# Scikit learn notes 4 SKLearn model selection and evaluation data set division

## Data set partition method  ## k-fold cross validation 1. All training sets s are divided into k disjoint subsets. Assuming that the number of training samples in S is m, each subset has m/k training samples, and the corresponding subset is called {s1,s2
,sk}. 2. Each time from the divided subset, take one as the test set and the other k-1 as the training set. 3. Train the learner model on the k-1 training set.
4. Put this model on the test set to get the classification rate. 5. Calculate the average value of the classification rate obtained k times as the real classification rate of the model or hypothesis function.
This method makes full use of all samples. However, the calculation is cumbersome, requiring k times of training and k times of testing. ### use

```# kFold
import numpy as np
from sklearn.model_selection import KFold
x = np.array([[1，2]，[3，4]，[5， 6]，[7，8]，[9， 10],[11，12]])
y = np.array([1，2，3，4，5，6])
kf = KFold(n_splits=2)   #3, that's 30% off
kf.get_n_splits(x)Iprint(kf)
for train_index,test_index in kf.split(X):   #Training set index test set index
print("Train Index : ", train_index, "，Test Index :", test_index)
X_train，X_test = X[train_index]， X[test_index]   #Slicing raw data using indexes
y_train，y_test = y[train_index], yltest_index]
# print(x_train,X_test, y_train,y_test)
``` ```# Variation of groupkfold k-fold iterator
import numpy as np
from sklearn. model_selection import GroupKFold
x = np.array([[1，2]，[3，4]，[5, 6]，[7，8]，[9，10]，[11，12]])
y = np.array([1，2，3，4，5,6])
groups = np.array([1，2，3，4，5，6])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)print(group_kfold)

for train_index，test_index in group_kfold.split(X, y，groups):   # Returns a tuple iterator
print("Train Index:", train_index，"，Test Index:" , test_index)
x_train,X_test = X[train_index]， X[test_index]
y_train,y_test = y[train_index], y[test_index]
# print(X_train，X_test, y_train,y_test)
``` ```# sklearn.model_selection.StratifiedKFold  #Hierarchical cross validation

import numpy as np
from sklearn.model_selection import StratifiedKFold
x = np.array([[1，2]，[3，4]，[5,6]，[7，8]，[9，10]，[11，12]])  #Layering is to randomly select one of the first three and one of the last three
y = np.array([1，1，1，2，2，2])
skf = StratifiedKFold(n_splits=3)skf.get_n_splits(x, y)
print(skf)
for train_index,test_index in skf.split(X, y):
print(" Train Index: ", train_index, "，Test Index:", test_index)
x_train，X_test = X[train_index]， X[test_index]
y_train,y_test = y[train_index], y[test_index]
``` ## Leave one method Leave one out (L0O):
Suppose there are n samples, take each sample as the test sample and the other N-1 samples as the training sample. In this way, n classifiers and N test results are obtained. The average of these n results is used to measure the performance of the model.
If loo is compared with k-fold cv, loo establishes n models on N samples instead of K. Further, each of the N models is trained on N-1 samples, not (k-1)n / k. In both methods, it is assumed that K is not very large and K < < n. Loo is more time-consuming than k-fold cv.
Leave-P-out verification:
There are n samples, each P sample is used as the test sample, and the other N-P samples are used as the training samples. This yields () train test pairs. Unlike LeaveOneOut and KFold, test sets overlap when p > 1. When P=1, it becomes leaving one method. ### use

```# sklearn.model_selection.LeaveOneOut leave one method to train and test as many times as there are samples

import numpy as np
from sklearn. model_selection import LeaveOneOut
x = np.array([[1，2]，[3,| 4]A[5, 6]，[7，8]，[9，10]，[11，12]])
y = np.array([1，2，3，4，5,6])
loo = LeaveOneOut O
loo.get_n_splits(X)
print(l1oo)
for train_index,test_index in loo.split(X):
print("TRAIN:", train_index，"TEST:", test_index)
X_train，X_test = X[train_index]，x[test_index]
y _train，y_test = y[train_index]，yltest_index]
# print(X train,X_test,y_train,y_test)
```   ```#sklearn.model_ selection.LeavePOut retention P method cannot guarantee the balance of sample proportion

import numpy as np
from sklearn. model_ selection import LeavePOut
x = np. array([[1， 2]， [3, 4]， [5, 6]，[7, 8]，[9， 10]， [11, 12]])
y = np. array([1，2, 3, 4, 5，6])
lpo = LeavePOut (p=3)  #Take 3 samples as test samples and the others as training samples each time
lpo.get_ n splits(X)
print (1po)
for train_ index, test_ index in lpo. split(X):
print(" TRAIN:"，train_ index," TEST:' ，test_ index)
X_ train, X _test = X[train_ index]， X[test_ index]
y_ train, y. test 1 y[train_ index]， y[test_ index]
# print(X train, X test, upper train, upper test)
```  ## Random division method The ShuffleSplit iterator produces a specified number of independent train / test dataset partitions. Firstly, all samples are randomly disrupted, and then train is divided
/Yes. You can use random number seed random_state to control the random number sequence generator so that the operation result can be reproduced.
ShuffleSplit is a good alternative to KFold cross validation, which allows better control over the number of iterations and the proportion of train / test samples.
Structured shufflesplito yes
A variant of ShuffleSplit returns hierarchical partition, that is, when creating partition, ensure that the sample proportion of classes in each partition is consistent with the original proportion in the overall data set. ### use

```# sklearn.model_ Selection.shuffleplit random partition
import numpy as np
from sklearn. model_selection import ShuffleSplit
x = np.array([[1，2]，[3，4，[5，6]，[7，8]，[9，10]，[11，12]])   #Sample point
y = np.array([1，2，4，2，1，2])  #Class label
rs = ShuffleSplit(n_splits=3,test_size=.25，random_state=O)  #,test_size=.25 specify the proportion of the test set in the sample
rs.get_n_splits(X)
print(rs)
for train_index,test_index in rs.split(X):
print("TRAIN:"， train_index，"TEST:", test_index)
print('=======================================================')
rs = ShuffleSplit(n_splits=3，train_size=0.5,test_size=.25, random_state=0)  #train_size=0.5,test_size=.25 does not necessarily add up to 1, but it is usually 1. You can see from the training set that 0.5 means the number of training is 3
for train_index，test_index in rs.split(X):
print("TRAIN:", train_index,"TEST:", test_index)
```  ```# sklearn.model_ Selection.stratifiedsshufflesplit hierarchical random partition
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
x = np.array([[1，2]，[3，4]，[5,6]，[7，8]，[9，10]，[11，12]])
y = np.array([1，2，1，2，1，2])
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.5， random_state=0)  #(n_splits=3 stands for several
sss.get_n_splits(X, y)
print(sss)
for train_index，test_index in sss.split(X, y):
print("TRAIN:", train_index，"TEST:", test_index)
X_train，X_test = X[train_index]，x[test_index]
y_train, y_test = y[train_index], yltest_index]
```  Posted on Sat, 23 Oct 2021 21:11:22 -0400 by BoukeBuffel