(1) Smote algorithm_ Application of synthetic minority oversampling technique (smote) in unbalanced data sets
In data science, unbalanced machine learning data sets are not surprising. If the data sets used for classification problems, such as emotion analysis, medical imaging or other problems related to discrete prediction analysis (such as flight delay prediction), for different classes, the number of examples (samples or data points) is not equal, then these machine learning data sets are unbalanced. This means that there is an imbalance between classes in the dataset, because there are great differences in the number of instances belonging to each class. A class with a relatively small number of instances is called a minority class, and a class with a relatively large number of samples is called a majority class. Examples of unbalanced data sets are as follows:
There are two class labels: 0 and 1, which are unbalanced
Training the machine learning model with this unbalanced data set will often lead to a certain deviation of the model to most classes, so as to misclassify a few class instances / data points.
SMOTe is a technology based on nearest neighbor, which judges the distance between data points in feature space by Euclidean. The percentage of oversampling represents the number of composite samples to create. The percentage of oversampling parameter is always a multiple of 100. If the percentage of oversampling is 100, a new sample will be created for each instance, so the number of instances of a few classes will double. Similarly, if the percentage of oversampling is 200, the total number of minority samples will triple. In SMOTe,
- For each minority class instance, find k nearest neighbors so that they also belong to the same class, where,
- Find the difference between the eigenvectors of the considered instance and the k nearest neighbor eigenvectors. K different vectors are obtained.
- k different vectors each multiplied by a random number between 0 and 1 (excluding 0 and 1).
- Now, after the vector is multiplied by the random number, it is added to the eigenvector of the considered instance (original minority class instance) at each iteration.
Start from scratch and implement SMOTe in Python as follows-
import numpy as npdef nearest_neighbour(X, x): euclidean = np.ones(X.shape[0]-1) additive = [None]*(1*X.shape[1]) additive = np.array(additive).reshape(1, X.shape[1]) k = 0 for j in range(0,X.shape[0]): if np.array_equal(X[j], x) == False: euclidean[k] = sqrt(sum((X[j]-x)**2)) k = k + 1 euclidean = np.sort(euclidean) weight = random.random() while(weight == 0): weight = random.random() additive = np.multiply(euclidean[:1],weight) return additive def SMOTE_100(X): new = [None]*(X.shape[0]*X.shape[1]) new = np.array(new).reshape(X.shape[0],X.shape[1]) k = 0 for i in range(0,X.shape[0]): additive = nearest_neighbour(X, X[i]) for j in range(0,1): new[k] = X[i] + additive[j] k = k + 1 return new # the synthetic samples created by SMOTe
Machine learning data set
Let's consider the adult dataset from UCI (University of California, Irvine)( http://archive.ics.uci.edu/ml/datasets/Adult ), which contains 48842 instances and 14 attributes / features.
Data preprocessing using Python:
- Tag coding is for the classification (non numeric) features and tag income mentioned in Table 1.
- By using additional tree classifiers to train the whole data set and select features, the feature importance score of each feature (given by the classifier) is obtained, as shown in Table 1. Features race and native country They were deleted because they had the smallest feature importance score.
- For classification features with more than two categories, one hot coding is performed. After one hot coding, the classification feature is divided into sub features, and each sub feature corresponds to one of its categories. It is assumed that the binary value is 0 / 1. Here, classification features, workclass, education, marital status, occupation and relationship is one hot encoding. Since sex is a feature with only two subcategories (male and female), no further coding is required.
Table 1
After feature selection, implement one hot coding in Python
import numpy as npimport pandas as pdfrom sklearn.preprocessing import OneHotEncoder# Label Encoding and Feature Selection is over ....# 1. Loading the modified dataset after Label Encodingdf = pd.read_csv('adult.csv') # Loading of Selected Features into XX = df.iloc[:,[0,1,2,3,4,5,6,7,9,10,11,12]].values# Loading of the Label into yy = df.iloc[:,14].values# 2. One Hot Encoding ....onehotencoder = OneHotEncoder(categorical_features = [1,3,5,6,7])X = onehotencoder.fit_transform(X).toarray()
The class label in this problem is binary. This means that the class tag assumes two values, that is, there are two classes. This is a binary classification problem.
Visual category distribution
# Getting the no. of instances with Label 0n_class_0 = df[df['income']==0].shape[0]# Getting the no. of instances with label 1n_class_1 = df[df['income']==1].shape[0]# Bar Visualization of Class Distributionimport matplotlib.pyplot as plt # required libraryx = ['0', '1']y = np.array([n_class_0, n_class_1])plt.bar(x, y)plt.xlabel('Labels/Classes')plt.ylabel('Number of Instances')plt.title('Distribution of Labels/Classes in the Dataset')
Class distribution
Therefore, in a given dataset, there is a serious imbalance between two classes with Class label, "1" is a few classes and "0" is a majority Class.
Now, there are two possible ways:
- Shuffle the data set and split it into training and verification sets, and apply SMOTe on the training data set. (method 1)
- SMOTe is applied to a given data set as a whole, and then the machine learning data set is randomly divided into training and verification sets. (method 2)
In many network resources such as Stack Overflow and many personal blogs, the second method is considered to be the wrong method of oversampling. In particular, Nick Becker mentioned in his personal blog that the second method is wrong for the following reasons:
"The application of SMOTe in the whole data set creates a similar example, because the algorithm is based on k-nearest neighbor theory. For this reason, splitting after applying SMOTe to a given data set will lead to information leakage from the verification set to the training set, resulting in the classifier or machine learning model overestimating its accuracy and other performance indicators“
We also use the second method and compare it.
Let's follow the first method because it is widely accepted throughout the process.
In order to prove that the second method is correct, I will randomly divide the whole data set into training verification and test sets. The test set will remain independent as an unknown instance set.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)# X_train and y_train is the Train-Validation Set# X_test and y_test is the Test Set separated out
- Now, in the training validation set, the first and second methods will apply case wise.
- For the two models (developed according to the first method and the second method), the performance analysis will be carried out on the same separate unknown instance set (test set)
The first method of using SMOTe after splitting
=>Split the training verification set into training and verification sets. The Python code is as follows:
X_train, X_v, y_train, y_v = train_test_split(X_train, y_train, test_size=0.2, random_state=2341)# X_train and y_train is the Training Set# X_v and y_v is the Validation Set
=>Apply SMOTe only on training sets
# 1. Getting the number of Minority Class Instances in Training Setimport numpy as np # required libraryunique, counts = np.unique(y_train, return_counts=True)minority_shape = dict(zip(unique, counts))[1]# 2. Storing the minority class instances separatelyx1 = np.ones((minority_shape, X_train.shape[1]))k=0for i in range(0,X_train.shape[0]): if y_train[i] == 1.0: x1[k] = X[i] k = k + 1# 3. Applying 100% SMOTesampled_instances = SMOTE_100(x1)# Keeping the artificial instances and original instances togetherX_f = np.concatenate((X_train,sampled_instances), axis = 0)y_sampled_instances = np.ones(minority_shape)y_f = np.concatenate((y_train,y_sampled_instances), axis=0)# X_f and y_f are the Training Set Features and Labels respectively
Model training using Gradient Boosting classifier
The Gradient Boosting classifier is used to train the machine learning model. The Gradient Boosting classifier uses the grid search method to obtain the optimal hyperparameter set of estimator and maximum depth.
from sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import GridSearchCVparameters = {'n_estimators':[100,150,200,250,300,350,400,450,500], 'max_depth':[3,4,5]}clf= GradientBoostingClassifier()grid_search = GridSearchCV(param_grid = parameters, estimator = clf, verbose = 3)grid_search_1 = grid_search.fit(X_f,y_f)
Therefore, the training machine learning model of the first method is embedded in grid_search_1.
Here is the second way to use SMOTe before splitting
=>Apply SMOTe to the entire training validation set:
# 1. Getting the number of Minority Class Instances in Training Setunique, counts = np.unique(y_train, return_counts=True)minority_shape = dict(zip(unique, counts))[1]# 2. Storing the minority class instances separatelyx1 = np.ones((minority_shape, X_train.shape[1]))k=0for i in range(0,X_train.shape[0]): if y_train[i] == 1.0: x1[k] = X[i] k = k + 1# 3. Applying 100% SMOTesampled_instances = SMOTE_100(x1)# Keeping the artificial instances and original instances togetherX_f = np.concatenate((X_train,sampled_instances), axis = 0)y_sampled_instances = np.ones(minority_shape)y_f = np.concatenate((y_train,y_sampled_instances), axis=0)# X_f and y_f are the Train-Validation Set Features and Labels respectively
=>Split the training verification set into training and verification sets.
X_train, X_v, y_train, y_v = train_test_split(X_f, y_f, test_size=0.2, random_state=9999)# X_train and y_train is the Training Set# X_v and y_v is the Validation Set
Model training using Gradient Boosting classifier
Similarly, grid search is also applicable to the Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import GridSearchCVparameters = {'n_estimators':[100,150,200,250,300,350,400,450,500], 'max_depth':[3,4,5]}clf= GradientBoostingClassifier()grid_search = GridSearchCV(param_grid = parameters, estimator = clf, verbose = 3)grid_search_2 = grid_search.fit(X_train,y_train)
Therefore, the training machine learning model of the second method is embedded in grid_search_2.
Analysis and comparison
The performance indicators used for comparison and analysis are:
- Accuracy on test set
- Precision on test set
- Recall on test set
- F1 score on test set
In addition to these comparison indicators, training accuracy (training set) and verification accuracy (verification set) are also calculated.
# MODEL 1 PERFORMANCE ANALYSIS# 1. Training Accuracy for Model 1 (following Approach 1)print(grid_search_1.score(X_f, y_f))# 2. Validation Accuracy on Validation Set for Model 1 print(grid_search_1.score(X_v, y_v))# 3. Test Accuracy on Test Set for Model 1print(grid_search_1.score(X_test, y_test))# 4. Precision, Recall and F1-Score on the Test Set for Model 1from sklearn.metrics import classification_reportpredictions=grid_search_1.predict(X_test)print(classification_report(y_test,predictions))# MODEL 2 PERFORMANCE ANALYSIS# 5. Training Accuracy for Model 2(following Approach 2)print(grid_search_2.score(X_train, y_train))# 6. Validation Accuracy on Validation Set for Model 2print(grid_search_2.score(X_v, y_v))# 3. Test Accuracy on Test Set for Model 2print(grid_search_2.score(X_test, y_test))# 4. Precision, Recall and F1-Score on the Test Set for Model 2from sklearn.metrics import classification_reportpredictions=grid_search_2.predict(X_test)print(classification_report(y_test,predictions))
The accuracy of training and validation sets of model 1 and model 2 is:
- Training accuracy (model 1): 90.64998262078554%
- Training accuracy (model 2): 90.92736479956705%
- Verification accuracy (model 1): 86.87140115163148%
- Verification accuracy (model 2): 89.33209647495362%
Therefore, it can be seen from here that the verification accuracy of the second method is high, but if the completely unknown and the same test set is not tested, it is impossible to draw a conclusion. Table 2 shows the performance comparison of the two models on the test set.
Table 2
Obviously, no matter how small the difference is, method 2 is obviously more successful than method 1. Why? You can do the next test.
Although SMOTe creates similar instances, on the other hand, this attribute is not only to reduce class imbalance and data enhancement, but also to find the training set most suitable for model training.
(II) kaggle fraud credit card prediction - processing method of unbalanced training samples. The comprehensive conclusion is that random forest + oversampling (after direct replication or smote, the black-and-white ratio is 1:3 or 1:1) has a better effect! Remember to standardize before smote!!!