Common feature selection methods

conclusion

Filtering methods are faster but coarser. Packaging and embedding methods are more precise and more suitable for adjustment to algorithms, but they are computationally intensive and take longer to run. When there is a large amount of data, differential filtering and mutual information methods are preferred before other feature selection methods. When using logical regression, embedding methods are preferred. When using support vector machines, embedding methods are preferred.Recursive elimination of features.

1. Read data and define test functions

import pandas as pd data = pd.read_csv(r'F:\Teacher Training\ppd7\df_Master_merge_clean.csv',encoding='gb18030') pd_x = data[data.target.notnull()].drop(columns=['Idx', 'target', 'sample_status', 'ListingInfo']) pd_y = data[data.target.notnull()]['target'] #Define drawing auc function def get_auc(x, y): from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score, roc_curve import lightgbm as lgb #Functions that draw auc graphs def roc_auc_plot(clf,x_train,y_train,x_test, y_test): train_auc = roc_auc_score(y_train,clf.predict_proba(x_train)[:,1]) train_fpr, train_tpr, _ = roc_curve(y_train,clf.predict_proba(x_train)[:,1]) train_ks = abs(train_fpr-train_tpr).max() print('train_ks = ', train_ks) print('train_auc = ', train_auc) test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1]) test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1]) test_ks = abs(test_fpr-test_tpr).max() print('test_ks = ', test_ks) print('test_auc = ', test_auc) from matplotlib import pyplot as plt plt.plot(train_fpr,train_tpr,label = 'train_roc') plt.plot(test_fpr,test_tpr,label = 'test_roc') plt.plot([0,1],[0,1],'k--', c='r') plt.xlabel('False positive rate') plt.ylabel('True positive rate') plt.title('ROC Curve') plt.legend(loc = 'best') plt.show() x_train,x_test, y_train, y_test = train_test_split(x,y,random_state=2,test_size=0.2) lgb_model = lgb.LGBMClassifier(n_estimators=800, boosting_type='gbdt', learning_rate=0.04, min_child_samples=68, min_child_weight=0.01, max_depth=4, num_leaves=16, colsample_bytree=0.8, subsample=0.8, reg_alpha=0.7777777777777778, reg_lambda=0.3, objective='binary') clf = lgb_model.fit(x_train, y_train, eval_set=[(x_train, y_train),(x_test,y_test)], eval_metric='auc',early_stopping_rounds=100) roc_auc_plot(clf,x_train,y_train,x_test, y_test) #Define base classifier clf1 = lgb.LGBMClassifier(n_estimators=800, boosting_type='gbdt', learning_rate=0.04, min_child_samples=68, min_child_weight=0.01, max_depth=4, num_leaves=16, colsample_bytree=0.8, subsample=0.8, reg_alpha=0.7777777777777778, reg_lambda=0.3, objective='binary')

2. The various methods are as follows

1. Recursive Elimination Feature

Recursive feature elimination (RFE) works well with very few features.
It creates models over and over again, retaining the best or removing the worst features in each iteration, and in the next iteration, it builds the next model using features that were not selected in the previous model until all the features are exhausted. It then retains or rejects the best or worst features according to itself.
Remove the order of features to rank them, and finally select the best subset.

The code is as follows (example):

from sklearn.feature_selection import RFE #Only fit means this is a selector and can have attributes #n_features_to_select: is the number of features you want to select #support_: Returns whether all features are last selected Boolean matrix #Ranking_: Returns the ranking of features by their combined importance in several iterations #step represents the number of features that you want to remove in each iteration c = RFE(clf1, n_features_to_select=i, step=50).fit(pd_x, pd_y) c.support_.sum() c.ranking_ #fit_transform returns the filtered dataset c = RFE(clf1, n_features_to_select=i, step=50).fit_transform(pd_x, pd_y) ########################################################################## #Selecting the optimal number of features using cross-validation runs too fast for this result to show validation axisx = range(1,250,10) scores = [] for i in axisx: x_wrapper = RFE(clf1, n_features_to_select=i, step=50).fit_transform(pd_x, pd_y) score = cross_val_score(clf1,x_wrapper, pd_y, cv=5).mean() scores.append(score) plt.figure(figsize=(20,5)) plt.plot(axisx,scores, label='n_features_to_select-score') plt.xticks(axisx) plt.legend() plt.show() #The optimal number of features 113 is selected as shown in the figure.

x_rfe = RFE(clf1, n_features_to_select=113, step=50).fit_transform(pd_x, pd_y) #Draw auc curve get_auc(x_rfe, pd_y) #416 features reduced to 113, the effect decreased from test_auc = 0.74109449149997 to 0.7322830868482363

2.Embedded Embedding

The code is as follows (example):
1. First, some machine learning algorithms and models are used to train to get the weight coefficients of each feature, and then the features are selected from large to small according to the weight coefficients.These weight coefficients often represent a contribution or importance of features to the model, such as the feature_importances_attribute in the integrated model of decision trees and trees. By listing the contribution of each feature to the tree building, we can evaluate the contribution and find out the most useful features to the model building.

from sklearn.feature_selection import SelectFromModel from sklearn.model_selection import cross_val_score #estimator: The model evaluator used can be used for any model with a feature_importances_or coef_attribute or with l1 and l2 penalties #Threshold: The threshold for feature importance below which features will be deleted # Select the maximum importance for the purpose of constructing axisx and selecting the optimal threshold m = clf1.fit(pd_x, pd_y).feature_importances_.max() axisx = range(0,m+1,10) #Finding the Optimal threshold Function def get_opt_thre(axisx): scores = [] for i in axisx: #threshold of thresholds feature importance below which features will be deleted embedded_x = SelectFromModel(clf1, threshold=i).fit_transform(pd_x, pd_y) score = cross_val_score(clf1, embedded_x,pd_y, cv=5).mean() scores.append(score) plt.figure(figsize=(20,5)) plt.plot(axisx, scores, label='threshold-scores') plt.xticks(axisx) plt.legend() plt.show() get_opt_thre(axisx) #Looking at the figure below, the optimal value is around 130, and the maximum feature important score is 140. As a result, very few features are selected, so this value is not selected, so inference may be a problem of evaluation index. Select 30

axisx = range(25,36,1) get_opt_thre(axisx) #From the image point of view, as the threshold value is higher and higher, the effect of the model becomes worse, more and more features are deleted, and the loss of information becomes larger and larger. embedded_x = SelectFromModel(clf1, threshold=29).fit_transform(pd_x, pd_y) get_auc(embedded_x, pd_y) #416 features reduced to 120 features, the effect decreased from test_auc = 0.74109449149997 to 0.72692747467951

3. Mutual information method for correlation filtering

Mutual information is a filtering method used to capture any relationship (both linear and non-linear) between each feature and the label.
feature_selection.mutual_info_classif (mutual information classification) and feature_selection.mutual_info_regression (mutual information regression). Mutual information method is more powerful than F test, F test can only find linear relationships, and mutual information method can find any relationships. Mutual information method does not return statistics with similar p or F values, it returns"Estimate of the amount of mutual information between each feature and the target". This estimate takes a value between [0,1]. A value of 0 indicates that two variables are independent and a value of 1 indicates that two variables are fully correlated. The code for the mutual information classification example is as follows:

from sklearn.feature_selection import mutual_info_classif res = mutual_info_classif(pd_x, pd_y) mc = pd.DataFrame({'fname':pd_x.columns, 'micv':res}) f = mc[mc.micv>0]['fname'] get_auc(pd_x[f], pd_y) #416 features reduced to 271 features, the effect decreased from test_auc = 0.74109449149997 to 0.7373713932080141

4. F-test for correlation filtering

# The F test, also known as ANOVA, variance homogeneity test, is a filtering method used to capture the linear relationship between each feature and a label. # F-test classification is used for data where labels are discrete variables, and F-test regression is used for data where labels are continuous variables. #The original assumption was that there is no significant linear relationship with the data. It returns F Values and p Value two statistics. As with chi-square filtering, we want to select p Value less than#0.05 or 0.01 features, #These features are significantly linear when labeling, while features with p values greater than 0.05 or 0.01 are considered features that have no significant linear relationship with labels and should be deleted. from sklearn.feature_selection import f_classif F, p = f_classif(x,y) sf = pd.DataFrame({'fname':x.columns, 'p':p}) fname = sf[sf.p<0.05]['fname'] get_auc(x[fname], y) #416 features down to 201 features, effect from test_auc = 0.74109449149997 to 0.7360001348353649

5. F-test for correlation filtering

#Chi-square filtering is a correlation filter designed for discrete labels (i.e., classification problems). The chi-square test class feature_selection.chi2 calculates the chi-square statistics between each non-negative feature and label and ranks them according to the chi-square statistics from high to low. This is not used because the dataset contains negative values.

from sklearn.feature_selection import chi2, SelectKBest #Suppose I always need 300 features here x_chi = SelectKBest(chi2, k=300).fit_transform(pd_x, y) et_auc(x_chi, y)

6. Variance filtering

Filter the classes of features by their variances. For example, a small variance of a feature itself means that there is little difference in the sample on this feature. Most of the values in the feature are the same, or even the whole feature is the same. This feature has no effect on sample differentiation. So whatever the next feature project does, it needs to be eliminated first.VarianceThreshold has an important parameter threshold, which represents the threshold of the variance, which discards all features with variance less than the threshold, and does not fill in the default 0, which means deleting all records with the same characteristics.

from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold() #Instantiate, default variance is 0 without filling in parameters X_var0 = selector.fit_transform(X) #Get new feature matrix after deleting unqualified features #No features with a variance of 0 were found

The nearest neighbor algorithm KNN, single decision tree, support vector machine SVM, neural network, and regression algorithm all need to traverse features or increase dimensionality.
Row operations, so they themselves are computationally intensive and take a long time, so feature selection such as variance filtering is particularly important to them.
However, for algorithms that do not require traversal of features, such as random forests, which randomly select features for branching, the operation itself is very fast, so feature selection works smoothly for them.

summary

Through observation, we find that the effect of recursive elimination method is relatively good, followed by the effect of embedding method. Different methods are used for different models, the theory part comes from vegetable machine learning.