Generally speaking, a sample is characterized by multiple features. Features can be understood as attributes, which can be divided into three categories:
- Related features: it can improve the effect of learning algorithm;
- Irrelevant features: it will not bring any improvement to the effect of the algorithm;
- Redundant feature: the information of this feature can be inferred from other features.
2, Feature selection
Feature selection is to select m (m < = N) sub features from N features, and among the M sub features, the criterion function can achieve the optimal solution.
What feature selection wants to do is to select as few sub features as possible, the effect of the model will not decline significantly, and the category distribution of the results is as close to the real category as possible. (pay attention to words)
Its main purpose is to reduce the dimension, so as to reduce the difficulty of learning tasks and improve the efficiency of the model.
So the question is, how to do feature selection? Generally speaking, there are three feature selection methods: filtering method, wrapping method and embedding method.
(1) Filter method
The filtering method uses statistical indicators to score and screen each feature, which focuses on the characteristics of the data itself. (however, without considering the correlation between features, useful correlation features may be kicked out by mistake)
When using the filtering method to examine variables (i.e. features), we will judge whether variables should be filtered from the situation of single variables and the relationship between multiple variables.
- Missing percentage: the missing value of the variable accounts for too much. It can be considered that the variable has no reference significance. It is recommended to eliminate it.
- Variance: calculate the variance of each variable, and select the variable whose variance is greater than the threshold according to the threshold. (features with large variance are more useful. If the variance is small, such as less than 1, this variable may not play a great role in the algorithm. The most extreme is that if the variance of a variable is 0, that is, the value of the variable is the same for all samples, it has no effect on model training. It is recommended to eliminate it.)
- Frequency: if the sample size proportion distribution of enumeration value of a discrete variable is concentrated on a certain enumeration value, that is, the frequency is seriously unbalanced, it can be considered to be eliminated.
When studying the relationship between variables, we mainly start from two kinds of relationships.
(1) Correlation between independent variables: if the correlation is too high, it will cause multicollinearity problems, resulting in poor model stability. It is recommended to select one feature with collinearity and eliminate the others.
(2) Correlation between independent variables and dependent variables: the higher the correlation, it indicates that the characteristics are more important to the prediction objectives of the model, and it is recommended to keep them. Because variables are separated into discrete and continuous types, different methods should also be selected when studying the relationship between variables.
Continuous VS continuous
- Pearson correlation coefficient: Pearson correlation coefficient is the product of the covariance of two variables divided by the standard deviation of two variables. Covariance can reflect the correlation degree of two random variables (covariance greater than 0 indicates positive correlation and less than 0 indicates negative correlation). After dividing by standard deviation, Pearson's value range is [- 1,1]. When the linear relationship between the two variables increases, the correlation coefficient tends to 1 or - 1, and the sign points to the positive and negative correlation. However, the drawback of Pearson correlation coefficient is that it is only sensitive to linear relationship. If the relationship is nonlinear, even if the two variables correspond one to one, the Pearson value may be close to 0
Code: X.corr(Y, method="pearson")
- Spearman's rank correlation coefficient: Pearson correlation coefficient is based on the normal distribution of variables. Spearman correlation coefficient does not assume the distribution of variables. It calculates the correlation between variables based on the concept of rank. If the variable is an ordinal feature, Spearman correlation coefficient is recommended.
Similarly, the correlation coefficient tends to 1 or - 1, and the sign points to the positive and negative correlation.
Code: x.corr(y, method='spearman ')
Continuous VS discrete
- Analysis of variance (ANOVA): the purpose of ANOVA is to test whether there is significant difference in the average number of different groups. See Analysis of variance.
Code: from statsmodes.stats.anova import ANOVA_ lm
(Note: three assumptions need to be met before ANOVA analysis: each group of samples has homogeneity of variance, the samples in the group obey normal distribution, and the samples need to be independent)
- Kendall tau rank correlation coefficient: 🌰， Assuming that we want to evaluate the correlation between education and salary, Kendall coefficient will sort the samples according to education. If education and salary rank the same after sorting, Kendall coefficient is 1, and the two variables are positively correlated. If education and salary are completely opposite, the coefficient is - 1, which is completely negative correlation. If education and salary are completely independent, the coefficient is 0.
Code: X.corr(Y, method="kendall")
Discrete VS discrete
- Chi square test: Chi square test tests whether there is a relationship between two variables by measuring the degree of difference between theory and practice. It establishes the zero hypothesis that the two variables are not related. The higher the chi square value, the greater the possibility of correlation between the two variables. For example 🌰， We studied the correlation between fitness and injury:
Code: from scipy.stats import chi2
Similarly, t distribution, F distribution and normal distribution will not be expanded one by one.
- Mutual information: mutual information is essentially the interactive part between the information entropy of two variables, which can be regarded as the amount of information about another random variable contained in one random variable. Mutual information is positive and symmetrical. When two variables are independent, there is no interactive part, then mutual information is 0; The greater the mutual information, the stronger the correlation of variables. The specific calculation is as follows:
Code: sklearn.metrics.normalized_mutual_info_score(X, Y)
In summary, the advantages and disadvantages of the filtration method are:
Advantages: the algorithm has strong universality, omits the training steps of classifier, and has low complexity. Therefore, it is suitable for large-scale data sets and can quickly remove a large number of irrelevant features. It is very suitable as a feature pre filter.
Disadvantages: since the evaluation criteria of the algorithm are independent of the specific learning algorithm, the selected feature subset is usually lower than the wrapper method in classification accuracy.
(2) Wrap method
The performance of the learner to be used is taken as the evaluation criterion of the feature subset, and the "customized" feature subset is selected for the learner.
Traverse all possible combined feature subsets, then input the model and select the feature subset with the best model score. (not recommended, too expensive to calculate)
But still say two words, how to traverse the search is also exquisite. It mainly has two kinds: exhaustive search and non exhaustive search.
- Breadth First Search: use the breadth first algorithm to traverse all possible feature subsets and select the optimal feature subset.
Non exhaustive search
- Beam Search: select the feature with the highest score as the feature subset and add it to a queue with length limit. The queue is the feature subset with the best performance to the worst from beginning to end. Take the subset with the highest score from the queue every time, and then exhaustively add all the feature sets after adding a feature to the subset, and add these subsets to the queue according to the score.
- Best First Search: similar to location search, the difference is that the length of the queue is not limited.
Heuristic search is a method to reduce the search space by using heuristic information. For example, model score or feature weight can be used as heuristic information.
- Sequential forward selection (SFS): start from the empty set, add only one feature in each round, and then train the model. If the model evaluation score increases, retain the features added in this round, otherwise stop the iteration, and take the feature subset of the previous round as the optimal feature selection result.
- Generalized Sequential Forward Selection (GSFS): this method is the acceleration of SFS algorithm. It can add r features to the feature set at one time.
- SBS (sequential backward selection): search from the complete set of features, and remove one feature from the feature subset each time. If the model performance is reduced, keep the feature, otherwise discard the feature.
- Generalized Sequential Backward Selection (GSBS): this method is the acceleration of SBS and can remove a certain number of features from the feature subset at one time. It is a fast feature selection algorithm in practical application, and its performance is relatively good. However, it is possible that the elimination operation is too fast and important information is removed, which makes it difficult to find the optimal feature subset.
- Bi directional search (BDS): use SFS and SBS to search at the same time, and stop the search only when they reach the same feature subset. In order to ensure that the same feature subset can be achieved, the features selected by SFS cannot be removed by SBS; Features removed by SBS cannot be selected by SFS.
Recursive feature elimination
Recursive Feature Elimination (RFE) uses a base model for multiple rounds of training. After each round of training, several features with low weight (such as feature weight coefficient or feature importance) are eliminated, and then the next round of training is carried out based on the new feature set.
However, RFE shall set the last selected feature number N in advance_ features_ to_ Select, it is difficult to ensure that this super parameter is set reasonably at one time, because if it is set high, it is easy to feature redundancy, and if it is set low, relatively important features may be filtered out. Moreover, RFE only selects based on the feature weight without considering the model performance, so RFECV appears. RFECV is RFE + CV (cross validation). Its operation mechanism is: first use RFE to obtain the ranking of each feature, and then based on ranking, select [mi n_features_to_select, len (feature)] feature subsets in turn for model training and cross validation, Finally, the feature subset with the highest average score is selected.
In short, RFECV is to test the score with the model in each round of RFE, and select the feature subset with the highest score after training.
Random feature subset
Select multiple feature subsets randomly, then evaluate the model performance respectively, and select the feature subset with high evaluation score.
- Genetic algorithm (GA): randomly generate a batch of feature subsets, and then use the model to score. The next generation feature subsets are generated through selection, crossover and mutation operations, and the higher the score, the higher the probability that the subset is selected to generate the next generation. After N-generation iteration, the feature subset with the highest evaluation function value will be formed in the population. It depends on randomness, because selection, crossover and mutation are controlled by a certain probability, so it is difficult to reproduce the results.
This method is very interesting. Its idea is very simple: the really strong, stable and important features must be under the real label. The features are very important, but once the label is disrupted, the importance of these high-quality features will become worse. On the contrary, if a feature behaves generally under the original label, but its importance increases after disturbing the label, it is obviously unreliable, and such "steering in the wind" features have to be eliminated.
The calculation process of Null Importance is roughly as follows:
- Run the model in the original data set to obtain the importance of features;
- Shuffle multiple tags, and obtain the feature importance under the false tag after each shuffle;
- Calculate the difference of feature importance under true and false tags, and filter features based on the difference.
If the importance under the original label > disrupt the importance under the label, the feature is a high-quality feature; If the importance under the original label < the importance under the disrupted label, the feature is a poor quality feature. For relevant codes, see Null Importance.
In practical use, RFECV and Null Importance are recommended because they consider both feature weight and model performance.
In summary, the advantages and disadvantages of the wrapping method are mainly:
Advantages: feature selection is more targeted than filtering method, which is good for model performance.
Disadvantages: higher computational overhead.
(3) Embedded method
Feature selection is embedded into the learner training process. Unlike the wrapping method, there is a clear distinction between feature selection and learner training process.
Based on penalty item
For the regression problem, we can directly introduce Lasso to train the data set and select the features corresponding to non-zero vectors.
from sklearn.linear_model import Lasso from sklearn.datasets import load_boston # Import dataset dataset_boston = load_boston() data_boston = dataset_boston.data target_boston = dataset_boston.target # model training model_lasso = Lasso() model_lasso.fit(data_boston, target_boston) # Get weight vector print(model_lasso.coef_)
array([-0.06343729 , 0.04916467 , -0. , 0. , -0. , 0.9498107 , 0.02090951 , -0.66879 , 0.26420643 , -0.01521159 , -0.72296636 , 0.00824703 , -0.76111454])
Then use sklearn. Feature_ Select. Selectfrommodel method to select a subset of features from the model.
from sklearn.feature_selection import SelectFromModel model_sfm = SelectFromModel(model_lasso, prefit=True) print(model_sfm.transform(data_boston).shape) print(data_boston.shape)
(506, 10) (506, 13)
It can be seen that the SelectFromModel method eliminates the feature that the weight vector is zero.
from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris # Import dataset dataset_iris = load_iris() data_iris = dataset_iris.data target_iris = dataset_iris.target # Training model model_lr = LogisticRegression(penalty='l1', C=0.01, solver='liblinear') model_lr.fit(data_iris, target_iris) print(model_lr.coef_) from sklearn.feature_selection import SelectFromModel model_sfm = SelectFromModel(model_lr, prefit=True) print(model_sfm.transform(data_iris).shape) print(data_iris.shape)
array([[ 0. , 0. , -0.18016819, 0. ], [-0.03183986, 0. , 0. , 0. ], [-0.00677759, 0. , 0. , 0. ]]) (150, 2) (150, 4)
It can be seen that the logistic regression model with l1 penalty term can select the features of the feature set, and select the features with non-zero weight vector with the help of SelectFromModel method. Of course, you can also use l1 penalty item and l2 penalty item at the same time, which will not be repeated here.
Note: for logistic regression, parameter C controls the degree of sparsity. The smaller C is, the fewer features are selected; For Lasso, the larger the parameter alpha, the fewer features are selected.
Tree based model
Decision tree can be used for feature selection. The set composed of the partition features of tree nodes is the selected feature subset.
- CART is selected as the tree model algorithm, which can process both continuous data and discrete data. At the same time, Gini coefficient is used as the feature selection standard.
from sklearn.datasets import load_boston import numpy as np from sklearn.tree import DecisionTreeRegressor dataset_boston = load_boston() data_boston = dataset_boston.data target_boston = dataset_boston.target model_dtc = DecisionTreeRegressor() model_dtc.fit(data_boston, target_boston) # For ease of viewing, we set the precision to 3 np.set_printoptions(precision=3) print(model_dtc.feature_importances_)
[4.114e-02 1.464e-03 2.521e-03 7.784e-04 5.009e-02 5.882e-01 1.097e-02 7.294e-02 3.122e-04 1.390e-02 7.009e-03 1.527e-02 1.954e-01]
We can use sklearn.feature_ Select. Selectfrommodel method to select a subset of features from the model.
from sklearn.feature_selection import SelectFromModel model_sfm = SelectFromModel(model_dtc, prefit=True) print(model_sfm.transform(data_boston))
[[6.575 4.98 ] [6.421 9.14 ] [7.185 4.03 ] ... [6.976 5.64 ] [6.794 6.48 ] [6.03 7.88 ]]
SelectFromModel can be used to have coef after fitting_ Or feature_ The model of the importance property. If the coef corresponding to the feature_ Or feature_importances_ If the value is lower than the set threshold, these features will be removed.
- Common forest models include random forest model and extreme forest model:
from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import load_boston from sklearn.feature_selection import SelectFromModel model_rfr = RandomForestRegressor(n_estimators=50) model_rfr.fit(data_boston, target_boston) print(model_rfr.feature_importances_) model_sfm = SelectFromModel(model_rfr, prefit=True) print(model_sfm.transform(data_boston))
[0.037 0.002 0.006 0.001 0.024 0.461 0.013 0.071 0.004 0.013 0.019 0.011 0.34 ] [[6.575 4.98 ] [6.421 9.14 ] [7.185 4.03 ] ... [6.976 5.64 ] [6.794 6.48 ] [6.03 7.88 ]]
from sklearn.ensemble import ExtraTreesRegressor from sklearn.datasets import load_boston from sklearn.feature_selection import SelectFromModel model_etr = ExtraTreesRegressor(n_estimators=50) model_etr.fit(data_boston, target_boston) print(model_etr.feature_importances_) model_sfm = SelectFromModel(model_etr, prefit=True) print(model_sfm.transform(data_boston))
[0.031 0.004 0.044 0.013 0.041 0.38 0.022 0.031 0.032 0.024 0.034 0.02 0.324] [[6.575 4.98 ] [6.421 9.14 ] [7.185 4.03 ] ... [6.976 5.64 ] [6.794 6.48 ] [6.03 7.88 ]]
- The classification problem of tree model is handled in the same way as the regression problem of tree model. The difference is that different tree model methods (from DecisionTreeRegressor to DecisionTreeClassifier) are called, and other operations are the same.
- Forest model
from sklearn.ensemble import ExtraTreesClassifier from sklearn.datasets import load_iris dataset_iris = load_iris() data_iris = dataset_iris.data target_iris = dataset_iris.target model_etc = ExtraTreesClassifier(n_estimators=50) model_etc.fit(data_iris, target_iris) print(model_etc.feature_importances_) from sklearn.ensemble import RandomForestClassifier model_rfc = RandomForestClassifier(n_estimators=50) model_rfc.fit(data_iris, target_iris) print(model_rfc.feature_importances_)
[0.0815576 0.06156375 0.36123267 0.49564599] [0.13296298 0.02399033 0.35337825 0.48966843]
Although the feature importance values output by different models are different, the ranking of features (in descending order of importance) is consistent.