Feature engineering feature selection

Feature engineering feature selection

Note: This article comes from WeChat public address: Datawhale.

Reducing the number of features to be used during statistical analysis can have some benefits, such as:

  • Improve accuracy
  • Reduce risk of over fitting
  • Speed up your training
  • Improve data visualization
  • Increase the interpretability of our model

In fact, it is statistically proven that when performing machine learning tasks, there are the best number of characteristics that should be used for each specific task (Figure 1). If more features are added than necessary, then our model performance will decrease (because of the noise added). The real challenge is to find out which features are the best to use (which actually depends on the amount of data we provide and the complexity of the tasks we are trying to achieve). This is where feature selection technology can help us!

feature selection

There are many different methods for feature selection. The most important ones are:
1. Filtering method = filtering our dataset, only taking a subset containing all relevant features (for example, using Pearson correlation matrix).
2. Follow the same goal of the filtering method, but use machine learning model as its evaluation criteria (for example, forward / backward / bidirectional / recursive feature elimination). We input some features into the machine learning model, evaluate their performance, and then decide whether to add or delete features to improve the accuracy. Therefore, this method can be more accurate than filtering, but the calculation cost is higher.
3. Embedding method. Like the filtering method, the embedding method uses the machine learning model. The difference between the two methods is that the embedded method checks the different training iterations of ML model, and then sorts the importance of each feature according to the contribution of each feature to ML model training.


In this article, I'll use the Mushroom Classification dataset to try to predict whether mushrooms are poisonous by looking at a given characteristic. In doing so, we will try different feature elimination techniques to see how they affect the training time and the overall accuracy of the model.

First, we need to import all the necessary libraries.

The dataset we will use in this example is shown in the following figure.

Before inputting these data into the machine learning model, I decided to do one hot coding for all classification variables, divide the data into feature (x) and label (y), and finally do it in the training set and test set.

X = df.drop(['class'], axis = 1)
Y = df['class']
X = pd.get_dummies(X, prefix_sep='_')
Y = LabelEncoder().fit_transform(Y)

X2 = StandardScaler().fit_transform(X)

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X2, Y, test_size = 0.30,  random_state

Importance of characteristics

Set based decision tree model (such as random forest) can be used to rank the importance of different features. Understanding the most important features of our model is essential to understanding how our model makes predictions (making them easier to interpret). At the same time, we can remove features that are not beneficial to our model.

start = time.process_time()
trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Train,Y_Train)
print(time.process_time() - start)
predictionforest = trainedforest.predict(X_Test)

As shown in the figure below, a random forest classifier is trained with all the features to achieve 100% accuracy in a training time of about 2.2 seconds. In each of the following examples, the training time for each model will be printed on the first line of each clip for your reference.

Once our random forest classifier is trained, we can create a feature importance graph to see which features are most important for our model prediction (Figure 4). In this case, only the first seven features are shown below.

Now that we know which features are considered most important by our random forest, we can try to use the first three to train our model.

X_Reduced = X[['odor_n','odor_f', 'gill-size_n','gill-size_b']]
X_Reduced = StandardScaler().fit_transform(X_Reduced)
X_Train2, X_Test2, Y_Train2, Y_Test2 = train_test_split(X_Reduced, Y, test_size = 0.30,  random_state = 101)

start = time.process_time()
trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Train2,Y_Train2)
print(time.process_time() - start)
predictionforest = trainedforest.predict(X_Test2)

As we can see below, using only three features will only reduce the accuracy by 0.03% and the training time by half.

We can also visualize a trained decision tree to understand how to make feature selection.

start = time.process_time()
trainedtree = tree.DecisionTreeClassifier().fit(X_Train, Y_Train)
print(time.process_time() - start)
predictionstree = trainedtree.predict(X_Test)

We can also visualize a trained decision tree to understand how to make feature selection.

start = time.process_time()
trainedtree = tree.DecisionTreeClassifier().fit(X_Train, Y_Train)
print(time.process_time() - start)
predictionstree = trainedtree.predict(X_Test)

The features at the top of the tree structure are the most important features that our model retains to perform classification. Therefore, selecting only the first few features at the top, and discarding the others, may create a model with considerable accuracy.

import graphviz
from sklearn.tree import DecisionTreeClassifier, export_graphviz

data = export_graphviz(trainedtree,out_file=None,feature_names= X.columns,
        class_names=['edible', 'poisonous'], 
        filled=True, rounded=True, 
graph = graphviz.Source(data)

Recursive feature elimination (RFE)

Recursive feature elimination (RFE) takes as input the instances of machine learning model and the final expected number of features to be used. Then, it recursively reduces the number of features to be used, using machine learning model precision as a measure to rank them.
Create a for loop in which the number of input features is our variable, so that we can track the accuracy registered in each loop iteration to find the best number of features our model needs. Using the RFE support method, we can find the name of the most important feature (rfe.support returns a Boolean list, where true means a feature is considered important and false means a feature is not).

from sklearn.feature_selection import RFE

model = RandomForestClassifier(n_estimators=700)
rfe = RFE(model, 4)
start = time.process_time()
RFE_X_Train = rfe.fit_transform(X_Train,Y_Train)
RFE_X_Test = rfe.transform(X_Test)
rfe = rfe.fit(RFE_X_Train,Y_Train)
print(time.process_time() - start)
print("Overall Accuracy using RFE: ", rfe.score(RFE_X_Test,Y_Test))


Select from model is another scikit learning method, which can be used for feature selection. This method can be used for all different types of scikit learning models with coef or feature importance attributes (after fitting). Compared with rfe, select from model is a less reliable solution. In fact, select from model just removes the less important features based on the calculated threshold value, which does not involve optimizing the iteration process.
To test the validity of select from model, I decided to use an extratrees classifier in this example.
Extrees classifier is a tree based ensemble classifier. Compared with the random forest method, it can produce less variance (thus reducing the risk of over fitting). The main difference between the random forest and the extreme random tree is that the sampling of nodes in the extreme random tree does not need to be replaced.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

model = ExtraTreesClassifier()
start = time.process_time()
model = model.fit(X_Train,Y_Train)
model = SelectFromModel(model, prefit=True)
print(time.process_time() - start)
Selected_X = model.transform(X_Train)

start = time.process_time()
trainedforest = RandomForestClassifier(n_estimators=700).fit(Selected_X, Y_Train)
print(time.process_time() - start)
Selected_X_Test = model.transform(X_Test)
predictionforest = trainedforest.predict(Selected_X_Test)

Correlation matrix analysis

In order to reduce the number of features in the dataset, another possible method is to check the correlation between features and labels.
Using Pearson correlation, our return coefficient will vary between - 1 and 1

  • If the correlation between two features is 0, it means that changing either feature will not affect the other
  • If the correlation between two features is greater than 0, it means that increasing the value of one feature will also increase the value of the other feature (the closer the correlation coefficient is to 1, the stronger the relationship between two different features will be).
  • If the correlation between two features is less than 0, this means that increasing the value in one feature will reduce the value in the other feature (the closer the correlation coefficient is to - 1, the stronger the relationship between two different features will be).

In this case, we will only consider features related to output variables of at least 0.5.

Numeric_df = pd.DataFrame(X)
Numeric_df['Y'] = Y
corr= Numeric_df.corr()
corr_y = abs(corr["Y"])
highest_corr = corr_y[corr_y >0.5]

figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')

corr2 = Numeric_df[['bruises_f' , 'bruises_t' , 'gill-color_b' , 'gill-size_b' ,  'gill-size_n' , 'ring-type_p' , 'stalk-surface-below-ring_k' ,  'stalk-surface-above-ring_k' , 'odor_f', 'odor_n']].corr()

sns.heatmap(corr2, annot=True, fmt=".2g")

In this analysis, another aspect that might be controlled is to check that the selected variables are highly correlated with each other. If so, we just need to keep one of the relevant ones and remove the others.
Finally, we can now select only the features with the highest correlation with y and train / test a SVM model to evaluate the results of this method.

Univariate selection

Univariate feature selection is a statistical method, which is used to select features that are most closely related to our corresponding tags. Using the selectkbest method, we can decide which metrics to use to evaluate our features and the number of k best features we want to retain. According to our needs, different types of scoring functions are provided:

  • Classification = chi2, f_classif, mutual_info_classif
  • Regression = f_regression, mutual_info_regression

In this case, we will use chi2 (Figure 7).

Chi squared (chi2) can use non negative values as input, so first, we scale the input data in the range of 0 to 1.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

min_max_scaler = preprocessing.MinMaxScaler()
Scaled_X = min_max_scaler.fit_transform(X2)

X_new = SelectKBest(chi2, k=2).fit_transform(Scaled_X, Y)
X_Train3, X_Test3, Y_Train3, Y_Test3 = train_test_split(X_new, Y, test_size = 0.30,  random_state = 101)
start = time.process_time()
trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Train3,Y_Train3)
print(time.process_time() - start)
predictionforest = trainedforest.predict(X_Test3)

Lasso Regression

When we apply regularization to machine learning model, we add a penalty to model parameters to avoid our model trying to get too close to our input data. In this way, we can make our model less complex, and we can avoid over fitting (making our model not only learn the key data features, but also learn its internal noise).
One of the possible regularization methods is lasso regression. When using lasso regression, if the coefficients of the input features have no positive contribution to our machine learning model training, they will be reduced. In this way, some features may be automatically discarded, with their coefficients specified as zero.

from sklearn.linear_model import LassoCV

regr = LassoCV(cv=5, random_state=101)
print("LassoCV Best Alpha Scored: ", regr.alpha_)
print("LassoCV Model Accuracy: ", regr.score(X_Test, Y_Test))
model_coef = pd.Series(regr.coef_, index = list(X.columns[:-1]))
print("Variables Eliminated: ", str(sum(model_coef == 0)))
print("Variables Kept: ", str(sum(model_coef != 0)))

Once we have trained our model, we can create a feature importance graph again to understand which features are considered most important by our model (Figure 8). This is very useful, especially when trying to understand how our model decides to make predictions, so it makes our model easier to explain.

figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')

top_coef = model_coef.sort_values()
top_coef[top_coef != 0].plot(kind = "barh")
plt.title("Most Important Features Identified using Lasso (!0)")

43 original articles published, 33 praised, 3668 visited
Private letter follow

Tags: less

Posted on Tue, 14 Jan 2020 03:18:20 -0500 by jarosciak