Data analysis and mining 3 - Feature Engineering

Data and features determine the upper limit of machine learning, and models and algorithms only approximate this upper limit

1. Data preprocessing

  1. data acquisition
  2. Data cleaning: remove dirty data
  3. Data sampling: it can be used when the data is unbalanced, including up sampling and down sampling; Positive sample > negative sample, and the amount of data is large, down sampling is adopted; Positive sample > negative sample, the data volume is small, and up sampling is adopted; Or modify the loss function to set the sample weight

2. Feature processing

  1. Standardization: make the processed data conform to the standard normal distribution.
from sklearn.preprocessing import StandardScaler
  1. Normalization: convert the eigenvalues of samples to the same dimension to eliminate the influence of different dimensions between features. Interval scaling is a kind of normalization.
from sklearn.preprocessing import Normalizer
#It is more suitable for the situation with concentrated data

Differences between the two:

  • Normalization is easily affected by extreme maximum and minimum values, which is more suitable for the case of concentrated values;
  • If there are outliers and more noise in the data, use standardization;
  • SVM, KNN, PCA and other models must be standardized or normalized;
  • Both can eliminate the influence of dimension;
  • Improve the speed of solving the optimal solution by gradient descent method;
  1. Quantitative data binarization: binarization of numerical data by setting threshold
from sklearn.preprocessing import Binarizer
b=Binarizer(threshold=3)#Set the threshold value. Set 1 for those greater than the threshold value and 0 for those less than the threshold value

Operation results:

  1. Qualitative data dumb coding: convert category data into numerical data, such as OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

Operation results:

7. Missing value processing
Data analysis and mining 2 - Data Preprocessing
8. Data conversion

3. Feature dimensionality reduction

3.1. Feature selection

Variance selection feature: calculate the variance of each feature and scold the feature by setting the threshold of variance

data_train_columns=[col for col in data_train.columns if col not in ['target']]
# Variance filtering feature, set the variance threshold to 1
from sklearn.feature_selection import VarianceThreshold

Correlation coefficient: calculate the correlation coefficient of each feature to the target value (Pearson correlation coefficient)

#The correlation coefficient method is used to screen the features and select the number of features
from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr
skb=SelectKBest(lambda X,Y:np.array(list(map(lambda x:pearsonr(x,Y),X.T))).T[0],k=10)

Chi square test: calculate the correlation between category features and category target s Chi square test

from sklearn.feature_selection import chi2

Maximum information coefficient method: calculate the correlation between category features and category target s

from minepy import MINE
def mic(x,y):
	return (m.mic(),0.5)
SelectKBest(lambda X,Y:np.array(list(map(lambda x:mic(x,Y),X.T))).T[0],k=10).fit_transform(train_data,train_target)

Recursive feature elimination method: RFE algorithm obtains the optimal combination variables that can maximize the performance of the model by adding or removing specific feature variables.

#RFE recursive elimination method
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
#estimator is the base model and logistic regression model, which is used for classification
#solver: several optimization methods. Liblinear is a good choice for small data sets, and sag and saga are faster for large data sets; In multi class problems, except liblinear, the other four algorithms can be used; Newton CG, lbfgs and sag can only use L2 penalty items, and liblinear and saga can only use L1 penalty items.
#max_iter:int type, the default is' 100 ', which is only applicable to Newton CG, sag and lbfgs algorithms; Represents the maximum number of times the algorithm converges.
#n_features_to_select=10 is the number of selected features
data_rfe=rfe.fit_transform(data_train[data_train_columns],data_train['target'])#data_train['target '] is a category label

Model based feature selection

  1. Feature selection based on penalty item
#Feature selection algorithm based on penalty term
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
#solver is an optimization algorithm, and lbfgs optimization algorithm only supports l2; Penalty is a penalty term, which is used to add parameters to avoid over fitting. It can be understood as a penalty for the current training sample to improve the generalization ability of the function; C is the reciprocal of the regularization coefficient. The smaller the value, the stronger the regularization
  1. Feature selection algorithm based on tree model
#Feature selection based on tree model
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier

3.2 linear dimensionality reduction

Tags: Python Machine Learning Data Analysis Data Mining sklearn

Posted on Tue, 21 Sep 2021 18:16:19 -0400 by little_webspinner