Data and features determine the upper limit of machine learning, and models and algorithms only approximate this upper limit

# 1. Data preprocessing

- data acquisition
- Data cleaning: remove dirty data
- Data sampling: it can be used when the data is unbalanced, including up sampling and down sampling; Positive sample > negative sample, and the amount of data is large, down sampling is adopted; Positive sample > negative sample, the data volume is small, and up sampling is adopted; Or modify the loss function to set the sample weight

# 2. Feature processing

- Standardization: make the processed data conform to the standard normal distribution.

#Standardization from sklearn.preprocessing import StandardScaler ss=StandardScaler() data=ss.fit_transform(data)

- Normalization: convert the eigenvalues of samples to the same dimension to eliminate the influence of different dimensions between features. Interval scaling is a kind of normalization.

from sklearn.preprocessing import Normalizer sn=Normalizer() data_normalizer=sn.fit_transform(data) #It is more suitable for the situation with concentrated data

Differences between the two:

- Normalization is easily affected by extreme maximum and minimum values, which is more suitable for the case of concentrated values;
- If there are outliers and more noise in the data, use standardization;
- SVM, KNN, PCA and other models must be standardized or normalized;
- Both can eliminate the influence of dimension;
- Improve the speed of solving the optimal solution by gradient descent method;

- Quantitative data binarization: binarization of numerical data by setting threshold

from sklearn.preprocessing import Binarizer b=Binarizer(threshold=3)#Set the threshold value. Set 1 for those greater than the threshold value and 0 for those less than the threshold value b.fit_transform(data)

Operation results:

- Qualitative data dumb coding: convert category data into numerical data, such as OneHotEncoder

from sklearn.preprocessing import OneHotEncoder oh=OneHotEncoder() oh.fit_transform(target.reshape((-1,1)))

Operation results:

7. Missing value processing

Data analysis and mining 2 - Data Preprocessing

8. Data conversion

# 3. Feature dimensionality reduction

## 3.1. Feature selection

Variance selection feature: calculate the variance of each feature and scold the feature by setting the threshold of variance

data_train_columns=[col for col in data_train.columns if col not in ['target']] # Variance filtering feature, set the variance threshold to 1 from sklearn.feature_selection import VarianceThreshold vt=VarianceThreshold(threshold=1) data_vt=vt.fit_transform(data_train[data_train_columns]) data_vt=pd.DataFrame(data_vt)

Correlation coefficient: calculate the correlation coefficient of each feature to the target value (Pearson correlation coefficient)

#The correlation coefficient method is used to screen the features and select the number of features from sklearn.feature_selection import SelectKBest from scipy.stats import pearsonr skb=SelectKBest(lambda X,Y:np.array(list(map(lambda x:pearsonr(x,Y),X.T))).T[0],k=10) data_skb=skb.fit_transform(data_train[data_train_columns],data_train['target'])

Chi square test: calculate the correlation between category features and category target s Chi square test

from sklearn.feature_selection import chi2 SelectKBest(chi2,k=2).fit_transform(x,y)

Maximum information coefficient method: calculate the correlation between category features and category target s

from minepy import MINE def mic(x,y): m=MINE() m.compute_score(x,y) return (m.mic(),0.5) SelectKBest(lambda X,Y:np.array(list(map(lambda x:mic(x,Y),X.T))).T[0],k=10).fit_transform(train_data,train_target)

Recursive feature elimination method: RFE algorithm obtains the optimal combination variables that can maximize the performance of the model by adding or removing specific feature variables.

#RFE recursive elimination method from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression rfe=RFE(estimator=LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=500),n_features_to_select=10) #estimator is the base model and logistic regression model, which is used for classification #solver: several optimization methods. Liblinear is a good choice for small data sets, and sag and saga are faster for large data sets; In multi class problems, except liblinear, the other four algorithms can be used; Newton CG, lbfgs and sag can only use L2 penalty items, and liblinear and saga can only use L1 penalty items. #max_iter:int type, the default is' 100 ', which is only applicable to Newton CG, sag and lbfgs algorithms; Represents the maximum number of times the algorithm converges. #n_features_to_select=10 is the number of selected features data_rfe=rfe.fit_transform(data_train[data_train_columns],data_train['target'])#data_train['target '] is a category label

Model based feature selection

- Feature selection based on penalty item

#Feature selection algorithm based on penalty term from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import LogisticRegression sfm=SelectFromModel(LogisticRegression(penalty='l2',C=0.1,solver='lbfgs',multi_class='auto')) #solver is an optimization algorithm, and lbfgs optimization algorithm only supports l2; Penalty is a penalty term, which is used to add parameters to avoid over fitting. It can be understood as a penalty for the current training sample to improve the generalization ability of the function; C is the reciprocal of the regularization coefficient. The smaller the value, the stronger the regularization data_sfm=sfm.fit_transform(data,target) data_sfm=pd.DataFrame(data_sfm) data_sfm

- Feature selection algorithm based on tree model

#Feature selection based on tree model from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import GradientBoostingClassifier sfmGBDT=SelectFromModel(GradientBoostingClassifier()) data_sfmGBDT=sfmGBDT.fit_transform(data,target) data_sfm=pd.DataFrame(data_sfm) data_sfm