# Data analysis and mining 3 - Feature Engineering

Data and features determine the upper limit of machine learning, and models and algorithms only approximate this upper limit

# 1. Data preprocessing

1. data acquisition
2. Data cleaning: remove dirty data
3. Data sampling: it can be used when the data is unbalanced, including up sampling and down sampling; Positive sample > negative sample, and the amount of data is large, down sampling is adopted; Positive sample > negative sample, the data volume is small, and up sampling is adopted; Or modify the loss function to set the sample weight

# 2. Feature processing

1. Standardization: make the processed data conform to the standard normal distribution.
```#Standardization
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
data=ss.fit_transform(data)
```
1. Normalization: convert the eigenvalues of samples to the same dimension to eliminate the influence of different dimensions between features. Interval scaling is a kind of normalization.
```from sklearn.preprocessing import Normalizer
sn=Normalizer()
data_normalizer=sn.fit_transform(data)
#It is more suitable for the situation with concentrated data
```

Differences between the two:

• Normalization is easily affected by extreme maximum and minimum values, which is more suitable for the case of concentrated values;
• If there are outliers and more noise in the data, use standardization;
• SVM, KNN, PCA and other models must be standardized or normalized;
• Both can eliminate the influence of dimension;
• Improve the speed of solving the optimal solution by gradient descent method;
1. Quantitative data binarization: binarization of numerical data by setting threshold
```from sklearn.preprocessing import Binarizer
b=Binarizer(threshold=3)#Set the threshold value. Set 1 for those greater than the threshold value and 0 for those less than the threshold value
b.fit_transform(data)
```

Operation results: 1. Qualitative data dumb coding: convert category data into numerical data, such as OneHotEncoder
```from sklearn.preprocessing import OneHotEncoder
oh=OneHotEncoder()
oh.fit_transform(target.reshape((-1,1)))
```

Operation results: 7. Missing value processing
Data analysis and mining 2 - Data Preprocessing
8. Data conversion

# 3. Feature dimensionality reduction

## 3.1. Feature selection Variance selection feature: calculate the variance of each feature and scold the feature by setting the threshold of variance

```data_train_columns=[col for col in data_train.columns if col not in ['target']]
# Variance filtering feature, set the variance threshold to 1
from sklearn.feature_selection import VarianceThreshold
vt=VarianceThreshold(threshold=1)
data_vt=vt.fit_transform(data_train[data_train_columns])
data_vt=pd.DataFrame(data_vt)
```

Correlation coefficient: calculate the correlation coefficient of each feature to the target value (Pearson correlation coefficient)

```#The correlation coefficient method is used to screen the features and select the number of features
from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr
skb=SelectKBest(lambda X,Y:np.array(list(map(lambda x:pearsonr(x,Y),X.T))).T,k=10)
data_skb=skb.fit_transform(data_train[data_train_columns],data_train['target'])
```

Chi square test: calculate the correlation between category features and category target s Chi square test

```from sklearn.feature_selection import chi2
SelectKBest(chi2,k=2).fit_transform(x,y)
```

Maximum information coefficient method: calculate the correlation between category features and category target s

```from minepy import MINE
def mic(x,y):
m=MINE()
m.compute_score(x,y)
return (m.mic(),0.5)
SelectKBest(lambda X,Y:np.array(list(map(lambda x:mic(x,Y),X.T))).T,k=10).fit_transform(train_data,train_target)
```

Recursive feature elimination method: RFE algorithm obtains the optimal combination variables that can maximize the performance of the model by adding or removing specific feature variables.

```#RFE recursive elimination method
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe=RFE(estimator=LogisticRegression(multi_class='auto',
solver='lbfgs',
max_iter=500),n_features_to_select=10)
#estimator is the base model and logistic regression model, which is used for classification
#solver: several optimization methods. Liblinear is a good choice for small data sets, and sag and saga are faster for large data sets; In multi class problems, except liblinear, the other four algorithms can be used; Newton CG, lbfgs and sag can only use L2 penalty items, and liblinear and saga can only use L1 penalty items.
#max_iter:int type, the default is' 100 ', which is only applicable to Newton CG, sag and lbfgs algorithms; Represents the maximum number of times the algorithm converges.
#n_features_to_select=10 is the number of selected features
data_rfe=rfe.fit_transform(data_train[data_train_columns],data_train['target'])#data_train['target '] is a category label
```

Model based feature selection

1. Feature selection based on penalty item
```#Feature selection algorithm based on penalty term
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
sfm=SelectFromModel(LogisticRegression(penalty='l2',C=0.1,solver='lbfgs',multi_class='auto'))
#solver is an optimization algorithm, and lbfgs optimization algorithm only supports l2; Penalty is a penalty term, which is used to add parameters to avoid over fitting. It can be understood as a penalty for the current training sample to improve the generalization ability of the function; C is the reciprocal of the regularization coefficient. The smaller the value, the stronger the regularization
data_sfm=sfm.fit_transform(data,target)
data_sfm=pd.DataFrame(data_sfm)
data_sfm
```
1. Feature selection algorithm based on tree model
```#Feature selection based on tree model
from sklearn.feature_selection import SelectFromModel