# 4 Feature Engineering - feature preprocessing

## 1 what is feature preprocessing

### 1.1 definition of feature preprocessing

##### Scikit learn explanation

provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

The process of converting feature data into feature data more suitable for the algorithm model through some conversion functions

#### Why should we normalize / standardize?

• The units or sizes of features differ greatly, or the variance of a feature is several orders of magnitude larger than other features, which is easy to affect (dominate) the target results, so that some algorithms can not learn other features

### 1.2 contents (dimensionless of numerical data)

• normalization
• Standardization

### 1.3 feature preprocessing API

`sklearn.preprocessing`

## 2 normalization

### 2.1 definitions

By transforming the original data, the data is mapped to (default is [0,1])

### 2.2 formula Acting on each column, max is the maximum value of a column and min is the minimum value of a column, then X '' is the final result, mx and mi are the specified interval values respectively, and mx is 1 and mi is 0 by default

So how to understand this process? We pass an example ### 2.3 API

• sklearn.preprocessing.MinMaxScaler (feature_range=(0,1)... )
• MinMaxScalar.fit_transform(X)
• 10: Data in numpy array format [n_samples,n_features]
• Return value: array with the same shape after conversion

### 2.4 data calculation

• analysis

1. Instantiate MinMaxScalar

2. Through fit_transform transform

```import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def minmax_demo():
"""
Normalization demonstration
:return: None
"""
print(data)
# 1,Instantiate a converter class
transfer = MinMaxScaler(feature_range=(2, 3))
# 2,call fit_transform
data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
print("Results of normalization of minimum and maximum values:\n", data)

return None```

### 2.5 normalization summary

Note that the maximum and minimum values vary. In addition, the maximum and minimum values are very vulnerable to outliers, so this method has poor robustness and is only suitable for traditional accurate small data scenarios.

## 3 standardization

### 3.1 definitions

Through the transformation of the original data, the data is transformed into the range of mean value 0 and standard deviation 1

### 3.2 formula Acting on each column, mean is the average, σ Is the standard deviation

So back to the outliers just now, let's look at standardization

• For normalization: if there are outliers that affect the maximum and minimum values, the results will obviously change
• For Standardization: if there are outliers, due to a certain amount of data, a small number of outliers have little impact on the average value, so the variance change is small.

### 3.3 API

• sklearn.preprocessing.StandardScaler( )
• After processing, for each column, all data are clustered around the mean value 0, and the standard deviation is 1
• StandardScaler.fit_transform(X)
• 10: Data in numpy array format [n_samples,n_features]
• Return value: array with the same shape after conversion

### 3.4 data calculation

The above data is also processed

• analysis

1. Instantiate StandardScaler

2. Through fit_transform transform

```import pandas as pd
from sklearn.preprocessing import StandardScaler

def stand_demo():
"""
Standardized demonstration
:return: None
"""
print(data)
# 1,Instantiate a converter class
transfer = StandardScaler()
# 2,call fit_transform
data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
print("Standardized results:\n", data)
print("Average value of characteristics of each column:\n", transfer.mean_)
print("Variance of characteristics of each column:\n", transfer.var_)

return None```

Return results

```     milage     Liters  Consumtime  target
0     40920   8.326976    0.953952       3
1     14488   7.153469    1.673904       2
2     26052   1.441871    0.805124       1
..      ...        ...         ...     ...
997   26575  10.650102    0.866627       3
998   48111   9.134528    0.728045       3
999   43757   7.882601    1.332446       3

[1000 rows x 4 columns]
Standardized results:
[[ 0.33193158  0.41660188  0.24523407]
[-0.87247784  0.13992897  1.69385734]
[-0.34554872 -1.20667094 -0.05422437]
...,
[-0.32171752  0.96431572  0.06952649]
[ 0.65959911  0.60699509 -0.20931587]
[ 0.46120328  0.31183342  1.00680598]]
Average value of characteristics of each column:
[  3.36354210e+04   6.55996083e+00   8.32072997e-01]
Variance of characteristics of each column:
[  4.81628039e+08   1.79902874e+01   2.46999554e-01]```

### 3.5 standardization summary

It is stable when there are enough samples, and is suitable for modern noisy big data scenarios.

Posted on Tue, 02 Nov 2021 21:25:01 -0400 by seavers