In machine learning, random forest is a classifier containing multiple decision trees. It is a set algorithm, and its output category is determined by the mode of the category output by individual trees.
Random forest = Bagging + decision tree
Bagging integration principle
bagging integration process
1. Sampling: take a part of all samples
2. Learning: training weak learners
3. Integration: using equal voting
Example: classify the following circles and squares
Implementation process:
1. Sampling different data sets
2. Training classifier
3. Equal voting to obtain the final result
4. Summary of main implementation process
Stochastic forest construction process
For example, if you train five trees, four of which have a result of True and one tree has a result of False, then the final voting result is True
Key steps in the process of random forest building (N represents the number of training cases (samples), and M represents the number of features):
1) Randomly select one sample at a time, and repeat it N times if there are samples put back (there may be duplicate samples)
2) Select m features randomly, m < < m, and establish a decision tree
 reflection

1. Why randomly sample the training set?
If the training set of each tree is the same without random sampling, the classification result of the trained tree is exactly the same 
2. Why do we need to put the sample back?
If there is no return sampling, the training samples of each tree are different and have no intersection. In this way, each tree is "biased" and absolutely "onesided" (of course, this may be wrong), that is, there are great differences in the training of each tree; The final classification of random forest depends on the voting of multiple trees (weak classifier).

Random forest api
There are DecisionTreeClassi ﬁ er random forest classification and DecisionTreeRegressor random forest regression. DecisionTreeClassi ﬁ er is introduced here.
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]

n_estimators: integer, optional (default = 100) number of trees in the forest;

Criterion: string, optional (default = "gini"). There are two options: gini coefficient and information entropy

max_depth: integer or None, optional (default = None) maximum depth of the tree;

max_features="auto", limit the number of features to be considered when branching. Features exceeding the limit will be discarded. The default value is the square of the total number of features
 If "auto", then max_features=sqrt(n_features).
 If "sqrt", then max_features=sqrt(n_features)(same as "auto").
 If "log2", then max_features=log2(n_features).
 If None, then max_features=n_features.

bootstrap: boolean, optional (default = True) whether to use put back sampling when building the tree

min_samples_split: minimum number of samples for node partition

min_samples_leaf: minimum number of samples of leaf node

Hyperparameters: n_estimator,max_depth,min_samples_split,min_samples_leaf
example
import pandas as pd import numpy as np from sklearn.feature_extraction import DictVectorizer from sklearn.model_selection import train_test_split,GridSearchCV from sklearn.ensemble import RandomForestClassifier # 1. Get Titanic dataset titan = pd.read_csv('titanic.csv') #2. Basic data processing #2.1 determination of characteristic value and target value x = titan[['pclass', 'age', 'sex']] y = titan['survived'] #2.2 missing value handling x['age'].fillna(x['age'].mean(), inplace=True) #2.3 data set division x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22) #3. Feature Engineering (dictionary feature extraction) # For converting x to dictionary data, x.to_dict(orient="records") transfer = DictVectorizer(sparse=False) x_train = transfer.fit_transform(x_train.to_dict(orient='records')) x_test = transfer.fit_transform(x_test.to_dict(orient='records')) # 4. Machine learning (random forest) estimator = RandomForestClassifier() param_grid={"n_estimators": [120,200,300,500,800,1200], "max_depth": [5, 8, 15, 25, 30]} estimator=GridSearchCV(estimator,param_grid=param_grid,cv=3) estimator.fit(x_train,y_train) # 5. Model evaluation score = estimator.score(x_test, y_test) print("Direct calculation accuracy:\n", score) print("The best result of cross validation:\n", estimator.best_score_) print("Adjusted optimal parameters:\n", estimator.best_params_) print("Best parametric model:\n", estimator.best_estimator_) print("Accuracy results after each cross validation:\n", estimator.cv_results_)
Random forest regression filling missing values
In the sklearn.impulse.simpleimputer module, you can easily use the mean, median, or other commonly used values to fill the null values. Next, we will fill the missing values with the mean, 0, and random forest return of the Boston house price data set, verify the fitting effect in various cases, and find out the best way to fill the missing values.
import numpy as np import pandas as pd from matplotlib import pyplot as plt from sklearn.datasets import load_boston from sklearn.impute import SimpleImputer # Null value from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score # Get data set  there are 506 * 13 = 6578 data sets in total boston = load_boston() x_full = boston.data # data set y_full = boston.target # Label column n_samples = x_full.shape[0] # Line 506 n_features = x_full.shape[1] # Column 13  feature name
Build missing values
1. Firstly, determine the proportion of missing values: 50%, that is, 3289 data are missing
rng = np.random.RandomState(0) # Random seed missing_rate = 0.5 # Proportion of missing values n_missing_samples = int(np.floor(n_samples*n_features*missing_rate)) # np.floor() rounds down and returns a floatingpoint number in. 0 format n_missing_samples # 3289
2. Missing values are distributed in 506 * 13 data tables  3289 missing values (grid cells composed of rows and columns) are generated at random positions. Similar to DataFrame, we need to locate and generate missing values through indexes (rows and columns).
missing_samples = rng.randint(0,n_samples,n_missing_samples) # 3289 data are randomly extracted from the row missing_features= rng.randint(0,n_features,n_missing_samples) # 3289 data are randomly extracted from the column # Using the above method for sampling will make the data far exceed the sample size 506 (the sample size here is calculated only according to rows) # We can also use np.random.choice() for abstraction. We can extract non repeated random numbers to ensure that the data will not be concentrated in the same row and ensure the dispersion of the data to some extent missing_features
3. Generate missing values
x_missing = x_full.copy() # Copy source dataset x_missing[missing_samples,missing_features] = np.nan # Missing values are generated by random positioning of row and column indexes x_missing = pd.DataFrame(x_missing) x_missing
Missing value filling
① mean filling
The SimpleImputer class in sklearn.impulse is used for filling. missing_values=np.nan represents the type of filling value (null value) currently required; strategy='mean 'represents the strategy used to fill null value, that is, the mean value is used for filling.
# ① . fill with mean imp_mean = SimpleImputer(missing_values=np.nan,strategy='mean') x_missing_mean = imp_mean.fit_transform(x_missing) # Training fit() + export predict() = = > fit_transform() x_missing_mean = pd.DataFrame(x_missing_mean) x_missing_mean
② Fill with 0 value
strategy = 'constant', fill_value=0 indicates that constant is used for filling, and fill_value indicates that the constant used is 0.
imp_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0) x_missing_0 = imp_mean.fit_transform(x_missing) # Training fit() + export predict() = = > fit_transform() x_missing_0 = pd.DataFrame(x_missing_0) x_missing_0
③ Filling with random forest regression
Any regression is a process of learning from the characteristic matrix and then solving the continuous label y. The reason why this process can be realized is that the regression algorithm believes that there is a certain connection between the characteristic matrix and the label. In fact, labels and features can be converted to each other. For example, in a problem of "predicting" house price "by using the number of areas, environments and nearby schools", we can use both The data of "region", "environment" and "number of nearby schools" can be used to predict "house price", or conversely, the data of "environment", "number of nearby schools" and "house price" can be used to predict "region" (a bit similar to the "three for one" in the "y=kx+b" equation). This idea is used to fill in the missing value by regression.
For a data with n features, where feature t has missing values, we take feature t as a label, and the other n1 features and the original label form a new feature matrix. For T, the part that is not missing is our ytrain. This score has both labels and features, and the part that is missing is only features without labels, which is the part we need to predict Points.
Other n1 features corresponding to the value of feature T not missing + original label: xtrain
Value of characteristic T not missing: ytrain
Other n1 features corresponding to the missing value of the feature + the original tag: xtest
Missing value of feature: unknown, ytest we need to predict
This method is very applicable to the situation where a large number of features are missing but other features are complete!
What if there are missing values for other features besides feature T in the data?
The answer is to traverse all the features and fill in the missing ones (because it requires the least accurate information to fill in the feature with the least missing value. When filling in a feature, first replace the missing value of other features with 0. Each time the regression prediction is completed, put the predicted value into the original feature matrix, and then continue to fill in the next feature. After each filling, the feature with missing value will be reduced by one, so the more features that need to be filled with 0 after each cycle When we go to the last feature (this feature should have the most missing values among all features), there are no other features that need to be filled with 0, but we have filled a lot of effective information for other features with regression, which can be used to fill the most missing features.
(1) sorting index for the number of missing values
x_missing_reg = x_missing.copy() # Find out the order of the features in the data set with the missing values sorted from small to large # np.argsort()  returns the index corresponding to the order sorted from small to large sort_columns_index = np.argsort(x_missing_reg.isnull().sum()).values sort_columns_index
(2) traverse the index to fill in the null value
for i in sort_columns_index: # Build a new feature matrix (features not selected and filled + original labels) and a new label (special features selected and filled) df = x_missing_reg fillc = df.iloc[:,i] # A column of features to be filled in at present  new label df = pd.concat([df.iloc[:,df.columns != i],pd.DataFrame(y_full)],axis=1) # Remaining n1 columns and complete labels # In the new characteristic matrix, the columns with missing values are filled with null values df_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0).fit_transform(df) # Extract the test set and training set ytrain = fillc[fillc.notnull()] # Non empty data in the selected feature column to be filled  training label ytest = fillc[fillc.isnull()] # Empty data in the selected feature column to be filled  test label xtrain = df_0[ytrain.index,:] # In the new feature matrix, the record corresponding to the non null value of the feature to be filled is selected xtest = df_0[ytest.index,:] # In the new feature matrix, the record corresponding to the feature null value to be filled is selected # Use random forest regression to fill in missing values rfc = RandomForestRegressor(n_estimators=100).fit(xtrain,ytrain) y_predict = rfc.predict(xtest) # The filled features are returned to our original feature matrix x_missing_reg.loc[x_missing_reg.iloc[:,i].isnull(),i] = y_predict
④ Evaluate the filling results
Next, we use cross validation (mean square error) to score the original data set, mean filled data set, 0value filled data set and random forest regression filled data set respectively.
# Evaluate null padding X = [x_full,x_missing_mean,x_missing_0,x_missing_reg] mse = [] # The mean square error was used for evaluation for x in X: estimator = RandomForestRegressor(n_estimators=100,random_state=0) scores = cross_val_score(estimator,x,y_full,scoring='neg_mean_squared_error',cv=5).mean() mse.append(scores * 1) mse
[21.571667100368845, 42.62658760318384, 42.62658760318384, 17.52358682764511]
Through the evaluation, it can be found that the mean value and 0 value are used for null filling, and the mean square error score is more than 40, while the fitting effect of random forest regression filling is even better than that of the original data set, and the mean square error score is as low as 17.5. Of course, over fitting is not ruled out.
# visualization plt.figure(figsize=(12,8)) # canvas colors = ['r','g','b','orange'] # colour x_labels = ["x_full","x_missing_mean","x_missing_0","x_missing_reg"] # label ax = plt.subplot(111) # Add subgraph for i in range(len(mse)): ax.barh(i,mse[i],color=colors[i],alpha=0.6,align='center') ax.set_title('Imputation Technique with Boston Data') # Set title ax.set_xlim(left=np.min(mse)*0.9,right=np.max(mse)*1.1) # Sets the range of the xaxis ax.set_yticks(range(len(mse))) ax.set_xlabel("MSE") # Set xaxis label ax.set_yticklabels(x_labels) # Set yaxis scale plt.show()
Adjusting parameters
For the tree model, the more lush the tree, the deeper the depth, and the more branches and leaves, the more complex the model will be. Therefore, the tree model is naturally located in the upper right corner of the graph, and the random forest is based on the tree model, so the random forest is also a naturally complex model. The parameters of the random forest are all towards one goal: reduce the complexity of the model and move the model to the left of the image Move to prevent over fitting. Of course, there is no absolute adjustment parameter.
How does each parameter affect our complexity and model? We have always adjusted the parameters, looking for the optimal value in turn on the learning curve, hoping to correct the accuracy to a higher level. However, now we know the parameter adjustment direction of random forest: reduce the complexity, and we can select the parameters that have a great impact on the complexity Let's study their monotonicity, and then focus on adjusting the parameters that can minimize the complexity. For those parameters that are not monotonous or will increase the complexity, we can use them according to the situation and even retreat most of the time. Based on experience, we rank the impact of various parameters on the model. When we adjust the parameters, we can You can refer to this order.
parameter  Impact on the evaluation performance of the model on unknown data  Degree of influence 

n_estimators  Improve to smooth, n_estimators ↑, which does not affect the complexity of a single model  ⭐⭐⭐⭐ 
max_depth  Increase and decrease. The default maximum depth is the highest complexity. Adjust max_depth ↓ in the direction of complexity reduction. The model is simpler and moves to the left of the image  ⭐⭐⭐ 
min_samples_leaf  Increase and decrease. The default minimum limit is 1, that is, the highest complexity. Adjust min_samples_leaf ↑ in the direction of complexity reduction. The model is simpler and moves to the left of the image  ⭐⭐ 
min_samples_split  Increase and decrease. The default minimum limit is 2, that is, the highest complexity. Adjust min_samples_split ↑ in the direction of complexity reduction. The model is simpler and moves to the left of the image  ⭐⭐ 
max_features  With increase and decrease, auto is the square of the total number of features by default. It is located in the middle complexity. You can adjust max_features ↓ in the direction of increasing complexity or decreasing complexity. The model is simpler, the image moves max_features ↑ to the left, the model is more complex, and the image moves max_features to the right is unique, which can not only make the model simpler, but also make the model more complex Complex parameters, so when adjusting this parameter, we need to consider the direction of parameter adjustment  ⭐ 
criterion  There are increases and decreases. gini is generally used  It depends 