Random forest [machine learning notes]

In machine learning, random forest is a classifier containing multiple decision trees. It is a set algorithm, and its output category is determined by the mode of the category output by individual trees.

Random forest = Bagging + decision tree

Bagging integration principle

bagging integration process
1. Sampling: take a part of all samples
2. Learning: training weak learners
3. Integration: using equal voting

Example: classify the following circles and squares

Implementation process:
1. Sampling different data sets
2. Training classifier

3. Equal voting to obtain the final result

4. Summary of main implementation process

Stochastic forest construction process


For example, if you train five trees, four of which have a result of True and one tree has a result of False, then the final voting result is True

Key steps in the process of random forest building (N represents the number of training cases (samples), and M represents the number of features):
1) Randomly select one sample at a time, and repeat it N times if there are samples put back (there may be duplicate samples)

2) Select m features randomly, m < < m, and establish a decision tree

  • reflection
    • 1. Why randomly sample the training set?   
      If the training set of each tree is the same without random sampling, the classification result of the trained tree is exactly the same

    • 2. Why do we need to put the sample back?
      If there is no return sampling, the training samples of each tree are different and have no intersection. In this way, each tree is "biased" and absolutely "one-sided" (of course, this may be wrong), that is, there are great differences in the training of each tree; The final classification of random forest depends on the voting of multiple trees (weak classifier).

Random forest api

There are DecisionTreeClassi fi er random forest classification and DecisionTreeRegressor random forest regression. DecisionTreeClassi fi er is introduced here.

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]
  • n_estimators: integer, optional (default = 100) number of trees in the forest;

  • Criterion: string, optional (default = "gini"). There are two options: gini coefficient and information entropy

  • max_depth: integer or None, optional (default = None) maximum depth of the tree;

  • max_features="auto", limit the number of features to be considered when branching. Features exceeding the limit will be discarded. The default value is the square of the total number of features

    • If "auto", then max_features=sqrt(n_features).
    • If "sqrt", then max_features=sqrt(n_features)(same as "auto").
    • If "log2", then max_features=log2(n_features).
    • If None, then max_features=n_features.
  • bootstrap: boolean, optional (default = True) whether to use put back sampling when building the tree

  • min_samples_split: minimum number of samples for node partition

  • min_samples_leaf: minimum number of samples of leaf node

  • Hyperparameters: n_estimator,max_depth,min_samples_split,min_samples_leaf

example

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 1. Get Titanic dataset
titan = pd.read_csv('titanic.csv')

#2. Basic data processing
#2.1 determination of characteristic value and target value
x = titan[['pclass', 'age', 'sex']]
y = titan['survived']

#2.2 missing value handling
x['age'].fillna(x['age'].mean(), inplace=True)

#2.3 data set division
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)


#3. Feature Engineering (dictionary feature extraction)
# For converting x to dictionary data, x.to_dict(orient="records")
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train.to_dict(orient='records'))
x_test = transfer.fit_transform(x_test.to_dict(orient='records'))


# 4. Machine learning (random forest)
estimator = RandomForestClassifier()
param_grid={"n_estimators": [120,200,300,500,800,1200], "max_depth": [5, 8, 15, 25, 30]}
estimator=GridSearchCV(estimator,param_grid=param_grid,cv=3)

estimator.fit(x_train,y_train)

# 5. Model evaluation
score = estimator.score(x_test, y_test)
print("Direct calculation accuracy:\n", score)
print("The best result of cross validation:\n", estimator.best_score_)
print("Adjusted optimal parameters:\n", estimator.best_params_)
print("Best parametric model:\n", estimator.best_estimator_)
print("Accuracy results after each cross validation:\n", estimator.cv_results_)

Random forest regression filling missing values

In the sklearn.impulse.simpleimputer module, you can easily use the mean, median, or other commonly used values to fill the null values. Next, we will fill the missing values with the mean, 0, and random forest return of the Boston house price data set, verify the fitting effect in various cases, and find out the best way to fill the missing values.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import load_boston
from sklearn.impute import SimpleImputer # Null value
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Get data set --- there are 506 * 13 = 6578 data sets in total
boston = load_boston()
x_full = boston.data   # data set
y_full = boston.target # Label column
n_samples = x_full.shape[0] # Line 506
n_features = x_full.shape[1] # Column 13 - feature name

Build missing values
1. Firstly, determine the proportion of missing values: 50%, that is, 3289 data are missing

rng = np.random.RandomState(0) # Random seed
missing_rate = 0.5 # Proportion of missing values
n_missing_samples = int(np.floor(n_samples*n_features*missing_rate)) # np.floor() rounds down and returns a floating-point number in. 0 format
n_missing_samples # 3289

2. Missing values are distributed in 506 * 13 data tables - 3289 missing values (grid cells composed of rows and columns) are generated at random positions. Similar to DataFrame, we need to locate and generate missing values through indexes (rows and columns).

missing_samples = rng.randint(0,n_samples,n_missing_samples)  # 3289 data are randomly extracted from the row
missing_features= rng.randint(0,n_features,n_missing_samples) # 3289 data are randomly extracted from the column
# Using the above method for sampling will make the data far exceed the sample size 506 (the sample size here is calculated only according to rows)
# We can also use np.random.choice() for abstraction. We can extract non repeated random numbers to ensure that the data will not be concentrated in the same row and ensure the dispersion of the data to some extent
missing_features

3. Generate missing values

x_missing = x_full.copy() # Copy source dataset
x_missing[missing_samples,missing_features] = np.nan # Missing values are generated by random positioning of row and column indexes
x_missing = pd.DataFrame(x_missing)
x_missing

Missing value filling
① mean filling
The SimpleImputer class in sklearn.impulse is used for filling. missing_values=np.nan represents the type of filling value (null value) currently required; strategy='mean 'represents the strategy used to fill null value, that is, the mean value is used for filling.

# ① . fill with mean
imp_mean = SimpleImputer(missing_values=np.nan,strategy='mean')
x_missing_mean = imp_mean.fit_transform(x_missing) # Training fit() + export predict() = = > fit_transform()
x_missing_mean = pd.DataFrame(x_missing_mean) 
x_missing_mean

② Fill with 0 value
strategy = 'constant', fill_value=0 indicates that constant is used for filling, and fill_value indicates that the constant used is 0.

imp_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0)
x_missing_0 = imp_mean.fit_transform(x_missing) # Training fit() + export predict() = = > fit_transform()
x_missing_0 = pd.DataFrame(x_missing_0) 
x_missing_0

③ Filling with random forest regression
Any regression is a process of learning from the characteristic matrix and then solving the continuous label y. The reason why this process can be realized is that the regression algorithm believes that there is a certain connection between the characteristic matrix and the label. In fact, labels and features can be converted to each other. For example, in a problem of "predicting" house price "by using the number of areas, environments and nearby schools", we can use both The data of "region", "environment" and "number of nearby schools" can be used to predict "house price", or conversely, the data of "environment", "number of nearby schools" and "house price" can be used to predict "region" (a bit similar to the "three for one" in the "y=kx+b" equation). This idea is used to fill in the missing value by regression.

For a data with n features, where feature t has missing values, we take feature t as a label, and the other n-1 features and the original label form a new feature matrix. For T, the part that is not missing is our ytrain. This score has both labels and features, and the part that is missing is only features without labels, which is the part we need to predict Points.

Other n-1 features corresponding to the value of feature T not missing + original label: xtrain
Value of characteristic T not missing: ytrain

Other n-1 features corresponding to the missing value of the feature + the original tag: xtest
Missing value of feature: unknown, ytest we need to predict

This method is very applicable to the situation where a large number of features are missing but other features are complete!

What if there are missing values for other features besides feature T in the data?

The answer is to traverse all the features and fill in the missing ones (because it requires the least accurate information to fill in the feature with the least missing value. When filling in a feature, first replace the missing value of other features with 0. Each time the regression prediction is completed, put the predicted value into the original feature matrix, and then continue to fill in the next feature. After each filling, the feature with missing value will be reduced by one, so the more features that need to be filled with 0 after each cycle When we go to the last feature (this feature should have the most missing values among all features), there are no other features that need to be filled with 0, but we have filled a lot of effective information for other features with regression, which can be used to fill the most missing features.
(1) sorting index for the number of missing values

x_missing_reg = x_missing.copy()
# Find out the order of the features in the data set with the missing values sorted from small to large
# np.argsort() -- returns the index corresponding to the order sorted from small to large
sort_columns_index = np.argsort(x_missing_reg.isnull().sum()).values
sort_columns_index

(2) traverse the index to fill in the null value

for i in sort_columns_index:
    
    # Build a new feature matrix (features not selected and filled + original labels) and a new label (special features selected and filled)
    df = x_missing_reg
    fillc = df.iloc[:,i] # A column of features to be filled in at present --- new label  
    df = pd.concat([df.iloc[:,df.columns != i],pd.DataFrame(y_full)],axis=1) # Remaining n-1 columns and complete labels
    
    # In the new characteristic matrix, the columns with missing values are filled with null values
    df_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0).fit_transform(df)
    
    # Extract the test set and training set
    ytrain = fillc[fillc.notnull()] # Non empty data in the selected feature column to be filled --- training label
    ytest  = fillc[fillc.isnull()]  # Empty data in the selected feature column to be filled --- test label
    xtrain = df_0[ytrain.index,:] # In the new feature matrix, the record corresponding to the non null value of the feature to be filled is selected
    xtest  = df_0[ytest.index,:]  # In the new feature matrix, the record corresponding to the feature null value to be filled is selected
    
    # Use random forest regression to fill in missing values
    rfc = RandomForestRegressor(n_estimators=100).fit(xtrain,ytrain)
    y_predict = rfc.predict(xtest)
    
    # The filled features are returned to our original feature matrix
    x_missing_reg.loc[x_missing_reg.iloc[:,i].isnull(),i] = y_predict

④ Evaluate the filling results
Next, we use cross validation (mean square error) to score the original data set, mean filled data set, 0-value filled data set and random forest regression filled data set respectively.

# Evaluate null padding
X = [x_full,x_missing_mean,x_missing_0,x_missing_reg]
mse = [] # The mean square error was used for evaluation

for x in X:
    estimator = RandomForestRegressor(n_estimators=100,random_state=0)
    scores = cross_val_score(estimator,x,y_full,scoring='neg_mean_squared_error',cv=5).mean()
    mse.append(scores * -1)
mse 

[21.571667100368845, 42.62658760318384, 42.62658760318384, 17.52358682764511]

Through the evaluation, it can be found that the mean value and 0 value are used for null filling, and the mean square error score is more than 40, while the fitting effect of random forest regression filling is even better than that of the original data set, and the mean square error score is as low as 17.5. Of course, over fitting is not ruled out.

# visualization
plt.figure(figsize=(12,8)) # canvas
colors = ['r','g','b','orange'] # colour
x_labels = ["x_full","x_missing_mean","x_missing_0","x_missing_reg"] # label

ax = plt.subplot(111) # Add subgraph
for i in range(len(mse)):
    ax.barh(i,mse[i],color=colors[i],alpha=0.6,align='center')
    
ax.set_title('Imputation Technique with Boston Data') # Set title
ax.set_xlim(left=np.min(mse)*0.9,right=np.max(mse)*1.1) # Sets the range of the x-axis
ax.set_yticks(range(len(mse)))
ax.set_xlabel("MSE") # Set x-axis label
ax.set_yticklabels(x_labels) # Set y-axis scale

plt.show()

Adjusting parameters

For the tree model, the more lush the tree, the deeper the depth, and the more branches and leaves, the more complex the model will be. Therefore, the tree model is naturally located in the upper right corner of the graph, and the random forest is based on the tree model, so the random forest is also a naturally complex model. The parameters of the random forest are all towards one goal: reduce the complexity of the model and move the model to the left of the image Move to prevent over fitting. Of course, there is no absolute adjustment parameter.
How does each parameter affect our complexity and model? We have always adjusted the parameters, looking for the optimal value in turn on the learning curve, hoping to correct the accuracy to a higher level. However, now we know the parameter adjustment direction of random forest: reduce the complexity, and we can select the parameters that have a great impact on the complexity Let's study their monotonicity, and then focus on adjusting the parameters that can minimize the complexity. For those parameters that are not monotonous or will increase the complexity, we can use them according to the situation and even retreat most of the time. Based on experience, we rank the impact of various parameters on the model. When we adjust the parameters, we can You can refer to this order.

parameterImpact on the evaluation performance of the model on unknown dataDegree of influence
n_estimatorsImprove to smooth, n_estimators ↑, which does not affect the complexity of a single model⭐⭐⭐⭐
max_depthIncrease and decrease. The default maximum depth is the highest complexity. Adjust max_depth ↓ in the direction of complexity reduction. The model is simpler and moves to the left of the image⭐⭐⭐
min_samples_leafIncrease and decrease. The default minimum limit is 1, that is, the highest complexity. Adjust min_samples_leaf ↑ in the direction of complexity reduction. The model is simpler and moves to the left of the image⭐⭐
min_samples_splitIncrease and decrease. The default minimum limit is 2, that is, the highest complexity. Adjust min_samples_split ↑ in the direction of complexity reduction. The model is simpler and moves to the left of the image⭐⭐
max_featuresWith increase and decrease, auto is the square of the total number of features by default. It is located in the middle complexity. You can adjust max_features ↓ in the direction of increasing complexity or decreasing complexity. The model is simpler, the image moves max_features ↑ to the left, the model is more complex, and the image moves max_features to the right is unique, which can not only make the model simpler, but also make the model more complex Complex parameters, so when adjusting this parameter, we need to consider the direction of parameter adjustment
criterionThere are increases and decreases. gini is generally usedIt depends

Tags: Machine Learning Decision Tree

Posted on Thu, 21 Oct 2021 09:39:05 -0400 by jlh3590