Principles and common parameters of decision tree and random forest of machine learning algorithm

Summary: random forest can be used for classification and regression as decision tree, but the results of random forest model are often better than decision tree. This article mainly explains the principles and common parameters of the above two ML algorithms.

1, Principle

1.1 decision tree

1.1.1 definition of decision tree

Decision tree is a nonparametric supervised learning method. In essence, it summarizes a set of decision rules from the training data set to solve the classification and regression problems. The rules are presented by a tree diagram composed of root nodes, internal nodes and leaf nodes (labels).

1.1.2 the decision tree algorithm should solve two core problems:

1) How to find the best node and branch from the data table?
"Impure" is the basis for feature selection of decision tree, which is usually calculated by Gini coefficient or information entropy (information gain).
We need to know: the value of Gini coefficient is [0,0.5], and the value of information entropy is [0,1]. Information entropy is more sensitive to impurity. When information entropy is used as an index, the calculation speed is slower, and the growth of decision tree is more "fine". If there are high-dimensional data and noise data, the information entropy is easier to over fit. At this time, we should choose Gini coefficient; On the contrary, if the fitting degree of the model is low and the scores on the training set and the test set are low, we should choose information entropy. However, it is not absolute. The parameter selection of machine learning should be determined by specific data.
2) How to stop the growth of decision tree and prevent over fitting?
Limit the depth of the tree;
Limit the sample size of each child node after branching;
Limit the sample size that each node must contain before branching;
Limiting the number of branches is the number of features considered: the method is relatively simple, rough and not rigorous, which may lead to insufficient learning of decision tree. PCA and other methods are generally used to reduce the dimension;

1.2 random forest

1.2.1 interpretation of random forest

Random forest is a very representative Bagging integration algorithm. All its base evaluators are decision trees. The forest composed of classification trees is called random forest classifier, and the forest integrated by regression trees is called random forest regressor. The core idea of Bagging bagging method is to build multiple independent evaluators, and then use the average or multiple voting principle to determine the results of the integrated evaluator. The representative model of Bagging method is random forest.

1.2.2 is random forest better than a single classifier?

1. The premise of random forest (bagged method) is how the base classifiers are independent of each other and how to do it?
1) Random forest has its own randomness, and the generated trees are naturally different from each other_ State. The usage is consistent with the decision tree. When it is fixed, it will control the production of a fixed forest, rather than only one tree in a forest
2) The bagged method has to put back random sampling to form different training sets, and the different base classifiers obtained from different training sets are naturally different.
2. Is the result of random forest better than that of single base classifier?

Horizontal axis in the figure: error rate of a single classifier; Vertical axis: error rate of random forest,
The straight line in the figure shows that each tree in the forest is the same, that is, the corresponding error rate is the same; The curve in the figure shows that each tree in the forest is different, that is, the error rate is different
It can be seen from the image that when the error rate of a single classifier is less than 0.5, that is, the accuracy is greater than 0.5, the effect of integration is better than that of base classifier. On the contrary, when the error rate of the base classifier is greater than 0.5, the bagged integration algorithm is not ideal. Therefore, before using random forest, we must check whether the single classification tree used to form random forest has at least 50% prediction accuracy.

2, Common parameters

2.1 decision tree

2.1.1 common parameters of classification

criterion: The method to determine the purity of impurity. The default value is“ gini",Is the Gini coefficient. Alternatively:“ entropy"
random_state: The parameter used to set the random mode in the branch. The default value is None,You can enter any value to make the decision tree stable and grow the same tree all the time
splitter: Control random options for branches in the decision tree“ best"Although the decision tree branches randomly, it will select more important features for branching“ random"The decision tree is more random when branching
max_depth:  Limit the maximum depth of the tree. Branches exceeding the set depth will be cut off=3 Start trying
min_samples_split:  The minimum sample size required for an intermediate node. The default value is = 2,Branches less than the set value of this parameter will not occur
min_samples_leaf: The minimum sample size required for each child node after branching. The default is = 1,You can enter an integer or floating point number. Branches that are less than the set value of this parameter will not occur
max_features:The number of features considered when making the best branch
min_impurity_decrease: Limit the size of impurity reduction, default = 0,If the purity is not reduced>= If this parameter is set to a value, it will be retained, otherwise the decision tree will not continue to grow
class_weight: To adjust the parameters of sample equalization, you can enter the dictionary“ balanced","None"

2.1.2 common regression parameters

Note: the common parameters and classification numbers of regression tree are the same, but the differences are as follows:
1,criterion: An indicator of branch quality“ mse","friedman_mse","mae",Commonly used“ mse",It is mean square error, which is essentially the difference between sample real data and regression results mse The closer the result is to 0, the better
2,unwanted class_weight,Because there is no question whether the label distribution is balanced in the regression tree species

2.2 random forest

2.2.1 common parameters of classification

n_eatimators: It does not affect the complexity of the model. Large is good, but after reaching a certain degree, it will fluctuate up and down a certain value. Too large a parameter will cause a large amount of calculation. We are eager to strike a balance between training time and model effect, and often use cross validation+The cycle method roughly determines the range with a larger cycle step size, and further locks the parameter value corresponding to the optimal effect with a more accurate step size.
criterion: The method to determine the purity of impurity. The default value is“ gini",Is the Gini coefficient. Alternatively:“ entropy"
random_state: Random forest has its own reason of randomness. When this parameter is fixed, it will control the production of a fixed forest rather than only one tree in a forest
bootstrap:  Default to True,The representative adopts the random sampling technique with return
oof_score: Using the bagging method will make 1-0.632 The proportion of data falling out of the bag can be used to make test sets and use interfaces oob_score_View the test results on the out of pocket data
max_depth:  Limit the maximum depth of the tree. Branches exceeding the set depth will be cut off=3 Start trying
min_samples_split:  The minimum sample size required for an intermediate node. The default value is = 2,Branches less than the set value of this parameter will not occur
min_samples_leaf: The minimum sample size required for each child node after branching. The default is = 1,You can enter an integer or floating point number. Branches that are less than the set value of this parameter will not occur
max_features:The number of features considered when making the best branch
min_impurity_decrease: Limit the size of impurity reduction, default = 0,If the purity is not reduced>= If this parameter is set to a value, it will be retained, otherwise the decision tree will not continue to grow

2.2.2 common regression parameters

Note: all common parameters are consistent with the random forest classifier. The only difference is the difference between regression tree and classification tree, impure indicators and parameters Criterion atypism.
criterion: An indicator of branch quality“ mse","friedman_mse","mae",Commonly used“ mse",It is mean square error, which is essentially the difference between sample real data and regression results mse The closer the result is to 0, the better

Note: I've been learning machine learning algorithms for some time. I like to learn by watching explanation + Practice (code once). I've seen video explanations from different teachers and feel that [vegetable's sklearn] It's especially clear for your reference. You can find it in station b. a good teacher is very important on the way to study. Share it. I hope the children who study together can avoid detours and improve ROI!! ha ha 😊

Tags: Algorithm Machine Learning Decision Tree

Posted on Sat, 02 Oct 2021 17:21:30 -0400 by garblar