1.1 parameters of control based evaluator
n_ The larger the estimators, the better the effect of the model. But correspondingly, any model has a decision boundary, n_ After the estimators reach a certain degree, the accuracy of random forest often does not rise or begin to fluctuate, and n_ The larger the estimators, the larger the amount of computation and memory required, and the longer the training time will be. For this parameter, we are eager to strike a balance between training difficulty and model effect.
from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from matplotlib import pyplot as plt wine = load_wine() x_train, x_test, y_train, y_test = train_test_split(wine.data,wine.target, test_size=0.3) rfc = RandomForestClassifier(n_estimators=25) rfc = rfc.fit(x_train, y_train) score = rfc.score(x_test, y_test) print(score) #0.9814814814814815
Draw the effect comparison between random forest and decision tree under a set of cross validation
rfc = RandomForestClassifier(n_estimators=25) rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10) clf = DecisionTreeClassifier() clf_s = cross_val_score(clf,wine.data,wine.target,cv=10) plt.plot(range(1,11),rfc_s,label = "RandomForest") plt.plot(range(1,11),clf_s,label = "Decision Tree") plt.legend() plt.show()
n_ Learning curve of estimators
superpa =  for i in range(200): rfc = RandomForestClassifier(n_estimators=i+1, n_jobs=-1) rfc_s = cross_val_score(rfc, wine.data, wine.target, cv=10).mean() superpa.append(rfc_s) print(max(superpa), superpa.index(max(superpa))) #0.9888888888888889 23 plt.figure(figsize=[20, 5]) plt.plot(range(1, 201), superpa) plt.show()
Random forest also has random_ The usage of state is similar to that in the classification tree, except that in the classification tree, there is a random_state only controls the generation of a tree, and random in a random forest_ State controls the mode of forest generation, rather than having only one tree in a forest.
When random_ When the state is fixed, a group of fixed trees are generated in the random forest, but each tree is still inconsistent, which is branched with the "random selection feature" "And we can prove that when the randomness is greater, the effect of the bagged method will generally be better and better. When the bagged method is used for integration, the base classifiers should be independent and different from each other.
1.4 bootstrap & oob_score
#There is no need to divide the training set and test set, and then conduct cross validation. It is also possible to directly observe the test scores of out of bag data without division rfc = RandomForestClassifier(n_estimators=25, oob_score=True) rfc = rfc.fit(wine.data, wine.target) #Important attribute oob_score_ print(rfc.oob_score_) #0.9662921348314607
1.5 important attributes and interfaces
In addition to the two important attributes of. estimators and. oob_score, the random forest naturally has the attribute of. Feature _imports. The interface of the random forest is completely consistent with the decision tree, so there are still four common interfaces: apply, fit, predict and score. In addition, pay attention to the predict _probainterface of the random forest, which returns the information corresponding to each test sample The probability of being assigned to each type of label. If there are several categories of labels, several probabilities will be returned. If the value returned by predict_proba is greater than 0.5, it will be divided into 1 and if it is less than 0.5, it will be divided into 0
rfc = RandomForestClassifier(n_estimators=25) rfc = rfc.fit(x_train, y_train) score = rfc.score(x_test, y_test) print(score) #0.9814814814814815 print(rfc.feature_importances_) #The importance coefficients of all eigenvalues are obtained print(rfc.apply(x_test)) #Get the leaf node to which the test set is assigned rfc.predict(Xtest) rfc.predict_proba(Xtest)
Bonus: another necessary condition of Bagging
As we said before, when using the bagged method, the base evaluator should be as independent as possible. In fact, the bagged method also has another necessary condition: the judgment accuracy of the base classifier should at least exceed that of the random classifier, that is, the judgment accuracy of the base classifier should at least exceed 50%.
import numpy as np x = np.linspace(0,1,20) y =  for epsilon in np.linspace(0,1,20): E = np.array([comb(25,i)*(epsilon**i)*((1-epsilon)**(25-i)) for i in range(13,26)]).sum() y.append(E) plt.plot(x,y,"o-",label="when estimators are different") plt.plot(x,x,"--",color="red",label="if all estimators are same") plt.xlabel("individual estimator's error") plt.ylabel("RandomForest's error") plt.legend() plt.show()
2.1 important parameters, attributes and interfaces
Consistent with decision tree