sklearn practice 02: random forest

1 RandomForestClassifier

1.1 parameters of control based evaluator

1.2 n_estimators

n_ The larger the estimators, the better the effect of the model. But correspondingly, any model has a decision boundary, n_ After the estimators reach a certain degree, the accuracy of random forest often does not rise or begin to fluctuate, and n_ The larger the estimators, the larger the amount of computation and memory required, and the longer the training time will be. For this parameter, we are eager to strike a balance between training difficulty and model effect.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from matplotlib import pyplot as plt


wine = load_wine()
x_train, x_test, y_train, y_test = train_test_split(wine.data,wine.target, test_size=0.3)

rfc = RandomForestClassifier(n_estimators=25)
rfc = rfc.fit(x_train, y_train)
score = rfc.score(x_test, y_test)
print(score)
#0.9814814814814815

Draw the effect comparison between random forest and decision tree under a set of cross validation

rfc = RandomForestClassifier(n_estimators=25)
rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10)
clf = DecisionTreeClassifier()
clf_s = cross_val_score(clf,wine.data,wine.target,cv=10)
plt.plot(range(1,11),rfc_s,label = "RandomForest")
plt.plot(range(1,11),clf_s,label = "Decision Tree")
plt.legend()
plt.show()

n_ Learning curve of estimators

superpa = []
for i in range(200):
    rfc = RandomForestClassifier(n_estimators=i+1, n_jobs=-1)
    rfc_s = cross_val_score(rfc, wine.data, wine.target, cv=10).mean()
    superpa.append(rfc_s)
print(max(superpa), superpa.index(max(superpa)))
#0.9888888888888889 23
plt.figure(figsize=[20, 5])
plt.plot(range(1, 201), superpa)
plt.show()

1.3 random_state

Random forest also has random_ The usage of state is similar to that in the classification tree, except that in the classification tree, there is a random_state only controls the generation of a tree, and random in a random forest_ State controls the mode of forest generation, rather than having only one tree in a forest.

When random_ When the state is fixed, a group of fixed trees are generated in the random forest, but each tree is still inconsistent, which is branched with the "random selection feature" "And we can prove that when the randomness is greater, the effect of the bagged method will generally be better and better. When the bagged method is used for integration, the base classifiers should be independent and different from each other.

1.4 bootstrap & oob_score

#There is no need to divide the training set and test set, and then conduct cross validation. It is also possible to directly observe the test scores of out of bag data without division
rfc = RandomForestClassifier(n_estimators=25, oob_score=True)
rfc = rfc.fit(wine.data, wine.target)
#Important attribute oob_score_
print(rfc.oob_score_)
#0.9662921348314607

1.5 important attributes and interfaces

In addition to the two important attributes of. estimators and. oob_score, the random forest naturally has the attribute of. Feature _imports. The interface of the random forest is completely consistent with the decision tree, so there are still four common interfaces: apply, fit, predict and score. In addition, pay attention to the predict _probainterface of the random forest, which returns the information corresponding to each test sample The probability of being assigned to each type of label. If there are several categories of labels, several probabilities will be returned. If the value returned by predict_proba is greater than 0.5, it will be divided into 1 and if it is less than 0.5, it will be divided into 0

rfc = RandomForestClassifier(n_estimators=25)
rfc = rfc.fit(x_train, y_train)
score = rfc.score(x_test, y_test)
print(score)
#0.9814814814814815

print(rfc.feature_importances_)  #The importance coefficients of all eigenvalues are obtained
print(rfc.apply(x_test))  #Get the leaf node to which the test set is assigned
rfc.predict(Xtest)
rfc.predict_proba(Xtest)

Bonus: another necessary condition of Bagging
As we said before, when using the bagged method, the base evaluator should be as independent as possible. In fact, the bagged method also has another necessary condition: the judgment accuracy of the base classifier should at least exceed that of the random classifier, that is, the judgment accuracy of the base classifier should at least exceed 50%.

import numpy as np
x = np.linspace(0,1,20)
y = []
for epsilon in np.linspace(0,1,20):
    E = np.array([comb(25,i)*(epsilon**i)*((1-epsilon)**(25-i)) 
                  for i in range(13,26)]).sum()
    y.append(E)
plt.plot(x,y,"o-",label="when estimators are different")
plt.plot(x,x,"--",color="red",label="if all estimators are same")
plt.xlabel("individual estimator's error")
plt.ylabel("RandomForest's error")
plt.legend()
plt.show()

2 RandomForestRegressor

2.1 important parameters, attributes and interfaces

2.1.1 criterion

Consistent with decision tree

Tags: Algorithm Machine Learning Decision Tree

Posted on Wed, 17 Nov 2021 04:29:32 -0500 by faydra92