Comparison between ordinary decision tree and random forest
Generate circles dataset
X,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42) plt.scatter(X[y==0,0],X[y==0,1]) plt.scatter(X[y==1,0],X[y==1,1]) plt.show()
Drawing function
def plot_decision_boundary(model, X, y): x0_min, x0_max = X[:,0].min()-1, X[:,0].max()+1 x1_min, x1_max = X[:,1].min()-1, X[:,1].max()+1 x0, x1 = np.meshgrid(np.linspace(x0_min, x0_max, 100), np.linspace(x1_min, x1_max, 100)) Z = model.predict(np.c_[x0.ravel(), x1.ravel()]) Z = Z.reshape(x0.shape) plt.contourf(x0, x1, Z, cmap=plt.cm.Spectral) plt.ylabel('x1') plt.xlabel('x0') plt.scatter(X[:, 0], X[:, 1], c=np.squeeze(y)) plt.show()
Prediction using decision tree
Build decision tree and train
from sklearn.tree import DecisionTreeClassifier dt_clf = DecisionTreeClassifier(max_depth=6) dt_clf.fit(X, y) plot_decision_boundary(dt_clf,X,y)
Drawing
Because the decision tree likes to go straight, the prediction results are as follows
Cross validation
from sklearn.model_selection import cross_val_score print(cross_val_score(dt_clf,X, y, cv=5).mean()) #cv decided to do several rounds of cross validation #Split cross validation will divide the data set according to the proportion of the original category from sklearn.model_selection import StratifiedKFold strKFold = StratifiedKFold(n_splits=5,shuffle=True,random_state=0) print(cross_val_score(dt_clf,X, y,cv=strKFold).mean()) #Leave one method for cross validation from sklearn.model_selection import LeaveOneOut loout = LeaveOneOut() print(cross_val_score(dt_clf,X, y,cv=loout).mean()) #You can control the number of division iterations and the proportion of test set and training set at each division (that is, there can be cases where there is neither training set nor test set) from sklearn.model_selection import ShuffleSplit shufspl = ShuffleSplit(train_size=.5,test_size=.4,n_splits=8) #8 iterations; print(cross_val_score(dt_clf,X, y,cv=shufspl).mean())
Voting with Voting Classifier
Constructing voting classifier
Voting Classifier voting is a kind of hard voting, that is, voting after the predicted results are obtained
from sklearn.ensemble import VotingClassifier voting_clf = VotingClassifier(estimators=[ ('knn_clf',KNeighborsClassifier(n_neighbors=7)), ('gnb_clf',GaussianNB()), ('dt_clf', DecisionTreeClassifier(max_depth=6)) ],voting='hard') voting_clf.fit(X_train,y_train) voting_clf.score(X_test,y_test)
Drawing
plot_decision_boundary(voting_clf,X,y)
Soft Voting classifier
Soft Voting classifier is a kind of Soft Voting. It votes according to the probability generated by each classifier
Constructing voting classifier
voting_clf = VotingClassifier(e stimators=[ ('knn_clf',KNeighborsClassifier(n_neighbors=7)), ('gnb_clf',GaussianNB()), ('dt_clf', DecisionTreeClassifier(max_depth=6)) ],voting='soft') voting_clf.fit(X_train,y_train) voting_clf.score(X_test,y_test)
Drawing
bagging
analysis
Establish multiple models, randomly select some data from the data to train different models, and determine the final prediction results through the prediction results of multiple models.
Multiple decision tree models
from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import BaggingClassifier bagging_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True) bagging_clf.fit(X_train,y_train) bagging_clf.score(X,y)
Drawing
Personally, I feel that the prediction result of this is similar to that of a single decision tree, and it is also straightforward
Out of Bag-oob
Similar to bagging's approach, the only difference is that after each model selects part of the data, the next model selects only the data set from the unselected data
code implementation
from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import BaggingClassifier bagging_clf = BaggingClassifier(DecisionTreeClassifier(),#classifier n_estimators=500,#Number of classifiers max_samples=100,#Number of training samples per model bootstrap=True,#Put back the sample oob_score=True)#out of bag bagging_clf.fit(X,y) bagging_clf.oob_score_
Drawing
The prediction results of this are similar. Maybe it would be better to use other data sets
Random forest
The real effect of random forest is the same as that of integrated learning composed of multiple decision trees
code implementation
from sklearn.ensemble import RandomForestClassifier rf_clf = RandomForestClassifier(n_estimators=500,random_state=666,oob_score=True) rf_clf.fit(X,y) rf_clf.oob_score_
Drawing
Extreme tree
Extreme random tree algorithm and random forest algorithm are composed of many decision trees. The main difference between limit tree and random forest. The Bagging model is used in random forest. All samples used in extreme tree are only randomly selected. Because the splitting is random, the result is better than that of random forest to some extent.
code implementation
from sklearn.ensemble import ExtraTreesClassifier et_clf = ExtraTreesClassifier(n_estimators=500,random_state=666,bootstrap=True,oob_score=True) et_clf.fit(X,y) et_clf.oob_score_
Drawing
Ada Boosting
The prediction results generated by each model training are weighted. The weight of the data that failed to predict in the previous model will be increased, and the data used in the next model is the weighted data
code implementation
from sklearn.ensemble import AdaBoostClassifier ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=6),n_estimators=500) ada_clf.fit(X_train,y_train) ada_clf.score(X_test,y_test)
Drawing
Gradient Boosting
Train only the data with wrong prediction
code implementation
from sklearn.ensemble import GradientBoostingClassifier gd_clf = GradientBoostingClassifier(max_depth=6,n_estimators=500) gd_clf.fit(X_train,y_train) gd_clf.score(X_test,y_test)