Easy to understand machine learning -- ensemble learning

Comparison between ordinary decision tree and random forest

Generate circles dataset

X,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42)
plt.scatter(X[y==0,0],X[y==0,1])
plt.scatter(X[y==1,0],X[y==1,1])
plt.show()

Drawing function

def plot_decision_boundary(model, X, y):
    x0_min, x0_max = X[:,0].min()-1, X[:,0].max()+1
    x1_min, x1_max = X[:,1].min()-1, X[:,1].max()+1
    x0, x1 = np.meshgrid(np.linspace(x0_min, x0_max, 100), np.linspace(x1_min, x1_max, 100))
    Z = model.predict(np.c_[x0.ravel(), x1.ravel()]) 
    Z = Z.reshape(x0.shape)
    
    plt.contourf(x0, x1, Z, cmap=plt.cm.Spectral)
    plt.ylabel('x1')
    plt.xlabel('x0')
    plt.scatter(X[:, 0], X[:, 1], c=np.squeeze(y))
    plt.show()

Prediction using decision tree

Build decision tree and train

from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(max_depth=6)
dt_clf.fit(X, y)
plot_decision_boundary(dt_clf,X,y)

Drawing

Because the decision tree likes to go straight, the prediction results are as follows

Cross validation

from sklearn.model_selection import cross_val_score
print(cross_val_score(dt_clf,X, y, cv=5).mean()) #cv decided to do several rounds of cross validation

#Split cross validation will divide the data set according to the proportion of the original category
from sklearn.model_selection import StratifiedKFold

strKFold = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
print(cross_val_score(dt_clf,X, y,cv=strKFold).mean())

#Leave one method for cross validation
from sklearn.model_selection import LeaveOneOut

loout = LeaveOneOut()
print(cross_val_score(dt_clf,X, y,cv=loout).mean())

#You can control the number of division iterations and the proportion of test set and training set at each division (that is, there can be cases where there is neither training set nor test set)
from sklearn.model_selection import ShuffleSplit

shufspl = ShuffleSplit(train_size=.5,test_size=.4,n_splits=8) #8 iterations;
print(cross_val_score(dt_clf,X, y,cv=shufspl).mean())

Voting with Voting Classifier

Constructing voting classifier

Voting Classifier voting is a kind of hard voting, that is, voting after the predicted results are obtained

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[
    ('knn_clf',KNeighborsClassifier(n_neighbors=7)),
    ('gnb_clf',GaussianNB()),
    ('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='hard')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)

Drawing

plot_decision_boundary(voting_clf,X,y)

Soft Voting classifier

Soft Voting classifier is a kind of Soft Voting. It votes according to the probability generated by each classifier

Constructing voting classifier

voting_clf = VotingClassifier(e![Please add a picture description](https://img-blog.csdnimg.cn/88798644fbc0458d88d4f1a97d4f7e17.png)
stimators=[
    ('knn_clf',KNeighborsClassifier(n_neighbors=7)),
    ('gnb_clf',GaussianNB()),
    ('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='soft')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)

Drawing

bagging

analysis

Establish multiple models, randomly select some data from the data to train different models, and determine the final prediction results through the prediction results of multiple models.

Multiple decision tree models

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bagging_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True)

bagging_clf.fit(X_train,y_train)
bagging_clf.score(X,y)

Drawing

Personally, I feel that the prediction result of this is similar to that of a single decision tree, and it is also straightforward

Out of Bag-oob

Similar to bagging's approach, the only difference is that after each model selects part of the data, the next model selects only the data set from the unselected data

code implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bagging_clf = BaggingClassifier(DecisionTreeClassifier(),#classifier 
                                n_estimators=500,#Number of classifiers
                                max_samples=100,#Number of training samples per model
                                bootstrap=True,#Put back the sample
                                oob_score=True)#out of bag

bagging_clf.fit(X,y)
bagging_clf.oob_score_

Drawing

The prediction results of this are similar. Maybe it would be better to use other data sets

Random forest

The real effect of random forest is the same as that of integrated learning composed of multiple decision trees

code implementation

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=500,random_state=666,oob_score=True)

rf_clf.fit(X,y)
rf_clf.oob_score_

Drawing

Extreme tree

Extreme random tree algorithm and random forest algorithm are composed of many decision trees. The main difference between limit tree and random forest. The Bagging model is used in random forest. All samples used in extreme tree are only randomly selected. Because the splitting is random, the result is better than that of random forest to some extent.

code implementation

from sklearn.ensemble import ExtraTreesClassifier
et_clf = ExtraTreesClassifier(n_estimators=500,random_state=666,bootstrap=True,oob_score=True)
et_clf.fit(X,y)
et_clf.oob_score_

Drawing

Ada Boosting

The prediction results generated by each model training are weighted. The weight of the data that failed to predict in the previous model will be increased, and the data used in the next model is the weighted data

code implementation

from  sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=6),n_estimators=500)
ada_clf.fit(X_train,y_train)
ada_clf.score(X_test,y_test)

Drawing

Gradient Boosting

Train only the data with wrong prediction

code implementation

from  sklearn.ensemble import GradientBoostingClassifier

gd_clf = GradientBoostingClassifier(max_depth=6,n_estimators=500)

gd_clf.fit(X_train,y_train)
gd_clf.score(X_test,y_test)

Drawing

For more relevant knowledge, please refer to

General review of basic knowledge of machine learning

Tags: Python Algorithm Machine Learning AI

Posted on Fri, 26 Nov 2021 11:26:12 -0500 by lew14