# Easy to understand machine learning -- ensemble learning

## Comparison between ordinary decision tree and random forest

### Generate circles dataset

```X,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42)
plt.scatter(X[y==0,0],X[y==0,1])
plt.scatter(X[y==1,0],X[y==1,1])
plt.show()
```

### Drawing function

```def plot_decision_boundary(model, X, y):
x0_min, x0_max = X[:,0].min()-1, X[:,0].max()+1
x1_min, x1_max = X[:,1].min()-1, X[:,1].max()+1
x0, x1 = np.meshgrid(np.linspace(x0_min, x0_max, 100), np.linspace(x1_min, x1_max, 100))
Z = model.predict(np.c_[x0.ravel(), x1.ravel()])
Z = Z.reshape(x0.shape)

plt.contourf(x0, x1, Z, cmap=plt.cm.Spectral)
plt.ylabel('x1')
plt.xlabel('x0')
plt.scatter(X[:, 0], X[:, 1], c=np.squeeze(y))
plt.show()
```

## Prediction using decision tree

### Build decision tree and train

```from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(max_depth=6)
dt_clf.fit(X, y)
plot_decision_boundary(dt_clf,X,y)
```

### Drawing

Because the decision tree likes to go straight, the prediction results are as follows

### Cross validation

```from sklearn.model_selection import cross_val_score
print(cross_val_score(dt_clf,X, y, cv=5).mean()) #cv decided to do several rounds of cross validation

#Split cross validation will divide the data set according to the proportion of the original category
from sklearn.model_selection import StratifiedKFold

strKFold = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
print(cross_val_score(dt_clf,X, y,cv=strKFold).mean())

#Leave one method for cross validation
from sklearn.model_selection import LeaveOneOut

loout = LeaveOneOut()
print(cross_val_score(dt_clf,X, y,cv=loout).mean())

#You can control the number of division iterations and the proportion of test set and training set at each division (that is, there can be cases where there is neither training set nor test set)
from sklearn.model_selection import ShuffleSplit

shufspl = ShuffleSplit(train_size=.5,test_size=.4,n_splits=8) #8 iterations;
print(cross_val_score(dt_clf,X, y,cv=shufspl).mean())
```

## Voting with Voting Classifier

### Constructing voting classifier

Voting Classifier voting is a kind of hard voting, that is, voting after the predicted results are obtained

```from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[
('knn_clf',KNeighborsClassifier(n_neighbors=7)),
('gnb_clf',GaussianNB()),
('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='hard')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
```

### Drawing

```plot_decision_boundary(voting_clf,X,y)
```

## Soft Voting classifier

Soft Voting classifier is a kind of Soft Voting. It votes according to the probability generated by each classifier

### Constructing voting classifier

```voting_clf = VotingClassifier(e![Please add a picture description](https://img-blog.csdnimg.cn/88798644fbc0458d88d4f1a97d4f7e17.png)
stimators=[
('knn_clf',KNeighborsClassifier(n_neighbors=7)),
('gnb_clf',GaussianNB()),
('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='soft')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
```

## bagging

### analysis

Establish multiple models, randomly select some data from the data to train different models, and determine the final prediction results through the prediction results of multiple models.

### Multiple decision tree models

```from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bagging_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True)

bagging_clf.fit(X_train,y_train)
bagging_clf.score(X,y)
```

### Drawing

Personally, I feel that the prediction result of this is similar to that of a single decision tree, and it is also straightforward

## Out of Bag-oob

Similar to bagging's approach, the only difference is that after each model selects part of the data, the next model selects only the data set from the unselected data

### code implementation

```from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bagging_clf = BaggingClassifier(DecisionTreeClassifier(),#classifier
n_estimators=500,#Number of classifiers
max_samples=100,#Number of training samples per model
bootstrap=True,#Put back the sample
oob_score=True)#out of bag

bagging_clf.fit(X,y)
bagging_clf.oob_score_
```

### Drawing

The prediction results of this are similar. Maybe it would be better to use other data sets

## Random forest

The real effect of random forest is the same as that of integrated learning composed of multiple decision trees

### code implementation

```from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=500,random_state=666,oob_score=True)

rf_clf.fit(X,y)
rf_clf.oob_score_
```

## Extreme tree

Extreme random tree algorithm and random forest algorithm are composed of many decision trees. The main difference between limit tree and random forest. The Bagging model is used in random forest. All samples used in extreme tree are only randomly selected. Because the splitting is random, the result is better than that of random forest to some extent.

### code implementation

```from sklearn.ensemble import ExtraTreesClassifier
et_clf = ExtraTreesClassifier(n_estimators=500,random_state=666,bootstrap=True,oob_score=True)
et_clf.fit(X,y)
et_clf.oob_score_
```

### Drawing

The prediction results generated by each model training are weighted. The weight of the data that failed to predict in the previous model will be increased, and the data used in the next model is the weighted data

### code implementation

```from  sklearn.ensemble import AdaBoostClassifier

```

### Drawing

Train only the data with wrong prediction

### code implementation

```from  sklearn.ensemble import GradientBoostingClassifier

gd_clf.fit(X_train,y_train)
gd_clf.score(X_test,y_test)
```

## For more relevant knowledge, please refer to

General review of basic knowledge of machine learning

Posted on Fri, 26 Nov 2021 11:26:12 -0500 by lew14