preface
This learning note is the learning content of Alibaba cloud Tianchi Longzhu plan machine learning training camp. The learning links are:
https://tianchi.aliyun.com/specials/promotion/aicampml
1, Summary of learning points
- Understand the application, advantages and disadvantages of XGBoost
- Master the Python call of XGBoost and apply it to weather dataset prediction
2, Learning content
1. Application, advantages and disadvantages of xgboost
1.1 application of xgboost
XGBoost is an extensible machine learning system developed in 2016 under the leadership of Chen Tianqi of the University of Washington. Strictly speaking, XGBoost is not a model, but a software package for users to easily solve classification, regression or sorting problems. It internally implements the Gradient Boosting Decision Tree (GBDT) model, and optimizes many algorithms in the model. It not only achieves high precision, but also maintains a very fast speed. For a period of time, it has become a weapon of mass destruction in the field of data mining and machine learning at home and abroad.
More importantly, XGBoost has made in-depth consideration in system optimization and machine learning principles. It is no exaggeration to say that the scalability, portability and accuracy provided by XGBoost promote the upper limit of machine learning computing restrictions. The system runs ten times faster on a single machine than the popular solutions at that time, and even can process billions of data in distributed systems.
XGBoost is widely used in the field of machine learning and data mining. At the same time, XGBoost has also been successfully applied to various problems in industry and academia. For example, store sales forecast, high energy physics event classification, web text classification; User behavior prediction, motion detection, advertising click through rate prediction, malware classification, disaster risk prediction, online course drop out rate prediction. Although domain related data analysis and feature engineering also play an important role in these solutions, the consistent selection of XGBoost by learners and practitioners shows the influence and importance of this software package.
1.2 advantages and disadvantages of xgboost
Main advantages of XGBoost:
- Easy to use. Compared with other machine learning libraries, users can easily use XGBoost and get quite good results.
- Efficient and scalable. When dealing with large-scale data sets, it has fast speed, good effect and low requirements for hardware resources such as memory.
- Strong robustness. Compared with the deep learning model, it can achieve close effect without fine tuning.
- XGBoost implements the lifting tree model internally, which can automatically handle missing values.
Main disadvantages of XGBoost:
- Compared with the deep learning model, it can not model the spatio-temporal position, and can not capture high-dimensional data such as image, voice, text and so on.
- When we have a large amount of training data and can find an appropriate deep learning model, the accuracy of deep learning can be far ahead of XGBoost.
2 actual combat of XGBoost classification based on weather data set
2.1 purpose:
Now there are some daily rainfall data provided by the weather station. We need to predict the probability of rain tomorrow according to the historical rainfall data.
2.2 Data Description:
2.3 code flow
Step 1: function library import
Import some basic function libraries, including numpy, pandas, matplotlib and seaborn.
#Import the required dataset !wget https://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/7XGBoost/train.csv ## Basic function library import numpy as np import pandas as pd ## Drawing function library import matplotlib.pyplot as plt import seaborn as sns
Step 2: data reading / loading
## We use the read provided by Pandas_ CSV function reads and converts to DataFrame format data = pd.read_csv('train.csv')
Step 3: Handling outliers
Fill in missing values
NaN exists in the data set. Generally, we believe that NaN represents a missing value in the data set, which may be an error during data collection or processing. Here we use - 1 to fill in the missing value. There are other missing value processing methods such as "median filling and average filling". Interested students can also try.
## Use. info() to view the overall information of the data data.info() ## For simple data viewing, we can use. head() header and. tail() tail data.head() data = data.fillna(-1) data.tail()
## Use value_ The counts function views the number of training set labels pd.Series(data['RainTomorrow']).value_counts()
We find that the number of negative samples in the data set is much larger than the number of positive samples. This common problem is called "data imbalance" problem, which needs some special treatment in some cases.
Make some statistical description for the characteristics
## Make some statistical description for the characteristics data.describe()
Some results are as follows:
Explanation:
Pandas provides the describe method, which can view the count, mean value, maximum / minimum value, etc. of each column. It is powerful.
Statistical value variable description:
count: quantity statistics. How many valid values are there in this column
Mean: mean
std: standard deviation
min: minimum
25%: quartile
50%: half quantile
75%: third quartile
max: maximum
Step4: visual description
For convenience, we first record digital features and non digital features:
numerical_features = [x for x in data.columns if data[x].dtype == np.float] category_features = [x for x in data.columns if data[x].dtype != np.float and x != 'RainTomorrow'] ## Scatter visualization based on the combination of three features and labels sns.pairplot(data=data[['Rainfall', 'Evaporation', 'Sunshine'] + ['RainTomorrow']], diag_kind='hist', hue= 'RainTomorrow') plt.show()
It can be found from the above figure that in 2D, different feature combinations have the scattered distribution of rain and no rain the next day, as well as the approximate discrimination ability. The combination of Sunshine and other features has more distinguishing ability
for col in data[numerical_features].columns: if col != 'RainTomorrow': sns.boxplot(x='RainTomorrow', y=col, saturation=0.5, palette='pastel', data=data) plt.title(col) plt.show()
Using the box graph, we can also get the distribution differences of different categories in different features. We can find that sunshine, humidity3pm, cloud9am and cloud3pm have strong discrimination ability
tlog = {} for i in category_features: tlog[i] = data[data['RainTomorrow'] == 'Yes'][i].value_counts() flog = {} for i in category_features: flog[i] = data[data['RainTomorrow'] == 'No'][i].value_counts()
plt.figure(figsize=(10,10)) plt.subplot(1,2,1) plt.title('RainTomorrow') sns.barplot(x = pd.DataFrame(tlog['Location']).sort_index()['Location'], y = pd.DataFrame(tlog['Location']).sort_index().index, color = "red") plt.subplot(1,2,2) plt.title('Not RainTomorrow') sns.barplot(x = pd.DataFrame(flog['Location']).sort_index()['Location'], y = pd.DataFrame(flog['Location']).sort_index().index, color = "blue") plt.show()slice
It can be seen from the above figure that rainfall varies greatly in different regions, and some places are obviously easier to rainfall
plt.figure(figsize=(10,2)) plt.subplot(1,2,1) plt.title('RainTomorrow') sns.barplot(x = pd.DataFrame(tlog['RainToday'][:2]).sort_index()['RainToday'], y = pd.DataFrame(tlog['RainToday'][:2]).sort_index().index, color = "red") plt.subplot(1,2,2) plt.title('Not RainTomorrow') sns.barplot(x = pd.DataFrame(flog['RainToday'][:2]).sort_index()['RainToday'], y = pd.DataFrame(flog['RainToday'][:2]).sort_index().index, color = "blue") plt.show()
In the above figure, we can find that it rains today and not necessarily tomorrow, but it doesn't rain today and it doesn't rain the next day.
Step5: Code discrete variables
Since XGBoost cannot handle string type data, we need some methods to convert string data into data. The simplest method is to encode all features of the same category into the same value, such as female = 0, male = 1, dog = 2, so the last encoded feature value is
[0, number of features − 1] (ps: a bit like data normalization in image processing). In addition, there are methods such as unique heat coding, summation coding, leaving one method coding and so on, which can get better results.
In fact, it is to convert the data with the following attributes as strings into data so that XGBoost can process them.
## All features of the same category are encoded into the same value def get_mapfunction(x): mapp = dict(zip(x.unique().tolist(), range(len(x.unique().tolist())))) def mapfunction(y): if y in mapp: return mapp[y] else: return -1 return mapfunction for i in category_features: data[i] = data[i].apply(get_mapfunction(data[i]))
## The encoded string feature becomes a number data['Location'].unique()
Step6: training and prediction with XGBoost
## In order to correctly evaluate the model performance, the data is divided into training set and test set, the model is trained on the training set, and the model performance is verified on the test set. from sklearn.model_selection import train_test_split ## Select samples with categories 0 and 1 (excluding samples with category 2) data_target_part = data['RainTomorrow'] data_features_part = data[[x for x in data.columns if x != 'RainTomorrow']] ## The test set size is 20%, 80% / 20% points x_train, x_test, y_train, y_test = train_test_split(data_features_part, data_target_part, test_size = 0.2, random_state = 2020)
## Import XGBoost model from xgboost.sklearn import XGBClassifier ## Define XGBoost model clf = XGBClassifier() # Training XGBoost model on training set clf.fit(x_train, y_train)
## The distribution on the training set and test set is predicted by the trained model train_predict = clf.predict(x_train) test_predict = clf.predict(x_test) from sklearn import metrics ## The model effect is evaluated by accuracy [the proportion of the number of correctly predicted samples to the total number of predicted samples] print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict)) print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict)) ## View the confusion matrix (statistical matrix of various situations of predicted value and real value) confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test) print('The confusion matrix result:\n',confusion_matrix_result) # Visualization of results using thermal maps plt.figure(figsize=(8, 6)) sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel('Predicted labels') plt.ylabel('True labels') plt.show()
We can find that a total of 15759 + 2306 samples are predicted correctly and 2470 + 794 samples are predicted incorrectly.
Step7: feature selection using XGBoost
The feature selection of xgboost belongs to the embedded method in feature selection, and the attribute feature can be used in xgboost_ importances_ To see the importance of the feature.
? sns.barplot
sns.barplot(y=data_features_part.columns, x=clf.feature_importances_)
From the picture, we can find that the humidity at 3 p.m. and whether it rains today are the most important factors to determine whether it rains the next day
In addition to the first time, we can also use the following important attributes in XGBoost to evaluate the importance of features.
- weight: it is evaluated by the number of times the feature is used
- gain: evaluation Gini index when using features for division
- cover: it is divided by the average value of the second derivative of an index covering the sample (the specific principle is unclear and needs to be explored).
- total_gain: total Gini index
- total_cover: total coverage
from sklearn.metrics import accuracy_score from xgboost import plot_importance def estimate(model,data): #sns.barplot(data.columns,model.feature_importances_) ax1=plot_importance(model,importance_type="gain") ax1.set_title('gain') ax2=plot_importance(model, importance_type="weight") ax2.set_title('weight') ax3 = plot_importance(model, importance_type="cover") ax3.set_title('cover') plt.show() def classes(data,label,test): model=XGBClassifier() model.fit(data,label) ans=model.predict(test) estimate(model, data) return ans ans=classes(x_train,y_train,x_test) pre=accuracy_score(y_test, ans) print('acc=',accuracy_score(y_test,ans))
Step8: get better results by adjusting parameters
XGBoost includes but is not limited to the following parameters that have a great impact on the model:
learning_rate: sometimes called eta. The default value is 0.3. The step size of each iteration is very important. It is too large, the operation accuracy is not high, it is too small, and the operation speed is slow.
subsample: 1 by default. This parameter controls the proportion of random sampling for each tree. If the value of this parameter is reduced, the algorithm will be more conservative and avoid over fitting. The value range is zero to one.
colsample_bytree: the system default value is 1. We usually set it to about 0.8. It is used to control the proportion of the number of columns sampled randomly per tree (each column is a feature).
max_depth: the default value is 6. We usually use a number between 3 and 10. This value is the maximum depth of the tree. This value is used to control over fitting. max_ The greater the depth, the more specific the model learning.
The methods of adjusting model parameters include greedy algorithm, grid parameter adjustment, Bayesian parameter adjustment and so on. Here we use grid parameter adjustment. Its basic idea is exhaustive search: in all candidate parameter selection, try every possibility through cyclic traversal, and the best parameter is the final result
## Import grid parameter adjustment function from sklearn Library from sklearn.model_selection import GridSearchCV ## Define parameter value range learning_rate = [0.1, 0.3, 0.6] subsample = [0.8, 0.9] colsample_bytree = [0.6, 0.8] max_depth = [3,5,8] parameters = { 'learning_rate': learning_rate, 'subsample': subsample, 'colsample_bytree':colsample_bytree, 'max_depth': max_depth} model = XGBClassifier(n_estimators = 50) ## Perform grid search clf = GridSearchCV(model, parameters, cv=3, scoring='accuracy',verbose=1,n_jobs=-1) clf = clf.fit(x_train, y_train)
## The best parameters after grid search are clf.best_params_
## The best model parameters are used to predict the distribution on the training set and test set ## Define XGBoost model with parameters clf = XGBClassifier(colsample_bytree = 0.6, learning_rate = 0.3, max_depth= 8, subsample = 0.9) # Training XGBoost model on training set clf.fit(x_train, y_train) train_predict = clf.predict(x_train) test_predict = clf.predict(x_test) ## The model effect is evaluated by accuracy [the proportion of the number of correctly predicted samples to the total number of predicted samples] print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict)) print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict)) ## View the confusion matrix (statistical matrix of various situations of predicted value and real value) confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test) print('The confusion matrix result:\n',confusion_matrix_result) # Visualization of results using thermal maps plt.figure(figsize=(8, 6)) sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel('Predicted labels') plt.ylabel('True labels') plt.show()
Originally there were 2470 + 790 errors, but now there are 2112 + 939 errors, which has significantly improved the accuracy.
3, Learning questions and answers
Method of adjusting parameters
Traditional manual parameter adjustment
In the traditional parameter adjustment process, we manually check the random hyperparameter set through the training algorithm and select the best parameter set that meets our goal.
#importing required libraries from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold , cross_val_score from sklearn.datasets import load_wine wine = load_wine() X = wine.data y = wine.target #splitting the data into train and test set X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 14) #declaring parameters grid k_value = list(range(2,11)) algorithm = ['auto','ball_tree','kd_tree','brute'] scores = [] best_comb = [] kfold = KFold(n_splits=5) #hyperparameter tunning for algo in algorithm: for k in k_value: knn = KNeighborsClassifier(n_neighbors=k,algorithm=algo) results = cross_val_score(knn,X_train,y_train,cv = kfold) print(f'Score:{round(results.mean(),4)} with algo = {algo} , K = {k}') scores.append(results.mean()) best_comb.append((k,algo)) best_param = best_comb[scores.index(max(scores))] print(f'\nThe Best Score : {max(scores)}') print(f"['algorithm': {best_param[1]} ,'n_neighbors': {best_param[0]}]")
GridSearchCV()
# Method 1: grid searchgridsearchcv () from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC import time start_time = time.time() pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1)) param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0] param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}] gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring='accuracy',cv=10,n_jobs=-1) gs = gs.fit(X_train,y_train) end_time = time.time() print("Grid search elapsed time:%.3f S" % float(end_time-start_time)) print(gs.best_score_) print(gs.best_params_)
Random mesh search (randomizedsearchcv)
# Method 2: random mesh search (randomizedsearchcv) from sklearn.model_selection import RandomizedSearchCV from sklearn.svm import SVC import time start_time = time.time() pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1)) param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0] param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}] # param_grid = [{'svc__C':param_range,'svc__kernel':['linear','rbf'],'svc__gamma':param_range}] gs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid,scoring='accuracy',cv=10,n_jobs=-1) gs = gs.fit(X_train,y_train) end_time = time.time() print("Random grid search elapsed time:%.3f S" % float(end_time-start_time)) print(gs.best_score_) print(gs.best_params_)
Bayesian search
from skopt import BayesSearchCV import warnings warnings.filterwarnings("ignore") # parameter ranges are specified by one of below from skopt.space import Real, Categorical, Integer knn = KNeighborsClassifier() #defining hyper-parameter grid grid_param = { 'n_neighbors' : list(range(2,11)) , 'algorithm' : ['auto','ball_tree','kd_tree','brute'] } #initializing Bayesian Search Bayes = BayesSearchCV(knn , grid_param , n_iter=30 , random_state=14) Bayes.fit(X_train,y_train) #best parameter combination Bayes.best_params_ #score achieved with best parameter combination Bayes.best_score_ #all combinations of hyperparameters Bayes.cv_results_['params'] #average scores of cross-validation Bayes.cv_results_['mean_test_score']
reference:
Summary of machine learning model evaluation and optimization
Machine learning 4 common super parameter debugging methods!
4, Learning, thinking and summary
XGBoost principle
The bottom layer of XGBoost implements GBDT algorithm and makes a series of optimizations for GBDT algorithm:
- Taylor's second-order expansion of the objective function can more efficiently fit the error.
- An algorithm for estimating split points is proposed to speed up the construction of CART tree and deal with sparse data at the same time.
- A tree parallel strategy is proposed to speed up the iteration.
- The distributed algorithm of the model is optimized.
XGBoost is an integrated model based on CART tree. Its idea is to connect multiple decision tree models in series to make decisions together.
XGBoost uses the iterative prediction error method. As a popular example, we now need to predict the value of a car at 3000 yuan. We build decision tree 1. After training, the prediction is 2600 yuan. We find that there is an error of 400 yuan, so the training goal of decision tree 2 is 400 yuan, but the prediction result of decision tree 2 is 350 yuan. If there is still an error of 50 yuan, we give it to the third tree... And so on. Each tree is used to estimate the error of all previous trees. Finally, the sum of the prediction results of all trees is the final prediction result!
The base model of XGBoost is CART regression tree, which has two characteristics: (1) CART tree is a binary tree. (2) Regression tree, the final fitting result is continuous value.