Implementation and principle explanation of python XGboost regression prediction algorithm (favored by the competition)

Regression prediction of Boston house prices

1, Load the Boston dataset and observe the shape of the data.

from sklearn.model_selection import train_test_split
def del_data(): #The method of processing data set is established, which is convenient to bring it directly into xgboost algorithm
    (train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

It can be seen from the image that the house price feature set of Boston training set is 404 feature sets with 13 dimensions, and the test set is 102.

2, Does xgboost regression need normalization

Answer: No, xgboos the bottom layer is still done according to the decision tree, which is optimized through the optimal splitting point. The decision algorithm process related to tree does not need to be normalized.

3, xgboost adjustable parameters

Answer: any machine learning algorithm has its own Parameters, and the parameter set can be adjusted.

The management document of XGboost. For the api calling interface of python, you can see the following website:

Parameter nameExplain meaning
max_depth-The maximum tree depth of the basic learner. This value is used to avoid over fitting. The default value is 6
learning_rate-The learning rate is used to evaluate the speed of training. If the value is set too low and the learning is slow, set the optimal value of low impact iteration
n_estimators-The number of decision trees, which are determined by over fitting and under fitting
objective-Based on this function, the optimal regression tree is solved
gamma-Penalty term coefficient, which specifies the decrease value of the minimum loss function required for node splitting
alpha-L1 regularization coefficient

There are a lot of parameters that can be adjusted from official documents, but in actual experiments, the call algorithm is more important than others.

The number of parameters can be adjusted by grid search method. For other coefficients, if there is no strong mathematical foundation and understanding of the underlying ideas, use the default

The parameter is the best solution.

max_depth,learning_rate,n_estimators are adjustable parameters.

4, The following is the code implementation process

from sklearn.metrics import mean_squared_error
import xgboost as xgb
from keras.datasets import boston_housing

def main():
    (train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

    model = xgb.XGBRegressor(max_depth=6,learning_rate=0.05,n_estimators=100,randam_state=42),train_targets)
    train_predict = model.predict(train_data)


Model is a built model. Make predictions for the future through learning the training set.
msetest and msetrain are your own evaluation results. You can judge whether your model construction is excellent by these two values.
msetest is the mse value of the result predicted by the model on the test set, and msetrain is the mse value of the result predicted by the model on the training set.

It can be seen that the performance in the training set is very good, and the performance in the test set is still poor. Parameters need to be adjusted by optimizing the model. Here I am

The article on how to adjust parameters will be updated in the later stage. This article mainly talks about the principle.

5, Popular explanation of the principle.

obj is the objective function, and the whole algorithm is realized by optimizing this objective function.

This is a regular term to prevent and suppress the complexity of the model.

Pass on
For the optimization of this, the algorithm takes the known formation of the regression tree in step t-1 to deduce the formation of the regression tree in step t, and then the objective function can be optimized.

The specific optimization method is Taylor expansion, and the state of t-1 step is known
To optimize this form, you can also see that the official documents and papers are very detailed. I will talk about it from the perspective of simple understanding and implementation.

The final display result is

6, Simple understanding

In fact, according to the idea of decision tree, n samples and m features are divided into regression decision trees of each feature, the best points are found for segmentation, and finally the optimal division points of n samples and m features are obtained. From the initial iteration to the final algorithm.
The overall thinking framework is probably like this. Welcome to discuss
The article is not easy. I hope to............

Tags: Python Algorithm Machine Learning xgboost

Posted on Sun, 17 Oct 2021 22:06:34 -0400 by MathewByrne