Decision tree is a nonparametric supervised learning method, which is mainly used for classification and regression. The goal of the algorithm is to create a model for predicting target variables by inferring data characteristics and learning decision rules. A decision tree is a tree structure (which can be a binary tree or a non binary tree). Each non leaf node represents a test on a characteristic attribute, each branch represents the output of the characteristic attribute in a value domain, and each leaf node stores a category. The process of using decision tree to make decision is to start from the root node, test the corresponding characteristic attributes in the items to be classified, select the output branch according to its value, until it reaches the leaf node, and take the category stored in the leaf node as the decision result.
Decision Tree is a simple but widely used classifier. By constructing Decision Tree from training data, unknown data can be classified efficiently. Decision number has two advantages:
- The decision tree model is readable and descriptive, which is helpful for manual analysis;
- High efficiency, the decision tree only needs to be constructed once and used repeatedly, and the maximum calculation times of each prediction does not exceed the depth of the decision tree.
The decision tree can do both classification and regression.
- The output of the classification tree is the class label of the sample.
- The output of the regression tree is a real number (such as the price of the house, the time the patient stays in the hospital, etc.).
Taking the picture at the beginning of the article as an example, it is assumed that the bank needs to review the user information before making a loan to determine whether to approve the loan. The structural data data data.scv is as follows:
house, married, income, give_loan 1, 1, 80, 1 1, 0, 30, 1 1, 1, 30, 1 0, 1, 30, 1 0, 1, 40, 1 0, 0, 80, 1 0, 0, 78, 0 0, 0, 70, 1 0, 0, 88, 1 0, 0, 45, 0 0, 1, 87, 1 0, 0, 89, 1 0, 0, 100, 1
from numpy import genfromtxt from sklearn import tree # Load data dataset = genfromtxt('data.csv', delimiter=",") x = dataset[1:, 0:3] y = dataset[1:, 3] clf = tree.DecisionTreeClassifier() clf = clf.fit(x, y) # forecast print(clf.predict([[0, 0, 50]])) # [0.] indicates that this user does not meet the loan conditions
The difference between regression and classification is that vector y can be a floating point number.
from sklearn import tree X = [[0, 0], [2, 2]] y = [0.5, 2.5] clf = tree.DecisionTreeRegressor() clf = clf.fit(X, y) clf.predict([[1, 1]])
The examples given on scikit learn's official website are:
import numpy as np from sklearn.tree import DecisionTreeRegressor import matplotlib.pyplot as plt # Create random dataset rng = np.random.RandomState(1) X = np.sort(5 * rng.rand(80, 1), axis=0) y = np.sin(X).ravel() y[::5] += 3 * (0.5 - rng.rand(16)) # Training decision tree regression model regr_1 = DecisionTreeRegressor(max_depth=2) regr_2 = DecisionTreeRegressor(max_depth=5) regr_1.fit(X, y) regr_2.fit(X, y) # forecast X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] y_1 = regr_1.predict(X_test) y_2 = regr_2.predict(X_test) # Result display plt.figure() plt.scatter(X, y, c="darkorange", label="data") plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2) plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2) plt.xlabel("data") plt.ylabel("target") plt.title("Decision Tree Regression") plt.legend() plt.show()
Use of decision tree
- If the amount of data is large, the decision tree is easy to over fit. The proportion of samples and features is very important. If the decision tree has few samples and many features, it is very likely to be over fitted.
- Dimension reduction (PCA, ICA) can be considered in advance to generate a decision tree with large differences between features
- Visualize the decision tree of your training through export and use max_depth =3 as the depth of an initial tree, there is a general feeling of data fitting the decision tree model, and then gradually increase the depth The increase of the sample size of the data will deepen the depth of the decision tree, using max_depth controls the size of the decision tree to prevent overfitting
- Use min_samples_split or min_samples_leaf to control the sample number of leaf nodes. A very small number often means over fitting, while a large number can prevent over fitting. Min can be_ samples_ Leaf = 5 as an initial value. If the sample data changes greatly, a floating-point number can be used. The difference between the two is min_samples_leaf ensures the minimum number of leaf nodes, min_samples_split can establish any number of leaf nodes, and is used more in literature
- If the sample is weighted, min can be used_ weight_ fraction_ Leaf to implement weight based pre pruning rules to optimize the decision tree structure
- np.float32 vector is used inside the decision tree. If the sample is not in this form, a sample of data set will be generated
- If the data matrix X is very sparse, it is recommended to convert it to sparse matrix CSC before fitting and prediction_ matrix. Sparse matrices will be several orders of magnitude faster than dense matrices