CART algorithm
What is CART?
CART is the abbreviation of Classification And Regression Tree in English, also known as classification regression tree. From its name we can
We can see that it is a very powerful algorithm, which can be used not only for classification but also for regression, so it is very worth learning.
CART algorithm uses binary segmentation method, which can adjust the construction process of tree to deal with continuous variables.
The specific processing method is: if the eigenvalue is greater than the given value, go to the left subtree, otherwise go to the right subtree.
CART algorithm has two steps:
- Decision tree generation: the process of building a binary decision tree recursively. The decision tree is generated based on the training data set, and the generated decision tree should be as large as possible; Establish nodes from the root from top to bottom. Select the best attribute at each node to split, so that the training set in the child nodes is as pure as possible.
- Decision tree pruning: prune the generated tree with the validation data set and select the optimal subtree. At this time, the minimum loss function is used as the pruning standard.
Construction process of CART tree:
First, find the best column to segment the data set, and execute the binary segmentation method every time. If the eigenvalue is greater than the given value, go to the left subtree, otherwise go to the right subtree. When the node cannot be divided again, save the node as a leaf node.
Here we introduce regression tree and model tree.
difference:
Regression tree: leaf nodes are the average of eigenvalues.
Model tree: leaf nodes are linear equations.
1: Regression tree
The data of the user-defined regression tree is the whole data, that is, the feature column is followed by the target column.
Main functions:
binSplitDataSet(dataSet, feature, value)#Segment data sets according to features errType(dataSet)#Calculate the total variance: mean square deviation * number of samples leafType(dataSet)#Generate leaf node chooseBestSplit(dataSet, leafType=leafType, errType=errType, ops = (1,4))#Find the best binary segmentation function of data createTree(dataSet, leafType = leafType, errType = errType, ops = (1, 4))#Tree building function
Manually implement the regression tree code:
""" Function description:Segment data sets according to features Parameter description: dataSet: Raw data set feature: Feature index to be segmented value: The value of the feature return: mat0: Segmented data set 0 mat1: Segmented data set 1 """ def binSplitDataSet(dataSet, feature, value): mat0 = dataSet.loc[dataSet.iloc[:,feature] > value,:] mat0.index = range(mat0.shape[0]) mat1 = dataSet.loc[dataSet.iloc[:,feature] <= value,:] mat1.index = range(mat1.shape[0]) return mat0, mat1 #Calculate the total variance: mean square deviation * number of samples def errType(dataSet): var= dataSet.iloc[:,-1].var() *dataSet.shape[0] return var #Generate leaf node def leafType(dataSet): leaf = dataSet.iloc[:,-1].mean() return leaf """ Function description:Find the best binary segmentation function of data Parameter description: dataSet: Raw data set leafType: Generate leaf node function errType: Error estimation function ops: Tuples of user-defined parameters return: bestIndex: Optimal segmentation feature bestValue: Optimal eigenvalue """ def chooseBestSplit(dataSet, leafType=leafType, errType=errType, ops = (1,4)): #The allowable error reduction value of tolS and the minimum number of samples for tolN segmentation tolS = ops[0]; tolN = ops[1] #If all current values are equal, exit. (according to the characteristics of set) if len(set(dataSet.iloc[:,-1].values)) == 1: return None, leafType(dataSet) #Rows m and columns n of the statistics set m, n = dataSet.shape #By default, the last feature is the best segmentation feature, and its error estimation is calculated S = errType(dataSet) #They are the best error, the index value of the best feature segmentation, and the best eigenvalue bestS = np.inf; bestIndex = 0; bestValue = 0 #Traverse all feature columns for featIndex in range(n - 1): #The original data set is labeled, and the number of features is n-1 colval= set(dataSet.iloc[:,featIndex].values) #Extract all values of the current cut column #Traverse all eigenvalues for splitVal in colval: #Data sets are segmented according to features and eigenvalues mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal) #If the data is less than tolN, exit if (mat0.shape[0] < tolN) or (mat1.shape[0] < tolN): continue #Calculation error estimation newS = errType(mat0) + errType(mat1) #If the error estimate is smaller, the feature index value and the feature value are updated if newS < bestS: bestIndex = featIndex bestValue = splitVal bestS = newS #Exit if the error reduction is small if (S - bestS) < tolS: return None, leafType(dataSet) #The data set is segmented according to the best segmentation feature and eigenvalue mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue) #Exit if the cut data set is very small if (mat0.shape[0] < tolN) or (mat1.shape[0] < tolN): return None, leafType(dataSet) #Returns the best segmentation feature and feature value return bestIndex, bestValue """ Function function:Tree building function Parameter description: dataSet: Raw data set leafType: Establish the function of leaf node errType: Error calculation function ops: Tuples containing all other parameters of the tree build return: retTree: Constructed regression tree """ def createTree(dataSet, leafType = leafType, errType = errType, ops = (1, 4)): #Select the best segmentation feature and eigenvalue col, value = chooseBestSplit(dataSet, leafType, errType, ops) #If there is no characteristic, the characteristic value is returned if col == None: return value #Regression tree retTree = {} #Store tree information retTree['spInd'] = col retTree['spVal'] = value #It is divided into left dataset and right dataset lSet, rSet = binSplitDataSet(dataSet, col, value) #Create left and right subtrees retTree['left'] = createTree(lSet, leafType, errType, ops) retTree['right'] = createTree(rSet, leafType, errType, ops) return retTree
SKlearn implementation of regression tree:
Dataset:
from sklearn.tree import DecisionTreeRegressor from sklearn import linear_model #Data for training x = (ex0.iloc[:,1].values).reshape(-1,1) y = (ex0.iloc[:,-1].values).reshape(-1,1) # Training model model1 = DecisionTreeRegressor(max_depth=1) model2 = DecisionTreeRegressor(max_depth=3) model3 = linear_model.LinearRegression() model1.fit(x, y) model2.fit(x, y) model3.fit(x, y) # forecast X_test = np.arange(0, 1, 0.01).reshape(-1,1) y_1 = model1.predict(X_test) y_2 = model2.predict(X_test) y_3 = model3.predict(X_test) # Visualization results plt.figure() #Create canvas plt.scatter(x, y, s=20, edgecolor="black",c="darkorange", label="data") plt.plot(X_test, y_1, color="cornflowerblue",label="max_depth=1", linewidth=2) plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=3", linewidth=2) plt.plot(X_test, y_3, color='red', label='liner regression', linewidth=2) plt.xlabel("data") plt.ylabel("target") plt.title("Decision Tree Regression") plt.legend() plt.show()
effect:
2: Model tree
In addition to simply setting the leaf node as a constant value, another method is to set the leaf node as a piecewise linear function. The piecewise linear function here refers to that the model is composed of multiple linear segments.
In the regression tree mentioned earlier, each leaf node contains a single value; Next, we will talk about this model tree. Each leaf node contains a linear equation.
Manually implement the model tree code:
""" Function function: test whether the input variable is of dictionary type and return the result of boolean type """ def isTree(obj): return type(obj).__name__=='dict' """ Function function: calculate characteristic matrix, label matrix and regression coefficient Parameter Description: dataSet: Raw data set return: ws: regression coefficient X: Characteristic matrix (first column added) x0=1) Y: Label matrix """ def linearSolve(dataSet): m,n = dataSet.shape con = pd.DataFrame(np.ones((m,1)))#Add a column of constant value X0=1 in the first column conX = pd.concat([con,dataSet.iloc[:,:-1]],axis=1,ignore_index=True) X = np.mat(conX) Y = np.mat(dataSet.iloc[:,-1].values).T xTx = X.T*X if np.linalg.det(xTx) == 0: raise NameError('Singular matrix cannot be inversed, please try increasing tolN,Namely ops Second value') ws = xTx.I*(X.T*Y) return ws,X,Y """Generate the leaf node of the model tree (i.e. linear equation), where the regression coefficient is returned""" def modelLeaf(dataSet): ws,X,Y = linearSolve(dataSet) return ws """Calculate the error (sum of squares of error) for a given data set""" def modelErr(dataSet): ws,X,Y = linearSolve(dataSet) yHat = X*ws err = sum(np.power(Y-yHat,2)) return err
#Using the createTree function of regression tree to build model tree createTree(exp2,modelLeaf,modelErr,(1, 10))
3: Auxiliary function for constructing prediction function
Prediction function of regression tree and model tree:
#Regression node prediction function def regTreeEval(model,inData): return model #Model leaf node prediction function def modelTreeEval(model,inData): n = len(inData) X = np.mat(np.ones((1,n+1))) #Add a column of constant term x0=1 and put it in the first column X[:,1:n+1] = inData return X*model
Forecast results:
""" Function function: returns the prediction result of a single test data Parameter Description: tree: Dictionary tree inData: Single test data modelEval: Leaf node prediction function """ def treeForeCast(tree,inData,modelEval = regTreeEval): #First judge whether it is a leaf node. If it is a leaf node, the prediction result will be returned directly if not isTree(tree): return modelEval(tree,inData) #Find the left and right subtrees according to the index if inData[tree['spInd']] > tree['spVal']: #If the left subtree is not a leaf node, the leaf node is found recursively if isTree(tree['left']): return treeForeCast(tree['left'],inData,modelEval) else: return modelEval(tree['left'],inData) else: if isTree(tree['right']): return treeForeCast(tree['right'],inData,modelEval) else: return modelEval(tree['right'],inData) """ Function function: returns the prediction result of the whole test set Parameter Description: tree:Dictionary tree testData:Test set modelEval: Leaf node prediction function return: yHat:Prediction results of each data """ def createForeCast(tree, testData, modelEval = regTreeEval): m = testData.shape[0] yHat = np.mat(np.zeros((m,1))) for i in range(m): inData = testData.iloc[i,:-1].values yHat[i,0]= treeForeCast(tree,inData,modelEval) return yHat
Prediction code of regression tree:
#Create regression tree regTree = createTree(biketrain,ops=(1,20)) #Regression tree prediction results yHat = createForeCast(regTree,biketest, regTreeEval) #Calculate the correlation coefficient R2 np.corrcoef(yHat.T,biketest.iloc[:,-1].values)[0,1] #Calculate mean square error SSE sum((yHat.A.flatten()-biketest.iloc[:,-1].values)**2)
Prediction code of model tree:
#Create model tree modelTree = createTree(biketrain, modelLeaf, modelErr, ops=(1,20)) #Model tree prediction results yHat1 = createForeCast( modelTree, biketest, modelTreeEval) #Calculate the correlation coefficient R2 np.corrcoef(yHat1.T,biketest.iloc[:,-1].values)[0,1] #Calculate mean square error SSE sum((yHat1.A.flatten()-biketest.iloc[:,-1].values)**2)
Standard linear regression:
#Standard linear regression ws,X,Y = linearSolve(biketrain) #Add constant item 1 in the first column to construct the characteristic matrix testX = pd.concat([pd.DataFrame(np.ones((biketest.shape[0],1))),biketest.iloc[:,:-1]], axis=1,ignore_index = True) testMat = np.mat(testX) #Standard linear regression prediction results yHat_2 = testMat*ws #Correlation coefficient R2 R2_2 = np.corrcoef(yHat_2.T,biketest.iloc[:,-1].values)[0,1] #Mean square error SSE SSE_2 = sum((yHat_2.A.flatten()-biketest.iloc[:,-1].values)**2)