Machine learning -- tree regression

CART algorithm

What is CART?

CART is the abbreviation of Classification And Regression Tree in English, also known as classification regression tree. From its name we can
We can see that it is a very powerful algorithm, which can be used not only for classification but also for regression, so it is very worth learning.

CART algorithm uses binary segmentation method, which can adjust the construction process of tree to deal with continuous variables.

The specific processing method is: if the eigenvalue is greater than the given value, go to the left subtree, otherwise go to the right subtree.

CART algorithm has two steps:

  1. Decision tree generation: the process of building a binary decision tree recursively. The decision tree is generated based on the training data set, and the generated decision tree should be as large as possible; Establish nodes from the root from top to bottom. Select the best attribute at each node to split, so that the training set in the child nodes is as pure as possible.
  2. Decision tree pruning: prune the generated tree with the validation data set and select the optimal subtree. At this time, the minimum loss function is used as the pruning standard.

Construction process of CART tree:

First, find the best column to segment the data set, and execute the binary segmentation method every time. If the eigenvalue is greater than the given value, go to the left subtree, otherwise go to the right subtree. When the node cannot be divided again, save the node as a leaf node.

Here we introduce regression tree and model tree.

difference:

Regression tree: leaf nodes are the average of eigenvalues.
Model tree: leaf nodes are linear equations.

1: Regression tree

The data of the user-defined regression tree is the whole data, that is, the feature column is followed by the target column.
Main functions:

binSplitDataSet(dataSet, feature, value)#Segment data sets according to features 
errType(dataSet)#Calculate the total variance: mean square deviation * number of samples
leafType(dataSet)#Generate leaf node
chooseBestSplit(dataSet, leafType=leafType, errType=errType, ops = (1,4))#Find the best binary segmentation function of data
createTree(dataSet, leafType = leafType, errType = errType, ops = (1, 4))#Tree building function 

Manually implement the regression tree code:

""" 
Function description:Segment data sets according to features 
Parameter description: 
	dataSet: Raw data set 
	feature: Feature index to be segmented 
	value: The value of the feature 
return:
	mat0: Segmented data set 0 
	mat1: Segmented data set 1 
"""
def binSplitDataSet(dataSet, feature, value):
    mat0 = dataSet.loc[dataSet.iloc[:,feature] > value,:]
    mat0.index = range(mat0.shape[0])
    mat1 = dataSet.loc[dataSet.iloc[:,feature] <= value,:]
    mat1.index = range(mat1.shape[0])
    return mat0, mat1

#Calculate the total variance: mean square deviation * number of samples
def errType(dataSet):
    var= dataSet.iloc[:,-1].var() *dataSet.shape[0]
    return var

#Generate leaf node
def leafType(dataSet):
    leaf = dataSet.iloc[:,-1].mean()
    return leaf

""" 
    Function description:Find the best binary segmentation function of data 
    Parameter description: 
        dataSet: Raw data set 
        leafType: Generate leaf node function 
        errType: Error estimation function 
        ops: Tuples of user-defined parameters 
    return:
        bestIndex: Optimal segmentation feature 
        bestValue: Optimal eigenvalue 
"""
def chooseBestSplit(dataSet, leafType=leafType, errType=errType, ops = (1,4)):
    #The allowable error reduction value of tolS and the minimum number of samples for tolN segmentation
    tolS = ops[0]; tolN = ops[1]
    #If all current values are equal, exit. (according to the characteristics of set)
    if len(set(dataSet.iloc[:,-1].values)) == 1:
        return None, leafType(dataSet)
    #Rows m and columns n of the statistics set
    m, n = dataSet.shape
    #By default, the last feature is the best segmentation feature, and its error estimation is calculated
    S = errType(dataSet)
    #They are the best error, the index value of the best feature segmentation, and the best eigenvalue
    bestS = np.inf; bestIndex = 0; bestValue = 0
    #Traverse all feature columns
    for featIndex in range(n - 1):  #The original data set is labeled, and the number of features is n-1
        colval= set(dataSet.iloc[:,featIndex].values) #Extract all values of the current cut column
        #Traverse all eigenvalues
        for splitVal in colval:
            #Data sets are segmented according to features and eigenvalues
            mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
            #If the data is less than tolN, exit
            if (mat0.shape[0] < tolN) or (mat1.shape[0] < tolN): continue
            #Calculation error estimation
            newS = errType(mat0) + errType(mat1)
            #If the error estimate is smaller, the feature index value and the feature value are updated
            if newS < bestS:
                bestIndex = featIndex
                bestValue = splitVal
                bestS = newS
    #Exit if the error reduction is small
    if (S - bestS) < tolS:
        return None, leafType(dataSet)
    #The data set is segmented according to the best segmentation feature and eigenvalue
    mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
    #Exit if the cut data set is very small
    if (mat0.shape[0] < tolN) or (mat1.shape[0] < tolN):
        return None, leafType(dataSet)
    #Returns the best segmentation feature and feature value
    return bestIndex, bestValue

""" 
    Function function:Tree building function 
    Parameter description: 
        dataSet: Raw data set 
        leafType: Establish the function of leaf node 
        errType: Error calculation function 
        ops: Tuples containing all other parameters of the tree build 
    return:
        retTree: Constructed regression tree 
"""
def createTree(dataSet, leafType = leafType, errType = errType, ops = (1, 4)):
    #Select the best segmentation feature and eigenvalue
    col, value = chooseBestSplit(dataSet, leafType, errType, ops)
    #If there is no characteristic, the characteristic value is returned
    if col == None: return value
    #Regression tree
    retTree = {}  #Store tree information
    retTree['spInd'] = col
    retTree['spVal'] = value
    #It is divided into left dataset and right dataset
    lSet, rSet = binSplitDataSet(dataSet, col, value)
    #Create left and right subtrees
    retTree['left'] = createTree(lSet, leafType, errType, ops)
    retTree['right'] = createTree(rSet, leafType, errType, ops)
    return retTree  

SKlearn implementation of regression tree:

Dataset:

from sklearn.tree import DecisionTreeRegressor
from sklearn import linear_model

#Data for training
x = (ex0.iloc[:,1].values).reshape(-1,1)
y = (ex0.iloc[:,-1].values).reshape(-1,1)

# Training model
model1 = DecisionTreeRegressor(max_depth=1)
model2 = DecisionTreeRegressor(max_depth=3)
model3 = linear_model.LinearRegression()
model1.fit(x, y)
model2.fit(x, y)
model3.fit(x, y)

# forecast
X_test = np.arange(0, 1, 0.01).reshape(-1,1)
y_1 = model1.predict(X_test)
y_2 = model2.predict(X_test)
y_3 = model3.predict(X_test)

# Visualization results
plt.figure() #Create canvas
plt.scatter(x, y, s=20, edgecolor="black",c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",label="max_depth=1", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=3", linewidth=2)
plt.plot(X_test, y_3, color='red', label='liner regression', linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

effect:

2: Model tree

In addition to simply setting the leaf node as a constant value, another method is to set the leaf node as a piecewise linear function. The piecewise linear function here refers to that the model is composed of multiple linear segments.

In the regression tree mentioned earlier, each leaf node contains a single value; Next, we will talk about this model tree. Each leaf node contains a linear equation.

Manually implement the model tree code:

""" Function function: test whether the input variable is of dictionary type and return the result of boolean type """
def isTree(obj):
    return type(obj).__name__=='dict'

""" 
    Function function: calculate characteristic matrix, label matrix and regression coefficient 
    Parameter Description: 
        dataSet: Raw data set 
    return:
        ws: regression coefficient 
        X: Characteristic matrix (first column added) x0=1) 
        Y: Label matrix 
"""
def linearSolve(dataSet):
    m,n = dataSet.shape
    con = pd.DataFrame(np.ones((m,1)))#Add a column of constant value X0=1 in the first column
    conX = pd.concat([con,dataSet.iloc[:,:-1]],axis=1,ignore_index=True)
    X = np.mat(conX)
    Y = np.mat(dataSet.iloc[:,-1].values).T
    xTx = X.T*X
    if np.linalg.det(xTx) == 0:
        raise NameError('Singular matrix cannot be inversed, please try increasing tolN,Namely ops Second value')
    ws = xTx.I*(X.T*Y)
    return ws,X,Y

"""Generate the leaf node of the model tree (i.e. linear equation), where the regression coefficient is returned"""
def modelLeaf(dataSet):
    ws,X,Y = linearSolve(dataSet)
    return ws

"""Calculate the error (sum of squares of error) for a given data set"""
def modelErr(dataSet):
    ws,X,Y = linearSolve(dataSet)
    yHat = X*ws
    err = sum(np.power(Y-yHat,2))
    return err

#Using the createTree function of regression tree to build model tree
createTree(exp2,modelLeaf,modelErr,(1, 10))

3: Auxiliary function for constructing prediction function

Prediction function of regression tree and model tree:

#Regression node prediction function
def regTreeEval(model,inData):
    return model

#Model leaf node prediction function
def modelTreeEval(model,inData):
    n = len(inData)
    X = np.mat(np.ones((1,n+1)))  #Add a column of constant term x0=1 and put it in the first column
    X[:,1:n+1] = inData
    return X*model

Forecast results:

"""
    Function function: returns the prediction result of a single test data 
    Parameter Description: 
        tree: Dictionary tree 
        inData: Single test data 
        modelEval: Leaf node prediction function
"""
def treeForeCast(tree,inData,modelEval = regTreeEval):
    #First judge whether it is a leaf node. If it is a leaf node, the prediction result will be returned directly
    if not isTree(tree):
        return modelEval(tree,inData)
    #Find the left and right subtrees according to the index
    if inData[tree['spInd']] > tree['spVal']: 
        #If the left subtree is not a leaf node, the leaf node is found recursively
        if isTree(tree['left']):
            return treeForeCast(tree['left'],inData,modelEval)
        else:
            return modelEval(tree['left'],inData)
    else:
        if isTree(tree['right']):
            return treeForeCast(tree['right'],inData,modelEval)
        else:
            return modelEval(tree['right'],inData)

""" 
    Function function: returns the prediction result of the whole test set 
    Parameter Description: 
        tree:Dictionary tree 
        testData:Test set 
        modelEval: Leaf node prediction function 
    return:
        yHat:Prediction results of each data 
"""
def createForeCast(tree, testData, modelEval = regTreeEval):
    m = testData.shape[0]
    yHat = np.mat(np.zeros((m,1)))
    for i in range(m):
        inData = testData.iloc[i,:-1].values
        yHat[i,0]= treeForeCast(tree,inData,modelEval)
    return yHat

Prediction code of regression tree:

#Create regression tree
regTree = createTree(biketrain,ops=(1,20))
#Regression tree prediction results
yHat = createForeCast(regTree,biketest, regTreeEval)
#Calculate the correlation coefficient R2
np.corrcoef(yHat.T,biketest.iloc[:,-1].values)[0,1]
#Calculate mean square error SSE
sum((yHat.A.flatten()-biketest.iloc[:,-1].values)**2)

Prediction code of model tree:

#Create model tree
modelTree = createTree(biketrain, modelLeaf, modelErr, ops=(1,20))
#Model tree prediction results
yHat1 = createForeCast( modelTree, biketest, modelTreeEval)
#Calculate the correlation coefficient R2
np.corrcoef(yHat1.T,biketest.iloc[:,-1].values)[0,1]
#Calculate mean square error SSE
sum((yHat1.A.flatten()-biketest.iloc[:,-1].values)**2)

Standard linear regression:

#Standard linear regression
ws,X,Y = linearSolve(biketrain)
#Add constant item 1 in the first column to construct the characteristic matrix
testX = pd.concat([pd.DataFrame(np.ones((biketest.shape[0],1))),biketest.iloc[:,:-1]],
                  axis=1,ignore_index = True)
testMat = np.mat(testX)
#Standard linear regression prediction results
yHat_2 = testMat*ws
#Correlation coefficient R2
R2_2 = np.corrcoef(yHat_2.T,biketest.iloc[:,-1].values)[0,1]
#Mean square error SSE
SSE_2 = sum((yHat_2.A.flatten()-biketest.iloc[:,-1].values)**2)

Tags: Algorithm Machine Learning sklearn

Posted on Sun, 03 Oct 2021 17:53:35 -0400 by diddy1234