# Machine learning -- tree regression

## CART algorithm

What is CART?

CART is the abbreviation of Classification And Regression Tree in English, also known as classification regression tree. From its name we can
We can see that it is a very powerful algorithm, which can be used not only for classification but also for regression, so it is very worth learning.

CART algorithm uses binary segmentation method, which can adjust the construction process of tree to deal with continuous variables.

The specific processing method is: if the eigenvalue is greater than the given value, go to the left subtree, otherwise go to the right subtree.

## CART algorithm has two steps:

1. Decision tree generation: the process of building a binary decision tree recursively. The decision tree is generated based on the training data set, and the generated decision tree should be as large as possible; Establish nodes from the root from top to bottom. Select the best attribute at each node to split, so that the training set in the child nodes is as pure as possible.
2. Decision tree pruning: prune the generated tree with the validation data set and select the optimal subtree. At this time, the minimum loss function is used as the pruning standard.

## Construction process of CART tree:

First, find the best column to segment the data set, and execute the binary segmentation method every time. If the eigenvalue is greater than the given value, go to the left subtree, otherwise go to the right subtree. When the node cannot be divided again, save the node as a leaf node.

Here we introduce regression tree and model tree.

## difference:

Regression tree: leaf nodes are the average of eigenvalues.
Model tree: leaf nodes are linear equations.

## 1: Regression tree

The data of the user-defined regression tree is the whole data, that is, the feature column is followed by the target column.
Main functions:

```binSplitDataSet(dataSet, feature, value)#Segment data sets according to features
errType(dataSet)#Calculate the total variance: mean square deviation * number of samples
leafType(dataSet)#Generate leaf node
chooseBestSplit(dataSet, leafType=leafType, errType=errType, ops = (1,4))#Find the best binary segmentation function of data
createTree(dataSet, leafType = leafType, errType = errType, ops = (1, 4))#Tree building function
```

Manually implement the regression tree code:

```"""
Function description:Segment data sets according to features
Parameter description:
dataSet: Raw data set
feature: Feature index to be segmented
value: The value of the feature
return:
mat0: Segmented data set 0
mat1: Segmented data set 1
"""
def binSplitDataSet(dataSet, feature, value):
mat0 = dataSet.loc[dataSet.iloc[:,feature] > value,:]
mat0.index = range(mat0.shape)
mat1 = dataSet.loc[dataSet.iloc[:,feature] <= value,:]
mat1.index = range(mat1.shape)
return mat0, mat1

#Calculate the total variance: mean square deviation * number of samples
def errType(dataSet):
var= dataSet.iloc[:,-1].var() *dataSet.shape
return var

#Generate leaf node
def leafType(dataSet):
leaf = dataSet.iloc[:,-1].mean()
return leaf

"""
Function description:Find the best binary segmentation function of data
Parameter description:
dataSet: Raw data set
leafType: Generate leaf node function
errType: Error estimation function
ops: Tuples of user-defined parameters
return:
bestIndex: Optimal segmentation feature
bestValue: Optimal eigenvalue
"""
def chooseBestSplit(dataSet, leafType=leafType, errType=errType, ops = (1,4)):
#The allowable error reduction value of tolS and the minimum number of samples for tolN segmentation
tolS = ops; tolN = ops
#If all current values are equal, exit. (according to the characteristics of set)
if len(set(dataSet.iloc[:,-1].values)) == 1:
return None, leafType(dataSet)
#Rows m and columns n of the statistics set
m, n = dataSet.shape
#By default, the last feature is the best segmentation feature, and its error estimation is calculated
S = errType(dataSet)
#They are the best error, the index value of the best feature segmentation, and the best eigenvalue
bestS = np.inf; bestIndex = 0; bestValue = 0
#Traverse all feature columns
for featIndex in range(n - 1):  #The original data set is labeled, and the number of features is n-1
colval= set(dataSet.iloc[:,featIndex].values) #Extract all values of the current cut column
#Traverse all eigenvalues
for splitVal in colval:
#Data sets are segmented according to features and eigenvalues
mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
#If the data is less than tolN, exit
if (mat0.shape < tolN) or (mat1.shape < tolN): continue
#Calculation error estimation
newS = errType(mat0) + errType(mat1)
#If the error estimate is smaller, the feature index value and the feature value are updated
if newS < bestS:
bestIndex = featIndex
bestValue = splitVal
bestS = newS
#Exit if the error reduction is small
if (S - bestS) < tolS:
return None, leafType(dataSet)
#The data set is segmented according to the best segmentation feature and eigenvalue
mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
#Exit if the cut data set is very small
if (mat0.shape < tolN) or (mat1.shape < tolN):
return None, leafType(dataSet)
#Returns the best segmentation feature and feature value
return bestIndex, bestValue

"""
Function function:Tree building function
Parameter description:
dataSet: Raw data set
leafType: Establish the function of leaf node
errType: Error calculation function
ops: Tuples containing all other parameters of the tree build
return:
retTree: Constructed regression tree
"""
def createTree(dataSet, leafType = leafType, errType = errType, ops = (1, 4)):
#Select the best segmentation feature and eigenvalue
col, value = chooseBestSplit(dataSet, leafType, errType, ops)
#If there is no characteristic, the characteristic value is returned
if col == None: return value
#Regression tree
retTree = {}  #Store tree information
retTree['spInd'] = col
retTree['spVal'] = value
#It is divided into left dataset and right dataset
lSet, rSet = binSplitDataSet(dataSet, col, value)
#Create left and right subtrees
retTree['left'] = createTree(lSet, leafType, errType, ops)
retTree['right'] = createTree(rSet, leafType, errType, ops)
return retTree
```

SKlearn implementation of regression tree:

Dataset:  ```from sklearn.tree import DecisionTreeRegressor
from sklearn import linear_model

#Data for training
x = (ex0.iloc[:,1].values).reshape(-1,1)
y = (ex0.iloc[:,-1].values).reshape(-1,1)

# Training model
model1 = DecisionTreeRegressor(max_depth=1)
model2 = DecisionTreeRegressor(max_depth=3)
model3 = linear_model.LinearRegression()
model1.fit(x, y)
model2.fit(x, y)
model3.fit(x, y)

# forecast
X_test = np.arange(0, 1, 0.01).reshape(-1,1)
y_1 = model1.predict(X_test)
y_2 = model2.predict(X_test)
y_3 = model3.predict(X_test)

# Visualization results
plt.figure() #Create canvas
plt.scatter(x, y, s=20, edgecolor="black",c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",label="max_depth=1", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=3", linewidth=2)
plt.plot(X_test, y_3, color='red', label='liner regression', linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()
```

effect: ## 2: Model tree

In addition to simply setting the leaf node as a constant value, another method is to set the leaf node as a piecewise linear function. The piecewise linear function here refers to that the model is composed of multiple linear segments.

In the regression tree mentioned earlier, each leaf node contains a single value; Next, we will talk about this model tree. Each leaf node contains a linear equation.

Manually implement the model tree code:

```""" Function function: test whether the input variable is of dictionary type and return the result of boolean type """
def isTree(obj):
return type(obj).__name__=='dict'

"""
Function function: calculate characteristic matrix, label matrix and regression coefficient
Parameter Description:
dataSet: Raw data set
return:
ws: regression coefficient
X: Characteristic matrix (first column added) x0=1)
Y: Label matrix
"""
def linearSolve(dataSet):
m,n = dataSet.shape
con = pd.DataFrame(np.ones((m,1)))#Add a column of constant value X0=1 in the first column
conX = pd.concat([con,dataSet.iloc[:,:-1]],axis=1,ignore_index=True)
X = np.mat(conX)
Y = np.mat(dataSet.iloc[:,-1].values).T
xTx = X.T*X
if np.linalg.det(xTx) == 0:
raise NameError('Singular matrix cannot be inversed, please try increasing tolN,Namely ops Second value')
ws = xTx.I*(X.T*Y)
return ws,X,Y

"""Generate the leaf node of the model tree (i.e. linear equation), where the regression coefficient is returned"""
def modelLeaf(dataSet):
ws,X,Y = linearSolve(dataSet)
return ws

"""Calculate the error (sum of squares of error) for a given data set"""
def modelErr(dataSet):
ws,X,Y = linearSolve(dataSet)
yHat = X*ws
err = sum(np.power(Y-yHat,2))
return err

```
```#Using the createTree function of regression tree to build model tree
createTree(exp2,modelLeaf,modelErr,(1, 10))
```

## 3: Auxiliary function for constructing prediction function

Prediction function of regression tree and model tree:

```#Regression node prediction function
def regTreeEval(model,inData):
return model

#Model leaf node prediction function
def modelTreeEval(model,inData):
n = len(inData)
X = np.mat(np.ones((1,n+1)))  #Add a column of constant term x0=1 and put it in the first column
X[:,1:n+1] = inData
return X*model
```

Forecast results:

```"""
Function function: returns the prediction result of a single test data
Parameter Description:
tree: Dictionary tree
inData: Single test data
modelEval: Leaf node prediction function
"""
def treeForeCast(tree,inData,modelEval = regTreeEval):
#First judge whether it is a leaf node. If it is a leaf node, the prediction result will be returned directly
if not isTree(tree):
return modelEval(tree,inData)
#Find the left and right subtrees according to the index
if inData[tree['spInd']] > tree['spVal']:
#If the left subtree is not a leaf node, the leaf node is found recursively
if isTree(tree['left']):
return treeForeCast(tree['left'],inData,modelEval)
else:
return modelEval(tree['left'],inData)
else:
if isTree(tree['right']):
return treeForeCast(tree['right'],inData,modelEval)
else:
return modelEval(tree['right'],inData)

"""
Function function: returns the prediction result of the whole test set
Parameter Description:
tree:Dictionary tree
testData:Test set
modelEval: Leaf node prediction function
return:
yHat:Prediction results of each data
"""
def createForeCast(tree, testData, modelEval = regTreeEval):
m = testData.shape
yHat = np.mat(np.zeros((m,1)))
for i in range(m):
inData = testData.iloc[i,:-1].values
yHat[i,0]= treeForeCast(tree,inData,modelEval)
return yHat
```

Prediction code of regression tree:

```#Create regression tree
regTree = createTree(biketrain,ops=(1,20))
#Regression tree prediction results
yHat = createForeCast(regTree,biketest, regTreeEval)
#Calculate the correlation coefficient R2
np.corrcoef(yHat.T,biketest.iloc[:,-1].values)[0,1]
#Calculate mean square error SSE
sum((yHat.A.flatten()-biketest.iloc[:,-1].values)**2)
```

Prediction code of model tree:

```#Create model tree
modelTree = createTree(biketrain, modelLeaf, modelErr, ops=(1,20))
#Model tree prediction results
yHat1 = createForeCast( modelTree, biketest, modelTreeEval)
#Calculate the correlation coefficient R2
np.corrcoef(yHat1.T,biketest.iloc[:,-1].values)[0,1]
#Calculate mean square error SSE
sum((yHat1.A.flatten()-biketest.iloc[:,-1].values)**2)
```

Standard linear regression:

```#Standard linear regression
ws,X,Y = linearSolve(biketrain)
#Add constant item 1 in the first column to construct the characteristic matrix
testX = pd.concat([pd.DataFrame(np.ones((biketest.shape,1))),biketest.iloc[:,:-1]],
axis=1,ignore_index = True)
testMat = np.mat(testX)
#Standard linear regression prediction results
yHat_2 = testMat*ws
#Correlation coefficient R2
R2_2 = np.corrcoef(yHat_2.T,biketest.iloc[:,-1].values)[0,1]
#Mean square error SSE
SSE_2 = sum((yHat_2.A.flatten()-biketest.iloc[:,-1].values)**2)
```

Posted on Sun, 03 Oct 2021 17:53:35 -0400 by diddy1234