Experiment 3: CART regression decision tree python implementation (two test sets) | machine learning

python implementation

Step by step

• Divide the data subset (the left subtree divides the sample set smaller than the specified value, and the right subtree divides the sample set larger than the specified value)

```import numpy as np
#The method of obtaining data subset, classification and regression is the same
#The data set is divided into two categories according to the division characteristics
def split_dataset(data_x,data_y,fea_axis,fea_value):
'''
input:data_x(ndarry):characteristic value
data_y(ndarry):Tag value
fea_axis(int):Number of features to be divided (number of columns)
fea_value(int):The feature to be divided corresponds to the eigenvalue
output:data_x[equal_Idx],data_y[equal_Idx](ndarry):Samples and labels with eigenvalues equal to (greater than or equal to) target eigenvalues
data_x[nequal_Idx],data_y[nequal_Idx](ndarry):Samples and labels with feature values not equal to (less than) the target feature
'''
if  isinstance(fea_value,int) or isinstance(fea_value,float):
#If the eigenvalue is a floating point number (continuous eigenvalue), the continuous value is discretized
equal_Idx = np.where(data_x[:,fea_axis]<=fea_value) #Find out that the eigenvalue is greater than or equal to FEA_ Sample serial number of value
nequal_Idx = np.where(data_x[:,fea_axis]>fea_value) #Find out that the eigenvalue is less than FEA_ Sample serial number of value
else:
equal_Idx = np.where(data_x[:,fea_axis]==fea_value) #Find out that the eigenvalue is equal to FEA_ Sample serial number of value
nequal_Idx = np.where(data_x[:,fea_axis]!=fea_value) #Find out that the eigenvalue is not equal to FEA_ Sample serial number of value
return data_x[equal_Idx],data_y[equal_Idx],data_x[nequal_Idx],data_y[nequal_Idx]
```
• Mean calculation of leaf nodes

```import numpy as np
#Leaf node mean calculation
def reg_leaf(data_y):
'''
input:data_y(array):Tag value
output:(float)mean value
'''
return np.mean(data_y)
```
• Calculate the total variance of the data set

```#Calculate the total variance of the data set
def reg_err(data_y):
'''
input:data_y(array):Tag value
output:(float):Total variance
'''
return np.var(data_y)*len(data_y)
```
• Select partition features and corresponding eigenvalues

```def classify_get_best_fea(data_x,data_y,ops=(1,4)):
'''
input:data_x(ndarry):characteristic value
data_y(array):Tag value
ops(tuple):The first number is the minimum precision of decision tree stopping division, and the second number is the minimum number of decision tree stopping division
output:best_fea_idx(int):Subscript of the best partition feature
best_fea_val(float):It is best to divide the eigenvalues corresponding to the features
'''
m,n = np.shape(data_x)
final_s = ops[0] #Stop precision
final_n = ops[1] #Minimum number of samples to stop
#When there is only one kind of sample, the leaf node and its corresponding mean are output
if len(np.unique(data_y))==1:
return None,reg_leaf(data_y)

#Obtaining optimal features and eigenvalues
total_err = reg_err(data_y)  #Total error
best_err = np.inf
best_fea_idx = 0
best_fea_val = 0

for i in range(n):
feas = np.unique(data_x[:,i])
for fea_val in feas:
data_D1_x,data_D1_y,data_D2_x,data_D2_y = split_dataset(data_x,data_y,i,fea_val)
#If the minimum partition set is not satisfied, no calculation will be performed
if data_D1_x.shape[0]<final_n or data_D2_x.shape[0]<final_n:
continue
con_err = reg_err(data_D1_y)+reg_err(data_D2_y)
if con_err<best_err:
best_err = con_err
best_fea_idx = i
best_fea_val = fea_val

#Pre pruning, the error of the solution is less than the minimum error, and continue to divide
if total_err-best_err<final_s:
return None,reg_leaf(data_y)

#It has been unable to be divided and processed here
data_D1_x,data_D1_y,data_D2_x,data_D2_y = split_dataset(data_x,data_y,best_fea_idx,best_fea_val)
if data_D1_x.shape[0]<final_n or data_D2_x.shape[0]<final_n:
return None,reg_leaf(data_y)

return best_fea_idx,best_fea_val
```
• Generation of CART regression tree (recursive generation)

```def reg_create_tree(data_x,data_y,ops=(1,4)):
fea_idx,fea_val = classify_get_best_fea(data_x,data_y,ops)
if fea_idx == None:
return fea_val

#Recursive establishment of CART regression decision tree
my_tree = {}
my_tree['fea_idx'] = fea_idx
my_tree['fea_val'] = fea_val
data_D1_x,data_D1_y,data_D2_x,data_D2_y = split_dataset(data_x,data_y,fea_idx,fea_val)

my_tree['left'] = reg_create_tree(data_D1_x,data_D1_y,ops)
my_tree['right'] = reg_create_tree(data_D2_x,data_D2_y,ops)

return my_tree
```

The dictionary format is: {'fea_idx': 'current best divided feature subscript', 'fea_val': 'corresponding feature value', 'left': {...}, "right": {...}}.

• Prediction function

```#Test operation
import re
#Predict a test data result
def classify(inputTree,testdata):
'''
input:inputTree(dict):CART Classification decision tree
xlabel(list):Feature attribute list
testdata(darry):Characteristic value of a test data
output:classLabel(int):Test data prediction results
'''
first_fea_idx = inputTree[list(inputTree.keys())[0]]  #Corresponding characteristic subscript
fea_val = inputTree[list(inputTree.keys())[1]]  #Segmentation value of corresponding feature

classLabel = 0.0 #Define the variable classLabel. The default value is 0

if testdata[first_fea_idx]>=fea_val:  #Enter right subtree
if type(inputTree['right']).__name__ == 'dict':
classLabel = classify(inputTree['right'],testdata)
else:
classLabel = inputTree['right']
else:  #Enter left subtree
if type(inputTree['left']).__name__ == 'dict':
classLabel = classify(inputTree['left'],testdata)
else:
classLabel = inputTree['left']

return round(classLabel,2)

#Predict all test data results
def classifytest(inputTree, testDataSet):
'''
input:inputTree(dict):Trained decision tree
xlabel(list):Eigenvalue label list
testDataSet(ndarray):Test data set
output:classLabelAll(list):Test set prediction result list
'''
classLabelAll = []#Create an empty list
for testVec in testDataSet:#Traverse each data
classLabelAll.append(classify(inputTree, testVec))#Add the feature label obtained from each data to the list
return  np.array(classLabelAll)
```

Source code (all)

```import numpy as np
import re
#The method of obtaining data subset, classification and regression is the same
#The data set is divided into two categories according to the division characteristics
def split_dataset(data_x,data_y,fea_axis,fea_value):
'''
input:data_x(ndarry):characteristic value
data_y(ndarry):Tag value
fea_axis(int):Number of features to be divided (number of columns)
fea_value(int):The feature to be divided corresponds to the eigenvalue
output:data_x[equal_Idx],data_y[equal_Idx](ndarry):Samples and labels with eigenvalues equal to (greater than or equal to) target eigenvalues
data_x[nequal_Idx],data_y[nequal_Idx](ndarry):Samples and labels with feature values not equal to (less than) the target feature
'''
if  isinstance(fea_value,int) or isinstance(fea_value,float):
#If the eigenvalue is a floating point number (continuous eigenvalue), the continuous value is discretized
equal_Idx = np.where(data_x[:,fea_axis]<=fea_value) #Find out that the eigenvalue is greater than or equal to FEA_ Sample serial number of value
nequal_Idx = np.where(data_x[:,fea_axis]>fea_value) #Find out that the eigenvalue is less than FEA_ Sample serial number of value
else:
equal_Idx = np.where(data_x[:,fea_axis]==fea_value) #Find out that the eigenvalue is equal to FEA_ Sample serial number of value
nequal_Idx = np.where(data_x[:,fea_axis]!=fea_value) #Find out that the eigenvalue is not equal to FEA_ Sample serial number of value
return data_x[equal_Idx],data_y[equal_Idx],data_x[nequal_Idx],data_y[nequal_Idx]

#Leaf node mean calculation
def reg_leaf(data_y):
'''
input:data_y(array):Tag value
output:(float)mean value
'''
return np.mean(data_y)

#Calculate the total variance of the data set
def reg_err(data_y):
'''
input:data_y(array):Tag value
output:(float):Total variance
'''
return np.var(data_y)*len(data_y)

def classify_get_best_fea(data_x,data_y,ops=(1,4)):
'''
input:data_x(ndarry):characteristic value
data_y(array):Tag value
ops(tuple):The first number is the minimum precision of decision tree stopping division, and the second number is the minimum number of decision tree stopping division
output:best_fea_idx(int):Subscript of the best partition feature
best_fea_val(float):It is best to divide the eigenvalues corresponding to the features
'''
m,n = np.shape(data_x)
final_s = ops[0] #Stop precision
final_n = ops[1] #Minimum number of samples to stop
#When there is only one kind of sample, the leaf node and its corresponding mean are output
if len(np.unique(data_y))==1:
return None,reg_leaf(data_y)

#Obtaining optimal features and eigenvalues
total_err = reg_err(data_y)  #Total error
best_err = np.inf
best_fea_idx = 0
best_fea_val = 0

for i in range(n):
feas = np.unique(data_x[:,i])
for fea_val in feas:
data_D1_x,data_D1_y,data_D2_x,data_D2_y = split_dataset(data_x,data_y,i,fea_val)
#If the minimum partition set is not satisfied, no calculation will be performed
if data_D1_x.shape[0]<final_n or data_D2_x.shape[0]<final_n:
continue
con_err = reg_err(data_D1_y)+reg_err(data_D2_y)
if con_err<best_err:
best_err = con_err
best_fea_idx = i
best_fea_val = fea_val

#Pre pruning, the error of the solution is less than the minimum error, and continue to divide
if total_err-best_err<final_s:
return None,reg_leaf(data_y)

#It has been unable to be divided and processed here
data_D1_x,data_D1_y,data_D2_x,data_D2_y = split_dataset(data_x,data_y,best_fea_idx,best_fea_val)
if data_D1_x.shape[0]<final_n or data_D2_x.shape[0]<final_n:
return None,reg_leaf(data_y)

return best_fea_idx,best_fea_val

def reg_create_tree(data_x,data_y,ops=(1,4)):
'''
input:data_x(ndarry):characteristic value
data_y(array):Tag value
ops(tuple):The first number is the minimum precision of decision tree stopping division, and the second number is the minimum number of decision tree stopping division
output:my_tree(dict):Generated CART Decision tree dictionary
'''
fea_idx,fea_val = classify_get_best_fea(data_x,data_y,ops)
if fea_idx == None:
return fea_val

#Recursive establishment of CART regression decision tree
my_tree = {}
my_tree['fea_idx'] = fea_idx
my_tree['fea_val'] = fea_val
data_D1_x,data_D1_y,data_D2_x,data_D2_y = split_dataset(data_x,data_y,fea_idx,fea_val)

my_tree['left'] = reg_create_tree(data_D1_x,data_D1_y,ops)
my_tree['right'] = reg_create_tree(data_D2_x,data_D2_y,ops)

return my_tree

#Predict a test data result
def classify(inputTree,testdata):
'''
input:inputTree(dict):CART Classification decision tree
xlabel(list):Feature attribute list
testdata(darry):Characteristic value of a test data
output:classLabel(int):Test data prediction results
'''
first_fea_idx = inputTree[list(inputTree.keys())[0]]  #Corresponding characteristic subscript
fea_val = inputTree[list(inputTree.keys())[1]]  #Segmentation value of corresponding feature

classLabel = 0.0 #Define the variable classLabel. The default value is 0

if testdata[first_fea_idx]>=fea_val:  #Enter right subtree
if type(inputTree['right']).__name__ == 'dict':
classLabel = classify(inputTree['right'],testdata)
else:
classLabel = inputTree['right']
else:
if type(inputTree['left']).__name__ == 'dict':
classLabel = classify(inputTree['left'],testdata)
else:
classLabel = inputTree['left']

return round(classLabel,2)

#Predict all test data results
def classifytest(inputTree, testDataSet):
'''
input:inputTree(dict):Trained decision tree
xlabel(list):Eigenvalue label list
testDataSet(ndarray):Test data set
output:classLabelAll(list):Test set prediction result list
'''
classLabelAll = []#Create an empty list
for testVec in testDataSet:#Traverse each data
classLabelAll.append(classify(inputTree, testVec))#Add the feature label obtained from each data to the list
return  np.array(classLabelAll)
```

Test set 1 (Boston house price data set)

There are thirteen attributes in total, as shown in the following figure:

Import dataset and train CART regression tree:

```#Boston house price data set

data = boston.data
target = boston.target

# X = data[:200,:]
# y = target[:200]

x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 666)
#Generate CART regression tree
cartTree = reg_create_tree(x_train,y_train)
print(cartTree)
```

The CART tree obtained from training is shown in the following figure:

The predicted results are shown in the figure below:

```classlist=classifytest(cartTree,x_test)
print('Forecast data',classlist)
print('Real data'.y_test)
print("The average error is:",abs(np.sum(classlist)-np.sum(y_test))/len(y_test))
```

Test set 2 (diabetes dataset)

There are ten attributes in total, as shown in the following figure:

Import dataset and train CART regression tree:

```#Diabetes dataset

data = diabetes.data
target = diabetes.target

X = data[:500,:]
y = target[:500]

x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 666)
#Generate CART regression tree
cartTree = reg_create_tree(x_train,y_train)
print(cartTree)
```

The CART tree obtained from training is shown in the figure below (500 pieces of data):

The predicted results are shown in the figure below:

```classlist=classifytest(cartTree,x_test)
print('Forecast data',classlist)
print('Real data',y_test)
print("The average error is:",abs(np.sum(classlist)-np.sum(y_test))/len(y_test))
```

summary

(1) The label values of data sets predicted by CART regression tree are continuous. Of course, discrete label values can also be used for classification, but it is not necessary. For each feature, the corresponding eigenvalue can be continuous or discrete, and the order of magnitude between different features can also be different, because each feature is independent of each other during division. (for neural networks, normalization is required)

(2) The average error (the average of the sum of errors) and the mean square error (the average of the sum of squares of errors) can be used to evaluate the accuracy of the model.

(3) The calculation method is different from the Gini coefficient of CART classification tree. CART regression tree uses the method of mean square deviation.

Tags: Machine Learning

Posted on Sun, 21 Nov 2021 16:30:04 -0500 by mator