# The decision tree picks out the good watermelon

## 1, Decision tree

### 1.1 INTRODUCTION

Decision tree is a decision analysis method based on the known probability of occurrence of various situations, which calculates the probability that the expected value of net present value is greater than or equal to zero by forming a decision tree, evaluates the project risk and judges its feasibility. It is a graphical method of intuitive probability analysis

The main algorithms are ID3, C4.5 and CART

The decision tree contains three types of nodes:

• Decision node: usually rectangular box
• Opportunity node: usually circle
• Terminal node: usually tertiary

### 1.2 decision event handling process

1. Construction strategy
With the increase of tree, the entropy of nodes decreases rapidly. The faster the entropy is reduced, the better, and the shortest decision tree can be obtained

2. Decision tree
The basic process of decision tree follows the divide and conquer strategy

3. Pseudo code 4. Three conditions for ending recursion

All samples belong to the same category
This attribute set is a hole or all samples have the same attribute value
The dataset has no samples on an attribute

### 1.3 theoretical basis

Purity:
For a branch node, if the samples contained in the node belong to the same class, its purity is 1
The higher the purity, the better. As many samples as possible belong to the same category

Information entropy: Information gain:

The optimal attribute division principle based on information gain - information gain criterion has a preference for attributes with more available data It is called the intrinsic value of attribute a. The greater the number of possible values of attribute a (that is, the greater V), the greater the value of I V (a). This eliminates the preference for attributes with more value data to a certain extent.
in fact, the gain rate criterion has a preference for attributes with a small number of values. The C4.5 algorithm does not directly use the gain rate criterion, but first finds the attributes with higher information gain than the average level from the candidate partition attributes, and then selects the attribute with the highest gain rate.

## 2, ID3 decision tree implementation

### 2.1 data ### 2.2 steps

• Calculate the initial information entropy
• Calculate information gain
• Sort by information gain
• The feature with the maximum information gain is selected as the partition node
• The feature is deleted from the feature list. Continue to return to the previous step of filtering. Repeat the operation until the feature list = 0

### 2.3 establishment of decision tree

Import package

```import pandas as pd
import numpy as np
from collections import Counter
from math import log2
```

Data acquisition and processing

```#Data acquisition and processing
def getData(filePath):
return data

dataList = np.array(data).tolist()
dataSet = [element[1:] for element in dataList]
return dataSet
```

Get property name

```#Get property name
def getLabels(data):
labels = list(data.columns)[1:-1]
return labels
```

Get category tag

```#Get category tag
def targetClass(dataSet):
classification = set([element[-1] for element in dataSet])
return classification
```

Leaf node marker

```#Mark the branch node as the leaf node, and select the class with the largest number of samples as the class mark
def majorityRule(dataSet):
mostKind = Counter([element[-1] for element in dataSet]).most_common(1)
majorityKind = mostKind
return majorityKind
```

Calculating information entropy

```#Calculating information entropy
def infoEntropy(dataSet):
classColumnCnt = Counter([element[-1] for element in dataSet])
Ent = 0
for symbol in classColumnCnt:
p_k = classColumnCnt[symbol]/len(dataSet)
Ent = Ent-p_k*log2(p_k)
return Ent
```

Sub dataset construction

```#Sub dataset construction
def makeAttributeData(dataSet,value,iColumn):
attributeData = []
for element in dataSet:
if element[iColumn]==value:
row = element[:iColumn]
row.extend(element[iColumn+1:])
attributeData.append(row)
return attributeData
```

Calculate information gain

```#Calculate information gain
def infoGain(dataSet,iColumn):
Ent = infoEntropy(dataSet)
tempGain = 0.0
attribute = set([element[iColumn] for element in dataSet])
for value in attribute:
attributeData = makeAttributeData(dataSet,value,iColumn)
tempGain = tempGain+len(attributeData)/len(dataSet)*infoEntropy(attributeData)
Gain = Ent-tempGain
return Gain
```

Select optimal attribute

```#Select optimal attribute
def selectOptimalAttribute(dataSet,labels):
bestGain = 0
sequence = 0
for iColumn in range(0,len(labels)):#Ignore the last category column
Gain = infoGain(dataSet,iColumn)
if Gain>bestGain:
bestGain = Gain
sequence = iColumn
print(labels[iColumn],Gain)
return sequence#
```

Establish decision tree

```#Establish decision tree
def createTree(dataSet,labels):
classification = targetClass(dataSet) #Get category type (collection de duplication)
if len(classification) == 1:
return list(classification)
if len(labels) == 1:
return majorityRule(dataSet)#Return categories with more sample types
sequence = selectOptimalAttribute(dataSet,labels)
print(labels)
optimalAttribute = labels[sequence]
del(labels[sequence])
myTree = {optimalAttribute:{}}
attribute = set([element[sequence] for element in dataSet])
for value in attribute:

print(myTree)
print(value)
subLabels = labels[:]
myTree[optimalAttribute][value] =  \
createTree(makeAttributeData(dataSet,value,sequence),subLabels)
return myTree
```

Main function

```def main():
filePath = 'watermalon.xls'
data = getData(filePath)
labels = getLabels(data)
myTree = createTree(dataSet,labels)
return myTree
```

generate

```if __name__ == '__main__':
myTree = main()
```  ```print(myTree)
```

Dictionary structure of tree ### 2.4 drawing decision tree

```#Drawing a decision tree using Matlotlib
import matplotlib.pyplot as plt

#Format text boxes and arrows
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['font.family'] = 'sans-serif'

#Draw node
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy = parentPt,\
xycoords = "axes fraction", xytext = centerPt, textcoords = 'axes fraction',\
va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args)

#Gets the number of leaf nodes of the decision tree
def getNumLeafs(myTree):
leafNumber = 0
firstStr = list(myTree.keys())
secondDict = myTree[firstStr]
for key in secondDict.keys():
if(type(secondDict[key]).__name__ == 'dict'):
leafNumber = leafNumber + getNumLeafs(secondDict[key])
else:
leafNumber += 1
return leafNumber

#Get the height of the decision tree (recursive)
def getTreeDepth(myTree):
maxDepth = 0
firstStr = list(myTree.keys())
secondDict = myTree[firstStr]
for key in secondDict.keys():
#test to see if the nodes are dictonaires, if not they are leaf nodes
if type(secondDict[key]).__name__=='dict':
thisDepth = 1 + getTreeDepth(secondDict[key])
else:   thisDepth = 1
if thisDepth > maxDepth: maxDepth = thisDepth
return maxDepth

def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt-cntrPt)/2.0 + cntrPt
yMid = (parentPt-cntrPt)/2.0 + cntrPt
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

#Painting tree
def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
depth = getTreeDepth(myTree)
firstStr = list(myTree.keys())     #the text label for this node should be this
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':
plotTree(secondDict[key],cntrPt,str(key))        #recursion
else:   #it's a leaf node print the leaf node
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

#Canvas initialization
def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
#createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
plotTree(inTree, (0.5,1.0), '')
plt.show()

```
```def main():
#createPlot()
print(getTreeDepth(myTree)) #Depth of output number
print(getNumLeafs(myTree))  #Number of output leaves
createPlot(myTree)

```
```main()
``` ### 2.4 result analysis

First look at the standard decision tree given in the textbook machine learning: Comparing the results in the book with the results we obtained, it is basically consistent, but there is still a slight difference. We lack a light white leaf. Through the inspection of the sample data and code, the reason for the deletion is finally found: there is no melon with clear texture, slightly curled root and light white color in the original data, resulting in the absence of cotyledons in the decision tree.

## 3, Implementing decision tree with SK learn

### 3.1 establishment of decision tree based on information gain criterion method

Import related libraries

```#Import related libraries
import pandas as pd
import graphviz
from sklearn.model_selection import train_test_split
from sklearn import tree
```

Import data

```f = open('watermalon.csv','r',encoding='utf-8')

x = data[["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]].copy()
y = data['Good melon'].copy()
print(data)
``` data conversion

```#Numeric eigenvalues
x = x.copy()
for i in ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]:
for j in range(len(x)):
if(x[i][j] == "dark green" or x[i][j] == "Curl up" or data[i][j] == "Turbid sound" \
or x[i][j] == "clear" or x[i][j] == "sunken" or x[i][j] == "Hard slip"):
x[i][j] = 1
elif(x[i][j] == "Black" or x[i][j] == "Slightly curled" or data[i][j] == "Dull" \
or x[i][j] == "Slightly paste" or x[i][j] == "Slightly concave" or x[i][j] == "Soft sticky"):
x[i][j] = 2
else:
x[i][j] = 3

y = y.copy()
for i in range(len(y)):
if(y[i] == "yes"):
y[i] = int(1)
else:
y[i] = int(-1)
#You need to convert the data x and y into a good format and the data frame dataframe, otherwise the format will report an error
x = pd.DataFrame(x).astype(int)
y = pd.DataFrame(y).astype(int)
print(x)
print(y)
``` Modeling and training

```x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print(x_train)
``` ```#Decision tree learning
clf = tree.DecisionTreeClassifier(criterion="entropy")                    #instantiation
clf = clf.fit(x_train, y_train)
score = clf.score(x_test, y_test)
print(score)
``` Draw decision tree

```# Plus graphviz 2.38 absolute path
import os
os.environ["PATH"] += os.pathsep + 'C:\Program Files\Graphviz\bin'

feature_name = ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]
dot_data = tree.export_graphviz(clf ,feature_names= feature_name,class_names=["Good melon","Bad melon"],filled=True,rounded=True,out_file =None)
graph = graphviz.Source(dot_data)
graph
``` ### 3.2 CART algorithm implementation

You only need to change the value of the parameter criterion of the DecisionTreeClassifier function to gini:

```clf = tree.DecisionTreeClassifier(criterion="gini")  #instantiation
clf = clf.fit(x_train, y_train)
score = clf.score(x_test, y_test)
print(score)
``` Draw decision tree

```# Plus graphviz 2.38 absolute path
import os
os.environ["PATH"] += os.pathsep + r'C:\Program Files\Graphviz\bin'

feature_name = ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]
dot_data = tree.export_graphviz(clf ,feature_names= feature_name,class_names=["Good melon","Bad melon"],filled=True,rounded=True,out_file =None)
graph = graphviz.Source(dot_data)
graph
```  ## 4, Summary

When the Sklearn library is used, the feature quantity and label must be converted into numerical type to generate the model

## 5, References

Posted on Sun, 31 Oct 2021 08:18:47 -0400 by iknownothing