The decision tree picks out the good watermelon

1, Decision tree


Decision tree is a decision analysis method based on the known probability of occurrence of various situations, which calculates the probability that the expected value of net present value is greater than or equal to zero by forming a decision tree, evaluates the project risk and judges its feasibility. It is a graphical method of intuitive probability analysis

The main algorithms are ID3, C4.5 and CART

The decision tree contains three types of nodes:

  • Decision node: usually rectangular box
  • Opportunity node: usually circle
  • Terminal node: usually tertiary

1.2 decision event handling process

  1. Construction strategy
    With the increase of tree, the entropy of nodes decreases rapidly. The faster the entropy is reduced, the better, and the shortest decision tree can be obtained

  2. Decision tree
    The basic process of decision tree follows the divide and conquer strategy

  3. Pseudo code

  4. Three conditions for ending recursion

    All samples belong to the same category
    This attribute set is a hole or all samples have the same attribute value
    The dataset has no samples on an attribute

1.3 theoretical basis

For a branch node, if the samples contained in the node belong to the same class, its purity is 1
The higher the purity, the better. As many samples as possible belong to the same category

Information entropy:

Information gain:

The optimal attribute division principle based on information gain - information gain criterion has a preference for attributes with more available data

It is called the intrinsic value of attribute a. The greater the number of possible values of attribute a (that is, the greater V), the greater the value of I V (a). This eliminates the preference for attributes with more value data to a certain extent.
   in fact, the gain rate criterion has a preference for attributes with a small number of values. The C4.5 algorithm does not directly use the gain rate criterion, but first finds the attributes with higher information gain than the average level from the candidate partition attributes, and then selects the attribute with the highest gain rate.

2, ID3 decision tree implementation

2.1 data

2.2 steps

  • Calculate the initial information entropy
  • Calculate information gain
  • Sort by information gain
  • The feature with the maximum information gain is selected as the partition node
  • The feature is deleted from the feature list. Continue to return to the previous step of filtering. Repeat the operation until the feature list = 0

2.3 establishment of decision tree

Import package

import pandas as pd
import numpy as np
from collections import Counter
from math import log2

Data acquisition and processing

#Data acquisition and processing
def getData(filePath):
    data = pd.read_excel(filePath)
    return data

def dataDeal(data):
    dataList = np.array(data).tolist()
    dataSet = [element[1:] for element in dataList]
    return dataSet

Get property name

#Get property name
def getLabels(data):
    labels = list(data.columns)[1:-1]
    return labels

Get category tag

#Get category tag
def targetClass(dataSet):
    classification = set([element[-1] for element in dataSet])
    return classification

Leaf node marker

#Mark the branch node as the leaf node, and select the class with the largest number of samples as the class mark
def majorityRule(dataSet):
    mostKind = Counter([element[-1] for element in dataSet]).most_common(1)
    majorityKind = mostKind[0][0]
    return majorityKind

Calculating information entropy

#Calculating information entropy
def infoEntropy(dataSet):
    classColumnCnt = Counter([element[-1] for element in dataSet])
    Ent = 0
    for symbol in classColumnCnt:
        p_k = classColumnCnt[symbol]/len(dataSet)
        Ent = Ent-p_k*log2(p_k)
    return Ent

Sub dataset construction

#Sub dataset construction
def makeAttributeData(dataSet,value,iColumn):
    attributeData = []
    for element in dataSet:
        if element[iColumn]==value:
            row = element[:iColumn]
    return attributeData

Calculate information gain

#Calculate information gain
def infoGain(dataSet,iColumn):
    Ent = infoEntropy(dataSet)
    tempGain = 0.0
    attribute = set([element[iColumn] for element in dataSet])
    for value in attribute:
        attributeData = makeAttributeData(dataSet,value,iColumn)
        tempGain = tempGain+len(attributeData)/len(dataSet)*infoEntropy(attributeData)
        Gain = Ent-tempGain
    return Gain

Select optimal attribute

#Select optimal attribute                
def selectOptimalAttribute(dataSet,labels):
    bestGain = 0
    sequence = 0
    for iColumn in range(0,len(labels)):#Ignore the last category column
        Gain = infoGain(dataSet,iColumn)
        if Gain>bestGain:
            bestGain = Gain
            sequence = iColumn
    return sequence#

Establish decision tree

#Establish decision tree
def createTree(dataSet,labels):
    classification = targetClass(dataSet) #Get category type (collection de duplication)
    if len(classification) == 1:
        return list(classification)[0]
    if len(labels) == 1:
        return majorityRule(dataSet)#Return categories with more sample types
    sequence = selectOptimalAttribute(dataSet,labels)
    optimalAttribute = labels[sequence]
    myTree = {optimalAttribute:{}}
    attribute = set([element[sequence] for element in dataSet])
    for value in attribute:
        subLabels = labels[:]
        myTree[optimalAttribute][value] =  \
    return myTree

Main function

def main():
    filePath = 'watermalon.xls'
    data = getData(filePath)
    dataSet = dataDeal(data)
    labels = getLabels(data)
    myTree = createTree(dataSet,labels)
    return myTree


if __name__ == '__main__':
    myTree = main()


Dictionary structure of tree

2.4 drawing decision tree

#Drawing a decision tree using Matlotlib
import matplotlib.pyplot as plt

#Format text boxes and arrows
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams[''] = 'sans-serif'

#Draw node
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy = parentPt,\
    xycoords = "axes fraction", xytext = centerPt, textcoords = 'axes fraction',\
    va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args)
#Gets the number of leaf nodes of the decision tree
def getNumLeafs(myTree):
    leafNumber = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if(type(secondDict[key]).__name__ == 'dict'):
            leafNumber = leafNumber + getNumLeafs(secondDict[key])
            leafNumber += 1
    return leafNumber

#Get the height of the decision tree (recursive)
def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        #test to see if the nodes are dictonaires, if not they are leaf nodes
        if type(secondDict[key]).__name__=='dict':
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

#Add information to parent-child nodes
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

#Painting tree
def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

#Canvas initialization
def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')

def main():
    print(getTreeDepth(myTree)) #Depth of output number
    print(getNumLeafs(myTree))  #Number of output leaves


2.4 result analysis

First look at the standard decision tree given in the textbook machine learning:

Comparing the results in the book with the results we obtained, it is basically consistent, but there is still a slight difference. We lack a light white leaf. Through the inspection of the sample data and code, the reason for the deletion is finally found: there is no melon with clear texture, slightly curled root and light white color in the original data, resulting in the absence of cotyledons in the decision tree.

3, Implementing decision tree with SK learn

3.1 establishment of decision tree based on information gain criterion method

Import related libraries

#Import related libraries
import pandas as pd
import graphviz 
from sklearn.model_selection import train_test_split
from sklearn import tree

Import data

f = open('watermalon.csv','r',encoding='utf-8')
data = pd.read_csv(f)

x = data[["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]].copy()
y = data['Good melon'].copy()

data conversion

#Numeric eigenvalues
x = x.copy()
for i in ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]:
    for j in range(len(x)):
        if(x[i][j] == "dark green" or x[i][j] == "Curl up" or data[i][j] == "Turbid sound" \
           or x[i][j] == "clear" or x[i][j] == "sunken" or x[i][j] == "Hard slip"):
            x[i][j] = 1
        elif(x[i][j] == "Black" or x[i][j] == "Slightly curled" or data[i][j] == "Dull" \
           or x[i][j] == "Slightly paste" or x[i][j] == "Slightly concave" or x[i][j] == "Soft sticky"):
            x[i][j] = 2
            x[i][j] = 3
y = y.copy()
for i in range(len(y)):
    if(y[i] == "yes"):
        y[i] = int(1)
        y[i] = int(-1) 
#You need to convert the data x and y into a good format and the data frame dataframe, otherwise the format will report an error
x = pd.DataFrame(x).astype(int)
y = pd.DataFrame(y).astype(int)

Modeling and training

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

#Decision tree learning
clf = tree.DecisionTreeClassifier(criterion="entropy")                    #instantiation  
clf =, y_train) 
score = clf.score(x_test, y_test)

Draw decision tree

You need to download and configure Graphviz

reference: Graphviz installation and use - decision tree visualization

# Plus graphviz 2.38 absolute path
import os
os.environ["PATH"] += os.pathsep + 'C:\Program Files\Graphviz\bin'
feature_name = ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]
dot_data = tree.export_graphviz(clf ,feature_names= feature_name,class_names=["Good melon","Bad melon"],filled=True,rounded=True,out_file =None) 
graph = graphviz.Source(dot_data) 

3.2 CART algorithm implementation

You only need to change the value of the parameter criterion of the DecisionTreeClassifier function to gini:

clf = tree.DecisionTreeClassifier(criterion="gini")  #instantiation  
clf =, y_train) 
score = clf.score(x_test, y_test)

Draw decision tree

# Plus graphviz 2.38 absolute path
import os
os.environ["PATH"] += os.pathsep + r'C:\Program Files\Graphviz\bin'
feature_name = ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]
dot_data = tree.export_graphviz(clf ,feature_names= feature_name,class_names=["Good melon","Bad melon"],filled=True,rounded=True,out_file =None) 
graph = graphviz.Source(dot_data) 

4, Summary

When the Sklearn library is used, the feature quantity and label must be converted into numerical type to generate the model

5, References

Tags: Machine Learning AI Decision Tree

Posted on Sun, 31 Oct 2021 08:18:47 -0400 by iknownothing