The decision tree picks out the good watermelon

1, Decision tree

1.1 INTRODUCTION

Decision tree is a decision analysis method based on the known probability of occurrence of various situations, which calculates the probability that the expected value of net present value is greater than or equal to zero by forming a decision tree, evaluates the project risk and judges its feasibility. It is a graphical method of intuitive probability analysis

The main algorithms are ID3, C4.5 and CART

The decision tree contains three types of nodes:

Decision node: usually rectangular box
Opportunity node: usually circle
Terminal node: usually tertiary

1.2 decision event handling process

Construction strategy
With the increase of tree, the entropy of nodes decreases rapidly. The faster the entropy is reduced, the better, and the shortest decision tree can be obtained
Decision tree
The basic process of decision tree follows the divide and conquer strategy
Pseudo code
Three conditions for ending recursion

All samples belong to the same category
This attribute set is a hole or all samples have the same attribute value
The dataset has no samples on an attribute

1.3 theoretical basis

Purity:
For a branch node, if the samples contained in the node belong to the same class, its purity is 1
The higher the purity, the better. As many samples as possible belong to the same category

Information entropy:

Information gain:

The optimal attribute division principle based on information gain - information gain criterion has a preference for attributes with more available data

It is called the intrinsic value of attribute a. The greater the number of possible values of attribute a (that is, the greater V), the greater the value of I V (a). This eliminates the preference for attributes with more value data to a certain extent.
in fact, the gain rate criterion has a preference for attributes with a small number of values. The C4.5 algorithm does not directly use the gain rate criterion, but first finds the attributes with higher information gain than the average level from the candidate partition attributes, and then selects the attribute with the highest gain rate.

2, ID3 decision tree implementation

2.1 data

2.2 steps

Calculate the initial information entropy
Calculate information gain
Sort by information gain
The feature with the maximum information gain is selected as the partition node
The feature is deleted from the feature list. Continue to return to the previous step of filtering. Repeat the operation until the feature list = 0

2.3 establishment of decision tree

Import package

import pandas as pd import numpy as np from collections import Counter from math import log2

Data acquisition and processing

#Data acquisition and processing def getData(filePath): data = pd.read_excel(filePath) return data def dataDeal(data): dataList = np.array(data).tolist() dataSet = [element[1:] for element in dataList] return dataSet

Get property name

#Get property name def getLabels(data): labels = list(data.columns)[1:-1] return labels

Get category tag

#Get category tag def targetClass(dataSet): classification = set([element[-1] for element in dataSet]) return classification

Leaf node marker

#Mark the branch node as the leaf node, and select the class with the largest number of samples as the class mark def majorityRule(dataSet): mostKind = Counter([element[-1] for element in dataSet]).most_common(1) majorityKind = mostKind[0][0] return majorityKind

Calculating information entropy

#Calculating information entropy def infoEntropy(dataSet): classColumnCnt = Counter([element[-1] for element in dataSet]) Ent = 0 for symbol in classColumnCnt: p_k = classColumnCnt[symbol]/len(dataSet) Ent = Ent-p_k*log2(p_k) return Ent

Sub dataset construction

#Sub dataset construction def makeAttributeData(dataSet,value,iColumn): attributeData = [] for element in dataSet: if element[iColumn]==value: row = element[:iColumn] row.extend(element[iColumn+1:]) attributeData.append(row) return attributeData

Calculate information gain

#Calculate information gain def infoGain(dataSet,iColumn): Ent = infoEntropy(dataSet) tempGain = 0.0 attribute = set([element[iColumn] for element in dataSet]) for value in attribute: attributeData = makeAttributeData(dataSet,value,iColumn) tempGain = tempGain+len(attributeData)/len(dataSet)*infoEntropy(attributeData) Gain = Ent-tempGain return Gain

Select optimal attribute

#Select optimal attribute def selectOptimalAttribute(dataSet,labels): bestGain = 0 sequence = 0 for iColumn in range(0,len(labels)):#Ignore the last category column Gain = infoGain(dataSet,iColumn) if Gain>bestGain: bestGain = Gain sequence = iColumn print(labels[iColumn],Gain) return sequence#

Establish decision tree

#Establish decision tree def createTree(dataSet,labels): classification = targetClass(dataSet) #Get category type (collection de duplication) if len(classification) == 1: return list(classification)[0] if len(labels) == 1: return majorityRule(dataSet)#Return categories with more sample types sequence = selectOptimalAttribute(dataSet,labels) print(labels) optimalAttribute = labels[sequence] del(labels[sequence]) myTree = } attribute = set([element[sequence] for element in dataSet]) for value in attribute: print(myTree) print(value) subLabels = labels[:] myTree[optimalAttribute][value] = \ createTree(makeAttributeData(dataSet,value,sequence),subLabels) return myTree

Main function

def main(): filePath = 'watermalon.xls' data = getData(filePath) dataSet = dataDeal(data) labels = getLabels(data) myTree = createTree(dataSet,labels) return myTree

generate

if __name__ == '__main__': myTree = main()

print(myTree)

Dictionary structure of tree

2.4 drawing decision tree

#Drawing a decision tree using Matlotlib import matplotlib.pyplot as plt #Format text boxes and arrows decisionNode = dict(boxstyle = "sawtooth", fc = "0.8") leafNode = dict(boxstyle = "round4", fc = "0.8") arrow_args = dict(arrowstyle = "<-") plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['font.family'] = 'sans-serif' #Draw node def plotNode(nodeTxt, centerPt, parentPt, nodeType): createPlot.ax1.annotate(nodeTxt, xy = parentPt,\ xycoords = "axes fraction", xytext = centerPt, textcoords = 'axes fraction',\ va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args) #Gets the number of leaf nodes of the decision tree def getNumLeafs(myTree): leafNumber = 0 firstStr = list(myTree.keys())[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if(type(secondDict[key]).__name__ == 'dict'): leafNumber = leafNumber + getNumLeafs(secondDict[key]) else: leafNumber += 1 return leafNumber #Get the height of the decision tree (recursive) def getTreeDepth(myTree): maxDepth = 0 firstStr = list(myTree.keys())[0] secondDict = myTree[firstStr] for key in secondDict.keys(): #test to see if the nodes are dictonaires, if not they are leaf nodes if type(secondDict[key]).__name__=='dict': thisDepth = 1 + getTreeDepth(secondDict[key]) else: thisDepth = 1 if thisDepth > maxDepth: maxDepth = thisDepth return maxDepth #Add information to parent-child nodes def plotMidText(cntrPt, parentPt, txtString): xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0] yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1] createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30) #Painting tree def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on numLeafs = getNumLeafs(myTree) #this determines the x width of this tree depth = getTreeDepth(myTree) firstStr = list(myTree.keys())[0] #the text label for this node should be this cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff) plotMidText(cntrPt, parentPt, nodeTxt) plotNode(firstStr, cntrPt, parentPt, decisionNode) secondDict = myTree[firstStr] plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict': plotTree(secondDict[key],cntrPt,str(key)) #recursion else: #it's a leaf node print the leaf node plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode) plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key)) plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD #Canvas initialization def createPlot(inTree): fig = plt.figure(1, facecolor='white') fig.clf() axprops = dict(xticks=[], yticks=[]) createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) #no ticks #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses plotTree.totalW = float(getNumLeafs(inTree)) plotTree.totalD = float(getTreeDepth(inTree)) plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0; plotTree(inTree, (0.5,1.0), '') plt.show()

def main(): #createPlot() print(getTreeDepth(myTree)) #Depth of output number print(getNumLeafs(myTree)) #Number of output leaves createPlot(myTree)

main()

2.4 result analysis

First look at the standard decision tree given in the textbook machine learning:

Comparing the results in the book with the results we obtained, it is basically consistent, but there is still a slight difference. We lack a light white leaf. Through the inspection of the sample data and code, the reason for the deletion is finally found: there is no melon with clear texture, slightly curled root and light white color in the original data, resulting in the absence of cotyledons in the decision tree.

3, Implementing decision tree with SK learn

3.1 establishment of decision tree based on information gain criterion method

Import related libraries

#Import related libraries import pandas as pd import graphviz from sklearn.model_selection import train_test_split from sklearn import tree

Import data

f = open('watermalon.csv','r',encoding='utf-8') data = pd.read_csv(f) x = data[["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]].copy() y = data['Good melon'].copy() print(data)

data conversion

#Numeric eigenvalues x = x.copy() for i in ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]: for j in range(len(x)): if(x[i][j] == "dark green" or x[i][j] == "Curl up" or data[i][j] == "Turbid sound" \ or x[i][j] == "clear" or x[i][j] == "sunken" or x[i][j] == "Hard slip"): x[i][j] = 1 elif(x[i][j] == "Black" or x[i][j] == "Slightly curled" or data[i][j] == "Dull" \ or x[i][j] == "Slightly paste" or x[i][j] == "Slightly concave" or x[i][j] == "Soft sticky"): x[i][j] = 2 else: x[i][j] = 3 y = y.copy() for i in range(len(y)): if(y[i] == "yes"): y[i] = int(1) else: y[i] = int(-1) #You need to convert the data x and y into a good format and the data frame dataframe, otherwise the format will report an error x = pd.DataFrame(x).astype(int) y = pd.DataFrame(y).astype(int) print(x) print(y)

Modeling and training

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2) print(x_train)

#Decision tree learning clf = tree.DecisionTreeClassifier(criterion="entropy") #instantiation clf = clf.fit(x_train, y_train) score = clf.score(x_test, y_test) print(score)

Draw decision tree

You need to download and configure Graphviz

reference: Graphviz installation and use - decision tree visualization

# Plus graphviz 2.38 absolute path import os os.environ["PATH"] += os.pathsep + 'C:\Program Files\Graphviz\bin' feature_name = ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"] dot_data = tree.export_graphviz(clf ,feature_names= feature_name,class_names=["Good melon","Bad melon"],filled=True,rounded=True,out_file =None) graph = graphviz.Source(dot_data) graph

3.2 CART algorithm implementation

You only need to change the value of the parameter criterion of the DecisionTreeClassifier function to gini:

clf = tree.DecisionTreeClassifier(criterion="gini") #instantiation clf = clf.fit(x_train, y_train) score = clf.score(x_test, y_test) print(score)

Draw decision tree

# Plus graphviz 2.38 absolute path import os os.environ["PATH"] += os.pathsep + r'C:\Program Files\Graphviz\bin' feature_name = ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"] dot_data = tree.export_graphviz(clf ,feature_names= feature_name,class_names=["Good melon","Bad melon"],filled=True,rounded=True,out_file =None) graph = graphviz.Source(dot_data) graph

4, Summary

When the Sklearn library is used, the feature quantity and label must be converted into numerical type to generate the model

5, References

https://blog.csdn.net/YangMax1/article/details/120917236
https://blog.csdn.net/weixin_47554309/article/details/120959933