# Experimental background

In previous experiments:

Machine learning_ 1:K-nearest neighbor algorithm

Machine learning_ 2:K-nearest neighbor algorithm

We have learned that K-nearest neighbor algorithm is an algorithm that can be used for classification without training, but it also has many disadvantages. The biggest disadvantage is that it can not give the internal meaning of data. Today, I will introduce a classification algorithm that is easy to understand the form of data - decision tree algorithm.

# 1. Principle of decision tree algorithm

## 1.1. What is a decision tree

Strictly speaking, the classification decision tree model is a tree structure that describes the classification of instances. The decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. An internal node represents a feature or attribute, and a leaf node represents a class.

Generally speaking, the decision tree has only one function - judgment. Through continuous judgment, we can finally get a result. Take blind date as an example

We simplify the blind date into three attributes (the object is white, rich and beautiful). In fact, the blind date is much more complex than this. Obviously, almost no one will refuse the object of white, rich and beautiful. A new object is put into the decision tree to judge. The object is white, rich and not beautiful. After three judgments, the conclusion is to go. This is the decision tree. Of course, this' Sea King 'decision tree is not particularly needed.

The use of decision tree is more inclined to the expert system. Taking disease diagnosis as an example, various diseases previously recorded by people and the symptoms of these diseases are recorded for the construction of decision tree. After the training of these diseases, when the patient inputs various information such as body temperature according to the prompt, the disease decision tree can judge which disease the patient has.

Then the question arises, which attribute can be used as the basis for judgment, or which attribute can have the best effect as the basis for judgment?

## 1.2. How to build a good decision tree

If you want to construct a good decision tree, you have to find the appropriate attributes as the basis for judgment. How to select the appropriate attributes? Here, four concepts are introduced, Xiangnong entropy, information gain, information gain rate and Gini index.

### 1.2.1. Aroma entropy

Entropy, in essence, refers to the "degree of internal chaos" of a system, and Xiangnong entropy is the same. This word is used to describe the degree of certainty of a thing. The greater the Xiangnong entropy, the more uncertain the thing is. The smaller the Xiangnong entropy, the more certain the thing is.

The formula is:

Ent (D) represents entropy, pk represents the probability of occurrence of k events, and k=1~y represents a total of Y events

Take a coin as an example. When a coin is thrown, the probability that the front or back face up is 50%. At this time, the entropy is: - (1 / 2 (log2 (1 / 2)) + 1 / 2 (log2 (1 / 2)) = 1. Obviously, this is the most uncertain time of the coin. If we manipulate the coin and change its probability, the fragrance entropy will also change.

This is the change of entropy caused by the change of coin probability. Of course, there are only two results of coin here. When the two probabilities are equal, the entropy is the largest. The more events, the greater the entropy.

### 1.2.2. Information gain

The change of information before and after dividing the data set is called information gain, and the calculation formula is:

Gain(D,a) represents the information gain of attribute a in set D, Dv represents that attribute a has V branches, and the v-th node contains all samples whose value D is av on attribute a. The larger the Gain(D,a), the greater the "purity improvement" obtained according to attribute a.

### 1.2.3. Information gain rate

Although information gain can help us select appropriate partition attributes, it also has an obvious problem, that is, it obviously prefers attributes with a large number. Take coins as an example, if you flip coins 10 times, 5 times up and 5 times down, if you take times as the attribute, you can find that there is only one value each time, that is, the attribute is determined, so the information gain will become 1-0 =1. So we introduce information gain rate

IV(a) is called the intrinsic value of attribute A. The more possible values of attribute a, the greater the value of IV(a). Obviously, this method also has a disadvantage, that is, it prefers a relatively small number of attributes. Therefore, we first select the attributes whose information gain is higher than the average level, and then calculate the information gain rate, so as to obtain the appropriate attributes.

### 1.2.4. Gini index

Gini index is another set of attributes different from information gain and information gain rate. It is used to describe the purity of the data set and reflects the probability that the category marks of two samples randomly selected are inconsistent, that is, the smaller the Gini index is, the more uniform the marks of the data set are

The formula is:

Given dataset D, the Gini index of attribute a is defined as:

In the candidate attribute set A, select the attribute that minimizes the Gini index after partition as the most partitioned attribute.

## 1.3. How to optimize the built decision tree

Although there are many methods to select appropriate attributes to divide the data set and construct the decision tree, the decision tree trained according to the training data set also has an obvious problem, that is, over fitting the training data, resulting in the decline of universality. In fact, the training data set can not contain all the situations in the world, which leads to a new example thrown into the decision tree for judgment It is very likely that the results obtained are not what we want, because this sample decision tree has not been touched. This requires us to "prune" the decision tree.

### 1.3.1. Pre pruning

As the name suggests, pre pruning prunes trees by stopping the construction of trees in advance. There are four main methods:

1. Stop the growth of the decision tree when the decision tree reaches the preset height.

2. Instances reaching a node have the same eigenvector. Even if these instances do not belong to the same class, the growth of the decision tree can be stopped.

(for example, headache and fever are the same, most of them are colds, and a small part are heatstroke. We think headache and fever are colds)

3. Define a threshold. When the number of instances of a node is less than the threshold, the growth of the decision tree can be stopped.

(this is to prevent over fitting the training data. For example, there are 10 examples of headache and only 1 example of headache and fever, so there is no need to divide them by fever.)

4. Decide whether to stop the decision tree by calculating the gain of each expansion on the system performance

(if the system performance is improved, it will be expanded, and if it is reduced, this attribute will be cut off)

There is an obvious problem in method 3. This threshold is specified in advance, but how to determine this threshold is worth discussing. How to control the balance of under fitting and over fitting is the difficulty in this place.

Although pre pruning can reduce the risk of over fitting and significantly reduce the cost of training time and testing time. However, because of his "greed", he may fall into local optimization, that is, he will be trapped on the hill if he can't solve the high low higher situation. At the same time, because the expansion of branches is limited, the risk of under fitting is greatly increased (after all, even if there is only one example of headache and fever, this also needs attention in case of virus, and pre pruning may lead to neglect of this problem)

As shown in the figure, in the case of pre pruning, there is no complete decision tree above. Directly draw the decision tree after pruning below.

### 1.3.2. Back pruning

After pruning, a complete decision tree is generated from the training set, and then the non leaf node is analyzed and calculated from bottom to top. If replacing the subtree corresponding to the node with leaf node can improve the generalization performance of the decision tree, the subtree is replaced with leaf node. If there is no promotion and it is inconvenient, we will keep it.

The advantages and disadvantages are obvious. Retaining more branches reduces the risk of under fitting. When there are enough branches, it is generally better than pre pruning with too few branches; Better performance often requires more effort, because this method is to prune from bottom to top after generating a complete decision tree, resulting in long training time and high cost.

After that, Mr. pruning becomes the complete decision tree above, and then prunes the decision tree.

# 2. Three kinds of decision tree construction

After the principle is explained, we will construct the decision trees based on information gain, information gain rate and Gini index respectively. First, we need to calculate the Shannon entropy. The code is as follows:

#Calculate aroma entropy def calcShannonEnt(dataSet): numEntries = len(dataSet) #Get total number of datasets labelCounts = {} #Dictionary, storage category and number for featVec in dataSet: #After entering the dataSet, it is in the form of list, and the variable featVec is used for list traversal currentLabel = featVec[-1] #Take out the last label in the list if currentLabel not in labelCounts.keys(): #If the label is not in the dictionary labelCounts[currentLabel] = 0 #Number of initialization Tags labelCounts[currentLabel] += 1 #Number + 1 in dictionary shannonEnt = 0.0 #Initial aroma entropy for key in labelCounts: #Use keyword loop traversal prob = float(labelCounts[key])/numEntries #Convert frequency to probability shannonEnt -= prob * log(prob,2) #Formula entropy return shannonEnt #Return fragrance entropy #Test data def createDataSet(): dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] labels = ['no surfacing','flippers'] #change to discrete values return dataSet, labels

## 2.1. Decision tree based on information gain (ID 3)

#Divide the data set according to the given characteristics def splitDataSet(dataSet, axis, value): retDataSet = [] #Store split data for featVec in dataSet: #list traversal if featVec[axis] == value: #If the value in the dataset matches the axis given by us reducedFeatVec = featVec[:axis] #Add this data to reducedFeatVec reducedFeatVec.extend(featVec[axis+1:]) #Append multiple values with extend retDataSet.append(reducedFeatVec) #Add reducedFeatVec to retDataSet through append return retDataSet #Return partition data #Select the best data set division method (information gain) def chooseBestFeatureToSplit1(dataSet): numFeatures = len(dataSet[0]) - 1 #Number of features to remove labels baseEntropy = calcShannonEnt(dataSet) #Get the aroma entropy #Initialization parameters bestInfoGain = 0.0 bestFeature = -1 for i in range(numFeatures): #Iterate all features featList = [example[i] for example in dataSet] #Create a list to store feature samples uniqueVals = set(featList) #Delete duplicate data and get unique data newEntropy = 0.0 #0 / 1 Classification, traversal for value in uniqueVals: #Calculate the conditional entropy of 0 and 1 instances respectively subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) #Conditional entropy infoGain = baseEntropy - newEntropy #information gain if (infoGain > bestInfoGain): #Select the maximum information gain bestInfoGain = infoGain bestFeature = i return bestFeature #Return characteristic subscript #Find the most frequent tags for voting def majorityCnt(classList): classCount={} #Record the number of occurrences of each label for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 #Sort in descending order to find the most frequent tags sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0] #Build decision tree def createTree(dataSet,labels): classList = [example[-1] for example in dataSet] #-1 indicates the last in the list if classList.count(classList[0]) == len(classList): #All results are the same, so there is no need to classify again return classList[0] #Return results if len(dataSet[0]) == 1: #There is no need to classify when there is only one result left return majorityCnt(classList) #Returns the class with the most occurrences #Use information gain to select the best division method bestFeat = chooseBestFeatureToSplit1(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel:{}} del(labels[bestFeat]) #Delete root node from labels featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] #Copy all labels to avoid cluttering existing labels myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels) return myTree

Such a decision tree is obviously not intuitive enough. We add some points to draw the image.

import matplotlib.pyplot as plt decisionNode = dict(boxstyle="sawtooth", fc="0.8") leafNode = dict(boxstyle="round4", fc="0.8") arrow_args = dict(arrowstyle="<-") #Get leaf node def getNumLeafs(myTree): numLeafs = 0 firstStr = myTree.keys()[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict':#Test whether the node is a dictionary. If not, it is a leaf node numLeafs += getNumLeafs(secondDict[key]) else: numLeafs +=1 return numLeafs #Gets the depth of the tree def getTreeDepth(myTree): maxDepth = 0 firstStr = myTree.keys()[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes thisDepth = 1 + getTreeDepth(secondDict[key]) else: thisDepth = 1 if thisDepth > maxDepth: maxDepth = thisDepth return maxDepth def plotNode(nodeTxt, centerPt, parentPt, nodeType): createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction', xytext=centerPt, textcoords='axes fraction', va="center", ha="center", bbox=nodeType, arrowprops=arrow_args ) def plotMidText(cntrPt, parentPt, txtString): xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0] yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1] createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30) def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on numLeafs = getNumLeafs(myTree) #this determines the x width of this tree depth = getTreeDepth(myTree) firstStr = myTree.keys()[0] #the text label for this node should be this cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff) plotMidText(cntrPt, parentPt, nodeTxt) plotNode(firstStr, cntrPt, parentPt, decisionNode) secondDict = myTree[firstStr] plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes plotTree(secondDict[key],cntrPt,str(key)) #recursion else: #it's a leaf node print the leaf node plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode) plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key)) plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD #if you do get a dictonary you know it's a tree, and the first element will be another dict def createPlot(inTree): fig = plt.figure(1, facecolor='white') fig.clf() axprops = dict(xticks=[], yticks=[]) createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) #no ticks #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses plotTree.totalW = float(getNumLeafs(inTree)) plotTree.totalD = float(getTreeDepth(inTree)) plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0 plotTree(inTree, (0.5,1.0), '') plt.show() #def createPlot(): # fig = plt.figure(1, facecolor='white') # fig.clf() # createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses # plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode) # plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode) # plt.show() def retrieveTree(i): listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}, {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}} ] return listOfTrees[i] #createPlot(thisTree)

## 2.2. Decision tree based on information gain rate (C 4.5)

#Select the best data set division method (information gain rate) def chooseBestFeatureToSplit2(dataSet): numFeatures = len(dataSet[0]) - 1 #Number of features to remove labels baseEntropy = calcShannonEnt(dataSet) #Get the aroma entropy #Initialization parameters bestInfoGainration = 0.0 bestFeature = -1 sum=0.0 m=0 #Get average information gain for j in range(numFeatures): #Statistical quantity m=m+1 featList = [example[j] for example in dataSet] #Create a list to store feature samples uniqueVals = set(featList) #Delete duplicate data and get unique data newEntropy = 0.0 IV=0.0 #0 / 1 Classification, traversal for value in uniqueVals: #Calculate the conditional entropy of 0 and 1 instances respectively subDataSet = splitDataSet(dataSet, j, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) #Conditional entropy IV-=prob * log(prob,2) infoGain = baseEntropy - newEntropy #information gain sum+=infoGain avg=sum/m for i in range(numFeatures): #Iterate all features featList = [example[i] for example in dataSet] #Create a list to store feature samples uniqueVals = set(featList) #Delete duplicate data and get unique data newEntropy = 0.0 IV=0.0 #0 / 1 Classification, traversal for value in uniqueVals: #Calculate the conditional entropy of 0 and 1 instances respectively subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) #Conditional entropy IV-=prob * log(prob,2) infoGain = baseEntropy - newEntropy #information gain if IV==0.0: infoGainration=0.0 else: infoGainration=float(infoGain)/float(IV) if (infoGainration > bestInfoGainration and infoGainration>avg): #Select the maximum information gain rate bestInfoGainration = infoGainration bestFeature = i return bestFeature #Return characteristic subscript

The code of C 4.5 is very similar to ID 3. The core difference lies in how to select the information gain rate when the information return is greater than the average return.

No obvious difference can be seen under trace data

## 2.3. Decision tree based on Gini index (CART)

#Calculate Gini index def calcGini(dataset): feature = [example[-1] for example in dataset] uniqueFeat = set(feature) sumProb =0.0 for feat in uniqueFeat: prob = feature.count(feat)/len(uniqueFeat) sumProb += prob*prob sumProb = 1-sumProb return sumProb def chooseBestFeatureToSplit3(dataSet): #Divide the data set using Gini coefficient numFeatures = len(dataSet[0]) -1 #The characteristics of the last position do not count bestInfoGini = 1 bestFeature = -1 for i in range(numFeatures): featList = [example[i] for example in dataSet] uniqueVals = set(featList) newEntropy = 0.0 for value in uniqueVals: # Data subsets are divided by different eigenvalues subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob *calcGini(subDataSet) infoGini = newEntropy if(infoGini < bestInfoGini): # Select the smallest Gini coefficient as the division basis bestInfoGini = infoGini bestFeature = i return bestFeature #Returns the best index of the decision attribute

There is no difference under micro data.

# 3. Decision tree test

For a large number of data sets, in order to facilitate storage and reuse, we need to save them in the hard disk

#Storage file def saveTree(inputTree,filename): fw = open(filename,'w') pickle.dump(inputTree,fw) fw.close() #read file def loadTree(filename): fr = open(filename) return pickle.load(fr)

if __name__ == '__main__': fr=open('D:/vscode/python/.vscode/car.data','r') dataSet=[inst.strip().split(',') for inst in fr.readlines()] Labels=['buying','maint','doors','persons','lug_boot','safety'] trainLabels=['buying','maint','doors','persons','lug_boot','safety'] train_num = int(len(dataSet)*0.8) #Use 80% of the data obtained for training and 20% for data verification train_data = dataSet[:train_num] test_data = dataSet[train_num+1:] trainTree = createTree(train_data,Labels) #Create binary tree with dataset label createPlot(trainTree) errCount = 0.0 for data in test_data: testVec = data[:-1] result = classify(trainTree,trainLabels,testVec) #Obtain classification results if result!=data[-1]: #Count the number of errors errCount+=1.0 prob = (1-(errCount/len(test_data)))*100 #Calculation accuracy print(prob)

## 3.1.ID 3

The accuracy is 68.7%

## 3.2.C 4.5

The accuracy is 68.1%, but there are many fewer nodes in the decision tree

## 3.3.CART

The accuracy is only 65.8%

# 4. Summary

Through the learning of the decision tree and the operation of the code, I have a deeper understanding of the decision tree. The algorithm of decision tree does not need to adjust too many parameters, and the interpretability of the algorithm is also very strong. The disadvantage of decision tree is also very obvious. When there are too many decision attributes, the algorithm overhead of the whole decision tree will be very large. At the same time, after testing the data set for many times, it can be found that when the decision attribute is continuous data, it is easy to lead to too many branches of the decision tree. At this time, it is necessary to discretize the continuous data.