# Decision tree -- ID3 algorithm, C4.5 algorithm, CART algorithm

catalogue

Steps of decision tree learning

Generate decision tree for example code

Decision tree is a tree structure. Each internal node represents the judgment of an attribute, each branch represents the output of a result, and the last leaf node represents the result of a classification.

Decision tree is more suitable for analyzing discrete data. If it is continuous data, it should be transformed into discrete data before analysis. The following is an example:

 RID age income student credit_rating Class:buys_computer 1 youth high no fair no 2 youth high no excellent no 3 middle_aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle_aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle_aged medium no excellent yes 13 middle_aged high yes fair yes 14 senior medium no excellent no

According to the table above, we assume that the decision tree is a binary tree, as shown in the figure below. Of course, we can train many decision trees. This is one of the cases below.

### Steps of decision tree learning

• Feature selection: feature selection is mainly used to determine which features to use for judgment. Each sample may have many attributes. For example, in our example, we use features to select attributes with high correlation. In short, we choose attributes that are easier to classify. Criteria for feature selection: information gain

ID3 algorithm: information gain calculation formula

C4.5 algorithm: the algorithm of information gain tends to first select variables with more factors and improve them to gain rate. The formula is as follows:

CART algorithm: the generation of CART decision tree is the process of recursive construction of binary decision tree. CART uses the Gini coefficient minimization criterion to select features and generate binary tree. Gini coefficient calculation:

•   Decision tree generation: after feature selection, start from the root node, calculate all information gains for the node, select the maximum information gain as the node feature, establish child nodes according to different values of the feature, and use the same method to select the child nodes of the node until the information gain is very small or no feature can be selected.
• Pruning operation: pruning operation is mainly to prevent over fitting. Decision tree can easily lead to over fitting.

### Advantages and disadvantages of decision tree

• The decision tree is easy to understand and interpret, can be visually analyzed, and can easily extract rules;
• It can process nominal and numerical data at the same time;
• It is more suitable to deal with samples with missing attributes;
• Ability to handle irrelevant features;
• When testing data sets, the running speed is relatively fast;
• It can make feasible and effective results for large data sources in a relatively short time.

shortcoming

• Over fitting is easy to occur;
• It is easy to ignore the correlation of attributes in the dataset;
• For those data with different sample numbers in each category, different criteria will lead to different propensity of attribute selection in the decision tree; The information gain criterion has a preference for attributes with a large number of desirable attributes (typically representing ID3 algorithm), while the gain rate criterion (CART) has a preference for attributes with a small number of desirable attributes. However, when cart divides attributes, it no longer simply uses the gain rate directly, but uses a heuristic rule (this disadvantage exists as long as information gain is used).
• When ID3 algorithm calculates information gain, the result is biased towards the feature of more values.

### Generate decision tree for example code

#Import package
from sklearn.feature_extraction import DictVectorizer
from sklearn import tree
from sklearn import preprocessing
import csv

Dtree = open('AllElectronics.csv','r')

#Get the first row of data

#Define two lists
featureList = []
labelList = []

for row in reader:
#Save label in list
labelList.append(row[-1])
rowDict = {}
for i in range(1,len(row)-1):
rowDict[headers[i]] = row[i] #Create a data dictionary
featureList.append(rowDict) #Store the data dictionary in the list

Because our data are all characters, here we need to process the data and convert it into 0-1 format

#Convert data to 0-1 representation
vec = DictVectorizer()
x_data = vec.fit_transform(featureList).toarray()
print("x_data:"+str(x_data))

#Print attribute name
print(vec.get_feature_names())
print("labelList:"+str(labelList))

#Convert label to 0-1 representation
lb = preprocessing.LabelBinarizer()
y_data = lb.fit_transform(labelList)
print("y_data:"+str(y_data))

The converted code is as follows: 1 indicates that it belongs to this attribute and 0 indicates that it does not belong to this attribute. The results are as follows:

#Create decision tree model
model = tree.DecisionTreeClassifier(criterion ='entropy') #criterion The default is Gini coefficient, here#I set it to entropy, that is, using the ID3 algorithm
#Input data to model
model.fit(x_data,y_data)

At this time, the model is established, but the code implementation has been completed. We need to print out the decision tree, as shown in the following code

There are two things you need to do before you complete the code

1. First open the command prompt and enter: pip install   graphviz
2. You need to download the drawing installation package:  Graphviz
3.   Download and install according to the operating system of your computer. After completion, we need to add the bin directory under the installed file to the environment variable. The operations are as follows:

Find the path first and copy it:

Enter this computer, right-click and select Properties:

After completion, you can run the following code. If there is still an error, you can restart the software that edits the code.

import graphviz

dot_data = tree.export_graphviz(model,
out_file = None,
feature_names = vec.get_feature_names,
class_names = lb.classes_,
filled = True,
rounded = True,
special_character = True)
graph = graphviz.Source(dot_data)
graph.render('computer')

The result is shown in the following figure, which is the generated decision tree:

Posted on Thu, 04 Nov 2021 09:55:03 -0400 by darksniperx