Decision tree -- ID3 algorithm, C4.5 algorithm, CART algorithm

catalogue

  Steps of decision tree learning

Advantages and disadvantages of decision tree

Generate decision tree for example code  

Decision tree is a tree structure. Each internal node represents the judgment of an attribute, each branch represents the output of a result, and the last leaf node represents the result of a classification.

Decision tree is more suitable for analyzing discrete data. If it is continuous data, it should be transformed into discrete data before analysis. The following is an example:

RIDageincomestudentcredit_ratingClass:buys_computer
1youthhighnofairno
2youthhighnoexcellentno
3middle_agedhighnofairyes
4seniormediumnofairyes
5seniorlowyesfairyes
6seniorlowyesexcellentno
7middle_agedlowyesexcellentyes
8youthmediumnofairno
9youthlowyesfairyes
10seniormediumyesfairyes
11youthmediumyesexcellentyes
12middle_agedmediumnoexcellentyes
13middle_agedhighyesfairyes
14seniormediumnoexcellentno

According to the table above, we assume that the decision tree is a binary tree, as shown in the figure below. Of course, we can train many decision trees. This is one of the cases below.

  Steps of decision tree learning

  • Feature selection: feature selection is mainly used to determine which features to use for judgment. Each sample may have many attributes. For example, in our example, we use features to select attributes with high correlation. In short, we choose attributes that are easier to classify. Criteria for feature selection: information gain

ID3 algorithm: information gain calculation formula

C4.5 algorithm: the algorithm of information gain tends to first select variables with more factors and improve them to gain rate. The formula is as follows:

  CART algorithm: the generation of CART decision tree is the process of recursive construction of binary decision tree. CART uses the Gini coefficient minimization criterion to select features and generate binary tree. Gini coefficient calculation:

  •   Decision tree generation: after feature selection, start from the root node, calculate all information gains for the node, select the maximum information gain as the node feature, establish child nodes according to different values of the feature, and use the same method to select the child nodes of the node until the information gain is very small or no feature can be selected.
  • Pruning operation: pruning operation is mainly to prevent over fitting. Decision tree can easily lead to over fitting.

Advantages and disadvantages of decision tree

advantage

  • The decision tree is easy to understand and interpret, can be visually analyzed, and can easily extract rules;
  • It can process nominal and numerical data at the same time;
  • It is more suitable to deal with samples with missing attributes;
  • Ability to handle irrelevant features;
  • When testing data sets, the running speed is relatively fast;
  • It can make feasible and effective results for large data sources in a relatively short time.

shortcoming

  • Over fitting is easy to occur;
  • It is easy to ignore the correlation of attributes in the dataset;
  • For those data with different sample numbers in each category, different criteria will lead to different propensity of attribute selection in the decision tree; The information gain criterion has a preference for attributes with a large number of desirable attributes (typically representing ID3 algorithm), while the gain rate criterion (CART) has a preference for attributes with a small number of desirable attributes. However, when cart divides attributes, it no longer simply uses the gain rate directly, but uses a heuristic rule (this disadvantage exists as long as information gain is used).
  • When ID3 algorithm calculates information gain, the result is biased towards the feature of more values.

Generate decision tree for example code  

#Import package
from sklearn.feature_extraction import DictVectorizer
from sklearn import tree
from sklearn import preprocessing
import csv

#read in data
Dtree = open('AllElectronics.csv','r')
reader = csv.reader(Dtree)

#Get the first row of data
headers = reader.__next__()
#print(headers)

#Define two lists
featureList = []
labelList = []

for row in reader:
    #Save label in list
    labelList.append(row[-1])
    rowDict = {}
    for i in range(1,len(row)-1):
        rowDict[headers[i]] = row[i] #Create a data dictionary          
    featureList.append(rowDict) #Store the data dictionary in the list

Because our data are all characters, here we need to process the data and convert it into 0-1 format

#Convert data to 0-1 representation
vec = DictVectorizer()
x_data = vec.fit_transform(featureList).toarray()
print("x_data:"+str(x_data))

#Print attribute name
print(vec.get_feature_names())
print("labelList:"+str(labelList))

#Convert label to 0-1 representation
lb = preprocessing.LabelBinarizer()
y_data = lb.fit_transform(labelList)
print("y_data:"+str(y_data))

The converted code is as follows: 1 indicates that it belongs to this attribute and 0 indicates that it does not belong to this attribute. The results are as follows:

 

#Create decision tree model
model = tree.DecisionTreeClassifier(criterion ='entropy') #criterion The default is Gini coefficient, here#I set it to entropy, that is, using the ID3 algorithm
#Input data to model
model.fit(x_data,y_data)

  At this time, the model is established, but the code implementation has been completed. We need to print out the decision tree, as shown in the following code

There are two things you need to do before you complete the code

  1. First open the command prompt and enter: pip install   graphviz
  2. You need to download the drawing installation package:  Graphviz
  3.   Download and install according to the operating system of your computer. After completion, we need to add the bin directory under the installed file to the environment variable. The operations are as follows:

Find the path first and copy it:  

Enter this computer, right-click and select Properties:

 

 

  After completion, you can run the following code. If there is still an error, you can restart the software that edits the code.

import graphviz

dot_data = tree.export_graphviz(model,
                               out_file = None,
                               feature_names = vec.get_feature_names,
                               class_names = lb.classes_,
                               filled = True,
                               rounded = True,
                               special_character = True)
graph = graphviz.Source(dot_data)
graph.render('computer')

The result is shown in the following figure, which is the generated decision tree:  

Tags: Algorithm Machine Learning Decision Tree

Posted on Thu, 04 Nov 2021 09:55:03 -0400 by darksniperx