Algorithm code implementation of SK learn decision tree ID3, C4.5 and CART

1, ID3 algorithm

1. Pseudo code

ID3 (Examples, Target_Attribute, Attributes)
    Create a root node for the tree
    If all examples are positive, Return the single-node tree Root, with label = +.
    If all examples are negative, Return the single-node tree Root, with label = -.
    If number of predicting attributes is empty, then Return the single node tree Root,
    with label = most common value of the target attribute in the examples.
    Otherwise Begin
        A ← The Attribute that best classifies examples.
        Decision Tree attribute for Root = A.
        For each possible value, vi, of A,
            Add a new tree branch below Root, corresponding to the test A = vi.
            Let Examples(vi) be the subset of examples that have the value vi for A
            If Examples(vi) is empty
                Then below this new branch add a leaf node with label = most common target value in the examples
            Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
    End
    Return Root

We will not use this next, but use the sklearn library

2. Disadvantages

  • For attributes with many values, it is very sensitive. For example, if an attribute value in our dataset is basically different from different samples, or even more extreme endpoints, it is unique for each sample. If we use this attribute to divide the dataset, it will get a great information gain. However, this result is not what we want.
  • The ID3 algorithm cannot handle attributes with continuous values.
  • The ID3 algorithm cannot handle samples with missing values for attributes.
  • Due to the deep tree generated according to the above algorithm, over fitting is easy to occur.

3. Implementation code

1. Import module

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

2. Read data

data = pd.read_csv('./Watermelon dataset.csv')
data
color and lustreRootKnocktextureUmbilicusTactile sensationGood melon
0dark greenCurl upTurbid soundclearsunkenHard slipyes
1BlackCurl upDullclearsunkenHard slipyes
2BlackCurl upTurbid soundclearsunkenHard slipyes
3dark greenCurl upDullclearsunkenHard slipyes
4plain Curl upTurbid soundclearsunkenHard slipyes
5dark greenSlightly curledTurbid soundclearSlightly concaveSoft stickyyes
6BlackSlightly curledTurbid soundSlightly pasteSlightly concaveSoft stickyyes
7BlackSlightly curledTurbid soundclearSlightly concaveHard slipyes
8BlackSlightly curledDullSlightly pasteSlightly concaveHard slipno
9dark greenStiffCrispclearflatSoft stickyno
10plain StiffCrispvagueflatHard slipno
11plain Curl upTurbid soundvagueflatSoft stickyno
12dark greenSlightly curledTurbid soundSlightly pastesunkenHard slipno
13plain Slightly curledDullSlightly pastesunkenHard slipno
14BlackSlightly curledTurbid soundclearSlightly concaveSoft stickyno
15plain Curl upTurbid soundvagueflatHard slipno
16dark greenCurl upDullSlightly pasteSlightly concaveHard slipno

3. Data coding

#Create a LabelEncoder() object for serialization
label = LabelEncoder()    

#Serialize for each column
for col in data[data.columns[:-1]]:
    data[col] = label.fit_transform(data[col])
data
color and lustreRootKnocktextureUmbilicusTactile sensationGood melon
0221100yes
1020100yes
2021100yes
3220100yes
4121100yes
5211121yes
6011221yes
7011120yes
8010220no
9202111no
10102010no
11121011no
12211200no
13110200no
14011121no
15121010no
16220220no

sklearn fitting

# Fit with ID3
dtc = DecisionTreeClassifier(criterion='entropy')
# Fit
dtc.fit(data.iloc[:,:-1].values.tolist(),data.iloc[:,-1].values) 
# Tag corresponding code
result = dtc.predict([[1,1,1,1,0,0]])
#Fitting results
result
array(['yes'], dtype=object)

2, C4.5 algorithm

The general idea of C4.5 algorithm is similar to that of ID3, which is classified by constructing a decision tree. The difference lies in the processing of branches. In the selection of branch attributes, ID3 algorithm uses information gain as a measure, while C4.5 algorithm introduces information gain rate as a measure
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-h88vxEjX-1635604412623)(attachment:image.png)]
It can be seen from the formula of information gain rate that when 𝑣 is relatively large, the information gain rate will be significantly reduced, which can solve the problem of selecting branch attributes with more values in ID3 algorithm to a certain extent

3, CART algorithm

CART algorithm constructs a binary decision tree. After the decision tree is constructed, it also needs pruning in order to be better applied to the classification of unknown data. CART algorithm uses Gini coefficient to select features when constructing decision tree.

1. Gini index

[external link image transfer failed. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-9I1TfPAZ-1635604412625)(attachment:8RW4%293Q@CXD6R%60WA%5BXP4CGC.png)]

2.CART fitting

# Fit with CART
dtc = DecisionTreeClassifier()
# Fit
dtc.fit(data.iloc[:,:-1].values.tolist(),data.iloc[:,-1].values) 
# Tag corresponding code
result = dtc.predict([[1,1,1,1,0,0]])
#Fitting results
result
array(['yes'], dtype=object)

4, Reference

https://blog.csdn.net/xlinsist/article/details/51468741

https://blog.csdn.net/qq_47281915/article/details/120928915

Tags: Algorithm Machine Learning Decision Tree

Posted on Sat, 30 Oct 2021 11:34:47 -0400 by djelica