[machine learning] hidden glasses selection based on decision tree

Experimental introduction

1. Experimental contents

This experiment learns and implements the decision tree algorithm.

2. Experimental objectives

Through this experiment, master the basic principle of decision tree algorithm.

3. Experimental knowledge points

  • Shannon entropy
  • information gain

4. Experimental environment

  • python 3.6.5

5. Preparatory knowledge

  • Fundamentals of Python Programming

preparation

  click the download experimental data module at the top right of the screen and select download decision_tree_glass.tgz to the specified directory, then select File - > Open - > upload in sequence, upload the compressed package of the data set just downloaded, and then use the following command to decompress it:

!tar -zxvf decision_tree_glass.tgz
decision_tree_glass/
decision_tree_glass/lenses.txt
decision_tree_glass/classifierStorage.txt

xt

Decision tree construction - ID3 algorithm

   the core of ID3 algorithm is to select features corresponding to information gain criteria on each node of the decision tree and recursively construct the decision tree. The specific method is as follows: starting from the root node, calculate the information gain of all possible features for the node, select the feature with the largest information gain as the feature of the node, and establish sub nodes according to the different values of the feature; Then recursively call the above methods to the child nodes to construct the decision tree; Until the information gain of all features is small or no features can be selected. Finally, a decision tree is obtained. ID3 is equivalent to the selection of probability model by maximum likelihood method.
According to the results of decision tree experiment, feature A3 (with its own house) has the largest information gain, so feature A3 is selected as the feature of root node. It divides the training set D into two subsets D1(A3 is "yes") and D2(A3 is "no"). Since D1 has only sample points of the same class, it becomes a leaf node, and the class of the node is marked as "yes". For D2, you need to select new features from features A1 (age), A2 (working) and A4 (credit situation) to calculate the information gain of each feature:

   according to the calculation, select the feature A2 (with work) with the largest information gain as the feature of the node. Since A2 has two possible values, two child nodes are derived from this node: a child node corresponding to "yes" (with work) contains three samples, which belong to the same class, so this is a leaf node, and the class is marked as "yes"; The other is the child node corresponding to "no" (no work), including 6 samples, which also belong to the same class, so this is also a leaf node, and the class is marked as "no".
In this way, a decision tree is generated, which only uses two features (with two internal nodes). The generated decision tree is shown in the following figure:

[exercise] decision tree construction - write code to build a decision tree

We use a dictionary to store the structure of the decision tree. For example, the decision tree we analyzed in the previous section can be expressed as:
{'own a house': {0: {have a job ': {0:' no ', 1:' yes'}}, 1: 'yes'}}
Create the function majorityCnt to count the most elements (class labels) in the classList, and create the function createTree to recursively build the decision tree. The code is as follows:

# -*- coding: UTF-8 -*-
from math import log
import operator
"""
Function description:Calculate the empirical entropy of a given data set(Shannon entropy)
Parameters:
    dataSet - data set
Returns:
    shannonEnt - Empirical entropy(Shannon entropy)
"""
def calcShannonEnt(dataSet):
    ### Start Code Here ###                      
    numEntires = len(dataSet)                        #Returns the number of rows in the dataset
    labelCounts = {}                                #Save a dictionary of the number of occurrences of each label
    for featVec in dataSet:                            #Each group of eigenvectors is counted
        currentLabel = featVec[-1]                    #Extract label information
        if currentLabel not in labelCounts.keys():    #If the label is not put into the dictionary of statistical times, add it
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1                #Label count
    shannonEnt = 0.0                                #Empirical entropy (Shannon entropy)
    for key in labelCounts:                            #Calculate Shannon entropy
        prob = float(labelCounts[key]) / numEntires    #The probability of selecting the label
        shannonEnt -= prob * log(prob, 2)            #Calculation by formula
    return shannonEnt                                #Return empirical entropy (Shannon entropy)
    ### End Code Here ###
    
"""
Function description:Create test dataset
Parameters:
    nothing
Returns:
    dataSet - data set
    labels - Feature label
"""
def createDataSet():
    dataSet = [[0, 0, 0, 0, 'no'],                        #data set
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    labels = ['Age', 'Have a job', 'Have your own house', 'Credit situation']        #Feature label
    return dataSet, labels                             #Return dataset and classification properties
"""
Function description:Divide the data set according to the given characteristics
Parameters:
    dataSet - Data set to be divided
    axis - Characteristics of partitioned data sets
    value - The value of the feature to be returned
Returns:
    nothing
"""
def splitDataSet(dataSet, axis, value):       
    retDataSet = []                                        #Create a list of returned datasets
    for featVec in dataSet:                             #Traversal dataset
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]                #Remove axis feature
            reducedFeatVec.extend(featVec[axis+1:])     #Add eligible to the returned dataset
            retDataSet.append(reducedFeatVec)
    return retDataSet                                      #Returns the partitioned dataset

"""
Function description:Select the optimal feature
Parameters:
    dataSet - data set
Returns:
    bestFeature - Maximum information gain(optimal)Index value of the feature
"""
def chooseBestFeatureToSplit(dataSet):
    ### Start Code Here ###
    numFeatures = len(dataSet[0]) - 1                    #Number of features
    baseEntropy = calcShannonEnt(dataSet)                 #Calculate Shannon entropy of data set
    bestInfoGain = 0.0                                  #information gain 
    bestFeature = -1                                    #Index value of optimal feature
    for i in range(numFeatures):                         #Traverse all features
        #Get the i th all features of dataSet
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)                         #Create set set {}, elements cannot be repeated
        newEntropy = 0.0                                  #Empirical conditional entropy
        for value in uniqueVals:                         #Calculate information gain
            subDataSet = splitDataSet(dataSet, i, value)         #The subset of the subDataSet after partition
            prob = len(subDataSet) / float(len(dataSet))           #Calculate the probability of subsets
            newEntropy += prob * calcShannonEnt(subDataSet)     #The empirical conditional entropy is calculated according to the formula
        infoGain = baseEntropy - newEntropy                     #information gain 
        print("The first%d The gain of each feature is%.3f" % (i, infoGain))            #Print information gain for each feature
        if (infoGain > bestInfoGain):                             #Calculate information gain
            bestInfoGain = infoGain                             #Update the information gain to find the maximum information gain
            bestFeature = i                                     #Record the index value of the feature with the largest information gain
    return bestFeature                                             #Returns the index value of the feature with the largest information gain
    ### End Code Here ###
"""
Function description:Statistics classList Most elements in(Class label)
Parameters:
    classList - Class label list
Returns:
    sortedClassCount[0][0] - Elements that appear most here(Class label)
"""
def majorityCnt(classList):
    classCount = {}
    for vote in classList:                                        #Count the number of occurrences of each element in the classList
        if vote not in classCount.keys():classCount[vote] = 0   
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)        #Sort by dictionary values in descending order
    return sortedClassCount[0][0]                                #Returns the most frequent element in the classList
"""
Function description:Create decision tree
Parameters:
    dataSet - Training data set
    labels - Classification attribute label
    featLabels - Store the selected optimal feature label
Returns:
    myTree - Decision tree
"""
def createTree(dataSet, labels, featLabels):
    #Take the classification label (lending or not: yes or no)
    classList = [example[-1] for example in dataSet] 
    #If the categories are exactly the same, stop dividing
    if classList.count(classList[0]) == len(classList):            
        return classList[0]
    #When all features are traversed, the class label with the most occurrences is returned
    if len(dataSet[0]) == 1:                                    
        return majorityCnt(classList)
    #Select the optimal feature
    bestFeat = chooseBestFeatureToSplit(dataSet) 
    #Optimal feature label
    bestFeatLabel = labels[bestFeat]                            
    featLabels.append(bestFeatLabel)
    #Label spanning tree based on optimal features
    myTree = {bestFeatLabel:{}}   
    #Delete used feature labels
    del(labels[bestFeat])                    
    #The attribute values of all optimal features in the training set are obtained
    featValues = [example[bestFeat] for example in dataSet] 
    #Remove duplicate attribute values
    uniqueVals = set(featValues) 
    #Traverse the features and create a decision tree.
    for value in uniqueVals:                                                           
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), labels, featLabels)
    return myTree


if __name__ == '__main__':
    dataSet, labels = createDataSet()
    featLabels = []
    myTree = createTree(dataSet, labels, featLabels)
    print(myTree)
The gain of the 0th feature is 0.083
 The gain of the first feature is 0.324
 The gain of the second feature is 0.420
 The gain of the third feature is 0.363
 The gain of the 0th feature is 0.252
 The gain of the first feature is 0.918
 The gain of the second feature is 0.474
{'Have your own house': {0: {'Have a job': {0: 'no', 1: 'yes'}}, 1: 'yes'}}

   when creating a decision tree recursively, recursion has two termination conditions: the first stop condition is that all class labels are exactly the same, then the class label is returned directly; The second stop condition is that after all features are used, the data cannot be divided into groups containing only unique categories, that is, the construction of decision tree fails and the features are not enough. This indicates that the data latitude is not enough. Since the second stop condition cannot simply return a unique class label, the category with the largest number is selected as the return value.

[exercise] classification using decision tree

  after constructing the decision tree based on the training data, we can use it to classify the actual data. When performing data classification, the decision tree and the label vector used to construct the tree are needed. Then, the program compares the test data with the values on the decision tree, and recursively executes the process until it enters the leaf node; Finally, the test data is defined as the type of leaf node. In the code of building the decision tree, we can see that there is a featLabels parameter, which is used to record each classification node. When using the decision tree for prediction, we can enter the attribute values of the required classification nodes in order. For example, if you use the decision tree trained in the previous section for classification, you only need to provide two information: whether the person has a house and whether he has a job. There is no need to provide redundant information.
The code for classification with decision tree is very simple. The code is as follows:

# -*- coding: UTF-8 -*-

"""
Function description:Classification using decision tree
Parameters:
    inputTree - Generated decision tree
    featLabels - Store the selected optimal feature label
    testVec - Test data list, the order corresponds to the optimal feature label
Returns:
    classLabel - Classification results
"""
def classify(inputTree, featLabels, testVec):
    ### Start Code Here ###
    #Implement classification function
    firstStr = list(inputTree.keys())[0]       #It is necessary to obtain the column number of the first feature for value comparison from the test data
    secondDict = inputTree[firstStr]           #Get a second dictionary
    featIndex = featLabels.index(firstStr)      #Get the corresponding characteristic value of the test set
    for key in secondDict.keys():
        if(testVec[featIndex] == key):
            if(type(secondDict[key]).__name__ == 'dict'):       #Judge whether the value is still a dictionary. If so, continue recursion
                classlabel = classify(secondDict[key],featLabels,testVec)
            else:
                classlabel = secondDict[key]
    return classlabel    
    ### End Code Here ###
    
if __name__ == '__main__':
    dataSet, labels = createDataSet()
    featLabels = []
    myTree = createTree(dataSet, labels, featLabels)
    testVec = [0,1]                                        #test data
    result = classify(myTree, featLabels, testVec)
    if result == 'yes':
        print('lending')
    if result == 'no':
        print('No lending')
The gain of the 0th feature is 0.083
 The gain of the first feature is 0.324
 The gain of the second feature is 0.420
 The gain of the third feature is 0.363
 The gain of the 0th feature is 0.252
 The gain of the first feature is 0.918
 The gain of the second feature is 0.474
 lending

  only the classify function is added here for decision tree classification. Enter the test data [0,1], which represents no house but work.

[exercise] predicting contact eye types based on decision tree - using Sklearn

  once we understand how the decision tree works, we can help people determine the type of lens to wear. The contact lens data set is a very famous data set, which contains many observation conditions for changing eye states and the types of contact lenses recommended by doctors. Contact lens types include hard, soft and no lenses.
There are 24 groups of data in the dataset. The Labels of the data are age, script, astigmatic, tearRate and class, that is, the first column is age, the second column is symptoms, the third column is astigmatism, the fourth column is the number of tears, and the fifth column is the final classification label. The data is shown in the figure below:

  next, let's talk about how to use Sklearn to build a decision tree. The sklearn.tree module provides a decision tree model to solve classification and regression problems. The method is shown in the figure below:

  we use DecisionTreeClassifie to build a decision tree. This function has 12 parameters:

The parameters are described as follows:
criterion: feature selection criteria. Optional parameters. The default is gini and can be set to entry. gini is gini impure, which is the expected error rate of randomly applying a result from a set to a data item. Entropy is Shannon entropy.
splitter: feature division point selection criteria. Optional parameters. The default is best and can be set to random. The selection strategy of each node. The best parameter is to select the best segmentation feature according to the algorithm, such as gini and entropy. Random randomly finds the local optimal partition point in some partition points. The default "best" is suitable when the sample size is small. If the sample data size is very large, the decision tree construction recommends "random".
  max_features: the maximum number of features to consider when dividing. Optional parameter. The default is None. The maximum number of features (n_features is the total number of features) considered in finding the best segmentation is as follows:
    if Max_ If features is an integer number, consider max_features features;
If Max_ If features are floating-point numbers, consider int(max_features * n_features);
If Max_ If features is set to auto, then max_features = sqrt(n_features);
If Max_ If features is set to sqrt, then max_features = sqrt (n_features), the same as auto;
If Max_ If features is set to log2, then max_features = log2(n_features);
If Max_ If features is set to None, then max_features = n_features, that is, all features are used.
Generally speaking, if the number of sample features is small, such as less than 50, we can use the default "None". If the number of features is very large, we can flexibly use other values just described to control the maximum number of features considered in the division, so as to control the generation time of the decision tree.
  max_depth: the maximum depth of the decision tree. Optional parameter. The default value is None. This parameter is the number of layers of the tree. The concept of layers is that, for example, in the case of a loan, the number of layers of the decision tree is 2. If this parameter is set to None, the decision tree will not limit the depth of the subtree when creating the subtree. Generally speaking, this value can be ignored when there are few data or features. Or if min is set_ samples_ Script parameter, then until less than min_smaples_split samples. If the model has a large sample size and many features, it is recommended to limit the maximum depth, and the specific value depends on the data distribution. Common values can be 10-100.
  min_ samples_ Split: the minimum number of samples required for internal node re division. Optional parameter. The default value is 2. This value limits the conditions for the continued division of the subtree. If min_ samples_ If split is an integer, min is used when splitting internal nodes_ samples_ Split as the minimum number of samples, that is, if the sample has been less than min_ samples_ Split samples, then stop the segmentation. If min_ samples_ If split is a floating point number, then min_ samples_ Split is a percentage, ceil(min_samples_split * n_samples), and the number is rounded up. If the sample size is small, this value does not need to be controlled. If the sample size order is very large, it is recommended to increase this value.
  min_ samples_ Leaf: minimum sample number of leaf nodes; optional parameter; the default value is 1. This value limits the minimum number of samples of leaf nodes. If the number of leaf nodes is less than the number of samples, they will be pruned together with brother nodes. Leaf nodes need the least number of samples, that is, how many samples are needed to calculate a leaf node. If it is set to 1, even if the category has only 1 sample, the decision tree will be built. If min_ samples_ If leaf is an integer, then min_ samples_ Leaf as the minimum number of samples. If it is a floating point number, then min_ samples_ Leaf is a percentage, the same as celi(min_samples_leaf * n_samples). The number is rounded up. If the sample size is small, this value does not need to be controlled. If the sample size order is very large, it is recommended to increase this value
  class_weight: category weight. Optional parameter. The default is None. It can also be dictionary, dictionary list and balanced. Specifying the weight of each category of samples is mainly to prevent too many samples in some categories of the training set, resulting in the training decision tree being too biased towards these categories. The weight of the category can be given in the format of {class_label: weight}. Here, you can specify the weight of each sample or use balanced. If balanced is used, the algorithm will calculate the weight itself, and the sample weight corresponding to the category with small sample size will be high. Of course, if there is no obvious bias in your sample category distribution, you can choose the default None regardless of this parameter.
  random_state: optional parameter. The default is None. Random number seed. If it is a certificate, then random_state will be the random number seed of the random number generator. Random number seed. If no random number is set, the random number is related to the current system time, and each time is different. If a random number seed is set, the random numbers generated at different times are the same for the same random number seed. If it is RandomState instance, then random_state is a random number generator. If None, the random number generator uses np.random.
  min_impurity_split: minimum impurity of node division; optional parameter; the default is 1e-7. This is a threshold, which limits the growth of the decision tree. If the impure (Gini coefficient, information gain, mean square deviation, absolute difference) of a node is less than this threshold, the node will not regenerate into a child node. This is the leaf node.
  min_weight_fraction_leaf: the minimum sample weight sum of leaf nodes. Optional parameters. The default value is 0. This value limits the minimum value of the sum of all sample weights of the leaf node. If it is less than this value, it will be pruned together with the brother node. Generally speaking, if more samples have missing values, or the distribution category deviation of classification tree samples is large, the sample weight will be introduced. At this time, we should pay attention to this value.     max_leaf_nodes: the maximum number of leaf nodes; optional parameter; the default is None. By limiting the maximum number of leaf nodes, overfitting can be prevented. If the restriction is added, the algorithm will establish the optimal decision tree within the maximum number of leaf nodes. If there are few features, this value can be ignored, but if there are many features, it can be limited, and the specific value can be obtained through cross validation.
presort: whether the data is pre sorted. Optional parameter. The default value is False. This value is Boolean. The default value is False. No sorting is performed. Generally speaking, if the sample size is small or a decision tree with small depth is limited, setting it to true can speed up the selection of partition points and the establishment of decision tree.
In addition to these parameters, other precautions during parameter adjustment are:
1) when the number of samples is small but the sample features are very large, the decision tree is easy to over fit. Generally speaking, if the number of samples is more than the number of features, it will be easier to establish a robust model.
2) if the number of samples is small but the sample features are very large, it is recommended to make dimension specification before fitting the decision tree model, such as principal component analysis (PCA), feature selection (Losso) or independent component analysis (ICA). In this way, the dimension of the feature will be greatly reduced. It will be better to fit the decision tree model.
3) the visualization of multi-purpose decision tree is recommended, and the depth of the decision tree is limited first, so that the preliminary fitting of the data in the generated decision tree can be observed first, and then decide whether to increase the depth.
4) when training the model, pay attention to the category of samples (mainly the classification tree). If the category distribution is very uneven, consider using class_weight to limit the model to be too biased towards categories with many samples.
5) the array of the decision tree uses the float32 type of numpy. If the training data is not in this format, the algorithm will copy first and then run.
6) if the sample matrix is sparse, it is recommended to call csc_ before fitting. Matrix is sparse, and csr_ is called before prediction. Matrix thinning.
sklearn.tree.DecisionTreeClassifier() provides some methods for us to use, as shown in the following figure:

[exercise] predict contact eye types based on decision tree - write code

# -*- coding: UTF-8 -*-
from sklearn import tree
if __name__ == '__main__':
    fr = open('decision_tree_glass/lenses.txt')
    lenses = [inst.strip().split('\t') for inst in fr.readlines()]
    print(lenses)
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
    clf = tree.DecisionTreeClassifier()
    lenses = clf.fit(lenses, lensesLabels)
[['young', 'myope', 'no', 'reduced', 'no lenses'], ['young', 'myope', 'no', 'normal', 'soft'], ['young', 'myope', 'yes', 'reduced', 'no lenses'], ['young', 'myope', 'yes', 'normal', 'hard'], ['young', 'hyper', 'no', 'reduced', 'no lenses'], ['young', 'hyper', 'no', 'normal', 'soft'], ['young', 'hyper', 'yes', 'reduced', 'no lenses'], ['young', 'hyper', 'yes', 'normal', 'hard'], ['pre', 'myope', 'no', 'reduced', 'no lenses'], ['pre', 'myope', 'no', 'normal', 'soft'], ['pre', 'myope', 'yes', 'reduced', 'no lenses'], ['pre', 'myope', 'yes', 'normal', 'hard'], ['pre', 'hyper', 'no', 'reduced', 'no lenses'], ['pre', 'hyper', 'no', 'normal', 'soft'], ['pre', 'hyper', 'yes', 'reduced', 'no lenses'], ['pre', 'hyper', 'yes', 'normal', 'no lenses'], ['presbyopic', 'myope', 'no', 'reduced', 'no lenses'], ['presbyopic', 'myope', 'no', 'normal', 'no lenses'], ['presbyopic', 'myope', 'yes', 'reduced', 'no lenses'], ['presbyopic', 'myope', 'yes', 'normal', 'hard'], ['presbyopic', 'hyper', 'no', 'reduced', 'no lenses'], ['presbyopic', 'hyper', 'no', 'normal', 'soft'], ['presbyopic', 'hyper', 'yes', 'reduced', 'no lenses'], ['presbyopic', 'hyper', 'yes', 'normal', 'no lenses']]



---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-8-79a6415291d2> in <module>
      7     lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
      8     clf = tree.DecisionTreeClassifier()
----> 9     lenses = clf.fit(lenses, lensesLabels)


/opt/conda/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    799             sample_weight=sample_weight,
    800             check_input=check_input,
--> 801             X_idx_sorted=X_idx_sorted)
    802         return self
    803 


/opt/conda/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    114         random_state = check_random_state(self.random_state)
    115         if check_input:
--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    117             y = check_array(y, ensure_2d=False, dtype=None)
    118             if issparse(X):


/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    525             try:
    526                 warnings.simplefilter('error', ComplexWarning)
--> 527                 array = np.asarray(array, dtype=dtype, order=order)
    528             except ComplexWarning:
    529                 raise ValueError("Complex data not supported\n"


/opt/conda/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 


ValueError: could not convert string to float: 'young'

  we can see that the program reports an error. Why? Because the fit () function cannot receive string type data, you can see from the printed information that the data is of string type. Before using the fit() function, we need to encode the data set. Here, we can use two methods:
LabelEncoder: converts a string to an incremental value
OneHotEncoder: converts a string to an integer using the One-of-K algorithm
In order to serialize string type data, we need to generate pandas data to facilitate our serialization work. The method used here is: original data - > Dictionary - > pandas data. The code is as follows:

# -*- coding: UTF-8 -*-
import pandas as pd
if __name__ == '__main__':
    with open('decision_tree_glass/lenses.txt', 'r') as fr:                                        #load file
        lenses = [inst.strip().split('\t') for inst in fr.readlines()]        #process the file
    lenses_target = []                                                        #Extract the category of each group of data and save it in the list
    for each in lenses:
        lenses_target.append(each[-1])
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']            #Feature label       
    lenses_list = []                                                        #Save a temporary list of lens data
    lenses_dict = {}                                                        #A dictionary that holds lens data for generating pandas
    for each_label in lensesLabels:                                            #Extract information and generate a dictionary
        for each in lenses:
            lenses_list.append(each[lensesLabels.index(each_label)])
        lenses_dict[each_label] = lenses_list
        lenses_list = []
    print(lenses_dict)                                                        #Print dictionary information
    lenses_pd = pd.DataFrame(lenses_dict)                                    #Generate pandas.DataFrame
    print(lenses_pd)
{'age': ['young', 'young', 'young', 'young', 'young', 'young', 'young', 'young', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic'], 'prescript': ['myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper', 'myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper', 'myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper'], 'astigmatic': ['no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes'], 'tearRate': ['reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal']}
           age prescript astigmatic tearRate
0        young     myope         no  reduced
1        young     myope         no   normal
2        young     myope        yes  reduced
3        young     myope        yes   normal
4        young     hyper         no  reduced
5        young     hyper         no   normal
6        young     hyper        yes  reduced
7        young     hyper        yes   normal
8          pre     myope         no  reduced
9          pre     myope         no   normal
10         pre     myope        yes  reduced
11         pre     myope        yes   normal
12         pre     hyper         no  reduced
13         pre     hyper         no   normal
14         pre     hyper        yes  reduced
15         pre     hyper        yes   normal
16  presbyopic     myope         no  reduced
17  presbyopic     myope         no   normal
18  presbyopic     myope        yes  reduced
19  presbyopic     myope        yes   normal
20  presbyopic     hyper         no  reduced
21  presbyopic     hyper         no   normal
22  presbyopic     hyper        yes  reduced
23  presbyopic     hyper        yes   normal

  it can be seen from the operation results that the pandas data is successfully generated.
  next, serialize the data and write the code as follows:

#!pip install pydotplus
# -*- coding: UTF-8 -*-
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import pydotplus
from sklearn.externals.six import StringIO
if __name__ == '__main__':
    with open('decision_tree_glass/lenses.txt', 'r') as fr:                                        #load file
        lenses = [inst.strip().split('\t') for inst in fr.readlines()]        #process the file
    lenses_target = []                                                        #Extract the category of each group of data and save it in the list
    for each in lenses:
        lenses_target.append(each[-1])
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']            #Feature label       
    lenses_list = []                                                        #Save a temporary list of lens data
    lenses_dict = {}                                                        #A dictionary that holds lens data for generating pandas
    for each_label in lensesLabels:                                            #Extract information and generate a dictionary
        for each in lenses:
            lenses_list.append(each[lensesLabels.index(each_label)])
        lenses_dict[each_label] = lenses_list
        lenses_list = []
    # print(lenses_dict)                                                        #Print dictionary information
    lenses_pd = pd.DataFrame(lenses_dict)                                    #Generate pandas.DataFrame
    print(lenses_pd)                                                        #Print pandas.DataFrame
    le = LabelEncoder()                                                        #Create a LabelEncoder() object for serialization            
    for col in lenses_pd.columns:                                            #Serialize for each column
        lenses_pd[col] = le.fit_transform(lenses_pd[col])
    print(lenses_pd)
           age prescript astigmatic tearRate
0        young     myope         no  reduced
1        young     myope         no   normal
2        young     myope        yes  reduced
3        young     myope        yes   normal
4        young     hyper         no  reduced
5        young     hyper         no   normal
6        young     hyper        yes  reduced
7        young     hyper        yes   normal
8          pre     myope         no  reduced
9          pre     myope         no   normal
10         pre     myope        yes  reduced
11         pre     myope        yes   normal
12         pre     hyper         no  reduced
13         pre     hyper         no   normal
14         pre     hyper        yes  reduced
15         pre     hyper        yes   normal
16  presbyopic     myope         no  reduced
17  presbyopic     myope         no   normal
18  presbyopic     myope        yes  reduced
19  presbyopic     myope        yes   normal
20  presbyopic     hyper         no  reduced
21  presbyopic     hyper         no   normal
22  presbyopic     hyper        yes  reduced
23  presbyopic     hyper        yes   normal
    age  prescript  astigmatic  tearRate
0     2          1           0         1
1     2          1           0         0
2     2          1           1         1
3     2          1           1         0
4     2          0           0         1
5     2          0           0         0
6     2          0           1         1
7     2          0           1         0
8     0          1           0         1
9     0          1           0         0
10    0          1           1         1
11    0          1           1         0
12    0          0           0         1
13    0          0           0         0
14    0          0           1         1
15    0          0           1         0
16    1          1           0         1
17    1          1           0         0
18    1          1           1         1
19    1          1           1         0
20    1          0           0         1
21    1          0           0         0
22    1          0           1         1
23    1          0           1         0

  as can be seen from the print result, we have successfully serialized the data. Next. We can use fit() data to build a decision tree.

[exercise] prediction of contact eye type based on decision tree - Prediction

  after determining the decision tree, we can make predictions. You can take a look at what kind of contact lenses you are suitable for according to your eye condition, age and other characteristics. The prediction results can be seen by using the following code:
print(clf.predict([[1,1,1,0]]))
  the complete code is as follows:

# -*- coding: UTF-8 -*-
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.externals.six import StringIO
from sklearn import tree
import pandas as pd
import numpy as np
import pydotplus

if __name__ == '__main__':
    with open('decision_tree_glass/lenses.txt', 'r') as fr:                                        #load file
        lenses = [inst.strip().split('\t') for inst in fr.readlines()]        #process the file
    lenses_target = []                                                        #Extract the category of each group of data and save it in the list
    for each in lenses:
        lenses_target.append(each[-1])
    print(lenses_target)

    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']            #Feature label       
    lenses_list = []                                                        #Save a temporary list of lens data
    lenses_dict = {}                                                        #A dictionary that holds lens data for generating pandas
    for each_label in lensesLabels:                                            #Extract information and generate a dictionary
        for each in lenses:
            lenses_list.append(each[lensesLabels.index(each_label)])
        lenses_dict[each_label] = lenses_list
        lenses_list = []
    # print(lenses_dict)                                                        #Print dictionary information
    lenses_pd = pd.DataFrame(lenses_dict)                                    #Generate pandas.DataFrame
    # print(lenses_pd)                                                        #Print pandas.DataFrame
    le = LabelEncoder()                                                        #Create a LabelEncoder() object for serialization           
    for col in lenses_pd.columns:                                            #serialize
        lenses_pd[col] = le.fit_transform(lenses_pd[col])
    # print(lenses_pd)                                                        #Print coding information
    
    ### Start Code Here ###
    #Create a DecisionTreeClassifier() object
    clf = tree.DecisionTreeClassifier()
    #Use data to build a decision tree
    clf.fit(lenses_pd,lenses_target)
    ### End Code Here ###          
    dot_data = StringIO()
    tree.export_graphviz(clf, out_file = dot_data,                            #Draw decision tree
                        feature_names = lenses_pd.keys(),
                        class_names = clf.classes_,
                        filled=True, rounded=True,
                        special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    print(clf.predict([[1,1,1,0]]))                                            #Save the drawn decision tree and store it in the form of PDF.
['no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'no lenses']
['hard']

Experimental summary

  master the construction and classification of decision tree algorithm through this experiment, and realize contact lens prediction based on decision tree.

Tags: Machine Learning Decision Tree sklearn

Posted on Tue, 12 Oct 2021 19:57:50 -0400 by JasonHarper