Experimental introduction
1. Experimental contents
This experiment learns and implements the decision tree algorithm.
2. Experimental objectives
Through this experiment, master the basic principle of decision tree algorithm.
3. Experimental knowledge points
- Shannon entropy
- information gain
4. Experimental environment
- python 3.6.5
5. Preparatory knowledge
- Fundamentals of Python Programming
preparation
click the download experimental data module at the top right of the screen and select download decision_tree_glass.tgz to the specified directory, then select File - > Open - > upload in sequence, upload the compressed package of the data set just downloaded, and then use the following command to decompress it:
!tar -zxvf decision_tree_glass.tgz
decision_tree_glass/ decision_tree_glass/lenses.txt decision_tree_glass/classifierStorage.txt
xt
Decision tree construction - ID3 algorithm
the core of ID3 algorithm is to select features corresponding to information gain criteria on each node of the decision tree and recursively construct the decision tree. The specific method is as follows: starting from the root node, calculate the information gain of all possible features for the node, select the feature with the largest information gain as the feature of the node, and establish sub nodes according to the different values of the feature; Then recursively call the above methods to the child nodes to construct the decision tree; Until the information gain of all features is small or no features can be selected. Finally, a decision tree is obtained. ID3 is equivalent to the selection of probability model by maximum likelihood method.
According to the results of decision tree experiment, feature A3 (with its own house) has the largest information gain, so feature A3 is selected as the feature of root node. It divides the training set D into two subsets D1(A3 is "yes") and D2(A3 is "no"). Since D1 has only sample points of the same class, it becomes a leaf node, and the class of the node is marked as "yes". For D2, you need to select new features from features A1 (age), A2 (working) and A4 (credit situation) to calculate the information gain of each feature:
according to the calculation, select the feature A2 (with work) with the largest information gain as the feature of the node. Since A2 has two possible values, two child nodes are derived from this node: a child node corresponding to "yes" (with work) contains three samples, which belong to the same class, so this is a leaf node, and the class is marked as "yes"; The other is the child node corresponding to "no" (no work), including 6 samples, which also belong to the same class, so this is also a leaf node, and the class is marked as "no".
In this way, a decision tree is generated, which only uses two features (with two internal nodes). The generated decision tree is shown in the following figure:
[exercise] decision tree construction - write code to build a decision tree
We use a dictionary to store the structure of the decision tree. For example, the decision tree we analyzed in the previous section can be expressed as:
{'own a house': {0: {have a job ': {0:' no ', 1:' yes'}}, 1: 'yes'}}
Create the function majorityCnt to count the most elements (class labels) in the classList, and create the function createTree to recursively build the decision tree. The code is as follows:
# -*- coding: UTF-8 -*- from math import log import operator """ Function description:Calculate the empirical entropy of a given data set(Shannon entropy) Parameters: dataSet - data set Returns: shannonEnt - Empirical entropy(Shannon entropy) """ def calcShannonEnt(dataSet): ### Start Code Here ### numEntires = len(dataSet) #Returns the number of rows in the dataset labelCounts = {} #Save a dictionary of the number of occurrences of each label for featVec in dataSet: #Each group of eigenvectors is counted currentLabel = featVec[-1] #Extract label information if currentLabel not in labelCounts.keys(): #If the label is not put into the dictionary of statistical times, add it labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 #Label count shannonEnt = 0.0 #Empirical entropy (Shannon entropy) for key in labelCounts: #Calculate Shannon entropy prob = float(labelCounts[key]) / numEntires #The probability of selecting the label shannonEnt -= prob * log(prob, 2) #Calculation by formula return shannonEnt #Return empirical entropy (Shannon entropy) ### End Code Here ###
""" Function description:Create test dataset Parameters: nothing Returns: dataSet - data set labels - Feature label """ def createDataSet(): dataSet = [[0, 0, 0, 0, 'no'], #data set [0, 0, 0, 1, 'no'], [0, 1, 0, 1, 'yes'], [0, 1, 1, 0, 'yes'], [0, 0, 0, 0, 'no'], [1, 0, 0, 0, 'no'], [1, 0, 0, 1, 'no'], [1, 1, 1, 1, 'yes'], [1, 0, 1, 2, 'yes'], [1, 0, 1, 2, 'yes'], [2, 0, 1, 2, 'yes'], [2, 0, 1, 1, 'yes'], [2, 1, 0, 1, 'yes'], [2, 1, 0, 2, 'yes'], [2, 0, 0, 0, 'no']] labels = ['Age', 'Have a job', 'Have your own house', 'Credit situation'] #Feature label return dataSet, labels #Return dataset and classification properties """ Function description:Divide the data set according to the given characteristics Parameters: dataSet - Data set to be divided axis - Characteristics of partitioned data sets value - The value of the feature to be returned Returns: nothing """ def splitDataSet(dataSet, axis, value): retDataSet = [] #Create a list of returned datasets for featVec in dataSet: #Traversal dataset if featVec[axis] == value: reducedFeatVec = featVec[:axis] #Remove axis feature reducedFeatVec.extend(featVec[axis+1:]) #Add eligible to the returned dataset retDataSet.append(reducedFeatVec) return retDataSet #Returns the partitioned dataset
""" Function description:Select the optimal feature Parameters: dataSet - data set Returns: bestFeature - Maximum information gain(optimal)Index value of the feature """ def chooseBestFeatureToSplit(dataSet): ### Start Code Here ### numFeatures = len(dataSet[0]) - 1 #Number of features baseEntropy = calcShannonEnt(dataSet) #Calculate Shannon entropy of data set bestInfoGain = 0.0 #information gain bestFeature = -1 #Index value of optimal feature for i in range(numFeatures): #Traverse all features #Get the i th all features of dataSet featList = [example[i] for example in dataSet] uniqueVals = set(featList) #Create set set {}, elements cannot be repeated newEntropy = 0.0 #Empirical conditional entropy for value in uniqueVals: #Calculate information gain subDataSet = splitDataSet(dataSet, i, value) #The subset of the subDataSet after partition prob = len(subDataSet) / float(len(dataSet)) #Calculate the probability of subsets newEntropy += prob * calcShannonEnt(subDataSet) #The empirical conditional entropy is calculated according to the formula infoGain = baseEntropy - newEntropy #information gain print("The first%d The gain of each feature is%.3f" % (i, infoGain)) #Print information gain for each feature if (infoGain > bestInfoGain): #Calculate information gain bestInfoGain = infoGain #Update the information gain to find the maximum information gain bestFeature = i #Record the index value of the feature with the largest information gain return bestFeature #Returns the index value of the feature with the largest information gain ### End Code Here ###
""" Function description:Statistics classList Most elements in(Class label) Parameters: classList - Class label list Returns: sortedClassCount[0][0] - Elements that appear most here(Class label) """ def majorityCnt(classList): classCount = {} for vote in classList: #Count the number of occurrences of each element in the classList if vote not in classCount.keys():classCount[vote] = 0 classCount[vote] += 1 sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True) #Sort by dictionary values in descending order return sortedClassCount[0][0] #Returns the most frequent element in the classList
""" Function description:Create decision tree Parameters: dataSet - Training data set labels - Classification attribute label featLabels - Store the selected optimal feature label Returns: myTree - Decision tree """ def createTree(dataSet, labels, featLabels): #Take the classification label (lending or not: yes or no) classList = [example[-1] for example in dataSet] #If the categories are exactly the same, stop dividing if classList.count(classList[0]) == len(classList): return classList[0] #When all features are traversed, the class label with the most occurrences is returned if len(dataSet[0]) == 1: return majorityCnt(classList) #Select the optimal feature bestFeat = chooseBestFeatureToSplit(dataSet) #Optimal feature label bestFeatLabel = labels[bestFeat] featLabels.append(bestFeatLabel) #Label spanning tree based on optimal features myTree = {bestFeatLabel:{}} #Delete used feature labels del(labels[bestFeat]) #The attribute values of all optimal features in the training set are obtained featValues = [example[bestFeat] for example in dataSet] #Remove duplicate attribute values uniqueVals = set(featValues) #Traverse the features and create a decision tree. for value in uniqueVals: myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), labels, featLabels) return myTree if __name__ == '__main__': dataSet, labels = createDataSet() featLabels = [] myTree = createTree(dataSet, labels, featLabels) print(myTree)
The gain of the 0th feature is 0.083 The gain of the first feature is 0.324 The gain of the second feature is 0.420 The gain of the third feature is 0.363 The gain of the 0th feature is 0.252 The gain of the first feature is 0.918 The gain of the second feature is 0.474 {'Have your own house': {0: {'Have a job': {0: 'no', 1: 'yes'}}, 1: 'yes'}}
when creating a decision tree recursively, recursion has two termination conditions: the first stop condition is that all class labels are exactly the same, then the class label is returned directly; The second stop condition is that after all features are used, the data cannot be divided into groups containing only unique categories, that is, the construction of decision tree fails and the features are not enough. This indicates that the data latitude is not enough. Since the second stop condition cannot simply return a unique class label, the category with the largest number is selected as the return value.
[exercise] classification using decision tree
after constructing the decision tree based on the training data, we can use it to classify the actual data. When performing data classification, the decision tree and the label vector used to construct the tree are needed. Then, the program compares the test data with the values on the decision tree, and recursively executes the process until it enters the leaf node; Finally, the test data is defined as the type of leaf node. In the code of building the decision tree, we can see that there is a featLabels parameter, which is used to record each classification node. When using the decision tree for prediction, we can enter the attribute values of the required classification nodes in order. For example, if you use the decision tree trained in the previous section for classification, you only need to provide two information: whether the person has a house and whether he has a job. There is no need to provide redundant information.
The code for classification with decision tree is very simple. The code is as follows:
# -*- coding: UTF-8 -*- """ Function description:Classification using decision tree Parameters: inputTree - Generated decision tree featLabels - Store the selected optimal feature label testVec - Test data list, the order corresponds to the optimal feature label Returns: classLabel - Classification results """ def classify(inputTree, featLabels, testVec): ### Start Code Here ### #Implement classification function firstStr = list(inputTree.keys())[0] #It is necessary to obtain the column number of the first feature for value comparison from the test data secondDict = inputTree[firstStr] #Get a second dictionary featIndex = featLabels.index(firstStr) #Get the corresponding characteristic value of the test set for key in secondDict.keys(): if(testVec[featIndex] == key): if(type(secondDict[key]).__name__ == 'dict'): #Judge whether the value is still a dictionary. If so, continue recursion classlabel = classify(secondDict[key],featLabels,testVec) else: classlabel = secondDict[key] return classlabel ### End Code Here ### if __name__ == '__main__': dataSet, labels = createDataSet() featLabels = [] myTree = createTree(dataSet, labels, featLabels) testVec = [0,1] #test data result = classify(myTree, featLabels, testVec) if result == 'yes': print('lending') if result == 'no': print('No lending')
The gain of the 0th feature is 0.083 The gain of the first feature is 0.324 The gain of the second feature is 0.420 The gain of the third feature is 0.363 The gain of the 0th feature is 0.252 The gain of the first feature is 0.918 The gain of the second feature is 0.474 lending
only the classify function is added here for decision tree classification. Enter the test data [0,1], which represents no house but work.
[exercise] predicting contact eye types based on decision tree - using Sklearn
once we understand how the decision tree works, we can help people determine the type of lens to wear. The contact lens data set is a very famous data set, which contains many observation conditions for changing eye states and the types of contact lenses recommended by doctors. Contact lens types include hard, soft and no lenses.
There are 24 groups of data in the dataset. The Labels of the data are age, script, astigmatic, tearRate and class, that is, the first column is age, the second column is symptoms, the third column is astigmatism, the fourth column is the number of tears, and the fifth column is the final classification label. The data is shown in the figure below:
next, let's talk about how to use Sklearn to build a decision tree. The sklearn.tree module provides a decision tree model to solve classification and regression problems. The method is shown in the figure below:
we use DecisionTreeClassifie to build a decision tree. This function has 12 parameters:
The parameters are described as follows:
criterion: feature selection criteria. Optional parameters. The default is gini and can be set to entry. gini is gini impure, which is the expected error rate of randomly applying a result from a set to a data item. Entropy is Shannon entropy.
splitter: feature division point selection criteria. Optional parameters. The default is best and can be set to random. The selection strategy of each node. The best parameter is to select the best segmentation feature according to the algorithm, such as gini and entropy. Random randomly finds the local optimal partition point in some partition points. The default "best" is suitable when the sample size is small. If the sample data size is very large, the decision tree construction recommends "random".
max_features: the maximum number of features to consider when dividing. Optional parameter. The default is None. The maximum number of features (n_features is the total number of features) considered in finding the best segmentation is as follows:
if Max_ If features is an integer number, consider max_features features;
If Max_ If features are floating-point numbers, consider int(max_features * n_features);
If Max_ If features is set to auto, then max_features = sqrt(n_features);
If Max_ If features is set to sqrt, then max_features = sqrt (n_features), the same as auto;
If Max_ If features is set to log2, then max_features = log2(n_features);
If Max_ If features is set to None, then max_features = n_features, that is, all features are used.
Generally speaking, if the number of sample features is small, such as less than 50, we can use the default "None". If the number of features is very large, we can flexibly use other values just described to control the maximum number of features considered in the division, so as to control the generation time of the decision tree.
max_depth: the maximum depth of the decision tree. Optional parameter. The default value is None. This parameter is the number of layers of the tree. The concept of layers is that, for example, in the case of a loan, the number of layers of the decision tree is 2. If this parameter is set to None, the decision tree will not limit the depth of the subtree when creating the subtree. Generally speaking, this value can be ignored when there are few data or features. Or if min is set_ samples_ Script parameter, then until less than min_smaples_split samples. If the model has a large sample size and many features, it is recommended to limit the maximum depth, and the specific value depends on the data distribution. Common values can be 10-100.
min_ samples_ Split: the minimum number of samples required for internal node re division. Optional parameter. The default value is 2. This value limits the conditions for the continued division of the subtree. If min_ samples_ If split is an integer, min is used when splitting internal nodes_ samples_ Split as the minimum number of samples, that is, if the sample has been less than min_ samples_ Split samples, then stop the segmentation. If min_ samples_ If split is a floating point number, then min_ samples_ Split is a percentage, ceil(min_samples_split * n_samples), and the number is rounded up. If the sample size is small, this value does not need to be controlled. If the sample size order is very large, it is recommended to increase this value.
min_ samples_ Leaf: minimum sample number of leaf nodes; optional parameter; the default value is 1. This value limits the minimum number of samples of leaf nodes. If the number of leaf nodes is less than the number of samples, they will be pruned together with brother nodes. Leaf nodes need the least number of samples, that is, how many samples are needed to calculate a leaf node. If it is set to 1, even if the category has only 1 sample, the decision tree will be built. If min_ samples_ If leaf is an integer, then min_ samples_ Leaf as the minimum number of samples. If it is a floating point number, then min_ samples_ Leaf is a percentage, the same as celi(min_samples_leaf * n_samples). The number is rounded up. If the sample size is small, this value does not need to be controlled. If the sample size order is very large, it is recommended to increase this value
class_weight: category weight. Optional parameter. The default is None. It can also be dictionary, dictionary list and balanced. Specifying the weight of each category of samples is mainly to prevent too many samples in some categories of the training set, resulting in the training decision tree being too biased towards these categories. The weight of the category can be given in the format of {class_label: weight}. Here, you can specify the weight of each sample or use balanced. If balanced is used, the algorithm will calculate the weight itself, and the sample weight corresponding to the category with small sample size will be high. Of course, if there is no obvious bias in your sample category distribution, you can choose the default None regardless of this parameter.
random_state: optional parameter. The default is None. Random number seed. If it is a certificate, then random_state will be the random number seed of the random number generator. Random number seed. If no random number is set, the random number is related to the current system time, and each time is different. If a random number seed is set, the random numbers generated at different times are the same for the same random number seed. If it is RandomState instance, then random_state is a random number generator. If None, the random number generator uses np.random.
min_impurity_split: minimum impurity of node division; optional parameter; the default is 1e-7. This is a threshold, which limits the growth of the decision tree. If the impure (Gini coefficient, information gain, mean square deviation, absolute difference) of a node is less than this threshold, the node will not regenerate into a child node. This is the leaf node.
min_weight_fraction_leaf: the minimum sample weight sum of leaf nodes. Optional parameters. The default value is 0. This value limits the minimum value of the sum of all sample weights of the leaf node. If it is less than this value, it will be pruned together with the brother node. Generally speaking, if more samples have missing values, or the distribution category deviation of classification tree samples is large, the sample weight will be introduced. At this time, we should pay attention to this value. max_leaf_nodes: the maximum number of leaf nodes; optional parameter; the default is None. By limiting the maximum number of leaf nodes, overfitting can be prevented. If the restriction is added, the algorithm will establish the optimal decision tree within the maximum number of leaf nodes. If there are few features, this value can be ignored, but if there are many features, it can be limited, and the specific value can be obtained through cross validation.
presort: whether the data is pre sorted. Optional parameter. The default value is False. This value is Boolean. The default value is False. No sorting is performed. Generally speaking, if the sample size is small or a decision tree with small depth is limited, setting it to true can speed up the selection of partition points and the establishment of decision tree.
In addition to these parameters, other precautions during parameter adjustment are:
1) when the number of samples is small but the sample features are very large, the decision tree is easy to over fit. Generally speaking, if the number of samples is more than the number of features, it will be easier to establish a robust model.
2) if the number of samples is small but the sample features are very large, it is recommended to make dimension specification before fitting the decision tree model, such as principal component analysis (PCA), feature selection (Losso) or independent component analysis (ICA). In this way, the dimension of the feature will be greatly reduced. It will be better to fit the decision tree model.
3) the visualization of multi-purpose decision tree is recommended, and the depth of the decision tree is limited first, so that the preliminary fitting of the data in the generated decision tree can be observed first, and then decide whether to increase the depth.
4) when training the model, pay attention to the category of samples (mainly the classification tree). If the category distribution is very uneven, consider using class_weight to limit the model to be too biased towards categories with many samples.
5) the array of the decision tree uses the float32 type of numpy. If the training data is not in this format, the algorithm will copy first and then run.
6) if the sample matrix is sparse, it is recommended to call csc_ before fitting. Matrix is sparse, and csr_ is called before prediction. Matrix thinning.
sklearn.tree.DecisionTreeClassifier() provides some methods for us to use, as shown in the following figure:
[exercise] predict contact eye types based on decision tree - write code
# -*- coding: UTF-8 -*- from sklearn import tree if __name__ == '__main__': fr = open('decision_tree_glass/lenses.txt') lenses = [inst.strip().split('\t') for inst in fr.readlines()] print(lenses) lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate'] clf = tree.DecisionTreeClassifier() lenses = clf.fit(lenses, lensesLabels)
[['young', 'myope', 'no', 'reduced', 'no lenses'], ['young', 'myope', 'no', 'normal', 'soft'], ['young', 'myope', 'yes', 'reduced', 'no lenses'], ['young', 'myope', 'yes', 'normal', 'hard'], ['young', 'hyper', 'no', 'reduced', 'no lenses'], ['young', 'hyper', 'no', 'normal', 'soft'], ['young', 'hyper', 'yes', 'reduced', 'no lenses'], ['young', 'hyper', 'yes', 'normal', 'hard'], ['pre', 'myope', 'no', 'reduced', 'no lenses'], ['pre', 'myope', 'no', 'normal', 'soft'], ['pre', 'myope', 'yes', 'reduced', 'no lenses'], ['pre', 'myope', 'yes', 'normal', 'hard'], ['pre', 'hyper', 'no', 'reduced', 'no lenses'], ['pre', 'hyper', 'no', 'normal', 'soft'], ['pre', 'hyper', 'yes', 'reduced', 'no lenses'], ['pre', 'hyper', 'yes', 'normal', 'no lenses'], ['presbyopic', 'myope', 'no', 'reduced', 'no lenses'], ['presbyopic', 'myope', 'no', 'normal', 'no lenses'], ['presbyopic', 'myope', 'yes', 'reduced', 'no lenses'], ['presbyopic', 'myope', 'yes', 'normal', 'hard'], ['presbyopic', 'hyper', 'no', 'reduced', 'no lenses'], ['presbyopic', 'hyper', 'no', 'normal', 'soft'], ['presbyopic', 'hyper', 'yes', 'reduced', 'no lenses'], ['presbyopic', 'hyper', 'yes', 'normal', 'no lenses']] --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-8-79a6415291d2> in <module> 7 lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate'] 8 clf = tree.DecisionTreeClassifier() ----> 9 lenses = clf.fit(lenses, lensesLabels) /opt/conda/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted) 799 sample_weight=sample_weight, 800 check_input=check_input, --> 801 X_idx_sorted=X_idx_sorted) 802 return self 803 /opt/conda/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted) 114 random_state = check_random_state(self.random_state) 115 if check_input: --> 116 X = check_array(X, dtype=DTYPE, accept_sparse="csc") 117 y = check_array(y, ensure_2d=False, dtype=None) 118 if issparse(X): /opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 525 try: 526 warnings.simplefilter('error', ComplexWarning) --> 527 array = np.asarray(array, dtype=dtype, order=order) 528 except ComplexWarning: 529 raise ValueError("Complex data not supported\n" /opt/conda/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 83 84 """ ---> 85 return array(a, dtype, copy=False, order=order) 86 87 ValueError: could not convert string to float: 'young'
we can see that the program reports an error. Why? Because the fit () function cannot receive string type data, you can see from the printed information that the data is of string type. Before using the fit() function, we need to encode the data set. Here, we can use two methods:
LabelEncoder: converts a string to an incremental value
OneHotEncoder: converts a string to an integer using the One-of-K algorithm
In order to serialize string type data, we need to generate pandas data to facilitate our serialization work. The method used here is: original data - > Dictionary - > pandas data. The code is as follows:
# -*- coding: UTF-8 -*- import pandas as pd if __name__ == '__main__': with open('decision_tree_glass/lenses.txt', 'r') as fr: #load file lenses = [inst.strip().split('\t') for inst in fr.readlines()] #process the file lenses_target = [] #Extract the category of each group of data and save it in the list for each in lenses: lenses_target.append(each[-1]) lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate'] #Feature label lenses_list = [] #Save a temporary list of lens data lenses_dict = {} #A dictionary that holds lens data for generating pandas for each_label in lensesLabels: #Extract information and generate a dictionary for each in lenses: lenses_list.append(each[lensesLabels.index(each_label)]) lenses_dict[each_label] = lenses_list lenses_list = [] print(lenses_dict) #Print dictionary information lenses_pd = pd.DataFrame(lenses_dict) #Generate pandas.DataFrame print(lenses_pd)
{'age': ['young', 'young', 'young', 'young', 'young', 'young', 'young', 'young', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'pre', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic', 'presbyopic'], 'prescript': ['myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper', 'myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper', 'myope', 'myope', 'myope', 'myope', 'hyper', 'hyper', 'hyper', 'hyper'], 'astigmatic': ['no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes'], 'tearRate': ['reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal', 'reduced', 'normal']} age prescript astigmatic tearRate 0 young myope no reduced 1 young myope no normal 2 young myope yes reduced 3 young myope yes normal 4 young hyper no reduced 5 young hyper no normal 6 young hyper yes reduced 7 young hyper yes normal 8 pre myope no reduced 9 pre myope no normal 10 pre myope yes reduced 11 pre myope yes normal 12 pre hyper no reduced 13 pre hyper no normal 14 pre hyper yes reduced 15 pre hyper yes normal 16 presbyopic myope no reduced 17 presbyopic myope no normal 18 presbyopic myope yes reduced 19 presbyopic myope yes normal 20 presbyopic hyper no reduced 21 presbyopic hyper no normal 22 presbyopic hyper yes reduced 23 presbyopic hyper yes normal
it can be seen from the operation results that the pandas data is successfully generated.
next, serialize the data and write the code as follows:
#!pip install pydotplus # -*- coding: UTF-8 -*- import pandas as pd from sklearn.preprocessing import LabelEncoder import pydotplus from sklearn.externals.six import StringIO if __name__ == '__main__': with open('decision_tree_glass/lenses.txt', 'r') as fr: #load file lenses = [inst.strip().split('\t') for inst in fr.readlines()] #process the file lenses_target = [] #Extract the category of each group of data and save it in the list for each in lenses: lenses_target.append(each[-1]) lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate'] #Feature label lenses_list = [] #Save a temporary list of lens data lenses_dict = {} #A dictionary that holds lens data for generating pandas for each_label in lensesLabels: #Extract information and generate a dictionary for each in lenses: lenses_list.append(each[lensesLabels.index(each_label)]) lenses_dict[each_label] = lenses_list lenses_list = [] # print(lenses_dict) #Print dictionary information lenses_pd = pd.DataFrame(lenses_dict) #Generate pandas.DataFrame print(lenses_pd) #Print pandas.DataFrame le = LabelEncoder() #Create a LabelEncoder() object for serialization for col in lenses_pd.columns: #Serialize for each column lenses_pd[col] = le.fit_transform(lenses_pd[col]) print(lenses_pd)
age prescript astigmatic tearRate 0 young myope no reduced 1 young myope no normal 2 young myope yes reduced 3 young myope yes normal 4 young hyper no reduced 5 young hyper no normal 6 young hyper yes reduced 7 young hyper yes normal 8 pre myope no reduced 9 pre myope no normal 10 pre myope yes reduced 11 pre myope yes normal 12 pre hyper no reduced 13 pre hyper no normal 14 pre hyper yes reduced 15 pre hyper yes normal 16 presbyopic myope no reduced 17 presbyopic myope no normal 18 presbyopic myope yes reduced 19 presbyopic myope yes normal 20 presbyopic hyper no reduced 21 presbyopic hyper no normal 22 presbyopic hyper yes reduced 23 presbyopic hyper yes normal age prescript astigmatic tearRate 0 2 1 0 1 1 2 1 0 0 2 2 1 1 1 3 2 1 1 0 4 2 0 0 1 5 2 0 0 0 6 2 0 1 1 7 2 0 1 0 8 0 1 0 1 9 0 1 0 0 10 0 1 1 1 11 0 1 1 0 12 0 0 0 1 13 0 0 0 0 14 0 0 1 1 15 0 0 1 0 16 1 1 0 1 17 1 1 0 0 18 1 1 1 1 19 1 1 1 0 20 1 0 0 1 21 1 0 0 0 22 1 0 1 1 23 1 0 1 0
as can be seen from the print result, we have successfully serialized the data. Next. We can use fit() data to build a decision tree.
[exercise] prediction of contact eye type based on decision tree - Prediction
after determining the decision tree, we can make predictions. You can take a look at what kind of contact lenses you are suitable for according to your eye condition, age and other characteristics. The prediction results can be seen by using the following code:
print(clf.predict([[1,1,1,0]]))
the complete code is as follows:
# -*- coding: UTF-8 -*- from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.externals.six import StringIO from sklearn import tree import pandas as pd import numpy as np import pydotplus if __name__ == '__main__': with open('decision_tree_glass/lenses.txt', 'r') as fr: #load file lenses = [inst.strip().split('\t') for inst in fr.readlines()] #process the file lenses_target = [] #Extract the category of each group of data and save it in the list for each in lenses: lenses_target.append(each[-1]) print(lenses_target) lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate'] #Feature label lenses_list = [] #Save a temporary list of lens data lenses_dict = {} #A dictionary that holds lens data for generating pandas for each_label in lensesLabels: #Extract information and generate a dictionary for each in lenses: lenses_list.append(each[lensesLabels.index(each_label)]) lenses_dict[each_label] = lenses_list lenses_list = [] # print(lenses_dict) #Print dictionary information lenses_pd = pd.DataFrame(lenses_dict) #Generate pandas.DataFrame # print(lenses_pd) #Print pandas.DataFrame le = LabelEncoder() #Create a LabelEncoder() object for serialization for col in lenses_pd.columns: #serialize lenses_pd[col] = le.fit_transform(lenses_pd[col]) # print(lenses_pd) #Print coding information ### Start Code Here ### #Create a DecisionTreeClassifier() object clf = tree.DecisionTreeClassifier() #Use data to build a decision tree clf.fit(lenses_pd,lenses_target) ### End Code Here ### dot_data = StringIO() tree.export_graphviz(clf, out_file = dot_data, #Draw decision tree feature_names = lenses_pd.keys(), class_names = clf.classes_, filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) print(clf.predict([[1,1,1,0]])) #Save the drawn decision tree and store it in the form of PDF.
['no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'no lenses'] ['hard']
Experimental summary
master the construction and classification of decision tree algorithm through this experiment, and realize contact lens prediction based on decision tree.