1, What is a decision tree?

Decision Tree is a kind of decision analysis method, which is based on the known probability of occurrence of all kinds of situations, through the formation of Decision Tree to get the probability that the expected value of NPV is greater than or equal to 0, to evaluate the project risk and to judge its feasibility. It is a graphic method of intuitively using decision analysis. Because this branch of decision is drawn into a graph very much like the branches of a tree, it is called Decision Tree

Look at the example in Figure 1. In popular language, decision tree means that samples are constantly judged according to conditions and can be accurately classified into a certain category.

Figure 1

Decision tree can solve the problem of regression and classification. In addition, according to the attribute basis, the decision tree can be divided into:

- ID3 decision tree (using information gain as the basis for attribute selection)
- C4.5 decision tree
- CART decision tree

2, Advantages and disadvantages

Advantage:

- Able to conduct visual analysis
- Strong interpretability
- Ability to handle missing data

Determine:

- Easy to ignore the correlation between data
- When there are too many categories, errors increase faster

3, How to select nodes and leaves of decision tree?

If a feature can achieve a lot of classification, then we choose this feature first. The criterion for judging the quality of features is information gain.

1. Information entropy

For a random event, when there are more data to be observed, the uncertainty of the random event is reduced for a long time, and the knowable information is increased.

Information entropy is the mathematical expectation of information gain

The formula is as follows:

Information gain:

When the base is 2, the unit is bit; when the base is e, the unit is nat

Information entropy:

Maximum entropy principle:

When inferring from incomplete information, it should be derived from the probability distribution with maximum entropy which satisfies the distribution restriction condition.

In other words, when faced with unknown data, the basis of our research should be selected according to the maximum entropy.

Maximum entropy principle - inference:

For a given variance σ ^ 2, the entropy of Gaussian random variable is the largest. So we often regard random events as Gaussian distribution.

2. Cross entropy and KL divergence

(1) Joint entropy:

If X and Y are a pair of discrete random variables and the joint probability distribution is p(x,y), then the joint entropy is:

The expression of joint entropy and information entropy is the same, but it only represents the average information of a pair of random variables.

(2) Conditional entropy

Infer from the above:

Then the chain rule of entropy is obtained

(3) Relative entropy (KL divergence)

A measure of the relative difference between two probability distributions in the same event space.

For two probability distributions: p(x), q(x), the relative entropy is:

Corollary:

- When two random variables are identical, the relative entropy is 0
- When the difference between two random distributions increases, the relative entropy also increases
- Generally speaking, D (P < q) ≠ D (Q < p)

(4) Cross entropy

Measure the difference between the estimated probability distribution and the true probability distribution.

Suppose the random variable X~p(x), and q(x) is the distribution used to approximate p(x), then the cross entropy of X and q(x) is:

4, Basis for attribute selection

1. General algorithm flow of decision tree

- Select the best partition attribute a from attribute set a*
- Each possible value of for a * is ai *,
- a branch node is generated. The branch node contains the samples in the parent node whose value is ai on attribute a. these samples form Di
- If Di is an empty set, set the current branch node as a leaf node, and its category is the class with the most times in the sample set of the parent node. Return
- If Di is not empty, the current node is taken as the parent node to continue the process of attribute selection and division

2. Attribute selection

- ID3 decision tree (using information gain as the basis for attribute selection)
- C4.5 decision tree
- CART decision tree

Information gain:

Information gain has a preference for attributes with a large number of values

Gain rate:

Definition

For attribute a, the more values can be taken, the greater IV(a).

Gain rate:

The gain rate has a preference for attributes with a small number of values.

Due to some shortcomings, the final choice is based on the Gini index

The Gini value of the sample set reflects the probability that two random samples belong to different categories

If the Gini index is used, the attribute with the smallest Gini index should be selected

5, Pruning operation

The process of turning a node that is originally an internal node into a leaf node is called pruning.

The reason of pruning is to prevent over fitting caused by too many branches.

Pruning operation:

1) Pre pruning (pruning while generating)

2) Post pruning (from bottom to top after decision tree generation)

6, python code implementation

from math import log import operator def calc_entropy(labels): """ //Calculating information entropy :param labels: :return: """ label_num=len(labels) label_show_up_times_dict = {} for label in labels: if label not in label_show_up_times_dict.keys(): label_show_up_times_dict[label]=0 label_show_up_times_dict[label]+=1 entropy=0.0 for key in label_show_up_times_dict: prob = float(label_show_up_times_dict[key]) / label_num entropy += prob * log(prob,2) return -entropy def split_dataset(dataset,labels,index,value): """ //According to the index of the location of the feature and the value of the feature, the subsets with the value equal to the value and the label subsets are separated from the original data :param dataset: :param labels: :param index: :param value: :return: """ sub_dataset = [] sub_labels = [] fc_index = 0 for fc in dataset: if fc[index] == value: #If equal values are encountered, the index is removed, and then the eigenvector after the index is removed is added to the subset tmp = fc[:index] tmp.extend(fc[index + 1:]) sub_dataset.append(tmp) #Pick out the label corresponding to the feature vector sub_labels.append(labels[index]) fc_index += 1 return sub_dataset,sub_labels def select_best_attribute(dataset,labels): """ //Choose the best attribute :param dataset: :param lables: :return: """ #Characteristic number feature_num = len(dataset[0]) #Information entropy of current dataset base_entropy = calc_entropy(labels) #Maximum information gain max_info_gain = -1 # Index of the best features best_feature = -1 for i in range(feature_num): #The current feature is the list of all values on the primary feature_value_list = [example[i] for example in dataset] # Get all possible values (not repeated) unique_vals = set(feature_value_list) #Information entropy of this feature new_entropy = 0.0 for value in unique_vals: #Get subset sub_dataset, sub_labels = split_dataset(dataset,i,value) # Percentage of subsets prob = float(len(sub_dataset)) / len(dataset) # New? Entropy with corresponding parts new_entropy += prob * calc_entropy(sub_labels) # Calculate the information gain of the current feature info_gain = base_entropy - new_entropy if info_gain > max_info_gain: # If it is higher than best info gain, update best info gain and best feature max_info_gain=info_gain best_feature=i return best_feature def majority_count(labels): """ //Select the label with the highest proportion :param labels: :return: """ label_count = {} for vote in labels: if vote not in label_count.keys(): label_count[vote]=0 label_count[vote] += 1 sortted_class_count = sorted(label_count.iteritems(),key=operator.itemgetter(1),reversed=True) return sortted_class_count[0][0] def decision_tree(dataset,feature_names,labels): """ //Generate decision tree (use dict to represent a tree) :param dataset: :param feature_names: The name of the feature (order and dataset Consistent order of features in) :param labels: :return: Decision tree """ if labels.count(labels[0]) == len(labels): #All elements in labels are equal, that is, the categories are exactly the same, stop dividing return labels[0] if len(dataset[0]) == 1: #If there is only one feature return majority_count(labels) #Select the best attribute of the root node best_feature_index = select_best_attribute(dataset,labels) best_feature_name = feature_names[best_feature_index] tree = {best_feature_name:{}} del(feature_names[best_feature_index]) attr_values = [example[best_feature_index] for example in dataset] unique_vals = set(attr_values) for value in unique_vals: sub_dataset,sub_labels=split_dataset(dataset,best_feature_index,value) tree[best_feature_name][value] = decision_tree(sub_dataset,sub_labels) return tree