[machine learning] decision tree

This experiment will realize a simple binary decision tree. I wanted to finish my homework without being familiar with the theory. As a result, I encountered a bottleneck... So I began to organize my ideas carefully from the beginning. It seems that the shortcuts taken will eventually be redoubled.
Knowledge should be accumulated honestly. It's not like learning to retreat in case of difficulties.

Experiment content:
1. Realize three classification criteria of information gain, information gain rate and Gini index
2. Use the given training set to complete the training process of three decision trees
3. Calculate the accuracy, precision, recall and F1 value of the three decision trees on the training set and test set when the maximum depth is 10

1, Data processing

1. Read data

Use grade, term, home only_ ownership, emp_ The four columns of length are used as features, safe_loans as tag

# Import data
loans = pd.read_csv('data/lendingclub/lending-club-data.csv', low_memory=False)

features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
loans = loans[features + [target]]

2. Divide training set and test set

from sklearn.utils import shuffle
loans = shuffle(loans, random_state = 34)
#Select the first 60% of the columns as the training set
split_line = int(len(loans) * 0.6)
train_data = loans.iloc[: split_line]
test_data = loans.iloc[split_line:]

train_data is shown in the figure below:

3. Feature preprocessing

One hot processing

It can be seen that all features are discrete features, and the data needs to be preprocessed and processed with one hot coding.
Using get in pandas_ Dummies generation One hot vector

def one_hot_encoding(data, features_categorical):
    '''
    Parameter
    ----------
    data: pd.DataFrame
    
    features_categorical: list(str)
    '''
    
    # Traversal of all discrete features
    for cat in features_categorical:
        
        # Encode the column one hot, prefixed with the variable name
        one_encoding = pd.get_dummies(data[cat], prefix = cat)
        
        # Splice the generated one hot code with the previous dataframe
        data = pd.concat([data, one_encoding],axis=1)
        
        # Delete the original column of discrete features
        del data[cat]
    
    return data
    
 

Firstly, one hot vector is generated for the training set, and then one hot vector is generated for the test set. Here, it should be noted that if the training set, the value of feature ? is
{?, ?, ?}, so we generate three columns of features, namely ?_ ? , ?_ ? , ?_ ?
Then, we use this training set to train the model, and the model will only consider these three features. If the value of the feature ? of a sample in the test set is ?, it will be better
?_ ? , ?_ ? , ?_ ?is all 0. We don't consider?_ ?, because this feature does not exist when training the model, that is, an error will be reported.

#The training set is processed by onehot coding
train_data = one_hot_encoding(train_data, features)
one_hot_features = train_data.columns.tolist()
one_hot_features.remove(target)

Tips:
Data.index returns a row index list of index type, and data.index.values returns the ndarray type composed of row indexes.
Data.columns returns a column index list of index type, and data.columns.values returns an ndarray type composed of column indexes.

The next step is to test the test set_ Hot coding, but as long as the reservation appears in one_hot_ Just the features in the features (think carefully about why! Because errors will occur if they are not within the scope after one hot coding)

test_data_tmp = one_hot_encoding(test_data, features)
# Create an empty DataFrame
test_data = pd.DataFrame(columns = train_data.columns)
for feature in train_data.columns:
    # If the current feature in the training set is in test_data_tmp, copy it to test_ In data
    if feature in test_data_tmp.columns:
        test_data[feature] = test_data_tmp[feature].copy()
    else:
        # Otherwise, replace it with columns with all 0
        test_data[feature] = np.zeros(test_data_tmp.shape[0], dtype = 'uint8')
  

2, Feature Division

There are many common feature partition methods in decision tree, such as information gain, information gain rate and Gini index
We need to implement a function to calculate the value of the corresponding partition index given the marks of all samples in a node of the decision tree

  • Here, we agree that all samples with feature value of 0 are divided into the left subtree, and samples with feature value of 1 are divided into the right subtree

1. Information gain

Entropy: represents the uncertainty of random variables.
Conditional entropy: the uncertainty of random variables under one condition.
Information gain: entropy - conditional entropy. Indicates the extent to which information uncertainty is reduced under a condition.
Generally speaking, X (rain tomorrow) is a random variable, the entropy of X can be calculated, and Y (cloudy tomorrow) is also a random variable. If we also know the information entropy of rain on cloudy days (we need to know its joint probability distribution or data estimation here), it is the conditional entropy. The entropy of x minus the entropy of X under the condition of Y is the information gain.
Specific explanation: originally, the information entropy of tomorrow's rain is 2, and the conditional entropy is 0.01 (because if you know that tomorrow is cloudy, the probability of rain is great and the amount of information is small), which is 1.99 after subtraction. After obtaining the information of cloudy day, the uncertainty of rain information is reduced by 1.99, and the uncertainty is reduced a lot, so the information gain is large. In other words, the information of cloudy day is very important for the inference of tomorrow afternoon.

Therefore, information gain is often used in feature selection. If IG (large information gain), then this feature is very important for classification. This is how the decision tree finds features.

Tags: Python Machine Learning Decision Tree

Posted on Thu, 21 Oct 2021 11:46:06 -0400 by dancingbear