How to deal with missing values in machine learning

How to deal with missing values in machine learning

Note: this data is from kaggle, please stamp for details here , original reference connection, please stamp here , this paper is a long one, which aims to introduce some ideas and details in the process of EDA.

1, Introduction

The purpose of this EDA(Exploratory Data Analysis) is to explore the missing value information in the data set, and to find a suitable method to fill in the missing value and make the model have better performance by mastering the distribution of missing values and other information.

2, Missing value distribution

Let's take a look at how null values are distributed in the dataset. Understanding the distribution law of null values is very useful in feature engineering. Its importance is no less than selecting an appropriate algorithm model. For example, according to the distribution of null values, we can avoid simply filling with mean value or "0" value, Next, let's take a look at how null values affect the ML model.

1. Effect of null value on each feaure

import pandas as pd
## load data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# Define the features we want to examine, names for one-hot encoding them, and the total number of records
features = [x for x in train.columns if x.startswith("f") and not x.endswith("_is_null")]
null_features = ["{}_is_null".format(x) for x in features]
total_rows = float(train.shape[0])

# One-hot encode whether a feature is null per row
for feature, null_feature in zip(features, null_features):
    train[null_feature] = train[feature].isnull().astype(int)

# Generate counts of number of null values we see per feature
null_counts = pd.DataFrame.from_dict({k : [round((train[(train[k]) == 1][k].count() / total_rows) * 100, 3)] for k in null_features})

# Plot percentage of rows impacted by feature
sns.set_style("whitegrid")
bar, ax = plt.subplots(figsize=(10, 35))
ax = sns.barplot(data=null_counts, ci=None, palette="muted", orient="h")
ax.set_title("Percentage of Null Values Per Feature (Train Data)", fontsize=15)
ax.set_xlabel("Percentage")
ax.set_ylabel("Feature")
for rect in ax.patches:
    ax.text(rect.get_width(), rect.get_y() + rect.get_height() / 2, "%.3f%%" % rect.get_width())
plt.show()


Here, we can see that there are null values in each feature, accounting for about 1.6%. Of course, we should then confirm whether the test dataset also has the same distribution

## almost same with trian code.

We can see that the distribution of null values in the training set and the test set is consistent, so the question now becomes: how about the overlap of null values? In other words, whether every sample is affected by null values, or whether there are samples without any missing values?

2. Effect of null value on each sample

In order to analyze the impact of null values on each row, we can construct a 'null count' feature to represent how many null values there are in each sample, and then count how many samples contain 0 / 1 / 2 / 3 /... / features.

# Count the number of null values that occur in each row
train["null_count"] = train.isnull().sum(axis=1)
# Group the null counts
counts = train.groupby("null_count")["claim"].count().to_dict()
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}
null_data["6 or More Null Values"] = sum([v for k, v in enumerate(counts.values()) if k > 5])

# Plot the null count results
pie, ax = plt.subplots(figsize=[20, 10])
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.5)
_ = plt.title("Percentage of Null Values Per Row (Train Data)", fontsize=14)
plt.show()


We can see that 37.5% of the samples do not contain any missing values, and even 6.05% of the samples have 6 or more missing values. Similarly, let's see if there is the same situation on the test set.

## almost same with trian code.

The distribution of null values in each single sample is consistent in the test set and training set. Similarly, we can see that about 1 / 3 of the data does not contain any missing values.

3. Linear relationship between null value and label

Let's see if there is a strong linear correlation between null value feature and label

# Define the features we want to examine, names for one-hot encoding them, and the total number of records
features = [x for x in train.columns if x.startswith("f") and not x.endswith("_is_null")]
null_features = ["{}_is_null".format(x) for x in features]
total_rows = float(train.shape[0])

# One-hot encode whether a feature is null per row
for feature, null_feature in zip(features, null_features):
    train[null_feature] = train[feature].isnull().astype(int)

correlation_features = null_features.copy()
correlation_features.append("claim")
null_correlation = train[correlation_features].corr()
null_correlation.style.background_gradient(cmap='coolwarm')

f, ax = plt.subplots(figsize=(30, 30))

# Draw the heatmap with the mask and correct aspect ratio
_ = sns.heatmap(
    null_correlation,
    mask=np.triu(np.ones_like(null_correlation, dtype=bool)),
    cmap=sns.diverging_palette(230, 20, as_cmap=True),
    vmax=.3,
    center=0,
    square=True,
    linewidths=.5,
    cbar_kws={"shrink": .5}
)
plt.show()


We can see that there is no obvious linear relationship between null value and label

4. Distribution relationship between the number of null values and label

Now let's look at the distribution relationship between the number of null values and label.

train["null_count"] = train.isnull().sum(axis=1)

z = dict()
for (null_count, claim_status), value_count in train[["null_count", "claim"]].value_counts().to_dict().items():
    if null_count not in z:
        z[null_count] = dict()
    z[null_count][claim_status] = value_count
a = {
    "Number of Null Values": ["Not Claimed (0)", "Claimed (1)"],
}
a = []
for null_values in range(15):
    a.append([null_values, z[null_values][0], z[null_values][1]])
df = pd.DataFrame(a, columns=["Number of Null Values", "Not Claimed (0)", "Claimed (1)"])
ax = df.plot(x="Number of Null Values", y=["Not Claimed (0)", "Claimed (1)"], kind="bar", figsize=(20, 10))
_ = ax.set_title("Number of Null Values by Claim Status", fontsize=15)
_ = ax.set_ylabel("Number of Rows")
plt.show()


We can see that when we contain two or more null values, the classification result is more inclined to claim, so here if we use null_count as a new feature will be of great help to our model classification. At the same time, it also shows that if we can effectively fill in these null values, our classification effect will be better.

3, Influence of null value and filling method

Now we know that null values have the same distribution in the training set and the test set. Next, let's see how null values affect the performance of the model through several simple experiments

1. Simple lightgbm classifier

lightgbm has its own processing mechanism for null values. Before we fill in various null values, we first establish a baseline with the default lgb classification model.

features = [x for x in train.columns if x.startswith("f") and not x.endswith("_is_null")]
train["null_count"] = train.isnull().sum(axis=1)
new_train = train.copy()
target = train["claim"]
k_fold = StratifiedKFold(
    n_splits=3,
    random_state=2021,
    shuffle=True,
)

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )
for fold, (train_index, test_index) in enumerate(k_fold.split(new_train[features], target)):
    x_train = pd.DataFrame(new_train[features].iloc[train_index])
    y_train = target.iloc[train_index]

    x_valid = pd.DataFrame(new_train[features].iloc[test_index])
    y_valid = target.iloc[test_index]
    model = LGBMClassifier(
        random_state=2021,
        metric="auc",
        n_estimators=16000,
        verbose=-1,
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=200,
        verbose=0,
    )

    train_oof_preds = model.predict(x_valid)
    train_oof_probas = model.predict_proba(x_valid)[:, -1]
    train_preds[test_index] = train_oof_preds
    train_probas[test_index] = train_oof_probas

    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- ROC AUC: {}".format(roc_auc_score(target, train_probas)))


train["unmodified_preds"] = train_preds
train["unmodified_probas"] = train_probas
misclassified = train[(train["claim"] != train["unmodified_preds"])]["null_count"].value_counts().to_dict()

# Show the confusion matrix
confusion = confusion_matrix(train["claim"], train["unmodified_preds"])
ax = sns.heatmap(confusion, annot=True, fmt=",d")
_ = ax.set_title("Confusion Matrix for LGB Classifier (Unmodified Dataset)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

# Plot percentage of rows impacted by feature
sns.set_style("whitegrid")
bar, ax = plt.subplots(figsize=(10, 10))
ax = sns.barplot(x=list(misclassified.keys()), y=list(misclassified.values()))
_ = ax.set_title("Number of Misclassifications by Null Values in Row (Unmodified Dataset)", fontsize=15)
_ = ax.set_xlabel("Number of Null Values in Row")
_ = ax.set_ylabel("Number of Misclassified Predictions")
for p in ax.patches:
    height = p.get_height()
    ax.text(
        x=p.get_x()+(p.get_width()/2),
        y=height,
        s=round(height),
        ha="center"
    )
plt.show()
...
-- Overall:
              precision    recall  f1-score   support

           0       0.75      0.76      0.75    480404
           1       0.75      0.74      0.75    477515

    accuracy                           0.75    957919
   macro avg       0.75      0.75      0.75    957919
weighted avg       0.75      0.75      0.75    957919

-- ROC AUC: 0.8040679784649976

Through the cross validation of 3-folds, we can see that the values of precision and recall are very stable, which indicates that the original data distribution is consistent. However, we can also see a phenomenon: Although the samples with only one null value account for only 14% in the data set, they produce the highest proportion of misjudgments (72118 samples are misjudged), on the contrary, The proportion of samples without null values is as high as 37%, but there are only 48553 misjudged samples. The same situation is also applicable to the case with two null values. Then, with the increase of the number of null values, we can see that the number of misjudged samples is decreasing sharply. The question now is: can we find an appropriate filling value to improve the classification results?

2. Mean filling

Now we will use mean to fill out all null values, and then build the same model. Let's see if this can break through the original baseline score.

new_train = train.copy()
for feature in features:
    new_train[feature].fillna(new_train[feature].mean(), inplace=True)
## other parts is almost same with baseline model's code
...
-- Overall:
              precision    recall  f1-score   support

           0       0.70      0.78      0.74    480404
           1       0.75      0.67      0.71    477515

    accuracy                           0.73    957919
   macro avg       0.73      0.73      0.72    957919
weighted avg       0.73      0.73      0.72    957919

-- ROC AUC: 0.7903046619321749

Compared with the default filling method of lgb, the use of mean filling reduces the performance of the model. The proportion of misjudgment of samples with only 1 and 2 null values is higher. Of course, we can know that it is mainly because of false positive (judge the samples classified as 0 as 1), This means that using mean padding makes us lose some important information - just like null values originally represent some outliers and boundary values, so it is obvious that the method of mean padding is not desirable.

3. Constant filling

Now we try to fill all null values with 0. Filling with 0 is not a good idea because we have not seen the eigenvalue distribution of each column separately. Filling with 0 may cause skew data. For example, the value distribution of a column is originally between 10000-100000, and you directly give a zero, which will change the data distribution, However, the skews feature can be extracted from the GBM model and may improve the performance of our model. who kowns,let's try it

new_train = train.copy()
for feature in features:
    new_train[feature].fillna(0, inplace=True)
## other parts is almost same with baseline model's code
...
-- Overall:
              precision    recall  f1-score   support

           0       0.72      0.77      0.75    480404
           1       0.75      0.70      0.73    477515

    accuracy                           0.74    957919
   macro avg       0.74      0.74      0.74    957919
weighted avg       0.74      0.74      0.74    957919

-- ROC AUC: 0.7969909218144446


Filling with 0 still lags behind our baseline performance, but compared with filling with mean, we can see that the misjudgment rate of samples with 2 null values is reduced, roc_auc results have also improved slightly.

4. Iterative impater filling

We can also try to use various imputers for filling. For example, Scikit Learn's iterative Imputer will try to find similar samples and fill the corresponding missing values with the values of similar samples. Here, we set K=5(nearest neighbors num is 5)

imputer = IterativeImputer(random_state=2021, n_nearest_features=5)
new_train[features] = imputer.fit_transform(new_train[features])
## other parts is almost same with baseline model's code
...
-- Overall:
              precision    recall  f1-score   support

           0       0.68      0.75      0.71    480404
           1       0.72      0.64      0.68    477515

    accuracy                           0.70    957919
   macro avg       0.70      0.70      0.70    957919
weighted avg       0.70      0.70      0.70    957919

-- ROC AUC: 0.7552119652576256


This time, we can see some differences. After we reduce the misjudgment rate of samples with 1 null value, we change the distribution of other columns. We can see that the misjudgment number of samples without null value has changed from 48553 to 63879. This may be because the filled values are divorced from the actual data distribution characteristics, We can now realize that the null value must contain some important information, and this information cannot be recovered with iterative impater.

5. Use null count as a new feature

Instead of trying to recover the original information contained in null values, we can add the number of null values to the source data as a new feature

features.append("null_count")
## other parts is almost same with baseline model's code
...
-- Overall:
              precision    recall  f1-score   support

           0       0.86      0.66      0.74    480404
           1       0.72      0.89      0.80    477515

    accuracy                           0.77    957919
   macro avg       0.79      0.77      0.77    957919
weighted avg       0.79      0.77      0.77    957919

-- ROC AUC: 0.8126761574315831

It can be seen that we have obtained the best performance results at present. We have greatly reduced the misjudgment rate of samples with null values. Although we have not changed the misjudgment rate of samples without null values, in fact, we can create a simpler classification model based on the characteristics of null values. We can directly judge samples with null values as 1, And the performance of this model will not be very poor. If you are interested, you can try it yourself.

4, Conclusion

Through the above experiments, we can draw the following conclusions:

1.null value contains very important information, even the most important information in this data. By constructing the null count feature, it can provide a very effective discrimination information for the algorithm model, so as to improve the performance

2. It is difficult for us to find a suitable way to fill in null values. At least for this data set, it does not provide us with enough information to recover null values,

For any ML algorithm, we can keep the place where the null value appears, or use an algorithm that can directly process the null value, or use null count as a new feature, or use one hot to encode the null value. Of course, we can also use the computer to fill in the null value.

5, How to handle null values in HyperGBM

In HyperGBM, there are four ways of inputting: ['mean', 'intermediate', 'constant', 'most_frequent']

  • When [num_pipeline_mode] = 'simple', fill in with the mean value,
  • When [num_pipeline_mode] = 'complex' (the default is' complex '), the framework will search for the optimal computer mode

For users who need to perform esemble / stacking, different imputers can generate different input data, which can make the difference between models greater

...
search_space_ = GeneralSearchSpaceGenerator(num_pipeline_mode='simple')
experiment = make_experiment(train_data=train,target='claim',log_level='info',search_space=search_space_)
...

Tags: Python Machine Learning Deep Learning ML

Posted on Thu, 23 Sep 2021 05:58:16 -0400 by g00bster