Machine learning -- evaluation of classification algorithms

Problems in classification accuracy

If there is a cancer prediction system, input the patient's information to determine whether there is cancer. Is it reasonable to use only classification accuracy to evaluate the quality of the model? If the prediction accuracy of the model is 99.9%, can the model be considered good? If the probability of cancer is only 0.1%, it means that the cancer prediction system can achieve 99.9% accuracy only by predicting that everyone is healthy. Do you still think the model is good at this time? If it is more extreme, if the probability of cancer is only 0.01%, it means that the cancer prediction system can achieve 99.99% accuracy only by predicting that everyone is healthy. Here, we can roughly understand the problems existing in the classification accuracy evaluation model. When will such a problem arise? This is that for extremely Skewed Data, that is, when the sample data is extremely unbalanced, it is far from enough to use only classification accuracy. Therefore, more indicators need to be introduced.
Firstly, the Confusion Matrix is used for further analysis. Firstly, the Confusion Matrix analysis is carried out for the binary classification problem. Through the collection of samples, we can directly know which data results are positive and which results are negative in the real situation.

Confusion matrix, also known as error matrix, is a standard format for accuracy evaluation, which is expressed in the form of matrix with n rows and N columns.

Introduce several concepts:

TN (True Negative): true value Negative, predicted Negative
FP (False Positive): the true value is Negative and the predicted value is Positive
FN (False Negative): the real value is Positive and the predicted value is Negative
TP (True Positive): real value Positive, predictive Positive

In fact, it is hoped that the more right diagonal lines, the better, that is, the more TN and TP, the better, which will extend more secondary indicators.
Assuming cancer prediction, 10000 people were tested first, and the prediction results are as follows:

TN: 9978 individuals were truly cancer free and predicted to be cancer free; FP: 12 people really have no cancer and are predicted to have cancer; FN: 2 individuals actually have cancer, and it is predicted that there is no cancer; TP: 8 people actually have cancer, and it is predicted that they also have cancer.

Accuracy and recall

Firstly, two secondary indicators extended from the confusion matrix are introduced:

Accuracy


From the actual cases in Section 1, the accuracy rate = 8 / (8 + 12) = 40%, which is because more attention is usually paid to the biased sample concentration. Accuracy is to judge the events that are more concerned about. For example, in the example, we focus on predicting cancer, that is, to predict the real probability of cancer among patients with cancer.

recall rate


According to the actual cases in Section 1, the recall rate = 8 / (8 + 2) = 80%, that is, 8 out of 10 cancer patients are predicted, and 80% is the recall rate.

Programming to achieve accuracy and recall

First, generate unbalanced sample data:

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
​
digits = datasets.load_digits()
x = digits.data
y = digits.target.copy()
# Generate unbalanced data
y[digits.target==9] = 1
y[digits.target!=9] = 0
​
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
​
from sklearn.linear_model import LogisticRegression
​
log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
log_reg.score(x_test, y_test)
# 0.9755555555555555
log_reg_predict = log_reg.predict(x_test)
def TN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict == 0))
​
def FP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict == 1))
​
def FN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict == 0))
​
def TP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict == 1))
​
TN(y_test, log_reg_predict)
# 403
FP(y_test, log_reg_predict)
# 2
FN(y_test, log_reg_predict)
# 9
TP(y_test, log_reg_predict)
# 36
# Implementation of confusion matrix
def confusion_matrix(y_true, y_predict):
    return np.array([
        [TN(y_test, log_reg_predict), FP(y_test, log_reg_predict)],
        [FN(y_test, log_reg_predict), TP(y_test, log_reg_predict)],
    ])
confusion_matrix(y_test, log_reg_predict)
# array([[403,   2],
#       [  9,  36]])
# Accuracy
def precison_score(y_true, y_predict):
    tp = TP(y_test, log_reg_predict)
    fp = FP(y_test, log_reg_predict)
    try:
        return tp/(tp+fp)
    except:
        return 0.0
​
# recall 
def recall_score(y_true, y_predict):
    tp = TP(y_test, log_reg_predict)
    fn = FN(y_test, log_reg_predict)
    try:
        return tp/(tp+fn)
    except:
        return 0.0
​
precison_score(y_test, log_reg_predict)
# 0.9473684210526315
recall_score(y_test, log_reg_predict)
# 0.8

Realization of accuracy and recall in sklearn

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
​
confusion_matrix(y_test, log_reg_predict)
# array([[403,   2],
#        [  9,  36]], dtype=int64)
precision_score(y_test, log_reg_predict)
# 0.9473684210526315
recall_score(y_test, log_reg_predict)
# 0.8

In the actual use process, there may be some contradictions between the two evaluation indicators. For example, sometimes this method has high accuracy but low recall, and the other method has low accuracy and high recall. So how to weigh the two indicators?

Sometimes it pays more attention to accuracy, such as stock forecasting, and sometimes it pays more attention to recall, such as patient diagnosis. Different indicators are preferred for different application scenarios. Sometimes, it may not be so extreme. You need to ensure both accuracy and recall? This leads to a new indicator: F1 score.

F1-score

F1 Score is an index used to measure the accuracy of binary classification model in statistics. It takes into account both the accuracy and recall of the classification model. F1 Score can be regarded as a harmonic average of model accuracy and recall. Its maximum value is 1 and its minimum value is 0.


Programming implementation:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
​
digits = datasets.load_digits()
x = digits.data
y = digits.target.copy()
# Generate unbalanced data
y[digits.target==9] = 1
y[digits.target!=9] = 0
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
​
log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
log_reg.score(x_test, y_test)
# 0.9755555555555555
log_reg_predict = log_reg.predict(x_test)
​
def f1_score(precision, recall):
    try:
        return 2 * precision * recall / (precision + recall)
    except:
        return 0.0
​
precision_score(y_test, log_reg_predict)
# 0.9473684210526315
recall_score(y_test, log_reg_predict)
# 0.8
f1_score(precision_score(y_test, log_reg_predict), recall_score(y_test, log_reg_predict))
# 0.8674698795180723

Implementation in sklearn:

from sklearn.metrics import f1_score
​
f1_score(y_test, log_reg_predict)
# 0.8674698795180723

Make a comparison through the above examples: accuracy: 0.9755555555, accuracy: 0.9473684210526315, recall: 0.8, harmonic average F1 score: 0.8674698795180723. Any lower value of accuracy and recall will lower the overall score.

def f1_score(precision, recall):
    try:
        return 2 * precision * recall / (precision + recall)
    except:
        return 0.0 
    
precision = 0.5
recall = 0.5
f1_score(precision, recall)
# 0.5
precision = 0.1
recall = 0.9
f1_score(precision, recall)
# 0.18000000000000002

Balance between accuracy and recall

In fact, this is a pair of contradictory indicators. If the accuracy rate is high, the recall rate is low, and if the accuracy rate is low, the recall rate is high. So how to balance it? First, let's review the logistic regression algorithm

Decision boundary:

In analytic geometry, this is actually a straight line, which is the decision boundary in classification. One side of the line is 0 and the other side is 1. So why take 0? What if you take any value?
Decision boundary:

This is equivalent to shifting the decision boundary, which affects the classification results.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
​
digits = datasets.load_digits()
x = digits.data
y = digits.target.copy()
# Generate unbalanced data
y[digits.target==9] = 1
y[digits.target!=9] = 0
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
​
log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
# By default, 0 is used as the decision boundary. How to translate the decision boundary?
y_predict = log_reg.predict(x_test)
​
confusion_matrix(y_test, y_predict)
# array([[403,   2],
#       [  9,  36]], dtype=int64)
precision_score(y_test, y_predict)
# 0.9473684210526315
recall_score(y_test, y_predict)
# 0.8

First, we need to know the maximum and minimum of the prediction results. Then you can choose the appropriate threshold according to your own needs to predict the data.

log_reg.decision_function(x_test)
# Too many. Show the first 10
log_reg.decision_function(x_test)[:10]

Output results:

array([-22.05700117, -33.02940957, -16.21334087, -80.3791447 ,
       -48.25125396, -24.54005629, -44.39168773, -25.04292757,
        -0.97829292, -19.7174399 ])
log_reg.predict(x_test)[:10]
# array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Through the first 10, we can find that they are all negative numbers and the prediction results are all 0. This is because predict uses 0 as the classification boundary by default. Those less than 0 are 0 and those greater than 0 are 1.

decision_score =  log_reg.decision_function(x_test)
np.min(decision_score)
# -85.68608522646575
np.max(decision_score)
# 19.8895858799022

First, select threshold=5,

y_predict2 = np.array(decision_score >= 5, dtype='int')
confusion_matrix(y_test, y_predict2)
# array([[404,   1],
#        [ 21,  24]], dtype=int64)
precision_score(y_test, y_predict2)
# 0.96
recall_score(y_test, y_predict2)
# 0.5333333333333333

What if you choose threshold=-5?

y_predict3 = np.array(decision_score >= -5, dtype='int')
confusion_matrix(y_test, y_predict3)
# array([[390,  15],
#        [  5,  40]], dtype=int64)
precision_score(y_test, y_predict3)
# 0.7272727272727273
recall_score(y_test, y_predict3)
# 0.8888888888888888

At this point, use decision_function changes the threshold of logistic regression classification. Accordingly, it can compare the restrictive relationship between accuracy and recall under different thresholds. How to select this threshold to balance accuracy and recall when making a classification algorithm? This leads to the accuracy and recall curve.

Accuracy and recall curve (P-R curve)

Both indicators of PR curve focus on positive examples.

Programming PR curve

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
​
digits = datasets.load_digits()
x = digits.data
y = digits.target.copy()
# Generate unbalanced data
y[digits.target==9] = 1
y[digits.target!=9] = 0
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
​
decision_scores = log_reg.decision_function(x_test)
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
​
precisions = []
recalls = []
thresholds = np.arange(np.min(decision_scores), np.max(decision_scores))
​
for threshold in thresholds:
    y_predict = np.array(decision_scores >= threshold, dtype='int')
    precisions.append(precision_score(y_test, y_predict))
    recalls.append(recall_score(y_test, y_predict)) 
    
plt.plot(thresholds, precisions, label='precision')
plt.plot(thresholds, recalls, label='recall')
plt.legend()
plt.show()


Now that you have this chart, you can choose an appropriate threshold to balance accuracy and recall. If you need to maintain an accuracy of more than 90%, how much recall can you have? So as to determine the appropriate threshold.

plt.plot(precisions, recalls)
plt.show()


According to the trend in the figure, it is obvious that the recall rate is declining with the improvement of accuracy. This also proves once again that the precision process and recall rate restrict and balance each other, and the sharp decline point in the figure is probably the best point to balance the precision rate and recall rate.

Realizing P-R curve in sklearn

from sklearn.metrics import precision_recall_curve
​
precisions, recalls, thresholds = precision_recall_curve(y_test, decision_scores)
precisions.shape
# (145,)
recalls.shape
# (145,)
thresholds.shape
# (144,)

Through the output results of the above program, it can be found that the returned accuracy and recall rate are inconsistent with the length of the threshold. This is because the accuracy and recall rate will be calculated automatically within the appropriate threshold range in sklearn, and the default maximum value and minimum value are 1 and 0, and there is no corresponding threshold. Therefore, this is why the thresholds are 1 less than the length of precisions and recalls, Therefore, attention should be paid when drawing.

plt.plot(thresholds, precisions[:-1])
plt.plot(thresholds, recalls[:-1])
plt.show()


By comparing the two curves, the accuracy and recall curves programmed by ourselves are roughly the same, with slight differences. This is because the threshold is processed in sklearn and the most important part will be automatically selected.

plt.plot(precisions, recalls)
plt.show()

summary

Through the accuracy rate and recall rate curve, a reasonable threshold can be determined to balance the relationship between accuracy rate and recall rate. The sharp decline point in the PR curve is probably the best equilibrium point. Finally, if the PR curve drawn by two different algorithms is shown in the figure below, which algorithm is better?

Obviously, every point on the outer curve is larger than the precisions and recalls of the inner curve, so on the whole, it is better if the PR curve is closer to the outside. Therefore, it can also be used as an indicator for the selection algorithm to select hyperparameters. In fact, it is the area under the PR curve to measure the quality of the model, but generally, the area under another curve will be used. The ROC curve of the next knowledge point is derived from the area under the curve.

ROC curve

ROC curve (Receiver Operation Characteristic)
Curve) to describe the relationship between TPR and FPR. The receiver operating characteristic curve refers to the line drawn at each point under specific stimulation conditions, with the false report probability P (y/N) obtained by the subjects under different judgment criteria as the abscissa and the hit probability P (y/SN) as the ordinate.


Programming implementation:

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
​
def TN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict == 0))
​
def FP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict == 1))
​
def FN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict == 0))
​
def TP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict == 1))
​
def TPR(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fn = FN(y_true, y_predict)
    try:
        return tp / (tp + fn)
    except:
        return 0.0
    
def FPR(y_true, y_predict):
    fp = FP(y_true, y_predict)
    tn = TN(y_true, y_predict)
    try:
        return fp / (fp + tn)
    except:
        return 0.0
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
​
digits = datasets.load_digits()
x = digits.data
y = digits.target.copy()
# Generate unbalanced data
y[digits.target==9] = 1
y[digits.target!=9] = 0
​
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
​
from sklearn.linear_model import LogisticRegression
​
log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
log_reg.score(x_test, y_test)
# 0.9755555555555555
log_reg_predict = log_reg.predict(x_test)
​
decision_scores = log_reg.decision_function(x_test)
​
fprs = []
tprs = []
thresholds = np.arange(np.min(decision_scores), np.max(decision_scores))
​
for threshold in thresholds:
    y_predict = np.array(decision_scores >= threshold, dtype='int')
    fprs.append(FPR(y_test, y_predict))
    tprs.append(TPR(y_test, y_predict))
    
plt.plot(fprs, tprs)
plt.show()


Implementation of ROC curve in sklearn:

from sklearn.metrics import roc_curve
​
fprs, tprs, thresholds = roc_curve(y_test, decision_scores)
plt.plot(fprs, tprs)
plt.show()


ROC curve increases with the increase of fpr and tpr. Usually, more attention is paid to the area under the curve. How to calculate the area under the curve?

from sklearn.metrics import roc_auc_score
# area under curve
roc_auc_score(y_test, decision_scores)
# 0.9830452674897119

The larger the area under the curve, the better the classification effect of the model. This is because at the beginning of the ROC curve, when the fpr is low (the lower the error predicted as 1), the larger the tpr (the more correct the prediction is), the larger the area under the curve, and the better the model of the classification algorithm. From the output results, it can be found that the AUC value of ROC is not so sensitive to unbalanced samples. Therefore, it is necessary to view the accuracy and recall curve of the model for extremely biased data sets. The main application of AUC of ROC is to compare the advantages and disadvantages of models or algorithms.

Confusion matrix in multi classification problems

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

digits = datasets.load_digits()
x = digits.data
y = digits.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8, random_state=666)

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
log_reg.score(x_test, y_test)

y_predict = log_reg.predict(x_test)

from sklearn.metrics import precision_score
precision_score(y_test, y_predict, average='micro')
# You can try precision_score(y_test, y_predict) does not support multi classification accuracy prediction by default, but it can be solved by passing in super parameters
# 0.93115438108484
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_predict)

Output results:

array([[147,   0,   1,   0,   0,   1,   0,   0,   0,   0],
       [  0, 123,   1,   2,   0,   0,   0,   3,   4,  10],
       [  0,   0, 134,   1,   0,   0,   0,   0,   1,   0],
       [  0,   0,   0, 138,   0,   5,   0,   1,   5,   0],
       [  2,   5,   0,   0, 139,   0,   0,   3,   0,   1],
       [  1,   3,   1,   0,   0, 146,   0,   0,   1,   0],
       [  0,   2,   0,   0,   0,   1, 131,   0,   2,   0],
       [  0,   0,   0,   1,   0,   0,   0, 132,   1,   2],
       [  1,   9,   2,   3,   2,   4,   0,   0, 115,   4],
       [  0,   1,   0,   5,   0,   3,   0,   2,   2, 134]], dtype=int64)

It doesn't look intuitive. Draw the confusion matrix.

import matplotlib.pyplot as plt
cfm = confusion_matrix(y_test, y_predict)
plt.matshow(cfm, cmap=plt.cm.gray)
plt.show()


The brighter the white box in the figure, the higher the prediction accuracy. However, if only the accuracy is displayed, it is meaningless for the confusion matrix and what can be explained. In fact, we want to see the prediction error part.

# Calculate how many samples are in each row
row_sums = np.sum(cfm, axis=1)
err_matrix = cfm / row_sums
# Don't focus on the right part of the forecast
np.fill_diagonal(err_matrix, 0)
err_matrix

Output results:

array([[0.        , 0.        , 0.00735294, 0.        , 0.        ,
        0.00657895, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.00735294, 0.01342282, 0.        ,
        0.        , 0.        , 0.02205882, 0.02857143, 0.06802721],
       [0.        , 0.        , 0.        , 0.00671141, 0.        ,
        0.        , 0.        , 0.        , 0.00714286, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.03289474, 0.        , 0.00735294, 0.03571429, 0.        ],
       [0.01342282, 0.03496503, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.02205882, 0.        , 0.00680272],
       [0.00671141, 0.02097902, 0.00735294, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.00714286, 0.        ],
       [0.        , 0.01398601, 0.        , 0.        , 0.        ,
        0.00657895, 0.        , 0.        , 0.01428571, 0.        ],
       [0.        , 0.        , 0.        , 0.00671141, 0.        ,
        0.        , 0.        , 0.        , 0.00714286, 0.01360544],
       [0.00671141, 0.06293706, 0.01470588, 0.02013423, 0.01333333,
        0.02631579, 0.        , 0.        , 0.        , 0.02721088],
       [0.        , 0.00699301, 0.        , 0.03355705, 0.        ,
        0.01973684, 0.        , 0.01470588, 0.01428571, 0.        ]])

This output is the prediction error part. It seems very laborious as a whole, and then draw this matrix.

plt.matshow(err_matrix, cmap=plt.cm.gray)
plt.show()


On the whole, the brighter part in the figure is the place where there are more prediction errors. For example, the truth value of 1 is predicted to be 9, for example, the truth value of 8 is predicted to be 1, so that we can see the place where the error is. More importantly, we can also see the main reason for the error. For example, the problem of handwritten numeral recognition actually lies in the prediction of numeral 8 and numeral 1, It is easy to confuse 1 and 9, 1 and 8. In fact, the accuracy of multi classification tasks can be improved by adjusting the thresholds of these two classifications. This fine-tuning process is still difficult. Through the visualization of such a confusion matrix, the problem is further analyzed, and then the classification algorithm is improved.
In fact, it has been discussed how to solve problems and make improvements from the algorithm level. However, in the field of machine learning, it is likely that the problem is not at the algorithm level, but at the sample data level, such as the data set level. Study the numbers 1, 8, 9, etc, From the perspective of data, it is possible to summarize new features by understanding the causes of machine learning algorithm or model prediction errors. This is characteristic engineering. In short, data is the basis of machine learning. If there is no good data, what training model should we talk about. For data cleaning and processing is very critical!

Tags: Machine Learning AI

Posted on Sat, 04 Dec 2021 01:06:04 -0500 by OmarHaydoor