Source: Hengyuan cloud community (focusing on artificial intelligence / deep learning GPU free acceleration platform, official experience website: https://gpushare.com)

Original author | Mathor

This paper will introduce a post-processing technique (Trick) for classification problems, which comes from a paper by EMNLP 2021 Findings< When in Doubt: Improving Classification Performance with Alternating Normalization >Through actual measurement, CAN (classification with alternative normalization) CAN indeed improve the effect of multi classification problems in most cases (common to CV and NLP), and hardly increase the prediction cost, because it is only the re normalization of the prediction results

## The idea of CAN

Interestingly, in fact, the idea of CAN is so simple that almost all of us have used it. Specifically, suppose there are 10 multiple-choice questions in the exam, and you are more confident in the first 9, question 10 will not be blind at all, but you find that the ratio of A, B, C and D in the first 9 is 3:3:2:2. Will you prefer C and D when you are blind in question 10? If so, is it more extreme In the end situation, you found that you chose A, B and C in the first 9 questions, but you didn't choose D. when you got question 10, would you prefer D?

Back to the classification task, suppose there is a binary classification problem. The prediction result given by the model for the input aaa is \ (p^{(a)}=[0.05,0.95] \), then we can give the prediction category as 1; next, for the input bbb, the prediction result given by the model is \ (p^{(b)}=[0.5,0.5] \), which is the most uncertain, and we don't know which category to output

But if I tell you:

- The category must be one of 0 or 1
- The probability of occurrence of the two categories is 0.5

Given these two "a priori" information, since the prediction result of the former sample is 1, based on the simple idea of uniformity, will we prefer to predict the latter sample as 0 to obtain a prediction result that meets the second "a priori"?

Behind these simple examples, there is the same idea as CAN. In fact, it is to use the "prior distribution" to correct the "low confidence" prediction results, so that the distribution of the new prediction results is closer to the prior distribution

## TOP-K entropy

To be exact, CAN is a post-processing method for low confidence prediction results, so we must first have an index to measure the uncertainty of prediction results. The common measure is "entropy", which is defined as \ (p=[p_1,p_2,..., p_m] \)

Although entropy is a common choice, its results do not always conform to our intuitive understanding. For example, for \ (p^{(a)}=[0.5, 0.25,0.25] and p^{(b)}=[0.5,0.5,0] \), directly apply the formula to get \ (H (p^{(b)}) H (p^{(b)}) \), but if people subjectively evaluate the two probability distributions, it is obvious that \ (p^{(b)} is better than p^{(a)} \) More uncertain, so it is not reasonable to use entropy directly

Objectively speaking, the greater the entropy, the more unstable the system is. If it is to be associated with confidence, the greater the entropy, the lower the confidence

A simple correction is to use only the first top-k probability values to calculate entropy. Assuming that \ (p_1,p_2,..., p_k \) is the \ (K \) value with the highest probability, then

Where \ (\ mathcal{T} \) is an operation to take the first k values of the largest vector \ ({R}^{m} \) → \ ({R}^{k} \). We can expand equation (2) into equation (1)

Where \ (P ~ I = p_i / \ sum \ limits {I = 1} ^ k p_i \)

## ALTERNATING NORMALIZATION

In this part, I first give the description of the algorithm steps in the paper. In the next section, I will manually simulate the calculation process

Step1

Let the column vector \ (\ mathbf{b}_0\in \mathbb{R}^m \) be the probability distribution of each category corresponding to the input sample \ (x \), and \ (m \) represent the number of categories. We generate a \ (n \) ×\ The probability matrix \ (A_0, A_0 \) of (m \) is actually obtained by splicing the prediction probability vectors of each category with \ (n \) samples with very high confidence. A \ ((n+1) is obtained by splicing \ (A_0 and \ mathbf{b}_0 \) × Matrix of M \ (L_0 \)

Step2

The second step is an iterative process. Specifically, first normalize the column of the matrix \ (L_0 \) (so that the sum of each column is 1), and then normalize the row (so that the sum of each row is 1). Before the algorithm step, define a vector diagonalization operation:

D(v) converts the column vector \ (\ mathbf{v}\in \mathbb{R}^n \) to \ (n × N \), the diagonal element is the original vector element

Column normalization

Among them, the parameter \ (\ alpha \in \mathbb{N} ^ + \) controls the speed at which \ (\ mathbf{b}_0 \) converges to high confidence (the higher the speed is, the default is 1); \ (\ mathbf{e}\in \mathbb{R}^{n+1} \) is the column vector of all 1. After the transformation of equation (4), the matrix \ (S_d\in \mathbb{R}^{(n+1)\times m} \) is the column normalized form of \ (L_{d-1} \) matrix; \ (\ Lambda_S^{-1} \) is \ (\ lambda _s \) Inverse matrix of

Row normalization

Among them, \ (\ mathbf{e}\in \mathbb{R}^{m} \) is still a column vector of all 1, but its dimension is mmm; matrix \ (L_d\in \mathbb{R}^{(n+1)\times m} \) is row normalized (but \ (L_d \) is not the row normalized form of a specific matrix); \ (\ Lambda_q \in \mathbb{R}^{m\times m} \) Is a diagonal matrix, and the elements on the diagonal are the distribution proportion of each category

for example

It means that this is a three classification problem, and the proportion of each category is 1:2:2

Step3

Step 2 matrix \ (L_d \) obtained after ddd times of cyclic iteration:

Among them, \ (\ mathbf{b}_d \) is the new probability distribution adjusted according to the "a priori distribution"

Note that this process requires us to traverse each prediction result with low confidence, that is, to correct it one by one, rather than one at a time. Moreover, although the prediction probability of each corresponding sample in \ (A_0 \) is also updated during the iteration process, it is only a temporary result and is finally discarded. The original \ (A_0 \) is used for each correction

## Analog calculation AN (alternative normalization)

First, we set some matrices and parameters

For a little explanation, \ (A_0 \) is spliced according to the prediction probability distribution of nnn samples with high confidence described in the original algorithm. It can be seen that only three samples have high confidence, and their prediction categories are 2, 0 and 2 respectively; \ (b_0 \) is the prediction probability of a sample \ (x \), because it is a probability distribution, so the sum must be 1; \ (\ Lambda_q \) This is the sample proportion of the three categories. It can be seen that there are a lot of data in the first category

The first is column normalization

Observe the matrix \ (S_d \) carefully. The sum of each column is 1, that is, column normalization. If we trace the source, in fact \ (S_d \) is the sum of each column by \ (L_0 \), and then divide the elements of each column by the sum

Next is row normalization

We only need the last line of \ (L_1 \), i.e. \ (\ mathbf{b}_1=\begin{bmatrix}23/25,0,2/25\end{bmatrix}^T \), we can see that the probability distribution of \ (\ mathbf{b}_0 \) is \ (\ begin {bMatrix} 0.5, 0.5 \ end {bMatrix} ^ t \), and the category after "a priori" adjustment is obviously biased towards the first category with more data, and the summation of \ (\ mathbf{b}_1 \) vectors is 1, Definition of compliance probability

In fact, this process is also very simple to implement in Python. Here is a code I wrote myself. The variable name is exactly the same as that in the formula

import numpy as np n, m, d, alpha = 3, 3, 5, 1 # n: Number of samples # m: Number of categories # d: Number of iterations # alpha: Power def softmax(arr): return np.exp(arr) / np.sum(np.exp(arr)) A_0 = np.array([[0.2, 0, 0.8], [0.9, 0.1, 0], [0, 0, 1]]) # A_0 = softmax(np.random.randn(n, m)) b_0 = np.array([0.5, 0, 0.5]) # b_0 = softmax(np.random.randn(m)) L_0 = np.vstack((A_0, b_0)) # (n+1) * m Lambda_q = np.diag( np.array([0.8, 0.1, 0.1]) ) # Lambda_q = np.diag( softmax(np.random.randn(m)) ) print("Prediction probability:", b_0) print("Sample quantity distribution of each category:", np.diag(Lambda_q, 0)) L_d_1 = L_0 for _ in range(d): Lambda_S = np.diag( np.dot((L_d_1 ** alpha).T, np.ones((n + 1))) ) S_d = np.dot((L_d_1 ** alpha), np.linalg.inv(Lambda_S)) Lambda_L = np.diag( np.dot(np.dot(S_d, Lambda_q), np.ones((m))) ) L_d_1 = np.dot( np.dot(np.linalg.inv(Lambda_L), S_d), Lambda_q ) print("Probability adjusted a priori:", L_d_1[-1:])

Reference implementation

The following is the implementation of boss Su Jianlin. His code normalizes the Top-k entropy to ensure \ (H {\ text {Top-k}} (P) \ in [0,1] \), which is better to determine the threshold (i.e. the threshold in the code)

import numpy as np # Predict the results and calculate the accuracy before correction y_pred = np.array([[0.2, 0.5, 0.2, 0.1], [0.3, 0.1, 0.5, 0.1], [0.4, 0.1, 0.1, 0.4], [0.1, 0.1, 0.1, 0.8], [0.3, 0.2, 0.2, 0.3], [0.2, 0.2, 0.2, 0.4]]) num_classes = y_pred.shape[1] y_true = np.array([0, 1, 2, 3, 1, 2]) acc_original = np.mean([y_pred.argmax(1) == y_true]) print('original acc: %s' % acc_original) # Statistics of prior distribution from training set # prior = np.zeros(num_classes) # for d in train_data: # prior[d[1]] += 1. # prior /= prior.sum() prior = np.array([0.2, 0.2, 0.25, 0.35]) # Evaluate the uncertainty of each prediction result k = 3 y_pred_topk = np.sort(y_pred, axis=1)[:, -k:] y_pred_topk /= y_pred_topk.sum(axis=1, keepdims=True) # normalization y_pred_entropy = -(y_pred_topk * np.log(y_pred_topk)).sum(1) / np.log(k) # top-k entropy print(y_pred_entropy) # Select the threshold and divide it into two parts: high and low confidence threshold = 0.9 y_pred_confident = y_pred[y_pred_entropy < threshold] # The samples with high confidence are those whose top-k entropy is lower than the threshold y_pred_unconfident = y_pred[y_pred_entropy >= threshold] # The samples with low confidence are those whose top-k entropy is higher than the threshold y_true_confident = y_true[y_pred_entropy < threshold] y_true_unconfident = y_true[y_pred_entropy >= threshold] # Displays the accuracy of each of the two parts # Generally speaking, the accuracy of high confidence set is much higher than that of low confidence set acc_confident = (y_pred_confident.argmax(1) == y_true_confident).mean() acc_unconfident = (y_pred_unconfident.argmax(1) == y_true_unconfident).mean() print('confident acc: %s' % acc_confident) print('unconfident acc: %s' % acc_unconfident) # Modify the low confidence samples one by one, and re evaluate the accuracy right, alpha, iters = 0, 1, 1 # Correct number, alpha power, iterations of iters for i, y in enumerate(y_pred_unconfident): Y = np.concatenate([y_pred_confident, y[None]], axis=0) # Y is L_0 for _ in range(iters): Y = Y ** alpha Y /= Y.sum(axis=0, keepdims=True) Y *= prior[None] Y /= Y.sum(axis=1, keepdims=True) y = Y[-1] if y.argmax() == y_true_unconfident[i]: right += 1 # Output corrected accuracy acc_final = (acc_confident * len(y_pred_confident) + right) / len(y_pred) print('new unconfident acc: %s' % (right / (i + 1.))) print('final acc: %s' % acc_final)

experimental result

So, how much improvement can such a simple post-processing bring? The experimental results given in the original paper are considerable:

Generally speaking, the more the number of categories, the more obvious the improvement effect. If the number of categories is small, the improvement may be weak or even decline

## ONE MORE THING

A very natural question is why not directly put all the results with low confidence together with the results with high confidence for correction, but correct them one by one? In fact, it is well understood that the original intention of CAN is to correct the low confidence with the help of "a priori distribution" and combined with the high confidence results. In this process, if the more low confidence results are incorporated, the greater the final deviation may be. Therefore, theoretically, the correction one by one will be more reliable than the batch correction

REFERENCES

When in Doubt: Improving Classification Performance with Alternating Normalization

CAN: A simple post-processing technique for improving classification performance with a priori distribution