# Summary of pytorch loss function

The loss function is a function for calculating the difference between the predicted value and the real value. The smaller the loss function is, the smaller the difference between the predicted value and the real value is, which proves that the network effect is better. For neural networks, the loss function determines the learning trend of neural networks, which is very important.

All loss functions in pytorch can set the mean value or total value through reduction = 'mean' or reduction = 'sum'.

## L1 Loss

L1 Loss is the absolute value loss, which is the absolute value of the error between the predicted value and the real value.

L 1 ( x , y ) = 1 N ∑ i = 1 n ∣ x i − y i ∣ L1(x, y) = \frac{1}{N} \sum_{i=1}^n |x_i - y_i| L1(x,y)=N1 i=1 Σ n ∣ xi − yi ∣ or L 1 ( x , y ) = ∑ i = 1 n ∣ x i − y i ∣ L1(x, y) = \sum_{i=1}^n |x_i - y_i| L1(x,y)=i=1∑n∣xi−yi∣

## L2 Loss

L2Loss is also commonly referred to as MSE Loss. Nn.mselos is used in pytorch, that is, the loss of mean square deviation, which is the square of the error between the predicted value and the real value.

L 2 ( x , y ) = 1 N ∑ i = 1 n ( x i − y i ) 2 L2(x, y) = \frac{1}{N} \sum_{i=1}^n (x_i - y_i)^2 L2(x,y)=N1 i=1 Σ n (xi − yi) 2 or L 2 ( x , y ) = ∑ i = 1 n ( x i − y i ) 2 L2(x, y) = \sum_{i=1}^n (x_i - y_i)^2 L2(x,y)=i=1∑n(xi−yi)2

## Smooth L1 Loss

Smooth L1 Loss is the smoothing of L1 Loss. L1 Loss is easily affected by outliers, and the gradient calculation of absolute value is easy to lose the gradient at point 0. Smooth L1 Loss is strongly convex near point 0, which combines the advantages of square loss and absolute loss.

S m o o t h L 1 ( x , y ) = 1 N ∑ i = 1 n z i SmoothL1(x, y) = \frac{1}{N} \sum_{i=1}^n z_i SmoothL1(x,y)=N1i=1∑nzi

z i = { 0.5 ( x i − y i ) 2 , i f ∣ x i − y i ∣ < 1 ∣ x i − y i ∣ − 0.5 , o t h e r w i s e z_i = \begin{cases} 0.5(x_i - y_i)^2, & if |x_i - y_i| < 1\\ |x_i - y_i| - 0.5, &otherwise \end{cases} zi={0.5(xi−yi)2,∣xi−yi∣−0.5,if∣xi−yi∣<1otherwise

## Cross entropy loss

Cross entropy represents the amount of mutual information and the distribution relationship between predicted value and real value. The smaller the cross entropy is, the closer the probability distribution between them is.

Calculation formula of cross entropy: H ( p , q ) = − ∑ k = 1 n ( p k ∗ l o g ( q k ) ) H(p, q) = - \sum_{k=1}^n (p_k * log(q_k)) H(p,q)=−k=1∑n(pk∗log(qk)). Where, $p_k yes Anticipate measure value of stage at ， Is the expectation of the predicted value, Is the expectation of the predicted value, q_k $is the expectation of the real value, usually 1.

The cross entropy in torch.nn can define weight, that is, the sample weight can be controlled by the number of samples.

## nn.NLLLoss

NLLLoss: negative log likelihood loss.

The formula is: ℓ ( x , y ) = L = { l 1 , ... , l N } ⊤ , l n = − x n , y n \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - x_{n,y_n}\quad ℓ(x,y)=L={l1,...,lN}⊤,ln=−xn,yn

I really don't know what the relationship between this formula and logarithm is. The formula is to take the value of the corresponding column of each row

The reproduction code is as follows:

import torch # Estimate predict = torch.Tensor([[0.5796, 0.4403, 0.9087], [-1.5673, -0.3150, 1.6660]]) # True value target = torch.tensor([0, 2]) result = 0 for i, j in enumerate(range(target.shape[0])): # Take out 0.5796 and 1.6660 respectively # That is, log_soft_out[0][0] and log_soft_out[1][2] result -= predict[i][target[j]] print(result / target.shape[0]) # tensor(-1.1228) loss = torch.nn.NLLLoss() print(loss(predict, target)) # tensor(-1.1228)

## nn.CrossEntropyLoss

That is, CELoss, cross entropy loss. Equivalent to predict log_ Execute nn.NLLLoss after softmax.

The execution process is:

- softmax the predicted value to obtain the distribution probability of each information.
- Logarithmic mapping of probability distribution is done, and multiplication is changed into addition to reduce the amount of calculation.
- Take out the mapped value of each row according to the classification, and sum or average it.

import torch # Estimate # The shape of predict is [2,3], which represents the predicted value of two data for three types of tasks predict = torch.Tensor([[0.5796, 0.4403, 0.9087], [-1.5673, -0.3150, 1.6660]]) # True value # The length of target corresponds to predict shape[0], and the maximum value is predict shape[1] - 1 target = torch.tensor([0, 2]) soft_input = torch.nn.Softmax(dim=-1) soft_out = soft_input(predict) # tensor([[0.3068, 0.2669, 0.4263], # [0.0335, 0.1172, 0.8494]]) log_soft_out = torch.log(soft_input(predict)) # tensor([[-1.1816, -1.3209, -0.8525], # [-3.3966, -2.1443, -0.1633]]) nll_loss = torch.nn.NLLLoss() # Here, the log is entered_ Value of softmax print(nll_loss(log_soft_out, target)) # tensor(0.6725) ce_loss = torch.nn.CrossEntropyLoss() # The original predicted value is entered here print(ce_loss(predict, target)) # tensor(0.6725)

## nn.BCELoss

Binary cross entropy loss, the formula is: B C E L o s s ( x , y ) = − ( y ∗ l o g ( x ) + ( 1 − y ) ∗ l o g ( 1 − x ) ) BCELoss(x, y) = - (y * log(x) + (1 - y) * log(1 - x)) BCELoss(x,y)=−(y∗log(x)+(1−y)∗log(1−x))

It can be seen from the formula that BCELoss seems to take into account the calculation of mutual information compared with CELoss.

According to this analysis, when BCELoss deals with the binary classification problem, that is, the 0-1 problem, one item will become 0. Then the company is the same

BCELoss has two requirements for input data:

- It is required that the input predict and target must be the same shape.
- The value range of predict required to be input should be 0 ~ 1

### Solve the problem of secondary classification

Since Softmax constitutes data, the sum of each row is equal to 1. Looking at the formula of BCELoss, it is found that in the face of binary classification problems, CELoss is Softmax + BCELoss. In other words, BCELoss is equivalent to log + NLLLoss when facing binary classification problems

# Estimate predict = torch.rand([2, 2]) # True value ce_target = torch.tensor([1, 0]) # 1. CELoss ce_loss = torch.nn.CrossEntropyLoss() print(ce_loss(predict, ce_target)) # 2.Softmax + BCELoss soft_input = torch.nn.Softmax(dim=-1) soft_out = soft_input(predict) bce_target = torch.Tensor([[0, 1], [1, 0]]) bce_loss = torch.nn.BCELoss() print(bce_loss(soft_out, bce_target)) # 3. Manually implement a BCELoss bce_result = - bec_target * torch.log(soft_out) - (1.0 - bec_target) * torch.log(1.0 - soft_out) print(bce_result.mean()) # 4.Softmax + log + NLLLoss log_soft_out = torch.log(soft_out) nll_loss = torch.nn.NLLLoss() print(nll_loss(log_soft_out, ce_target))

wuhu~ connected.

### Solve multi classification problems

Since BCELoss requires the same predict ion and target, how does BCELoss construct the target to solve the multi classification problem? If the multi classification problem cannot be solved, BCELoss will not be often used in target detection networks.

At this time, you need to use the one hot data format. As can be seen from the above code, bce_target is a one hot form.

# Estimate predict = torch.Tensor([[0.5796, 0.4403, 0.9087], [-1.5673, -0.3150, 1.6660]]) # True value ce_target = torch.tensor([2, 0]) # 1. CELoss ce_loss = torch.nn.CrossEntropyLoss() print('ce_loss:', ce_loss(predict, ce_target)) # ce_loss: tensor(2.1246) # 2.Softmax + BCELoss soft_input = torch.nn.Softmax(dim=-1) soft_out = soft_input(predict) bec_target = torch.Tensor([[0, 0, 1], [1, 0, 0]]) bce_loss = torch.nn.BCELoss() print('bce_loss:', bce_loss(soft_out, bec_target)) # bce_loss: tensor(1.1572) # 3.Softmax + log + NLLLoss log_soft_out = torch.log(soft_out) nll_loss = torch.nn.NLLLoss() print('nll_loss:', nll_loss(log_soft_out, ce_target)) # nll_loss: tensor(2.1246)

It can be seen that the results of CELoss and BCELoss are different when solving the multi classification problem.

Then, when solving the second classification problem and the third classification problem, there are the following comparison Codes:

import torch # II. Predicted value of classification predict_2 = torch.rand([3, 2]) # tensor([[0.6718, 0.8155], # [0.6771, 0.1240], # [0.7621, 0.3166]]) soft_input = torch.nn.Softmax(dim=-1) # Second category Softmax results soft_out_2 = soft_input(predict_2) # tensor([[0.4641, 0.5359], # [0.6349, 0.3651], # [0.6096, 0.3904]]) # III. predicted value of classification predict_3 = torch.rand([2, 3]) # tensor([[0.0098, 0.5813, 0.9645], # [0.4855, 0.5245, 0.4162]]) # Three category Softmax results soft_out_3 = soft_input(predict_3) # tensor([[0.1863, 0.3299, 0.4839], # [0.3364, 0.3498, 0.3139]])

It can be seen that when solving the binary classification problem, soft_out_2, there are only two elements in each row, and the sum of the two elements is. That is, soft_out_2[:][0] + soft_out_2[:][1] = 1

Assuming that the first element of target is 0, the formula in BCELoss should be used B C E L o s s ( x , y ) = − ( y ∗ l o g ( x ) + ( 1 − y ) ∗ l o g ( 1 − x ) ) BCELoss(x, y) = - (y * log(x) + (1 - y) * log(1 - x)) In BCELoss(x,y) = − (y * log(x)+(1 − y) * log(1 − x)),

B C E L o s s ( s o f t _ o u t _ 2 [ 0 ] [ 0 ] , 0 ) = − l o g ( 1 − s o f t _ o u t _ 2 [ 0 ] [ 0 ] ) = − l o g ( s o f t _ o u t _ 2 [ 0 ] [ 1 ] ) BCELoss(soft\_out\_2[0][0], 0) = - log(1 - soft\_out\_2[0][0]) = - log(soft\_out\_2[0][1]) BCELoss(soft_out_2[0][0],0)=−log(1−soft_out_2[0][0])=−log(soft_out_2[0][1])

B C E L o s s ( s o f t _ o u t _ 2 [ 0 ] [ 1 ] , 1 ) = − l o g ( s o f t _ o u t _ 2 [ 0 ] [ 1 ] ) BCELoss(soft\_out\_2[0][1], 1) = - log(soft\_out\_2[0][1]) BCELoss(soft_out_2[0][1],1)=−log(soft_out_2[0][1])

The two are the same, that is, in the face of the binary classification problem, each element in the result of each row of BCELoss is the same, so when making the average value, the result of each row is the result of each element in each row.

However, when solving the three classification problem, soft_ out_ The result of 3 has three elements in each line, and the sum of the three elements is 1.

Or suppose the first element of the target is 0, and each element in each row of BCELoss is different. Then the result is different.

In this way, BCELoss has the advantages over CELoss in solving multi classification problems. CELoss only takes values according to the classification results of each row, while BCELoss considers all the results of each row

## BCEWithLogitsLoss

As mentioned above, BCELoss has two requirements for input data

- It is required that the input predict and target must be the same shape.
- The value range of predict required to be input should be 0 ~ 1.

The shape problem of predict and target is solved by constructing target into onehot form.

How to determine the value range of predict between 0 and 1?

Softmax mentioned earlier is a solution, and sigmoid is also a solution. The output results of softmax and sigmoid are independent of each other.

BCEWithLogitsLoss first takes sigmoid from the data and makes BCELoss. That is, BCEWithLogitsLoss = sigmoid + BCELoss

import torch # Estimate predict = torch.Tensor([[0.5796, 0.4403, 0.9087], [-1.5673, -0.3150, 1.6660]]) # True value bce_target = torch.Tensor([[0, 0, 1], [1, 0, 0]]) bce_logits_loss = torch.nn.BCEWithLogitsLoss() print(bce_logits_loss(predict, bce_target)) sigmoid_out = torch.sigmoid(predict) bce_loss = torch.nn.BCELoss() print(bce_loss(sigmoid_out, bce_target))

### Focal Loss

Focal Loss was proposed by he Kaiming aiming at the large gap between positive and negative samples. At present, pytorch has not integrated the corresponding function. Please fill in a note later. You can read this blog first.