An application scenario
In the multi label problem, some special processing is sometimes required. For example, some samples only update some classes, while others update other classes.
For example, we predict what kind of movies a user likes. If we know which movies the user has watched, this is a positive sample. In general, negative samples can be constructed by random sampling. There are also two ways to construct, one is to randomly sample movies by fixed users, the other is to randomly sample users by fixed movies. In fact, the two ways are equivalent, However, considering that random sampling users can cover users who have no movie viewing records at all, we prefer the latter (users who watch movies and users who don't watch movies may themselves be very different. If the train domain is limited to users who watch movies, it may be uncertain if they encounter a user who doesn't watch movies when infor). A user can like multiple movies or not at all, so this should be a multi label problem.
If there are five types of movies and user A has watched category 2, then the user's label can be expressed as [0,1,0,0,0]. If A negative sample is constructed for this positive sample, the user randomly samples it as user B. what should be the negative sample label? Can it be [0,0,0,0] or [1,0,1,1]? No. Because we only know that user A has seen class 2, but we can't say that user A doesn't like other types. Similarly, for user B with random negative sampling, we can think that class 2 is marked as 0, but it's hard to say whether other classes are 0 or 1.
The common method is to update only the parameters related to category 2 for this sample, and the other categories are not updated. The way to achieve this goal is to mask when calculating loss. For example, the mask of this sample in the above example should be [0,1,0,0,0], the position corresponding to 1 in the mask calculates loss, and the position corresponding to 0 does not calculate loss.
Let's take sigmoid (two classification / multi label) cross entropy loss as an example to see two ways of loss mask. One is the loss function of tensorflow with weights to realize the mask, and the other is the mask loss function written by ourselves. The two are completely equivalent. Note that tf.nn.sigmoid_ cross_ entropy_ with_ The loss dimension of Logits results is the same as the entered label and pred, and the loss=xxxx we usually print in the log is actually after aggregation. There are many ways of aggregation. The reduction we use here is SUM_OVER_NONZERO_WEIGHTS, that is, average the loss elements that are not 0.
import tensorflow as tf import numpy as np label = tf.constant([[1.0,1.0,0.0,0.0,0.0],[0.0,1.0,1.0,1.0,0.0],[0.0,0.0,0.0,0.0,1.0],[1.0,1.0,1.0,1.0,1.0]]) pred = tf.constant([[0.2,0.2,0.2,0,0.8],[0,0.5,0.5,0.2,0],[0,0.5,0.1,0.0,0.9],[0.9,0.9,0.7,0.1,0.0]]) mask = tf.constant([[1,0,0,0,0],[0,0,1,1,0],[0,0,1,0,1],[0,1,1,1,1]]) def loss1(a, b, weights): # tensorflow comes with tf.nn.sigmoid_cross_entropy_with_logits, which realizes loss reduction with weights. loss = tf.compat.v1.losses.sigmoid_cross_entropy( multi_class_labels=a, logits=b, weights=weights, reduction=tf.compat.v1.losses.Reduction.SUM_OVER_NONZERO_WEIGHTS) return loss def loss2(a, b, weights): # Calculate loss, and the dimension is consistent with labels cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits( labels=a, logits=b) # Position matrix corresponding to 1 in mask mask_idx = tf.where(weights > 0) # Aggregate according to the mask. The aggregation method is to average the loss elements that are not 0, which is the same as sum_ OVER_ NONZERO_ Like weights loss = tf.cond(tf.equal(tf.size(mask_idx), 0), lambda: tf.constant(0.0), lambda: tf.reduce_mean(tf.gather_nd(cross_entropy, mask_idx))) return loss with tf.Session() as sess: result1 = sess.run(loss1(label,pred,mask)) print(result1) result2 = sess.run(loss2(label,pred,mask)) print(result2)
The output results show that they are the same: