# Why do neural networks think your cat is a cat: Magical class activation map CAM Technology (with pytoch implementation code)

## 1 Introduction

For the general classification, we import the data into the neural network, which tells us what kind it is. For some necessary cases, we need to know how the neural network makes judgment, and which parameters of the input data affect the judgment of the neural network. Therefore, the paper Learning Deep Features for Discriminative Localization published by Bolei Zhou et al. Proposed the class activation graph technology to tell us why the neural network thinks your cat is a cat.

# Principle of type 2 activation diagram

The explanation in the original paper is:

We perform global average pooling on the convolution feature graph and use it as a feature of the full connection layer that produces the desired output (classification or other). Given this simple connectivity structure, we can recognize the importance of image region by mapping the weight of output layer to convolution features. This technology is called class activation mapping.

As shown in the figure, the global average collection outputs the spatial average of the characteristic graph of each unit of the last convolution layer. The weighted sum of these values is used to generate the final output. Similarly, we calculate the weighted sum of the characteristic graph of the last convolution layer to obtain our class activation graph.

#2.1 formula calculation

Now there is a picture that we use f k ( x , y ) f_k(x,y) fk (x,y) represents the position in the last convolution ( x , y ) (x,y) (x,y) in paragraph k k k activation graphs are characterized, and then the global average pooling GAP is performed, as we know. Tao, for c ∗ n ∗ n c*n*n Characteristic diagram of c * n * n (each characteristic diagram) n ∗ n n*n N * n, shared c c c sheets), only left after average pooling c ∗ 1 ∗ 1 c*1*1 c * 1 * 1 is equivalent to summing all points in each feature map and averaging them. Therefore, after GAP, we get F k = ∑ x , y f k ( x , y ) F^k = \sum_{x,y}f_k(x,y) fk = ∑ x,y ∑ fk (x,y) (note that the formula in the paper does not take the average). For a given class c c c. At the end s o f t m a x softmax Input before softmax activation is S c = ∑ k w k c F k S_c = \sum_kw_k^cF_k Sc​=∑k​wkc​Fk​， w k c w_k^c wkc ， is a set of weights representing each feature graph for a class c k k The importance of k, that is, the weight of the last full connection layer. Finally, use softmax to output the score of the class in all classes P c = e x p ( S c ) ∑ c e x p ( S c ) P_c = \frac{exp(S_c)}{\sum_c exp(S_c)} Pc​=∑c​exp(Sc​)exp(Sc​)​ . In the paper, the last layer is not used b i a s bias bias, because it has little impact on the performance of classification.

In summary, categories c c The output of c is:
S c = ∑ k w k c ∑ x , y f k ( x , y ) = ∑ x , y ∑ k w k c f k ( x , y ) S_c = \displaystyle\sum_k w_k^c\displaystyle\sum_{x,y}f_k(x,y) = \sum_{x,y}\displaystyle\sum_k w_k^cf_k(x,y) Sc​=k∑​wkc​x,y∑​fk​(x,y)=x,y∑​k∑​wkc​fk​(x,y)
For category c, a location ( x , y ) (x,y) The output of (x,y) is M c ( x , y ) = ∑ k w k c f k ( x , y ) M_c(x,y) = \displaystyle\sum_kw_k^cf_k(x,y) Mc (x,y)=k Σ wkc fk (x,y) therefore, S c = ∑ ( x , y ) M c ( x , y ) S_c = \sum_ {(x,y)}M_c(x,y) Sc​=∑(x,y)​Mc​(x,y)

From this formula, class activation map CAM is to add each feature map in the last layer of convolution layer according to different weight overlap to generate a weighted feature map.

# 2.2 popular understanding

In the above paper, the class activation graph uses GAP (global average pooling), which eliminates the full connection layer, reduces parameters and prevents over fitting. As shown in the figure below, Source: a blog in CSDN

In a convolution neural network, we may have many characteristic graphs at the end of convolution. The convolution layer before GAP tells us what each characteristic graph is like at the end of the network. For the full connection layer after GAP, it calculates which feature maps are more favorable for identifying cats through a set of weights. We directly multiply the weight matrix belonging to cats (there may be n classes in multiple classifications) and all feature maps to know which feature maps the neural network thinks can help it identify a cat. Here is the principle of CAM.

Our final goal is to add multiple feature maps together according to the weight overlap to generate a weighted feature map. The weighted feature map is a picture with different color depths, and then overlay the feature map on the original map. Different colors are used to represent the importance of each pixel for classification and the focus of neural network.

For example, in the cat and dog classification, we now have a picture of a cat. After passing through the network, we get 512 on the last layer × 7, that is, there are 512 7 * 7 feature maps, and then we connect these feature maps to the last band after GAP s o f t m a x softmax softmax is classified in the active full connection layer. The dimension of the weight of this full connection layer is 512 × 2. Here, we only go to the weights belonging to cats (category c in the paper), that is, 512 weights, which are multiplied by the previous feature graph points to get 7 × 7, this 7 × The feature map of 7 represents the weighted sum of all feature maps in the last convolution layer, and then we draw this feature map in the form of pictures, that is, different colors represent the contribution of different pixels to classification.

With my less exquisite painting skills, it is probably such a process. I hope it can help you understand:

# 3 code implementation

First, we need to prepare the data set for cat and dog recognition: Baidu network disk , extraction code: fxvv

In order to facilitate implementation, I simply called ResNet18 to train the cat and dog identification network, and did not use GPU for training. If necessary, please add the GPU code.

from PIL import Image
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision import models,transforms
from torch.nn import functional as F
import  pandas as pd
import  os

#data processing
def data_process(train_path):
img_list = os.listdir(train_path)
train_df = pd.DataFrame()
train_df['img'] = img_list
train_df['img'] = train_df['img'].apply(lambda x : train_path+'/'+x)
train_df['label'] = train_df['img'].apply(lambda x : 0 if x.split('/')[-1][:3] == 'cat' else 1)
return train_df['img'].values ,train_df['label'].values

#Convert to pytorch format
class Cat_Dog_Data(Dataset):
def __init__(self,img,label,transform):
self.img = img
self.label = label
self.transform = transform

def __getitem__(self, index):
img = self.img[index]
label = self.label[index]
image = Image.open(img).convert('RGB')
label = int(self.label[index])
if self.transform is not None:
image = self.transform(image)
label = torch.tensor(label).long()
return image, label

def __len__(self):
return len(self.label)

if __name__ == '__main__':
train_path = r'./train/train'
data_img,data_label = data_process(train_path)
transform = transforms.Compose([
transforms.Scale((224,224)),
transforms.ToTensor(),
])
train_data = Cat_Dog_Data(data_img,data_label,transform)
train_data = DataLoader(train_data,batch_size = 64,shuffle=True)
model = models.resnet18(pretrained=False)
model.fc = nn.Linear(512,2)
criterion = nn.CrossEntropyLoss()
optim = torch.optim.Adam(params=model.parameters(), lr = 0.01)
max_epoch = 1
model.train()
for epoch in range(max_epoch):
for i,(img,label) in enumerate(train_data):
out = model(img)
loss = criterion(out,label)
loss.backward()
optim.step()
print('Epoch:{}/{} , batch:{}/{} , loss:{:.6f}'.format(epoch,max_epoch,i,len(train_data),loss))
torch.save(model,'cat_dog_model.pth')



After training the network, you can extract the feature map of the last convolution layer and the weight of the full connection layer in the network to calculate the CAM. The code is referenced A blog of CSDN's "ghost talk" and "classmate" , slightly modified and added comments. The specific process is as follows:

2. Extract convolution features before GAP
3. Extract the final full connection layer weight
4. The weighted sum of the feature map is obtained by multiplying the convolution feature and the weight point
5. Zoom the weighted sum in 4 to a picture CAM
6. Overlay the CAM drawing with the original drawing and save the output
from PIL import Image
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision import models,transforms
from torch.nn import functional as F
import  pandas as pd
import numpy as np
import cv2
import  os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE" #Without this sentence, point multiplication will report an error
model_ft = torch.load('cat_dog_model.pth', map_location=lambda  storage, loc:storage)

#Get the features before GAP and the weight of the full connection layer
model_features = nn.Sequential(*list(model_ft.children())[:-2])
#. children() returns the outer element of the network, [: - 2], that is, the penultimate layer, that is, the network only reaches the last volume layer
fc_weights = model_ft.state_dict()['fc.weight'].cpu().numpy()  # Dimension [2512], the weight of the last full connection layer
class_ = {0:'cat', 1:'dog'}
model_ft.eval()
model_features.eval()

img_path = r'C:\Users\lenovo\Desktop\CAM\train\train\dog.5966.jpg'
features_blobs = []
img = Image.open(img_path).convert('RGB')
transform = transforms.Compose([
# transforms.Scale((224, 224)),
transforms.ToTensor(),
])
img_tensor = transform(img).unsqueeze(0) #Picture dimension [1,322424]

#Picture import network
features = model_features(img_tensor).detach().cpu().numpy() #Get features before GAP
logit = model_ft(img_tensor) #Obtain the final output probability of the model
h_x = F.softmax(logit, dim=1).data.squeeze() #Import softmax to get the probability size
probs,idx = torch.max(h_x).numpy(),torch.argmax(h_x).numpy() #Maximum probability

#Calculate CAM
bs, c, h, w = features.shape #batch_size , channel ,height , width
features = features.reshape((c, h*w)) #512,49
cam = fc_weights[idx].dot(features) #
cam = cam.reshape(h, w)
cam_img = (cam - cam.min()) / (cam.max() - cam.min())  #Reduce the value of cam to 0-1
cam_img = np.uint8(255 * cam_img) #Enlarge the value of cam to 0-255

#Convert picture to heat map
height, width, _ = img.shape  #Read the size of the input picture
heatmap = cv2.applyColorMap(cv2.resize(cam_img, (width, height)), cv2.COLORMAP_JET)  #CAM resize match input image size

result = heatmap * 0.5 + img * 0.5    #The added scale of the two pictures
text = '%s %.2f%%' % (class_[int(idx)], probs*100)
cv2.putText(result, text, (0, 50), fontFace=cv2.FONT_HERSHEY_SIMPLEX, fontScale=0.9,
color=(123, 222, 238), thickness=2, lineType=cv2.LINE_AA)

CAM_RESULT_PATH = r'./CAM_Photo//'# save the results
if not os.path.exists(CAM_RESULT_PATH):
os.mkdir(CAM_RESULT_PATH)
image_name_ = img_path.split('\\')[-1].split('.')[0] + img_path.split('\\')[-1].split('.')[1]
cv2.imwrite(CAM_RESULT_PATH + image_name_ + '_' + 'pred_' + class_[int(idx)] + '.jpg', result)  #storage



# 4. Results

Correct result:

Wrong result:

Result analysis: because cats and dogs have similarities in hair and morphology, the classification results of neural network are not ideal. For cats, neural network classification is more correct, and CAM can basically pay attention to the cat itself. For dogs, it is often confused with cats, and CAM will pay attention to some backgrounds, such as grass, cage, etc.

# 5 conclusion and contact information

I learned about cam in the field of human skeleton motion recognition (ref. 2). The author infers which joint points the neural network focuses on in the training process (or which joint points are more favorable for human skeleton motion recognition) by taking the human skeleton nodes and time as x and y, so as to use multiple streams. Each stream only inputs the first-class nodes that are not concerned, That is, the nodes with low cam score force the neural network to pay attention to more information. It can be seen that CAM technology is not limited to the image field. This paper mainly focuses on my understanding and implementation of cam, with limited ability. If there is any problem, please point out to me in the comment area or private letter: 1759412770@qq.com ， zn1759412770@163.com

# 6 references

[1] Zhou B , Khosla A , Lapedriza A , et al. Learning Deep Features for Discriminative Localization[C]// CVPR. IEEE Computer Society, 2016.

[2] Song Y F , Zhang Z , Shan C , et al. Richly Activated Graph Convolutional Network for Robust Skeleton-based Action Recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, PP(99):1-1.

Posted on Tue, 23 Nov 2021 16:39:41 -0500 by jonez