1, Summary
CBAM is an attention realization mechanism, which combines the information in the two dimensions of space and channel to realize attention, which can be said to be a supplement to SE. SE solves the problem of "what", while CBAM solves the problem of "what + where".
2, Overall structure
The overall structure of CBAM is as follows. It connects the attention information of space and channel in series, integrates the attention information of spatial channel (channel weighting) and spatial attention information (spatial position weighting) into the input feature map of CBAM, and outputs the feature map integrating the two channels and spatial attention information.
3, Channel attention
Channel attention in SE uses global draw pooling to get the weight of each channel, while CBAM believes that global maximum pooling is also an important way to locate key features, which seems to be in line with our intuition. Therefore, the channel attention in CBAM integrates the two pooling methods, as shown in the figure below.
The code implementation of channel attention is as follows. The basic implementation process is to input the shared multi-layer perceptron after the maximum and average pooling of the input feature map, add the output, and then obtain the value between 0-1 through sigmod transformation, which weights the input feature map channel by channel.
class ChannelAttention(nn.Module): def __init__(self, in_planes, reduction=4): super(ChannelAttention, self).__init__() self.avg_pool = nn.AdaptiveAvgPool2d(1) self.max_pool = nn.AdaptiveMaxPool2d(1) self.sharedMLP = nn.Sequential( nn.Conv2d(in_planes, in_planes // reduction, 1, bias=False), nn.ReLU(), nn.Conv2d(in_planes // reduction, in_planes, 1, bias=False) ) self.sigmoid = nn.Sigmoid() def forward(self, x): avgout = self.sharedMLP(self.avg_pool(x)) maxout = self.sharedMLP(self.max_pool(x)) return self.sigmoid(avgout + maxout) * x
4, Spatial attention
The schematic diagram of spatial attention is as follows. The basic idea is to combine maximum pooling and average pooling.
The code implementation is as follows. Firstly, the maximum pooled output and average pooled output of the input feature map are obtained. A contact is made to form a 2-channel feature map, and then convolution is carried out. The convolution output is sigmod transformed, and the transformed output acts on the input feature map.
class SpatialAttention(nn.Module): def __init__(self, kernel_size=7): super(SpatialAttention, self).__init__() assert kernel_size in (3, 7), "kernel size must be 3 or 7" padding = 3 if kernel_size == 7 else 1 self.conv = nn.Conv2d(2, 1, kernel_size, padding=padding, bias=False) self.sigmoid = nn.Sigmoid() def forward(self, x): avgout = torch.mean(x, dim=1, keepdim=True) maxout, _ = torch.max(x, dim=1, keepdim=True) y = torch.cat([avgout, maxout], dim=1) y = self.conv(y) return self.sigmoid(y) * x
5, CBAM block code implementation
Combining channel attention and spatial attention, the overall code implementation of CBAM block is as follows.
def conv3x3(in_planes, out_planes, stride=1): "3x3 convolution with padding" return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, padding=1, bias=False) class CbamBlock(nn.Module): expansion = 1 def __init__(self, in_channels, out_channels, stride=1): super(CbamBlock, self).__init__() self.conv1 = conv3x3(in_channels, out_channels, stride) self.bn1 = nn.BatchNorm2d(out_channels) self.relu = nn.ReLU(inplace=True) self.conv2 = conv3x3(out_channels, out_channels) self.bn2 = nn.BatchNorm2d(out_channels) self.ca = ChannelAttention(out_channels) self.sa = SpatialAttention() if in_channels != out_channels: self.conv1x1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride) else: self.conv1x1 = None def forward(self, x): out = self.conv1(x) out = self.bn1(out) out = self.relu(out) out = self.conv2(out) out = self.bn2(out) out = self.ca(out) * out # Broadcasting mechanism out = self.sa(out) * out # Broadcasting mechanism if self.conv1x1: residual = self.conv1x1(x) else: residual = x out = self.relu(out + residual) return out
6, Grad cam visualization
As shown in the figure below, the position of CBAM's attention is more accurate than that of SE, especially the comparison of daddy dragons. In addition, CBAM scored higher in softmax than baseline and SE.