CBAM(Convolutional Block Attention Modul)

1, Summary

    CBAM is an attention realization mechanism, which combines the information in the two dimensions of space and channel to realize attention, which can be said to be a supplement to SE. SE solves the problem of "what", while CBAM solves the problem of "what + where".

2, Overall structure

    The overall structure of CBAM is as follows. It connects the attention information of space and channel in series, integrates the attention information of spatial channel (channel weighting) and spatial attention information (spatial position weighting) into the input feature map of CBAM, and outputs the feature map integrating the two channels and spatial attention information.

3, Channel attention

    Channel attention in SE uses global draw pooling to get the weight of each channel, while CBAM believes that global maximum pooling is also an important way to locate key features, which seems to be in line with our intuition. Therefore, the channel attention in CBAM integrates the two pooling methods, as shown in the figure below.

      The code implementation of channel attention is as follows. The basic implementation process is to input the shared multi-layer perceptron after the maximum and average pooling of the input feature map, add the output, and then obtain the value between 0-1 through sigmod transformation, which weights the input feature map channel by channel.

class ChannelAttention(nn.Module):
    def __init__(self, in_planes, reduction=4):
        super(ChannelAttention, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)

        self.sharedMLP = nn.Sequential(
            nn.Conv2d(in_planes, in_planes // reduction, 1, bias=False),
            nn.ReLU(),
            nn.Conv2d(in_planes // reduction, in_planes, 1, bias=False)
        )
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avgout = self.sharedMLP(self.avg_pool(x))
        maxout = self.sharedMLP(self.max_pool(x))
        return self.sigmoid(avgout + maxout) * x

4, Spatial attention

    The schematic diagram of spatial attention is as follows. The basic idea is to combine maximum pooling and average pooling.

      The code implementation is as follows. Firstly, the maximum pooled output and average pooled output of the input feature map are obtained. A contact is made to form a 2-channel feature map, and then convolution is carried out. The convolution output is sigmod transformed, and the transformed output acts on the input feature map.

class SpatialAttention(nn.Module):
    def __init__(self, kernel_size=7):
        super(SpatialAttention, self).__init__()
        assert kernel_size in (3, 7), "kernel size must be 3 or 7"
        padding = 3 if kernel_size == 7 else 1

        self.conv = nn.Conv2d(2, 1, kernel_size, padding=padding, bias=False)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avgout = torch.mean(x, dim=1, keepdim=True)
        maxout, _ = torch.max(x, dim=1, keepdim=True)
        y = torch.cat([avgout, maxout], dim=1)
        y = self.conv(y)
        return self.sigmoid(y) * x

5, CBAM block code implementation

    Combining channel attention and spatial attention, the overall code implementation of CBAM block is as follows.

def conv3x3(in_planes, out_planes, stride=1):
    "3x3 convolution with padding"
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, padding=1,
                     bias=False)


class CbamBlock(nn.Module):
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1):
        super(CbamBlock, self).__init__()
        self.conv1 = conv3x3(in_channels, out_channels, stride)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(out_channels, out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.ca = ChannelAttention(out_channels)
        self.sa = SpatialAttention()

        if in_channels != out_channels:
            self.conv1x1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride)
        else:
            self.conv1x1 = None

    def forward(self, x):
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        out = self.ca(out) * out  # Broadcasting mechanism
        out = self.sa(out) * out  # Broadcasting mechanism

        if self.conv1x1:
            residual = self.conv1x1(x)
        else:
            residual = x
        out = self.relu(out + residual)
        return out

    

6, Grad cam visualization

    As shown in the figure below, the position of CBAM's attention is more accurate than that of SE, especially the comparison of daddy dragons. In addition, CBAM scored higher in softmax than baseline and SE.

Tags: neural networks Computer Vision Deep Learning

Posted on Mon, 11 Oct 2021 17:28:07 -0400 by hollyspringer