FPN of Backbone: feature pyramid (pytoch implementation and code analysis)

  Background:

         In order to enhance semantics, the traditional object detection model usually only carries out subsequent operations on the last feature map of the deep convolution network, and the down sampling rate (multiple of image reduction) corresponding to this layer is usually large, such as 16 and 32, resulting in less effective information of small objects on the feature map and a sharp decline in the detection performance of small objects, This problem is also called multi-scale problem. The key to solve the multi-scale problem is how to extract multi-scale features. Traditional methods include Image Pyramid. The main idea is to make the input image into multiple scales. Images of different scales generate features of different scales. This method is simple and effective. It is widely used in competitions such as COCO, but the disadvantage is that it is very time-consuming and requires a lot of calculation. From the contents of the previous backbone networks, we can know that the size and semantic information of different layers of convolutional neural network are different, which is itself similar to a pyramid structure. The FPN (Feature Pyramid Network) method in 2017 integrates the characteristics of different layers and better improves the problem of multi-scale detection.

FPN network structure:

The overall architecture of FPN is shown in the figure above, which mainly includes four parts: bottom-up network, top-down network, horizontal connection and convolution fusion.

Bottom up:

         The leftmost is the ordinary convolution network. By default, ResNet structure is used to extract semantic information. C1 represents the first several convolution and pooling layers of ResNet, while C2 to C5 are different ResNet convolution groups. These convolution groups contain multiple Bottleneck structures. The size of characteristic graphs in the groups is the same and the size between groups decreases.

Top down:

         First, C5 is 1 × 1. Reduce the number of channels by convolution to obtain P5, and then conduct up sampling to obtain P4, P3 and P2 in order to obtain the same characteristics as the length and width of C4, C3 and C2, so as to facilitate element by element addition in the next step. Here, 2 times of nearest neighbor up sampling is used, that is, the adjacent elements are copied directly, and nonlinear interpolation is used.

Lateral Connection:

         The purpose is to fuse the high semantic features after up sampling with the shallow positioning detail features. After up sampling, the length and width of high semantic features are the same as the corresponding shallow features, and the number of channels is fixed at 256. Therefore, it is necessary to 11 convolute the underlying features C2 to C4 to make the number of channels become 256, and then add them element by element to obtain P4, P3 and P2. Because the feature graph size of C1 is large and the semantic information is insufficient, C1 is not placed in the horizontal connection.

Convolution fusion:

         After obtaining the added features, use 3 × 3. The generated P2 to P4 are fused by convolution in order to eliminate the overlapping effect caused by the up sampling process and generate the final feature map.

How to select a feature map:

         For the actual object detection algorithm, RoI (Region of Interests) extraction needs to be carried out on the feature map, and FPN has four output feature maps. It is also a problem to select which feature on the feature map. The solution given by FPN is to use different feature maps for RoI of different sizes. Large-scale RoI is extracted on the deep feature map, such as P5, and small-scale RoI is extracted on the shallow feature map, such as P2.

FPN transmits the deep semantic information to the bottom layer to supplement the shallow semantic information, so as to obtain the characteristics of high resolution and strong semantics. It has a very good performance in the fields of small object detection, instance segmentation and so on.

Specific code:

import torch.nn as nn
import torch.nn.functional as F
import math
​
##First define the basic class of ResNet, or the basic brick of ResNet
class Bottleneck(nn.Module):
    expansion = 4   ##Channel multiplier
    def __init__(self, in_planes, planes, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        self.bottleneck = nn.Sequential(
                nn.Conv2d(in_planes, planes, 1, bias=False),
                nn.BatchNorm2d(planes),
                nn.ReLU(inplace=True),
                nn.Conv2d(planes, planes, 3, stride, 1, bias=False),
                nn.BatchNorm2d(planes),
                nn.ReLU(inplace=True),
                nn.Conv2d(planes, self.expansion * planes, 1, bias=False),
                nn.BatchNorm2d(self.expansion * planes),
            )
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
    def forward(self, x):
        identity = x
        out = self.bottleneck(x)
        if self.downsample is not None:
            identity = self.downsample(x)
        out += identity
        out = self.relu(out)
        return out
​
##FPN class
class FPN(nn.Module):
    def __init__(self, layers):
        super(FPN, self).__init__()
        self.inplanes = 64
        ###The following four sentences represent the C1 module for processing input -- corresponding to the figure in the blog
        self.conv1 = nn.Conv2d(3, 64, 7, 2, 3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(3, 2, 1)
        ###Build C2, C3, C4 and C5 from bottom to top
        self.layer1 = self._make_layer(64, layers[0])
        self.layer2 = self._make_layer(128, layers[1], 2)
        self.layer3 = self._make_layer(256, layers[2], 2)
        self.layer4 = self._make_layer(512, layers[3], 2)
        ###Define the toplayer layer, reduce the number of channels for C5, and get P5
        self.toplayer = nn.Conv2d(2048, 256, 1, 1, 0) 
        ###The purpose of convolution fusion representing 3 * 3 is to eliminate the overlapping effect caused by the up sampling process to generate the final feature map.
        self.smooth1 = nn.Conv2d(256, 256, 3, 1, 1)
        self.smooth2 = nn.Conv2d(256, 256, 3, 1, 1)
        self.smooth3 = nn.Conv2d(256, 256, 3, 1, 1)
        ###Transverse connection to ensure the same number of channels
        self.latlayer1 = nn.Conv2d(1024, 256, 1, 1, 0)
        self.latlayer2 = nn.Conv2d( 512, 256, 1, 1, 0)
        self.latlayer3 = nn.Conv2d( 256, 256, 1, 1, 0)
##Function: construct C2-C5 bricks. Note the difference between stripe 1 and 2: C2 does not experience down sampling
    def _make_layer(self, planes, blocks, stride=1):
        downsample  = None
        if stride != 1 or self.inplanes != Bottleneck.expansion * planes:
            downsample  = nn.Sequential(
                nn.Conv2d(self.inplanes, Bottleneck.expansion * planes, 1, stride, bias=False),
                nn.BatchNorm2d(Bottleneck.expansion * planes)
            )
        ###A list is required for initialization, which represents the number of bottlenecks in each stage of ResNet on the left
        layers = []
        layers.append(Bottleneck(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * Bottleneck.expansion
        for i in range(1, blocks):
            layers.append(Bottleneck(self.inplanes, planes))
        return nn.Sequential(*layers)
###Top down sampling module
    def _upsample_add(self, x, y):
        _,_,H,W = y.shape
        return F.upsample(x, size=(H,W), mode='bilinear') + y
​
    def forward(self, x):
        ###Bottom up
        c1 = self.maxpool(self.relu(self.bn1(self.conv1(x))))
        c2 = self.layer1(c1)
        c3 = self.layer2(c2)
        c4 = self.layer3(c3)
        c5 = self.layer4(c4)
        ###Top down
        p5 = self.toplayer(c5)
        p4 = self._upsample_add(p5, self.latlayer1(c4))
        p3 = self._upsample_add(p4, self.latlayer2(c3))
        p2 = self._upsample_add(p3, self.latlayer3(c2))
        ###Convolution, fusion, smoothing
        p4 = self.smooth1(p4)
        p3 = self.smooth2(p3)
        p2 = self.smooth3(p2)
        return p2, p3, p4, p5
​
​

Point a praise before you go!!

 

Tags: Python AI Pytorch Deep Learning Object Detection

Posted on Tue, 30 Nov 2021 11:14:55 -0500 by monloi