In order to enhance semantics, the traditional object detection model usually only carries out subsequent operations on the last feature map of the deep convolution network, and the down sampling rate (multiple of image reduction) corresponding to this layer is usually large, such as 16 and 32, resulting in less effective information of small objects on the feature map and a sharp decline in the detection performance of small objects, This problem is also called multi-scale problem. The key to solve the multi-scale problem is how to extract multi-scale features. Traditional methods include Image Pyramid. The main idea is to make the input image into multiple scales. Images of different scales generate features of different scales. This method is simple and effective. It is widely used in competitions such as COCO, but the disadvantage is that it is very time-consuming and requires a lot of calculation. From the contents of the previous backbone networks, we can know that the size and semantic information of different layers of convolutional neural network are different, which is itself similar to a pyramid structure. The FPN (Feature Pyramid Network) method in 2017 integrates the characteristics of different layers and better improves the problem of multi-scale detection.
FPN network structure:
The overall architecture of FPN is shown in the figure above, which mainly includes four parts: bottom-up network, top-down network, horizontal connection and convolution fusion.
The leftmost is the ordinary convolution network. By default, ResNet structure is used to extract semantic information. C1 represents the first several convolution and pooling layers of ResNet, while C2 to C5 are different ResNet convolution groups. These convolution groups contain multiple Bottleneck structures. The size of characteristic graphs in the groups is the same and the size between groups decreases.
First, C5 is 1 × 1. Reduce the number of channels by convolution to obtain P5, and then conduct up sampling to obtain P4, P3 and P2 in order to obtain the same characteristics as the length and width of C4, C3 and C2, so as to facilitate element by element addition in the next step. Here, 2 times of nearest neighbor up sampling is used, that is, the adjacent elements are copied directly, and nonlinear interpolation is used.
The purpose is to fuse the high semantic features after up sampling with the shallow positioning detail features. After up sampling, the length and width of high semantic features are the same as the corresponding shallow features, and the number of channels is fixed at 256. Therefore, it is necessary to 11 convolute the underlying features C2 to C4 to make the number of channels become 256, and then add them element by element to obtain P4, P3 and P2. Because the feature graph size of C1 is large and the semantic information is insufficient, C1 is not placed in the horizontal connection.
After obtaining the added features, use 3 × 3. The generated P2 to P4 are fused by convolution in order to eliminate the overlapping effect caused by the up sampling process and generate the final feature map.
How to select a feature map:
For the actual object detection algorithm, RoI (Region of Interests) extraction needs to be carried out on the feature map, and FPN has four output feature maps. It is also a problem to select which feature on the feature map. The solution given by FPN is to use different feature maps for RoI of different sizes. Large-scale RoI is extracted on the deep feature map, such as P5, and small-scale RoI is extracted on the shallow feature map, such as P2.
FPN transmits the deep semantic information to the bottom layer to supplement the shallow semantic information, so as to obtain the characteristics of high resolution and strong semantics. It has a very good performance in the fields of small object detection, instance segmentation and so on.
import torch.nn as nn import torch.nn.functional as F import math ##First define the basic class of ResNet, or the basic brick of ResNet class Bottleneck(nn.Module): expansion = 4 ##Channel multiplier def __init__(self, in_planes, planes, stride=1, downsample=None): super(Bottleneck, self).__init__() self.bottleneck = nn.Sequential( nn.Conv2d(in_planes, planes, 1, bias=False), nn.BatchNorm2d(planes), nn.ReLU(inplace=True), nn.Conv2d(planes, planes, 3, stride, 1, bias=False), nn.BatchNorm2d(planes), nn.ReLU(inplace=True), nn.Conv2d(planes, self.expansion * planes, 1, bias=False), nn.BatchNorm2d(self.expansion * planes), ) self.relu = nn.ReLU(inplace=True) self.downsample = downsample def forward(self, x): identity = x out = self.bottleneck(x) if self.downsample is not None: identity = self.downsample(x) out += identity out = self.relu(out) return out ##FPN class class FPN(nn.Module): def __init__(self, layers): super(FPN, self).__init__() self.inplanes = 64 ###The following four sentences represent the C1 module for processing input -- corresponding to the figure in the blog self.conv1 = nn.Conv2d(3, 64, 7, 2, 3, bias=False) self.bn1 = nn.BatchNorm2d(64) self.relu = nn.ReLU(inplace=True) self.maxpool = nn.MaxPool2d(3, 2, 1) ###Build C2, C3, C4 and C5 from bottom to top self.layer1 = self._make_layer(64, layers) self.layer2 = self._make_layer(128, layers, 2) self.layer3 = self._make_layer(256, layers, 2) self.layer4 = self._make_layer(512, layers, 2) ###Define the toplayer layer, reduce the number of channels for C5, and get P5 self.toplayer = nn.Conv2d(2048, 256, 1, 1, 0) ###The purpose of convolution fusion representing 3 * 3 is to eliminate the overlapping effect caused by the up sampling process to generate the final feature map. self.smooth1 = nn.Conv2d(256, 256, 3, 1, 1) self.smooth2 = nn.Conv2d(256, 256, 3, 1, 1) self.smooth3 = nn.Conv2d(256, 256, 3, 1, 1) ###Transverse connection to ensure the same number of channels self.latlayer1 = nn.Conv2d(1024, 256, 1, 1, 0) self.latlayer2 = nn.Conv2d( 512, 256, 1, 1, 0) self.latlayer3 = nn.Conv2d( 256, 256, 1, 1, 0) ##Function: construct C2-C5 bricks. Note the difference between stripe 1 and 2: C2 does not experience down sampling def _make_layer(self, planes, blocks, stride=1): downsample = None if stride != 1 or self.inplanes != Bottleneck.expansion * planes: downsample = nn.Sequential( nn.Conv2d(self.inplanes, Bottleneck.expansion * planes, 1, stride, bias=False), nn.BatchNorm2d(Bottleneck.expansion * planes) ) ###A list is required for initialization, which represents the number of bottlenecks in each stage of ResNet on the left layers =  layers.append(Bottleneck(self.inplanes, planes, stride, downsample)) self.inplanes = planes * Bottleneck.expansion for i in range(1, blocks): layers.append(Bottleneck(self.inplanes, planes)) return nn.Sequential(*layers) ###Top down sampling module def _upsample_add(self, x, y): _,_,H,W = y.shape return F.upsample(x, size=(H,W), mode='bilinear') + y def forward(self, x): ###Bottom up c1 = self.maxpool(self.relu(self.bn1(self.conv1(x)))) c2 = self.layer1(c1) c3 = self.layer2(c2) c4 = self.layer3(c3) c5 = self.layer4(c4) ###Top down p5 = self.toplayer(c5) p4 = self._upsample_add(p5, self.latlayer1(c4)) p3 = self._upsample_add(p4, self.latlayer2(c3)) p2 = self._upsample_add(p3, self.latlayer3(c2)) ###Convolution, fusion, smoothing p4 = self.smooth1(p4) p3 = self.smooth2(p3) p2 = self.smooth3(p2) return p2, p3, p4, p5
Point a praise before you go!!