# RFBnet

Before learning, I always looked at CSDN and knowledge, but I still felt that many details were not well understood, so I recently planned to start reading the original text and record my ideas and code analysis.

There may be some typos or wrong conclusions. I hope you will forgive me.

### Abstract

The current top target detection networks (the background is in 2018) are based on convolutional neural networks as feature extraction networks (backbone), such as resnet-101 and Inception. Although these feature extraction networks have strong feature extraction ability, they need high calculation overhead. On the contrary, some lightweight models can realize real-time detection, but their accuracy has been criticized.

In this paper, the author explores an alternative target detection method to enhance the feature extraction ability of these lightweight models through artificial features. The author's inspiration comes from the receptive field structure of the human visual system (the visual field range of the target object concerned by the human eye). Therefore, according to the principle of bionics, the author proposes an RFB module, which considers the relationship between the size and eccentricity of RFs, so as to improve the variability and robustness of network feature extraction.

The author integrates this module with the prediction layer of SSD network to form the structure of RFBnet target detection network. In order to evaluate the performance of the network proposed by the author, the experiments are compared on two main baseline methods. The structure shows that RFBnet can still have excellent detection performance on the premise of real-time target detection speed.

### Introduce

In recent years (the background of 2018), the three brothers led by RCNN, fast RCNN and fast RCNN, as well as the three data sets of Pascal VOC, Ms coco and ilsvrc. Of course, they are all second-order target detection networks. The general idea is to distinguish between foreground and background in the first stage and classify and regression in the second stage. Then the author talked about some achievements in the last two stages.

Suddenly! As soon as the conversation turns, the features extracted by the author diss come from deep-seated networks. Therefore, the computer overhead is large and the speed is not very good. Then the author needs to boast about his work.

The author first introduces the speed advantages of the first-order target detection network, but the first-order target detection network needs to sacrifice performance. It is said that it is about 10% - 40% lower than the SOTA (optimal algorithm) of the second-order target detection network. I'm convinced by the first-order target detection network, but I'm not convinced anyway. Because I am engaged in first-order target detection network (manual dog head).

The author goes on to say that recently, the two brothers DSSD and Retinanet have been very successful and effectively alleviated the problem of insufficient accuracy. Unfortunately, these two products use resnet-101, which is too slow.

So what? The author proposes to use the feature extraction network of lightweight and RFB module to extract features without such a deep feature extraction network. Moreover, the author said that the RFB module has no restriction on the whole network. I guess there should be no side effects,

Then the author lists several contributions

- RFB module is proposed
- If the RFB module is placed before the detection header, the computational overhead can be ignored in the author's opinion.
- Experiments with MobileNet verify the effectiveness of RFB for lightweight backbone

### Related Work

- Inception
- ASPP
- Deformable Conv

The author compares the receptive fields of these structures with those of RFB module to prove the superiority of RFB module

- Perception uses convolution kernels of different sizes to obtain receptive fields, and the colors on the graph are richer
- ASPP adopts the same convolution kernel, but the number of holes is different, and the color area on the graph is larger
- Deformable Conv is to transform the sampling points in the convolution kernel to obtain different receptive fields

Therefore, the author combines Inception with ASPP and proposes RFB module

### Introduction to RFB module

The author divides the RFB module into two parts: multi branch convolution and hole convolution

Multi branch convolution mainly draws lessons from the idea of Google net

The setting of void convolution is mainly based on the idea of ASPP

Therefore, the author proposes two RFB modules to be applied to different locations of target detection network

The figure above is an explanatory diagram of the RFB module. It can be observed that the characteristic diagram obtained by each branch is different. Here, the author uses blue circle, green circle and red circle to represent it. It can be found that their size and coverage area are also different. Adding to a graph can just cover the whole graph. This is why the author finally needs to stack the characteristic graphs on different channels.

After knowing what the module looks like, we need to know where to put it, or where it can be used.

Considering the lack of shallow receptive field, the author mainly applies the module to the prediction of 38 and 19 characteristic maps.

Later layers also use void convolution.

#### Experimental part

Ablation Experiment

Pascal VOC

COCO

class BasicRFB(nn.Module): def __init__(self, in_planes, out_planes, stride=1, scale = 0.1, visual = 1): super(BasicRFB, self).__init__() self.scale = scale self.out_channels = out_planes inter_planes = in_planes // 8 self.branch0 = nn.Sequential( BasicConv(in_planes, 2*inter_planes, kernel_size=1, stride=stride), BasicConv(2*inter_planes, 2*inter_planes, kernel_size=3, stride=1, padding=visual, dilation=visual, relu=False) ) self.branch1 = nn.Sequential( BasicConv(in_planes, inter_planes, kernel_size=1, stride=1), BasicConv(inter_planes, 2*inter_planes, kernel_size=(3,3), stride=stride, padding=(1,1)), BasicConv(2*inter_planes, 2*inter_planes, kernel_size=3, stride=1, padding=visual+1, dilation=visual+1, relu=False) ) self.branch2 = nn.Sequential( BasicConv(in_planes, inter_planes, kernel_size=1, stride=1), BasicConv(inter_planes, (inter_planes//2)*3, kernel_size=3, stride=1, padding=1), BasicConv((inter_planes//2)*3, 2*inter_planes, kernel_size=3, stride=stride, padding=1), BasicConv(2*inter_planes, 2*inter_planes, kernel_size=3, stride=1, padding=2*visual+1, dilation=2*visual+1, relu=False) ) self.ConvLinear = BasicConv(6*inter_planes, out_planes, kernel_size=1, stride=1, relu=False) self.shortcut = BasicConv(in_planes, out_planes, kernel_size=1, stride=stride, relu=False) self.relu = nn.ReLU(inplace=False) def forward(self,x): x0 = self.branch0(x) x1 = self.branch1(x) x2 = self.branch2(x) out = torch.cat((x0,x1,x2),1) out = self.ConvLinear(out) short = self.shortcut(x) out = out*self.scale + short out = self.relu(out) return out class BasicRFB_a(nn.Module): def __init__(self, in_planes, out_planes, stride=1, scale = 0.1): super(BasicRFB_a, self).__init__() self.scale = scale self.out_channels = out_planes inter_planes = in_planes //4 self.branch0 = nn.Sequential( BasicConv(in_planes, inter_planes, kernel_size=1, stride=1), BasicConv(inter_planes, inter_planes, kernel_size=3, stride=1, padding=1,relu=False) ) self.branch1 = nn.Sequential( BasicConv(in_planes, inter_planes, kernel_size=1, stride=1), BasicConv(inter_planes, inter_planes, kernel_size=(3,1), stride=1, padding=(1,0)), BasicConv(inter_planes, inter_planes, kernel_size=3, stride=1, padding=3, dilation=3, relu=False) ) self.branch2 = nn.Sequential( BasicConv(in_planes, inter_planes, kernel_size=1, stride=1), BasicConv(inter_planes, inter_planes, kernel_size=(1,3), stride=stride, padding=(0,1)), BasicConv(inter_planes, inter_planes, kernel_size=3, stride=1, padding=3, dilation=3, relu=False) ) self.branch3 = nn.Sequential( BasicConv(in_planes, inter_planes//2, kernel_size=1, stride=1), BasicConv(inter_planes//2, (inter_planes//4)*3, kernel_size=(1,3), stride=1, padding=(0,1)), BasicConv((inter_planes//4)*3, inter_planes, kernel_size=(3,1), stride=stride, padding=(1,0)), BasicConv(inter_planes, inter_planes, kernel_size=3, stride=1, padding=5, dilation=5, relu=False) ) self.ConvLinear = BasicConv(4*inter_planes, out_planes, kernel_size=1, stride=1, relu=False) self.shortcut = BasicConv(in_planes, out_planes, kernel_size=1, stride=stride, relu=False) self.relu = nn.ReLU(inplace=False) def forward(self,x): x0 = self.branch0(x) x1 = self.branch1(x) x2 = self.branch2(x) x3 = self.branch3(x) out = torch.cat((x0,x1,x2,x3),1) out = self.ConvLinear(out) short = self.shortcut(x) out = out*self.scale + short out = self.relu(out) return out

Liu, Songtao, and Di Huang. "Receptive field block net for accurate and fast object detection." Proceedings of the European Conference on Computer Vision (ECCV). 2018.