Reference code: YOLOF
1. General
Introduction: review the FPN network commonly used in previous detection networks to fuse features of different scales and detect targets under different scales (targets of different sizes will be assigned to FPN feature maps of different stripes). This is actually a divide and conquer strategy. However, such a network design method will bring more computational overhead. After all, so many levels of FPN pyramids need to be calculated. The idea of this article is to think about whether the FPN network can be simplified, so that the time-consuming of network detection can be greatly reduced. In this paper, the FPN network is replaced, and only one C5 feature scale is used to complete the full-scale target detection task. For the receptive field problem, the expansion convolution block with shortcut connection is used to increase the receptive field, while maintaining the adaptability to a large target size range. After solving the receptive field problem, the remaining problem is the positive sample matching problem caused by using only one scale for prediction. A new matching optimization strategy is proposed. Thus, the detection algorithm is greatly simplified in the encoder stage, so as to improve the speed while maintaining the performance (the baseline is RetinaNet).
As mentioned above, the more important functions of FPN network are multi-scale feature aggregation and target hierarchical optimization. So, the above two functions have a greater impact on the final performance? In this paper, the FPN network is tested, and the following figure is obtained:
By analyzing the above figure, we can get:
- 1) In Fig. b, only the feature map of C5 is used as input, but many levels of features are output, and the performance obtained is similar to that of Fig. a;
- 2) Combined with figure c and figure d, it can be seen that the performance difference of multi-scale input and single-scale output is small compared with single-scale input and single-scale output, that is, the impact of multi-scale feature aggregation is relatively small.
Then, combined with the above two conclusions, we can get that multi-scale prediction is the dominant performance in FPN network, that is, the divide and conquer strategy mentioned in this paper. Moreover, the C5 characteristic map has included the ability to adapt and regression to multiple scale targets.
Then, the receptive field problem and the positive sample imbalance problem are solved according to the two ideas mentioned in the above article, and the performance of the obtained network is compared with other networks, as shown in the following figure:
2. Ideas for solving problems
2.1 receptive field problem
Receptive field is the ability of the network to describe the target scale. When only C5 feature is used, its receptive field is shown in figure a below, and its receptive field cannot cover larger targets. Then, one of the most common methods to increase the receptive field is in the form of cascade expansion convolution, and the receptive field is shown in Figure b below.
Since increasing the expansion convolution can actually increase the receptive field, how can we maintain the adaptability to low scale targets? The natural idea is the shortcut connection, which not only ensures that the network can obtain a large receptive field, but also retains the adaptability to low scale targets. Therefore, the expansion convolution mentioned above is redesigned, and its structure is shown in the following figure:
After such improvement, the receptive field range of the article becomes a c-graph suitable for a large range.
2.2 imbalance of positive samples
In the traditional detection algorithm, the positive samples are usually selected in the form of Max IOU. This method has little problem in multi-scale prediction, but there will be problems when this method is applied to the article, as shown in the figure below:
This is because according to this selection method, it will naturally bias towards large-scale targets, and the proportion of positive samples corresponding to medium and small targets is too small, which will affect the learning ability of the detection network. In this regard, the article selects the form of Uniform Matching for each GT, that is, select the most matching detection box and anchor for each GT box (the anchor scale here is set to
[
32
,
64
,
128
,
256
,
512
]
[32, 64, 128, 256, 512]
[32,64128256512]), for implementation, refer to:
# playground/detection/coco/yolof/yolof_base/uniform_matcher.py#L40 cost_bbox = torch.cdist( # Calculate the matching degree between prediction frame and GT box_xyxy_to_cxcywh(out_bbox), box_xyxy_to_cxcywh(tgt_bbox), p=1) cost_bbox_anchors = torch.cdist( # Calculate the matching degree between anchor and GT box_xyxy_to_cxcywh(anchors), box_xyxy_to_cxcywh(tgt_bbox), p=1) # Final cost matrix C = cost_bbox C = C.view(bs, num_queries, -1).cpu() C1 = cost_bbox_anchors C1 = C1.view(bs, num_queries, -1).cpu() sizes = [len(v.gt_boxes.tensor) for v in targets] all_indices_list = [[] for _ in range(bs)] # positive indices when matching predict boxes and gt boxes indices = [ # Uniformly select the corresponding quantity of prediction boxes according to the matching degree tuple( torch.topk( c[i], k=self.match_times, dim=0, largest=False)[1].numpy().tolist() ) for i, c in enumerate(C.split(sizes, -1)) ] # positive indices when matching anchor boxes and gt boxes indices1 = [ # Select the corresponding number of anchor s according to the matching degree tuple( torch.topk( c[i], k=self.match_times, dim=0, largest=False)[1].numpy().tolist()) for i, c in enumerate(C1.split(sizes, -1))]
The uniform matching mechanism here is compared with the method of ATSS L L L-layer features are selected for each GT box k k k is the most matching anchor, and then L ∗ k L*k L * k matching anchor s use dynamic IoU threshold to select positive samples, so the overall work is biased towards the adaptation of positive and negative samples. The method of this paper focuses on the balance of positive samples with different scales.
Another work in this paper is to improve the prediction module and integrate the target features of bounding box regression feature coding and classification features, that is, the decoder in the figure below:
For the corresponding implementation, refer to:
# playground/detection/coco/yolof/yolof_base/decoder.py#L93 def forward(self, feature: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: cls_score = self.cls_score(self.cls_subnet(feature)) # [N,A*C,H,W] N, _, H, W = cls_score.shape cls_score = cls_score.view(N, -1, self.num_classes, H, W) # [N,A,C,H,W] reg_feat = self.bbox_subnet(feature) bbox_reg = self.bbox_pred(reg_feat) # [N,A*4,H,W] objectness = self.object_pred(reg_feat) # [N,A,H,W] # implicit objectness objectness = objectness.view(N, -1, 1, H, W) # [N,A,1,H,W] normalized_cls_score = cls_score + objectness - torch.log( 1. + torch.clamp(cls_score.exp(), max=self.INF) + torch.clamp( objectness.exp(), max=self.INF)) # [N,A,C,H,W] normalized_cls_score = normalized_cls_score.view(N, -1, H, W) # [N,A*C,H,W] return normalized_cls_score, bbox_reg
2.3 Ablation Experiment
Overall ablation results:
Detailed ablation experiments of internal specific modules: