Object recognition and location algorithm based on deep neural network

YOLO (You Only Look Once) is an object recognition and location algorithm based on deep neural network. Its biggest feature is that it runs fast and can be used in real-time system. Now YOLO has developed to v3 version, but the new version is also continuously improved and evolved on the basis of the original version, so this paper first analyzes YOLO v1 version.

Input a picture and output the objects contained in it and the position of each object (rectangular box containing the object).


  Object recognition and location can be regarded as two tasks: finding an object area in the picture, and then identifying which object is in the area. Object recognition (a picture contains only one object and basically occupies the whole range of the picture), based on CNN Various methods of convolution neural network have achieved good results. So the main problem to be solved is where the object is.

 The simplest idea is to traverse all possible positions in the picture, carpet search each area with different sizes, different aspect ratios and different positions, detect whether there is an object one by one, and select the result with the greatest probability as the output. Obviously, this method is too inefficient.


 RCNN Pioneering proposed candidate areas(Region Proposals)Firstly, some candidate areas of possible objects are searched from the picture Selective Search,About 2000, and then object recognition is carried out for each candidate area. The efficiency of object recognition and location is greatly improved. however RCNN The speed of is still very slow, and it takes about 49 seconds to process a picture. So there is a follow-up Fast RCNN and Faster RCNN,in the light of RCNN The neural network structure and the algorithm of candidate areas are continuously improved, Faster RCNN You can already reach about 0.2 Second processing speed.

YOLO creatively combines the two stages of candidate selection and object recognition. You can know which objects are there and their location by looking at the picture (without looking at it). In fact, YOLO does not really remove the candidate area, but uses the predefined candidate area (to be exact, it should be the prediction area, not the Anchor used by fast RCNN). That is, the picture is divided into 77 = 49 grids. Each grid allows prediction of 2 bounding boxes (rectangular box containing an object), with a total of 492 = 98 bounding boxes. It can be understood as 98 candidate areas, which roughly cover the whole area of the picture.

RCNN: let's study the picture first. Well, there may be some objects in these locations. You can check these locations to see which objects are in them.
YOLO: we roughly divide the picture into 98 areas. Each area looks at whether there are objects and where the specific location is.
RCNN: is it really OK for you to be so simple and rude?
YOLO: of course not... Well, there's actually a little problem. The accuracy is a little lower, but I'm very fast! Come on! Come on!
RCNN: why do you use such a rough candidate area and finally get a good bounding box?
YOLO: haven't you used border regression? I can't use it.
Although RCNN will find some candidate areas, they are only candidates after all. After really identifying the objects, it is necessary to fine tune the candidate area to make it closer to the real bounding box. This process is border regression: adjust the bounding box of the candidate area to be closer to the real bounding box. Since it has to be adjusted in the end anyway, why bother to find candidate areas first? It's OK to have an area, so YOLO did it.

Let's look at the implementation scheme of YOLO.

1) Structure

 After removing the candidate area, YOLO Its structure is very simple, that is, simple convolution and pooling, and finally two layers of full connection are added. The biggest difference is that the last output layer uses the linear function as the activation function, because it needs to be predicted bounding box The location of the object (numerical type), not just the probability of the object. YOLO The network structure consists of 24 convolution layers and 2 full connection layers, and the network entrance is 448 x448(v2 416 x416),Pictures enter the network first resize,The output result of the network is a tensor, and the dimension is:

  Among them, S Is the number of meshes, B The number of frames responsible for each grid, C Is the number of categories. Each cell corresponds to B The width and height range of the bounding box is the full picture, indicating that the bounding box position of the object is found with the small grid as the center. Each bounding box corresponds to a score, which represents whether there are objects and positioning accuracy:, each small box will correspond to C A probability value to find the category corresponding to the maximum probability P(Class|object),It is considered that the small cell contains the object or a part of the object.

2) Mapping between input and output

3) Input

The input is the original image, and the only requirement is to zoom to 448*448 The size of the. Mainly because YOLO In the network, the convolution layer finally connects two full connection layers. The full connection layer requires a fixed size vector as the input, so pushing back requires a fixed size of the original image. that YOLO The design size is 448*448. 

4) The output is a 7730 tensor.

 The input image is divided into 7*7 Grid of( grid),7 in the output tensor*7 It corresponds to 7 of the input image*7 Grid. Or we put 7*7*30 The tensor of is regarded as 49 30 dimensional vectors, that is, each grid in the input image corresponds to an output 30 dimensional vector. Refer to the above figure, for example, the grid in the upper left corner of the input image corresponds to the vector in the upper left corner of the output tensor.

It should be noted that not only the information in the grid is mapped to a 30 dimensional vector. After the neural network extracts and transforms the input image information, the information around the grid will also be recognized and sorted out, and finally encoded into the 30 dimensional vector. Specifically, what information is contained in the 30 dimensional vector corresponding to each grid?

① Probability of 20 object classification

YOLO It supports the recognition of 20 different objects (people, birds, cats, cars, chairs, etc.), so here are 20 values indicating the probability of any object in the grid position. It can be recorded as P(C_1|Object), ......, P(C_i|Object),......P(C_{20}|Object),The reason why it is written as conditional probability means that if there is an object in the grid Object,So what is it C_i The probability is P(C_i|Object). 

② Location of 2 bounding box es

each bounding box Four values are required to represent its position,(Center_x,Center_y,width,height),Namely(bounding box Center point of x Coordinates, y Coordinates, width, height),2 individual bounding box A total of 8 values are required to represent its position. The coordinates are x, y With the corresponding grid offset Normalized to 0-1 between, w, h With image width and height Normalized to 0-1 between.

③ Confidence of 2 bounding box es

Confidence level of the bounding box = probability that there is an object in the bounding box * IOU between the bounding box and the actual bounding box of the object, formula:

                                     Confidence = Pr(Object) * IOU_{pred}^{truth}

Pr(Object)yes bounding box The probability that an object exists, as distinguished from P(C_i|Object). Pr(Object)No matter which object it is, it reflects the probability of having or not having an object①In point P(C_i|Object)It means assuming that there is already an object in the grid, which object is it.

IOU_{pred}^{truth}yes bounding box And object reality bounding box of IOU. Note that in the 30 dimensional vector now discussed bounding box yes YOLO The output of the network, that is, the predicted output bounding box. therefore IOU_{pred}^{truth}It reflects the of prediction bounding box And reality bounding box The proximity of the. It should also be noted that although sometimes it is said that"forecast"of bounding box,But this IOU It is calculated during the training phase. Wait until the test phase( Inference),At this time, we don't know where the real object is. We can only rely entirely on the output of the network. At this time, we don't need (and can't) calculate IOU Yes.

 Overall, a bounding box Confidence of Confidence It means whether it contains objects and the position is accurate. A high confidence indicates that there is an object and the position is accurate. A low confidence indicates that there may be no objects or there is a large position deviation even if there are objects.

 In general, 30 dimensional vectors = 20 Probability of multiple objects + 2 individual bounding box * 4 Coordinates + 2 individual bounding box Confidence of

Note: class information is for each grid, and confidence information is for each bounding box.

4.2) discussion

① A picture can detect up to 49 objects. There is only one group (20) object classification probability in each 30 dimensional vector, so only one object can be predicted. Therefore, the output 7 * 7 = 49 30 dimensional vectors represent up to 49 objects.

② There are 49 * 2 = 98 bounding boxes in total, and there are 2 groups of bounding boxes in each 30-dimensional vector, so there are 98 candidate areas in total.

③ YOLO's bounding box is not the Anchor of fast RCNN

 Faster RCNN Some algorithms such as each grid Manual setting in n individual Anchor(A priori box, preset position bounding box)Design of each Anchor There are different sizes and aspect ratios. YOLO of bounding box It looks like a grid 2 in Anchor,But they are not. YOLO Two are not preset bounding box The size and shape are not for each bounding box Output the prediction of an object respectively. It means that only two are predicted for an object bounding box,Choose the one that predicts relatively accurately.

 We need to give a correct answer according to the sample in advance bounding box As the goal of return. YOLO 2 of bounding box I don't know where it will be in advance. Only through forward calculation, the network will output 2 bounding box,these two items. bounding box With the actual object in the sample bounding box calculation IOU. Only then can we be sure, IOU The one worth more bounding box,As the person responsible for predicting the object bounding box. 

 At the beginning of training, network prediction bounding box Maybe it's all a mess, but it's always a choice IOU The better one, as the training goes on, each bounding box Will gradually be good at predicting certain situations (possibly object size, aspect ratio, different types of objects, etc.). Therefore, this is an idea of evolutionary or unsupervised learning.

The other is one Object Only by one grid To make predictions, not more grid All rush to predict the same Object. More specifically, when setting training samples, each of the samples Object Belong to and only to one grid,Even sometimes Object Across several grid,You can also specify only one of them. Specifically, calculate the Object of bounding box The central position of the, where does this central position fall grid,Should grid The category probability of the object in the corresponding output vector is 1 (the gird Responsible for forecasting this object), all other grid Yes Object The prediction probability of is set to 0 (not responsible for predicting the object).

And: although one grid 2 will be generated in bounding box,But we will choose one as the prediction result, and the other will be ignored. The following part of constructing the training sample will be seen more clearly.

④ You can adjust the number of grids and bounding box es

 7*7 Grid, 2 per grid bounding box,Yes, 448*448 The coverage granularity is a little coarse for the input image. We can also set up more grids and more bounding box. Set the number of grids to S*S,Each grid is generated B A border, network support identification C A different object. At this time, the length of the output vector is: C + B * (4+1),Total output tensor namely: S * S * (C + B * (4+1))

YOLO The selected parameter is 7*7 Grid, 2 bounding box,20 An object, therefore, outputs the vector length = 20 + 2 * (4+1) = 30. Total output tensor It's 7*7*30. Because the grid and bounding box The settings are sparse, so this version YOLO After training, the prediction accuracy and recall rate are not very ideal, and the follow-up is not satisfactory v2,v3 The version will be improved. Of course, because its speed can meet the requirements of real-time processing, it is still very attractive to the industry.

5) Training sample structure

As supervised learning, we need to construct training samples before the model can learn from them. For an input picture, it corresponds to 7 of the output*7*30 Tensor (that is, supervised learning) is called label label)What data should be filled in?First, output 7*7 The dimension corresponds to the entered 7*7 Grid. Then look at the filling of 30 dimensional vector.

① Probability of 20 object classification

 For each object in the input image, first find the center point. For example, for the bicycle in the figure, the center point is at the yellow dot position and the center point falls in the Yellow grid, so in the 30-dimensional vector corresponding to the Yellow grid, the probability of the bicycle is 1 and the probability of other objects is 0. In all the 30-dimensional vectors of the other 48 grids, the probability of the bicycle is 0. That's all Predicate"The grid where the center point is located is responsible for predicting the object". The classification probabilities of dogs and cars are filled in the same way.

② Location of 2 bounding box es

  Of training samples bounding box The actual location of the object should be filled in bounding box,But one object corresponds to two bounding box Parameters, the actual bounding box Which one to fill in? As discussed above, it needs to be output according to the network bounding box And object reality bounding box of IOU To choose, so we should dynamically decide which one to fill in during the training process bounding box. Refer to section below③Point.

③ Confidence of 2 bounding box es

The confidence formula discussed above:

                                          Confidence = Pr(Object) * IOU_{pred}^{truth}
  IOU_{pred}^{truth}It can be calculated directly, that is, using the two outputs of the network bounding box And object reality bounding box Calculated together IOU. Then look at two bounding box of IOU,Which is bigger (closer to reality) bounding box),Which one bounding box To predict whether the object exists, that is, the bounding box of Pr(Object) =1,At the same time, the object is real bounding box Parameters are filled into this bounding box At the position vector. Another is not responsible for prediction bounding box of Pr(Object) =0. 
In general, it is related to the actual object bounding box The closest one bounding box,his Confidence=IOU_{pred}^{truth},Other of the grid bounding box of Confidence=0. 

For example, in the figure above, the center point of the bicycle is located in the grid of 4 rows and 3 columns, so the 30 dimensional vector of the position of 4 rows and 3 columns in the output tensor is shown in the figure below.

   4 There is a bicycle at the grid position of row 3 and column 3. Its center point is in this grid. Fill the actual frame parameters of the bicycle into the sample label vector bounding box1 Position. Note that the actual frame parameters of the bicycle are placed in the figure bounding box1,But in fact, in the training process, after the network output, compare the two bounding box1 and bounding box2 And bicycle practice bounding box of IOU,Found output bounding box1 With the actual border IOU The biggest, so put the bike in practice bounding box The parameters are placed in the diagram bounding box1 Location, and bounding box1 The confidence of is set to 1, bounding box2 Normal storage, and the confidence is set to 0.

6) Loss function

The loss is the deviation between the actual output value of the network and the sample label value.

(we output the prediction frame coordinate 1 and the prediction frame coordinate 2, then calculate the IOU of the actual object frame, the position corresponding to the largest prediction frame, and fill the position corresponding to the sample label into the actual object frame)

YOLO's loss function:

 The design goal of the loss function is to let the coordinates( x,y,w,h)8 Wei, confidence 2 Wei, classification 20 The three aspects of dimension achieve a good balance. Simply adopt all of them sum-squared error loss To do this will have the following shortcomings: 

a) It is obviously unreasonable that 8-dimensional localization error is as important as 20-dimensional classification error.
b) If there are no objects in some lattices (there are many such lattices in a picture), the confidence of the bounding box in these lattices will be set to 0. Compared with a small number of lattices with objects, the contribution of lattices without objects to gradient update will be much greater than that of lattices with objects, which will lead to network instability and even divergence.

 For different sizes bbox In the forecast, compared with large bbox The forecast is a little bit off the mark box The prediction bias is the same size pair IOU The impact is even greater sum-square error loss For the same offset loss It's the same. In order to alleviate this problem, the author uses a clever way, which is to box of width and height Take the square root instead of the original height and width.  As shown below: small bbox The horizontal axis value of is small. When offset occurs, it reflects y On shaft loss(Figure below (green) ratio big box(The figure below is red)Bigger.

 1_i^{obj}It means grid i There are objects in the; one_{ij}^{obj}It means grid i The first j individual bounding box Object exists in. one_{ij}^{noobj}It means grid i The first j individual bounding box Object does not exist in, C_i Indicates the probability that an object exists.

In general, the sum of the squares of the errors between the network output and the contents of the sample label is used as the overall error of a sample. Several terms in the loss function correspond to the contents of the output 30 dimensional vector.

① Error of object classification
Formula line 5 category forecast, 1_i^{obj} means that only meshes with objects are included in the error.

② Position error of bounding box
There is an error in the blue box above, and the following figure shall prevail:

a) All with 1_{ij}^{obj} means that only the data of the "responsible" (larger IOU) predicted bounding box will be included in the error.
b) The width and height of the second line take the square root first, because if the difference is taken directly, the large object is less sensitive to the difference and the small object is more sensitive to the difference. Therefore, taking the square root can reduce the difference in sensitivity, so that the larger object and the smaller object have similar weight in size error.
c) Give these losses a greater loss weight, as λ Coord, multiplied by \ lambda_{coord} adjust the weight of the bounding box position error (relative classification error and confidence error). YOLO settings \ lambda_{coord} = 5, i.e. increase the weight of position error.

③ Confidence error of bounding box
a) Line 3 is the confidence error of the bounding box of the object. With 1_{ij}^{obj} means that only the confidence of the "responsible" (larger IOU) predicted bounding box will be included in the error.
b) Line 4 is the confidence error of the bounding box where there is no object. Because there is no object in the bounding box, you should honestly say "I have no object here", that is, output the lowest confidence as possible. If it does not properly output high confidence, it will be confused with the bounding box that is really "responsible" for the prediction of the object. In fact, just like object classification, the best probability of correct object is 1, and the best probability of all other objects is 0. The loss weight of the confidence loss (red box in the figure above) of the bbox with object is normally taken as 1.
c) Give a small loss weight to the confidence loss of a bbox without an object, which is recorded as λ Noobj, multiplied by \ lambda_{noobj} adjusts the weight (relative to other errors) of the confidence of the bounding box without an object. YOLO settings \ lambda_{noobj} = 0.5, that is, reduce the weight of the confidence error of the bounding box without an object.

7) Training

YOLO First use ImageNet The data set pre trains the first 20 layers of convolutional network, and then uses the complete network PASCAL VOC Training and prediction of object recognition and location on data sets. YOLO The network structure of is shown in the figure below:


YOLO The last layer adopts linear activation function, and the other layers are Leaky ReLU. Used in training drop out And data enhancement( data augmentation)To prevent over fitting.

8) Forecast (information)

 Trained YOLO Network, input a picture and output a 7*7*30 Tensor of( tensor)To represent the object (probability) contained in all grids in the picture and the possible 2 positions of the object( bounding box)And confidence. YOLO use NMS(Non-maximal suppression,Non maximum suppression) algorithm to extract the most likely objects and positions.

9) NMS (non maximum suppression)

NMS The core idea is: select the one with the highest score as the output, remove the overlap with the output, and repeat this process until all alternatives are processed. YOLO of NMS The calculation method is as follows.

The network outputs 7730 tensors in each grid, object C_i score in the j-th bounding box:
Score_{ij} = P(C_i|Object) * Confidence_j

It represents an object C_i Exist in j individual bounding box The possibility of. Each grid has a probability of 20 objects*2 individual bounding box Confidence, a total of 40 scores (candidates). There are 1960 scores in 49 grids. Separate for each object NMS,Then each object has 1960/20=98 Scores.

NMS steps are as follows:

1) Set the threshold of a Score, and exclude candidates below the threshold (set the Score to 0)
2) Traverse each object category
  2.1) traverse 98 scores of the object
  2.1.1) find the object with the largest Score and its bounding box and add it to the output list
   2.1.2) for each candidate object whose Score is not 0, calculate its IOU with the bounding box of the above 2.1.1 output object
   2.1.3) according to the preset IOU threshold, all candidates higher than the threshold (high degree of overlap) are excluded (set Score to 0)
   2.1.4) if all bounding box es are either in the output list or Score=0, the NMS of the object category is completed, and return to step 2 to process the next object
3) The output list is the predicted object

10) Summary YOLO detection process:

  1. resize the picture to a size of 448 * 448.

2. Put the pictures into the network for processing.

3. The results are obtained by non maximum suppression processing.

Different from the traditional detection algorithm, YOLO uses sliding window to find the target. YOLO directly uses a single convolutional neural network to predict multiple bounding boxes and category probabilities.

The advantages of YOLO are:

1. High speed, processing speed can reach 45fps, and its fast version (small network) can even reach 155fps. This is due to the network design of the combination of identification and positioning, and this unified design also makes the training and prediction end-to-end, very simple.

  1. It has strong generalization ability and can be widely used in other test sets.

3. The error rate of background prediction is low because the whole picture is put into the network for prediction.

The disadvantages of YOLO are:

The accuracy is low, the detection effect of small targets and adjacent targets is poor, the detection effect of small objects is not very good (especially some small objects gathered together), the prediction accuracy of the frame is not very high, and the overall prediction accuracy is slightly lower than that Fast RCNN. This is mainly because the grid settings are sparse, and each grid only predicts two borders. In addition Pooling The layer will lose some details, which will affect the positioning.

YOLO And Fast R-CNN Compared with the positioning error based on region proposal The method has a lower recall rate. But, YOLO When locating and recognizing the background, the accuracy is higher, and Fast-R-CNN The false positive rate is very high. Based on this, the author designs Fast-R-CNN + YOLO Detect the recognition mode, i.e. use it first R-CNN Extract a group bounding box,Then use YOLO A set of processed images is also obtained bounding box. Compare the two groups bounding box Is it basically consistent? If it is consistent, use it YOLO The calculated probability is used to classify the target bouding box Select the intersection area of the two. This combination improves the accuracy by 3 percentage points.

Paper download address: https://pjreddie.com/media/files/papers/yolo.pdf

Caffe code download address: https://github.com/yeahkun/caffe-yolo

2, YOLOv2
V1 defects:

Fixed input size: since the output layer is a fully connected layer, the YOLO training model only supports the same input resolution as the training image during detection. Other resolutions need to be scaled to this fixed resolution;
The target detection effect with small proportion is not good: although each grid can predict B bounding box es, in the end, only the bbox with the highest IOU is selected as the object detection output, that is, each grid can predict only one object at most. When objects account for a small proportion of the picture, such as livestock or birds in the image, each grid contains multiple objects, but only one of them can be detected.
In order to improve the accuracy and recall rate of object positioning, the author of YOLO proposed "YOLO9000: Better, Faster, Stronger" (Joseph Redmon, Ali Farhadi, CVPR 2017, best paper honorable meaning), that is, the full name of the paper of YOLOv2, which improves the resolution of the training image compared with v1; The idea of anchor box in fast RCNN is introduced, and the design of network structure is improved to make the model easier to learn. The structural diagram of YOLOv2 is as follows:
Insert picture description here

Classification network

Fine grained features

 A pass through layer is added here( passthrough layer),That is, in the source code reorg layer,Put the 26 of the previous layer*26 Characteristic diagram and this floor 13*13 The characteristic diagram is connected with ResNet Network shortcut Similarly, take the previous higher resolution feature map as the input, and then connect it to the subsequent low resolution feature map.

Prediction on the feature map of 1313 is enough for large targets, but not good enough for small targets. Here, merging the previous larger feature map can effectively detect small targets.
Specific operation: the 2626512 feature map is processed by the passthrough layer to become a new feature map of 13132048 (the size of the feature map becomes 1 / 4, and the number of channels becomes 4 times the previous), and then connected with the later 13131024 feature map to form a feature map of 1313 * 3072. Finally, convolution prediction is made on the feature map.

1. Anchor Boxes: in v1, the full connection layer is used to predict the coordinates of bbox directly after the convolution layer. In v2, the idea of fast r-cnn is used to predict the offset of bbox

  In order to introduce anchor boxes To predict bounding boxes,The author decisively removed the full connection layer from the network. What about the rest? First, the author removed the back pool layer to ensure that the output convolution feature map has a higher resolution. Then, by reducing the network, the image input resolution is 416 * 416,The purpose of this step is to make the width and height of the convolution feature map generated later odd, so that a center cell(yolo use pooling Next sampling,There are five size=2, stride=2 of max pooling,The size of the convolution layer does not decrease,Therefore, the final feature is 416/(2^5)=13).. The author observed that large objects usually occupy the middle of the image, so they can only use one of the center cell To predict the position of these objects, otherwise use the middle four cell To predict, this technique can slightly improve efficiency. Finally, YOLOv2 416 of input convolution network * 416 The picture finally got 13 * 13 Convolution characteristic diagram of (416)/32=13). 

2. The output layer uses the convolution layer instead of the full connection layer of YOLOv1

YOLOv2 proposes a new classification model Darknet-19. It mainly uses 3x3 convolution and doubles the number of channel s after pooling (VGG);global average pooling replaces full connection for predictive classification, and uses 1x1 convolution compression feature representation (Network in Network) between 3x3 convolutions batch normalization is used to improve stability, accelerate convergence and regularize the model. The structure of Darknet-19 is shown in the following table:


Contains 19 conv + 5 maxpooling.Use 1 x1 Convolution layer substitution YOLOv1 Full connection layer. 1 x1 The convolution layer (for cross channel information integration) is shown in the red rectangle above.

3. High resolution classifier: at present, most detection models will first pre train the main part of the model (CNN feature extractor) on the ImageNet classification dataset , ImageNet classification model basically uses 224\times224 images as input, and the resolution is relatively low, which is not conducive to the detection model. Therefore, YOLOv1 increases the resolution to 448\times448 after pre training with 224\times224 classification model, and uses this high resolution to finetune the detection data set. However, it may be difficult to quickly switch the resolution directly It is suitable for high resolution. Therefore, YOLOv2 adds the intermediate process of using 448\times448 input to finetune classification network on ImageNet dataset (10 epochs), which makes the model applicable to high-resolution input before detecting finetune on the dataset.

v1 use 224 × 224 train the classifier network and expand it to 448 to detect the network. v2 set ImageNet to 448 × 448 resolution fine tune the initial classification network and iterate 10 epochs.

4. All convolution layers use Batch Normalization

Batch Normalization is also widely used in v1, dropout is used behind the positioning layer, dropout is cancelled in v2, and Batch Normalization is used in all convolution layers.

5. K-Means algorithm

Faster R-CNN in anchor box The size and proportion of are set by experience, YOLOv2 It is improved and adopted k-means In training set bbox Clustering is performed on to produce an appropriate a priori frame. Due to the use of standard k-means Euclidean distance will make the larger bbox Smaller than bbox Greater error, and IOU And bbox Size independent, So use IOU Participate in distance calculation, Make it through these anchor boxes Get good IOU Score. Distance formula:


 The advantages of using clustering for selection are the same IOU Required for results anchor box Less quantity,Make the model more expressive,Tasks are easier to learn. The algorithm process is:Will each bbox The ratio of the width and height of the relative to the whole picture(wr,hr)Cluster,obtain k individual anchor box,Multiply this proportional value by the size of the output feature of the convolution layer.If the input is 416 x416,So the characteristic of the last convolution is 13 x13.

6,Multi-Scale Training:

Different from the fixed image size input by the network during YOLOv1 training, YOLOv2 (when random=1 in cfg file) will fine tune the input size of the network every few iterations. Every 10 iterations during training, a new input image size will be randomly selected. Because the downsamples magnification used by the YOLOv2 network is 32, the input image size {320352,..., 608} is adjusted using a multiple of 32. The minimum image size used for training is 320 x 320 and the maximum image size is 608 x 608. This allows the network to adapt to a variety of different scale inputs.

7. Direct location prediction
The second problem found by the author when using anchor boxes is that the model is unstable, especially in early iterations. Most of the instability occurs in the (x,y) coordinates of the predicted box. In the regional recommendation network, the following formulas are used for prediction (x,y) and tx, ty:

                   Write a picture description here

In the formula, x is the predicted coordinate value, xa is the anchor coordinate (preset fixed value), x * is the real coordinate value (annotation information), other variables y, w, h and so on, and t variable is the offset. Then deform the first two formulas,

 This formula is understood as: when forecasting tx=1,Will put box Move a certain distance to the right (specifically anchor box Width of), predicted tx=−1,Will put box Move the same distance to the left. This formula has no restrictions, so that no matter where to predict, any anchor boxes It can end at any point in the image (my understanding is, tx There is no numerical limit, which may occur anchor Detect distant targets box The efficiency is relatively low. The correct approach should be every one anchor It is only responsible for detecting targets within plus or minus one unit box). 

  Here, the author does not use the method of predicting the direct offset, but uses the prediction relative to the offset grid cell The author puts the coordinate position method again ground truth Limited between 0 and 1, using logistic Regression function to carry out this limitation.

 Now, the neural network is in the characteristic diagram (13) *13 )Each of cell 5 predicted bounding boxes(Clustering values), while each bounding box Five sitting values are predicted, which are tx,ty,tw,th,to,The first four are coordinates, to Is the confidence level cell corresponding box(bounding box prior)The length and width of are (pw,ph),this cell The margin from the upper left corner of the image is (cx,cy),(What torture says: every cell All correspond to different(cx,cy))Then the predicted value can be expressed as: 

bx represents the predicted coordinate value, tx represents the offset, and cx is the preset fixed value (similar to anchor coordinates)

 tx,ty through sigmod The function has been processed, and the value is limited to 0~1,Is to control the offset between 0 and 1 anchor Only responsible for the surrounding box,It is conducive to improve efficiency and network convergence.σ The meaning of the function is not given, but the estimation is to convert the normalized value into the real value in the figure. Therefore,σ(tx)yes bounding box The abscissa of the center of the grid relative to the upper left corner of the grid,σ(ty)Is the ordinate,σ(to)yes bounding box of confidence score. After the location prediction value is normalized, the parameters are easier to learn and the model is more stable.

Introduction to YOLO v3
yolo v3 code is the keras version of qwe. It is easy to reproduce, and the code is relatively easy to understand. Students can understand the essence of v3 in combination with the code and blog posts.

GitHub address: https://github.com/qqwweee/keras-yolo3

For the implementation code based on tensorflow, please refer to: https://github.com/wizyoung/YOLOv3_TensorFlow

 yolo_v3 As yolo A series of the latest algorithms have both reservations and improvements to the previous algorithms. First analyze them yolo_v3 What remains on the:

 from yolo_v1 Start, yolo The algorithm is to detect by dividing cells, but the number of divisions is different"leaky ReLU"As an activation function. End to end training. A loss function To get the training done, just focus on the input and output.

Since yolo_v2, Yolo has used batch normalization as a method of regularization, accelerating convergence and avoiding over fitting, connecting BN layer and leaky relu layer to each convolution layer.

  Multiscale training. Between speed and accuracy tradeoff. If you want to speed up, you can sacrifice accuracy; if you want to improve accuracy, you can sacrifice a little speed. yolo The ascension of each generation is largely determined by backbone Network improvement, from v2 of darknet-19 reach v3 of darknet-53. yolo_v3 Replacement is also available backbone-tiny darknet. If you want a fork, backbone Can use Darknet-53,For light weight and high speed, you can use tiny-darknet. Anyway, yolo It is naturally "flexible", so it is especially suitable for engineering algorithm.


  DBL: As shown in the lower left corner of the figure, that is, in the code Darknetconv2d_BN_Leaky,yes yolo_v3 The basic component of is convolution+BN+Leaky relu. about v3 For example, BN and leaky relu It is already an inseparable part of the convolution layer(Except for the last layer of convolution),Together, they form the smallest component.

 res_n: n Represents numbers, yes res1,res2, ... ,res8 Wait, that means this res_block How many are in it res_unit. This is yolo_v3 Large components of, yolo_v3 Began to learn from ResNet Using this structure can make the network structure deeper(from v2 of darknet-19 Rise to v3 of darknet-53,The former has no residual structure). about res_block The explanation of can be seen intuitively in the lower right corner of the figure, and its basic components are the same DBL. 

concat: Tensor splicing darknet The upper samples of the middle layer and a later layer are spliced. Splicing operation and residual layer add The operation is different. Splicing will expand the dimension of tensor, and add Only direct addition will not lead to the change of tensor dimension.

whole yolo_v3_body Contains 252 layers, including add Floor 23(Mainly used for res_block Composition of each res_unit Need one add There are 1 floors in total+2+8+8+4=23 layer). besides, BN Layer and LeakyReLU The number of layers is exactly the same(72 layer),In the network structure, each layer BN There will be another floor in the back LeakyReLU. There are 75 convolution layers, of which 72 are connected behind BN+LeakyReLU The combination of forms the basic components DBL. Looking at the structure diagram, you can find the upper sampling and concat There are 2 times, corresponding to each in the table analysis res_block Will be filled with a zero, a total of 5 res_block.


darknet-53 model



 whole v3 In the forward propagation process, the size transformation of tensor is realized by changing the step size of convolution kernel yolo_v2 In, yolo_v2 The tensor size transformation in the forward process is carried out through maximum pooling, a total of 5 times,After 5 times of reduction, the feature map will be reduced to 1 of the original input size/32. The input is 416 x416,The output is 13 x13(416/32=13). yolo_v3 Also and v2 Like, backbone Will reduce the output characteristic map to 1 of the input/32. Therefore, the input image is usually required to be a multiple of 32,and v3 It is carried out by increasing the step size of convolution kernel, which is also 5 times.(darknet-53 There is a global average pool at the end, which is yolo-v3 There is no such layer, so the tensor dimension changes only consider the previous five times). 

 darknet-19 There is no residual structure(resblock,from resnet Learn from it)Yes, and VGG Is of the same type backbone(Belonging to the previous generation CNN structure),and darknet-53 Yes, I can resnet-152 Frontal rigid backbone.darknet-19 It still has a great advantage in speed. In fact, it can be seen in other details(such as bounding box prior use k=9),yolo_v3 Instead of pursuing speed, it is ensuring real-time performance(fps>36)Based on the pursuit of performance. 


 yolo v3 Three different scales are output feature map,As shown in the figure above y1, y2, y3. This too v3 Few improvements mentioned in the paper: predictions across scales.This is a reference FPN(feature pyramid networks),Using multi-scale to analyze different size The finer the target is detected grid cell The finer objects can be detected. y1,y2 and y3 The depth of is 255, and the law of side length is 13:26:52

 about COCO In terms of categories, there are 80 categories, so each box A probability should be output for each category. yolo v3 It is set that each grid cell predicts 3 box,So each box Need to have(x, y, w, h, confidence)Five basic parameters, and then there are 80 categories of probability. So 3*(5 + 80) = 255. That's how this 255 came from. (remember) yolo v1 The output tensor of? 7 x7x30,Only 20 types of objects can be identified, and each cell Only 2 can be predicted box,and v3 It's like an old man's machine and iphoneX Same)

 v3 The method of up sampling is used to realize this multi-scale feature map,concat The two connected tensors have the same scale(Two splices are 26 x26 Scale splicing and 52 x52 Scale splicing, through(2, 2)Up sampling to ensure concat The tensor scale of splicing is the same). The author is not like SSD That's direct backbone The processing result of the middle layer is used as feature map Instead, the processing result after splicing with the upper sampling results of the later network layer is used as feature map. Why do you do that? I think it's a little metaphysical. On the one hand, it avoids overlapping with other algorithms. On the other hand, it may be a better choice after the experiment and the results prove that it is possible to save the model by doing so size of

3,Bounding Box Prediction

 about v3 Generally speaking, in prior The treatment here has a clear explanation: selected b-box priors of k=9,about tiny-yolo If so, k=6. priors They are all obtained by clustering on the data set, with certain values, as follows:

10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
Each anchor prior (named anchor prior, but not using the anchor mechanism) is composed of two numbers, one representing height and the other representing width. v3 uses logistic regression to predict b-box. This wave operates sao like a linear regression adjustment b-box in RPN. Every time v3 predicts the b-box, the output is the same as v2, and then calculate the absolute (x, y, w, h, c) through formula 1.

among cx,cy Is the coordinate offset of the mesh,pw,ph It's preset anchor box Side length of.The final predicted frame coordinate value is bx,by,bw,bh The goal of e-learning is tx,ty,tw,th. The author used logistic Return to each anchor The surrounding content was given a goal score(objectness score),That is, how likely it is that this location is a target. This step is in predict Previously, unnecessary can be removed anchor,It can reduce the amount of calculation.Select according to the goal score anchor prior conduct predict,Not all anchor prior There will be output.

 differ faster R-CNN Unfortunately, yolo_v3 Only one prior Operate, that is, the best prior. and logistic Regression is used from nine anchor priors Found in objectness score(Target existence possibility score)The tallest one. logistic Regression is to use a curve pair prior be relative to objectness score Linear modeling of mapping relationship.

Multiscale prediction:
Nine anchors will be divided equally by three output tensors. Choose your own anchor according to the three sizes of large, medium and small. Each scale predicts three boxes. The anchor design still uses clustering to obtain nine cluster centers, which are divided into three mesoscales according to their size

Scale 1: add some convolution layers after the basic network, and then output box information
Scale 2: sample (x2) from the convolution layer of the penultimate layer in scale 1, add it to the last 16x16 feature map, and output box information through multiple convolutions. It is twice larger than scale 1
Scale 3: similar to scale 2, a 32 x 32 feature map is used
Each output y will output three prediction boxes in each grid. These three boxes are obtained by dividing 9 by 3, which is set by the author. We can see from the dimension of output tensor, 13x13x255. How did 255 come from, 3 * (5 + 80). 80 represents 80 categories, 5 represents location information and confidence, and 3 represents 3 predictions to be output. In terms of code, 3 in 3 * (5 + 80) is directly generated by num_anchors//3.

Classifier category prediction:

YOLOv3 does not use Softmax to classify each box. There are two main considerations:

Softmax assigns a category (the largest score) to each box. For data sets such as Open Images, the target may have overlapping category labels, so softmax is not suitable for multi label classification.
Softmax can be replaced by multiple independent logistic classifiers, and the accuracy will not decrease. Binary cross entry loss is adopted for classification loss
Improvement of yolov3 over yolov2
According to the network structure of Darknet-53, yolov3 is pushed to 106 layers deeper than yolov2. The idea of ResNet and FPN network structure is quoted, and multi-scale prediction is used to make up for the insufficient fine grain of 13 * 13 grid in the initial division. At the same time, yolov3 network still adopts the operations of data enhancement and batch normalization in yolov2.

 YOLOv2 There is a special conversion layer in the network structure( Passthrough Layer),Suppose the size of the last extracted feature graph is 13*13,The function of the conversion layer is to convert the previous 26*26 And 13 of this floor*13 The feature map is stacked (expanding the amount of feature dimension data), then fused, and then detected with the fused feature map. This is to strengthen the accuracy of the algorithm for small target detection. In order to achieve better results, YOLOv3 This idea has been strengthened and improved.

YOLO v3 Use (similar) FPN)Up sampling( Upsample)And integration approach, integrating 3 scales (13)*13,26*26 And 52*52),Independent detection is performed on multi-scale fusion feature maps, and finally the detection effect of small targets is significantly improved. (some algorithms use multi-scale feature fusion, such as YOLOv2 Generally, the fused single feature map is used for prediction, FPN The difference is that the prediction is carried out in different characteristic layers)

Effective improvement strategies:

1 use logical regression to predict the object score of each bounding box. The matching strategy between each bounding box and ground truth is 1 vs 1.

2 each bounding boxes uses multi label classification to predict the contained classes, and uses binary cross entropy (sigmoid) to replace softmax for class prediction.

3. Pyramid network is used to predict multi-scale boxes (multi-layer feature map[13 26 52]), and the feature map (feature map) earlier in the network is extracted, which is combined with the feature map sampled twice before for element wise addition to conduct more accurate fine-grained prediction (similar to SSD idea) However, v2 is not such an SSD idea. Although it also uses multi-layer feature maps, v2 converts 2626 feature maps into 1313 feature maps through the Passthrough Layer to increase the dimension of features in turn

from: https://blog.csdn.net/donkey_1993/article/details/81481833
from :https://www.jianshu.com/p/cad68ca85e27
from: https://blog.csdn.net/zxyhhjs2017/article/details/83013297
from: https://www.cnblogs.com/makefile/p/YOLOv3.html
from: https://blog.csdn.net/leviopku/article/details/82660381


1. YOLOv1 will regard the image as a sxs grid, where s is equal to 7. Each grid predicts 2 bounding boxes and the confidence of the objects contained in the grid. At the same time, each grid is also the object category to which the prediction grid belongs; YOLOv1 consists of 24 layers of convolution layers, 4 maximum pooling layers and 2 full connection layers. In conventional operation, we focus on the final output of 7x7x30. Here, 7x7 represents the 7x7 grid of the input image, one-to-one correspondence. The first ten of 30 represent the coordinates of 2 bounding boxes and the confidence of objects, and the last 20 represent 20 categories of VOC data sets.

2. YOLOv2: batch normalization, high-resolution classifier, anchor box, dimensional clustering, fine-grained features and multi-scale training. YOLOv2 adopts a new basic model (feature extractor), called Darknet-19, including 19 convolution layers and 5 maxpooling layers. The improvement improves the accuracy, and the speed is still faster than YOLOv1.

Batch Normalization maps the data distribution to a relatively compact distribution, so that the network can learn faster and better and avoid over fitting. Using Batch Normalization improves the map by 2%.
v1 use 224 × 224 training classifier network, expanded to 448 for detection network. v2 set ImageNet to 448 × 448 resolution fine tune the initial classification network and iterate 10 epochs.
YOLOv1 uses the full connection layer to generate the coordinates of the bounding box, which loses the spatial information of the feature map, resulting in inaccurate positioning. YOLOv2 uses the anchor frame to directly sample in the sliding window of the convolution feature map. Different from YOLOv1, yov2 predicts the offset of the coordinates relative to the left vertex of the grid, and obtains the final predicted coordinates through the transformation formula. YOLOv1 can only predict 98 bounding boxes, while YOLOv2 can predict thousands of bounding boxes after using anchor boxes.
YOLOv2 uses K-means clustering method to cluster the boundary box in the training set, and clusters the ground truth in the data set through K-means clustering algorithm. Finally, for the balance between model complexity and recall rate, five clustering centers and five most representative bounding box es are selected.
Fine grained features have a great impact on the detection of small objects. A Passthrough Layer is introduced to connect the shallow feature map to the deep feature map, that is, the 26x26x512 feature map in the figure is transformed into a 13x13x2048 feature map through interlaced sampling, and then concat enated with the 13x13x1024 feature map by channel.
To make YOLOv2 robust to images of different sizes, multi-size training is introduced. Every 10 batches, a new image size is selected to train the network.
3. On the basis of maintaining real-time performance, YOLOv3 has made several improvements to YOLOv2, mainly including three points:

Residual network is used; Using logistic regression prediction confidence and classification, the coordinates of b-box and the change of feature extractor are predicted from three scales.
In terms of classification, softmax multi classification is not used. The author also points out that softmax ultimately does not improve the performance, and softmax assumes that each box has only one class, which is not good for migrating to a larger dataset with multiple category labels. For Open Images, the target may have overlapping category labels, so softmax is not suitable for multi label classification. Therefore, the author uses multiple logical regression to predict classification, and uses binary cross entropy to calculate classification loss.
Using the idea of FPN, the output of the middle layer and the output of the back layer are fused to predict three scales. Each cell of each scale predicts three coordinates. Taking the above as an example, the down sampling is 32 times, and the last output is 8x8x1024. 8x8x255 (255=3x (5 + 80), 5 is the coordinate plus confidence, and 80 is the coco category) is obtained through convolution and logical regression, This is the prediction output of the first scale. The second scale is 8x8x1024, which is transformed into 16x16x512 by up sampling and convolution, and then concat enated with 16x16x512 of the previous stage, and then 16x16x255 is generated in a similar way. Similar operations get 32x32x255.

Tags: Algorithm neural networks Deep Learning yolo

Posted on Wed, 15 Sep 2021 18:10:59 -0400 by bough