Intelligent Target Detection 52 - Keras Builds YoloX Target Detection Platform

Learning Preface

Ignoring the new YoloX, it feels very interesting and repeats it.

Source Download

https://github.com/bubbliiiing/yolox-keras
You can order a star if you like.

Improvements to YoloX (incomplete)

1. Main Cadre: Focus network structure is used, which is a more interesting network structure used in YoloV5. The specific operation is to get a value every other pixel in a picture. At this time, four independent feature layers are obtained, and then the four independent feature layers are stacked. At this time, the wide and high information is concentrated in the channel information, and the input channel is expanded.Four times.

2. Taxonomy Regression Layer: Decoupled Head, the decoupling heads used by previous versions of Yolo, are together, that is, classification and regression are implemented in a 1X1 convolution, which YoloX considers to have a negative impact on network identification. In YoloX, Yolo Head is divided into two parts, which are implemented separately, and are integrated only when predictions are made.

3. Data Enhancement: Mosaic data enhancement, Mosaic uses four pictures to stitch together to achieve data enhancement. According to the paper, it has a huge advantage is to enrich the background of detecting objects! And when BN is calculated, the data of four pictures will be calculated at once!

4. Anchor Free: Do not use a priori box.

5. SimOTA: Dynamically match positive samples for targets of different sizes.

Not all of the above improvements, but there are other improvements. Only a few that I am interested in and very effective are listed here.

YoloX implementation ideas

1. Analysis of the Overall Structure

Before learning YoloX, we need to know something about what YoloX is doing, which will help us learn more about the web later.

Like previous versions of Yolo, the entire YoloX can still be divided into three parts, CSPDarknet, FPN, and Yolo Head.

CSPDarknet can be called YoloX's main feature extraction network. Input pictures first extract features in CSPDarknet. The extracted features can be called feature layer, which is the feature collection of input pictures. In the main cadre section, we get three feature layers to build the next network, which I call effective feature layer.

FPN can be referred to as YoloX's enhanced feature extraction network. The three effective feature layers obtained from the main cadre are fused in this part. The purpose of feature fusion is to combine feature information of different scales. In the FPN part, the effective feature layers that have been obtained are used to continue feature extraction. In YoloX, the Panet structure used in YoloV4 is also used, and we not only fuse features in this part.Features are sampled up to achieve feature fusion, and features are sampled down again to achieve feature fusion.

Yolo Head is YoloX's classifier and regressor. Through CSPDarknet and FPN, we have been able to obtain three enhanced effective feature layers. Each feature layer has a width, height and number of channels. At this time, we can think of the feature map as a collection of feature points one after another, and each feature point has a number of channels.What Head actually does is judge the feature points to see if they have objects corresponding to them. The decoupling heads used by previous versions of Yolo are the same, that is, classification and regression are implemented in a 1X1 convolution, which YoloX considers to have a negative impact on network identification. In YoloX, YoloThe Head is divided into two parts, one for each, and then integrated when the final prediction is made.

Therefore, what the entire YoloX network does is feature extraction-feature enhancement-prediction of the object situation corresponding to the feature points.

2. Network Structure Analysis

1. Introduction to the backbone network CSPDarknet


The backbone feature extraction network used by YoloX is CSPDarknet, which has five important features:
1. Residual network is used, the residual convolution in CSPDarknet can be divided into two parts, the main cadre is a convolution of 1X1 and a convolution of 3X3. The residual edge part is not handled, and the input and output of the main cadre are combined directly.

def Bottleneck(x, out_channels, shortcut=True, name = ""):
    y = compose(
            DarknetConv2D_BN_SiLU(out_channels, (1,1), name = name + '.conv1'),
            DarknetConv2D_BN_SiLU(out_channels, (3,3), name = name + '.conv2'))(x)
    if shortcut:
        y = Add()([x, y])
    return y


Residual networks are characterized by easy optimization and the ability to increase accuracy by increasing a considerable depth. The internal residual blocks use skip links to alleviate the gradient disappearance problem caused by increasing depth in deep neural networks.

2. With the CSPnet network structure, the CSPnet structure is not complicated, that is, the stack of the original residual blocks is split into left and right parts: the main cadre continues to stack the original residual blocks; the other part, like a residual edge, is directly connected to the last after a small amount of processing. Therefore, it can be considered that there is a large residual edge in the CSP.

def CSPLayer(x, num_filters, num_blocks, shortcut=True, expansion=0.5, name=""):
    hidden_channels = int(num_filters * expansion)  # hidden channels
    #----------------------------------------------------------------#
    #   The main cadre branch cycles through num_blocks, with the residual structure inside.
    #----------------------------------------------------------------#
    x_1 = DarknetConv2D_BN_SiLU(hidden_channels, (1,1), name = name + '.conv1')(x)
    #--------------------------------------------------------------------#
    #   Then a large residual edge shortconv is created, which bypasses many residual structures
    #--------------------------------------------------------------------#
    x_2 = DarknetConv2D_BN_SiLU(hidden_channels, (1,1), name = name + '.conv2')(x)
    for i in range(num_blocks):
        x_1 = Bottleneck(x_1, hidden_channels, shortcut, name = name + '.m.' + str(i))
    #----------------------------------------------------------------#
    #   Stack the edges of the large residuals back
    #----------------------------------------------------------------#
    route = Concatenate()([x_1, x_2])

    #----------------------------------------------------------------#
    #   Finally, integrate the number of channels
    #----------------------------------------------------------------#
    return DarknetConv2D_BN_SiLU(num_filters, (1,1), name = name + '.conv3')(route)

3. Focus network structure is used, which is a more interesting network structure used in YoloV5. The operation is to get a value every other pixel in a picture. At this time, four independent feature layers are obtained, and then the four independent feature layers are stacked. At this time, the width and height information is concentrated in the channel information, and the input channel is expanded four times.The stitched feature layer becomes twelve channels compared to the original three channels. The following image shows the Focus structure very well and can be seen at a glance.

class Focus(Layer):
    def __init__(self):
        super(Focus, self).__init__()

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1] // 2 if input_shape[1] != None else input_shape[1], input_shape[2] // 2 if input_shape[2] != None else input_shape[2], input_shape[3] * 4)

    def call(self, x):
        return tf.concat(
            [x[...,  ::2,  ::2, :],
             x[..., 1::2,  ::2, :],
             x[...,  ::2, 1::2, :],
             x[..., 1::2, 1::2, :]],
             axis=-1
        )

4. Using the SiLU activation function, SiLU is an improved version of Sigmoid and ReLU. SiLU has unbound, lower bound, smooth, non-monotonic properties. SiLU is superior to ReLU in deep model. It can be seen as a smooth ReLU activation function.
f ( x ) = x ⋅ sigmoid ( x ) f(x) = x · \text{sigmoid}(x) f(x)=x⋅sigmoid(x)

class SiLU(Layer):
    def __init__(self, **kwargs):
        super(SiLU, self).__init__(**kwargs)
        self.supports_masking = True

    def call(self, inputs):
        return inputs * K.sigmoid(inputs)

    def get_config(self):
        config = super(SiLU, self).get_config()
        return config

    def compute_output_shape(self, input_shape):
        return input_shape

5. SPP structure is used to extract features through maximum pooling with different pooling core sizes to improve network perception field. In YoloV4, SPP is used in FPN, and in YoloX, SPP module is used in backbone feature extraction network.

def SPPBottleneck(x, out_channels, name = ""):
    #---------------------------------------------------#
    #   The SPP structure, the largest pooled stack at different scales, is used.
    #---------------------------------------------------#
    x = DarknetConv2D_BN_SiLU(out_channels // 2, (1,1), name = name + '.conv1')(x)
    maxpool1 = MaxPooling2D(pool_size=(5,5), strides=(1,1), padding='same')(x)
    maxpool2 = MaxPooling2D(pool_size=(9,9), strides=(1,1), padding='same')(x)
    maxpool3 = MaxPooling2D(pool_size=(13,13), strides=(1,1), padding='same')(x)
    x = Concatenate()([x, maxpool1, maxpool2, maxpool3])
    x = DarknetConv2D_BN_SiLU(out_channels, (1,1), name = name + '.conv2')(x)
    return x

The entire trunk implementation code is:

from functools import wraps
from re import X

import tensorflow as tf
from keras import backend as K
from keras.initializers import random_normal
from keras.layers import (Add, BatchNormalization, Concatenate, Conv2D, Layer,
                          MaxPooling2D, ZeroPadding2D)
from keras.layers.normalization import BatchNormalization
from keras.regularizers import l2
from utils.utils import compose


class SiLU(Layer):
    def __init__(self, **kwargs):
        super(SiLU, self).__init__(**kwargs)
        self.supports_masking = True

    def call(self, inputs):
        return inputs * K.sigmoid(inputs)

    def get_config(self):
        config = super(SiLU, self).get_config()
        return config

    def compute_output_shape(self, input_shape):
        return input_shape

class Focus(Layer):
    def __init__(self):
        super(Focus, self).__init__()

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1] // 2 if input_shape[1] != None else input_shape[1], input_shape[2] // 2 if input_shape[2] != None else input_shape[2], input_shape[3] * 4)

    def call(self, x):
        return tf.concat(
            [x[...,  ::2,  ::2, :],
             x[..., 1::2,  ::2, :],
             x[...,  ::2, 1::2, :],
             x[..., 1::2, 1::2, :]],
             axis=-1
        )
#------------------------------------------------------#
#   Single convolution DarknetConv2D
#   If the step size is 2, set the padding mode yourself.
#------------------------------------------------------#
@wraps(Conv2D)
def DarknetConv2D(*args, **kwargs):
    darknet_conv_kwargs = {'kernel_initializer' : random_normal(stddev=0.02)}
    darknet_conv_kwargs['padding'] = 'valid' if kwargs.get('strides')==(2,2) else 'same'
    darknet_conv_kwargs.update(kwargs)
    return Conv2D(*args, **darknet_conv_kwargs)

#---------------------------------------------------#
#   Convolution Block->Convolution+Standardization+Activation Function
#   DarknetConv2D + BatchNormalization + SiLU
#---------------------------------------------------#
def DarknetConv2D_BN_SiLU(*args, **kwargs):
    no_bias_kwargs = {'use_bias': False}
    no_bias_kwargs.update(kwargs)
    if "name" in kwargs.keys():
        no_bias_kwargs['name'] = kwargs['name'] + '.conv'
    return compose(
        DarknetConv2D(*args, **no_bias_kwargs),
        BatchNormalization(name = kwargs['name'] + '.bn'),
        SiLU())

def SPPBottleneck(x, out_channels, name = ""):
    #---------------------------------------------------#
    #   The SPP structure, the largest pooled stack at different scales, is used.
    #---------------------------------------------------#
    x = DarknetConv2D_BN_SiLU(out_channels // 2, (1,1), name = name + '.conv1')(x)
    maxpool1 = MaxPooling2D(pool_size=(5,5), strides=(1,1), padding='same')(x)
    maxpool2 = MaxPooling2D(pool_size=(9,9), strides=(1,1), padding='same')(x)
    maxpool3 = MaxPooling2D(pool_size=(13,13), strides=(1,1), padding='same')(x)
    x = Concatenate()([x, maxpool1, maxpool2, maxpool3])
    x = DarknetConv2D_BN_SiLU(out_channels, (1,1), name = name + '.conv2')(x)
    return x

def Bottleneck(x, out_channels, shortcut=True, name = ""):
    y = compose(
            DarknetConv2D_BN_SiLU(out_channels, (1,1), name = name + '.conv1'),
            DarknetConv2D_BN_SiLU(out_channels, (3,3), name = name + '.conv2'))(x)
    if shortcut:
        y = Add()([x, y])
    return y

def CSPLayer(x, num_filters, num_blocks, shortcut=True, expansion=0.5, name=""):
    hidden_channels = int(num_filters * expansion)  # hidden channels
    #----------------------------------------------------------------#
    #   The main cadre branch cycles through num_blocks, with the residual structure inside.
    #----------------------------------------------------------------#
    x_1 = DarknetConv2D_BN_SiLU(hidden_channels, (1,1), name = name + '.conv1')(x)
    #--------------------------------------------------------------------#
    #   Then a large residual edge shortconv is created, which bypasses many residual structures
    #--------------------------------------------------------------------#
    x_2 = DarknetConv2D_BN_SiLU(hidden_channels, (1,1), name = name + '.conv2')(x)
    for i in range(num_blocks):
        x_1 = Bottleneck(x_1, hidden_channels, shortcut, name = name + '.m.' + str(i))
    #----------------------------------------------------------------#
    #   Stack the edges of the large residuals back
    #----------------------------------------------------------------#
    route = Concatenate()([x_1, x_2])

    #----------------------------------------------------------------#
    #   Finally, integrate the number of channels
    #----------------------------------------------------------------#
    return DarknetConv2D_BN_SiLU(num_filters, (1,1), name = name + '.conv3')(route)

def resblock_body(x, num_filters, num_blocks, shortcut=True, expansion=0.5, last = False, name = ""):
    #----------------------------------------------------------------#
    #   Compression of height and width using ZeroPadding2D and a convolution block with a step of 2x2
    #----------------------------------------------------------------#
    x = ZeroPadding2D(((1,1),(1,1)))(x)
    #----------------------------------------------------------------#
    #   Compression of height and width using ZeroPadding2D and a convolution block with a step of 2x2
    #----------------------------------------------------------------#
    x = DarknetConv2D_BN_SiLU(num_filters, (3,3), strides=(2,2), name = name + '.0')(x)
    if last:
        x = SPPBottleneck(x, num_filters, name = name + '.1')
    return CSPLayer(x, num_filters, num_blocks, shortcut=shortcut, expansion=expansion, name = name + '.1' if not last else name + '.2')

#---------------------------------------------------#
#   Main part of CSPdarknet53
#   Enter a picture of 416x416x3
#   Output as three valid feature layers
#---------------------------------------------------#
def darknet_body(x, dep_mul, wid_mul):
    base_channels   = int(wid_mul * 64)  # 64
    base_depth      = max(round(dep_mul * 3), 1)  # 3

    x = Focus()(x)
    x = DarknetConv2D_BN_SiLU(base_channels, (3,3), name = 'backbone.backbone.stem.conv')(x)

    x = resblock_body(x, base_channels * 2, base_depth, name = 'backbone.backbone.dark2')
    x = resblock_body(x, base_channels * 4, base_depth * 3, name = 'backbone.backbone.dark3')
    feat1 = x
    x = resblock_body(x, base_channels * 8, base_depth * 3, name = 'backbone.backbone.dark4')
    feat2 = x
    x = resblock_body(x, base_channels * 16, base_depth, last = True, name = 'backbone.backbone.dark5')
    feat3 = x
    return feat1,feat2,feat3

2. Build FPN feature pyramid to enhance feature extraction


In the feature utilization section, YoloX extracts three feature layers for target detection.
The three feature layers are located in different positions of the main cadre sub-CSPdarknet, which are in the middle, middle and lower layers, and the bottom layer. When the input is (640,640,3), the shape s of the three feature layers are feat1=(80,80,256), feat2=(40,40,512), and feat3=(20,20,1024).

After obtaining three valid feature layers, we use these three valid feature layers to construct the FPN layer by:

  1. Feature layer with feat3=(20,20,1024) acquires P5 after 1X1 convolution adjustment channel once, and P5 combines feature 2=(40,40,512) feature layer with UmSampling for 2 days after sampling, then uses CSPLayer to extract feature layer to obtain P5_upsample. At this time, the feature layer obtained is (40,40,512).
  2. P5_upsample=(40,40,512) feature layer acquires P4 after 1X1 convolution adjustment channel once, and P4 combines feature 1=(80,80,256) feature layer with UmSampling 2 days after sampling, then uses CSPLayer to extract P3_out feature layer, at which point the feature layer acquired is (80,80,256).
  3. The feature layer of P3_out=(80,80,256) is downsampled by a 3x3 convolution, stacked with P4 after downsampling, and then extracted P4_out using CSPLayer, at which point the feature layer obtained is (40,40,512).
  4. The feature layer of P4_out=(40,40,512) is downsampled by a 3x3 convolution, stacked with P5 after downsampling, and then extracted P5_out using CSPLayer. The feature layer obtained is (20,20,1024).

Feature pyramids can fuse the feature layers of different shape s to extract better features.

from keras.layers import (Concatenate, Input, Lambda, UpSampling2D,
                          ZeroPadding2D)
from keras.layers.convolutional import UpSampling2D
from keras.models import Model

from nets.CSPdarknet53 import (CSPLayer, DarknetConv2D, DarknetConv2D_BN_SiLU,
                               darknet_body)
from nets.yolo_training import get_yolo_loss


#---------------------------------------------------#
#   Construction of Panet Network and Prediction Results
#---------------------------------------------------#
def yolo_body(input_shape, num_classes, phi):
    depth_dict      = {'s' : 0.33, 'm' : 0.67, 'l' : 1.00, 'x' : 1.33,}
    width_dict      = {'s' : 0.50, 'm' : 0.75, 'l' : 1.00, 'x' : 1.25,}
    depth, width    = depth_dict[phi], width_dict[phi]
    in_channels     = [256, 512, 1024]
    
    inputs      = Input(input_shape)
    feat1, feat2, feat3 = darknet_body(inputs, depth, width)

    P5          = DarknetConv2D_BN_SiLU(int(in_channels[1] * width), (1, 1), name = 'backbone.lateral_conv0')(feat3)  
    P5_upsample = UpSampling2D()(P5)  # 512/16
    P5_upsample = Concatenate(axis = -1)([P5_upsample, feat2])  # 512->1024/16
    P5_upsample = CSPLayer(P5_upsample, int(in_channels[1] * width), round(3 * depth), shortcut = False, name = 'backbone.C3_p4')  # 1024->512/16

    P4          = DarknetConv2D_BN_SiLU(int(in_channels[0] * width), (1, 1), name = 'backbone.reduce_conv1')(P5_upsample)  # 512->256/16
    P4_upsample = UpSampling2D()(P4)  # 256/8
    P4_upsample = Concatenate(axis = -1)([P4_upsample, feat1])  # 256->512/8
    P3_out      = CSPLayer(P4_upsample, int(in_channels[0] * width), round(3 * depth), shortcut = False, name = 'backbone.C3_p3')  # 1024->512/16

    P3_downsample   = ZeroPadding2D(((1,1),(1,1)))(P3_out)
    P3_downsample   = DarknetConv2D_BN_SiLU(int(in_channels[0] * width), (3, 3), strides = (2, 2), name = 'backbone.bu_conv2')(P3_downsample)  # 256->256/16
    P3_downsample   = Concatenate(axis = -1)([P3_downsample, P4])  # 256->512/16
    P4_out          = CSPLayer(P3_downsample, int(in_channels[1] * width), round(3 * depth), shortcut = False, name = 'backbone.C3_n3')  # 1024->512/16

    P4_downsample   = ZeroPadding2D(((1,1),(1,1)))(P4_out)
    P4_downsample   = DarknetConv2D_BN_SiLU(int(in_channels[1] * width), (3, 3), strides = (2, 2), name = 'backbone.bu_conv1')(P4_downsample)  # 256->256/16
    P4_downsample   = Concatenate(axis = -1)([P4_downsample, P5])  # 512->1024/32
    P5_out          = CSPLayer(P4_downsample, int(in_channels[2] * width), round(3 * depth), shortcut = False, name = 'backbone.C3_n4')  # 1024->512/16

3. Using Yolo Head to obtain predictions


Using the FPN feature pyramid, we can obtain three enhancement features, which are shapes (20,20,1024), (40,40,512), (80,80,256). Then we use the feature layers of these three shapes to pass in Yolo Head to obtain prediction results.

YoloHead in YoloX is different from previous versions of YoloHead. The decoupling heads used in previous versions of Yolo are together, that is, classification and regression are implemented in a 1X1 convolution, which YoloX considers to have a negative impact on network identification. In YoloX, YoloHead is divided into two parts, which are implemented separately, and then integrated at the end of the prediction.
For each feature layer, we can get three predictions:
1. Reg(h,w,4) is used to determine the regression parameters of each feature point. After adjusting the regression parameters, the prediction box can be obtained.
2. Obj(h,w,1) is used to determine whether each feature point contains an object.
3. Cls(h,w,num_classes) are used to determine the kind of objects contained in each feature point.
Three predictions are stacked, and each feature layer yields the following results:
The first four parameters of Out (h, w, 4+1+num_classes) are used to determine the regression parameters of each feature point. After adjusting the regression parameters, a prediction box can be obtained. The fifth parameter is used to determine whether each feature point contains an object or not; and the last num_classes parameter is used to determine the type of object contained in each feature point.

The implementation code is as follows:

fpn_outs    = [P3_out, P4_out, P5_out]
yolo_outs   = []
for i, out in enumerate(fpn_outs):
    stem    = DarknetConv2D_BN_SiLU(int(256 * width), (1, 1), strides = (1, 1), name = 'head.stems.' + str(i))(out)
    
    cls_conv = DarknetConv2D_BN_SiLU(int(256 * width), (3, 3), strides = (1, 1), name = 'head.cls_convs.' + str(i) + '.0')(stem)
    cls_conv = DarknetConv2D_BN_SiLU(int(256 * width), (3, 3), strides = (1, 1), name = 'head.cls_convs.' + str(i) + '.1')(cls_conv)
    cls_pred = DarknetConv2D(num_classes, (1, 1), strides = (1, 1), name = 'head.cls_preds.' + str(i))(cls_conv)

    reg_conv = DarknetConv2D_BN_SiLU(int(256 * width), (3, 3), strides = (1, 1), name = 'head.reg_convs.' + str(i) + '.0')(stem)
    reg_conv = DarknetConv2D_BN_SiLU(int(256 * width), (3, 3), strides = (1, 1), name = 'head.reg_convs.' + str(i) + '.1')(reg_conv)
    reg_pred = DarknetConv2D(4, (1, 1), strides = (1, 1), name = 'head.reg_preds.' + str(i))(reg_conv)
    obj_pred = DarknetConv2D(1, (1, 1), strides = (1, 1), name = 'head.obj_preds.' + str(i))(reg_conv)
    output   = Concatenate(axis = -1)([reg_pred, obj_pred, cls_pred])
    yolo_outs.append(output)
return Model(inputs, yolo_outs)

3. Decoding of Prediction Results

1. Get Prediction Boxes and Scores

Before decoding the predictions, let's see what they represent. The predictions can be divided into three parts:

In the previous step, we obtained three predictions for each feature layer.

This paper takes (20,20,1024) three corresponding prediction results as an example:

1. Reg prediction results, where the convolution channel number is 4 and the final result is (20,20,4). Four of them can be divided into two parts, the first two is the offset of the center point of the prediction box from the feature point, and the second two is the parameter of the prediction box's width and height from the logarithmic index.
2. Obj prediction results, where the convolution channel number is 1 and the final result is (20,20,1), which represents the probability that each feature point prediction box contains an object inside.
3. Cls prediction results, where the channel number of convolution is num_classes, and the final result is (20,20,num_classes), which represents the probability that each feature point corresponds to a certain class of objects, and the predicted value in the last dimension, num_classes, represents the probability that each class belongs to;

This feature layer is equivalent to dividing an image into 20x20 feature points, which are used to predict an object if it falls within its corresponding box.

As shown in the figure, the blue point is the characteristic point of 20x20, at which point we demonstrate the decoding operation for the three red points in the left image:
1. Calculate the center prediction point, use the content of the first two serial numbers of the Regression prediction result to offset the coordinates of the feature point, the three feature points in the left picture red are offset by the three points in the right picture green;
2. Calculate the width and height of the prediction box, and get the width and height of the prediction box by exponentiating the contents of the two serial numbers after the Regression prediction result.
3. The prediction box obtained at this time can be drawn on the picture.

In addition to such decoding operations, there are non-maximally suppressed operations needed to prevent the same kind of box from stacking.

#---------------------------------------------------#
#   Picture Prediction
#---------------------------------------------------#
def DecodeBox(outputs,
            num_classes,
            image_shape,
            input_shape,
            max_boxes       = 100,
            confidence      = 0.5,
            nms_iou         = 0.3,
            letterbox_image = True):
            
    bs      = K.shape(outputs[0])[0]

    grids   = []
    strides = []
    hw      = [K.shape(x)[1:3] for x in outputs]
    outputs = tf.concat([tf.reshape(x, [bs, -1, 5 + num_classes]) for x in outputs], axis = 1)
    for i in range(len(hw)):
        #---------------------------#
        #   Generating grid points from feature layers
        #---------------------------#
        grid_x, grid_y  = tf.meshgrid(K.arange(hw[i][1]), K.arange(hw[i][0]))
        grid            = tf.reshape(tf.stack((grid_x, grid_y), 2), (1, -1, 2))
        shape           = tf.shape(grid)[:2]

        grids.append(tf.cast(grid, K.dtype(outputs)))
        strides.append(tf.ones((shape[0], shape[1], 1)) * input_shape[0] / tf.cast(hw[i][0], K.dtype(outputs)))
    #---------------------------#
    #   Stack grid points together
    #---------------------------#
    grids               = tf.concat(grids, axis=1)
    strides             = tf.concat(strides, axis=1)
    #------------------------#
    #   Decode from grid points
    #------------------------#
    box_xy = (outputs[..., :2] + grids) * strides / K.cast(input_shape[::-1], K.dtype(outputs))
    box_wh = tf.exp(outputs[..., 2:4]) * strides / K.cast(input_shape[::-1], K.dtype(outputs))

    box_confidence  = K.sigmoid(outputs[..., 4:5])
    box_class_probs = K.sigmoid(outputs[..., 5: ])
    #------------------------------------------------------------------------------------------------------------#
    #   Leterbox_image is used to add gray bars around the image before it is predicted into the network, so the resulting box_xy, box_wh is relative to the gray bars image
    #   We need to modify it to remove the gray bars. Adjust box_xy, and box_wh to y_min,y_max,xmin,xmax
    #   Normalized box_xy, box_wh needs to be adjusted relative to the original size if letterbox_image is not used
    #------------------------------------------------------------------------------------------------------------#
    boxes       = yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)

2. Score Screening and Non-extreme Suppression

Score sorting and non-maximal suppression screening are also performed after the final prediction results are obtained.

Score filtering is to filter out the prediction boxes whose scores meet the confidence confidence level.
Non-maximal suppression is to filter out the box with the highest score for the same kind in a certain area.

The process of score screening and non-maximal suppression can be summarized as follows:
1. Find the box in the picture where the score is greater than the threshold function. Filtering the score before the coincidence box filter can significantly reduce the number of boxes.
2. Cycling the species, the function of non-maximal inhibition is to filter out the box with the highest score of the same kind in a certain area. Cycling the species can help us to suppress each class individually.
3. Sort the categories from large to small according to the score.
4. Each time the box with the highest score is taken out, the degree of coincidence with all other prediction boxes is calculated. If the coincidence is too large, it is eliminated.

The results from score filtering and non-maximal suppression can be used to draw a prediction box.

The following figure is not very suppressed.

The following figure is not very suppressed.

The implementation code is:

box_scores  = box_confidence * box_class_probs

#-----------------------------------------------------------#
#   Determine if the score is greater than score_threshold
#-----------------------------------------------------------#
mask             = box_scores >= confidence
max_boxes_tensor = K.constant(max_boxes, dtype='int32')
boxes_out   = []
scores_out  = []
classes_out = []
for c in range(num_classes):
    #-----------------------------------------------------------#
    #   Remove all boxes with box_scores >= score_threshold, and results
    #-----------------------------------------------------------#
    class_boxes      = tf.boolean_mask(boxes, mask[:, c])
    class_box_scores = tf.boolean_mask(box_scores[:, c], mask[:, c])

    #-----------------------------------------------------------#
    #   Nonmaximal suppression
    #   Keep the box with the highest score in a certain area
    #-----------------------------------------------------------#
    nms_index = tf.image.non_max_suppression(class_boxes, class_box_scores, max_boxes_tensor, iou_threshold=nms_iou)

    #-----------------------------------------------------------#
    #   Obtain results after non-maximal suppression
    #   The following three are: the position of the box, the score and the type
    #-----------------------------------------------------------#
    class_boxes         = K.gather(class_boxes, nms_index)
    class_box_scores    = K.gather(class_box_scores, nms_index)
    classes             = K.ones_like(class_box_scores, 'int32') * c

    boxes_out.append(class_boxes)
    scores_out.append(class_box_scores)
    classes_out.append(classes)
boxes_out      = K.concatenate(boxes_out, axis=0)
scores_out     = K.concatenate(scores_out, axis=0)
classes_out    = K.concatenate(classes_out, axis=0)

IV. Training Section

1. What is needed to calculate loss

Calculating loss is actually a comparison of the predicted results of the network with the actual results of the network.
Like the prediction results of the network, the loss of the network consists of three parts, Reg part, Obj part, Cls part. Reg part is the regression parameter judgment of the feature point, Obj part is whether the feature point contains object judgment, Cls part is the kind of object that the feature point contains.

2. Necessary conditions for positive sample feature points

In YoloX, the feature points within which the true frame of an object falls are predicted by that feature.

For each true box, we will find all the feature points and their spatial locations. As a positive sample, the feature points need to satisfy the following characteristics:
1. The feature point falls within the real frame of the object.
2. The feature point should be within a certain radius from the center of the object as far as possible.

Features 1 and 2 ensure that the feature points belonging to the positive samples fall inside the real frame of the object, and the center of the feature points is close to the real frame of the object.

In YoloX, we used the SimOTA method for dynamic positive sample size allocation.

def get_in_boxes_info(gt_bboxes_per_image, x_shifts, y_shifts, expanded_strides, num_gt, total_num_anchors, center_radius = 2.5):
    #-------------------------------------------------------#
    #   expanded_strides_per_image  [n_anchors_all]
    #   x_centers_per_image         [num_gt, n_anchors_all]
    #   x_centers_per_image         [num_gt, n_anchors_all]
    #-------------------------------------------------------#
    expanded_strides_per_image  = expanded_strides[0]
    x_centers_per_image         = tf.tile(tf.expand_dims(((x_shifts[0] + 0.5) * expanded_strides_per_image), 0), [num_gt, 1])
    y_centers_per_image         = tf.tile(tf.expand_dims(((y_shifts[0] + 0.5) * expanded_strides_per_image), 0), [num_gt, 1])

    #-------------------------------------------------------#
    #   gt_bboxes_per_image_x       [num_gt, n_anchors_all]
    #-------------------------------------------------------#
    gt_bboxes_per_image_l = tf.tile(tf.expand_dims((gt_bboxes_per_image[:, 0] - 0.5 * gt_bboxes_per_image[:, 2]), 1), [1, total_num_anchors])
    gt_bboxes_per_image_r = tf.tile(tf.expand_dims((gt_bboxes_per_image[:, 0] + 0.5 * gt_bboxes_per_image[:, 2]), 1), [1, total_num_anchors])
    gt_bboxes_per_image_t = tf.tile(tf.expand_dims((gt_bboxes_per_image[:, 1] - 0.5 * gt_bboxes_per_image[:, 3]), 1), [1, total_num_anchors])
    gt_bboxes_per_image_b = tf.tile(tf.expand_dims((gt_bboxes_per_image[:, 1] + 0.5 * gt_bboxes_per_image[:, 3]), 1), [1, total_num_anchors])

    #-------------------------------------------------------#
    #   bbox_deltas     [num_gt, n_anchors_all, 4]
    #-------------------------------------------------------#
    b_l = x_centers_per_image - gt_bboxes_per_image_l
    b_r = gt_bboxes_per_image_r - x_centers_per_image
    b_t = y_centers_per_image - gt_bboxes_per_image_t
    b_b = gt_bboxes_per_image_b - y_centers_per_image
    bbox_deltas = tf.stack([b_l, b_t, b_r, b_b], 2)

    #-------------------------------------------------------#
    #   is_in_boxes     [num_gt, n_anchors_all]
    #   is_in_boxes_all [n_anchors_all]
    #-------------------------------------------------------#
    is_in_boxes     = tf.reduce_min(bbox_deltas, axis = -1) > 0.0
    is_in_boxes_all = tf.reduce_sum(tf.cast(is_in_boxes, K.dtype(gt_bboxes_per_image)), axis = 0) > 0.0

    gt_bboxes_per_image_l = tf.tile(tf.expand_dims(gt_bboxes_per_image[:, 0], 1), [1, total_num_anchors]) - center_radius * tf.expand_dims(expanded_strides_per_image, 0)
    gt_bboxes_per_image_r = tf.tile(tf.expand_dims(gt_bboxes_per_image[:, 0], 1), [1, total_num_anchors]) + center_radius * tf.expand_dims(expanded_strides_per_image, 0)
    gt_bboxes_per_image_t = tf.tile(tf.expand_dims(gt_bboxes_per_image[:, 1], 1), [1, total_num_anchors]) - center_radius * tf.expand_dims(expanded_strides_per_image, 0)
    gt_bboxes_per_image_b = tf.tile(tf.expand_dims(gt_bboxes_per_image[:, 1], 1), [1, total_num_anchors]) + center_radius * tf.expand_dims(expanded_strides_per_image, 0)

    #-------------------------------------------------------#
    #   center_deltas   [num_gt, n_anchors_all, 4]
    #-------------------------------------------------------#
    c_l = x_centers_per_image - gt_bboxes_per_image_l
    c_r = gt_bboxes_per_image_r - x_centers_per_image
    c_t = y_centers_per_image - gt_bboxes_per_image_t
    c_b = gt_bboxes_per_image_b - y_centers_per_image
    center_deltas       = tf.stack([c_l, c_t, c_r, c_b], 2)

    #-------------------------------------------------------#
    #   is_in_centers       [num_gt, n_anchors_all]
    #   is_in_centers_all   [n_anchors_all]
    #-------------------------------------------------------#
    is_in_centers       = tf.reduce_min(center_deltas, axis = -1) > 0.0
    is_in_centers_all   = tf.reduce_sum(tf.cast(is_in_centers, K.dtype(gt_bboxes_per_image)), axis = 0) > 0.0

    #-------------------------------------------------------#
    #   fg_mask                 [n_anchors_all]
    #   is_in_boxes_and_center  [num_gt, fg_mask]
    #-------------------------------------------------------#
    fg_mask = tf.cast(is_in_boxes_all | is_in_centers_all, tf.bool)
    
    is_in_boxes_and_center  = tf.boolean_mask(is_in_boxes, fg_mask, axis = 1) & tf.boolean_mask(is_in_centers, fg_mask, axis = 1)
    return fg_mask, is_in_boxes_and_center

3. SimOTA Dynamic Matching Positive Samples

In YoloX, we calculate a Cost cost matrix, which represents the cost relationship between each real box and each feature point. The Cost cost matrix consists of three parts:
1. The degree of coincidence between each true box and the current feature point prediction box;
2. Prediction accuracy of each true box and current feature point prediction box;
3. Whether the center of each true box falls within a certain radius of the feature point.

The higher the coincidence between each true box and the current feature point prediction box, the lower the Cost of each feature point because it has tried to fit the true box.

The higher the type prediction accuracy of each true box and the current feature point prediction box, the more the feature point has tried to fit the true box, so the lower its Cost will be.

If the center of each true box falls within a radius of the feature point, the feature point should be fitted to the true box, so its Cost will be reduced.

The purpose of Cost cost matrix is to find the true frame that the current feature points should fit adaptively. The higher the coincidence degree, the more fitting is needed, the more accurate the classification is, and the more fitting is needed within a certain radius.

In SimOTA, different goals set different positive sample numbers (dynamick) Take the ants and watermelons in the Official Answer of Ignorance Technology for example. Traditional positive sample assignment schemes often assign the same number of positive samples to watermelons and ants in the same scene. Either ants have many low-quality positive samples or watermelons only have one or two positive samples. It is not appropriate for which assignment.
The key to dynamic positive sample setup is how to determine k. SimOTA calculates the lowest 10 feature points of each target Cost first, and then adds the prediction box corresponding to these ten feature points to the IOU of the real box to get the final k.

Therefore, the SimOTA process is summarized as follows:
1. Calculate the degree of coincidence between each true box and the current feature point prediction box.
2. Calculate the type prediction accuracy of each true box and the current feature point prediction box.
3. Determine if the center of the true box falls within a certain radius of the feature point.
4. Calculate Cost cost matrix.
5. Calculate the lowest 10 feature points of each true box Cost, and then add the prediction box corresponding to the ten feature points to the IOU of the real box to find the final k of each real box.

def get_assignments(gt_bboxes_per_image, gt_classes, bboxes_preds_per_image, obj_preds_per_image, cls_preds_per_image, x_shifts, y_shifts, expanded_strides, num_classes, num_gt, total_num_anchors):
    #-------------------------------------------------------#
    #   fg_mask                 [n_anchors_all]
    #   is_in_boxes_and_center  [num_gt, len(fg_mask)]
    #-------------------------------------------------------#
    fg_mask, is_in_boxes_and_center = get_in_boxes_info(gt_bboxes_per_image, x_shifts, y_shifts, expanded_strides, num_gt, total_num_anchors)
    
    #-------------------------------------------------------#
    #   fg_mask                 [n_anchors_all]
    #   bboxes_preds_per_image  [fg_mask, 4]
    #   cls_preds_              [fg_mask, num_classes]
    #   obj_preds_              [fg_mask, 1]
    #-------------------------------------------------------#
    bboxes_preds_per_image  = tf.boolean_mask(bboxes_preds_per_image, fg_mask, axis = 0)
    obj_preds_              = tf.boolean_mask(obj_preds_per_image, fg_mask, axis = 0)
    cls_preds_              = tf.boolean_mask(cls_preds_per_image, fg_mask, axis = 0)
    num_in_boxes_anchor     = tf.shape(bboxes_preds_per_image)[0]

    #-------------------------------------------------------#
    #   pair_wise_ious      [num_gt, fg_mask]
    #-------------------------------------------------------#
    pair_wise_ious      = bboxes_iou(gt_bboxes_per_image, bboxes_preds_per_image)
    pair_wise_ious_loss = -tf.log(pair_wise_ious + 1e-8)
    #-------------------------------------------------------#
    #   cls_preds_          [num_gt, fg_mask, num_classes]
    #   gt_cls_per_image    [num_gt, fg_mask, num_classes]
    #-------------------------------------------------------#
    gt_cls_per_image    = tf.tile(tf.expand_dims(tf.one_hot(tf.cast(gt_classes, tf.int32), num_classes), 1), (1, num_in_boxes_anchor, 1))
    cls_preds_          = K.sigmoid(tf.tile(tf.expand_dims(cls_preds_, 0), (num_gt, 1, 1))) *\
                          K.sigmoid(tf.tile(tf.expand_dims(obj_preds_, 0), (num_gt, 1, 1)))

    pair_wise_cls_loss  = tf.reduce_sum(K.binary_crossentropy(gt_cls_per_image, tf.sqrt(cls_preds_)), -1)

    cost = pair_wise_cls_loss + 3.0 * pair_wise_ious_loss + 100000.0 * tf.cast((~is_in_boxes_and_center), K.dtype(bboxes_preds_per_image))

    gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg = dynamic_k_matching(cost, pair_wise_ious, fg_mask, gt_classes, num_gt)
    return gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg

def bboxes_iou(b1, b2):
    #---------------------------------------------------#
    #   num_anchor,1,4
    #   Calculate the coordinates of the upper left corner and the lower right corner
    #---------------------------------------------------#
    b1              = K.expand_dims(b1, -2)
    b1_xy           = b1[..., :2]
    b1_wh           = b1[..., 2:4]
    b1_wh_half      = b1_wh/2.
    b1_mins         = b1_xy - b1_wh_half
    b1_maxes        = b1_xy + b1_wh_half

    #---------------------------------------------------#
    #   1,n,4
    #   Calculate coordinates for upper left and lower right corners
    #---------------------------------------------------#
    b2              = K.expand_dims(b2, 0)
    b2_xy           = b2[..., :2]
    b2_wh           = b2[..., 2:4]
    b2_wh_half      = b2_wh/2.
    b2_mins         = b2_xy - b2_wh_half
    b2_maxes        = b2_xy + b2_wh_half

    #---------------------------------------------------#
    #   Calculate coincident area
    #---------------------------------------------------#
    intersect_mins  = K.maximum(b1_mins, b2_mins)
    intersect_maxes = K.minimum(b1_maxes, b2_maxes)
    intersect_wh    = K.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_area  = intersect_wh[..., 0] * intersect_wh[..., 1]
    b1_area         = b1_wh[..., 0] * b1_wh[..., 1]
    b2_area         = b2_wh[..., 0] * b2_wh[..., 1]
    iou             = intersect_area / (b1_area + b2_area - intersect_area)
    return iou

def dynamic_k_matching(cost, pair_wise_ious, fg_mask, gt_classes, num_gt):
    #-------------------------------------------------------#
    #   cost                [num_gt, fg_mask]
    #   pair_wise_ious      [num_gt, fg_mask]
    #   gt_classes          [num_gt]        
    #   fg_mask             [n_anchors_all]
    #   matching_matrix     [num_gt, fg_mask]
    #-------------------------------------------------------#
    matching_matrix         = tf.zeros_like(cost)

    #------------------------------------------------------------#
    #   Select the maximum n_candidate_k points of iou
    #   Then sum to determine how many points should be used for this box prediction
    #   topk_ious           [num_gt, n_candidate_k]
    #   dynamic_ks          [num_gt]
    #   matching_matrix     [num_gt, fg_mask]
    #------------------------------------------------------------#
    n_candidate_k           = tf.minimum(10, tf.shape(pair_wise_ious)[1])
    topk_ious, _            = tf.nn.top_k(pair_wise_ious, n_candidate_k)
    dynamic_ks              = tf.maximum(tf.reduce_sum(topk_ious, 1), 1)
    # dynamic_ks              = tf.Print(dynamic_ks, [topk_ious, dynamic_ks], summarize = 100)
    
    def loop_body_1(b, matching_matrix):
        #------------------------------------------------------------#
        #   Select the smallest dynamic k points for each real box
        #------------------------------------------------------------#
        _, pos_idx = tf.nn.top_k(-cost[b], k=tf.cast(dynamic_ks[b], tf.int32))
        matching_matrix = tf.concat(
            [matching_matrix[:b], tf.expand_dims(tf.reduce_max(tf.one_hot(pos_idx, tf.shape(cost)[1]), 0), 0), matching_matrix[b+1:]], axis = 0
        )
        # matching_matrix = matching_matrix.write(b, K.cast(tf.reduce_max(tf.one_hot(pos_idx, tf.shape(cost)[1]), 0), K.dtype(cost)))
        return b + 1, matching_matrix
    #-----------------------------------------------------------#
    #   A loop in this place is for each picture
    #-----------------------------------------------------------#
    _, matching_matrix = K.control_flow_ops.while_loop(lambda b,*args: b < tf.cast(num_gt, tf.int32), loop_body_1, [0, matching_matrix])

    #------------------------------------------------------------#
    #   anchor_matching_gt  [fg_mask]
    #------------------------------------------------------------#
    anchor_matching_gt = tf.reduce_sum(matching_matrix, 0)
    #------------------------------------------------------------#
    #   When a feature point points to multiple real boxes
    #   Select the lowest cost true box.
    #------------------------------------------------------------#
    biger_one_indice = tf.reshape(tf.where(anchor_matching_gt > 1), [-1])
    def loop_body_2(b, matching_matrix):
        indice_anchor   = tf.cast(biger_one_indice[b], tf.int32)
        indice_gt       = tf.math.argmin(cost[:, indice_anchor])
        matching_matrix = tf.concat(
            [
                matching_matrix[:, :indice_anchor], 
                tf.expand_dims(tf.one_hot(indice_gt, tf.cast(num_gt, tf.int32)), 1), 
                matching_matrix[:, indice_anchor+1:]
            ], axis = -1
        )
        return b + 1, matching_matrix
    #-----------------------------------------------------------#
    #   A loop in this place is for each picture
    #-----------------------------------------------------------#
    _, matching_matrix = K.control_flow_ops.while_loop(lambda b,*args: b < tf.cast(tf.shape(biger_one_indice)[0], tf.int32), loop_body_2, [0, matching_matrix])

    #------------------------------------------------------------#
    #   fg_mask_inboxes  [fg_mask]
    #   num_fg is the number of feature points of positive samples
    #------------------------------------------------------------#
    fg_mask_inboxes = tf.reduce_sum(matching_matrix, 0) > 0.0
    num_fg          = tf.reduce_sum(tf.cast(fg_mask_inboxes, K.dtype(cost)))

    fg_mask_indices         = tf.reshape(tf.where(fg_mask), [-1])
    fg_mask_inboxes_indices = tf.reshape(tf.where(fg_mask_inboxes), [-1, 1])
    fg_mask_select_indices  = tf.gather_nd(fg_mask_indices, fg_mask_inboxes_indices)
    fg_mask                 = tf.cast(tf.reduce_max(tf.one_hot(fg_mask_select_indices, tf.shape(fg_mask)[0]), 0), K.dtype(fg_mask))

    #------------------------------------------------------------#
    #   Obtain item types corresponding to feature points
    #------------------------------------------------------------#
    matched_gt_inds     = tf.math.argmax(tf.boolean_mask(matching_matrix, fg_mask_inboxes, axis = 1), 0)
    gt_matched_classes  = tf.gather_nd(gt_classes, tf.reshape(matched_gt_inds, [-1, 1]))

    pred_ious_this_matching = tf.boolean_mask(tf.reduce_sum(matching_matrix * pair_wise_ious, 0), fg_mask_inboxes)
    return gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg

4. Calculating Loss

As you can see from the first part, YoloX's loss consists of three parts:
1. Reg part, the third part can know the characteristic points corresponding to each real box, get the corresponding feature points of each box, take out the prediction box of the characteristic points, use the real box and prediction box to calculate IOU loss, which is composed of Loss of Reg part.
2. Obj part, from the third part, we can know the feature points corresponding to each real box. All the feature points corresponding to the real box are positive samples, and the remaining feature points are negative samples. The cross-entropy loss is calculated according to whether the positive and negative samples and the feature points contain the predicted results of the object, which is composed of Loss of Obj part.
3. The Cls part, which is composed of the third part, can know the characteristic points corresponding to each real box, get the characteristic points corresponding to each box, take out the type prediction results of the characteristic points, calculate the cross-entropy loss based on the type of the real box and the type prediction results of the feature points, and make up of Loss in the Cls part.

def get_yolo_loss(input_shape, num_layers, num_classes):
    def yolo_loss(args):
        labels, y_pred = args[-1], args[:-1]
        x_shifts            = []
        y_shifts            = []
        expanded_strides    = []
        outputs             = []
        #-----------------------------------------------#
        # inputs    [[batch_size, 20, 20, num_classes + 5]
        #            [batch_size, 40, 40, num_classes + 5]
        #            [batch_size, 80, 80, num_classes + 5]]
        # outputs   [[batch_size, 400, num_classes + 5]
        #            [batch_size, 1600, num_classes + 5]
        #            [batch_size, 6400, num_classes + 5]]
        #-----------------------------------------------#
        for i in range(num_layers):
            output          = y_pred[i]
            grid_shape      = tf.shape(output)[1:3]
            stride          = input_shape[0] / tf.cast(grid_shape[0], K.dtype(output))

            grid_x, grid_y  = tf.meshgrid(K.arange(grid_shape[1]), K.arange(grid_shape[0]))
            grid            = tf.cast(tf.reshape(tf.stack((grid_x, grid_y), 2), (1, -1, 2)), K.dtype(output))
            
            output          = tf.reshape(output, [tf.shape(y_pred[i])[0], grid_shape[0] * grid_shape[1], -1])
            output_xy       = (output[..., :2] + grid) * stride
            output_wh       = tf.exp(output[..., 2:4]) * stride
            output          = tf.concat([output_xy, output_wh, output[..., 4:]], -1)

            x_shifts.append(grid[..., 0])
            y_shifts.append(grid[..., 1])
            expanded_strides.append(tf.ones_like(grid[..., 0]) * stride)
            outputs.append(output)
        #-----------------------------------------------#
        #   x_shifts            [1, n_anchors_all]
        #   y_shifts            [1, n_anchors_all]
        #   expanded_strides    [1, n_anchors_all]
        #-----------------------------------------------#
        x_shifts            = tf.concat(x_shifts, 1)
        y_shifts            = tf.concat(y_shifts, 1)
        expanded_strides    = tf.concat(expanded_strides, 1)
        outputs             = tf.concat(outputs, 1)
        return get_losses(x_shifts, y_shifts, expanded_strides, outputs, labels, num_classes)
    return yolo_loss

def get_losses(x_shifts, y_shifts, expanded_strides, outputs, labels, num_classes):
    #-----------------------------------------------#
    #   [batch, n_anchors_all, 4]
    #   [batch, n_anchors_all, 1]
    #   [batch, n_anchors_all, n_cls]
    #-----------------------------------------------#
    bbox_preds  = outputs[:, :, :4]  
    obj_preds   = outputs[:, :, 4:5]
    cls_preds   = outputs[:, :, 5:]  
    
    #------------------------------------------------------------#
    #   labels                      [batch, max_boxes, 5]
    #   tf.reduce_sum(labels, -1)   [batch, max_boxes]
    #   nlabel                      [batch]
    #------------------------------------------------------------#
    nlabel = tf.reduce_sum(tf.cast(tf.reduce_sum(labels, -1) > 0, K.dtype(outputs)), -1)
    total_num_anchors = tf.shape(outputs)[1]

    num_fg      = 0.0
    loss_obj    = 0.0
    loss_cls    = 0.0
    loss_iou    = 0.0
    def loop_body(b, num_fg, loss_iou, loss_obj, loss_cls):
        num_gt  = tf.cast(nlabel[b], tf.int32)
        #-----------------------------------------------#
        #   gt_bboxes_per_image     [num_gt, num_classes]
        #   gt_classes              [num_gt]
        #   bboxes_preds_per_image  [n_anchors_all, 4]
        #   obj_preds_per_image     [n_anchors_all, 1]
        #   cls_preds_per_image     [n_anchors_all, num_classes]
        #-----------------------------------------------#
        gt_bboxes_per_image     = labels[b][:num_gt, :4]
        gt_classes              = labels[b][:num_gt,  4]
        bboxes_preds_per_image  = bbox_preds[b]
        obj_preds_per_image     = obj_preds[b]
        cls_preds_per_image     = cls_preds[b]

        def f1():
            num_fg_img  = tf.cast(tf.constant(0), K.dtype(outputs))
            cls_target  = tf.cast(tf.zeros((0, num_classes)), K.dtype(outputs))
            reg_target  = tf.cast(tf.zeros((0, 4)), K.dtype(outputs))
            obj_target  = tf.cast(tf.zeros((total_num_anchors, 1)), K.dtype(outputs))
            fg_mask     = tf.cast(tf.zeros(total_num_anchors), tf.bool)
            return num_fg_img, cls_target, reg_target, obj_target, fg_mask
        def f2():
            gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg_img = get_assignments( 
                gt_bboxes_per_image, gt_classes, bboxes_preds_per_image, obj_preds_per_image, cls_preds_per_image,
                x_shifts, y_shifts, expanded_strides, num_classes, num_gt, total_num_anchors, 
            )
            reg_target  = tf.cast(tf.gather_nd(gt_bboxes_per_image, tf.reshape(matched_gt_inds, [-1, 1])), K.dtype(outputs))
            cls_target  = tf.cast(tf.one_hot(tf.cast(gt_matched_classes, tf.int32), num_classes) * tf.expand_dims(pred_ious_this_matching, -1), K.dtype(outputs))
            obj_target  = tf.cast(tf.expand_dims(fg_mask, -1), K.dtype(outputs))
            return num_fg_img, cls_target, reg_target, obj_target, fg_mask

        num_fg_img, cls_target, reg_target, obj_target, fg_mask = tf.cond(tf.equal(num_gt, 0), f1, f2)
        num_fg      += num_fg_img
        loss_iou    += K.sum(1 - box_ciou(reg_target, tf.boolean_mask(bboxes_preds_per_image, fg_mask)))
        loss_obj    += K.sum(K.binary_crossentropy(obj_target, obj_preds_per_image, from_logits=True))
        loss_cls    += K.sum(K.binary_crossentropy(cls_target, tf.boolean_mask(cls_preds_per_image, fg_mask), from_logits=True))
        return b + 1, num_fg, loss_iou, loss_obj, loss_cls
    #-----------------------------------------------------------#
    #   A loop in this place is for each picture
    #-----------------------------------------------------------#
    _, num_fg, loss_iou, loss_obj, loss_cls = K.control_flow_ops.while_loop(lambda b,*args: b < tf.cast(tf.shape(outputs)[0], tf.int32), loop_body, [0, num_fg, loss_iou, loss_obj, loss_cls])
    
    num_fg      = tf.cast(tf.maximum(num_fg, 1), K.dtype(outputs))
    reg_weight  = 5.0
    loss        = reg_weight * loss_iou + loss_obj + loss_cls
    return loss / num_fg

Training your own YoloX model

First go to Github to download the warehouse. After downloading, use the decompression software to decompress, and then use the programming software to open the folder.
Note that the root directory must be opened correctly, otherwise the code will not run if the relative directory is incorrect.
It is important to note that the open root directory is the directory where the files are stored.

1. Preparation of datasets

This paper uses VOC format for training. You need to make your own dataset before training. If you don't have your own dataset, you can try downloading VOC12+07 dataset via Github connection.
Place the label file in Annotation under VOC2007 folder under VOCdevkit folder before training.

Before training, place the picture files in JPEGImages under VOC2007 folder under VOCdevkit folder.

The placement of the dataset is now complete.

2. Processing of datasets

After placing the dataset, we need to do the next step with the dataset in order to get the 2007_train.txt and 2007_val.txt for training and the voc_annotation.py in the root directory.

There are some parameters in voc_annotation.py that need to be set.
annotation_mode, classes_path, trainval_percent, train_percent, VOCdevkit_path, respectively. Only classes_path can be modified for the first training

'''
annotation_mode Content used to specify runtime calculations for this file
annotation_mode 0 represents the entire tag processing, including getting VOCdevkit/VOC2007/ImageSets Inside txt And 2007 for training_train.txt,2007_val.txt
annotation_mode Acquired for 1 representative VOCdevkit/VOC2007/ImageSets Inside txt
annotation_mode 2007 for 2 delegates to get training_train.txt,2007_val.txt
'''
annotation_mode     = 0
'''
Must be modified to generate 2007_train.txt,2007_val.txt Target Information
 Used with training and prediction classes_path Consistency is sufficient
 If generated in 2007_train.txt There is no target information in it
 That's because classes Not set correctly
 Only in annotation_mode Valid when 0 and 2
'''
classes_path        = 'model_data/voc_classes.txt'
'''
trainval_percent For specifying(training set+Verification Set)Scale to test set, by default (training set+Verification Set):Test Set = 9:1
train_percent For specifying(training set+Verification Set)Scale of training set to verification set in, training set by default:Verification Set = 9:1
 Only in annotation_mode Valid when 0 and 1
'''
trainval_percent    = 0.9
train_percent       = 0.9
'''
point VOC The folder where the dataset is located
 Point to the root directory by default VOC data set
'''
VOCdevkit_path  = 'VOCdevkit'

classes_path is used to point to the txt corresponding to the detected category. For example, with a voc dataset, the txt we use is:

When you train your own dataset, you can create your own cls_classes.txt that contains the categories you want to distinguish.

3. Start Network Training

With voc_annotation.py, we've generated 2007_train.txt and 2007_val.txt, and now we can start training.
There are many training parameters, so you can read the notes carefully after downloading the library, the most important of which is the classes_path in train.py.

classes_path is used to point to the txt corresponding to the detected category, which is the same as the txt inside voc_annotation.py! The training dataset must be modified!

After modifying the classes_path, you can run train.py to start training, and after training multiple epoch s, the weights are generated in the logs folder.
The other parameters work as follows:

'''
Whether to use eager Mode Training
'''
eager = False
'''
Always change before training classes_path,Make it correspond to its own dataset
'''
classes_path    = 'model_data/voc_classes.txt'
'''
anchors_path Represents the corresponding of a priori box txt Files, generally do not modify.
anchors_mask Used to help code find a corresponding priori box, generally without modification.
'''
anchors_path    = 'model_data/yolo_anchors.txt'
anchors_mask    = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
'''
See Weights File README,Baidu Net Disk Download
 When you train your own dataset, you are prompted that the dimensions do not match properly, that everything predicted is different Natural dimensions do not match
 Pre-training weight for 99%It must be used in all cases. The unused weights are too random and the effect of feature extraction is not obvious.
The results of network training are also not good, and the pre-training weights of the data are common to different datasets because the features are common
'''
model_path      = 'model_data/yolo_weight.h5'
'''
Input shape Size must be a multiple of 32
'''
input_shape     = [416, 416]
'''
The training is divided into two stages, the freezing phase and the thawing phase.
Training parameters during freezing phase
 The backbone of the model is frozen and the feature extraction network is unchanged.
Small footprint, fine-tuning network only
'''
Init_Epoch          = 0
Freeze_Epoch        = 50
Freeze_batch_size   = 8
Freeze_lr           = 1e-3
'''
Training parameters during thawing phase
 The backbone of the model is not frozen and the feature extraction network will change.
Large memory footprint, all network parameters will change
'''
UnFreeze_Epoch      = 100
Unfreeze_batch_size = 4
Unfreeze_lr         = 1e-4
'''
Whether freezing training is performed or not, the default is to freeze the main training before thawing training.
'''
Freeze_Train        = True
'''
Used to set whether to read data using multithreading, 0 means to close multithreading
 Turning it on speeds up data reading, but takes up more memory
keras Opening multithreading in is sometimes much slower
 stay IO Turn on Multithreading when the bottleneck is GPU It is much faster than reading pictures.
'''
num_workers         = 0
'''
Get picture paths and labels
'''
train_annotation_path   = '2007_train.txt'
val_annotation_path     = '2007_val.txt'

4. Prediction of training results

Two files, yolo.py and predict.py, are needed for training result prediction.
First, we need to modify the model_path and classes_path in yolo.py, which must be modified.

model_path points to the trained weight file in the logs folder.
classes_path points to the txt corresponding to the detected category.

After the modification is complete, predict.py can be run to detect. After running, enter the picture path to detect.

Tags: Deep Learning image processing Object Detection keras YoloX

Posted on Fri, 17 Sep 2021 02:57:27 -0400 by davard