Deep learning Yolo algorithm code implementation

preface

This blog mainly summarizes the technologies used in Yolo series according to the algorithm code of Yolo series, including anchor box setting, data reading and processing, plug and play attention mechanism module, loss function setting, etc. The code source of this blog is Yolov3 and Yolov4 , the code is based on the tensorflow framework. For the principle, please refer to some blogs I wrote earlier: Overview of target detection,Target detection yolo series and YOLO series algorithms for target detection . Related papers: Yolov3 and Yolov4 . Original code: Yolov4 and Yolov3 . Yolov4 please refer to my git warehouse Readme.md file.

Yolov3

The network structure diagram of Yolov3 is shown in the following figure:

Anchor settings

C x C_x Cx ^ and C y C_y Cy represents the coordinates of the upper left corner of the grid where the coordinates of the center point of the target to be detected are located, as shown in the figure below C x C_x Cx ^ and C y C_y Cy is (1, 1). σ ( t x ) \sigma(t_x) σ (tx) and σ ( t y ) \sigma(t_y) σ (TY) represents the offset. Compressing TX and ty into the [0,1] interval with sigmoid can effectively ensure that the target center is in the grid cell performing prediction and prevent excessive offset. P w P_w Pw , and P h P_h Ph indicates the width and height of the anchor box mapped to the feature map. The exponent is used because t w t_w tw​, t h t_h th , yes log ⁡ \log The log scale is scaled to logarithmic space. The resulting frame coordinate value is b x b_x bx​, b y b_y by​, b w b_w bw​, b h b_h bh #, that is, the position and size of the bounding box bbox relative to the feature map, is the predicted output coordinate required by the model. But the actual learning goal of the network is t x t_x tx​, t y t_y ty​, t w t_w tw​, t h t_h th , these four offsets, where t x t_x tx​, t y t_y ty is the predicted coordinate offset value, t w t_w tw​, t h t_h th , is scale scaling. With these four offsets, we can naturally find the real need according to the previous formula b x b_x bx​, b y b_y by​, b w b_w bw​, b h b_h bh# 4 coordinates. As for the scaling of the image, it is necessary to maintain the length width ratio of the image, and then supplement it with padding.

The formula used in yolov3 is different from that used in the fast RCNN Series in the first two lines P x P_x Px ^ and P y P_y Py # is replaced by the upper left coordinate of the grid cell on the feature map C x C_x Cx​, C y C_y Cy, that is, in yolov3 G x G_x Gx​, G y G_y Gy ^ subtract the coordinates of the upper left corner of the grid cell C x C_x Cx​, C y C_y Cy​. x x x, y y The y coordinate does not calculate the offset for the anchor box, so it does not need to be divided by P w P_w Pw​, P h P_h Ph​. that is t x = G x − C x t_x = G_x - C_x tx​=Gx​−Cx​, t y = G y − C y t_y = G_y - C_y ty = Gy − Cy, so you can directly find the offset of bbox center from the coordinates in the upper left corner of grid cell. t w t_w tw and t h t_h The formula yolov3 of th is the same as that of the fast RCNN series. It is the ratio between the length and width of the border where the object is located and the length and width of the anchor box. No matter fast RCNN or YOLO, they do not directly return to the length and width of the bounding box, but scale to the logarithmic space for fear that the training will bring unstable gradients. Because if no transformation is made, the relative deformation can be predicted directly t w t_w tw, then ask t w > 0 t_w>0 tw > 0, because the width and height of the box cannot be negative. In this way, we are doing an optimization problem with inequality constraints, and we can't do it directly with SGD. So first take a logarithmic transformation and remove its inequality constraint.

The first function of Anchor box is to improve the recall rate. When multiple target centers are located in the same cell, different proportions of prediction boxes can predict different types of objects. The second is to reduce the learning difficulty. Anchor gives the absolute amount of target width and height, and only needs to return the offset.

Anchor calculation

Let the area of the rectangular box be s s s. The width of the rectangle is w w w. Gao Wei h h h. Then there are: { w h = r a t i o w × h = s ⇒ { r a t i o × h 2 = s w = r a t i o ⋅ h \{^{w\times h = s}_{\frac{w}{h}=ratio} \Rightarrow \{^{w = ratio \cdot h}_{ratio\times h^2 = s} {hw​=ratiow × h=s​⇒{ratio × h2=sw=ratio ⋅ h finally get: { w = r a t i o ⋅ h = s ⋅ r a t i o h = s / r a t i o \{^{h = \sqrt{s/ratio}}_{w=ratio\cdot h = \sqrt{s\cdot ratio}} {w=ratio⋅h=s⋅ratio ​h=s/ratio scale on different scales: { w = r a t i o ⋅ h = s c a l e ⋅ s ⋅ r a t i o h = s a c l e ⋅ s / r a t i o \{^{h = \sqrt{sacle\cdot s/ratio}}_{w=ratio\cdot h = \sqrt{scale \cdot s\cdot ratio}} {w=ratio⋅h=scale⋅s⋅ratio ​h=sacle⋅s/ratio ​​

Generate anchor box by clustering

The principle can be referred to: Anchor box generated by K-means clustering
It should be noted that the standard K-means algorithm generally uses the Euclidean distance as the sample measurement. In the task of generating anchor box, the Euclidean distance cannot be used, because the errors of anchor boxes of different sizes are different, and large box clusters will produce greater errors than small box clusters.
The code is as follows:

#coding=utf-8
import xml.etree.ElementTree as ET
import numpy as np
import glob
import sys
sys.path.append('.')
import config as sys_config
from tqdm import tqdm
 
def iou(box, clusters):
    """
    Calculate a ground truth Bounding box and k A priori box(Anchor)Intersection and union ratio(IOU)Value.
    parameter box: Tuples or data, representing ground truth Length and width.
    parameter clusters: Shape such as(k,2)of numpy Array, where k Is clustering Anchor Number of boxes
    return: ground truth And each Anchor Intersection and union ratio of frames.
    """
    x = np.minimum(clusters[:, 0], box[0])
    y = np.minimum(clusters[:, 1], box[1])
    if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
        raise ValueError("Box has no area")
    interp = x * y
    box_area = box[0] * box[1]
    cluster_area = clusters[:, 0] * clusters[:, 1]
    iou_ = interp / (box_area + cluster_area - interp)
    return iou_
 
 
def avg_iou(boxes, clusters):
    """
    Calculate a ground truth and k individual Anchor The mean of the intersection union ratio of.
    """
    return np.mean([np.max(iou(boxes[i], clusters)) for i in range(boxes.shape[0])])
 
def kmeans(boxes, k, dist=np.median):
    """
    utilize IOU Value proceed K-means clustering
    parameter boxes: Shape is(r, 2)of ground truth Box, where r yes ground truth Number of
    parameter k: Anchor Number of
    parameter dist: distance function 
    Return value: shape is(k, 2)of k individual Anchor frame
    """
    # That is, r mentioned above
    rows = boxes.shape[0]
    # Distance array to calculate the distance between each ground truth and k anchors
    distances = np.empty((rows, k))
    # The Anchor index closest to each ground truth last time
    last_clusters = np.zeros((rows,))
    # Set random number seed
    np.random.seed()
 
    # Initialize the cluster center, K clusters, and randomly select k from r ground truth s
    clusters = boxes[np.random.choice(rows, k, replace=False)]
    # Start clustering
    while True:
        # Calculate the distance between each ground truth and k anchors, and use 1-IOU(box,anchor) to calculate
        for row in range(rows):
            distances[row] = 1 - iou(boxes[row], clusters)
        # For each ground truth, select the Anchor with the smallest distance and index it
        nearest_clusters = np.argmin(distances, axis=1)
        # If the Anchor index closest to each ground truth is the same as the last time, the clustering ends
        if (last_clusters == nearest_clusters).all():
            break
        # Update the cluster center to the mean value of all ground truth boxes in the cluster
        for cluster in range(k):
            clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0)
        # Update the Anchor index closest to each ground truth
        last_clusters = nearest_clusters
 
    return clusters
 
# To load your own dataset, you only need all the xml files marked by labelimg
def load_dataset(path):
    dataset = []
    for xml_file in tqdm(glob.glob("{}/*xml".format(path))):
        tree = ET.parse(xml_file)
        # Picture height
        height = int(tree.findtext("./size/height"))
        # image width
        width = int(tree.findtext("./size/width"))
        
        for obj in tree.iter("object"):
            # Offset
            xmin = int(obj.findtext("bndbox/xmin")) / width
            ymin = int(obj.findtext("bndbox/ymin")) / height
            xmax = int(obj.findtext("bndbox/xmax")) / width
            ymax = int(obj.findtext("bndbox/ymax")) / height
            xmin = np.float64(xmin)
            ymin = np.float64(ymin)
            xmax = np.float64(xmax)
            ymax = np.float64(ymax)
            if xmax == xmin or ymax == ymin:
                print(xml_file)
            # Put the length and width of the Anchor into the dateset, and run kmeans to obtain the Anchor
            dataset.append([xmax - xmin, ymax - ymin])
    return np.array(dataset)
 
if __name__ == '__main__':
    import os
    ANNOTATIONS_PATH = os.path.join(sys_config.dataset_base_path, 'Annotations')
    # ANNOTATIONS_PATH = r'D:\Datasets\VOC\VOCtest_06-Nov-2007\VOCdevkit\VOC2007\Annotations'
    CLUSTERS = 9 #Number of clusters
    INPUTDIM = sys_config.imagesize #Enter network size
 
    data = load_dataset(ANNOTATIONS_PATH)
    out = kmeans(data, k=CLUSTERS)
    print('Boxes:')
    print(np.array(out)*INPUTDIM)
    print("Accuracy: {:.2f}%".format(avg_iou(data, out) * 100))       
    final_anchors = np.around(out[:, 0] / out[:, 1], decimals=2).tolist()
    print("Before Sort Ratios:\n {}".format(final_anchors))
    print("After Sort Ratios:\n {}".format(sorted(final_anchors)))

data processing

Convert to tfrecord format data

In code Yolov3 The data format is sampling tfrecord for input. Tfrecord is a data format officially recommended by Google. In fact, it is a binary file, which can make better use of memory. It contains multiple tf.train.Example, which is the implementation of the protocol buffer data standard. An Example message contains a series of tf.train.feature attributes, Each feature is a key value pair in the form of key value. There are three values of value:

  • bytes_list: can store string and byte data types.
  • float_list: it can store two data types: float(float32) and double(float64).
  • int64_list: bool, enum, int32, uint32, int64, uint64 can be stored

Code to generate tfrecord:

#This function passes in the annotation in xml. class_map represents class ID: and class_name
def build_example(annotation, class_map):
    img_path = os.path.join(
        FLAGS.data_dir, 'JPEGImages', annotation['filename'])
    print(img_path)
    img_raw = open(img_path, 'rb').read()
    key = hashlib.sha256(img_raw).hexdigest()

    width = int(annotation['size']['width'])
    height = int(annotation['size']['height'])

    xmin = []
    ymin = []
    xmax = []
    ymax = []
    classes = []
    classes_text = []
    truncated = []
    views = []
    difficult_obj = []
    if 'object' in annotation:
        for obj in annotation['object']:
            difficult = bool(int(obj['difficult']))
            difficult_obj.append(int(difficult))

            xmin.append(float(obj['bndbox']['xmin']) / width)
            ymin.append(float(obj['bndbox']['ymin']) / height)
            xmax.append(float(obj['bndbox']['xmax']) / width)
            ymax.append(float(obj['bndbox']['ymax']) / height)
            classes_text.append(obj['name'].encode('utf8'))
            classes.append(class_map[obj['name']])
            truncated.append(int(obj['truncated']))
            views.append(obj['pose'].encode('utf8'))

    example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': tf.train.Feature(int64_list=tf.train.Int64List(value=[height])),
        'image/width': tf.train.Feature(int64_list=tf.train.Int64List(value=[width])),
        'image/filename': tf.train.Feature(bytes_list=tf.train.BytesList(value=[
            annotation['filename'].encode('utf8')])),
        'image/source_id': tf.train.Feature(bytes_list=tf.train.BytesList(value=[
            annotation['filename'].encode('utf8')])),
        'image/key/sha256': tf.train.Feature(bytes_list=tf.train.BytesList(value=[key.encode('utf8')])),
        'image/encoded': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_raw])),
        'image/format': tf.train.Feature(bytes_list=tf.train.BytesList(value=['jpeg'.encode('utf8')])),
        'image/object/bbox/xmin': tf.train.Feature(float_list=tf.train.FloatList(value=xmin)),
        'image/object/bbox/xmax': tf.train.Feature(float_list=tf.train.FloatList(value=xmax)),
        'image/object/bbox/ymin': tf.train.Feature(float_list=tf.train.FloatList(value=ymin)),
        'image/object/bbox/ymax': tf.train.Feature(float_list=tf.train.FloatList(value=ymax)),
        'image/object/class/text': tf.train.Feature(bytes_list=tf.train.BytesList(value=classes_text)),
        'image/object/class/label': tf.train.Feature(int64_list=tf.train.Int64List(value=classes)),
        'image/object/difficult': tf.train.Feature(int64_list=tf.train.Int64List(value=difficult_obj)),
        'image/object/truncated': tf.train.Feature(int64_list=tf.train.Int64List(value=truncated)),
        'image/object/view': tf.train.Feature(bytes_list=tf.train.BytesList(value=views)),
    }))
    return example

You can see what tfrecord records through the following functions:

# Parse parse_tfrecord
def parse_tfrecord(tfrecord, class_table, size):
    x = tf.io.parse_single_example(tfrecord, IMAGE_FEATURE_MAP)
    x_train = tf.image.decode_jpeg(x['image/encoded'], channels=3)
    x_train = tf.image.resize(x_train, (size, size))

    class_text = tf.sparse.to_dense(
        x['image/object/class/text'], default_value='')
    labels = tf.cast(class_table.lookup(class_text), tf.float32)
    y_train = tf.stack([tf.sparse.to_dense(x['image/object/bbox/xmin']),
                        tf.sparse.to_dense(x['image/object/bbox/ymin']),
                        tf.sparse.to_dense(x['image/object/bbox/xmax']),
                        tf.sparse.to_dense(x['image/object/bbox/ymax']),
                        labels], axis=1)

    paddings = [[0, FLAGS.yolo_max_boxes - tf.shape(y_train)[0]], [0, 0]]
    y_train = tf.pad(y_train, paddings)

    return x_train, y_train
def load_tfrecord_dataset(file_pattern, class_file, size=416):
    LINE_NUMBER = -1  # TODO: use tf.lookup.TextFileIndex.LINE_NUMBER
    '''
    tf.lookup.StaticHashTable: Establish the association between category and number
    keys_tensor = tf.constant([1, 2])
    vals_tensor = tf.constant([3, 4])
    input_tensor = tf.constant([1, 5])
    table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(keys_tensor, vals_tensor), -1)
    print(table.lookup(input_tensor))
    output:tf.Tensor([ 3 -1], shape=(2,), dtype=int32)

    tf.lookup.TextFileInitializer: Table initializers from a text file.
    '''
    class_table = tf.lookup.StaticHashTable(tf.lookup.TextFileInitializer(
        class_file, tf.string, 0, tf.int64, LINE_NUMBER, delimiter="\n"), -1)

    files = tf.data.Dataset.list_files(file_pattern)
    dataset = files.flat_map(tf.data.TFRecordDataset)
    return dataset.map(lambda x: parse_tfrecord(x, class_table, size))
import sys
import os
path = os.path.abspath(os.path.join(os.getcwd()))
sys.path.append(path)
print(sys.path)
import time
from absl import app, flags, logging
from absl.flags import FLAGS
import cv2
import numpy as np
import tensorflow as tf
from yolov3_tf2.models import (
    YoloV3, YoloV3Tiny
)
from yolov3_tf2.dataset import load_tfrecord_dataset, transform_images
from yolov3_tf2.utils import draw_outputs

flags.DEFINE_string('classes', './data/voc2012_trainbin.names', 'path to classes file')
flags.DEFINE_integer('size', 416, 'resize images to')
flags.DEFINE_string(
    'dataset', './data/voc_train_trashBin.tfrecord', 'path to dataset')
flags.DEFINE_string('output', './output.jpg', 'path to output image')


def main(_argv):
    class_names = [c.strip() for c in open(FLAGS.classes).readlines()]
    logging.info('classes loaded')
    dataset = load_tfrecord_dataset(FLAGS.dataset, FLAGS.classes, FLAGS.size)
    dataset = dataset.shuffle(512)
    for image, labels in dataset.take(4):
        boxes = []
        scores = []
        classes = []
        for x1, y1, x2, y2, label in labels:
            if x1 == 0 and x2 == 0:
                continue

            boxes.append((x1, y1, x2, y2))
            scores.append(1)
            classes.append(label)
        nums = [len(boxes)]
        boxes = [boxes]
        scores = [scores]
        classes = [classes]

        logging.info('labels:')
        for i in range(nums[0]):
            logging.info('\t{}, {}, {}'.format(class_names[int(classes[0][i])],
                                               np.array(scores[0][i]),
                                               np.array(boxes[0][i])))

        img = cv2.cvtColor(image.numpy(), cv2.COLOR_RGB2BGR)
        img = draw_outputs(img, (boxes, scores, classes, nums), class_names)
        cv2.imwrite(FLAGS.output, img)
        logging.info('output saved to: {}'.format(FLAGS.output))


if __name__ == '__main__':
    app.run(main)

I have to take a closer look at this piece of code.

Load data

After loading data, you need to batch, disrupt and resize.

train_dataset = train_dataset.shuffle(buffer_size=512)
    train_dataset = train_dataset.batch(FLAGS.batch_size)
    train_dataset = train_dataset.map(lambda x, y: (
        dataset.transform_images(x, FLAGS.size),
        dataset.transform_targets(y, anchors, anchor_masks, FLAGS.size)))
    train_dataset = train_dataset.prefetch(
        buffer_size=tf.data.experimental.AUTOTUNE)

Model loading

Yolov3

In fact, looking at the above figure, the code is very easy to understand.

def DarknetResidual(x, filters):
    prev = x
    x = DarknetConv(x, filters // 2, 1)
    x = DarknetConv(x, filters, 3)
    x = Add()([prev, x])
    return x


def DarknetBlock(x, filters, blocks):
    x = DarknetConv(x, filters, 3, strides=2)
    for _ in range(blocks):
        x = DarknetResidual(x, filters)
    return x


def Darknet(name=None):
    x = inputs = Input([None, None, 3])
    x = DarknetConv(x, 32, 3)
    x = DarknetBlock(x, 64, 1)
    x = DarknetBlock(x, 128, 2)  # skip connection
    x = x_36 = DarknetBlock(x, 256, 8)  # skip connection
    x = x_61 = DarknetBlock(x, 512, 8)
    x = DarknetBlock(x, 1024, 4)
    return tf.keras.Model(inputs, (x_36, x_61, x), name=name)
def YoloV3(size=None, channels=3, anchors=yolo_anchors,
           masks=yolo_anchor_masks, classes=80, training=False):
    x = inputs = Input([size, size, channels], name='input')

    x_36, x_61, x = Darknet(name='yolo_darknet')(x)

    x = YoloConv(512, name='yolo_conv_0')(x)
    output_0 = YoloOutput(512, len(masks[0]), classes, name='yolo_output_0')(x)
	# concat is done here
    x = YoloConv(256, name='yolo_conv_1')((x, x_61))
    output_1 = YoloOutput(256, len(masks[1]), classes, name='yolo_output_1')(x)
	# Here we do concat and use the functions in python to define functions.
    x = YoloConv(128, name='yolo_conv_2')((x, x_36))
    output_2 = YoloOutput(128, len(masks[2]), classes, name='yolo_output_2')(x)
	# If training is required, the model is returned; otherwise, the output is returned
    if training:
        return Model(inputs, (output_0, output_1, output_2), name='yolov3')

    boxes_0 = Lambda(lambda x: yolo_boxes(x, anchors[masks[0]], classes),
                     name='yolo_boxes_0')(output_0)
    boxes_1 = Lambda(lambda x: yolo_boxes(x, anchors[masks[1]], classes),
                     name='yolo_boxes_1')(output_1)
    boxes_2 = Lambda(lambda x: yolo_boxes(x, anchors[masks[2]], classes),
                     name='yolo_boxes_2')(output_2)

    outputs = Lambda(lambda x: yolo_nms(x, anchors, masks, classes),
                     name='yolo_nms')((boxes_0[:3], boxes_1[:3], boxes_2[:3]))

    return Model(inputs, outputs, name='yolov3')
# As tensorflow lite doesn't support tf.size used in tf.meshgrid, 
# we reimplemented a simple meshgrid function that use basic tf function.
def _meshgrid(n_a, n_b):

    return [
        tf.reshape(tf.tile(tf.range(n_a), [n_b]), (n_b, n_a)),
        tf.reshape(tf.repeat(tf.range(n_b), n_a), (n_b, n_a))
    ]


def yolo_boxes(pred, anchors, classes):
    # pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...classes))
    grid_size = tf.shape(pred)[1:3]
    box_xy, box_wh, objectness, class_probs = tf.split(
        pred, (2, 2, 1, classes), axis=-1)
    # Normalize the predicted data to the [0,1] interval
    box_xy = tf.sigmoid(box_xy)
    objectness = tf.sigmoid(objectness)
    class_probs = tf.sigmoid(class_probs)
    pred_box = tf.concat((box_xy, box_wh), axis=-1)  # original xywh for loss

    # !!! grid[x][y] == (y, x)
    grid = _meshgrid(grid_size[1],grid_size[0])
    grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2)  # [gx, gy, 1, 2]

    box_xy = (box_xy + tf.cast(grid, tf.float32)) / \
        tf.cast(grid_size, tf.float32)
    # This is because it has been log ged when setting
    box_wh = tf.exp(box_wh) * anchors

    box_x1y1 = box_xy - box_wh / 2
    box_x2y2 = box_xy + box_wh / 2
    bbox = tf.concat([box_x1y1, box_x2y2], axis=-1)

    return bbox, objectness, class_probs, pred_box

train

The code written in eager mode, that is, debugging mode, I'm afraid I can't write such code

    if FLAGS.mode == 'eager_tf':
        # Eager mode is great for debugging
        # Non eager graph mode is recommended for real training
        avg_loss = tf.keras.metrics.Mean('loss', dtype=tf.float32)
        avg_val_loss = tf.keras.metrics.Mean('val_loss', dtype=tf.float32)

        for epoch in range(1, FLAGS.epochs + 1):
            for batch, (images, labels) in enumerate(train_dataset):
                with tf.GradientTape() as tape:
                    outputs = model(images, training=True)
                    regularization_loss = tf.reduce_sum(model.losses)
                    pred_loss = []
                    for output, label, loss_fn in zip(outputs, labels, loss):
                        pred_loss.append(loss_fn(label, output))
                    total_loss = tf.reduce_sum(pred_loss) + regularization_loss

                grads = tape.gradient(total_loss, model.trainable_variables)
                optimizer.apply_gradients(
                    zip(grads, model.trainable_variables))

                logging.info("{}_train_{}, {}, {}".format(
                    epoch, batch, total_loss.numpy(),
                    list(map(lambda x: np.sum(x.numpy()), pred_loss))))
                avg_loss.update_state(total_loss)

            for batch, (images, labels) in enumerate(val_dataset):
                outputs = model(images)
                regularization_loss = tf.reduce_sum(model.losses)
                pred_loss = []
                for output, label, loss_fn in zip(outputs, labels, loss):
                    pred_loss.append(loss_fn(label, output))
                total_loss = tf.reduce_sum(pred_loss) + regularization_loss

                logging.info("{}_val_{}, {}, {}".format(
                    epoch, batch, total_loss.numpy(),
                    list(map(lambda x: np.sum(x.numpy()), pred_loss))))
                avg_val_loss.update_state(total_loss)

            logging.info("{}, train: {}, val: {}".format(
                epoch,
                avg_loss.result().numpy(),
                avg_val_loss.result().numpy()))

            avg_loss.reset_states()
            avg_val_loss.reset_states()
            model.save_weights(
                'checkpoints/yolov3_train_{}.tf'.format(epoch))

Transfer Learning

# Configure the model for transfer learning
    if FLAGS.transfer == 'none':
        pass  # Nothing to do
    elif FLAGS.transfer in ['darknet', 'no_output']:
        # Darknet transfer is a special case that works
        # with incompatible number of classes

        # reset top layers
        if FLAGS.tiny:
            model_pretrained = YoloV3Tiny(
                FLAGS.size, training=True, classes=FLAGS.weights_num_classes or FLAGS.num_classes)
        else:
            model_pretrained = YoloV3(
                FLAGS.size, training=True, classes=FLAGS.weights_num_classes or FLAGS.num_classes)
        
        # finetune 
        if FLAGS.pretrain:
            model_pretrained.load_weights(FLAGS.weights)

        if FLAGS.transfer == 'darknet':
            model.get_layer('yolo_darknet').set_weights(
                model_pretrained.get_layer('yolo_darknet').get_weights())
            freeze_all(model.get_layer('yolo_darknet'))

        elif FLAGS.transfer == 'no_output':
            for l in model.layers:
                if not l.name.startswith('yolo_output'):
                    l.set_weights(model_pretrained.get_layer(
                        l.name).get_weights())
                    freeze_all(l)

    else:
        # All other transfer require matching classes
        model.load_weights(FLAGS.weights)
        if FLAGS.transfer == 'fine_tune':
            # freeze darknet and fine tune other layers
            darknet = model.get_layer('yolo_darknet')
            freeze_all(darknet)
        elif FLAGS.transfer == 'frozen':
            # freeze everything
            freeze_all(model)

Optimization function

Using Adam

model.compile(optimizer=optimizer, loss=loss,
                      run_eagerly=(FLAGS.mode == 'eager_fit'))

loss function

def YoloLoss(anchors, classes=80, ignore_thresh=0.5):
    def yolo_loss(y_true, y_pred):
        # 1. transform all pred outputs
        # y_pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...cls))
        pred_box, pred_obj, pred_class, pred_xywh = yolo_boxes(
            y_pred, anchors, classes)
        pred_xy = pred_xywh[..., 0:2]
        pred_wh = pred_xywh[..., 2:4]

        # 2. transform all true outputs
        # y_true: (batch_size, grid, grid, anchors, (x1, y1, x2, y2, obj, cls))
        true_box, true_obj, true_class_idx = tf.split(
            y_true, (4, 1, 1), axis=-1)
        true_xy = (true_box[..., 0:2] + true_box[..., 2:4]) / 2
        true_wh = true_box[..., 2:4] - true_box[..., 0:2]

        # give higher weights to small boxes
        box_loss_scale = 2 - true_wh[..., 0] * true_wh[..., 1]

        # 3. inverting the pred box equations
        grid_size = tf.shape(y_true)[1]
        grid = tf.meshgrid(tf.range(grid_size), tf.range(grid_size))
        grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2)
        true_xy = true_xy * tf.cast(grid_size, tf.float32) - \
            tf.cast(grid, tf.float32)
        true_wh = tf.math.log(true_wh / anchors)
        true_wh = tf.where(tf.math.is_inf(true_wh),
                           tf.zeros_like(true_wh), true_wh)

        # 4. calculate all masks
        obj_mask = tf.squeeze(true_obj, -1)
        # ignore false positive when iou is over threshold
        best_iou = tf.map_fn(
            lambda x: tf.reduce_max(broadcast_iou(x[0], tf.boolean_mask(
                x[1], tf.cast(x[2], tf.bool))), axis=-1),
            (pred_box, true_box, obj_mask),
            tf.float32)
        ignore_mask = tf.cast(best_iou < ignore_thresh, tf.float32)
        # Coordinate use square error
        # 5. calculate all losses
        xy_loss = obj_mask * box_loss_scale * \
            tf.reduce_sum(tf.square(true_xy - pred_xy), axis=-1)
        wh_loss = obj_mask * box_loss_scale * \
            tf.reduce_sum(tf.square(true_wh - pred_wh), axis=-1)
        # The confidence level uses binary cross entropy
        obj_loss = binary_crossentropy(true_obj, pred_obj)
        obj_loss = obj_mask * obj_loss + \
            (1 - obj_mask) * ignore_mask * obj_loss
        # TODO: use binary_crossentropy instead
        # Category uses cross entropy loss
        class_loss = obj_mask * sparse_categorical_crossentropy(
            true_class_idx, pred_class)

        # 6. sum over (batch, gridx, gridy, anchors) => (batch, 1)
        xy_loss = tf.reduce_sum(xy_loss, axis=(1, 2, 3))
        wh_loss = tf.reduce_sum(wh_loss, axis=(1, 2, 3))
        obj_loss = tf.reduce_sum(obj_loss, axis=(1, 2, 3))
        class_loss = tf.reduce_sum(class_loss, axis=(1, 2, 3))

        return xy_loss + wh_loss + obj_loss + class_loss
    return yolo_loss

Tags: Algorithm Deep Learning Object Detection

Posted on Thu, 04 Nov 2021 04:51:25 -0400 by taslim