Deep learning Yolo algorithm code implementation

preface This blog mainly summarizes the technologies used in Yolo series according to the algorithm code of Yolo series,...
Anchor settings
data processing
Model loading
train

preface

This blog mainly summarizes the technologies used in Yolo series according to the algorithm code of Yolo series, including anchor box setting, data reading and processing, plug and play attention mechanism module, loss function setting, etc. The code source of this blog is Yolov3 and Yolov4 , the code is based on the tensorflow framework. For the principle, please refer to some blogs I wrote earlier: Overview of target detection,Target detection yolo series and YOLO series algorithms for target detection . Related papers: Yolov3 and Yolov4 . Original code: Yolov4 and Yolov3 . Yolov4 please refer to my git warehouse Readme.md file.

Yolov3

The network structure diagram of Yolov3 is shown in the following figure:

Anchor settings

C x C_x Cx ^ and C y C_y Cy represents the coordinates of the upper left corner of the grid where the coordinates of the center point of the target to be detected are located, as shown in the figure below C x C_x Cx ^ and C y C_y Cy is (1, 1). σ ( t x ) \sigma(t_x) σ (tx) and σ ( t y ) \sigma(t_y) σ (TY) represents the offset. Compressing TX and ty into the [0,1] interval with sigmoid can effectively ensure that the target center is in the grid cell performing prediction and prevent excessive offset. P w P_w Pw , and P h P_h Ph indicates the width and height of the anchor box mapped to the feature map. The exponent is used because t w t_w tw​, t h t_h th , yes log ⁡ \log The log scale is scaled to logarithmic space. The resulting frame coordinate value is b x b_x bx​, b y b_y by​, b w b_w bw​, b h b_h bh #, that is, the position and size of the bounding box bbox relative to the feature map, is the predicted output coordinate required by the model. But the actual learning goal of the network is t x t_x tx​, t y t_y ty​, t w t_w tw​, t h t_h th , these four offsets, where t x t_x tx​, t y t_y ty is the predicted coordinate offset value, t w t_w tw​, t h t_h th , is scale scaling. With these four offsets, we can naturally find the real need according to the previous formula b x b_x bx​, b y b_y by​, b w b_w bw​, b h b_h bh# 4 coordinates. As for the scaling of the image, it is necessary to maintain the length width ratio of the image, and then supplement it with padding.

The formula used in yolov3 is different from that used in the fast RCNN Series in the first two lines P x P_x Px ^ and P y P_y Py # is replaced by the upper left coordinate of the grid cell on the feature map C x C_x Cx​, C y C_y Cy, that is, in yolov3 G x G_x Gx​, G y G_y Gy ^ subtract the coordinates of the upper left corner of the grid cell C x C_x Cx​, C y C_y Cy​. x x x, y y The y coordinate does not calculate the offset for the anchor box, so it does not need to be divided by P w P_w Pw​, P h P_h Ph​. that is t x = G x − C x t_x = G_x - C_x tx​=Gx​−Cx​, t y = G y − C y t_y = G_y - C_y ty = Gy − Cy, so you can directly find the offset of bbox center from the coordinates in the upper left corner of grid cell. t w t_w tw and t h t_h The formula yolov3 of th is the same as that of the fast RCNN series. It is the ratio between the length and width of the border where the object is located and the length and width of the anchor box. No matter fast RCNN or YOLO, they do not directly return to the length and width of the bounding box, but scale to the logarithmic space for fear that the training will bring unstable gradients. Because if no transformation is made, the relative deformation can be predicted directly t w t_w tw, then ask t w > 0 t_w>0 tw > 0, because the width and height of the box cannot be negative. In this way, we are doing an optimization problem with inequality constraints, and we can't do it directly with SGD. So first take a logarithmic transformation and remove its inequality constraint.

The first function of Anchor box is to improve the recall rate. When multiple target centers are located in the same cell, different proportions of prediction boxes can predict different types of objects. The second is to reduce the learning difficulty. Anchor gives the absolute amount of target width and height, and only needs to return the offset.

Anchor calculation

Let the area of the rectangular box be s s s. The width of the rectangle is w w w. Gao Wei h h h. Then there are: { w h = r a t i o w × h = s ⇒ { r a t i o × h 2 = s w = r a t i o ⋅ h \{^_{\frac=ratio} \Rightarrow \{^_ }_} }_} /*xml".format(path))): tree = ET.parse(xml_file) # Picture height height = int(tree.findtext("./size/height")) # image width width = int(tree.findtext("./size/width")) for obj in tree.iter("object"): # Offset xmin = int(obj.findtext("bndbox/xmin")) / width ymin = int(obj.findtext("bndbox/ymin")) / height xmax = int(obj.findtext("bndbox/xmax")) / width ymax = int(obj.findtext("bndbox/ymax")) / height xmin = np.float64(xmin) ymin = np.float64(ymin) xmax = np.float64(xmax) ymax = np.float64(ymax) if xmax == xmin or ymax == ymin: print(xml_file) # Put the length and width of the Anchor into the dateset, and run kmeans to obtain the Anchor dataset.append([xmax - xmin, ymax - ymin]) return np.array(dataset) if __name__ == '__main__': import os ANNOTATIONS_PATH = os.path.join(sys_config.dataset_base_path, 'Annotations') # ANNOTATIONS_PATH = r'D:\Datasets\VOC\VOCtest_06-Nov-2007\VOCdevkit\VOC2007\Annotations' CLUSTERS = 9 #Number of clusters INPUTDIM = sys_config.imagesize #Enter network size data = load_dataset(ANNOTATIONS_PATH) out = kmeans(data, k=CLUSTERS) print('Boxes:') print(np.array(out)*INPUTDIM) print("Accuracy: {:.2f}%".format(avg_iou(data, out) * 100)) final_anchors = np.around(out[:, 0] / out[:, 1], decimals=2).tolist() print("Before Sort Ratios:\n {}".format(final_anchors)) print("After Sort Ratios:\n {}".format(sorted(final_anchors)))

data processing

Convert to tfrecord format data

In code Yolov3 The data format is sampling tfrecord for input. Tfrecord is a data format officially recommended by Google. In fact, it is a binary file, which can make better use of memory. It contains multiple tf.train.Example, which is the implementation of the protocol buffer data standard. An Example message contains a series of tf.train.feature attributes, Each feature is a key value pair in the form of key value. There are three values of value:

  • bytes_list: can store string and byte data types.
  • float_list: it can store two data types: float(float32) and double(float64).
  • int64_list: bool, enum, int32, uint32, int64, uint64 can be stored

Code to generate tfrecord:

#This function passes in the annotation in xml. class_map represents class ID: and class_name def build_example(annotation, class_map): img_path = os.path.join( FLAGS.data_dir, 'JPEGImages', annotation['filename']) print(img_path) img_raw = open(img_path, 'rb').read() key = hashlib.sha256(img_raw).hexdigest() width = int(annotation['size']['width']) height = int(annotation['size']['height']) xmin = [] ymin = [] xmax = [] ymax = [] classes = [] classes_text = [] truncated = [] views = [] difficult_obj = [] if 'object' in annotation: for obj in annotation['object']: difficult = bool(int(obj['difficult'])) difficult_obj.append(int(difficult)) xmin.append(float(obj['bndbox']['xmin']) / width) ymin.append(float(obj['bndbox']['ymin']) / height) xmax.append(float(obj['bndbox']['xmax']) / width) ymax.append(float(obj['bndbox']['ymax']) / height) classes_text.append(obj['name'].encode('utf8')) classes.append(class_map[obj['name']]) truncated.append(int(obj['truncated'])) views.append(obj['pose'].encode('utf8')) example = tf.train.Example(features=tf.train.Features(feature={ 'image/height': tf.train.Feature(int64_list=tf.train.Int64List(value=[height])), 'image/width': tf.train.Feature(int64_list=tf.train.Int64List(value=[width])), 'image/filename': tf.train.Feature(bytes_list=tf.train.BytesList(value=[ annotation['filename'].encode('utf8')])), 'image/source_id': tf.train.Feature(bytes_list=tf.train.BytesList(value=[ annotation['filename'].encode('utf8')])), 'image/key/sha256': tf.train.Feature(bytes_list=tf.train.BytesList(value=[key.encode('utf8')])), 'image/encoded': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_raw])), 'image/format': tf.train.Feature(bytes_list=tf.train.BytesList(value=['jpeg'.encode('utf8')])), 'image/object/bbox/xmin': tf.train.Feature(float_list=tf.train.FloatList(value=xmin)), 'image/object/bbox/xmax': tf.train.Feature(float_list=tf.train.FloatList(value=xmax)), 'image/object/bbox/ymin': tf.train.Feature(float_list=tf.train.FloatList(value=ymin)), 'image/object/bbox/ymax': tf.train.Feature(float_list=tf.train.FloatList(value=ymax)), 'image/object/class/text': tf.train.Feature(bytes_list=tf.train.BytesList(value=classes_text)), 'image/object/class/label': tf.train.Feature(int64_list=tf.train.Int64List(value=classes)), 'image/object/difficult': tf.train.Feature(int64_list=tf.train.Int64List(value=difficult_obj)), 'image/object/truncated': tf.train.Feature(int64_list=tf.train.Int64List(value=truncated)), 'image/object/view': tf.train.Feature(bytes_list=tf.train.BytesList(value=views)), })) return example

You can see what tfrecord records through the following functions:

# Parse parse_tfrecord def parse_tfrecord(tfrecord, class_table, size): x = tf.io.parse_single_example(tfrecord, IMAGE_FEATURE_MAP) x_train = tf.image.decode_jpeg(x['image/encoded'], channels=3) x_train = tf.image.resize(x_train, (size, size)) class_text = tf.sparse.to_dense( x['image/object/class/text'], default_value='') labels = tf.cast(class_table.lookup(class_text), tf.float32) y_train = tf.stack([tf.sparse.to_dense(x['image/object/bbox/xmin']), tf.sparse.to_dense(x['image/object/bbox/ymin']), tf.sparse.to_dense(x['image/object/bbox/xmax']), tf.sparse.to_dense(x['image/object/bbox/ymax']), labels], axis=1) paddings = [[0, FLAGS.yolo_max_boxes - tf.shape(y_train)[0]], [0, 0]] y_train = tf.pad(y_train, paddings) return x_train, y_train def load_tfrecord_dataset(file_pattern, class_file, size=416): LINE_NUMBER = -1 # TODO: use tf.lookup.TextFileIndex.LINE_NUMBER ''' tf.lookup.StaticHashTable: Establish the association between category and number keys_tensor = tf.constant([1, 2]) vals_tensor = tf.constant([3, 4]) input_tensor = tf.constant([1, 5]) table = tf.lookup.StaticHashTable( tf.lookup.KeyValueTensorInitializer(keys_tensor, vals_tensor), -1) print(table.lookup(input_tensor)) output:tf.Tensor([ 3 -1], shape=(2,), dtype=int32) tf.lookup.TextFileInitializer: Table initializers from a text file. ''' class_table = tf.lookup.StaticHashTable(tf.lookup.TextFileInitializer( class_file, tf.string, 0, tf.int64, LINE_NUMBER, delimiter="\n"), -1) files = tf.data.Dataset.list_files(file_pattern) dataset = files.flat_map(tf.data.TFRecordDataset) return dataset.map(lambda x: parse_tfrecord(x, class_table, size))
import sys import os path = os.path.abspath(os.path.join(os.getcwd())) sys.path.append(path) print(sys.path) import time from absl import app, flags, logging from absl.flags import FLAGS import cv2 import numpy as np import tensorflow as tf from yolov3_tf2.models import ( YoloV3, YoloV3Tiny ) from yolov3_tf2.dataset import load_tfrecord_dataset, transform_images from yolov3_tf2.utils import draw_outputs flags.DEFINE_string('classes', './data/voc2012_trainbin.names', 'path to classes file') flags.DEFINE_integer('size', 416, 'resize images to') flags.DEFINE_string( 'dataset', './data/voc_train_trashBin.tfrecord', 'path to dataset') flags.DEFINE_string('output', './output.jpg', 'path to output image') def main(_argv): class_names = [c.strip() for c in open(FLAGS.classes).readlines()] logging.info('classes loaded') dataset = load_tfrecord_dataset(FLAGS.dataset, FLAGS.classes, FLAGS.size) dataset = dataset.shuffle(512) for image, labels in dataset.take(4): boxes = [] scores = [] classes = [] for x1, y1, x2, y2, label in labels: if x1 == 0 and x2 == 0: continue boxes.append((x1, y1, x2, y2)) scores.append(1) classes.append(label) nums = [len(boxes)] boxes = [boxes] scores = [scores] classes = [classes] logging.info('labels:') for i in range(nums[0]): logging.info('\t{}, {}, {}'.format(class_names[int(classes[0][i])], np.array(scores[0][i]), np.array(boxes[0][i]))) img = cv2.cvtColor(image.numpy(), cv2.COLOR_RGB2BGR) img = draw_outputs(img, (boxes, scores, classes, nums), class_names) cv2.imwrite(FLAGS.output, img) logging.info('output saved to: {}'.format(FLAGS.output)) if __name__ == '__main__': app.run(main)

I have to take a closer look at this piece of code.

Load data

After loading data, you need to batch, disrupt and resize.

train_dataset = train_dataset.shuffle(buffer_size=512) train_dataset = train_dataset.batch(FLAGS.batch_size) train_dataset = train_dataset.map(lambda x, y: ( dataset.transform_images(x, FLAGS.size), dataset.transform_targets(y, anchors, anchor_masks, FLAGS.size))) train_dataset = train_dataset.prefetch( buffer_size=tf.data.experimental.AUTOTUNE)

Model loading

Yolov3

In fact, looking at the above figure, the code is very easy to understand.

def DarknetResidual(x, filters): prev = x x = DarknetConv(x, filters // 2, 1) x = DarknetConv(x, filters, 3) x = Add()([prev, x]) return x def DarknetBlock(x, filters, blocks): x = DarknetConv(x, filters, 3, strides=2) for _ in range(blocks): x = DarknetResidual(x, filters) return x def Darknet(name=None): x = inputs = Input([None, None, 3]) x = DarknetConv(x, 32, 3) x = DarknetBlock(x, 64, 1) x = DarknetBlock(x, 128, 2) # skip connection x = x_36 = DarknetBlock(x, 256, 8) # skip connection x = x_61 = DarknetBlock(x, 512, 8) x = DarknetBlock(x, 1024, 4) return tf.keras.Model(inputs, (x_36, x_61, x), name=name)
def YoloV3(size=None, channels=3, anchors=yolo_anchors, masks=yolo_anchor_masks, classes=80, training=False): x = inputs = Input([size, size, channels], name='input') x_36, x_61, x = Darknet(name='yolo_darknet')(x) x = YoloConv(512, name='yolo_conv_0')(x) output_0 = YoloOutput(512, len(masks[0]), classes, name='yolo_output_0')(x) # concat is done here x = YoloConv(256, name='yolo_conv_1')((x, x_61)) output_1 = YoloOutput(256, len(masks[1]), classes, name='yolo_output_1')(x) # Here we do concat and use the functions in python to define functions. x = YoloConv(128, name='yolo_conv_2')((x, x_36)) output_2 = YoloOutput(128, len(masks[2]), classes, name='yolo_output_2')(x) # If training is required, the model is returned; otherwise, the output is returned if training: return Model(inputs, (output_0, output_1, output_2), name='yolov3') boxes_0 = Lambda(lambda x: yolo_boxes(x, anchors[masks[0]], classes), name='yolo_boxes_0')(output_0) boxes_1 = Lambda(lambda x: yolo_boxes(x, anchors[masks[1]], classes), name='yolo_boxes_1')(output_1) boxes_2 = Lambda(lambda x: yolo_boxes(x, anchors[masks[2]], classes), name='yolo_boxes_2')(output_2) outputs = Lambda(lambda x: yolo_nms(x, anchors, masks, classes), name='yolo_nms')((boxes_0[:3], boxes_1[:3], boxes_2[:3])) return Model(inputs, outputs, name='yolov3')
# As tensorflow lite doesn't support tf.size used in tf.meshgrid, # we reimplemented a simple meshgrid function that use basic tf function. def _meshgrid(n_a, n_b): return [ tf.reshape(tf.tile(tf.range(n_a), [n_b]), (n_b, n_a)), tf.reshape(tf.repeat(tf.range(n_b), n_a), (n_b, n_a)) ] def yolo_boxes(pred, anchors, classes): # pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...classes)) grid_size = tf.shape(pred)[1:3] box_xy, box_wh, objectness, class_probs = tf.split( pred, (2, 2, 1, classes), axis=-1) # Normalize the predicted data to the [0,1] interval box_xy = tf.sigmoid(box_xy) objectness = tf.sigmoid(objectness) class_probs = tf.sigmoid(class_probs) pred_box = tf.concat((box_xy, box_wh), axis=-1) # original xywh for loss # !!! grid[x][y] == (y, x) grid = _meshgrid(grid_size[1],grid_size[0]) grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2) # [gx, gy, 1, 2] box_xy = (box_xy + tf.cast(grid, tf.float32)) / \ tf.cast(grid_size, tf.float32) # This is because it has been log ged when setting box_wh = tf.exp(box_wh) * anchors box_x1y1 = box_xy - box_wh / 2 box_x2y2 = box_xy + box_wh / 2 bbox = tf.concat([box_x1y1, box_x2y2], axis=-1) return bbox, objectness, class_probs, pred_box

train

The code written in eager mode, that is, debugging mode, I'm afraid I can't write such code

if FLAGS.mode == 'eager_tf': # Eager mode is great for debugging # Non eager graph mode is recommended for real training avg_loss = tf.keras.metrics.Mean('loss', dtype=tf.float32) avg_val_loss = tf.keras.metrics.Mean('val_loss', dtype=tf.float32) for epoch in range(1, FLAGS.epochs + 1): for batch, (images, labels) in enumerate(train_dataset): with tf.GradientTape() as tape: outputs = model(images, training=True) regularization_loss = tf.reduce_sum(model.losses) pred_loss = [] for output, label, loss_fn in zip(outputs, labels, loss): pred_loss.append(loss_fn(label, output)) total_loss = tf.reduce_sum(pred_loss) + regularization_loss grads = tape.gradient(total_loss, model.trainable_variables) optimizer.apply_gradients( zip(grads, model.trainable_variables)) logging.info("{}_train_{}, {}, {}".format( epoch, batch, total_loss.numpy(), list(map(lambda x: np.sum(x.numpy()), pred_loss)))) avg_loss.update_state(total_loss) for batch, (images, labels) in enumerate(val_dataset): outputs = model(images) regularization_loss = tf.reduce_sum(model.losses) pred_loss = [] for output, label, loss_fn in zip(outputs, labels, loss): pred_loss.append(loss_fn(label, output)) total_loss = tf.reduce_sum(pred_loss) + regularization_loss logging.info("{}_val_{}, {}, {}".format( epoch, batch, total_loss.numpy(), list(map(lambda x: np.sum(x.numpy()), pred_loss)))) avg_val_loss.update_state(total_loss) logging.info("{}, train: {}, val: {}".format( epoch, avg_loss.result().numpy(), avg_val_loss.result().numpy())) avg_loss.reset_states() avg_val_loss.reset_states() model.save_weights( 'checkpoints/yolov3_train_{}.tf'.format(epoch))

Transfer Learning

# Configure the model for transfer learning if FLAGS.transfer == 'none': pass # Nothing to do elif FLAGS.transfer in ['darknet', 'no_output']: # Darknet transfer is a special case that works # with incompatible number of classes # reset top layers if FLAGS.tiny: model_pretrained = YoloV3Tiny( FLAGS.size, training=True, classes=FLAGS.weights_num_classes or FLAGS.num_classes) else: model_pretrained = YoloV3( FLAGS.size, training=True, classes=FLAGS.weights_num_classes or FLAGS.num_classes) # finetune if FLAGS.pretrain: model_pretrained.load_weights(FLAGS.weights) if FLAGS.transfer == 'darknet': model.get_layer('yolo_darknet').set_weights( model_pretrained.get_layer('yolo_darknet').get_weights()) freeze_all(model.get_layer('yolo_darknet')) elif FLAGS.transfer == 'no_output': for l in model.layers: if not l.name.startswith('yolo_output'): l.set_weights(model_pretrained.get_layer( l.name).get_weights()) freeze_all(l) else: # All other transfer require matching classes model.load_weights(FLAGS.weights) if FLAGS.transfer == 'fine_tune': # freeze darknet and fine tune other layers darknet = model.get_layer('yolo_darknet') freeze_all(darknet) elif FLAGS.transfer == 'frozen': # freeze everything freeze_all(model)

Optimization function

Using Adam

model.compile(optimizer=optimizer, loss=loss, run_eagerly=(FLAGS.mode == 'eager_fit'))

loss function

def YoloLoss(anchors, classes=80, ignore_thresh=0.5): def yolo_loss(y_true, y_pred): # 1. transform all pred outputs # y_pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...cls)) pred_box, pred_obj, pred_class, pred_xywh = yolo_boxes( y_pred, anchors, classes) pred_xy = pred_xywh[..., 0:2] pred_wh = pred_xywh[..., 2:4] # 2. transform all true outputs # y_true: (batch_size, grid, grid, anchors, (x1, y1, x2, y2, obj, cls)) true_box, true_obj, true_class_idx = tf.split( y_true, (4, 1, 1), axis=-1) true_xy = (true_box[..., 0:2] + true_box[..., 2:4]) / 2 true_wh = true_box[..., 2:4] - true_box[..., 0:2] # give higher weights to small boxes box_loss_scale = 2 - true_wh[..., 0] * true_wh[..., 1] # 3. inverting the pred box equations grid_size = tf.shape(y_true)[1] grid = tf.meshgrid(tf.range(grid_size), tf.range(grid_size)) grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2) true_xy = true_xy * tf.cast(grid_size, tf.float32) - \ tf.cast(grid, tf.float32) true_wh = tf.math.log(true_wh / anchors) true_wh = tf.where(tf.math.is_inf(true_wh), tf.zeros_like(true_wh), true_wh) # 4. calculate all masks obj_mask = tf.squeeze(true_obj, -1) # ignore false positive when iou is over threshold best_iou = tf.map_fn( lambda x: tf.reduce_max(broadcast_iou(x[0], tf.boolean_mask( x[1], tf.cast(x[2], tf.bool))), axis=-1), (pred_box, true_box, obj_mask), tf.float32) ignore_mask = tf.cast(best_iou < ignore_thresh, tf.float32) # Coordinate use square error # 5. calculate all losses xy_loss = obj_mask * box_loss_scale * \ tf.reduce_sum(tf.square(true_xy - pred_xy), axis=-1) wh_loss = obj_mask * box_loss_scale * \ tf.reduce_sum(tf.square(true_wh - pred_wh), axis=-1) # The confidence level uses binary cross entropy obj_loss = binary_crossentropy(true_obj, pred_obj) obj_loss = obj_mask * obj_loss + \ (1 - obj_mask) * ignore_mask * obj_loss # TODO: use binary_crossentropy instead # Category uses cross entropy loss class_loss = obj_mask * sparse_categorical_crossentropy( true_class_idx, pred_class) # 6. sum over (batch, gridx, gridy, anchors) => (batch, 1) xy_loss = tf.reduce_sum(xy_loss, axis=(1, 2, 3)) wh_loss = tf.reduce_sum(wh_loss, axis=(1, 2, 3)) obj_loss = tf.reduce_sum(obj_loss, axis=(1, 2, 3)) class_loss = tf.reduce_sum(class_loss, axis=(1, 2, 3)) return xy_loss + wh_loss + obj_loss + class_loss return yolo_loss

4 November 2021, 04:51 | Views: 10020

Add new comment

For adding a comment, please log in
or create account

0 comments