Introduction of forest pest data set and data preprocessing method
In this course, the insect data set used in the forestry pest control project jointly developed by Baidu and Forestry University will be used.
Read the annotation information of AI insect recognition dataset
The structure of AI insect recognition dataset is as follows:
 2183 pictures are provided, including 1693 training sets, 245 verification sets and 245 test sets.
 It contains 7 species of insects, namely Boerner, Leconte, Linnaeus, acuminatus, armandi, coleoptera and Linnaeus.
 It contains pictures and labels. Please decompress the data and store it in the insects directory.
!conda list
# packages in environment at C:\ProgramData\Anaconda3\envs\paddle: # # Name Version Build Channel astor 0.8.1 pypi_0 pypi backcall 0.2.0 pyhd3eb1b0_0 defaults certifi 2021.5.30 py36haa95532_0 defaults charsetnormalizer 2.0.7 pypi_0 pypi colorama 0.4.4 pyhd3eb1b0_0 defaults decorator 5.1.0 pyhd3eb1b0_0 defaults entrypoints 0.3 py36_0 defaults gast 0.3.3 pypi_0 pypi idna 3.3 pypi_0 pypi ipykernel 5.3.4 py36h5ca1d4c_0 defaults ipython 7.16.1 py36h5ca1d4c_0 defaults ipython_genutils 0.2.0 pyhd3eb1b0_1 defaults jedi 0.17.0 py36_0 defaults jupyter_client 7.0.1 pyhd3eb1b0_0 defaults jupyter_core 4.8.1 py36haa95532_0 defaults nestasyncio 1.5.1 pyhd3eb1b0_0 defaults numpy 1.19.3 pypi_0 pypi paddlepaddlegpu 2.1.3.post101 pypi_0 pypi parso 0.8.2 pyhd3eb1b0_0 defaults pickleshare 0.7.5 pyhd3eb1b0_1003 defaults pillow 8.4.0 pypi_0 pypi pip 21.2.2 py36haa95532_0 defaults prompttoolkit 3.0.20 pyhd3eb1b0_0 defaults protobuf 3.19.1 pypi_0 pypi pygments 2.10.0 pyhd3eb1b0_0 defaults python 3.6.13 h3758d61_0 defaults pythondateutil 2.8.2 pyhd3eb1b0_0 defaults pywin32 228 py36hbaba5e8_1 defaults pyzmq 22.2.1 py36hd77b12b_1 defaults requests 2.26.0 pypi_0 pypi setuptools 58.0.4 py36haa95532_0 defaults six 1.16.0 pyhd3eb1b0_0 defaults sqlite 3.36.0 h2bbff1b_0 defaults tornado 6.1 py36h2bbff1b_0 defaults traitlets 4.3.3 py36haa95532_0 defaults urllib3 1.26.7 pypi_0 pypi vc 14.2 h21ff451_1 defaults vs2015_runtime 14.27.29016 h5e58377_2 defaults wcwidth 0.2.5 pyhd3eb1b0_0 defaults wheel 0.37.0 pyhd3eb1b0_1 defaults wincertstore 0.2 py36h7fe50ca_0 defaults
After decompressing the data, you can see the structure under the insects directory as follows.
insects train  annotations   xmls   100.xml   101.xml   ...    images  100.jpeg  101.jpeg  ...  val  annotations   xmls   1221.xml   1277.xml   ...    images  1221.jpeg  1277.jpeg  ...  test images 1833.jpeg 1838.jpeg ...
insects contains three folders: train, val and test. The labels of pictures are stored in the train/annotations/xmls directory. Each xml file is a description of a picture, including the size of the picture, the name of the insect contained, the location on the picture and other information.
<annotation> <folder>Liu Feifei</folder> <filename>100.jpeg</filename> <path>/home/fion/desktop/Liu Feifei/100.jpeg</path> <source> <database>Unknown</database> </source> <size> <width>1336</width> <height>1336</height> <depth>3</depth> </size> <segmented>0</segmented> <object> <name>Boerner</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>500</xmin> <ymin>893</ymin> <xmax>656</xmax> <ymax>966</ymax> </bndbox> </object> <object> <name>Leconte</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>622</xmin> <ymin>490</ymin> <xmax>756</xmax> <ymax>610</ymax> </bndbox> </object> <object> <name>armandi</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>432</xmin> <ymin>663</ymin> <xmax>517</xmax> <ymax>729</ymax> </bndbox> </object> <object> <name>coleoptera</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>624</xmin> <ymin>685</ymin> <xmax>697</xmax> <ymax>771</ymax> </bndbox> </object> <object> <name>linnaeus</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>783</xmin> <ymin>700</ymin> <xmax>856</xmax> <ymax>802</ymax> </bndbox> </object> </annotation>
The main parameters in the xml file listed above are described as follows:

Size: picture size.

Object: an object contained in a picture. A picture may contain multiple objects.
– Name: insect name;
– bndbox: real object box;
– difficult: identify whether it is difficult.
Next, we will read the xml file from the dataset and read the annotation information of each picture. Before reading the specific annotation file, we first complete one thing, that is, convert the category name (string) of insects into the category represented by numbers. Because the input type required for calculation in neural network is numerical, it is necessary to convert the category represented by string into specific numbers. The list of insect category names is: [Boerner ',' Leconte ',' linnaeus', 'acuminatus',' armandi ',' coleoptera ',' linnaeus]. Here we agree that in this list, 'Boerner' corresponds to category 0, 'Leconte' corresponds to category 1,..., 'linnaeus' corresponds to category 6. Use the following program to get a dictionary representing the mapping relationship between name strings and numeric categories.
INSECT_NAMES = ['Boerner', 'Leconte', 'Linnaeus', 'acuminatus', 'armandi', 'coleoptera', 'linnaeus'] def get_insect_names(): """ return a dict, as following, {'Boerner': 0, 'Leconte': 1, 'Linnaeus': 2, 'acuminatus': 3, 'armandi': 4, 'coleoptera': 5, 'linnaeus': 6 } It can map the insect name into an integer label. """ insect_category2id = {} for i, item in enumerate(INSECT_NAMES): insect_category2id[item] = i return insect_category2id
cname2cid = get_insect_names() cname2cid
{'Boerner': 0, 'Leconte': 1, 'Linnaeus': 2, 'acuminatus': 3, 'armandi': 4, 'coleoptera': 5, 'linnaeus': 6}
Call get_ insect_ The names function returns a dict that describes the mapping relationship between insect names and numeric categories. The following program reads all file annotation information from the annotations/xml directory.
import os import numpy as np import xml.etree.ElementTree as ET def get_annotations(cname2cid, datadir): filenames = os.listdir(os.path.join(datadir, 'annotations', 'xmls')) records = [] ct = 0 for fname in filenames: fid = fname.split('.')[0] fpath = os.path.join(datadir, 'annotations', 'xmls', fname) img_file = os.path.join(datadir, 'images', fid + '.jpeg') tree = ET.parse(fpath) if tree.find('id') is None: im_id = np.array([ct]) else: im_id = np.array([int(tree.find('id').text)]) objs = tree.findall('object') im_w = float(tree.find('size').find('width').text) im_h = float(tree.find('size').find('height').text) gt_bbox = np.zeros((len(objs), 4), dtype=np.float32) gt_class = np.zeros((len(objs), ), dtype=np.int32) is_crowd = np.zeros((len(objs), ), dtype=np.int32) difficult = np.zeros((len(objs), ), dtype=np.int32) for i, obj in enumerate(objs): cname = obj.find('name').text gt_class[i] = cname2cid[cname] _difficult = int(obj.find('difficult').text) x1 = float(obj.find('bndbox').find('xmin').text) y1 = float(obj.find('bndbox').find('ymin').text) x2 = float(obj.find('bndbox').find('xmax').text) y2 = float(obj.find('bndbox').find('ymax').text) x1 = max(0, x1) y1 = max(0, y1) x2 = min(im_w  1, x2) y2 = min(im_h  1, y2) # Here, xywh format is used to represent the real box of the target object gt_bbox[i] = [(x1+x2)/2.0 , (y1+y2)/2.0, x2x1+1., y2y1+1.] is_crowd[i] = 0 difficult[i] = _difficult voc_rec = { 'im_file': img_file, 'im_id': im_id, 'h': im_h, 'w': im_w, 'is_crowd': is_crowd, 'gt_class': gt_class, 'gt_bbox': gt_bbox, 'gt_poly': [], 'difficult': difficult } if len(objs) != 0: records.append(voc_rec) ct += 1 return records
TRAINDIR = './insects/train' TESTDIR = './insects/test' VALIDDIR = './insects/val' cname2cid = get_insect_names() records = get_annotations(cname2cid, TRAINDIR) records[0]
{'im_file': './insects/train\\images\\1.jpeg', 'im_id': array([0]), 'h': 1344.0, 'w': 1344.0, 'is_crowd': array([0, 0, 0, 0, 0]), 'gt_class': array([1, 0, 6, 4, 5]), 'gt_bbox': array([[542.5, 652.5, 140. , 150. ], [885. , 572. , 127. , 135. ], [648.5, 811.5, 84. , 62. ], [798.5, 821. , 86. , 71. ], [667.5, 521. , 88. , 67. ]], dtype=float32), 'gt_poly': [], 'difficult': array([0, 0, 0, 0, 0])}
len(records)
1693
Through the above procedure, all the labeled data of all training data sets are read out and stored under the records list. Each element is the labeled data of a picture, including the picture storage address, picture id, picture height and width, and the type and position of the target object contained in the picture.
Data reading and preprocessing
Data preprocessing is a very important step in training neural networks. Appropriate preprocessing methods can help the model converge better and prevent over fitting. First, we need to read data from the disk, and then preprocess these data. In order to ensure the speed of network operation, we usually need to speed up the data preprocessing.
data fetch
Previously, all the description information of the picture has been saved in records, and each element contains the description of a picture. The following program shows how to read the picture and label according to the description in records.
# data fetch import cv2 def get_bbox(gt_bbox, gt_class): # For general detection tasks, there are often multiple target objects in a picture # Set parameter MAX_NUM = 50, that is, a picture can take up to 50 real boxes; If true # If the number of boxes is less than 50, the insufficient part will be gt_bbox, gt_class and GT_ All values of score are set to 0 MAX_NUM = 50 gt_bbox2 = np.zeros((MAX_NUM, 4)) gt_class2 = np.zeros((MAX_NUM,)) for i in range(len(gt_bbox)): gt_bbox2[i, :] = gt_bbox[i, :] gt_class2[i] = gt_class[i] if i >= MAX_NUM: break return gt_bbox2, gt_class2 def get_img_data_from_file(record): """ record is a dict as following, record = { 'im_file': img_file, 'im_id': im_id, 'h': im_h, 'w': im_w, 'is_crowd': is_crowd, 'gt_class': gt_class, 'gt_bbox': gt_bbox, 'gt_poly': [], 'difficult': difficult } """ im_file = record['im_file'] h = record['h'] w = record['w'] is_crowd = record['is_crowd'] gt_class = record['gt_class'] gt_bbox = record['gt_bbox'] difficult = record['difficult'] img = cv2.imread(im_file) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # check if h and w in record equals that read from img assert img.shape[0] == int(h), \ "image height of {} inconsistent in record({}) and img file({})".format( im_file, h, img.shape[0]) assert img.shape[1] == int(w), \ "image width of {} inconsistent in record({}) and img file({})".format( im_file, w, img.shape[1]) gt_boxes, gt_labels = get_bbox(gt_bbox, gt_class) # gt_ Relative value for bbox gt_boxes[:, 0] = gt_boxes[:, 0] / float(w) gt_boxes[:, 1] = gt_boxes[:, 1] / float(h) gt_boxes[:, 2] = gt_boxes[:, 2] / float(w) gt_boxes[:, 3] = gt_boxes[:, 3] / float(h) return img, gt_boxes, gt_labels, (h, w)
record = records[0] img, gt_boxes, gt_labels, scales = get_img_data_from_file(record)
img.shape
(1344, 1344, 3)
gt_boxes.shape
(50, 4)
gt_labels
array([1., 0., 6., 4., 5., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
scales
(1344.0, 1344.0)
get_ img_ data_ from_ The file() function can return the data of picture data, which are image data img and real frame coordinates gt_boxes, the category of objects contained in the real box gt_labels, image size scales.
Data preprocessing
In computer vision, some random changes are usually made to the image to produce similar but not identical samples. The main function is to expand the training data set, suppress over fitting and improve the generalization ability of the model. The common methods are as follows:
 Randomly change brightness, contrast, and color
 Random fill
 Random clipping
 Random scaling
 random invert
 Randomly disorder the arrangement order of real boxes
Next, we use numpy to implement these data enhancement methods.
Randomly change brightness, contrast, color, etc
import numpy as np import cv2 from PIL import Image, ImageEnhance import random # Randomly change brightness, contrast, color, etc def random_distort(img): # Randomly change brightness def random_brightness(img, lower=0.5, upper=1.5): e = np.random.uniform(lower, upper) return ImageEnhance.Brightness(img).enhance(e) # Randomly change contrast def random_contrast(img, lower=0.5, upper=1.5): e = np.random.uniform(lower, upper) return ImageEnhance.Contrast(img).enhance(e) # Randomly change color def random_color(img, lower=0.5, upper=1.5): e = np.random.uniform(lower, upper) return ImageEnhance.Color(img).enhance(e) ops = [random_brightness, random_contrast, random_color] np.random.shuffle(ops) img = Image.fromarray(img) img = ops[0](img) img = ops[1](img) img = ops[2](img) img = np.asarray(img) return img # Define a visualization function to compare the effect of original image and image enhancement import matplotlib.pyplot as plt def visualize(srcimg, img_enhance): # Image visualization plt.figure(num=2, figsize=(6,12)) plt.subplot(1,2,1) plt.title('Src Image', color='#0000FF') plt.axis('off') # Do not display axes plt.imshow(srcimg) # Show original picture # Randomly change the brightness, contrast, color and other data enhancement of the original image srcimg_gtbox = records[0]['gt_bbox'] srcimg_label = records[0]['gt_class'] plt.subplot(1,2,2) plt.title('Enhance Image', color='#0000FF') plt.axis('off') # Do not display axes plt.imshow(img_enhance) image_path = records[0]['im_file'] print("read image from file {}".format(image_path)) srcimg = Image.open(image_path) # Convert the image read by PIL to array type srcimg = np.array(srcimg) # Randomly change the brightness, contrast, color and other data enhancement of the original image img_enhance = random_distort(srcimg) visualize(srcimg, img_enhance)
read image from file ./insects/train\images\1.jpeg
Random fill
# Random fill def random_expand(img, gtboxes, max_ratio=4., fill=None, keep_ratio=True, thresh=0.5): if random.random() > thresh: return img, gtboxes if max_ratio < 1.0: return img, gtboxes h, w, c = img.shape ratio_x = random.uniform(1, max_ratio) if keep_ratio: ratio_y = ratio_x else: ratio_y = random.uniform(1, max_ratio) oh = int(h * ratio_y) ow = int(w * ratio_x) off_x = random.randint(0, ow  w) off_y = random.randint(0, oh  h) out_img = np.zeros((oh, ow, c)) if fill and len(fill) == c: for i in range(c): out_img[:, :, i] = fill[i] * 255.0 out_img[off_y:off_y + h, off_x:off_x + w, :] = img gtboxes[:, 0] = ((gtboxes[:, 0] * w) + off_x) / float(ow) gtboxes[:, 1] = ((gtboxes[:, 1] * h) + off_y) / float(oh) gtboxes[:, 2] = gtboxes[:, 2] / ratio_x gtboxes[:, 3] = gtboxes[:, 3] / ratio_y return out_img.astype('uint8'), gtboxes # Randomly change the brightness, contrast, color and other data enhancement of the original image srcimg_gtbox = records[0]['gt_bbox'] img_enhance, new_gtbox = random_expand(srcimg, srcimg_gtbox) visualize(srcimg, img_enhance)
Random clipping
Before random clipping, you need to define two functions, multi_box_iou_xywh and box_crop these two functions will be saved in box_ In the utils.py file.
import numpy as np def multi_box_iou_xywh(box1, box2): """ In this case, box1 or box2 can contain multi boxes. Only two cases can be processed in this method: 1, box1 and box2 have the same shape, box1.shape == box2.shape 2, either box1 or box2 contains only one box, len(box1) == 1 or len(box2) == 1 If the shape of box1 and box2 does not match, and both of them contain multi boxes, it will be wrong. """ assert box1.shape[1] == 4, "Box1 shape[1] should be 4." assert box2.shape[1] == 4, "Box2 shape[1] should be 4." b1_x1, b1_x2 = box1[:, 0]  box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2 b1_y1, b1_y2 = box1[:, 1]  box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2 b2_x1, b2_x2 = box2[:, 0]  box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2 b2_y1, b2_y2 = box2[:, 1]  box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2 inter_x1 = np.maximum(b1_x1, b2_x1) inter_x2 = np.minimum(b1_x2, b2_x2) inter_y1 = np.maximum(b1_y1, b2_y1) inter_y2 = np.minimum(b1_y2, b2_y2) inter_w = inter_x2  inter_x1 inter_h = inter_y2  inter_y1 inter_w = np.clip(inter_w, a_min=0., a_max=None) inter_h = np.clip(inter_h, a_min=0., a_max=None) inter_area = inter_w * inter_h b1_area = (b1_x2  b1_x1) * (b1_y2  b1_y1) b2_area = (b2_x2  b2_x1) * (b2_y2  b2_y1) return inter_area / (b1_area + b2_area  inter_area) def box_crop(boxes, labels, crop, img_shape): x, y, w, h = map(float, crop) im_w, im_h = map(float, img_shape) boxes = boxes.copy() boxes[:, 0], boxes[:, 2] = (boxes[:, 0]  boxes[:, 2] / 2) * im_w, ( boxes[:, 0] + boxes[:, 2] / 2) * im_w boxes[:, 1], boxes[:, 3] = (boxes[:, 1]  boxes[:, 3] / 2) * im_h, ( boxes[:, 1] + boxes[:, 3] / 2) * im_h crop_box = np.array([x, y, x + w, y + h]) centers = (boxes[:, :2] + boxes[:, 2:]) / 2.0 mask = np.logical_and(crop_box[:2] <= centers, centers <= crop_box[2:]).all( axis=1) boxes[:, :2] = np.maximum(boxes[:, :2], crop_box[:2]) boxes[:, 2:] = np.minimum(boxes[:, 2:], crop_box[2:]) boxes[:, :2] = crop_box[:2] boxes[:, 2:] = crop_box[:2] mask = np.logical_and(mask, (boxes[:, :2] < boxes[:, 2:]).all(axis=1)) boxes = boxes * np.expand_dims(mask.astype('float32'), axis=1) labels = labels * mask.astype('float32') boxes[:, 0], boxes[:, 2] = (boxes[:, 0] + boxes[:, 2]) / 2 / w, ( boxes[:, 2]  boxes[:, 0]) / w boxes[:, 1], boxes[:, 3] = (boxes[:, 1] + boxes[:, 3]) / 2 / h, ( boxes[:, 3]  boxes[:, 1]) / h return boxes, labels, mask.sum()
# Random clipping def random_crop(img, boxes, labels, scales=[0.3, 1.0], max_ratio=2.0, constraints=None, max_trial=50): if len(boxes) == 0: return img, boxes if not constraints: constraints = [(0.1, 1.0), (0.3, 1.0), (0.5, 1.0), (0.7, 1.0), (0.9, 1.0), (0.0, 1.0)] img = Image.fromarray(img) w, h = img.size crops = [(0, 0, w, h)] for min_iou, max_iou in constraints: for _ in range(max_trial): scale = random.uniform(scales[0], scales[1]) aspect_ratio = random.uniform(max(1 / max_ratio, scale * scale), \ min(max_ratio, 1 / scale / scale)) crop_h = int(h * scale / np.sqrt(aspect_ratio)) crop_w = int(w * scale * np.sqrt(aspect_ratio)) crop_x = random.randrange(w  crop_w) crop_y = random.randrange(h  crop_h) crop_box = np.array([[(crop_x + crop_w / 2.0) / w, (crop_y + crop_h / 2.0) / h, crop_w / float(w), crop_h / float(h)]]) iou = multi_box_iou_xywh(crop_box, boxes) if min_iou <= iou.min() and max_iou >= iou.max(): crops.append((crop_x, crop_y, crop_w, crop_h)) break while crops: crop = crops.pop(np.random.randint(0, len(crops))) crop_boxes, crop_labels, box_num = box_crop(boxes, labels, crop, (w, h)) if box_num < 1: continue img = img.crop((crop[0], crop[1], crop[0] + crop[2], crop[1] + crop[3])).resize(img.size, Image.LANCZOS) img = np.asarray(img) return img, crop_boxes, crop_labels img = np.asarray(img) return img, boxes, labels # Randomly change the brightness, contrast, color and other data enhancement of the original image srcimg_gtbox = records[0]['gt_bbox'] srcimg_label = records[0]['gt_class'] img_enhance, new_labels, mask = random_crop(srcimg, srcimg_gtbox, srcimg_label) visualize(srcimg, img_enhance)
Random scaling
# Random scaling def random_interp(img, size, interp=None): interp_method = [ cv2.INTER_NEAREST, cv2.INTER_LINEAR, cv2.INTER_AREA, cv2.INTER_CUBIC, cv2.INTER_LANCZOS4, ] if not interp or interp not in interp_method: interp = interp_method[random.randint(0, len(interp_method)  1)] h, w, _ = img.shape im_scale_x = size / float(w) im_scale_y = size / float(h) img = cv2.resize( img, None, None, fx=im_scale_x, fy=im_scale_y, interpolation=interp) return img # Randomly change the brightness, contrast, color and other data enhancement of the original image img_enhance = random_interp(srcimg, 640) visualize(srcimg, img_enhance)
random invert
# random invert def random_flip(img, gtboxes, thresh=0.5): if random.random() > thresh: img = img[:, ::1, :] gtboxes[:, 0] = 1.0  gtboxes[:, 0] return img, gtboxes # Randomly change the brightness, contrast, color and other data enhancement of the original image img_enhance, box_enhance = random_flip(srcimg, srcimg_gtbox) visualize(srcimg, img_enhance)
Randomly disorder the arrangement order of real boxes
# Randomly disorder the arrangement order of real boxes def shuffle_gtbox(gtbox, gtlabel): gt = np.concatenate( [gtbox, gtlabel[:, np.newaxis]], axis=1) idx = np.arange(gt.shape[0]) np.random.shuffle(idx) gt = gt[idx, :] return gt[:, :4], gt[:, 4]
Summary of image enhancement methods
# Summary of image enhancement methods def image_augment(img, gtboxes, gtlabels, size, means=None): # Randomly change brightness, contrast, color, etc img = random_distort(img) # Random fill img, gtboxes = random_expand(img, gtboxes, fill=means) # Random clipping img, gtboxes, gtlabels, = random_crop(img, gtboxes, gtlabels) # Random scaling img = random_interp(img, size) # random invert img, gtboxes = random_flip(img, gtboxes) # Randomly disorder the arrangement order of real boxes gtboxes, gtlabels = shuffle_gtbox(gtboxes, gtlabels) return img.astype('float32'), gtboxes.astype('float32'), gtlabels.astype('int32') img_enhance, img_box, img_label = image_augment(srcimg, srcimg_gtbox, srcimg_label, size=320) visualize(srcimg, img_enhance)
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
img, gt_boxes, gt_labels, scales = get_img_data_from_file(record) size = 512 img, gt_boxes, gt_labels = image_augment(img, gt_boxes, gt_labels, size)
img.shape
(512, 512, 3)
gt_boxes.shape
(50, 4)
gt_labels.shape
(50,)
The img data value obtained here needs to be adjusted by dividing by 255, subtracting the mean and variance, and then adjusting the dimension from [H, W, C] to [C, H, W].
img, gt_boxes, gt_labels, scales = get_img_data_from_file(record) size = 512 img, gt_boxes, gt_labels = image_augment(img, gt_boxes, gt_labels, size) mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] mean = np.array(mean).reshape((1, 1, 1)) std = np.array(std).reshape((1, 1, 1)) img = (img / 255.0  mean) / std img = img.astype('float32').transpose((2, 0, 1)) img
array([[[2.117904 , 2.117904 , 2.117904 , ..., 2.117904 , 2.117904 , 2.117904 ], [2.117904 , 2.117904 , 2.117904 , ..., 2.117904 , 2.117904 , 2.117904 ], [2.117904 , 2.117904 , 2.117904 , ..., 2.117904 , 2.117904 , 2.117904 ], ..., [2.117904 , 2.117904 , 2.117904 , ..., 2.117904 , 2.117904 , 2.117904 ], [2.117904 , 2.117904 , 2.117904 , ..., 2.117904 , 2.117904 , 2.117904 ], [2.117904 , 2.117904 , 2.117904 , ..., 2.117904 , 2.117904 , 2.117904 ]], [[2.0357144, 2.0357144, 2.0357144, ..., 2.0357144, 2.0357144, 2.0357144], [2.0357144, 2.0357144, 2.0357144, ..., 2.0357144, 2.0357144, 2.0357144], [2.0357144, 2.0357144, 2.0357144, ..., 2.0357144, 2.0357144, 2.0357144], ..., [2.0357144, 2.0357144, 2.0357144, ..., 2.0357144, 2.0357144, 2.0357144], [2.0357144, 2.0357144, 2.0357144, ..., 2.0357144, 2.0357144, 2.0357144], [2.0357144, 2.0357144, 2.0357144, ..., 2.0357144, 2.0357144, 2.0357144]], [[1.8044444, 1.8044444, 1.8044444, ..., 1.8044444, 1.8044444, 1.8044444], [1.8044444, 1.8044444, 1.8044444, ..., 1.8044444, 1.8044444, 1.8044444], [1.8044444, 1.8044444, 1.8044444, ..., 1.8044444, 1.8044444, 1.8044444], ..., [1.8044444, 1.8044444, 1.8044444, ..., 1.8044444, 1.8044444, 1.8044444], [1.8044444, 1.8044444, 1.8044444, ..., 1.8044444, 1.8044444, 1.8044444], [1.8044444, 1.8044444, 1.8044444, ..., 1.8044444, 1.8044444, 1.8044444]]], dtype=float32)
Organize the above process into a get_img_data function.
def get_img_data(record, size=640): img, gt_boxes, gt_labels, scales = get_img_data_from_file(record) img, gt_boxes, gt_labels = image_augment(img, gt_boxes, gt_labels, size) mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] mean = np.array(mean).reshape((1, 1, 1)) std = np.array(std).reshape((1, 1, 1)) img = (img / 255.0  mean) / std img = img.astype('float32').transpose((2, 0, 1)) return img, gt_boxes, gt_labels, scales
TRAINDIR = '/home/aistudio/work/insects/train' TESTDIR = '/home/aistudio/work/insects/test' VALIDDIR = '/home/aistudio/work/insects/val' cname2cid = get_insect_names() records = get_annotations(cname2cid, TRAINDIR) record = records[0] img, gt_boxes, gt_labels, scales = get_img_data(record, size=480)
img.shape
(3, 480, 480)
gt_boxes.shape
(50, 4)
gt_labels
array([0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int32)
scales
(1220.0, 1220.0)
Fast data enhancement using highlevel API of propeller
In the above code, we use numpy to implement a variety of data enhancement methods. At the same time, the propeller also provides a ready to use data enhancement method, which can be consulted in detail paddle.vision.transforms The transforms module provides dozens of data enhancement methods, including brightness enhancement( adjust_brightness ), contrast enhancement( adjust_contrast ), random clipping( RandomCrop )Wait. For more information on how to use the highlevel API, please log in to the official website of the propeller.
The data enhancement in the pad.vision.transforms module is used as follows:
#Random clipping of image # import random cut API RandomCrop from the pad.vision.transforms module from paddle.vision.transforms import RandomCrop # RandomCrop is a python class that needs to be declared in advance #RandomCrop also needs to pass in the cut shape, which is set to 640 here transform = RandomCrop(640) # Convert image to PIL.Image format srcimg = Image.fromarray(np.array(srcimg)) # Call the declared API to realize random cutting img_res = transform(srcimg) # Visualization results visualize(srcimg, np.array(img_res))
In the same way, the brightness enhancement can be realized by using the highlevel API of the propeller, as shown in the following code:
from paddle.vision.transforms import BrightnessTransform # BrightnessTransform is a python class that needs to be declared in advance transform = BrightnessTransform(0.4) # Convert image to PIL.Image format srcimg = Image.fromarray(np.array(srcimg)) # Call the declared API to realize random cutting img_res = transform(srcimg) # Visualization results visualize(srcimg, np.array(img_res))
Batch data reading and acceleration
The above program shows how to read the data of a picture and accelerate it. The following code realizes batch data reading.
# Obtain the randomly scaled size of samples in a batch def get_img_size(mode): if (mode == 'train') or (mode == 'valid'): inds = np.array([0,1,2,3,4,5,6,7,8,9]) ii = np.random.choice(inds) img_size = 320 + ii * 32 else: img_size = 608 return img_size # Convert batch data in the form of list into tuple s composed of multiple array s def make_array(batch_data): img_array = np.array([item[0] for item in batch_data], dtype = 'float32') gt_box_array = np.array([item[1] for item in batch_data], dtype = 'float32') gt_labels_array = np.array([item[2] for item in batch_data], dtype = 'int32') img_scale = np.array([item[3] for item in batch_data], dtype='int32') return img_array, gt_box_array, gt_labels_array, img_scale
Because the data preprocessing takes a long time, it may become the bottleneck of network training speed, so the preprocessing part needs to be optimized. Provided by using a propeller paddle.io.DataLoader Num in API_ The workers parameter sets the number of processes to read data from multiple processes. The specific implementation code is as follows.
import paddle # Define a data reading class that inherits the Paddle.io.Dataset class TrainDataset(paddle.io.Dataset): def __init__(self, datadir, mode='train'): self.datadir = datadir cname2cid = get_insect_names() self.records = get_annotations(cname2cid, datadir) self.img_size = 640 #get_img_size(mode) def __getitem__(self, idx): record = self.records[idx] # print("print: ", record) img, gt_bbox, gt_labels, im_shape = get_img_data(record, size=self.img_size) return img, gt_bbox, gt_labels, np.array(im_shape) def __len__(self): return len(self.records) # Create data reading class train_dataset = TrainDataset(TRAINDIR, mode='train') # Create a data reader using the pad.io.dataloader, and set the batchsize and the number of processes num_workers and other parameters train_loader = paddle.io.DataLoader(train_dataset, batch_size=2, shuffle=True, num_workers=2, drop_last=True)
d = paddle.io.DataLoader(train_dataset, batch_size=2, shuffle=True, num_workers=1, drop_last=True)
img, gt_boxes, gt_labels, im_shape = next(d())
img.shape, gt_boxes.shape, gt_labels.shape, im_shape.shape
([2, 3, 640, 640], [2, 50, 4], [2, 50], [2, 2])
So far, we have completed the processes of viewing the data in the dataset, extracting data annotation information, reading images and annotation data from files, image widening, batch reading and acceleration. Img, GT can be returned through pad.io.dataset_ boxes, gt_ labels, im_ Shape and other data, and then they can be input into the neural network and applied to the specific algorithm.
Before starting the specific algorithm explanation, first supplement the code to read the test data. The test data has no annotation information, and there is no need to expand the image. The code is as follows.
import os # Convert batch data in the form of list into tuple s composed of multiple array s def make_test_array(batch_data): img_name_array = np.array([item[0] for item in batch_data]) img_data_array = np.array([item[1] for item in batch_data], dtype = 'float32') img_scale_array = np.array([item[2] for item in batch_data], dtype='int32') return img_name_array, img_data_array, img_scale_array # Test data reading def test_data_loader(datadir, batch_size= 10, test_image_size=608, mode='test'): """ Load the test picture, and the test data is not available groundtruth label """ image_names = os.listdir(datadir) def reader(): batch_data = [] img_size = test_image_size for image_name in image_names: file_path = os.path.join(datadir, image_name) img = cv2.imread(file_path) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) H = img.shape[0] W = img.shape[1] img = cv2.resize(img, (img_size, img_size)) mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] mean = np.array(mean).reshape((1, 1, 1)) std = np.array(std).reshape((1, 1, 1)) out_img = (img / 255.0  mean) / std out_img = out_img.astype('float32').transpose((2, 0, 1)) img = out_img #np.transpose(out_img, (2,0,1)) im_shape = [H, W] batch_data.append((image_name.split('.')[0], img, im_shape)) if len(batch_data) == batch_size: yield make_test_array(batch_data) batch_data = [] if len(batch_data) > 0: yield make_test_array(batch_data) return reader
Single stage target detection model YOLOv3
RCNN series algorithms need to generate candidate regions first, and then classify the candidate regions and predict the position coordinates. This kind of algorithm is called twostage target detection algorithm. In recent years, many researchers have proposed a series of singlestage detection algorithms, which only need a network to generate candidate regions and predict the category and position coordinates of objects at the same time.
Different from RCNN series algorithms, YOLOv3 uses a single network structure to predict the object category and location while generating candidate regions. It does not need to be divided into two stages to complete the detection task. In addition, the number of prediction frames generated by YOLOv3 algorithm is much less than that of fast RCNN. Each real box in fast RCNN may correspond to multiple candidate regions with positive labels, while each real box in YOLOv3 corresponds to only one positive candidate region. These characteristics make YOLOv3 algorithm have faster speed and can reach the level of realtime response.
Joseph Redmon et al. Proposed YOLO (You Only Look Once, YOLO) algorithm in 2015, which is also commonly known as YOLOv1; In 2016, they improved the algorithm and proposed YOLOv2 version; YOLOv3 version was developed in 2018.
YOLOv3 model design idea
The basic idea of YOLOv3 algorithm can be divided into two parts:
 A series of candidate regions are generated on the picture according to certain rules, and then the candidate regions are labeled according to the positional relationship between these candidate regions and the real frame of the object on the picture. Those candidate areas close enough to the real box will be marked as positive samples, and the position of the real box will be taken as the position target of positive samples. Those candidate areas that deviate greatly from the real box will be labeled as negative samples, and negative samples do not need to predict location or category.
 Convolution neural network is used to extract image features and predict the location and category of candidate regions. In this way, each prediction box can be regarded as a sample. The label value is obtained by labeling the real box relative to its position and category. The loss function can be established by predicting its position and category through the network model and comparing the network prediction value with the label value.
The flow chart of YOLOv3 algorithm training process is shown in Figure 8:
Figure 8: YOLOv3 algorithm training flow chart
 The left side of Fig. 8 is the input picture. The process shown in the upper part is to extract features from the picture using convolutional neural network. With the continuous propagation of the network, the size of the feature map becomes smaller and smaller, and each pixel will represent a more abstract feature pattern until the feature map is output, and its size is reduced to the size of the original image 1 32 \frac{1}{32} 321.
 The lower part of Fig. 8 describes the process of generating candidate regions. Firstly, the original diagram is divided into a plurality of small blocks, and the size of each small block is 32 × 32 32 \times 32 thirtytwo × 32, and then generate a series of anchor boxes centered on each small box, and the whole picture will be covered by the anchor box. On the basis of each anchor frame, a corresponding prediction frame is generated, and these prediction frames are marked according to the positional relationship between the anchor frame, prediction frame and the real frame of the object on the picture.
 Associate the feature map output in the upper branch with the prediction frame label generated in the lower branch, create a loss function, and start the endtoend training process.
Next, the principle and code implementation of each node in the process are introduced in detail.
Generate candidate regions
How to generate candidate regions is the core design scheme of detection model. At present, most models based on convolutional neural network adopt the following methods:
 According to certain rules, a series of anchor boxes with fixed positions are generated on the picture, and these anchor boxes are regarded as possible candidate areas.
 Predict whether the anchor box contains the target object. If it contains the target object, it is also necessary to predict the category of the included object and the range of adjustment of the prediction box relative to the anchor box position.
Generate anchor box
Divide the original picture into m × n m\times n m × n areas, as shown in the figure below, the height of the original picture H = 640 H=640 H=640, width W = 480 W=480 W=480, if we select the size of the small area as 32 × 32 32 \times 32 thirtytwo × 32, then m m m and n n n are:
m = 640 32 = 20 m = \frac{640}{32} = 20 m=32640=20
n = 480 32 = 15 n = \frac{480}{32} = 15 n=32480=15
As shown in Fig. 9, the original image is divided into 20 rows and 15 columns of small square areas.
Figure 9: divide the picture into multiple 32x32 small blocks
YOLOv3 algorithm will generate a series of anchor boxes in the center of each region. For display convenience, we first draw the generated anchor box near the small box position in the tenth row and fourth column in the figure, as shown in Figure 10.
be careful:
Here, in order to correspond to the number in the program, the top line number is row 0 and the left column number is column 0.
Figure 10: three anchor boxes are generated in the small square area in row 10 and column 4
Figure 11 shows that three anchor frames are generated near each area. It may not be easy to see many anchor frames stacked together, but the process is similar to the above, except that three anchor frames need to be generated respectively with the center point of each area as the center.
Figure 11: generate 3 anchor boxes in each small square area
Generate forecast box
It has been pointed out earlier that the position of the anchor frame is fixed and cannot coincide with the object boundary frame. It is necessary to fine tune the position on the basis of the anchor frame to generate the prediction frame. The prediction box will have different center positions and sizes relative to the anchor box. What method can be used to obtain the prediction box? Let's first consider how to generate its center position coordinates.
For example, in the above figure, an anchor box is generated in the center of the small square area in row 10 and column 4, as shown by the green dotted line box. The unit length is the width of the small square,
The position coordinates of the upper left corner of this small square area are:
c
x
=
4
c_x = 4
cx=4
c
y
=
10
c_y = 10
cy=10
The area center coordinates of this anchor box are:
c
e
n
t
e
r
_
x
=
c
x
+
0.5
=
4.5
center\_x = c_x + 0.5 = 4.5
center_x=cx+0.5=4.5
c
e
n
t
e
r
_
y
=
c
y
+
0.5
=
10.5
center\_y = c_y + 0.5 = 10.5
center_y=cy+0.5=10.5
The center coordinates of the prediction box can be generated in the following ways:
b
x
=
c
x
+
σ
(
t
x
)
b_x = c_x + \sigma(t_x)
bx=cx+σ(tx)
b
y
=
c
y
+
σ
(
t
y
)
b_y = c_y + \sigma(t_y)
by=cy+σ(ty)
among t x t_x tx and t y t_y ty is a real number, σ ( x ) \sigma(x) σ (x) Is the Sigmoid function we learned before, which is defined as follows:
σ ( x ) = 1 1 + e x p ( − x ) \sigma(x) = \frac{1}{1 + exp(x)} σ(x)=1+exp(−x)1
Because the function value of Sigmoid is 0 ∼ 1 0 \thicksim 1 0 ∼ 1, so the center point of the prediction frame calculated by the above formula always falls within the small area in the fourth column of the tenth row.
When t x = t y = 0 t_x=t_y=0 When tx = ty = 0, b x = c x + 0.5 b_x = c_x + 0.5 bx=cx+0.5， b y = c y + 0.5 b_y = c_y + 0.5 by = cy + 0.5, the center of the prediction frame coincides with the center of the anchor frame, which are the centers of small areas.
The size of the anchor frame is preset and can be regarded as a super parameter in the model. The size of the anchor frame drawn in the figure below is
p
h
=
350
p_h = 350
ph=350
p
w
=
250
p_w = 250
pw=250
The size of the prediction box is generated by the following formula:
b
h
=
p
h
e
t
h
b_h = p_h e^{t_h}
bh=pheth
b
w
=
p
w
e
t
w
b_w = p_w e^{t_w}
bw=pwetw
If t x = t y = 0 , t h = t w = 0 t_x=t_y=0, t_h=t_w=0 tx = ty = 0,th = tw = 0, then the prediction box coincides with the anchor box.
If you give t x , t y , t h , t w t_x, t_y, t_h, t_w tx, ty, th, tw are randomly assigned as follows:
t x = 0.2 , t y = 0.3 , t w = 0.1 , t h = − 0.12 t_x = 0.2, t_y = 0.3, t_w = 0.1, t_h = 0.12 tx=0.2,ty=0.3,tw=0.1,th=−0.12
The coordinates of the prediction frame are (154.98, 357.44, 276.29, 310.42), as shown in the blue box in Figure 12.
explain:
The coordinates here are
x
y
w
h
xywh
xywh format.
Figure 12: generate forecast box
Here we ask: when t x , t y , t w , t h t_x, t_y, t_w, t_h When the values of tx, ty, tw, th ， are taken, can the prediction frame coincide with the real frame? In order to answer the question, you only need to add b x , b y , b h , b w b_x, b_y, b_h, b_w bx, by, bh, bw {are set to the position of the real box to solve the problem t t The value of t.
Order:
σ
(
t
x
∗
)
+
c
x
=
g
t
x
\sigma(t^*_x) + c_x = gt_x
σ(tx∗)+cx=gtx
σ
(
t
y
∗
)
+
c
y
=
g
t
y
\sigma(t^*_y) + c_y = gt_y
σ(ty∗)+cy=gty
p
w
e
t
w
∗
=
g
t
h
p_w e^{t^*_w} = gt_h
pwetw∗=gth
p
h
e
t
h
∗
=
g
t
w
p_h e^{t^*_h} = gt_w
pheth∗=gtw
It can be solved as follows: ( t x ∗ , t y ∗ , t w ∗ , t h ∗ ) (t^*_x, t^*_y, t^*_w, t^*_h) (tx∗,ty∗,tw∗,th∗)
If t t t is the output value of network prediction, which will t ∗ t^* Taking t * as the target value and the gap between them as the loss function, a regression problem can be established by learning the network parameters t t t close enough t ∗ t^* t * so that the position coordinates and size of the prediction frame can be solved.
The prediction box can be regarded as a fine adjustment based on the anchor box. Each anchor box will have a prediction box corresponding to it. We need to determine the value in the above calculation formula t x , t y , t w , t h t_x, t_y, t_w, t_h tx, ty, tw, th, so as to calculate the position and shape of the prediction frame corresponding to the anchor frame.
Label candidate areas
Each region can generate three anchor boxes with different shapes. Each anchor box is a possible candidate region. For these candidate regions, we need to know the following things:

Whether the anchor box contains objects can be regarded as a binary classification problem, which is represented by the label objectness. When the anchor box contains objects, objectness=1, indicating that the prediction box belongs to the positive category; When the anchor box does not contain objects, set objectness=0, indicating that the anchor box belongs to negative class.

If the anchor box contains an object, what should be the center position and size of its corresponding prediction box, or in the above calculation formula t x , t y , t w , t h t_x, t_y, t_w, t_h What should tx, ty, tw, th ？ be? Use the location tag.

If the anchor box contains an object, what is the specific category? Here, the variable label is used to represent the label of its category.
Select any anchor box to label it, that is, you need to determine its corresponding objectness, ( t x , t y , t w , t h ) (t_x, t_y, t_w, t_h) (tx, ty, tw, th) and label. How to determine the values of these three labels will be described below.
Mark whether the anchor box contains objects
As shown in Figure 13, there are three targets here. Take the portrait on the far left as an example, and its real frame is
(
133.96
,
328.42
,
186.06
,
374.63
)
(133.96, 328.42, 186.06, 374.63)
(133.96,328.42,186.06,374.63).
Figure 13: select the anchor frame located in the same area as the center of the real frame
The coordinates of the center point of the real box are:
c e n t e r _ x = 133.96 center\_x = 133.96 center_x=133.96
c e n t e r _ y = 328.42 center\_y = 328.42 center_y=328.42
i = 133.96 / 32 = 4.18625 i = 133.96 / 32 = 4.18625 i=133.96/32=4.18625
j = 328.42 / 32 = 10.263125 j = 328.42 / 32 = 10.263125 j=328.42/32=10.263125
It falls in the small box in row 10 and column 4, as shown in Figure 13. This small square area can generate three anchor boxes with different shapes, and their numbers and sizes on the figure are A 1 ( 116 , 90 ) , A 2 ( 156 , 198 ) , A 3 ( 373 , 326 ) A_1(116, 90), A_2(156, 198), A_3(373, 326) A1(116,90),A2(156,198),A3(373,326).
Use these three anchor frames with different shapes and real frames to calculate IoU, and select the anchor frame with the largest IoU. Here, in order to simplify the calculation, only the shape of the anchor frame is considered, and the offset between the anchor frame and the center of the real frame is not considered. The specific calculation results are shown in Figure 14.
Figure 14: IoU selected with real frame and anchor frame
The anchor box is the largest with the real box IoU A 3 A_3 A3, shape is ( 373 , 326 ) (373, 326) (373326), set the objectness tag of the corresponding prediction box to 1, and the object category it includes is the category of the object in the real box.
You can find the largest anchor box of IoU corresponding to several other real boxes in turn, and then set the objectness label of their prediction boxes to 1. There are 20 × 15 × 3 = 900 20 \times 15 \times 3 = 900 twenty × fifteen × 3 = 900 anchor boxes, only 3 prediction boxes will be marked as positive.
Since each real box only corresponds to one prediction box with a positive objectness tag, if the IoU between some prediction boxes and the real box is large, but not the largest one, it may not be appropriate to directly set its objectness tag to 0 as a negative sample. In order to avoid this situation, YOLOv3 algorithm sets an IoU threshold iou_threshold, when the objectness of the prediction box is not 1, but its IoU with a real box is greater than IoU_ When threshold is set, its objectness tag is set to  1 and does not participate in the calculation of loss function.
For all other prediction boxes, the objectness tag is set to 0, indicating a negative class.
For the prediction box with objectness=1, its location and the specific classification label of the object need to be further determined. However, for the prediction box with objectness=0 or  1, their location and category do not matter.
Label the location coordinates of the prediction box
When the anchor box objectness=1, it is necessary to determine the fine adjustment range of the prediction box position relative to it, that is, the position label of the anchor box.
We have asked such a question before: when t x , t y , t w , t h t_x, t_y, t_w, t_h When the values of tx, ty, tw, th ， are taken, can the prediction frame coincide with the real frame? The method is to put the in the coordinates of the prediction frame b x , b y , b h , b w b_x, b_y, b_h, b_w bx, by, bh, bw} are set as the coordinates of the real frame to solve t t The value of t.
Order:
σ
(
t
x
∗
)
+
c
x
=
g
t
x
\sigma(t^*_x) + c_x = gt_x
σ(tx∗)+cx=gtx
σ
(
t
y
∗
)
+
c
y
=
g
t
y
\sigma(t^*_y) + c_y = gt_y
σ(ty∗)+cy=gty
p
w
e
t
w
∗
=
g
t
w
p_w e^{t^*_w} = gt_w
pwetw∗=gtw
p
h
e
t
h
∗
=
g
t
h
p_h e^{t^*_h} = gt_h
pheth∗=gth
about t x ∗ t_x^* tx * and t y ∗ t_y^* ty *, because the inverse function of Sigmoid is difficult to calculate, we use it directly σ ( t x ∗ ) \sigma(t^*_x) σ (tx *) and σ ( t y ∗ ) \sigma(t^*_y) σ (ty *) as the goal of return.
d x ∗ = σ ( t x ∗ ) = g t x − c x d_x^* = \sigma(t^*_x) = gt_x  c_x dx∗=σ(tx∗)=gtx−cx
d y ∗ = σ ( t y ∗ ) = g t y − c y d_y^* = \sigma(t^*_y) = gt_y  c_y dy∗=σ(ty∗)=gty−cy
t w ∗ = l o g ( g t w p w ) t^*_w = log(\frac{gt_w}{p_w}) tw∗=log(pwgtw)
t h ∗ = l o g ( g t h p h ) t^*_h = log(\frac{gt_h}{p_h}) th∗=log(phgth)
If ( t x , t y , t h , t w ) (t_x, t_y, t_h, t_w) (tx, ty, th, tw) is the output value of network prediction ( d x ∗ , d y ∗ , t w ∗ , t h ∗ ) (d_x^*, d_y^*, t_w^*, t_h^*) (dx *, dy *, tw *, th *) as ( σ ( t x ) , σ ( t y ) , t h , t w ) (\sigma(t_x), \sigma(t_y), t_h, t_w) ( σ (tx), σ Taking the gap between them as the loss function, a regression problem can be established by learning the network parameters t t t close enough t ∗ t^* t * so that the position of the prediction frame can be solved.
The label anchor box contains the label of the object category
For the anchor box with objectness=1, its specific category needs to be determined. As mentioned above, the anchor box with objectness marked 1 will have a real box corresponding to it. The object category of the anchor box is the object category contained in the corresponding real box. Here, the one hot vector is used to represent the category label label. For example, there are 10 categories in total, and the object category contained in the real box is category 2, then the label is ( 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) (0,1,0,0,0,0,0,0,0,0) (0,1,0,0,0,0,0,0,0,0)
The above steps are summarized, and the marked process is shown in Figure 15.
Figure 15: schematic diagram of marking process
In this way, we generate a series of anchor boxes in each small block area as candidate areas, and mark the objectness label corresponding to each candidate area, the range of position adjustment and the category of the included object according to the position of the real object in the picture. The magnitude of position adjustment is described by four variables ( t x , t y , t w , t h ) (t_x, t_y, t_w, t_h) (tx, ty, tw, th), the objectness tag needs to be described by a variable o b j obj obj, the length of the variable describing the category is equal to the number of categories C.
For each anchor box, the model needs to predict the output ( t x , t y , t w , t h , P o b j , P 1 , P 2 , . . . , P C ) (t_x, t_y, t_w, t_h, P_{obj}, P_1, P_2,... , P_C) (tx, ty, tw, th, Pobj, P1, P2,..., PC), where P o b j P_{obj} Pobj ， is the probability of whether the anchor box contains an object, P 1 , P 2 , . . . , P C P_1, P_2,... , P_C P1, P2,..., PC... Are the probability that the objects contained in the anchor box belong to each category. Next, let's learn how to output such a prediction value through convolutional neural network.
Specific procedures for marking anchor boxes
The above describes how to label the pre anchor box, but the reader may still not understand the details. This step will be completed through specific procedures.
# Label the objectness of the prediction box def get_objectness_label(img, gt_boxes, gt_labels, iou_threshold = 0.7, anchors = [116, 90, 156, 198, 373, 326], num_classes=7, downsample=32): """ img Is the input image data, and the shape is[N, C, H, W] gt_boxes，True box, dimension is[N, 50, 4]，Where 50 is the upper limit of the number of real frames. When there are less than 50 real frames in the picture, the coordinates of the insufficient part are all 0 The real box coordinate format is xywh，Relative values are used here gt_labels，The category of the real box. The dimension is[N, 50] iou_threshold，When the predicted box is different from the real box iou greater than iou_threshold It is not regarded as a negative sample anchors，Optional dimensions of anchor frame anchor_masks，Through and anchors Determine together what size anchor frame should be selected for the characteristic drawing of this level num_classes，Number of categories downsample，The scale of the change of the feature map relative to the picture size of the input network """ img_shape = img.shape batchsize = img_shape[0] num_anchors = len(anchors) // 2 input_h = img_shape[2] input_w = img_shape[3] # Divide the input picture into num_rows x num_cols is a small square area, and the side length of each small square is downsample # Calculate the total number of rows of small squares num_rows = input_h // downsample # Calculate how many columns of small squares there are num_cols = input_w // downsample label_objectness = np.zeros([batchsize, num_anchors, num_rows, num_cols]) label_classification = np.zeros([batchsize, num_anchors, num_classes, num_rows, num_cols]) label_location = np.zeros([batchsize, num_anchors, 4, num_rows, num_cols]) scale_location = np.ones([batchsize, num_anchors, num_rows, num_cols]) # Cycle the batchsize and process each picture in turn for n in range(batchsize): # Loop the real box on the picture and find the anchor box that best matches the shape of the real box in turn for n_gt in range(len(gt_boxes[n])): gt = gt_boxes[n][n_gt] gt_cls = gt_labels[n][n_gt] gt_center_x = gt[0] gt_center_y = gt[1] gt_width = gt[2] gt_height = gt[3] if (gt_width < 1e3) or (gt_height < 1e3): continue i = int(gt_center_y * num_rows) j = int(gt_center_x * num_cols) ious = [] for ka in range(num_anchors): bbox1 = [0., 0., float(gt_width), float(gt_height)] anchor_w = anchors[ka * 2] anchor_h = anchors[ka * 2 + 1] bbox2 = [0., 0., anchor_w/float(input_w), anchor_h/float(input_h)] # Calculate iou iou = box_iou_xywh(bbox1, bbox2) ious.append(iou) ious = np.array(ious) inds = np.argsort(ious) k = inds[1] label_objectness[n, k, i, j] = 1 c = gt_cls label_classification[n, k, c, i, j] = 1. # for those prediction bbox with objectness =1, set label of location dx_label = gt_center_x * num_cols  j dy_label = gt_center_y * num_rows  i dw_label = np.log(gt_width * input_w / anchors[k*2]) dh_label = np.log(gt_height * input_h / anchors[k*2 + 1]) label_location[n, k, 0, i, j] = dx_label label_location[n, k, 1, i, j] = dy_label label_location[n, k, 2, i, j] = dw_label label_location[n, k, 3, i, j] = dh_label # scale_location is used to adjust the contribution of anchor frames of different sizes to the loss function, which is multiplied by the weighting coefficient and the position loss function scale_location[n, k, i, j] = 2.0  gt_width * gt_height # At present, according to all GT boxes that appear in each picture, the prediction box with positive objectness is marked, and the remaining prediction boxes default objectness to 0 # For the prediction boxes with objectness of 1, the object categories they contain and the targets of position regression are marked return label_objectness.astype('float32'), label_location.astype('float32'), label_classification.astype('float32'), \ scale_location.astype('float32')
# When calculating IoU, the coordinate form of the rectangular box is xywh def box_iou_xywh(box1, box2): x1min, y1min = box1[0]  box1[2]/2.0, box1[1]  box1[3]/2.0 x1max, y1max = box1[0] + box1[2]/2.0, box1[1] + box1[3]/2.0 s1 = box1[2] * box1[3] x2min, y2min = box2[0]  box2[2]/2.0, box2[1]  box2[3]/2.0 x2max, y2max = box2[0] + box2[2]/2.0, box2[1] + box2[3]/2.0 s2 = box2[2] * box2[3] xmin = np.maximum(x1min, x2min) ymin = np.maximum(y1min, y2min) xmax = np.minimum(x1max, x2max) ymax = np.minimum(y1max, y2max) inter_h = np.maximum(ymax  ymin, 0.) inter_w = np.maximum(xmax  xmin, 0.) intersection = inter_h * inter_w union = s1 + s2  intersection iou = intersection / union return iou
# Read data import paddle reader = paddle.io.DataLoader(train_dataset, batch_size=2, shuffle=True, num_workers=1, drop_last=True) img, gt_boxes, gt_labels, im_shape = next(reader()) img, gt_boxes, gt_labels, im_shape = img.numpy(), gt_boxes.numpy(), gt_labels.numpy(), im_shape.numpy() # Calculate the label corresponding to the anchor box label_objectness, label_location, label_classification, scale_location = get_objectness_label(img, gt_boxes, gt_labels, iou_threshold = 0.7, anchors = [116, 90, 156, 198, 373, 326], num_classes=7, downsample=32)
img.shape, gt_boxes.shape, gt_labels.shape, im_shape.shape
((2, 3, 640, 640), (2, 50, 4), (2, 50), (2, 2))
label_objectness.shape, label_location.shape, label_classification.shape, scale_location.shape
((2, 3, 20, 20), (2, 3, 4, 20, 20), (2, 3, 7, 20, 20), (2, 3, 20, 20))
The above program realizes the annotation of the anchor box. For each real box, the anchor box that best matches its shape is selected, its objectness is marked as 1, and [ d x ∗ , d y ∗ , t h ∗ , t w ∗ ] [d_x^*, d_y^*, t_h^*, t_w^*] [dx *, dy *, th *, tw *] is used as the label of the positive sample position, and the object category contained in the real box is used as the category of the anchor box. For the other anchor boxes, objectness will be marked as 0, and there is no need to mark the label of position and category.
 Note: there is still a small problem left here. As we said earlier, for those anchor boxes larger than the real box IoU, their objectness needs to be marked as  1, which does not participate in the calculation of the loss function. Let's put this problem aside first and fill it up later when we establish the loss function.
Feature extraction using convolution neural network
In the previous course of image classification, we have learned how to extract image features through convolution neural network. Through the continuous use of multilayer convolution and pooling, we can get feature graphs with richer semantic meaning. In the detection problem, convolutional neural network is also used to extract image features layer by layer, and the final output feature map is used to characterize the information such as object position and category.
The backbone network used by YOLOv3 algorithm is Darknet53. The specific structure of Darknet53 network is shown in Figure 16, and good results have been achieved in ImageNet image classification task. In the detection task, the average pooling, full connection layer and Softmax behind C0 in the figure are removed, and the network structure from the input to C0 is retained as the basic network structure of the detection model, also known as the backbone network. YOLOv3 model will add detection related network modules on the basis of backbone network.
Figure 16: Darknet53 network structure
The following program is the implementation code of Darknet53 backbone network. Here, take out the output data represented by C0, C1 and C2 in the above figure and check their shapes, C 0 [ 1 , 1024 , 20 , 20 ] C0 [1, 1024, 20, 20] C0[1,1024,20,20]， C 1 [ 1 , 512 , 40 , 40 ] C1 [1, 512, 40, 40] C1[1,512,40,40]， C 2 [ 1 , 256 , 80 , 80 ] C2 [1, 256, 80, 80] C2[1,256,80,80].
 Term explanation: stride of characteristic graph
In the process of feature extraction, convolution or pooling with a step greater than 1 is usually used, resulting in the size of the subsequent feature map becoming smaller and smaller. The step of the feature map is equal to the size of the input picture divided by the size of the feature map. For example, the dimension of C0 is 20 × 20 20\times20 twenty × 20. The size of the original drawing is 640 × 640 640\times640 six hundred and forty × 640, then the step of C0 is 640 20 = 32 \frac{640}{20}=32 20640=32. Similarly, the stride of C1 is 16 and the stride of C2 is 8.
import paddle import paddle.nn.functional as F import numpy as np class ConvBNLayer(paddle.nn.Layer): def __init__(self, ch_in, ch_out, kernel_size=3, stride=1, groups=1, padding=0, act="leaky"): super(ConvBNLayer, self).__init__() self.conv = paddle.nn.Conv2D( in_channels=ch_in, out_channels=ch_out, kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, weight_attr=paddle.ParamAttr( initializer=paddle.nn.initializer.Normal(0., 0.02)), bias_attr=False) self.batch_norm = paddle.nn.BatchNorm2D( num_features=ch_out, weight_attr=paddle.ParamAttr( initializer=paddle.nn.initializer.Normal(0., 0.02), regularizer=paddle.regularizer.L2Decay(0.)), bias_attr=paddle.ParamAttr( initializer=paddle.nn.initializer.Constant(0.0), regularizer=paddle.regularizer.L2Decay(0.))) self.act = act def forward(self, inputs): out = self.conv(inputs) out = self.batch_norm(out) if self.act == 'leaky': out = F.leaky_relu(x=out, negative_slope=0.1) return out class DownSample(paddle.nn.Layer): # For down sampling, the picture size is halved. The specific implementation method is to use the convolution of stirde=2 def __init__(self, ch_in, ch_out, kernel_size=3, stride=2, padding=1): super(DownSample, self).__init__() self.conv_bn_layer = ConvBNLayer( ch_in=ch_in, ch_out=ch_out, kernel_size=kernel_size, stride=stride, padding=padding) self.ch_out = ch_out def forward(self, inputs): out = self.conv_bn_layer(inputs) return out class BasicBlock(paddle.nn.Layer): """ Definition of basic residual block, input x After twolayer convolution, then the output and input of the second layer convolution are connected x Add """ def __init__(self, ch_in, ch_out): super(BasicBlock, self).__init__() self.conv1 = ConvBNLayer( ch_in=ch_in, ch_out=ch_out, kernel_size=1, stride=1, padding=0 ) self.conv2 = ConvBNLayer( ch_in=ch_out, ch_out=ch_out*2, kernel_size=3, stride=1, padding=1 ) def forward(self, inputs): conv1 = self.conv1(inputs) conv2 = self.conv2(conv1) out = paddle.add(x=inputs, y=conv2) return out class LayerWarp(paddle.nn.Layer): """ Add multilayer residual blocks to form Darknet53 A hierarchy of networks """ def __init__(self, ch_in, ch_out, count, is_test=True): super(LayerWarp,self).__init__() self.basicblock0 = BasicBlock(ch_in, ch_out) self.res_out_list = [] for i in range(1, count): res_out = self.add_sublayer("basic_block_%d" % (i), # Using add_sublayer add sublayer BasicBlock(ch_out*2, ch_out)) self.res_out_list.append(res_out) def forward(self,inputs): y = self.basicblock0(inputs) for basic_block_i in self.res_out_list: y = basic_block_i(y) return y # The number of residual blocks in each group of DarkNet is from the network structure diagram of DarkNet DarkNet_cfg = {53: ([1, 2, 8, 8, 4])} class DarkNet53_conv_body(paddle.nn.Layer): def __init__(self): super(DarkNet53_conv_body, self).__init__() self.stages = DarkNet_cfg[53] self.stages = self.stages[0:5] # First layer convolution self.conv0 = ConvBNLayer( ch_in=3, ch_out=32, kernel_size=3, stride=1, padding=1) # Down sampling is realized by convolution of stripe = 2 self.downsample0 = DownSample( ch_in=32, ch_out=32 * 2) # Add implementations at all levels self.darknet53_conv_block_list = [] self.downsample_list = [] for i, stage in enumerate(self.stages): conv_block = self.add_sublayer( "stage_%d" % (i), LayerWarp(32*(2**(i+1)), 32*(2**i), stage)) self.darknet53_conv_block_list.append(conv_block) # Use DownSample to halve the size between the two levels for i in range(len(self.stages)  1): downsample = self.add_sublayer( "stage_%d_downsample" % i, DownSample(ch_in=32*(2**(i+1)), ch_out=32*(2**(i+2)))) self.downsample_list.append(downsample) def forward(self,inputs): out = self.conv0(inputs) #print("conv1:",out.numpy()) out = self.downsample0(out) #print("dy:",out.numpy()) blocks = [] for i, conv_block_i in enumerate(self.darknet53_conv_block_list): #Apply each level to the input in turn out = conv_block_i(out) blocks.append(out) if i < len(self.stages)  1: out = self.downsample_list[i](out) return blocks[1:4:1] # Take C0, C1 and C2 as return values
# View Darknet53 network output characteristic diagram import numpy as np backbone = DarkNet53_conv_body() x = np.random.randn(1, 3, 640, 640).astype('float32') x = paddle.to_tensor(x) C0, C1, C2 = backbone(x) print(C0.shape, C1.shape, C2.shape)
/opt/conda/envs/python35paddle120env/lib/python3.7/sitepackages/paddle/nn/layer/norm.py:648: UserWarning: When training, we now always track global mean and variance. "When training, we now always track global mean and variance.") [1, 1024, 20, 20] [1, 512, 40, 40] [1, 256, 80, 80]
The above example code specifies that the shape of the input data is ( 1 , 3 , 640 , 640 ) (1, 3, 640, 640) (1,3640640), the shapes of the output characteristic diagrams of the three levels are C 0 ( 1 , 1024 , 20 , 20 ) C0 (1, 1024, 20, 20) C0(1,1024,20,20)， C 1 ( 1 , 512 , 40 , 40 ) C1 (1, 512, 40, 40) C1 (1512,40,40) and C 2 ( 1 , 256 , 80 , 80 ) C2 (1, 256, 80, 80) C2(1,256,80,80).
Calculate the position and category of prediction frame according to the output characteristic diagram
The calculation logic for each prediction box in YOLOv3 is as follows:

Whether the prediction box contains objects. It can also be understood as the probability of objectness=1. A real number can be output by the network x x x. Can use S i g m o i d ( x ) Sigmoid(x) Sigmoid(x) indicates the probability that objectness is positive P o b j P_{obj} Pobj

Predict the position and shape of objects. Object position and shape t x , t y , t w , t h t_x, t_y, t_w, t_h tx, ty, tw, th ， can be represented by four real numbers output from the network t x , t y , t w , t h t_x, t_y, t_w, t_h tx,ty,tw,th

Predict object category. Predict the specific category of objects in the image, or the probability of belonging to each category. The total number of categories is C. It is necessary to predict the probability that the object belongs to each category ( P 1 , P 2 , . . . , P C ) (P_1, P_2, ..., P_C) (P1, P2,..., PC), C real numbers can be output by network ( x 1 , x 2 , . . . , x C ) (x_1, x_2, ..., x_C) (x1, x2,..., xC), find the Sigmoid function for each real number, let P i = S i g m o i d ( x i ) P_i = Sigmoid(x_i) Pi = Sigmoid(xi), the probability that the object belongs to each category can be expressed.
For a prediction box, the network needs to output ( 5 + C ) (5 + C) (5+C) real numbers to represent whether it contains objects, positions, shapes, dimensions and the probability of belonging to each category.
Since we have generated K prediction boxes in each small block area, the total number of prediction values that all prediction boxes need to be output by the network is:
[ K ( 5 + C ) ] × m × n [K(5 + C)] \times m \times n [K(5+C)]×m×n
Another more important point is that the network output must be able to distinguish the position of the small block area, and the characteristic diagram cannot be directly connected to an output with the size of [ K ( 5 + C ) ] × m × n [K(5 + C)] \times m \times n [K(5+C)] × m × Full connection layer of n.
Establish the association between the output characteristic graph and the prediction frame
Now observe the characteristic graph. After multiple convolution kernel pooling, its stride is stripe = 32, 640 × 480 640 \times 480 six hundred and forty × 480 size input picture becomes 20 × 15 20\times15 twenty × Characteristic diagram of 15; The number of small square areas is exactly 20 × 15 20\times15 twenty × 15, that is, each pixel on the feature map can correspond to a small block area on the original map. This is why we first set the size of the small block area to 32, which can skillfully correspond the small block area to the pixels on the feature map, and solve the corresponding relationship of spatial position.
Figure 17: comparison of the shape of feature C0 and small block area
Next, you need to add pixels ( i , j ) (i,j) (i,j) associated with the prediction value required by the small block area in row i and column j, each small block area generates K prediction frames, and each prediction frame needs ( 5 + C ) (5 + C) (5+C) real predicted values, the corresponding value of each pixel shall be K ( 5 + C ) K(5 + C) K(5+C) real numbers. In order to solve this problem, the characteristic graph is convoluted many times, and the final number of output channels is set to K ( 5 + C ) K(5 + C) K(5+C), which can skillfully correspond the generated characteristic diagram with the prediction value required by each prediction box. Of course, this correspondence is to connect the features extracted by the backbone network to the output layer to form Loss. In practice, these sizes can be adjusted according to the different distribution of task data, as long as the output size of feature map (controlling convolution kernel and down sampling) and output layer size (controlling the size of small block area) are the same.
The output characteristic graph of the backbone network is C0. The following program convolutes C0 for many times to obtain the characteristic graph P0 related to the prediction frame.
class YoloDetectionBlock(paddle.nn.Layer): # define YOLOv3 detection head # Feature extraction using multilayer convolution and BN def __init__(self,ch_in,ch_out,is_test=True): super(YoloDetectionBlock, self).__init__() assert ch_out % 2 == 0, \ "channel {} cannot be divided by 2".format(ch_out) self.conv0 = ConvBNLayer( ch_in=ch_in, ch_out=ch_out, kernel_size=1, stride=1, padding=0) self.conv1 = ConvBNLayer( ch_in=ch_out, ch_out=ch_out*2, kernel_size=3, stride=1, padding=1) self.conv2 = ConvBNLayer( ch_in=ch_out*2, ch_out=ch_out, kernel_size=1, stride=1, padding=0) self.conv3 = ConvBNLayer( ch_in=ch_out, ch_out=ch_out*2, kernel_size=3, stride=1, padding=1) self.route = ConvBNLayer( ch_in=ch_out*2, ch_out=ch_out, kernel_size=1, stride=1, padding=0) self.tip = ConvBNLayer( ch_in=ch_out, ch_out=ch_out*2, kernel_size=3, stride=1, padding=1) def forward(self, inputs): out = self.conv0(inputs) out = self.conv1(out) out = self.conv2(out) out = self.conv3(out) route = self.route(out) tip = self.tip(route) return route, tip
NUM_ANCHORS = 3 NUM_CLASSES = 7 num_filters=NUM_ANCHORS * (NUM_CLASSES + 5) backbone = DarkNet53_conv_body() detection = YoloDetectionBlock(ch_in=1024, ch_out=512) conv2d_pred = paddle.nn.Conv2D(in_channels=1024, out_channels=num_filters, kernel_size=1) x = np.random.randn(1, 3, 640, 640).astype('float32') x = paddle.to_tensor(x) C0, C1, C2 = backbone(x) route, tip = detection(C0) P0 = conv2d_pred(tip) print(P0.shape)
[1, 36, 20, 20]
As shown in the above code, the feature map P0 can be generated from the feature map C0, and the shape of P0 is [ 1 , 36 , 20 , 20 ] [1, 36, 20, 20] [1,36,20,20]. The number of anchor boxes or prediction boxes generated in each small block area is 3, the number of object categories is 7, and the number of prediction values required in each area is 3 × ( 5 + 7 ) = 36 3 \times (5 + 7) = 36 three × (5 + 7) = 36, which is exactly equal to the number of output channels of P0.
take P 0 [ t , 0 : 12 , i , j ] P0[t, 0:12, i, j] P0[t,0:12,i,j] and the small square area on the input tth picture ( i , j ) (i, j) (i,j) corresponding to the 12 prediction values required for the first prediction frame, P 0 [ t , 12 : 24 , i , j ] P0[t, 12:24, i, j] P0[t,12:24,i,j] and the small square area on the input tth picture ( i , j ) (i, j) (i,j) corresponding to the 12 prediction values required for the second prediction frame, P 0 [ t , 24 : 36 , i , j ] P0[t, 24:36, i, j] P0[t,24:36,i,j] and the small square area on the input tth picture ( i , j ) (i, j) (i,j) corresponding to the 12 predicted values required for the third prediction frame.
P 0 [ t , 0 : 4 , i , j ] P0[t, 0:4, i, j] P0[t,0:4,i,j] and the small square area on the input tth picture ( i , j ) (i, j) (i,j) corresponding to the position of the first prediction frame, P 0 [ t , 4 , i , j ] P0[t, 4, i, j] P0[t,4,i,j] and the small square area on the input tth picture ( i , j ) (i, j) (i,j) corresponding to the objectness of the first prediction box, P 0 [ t , 5 : 12 , i , j ] P0[t, 5:12, i, j] P0[t,5:12,i,j] and the small square area on the input tth picture ( i , j ) (i, j) (i,j) category correspondence of the first prediction box.
As shown in Fig. 18, in this way, the network output characteristic diagram can be cleverly corresponding to the prediction frame generated by each small block area.
Fig. 18: association between feature map P0 and candidate region
Calculate the probability of whether the prediction box contains objects
According to the previous analysis, P 0 [ t , 4 , i , j ] P0[t, 4, i, j] P0[t,4,i,j] and the small square area on the input tth picture ( i , j ) (i, j) (i,j) corresponding to the objectness of the first prediction box, P 0 [ t , 4 + 12 , i , j ] P0[t, 4+12, i, j] P0[t,4+12,i,j] corresponds to the objectness of the second prediction box,..., then you can use the following program to take out the prediction related to objectness, and calculate the output probability using pad.nn.functional.sigmoid.
NUM_ANCHORS = 3 NUM_CLASSES = 7 num_filters=NUM_ANCHORS * (NUM_CLASSES + 5) backbone = DarkNet53_conv_body() detection = YoloDetectionBlock(ch_in=1024, ch_out=512) conv2d_pred = paddle.nn.Conv2D(in_channels=1024, out_channels=num_filters, kernel_size=1) x = np.random.randn(1, 3, 640, 640).astype('float32') x = paddle.to_tensor(x) C0, C1, C2 = backbone(x) route, tip = detection(C0) P0 = conv2d_pred(tip) reshaped_p0 = paddle.reshape(P0, [1, NUM_ANCHORS, NUM_CLASSES + 5, P0.shape[2], P0.shape[3]]) pred_objectness = reshaped_p0[:, :, 4, :, :] pred_objectness_probability = F.sigmoid(pred_objectness) print(pred_objectness.shape, pred_objectness_probability.shape)
[1, 3, 20, 20] [1, 3, 20, 20]
The above output program shows whether the prediction box contains the probability PRED of the object_ objectness_ The data shape is [1,3,20,20], which is consistent with the number of prediction frames mentioned above. The data size is between 0 and 1, indicating the probability that the prediction frame is a positive sample.
Calculate the position coordinates of the prediction frame
P 0 [ t , 0 : 4 , i , j ] P0[t, 0:4, i, j] P0[t,0:4,i,j] and the second input t t Small square area on t pictures ( i , j ) (i, j) (i,j) corresponding to the position of the first prediction frame, P 0 [ t , 12 : 16 , i , j ] P0[t, 12:16, i, j] P0[t,12:16,i,j] corresponds to the position of the second prediction box, and so on, you can use the following program to P 0 P0 P0 takes out the predicted value related to the position of the prediction frame.
NUM_ANCHORS = 3 NUM_CLASSES = 7 num_filters=NUM_ANCHORS * (NUM_CLASSES + 5) backbone = DarkNet53_conv_body() detection = YoloDetectionBlock(ch_in=1024, ch_out=512) conv2d_pred = paddle.nn.Conv2D(in_channels=1024, out_channels=num_filters, kernel_size=1) x = np.random.randn(1, 3, 640, 640).astype('float32') x = paddle.to_tensor(x) C0, C1, C2 = backbone(x) route, tip = detection(C0) P0 = conv2d_pred(tip) reshaped_p0 = paddle.reshape(P0, [1, NUM_ANCHORS, NUM_CLASSES + 5, P0.shape[2], P0.shape[3]]) pred_objectness = reshaped_p0[:, :, 4, :, :] pred_objectness_probability = F.sigmoid(pred_objectness) pred_location = reshaped_p0[:, :, 0:4, :, :] print(pred_location.shape)
[1, 3, 4, 20, 20]
The network output value is ( t x , t y , t w , t h ) (t_x, t_y, t_w, t_h) (tx, ty, tw, th), you also need to convert it to ( x 1 , y 1 , x 2 , y 2 ) (x_1, y_1, x_2, y_2) (x1, y1, x2, y2) this form of coordinate representation. Use the propeller paddle.vision.ops.yolo_box API can directly calculate the results, but in order to show readers the implementation process of the algorithm more clearly, we use Numpy to implement this process.
# Defining Sigmoid functions def sigmoid(x): return 1./(1.0 + np.exp(x)) # Convert the [tx, ty, th, tw] output from the network feature map into the coordinates of the prediction frame [x1, y1, x2, y2] def get_yolo_box_xxyy(pred, anchors, num_classes, downsample): """ pred It is transformed from the network output characteristic graph numpy.ndarray anchors It's a list. Indicates the size of the anchor box, for example anchors = [116, 90, 156, 198, 373, 326]，Indicates that there are three anchor boxes, First anchor box size[w, h]yes[116, 90]，The size of the second anchor box is[156, 198]，The size of the third anchor box is[373, 326] """ batchsize = pred.shape[0] num_rows = pred.shape[2] num_cols = pred.shape[1] input_h = num_rows * downsample input_w = num_cols * downsample num_anchors = len(anchors) // 2 # The shape of pred is [N, C, H, W], where C = NUM_ANCHORS * (5 + NUM_CLASSES) # reshape pred pred = pred.reshape([1, num_anchors, 5+num_classes, num_rows, num_cols]) pred_location = pred[:, :, 0:4, :, :] pred_location = np.transpose(pred_location, (0,3,4,1,2)) anchors_this = [] for ind in range(num_anchors): anchors_this.append([anchors[ind*2], anchors[ind*2+1]]) anchors_this = np.array(anchors_this).astype('float32') # The final output data is saved in pred_box, whose shape is [N, H, W, NUM_ANCHORS, 4], # The last dimension 4 represents the four coordinates of the position pred_box = np.zeros(pred_location.shape) for n in range(batchsize): for i in range(num_rows): for j in range(num_cols): for k in range(num_anchors): pred_box[n, i, j, k, 0] = j pred_box[n, i, j, k, 1] = i pred_box[n, i, j, k, 2] = anchors_this[k][0] pred_box[n, i, j, k, 3] = anchors_this[k][1] # The relative coordinates, PRED, are used here_ The output element value of box is between 0. ~ 1.0 pred_box[:, :, :, :, 0] = (sigmoid(pred_location[:, :, :, :, 0]) + pred_box[:, :, :, :, 0]) / num_cols pred_box[:, :, :, :, 1] = (sigmoid(pred_location[:, :, :, :, 1]) + pred_box[:, :, :, :, 1]) / num_rows pred_box[:, :, :, :, 2] = np.exp(pred_location[:, :, :, :, 2]) * pred_box[:, :, :, :, 2] / input_w pred_box[:, :, :, :, 3] = np.exp(pred_location[:, :, :, :, 3]) * pred_box[:, :, :, :, 3] / input_h # Convert coordinates from xywh to xyxy pred_box[:, :, :, :, 0] = pred_box[:, :, :, :, 0]  pred_box[:, :, :, :, 2] / 2. pred_box[:, :, :, :, 1] = pred_box[:, :, :, :, 1]  pred_box[:, :, :, :, 3] / 2. pred_box[:, :, :, :, 2] = pred_box[:, :, :, :, 0] + pred_box[:, :, :, :, 2] pred_box[:, :, :, :, 3] = pred_box[:, :, :, :, 1] + pred_box[:, :, :, :, 3] pred_box = np.clip(pred_box, 0., 1.0) return pred_box
By calling the get defined above_ yolo_ box_ XXYY function, which can be obtained from P 0 P0 P0 calculates the coordinates of the prediction frame. The specific procedure is as follows:
NUM_ANCHORS = 3 NUM_CLASSES = 7 num_filters=NUM_ANCHORS * (NUM_CLASSES + 5) backbone = DarkNet53_conv_body() detection = YoloDetectionBlock(ch_in=1024, ch_out=512) conv2d_pred = paddle.nn.Conv2D(in_channels=1024, out_channels=num_filters, kernel_size=1) x = np.random.randn(1, 3, 640, 640).astype('float32') x = paddle.to_tensor(x) C0, C1, C2 = backbone(x) route, tip = detection(C0) P0 = conv2d_pred(tip) reshaped_p0 = paddle.reshape(P0, [1, NUM_ANCHORS, NUM_CLASSES + 5, P0.shape[2], P0.shape[3]]) pred_objectness = reshaped_p0[:, :, 4, :, :] pred_objectness_probability = F.sigmoid(pred_objectness) pred_location = reshaped_p0[:, :, 0:4, :, :] # anchors contain preset anchor frame dimensions anchors = [116, 90, 156, 198, 373, 326] # downsample is the stride of the characteristic map P0 pred_boxes = get_yolo_box_xxyy(P0.numpy(), anchors, num_classes=7, downsample=32) # Calculate the position coordinates of the prediction frame from the output characteristic map P0 print(pred_boxes.shape)
(1, 20, 20, 3, 4)
PRED calculated by the above program_ What is the shape of boxes [ N , H , W , n u m _ a n c h o r s , 4 ] [N, H, W, num\_anchors, 4] [N,H,W,num_anchors,4], the coordinate format is [ x 1 , y 1 , x 2 , y 2 ] [x_1, y_1, x_2, y_2] [x1, y1, x2, y2], the value is between 0 and 1, indicating the relative coordinates.
Calculate the probability that the object belongs to each category
P 0 [ t , 5 : 12 , i , j ] P0[t, 5:12, i, j] P0[t,5:12,i,j] and the second input t t Small square area on t pictures ( i , j ) (i, j) (i,j) the first prediction frame contains the category correspondence of the object, P 0 [ t , 17 : 24 , i , j ] P0[t, 17:24, i, j] P0[t,17:24,i,j] corresponds to the category of the second prediction box, and so on, you can use the following procedure to P 0 P0 P0 takes out the predicted values related to the prediction box category.
NUM_ANCHORS = 3 NUM_CLASSES = 7 num_filters=NUM_ANCHORS * (NUM_CLASSES + 5) backbone = DarkNet53_conv_body() detection = YoloDetectionBlock(ch_in=1024, ch_out=512) conv2d_pred = paddle.nn.Conv2D(in_channels=1024, out_channels=num_filters, kernel_size=1) x = np.random.randn(1, 3, 640, 640).astype('float32') x = paddle.to_tensor(x) C0, C1, C2 = backbone(x) route, tip = detection(C0) P0 = conv2d_pred(tip) reshaped_p0 = paddle.reshape(P0, [1, NUM_ANCHORS, NUM_CLASSES + 5, P0.shape[2], P0.shape[3]]) # Take out the predicted value related to objectness pred_objectness = reshaped_p0[:, :, 4, :, :] pred_objectness_probability = F.sigmoid(pred_objectness) # Take out the predicted value related to the position pred_location = reshaped_p0[:, :, 0:4, :, :] # Take out the predicted value related to the category pred_classification = reshaped_p0[:, :, 5:5+NUM_CLASSES, :, :] pred_classification_probability = F.sigmoid(pred_classification) print(pred_classification.shape)
[1, 3, 7, 20, 20]
The above procedure passed P 0 P0 P0 calculates the probability of the category of the object contained in the prediction box, PRED_ classification_ The shape of probability is [ 1 , 3 , 7 , 20 , 20 ] [1, 3, 7, 20, 20] [1,3,7,20,20], the value is between 0 and 1.
loss function
The above conceptually associates the pixels on the output characteristic graph with the prediction frame, so in order to solve the neural network, we must mathematically associate the network output with the prediction frame, that is, to establish the relationship between the loss function and the network output. How to establish the loss function of YOLOv3 is discussed below.
For each prediction box, YOLOv3 model will establish three types of loss functions:

Characterize whether the loss function of the target object is included by PRED_ Objectiveness and label_objectness calculation.
loss_obj = paddle.nn.fucntional.binary_cross_entropy_with_logits(pred_objectness, label_objectness)

The loss function characterizing the position of the object is obtained by pred_location and label_location calculation.
pred_location_x = pred_location[:, :, 0, :, :] pred_location_y = pred_location[:, :, 1, :, :] pred_location_w = pred_location[:, :, 2, :, :] pred_location_h = pred_location[:, :, 3, :, :] loss_location_x = paddle.nn.fucntional.binary_cross_entropy_with_logits(pred_location_x, label_location_x) loss_location_y = paddle.nn.fucntional.binary_cross_entropy_with_logits(pred_location_y, label_location_y) loss_location_w = paddle.abs(pred_location_w  label_location_w) loss_location_h = paddle.abs(pred_location_h  label_location_h) loss_location = loss_location_x + loss_location_y + loss_location_w + loss_location_h

The loss function characterizing the object category is obtained by pred_classification and label_classification calculation.
loss_obj = paddle.nn.fucntional.binary_cross_entropy_with_logits(pred_classification, label_classification)
We already know how to calculate these predicted values and labels, but there is a small problem left, that is, we do not mark which anchor boxes have an objectness of  1. To complete this step, we need to calculate the IoU between all prediction frames and real frames, and then select those real frames whose IoU is greater than the threshold. The implementation code is as follows:
# Select the prediction box with the real box IoU greater than the threshold def get_iou_above_thresh_inds(pred_box, gt_boxes, iou_threshold): batchsize = pred_box.shape[0] num_rows = pred_box.shape[1] num_cols = pred_box.shape[2] num_anchors = pred_box.shape[3] ret_inds = np.zeros([batchsize, num_rows, num_cols, num_anchors]) for i in range(batchsize): pred_box_i = pred_box[i] gt_boxes_i = gt_boxes[i] for k in range(len(gt_boxes_i)): #gt in gt_boxes_i: gt = gt_boxes_i[k] gtx_min = gt[0]  gt[2] / 2. gty_min = gt[1]  gt[3] / 2. gtx_max = gt[0] + gt[2] / 2. gty_max = gt[1] + gt[3] / 2. if (gtx_max  gtx_min < 1e3) or (gty_max  gty_min < 1e3): continue x1 = np.maximum(pred_box_i[:, :, :, 0], gtx_min) y1 = np.maximum(pred_box_i[:, :, :, 1], gty_min) x2 = np.minimum(pred_box_i[:, :, :, 2], gtx_max) y2 = np.minimum(pred_box_i[:, :, :, 3], gty_max) intersection = np.maximum(x2  x1, 0.) * np.maximum(y2  y1, 0.) s1 = (gty_max  gty_min) * (gtx_max  gtx_min) s2 = (pred_box_i[:, :, :, 2]  pred_box_i[:, :, :, 0]) * (pred_box_i[:, :, :, 3]  pred_box_i[:, :, :, 1]) union = s2 + s1  intersection iou = intersection / union above_inds = np.where(iou > iou_threshold) ret_inds[i][above_inds] = 1 ret_inds = np.transpose(ret_inds, (0,3,1,2)) return ret_inds.astype('bool')
The above function can get which anchor box's objectness needs to be marked as  1. Through the following program, label_ Process objectness, and mark the anchor box whose IoU is greater than the threshold but not a positive sample as  1.
def label_objectness_ignore(label_objectness, iou_above_thresh_indices): # Note: label cannot be simply used here_ objectness[iou_above_thresh_indices] = 1， # This may cause label_ The point with objectness 1 is set to  1 # Only those prediction boxes that are marked as 0 and exceed the threshold with the real box IoU are marked as  1 negative_indices = (label_objectness < 0.5) ignore_indices = negative_indices * iou_above_thresh_indices label_objectness[ignore_indices] = 1 return label_objectness
Next, you can call these two functions to realize how to label some prediction boxes_ Objectness is set to  1.
# Read data reader = paddle.io.DataLoader(train_dataset, batch_size=2, shuffle=True, num_workers=0, drop_last=True) img, gt_boxes, gt_labels, im_shape = next(reader()) img, gt_boxes, gt_labels, im_shape = img.numpy(), gt_boxes.numpy(), gt_labels.numpy(), im_shape.numpy() # Calculate the label corresponding to the anchor box label_objectness, label_location, label_classification, scale_location = get_objectness_label(img, gt_boxes, gt_labels, iou_threshold = 0.7, anchors = [116, 90, 156, 198, 373, 326], num_classes=7, downsample=32) NUM_ANCHORS = 3 NUM_CLASSES = 7 num_filters=NUM_ANCHORS * (NUM_CLASSES + 5) backbone = DarkNet53_conv_body() detection = YoloDetectionBlock(ch_in=1024, ch_out=512) conv2d_pred = paddle.nn.Conv2D(in_channels=1024, out_channels=num_filters, kernel_size=1) x = paddle.to_tensor(img) C0, C1, C2 = backbone(x) route, tip = detection(C0) P0 = conv2d_pred(tip) # anchors contain preset anchor frame dimensions anchors = [116, 90, 156, 198, 373, 326] # downsample is the stride of the characteristic map P0 pred_boxes = get_yolo_box_xxyy(P0.numpy(), anchors, num_classes=7, downsample=32) iou_above_thresh_indices = get_iou_above_thresh_inds(pred_boxes, gt_boxes, iou_threshold=0.7) label_objectness = label_objectness_ignore(label_objectness, iou_above_thresh_indices)
In this way, you can set the objectness tag of those samples that are not marked as positive samples but are larger than the real box IoU to  1, and do not calculate their contribution to any loss function. The code for calculating the total loss function is as follows:
def get_loss(output, label_objectness, label_location, label_classification, scales, num_anchors=3, num_classes=7): # Deform output from [N, C, H, W] to [n, num_angels, num_classes + 5, h, w] reshaped_output = paddle.reshape(output, [1, num_anchors, num_classes + 5, output.shape[2], output.shape[3]]) # Extract the predicted value related to objectness from the output pred_objectness = reshaped_output[:, :, 4, :, :] loss_objectness = F.binary_cross_entropy_with_logits(pred_objectness, label_objectness, reduction="none") # pos_samples is 1 only in positive samples and 0 in other places pos_objectness = label_objectness > 0 pos_samples = paddle.cast(pos_objectness, 'float32') pos_samples.stop_gradient=True # Extract all position related predicted values from output tx = reshaped_output[:, :, 0, :, :] ty = reshaped_output[:, :, 1, :, :] tw = reshaped_output[:, :, 2, :, :] th = reshaped_output[:, :, 3, :, :] # From label_ Take out the label of each position coordinate in location dx_label = label_location[:, :, 0, :, :] dy_label = label_location[:, :, 1, :, :] tw_label = label_location[:, :, 2, :, :] th_label = label_location[:, :, 3, :, :] # Build loss function loss_location_x = F.binary_cross_entropy_with_logits(tx, dx_label, reduction="none") loss_location_y = F.binary_cross_entropy_with_logits(ty, dy_label, reduction="none") loss_location_w = paddle.abs(tw  tw_label) loss_location_h = paddle.abs(th  th_label) # Calculate the total position loss function loss_location = loss_location_x + loss_location_y + loss_location_h + loss_location_w # Multiply by scales loss_location = loss_location * scales # Only the position loss function of the positive sample is calculated loss_location = loss_location * pos_samples # Extract all pixels related to the object category from the output pred_classification = reshaped_output[:, :, 5:5+num_classes, :, :] # Calculate the loss function related to classification loss_classification = F.binary_cross_entropy_with_logits(pred_classification, label_classification, reduction="none") # Sum the second dimension loss_classification = paddle.sum(loss_classification, axis=2) # Only the classification loss function of samples with positive objectness is calculated loss_classification = loss_classification * pos_samples total_loss = loss_objectness + loss_location + loss_classification # Sum the loss of all prediction frames total_loss = paddle.sum(total_loss, axis=[1,2,3]) # Average all samples total_loss = paddle.mean(total_loss) return total_loss
from paddle.nn import Conv2D # Calculate the label corresponding to the anchor box label_objectness, label_location, label_classification, scale_location = get_objectness_label(img, gt_boxes, gt_labels, iou_threshold = 0.7, anchors = [116, 90, 156, 198, 373, 326], num_classes=7, downsample=32) NUM_ANCHORS = 3 NUM_CLASSES = 7 num_filters=NUM_ANCHORS * (NUM_CLASSES + 5) backbone = DarkNet53_conv_body() detection = YoloDetectionBlock(ch_in=1024, ch_out=512) conv2d_pred = Conv2D(in_channels=1024, out_channels=num_filters, kernel_size=1) x = paddle.to_tensor(img) C0, C1, C2 = backbone(x) route, tip = detection(C0) P0 = conv2d_pred(tip) # anchors contain preset anchor frame dimensions anchors = [116, 90, 156, 198, 373, 326] # downsample is the stride of the characteristic map P0 pred_boxes = get_yolo_box_xxyy(P0.numpy(), anchors, num_classes=7, downsample=32) iou_above_thresh_indices = get_iou_above_thresh_inds(pred_boxes, gt_boxes, iou_threshold=0.7) label_objectness = label_objectness_ignore(label_objectness, iou_above_thresh_indices) label_objectness = paddle.to_tensor(label_objectness) label_location = paddle.to_tensor(label_location) label_classification = paddle.to_tensor(label_classification) scales = paddle.to_tensor(scale_location) label_objectness.stop_gradient=True label_location.stop_gradient=True label_classification.stop_gradient=True scales.stop_gradient=True total_loss = get_loss(P0, label_objectness, label_location, label_classification, scales, num_anchors=NUM_ANCHORS, num_classes=NUM_CLASSES) total_loss_data = total_loss.numpy() print(total_loss_data)
The above program calculates the total loss function. Here, the reader has learned most of the contents of YOLOv3 algorithm, including how to generate the anchor box, label the anchor box, extract features through convolution neural network, associate the output feature map with the prediction box, and establish the loss function.
Multiscale detection
At present, we calculate the loss function based on the characteristic graph P0, and its stride is stripe = 32. The size of the feature map is relatively small, the number of pixels is relatively small, and the receptive field of each pixel is large. It has very rich highlevel semantic information, and it may be easier to detect large targets. In order to detect the smaller targets, the prediction output needs to be established on the larger feature map. If we directly generate prediction output on C2 or C1 level feature map, we may face new problems. They have not been fully extracted, and the semantic information contained in pixels is not rich enough, so it may be difficult to extract effective feature patterns. In target detection, the way to solve this problem is to enlarge the size of the highlevel feature map and fuse it with the lowlevel feature map. The new feature map can not only contain rich semantic information, but also have more pixels and describe a more fine structure.
The specific network implementation method is shown in Figure 19:
Fig. 19: output characteristic diagrams P0, P1 and P2 for generating multilayer stages
YOLOv3 generates three anchor frames at the center of each area, and the sizes of the anchor frames generated on the characteristic diagrams of the three levels are P2 [(10) respectively × 13),(16 × 30),(33 × 23)]，P1 [(30 × 61),(62 × 45),(59 × 119)]，P0[(116 × 90), (156 × 198), (373 × 326]. The larger the size of the anchor frame used in the later feature map, the larger the information of the largesize target can be captured; the smaller the size of the anchor frame in the forward feature map, the smaller the information of the smallsize target can be captured.
Because there is multiscale detection, the above code needs to be greatly modified, and the implementation process is slightly cumbersome. Therefore, it is recommended to use the propeller directly paddle.vision.ops.yolo_loss API, key parameters are described as follows:
paddle.vision.ops.yolo_loss(x, gt_box, gt_label, anchors, anchor_mask, class_num, ignore_thresh, downsample_ratio, gt_score=None, use_label_smooth=True, name=None, scale_x_y=1.0)
 x: Output characteristic diagram.
 gt_box: real box.
 gt_label: real box label.
 Ignore_thresh: when the IoU threshold of prediction frame and real frame exceeds ignore_thresh, it will not be taken as a negative sample, and it is set to 0.7 in YOLOv3 model.
 downsample_ratio, the downsampling ratio of feature map P0, is 32 when using Darknet53 backbone network.
 gt_score, the confidence of the real box, is used when the mixup technique is used.
 use_label_smooth, a training technique, set to False if not used.
 Name, the name of the layer, such as' yolov3_loss', the default value is None, which generally does not need to be set.
For the method of generating prediction frame using multilevel feature map, the specific implementation code is as follows:
# Define up sampling module class Upsample(paddle.nn.Layer): def __init__(self, scale=2): super(Upsample,self).__init__() self.scale = scale def forward(self, inputs): # get dynamic upsample output shape shape_nchw = paddle.shape(inputs) shape_hw = paddle.slice(shape_nchw, axes=[0], starts=[2], ends=[4]) shape_hw.stop_gradient = True in_shape = paddle.cast(shape_hw, dtype='int32') out_shape = in_shape * self.scale out_shape.stop_gradient = True # reisze by actual_shape out = paddle.nn.functional.interpolate( x=inputs, scale_factor=self.scale, mode="NEAREST") return out class YOLOv3(paddle.nn.Layer): def __init__(self, num_classes=7): super(YOLOv3,self).__init__() self.num_classes = num_classes # Backbone code for extracting image features self.block = DarkNet53_conv_body() self.block_outputs = [] self.yolo_blocks = [] self.route_blocks_2 = [] # Generate characteristic diagrams P0, P1 and P2 of three levels for i in range(3): # Add modules that generate ri and ti from ci yolo_block = self.add_sublayer( "yolo_detecton_block_%d" % (i), YoloDetectionBlock( ch_in=512//(2**i)*2 if i==0 else 512//(2**i)*2 + 512//(2**i), ch_out = 512//(2**i))) self.yolo_blocks.append(yolo_block) num_filters = 3 * (self.num_classes + 5) # Add a module that generates pi from ti. This is a Conv2D operation. The number of output channels is 3 * (num_classes + 5) block_out = self.add_sublayer( "block_out_%d" % (i), paddle.nn.Conv2D(in_channels=512//(2**i)*2, out_channels=num_filters, kernel_size=1, stride=1, padding=0, weight_attr=paddle.ParamAttr( initializer=paddle.nn.initializer.Normal(0., 0.02)), bias_attr=paddle.ParamAttr( initializer=paddle.nn.initializer.Constant(0.0), regularizer=paddle.regularizer.L2Decay(0.)))) self.block_outputs.append(block_out) if i < 2: # Convolution of ri route = self.add_sublayer("route2_%d"%i, ConvBNLayer(ch_in=512//(2**i), ch_out=256//(2**i), kernel_size=1, stride=1, padding=0)) self.route_blocks_2.append(route) # Enlarge ri to keep the same size as c#{i + 1} self.upsample = Upsample() def forward(self, inputs): outputs = [] blocks = self.block(inputs) for i, block in enumerate(blocks): if i > 0: # The r#u{i1} feature map is obtained after convolution and up sampling, and then spliced with ci of this level block = paddle.concat([route, block], axis=1) # Generating ti and ri from ci route, tip = self.yolo_blocks[i](block) # Generate pi from ti block_out = self.block_outputs[i](tip) # Put pi in list outputs.append(block_out) if i < 2: # Convolute ri to adjust the number of channels route = self.route_blocks_2[i](route) # Enlarge ri so that its size is consistent with c#u{i + 1} route = self.upsample(route) return outputs def get_loss(self, outputs, gtbox, gtlabel, gtscore=None, anchors = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326], anchor_masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]], ignore_thresh=0.7, use_label_smooth=False): """ use paddle.vision.ops.yolo_loss，Direct calculation of the loss function is simpler and faster """ self.losses = [] downsample = 32 for i, out in enumerate(outputs): # Calculate the loss function for the three levels respectively anchor_mask_i = anchor_masks[i] loss = paddle.vision.ops.yolo_loss( x=out, # out is one of P0, P1 and P2 gt_box=gtbox, # Real frame coordinates gt_label=gtlabel, # Real box category gt_score=gtscore, # The real box score is required when using the mixup training skill. When not using the skill, it is directly set to 1, and the shape is the same as that of gtlabel anchors=anchors, # Anchor frame size, including the size of 9 anchor frames [W0, H0, W1, H1,..., W8, H8] anchor_mask=anchor_mask_i, # Screen the mask of the anchor box, such as anchor_mask_i=[3, 4, 5], select the 3rd, 4th and 5th anchor boxes in anchors for this level class_num=self.num_classes, # Number of categories ignore_thresh=ignore_thresh, # When the predicted box and the real box IOU > ignore_ Thresh, label objectness = 1 downsample_ratio=downsample, # For example, P0 is 32, P1 is 16 and P2 is 8 use_label_smooth=False) # Use label_smooth training skills will be used. This skill is not used here. It is directly set to False self.losses.append(paddle.mean(loss)) #mean sums each picture downsample = downsample // 2 # the scaling factor of the next level feature map will be halved return sum(self.losses) # Sum each level
Open endtoend training
The training process is shown in Figure 20. After feature extraction, the input picture obtains three levels of output feature maps P0 (stripe = 32), P1 (stripe = 16) and P2 (stripe = 8). Accordingly, small square areas of different sizes are used to generate corresponding anchor boxes and prediction boxes, and these anchor boxes are marked.

P0 level characteristic diagram, corresponding to the use 32 × 32 32\times32 thirtytwo × 32 small squares are generated in the center of each area, and the sizes are [ 116 , 90 ] [116, 90] [116,90], [ 156 , 198 ] [156, 198] [156,198], [ 373 , 326 ] [373, 326] Three anchor frames of [373326].

P1 level characteristic diagram, corresponding to the use 16 × 16 16\times16 sixteen × A small square of size 16 is generated in the center of each area, with the size of [ 30 , 61 ] [30, 61] [30,61], [ 62 , 45 ] [62, 45] [62,45], [ 59 , 119 ] [59, 119] Three anchor frames of [59119].

P2 level characteristic diagram, corresponding to the use 8 × 8 8\times8 eight × 8size small square, generated in the center of each area, with the size of [ 10 , 13 ] [10, 13] [10,13], [ 16 , 30 ] [16, 30] [16,30], [ 33 , 23 ] [33, 23] [33,23].
Associate the characteristic diagrams of the three levels with the labels between the corresponding anchor boxes, and establish the loss function. The total loss function is equal to the sum of the loss functions of the three levels. By minimizing the loss function, the endtoend training process can be started.
Figure 20: end to end training process
The specific implementation code of the training process is as follows:
############# Please be careful when running this code on the local machine. It is easy to crash####################### import time import os import paddle ANCHORS = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326] ANCHOR_MASKS = [[6, 7, 8], [3, 4, 5], [0, 1, 2]] IGNORE_THRESH = .7 NUM_CLASSES = 7 def get_lr(base_lr = 0.0001, lr_decay = 0.1): bd = [10000, 20000] lr = [base_lr, base_lr * lr_decay, base_lr * lr_decay * lr_decay] learning_rate = paddle.optimizer.lr.PiecewiseDecay(boundaries=bd, values=lr) return learning_rate if __name__ == '__main__': TRAINDIR = '/home/aistudio/work/insects/train' TESTDIR = '/home/aistudio/work/insects/test' VALIDDIR = '/home/aistudio/work/insects/val' paddle.set_device("gpu:0") # Create data reading class train_dataset = TrainDataset(TRAINDIR, mode='train') valid_dataset = TrainDataset(VALIDDIR, mode='valid') test_dataset = TrainDataset(VALIDDIR, mode='valid') # Create a data reader using the pad.io.dataloader, and set the batchsize and the number of processes num_workers and other parameters train_loader = paddle.io.DataLoader(train_dataset, batch_size=10, shuffle=True, num_workers=0, drop_last=True, use_shared_memory=False) valid_loader = paddle.io.DataLoader(valid_dataset, batch_size=10, shuffle=False, num_workers=0, drop_last=False, use_shared_memory=False) model = YOLOv3(num_classes = NUM_CLASSES) #Create model learning_rate = get_lr() opt = paddle.optimizer.Momentum( learning_rate=learning_rate, momentum=0.9, weight_decay=paddle.regularizer.L2Decay(0.0005), parameters=model.parameters()) #Create optimizer # opt = paddle.optimizer.Adam(learning_rate=learning_rate, weight_decay=paddle.regularizer.L2Decay(0.0005), parameters=model.parameters()) MAX_EPOCH = 200 for epoch in range(MAX_EPOCH): for i, data in enumerate(train_loader()): img, gt_boxes, gt_labels, img_scale = data gt_scores = np.ones(gt_labels.shape).astype('float32') gt_scores = paddle.to_tensor(gt_scores) img = paddle.to_tensor(img) gt_boxes = paddle.to_tensor(gt_boxes) gt_labels = paddle.to_tensor(gt_labels) outputs = model(img) #Forward propagation, output [P0, P1, P2] loss = model.get_loss(outputs, gt_boxes, gt_labels, gtscore=gt_scores, anchors = ANCHORS, anchor_masks = ANCHOR_MASKS, ignore_thresh=IGNORE_THRESH, use_label_smooth=False) # Calculation loss function loss.backward() # Back propagation calculation gradient opt.step() # Update parameters opt.clear_grad() if i % 10 == 0: timestring = time.strftime("%Y%m%d %H:%M:%S",time.localtime(time.time())) print('{}[TRAIN]epoch {}, iter {}, output loss: {}'.format(timestring, epoch, i, loss.numpy())) # save params of model if (epoch % 5 == 0) or (epoch == MAX_EPOCH 1): paddle.save(model.state_dict(), 'yolo_epoch{}'.format(epoch)) # Test on the validation set after each epoch model.eval() for i, data in enumerate(valid_loader()): img, gt_boxes, gt_labels, img_scale = data gt_scores = np.ones(gt_labels.shape).astype('float32') gt_scores = paddle.to_tensor(gt_scores) img = paddle.to_tensor(img) gt_boxes = paddle.to_tensor(gt_boxes) gt_labels = paddle.to_tensor(gt_labels) outputs = model(img) loss = model.get_loss(outputs, gt_boxes, gt_labels, gtscore=gt_scores, anchors = ANCHORS, anchor_masks = ANCHOR_MASKS, ignore_thresh=IGNORE_THRESH, use_label_smooth=False) if i % 1 == 0: timestring = time.strftime("%Y%m%d %H:%M:%S",time.localtime(time.time())) print('{}[VALID]epoch {}, iter {}, output loss: {}'.format(timestring, epoch, i, loss.numpy())) model.train()
forecast
The prediction process flow chart 21 is as follows:
Figure 21: forecast process
The forecasting process can be divided into two steps:
 The position of the prediction box and the score of the category are calculated through the network output.
 Non maximum suppression is used to eliminate prediction frames with large overlap.
For step 1, we have talked about how to calculate PRED from the network output value_ objectness_ probability, pred_ Boxes and pred_classification_probability, it is recommended that you use it directly paddle.vision.ops.yolo_box , key parameters have the following meanings:
paddle.vision.ops.yolo_box(x, img_size, anchors, class_num, conf_thresh, downsample_ratio, clip_bbox=True, name=None, scale_x_y=1.0)
 x. Network output characteristic diagram, such as P0 or P1, P2 mentioned above.
 img_size, enter the picture size.
 anchors, dimensions of anchor s used, such as [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326]
 class_num, number of object categories.
 conf_thresh is the confidence threshold. The prediction box position value with a score lower than the threshold is directly set to 0.0 without calculation.
 downsample_ratio, the down sampling ratio of the characteristic graph, for example, P0 is 32, P1 is 16, P2 is 8.
 name=None, name, e.g. 'Yolo'_ Box ', generally no setting is required, and the default value is None.
The return value includes two items, boxes and scores, where boxes is the coordinate value of all prediction boxes and scores is the score of all prediction boxes.
The prediction box score is defined as the probability of the category multiplied by the objectness probability of whether the prediction box contains the target object, i.e
s c o r e = P o b j ⋅ P c l a s s i f i c a t i o n score = P_{obj} \cdot P_{classification} score=Pobj⋅Pclassification
Add a function under the class YOLOv3 defined above, get_pred, by calling pad.vision.ops.yolo_ Box obtains the prediction boxes and scores corresponding to the characteristic diagrams of P0, P1 and P2, and splices them together to obtain all prediction boxes and their scores belonging to each category.
# Define the YOLOv3 model class YOLOv3(paddle.nn.Layer): def __init__(self, num_classes=7): super(YOLOv3,self).__init__() self.num_classes = num_classes # Backbone code for extracting image features self.block = DarkNet53_conv_body() self.block_outputs = [] self.yolo_blocks = [] self.route_blocks_2 = [] # Generate characteristic diagrams P0, P1 and P2 of three levels for i in range(3): # Add modules that generate ri and ti from ci yolo_block = self.add_sublayer( "yolo_detecton_block_%d" % (i), YoloDetectionBlock( ch_in=512//(2**i)*2 if i==0 else 512//(2**i)*2 + 512//(2**i), ch_out = 512//(2**i))) self.yolo_blocks.append(yolo_block) num_filters = 3 * (self.num_classes + 5) # Add a module that generates pi from ti. This is a Conv2D operation. The number of output channels is 3 * (num_classes + 5) block_out = self.add_sublayer( "block_out_%d" % (i), paddle.nn.Conv2D(in_channels=512//(2**i)*2, out_channels=num_filters, kernel_size=1, stride=1, padding=0, weight_attr=paddle.ParamAttr( initializer=paddle.nn.initializer.Normal(0., 0.02)), bias_attr=paddle.ParamAttr( initializer=paddle.nn.initializer.Constant(0.0), regularizer=paddle.regularizer.L2Decay(0.)))) self.block_outputs.append(block_out) if i < 2: # Convolution of ri route = self.add_sublayer("route2_%d"%i, ConvBNLayer(ch_in=512//(2**i), ch_out=256//(2**i), kernel_size=1, stride=1, padding=0)) self.route_blocks_2.append(route) # Zoom ri in to match c_{i+1} keep the same size self.upsample = Upsample() def forward(self, inputs): outputs = [] blocks = self.block(inputs) for i, block in enumerate(blocks): if i > 0: # Will r_{i1} after convolution and up sampling, the feature map is obtained and spliced with this level of ci block = paddle.concat([route, block], axis=1) # Generating ti and ri from ci route, tip = self.yolo_blocks[i](block) # Generate pi from ti block_out = self.block_outputs[i](tip) # Put pi in list outputs.append(block_out) if i < 2: # Convolute ri to adjust the number of channels route = self.route_blocks_2[i](route) # Enlarge ri to make its size and c_{i+1} consistent route = self.upsample(route) return outputs def get_loss(self, outputs, gtbox, gtlabel, gtscore=None, anchors = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326], anchor_masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]], ignore_thresh=0.7, use_label_smooth=False): """ use paddle.vision.ops.yolo_loss，Direct calculation of the loss function is simpler and faster """ self.losses = [] downsample = 32 for i, out in enumerate(outputs): # Calculate the loss function for the three levels respectively anchor_mask_i = anchor_masks[i] loss = paddle.vision.ops.yolo_loss( x=out, # out is one of P0, P1 and P2 gt_box=gtbox, # Real frame coordinates gt_label=gtlabel, # Real box category gt_score=gtscore, # The real box score is required when using the mixup training skill. When not using the skill, it is directly set to 1, and the shape is the same as that of gtlabel anchors=anchors, # Anchor frame size, including the size of 9 anchor frames [W0, H0, W1, H1,..., W8, H8] anchor_mask=anchor_mask_i, # Screen the mask of the anchor box, such as anchor_mask_i=[3, 4, 5], select the 3rd, 4th and 5th anchor boxes in anchors for this level class_num=self.num_classes, # Number of categories ignore_thresh=ignore_thresh, # When the predicted box and the real box IOU > ignore_ Thresh, label objectness = 1 downsample_ratio=downsample, # For example, P0 is 32, P1 is 16 and P2 is 8 use_label_smooth=False) # Use label_smooth training skills will be used. This skill is not used here. It is directly set to False self.losses.append(paddle.mean(loss)) #mean sums each picture downsample = downsample // 2 # the scaling factor of the next level feature map will be halved return sum(self.losses) # Sum each level def get_pred(self, outputs, im_shape=None, anchors = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326], anchor_masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]], valid_thresh = 0.01): downsample = 32 total_boxes = [] total_scores = [] for i, out in enumerate(outputs): anchor_mask = anchor_masks[i] anchors_this_level = [] for m in anchor_mask: anchors_this_level.append(anchors[2 * m]) anchors_this_level.append(anchors[2 * m + 1]) boxes, scores = paddle.vision.ops.yolo_box( x=out, img_size=im_shape, anchors=anchors_this_level, class_num=self.num_classes, conf_thresh=valid_thresh, downsample_ratio=downsample, name="yolo_box" + str(i)) total_boxes.append(boxes) total_scores.append( paddle.transpose( scores, perm=[0, 2, 1])) downsample = downsample // 2 yolo_boxes = paddle.concat(total_boxes, axis=1) yolo_scores = paddle.concat(total_scores, axis=2) return yolo_boxes, yolo_scores
The calculation result in step 1 will generate multiple prediction frames on each small block area, and many of these prediction frames have large coincidence degree, so it is necessary to eliminate redundant detection frames with large overlap.
The prediction box in the following example code is output after using the model to predict the picture. Here, a total of 11 prediction boxes are selected, and the prediction boxes are drawn on the figure, as shown below. There are multiple prediction frames around each portrait. It is necessary to eliminate redundant prediction frames to get the final prediction results.
# Draw a picture to show the boundary box of the target object import numpy as np import matplotlib.pyplot as plt import matplotlib.patches as patches from matplotlib.image import imread import math # Define the program for drawing rectangular boxes def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle=''): # currentAxis, coordinate axis, obtained through plt.gca() # bbox, bounding box, list containing four values, [x1, y1, x2, y2] # edgecolor, border line color # facecolor, fill color # Fill, fill # Linetype, border Linetype # patches.Rectangle parameters such as the coordinates of the upper left corner, the width and height of the rectangular area need to be passed in rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]bbox[0]+1, bbox[3]bbox[1]+1, linewidth=1, edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle) currentAxis.add_patch(rect) plt.figure(figsize=(10, 10)) filename = '/home/aistudio/work/images/section3/000000086956.jpg' im = imread(filename) plt.imshow(im) currentAxis=plt.gca() # Prediction box position boxes = np.array([[4.21716537e+01, 1.28230896e+02, 2.26547668e+02, 6.00434631e+02], [3.18562988e+02, 1.23168472e+02, 4.79000000e+02, 6.05688416e+02], [2.62704697e+01, 1.39430557e+02, 2.20587097e+02, 6.38959656e+02], [4.24965363e+01, 1.42706665e+02, 2.25955185e+02, 6.35671204e+02], [2.37462646e+02, 1.35731537e+02, 4.79000000e+02, 6.31451294e+02], [3.19390472e+02, 1.29295090e+02, 4.79000000e+02, 6.33003845e+02], [3.28933838e+02, 1.22736115e+02, 4.79000000e+02, 6.39000000e+02], [4.44292603e+01, 1.70438187e+02, 2.26841858e+02, 6.39000000e+02], [2.17988785e+02, 3.02472412e+02, 4.06062927e+02, 6.29106628e+02], [2.00241089e+02, 3.23755096e+02, 3.96929321e+02, 6.36386108e+02], [2.14310303e+02, 3.23443665e+02, 4.06732849e+02, 6.35775269e+02]]) # Prediction box score scores = np.array([0.5247661 , 0.51759845, 0.86075854, 0.9910175 , 0.39170712, 0.9297706 , 0.5115228 , 0.270992 , 0.19087596, 0.64201415, 0.879036]) # Draw all prediction boxes for box in boxes: draw_rectangle(currentAxis, box)
Here, non maximum suppression (NMS) is used to eliminate redundant frames. The basic idea is that if there are multiple prediction frames corresponding to the same object, only the prediction frame with the highest score is selected, and the remaining prediction frames are discarded.
How to judge that two prediction frames correspond to the same object? How to set the standard?
If the categories of two prediction frames are the same and their position coincidence is relatively large, they can be considered to be predicting the same target. The method of non maximum suppression is to select the prediction frame with the highest score in a category, and then see which prediction frame and its IoU are greater than the threshold, and discard these prediction frames. Here, the threshold of IoU is a super parameter, which needs to be set in advance. It is set to 0.5 in the YOLOv3 model.
For example, in the above program, there are 11 prediction boxes in boxes, and scores gives their scores for predicting the category of "people".
 Step0: create a selected list, keep_list = []
 Step 1: sort scores, remain_list = [ 3, 5, 10, 2, 9, 0, 1, 6, 4, 7, 8]，
 Step 2: select boxes[3], and keep_ The list is empty. You don't need to calculate IoU. You can put it directly into keep_list，keep_list = [3]， remain_list=[5, 10, 2, 9, 0, 1, 6, 4, 7, 8]
 Step 3: select boxes[5], and keep_ boxes[3] already exists in the list. If IoU(boxes[3], boxes[5]) = 0.0 is calculated, which is obviously less than the threshold, keep_list=[3, 5], remain_list = [10, 2, 9, 0, 1, 6, 4, 7, 8]
 Step 4: select boxes[10], and keep_list=[3, 5], calculated IoU(boxes[3], boxes[10])=0.0268, IoU(boxes[5], boxes[10])=0.0268 = 0.24, all less than the threshold, then keep_list=[3, 5, 10]，remain_list=[2, 9, 0, 1, 6, 4, 7, 8]
 Step5: select boxes[2], and keep_list=[3, 5, 10], calculated IoU(boxes[3], boxes[2]) = 0.88, exceeding the threshold, directly discard boxes[2], keep_list=[3, 5, 10]，remain_list=[9, 0, 1, 6, 4, 7, 8]
 Step 6: select boxes[9], and keep_list=[3, 5, 10], calculate IoU(boxes[3], boxes[9]) = 0.0577, IoU(boxes[5], boxes[9]) = 0.205, IoU(boxes[10], boxes[9]) = 0.88. If the threshold is exceeded, discard boxes[9]. keep_list=[3, 5, 10]，remain_list=[0, 1, 6, 4, 7, 8]
 Step 7: repeat step 6 above until remain_list is empty.
Finally get keep_list=[3, 5, 10], that is, prediction boxes 3, 5 and 10 are finally selected, as shown in the figure below.
# Draw a picture to show the boundary box of the target object import numpy as np import matplotlib.pyplot as plt import matplotlib.patches as patches from matplotlib.image import imread import math # Define the program for drawing rectangular boxes def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle=''): # currentAxis, coordinate axis, obtained through plt.gca() # bbox, bounding box, list containing four values, [x1, y1, x2, y2] # edgecolor, border line color # facecolor, fill color # Fill, fill # Linetype, border Linetype # patches.Rectangle parameters such as the coordinates of the upper left corner, the width and height of the rectangular area need to be passed in rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]bbox[0]+1, bbox[3]bbox[1]+1, linewidth=1, edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle) currentAxis.add_patch(rect) plt.figure(figsize=(10, 10)) filename = '/home/aistudio/work/images/section3/000000086956.jpg' im = imread(filename) plt.imshow(im) currentAxis=plt.gca() boxes = np.array([[4.21716537e+01, 1.28230896e+02, 2.26547668e+02, 6.00434631e+02], [3.18562988e+02, 1.23168472e+02, 4.79000000e+02, 6.05688416e+02], [2.62704697e+01, 1.39430557e+02, 2.20587097e+02, 6.38959656e+02], [4.24965363e+01, 1.42706665e+02, 2.25955185e+02, 6.35671204e+02], [2.37462646e+02, 1.35731537e+02, 4.79000000e+02, 6.31451294e+02], [3.19390472e+02, 1.29295090e+02, 4.79000000e+02, 6.33003845e+02], [3.28933838e+02, 1.22736115e+02, 4.79000000e+02, 6.39000000e+02], [4.44292603e+01, 1.70438187e+02, 2.26841858e+02, 6.39000000e+02], [2.17988785e+02, 3.02472412e+02, 4.06062927e+02, 6.29106628e+02], [2.00241089e+02, 3.23755096e+02, 3.96929321e+02, 6.36386108e+02], [2.14310303e+02, 3.23443665e+02, 4.06732849e+02, 6.35775269e+02]]) scores = np.array([0.5247661 , 0.51759845, 0.86075854, 0.9910175 , 0.39170712, 0.9297706 , 0.5115228 , 0.270992 , 0.19087596, 0.64201415, 0.879036]) left_ind = np.where((boxes[:, 0]<60) * (boxes[:, 0]>20)) left_boxes = boxes[left_ind] left_scores = scores[left_ind] colors = ['r', 'g', 'b', 'k'] # Draw the prediction box for the final retention inds = [3, 5, 10] for i in range(3): box = boxes[inds[i]] draw_rectangle(currentAxis, box, edgecolor=colors[i])
The specific implementation code of non maximum suppression is shown in the following nms function definition. It should be noted that the data set contains objects of multiple categories, so multi classification non maximum suppression is required here. Its implementation principle is the same as that of non maximum suppression. The difference is that non maximum suppression needs to be done for each category. The implementation code is shown in the following multiclass_nms.
# Non maximum suppression def nms(bboxes, scores, score_thresh, nms_thresh, pre_nms_topk, i=0, c=0): """ nms """ inds = np.argsort(scores) inds = inds[::1] keep_inds = [] while(len(inds) > 0): cur_ind = inds[0] cur_score = scores[cur_ind] # if score of the box is less than score_thresh, just drop it if cur_score < score_thresh: break keep = True for ind in keep_inds: current_box = bboxes[cur_ind] remain_box = bboxes[ind] iou = box_iou_xyxy(current_box, remain_box) if iou > nms_thresh: keep = False break if i == 0 and c == 4 and cur_ind == 951: print('suppressed, ', keep, i, c, cur_ind, ind, iou) if keep: keep_inds.append(cur_ind) inds = inds[1:] return np.array(keep_inds) # Multi classification non maximum suppression def multiclass_nms(bboxes, scores, score_thresh=0.01, nms_thresh=0.45, pre_nms_topk=1000, pos_nms_topk=100): """ This is for multiclass_nms """ batch_size = bboxes.shape[0] class_num = scores.shape[1] rets = [] for i in range(batch_size): bboxes_i = bboxes[i] scores_i = scores[i] ret = [] for c in range(class_num): scores_i_c = scores_i[c] keep_inds = nms(bboxes_i, scores_i_c, score_thresh, nms_thresh, pre_nms_topk, i=i, c=c) if len(keep_inds) < 1: continue keep_bboxes = bboxes_i[keep_inds] keep_scores = scores_i_c[keep_inds] keep_results = np.zeros([keep_scores.shape[0], 6]) keep_results[:, 0] = c keep_results[:, 1] = keep_scores[:] keep_results[:, 2:6] = keep_bboxes[:, :] ret.append(keep_results) if len(ret) < 1: rets.append(ret) continue ret_i = np.concatenate(ret, axis=0) scores_i = ret_i[:, 1] if len(scores_i) > pos_nms_topk: inds = np.argsort(scores_i)[::1] inds = inds[:pos_nms_topk] ret_i = ret_i[inds] rets.append(ret_i) return rets
The following is the complete test program. The output results on the test data set will be saved in pred_results.json file.
# When calculating IoU, the coordinate form of the rectangular box is xyxy, and this function will be saved in box_ In the utils.py file def box_iou_xyxy(box1, box2): # Get the coordinates of box1 upper left corner and lower right corner x1min, y1min, x1max, y1max = box1[0], box1[1], box1[2], box1[3] # Calculate the area of box1 s1 = (y1max  y1min + 1.) * (x1max  x1min + 1.) # Get box2 upper left and lower right coordinates x2min, y2min, x2max, y2max = box2[0], box2[1], box2[2], box2[3] # Calculate the area of box2 s2 = (y2max  y2min + 1.) * (x2max  x2min + 1.) # Calculates the coordinates of intersecting rectangular boxes xmin = np.maximum(x1min, x2min) ymin = np.maximum(y1min, y2min) xmax = np.minimum(x1max, x2max) ymax = np.minimum(y1max, y2max) # Calculate the height, width and area of intersecting rectangular rows inter_h = np.maximum(ymax  ymin + 1., 0.) inter_w = np.maximum(xmax  xmin + 1., 0.) intersection = inter_h * inter_w # Calculate the merging area union = s1 + s2  intersection # Calculation of intersection and union ratio iou = intersection / union return iou
import json import os ANCHORS = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326] ANCHOR_MASKS = [[6, 7, 8], [3, 4, 5], [0, 1, 2]] VALID_THRESH = 0.01 NMS_TOPK = 400 NMS_POSK = 100 NMS_THRESH = 0.45 NUM_CLASSES = 7 if __name__ == '__main__': TRAINDIR = '/home/aistudio/work/insects/train/images' TESTDIR = '/home/aistudio/work/insects/test/images' VALIDDIR = '/home/aistudio/work/insects/val' model = YOLOv3(num_classes=NUM_CLASSES) params_file_path = '/home/aistudio/yolo_epoch50.pdparams' model_state_dict = paddle.load(params_file_path) model.load_dict(model_state_dict) model.eval() total_results = [] test_loader = test_data_loader(TESTDIR, batch_size= 1, mode='test') for i, data in enumerate(test_loader()): img_name, img_data, img_scale_data = data img = paddle.to_tensor(img_data) img_scale = paddle.to_tensor(img_scale_data) outputs = model.forward(img) bboxes, scores = model.get_pred(outputs, im_shape=img_scale, anchors=ANCHORS, anchor_masks=ANCHOR_MASKS, valid_thresh = VALID_THRESH) bboxes_data = bboxes.numpy() scores_data = scores.numpy() result = multiclass_nms(bboxes_data, scores_data, score_thresh=VALID_THRESH, nms_thresh=NMS_THRESH, pre_nms_topk=NMS_TOPK, pos_nms_topk=NMS_POSK) for j in range(len(result)): result_j = result[j] img_name_j = img_name[j] total_results.append([img_name_j, result_j.tolist()]) print('processed {} pictures'.format(len(total_results))) print('') json.dump(total_results, open('pred_results.json', 'w'))
The test results are saved in the json file, which is a list containing all picture prediction results. Its composition is as follows:
[[img_name, [[label, score, x1, y1, x2, y2], ..., [label, score, x1, y1, x2, y2]]], [img_name, [[label, score, x1, y1, x2, y2], ..., [label, score, x1, y1, x2, y2]]], ... [img_name, [[label, score, x1, y1, x2, y2],..., [label, score, x1, y1, x2, y2]]]]
Each element in the list is the prediction result of a picture. The total length of the list is equal to the number of pictures. The format of the prediction result of each picture is:
[img_name, [[label, score, x1, y1, x2, y2],..., [label, score, x1, y1, x2, y2]]]
The first element is the image name image_name, the second element is a list containing all prediction boxes of the picture. Prediction box list:
[[label, score, x1, x2, y1, y2],..., [label, score, x1, y1, x2, y2]]
Each element [label, score, x1, y1, x2, y2] in the prediction box list describes a prediction box. Label is the category label of the prediction box, and score is the score of the prediction box; X1, Y1, X2 and Y2 correspond to the coordinates of the upper left corner (x1, y1) and the lower right corner (x2, y2) of the prediction frame. Each picture may have many prediction boxes, so put them all in the prediction box list.
Model effect and visual display
The above program shows how to read the pictures of the test data set and save the final results in a json format file. In order to more intuitively show readers the model effect, the following program adds how to read a single picture and draw the prediction box.
 Create a data reader to read data from a single picture
# Read a single test picture def single_image_data_loader(filename, test_image_size=608, mode='test'): """ Load the test picture, and the test data is not available groundtruth label """ batch_size= 1 def reader(): batch_data = [] img_size = test_image_size file_path = os.path.join(filename) img = cv2.imread(file_path) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) H = img.shape[0] W = img.shape[1] img = cv2.resize(img, (img_size, img_size)) mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] mean = np.array(mean).reshape((1, 1, 1)) std = np.array(std).reshape((1, 1, 1)) out_img = (img / 255.0  mean) / std out_img = out_img.astype('float32').transpose((2, 0, 1)) img = out_img #np.transpose(out_img, (2,0,1)) im_shape = [H, W] batch_data.append((image_name.split('.')[0], img, im_shape)) if len(batch_data) == batch_size: yield make_test_array(batch_data) batch_data = [] return reader
 Define the drawing function for drawing the prediction box. The code is as follows.
# Define drawing function INSECT_NAMES = ['Boerner', 'Leconte', 'Linnaeus', 'acuminatus', 'armandi', 'coleoptera', 'linnaeus'] # Defines the function to draw a rectangular box def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle=''): # currentAxis, coordinate axis, obtained through plt.gca() # bbox, bounding box, list containing four values, [x1, y1, x2, y2] # edgecolor, border line color # facecolor, fill color # Fill, fill # Linetype, border Linetype # patches.Rectangle parameters such as the coordinates of the upper left corner, the width and height of the rectangular area need to be passed in rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]bbox[0]+1, bbox[3]bbox[1]+1, linewidth=1, edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle) currentAxis.add_patch(rect) # Define the function that plots the prediction results def draw_results(result, filename, draw_thresh=0.5): plt.figure(figsize=(10, 10)) im = imread(filename) plt.imshow(im) currentAxis=plt.gca() colors = ['r', 'g', 'b', 'k', 'y', 'c', 'purple'] for item in result: box = item[2:6] label = int(item[0]) name = INSECT_NAMES[label] if item[1] > draw_thresh: draw_rectangle(currentAxis, box, edgecolor = colors[label]) plt.text(box[0], box[1], name, fontsize=12, color=colors[label])
 Use the single defined above_ image_ data_ The loader function reads the specified picture, inputs the network and calculates the prediction box and score, and then uses the multi classification non maximum to suppress and eliminate the redundant box. Draw and display the final results.
import json import paddle ANCHORS = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326] ANCHOR_MASKS = [[6, 7, 8], [3, 4, 5], [0, 1, 2]] VALID_THRESH = 0.01 NMS_TOPK = 400 NMS_POSK = 100 NMS_THRESH = 0.45 NUM_CLASSES = 7 if __name__ == '__main__': image_name = '/home/aistudio/work/insects/test/images/2599.jpeg' params_file_path = '/home/aistudio/yolo_epoch50.pdparams' model = YOLOv3(num_classes=NUM_CLASSES) model_state_dict = paddle.load(params_file_path) model.load_dict(model_state_dict) model.eval() total_results = [] test_loader = single_image_data_loader(image_name, mode='test') for i, data in enumerate(test_loader()): img_name, img_data, img_scale_data = data img = paddle.to_tensor(img_data) img_scale = paddle.to_tensor(img_scale_data) outputs = model.forward(img) bboxes, scores = model.get_pred(outputs, im_shape=img_scale, anchors=ANCHORS, anchor_masks=ANCHOR_MASKS, valid_thresh = VALID_THRESH) bboxes_data = bboxes.numpy() scores_data = scores.numpy() results = multiclass_nms(bboxes_data, scores_data, score_thresh=VALID_THRESH, nms_thresh=NMS_THRESH, pre_nms_topk=NMS_TOPK, pos_nms_topk=NMS_POSK) result = results[0] draw_results(result, image_name, draw_thresh=0.5)
Through the above program, it clearly shows readers how to use the trained weight to predict the picture and visualize the results. On the final output image, each insect is detected, and their boundary box and specific category are marked.
summary
This chapter systematically introduces various network structures and development processes of computer vision, and takes two tasks of image classification and target detection as examples to show the implementation of ResNet and YOLOv3 algorithms. Readers are expected to not only master the method of building computer vision model, but also have a deeper understanding of the method of extracting visual features.