# Introduction of forest pest data set and data preprocessing method

In this course, the insect data set used in the forestry pest control project jointly developed by Baidu and Forestry University will be used.

## Read the annotation information of AI insect recognition dataset

The structure of AI insect recognition dataset is as follows:

• 2183 pictures are provided, including 1693 training sets, 245 verification sets and 245 test sets.
• It contains 7 species of insects, namely Boerner, Leconte, Linnaeus, acuminatus, armandi, coleoptera and Linnaeus.
• It contains pictures and labels. Please decompress the data and store it in the insects directory.
!conda list

# packages in environment at C:\ProgramData\Anaconda3\envs\paddle:
#
# Name                    Version                   Build  Channel
astor                     0.8.1                    pypi_0    pypi
backcall                  0.2.0              pyhd3eb1b0_0    defaults
certifi                   2021.5.30        py36haa95532_0    defaults
charset-normalizer        2.0.7                    pypi_0    pypi
colorama                  0.4.4              pyhd3eb1b0_0    defaults
decorator                 5.1.0              pyhd3eb1b0_0    defaults
entrypoints               0.3                      py36_0    defaults
gast                      0.3.3                    pypi_0    pypi
idna                      3.3                      pypi_0    pypi
ipykernel                 5.3.4            py36h5ca1d4c_0    defaults
ipython                   7.16.1           py36h5ca1d4c_0    defaults
ipython_genutils          0.2.0              pyhd3eb1b0_1    defaults
jedi                      0.17.0                   py36_0    defaults
jupyter_client            7.0.1              pyhd3eb1b0_0    defaults
jupyter_core              4.8.1            py36haa95532_0    defaults
nest-asyncio              1.5.1              pyhd3eb1b0_0    defaults
numpy                     1.19.3                   pypi_0    pypi
parso                     0.8.2              pyhd3eb1b0_0    defaults
pickleshare               0.7.5           pyhd3eb1b0_1003    defaults
pillow                    8.4.0                    pypi_0    pypi
pip                       21.2.2           py36haa95532_0    defaults
prompt-toolkit            3.0.20             pyhd3eb1b0_0    defaults
protobuf                  3.19.1                   pypi_0    pypi
pygments                  2.10.0             pyhd3eb1b0_0    defaults
python                    3.6.13               h3758d61_0    defaults
python-dateutil           2.8.2              pyhd3eb1b0_0    defaults
pywin32                   228              py36hbaba5e8_1    defaults
pyzmq                     22.2.1           py36hd77b12b_1    defaults
requests                  2.26.0                   pypi_0    pypi
setuptools                58.0.4           py36haa95532_0    defaults
six                       1.16.0             pyhd3eb1b0_0    defaults
sqlite                    3.36.0               h2bbff1b_0    defaults
traitlets                 4.3.3            py36haa95532_0    defaults
urllib3                   1.26.7                   pypi_0    pypi
vc                        14.2                 h21ff451_1    defaults
vs2015_runtime            14.27.29016          h5e58377_2    defaults
wcwidth                   0.2.5              pyhd3eb1b0_0    defaults
wheel                     0.37.0             pyhd3eb1b0_1    defaults
wincertstore              0.2              py36h7fe50ca_0    defaults


After decompressing the data, you can see the structure under the insects directory as follows.

    insects
|---train
|         |---annotations
|         |         |---xmls
|         |                  |---100.xml
|         |                  |---101.xml
|         |                  |---...
|         |
|         |---images
|                   |---100.jpeg
|                   |---101.jpeg
|                   |---...
|
|---val
|        |---annotations
|        |         |---xmls
|        |                  |---1221.xml
|        |                  |---1277.xml
|        |                  |---...
|        |
|        |---images
|                  |---1221.jpeg
|                  |---1277.jpeg
|                  |---...
|
|---test
|---images
|---1833.jpeg
|---1838.jpeg
|---...


insects contains three folders: train, val and test. The labels of pictures are stored in the train/annotations/xmls directory. Each xml file is a description of a picture, including the size of the picture, the name of the insect contained, the location on the picture and other information.

<annotation>
<folder>Liu Feifei</folder>
<filename>100.jpeg</filename>
<path>/home/fion/desktop/Liu Feifei/100.jpeg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>1336</width>
<height>1336</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>Boerner</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>500</xmin>
<ymin>893</ymin>
<xmax>656</xmax>
<ymax>966</ymax>
</bndbox>
</object>
<object>
<name>Leconte</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>622</xmin>
<ymin>490</ymin>
<xmax>756</xmax>
<ymax>610</ymax>
</bndbox>
</object>
<object>
<name>armandi</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>432</xmin>
<ymin>663</ymin>
<xmax>517</xmax>
<ymax>729</ymax>
</bndbox>
</object>
<object>
<name>coleoptera</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>624</xmin>
<ymin>685</ymin>
<xmax>697</xmax>
<ymax>771</ymax>
</bndbox>
</object>
<object>
<name>linnaeus</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>783</xmin>
<ymin>700</ymin>
<xmax>856</xmax>
<ymax>802</ymax>
</bndbox>
</object>
</annotation>


The main parameters in the xml file listed above are described as follows:

• Size: picture size.

• Object: an object contained in a picture. A picture may contain multiple objects.

– Name: insect name;

– bndbox: real object box;

– difficult: identify whether it is difficult.

Next, we will read the xml file from the dataset and read the annotation information of each picture. Before reading the specific annotation file, we first complete one thing, that is, convert the category name (string) of insects into the category represented by numbers. Because the input type required for calculation in neural network is numerical, it is necessary to convert the category represented by string into specific numbers. The list of insect category names is: [Boerner ',' Leconte ',' linnaeus', 'acuminatus',' armandi ',' coleoptera ',' linnaeus]. Here we agree that in this list, 'Boerner' corresponds to category 0, 'Leconte' corresponds to category 1,..., 'linnaeus' corresponds to category 6. Use the following program to get a dictionary representing the mapping relationship between name strings and numeric categories.

INSECT_NAMES = ['Boerner', 'Leconte', 'Linnaeus',
'acuminatus', 'armandi', 'coleoptera', 'linnaeus']

def get_insect_names():
"""
return a dict, as following,
{'Boerner': 0,
'Leconte': 1,
'Linnaeus': 2,
'acuminatus': 3,
'armandi': 4,
'coleoptera': 5,
'linnaeus': 6
}
It can map the insect name into an integer label.
"""
insect_category2id = {}
for i, item in enumerate(INSECT_NAMES):
insect_category2id[item] = i

return insect_category2id

cname2cid = get_insect_names()
cname2cid

{'Boerner': 0,
'Leconte': 1,
'Linnaeus': 2,
'acuminatus': 3,
'armandi': 4,
'coleoptera': 5,
'linnaeus': 6}


Call get_ insect_ The names function returns a dict that describes the mapping relationship between insect names and numeric categories. The following program reads all file annotation information from the annotations/xml directory.

import os
import numpy as np
import xml.etree.ElementTree as ET

records = []
ct = 0
for fname in filenames:
fid = fname.split('.')[0]
fpath = os.path.join(datadir, 'annotations', 'xmls', fname)
img_file = os.path.join(datadir, 'images', fid + '.jpeg')
tree = ET.parse(fpath)

if tree.find('id') is None:
im_id = np.array([ct])
else:
im_id = np.array([int(tree.find('id').text)])

objs = tree.findall('object')
im_w = float(tree.find('size').find('width').text)
im_h = float(tree.find('size').find('height').text)
gt_bbox = np.zeros((len(objs), 4), dtype=np.float32)
gt_class = np.zeros((len(objs), ), dtype=np.int32)
is_crowd = np.zeros((len(objs), ), dtype=np.int32)
difficult = np.zeros((len(objs), ), dtype=np.int32)
for i, obj in enumerate(objs):
cname = obj.find('name').text
gt_class[i] = cname2cid[cname]
_difficult = int(obj.find('difficult').text)
x1 = float(obj.find('bndbox').find('xmin').text)
y1 = float(obj.find('bndbox').find('ymin').text)
x2 = float(obj.find('bndbox').find('xmax').text)
y2 = float(obj.find('bndbox').find('ymax').text)
x1 = max(0, x1)
y1 = max(0, y1)
x2 = min(im_w - 1, x2)
y2 = min(im_h - 1, y2)
# Here, xywh format is used to represent the real box of the target object
gt_bbox[i] = [(x1+x2)/2.0 , (y1+y2)/2.0, x2-x1+1., y2-y1+1.]
is_crowd[i] = 0
difficult[i] = _difficult

voc_rec = {
'im_file': img_file,
'im_id': im_id,
'h': im_h,
'w': im_w,
'is_crowd': is_crowd,
'gt_class': gt_class,
'gt_bbox': gt_bbox,
'gt_poly': [],
'difficult': difficult
}
if len(objs) != 0:
records.append(voc_rec)
ct += 1
return records

TRAINDIR = './insects/train'
TESTDIR = './insects/test'
VALIDDIR = './insects/val'
cname2cid = get_insect_names()
records = get_annotations(cname2cid, TRAINDIR)
records[0]

{'im_file': './insects/train\\images\\1.jpeg',
'im_id': array([0]),
'h': 1344.0,
'w': 1344.0,
'is_crowd': array([0, 0, 0, 0, 0]),
'gt_class': array([1, 0, 6, 4, 5]),
'gt_bbox': array([[542.5, 652.5, 140. , 150. ],
[885. , 572. , 127. , 135. ],
[648.5, 811.5,  84. ,  62. ],
[798.5, 821. ,  86. ,  71. ],
[667.5, 521. ,  88. ,  67. ]], dtype=float32),
'gt_poly': [],
'difficult': array([0, 0, 0, 0, 0])}

len(records)

1693


Through the above procedure, all the labeled data of all training data sets are read out and stored under the records list. Each element is the labeled data of a picture, including the picture storage address, picture id, picture height and width, and the type and position of the target object contained in the picture.

Data preprocessing is a very important step in training neural networks. Appropriate preprocessing methods can help the model converge better and prevent over fitting. First, we need to read data from the disk, and then preprocess these data. In order to ensure the speed of network operation, we usually need to speed up the data preprocessing.

### data fetch

Previously, all the description information of the picture has been saved in records, and each element contains the description of a picture. The following program shows how to read the picture and label according to the description in records.

# data fetch
import cv2

def get_bbox(gt_bbox, gt_class):
# For general detection tasks, there are often multiple target objects in a picture
# Set parameter MAX_NUM = 50, that is, a picture can take up to 50 real boxes; If true
# If the number of boxes is less than 50, the insufficient part will be gt_bbox, gt_class and GT_ All values of score are set to 0
MAX_NUM = 50
gt_bbox2 = np.zeros((MAX_NUM, 4))
gt_class2 = np.zeros((MAX_NUM,))
for i in range(len(gt_bbox)):
gt_bbox2[i, :] = gt_bbox[i, :]
gt_class2[i] = gt_class[i]
if i >= MAX_NUM:
break
return gt_bbox2, gt_class2

def get_img_data_from_file(record):
"""
record is a dict as following,
record = {
'im_file': img_file,
'im_id': im_id,
'h': im_h,
'w': im_w,
'is_crowd': is_crowd,
'gt_class': gt_class,
'gt_bbox': gt_bbox,
'gt_poly': [],
'difficult': difficult
}
"""
im_file = record['im_file']
h = record['h']
w = record['w']
is_crowd = record['is_crowd']
gt_class = record['gt_class']
gt_bbox = record['gt_bbox']
difficult = record['difficult']

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# check if h and w in record equals that read from img
assert img.shape[0] == int(h), \
"image height of {} inconsistent in record({}) and img file({})".format(
im_file, h, img.shape[0])

assert img.shape[1] == int(w), \
"image width of {} inconsistent in record({}) and img file({})".format(
im_file, w, img.shape[1])

gt_boxes, gt_labels = get_bbox(gt_bbox, gt_class)

# gt_ Relative value for bbox
gt_boxes[:, 0] = gt_boxes[:, 0] / float(w)
gt_boxes[:, 1] = gt_boxes[:, 1] / float(h)
gt_boxes[:, 2] = gt_boxes[:, 2] / float(w)
gt_boxes[:, 3] = gt_boxes[:, 3] / float(h)

return img, gt_boxes, gt_labels, (h, w)

record = records[0]
img, gt_boxes, gt_labels, scales = get_img_data_from_file(record)

img.shape

(1344, 1344, 3)

gt_boxes.shape

(50, 4)

gt_labels

array([1., 0., 6., 4., 5., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

scales

(1344.0, 1344.0)


get_ img_ data_ from_ The file() function can return the data of picture data, which are image data img and real frame coordinates gt_boxes, the category of objects contained in the real box gt_labels, image size scales.

### Data preprocessing

In computer vision, some random changes are usually made to the image to produce similar but not identical samples. The main function is to expand the training data set, suppress over fitting and improve the generalization ability of the model. The common methods are as follows:

• Randomly change brightness, contrast, and color
• Random fill
• Random clipping
• Random scaling
• random invert
• Randomly disorder the arrangement order of real boxes

Next, we use numpy to implement these data enhancement methods.

#### Randomly change brightness, contrast, color, etc

import numpy as np
import cv2
from PIL import Image, ImageEnhance
import random

# Randomly change brightness, contrast, color, etc
def random_distort(img):
# Randomly change brightness
def random_brightness(img, lower=0.5, upper=1.5):
e = np.random.uniform(lower, upper)
return ImageEnhance.Brightness(img).enhance(e)
# Randomly change contrast
def random_contrast(img, lower=0.5, upper=1.5):
e = np.random.uniform(lower, upper)
return ImageEnhance.Contrast(img).enhance(e)
# Randomly change color
def random_color(img, lower=0.5, upper=1.5):
e = np.random.uniform(lower, upper)
return ImageEnhance.Color(img).enhance(e)

ops = [random_brightness, random_contrast, random_color]
np.random.shuffle(ops)

img = Image.fromarray(img)
img = ops[0](img)
img = ops[1](img)
img = ops[2](img)
img = np.asarray(img)

return img

# Define a visualization function to compare the effect of original image and image enhancement
import matplotlib.pyplot as plt
def visualize(srcimg, img_enhance):
# Image visualization
plt.figure(num=2, figsize=(6,12))
plt.subplot(1,2,1)
plt.title('Src Image', color='#0000FF')
plt.axis('off') # Do not display axes
plt.imshow(srcimg) # Show original picture

# Randomly change the brightness, contrast, color and other data enhancement of the original image
srcimg_gtbox = records[0]['gt_bbox']
srcimg_label = records[0]['gt_class']

plt.subplot(1,2,2)
plt.title('Enhance Image', color='#0000FF')
plt.axis('off') # Do not display axes
plt.imshow(img_enhance)

image_path = records[0]['im_file']
srcimg = Image.open(image_path)
# Convert the image read by PIL to array type
srcimg = np.array(srcimg)

# Randomly change the brightness, contrast, color and other data enhancement of the original image
img_enhance = random_distort(srcimg)
visualize(srcimg, img_enhance)

read image from file ./insects/train\images\1.jpeg


#### Random fill

# Random fill
def random_expand(img,
gtboxes,
max_ratio=4.,
fill=None,
keep_ratio=True,
thresh=0.5):
if random.random() > thresh:
return img, gtboxes

if max_ratio < 1.0:
return img, gtboxes

h, w, c = img.shape
ratio_x = random.uniform(1, max_ratio)
if keep_ratio:
ratio_y = ratio_x
else:
ratio_y = random.uniform(1, max_ratio)
oh = int(h * ratio_y)
ow = int(w * ratio_x)
off_x = random.randint(0, ow - w)
off_y = random.randint(0, oh - h)

out_img = np.zeros((oh, ow, c))
if fill and len(fill) == c:
for i in range(c):
out_img[:, :, i] = fill[i] * 255.0

out_img[off_y:off_y + h, off_x:off_x + w, :] = img
gtboxes[:, 0] = ((gtboxes[:, 0] * w) + off_x) / float(ow)
gtboxes[:, 1] = ((gtboxes[:, 1] * h) + off_y) / float(oh)
gtboxes[:, 2] = gtboxes[:, 2] / ratio_x
gtboxes[:, 3] = gtboxes[:, 3] / ratio_y

return out_img.astype('uint8'), gtboxes

# Randomly change the brightness, contrast, color and other data enhancement of the original image
srcimg_gtbox = records[0]['gt_bbox']
img_enhance, new_gtbox = random_expand(srcimg, srcimg_gtbox)
visualize(srcimg, img_enhance)


#### Random clipping

Before random clipping, you need to define two functions, multi_box_iou_xywh and box_crop these two functions will be saved in box_ In the utils.py file.

import numpy as np

def multi_box_iou_xywh(box1, box2):
"""
In this case, box1 or box2 can contain multi boxes.
Only two cases can be processed in this method:
1, box1 and box2 have the same shape, box1.shape == box2.shape
2, either box1 or box2 contains only one box, len(box1) == 1 or len(box2) == 1
If the shape of box1 and box2 does not match, and both of them contain multi boxes, it will be wrong.
"""
assert box1.shape[-1] == 4, "Box1 shape[-1] should be 4."
assert box2.shape[-1] == 4, "Box2 shape[-1] should be 4."

b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2

inter_x1 = np.maximum(b1_x1, b2_x1)
inter_x2 = np.minimum(b1_x2, b2_x2)
inter_y1 = np.maximum(b1_y1, b2_y1)
inter_y2 = np.minimum(b1_y2, b2_y2)
inter_w = inter_x2 - inter_x1
inter_h = inter_y2 - inter_y1
inter_w = np.clip(inter_w, a_min=0., a_max=None)
inter_h = np.clip(inter_h, a_min=0., a_max=None)

inter_area = inter_w * inter_h
b1_area = (b1_x2 - b1_x1) * (b1_y2 - b1_y1)
b2_area = (b2_x2 - b2_x1) * (b2_y2 - b2_y1)

return inter_area / (b1_area + b2_area - inter_area)

def box_crop(boxes, labels, crop, img_shape):
x, y, w, h = map(float, crop)
im_w, im_h = map(float, img_shape)

boxes = boxes.copy()
boxes[:, 0], boxes[:, 2] = (boxes[:, 0] - boxes[:, 2] / 2) * im_w, (
boxes[:, 0] + boxes[:, 2] / 2) * im_w
boxes[:, 1], boxes[:, 3] = (boxes[:, 1] - boxes[:, 3] / 2) * im_h, (
boxes[:, 1] + boxes[:, 3] / 2) * im_h

crop_box = np.array([x, y, x + w, y + h])
centers = (boxes[:, :2] + boxes[:, 2:]) / 2.0
mask = np.logical_and(crop_box[:2] <= centers, centers <= crop_box[2:]).all(
axis=1)

boxes[:, :2] = np.maximum(boxes[:, :2], crop_box[:2])
boxes[:, 2:] = np.minimum(boxes[:, 2:], crop_box[2:])
boxes[:, :2] -= crop_box[:2]
boxes[:, 2:] -= crop_box[:2]

boxes = boxes * np.expand_dims(mask.astype('float32'), axis=1)
boxes[:, 0], boxes[:, 2] = (boxes[:, 0] + boxes[:, 2]) / 2 / w, (
boxes[:, 2] - boxes[:, 0]) / w
boxes[:, 1], boxes[:, 3] = (boxes[:, 1] + boxes[:, 3]) / 2 / h, (
boxes[:, 3] - boxes[:, 1]) / h


# Random clipping
def random_crop(img,
boxes,
labels,
scales=[0.3, 1.0],
max_ratio=2.0,
constraints=None,
max_trial=50):
if len(boxes) == 0:
return img, boxes

if not constraints:
constraints = [(0.1, 1.0), (0.3, 1.0), (0.5, 1.0), (0.7, 1.0),
(0.9, 1.0), (0.0, 1.0)]

img = Image.fromarray(img)
w, h = img.size
crops = [(0, 0, w, h)]
for min_iou, max_iou in constraints:
for _ in range(max_trial):
scale = random.uniform(scales[0], scales[1])
aspect_ratio = random.uniform(max(1 / max_ratio, scale * scale), \
min(max_ratio, 1 / scale / scale))
crop_h = int(h * scale / np.sqrt(aspect_ratio))
crop_w = int(w * scale * np.sqrt(aspect_ratio))
crop_x = random.randrange(w - crop_w)
crop_y = random.randrange(h - crop_h)
crop_box = np.array([[(crop_x + crop_w / 2.0) / w,
(crop_y + crop_h / 2.0) / h,
crop_w / float(w), crop_h / float(h)]])

iou = multi_box_iou_xywh(crop_box, boxes)
if min_iou <= iou.min() and max_iou >= iou.max():
crops.append((crop_x, crop_y, crop_w, crop_h))
break

while crops:
crop = crops.pop(np.random.randint(0, len(crops)))
crop_boxes, crop_labels, box_num = box_crop(boxes, labels, crop, (w, h))
if box_num < 1:
continue
img = img.crop((crop[0], crop[1], crop[0] + crop[2],
crop[1] + crop[3])).resize(img.size, Image.LANCZOS)
img = np.asarray(img)
return img, crop_boxes, crop_labels
img = np.asarray(img)
return img, boxes, labels

# Randomly change the brightness, contrast, color and other data enhancement of the original image
srcimg_gtbox = records[0]['gt_bbox']
srcimg_label = records[0]['gt_class']

img_enhance, new_labels, mask = random_crop(srcimg, srcimg_gtbox, srcimg_label)
visualize(srcimg, img_enhance)


#### Random scaling

# Random scaling
def random_interp(img, size, interp=None):
interp_method = [
cv2.INTER_NEAREST,
cv2.INTER_LINEAR,
cv2.INTER_AREA,
cv2.INTER_CUBIC,
cv2.INTER_LANCZOS4,
]
if not interp or interp not in interp_method:
interp = interp_method[random.randint(0, len(interp_method) - 1)]
h, w, _ = img.shape
im_scale_x = size / float(w)
im_scale_y = size / float(h)
img = cv2.resize(
img, None, None, fx=im_scale_x, fy=im_scale_y, interpolation=interp)
return img

# Randomly change the brightness, contrast, color and other data enhancement of the original image
img_enhance = random_interp(srcimg, 640)
visualize(srcimg, img_enhance)


#### random invert

# random invert
def random_flip(img, gtboxes, thresh=0.5):
if random.random() > thresh:
img = img[:, ::-1, :]
gtboxes[:, 0] = 1.0 - gtboxes[:, 0]
return img, gtboxes

# Randomly change the brightness, contrast, color and other data enhancement of the original image
img_enhance, box_enhance = random_flip(srcimg, srcimg_gtbox)
visualize(srcimg, img_enhance)


#### Randomly disorder the arrangement order of real boxes

# Randomly disorder the arrangement order of real boxes
def shuffle_gtbox(gtbox, gtlabel):
gt = np.concatenate(
[gtbox, gtlabel[:, np.newaxis]], axis=1)
idx = np.arange(gt.shape[0])
np.random.shuffle(idx)
gt = gt[idx, :]
return gt[:, :4], gt[:, 4]


#### Summary of image enhancement methods

# Summary of image enhancement methods
def image_augment(img, gtboxes, gtlabels, size, means=None):
# Randomly change brightness, contrast, color, etc
img = random_distort(img)
# Random fill
img, gtboxes = random_expand(img, gtboxes, fill=means)
# Random clipping
img, gtboxes, gtlabels, = random_crop(img, gtboxes, gtlabels)
# Random scaling
img = random_interp(img, size)
# random invert
img, gtboxes = random_flip(img, gtboxes)
# Randomly disorder the arrangement order of real boxes
gtboxes, gtlabels = shuffle_gtbox(gtboxes, gtlabels)

return img.astype('float32'), gtboxes.astype('float32'), gtlabels.astype('int32')

img_enhance, img_box, img_label = image_augment(srcimg, srcimg_gtbox, srcimg_label, size=320)
visualize(srcimg, img_enhance)


Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).


img, gt_boxes, gt_labels, scales = get_img_data_from_file(record)
size = 512
img, gt_boxes, gt_labels = image_augment(img, gt_boxes, gt_labels, size)

img.shape

(512, 512, 3)

gt_boxes.shape

(50, 4)

gt_labels.shape

(50,)


The img data value obtained here needs to be adjusted by dividing by 255, subtracting the mean and variance, and then adjusting the dimension from [H, W, C] to [C, H, W].

img, gt_boxes, gt_labels, scales = get_img_data_from_file(record)
size = 512
img, gt_boxes, gt_labels = image_augment(img, gt_boxes, gt_labels, size)
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
mean = np.array(mean).reshape((1, 1, -1))
std = np.array(std).reshape((1, 1, -1))
img = (img / 255.0 - mean) / std
img = img.astype('float32').transpose((2, 0, 1))
img

array([[[-2.117904 , -2.117904 , -2.117904 , ..., -2.117904 ,
-2.117904 , -2.117904 ],
[-2.117904 , -2.117904 , -2.117904 , ..., -2.117904 ,
-2.117904 , -2.117904 ],
[-2.117904 , -2.117904 , -2.117904 , ..., -2.117904 ,
-2.117904 , -2.117904 ],
...,
[-2.117904 , -2.117904 , -2.117904 , ..., -2.117904 ,
-2.117904 , -2.117904 ],
[-2.117904 , -2.117904 , -2.117904 , ..., -2.117904 ,
-2.117904 , -2.117904 ],
[-2.117904 , -2.117904 , -2.117904 , ..., -2.117904 ,
-2.117904 , -2.117904 ]],

[[-2.0357144, -2.0357144, -2.0357144, ..., -2.0357144,
-2.0357144, -2.0357144],
[-2.0357144, -2.0357144, -2.0357144, ..., -2.0357144,
-2.0357144, -2.0357144],
[-2.0357144, -2.0357144, -2.0357144, ..., -2.0357144,
-2.0357144, -2.0357144],
...,
[-2.0357144, -2.0357144, -2.0357144, ..., -2.0357144,
-2.0357144, -2.0357144],
[-2.0357144, -2.0357144, -2.0357144, ..., -2.0357144,
-2.0357144, -2.0357144],
[-2.0357144, -2.0357144, -2.0357144, ..., -2.0357144,
-2.0357144, -2.0357144]],

[[-1.8044444, -1.8044444, -1.8044444, ..., -1.8044444,
-1.8044444, -1.8044444],
[-1.8044444, -1.8044444, -1.8044444, ..., -1.8044444,
-1.8044444, -1.8044444],
[-1.8044444, -1.8044444, -1.8044444, ..., -1.8044444,
-1.8044444, -1.8044444],
...,
[-1.8044444, -1.8044444, -1.8044444, ..., -1.8044444,
-1.8044444, -1.8044444],
[-1.8044444, -1.8044444, -1.8044444, ..., -1.8044444,
-1.8044444, -1.8044444],
[-1.8044444, -1.8044444, -1.8044444, ..., -1.8044444,
-1.8044444, -1.8044444]]], dtype=float32)


Organize the above process into a get_img_data function.

def get_img_data(record, size=640):
img, gt_boxes, gt_labels, scales = get_img_data_from_file(record)
img, gt_boxes, gt_labels = image_augment(img, gt_boxes, gt_labels, size)
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
mean = np.array(mean).reshape((1, 1, -1))
std = np.array(std).reshape((1, 1, -1))
img = (img / 255.0 - mean) / std
img = img.astype('float32').transpose((2, 0, 1))
return img, gt_boxes, gt_labels, scales

TRAINDIR = '/home/aistudio/work/insects/train'
TESTDIR = '/home/aistudio/work/insects/test'
VALIDDIR = '/home/aistudio/work/insects/val'
cname2cid = get_insect_names()
records = get_annotations(cname2cid, TRAINDIR)

record = records[0]
img, gt_boxes, gt_labels, scales = get_img_data(record, size=480)

img.shape

(3, 480, 480)

gt_boxes.shape

(50, 4)

gt_labels

array([0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1], dtype=int32)

scales

(1220.0, 1220.0)


### Fast data enhancement using high-level API of propeller

In the above code, we use numpy to implement a variety of data enhancement methods. At the same time, the propeller also provides a ready to use data enhancement method, which can be consulted in detail paddle.vision.transforms The transforms module provides dozens of data enhancement methods, including brightness enhancement( adjust_brightness ), contrast enhancement( adjust_contrast ), random clipping( RandomCrop )Wait. For more information on how to use the high-level API, please log in to the official website of the propeller.

The data enhancement in the pad.vision.transforms module is used as follows:

#Random clipping of image
# import random cut API RandomCrop from the pad.vision.transforms module

# RandomCrop is a python class that needs to be declared in advance
#RandomCrop also needs to pass in the cut shape, which is set to 640 here
transform = RandomCrop(640)
# Convert image to PIL.Image format
srcimg = Image.fromarray(np.array(srcimg))
# Call the declared API to realize random cutting
img_res = transform(srcimg)
# Visualization results
visualize(srcimg, np.array(img_res))


In the same way, the brightness enhancement can be realized by using the high-level API of the propeller, as shown in the following code:

from paddle.vision.transforms import BrightnessTransform

# BrightnessTransform is a python class that needs to be declared in advance
transform = BrightnessTransform(0.4)
# Convert image to PIL.Image format
srcimg = Image.fromarray(np.array(srcimg))
# Call the declared API to realize random cutting
img_res = transform(srcimg)
# Visualization results
visualize(srcimg, np.array(img_res))


### Batch data reading and acceleration

The above program shows how to read the data of a picture and accelerate it. The following code realizes batch data reading.

# Obtain the randomly scaled size of samples in a batch
def get_img_size(mode):
if (mode == 'train') or (mode == 'valid'):
inds = np.array([0,1,2,3,4,5,6,7,8,9])
ii = np.random.choice(inds)
img_size = 320 + ii * 32
else:
img_size = 608
return img_size

# Convert batch data in the form of list into tuple s composed of multiple array s
def make_array(batch_data):
img_array = np.array([item[0] for item in batch_data], dtype = 'float32')
gt_box_array = np.array([item[1] for item in batch_data], dtype = 'float32')
gt_labels_array = np.array([item[2] for item in batch_data], dtype = 'int32')
img_scale = np.array([item[3] for item in batch_data], dtype='int32')
return img_array, gt_box_array, gt_labels_array, img_scale


Because the data preprocessing takes a long time, it may become the bottleneck of network training speed, so the preprocessing part needs to be optimized. Provided by using a propeller paddle.io.DataLoader Num in API_ The workers parameter sets the number of processes to read data from multiple processes. The specific implementation code is as follows.

import paddle

cname2cid = get_insect_names()
self.img_size = 640  #get_img_size(mode)

def __getitem__(self, idx):
record = self.records[idx]
# print("print: ", record)
img, gt_bbox, gt_labels, im_shape = get_img_data(record, size=self.img_size)

return img, gt_bbox, gt_labels, np.array(im_shape)

def __len__(self):
return len(self.records)

train_dataset = TrainDataset(TRAINDIR, mode='train')

# Create a data reader using the pad.io.dataloader, and set the batchsize and the number of processes num_workers and other parameters

d = paddle.io.DataLoader(train_dataset, batch_size=2, shuffle=True, num_workers=1, drop_last=True)

img, gt_boxes, gt_labels, im_shape = next(d())

img.shape, gt_boxes.shape, gt_labels.shape, im_shape.shape

([2, 3, 640, 640], [2, 50, 4], [2, 50], [2, 2])


So far, we have completed the processes of viewing the data in the dataset, extracting data annotation information, reading images and annotation data from files, image widening, batch reading and acceleration. Img, GT can be returned through pad.io.dataset_ boxes, gt_ labels, im_ Shape and other data, and then they can be input into the neural network and applied to the specific algorithm.

Before starting the specific algorithm explanation, first supplement the code to read the test data. The test data has no annotation information, and there is no need to expand the image. The code is as follows.

import os
# Convert batch data in the form of list into tuple s composed of multiple array s
def make_test_array(batch_data):
img_name_array = np.array([item[0] for item in batch_data])
img_data_array = np.array([item[1] for item in batch_data], dtype = 'float32')
img_scale_array = np.array([item[2] for item in batch_data], dtype='int32')
return img_name_array, img_data_array, img_scale_array

"""
Load the test picture, and the test data is not available groundtruth label
"""
batch_data = []
img_size = test_image_size
for image_name in image_names:
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
H = img.shape[0]
W = img.shape[1]
img = cv2.resize(img, (img_size, img_size))

mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
mean = np.array(mean).reshape((1, 1, -1))
std = np.array(std).reshape((1, 1, -1))
out_img = (img / 255.0 - mean) / std
out_img = out_img.astype('float32').transpose((2, 0, 1))
img = out_img #np.transpose(out_img, (2,0,1))
im_shape = [H, W]

batch_data.append((image_name.split('.')[0], img, im_shape))
if len(batch_data) == batch_size:
yield make_test_array(batch_data)
batch_data = []
if len(batch_data) > 0:
yield make_test_array(batch_data)



# Single stage target detection model YOLOv3

R-CNN series algorithms need to generate candidate regions first, and then classify the candidate regions and predict the position coordinates. This kind of algorithm is called two-stage target detection algorithm. In recent years, many researchers have proposed a series of single-stage detection algorithms, which only need a network to generate candidate regions and predict the category and position coordinates of objects at the same time.

Different from R-CNN series algorithms, YOLOv3 uses a single network structure to predict the object category and location while generating candidate regions. It does not need to be divided into two stages to complete the detection task. In addition, the number of prediction frames generated by YOLOv3 algorithm is much less than that of fast R-CNN. Each real box in fast R-CNN may correspond to multiple candidate regions with positive labels, while each real box in YOLOv3 corresponds to only one positive candidate region. These characteristics make YOLOv3 algorithm have faster speed and can reach the level of real-time response.

Joseph Redmon et al. Proposed YOLO (You Only Look Once, YOLO) algorithm in 2015, which is also commonly known as YOLOv1; In 2016, they improved the algorithm and proposed YOLOv2 version; YOLOv3 version was developed in 2018.

## YOLOv3 model design idea

The basic idea of YOLOv3 algorithm can be divided into two parts:

• A series of candidate regions are generated on the picture according to certain rules, and then the candidate regions are labeled according to the positional relationship between these candidate regions and the real frame of the object on the picture. Those candidate areas close enough to the real box will be marked as positive samples, and the position of the real box will be taken as the position target of positive samples. Those candidate areas that deviate greatly from the real box will be labeled as negative samples, and negative samples do not need to predict location or category.
• Convolution neural network is used to extract image features and predict the location and category of candidate regions. In this way, each prediction box can be regarded as a sample. The label value is obtained by labeling the real box relative to its position and category. The loss function can be established by predicting its position and category through the network model and comparing the network prediction value with the label value.

The flow chart of YOLOv3 algorithm training process is shown in Figure 8:

Figure 8: YOLOv3 algorithm training flow chart
• The left side of Fig. 8 is the input picture. The process shown in the upper part is to extract features from the picture using convolutional neural network. With the continuous propagation of the network, the size of the feature map becomes smaller and smaller, and each pixel will represent a more abstract feature pattern until the feature map is output, and its size is reduced to the size of the original image 1 32 \frac{1}{32} 321​.
• The lower part of Fig. 8 describes the process of generating candidate regions. Firstly, the original diagram is divided into a plurality of small blocks, and the size of each small block is 32 × 32 32 \times 32 thirty-two × 32, and then generate a series of anchor boxes centered on each small box, and the whole picture will be covered by the anchor box. On the basis of each anchor frame, a corresponding prediction frame is generated, and these prediction frames are marked according to the positional relationship between the anchor frame, prediction frame and the real frame of the object on the picture.
• Associate the feature map output in the upper branch with the prediction frame label generated in the lower branch, create a loss function, and start the end-to-end training process.

Next, the principle and code implementation of each node in the process are introduced in detail.

## Generate candidate regions

How to generate candidate regions is the core design scheme of detection model. At present, most models based on convolutional neural network adopt the following methods:

• According to certain rules, a series of anchor boxes with fixed positions are generated on the picture, and these anchor boxes are regarded as possible candidate areas.
• Predict whether the anchor box contains the target object. If it contains the target object, it is also necessary to predict the category of the included object and the range of adjustment of the prediction box relative to the anchor box position.

### Generate anchor box

Divide the original picture into m × n m\times n m × n areas, as shown in the figure below, the height of the original picture H = 640 H=640 H=640, width W = 480 W=480 W=480, if we select the size of the small area as 32 × 32 32 \times 32 thirty-two × 32, then m m m and n n n are:

m = 640 32 = 20 m = \frac{640}{32} = 20 m=32640​=20

n = 480 32 = 15 n = \frac{480}{32} = 15 n=32480​=15

As shown in Fig. 9, the original image is divided into 20 rows and 15 columns of small square areas.

Figure 9: divide the picture into multiple 32x32 small blocks

YOLOv3 algorithm will generate a series of anchor boxes in the center of each region. For display convenience, we first draw the generated anchor box near the small box position in the tenth row and fourth column in the figure, as shown in Figure 10.

be careful:

Here, in order to correspond to the number in the program, the top line number is row 0 and the left column number is column 0.

Figure 10: three anchor boxes are generated in the small square area in row 10 and column 4

Figure 11 shows that three anchor frames are generated near each area. It may not be easy to see many anchor frames stacked together, but the process is similar to the above, except that three anchor frames need to be generated respectively with the center point of each area as the center.

Figure 11: generate 3 anchor boxes in each small square area

### Generate forecast box

It has been pointed out earlier that the position of the anchor frame is fixed and cannot coincide with the object boundary frame. It is necessary to fine tune the position on the basis of the anchor frame to generate the prediction frame. The prediction box will have different center positions and sizes relative to the anchor box. What method can be used to obtain the prediction box? Let's first consider how to generate its center position coordinates.

For example, in the above figure, an anchor box is generated in the center of the small square area in row 10 and column 4, as shown by the green dotted line box. The unit length is the width of the small square,

The position coordinates of the upper left corner of this small square area are:
c x = 4 c_x = 4 cx​=4
c y = 10 c_y = 10 cy​=10

The area center coordinates of this anchor box are:
c e n t e r _ x = c x + 0.5 = 4.5 center\_x = c_x + 0.5 = 4.5 center_x=cx​+0.5=4.5
c e n t e r _ y = c y + 0.5 = 10.5 center\_y = c_y + 0.5 = 10.5 center_y=cy​+0.5=10.5

The center coordinates of the prediction box can be generated in the following ways:
b x = c x + σ ( t x ) b_x = c_x + \sigma(t_x) bx​=cx​+σ(tx​)
b y = c y + σ ( t y ) b_y = c_y + \sigma(t_y) by​=cy​+σ(ty​)

among t x t_x tx and t y t_y ty is a real number, σ ( x ) \sigma(x) σ (x) Is the Sigmoid function we learned before, which is defined as follows:

σ ( x ) = 1 1 + e x p ( − x ) \sigma(x) = \frac{1}{1 + exp(-x)} σ(x)=1+exp(−x)1​

Because the function value of Sigmoid is 0 ∼ 1 0 \thicksim 1 0 ∼ 1, so the center point of the prediction frame calculated by the above formula always falls within the small area in the fourth column of the tenth row.

When t x = t y = 0 t_x=t_y=0 When tx = ty = 0, b x = c x + 0.5 b_x = c_x + 0.5 bx​=cx​+0.5， b y = c y + 0.5 b_y = c_y + 0.5 by = cy + 0.5, the center of the prediction frame coincides with the center of the anchor frame, which are the centers of small areas.

The size of the anchor frame is preset and can be regarded as a super parameter in the model. The size of the anchor frame drawn in the figure below is

p h = 350 p_h = 350 ph​=350
p w = 250 p_w = 250 pw​=250

The size of the prediction box is generated by the following formula:

b h = p h e t h b_h = p_h e^{t_h} bh​=ph​eth​
b w = p w e t w b_w = p_w e^{t_w} bw​=pw​etw​

If t x = t y = 0 , t h = t w = 0 t_x=t_y=0, t_h=t_w=0 tx = ty = 0,th = tw = 0, then the prediction box coincides with the anchor box.

If you give t x , t y , t h , t w t_x, t_y, t_h, t_w tx, ty, th, tw are randomly assigned as follows:

t x = 0.2 , t y = 0.3 , t w = 0.1 , t h = − 0.12 t_x = 0.2, t_y = 0.3, t_w = 0.1, t_h = -0.12 tx​=0.2,ty​=0.3,tw​=0.1,th​=−0.12

The coordinates of the prediction frame are (154.98, 357.44, 276.29, 310.42), as shown in the blue box in Figure 12.

explain:
The coordinates here are x y w h xywh xywh format.

Figure 12: generate forecast box

Here we ask: when t x , t y , t w , t h t_x, t_y, t_w, t_h When the values of tx, ty, tw, th ， are taken, can the prediction frame coincide with the real frame? In order to answer the question, you only need to add b x , b y , b h , b w b_x, b_y, b_h, b_w bx, by, bh, bw {are set to the position of the real box to solve the problem t t The value of t.

Order:
σ ( t x ∗ ) + c x = g t x \sigma(t^*_x) + c_x = gt_x σ(tx∗​)+cx​=gtx​
σ ( t y ∗ ) + c y = g t y \sigma(t^*_y) + c_y = gt_y σ(ty∗​)+cy​=gty​
p w e t w ∗ = g t h p_w e^{t^*_w} = gt_h pw​etw∗​=gth​
p h e t h ∗ = g t w p_h e^{t^*_h} = gt_w ph​eth∗​=gtw​

It can be solved as follows: ( t x ∗ , t y ∗ , t w ∗ , t h ∗ ) (t^*_x, t^*_y, t^*_w, t^*_h) (tx∗​,ty∗​,tw∗​,th∗​)

If t t t is the output value of network prediction, which will t ∗ t^* Taking t * as the target value and the gap between them as the loss function, a regression problem can be established by learning the network parameters t t t close enough t ∗ t^* t * so that the position coordinates and size of the prediction frame can be solved.

The prediction box can be regarded as a fine adjustment based on the anchor box. Each anchor box will have a prediction box corresponding to it. We need to determine the value in the above calculation formula t x , t y , t w , t h t_x, t_y, t_w, t_h tx, ty, tw, th, so as to calculate the position and shape of the prediction frame corresponding to the anchor frame.

### Label candidate areas

Each region can generate three anchor boxes with different shapes. Each anchor box is a possible candidate region. For these candidate regions, we need to know the following things:

• Whether the anchor box contains objects can be regarded as a binary classification problem, which is represented by the label objectness. When the anchor box contains objects, objectness=1, indicating that the prediction box belongs to the positive category; When the anchor box does not contain objects, set objectness=0, indicating that the anchor box belongs to negative class.

• If the anchor box contains an object, what should be the center position and size of its corresponding prediction box, or in the above calculation formula t x , t y , t w , t h t_x, t_y, t_w, t_h What should tx, ty, tw, th ？ be? Use the location tag.

• If the anchor box contains an object, what is the specific category? Here, the variable label is used to represent the label of its category.

Select any anchor box to label it, that is, you need to determine its corresponding objectness, ( t x , t y , t w , t h ) (t_x, t_y, t_w, t_h) (tx, ty, tw, th) and label. How to determine the values of these three labels will be described below.

#### Mark whether the anchor box contains objects

As shown in Figure 13, there are three targets here. Take the portrait on the far left as an example, and its real frame is ( 133.96 , 328.42 , 186.06 , 374.63 ) (133.96, 328.42, 186.06, 374.63) (133.96,328.42,186.06,374.63).

Figure 13: select the anchor frame located in the same area as the center of the real frame

The coordinates of the center point of the real box are:

c e n t e r _ x = 133.96 center\_x = 133.96 center_x=133.96

c e n t e r _ y = 328.42 center\_y = 328.42 center_y=328.42

i = 133.96 / 32 = 4.18625 i = 133.96 / 32 = 4.18625 i=133.96/32=4.18625

j = 328.42 / 32 = 10.263125 j = 328.42 / 32 = 10.263125 j=328.42/32=10.263125

It falls in the small box in row 10 and column 4, as shown in Figure 13. This small square area can generate three anchor boxes with different shapes, and their numbers and sizes on the figure are A 1 ( 116 , 90 ) , A 2 ( 156 , 198 ) , A 3 ( 373 , 326 ) A_1(116, 90), A_2(156, 198), A_3(373, 326) A1​(116,90),A2​(156,198),A3​(373,326).

Use these three anchor frames with different shapes and real frames to calculate IoU, and select the anchor frame with the largest IoU. Here, in order to simplify the calculation, only the shape of the anchor frame is considered, and the offset between the anchor frame and the center of the real frame is not considered. The specific calculation results are shown in Figure 14.

Figure 14: IoU selected with real frame and anchor frame

The anchor box is the largest with the real box IoU A 3 A_3 A3, shape is ( 373 , 326 ) (373, 326) (373326), set the objectness tag of the corresponding prediction box to 1, and the object category it includes is the category of the object in the real box.

You can find the largest anchor box of IoU corresponding to several other real boxes in turn, and then set the objectness label of their prediction boxes to 1. There are 20 × 15 × 3 = 900 20 \times 15 \times 3 = 900 twenty × fifteen × 3 = 900 anchor boxes, only 3 prediction boxes will be marked as positive.

Since each real box only corresponds to one prediction box with a positive objectness tag, if the IoU between some prediction boxes and the real box is large, but not the largest one, it may not be appropriate to directly set its objectness tag to 0 as a negative sample. In order to avoid this situation, YOLOv3 algorithm sets an IoU threshold iou_threshold, when the objectness of the prediction box is not 1, but its IoU with a real box is greater than IoU_ When threshold is set, its objectness tag is set to - 1 and does not participate in the calculation of loss function.

For all other prediction boxes, the objectness tag is set to 0, indicating a negative class.

For the prediction box with objectness=1, its location and the specific classification label of the object need to be further determined. However, for the prediction box with objectness=0 or - 1, their location and category do not matter.

#### Label the location coordinates of the prediction box

When the anchor box objectness=1, it is necessary to determine the fine adjustment range of the prediction box position relative to it, that is, the position label of the anchor box.

We have asked such a question before: when t x , t y , t w , t h t_x, t_y, t_w, t_h When the values of tx, ty, tw, th ， are taken, can the prediction frame coincide with the real frame? The method is to put the in the coordinates of the prediction frame b x , b y , b h , b w b_x, b_y, b_h, b_w bx, by, bh, bw} are set as the coordinates of the real frame to solve t t The value of t.

Order:
σ ( t x ∗ ) + c x = g t x \sigma(t^*_x) + c_x = gt_x σ(tx∗​)+cx​=gtx​
σ ( t y ∗ ) + c y = g t y \sigma(t^*_y) + c_y = gt_y σ(ty∗​)+cy​=gty​
p w e t w ∗ = g t w p_w e^{t^*_w} = gt_w pw​etw∗​=gtw​
p h e t h ∗ = g t h p_h e^{t^*_h} = gt_h ph​eth∗​=gth​

about t x ∗ t_x^* tx * and t y ∗ t_y^* ty *, because the inverse function of Sigmoid is difficult to calculate, we use it directly σ ( t x ∗ ) \sigma(t^*_x) σ (tx *) and σ ( t y ∗ ) \sigma(t^*_y) σ (ty *) as the goal of return.

d x ∗ = σ ( t x ∗ ) = g t x − c x d_x^* = \sigma(t^*_x) = gt_x - c_x dx∗​=σ(tx∗​)=gtx​−cx​

d y ∗ = σ ( t y ∗ ) = g t y − c y d_y^* = \sigma(t^*_y) = gt_y - c_y dy∗​=σ(ty∗​)=gty​−cy​

t w ∗ = l o g ( g t w p w ) t^*_w = log(\frac{gt_w}{p_w}) tw∗​=log(pw​gtw​​)

t h ∗ = l o g ( g t h p h ) t^*_h = log(\frac{gt_h}{p_h}) th∗​=log(ph​gth​​)

If ( t x , t y , t h , t w ) (t_x, t_y, t_h, t_w) (tx, ty, th, tw) is the output value of network prediction ( d x ∗ , d y ∗ , t w ∗ , t h ∗ ) (d_x^*, d_y^*, t_w^*, t_h^*) (dx *, dy *, tw *, th *) as ( σ ( t x ) , σ ( t y ) , t h , t w ) (\sigma(t_x), \sigma(t_y), t_h, t_w) ( σ (tx​), σ Taking the gap between them as the loss function, a regression problem can be established by learning the network parameters t t t close enough t ∗ t^* t * so that the position of the prediction frame can be solved.

#### The label anchor box contains the label of the object category

For the anchor box with objectness=1, its specific category needs to be determined. As mentioned above, the anchor box with objectness marked 1 will have a real box corresponding to it. The object category of the anchor box is the object category contained in the corresponding real box. Here, the one hot vector is used to represent the category label label. For example, there are 10 categories in total, and the object category contained in the real box is category 2, then the label is ( 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) (0,1,0,0,0,0,0,0,0,0) (0,1,0,0,0,0,0,0,0,0)

The above steps are summarized, and the marked process is shown in Figure 15.

Figure 15: schematic diagram of marking process

In this way, we generate a series of anchor boxes in each small block area as candidate areas, and mark the objectness label corresponding to each candidate area, the range of position adjustment and the category of the included object according to the position of the real object in the picture. The magnitude of position adjustment is described by four variables ( t x , t y , t w , t h ) (t_x, t_y, t_w, t_h) (tx, ty, tw, th), the objectness tag needs to be described by a variable o b j obj obj, the length of the variable describing the category is equal to the number of categories C.

For each anchor box, the model needs to predict the output ( t x , t y , t w , t h , P o b j , P 1 , P 2 , . . . , P C ) (t_x, t_y, t_w, t_h, P_{obj}, P_1, P_2,... , P_C) (tx, ty, tw, th, Pobj, P1, P2,..., PC), where P o b j P_{obj} Pobj ， is the probability of whether the anchor box contains an object, P 1 , P 2 , . . . , P C P_1, P_2,... , P_C P1, P2,..., PC... Are the probability that the objects contained in the anchor box belong to each category. Next, let's learn how to output such a prediction value through convolutional neural network.

### Specific procedures for marking anchor boxes

The above describes how to label the pre anchor box, but the reader may still not understand the details. This step will be completed through specific procedures.

# Label the objectness of the prediction box
def get_objectness_label(img, gt_boxes, gt_labels, iou_threshold = 0.7,
anchors = [116, 90, 156, 198, 373, 326],
num_classes=7, downsample=32):
"""
img Is the input image data, and the shape is[N, C, H, W]
gt_boxes，True box, dimension is[N, 50, 4]，Where 50 is the upper limit of the number of real frames. When there are less than 50 real frames in the picture, the coordinates of the insufficient part are all 0
The real box coordinate format is xywh，Relative values are used here
gt_labels，The category of the real box. The dimension is[N, 50]
iou_threshold，When the predicted box is different from the real box iou greater than iou_threshold It is not regarded as a negative sample
anchors，Optional dimensions of anchor frame
anchor_masks，Through and anchors Determine together what size anchor frame should be selected for the characteristic drawing of this level
num_classes，Number of categories
downsample，The scale of the change of the feature map relative to the picture size of the input network
"""

img_shape = img.shape
batchsize = img_shape[0]
num_anchors = len(anchors) // 2
input_h = img_shape[2]
input_w = img_shape[3]
# Divide the input picture into num_rows x num_cols is a small square area, and the side length of each small square is downsample
# Calculate the total number of rows of small squares
num_rows = input_h // downsample
# Calculate how many columns of small squares there are
num_cols = input_w // downsample

label_objectness = np.zeros([batchsize, num_anchors, num_rows, num_cols])
label_classification = np.zeros([batchsize, num_anchors, num_classes, num_rows, num_cols])
label_location = np.zeros([batchsize, num_anchors, 4, num_rows, num_cols])

scale_location = np.ones([batchsize, num_anchors, num_rows, num_cols])

# Cycle the batchsize and process each picture in turn
for n in range(batchsize):
# Loop the real box on the picture and find the anchor box that best matches the shape of the real box in turn
for n_gt in range(len(gt_boxes[n])):
gt = gt_boxes[n][n_gt]
gt_cls = gt_labels[n][n_gt]
gt_center_x = gt[0]
gt_center_y = gt[1]
gt_width = gt[2]
gt_height = gt[3]
if (gt_width < 1e-3) or (gt_height < 1e-3):
continue
i = int(gt_center_y * num_rows)
j = int(gt_center_x * num_cols)
ious = []
for ka in range(num_anchors):
bbox1 = [0., 0., float(gt_width), float(gt_height)]
anchor_w = anchors[ka * 2]
anchor_h = anchors[ka * 2 + 1]
bbox2 = [0., 0., anchor_w/float(input_w), anchor_h/float(input_h)]
# Calculate iou
iou = box_iou_xywh(bbox1, bbox2)
ious.append(iou)
ious = np.array(ious)
inds = np.argsort(ious)
k = inds[-1]
label_objectness[n, k, i, j] = 1
c = gt_cls
label_classification[n, k, c, i, j] = 1.

# for those prediction bbox with objectness =1, set label of location
dx_label = gt_center_x * num_cols - j
dy_label = gt_center_y * num_rows - i
dw_label = np.log(gt_width * input_w / anchors[k*2])
dh_label = np.log(gt_height * input_h / anchors[k*2 + 1])
label_location[n, k, 0, i, j] = dx_label
label_location[n, k, 1, i, j] = dy_label
label_location[n, k, 2, i, j] = dw_label
label_location[n, k, 3, i, j] = dh_label
# scale_location is used to adjust the contribution of anchor frames of different sizes to the loss function, which is multiplied by the weighting coefficient and the position loss function
scale_location[n, k, i, j] = 2.0 - gt_width * gt_height

# At present, according to all GT boxes that appear in each picture, the prediction box with positive objectness is marked, and the remaining prediction boxes default objectness to 0
# For the prediction boxes with objectness of 1, the object categories they contain and the targets of position regression are marked
return label_objectness.astype('float32'), label_location.astype('float32'), label_classification.astype('float32'), \
scale_location.astype('float32')

# When calculating IoU, the coordinate form of the rectangular box is xywh
def box_iou_xywh(box1, box2):
x1min, y1min = box1[0] - box1[2]/2.0, box1[1] - box1[3]/2.0
x1max, y1max = box1[0] + box1[2]/2.0, box1[1] + box1[3]/2.0
s1 = box1[2] * box1[3]

x2min, y2min = box2[0] - box2[2]/2.0, box2[1] - box2[3]/2.0
x2max, y2max = box2[0] + box2[2]/2.0, box2[1] + box2[3]/2.0
s2 = box2[2] * box2[3]

xmin = np.maximum(x1min, x2min)
ymin = np.maximum(y1min, y2min)
xmax = np.minimum(x1max, x2max)
ymax = np.minimum(y1max, y2max)
inter_h = np.maximum(ymax - ymin, 0.)
inter_w = np.maximum(xmax - xmin, 0.)
intersection = inter_h * inter_w

union = s1 + s2 - intersection
iou = intersection / union
return iou

# Read data
img, gt_boxes, gt_labels, im_shape = next(reader())
img, gt_boxes, gt_labels, im_shape = img.numpy(), gt_boxes.numpy(), gt_labels.numpy(), im_shape.numpy()

# Calculate the label corresponding to the anchor box
label_objectness, label_location, label_classification, scale_location = get_objectness_label(img,
gt_boxes, gt_labels,
iou_threshold = 0.7,
anchors = [116, 90, 156, 198, 373, 326],
num_classes=7, downsample=32)


img.shape, gt_boxes.shape, gt_labels.shape, im_shape.shape

((2, 3, 640, 640), (2, 50, 4), (2, 50), (2, 2))

label_objectness.shape, label_location.shape, label_classification.shape, scale_location.shape

((2, 3, 20, 20), (2, 3, 4, 20, 20), (2, 3, 7, 20, 20), (2, 3, 20, 20))


The above program realizes the annotation of the anchor box. For each real box, the anchor box that best matches its shape is selected, its objectness is marked as 1, and [ d x ∗ , d y ∗ , t h ∗ , t w ∗ ] [d_x^*, d_y^*, t_h^*, t_w^*] [dx *, dy *, th *, tw *] is used as the label of the positive sample position, and the object category contained in the real box is used as the category of the anchor box. For the other anchor boxes, objectness will be marked as 0, and there is no need to mark the label of position and category.

• Note: there is still a small problem left here. As we said earlier, for those anchor boxes larger than the real box IoU, their objectness needs to be marked as - 1, which does not participate in the calculation of the loss function. Let's put this problem aside first and fill it up later when we establish the loss function.

## Feature extraction using convolution neural network

In the previous course of image classification, we have learned how to extract image features through convolution neural network. Through the continuous use of multi-layer convolution and pooling, we can get feature graphs with richer semantic meaning. In the detection problem, convolutional neural network is also used to extract image features layer by layer, and the final output feature map is used to characterize the information such as object position and category.

The backbone network used by YOLOv3 algorithm is Darknet53. The specific structure of Darknet53 network is shown in Figure 16, and good results have been achieved in ImageNet image classification task. In the detection task, the average pooling, full connection layer and Softmax behind C0 in the figure are removed, and the network structure from the input to C0 is retained as the basic network structure of the detection model, also known as the backbone network. YOLOv3 model will add detection related network modules on the basis of backbone network.

Figure 16: Darknet53 network structure

The following program is the implementation code of Darknet53 backbone network. Here, take out the output data represented by C0, C1 and C2 in the above figure and check their shapes, C 0 [ 1 , 1024 , 20 , 20 ] C0 [1, 1024, 20, 20] C0[1,1024,20,20]， C 1 [ 1 , 512 , 40 , 40 ] C1 [1, 512, 40, 40] C1[1,512,40,40]， C 2 [ 1 , 256 , 80 , 80 ] C2 [1, 256, 80, 80] C2[1,256,80,80].

• Term explanation: stride of characteristic graph

In the process of feature extraction, convolution or pooling with a step greater than 1 is usually used, resulting in the size of the subsequent feature map becoming smaller and smaller. The step of the feature map is equal to the size of the input picture divided by the size of the feature map. For example, the dimension of C0 is 20 × 20 20\times20 twenty × 20. The size of the original drawing is 640 × 640 640\times640 six hundred and forty × 640, then the step of C0 is 640 20 = 32 \frac{640}{20}=32 20640​=32. Similarly, the stride of C1 is 16 and the stride of C2 is 8.

import paddle
import numpy as np

def __init__(self, ch_in, ch_out,
kernel_size=3, stride=1, groups=1,
super(ConvBNLayer, self).__init__()

in_channels=ch_in,
out_channels=ch_out,
kernel_size=kernel_size,
stride=stride,
groups=groups,
bias_attr=False)

num_features=ch_out,
self.act = act

def forward(self, inputs):
out = self.conv(inputs)
out = self.batch_norm(out)
if self.act == 'leaky':
out = F.leaky_relu(x=out, negative_slope=0.1)
return out

# For down sampling, the picture size is halved. The specific implementation method is to use the convolution of stirde=2
def __init__(self,
ch_in,
ch_out,
kernel_size=3,
stride=2,

super(DownSample, self).__init__()

self.conv_bn_layer = ConvBNLayer(
ch_in=ch_in,
ch_out=ch_out,
kernel_size=kernel_size,
stride=stride,
self.ch_out = ch_out
def forward(self, inputs):
out = self.conv_bn_layer(inputs)
return out

"""
Definition of basic residual block, input x After two-layer convolution, then the output and input of the second layer convolution are connected x Add
"""
def __init__(self, ch_in, ch_out):
super(BasicBlock, self).__init__()

self.conv1 = ConvBNLayer(
ch_in=ch_in,
ch_out=ch_out,
kernel_size=1,
stride=1,
)
self.conv2 = ConvBNLayer(
ch_in=ch_out,
ch_out=ch_out*2,
kernel_size=3,
stride=1,
)
def forward(self, inputs):
conv1 = self.conv1(inputs)
conv2 = self.conv2(conv1)
return out

"""
Add multi-layer residual blocks to form Darknet53 A hierarchy of networks
"""
def __init__(self, ch_in, ch_out, count, is_test=True):
super(LayerWarp,self).__init__()

self.basicblock0 = BasicBlock(ch_in,
ch_out)
self.res_out_list = []
for i in range(1, count):
BasicBlock(ch_out*2,
ch_out))
self.res_out_list.append(res_out)

def forward(self,inputs):
y = self.basicblock0(inputs)
for basic_block_i in self.res_out_list:
y = basic_block_i(y)
return y

# The number of residual blocks in each group of DarkNet is from the network structure diagram of DarkNet
DarkNet_cfg = {53: ([1, 2, 8, 8, 4])}

def __init__(self):
super(DarkNet53_conv_body, self).__init__()
self.stages = DarkNet_cfg[53]
self.stages = self.stages[0:5]

# First layer convolution
self.conv0 = ConvBNLayer(
ch_in=3,
ch_out=32,
kernel_size=3,
stride=1,

# Down sampling is realized by convolution of stripe = 2
self.downsample0 = DownSample(
ch_in=32,
ch_out=32 * 2)

# Add implementations at all levels
self.darknet53_conv_block_list = []
self.downsample_list = []
for i, stage in enumerate(self.stages):
"stage_%d" % (i),
LayerWarp(32*(2**(i+1)),
32*(2**i),
stage))
self.darknet53_conv_block_list.append(conv_block)
# Use DownSample to halve the size between the two levels
for i in range(len(self.stages) - 1):
"stage_%d_downsample" % i,
DownSample(ch_in=32*(2**(i+1)),
ch_out=32*(2**(i+2))))
self.downsample_list.append(downsample)

def forward(self,inputs):
out = self.conv0(inputs)
#print("conv1:",out.numpy())
out = self.downsample0(out)
#print("dy:",out.numpy())
blocks = []
for i, conv_block_i in enumerate(self.darknet53_conv_block_list): #Apply each level to the input in turn
out = conv_block_i(out)
blocks.append(out)
if i < len(self.stages) - 1:
out = self.downsample_list[i](out)
return blocks[-1:-4:-1] # Take C0, C1 and C2 as return values


# View Darknet53 network output characteristic diagram
import numpy as np
backbone = DarkNet53_conv_body()
x = np.random.randn(1, 3, 640, 640).astype('float32')
C0, C1, C2 = backbone(x)
print(C0.shape, C1.shape, C2.shape)

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/norm.py:648: UserWarning: When training, we now always track global mean and variance.
"When training, we now always track global mean and variance.")

[1, 1024, 20, 20] [1, 512, 40, 40] [1, 256, 80, 80]


The above example code specifies that the shape of the input data is ( 1 , 3 , 640 , 640 ) (1, 3, 640, 640) (1,3640640), the shapes of the output characteristic diagrams of the three levels are C 0 ( 1 , 1024 , 20 , 20 ) C0 (1, 1024, 20, 20) C0(1,1024,20,20)， C 1 ( 1 , 512 , 40 , 40 ) C1 (1, 512, 40, 40) C1 (1512,40,40) and C 2 ( 1 , 256 , 80 , 80 ) C2 (1, 256, 80, 80) C2(1,256,80,80).

## Calculate the position and category of prediction frame according to the output characteristic diagram

The calculation logic for each prediction box in YOLOv3 is as follows:

• Whether the prediction box contains objects. It can also be understood as the probability of objectness=1. A real number can be output by the network x x x. Can use S i g m o i d ( x ) Sigmoid(x) Sigmoid(x) indicates the probability that objectness is positive P o b j P_{obj} Pobj​

• Predict the position and shape of objects. Object position and shape t x , t y , t w , t h t_x, t_y, t_w, t_h tx, ty, tw, th ， can be represented by four real numbers output from the network t x , t y , t w , t h t_x, t_y, t_w, t_h tx​,ty​,tw​,th​

• Predict object category. Predict the specific category of objects in the image, or the probability of belonging to each category. The total number of categories is C. It is necessary to predict the probability that the object belongs to each category ( P 1 , P 2 , . . . , P C ) (P_1, P_2, ..., P_C) (P1, P2,..., PC), C real numbers can be output by network ( x 1 , x 2 , . . . , x C ) (x_1, x_2, ..., x_C) (x1, x2,..., xC), find the Sigmoid function for each real number, let P i = S i g m o i d ( x i ) P_i = Sigmoid(x_i) Pi = Sigmoid(xi), the probability that the object belongs to each category can be expressed.

For a prediction box, the network needs to output ( 5 + C ) (5 + C) (5+C) real numbers to represent whether it contains objects, positions, shapes, dimensions and the probability of belonging to each category.

Since we have generated K prediction boxes in each small block area, the total number of prediction values that all prediction boxes need to be output by the network is:

[ K ( 5 + C ) ] × m × n [K(5 + C)] \times m \times n [K(5+C)]×m×n

Another more important point is that the network output must be able to distinguish the position of the small block area, and the characteristic diagram cannot be directly connected to an output with the size of [ K ( 5 + C ) ] × m × n [K(5 + C)] \times m \times n [K(5+C)] × m × Full connection layer of n.

### Establish the association between the output characteristic graph and the prediction frame

Now observe the characteristic graph. After multiple convolution kernel pooling, its stride is stripe = 32, 640 × 480 640 \times 480 six hundred and forty × 480 size input picture becomes 20 × 15 20\times15 twenty × Characteristic diagram of 15; The number of small square areas is exactly 20 × 15 20\times15 twenty × 15, that is, each pixel on the feature map can correspond to a small block area on the original map. This is why we first set the size of the small block area to 32, which can skillfully correspond the small block area to the pixels on the feature map, and solve the corresponding relationship of spatial position.

Figure 17: comparison of the shape of feature C0 and small block area

Next, you need to add pixels ( i , j ) (i,j) (i,j) associated with the prediction value required by the small block area in row i and column j, each small block area generates K prediction frames, and each prediction frame needs ( 5 + C ) (5 + C) (5+C) real predicted values, the corresponding value of each pixel shall be K ( 5 + C ) K(5 + C) K(5+C) real numbers. In order to solve this problem, the characteristic graph is convoluted many times, and the final number of output channels is set to K ( 5 + C ) K(5 + C) K(5+C), which can skillfully correspond the generated characteristic diagram with the prediction value required by each prediction box. Of course, this correspondence is to connect the features extracted by the backbone network to the output layer to form Loss. In practice, these sizes can be adjusted according to the different distribution of task data, as long as the output size of feature map (controlling convolution kernel and down sampling) and output layer size (controlling the size of small block area) are the same.

The output characteristic graph of the backbone network is C0. The following program convolutes C0 for many times to obtain the characteristic graph P0 related to the prediction frame.

class YoloDetectionBlock(paddle.nn.Layer):
# Feature extraction using multilayer convolution and BN
def __init__(self,ch_in,ch_out,is_test=True):
super(YoloDetectionBlock, self).__init__()

assert ch_out % 2 == 0, \
"channel {} cannot be divided by 2".format(ch_out)

self.conv0 = ConvBNLayer(
ch_in=ch_in,
ch_out=ch_out,
kernel_size=1,
stride=1,
self.conv1 = ConvBNLayer(
ch_in=ch_out,
ch_out=ch_out*2,
kernel_size=3,
stride=1,
self.conv2 = ConvBNLayer(
ch_in=ch_out*2,
ch_out=ch_out,
kernel_size=1,
stride=1,
self.conv3 = ConvBNLayer(
ch_in=ch_out,
ch_out=ch_out*2,
kernel_size=3,
stride=1,
self.route = ConvBNLayer(
ch_in=ch_out*2,
ch_out=ch_out,
kernel_size=1,
stride=1,
self.tip = ConvBNLayer(
ch_in=ch_out,
ch_out=ch_out*2,
kernel_size=3,
stride=1,
def forward(self, inputs):
out = self.conv0(inputs)
out = self.conv1(out)
out = self.conv2(out)
out = self.conv3(out)
route = self.route(out)
tip = self.tip(route)
return route, tip


NUM_ANCHORS = 3
NUM_CLASSES = 7
num_filters=NUM_ANCHORS * (NUM_CLASSES + 5)

backbone = DarkNet53_conv_body()
detection = YoloDetectionBlock(ch_in=1024, ch_out=512)

x = np.random.randn(1, 3, 640, 640).astype('float32')
C0, C1, C2 = backbone(x)
route, tip = detection(C0)
P0 = conv2d_pred(tip)

print(P0.shape)

[1, 36, 20, 20]


As shown in the above code, the feature map P0 can be generated from the feature map C0, and the shape of P0 is [ 1 , 36 , 20 , 20 ] [1, 36, 20, 20] [1,36,20,20]. The number of anchor boxes or prediction boxes generated in each small block area is 3, the number of object categories is 7, and the number of prediction values required in each area is 3 × ( 5 + 7 ) = 36 3 \times (5 + 7) = 36 three × (5 + 7) = 36, which is exactly equal to the number of output channels of P0.

take P 0 [ t , 0 : 12 , i , j ] P0[t, 0:12, i, j] P0[t,0:12,i,j] and the small square area on the input t-th picture ( i , j ) (i, j) (i,j) corresponding to the 12 prediction values required for the first prediction frame, P 0 [ t , 12 : 24 , i , j ] P0[t, 12:24, i, j] P0[t,12:24,i,j] and the small square area on the input t-th picture ( i , j ) (i, j) (i,j) corresponding to the 12 prediction values required for the second prediction frame, P 0 [ t , 24 : 36 , i , j ] P0[t, 24:36, i, j] P0[t,24:36,i,j] and the small square area on the input t-th picture ( i , j ) (i, j) (i,j) corresponding to the 12 predicted values required for the third prediction frame.

P 0 [ t , 0 : 4 , i , j ] P0[t, 0:4, i, j] P0[t,0:4,i,j] and the small square area on the input t-th picture ( i , j ) (i, j) (i,j) corresponding to the position of the first prediction frame, P 0 [ t , 4 , i , j ] P0[t, 4, i, j] P0[t,4,i,j] and the small square area on the input t-th picture ( i , j ) (i, j) (i,j) corresponding to the objectness of the first prediction box, P 0 [ t , 5 : 12 , i , j ] P0[t, 5:12, i, j] P0[t,5:12,i,j] and the small square area on the input t-th picture ( i , j ) (i, j) (i,j) category correspondence of the first prediction box.

As shown in Fig. 18, in this way, the network output characteristic diagram can be cleverly corresponding to the prediction frame generated by each small block area.

Fig. 18: association between feature map P0 and candidate region

### Calculate the probability of whether the prediction box contains objects

According to the previous analysis, P 0 [ t , 4 , i , j ] P0[t, 4, i, j] P0[t,4,i,j] and the small square area on the input t-th picture ( i , j ) (i, j) (i,j) corresponding to the objectness of the first prediction box, P 0 [ t , 4 + 12 , i , j ] P0[t, 4+12, i, j] P0[t,4+12,i,j] corresponds to the objectness of the second prediction box,..., then you can use the following program to take out the prediction related to objectness, and calculate the output probability using pad.nn.functional.sigmoid.

NUM_ANCHORS = 3
NUM_CLASSES = 7
num_filters=NUM_ANCHORS * (NUM_CLASSES + 5)

backbone = DarkNet53_conv_body()
detection = YoloDetectionBlock(ch_in=1024, ch_out=512)

x = np.random.randn(1, 3, 640, 640).astype('float32')
C0, C1, C2 = backbone(x)
route, tip = detection(C0)
P0 = conv2d_pred(tip)

reshaped_p0 = paddle.reshape(P0, [-1, NUM_ANCHORS, NUM_CLASSES + 5, P0.shape[2], P0.shape[3]])
pred_objectness = reshaped_p0[:, :, 4, :, :]
pred_objectness_probability = F.sigmoid(pred_objectness)
print(pred_objectness.shape, pred_objectness_probability.shape)

[1, 3, 20, 20] [1, 3, 20, 20]


The above output program shows whether the prediction box contains the probability PRED of the object_ objectness_ The data shape is [1,3,20,20], which is consistent with the number of prediction frames mentioned above. The data size is between 0 and 1, indicating the probability that the prediction frame is a positive sample.

### Calculate the position coordinates of the prediction frame

P 0 [ t , 0 : 4 , i , j ] P0[t, 0:4, i, j] P0[t,0:4,i,j] and the second input t t Small square area on t pictures ( i , j ) (i, j) (i,j) corresponding to the position of the first prediction frame, P 0 [ t , 12 : 16 , i , j ] P0[t, 12:16, i, j] P0[t,12:16,i,j] corresponds to the position of the second prediction box, and so on, you can use the following program to P 0 P0 P0 takes out the predicted value related to the position of the prediction frame.

NUM_ANCHORS = 3
NUM_CLASSES = 7
num_filters=NUM_ANCHORS * (NUM_CLASSES + 5)

backbone = DarkNet53_conv_body()
detection = YoloDetectionBlock(ch_in=1024, ch_out=512)

x = np.random.randn(1, 3, 640, 640).astype('float32')
C0, C1, C2 = backbone(x)
route, tip = detection(C0)
P0 = conv2d_pred(tip)

reshaped_p0 = paddle.reshape(P0, [-1, NUM_ANCHORS, NUM_CLASSES + 5, P0.shape[2], P0.shape[3]])
pred_objectness = reshaped_p0[:, :, 4, :, :]
pred_objectness_probability = F.sigmoid(pred_objectness)

pred_location = reshaped_p0[:, :, 0:4, :, :]
print(pred_location.shape)


[1, 3, 4, 20, 20]


The network output value is ( t x , t y , t w , t h ) (t_x, t_y, t_w, t_h) (tx, ty, tw, th), you also need to convert it to ( x 1 , y 1 , x 2 , y 2 ) (x_1, y_1, x_2, y_2) (x1, y1, x2, y2) this form of coordinate representation. Use the propeller paddle.vision.ops.yolo_box API can directly calculate the results, but in order to show readers the implementation process of the algorithm more clearly, we use Numpy to implement this process.

# Defining Sigmoid functions
def sigmoid(x):
return 1./(1.0 + np.exp(-x))

# Convert the [tx, ty, th, tw] output from the network feature map into the coordinates of the prediction frame [x1, y1, x2, y2]
def get_yolo_box_xxyy(pred, anchors, num_classes, downsample):
"""
pred It is transformed from the network output characteristic graph numpy.ndarray
anchors It's a list. Indicates the size of the anchor box,
for example anchors = [116, 90, 156, 198, 373, 326]，Indicates that there are three anchor boxes,
First anchor box size[w, h]yes[116, 90]，The size of the second anchor box is[156, 198]，The size of the third anchor box is[373, 326]
"""
batchsize = pred.shape[0]
num_rows = pred.shape[-2]
num_cols = pred.shape[-1]

input_h = num_rows * downsample
input_w = num_cols * downsample

num_anchors = len(anchors) // 2

# The shape of pred is [N, C, H, W], where C = NUM_ANCHORS * (5 + NUM_CLASSES)
# reshape pred
pred = pred.reshape([-1, num_anchors, 5+num_classes, num_rows, num_cols])
pred_location = pred[:, :, 0:4, :, :]
pred_location = np.transpose(pred_location, (0,3,4,1,2))
anchors_this = []
for ind in range(num_anchors):
anchors_this.append([anchors[ind*2], anchors[ind*2+1]])
anchors_this = np.array(anchors_this).astype('float32')

# The final output data is saved in pred_box, whose shape is [N, H, W, NUM_ANCHORS, 4],
# The last dimension 4 represents the four coordinates of the position
pred_box = np.zeros(pred_location.shape)
for n in range(batchsize):
for i in range(num_rows):
for j in range(num_cols):
for k in range(num_anchors):
pred_box[n, i, j, k, 0] = j
pred_box[n, i, j, k, 1] = i
pred_box[n, i, j, k, 2] = anchors_this[k][0]
pred_box[n, i, j, k, 3] = anchors_this[k][1]

# The relative coordinates, PRED, are used here_ The output element value of box is between 0. ~ 1.0
pred_box[:, :, :, :, 0] = (sigmoid(pred_location[:, :, :, :, 0]) + pred_box[:, :, :, :, 0]) / num_cols
pred_box[:, :, :, :, 1] = (sigmoid(pred_location[:, :, :, :, 1]) + pred_box[:, :, :, :, 1]) / num_rows
pred_box[:, :, :, :, 2] = np.exp(pred_location[:, :, :, :, 2]) * pred_box[:, :, :, :, 2] / input_w
pred_box[:, :, :, :, 3] = np.exp(pred_location[:, :, :, :, 3]) * pred_box[:, :, :, :, 3] / input_h

# Convert coordinates from xywh to xyxy
pred_box[:, :, :, :, 0] = pred_box[:, :, :, :, 0] - pred_box[:, :, :, :, 2] / 2.
pred_box[:, :, :, :, 1] = pred_box[:, :, :, :, 1] - pred_box[:, :, :, :, 3] / 2.
pred_box[:, :, :, :, 2] = pred_box[:, :, :, :, 0] + pred_box[:, :, :, :, 2]
pred_box[:, :, :, :, 3] = pred_box[:, :, :, :, 1] + pred_box[:, :, :, :, 3]

pred_box = np.clip(pred_box, 0., 1.0)

return pred_box



By calling the get defined above_ yolo_ box_ XXYY function, which can be obtained from P 0 P0 P0 calculates the coordinates of the prediction frame. The specific procedure is as follows:

NUM_ANCHORS = 3
NUM_CLASSES = 7
num_filters=NUM_ANCHORS * (NUM_CLASSES + 5)

backbone = DarkNet53_conv_body()
detection = YoloDetectionBlock(ch_in=1024, ch_out=512)

x = np.random.randn(1, 3, 640, 640).astype('float32')
C0, C1, C2 = backbone(x)
route, tip = detection(C0)
P0 = conv2d_pred(tip)

reshaped_p0 = paddle.reshape(P0, [-1, NUM_ANCHORS, NUM_CLASSES + 5, P0.shape[2], P0.shape[3]])
pred_objectness = reshaped_p0[:, :, 4, :, :]
pred_objectness_probability = F.sigmoid(pred_objectness)

pred_location = reshaped_p0[:, :, 0:4, :, :]

# anchors contain pre-set anchor frame dimensions
anchors = [116, 90, 156, 198, 373, 326]
# downsample is the stride of the characteristic map P0
pred_boxes = get_yolo_box_xxyy(P0.numpy(), anchors, num_classes=7, downsample=32) # Calculate the position coordinates of the prediction frame from the output characteristic map P0
print(pred_boxes.shape)

(1, 20, 20, 3, 4)


PRED calculated by the above program_ What is the shape of boxes [ N , H , W , n u m _ a n c h o r s , 4 ] [N, H, W, num\_anchors, 4] [N,H,W,num_anchors,4], the coordinate format is [ x 1 , y 1 , x 2 , y 2 ] [x_1, y_1, x_2, y_2] [x1, y1, x2, y2], the value is between 0 and 1, indicating the relative coordinates.

### Calculate the probability that the object belongs to each category

P 0 [ t , 5 : 12 , i , j ] P0[t, 5:12, i, j] P0[t,5:12,i,j] and the second input t t Small square area on t pictures ( i , j ) (i, j) (i,j) the first prediction frame contains the category correspondence of the object, P 0 [ t , 17 : 24 , i , j ] P0[t, 17:24, i, j] P0[t,17:24,i,j] corresponds to the category of the second prediction box, and so on, you can use the following procedure to P 0 P0 P0 takes out the predicted values related to the prediction box category.

NUM_ANCHORS = 3
NUM_CLASSES = 7
num_filters=NUM_ANCHORS * (NUM_CLASSES + 5)

backbone = DarkNet53_conv_body()
detection = YoloDetectionBlock(ch_in=1024, ch_out=512)

x = np.random.randn(1, 3, 640, 640).astype('float32')
C0, C1, C2 = backbone(x)
route, tip = detection(C0)
P0 = conv2d_pred(tip)

reshaped_p0 = paddle.reshape(P0, [-1, NUM_ANCHORS, NUM_CLASSES + 5, P0.shape[2], P0.shape[3]])
# Take out the predicted value related to objectness
pred_objectness = reshaped_p0[:, :, 4, :, :]
pred_objectness_probability = F.sigmoid(pred_objectness)
# Take out the predicted value related to the position
pred_location = reshaped_p0[:, :, 0:4, :, :]
# Take out the predicted value related to the category
pred_classification = reshaped_p0[:, :, 5:5+NUM_CLASSES, :, :]
pred_classification_probability = F.sigmoid(pred_classification)
print(pred_classification.shape)

[1, 3, 7, 20, 20]


The above procedure passed P 0 P0 P0 calculates the probability of the category of the object contained in the prediction box, PRED_ classification_ The shape of probability is [ 1 , 3 , 7 , 20 , 20 ] [1, 3, 7, 20, 20] [1,3,7,20,20], the value is between 0 and 1.

## loss function

The above conceptually associates the pixels on the output characteristic graph with the prediction frame, so in order to solve the neural network, we must mathematically associate the network output with the prediction frame, that is, to establish the relationship between the loss function and the network output. How to establish the loss function of YOLOv3 is discussed below.

For each prediction box, YOLOv3 model will establish three types of loss functions:

• Characterize whether the loss function of the target object is included by PRED_ Objectiveness and label_objectness calculation.

  loss_obj = paddle.nn.fucntional.binary_cross_entropy_with_logits(pred_objectness, label_objectness)

• The loss function characterizing the position of the object is obtained by pred_location and label_location calculation.

  pred_location_x = pred_location[:, :, 0, :, :]
pred_location_y = pred_location[:, :, 1, :, :]
pred_location_w = pred_location[:, :, 2, :, :]
pred_location_h = pred_location[:, :, 3, :, :]
loss_location = loss_location_x + loss_location_y + loss_location_w + loss_location_h

• The loss function characterizing the object category is obtained by pred_classification and label_classification calculation.

  loss_obj = paddle.nn.fucntional.binary_cross_entropy_with_logits(pred_classification, label_classification)


We already know how to calculate these predicted values and labels, but there is a small problem left, that is, we do not mark which anchor boxes have an objectness of - 1. To complete this step, we need to calculate the IoU between all prediction frames and real frames, and then select those real frames whose IoU is greater than the threshold. The implementation code is as follows:

# Select the prediction box with the real box IoU greater than the threshold
def get_iou_above_thresh_inds(pred_box, gt_boxes, iou_threshold):
batchsize = pred_box.shape[0]
num_rows = pred_box.shape[1]
num_cols = pred_box.shape[2]
num_anchors = pred_box.shape[3]
ret_inds = np.zeros([batchsize, num_rows, num_cols, num_anchors])
for i in range(batchsize):
pred_box_i = pred_box[i]
gt_boxes_i = gt_boxes[i]
for k in range(len(gt_boxes_i)): #gt in gt_boxes_i:
gt = gt_boxes_i[k]
gtx_min = gt[0] - gt[2] / 2.
gty_min = gt[1] - gt[3] / 2.
gtx_max = gt[0] + gt[2] / 2.
gty_max = gt[1] + gt[3] / 2.
if (gtx_max - gtx_min < 1e-3) or (gty_max - gty_min < 1e-3):
continue
x1 = np.maximum(pred_box_i[:, :, :, 0], gtx_min)
y1 = np.maximum(pred_box_i[:, :, :, 1], gty_min)
x2 = np.minimum(pred_box_i[:, :, :, 2], gtx_max)
y2 = np.minimum(pred_box_i[:, :, :, 3], gty_max)
intersection = np.maximum(x2 - x1, 0.) * np.maximum(y2 - y1, 0.)
s1 = (gty_max - gty_min) * (gtx_max - gtx_min)
s2 = (pred_box_i[:, :, :, 2] - pred_box_i[:, :, :, 0]) * (pred_box_i[:, :, :, 3] - pred_box_i[:, :, :, 1])
union = s2 + s1 - intersection
iou = intersection / union
above_inds = np.where(iou > iou_threshold)
ret_inds[i][above_inds] = 1
ret_inds = np.transpose(ret_inds, (0,3,1,2))
return ret_inds.astype('bool')


The above function can get which anchor box's objectness needs to be marked as - 1. Through the following program, label_ Process objectness, and mark the anchor box whose IoU is greater than the threshold but not a positive sample as - 1.

def label_objectness_ignore(label_objectness, iou_above_thresh_indices):
# Note: label cannot be simply used here_ objectness[iou_above_thresh_indices] = -1，
#         This may cause label_ The point with objectness 1 is set to - 1
#         Only those prediction boxes that are marked as 0 and exceed the threshold with the real box IoU are marked as - 1
negative_indices = (label_objectness < 0.5)
ignore_indices = negative_indices * iou_above_thresh_indices
label_objectness[ignore_indices] = -1
return label_objectness


Next, you can call these two functions to realize how to label some prediction boxes_ Objectness is set to - 1.

# Read data
img, gt_boxes, gt_labels, im_shape = next(reader())
img, gt_boxes, gt_labels, im_shape = img.numpy(), gt_boxes.numpy(), gt_labels.numpy(), im_shape.numpy()
# Calculate the label corresponding to the anchor box
label_objectness, label_location, label_classification, scale_location = get_objectness_label(img,
gt_boxes, gt_labels,
iou_threshold = 0.7,
anchors = [116, 90, 156, 198, 373, 326],
num_classes=7, downsample=32)

NUM_ANCHORS = 3
NUM_CLASSES = 7
num_filters=NUM_ANCHORS * (NUM_CLASSES + 5)

backbone = DarkNet53_conv_body()
detection = YoloDetectionBlock(ch_in=1024, ch_out=512)

C0, C1, C2 = backbone(x)
route, tip = detection(C0)
P0 = conv2d_pred(tip)

# anchors contain pre-set anchor frame dimensions
anchors = [116, 90, 156, 198, 373, 326]
# downsample is the stride of the characteristic map P0
pred_boxes = get_yolo_box_xxyy(P0.numpy(), anchors, num_classes=7, downsample=32)
iou_above_thresh_indices = get_iou_above_thresh_inds(pred_boxes, gt_boxes, iou_threshold=0.7)
label_objectness = label_objectness_ignore(label_objectness, iou_above_thresh_indices)



In this way, you can set the objectness tag of those samples that are not marked as positive samples but are larger than the real box IoU to - 1, and do not calculate their contribution to any loss function. The code for calculating the total loss function is as follows:

def get_loss(output, label_objectness, label_location, label_classification, scales, num_anchors=3, num_classes=7):
# Deform output from [N, C, H, W] to [n, num_angels, num_classes + 5, h, w]
reshaped_output = paddle.reshape(output, [-1, num_anchors, num_classes + 5, output.shape[2], output.shape[3]])

# Extract the predicted value related to objectness from the output
pred_objectness = reshaped_output[:, :, 4, :, :]
loss_objectness = F.binary_cross_entropy_with_logits(pred_objectness, label_objectness, reduction="none")

# pos_samples is 1 only in positive samples and 0 in other places
pos_objectness = label_objectness > 0

# Extract all position related predicted values from output
tx = reshaped_output[:, :, 0, :, :]
ty = reshaped_output[:, :, 1, :, :]
tw = reshaped_output[:, :, 2, :, :]
th = reshaped_output[:, :, 3, :, :]

# From label_ Take out the label of each position coordinate in location
dx_label = label_location[:, :, 0, :, :]
dy_label = label_location[:, :, 1, :, :]
tw_label = label_location[:, :, 2, :, :]
th_label = label_location[:, :, 3, :, :]

# Build loss function
loss_location_x = F.binary_cross_entropy_with_logits(tx, dx_label, reduction="none")
loss_location_y = F.binary_cross_entropy_with_logits(ty, dy_label, reduction="none")

# Calculate the total position loss function
loss_location = loss_location_x + loss_location_y + loss_location_h + loss_location_w

# Multiply by scales
loss_location = loss_location * scales
# Only the position loss function of the positive sample is calculated
loss_location = loss_location * pos_samples

# Extract all pixels related to the object category from the output
pred_classification = reshaped_output[:, :, 5:5+num_classes, :, :]

# Calculate the loss function related to classification
loss_classification = F.binary_cross_entropy_with_logits(pred_classification, label_classification, reduction="none")

# Sum the second dimension
# Only the classification loss function of samples with positive objectness is calculated
loss_classification = loss_classification * pos_samples
total_loss = loss_objectness + loss_location + loss_classification
# Sum the loss of all prediction frames
# Average all samples


from paddle.nn import Conv2D

# Calculate the label corresponding to the anchor box
label_objectness, label_location, label_classification, scale_location = get_objectness_label(img,
gt_boxes, gt_labels,
iou_threshold = 0.7,
anchors = [116, 90, 156, 198, 373, 326],
num_classes=7, downsample=32)

NUM_ANCHORS = 3
NUM_CLASSES = 7
num_filters=NUM_ANCHORS * (NUM_CLASSES + 5)

backbone = DarkNet53_conv_body()
detection = YoloDetectionBlock(ch_in=1024, ch_out=512)
conv2d_pred = Conv2D(in_channels=1024, out_channels=num_filters,  kernel_size=1)

C0, C1, C2 = backbone(x)
route, tip = detection(C0)
P0 = conv2d_pred(tip)
# anchors contain pre-set anchor frame dimensions
anchors = [116, 90, 156, 198, 373, 326]
# downsample is the stride of the characteristic map P0
pred_boxes = get_yolo_box_xxyy(P0.numpy(), anchors, num_classes=7, downsample=32)
iou_above_thresh_indices = get_iou_above_thresh_inds(pred_boxes, gt_boxes, iou_threshold=0.7)
label_objectness = label_objectness_ignore(label_objectness, iou_above_thresh_indices)

total_loss = get_loss(P0, label_objectness, label_location, label_classification, scales,
num_anchors=NUM_ANCHORS, num_classes=NUM_CLASSES)
total_loss_data = total_loss.numpy()
print(total_loss_data)



The above program calculates the total loss function. Here, the reader has learned most of the contents of YOLOv3 algorithm, including how to generate the anchor box, label the anchor box, extract features through convolution neural network, associate the output feature map with the prediction box, and establish the loss function.

## Multiscale detection

At present, we calculate the loss function based on the characteristic graph P0, and its stride is stripe = 32. The size of the feature map is relatively small, the number of pixels is relatively small, and the receptive field of each pixel is large. It has very rich high-level semantic information, and it may be easier to detect large targets. In order to detect the smaller targets, the prediction output needs to be established on the larger feature map. If we directly generate prediction output on C2 or C1 level feature map, we may face new problems. They have not been fully extracted, and the semantic information contained in pixels is not rich enough, so it may be difficult to extract effective feature patterns. In target detection, the way to solve this problem is to enlarge the size of the high-level feature map and fuse it with the low-level feature map. The new feature map can not only contain rich semantic information, but also have more pixels and describe a more fine structure.

The specific network implementation method is shown in Figure 19:

Fig. 19: output characteristic diagrams P0, P1 and P2 for generating multi-layer stages

YOLOv3 generates three anchor frames at the center of each area, and the sizes of the anchor frames generated on the characteristic diagrams of the three levels are P2 [(10) respectively × 13),(16 × 30),(33 × 23)]，P1 [(30 × 61),(62 × 45),(59 × 119)]，P0[(116 × 90), (156 × 198), (373 × 326]. The larger the size of the anchor frame used in the later feature map, the larger the information of the large-size target can be captured; the smaller the size of the anchor frame in the forward feature map, the smaller the information of the small-size target can be captured.

Because there is multi-scale detection, the above code needs to be greatly modified, and the implementation process is slightly cumbersome. Therefore, it is recommended to use the propeller directly paddle.vision.ops.yolo_loss API, key parameters are described as follows:

paddle.vision.ops.yolo_loss(x, gt_box, gt_label, anchors, anchor_mask, class_num, ignore_thresh, downsample_ratio, gt_score=None, use_label_smooth=True, name=None, scale_x_y=1.0)

• x: Output characteristic diagram.
• gt_box: real box.
• gt_label: real box label.
• Ignore_thresh: when the IoU threshold of prediction frame and real frame exceeds ignore_thresh, it will not be taken as a negative sample, and it is set to 0.7 in YOLOv3 model.
• downsample_ratio, the downsampling ratio of feature map P0, is 32 when using Darknet53 backbone network.
• gt_score, the confidence of the real box, is used when the mixup technique is used.
• use_label_smooth, a training technique, set to False if not used.
• Name, the name of the layer, such as' yolov3_loss', the default value is None, which generally does not need to be set.

For the method of generating prediction frame using multi-level feature map, the specific implementation code is as follows:

# Define up sampling module
def __init__(self, scale=2):
super(Upsample,self).__init__()
self.scale = scale

def forward(self, inputs):
# get dynamic upsample output shape
shape_hw = paddle.slice(shape_nchw, axes=[0], starts=[2], ends=[4])
out_shape = in_shape * self.scale

# reisze by actual_shape
x=inputs, scale_factor=self.scale, mode="NEAREST")
return out

def __init__(self, num_classes=7):
super(YOLOv3,self).__init__()

self.num_classes = num_classes
# Backbone code for extracting image features
self.block = DarkNet53_conv_body()
self.block_outputs = []
self.yolo_blocks = []
self.route_blocks_2 = []
# Generate characteristic diagrams P0, P1 and P2 of three levels
for i in range(3):
# Add modules that generate ri and ti from ci
"yolo_detecton_block_%d" % (i),
YoloDetectionBlock(
ch_in=512//(2**i)*2 if i==0 else 512//(2**i)*2 + 512//(2**i),
ch_out = 512//(2**i)))
self.yolo_blocks.append(yolo_block)

num_filters = 3 * (self.num_classes + 5)

# Add a module that generates pi from ti. This is a Conv2D operation. The number of output channels is 3 * (num_classes + 5)
"block_out_%d" % (i),
out_channels=num_filters,
kernel_size=1,
stride=1,
self.block_outputs.append(block_out)
if i < 2:
# Convolution of ri
ConvBNLayer(ch_in=512//(2**i),
ch_out=256//(2**i),
kernel_size=1,
stride=1,
self.route_blocks_2.append(route)
# Enlarge ri to keep the same size as c#{i + 1}
self.upsample = Upsample()
def forward(self, inputs):
outputs = []
blocks = self.block(inputs)
for i, block in enumerate(blocks):
if i > 0:
# The r#u{i-1} feature map is obtained after convolution and up sampling, and then spliced with ci of this level
# Generating ti and ri from ci
route, tip = self.yolo_blocks[i](block)
# Generate pi from ti
block_out = self.block_outputs[i](tip)
# Put pi in list
outputs.append(block_out)

if i < 2:
# Convolute ri to adjust the number of channels
route = self.route_blocks_2[i](route)
# Enlarge ri so that its size is consistent with c#u{i + 1}
route = self.upsample(route)

return outputs

def get_loss(self, outputs, gtbox, gtlabel, gtscore=None,
anchors = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326],
anchor_masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]],
ignore_thresh=0.7,
use_label_smooth=False):
"""
use paddle.vision.ops.yolo_loss，Direct calculation of the loss function is simpler and faster
"""
self.losses = []
downsample = 32
for i, out in enumerate(outputs): # Calculate the loss function for the three levels respectively
x=out,  # out is one of P0, P1 and P2
gt_box=gtbox,  # Real frame coordinates
gt_label=gtlabel,  # Real box category
gt_score=gtscore,  # The real box score is required when using the mixup training skill. When not using the skill, it is directly set to 1, and the shape is the same as that of gtlabel
anchors=anchors,   # Anchor frame size, including the size of 9 anchor frames [W0, H0, W1, H1,..., W8, H8]
anchor_mask=anchor_mask_i, # Screen the mask of the anchor box, such as anchor_mask_i=[3, 4, 5], select the 3rd, 4th and 5th anchor boxes in anchors for this level
class_num=self.num_classes, # Number of categories
ignore_thresh=ignore_thresh, # When the predicted box and the real box IOU > ignore_ Thresh, label objectness = -1
downsample_ratio=downsample, # For example, P0 is 32, P1 is 16 and P2 is 8
use_label_smooth=False)      # Use label_smooth training skills will be used. This skill is not used here. It is directly set to False
downsample = downsample // 2 # the scaling factor of the next level feature map will be halved
return sum(self.losses) # Sum each level


### Open end-to-end training

The training process is shown in Figure 20. After feature extraction, the input picture obtains three levels of output feature maps P0 (stripe = 32), P1 (stripe = 16) and P2 (stripe = 8). Accordingly, small square areas of different sizes are used to generate corresponding anchor boxes and prediction boxes, and these anchor boxes are marked.

• P0 level characteristic diagram, corresponding to the use 32 × 32 32\times32 thirty-two × 32 small squares are generated in the center of each area, and the sizes are [ 116 , 90 ] [116, 90] [116,90], [ 156 , 198 ] [156, 198] [156,198], [ 373 , 326 ] [373, 326] Three anchor frames of [373326].

• P1 level characteristic diagram, corresponding to the use 16 × 16 16\times16 sixteen × A small square of size 16 is generated in the center of each area, with the size of [ 30 , 61 ] [30, 61] [30,61], [ 62 , 45 ] [62, 45] [62,45], [ 59 , 119 ] [59, 119] Three anchor frames of [59119].

• P2 level characteristic diagram, corresponding to the use 8 × 8 8\times8 eight × 8-size small square, generated in the center of each area, with the size of [ 10 , 13 ] [10, 13] [10,13], [ 16 , 30 ] [16, 30] [16,30], [ 33 , 23 ] [33, 23] [33,23].

Associate the characteristic diagrams of the three levels with the labels between the corresponding anchor boxes, and establish the loss function. The total loss function is equal to the sum of the loss functions of the three levels. By minimizing the loss function, the end-to-end training process can be started.

Figure 20: end to end training process

The specific implementation code of the training process is as follows:

############# Please be careful when running this code on the local machine. It is easy to crash#######################

import time
import os

ANCHORS = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326]

ANCHOR_MASKS = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]

IGNORE_THRESH = .7
NUM_CLASSES = 7

def get_lr(base_lr = 0.0001, lr_decay = 0.1):
bd = [10000, 20000]
lr = [base_lr, base_lr * lr_decay, base_lr * lr_decay * lr_decay]
return learning_rate

if __name__ == '__main__':

TRAINDIR = '/home/aistudio/work/insects/train'
TESTDIR = '/home/aistudio/work/insects/test'
VALIDDIR = '/home/aistudio/work/insects/val'
train_dataset = TrainDataset(TRAINDIR, mode='train')
valid_dataset = TrainDataset(VALIDDIR, mode='valid')
test_dataset = TrainDataset(VALIDDIR, mode='valid')
# Create a data reader using the pad.io.dataloader, and set the batchsize and the number of processes num_workers and other parameters
model = YOLOv3(num_classes = NUM_CLASSES)  #Create model
learning_rate = get_lr()
learning_rate=learning_rate,
momentum=0.9,
parameters=model.parameters())  #Create optimizer

MAX_EPOCH = 200
for epoch in range(MAX_EPOCH):
img, gt_boxes, gt_labels, img_scale = data
gt_scores = np.ones(gt_labels.shape).astype('float32')
outputs = model(img)  #Forward propagation, output [P0, P1, P2]
loss = model.get_loss(outputs, gt_boxes, gt_labels, gtscore=gt_scores,
anchors = ANCHORS,
ignore_thresh=IGNORE_THRESH,
use_label_smooth=False)        # Calculation loss function

loss.backward()    # Back propagation calculation gradient
opt.step()  # Update parameters
if i % 10 == 0:
timestring = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(time.time()))
print('{}[TRAIN]epoch {}, iter {}, output loss: {}'.format(timestring, epoch, i, loss.numpy()))

# save params of model
if (epoch % 5 == 0) or (epoch == MAX_EPOCH -1):

# Test on the validation set after each epoch
model.eval()
img, gt_boxes, gt_labels, img_scale = data
gt_scores = np.ones(gt_labels.shape).astype('float32')
outputs = model(img)
loss = model.get_loss(outputs, gt_boxes, gt_labels, gtscore=gt_scores,
anchors = ANCHORS,
ignore_thresh=IGNORE_THRESH,
use_label_smooth=False)
if i % 1 == 0:
timestring = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(time.time()))
print('{}[VALID]epoch {}, iter {}, output loss: {}'.format(timestring, epoch, i, loss.numpy()))
model.train()


## forecast

The prediction process flow chart 21 is as follows:

Figure 21: forecast process

The forecasting process can be divided into two steps:

1. The position of the prediction box and the score of the category are calculated through the network output.
2. Non maximum suppression is used to eliminate prediction frames with large overlap.

For step 1, we have talked about how to calculate PRED from the network output value_ objectness_ probability, pred_ Boxes and pred_classification_probability, it is recommended that you use it directly paddle.vision.ops.yolo_box , key parameters have the following meanings:

paddle.vision.ops.yolo_box(x, img_size, anchors, class_num, conf_thresh, downsample_ratio, clip_bbox=True, name=None, scale_x_y=1.0)

• x. Network output characteristic diagram, such as P0 or P1, P2 mentioned above.
• img_size, enter the picture size.
• anchors, dimensions of anchor s used, such as [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326]
• class_num, number of object categories.
• conf_thresh is the confidence threshold. The prediction box position value with a score lower than the threshold is directly set to 0.0 without calculation.
• downsample_ratio, the down sampling ratio of the characteristic graph, for example, P0 is 32, P1 is 16, P2 is 8.
• name=None, name, e.g. 'Yolo'_ Box ', generally no setting is required, and the default value is None.

The return value includes two items, boxes and scores, where boxes is the coordinate value of all prediction boxes and scores is the score of all prediction boxes.

The prediction box score is defined as the probability of the category multiplied by the objectness probability of whether the prediction box contains the target object, i.e

s c o r e = P o b j ⋅ P c l a s s i f i c a t i o n score = P_{obj} \cdot P_{classification} score=Pobj​⋅Pclassification​

Add a function under the class YOLOv3 defined above, get_pred, by calling pad.vision.ops.yolo_ Box obtains the prediction boxes and scores corresponding to the characteristic diagrams of P0, P1 and P2, and splices them together to obtain all prediction boxes and their scores belonging to each category.

# Define the YOLOv3 model
def __init__(self, num_classes=7):
super(YOLOv3,self).__init__()

self.num_classes = num_classes
# Backbone code for extracting image features
self.block = DarkNet53_conv_body()
self.block_outputs = []
self.yolo_blocks = []
self.route_blocks_2 = []
# Generate characteristic diagrams P0, P1 and P2 of three levels
for i in range(3):
# Add modules that generate ri and ti from ci
"yolo_detecton_block_%d" % (i),
YoloDetectionBlock(
ch_in=512//(2**i)*2 if i==0 else 512//(2**i)*2 + 512//(2**i),
ch_out = 512//(2**i)))
self.yolo_blocks.append(yolo_block)

num_filters = 3 * (self.num_classes + 5)

# Add a module that generates pi from ti. This is a Conv2D operation. The number of output channels is 3 * (num_classes + 5)
"block_out_%d" % (i),
out_channels=num_filters,
kernel_size=1,
stride=1,
self.block_outputs.append(block_out)
if i < 2:
# Convolution of ri
ConvBNLayer(ch_in=512//(2**i),
ch_out=256//(2**i),
kernel_size=1,
stride=1,
self.route_blocks_2.append(route)
# Zoom ri in to match c_{i+1} keep the same size
self.upsample = Upsample()
def forward(self, inputs):
outputs = []
blocks = self.block(inputs)
for i, block in enumerate(blocks):
if i > 0:
# Will r_{i-1} after convolution and up sampling, the feature map is obtained and spliced with this level of ci
# Generating ti and ri from ci
route, tip = self.yolo_blocks[i](block)
# Generate pi from ti
block_out = self.block_outputs[i](tip)
# Put pi in list
outputs.append(block_out)

if i < 2:
# Convolute ri to adjust the number of channels
route = self.route_blocks_2[i](route)
# Enlarge ri to make its size and c_{i+1} consistent
route = self.upsample(route)

return outputs

def get_loss(self, outputs, gtbox, gtlabel, gtscore=None,
anchors = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326],
anchor_masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]],
ignore_thresh=0.7,
use_label_smooth=False):
"""
use paddle.vision.ops.yolo_loss，Direct calculation of the loss function is simpler and faster
"""
self.losses = []
downsample = 32
for i, out in enumerate(outputs): # Calculate the loss function for the three levels respectively
x=out,  # out is one of P0, P1 and P2
gt_box=gtbox,  # Real frame coordinates
gt_label=gtlabel,  # Real box category
gt_score=gtscore,  # The real box score is required when using the mixup training skill. When not using the skill, it is directly set to 1, and the shape is the same as that of gtlabel
anchors=anchors,   # Anchor frame size, including the size of 9 anchor frames [W0, H0, W1, H1,..., W8, H8]
anchor_mask=anchor_mask_i, # Screen the mask of the anchor box, such as anchor_mask_i=[3, 4, 5], select the 3rd, 4th and 5th anchor boxes in anchors for this level
class_num=self.num_classes, # Number of categories
ignore_thresh=ignore_thresh, # When the predicted box and the real box IOU > ignore_ Thresh, label objectness = -1
downsample_ratio=downsample, # For example, P0 is 32, P1 is 16 and P2 is 8
use_label_smooth=False)      # Use label_smooth training skills will be used. This skill is not used here. It is directly set to False
downsample = downsample // 2 # the scaling factor of the next level feature map will be halved
return sum(self.losses) # Sum each level

def get_pred(self,
outputs,
im_shape=None,
anchors = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326],
anchor_masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]],
valid_thresh = 0.01):
downsample = 32
total_boxes = []
total_scores = []
for i, out in enumerate(outputs):
anchors_this_level = []
anchors_this_level.append(anchors[2 * m])
anchors_this_level.append(anchors[2 * m + 1])

x=out,
img_size=im_shape,
anchors=anchors_this_level,
class_num=self.num_classes,
conf_thresh=valid_thresh,
downsample_ratio=downsample,
name="yolo_box" + str(i))
total_boxes.append(boxes)
total_scores.append(
scores, perm=[0, 2, 1]))
downsample = downsample // 2

return yolo_boxes, yolo_scores


The calculation result in step 1 will generate multiple prediction frames on each small block area, and many of these prediction frames have large coincidence degree, so it is necessary to eliminate redundant detection frames with large overlap.

The prediction box in the following example code is output after using the model to predict the picture. Here, a total of 11 prediction boxes are selected, and the prediction boxes are drawn on the figure, as shown below. There are multiple prediction frames around each portrait. It is necessary to eliminate redundant prediction frames to get the final prediction results.

# Draw a picture to show the boundary box of the target object
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import math

# Define the program for drawing rectangular boxes
def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle='-'):
# currentAxis, coordinate axis, obtained through plt.gca()
# bbox, bounding box, list containing four values, [x1, y1, x2, y2]
# edgecolor, border line color
# facecolor, fill color
# Fill, fill
# Linetype, border Linetype
# patches.Rectangle parameters such as the coordinates of the upper left corner, the width and height of the rectangular area need to be passed in
rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]-bbox[0]+1, bbox[3]-bbox[1]+1, linewidth=1,
edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle)

plt.figure(figsize=(10, 10))

filename = '/home/aistudio/work/images/section3/000000086956.jpg'
plt.imshow(im)

currentAxis=plt.gca()

# Prediction box position
boxes = np.array([[4.21716537e+01, 1.28230896e+02, 2.26547668e+02, 6.00434631e+02],
[3.18562988e+02, 1.23168472e+02, 4.79000000e+02, 6.05688416e+02],
[2.62704697e+01, 1.39430557e+02, 2.20587097e+02, 6.38959656e+02],
[4.24965363e+01, 1.42706665e+02, 2.25955185e+02, 6.35671204e+02],
[2.37462646e+02, 1.35731537e+02, 4.79000000e+02, 6.31451294e+02],
[3.19390472e+02, 1.29295090e+02, 4.79000000e+02, 6.33003845e+02],
[3.28933838e+02, 1.22736115e+02, 4.79000000e+02, 6.39000000e+02],
[4.44292603e+01, 1.70438187e+02, 2.26841858e+02, 6.39000000e+02],
[2.17988785e+02, 3.02472412e+02, 4.06062927e+02, 6.29106628e+02],
[2.00241089e+02, 3.23755096e+02, 3.96929321e+02, 6.36386108e+02],
[2.14310303e+02, 3.23443665e+02, 4.06732849e+02, 6.35775269e+02]])

# Prediction box score
scores = np.array([0.5247661 , 0.51759845, 0.86075854, 0.9910175 , 0.39170712,
0.9297706 , 0.5115228 , 0.270992  , 0.19087596, 0.64201415, 0.879036])

# Draw all prediction boxes
for box in boxes:
draw_rectangle(currentAxis, box)



Here, non maximum suppression (NMS) is used to eliminate redundant frames. The basic idea is that if there are multiple prediction frames corresponding to the same object, only the prediction frame with the highest score is selected, and the remaining prediction frames are discarded.

How to judge that two prediction frames correspond to the same object? How to set the standard?

If the categories of two prediction frames are the same and their position coincidence is relatively large, they can be considered to be predicting the same target. The method of non maximum suppression is to select the prediction frame with the highest score in a category, and then see which prediction frame and its IoU are greater than the threshold, and discard these prediction frames. Here, the threshold of IoU is a super parameter, which needs to be set in advance. It is set to 0.5 in the YOLOv3 model.

For example, in the above program, there are 11 prediction boxes in boxes, and scores gives their scores for predicting the category of "people".

• Step0: create a selected list, keep_list = []
• Step 1: sort scores, remain_list = [ 3, 5, 10, 2, 9, 0, 1, 6, 4, 7, 8]，
• Step 2: select boxes[3], and keep_ The list is empty. You don't need to calculate IoU. You can put it directly into keep_list，keep_list = [3]， remain_list=[5, 10, 2, 9, 0, 1, 6, 4, 7, 8]
• Step 3: select boxes[5], and keep_ boxes[3] already exists in the list. If IoU(boxes[3], boxes[5]) = 0.0 is calculated, which is obviously less than the threshold, keep_list=[3, 5], remain_list = [10, 2, 9, 0, 1, 6, 4, 7, 8]
• Step 4: select boxes[10], and keep_list=[3, 5], calculated IoU(boxes[3], boxes[10])=0.0268, IoU(boxes[5], boxes[10])=0.0268 = 0.24, all less than the threshold, then keep_list=[3, 5, 10]，remain_list=[2, 9, 0, 1, 6, 4, 7, 8]
• Step5: select boxes[2], and keep_list=[3, 5, 10], calculated IoU(boxes[3], boxes[2]) = 0.88, exceeding the threshold, directly discard boxes[2], keep_list=[3, 5, 10]，remain_list=[9, 0, 1, 6, 4, 7, 8]
• Step 6: select boxes[9], and keep_list=[3, 5, 10], calculate IoU(boxes[3], boxes[9]) = 0.0577, IoU(boxes[5], boxes[9]) = 0.205, IoU(boxes[10], boxes[9]) = 0.88. If the threshold is exceeded, discard boxes[9]. keep_list=[3, 5, 10]，remain_list=[0, 1, 6, 4, 7, 8]
• Step 7: repeat step 6 above until remain_list is empty.

Finally get keep_list=[3, 5, 10], that is, prediction boxes 3, 5 and 10 are finally selected, as shown in the figure below.

# Draw a picture to show the boundary box of the target object
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import math

# Define the program for drawing rectangular boxes
def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle='-'):
# currentAxis, coordinate axis, obtained through plt.gca()
# bbox, bounding box, list containing four values, [x1, y1, x2, y2]
# edgecolor, border line color
# facecolor, fill color
# Fill, fill
# Linetype, border Linetype
# patches.Rectangle parameters such as the coordinates of the upper left corner, the width and height of the rectangular area need to be passed in
rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]-bbox[0]+1, bbox[3]-bbox[1]+1, linewidth=1,
edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle)

plt.figure(figsize=(10, 10))

filename = '/home/aistudio/work/images/section3/000000086956.jpg'
plt.imshow(im)

currentAxis=plt.gca()

boxes = np.array([[4.21716537e+01, 1.28230896e+02, 2.26547668e+02, 6.00434631e+02],
[3.18562988e+02, 1.23168472e+02, 4.79000000e+02, 6.05688416e+02],
[2.62704697e+01, 1.39430557e+02, 2.20587097e+02, 6.38959656e+02],
[4.24965363e+01, 1.42706665e+02, 2.25955185e+02, 6.35671204e+02],
[2.37462646e+02, 1.35731537e+02, 4.79000000e+02, 6.31451294e+02],
[3.19390472e+02, 1.29295090e+02, 4.79000000e+02, 6.33003845e+02],
[3.28933838e+02, 1.22736115e+02, 4.79000000e+02, 6.39000000e+02],
[4.44292603e+01, 1.70438187e+02, 2.26841858e+02, 6.39000000e+02],
[2.17988785e+02, 3.02472412e+02, 4.06062927e+02, 6.29106628e+02],
[2.00241089e+02, 3.23755096e+02, 3.96929321e+02, 6.36386108e+02],
[2.14310303e+02, 3.23443665e+02, 4.06732849e+02, 6.35775269e+02]])

scores = np.array([0.5247661 , 0.51759845, 0.86075854, 0.9910175 , 0.39170712,
0.9297706 , 0.5115228 , 0.270992  , 0.19087596, 0.64201415, 0.879036])

left_ind = np.where((boxes[:, 0]<60) * (boxes[:, 0]>20))
left_boxes = boxes[left_ind]
left_scores = scores[left_ind]

colors = ['r', 'g', 'b', 'k']

# Draw the prediction box for the final retention
inds = [3, 5, 10]
for i in range(3):
box = boxes[inds[i]]
draw_rectangle(currentAxis, box, edgecolor=colors[i])



The specific implementation code of non maximum suppression is shown in the following nms function definition. It should be noted that the data set contains objects of multiple categories, so multi classification non maximum suppression is required here. Its implementation principle is the same as that of non maximum suppression. The difference is that non maximum suppression needs to be done for each category. The implementation code is shown in the following multiclass_nms.

# Non maximum suppression
def nms(bboxes, scores, score_thresh, nms_thresh, pre_nms_topk, i=0, c=0):
"""
nms
"""
inds = np.argsort(scores)
inds = inds[::-1]
keep_inds = []
while(len(inds) > 0):
cur_ind = inds[0]
cur_score = scores[cur_ind]
# if score of the box is less than score_thresh, just drop it
if cur_score < score_thresh:
break

keep = True
for ind in keep_inds:
current_box = bboxes[cur_ind]
remain_box = bboxes[ind]
iou = box_iou_xyxy(current_box, remain_box)
if iou > nms_thresh:
keep = False
break
if i == 0 and c == 4 and cur_ind == 951:
print('suppressed, ', keep, i, c, cur_ind, ind, iou)
if keep:
keep_inds.append(cur_ind)
inds = inds[1:]

return np.array(keep_inds)

# Multi classification non maximum suppression
def multiclass_nms(bboxes, scores, score_thresh=0.01, nms_thresh=0.45, pre_nms_topk=1000, pos_nms_topk=100):
"""
This is for multiclass_nms
"""
batch_size = bboxes.shape[0]
class_num = scores.shape[1]
rets = []
for i in range(batch_size):
bboxes_i = bboxes[i]
scores_i = scores[i]
ret = []
for c in range(class_num):
scores_i_c = scores_i[c]
keep_inds = nms(bboxes_i, scores_i_c, score_thresh, nms_thresh, pre_nms_topk, i=i, c=c)
if len(keep_inds) < 1:
continue
keep_bboxes = bboxes_i[keep_inds]
keep_scores = scores_i_c[keep_inds]
keep_results = np.zeros([keep_scores.shape[0], 6])
keep_results[:, 0] = c
keep_results[:, 1] = keep_scores[:]
keep_results[:, 2:6] = keep_bboxes[:, :]
ret.append(keep_results)
if len(ret) < 1:
rets.append(ret)
continue
ret_i = np.concatenate(ret, axis=0)
scores_i = ret_i[:, 1]
if len(scores_i) > pos_nms_topk:
inds = np.argsort(scores_i)[::-1]
inds = inds[:pos_nms_topk]
ret_i = ret_i[inds]

rets.append(ret_i)

return rets


The following is the complete test program. The output results on the test data set will be saved in pred_results.json file.

# When calculating IoU, the coordinate form of the rectangular box is xyxy, and this function will be saved in box_ In the utils.py file
def box_iou_xyxy(box1, box2):
# Get the coordinates of box1 upper left corner and lower right corner
x1min, y1min, x1max, y1max = box1[0], box1[1], box1[2], box1[3]
# Calculate the area of box1
s1 = (y1max - y1min + 1.) * (x1max - x1min + 1.)
# Get box2 upper left and lower right coordinates
x2min, y2min, x2max, y2max = box2[0], box2[1], box2[2], box2[3]
# Calculate the area of box2
s2 = (y2max - y2min + 1.) * (x2max - x2min + 1.)

# Calculates the coordinates of intersecting rectangular boxes
xmin = np.maximum(x1min, x2min)
ymin = np.maximum(y1min, y2min)
xmax = np.minimum(x1max, x2max)
ymax = np.minimum(y1max, y2max)
# Calculate the height, width and area of intersecting rectangular rows
inter_h = np.maximum(ymax - ymin + 1., 0.)
inter_w = np.maximum(xmax - xmin + 1., 0.)
intersection = inter_h * inter_w
# Calculate the merging area
union = s1 + s2 - intersection
# Calculation of intersection and union ratio
iou = intersection / union
return iou

import json
import os
ANCHORS = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326]
ANCHOR_MASKS = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
VALID_THRESH = 0.01
NMS_TOPK = 400
NMS_POSK = 100
NMS_THRESH = 0.45
NUM_CLASSES = 7

if __name__ == '__main__':
TRAINDIR = '/home/aistudio/work/insects/train/images'
TESTDIR = '/home/aistudio/work/insects/test/images'
VALIDDIR = '/home/aistudio/work/insects/val'

model = YOLOv3(num_classes=NUM_CLASSES)
params_file_path = '/home/aistudio/yolo_epoch50.pdparams'
model.eval()

total_results = []
img_name, img_data, img_scale_data = data

outputs = model.forward(img)
bboxes, scores = model.get_pred(outputs,
im_shape=img_scale,
anchors=ANCHORS,
valid_thresh = VALID_THRESH)

bboxes_data = bboxes.numpy()
scores_data = scores.numpy()
result = multiclass_nms(bboxes_data, scores_data,
score_thresh=VALID_THRESH,
nms_thresh=NMS_THRESH,
pre_nms_topk=NMS_TOPK,
pos_nms_topk=NMS_POSK)
for j in range(len(result)):
result_j = result[j]
img_name_j = img_name[j]
total_results.append([img_name_j, result_j.tolist()])
print('processed {} pictures'.format(len(total_results)))

print('')
json.dump(total_results, open('pred_results.json', 'w'))



The test results are saved in the json file, which is a list containing all picture prediction results. Its composition is as follows:

[[img_name, [[label, score, x1, y1, x2, y2], ..., [label, score, x1, y1, x2, y2]]],
[img_name, [[label, score, x1, y1, x2, y2], ..., [label, score, x1, y1, x2, y2]]],
...
[img_name, [[label, score, x1, y1, x2, y2],..., [label, score, x1, y1, x2, y2]]]]


Each element in the list is the prediction result of a picture. The total length of the list is equal to the number of pictures. The format of the prediction result of each picture is:

 [img_name, [[label, score, x1, y1, x2, y2],..., [label, score, x1, y1, x2, y2]]]


The first element is the image name image_name, the second element is a list containing all prediction boxes of the picture. Prediction box list:

 [[label, score, x1, x2, y1, y2],..., [label, score, x1, y1, x2, y2]]


Each element [label, score, x1, y1, x2, y2] in the prediction box list describes a prediction box. Label is the category label of the prediction box, and score is the score of the prediction box; X1, Y1, X2 and Y2 correspond to the coordinates of the upper left corner (x1, y1) and the lower right corner (x2, y2) of the prediction frame. Each picture may have many prediction boxes, so put them all in the prediction box list.

### Model effect and visual display

The above program shows how to read the pictures of the test data set and save the final results in a json format file. In order to more intuitively show readers the model effect, the following program adds how to read a single picture and draw the prediction box.

1. Create a data reader to read data from a single picture
# Read a single test picture
"""
Load the test picture, and the test data is not available groundtruth label
"""
batch_size= 1
batch_data = []
img_size = test_image_size
file_path = os.path.join(filename)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
H = img.shape[0]
W = img.shape[1]
img = cv2.resize(img, (img_size, img_size))

mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
mean = np.array(mean).reshape((1, 1, -1))
std = np.array(std).reshape((1, 1, -1))
out_img = (img / 255.0 - mean) / std
out_img = out_img.astype('float32').transpose((2, 0, 1))
img = out_img #np.transpose(out_img, (2,0,1))
im_shape = [H, W]

batch_data.append((image_name.split('.')[0], img, im_shape))
if len(batch_data) == batch_size:
yield make_test_array(batch_data)
batch_data = []


1. Define the drawing function for drawing the prediction box. The code is as follows.
# Define drawing function
INSECT_NAMES = ['Boerner', 'Leconte', 'Linnaeus',
'acuminatus', 'armandi', 'coleoptera', 'linnaeus']

# Defines the function to draw a rectangular box
def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle='-'):
# currentAxis, coordinate axis, obtained through plt.gca()
# bbox, bounding box, list containing four values, [x1, y1, x2, y2]
# edgecolor, border line color
# facecolor, fill color
# Fill, fill
# Linetype, border Linetype
# patches.Rectangle parameters such as the coordinates of the upper left corner, the width and height of the rectangular area need to be passed in
rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]-bbox[0]+1, bbox[3]-bbox[1]+1, linewidth=1,
edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle)

# Define the function that plots the prediction results
def draw_results(result, filename, draw_thresh=0.5):
plt.figure(figsize=(10, 10))
plt.imshow(im)
currentAxis=plt.gca()
colors = ['r', 'g', 'b', 'k', 'y', 'c', 'purple']
for item in result:
box = item[2:6]
label = int(item[0])
name = INSECT_NAMES[label]
if item[1] > draw_thresh:
draw_rectangle(currentAxis, box, edgecolor = colors[label])
plt.text(box[0], box[1], name, fontsize=12, color=colors[label])

1. Use the single defined above_ image_ data_ The loader function reads the specified picture, inputs the network and calculates the prediction box and score, and then uses the multi classification non maximum to suppress and eliminate the redundant box. Draw and display the final results.
import json

ANCHORS = [10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326]
ANCHOR_MASKS = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
VALID_THRESH = 0.01
NMS_TOPK = 400
NMS_POSK = 100
NMS_THRESH = 0.45

NUM_CLASSES = 7
if __name__ == '__main__':
image_name = '/home/aistudio/work/insects/test/images/2599.jpeg'
params_file_path = '/home/aistudio/yolo_epoch50.pdparams'

model = YOLOv3(num_classes=NUM_CLASSES)
model.eval()

total_results = []
img_name, img_data, img_scale_data = data

outputs = model.forward(img)
bboxes, scores = model.get_pred(outputs,
im_shape=img_scale,
anchors=ANCHORS,
valid_thresh = VALID_THRESH)

bboxes_data = bboxes.numpy()
scores_data = scores.numpy()
results = multiclass_nms(bboxes_data, scores_data,
score_thresh=VALID_THRESH,
nms_thresh=NMS_THRESH,
pre_nms_topk=NMS_TOPK,
pos_nms_topk=NMS_POSK)

result = results[0]
draw_results(result, image_name, draw_thresh=0.5)


Through the above program, it clearly shows readers how to use the trained weight to predict the picture and visualize the results. On the final output image, each insect is detected, and their boundary box and specific category are marked.

## summary

This chapter systematically introduces various network structures and development processes of computer vision, and takes two tasks of image classification and target detection as examples to show the implementation of ResNet and YOLOv3 algorithms. Readers are expected to not only master the method of building computer vision model, but also have a deeper understanding of the method of extracting visual features.

Tags: Python

Posted on Tue, 02 Nov 2021 07:58:01 -0400 by tipjones