Deep learning series notes 08 computer vision

In recent years, deep learning has been a revolutionary force to improve the performance of computer vision system. Whether it is medical diagnosis, autonomous vehicle, intelligent filter and camera monitoring, many applications in the field of computer vision are closely related to our current and future life. It can be said that the most advanced computer vision applications are almost inseparable from deep learning.

We study various convolutional neural networks commonly used in computer vision and apply them to simple image classification tasks. We will introduce two methods that can improve model generalization, namely image augmentation and fine tuning, and apply them to image classification.

Because deep neural network can effectively represent multi-level images, this hierarchical representation has been successfully used in various computer vision tasks, such as object detection, image semantic segmentation and style migration. Adhering to the key idea of using hierarchical representation in computer vision, we will start with the main components and technologies of object detection, then show how to use complete convolution network to segment images semantically, and then explain how to use style migration technology to generate images like the cover of this book.

1 image widening

We mentioned that large data sets are a prerequisite for the successful application of deep neural networks. Image augmentation generates similar but different training samples after a series of random changes to the training image, so as to expand the scale of the training set.
For example, we can crop the image in different ways to make the objects of interest appear in different locations, so as to reduce the dependence of the model on the location of the objects. We can also adjust the brightness, color and other factors to reduce the color sensitivity of the model.

1.1 common image enlargement methods

import torch
import torchvision
from torch import nn
import matplotlib.pyplot as plt  # plt is used to display pictures
import matplotlib.image as mpimg  # mpimg is used to read pictures
from d2l import torch as d2l

img ='cat1.jpg')
# img = mpimg.imread('cat1.jpg')  # The function is the same as the code above

1.1.1 turning and cutting

# It is convenient to observe the effect of image widening
def apply(img, aug, num_rows=2, num_clos=4, scale=1.5):
    Y = [aug(img) for _ in range(num_rows * num_clos)]
    d2l.show_images(Y, num_rows, num_clos, scale=scale)

# Flip
apply(img, torchvision.transforms.RandomHorizontalFlip())
apply(img, torchvision.transforms.RandomVerticalFlip())


We randomly cut an area with an area of 10% to 100% of the original area, and the aspect ratio of the area is randomly selected from 0.5 to 2. Then, the width and height of the region are scaled to 200 pixels. The random number between a and B refers to the continuous value obtained by uniform sampling in the interval [a,b].

# Cutting
shape_aug = torchvision.transforms.RandomResizedCrop(
    (200, 200), scale=(0.1, 1), ratio=(0.5, 2))
apply(img, shape_aug)

1.1.2 change color

Another augmentation method is to change the color. We can change four aspects of image color * *: brightness, contrast, saturation and hue * *. Randomly change the brightness of the image. The random value is between 50% (1 − 0.5) and 150% (1 + 0.5) of the original image.

We can also create an instance of RandomColorJitter and set how to randomly change the brightness, contrast, saturation and hue of the image at the same time.

# Change color
apply(img, torchvision.transforms.ColorJitter(
    brightness=0.5, contrast=0, saturation=0, hue=0))





1.1.3 combining multiple image enhancement methods

shape_aug = torchvision.transforms.RandomResizedCrop(
    (200, 200), scale=(0.1, 1), ratio=(0.5, 2))
color_aug = torchvision.transforms.ColorJitter(
    brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5)

augs = torchvision.transforms.Compose([
    torchvision.transforms.RandomHorizontalFlip(), color_aug, shape_aug])
apply(img, augs)

1.2 training with image augmentation

Let's use image augmentation to train the model. Here, we use the CIFAR-10 dataset instead of the fashion MNIST dataset we used before. This is because the location and size of objects in the fashion MNIST dataset have been normalized, while the color and size differences of objects in the CIFAR-10 dataset are more obvious.

all_images = torchvision.datasets.CIFAR10(train=True, root='data',
d2l.show_images([all_images[i][0] for i in range(32)], 4, 8, scale=0.8)

In order to get the exact results in the prediction process, we usually only perform image augmentation on the training samples, and do not use random image augmentation in the prediction process. Here, we only use the simplest random left-right flip. In addition, we use the ToTensor instance to convert a batch of images into the format required by the depth learning framework, that is, 32-bit floating-point numbers with shape (batch size, number of channels, height, width), with a value range of 0 to 1.

train_augs = torchvision.transforms.Compose([

test_augs = torchvision.transforms.Compose([

Define an auxiliary function to facilitate reading images and applying image augmentation. The transform function provided by the PyTorch dataset applies image augmentation to transform images.

def load_cifar10(is_train, augs, batch_size):
    dataset = torchvision.datasets.CIFAR10(root="../data", train=is_train,
                                           transform=augs, download=True)
    dataloader =, batch_size=batch_size,
                    shuffle=is_train, num_workers=d2l.get_dataloader_workers())
    return dataloader

We trained ResNet-18 model on CIFAR-10 dataset. Define a function to train and evaluate the model using multiple GPU s.

def train_batch_ch13(net, X, y, loss, trainer, devices):
    if isinstance(X, list):
        X = [[0]) for x in X]
        X =[0])
    y =[0])
    pred = net(X)
    l = loss(pred, y)
    train_loss_sum = l.sum()
    train_acc_sum = d2l.accuracy(pred, y)
    return train_loss_sum, train_acc_sum

def train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs,
    timer, num_batches = d2l.Timer(), len(train_iter)
    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0, 1],
                            legend=['train loss', 'train acc', 'test acc'])
    net = nn.DataParallel(net, device_ids=devices).to(devices[0])
    for epoch in range(num_epochs):
        # Four dimensions: storage training loss, training accuracy, number of instances and number of features
        metric = d2l.Accumulator(4)
        for i, (features, labels) in enumerate(train_iter):
            l, acc = train_batch_ch13(
                net, features, labels, loss, trainer, devices)
            metric.add(l, acc, labels.shape[0], labels.numel())
            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
                animator.add(epoch + (i + 1) / num_batches,
                             (metric[0] / metric[2], metric[1] / metric[3],
        test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)
        animator.add(epoch + 1, (None, None, test_acc))
    print(f'loss {metric[0] / metric[2]:.3f}, train acc '
          f'{metric[1] / metric[3]:.3f}, test acc {test_acc:.3f}')
    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec on '

Define train_ with_ data_ The AUG function uses image augmentation to train the model. This function takes all GPU and uses Adam (similar to SGD, but not particularly sensitive to learning rate) as an optimization algorithm for training, applies image augmentation to training set, and finally calls train_ that has just been defined for training and evaluating models. Ch13 function.

batch_size, devices, net = 256, d2l.try_all_gpus(), d2l.resnet18(10, 3)

def init_weights(m):
    if type(m) in [nn.Linear, nn.Conv2d]:


def train_with_data_aug(train_augs, test_augs, net, lr=0.001):
    train_iter = load_cifar10(True, train_augs, batch_size)
    test_iter = load_cifar10(False, test_augs, batch_size)
    loss = nn.CrossEntropyLoss(reduction="none")
    trainer = torch.optim.Adam(net.parameters(), lr=lr)
    train_ch13(net, train_iter, test_iter, loss, trainer, 10, devices)

Let's use image augmentation based on random left-right flip to train the model.

train_with_data_aug(train_augs, test_augs, net)

loss 0.174, train acc 0.941, test acc 0.824
551.9 examples/sec on [device(type='cuda', index=0)]
# I only have one GPU

1.3 summary

  • Image augmentation generates random images based on existing training data to obtain diversity, so as to improve the generalization performance of the model
  • In order to get the exact results in the prediction process, we usually only perform image augmentation on the training samples, and do not use random image augmentation in the prediction process.
  • Deep learning framework provides many different image augmentation methods, which can be applied at the same time to reduce the degree of over fitting. (turning, cutting, discoloration)

2 fine tuning

We introduced how to train the model on the fashion MNIST training data set with only 60000 images. We also describe ImageNet, the most widely used large-scale image data set in academia, which has more than 10 million images and 1000 classes of objects. However, the size of the data set we usually contact is usually between the two. In addition, due to the limited number of training examples, the accuracy of the training model may not meet the actual requirements.
In order to solve the above problems, an obvious solution is to collect more data. However, collecting and tagging data can require a lot of time and money.
Another solution is to apply transfer learning to migrate the knowledge learned from the source dataset to the target dataset. For example, although most of the images in the ImageNet dataset are independent of the chair, the model trained on this dataset may extract more conventional image features, which helps to identify edges, textures, shapes, and object synthesis. These similar functions may also effectively identify chairs.

2.1 steps

Common skills in transfer learning: fine tuning. Fine tuning consists of the following four steps:

  1. The neural network model is pre trained on the source data set, that is, the source model.
  2. Create a new neural network model, namely target model. This copies all model designs and their parameters on the source model, except the output layer. We assume that these model parameters contain the knowledge learned from the source dataset, which will also apply to the target dataset. We also assume that the output layer of the source model is closely related to the label of the source dataset; Therefore, this layer is not used in the target model.
  3. Add an output layer to the target model, and the number of outputs is the number of categories in the target dataset. Then, the model parameters of the layer are initialized randomly.
  4. Train the target model on the target data set. The output layer will be trained from scratch, and the parameters of all other layers will be fine tuned according to the parameters of the source model.

When the target dataset is much smaller than the source dataset, fine tuning helps to improve the generalization ability of the model.

2.2. Hot dog identification

We demonstrate fine tuning through specific cases: hot dog recognition. We will fine tune the ResNet model on a small dataset that has been pre trained on the ImageNet dataset. This small data set contains thousands of images with and without hot dogs. We will use the fine-tuning model to identify whether the images contain hot dogs.

2.2.1 obtaining data sets

The hot dog data set we use comes from the network. The dataset contains 1400 positive images containing hot dogs and as many negative images containing other foods as possible. 1000 pictures in two categories are used for training and the rest for testing.

import matplotlib.pyplot as plt
import os
import torch
import torchvision
from torch import nn
from d2l import torch as d2l

d2l.DATA_HUB['hotdog'] = (d2l.DATA_URL + '',

data_dir = d2l.download_extract('hotdog')
# Create two instances to read all image files in the training and test data sets respectively
train_imgs = torchvision.datasets.ImageFolder(os.path.join(data_dir, 'train'))
test_imgs = torchvision.datasets.ImageFolder(os.path.join(data_dir, 'test'))

hotdogs = [train_imgs[i][0] for i in range(8)]
not_hotdogs = [train_imgs[-i - 1][0] for i in range(8)]
d2l.show_images(hotdogs + not_hotdogs, 2, 8, scale=1.4)

During training, we first cut an area with random size and random aspect ratio from the image, and then scale the area to 224 × 224 input image. During the test, we scaled the height and width of the image to 256 pixels, and then cropped the center 224 × 224 field as input.
In addition, for the three RGB (red, green and blue) color channels, we standardize each channel. Specifically, the average value of the channel is subtracted from each value of the channel, and then the result is divided by the standard deviation of the channel.

# The mean and standard deviation of the three RGB channels are used to standardize each channel
normalize = torchvision.transforms.Normalize(
    [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

train_augs = torchvision.transforms.Compose([

test_augs = torchvision.transforms.Compose([

2.2.2 definition and initialization model

We used Resnet-18 pre trained on ImageNet dataset as the source model. Here, we specify pre trained = true to automatically download the pre trained model parameters. If you use this model for the first time, you need to connect to the Internet to download it.

pretrained_net = torchvision.models.resnet18(pretrained=True)

The pre trained source model instance contains many feature layers and an output layer fc. The main purpose of this division is to facilitate fine-tuning of model parameters of all layers except the output layer. The variable fc of the source model is given below.

Linear(in_features=512, out_features=1000, bias=True)

After ResNet's global average pooling, the full connection layer collects and converts 1000 class outputs into ImageNet datasets. After that, we construct a new neural network as the target model. It is defined in the same way as the pre training source model, except that the number of outputs in the final layer is set to the number of classes in the target dataset.

Target model instance finetune_net is initialized to the model parameters of the corresponding layer of the source model. Because the model parameters in the function are pre trained on the ImageNet dataset and good enough, they usually need only a small learning rate to fine tune these parameters.

The model parameters in the variable output are randomly initialized, which usually requires a higher learning rate to train from scratch. Assuming that the learning rate in the Trainer instance is 1, we set the learning rate of the model parameters in the variable output in the iteration to 10. We set the basic learning rate as η , The learning rate of iterative output layer is 10 η .

finetune_net = torchvision.models.resnet18(pretrained=True)
finetune_net.fc = nn.Linear(finetune_net.fc.in_features, 2)

2.2.3 fine tuning model

# If ` param_group=True `, the model parameters in the output layer will use ten times the learning rate
def train_fine_tuning(net, learning_rate, batch_size=128, num_epochs=5,
    train_iter =
        os.path.join(data_dir, 'train'), transform=train_augs),
        batch_size=batch_size, shuffle=True)
    test_iter =
        os.path.join(data_dir, 'test'), transform=test_augs),
    devices = d2l.try_all_gpus()
    loss = nn.CrossEntropyLoss(reduction="none")
    if param_group:
        params_1x = [param for name, param in net.named_parameters()
             if name not in ["fc.weight", "fc.bias"]]
        trainer = torch.optim.SGD([{'params': params_1x},
                                   {'params': net.fc.parameters(),
                                    'lr': learning_rate * 10}],
                                lr=learning_rate, weight_decay=0.001)
        trainer = torch.optim.SGD(net.parameters(), lr=learning_rate,
    d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs,

If you encounter insufficient GPU memory, you can turn down batch_size

We use a smaller learning rate to fine tune the model parameters obtained by pre training.

train_fine_tuning(finetune_net, 5e-5)

loss 0.172, train acc 0.930, test acc 0.943
187.2 examples/sec on [device(type='cuda', index=0)]

For comparison, the same model is defined, but all its model parameters are initialized to random values. Since the whole model needs to be trained from scratch, we need to use a larger learning rate.

scratch_net = torchvision.models.resnet18()
scratch_net.fc = nn.Linear(scratch_net.fc.in_features, 2)
train_fine_tuning(scratch_net, 5e-4, param_group=False)

loss 0.362, train acc 0.841, test acc 0.826
182.7 examples/sec on [device(type='cuda', index=0)]

As expected, the fine-tuning model often performs better because its initial parameter values are more effective.

2.3 summary

  • Fine tuning improves the accuracy by initializing the model weight using the pre trained model obtained on big data.
  • Migration learning "migrates" the knowledge learned from the source dataset to the target dataset. Fine tuning is a common skill of migration learning.
  • Fine tuning is usually faster and more accurate. Typically, fine tuning parameters use a smaller learning rate.

3 target detection and bounding box

We introduce various image classification models. In the image classification task, we assume that there is only one main object in the image, and we only focus on how to identify its category. However, many times, there are many objects we are interested in in in the image. We not only want to know their categories, but also want to get their specific positions in the image. In computer vision, we call this kind of task object detection or object detection.

Target detection is widely used in many fields. For example, in driverless, we need to plan the route by identifying the location of vehicles, pedestrians, roads and obstacles in the captured video image. Robots often use this task to detect interested targets. The security field needs to detect abnormal targets, such as gangsters or bombs.

import torch
from d2l import torch as d2l
import matplotlib.pyplot as plt

img = plt.imread('catdog.png')

3.1 bounding box

In object detection, we usually use * * bounding box * * to describe the spatial position of the object. The bounding box is rectangular and is determined by the X and Y coordinates of the upper left corner and the coordinates of the lower right corner of the rectangle. Another common boundary box representation method is the (x,y) axis coordinates of the center of the boundary box and the width and height of the box.

def box_corner_to_center(boxes):
    """Transition from (top left, bottom right) to (middle, width, height)"""
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = torch.stack((cx, cy, w, h), axis=-1)
    return boxes

def box_center_to_corner(boxes):
    """Transition from (middle, width, height) to (top left, bottom right)"""
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - 0.5 * w
    y1 = cy - 0.5 * h
    x2 = cx + 0.5 * w
    y2 = cy + 0.5 * h
    boxes = torch.stack((x1, y1, x2, y2), axis=-1)
    return boxes

We will define the boundary box of dog and cat in the image according to the coordinate information. The origin of the coordinates in the image is the upper left corner of the image, and the right and down are the positive directions of the x and y axes respectively.

# bbox is the abbreviation of bounding box
dog_bbox, cat_bbox = [55.0, 99.0, 261.0, 354.0], [75.0, 640.0, 317.0, 450.0]
# Define an auxiliary function bbox_to_rect
# It represents the bounding box in matplotlib's bounding box format
def bbox_to_rect(bbox, color):
    # Convert the format of bounding box (upper left x, upper left y, lower right x, lower right y) to matplotlib format:
    # ((upper left x, upper left y), width, height)
    return d2l.plt.Rectangle(
        xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
        fill=False, edgecolor=color, linewidth=2)

fig = plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));

3.2 summary

  • Object detection identifies the category and position of multiple objects in the picture.
  • The position is usually indicated by an edge box.

4 Anchor box

Target detection algorithms usually sample a large number of regions in the input image, then judge whether these regions contain the target we are interested in, and adjust the region edge to more accurately predict the ground truth bounding box of the target. Different models may use different regional sampling methods. Here we introduce one method: it generates multiple bounding boxes with different sizes and aspect ratio s centered on each pixel. These bounding boxes are called anchor box es

import torch
from d2l import torch as d2l
import matplotlib.pyplot as plt


set_printoptions function record

torch.set_printoptions(precision=2, threshold=4)
a = torch.arange(1, 12)
tensor([ 1,  2,  3,  ...,  9, 10, 11])
torch.set_printoptions(precision=2, threshold=4, edgeitems=2)
a = torch.arange(1, 12)
tensor([ 1,  2,  ..., 10, 11])
  • Precision precision. After setting, keep two digits and round.
  • Threshold threshold. The thumbnail display starts when the threshold is greater than the threshold.
  • edgeitems number of boundary elements. How many numbers are displayed on both sides? The default is 3.

4.1 generate multiple anchor boxes

Suppose the height of the input image is h and the width is w. We generate anchor boxes of different shapes centered on each pixel of the image: the scale is s ∈ (0,1), and the aspect ratio (aspect ratio) is r > 0. Then the width and height of the anchor box are ws √ R and hs / √ R respectively. Please note that when the center position is given, the known width and height of the anchor box are determined.

To generate multiple anchor boxes of different shapes, let's set a series of scales s1,..., sn and a series of aspect ratios r1,..., rm. When all combinations of these proportions and aspect ratios are used to center on each pixel, the input image will have a total of whnm anchor boxes.

That is, the number of anchor boxes centered on the same pixel is n+m − 1. For the whole input image, we will generate wh(n+m − 1) anchor boxes.

The above method of generating anchor boxes can be implemented in the following multibox_prior function. We specify the input image, scale list and aspect ratio list, and then this function will return all anchor boxes.

def multibox_prior(data, sizes, ratios):
    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)
    size_tensor = torch.tensor(sizes, device=device)
    ratio_tensor = torch.tensor(ratios, device=device)

    # In order to move the anchor to the center of the pixel, you need to set the offset.
    # Because a pixel has a height of 1 and a width of 1, we choose to offset our center by 0.5
    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # Scaled steps in y axis
    steps_w = 1.0 / in_width  # Scaled steps in x axis

    # Generate all center points of the anchor box
    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w)
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)

    # Generate "boxes_per_pixel" height and width,
    # Then the quadrangular coordinates (xmin, xmax, ymin, ymax) used to create the anchor box
    w = * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # Handle rectangular inputs
    h = / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))
    # Divide by 2 to get half height and half width
    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2

    # Each center point will have "boxes_per_pixel" anchor boxes,
    # Therefore, the grid with all anchor box centers is generated, and the "boxes_per_pixel" is repeated several times
    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)
img = plt.imread('catdog.png')
h, w = img.shape[:2]

print(h, w)
X = torch.rand(size=(1, 3, h, w))  # Batch, channel, height, width
Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
733 550
torch.Size([1, 2015750, 4])(Batch size, number of anchor boxes, 4)
2015750 == 733 * 550 * (3 + 3 - 1)

After changing the shape of the anchor box variable Y to (image height, image width, number of anchor boxes centered on the same pixel, 4), we can obtain all anchor boxes centered on the position of the specified pixel. In the next content, we access the first anchor box centered on (148, 224). It has four elements: the (x,y) in the upper left corner of the anchor box Axis coordinates and the (x,y) axis coordinates in the lower right corner. After dividing the coordinates of the two axes by the width and height of the image, the value is between 0 and 1.

boxes = Y.reshape(h, w, 5, 4)
boxes[148, 224, 0, :]

In order to display all anchor boxes centered on a pixel in the image, we define the following show_bboxes function to draw multiple bounding boxes on the image.

def show_bboxes(axes, bboxes, labels=None, colors=None):
    """Displays all bounding boxes."""
    def _make_list(obj, default_values=None):
        if obj is None:
            obj = default_values
        elif not isinstance(obj, (list, tuple)):
            obj = [obj]
        return obj
    labels = _make_list(labels)
    colors = _make_list(colors, ['b', 'g', 'r', 'm', 'c'])
    for i, bbox in enumerate(bboxes):
        color = colors[i % len(colors)]
        rect = d2l.bbox_to_rect(bbox.detach().numpy(), color)
        if labels and len(labels) > i:
            text_color = 'k' if color == 'w' else 'w'
            axes.text(rect.xy[0], rect.xy[1], labels[i],
                      va='center', ha='center', fontsize=9, color=text_color,
                      bbox=dict(facecolor=color, lw=0))

Now we can draw all anchor boxes centered on (224, 148) in the image. (height, width)

bbox_scale = torch.tensor((w, h, w, h))
fig = plt.imshow(img)
show_bboxes(fig.axes, boxes[224, 148, :, :] * bbox_scale,
            ['s=0.1, r=1', 's=0.5, r=1', 's=0.25, r=1', 's=0.1, r=2',
             's=0.1, r=0.5'])

4.2 cross and merge ratio IoU

Measure the similarity between the anchor box and the real bounding box.
For two bounding boxes, we usually call their Jaccard index intersection over union (IoU), that is, the ratio of the intersection area of the two bounding boxes to the merging area. The value range of intersection and union ratio is between 0 and 1: 0 indicates that the two bounding boxes have no coincident pixels, and 1 indicates that the two bounding boxes are completely coincident.

def box_iou(boxes1, boxes2):
    box_area = lambda boxes: ((boxes[:, 1] - boxes[:, 0]) *
                              (boxes[:, 3] - boxes[:, 1]))
	# Lambda x, Y: the input of x*y function is x and y, and the output is their product x*y
    areas1 = box_area(boxes1)
    areas2 = box_area(boxes2)
    inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2])
    inter_lowerrights = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])
    inters = (inter_lowerrights - inter_upperlefts).clamp(min=0)
    inter_areas = inters[:, :, 0] * inters[:, :, 1]
    union_areas = areas1[:, None] + areas2 - inter_areas
    return inter_areas / union_areas

Complete and detailed explanation of lambda function in Python & ingenious application

4.3 anchor box for marking training data

In the training set, we regard each anchor box as a training sample. In order to train the target detection model, we need the class and offset of each anchor box Label, where the former is the category of objects related to the anchor box, and the latter is the offset of the real boundary box relative to the anchor box. During the prediction period, we generate multiple anchor boxes for each image, predict the class and offset of all anchor boxes, adjust their positions according to the predicted offset to obtain the predicted boundary box, and finally output only the predicted boundary box that meets specific conditions.

The summary is as follows:

4.3.1. Assign real bounding box to anchor box

Each column is a real object, and each row is a corresponding anchor box. Each element is the intersection and union ratio of the anchor box to the prediction box. First, find the largest element from this matrix, which means that the second box is used to predict the third object. At the same time, delete this row and column, and in the find matrix, except this row and the second largest element in this column. Repeat the above steps Operation until all elements are deleted.

def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
    """Assign the closest real bounding box to the anchor box."""
    num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]
    # The element x_ij in row i and column j is the IoU of anchor box i and real bounding box j
    jaccard = box_iou(anchors, ground_truth)
    # For each anchor box, the tensor of the real bounding box is assigned
    anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
    # According to the threshold, decide whether to assign a real bounding box
    max_ious, indices = torch.max(jaccard, dim=1)
    anc_i = torch.nonzero(max_ious >= 0.5).reshape(-1)
    box_j = indices[max_ious >= 0.5]
    anchors_bbox_map[anc_i] = box_j
    col_discard = torch.full((num_anchors,), -1)
    row_discard = torch.full((num_gt_boxes,), -1)
    for _ in range(num_gt_boxes):
        max_idx = torch.argmax(jaccard)
        box_idx = (max_idx % num_gt_boxes).long()
        anc_idx = (max_idx / num_gt_boxes).long()
        anchors_bbox_map[anc_idx] = box_idx
        jaccard[:, box_idx] = col_discard
        jaccard[anc_idx, :] = row_discard
    return anchors_bbox_map

4.3.2 marking class and offset

Now we can mark the classification and offset for each anchor box. Suppose an anchor box A is assigned A real bounding box B. on the one hand, the class of anchor box A will be marked as the same as B. on the other hand, the offset of anchor box A will be marked according to the relative position of the central coordinates of B and A and the relative size of the two boxes.

def offset_boxes(anchors, assigned_bb, eps=1e-6):
    """Conversion of anchor box offset."""
    c_anc = d2l.box_corner_to_center(anchors)
    c_assigned_bb = d2l.box_corner_to_center(assigned_bb)
    offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:]
    offset_wh = 5 * torch.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:])
    offset =[offset_xy, offset_wh], axis=1)
    return offset

If an anchor box is not assigned a real bounding box, we only need to mark the class of the anchor box as the "background" class. The anchor box of the background class is usually called the "negative class" anchor box, and the rest is called the "positive class" anchor box. We use the real bounding box (labels parameter) to implement the following multibox_target function to mark the class and offset of the anchor box (anchors parameter) . this function sets the background class to zero, and then increments the integer index of the new class by one.

def multibox_target(anchors, labels):
    """Mark the anchor box with a real bounding box."""
    batch_size, anchors = labels.shape[0], anchors.squeeze(0)
    batch_offset, batch_mask, batch_class_labels = [], [], []
    device, num_anchors = anchors.device, anchors.shape[0]
    for i in range(batch_size):
        label = labels[i, :, :]
        anchors_bbox_map = assign_anchor_to_bbox(
            label[:, 1:], anchors, device)
        bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(
            1, 4)
        # Initializes the class label and the assigned bounding box coordinates to zero
        class_labels = torch.zeros(num_anchors, dtype=torch.long,
        assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32,
        # Use a real bounding box to mark the category of the anchor box.
        # If an anchor box is not assigned, we mark it as the background (the value is zero)
        indices_true = torch.nonzero(anchors_bbox_map >= 0)
        bb_idx = anchors_bbox_map[indices_true]
        class_labels[indices_true] = label[bb_idx, 0].long() + 1
        assigned_bb[indices_true] = label[bb_idx, 1:]
        # Offset conversion
        offset = offset_boxes(anchors, assigned_bb) * bbox_mask
    bbox_offset = torch.stack(batch_offset)
    bbox_mask = torch.stack(batch_mask)
    class_labels = torch.stack(batch_class_labels)
    return (bbox_offset, bbox_mask, class_labels)

4.3.3 an example

Let's illustrate the anchor box label through a specific example. In the loaded image, we define the real boundary box of the ground for the dog and the cat. The first element is the class (0 represents the dog and 1 represents the cat), and the other four elements are the (x,y) axis coordinates of the upper left corner and the lower right corner (the range is between 0 and 1). We also constructed five anchor boxes marked with the coordinates of the upper left corner and the lower right corner: A0,..., A4 (the index starts from 0). Then we draw these ground truth bounding boxes and anchor boxes in the image.

ground_truth = torch.tensor([[0, 0.1, 0.14, 0.47, 0.45],
                             [1, 0.14, 0.61, 0.58, 0.88]])
anchors = torch.tensor([[0, 0.2, 0.2, 0.3], [0.07, 0.08, 0.52, 0.52],
                        [0.2, 0.5, 0.53, 0.98], [0.09, 0.58, 0.7, 0.8],
                        [0.07, 0.7, 0.92, 0.9]])

fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog', 'cat'], 'k')
show_bboxes(fig.axes, anchors * bbox_scale, ['0', '1', '2', '3', '4'])

4.4. Using non maxima to suppress the prediction bounding box

During the prediction, we first generate multiple anchor boxes for the image, and then predict the category and offset for these anchor boxes one by one. A "predicted bounding box" is generated based on one of the anchor boxes with predicted offset. Next, we implement offset_inverse function, which takes the anchor box and offset prediction as input, and applies the inverse offset transformation to return the predicted bounding box coordinates.

def offset_inverse(anchors, offset_preds):
    """The bounding box is predicted based on the anchor box with the predicted offset."""
    anc = d2l.box_corner_to_center(anchors)
    pred_bbox_xy = (offset_preds[:, :2] * anc[:, 2:] / 10) + anc[:, :2]
    pred_bbox_wh = torch.exp(offset_preds[:, 2:] / 5) * anc[:, 2:]
    pred_bbox =, pred_bbox_wh), axis=1)
    predicted_bbox = d2l.box_center_to_corner(pred_bbox)
    return predicted_bbox

The following nms function sorts the confidence in descending order and returns its index.

def nms(boxes, scores, iou_threshold):
    """Sort the confidence of the prediction bounding box."""
    B = torch.argsort(scores, dim=-1, descending=True)
    keep = []  # Keep the indicators of the forecast bounding box
    while B.numel() > 0:
        i = B[0]
        if B.numel() == 1: break
        iou = box_iou(boxes[i, :].reshape(-1, 4),
                      boxes[B[1:], :].reshape(-1, 4)).reshape(-1)
        inds = torch.nonzero(iou <= iou_threshold).reshape(-1)
        B = B[inds + 1]
    return torch.tensor(keep, device=boxes.device)

Define the following multiboxes_ The detection function is used to apply non maximum suppression to the prediction bounding box.

def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5,
    """Non maximum suppression is used to predict the bounding box."""
    device, batch_size = cls_probs.device, cls_probs.shape[0]
    anchors = anchors.squeeze(0)
    num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2]
    out = []
    for i in range(batch_size):
        cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4)
        conf, class_id = torch.max(cls_prob[1:], 0)
        predicted_bb = offset_inverse(anchors, offset_pred)
        keep = nms(predicted_bb, conf, nms_threshold)

        # Find all non_keep the index and set the class as the background
        all_idx = torch.arange(num_anchors, dtype=torch.long, device=device)
        combined =, all_idx))
        uniques, counts = combined.unique(return_counts=True)
        non_keep = uniques[counts == 1]
        all_id_sorted =, non_keep))
        class_id[non_keep] = -1
        class_id = class_id[all_id_sorted]
        conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
        # `pos_threshold ` is a threshold for non background prediction
        below_min_idx = (conf < pos_threshold)
        class_id[below_min_idx] = -1
        conf[below_min_idx] = 1 - conf[below_min_idx]
        pred_info =,
                               predicted_bb), dim=1)
    return torch.stack(out)

Now let's apply the above algorithm to a specific example with four anchor boxes. It is assumed that the predicted offsets are all zero, which means that the predicted bounding box is the anchor box. For each class of background, dog and cat, we also define its prediction probability.

anchors = torch.tensor([[0.1, 0.14, 0.47, 0.45], [0.08, 0.2, 0.56, 0.55],
                        [0.12, 0.3, 0.52, 0.5], [0.14, 0.61, 0.58, 0.88]])
offset_preds = torch.tensor([0] * anchors.numel())
cls_probs = torch.tensor([[0] * 4,  # Prediction probability of background
                          [0.9, 0.8, 0.7, 0.1],  # Prediction probability of dog
                          [0.1, 0.2, 0.3, 0.9]])  # Cat prediction probability

fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, anchors * bbox_scale,
            ['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9'])

Now we can call multibox_ The detection function performs non maximum suppression, where the threshold is set to 0.5.

output = multibox_detection(cls_probs.unsqueeze(dim=0),
fig = d2l.plt.imshow(img)
for i in output[0].detach().numpy():
    if i[0] == -1:
    label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
    show_bboxes(fig.axes, [torch.tensor(i[2:]) * bbox_scale], label)

The shape of the returned result is (batch size, number of anchor boxes, 6). The six elements in the innermost dimension provide the output information of the same prediction bounding box. The first element is the predicted class index, starting from 0 (0 for dog and 1 for cat), and a value of - 1 indicates the background or has been removed in non maximum suppression. The second element is the confidence of the predicted bounding box. The remaining four elements are the (x,y) axis coordinates of the upper left and lower right corners of the prediction bounding box (range between 0 and 1).

In practice, we can even remove the prediction bounding box with low confidence before non maximum suppression, so as to reduce the amount of calculation in this algorithm. We can also post process the output results of non maximum suppression, for example, only the results with higher retention reliability can be used as the final output.

4.5 summary

  • Class target detection algorithm is based on anchor box prediction.
  • Firstly, a large number of anchor boxes are generated and labeled, and each anchor box is trained as a sample.
  • The intersection union ratio (IoU), also known as the Jaccard index, is used to measure the similarity between the two bounding boxes. It is the ratio of the intersecting area to the merging area.
  • In the training set, we need to give each anchor box two types of labels. One is the category of target detection in the anchor box, and the other is the offset of the anchor box from the boundary box.
  • In prediction, NMS is used to remove redundant prediction and simplify output.

Tags: neural networks Deep Learning

Posted on Sun, 10 Oct 2021 19:30:27 -0400 by phpScott