Multi target tracking based on PP-YOLO Tiny and DSST algorithm

1 project background

Target tracking in video sequences is one of the research hotspots in the field of computer vision. This technology has important application value in security, transportation, military and other fields.

This project is based on PaddlePaddle Computer vision development kit, combined with deep learning and traditional visual tracking algorithm to achieve multi-target tracking task.

As the basic link from target detection to (multi) target tracking learning, this project designs the tracking method under the idea of tracking by detection, and is familiar with relevant steps and difficulties.

The common target tracking methods are generally divided into two categories: generation model and discrimination model. The discrimination model method is adopted in this project:

  1. Model generation method: search the image area and minimize the reconstruction error through the learned target model, such as Kalman Filter;
  2. Discriminant model method: the tracking problem is regarded as a binary classification problem, which is distinguished by judging the difference between target and background, such as DSST.

2 main work

This item is marked manually HeLa cell dataset ( )For example, use PaddleX realization PP-YOLO Tiny Target detector training is then used DLib Built in DSST The single target tracking algorithm realizes multi-target tracking through the intersection ratio cascade matching of detection frame and observation frame.

  • Note: the output of the project has been saved in the version (work /), so the annotated code does not need to be run. The training part of the detector can also be skipped, but the data set needs to be decompressed.

3 target detection

Under the idea of tracking by detection algorithm, the determination of tracking target is based on the detection model. So how to generate tracking targets? When the target has not been tracked in the image (frame 1), use the detector to obtain the prediction frame, take them as the first batch of targets to be tracked, and observe each prediction frame separately by setting multiple trackers; When the next image frame arrives, the single target tracker searches for the best matching region in the frame according to its own region matching algorithm (such as {scale + position} filter of DSST algorithm), and then it automatically takes the new region as the observation target. Through the above ideas, we solve the problem of how to generate the tracking target, that is, using the prediction frame of the detector.

3.1 dependency preparation

  • Install DLIB persistently in the premium environment.
# !mkdir /home/aistudio/external-libraries
# !pip install dlib -t /home/aistudio/external-libraries

The original plan was to export the detector as a high-performance deployment interface, but there are still some problems in the PIP source release version of padlex.deploy.predictor (the development version has been repaired, but there are still problems in the source code installation on AI Studio), so we will directly use the trained model for reasoning or use it Paddle Inference Reasoning.

# !git clone
# %cd PaddleX
# !git checkout develop
# !python install
# %cd ../
!pip install paddlex
import sys 

import paddle
import paddlex as pdx
from paddlex import transforms as T

import shutil
import glob
import os

import dlib
import numpy as np
import pandas as pd

import cv2
import imghdr
from PIL import Image

import matplotlib
import matplotlib.pyplot as plt

3.2 data division

  • Unzip the file to the same level directory.
!unzip -oq data/data107056/ -d data/data107056
  • 20% validation (16), 80% training (68)
!paddlex --split_dataset --format VOC\
    --dataset_dir data/data107056/DIC-C2DH-HeLa\
    --val_value 0.2\
    --test_value 0

3.3 data enhancement and reader

Define the data preprocessing method of training set and verification set;

Both need to Resize to 320x320 size, and then use the normalization coefficient of IMAGENET (use the pre training weight of IMAGENET later).

train_transforms = T.Compose([
    T.MixupImage(alpha=1.5, beta=1.5, mixup_epoch=int(550 * 25. / 27)),
        brightness_range=0.5, brightness_prob=0.5,
        contrast_range=0.5, contrast_prob=0.5,
        saturation_range=0.5, saturation_prob=0.5,
        hue_range=18.0, hue_prob=0.5),
        im_padding_value=[float(int(x * 255)) for x in [0.485, 0.456, 0.406]]),
    T.Resize(target_size=320, interp='RANDOM'),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

eval_transforms = T.Compose([
    T.Resize(target_size=320, interp='AREA'),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),

Configure *. txt generated in step [3.2] on the data reader.

train_dataset = pdx.datasets.VOCDetection(

eval_dataset = pdx.datasets.VOCDetection(

3.4 model configuration and training

  • Cluster anchors on the training set.
anchors =
  • Load adaptive clustering data and construct PP-TOLO Tiny.
model = pdx.det.PPYOLOTiny(
  • Custom optimizer (Momentum,L2(5e-4)), learning rate attenuation strategy based on empirical value (warmup + piecewise decay).
learning_rate = 0.001
warmup_steps = 66
warmup_start_lr = 0.0
train_batch_size = 8
step_each_epoch = train_dataset.num_samples // train_batch_size

lr_decay_epochs = [130, 540]
boundaries = [b * step_each_epoch for b in lr_decay_epochs]
values = [learning_rate * (0.1**i) for i in range(len(lr_decay_epochs) + 1)]
lr =

lr =

optimizer = paddle.optimizer.Momentum(
  • Start training, use IMAGENET pre training weight and custom optimizer to save the model every 30 EPOCH evaluations.


    log_interval_steps=step_each_epoch * 5,

3.5 model evaluation

  • Draw PR curve.
  • Draw error analysis chart.
_, evaluate_details = model.evaluate(eval_dataset, return_details=True)
gt, bbox = evaluate_details['gt'], evaluate_details['bbox']

  • It is observed that the current optimal model is EPOCH-550 (the last saved model), mAP = 79.77.

  • Save the model to work/best_model, so far, the steps of target detector are completed.

!cp -r output/PPYOLOTiny/best_model work/

4 target tracking

4.1 picture synthesis video

This step is to define an images2mp4 function first: in our original dataset, it is a video sequence image with FPS of 10. It is necessary to merge the *. jpg file under the target folder into an mp4 video file with FPS of 10 to simulate the prediction of deployed video stream.

def images2mp4(images_dir, output_path):
        All files in the destination folder.jpg Picture synthesis.mp4 Format video file.
    :param images_dir: Path to destination folder
    :param output_path: Where to save the composite video

    # Create an mp4 video stream file with 512x512 resolution and FPS of 10
    video = cv2.VideoWriter(
        frameSize=(512, 512))

    # Read each target picture file and write it to the video stream
    for img_path in sorted(glob.glob(os.path.join(images_dir, '*.jpg'))):

!mkdir work/viedio

# DIC-C2DH-HeLa/Test/*.jpg => work/viedio/test.mp4
# Convert. mp4 to. gif file using ffmpeg
!ffmpeg -i work/viedio/test.mp4 -s 320*320 work/viedio/test.gif -y

The following is the video sequence after test image synthesis (FPS=10):

After defining the tracker, you can test the file as input.

4.2 single target tracker

Target tracking uses a single tracker to track fixed targets. We design a scheme to uniformly manage these single target trackers to achieve multi-target tracking. Therefore, first build a single target tracker and encapsulate the necessary steps. Then, in the multi-target tracker, you only need to design algorithms and call their interfaces.

The following is the designed DSST single target tracker SingleTracker, which we can use to track a single target.

DSST(Accurate Scale Estimation for Robust Visual Tracking )The tracking is divided into two parts, and two correlation filters are defined. One filter is used to estimate the position and the other filter is used to estimate the scale. stay DLib This method is already built in.

The following is a brief process of single target tracking, which we use uniformly (x_min, y_min, x_max, y_max):

(1) Create a tracker and pass in the first frame to determine the target location: create a SingleTracker class and pass the observation area into the begin() method to start automatic tracking;

(2) Read subsequent frames: pass image into update_bbox() returns the score. If the score is less than a certain threshold, we think the target is lost in the image, and we can delete the tracker.

In short, after the tracking target box is determined (manually), it will automatically update the observation position every time the image is input. After updating, you can call SingleTracker.bbox to draw the frame.

class SingleTracker(dlib.correlation_tracker):
    def __init__(self,
            Initialize the single target tracker.
        :param tracker_id: Tracker assigned ID
        :param category: Category of tracking target
        super().__init__() = int(tracker_id)
        self.category = str(category)
        self.bbox = None
        self.bbox_color = (
            100 + np.random.randint(0, 155),
            100 + np.random.randint(0, 155),
            100 + np.random.randint(0, 155))

    def begin(self, image, bbox: list or tuple):
            Input image image Middle position bbox Make observations.
        :param image: input image 
        :param bbox: x_min, y_min, x_max, y_max
        :return: None
        self.bbox = bbox
        self.start_track(image, dlib.rectangle(*bbox))

    def update_bbox(self, image):
            Update the observation area of the current tracker according to the input image.
        :param image: input image 
        :return: The tracking quality score of the tracker for the current image
        score = self.update(image)

        curr_pos = self.get_position()
        self.bbox = (int(curr_pos.left()), int(,
                     int(curr_pos.right()), int(curr_pos.bottom()))

        return score

4.3 multi target tracker

How to apply single target tracker to multi-target tracking task? A simple idea is to use multiple trackers to track different targets, but it needs to deal with three main problems: the generation of tracking targets, the matching of old and new targets and the disappearance of existing targets.

  1. In step (3) of the detector, it is mentioned that the target is generated through the prediction frame of PP-YOLO Tiny;
  2. The target tracking step (4.2) describes that the disappearance of the target can be judged by the tracking quality score returned by the tracker being lower than a certain threshold;
  3. Finally, for the matching of old and new targets, we use the intersection union ratio cascade matching method (although the effect will be poor when the objects overlap, but the cell tracking may be less): it is assumed that there are existing tracking targets N ( 0 < = N ) N(0<=N) N (0 < = n), target obtained by detector M ( 0 < = M ) M(0<=M) M (0 < = m) and concatenate them to obtain an intersection union ratio matrix c o s t _ m a t r i x ( N × M ) cost\_matrix(N×M) cost_matrix(N × M) , using the allocation algorithm N N N and M M M pairs in pairs, and finally the of the prediction frame that fails to pair is left as the new target tracking area.

For the design of the MultiTracker class below, the code logic is relatively clear, and the entry is in the update at the bottom_ trackers(image):

# To import the linear allocation function, we need to use it in cascade matching: linear_sum_assignment(cost_matrix, maximize=False)
from scipy.optimize import linear_sum_assignment
class MultiTracker:
    def __init__(self,
        :param model_path: Path to the model
        :param det_threshold: Predictive filtering threshold
        :param stride: Interval for generating new targets
        self.det_threshold = det_threshold
        self.stride = stride
            from paddlex import load_model
            self.model = load_model(model_path)
        except Exception as e:
            raise e

        self.frame_num = 0  # Frame count
        self.tracker_num = 0  # Tracker ID statistics
        self.trackers = []  # Tracker instance list
        self.tracking_threshold = 6.5  # Tracking score threshold for tracker instances

    def _update_existed_trackers(self, image, is_update_frame=False):
            For every existing DSST Algorithm trackers transfer their tracking area to the current image image In your position,
        If the tracking score is lower than self.tracking_threshold,The target is lost by default. Delete the target.
        del_idx = []
        for i in range(len(self.trackers)):
            if self.trackers[i].update_bbox(image) < self.tracking_threshold:

        if is_update_frame:
            self.trackers = [self.trackers[i] for i in range(len(self.trackers)) if i not in del_idx]

    def det_image(self, img):
        Model prediction function, return img Predicted results on ([xmin, ymin, xmax, ymax], category)
        result = self.model.predict(img.astype('float32'))
        selected_result = []
        for item in result:
            if item['score'] < self.det_threshold:

            x_min, y_min, w, h = np.int64(item['bbox'])
                [x_min, y_min, x_min + w, y_min + h],

        return selected_result

    def get_IoU(_bbox1, _bbox2):
            Enter the diagonal endpoint of the border(x_min, y_min, x_max, y_max),Calculate the intersection and union ratio of two rectangles IoU. 
        x1min, y1min, x1max, y1max = _bbox1
        x2min, y2min, x2max, y2max = _bbox2

        s1 = (y1max - y1min + 1.) * (x1max - x1min + 1.)
        s2 = (y2max - y2min + 1.) * (x2max - x2min + 1.)

        x_min, y_min = max(x1min, x2min), max(y1min, y2min)
        x_max, y_max = min(x1max, x2max), min(y1max, y2max)
        inter_w, inter_h = max(y_max - y_min + 1., 0.), max(x_max - x_min + 1., 0.)

        intersection = inter_h * inter_w
        union = s1 + s2 - intersection

        return intersection / union

    def _add_new_tracker(self, image, bbox: list or list, category: str):
            Generate a single target tracker to observe the image image Upper bbox Border area.
        tracker = SingleTracker(tracker_id=self.tracker_num, category=category)
        tracker.begin(image=image, bbox=bbox)
        self.tracker_num += 1

    def _matching_and_add_trackers(self, image, is_update_frame):
            The prediction frame and tracking frame are cascaded matched according to the intersection union ratio distance. The unmatched ones are regarded as new targets, and a tracker is created for them.
        if not is_update_frame:

        # Obtain the prediction results of the model and generate a list of prediction boxes and observation boxes
        predict_result = self.det_image(image)
        predict_bboxes = [bbox for bbox, _ in predict_result]
        tracker_bboxes = [tracker.bbox for tracker in self.trackers]

        # Generating intersection union ratio distance matrix
        cost_matrix = np.zeros(shape=(len(tracker_bboxes), len(predict_bboxes)), dtype='float32')
        for i in range(len(tracker_bboxes)):
            for j in range(len(predict_bboxes)):
                cost_matrix[i, j] = 1. - self.get_IoU(tracker_bboxes[i], predict_bboxes[j])

        # Gets the subscript pair (row_i, col_i) that minimizes the distance and after concatenation
        row, col = linear_sum_assignment(cost_matrix)

        # Take the prediction frame that has not been matched as a new target, and generate a tracker to observe the frame
        unused_idx = [i for i in range(len(predict_result)) if i not in col]
        for idx in unused_idx:
            bbox, category = predict_result[idx]
            self._add_new_tracker(image, bbox, category)

    def _plot_trackers(self, image):
            take regions_info The frame and other information in the image image Draw on and return.
        thickness = round(0.002 * (image.shape[0] + image.shape[1]) / 2) + 1  # Line thickness
        for tracker in self.trackers:
            # Gets the two diagonal vertices of the border
            pt1, pt2 = (tracker.bbox[0], tracker.bbox[1]), (tracker.bbox[2], tracker.bbox[3])

            # Draw target border
                          pt1=pt1, pt2=pt2,

            # Gets the two diagonal vertices of the text border
            w, h = cv2.getTextSize(text=tracker.category,
                                   fontScale=thickness / 3,
                                   thickness=max(thickness - 1, 1))[0]
            font_pt1, font_pt2 = pt1, (pt1[0] + w, pt1[1] + h)

            # The background color of the filled text box area
                          pt1=font_pt1, pt2=font_pt2,

            # Outputs characters in the area of the text box
            cv2.putText(image, '{}({})'.format(tracker.category,,
                        org=(font_pt1[0], font_pt2[1]),
                        fontScale=thickness / 3,
                        color=(225, 255, 255),
                        thickness=max(thickness - 1, 1),

        return image

    def update_trackers(self, image):
        self.frame_num = (self.frame_num + 1) % 864000  # Prevent overflow
        is_update_frame = self.frame_num % self.stride == 1  # Identification of interval increase / decrease tracker

        self._update_existed_trackers(image, is_update_frame)  # Update the tracking area of the tracker and delete some targets
        self._matching_and_add_trackers(image, is_update_frame)  # Cascade matching of detector prediction frame and tracker observation frame to add new targets

        plotted_image = self._plot_trackers(image)  # Obtain the information of the existing tracker and draw it on the original map.

        return plotted_image

4.4 prediction experience

A function predict is defined here_ Stream() is used to read the video stream and model position, and save the prediction result (. jpg) of the corresponding frame to save_ In dir, we use the image synthesis function images2mp4() in step (4.1) to restore it to video.

def predict_stream(stream_path, model_path, save_dir):
        Predict each frame of picture in a video stream, and save the result of drawing the border of each frame of picture in the folder.
    :param stream_path: Video stream file path
    :param model_path: Model path
    :param save_dir: Folder address where pictures are saved

    if not os.path.exists(save_dir):

    # Open target video stream
    video = cv2.VideoCapture(stream_path)

    # Define multi-target tracker
    multi_tracker = MultiTracker(

    while True:  # Each frame image of the video stream is read for prediction
        _, frame =
        if frame is None:

        # Update the tracking area of each tracker in the multi-target tracker and return the drawn image
        plotted_frame = multi_tracker.update_trackers(frame)

        # Save the drawn border image in the target folder and name it with the frame number
        save_path = os.path.join(save_dir, '%03d.jpg' % (multi_tracker.frame_num - 1))
        cv2.imwrite(save_path, plotted_frame)
# Video reasoning to generate pictures with tracking labels

# Synthesize the reasoning result picture into video
# Add the just synthesized ` work / viewio / test_ track_ Result. MP4 ` convert to gif
# Video reasoning to generate pictures with tracking labels

# Synthesize the reasoning result picture into video
# Add the just synthesized ` work / viewio / test_ track_ Result. MP4 ` convert to gif
!ffmpeg -i work/viedio/test_track_result.mp4 -s 320*320 work/viedio/test_track_result.gif -y
  • It can be seen that some tracking effects are poor because cells have division and proliferation behavior, and their shapes will also change compared with pedestrian and other targets, which increases the difficulty of prediction.

5 Summary and analysis

Under the guidance of the idea of tracking by detection, this project uses the computer vision development kit PaddleX to train PP-YOLO Tiny target detection model to detect cells, and then uses DLib's built-in DSST single target tracking algorithm to construct a cascade matching method based on the intersection and merge ratio distance between prediction frame and observation frame to realize a multi-target tracking class.

The project can optimize the following directions:

(1) The detection performance of the detector (the detection speed of the detector will affect the system speed, and the accuracy will affect the tracking quality of the tracker to the initial target);

(2) The tracking performance of single target tracker (DSST correlation filter can be replaced by Kalman Filter, so the project becomes a classic SORT Algorithm, which points out that "this project is to pave the way for later learning") described at the beginning of the project;

(3) Matching algorithm between prediction frame and tracking frame (this paper uses intersection union ratio distance as the cost, DEEPSORT Use cosine distance and Mahalanobis distance);

(4) Generation of prediction frame, adjustment of matching threshold between prediction frame and tracking frame and other parameters (specific parameters can rise in specific scenarios, but it is of little significance if deployed).

Tags: Computer Vision paddlepaddle

Posted on Fri, 26 Nov 2021 21:02:16 -0500 by Joseph Witchard