ByteTrack real-time multi-target tracking

I wrote an article last year FairMOT real-time multi-target tracking A year has passed. Recently, the original author of FairMOT released a faster and stronger ByteTrack. With this article, it feels like a world away.

brief introduction

ByteTrack is a new multi-target tracking SOTA method recently disclosed. It reaches more than 80 MOTA on MOT17 dataset for the first time and ranks first on multiple lists. It can be called Tu Bang multi-target tracking. This paper mainly introduces how to use the source code of ByteTrack for real-time tracking (including video and camera). In this paper, Ubuntu 18.04 is used for environment configuration. If other operating systems are used, there may be problems when installing some libraries, which need to be solved by themselves.

The performance comparison of ByteTrack is shown in the figure below. The horizontal axis represents the reasoning speed, the vertical axis represents the MOTA accuracy, and the size of the circle represents the value of IDF1. As you can see, ByteTrack goes beyond all previous tracking methods.

Here is a brief introduction to some contents of this algorithm. Tracking by detection is a classic and efficient genre in MOT. The tracking track is obtained by associating the detection frames between frames through similarity (position, appearance, motion and other information). However, due to the complexity of the actual scene, the detector often can not get perfect detection results. In order to weigh true and false positive examples, most MOT methods currently select a threshold, only the detection results higher than this threshold are retained for correlation, and the tracking results are obtained. The detection frames lower than this threshold are directly discarded. The author believes that this strategy is unreasonable. As Hegel said, "existence is reasonable." low score detection frames often indicate the existence of objects (such as objects with serious occlusion). Simply discarding these objects will bring irreversible errors to MOT, including a large number of missed detection and trajectory interruption, and reduce the overall tracking performance. Therefore, the author proposes a new data association method BYTE, which processes the high score frame and the low score frame separately. Using the similarity between the low score detection frame and the tracking track, the real object is mined from the low score frame and the background is filtered. In short, it is a secondary matching process. The specific algorithm process can be viewed in the original paper.

In fact, the reason why this strategy is effective is very similar to some previous methods for occlusion, that is, when the object is blocked, this process certainly does not happen instantaneously. It must be accompanied by the process from clear to unclear detection frame, that is, the process of reducing the score of the frame, Therefore, mining a low score detection box is helpful to repair those damaged tracks and maintain a high running speed.

Environment configuration

The following describes the process of environment configuration of the project. It is necessary to ensure that users have installed Git and Conda and installed graphics card drivers supporting CUDA10.2 or above.

Execute the following commands line by line. Note that pytoch and Cuda are installed through conda, so the torch and torchvision lines in the requirements.txt file need to be deleted first.

git clone
cd ByteTrack/
conda create -n bytetrack python=3.8 -y
conda activate bytetrack
conda install pytorch=1.7.1 torchvision cudatoolkit -y
pip install -r requirements.txt
python develop
pip install cython
pip install 'git+'
pip install cython_bbox

At this time, the environment for model reasoning is installed. Of course, the official has also given a tutorial on docker environment configuration, which will not be introduced here. If you are interested, please refer to the author's instructions README.

Model Download

The models trained on CrowdHuman, MOT17, Cityperson and ETHZ are used. The download addresses are as follows. The indexes in the table are tested on MOT17 training set.

bytetrack_x_mot17 [google], [baidu(code:ic0i)]90.083.342229.6
bytetrack_l_mot17 [google], [baidu(code:1cml)]88.780.746043.7
bytetrack_m_mot17 [google], [baidu(code:u3m4)]87.080.147754.1
bytetrack_s_mot17 [google], [baidu(code:qflm)]79.274.353364.5

This article takes the lightest version of s as an example to download bytetrack_s_mot17.pth.tar file. After downloading, create a new models folder in the root directory of the project and put the file in it.

Real time tracking

After the above environment configuration and model download, you can infer the demo video file provided by the author through the following command.

video file

python tools/ video -f exps/example/mot/ -c ./models/bytetrack_s_mot17.pth.tar --path ./videos/palace.mp4 --fp16 --fuse --save_result

The following logs will appear in the reasoning process, and yolox will be generated in the current directory_ Outputs directory, where the tracking results generated by reasoning are located.

The meanings of some option characters are as follows.

  • demo: task type; required; image, video and webcam are optional
  • -f: Model profile
  • -c: Model file
  • --Path: the path of the file that needs reasoning
  • --save_result: save the reasoning result

We tested the effect of this s-version model in a scene in the VisDrone dataset. The results are as follows. Because we use the lightest s-version model, the accuracy is not very high, and many small targets are not detected, but the speed is very fast. To see more accurate results, you can try more complex models, i.e. m, l and x models.

ByteTrack real-time multi-target tracking


Then I use the camera for real-time tracking. I use a convenient USB camera here. The author has provided the interface of camera data stream and will save the reasoning result video after the reasoning is completed, but we often want to see the tracking effect in real time when using the camera for tracking. Therefore, we need to use the author's tools / demo_ Imageflow in track.py_ The demo function is modified as follows.

def imageflow_demo(predictor, vis_folder, current_time, args):
    cap = cv2.VideoCapture(args.path if args.demo == "video" else args.camid)
    width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)  # float
    height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)  # float
    fps = cap.get(cv2.CAP_PROP_FPS)
    save_folder = os.path.join(
        vis_folder, time.strftime("%Y_%m_%d_%H_%M_%S", current_time)
    os.makedirs(save_folder, exist_ok=True)
    if args.demo == "video":
        save_path = os.path.join(save_folder, args.path.split("/")[-1])
        save_path = os.path.join(save_folder, "camera.mp4")"video save_path is {save_path}")
    vid_writer = cv2.VideoWriter(
        save_path, cv2.VideoWriter_fourcc(*"mp4v"), fps, (int(width), int(height))
    tracker = BYTETracker(args, frame_rate=30)
    timer = Timer()
    frame_id = 0
    results = []
    while True:
        if frame_id % 20 == 0:
  'Processing frame {} ({:.2f} fps)'.format(frame_id, 1. / max(1e-5, timer.average_time)))
        ret_val, frame =
        if ret_val:
            outputs, img_info = predictor.inference(frame, timer)
            online_targets = tracker.update(outputs[0], [img_info['height'], img_info['width']], exp.test_size)
            online_tlwhs = []
            online_ids = []
            online_scores = []
            for t in online_targets:
                tlwh = t.tlwh
                tid = t.track_id
                vertical = tlwh[2] / tlwh[3] > 1.6
                if tlwh[2] * tlwh[3] > args.min_box_area and not vertical:
            results.append((frame_id + 1, online_tlwhs, online_ids, online_scores))
            online_im = plot_tracking(img_info['raw_img'], online_tlwhs, online_ids, frame_id=frame_id + 1,
                                      fps=1. / timer.average_time)
            if args.save_result:
            cv2.imshow("demo", online_im)
            ch = cv2.waitKey(1)
            if ch == 27 or ch == ord("q") or ch == ord("Q"):
        frame_id += 1

At this time, you can see the tracking effect of the content captured in the camera in real time through the following command (the -- canid in the command indicates the camera number). I have tested the code and run it successfully, but I won't play the tracking effect video here because of privacy issues.

python tools/ webcam -f exps/example/mot/ -c ./models/bytetrack_s_mot17.pth.tar --fp16 --fuse --save_result --camid 0

Supplementary notes

In this paper, ByteTrack is applied to realize multi-target real-time tracking based on video stream. It is a small demo of ByteTrack code in spare time. The video involved in this paper is only for example. If there is infringement, please contact me to delete it.

Posted on Sat, 23 Oct 2021 23:23:24 -0400 by mazman13