1 Overview
At present, many embedded devices need to use target tracking applications, such as missiles (infrared guidance), UAVs (follow shooting), etc., but in view of the small size of embedded devices, low energy consumption, and small computing power, it is necessary to select an appropriate algorithm to implement on the embedded system. Here, we only talk about the FPGA based implementation, as long as the algorithm does not involve uncertain times Generally speaking, the computation speed of FPGA is able to complete ARM, and the data receiving and transmitting ability of FPGA is also very strong.
The calculation process of correlation filtering algorithm is very suitable for FPGA: there is analytic solution, without iterative optimization; considering that the implementation in FPGA needs to use the form of pipeline, before the implementation, of course, we need to do some optimization of the calculation process, and it is impossible to realize without any change. The algorithm can refer to the paper "stage: comprehensive learners for real time" Tracking, which adopts the calculation framework of correlation filtering and combines the HOG features and color histogram, has strong robustness and scale estimation. After that, it seems that the paper just adds neural network to extract stronger features, and the general framework has no change, so I have paid attention to it.
2 matlab implementation process
It mainly includes response calculation and template training. If there is no subfunction code involved, you can leave a message in the background. The names of formulas, matlab variables and Verilog variables in this paper are the same. If there is any problem, it can be discussed in detail. Link to the original matlab code: https://github.com/bertinetto/stage.
Response calculation process:
%% TESTING step% extract patch of size bg_area and resize to norm_bg_area
im_patch_cf = getSubwindow(im, pos, p.norm_bg_area, bg_area);
pwp_search_area = round(p.norm_pwp_search_area / area_resize_factor);
% extract patch of size pwp_search_area and resize to norm_pwp_search_area
im_patch_pwp = getSubwindow(im, pos, p.norm_pwp_search_area, pwp_search_area);
% compute feature map
xt = getFeatureMap(im_patch_cf, p.feature_type, p.cf_response_size, p.hog_cell_size);
% apply Hann window
xt_windowed = bsxfun(@times, hann_window, xt);
% compute FFT
xtf = fft2(xt_windowed);
% Correlation between filter and test patch gives the response
% Solve diagonal system per pixel.
if p.den_per_channel
hf = hf_num ./ (hf_den + p.lambda);
else
hf = bsxfun(@rdivide, hf_num, sum(hf_den, 3)+p.lambda);
%hf = bsxfun(@rdivide, hf_num, sum_hf_den+p.lambda);
end
conj_hf_xtf = conj(hf) .* xtf;
iconj_hf_xtf = ifft2(sum(conj_hf_xtf, 3));
response_cf = ensure_real(iconj_hf_xtf);
% Crop square search region (in feature pixels).
response_cf = cropFilterResponse(response_cf, ...
floor_odd(p.norm_delta_area / p.hog_cell_size));
if p.hog_cell_size > 1
% Scale up to match center likelihood resolution.
response_cf = mexResize(response_cf, p.norm_delta_area,'auto');
end
[likelihood_map] = getColourMap(im_patch_pwp, bg_hist, fg_hist, p.n_bins, p.grayscale_sequence);
% (TODO) in theory it should be at 0.5 (unseen colors shoud have max entropy)
likelihood_map(isnan(likelihood_map)) = 0;
% each pixel of response_pwp loosely represents the likelihood that
% the target (of size norm_target_sz) is centred on it
response_pwp = getCenterLikelihood(likelihood_map, p.norm_target_sz);
%% ESTIMATION
response = mergeResponses(response_cf, response_pwp, p.merge_factor, p.merge_method)
Template training process:
%% TRAINING
% extract patch of size bg_area and resize to norm_bg_area
im_patch_bg = getSubwindow(im, pos, p.norm_bg_area, bg_area);
pos_r = pos;
% compute feature map, of cf_response_size
xt = getFeatureMap(im_patch_bg, p.feature_type, p.cf_response_size, p.hog_cell_size);
% apply Hann window
xt = bsxfun(@times, hann_window, xt);
% compute FFT
xtf = fft2(xt);
%% FILTER UPDATE
% Compute expectations over circular shifts,
% therefore divide by number of pixels.
new_hf_num1 = bsxfun(@times, conj(yf), xtf);
new_hf_den1 = (conj(xtf) .* xtf);
new_hf_num = new_hf_num1 / prod(p.cf_response_size);
new_hf_den = new_hf_den1 / prod(p.cf_response_size);
3 FPGA implementation process
3.1 overall structure
In the process of FPGA algorithm implementation, it follows the principle of pipeline processing, so it is necessary to approximate some of the calculation process of the algorithm to achieve the effect of high frame rate processing, and has a small impact on the algorithm effect.
In the STAPLE algorithm, we need to estimate the target position first, then update the filter template according to the estimated target position information, and extract the scale space and calculate the target scale according to the calculated target position. However, in FPGA, the calculation process of tracking algorithm and image transmission process are carried out at the same time, so it is necessary to synchronize the target position estimation, filter template update and scale estimation. When the filter template update and scale estimation are processed approximately, the target information of the previous frame is used to update the target and generate the scale space. The schematic diagram of calculation structure optimization is shown in the figure below.
3.2 module division
The STAPLE algorithm is mainly divided into two parts: target location estimation and target scale estimation. The two parts are independent of each other and processed in parallel. When calculating the target position, the 1D gray scale feature and 32d HOG feature of the target area are extracted, and then the correlation filtering is carried out. Combined with the information of color histogram, the final target position response graph is obtained, and the maximum response position is calculated, that is, the estimated target position. In scale estimation, seven scale spaces are calculated and the response of each scale is calculated respectively. The largest scale of response is the estimated target scale. The block diagram of algorithm module structure is shown in the figure below. In this experiment, the algorithm module works in 150M clock domain.
ISE engineering structure is as follows:
3.3 result introduction
PC configuration: Intel i56500 CPU @3.2GHz, 8GBRAM; FPGA development tool is ISE 14.7, simulation tool is Modelsim SE 10.1c, MATLAB version 2017b.
Simulation waveform and calculation time statistics:
Serial number
Name
Number of clock cycles occupied
FPGA computing time (ms)
PC calculation time (ms)
Speedup ratio
(PC / FPGA)
A
Original image frame input process
The effective clock period is at least 312010
At least 2.08
6.7

B
Image block extraction process
Valid clock cycles up to 166400
Up to 0.111
0.372

C
Interpolation, HOG feature extraction process
About 82790
About 0.083
1.2
14.5
D
Location estimation process
About 386900
About 0.387
3.3
8.5
E
Color histogram extraction and matching process
About 35400
About 0.236
1.4
5.9
F
Scale estimation process
About 2716900
About 2.717
4.2
1.5
Total
At least 2.9
17.172

The maximum target size supported by this method is 256 * 256 pixels, and the input image size is 640 * 480 color image. When the algorithm module works in 150M clock, the feature time of extracting a HOG is 83us, the time of calculating the tracking position is 0.387ms, and the time of calculating the scale is 2.717ms. The calculating position and scale are independent of each other. In theory, the highest processing frame frequency can reach 286 Above FPS, LUT takes up 48% and storage resources 42%. It has better robustness to scale and deformation of the target, and less hardware resources. In the subsequent update, we will introduce the implementation idea and process of each module in turn, and attach the Verilog code of some modules. If there is something wrong, please leave a message for correction.
Welcome to the official account and exchange learning.
Published 6 original articles, won praise 19, visited 20000+