Multi-hierarchical Independent Correlation Filters for Visual Tracking
For visual tracking, most of the traditional correlation filters (CF) based methods suffer from the bottleneck of feature redundancy and lack of motion information. In this paper, we design a novel tracking framework, called multi-hierarchical independent correlation filters (MHIT). The framework consists of motion estimation module, hierarchical features selection, independent CF online learning, and adaptive multi-branch CF fusion. Specifically, the motion estimation module is introduced to capture motion information, which effectively alleviates the object partial occlusion in the temporal video. The multi-hierarchical deep features of CNN representing different semantic information can be fully excavated to track multi-scale objects. To better overcome the deep feature redundancy, each hierarchical features are independently fed into a single branch to implement the online learning of parameters. Finally, an adaptive weight scheme is integrated into the framework to fuse these independent multi-branch CFs for the better and more robust visual object tracking. Extensive experiments on OTB and VOT datasets show that the proposed MHIT tracker can significantly improve the tracking performance. Especially, it obtains a 20.1% relative performance gain compared to the top trackers on the VOT2017 challenge, and also achieves new state-of-the-art performance on the VOT2018 challenge.
Visual object tracking is such a task that continually localizes and tracks a target in a video only given position information in the first frame. It has many real-world applications, such as automatic driving, robotic services, and object surveillance. However, it also faces some complex situations, such as foreground occlusions, illumination changes, and appearance changes. Therefore, how to design a robust tracker has drawn a significant amount of interest from both academia and industry.
Many attempts have been addressed to improve the performance of trackers in recent years. One group of methods mainly adopt deep features of CNN to train a tracker in end-to-end manner [2, 35, 23, 40]. This type of trackers obtains features with strong recognition ability by pretraining the CNN model offline on large-scale datasets (e.g., ImageNet  and Youtube-BB ). Another group is correlation filters (CF) based methods that mainly use the cyclic matrix to generate dense sampling for the online learning of CF parameters. The CF based trackers [10, 6, 33, 7] rely on strong feature representation with large number of parameters and frequent online updates. Researchers gradually shift attention from traditional handcrafted features (e.g., HOG , Color Names (CN) ) to more powerful multi-level CNN features.
However, most of the CF based trackers roughly concatenate all kinds of features, which can be defined as multiple features fusion, introducing much redundancy and burying the characteristics of hierarchical features. Moreover, a massive number of trainable parameters have improved the risk of severe over-fitting, which is indicated in ECO . Besides, due to the large parameters and sparse features of CNN, only limited hierarchical and shallow networks such as AlexNet  and VGG  are used in the field of visual tracking over the past few years. Recently, the deeper network (e.g., ResNet , DenseNet  and SE-ResNet  has shown the more efficient performance for many computer vision tasks. However, how to effectively utilize rich deep CNN features to construct a robust tracker is still a challenging study.
In this paper, we fully excavate multi-level deep convolutional features to form a more robust yet efficient tracking framework, called MHIT. Instead of the directly concatenating multi-level deep features in previous trackers, we independently use multi-hierarchical deep features to implement the online learning of different branch, and adaptively fuse the respective CF score map. To further improve the robustness of our framework, the motion estimation is introduced under the consideration of the time continuity for overcoming some difficult situations, such as complete occlusion and deformation.
In summary, we make the following contributions:
We propose a novel CF tracking framework, called MHIT, which efficiently fuses the multi-branch independent solutions of CF via an adaptive weight strategy for more robust and reliable tracking.
Different hierarchical features are independently fed into different branches to update parameters of CF online, which sufficiently alleviates the curse of dimensionality of the conventional multi-feature fusion. And a motion estimation is also addressed successfully to capture the motion information.
2 Related Work
Based on the CNN methods, Li et al  exploit the CNN end-to-end training approach to turn the tracking problem into a classification problem. MDNet  further combines offline multi-domain training and online updates classifiers for identifying specific targets. Following the end-to-end ideas, some works further use a Siamese matching structure to learn a similarity measure, which regards DCF as part of the networks. SiamFC  trains offline with the ILSVRC  dataset and does not update the parameters online. DCFNet  presents an end-to-end network architecture to learn the convolutional features and performs the correlation tracking process simultaneously. SiamRPN  introduces feature extraction and region proposal subnetwork including the classification branch and regression branch. DaSiamRPN  proposes a framework on the basis of SiamRPN  to learn distractor-aware features and explicitly suppress distractors during the inference of online tracking. SiamVGG  replaces the base network AlexNet  with VGG  on the basis of SiamFC  to improve tracking performance. This type of methods typically takes the groundtruth of the first frame as a template or employs a simple moving average strategy to update the template.
As for the correlation filters algorithm, it has received extensive attention in visual tracking due to the high computational efficiency in the Fourier domain. Bolme et al.  propose a CF tracker by learning a minimum output sum of squared error (MOSSE) for target appearance, which is able to run in high speed. CSK  uses a circular matrix for dense sampling to generate a large number of samples with low computational load. KCF  adopts ridge regression and multi-channel features to solve correlation filter parameters. SRDCF  makes use of a negative Gaussian penalty weight on the filter parameters to overcome the boundary effect. DeepSRDCF  introduces CNN features into SRDCF  and achieves good results. C-COT  further converts feature maps of different resolutions into a continuous spatial domain to achieve better accuracy. The subsequent ECO  improves the C-COT  tracker in terms of performance and efficiency. Based on ECO , CFWCR  normalizes each individual feature extracted from different layers to get more robust results. Attempts on features design have shifted from the CN , HOG  hard-crafted features to CNN features.
3 Proposed Method
3.1 Correlation Filter for Visual Tracking
We first review the traditional correlation filters algorithm. Each sample contains feature channels , ,…, , extracted from the same image patch, where is the index of the samples. Assume that is a set of channel features. The correlation filters algorithm can be formulated as :
where is the cyclic shift sample of the and is the Gaussian response label. The optimization problem in Eq.1 can be solved efficiently in the Fourier domain. Eq.1 is minimized as , where the coefficient is computed with:
where and denote Fourier tranform and its inverse respectively. Given and the appearance model , we can get the response map of a new patch by:
3.2 Motion Estimation Module
Most of existing DCF trackers only consider appearance features of current frame, and hardly benefit from motion information. The lack of temporal information degrades the tracking performance during challenges such as partial occlusion and deformation. Our proposed tracker uses motion estimation module to take full use of the motion information.
Kalman filtering is an algorithm that uses a series of measurements observed over time, which estimates a process by using a form of feedback control. The equations for Kalman filters fall in two groups: time update equations and measurement update equations. The time update equations can also be regarded as predictor equations, while the measurement update equations can be regarded as corrector equations. The time update projects the current state estimate ahead in time. The measurement update adjusts the projected estimate by an actual measurement at that time.
Time update equations can be formulated as:
In Eq.5, is a vector representing predicted process state at time before measurement update, is a control vector and relates optional control vector into state space. is a 4-dimension vector , where and represent the coordinates of the center of the target, and and represent its velocity. Therefore, process transition matrix can be expressed as:
The predictive estimated covariance matrix can be formulated as:
where is the posterior estimate error covariance matrix which measures the accuracy of the estimate at time before measurement update and is the process noise covariance at time .
After time update process, the Kalman filter uses measurement to correct its prediction during the measurement update steps. The measurement residual can be expressed as:
where is the matrix converting state space into measurement space at time . The measurement margin covariance can be expressed as:
where is measurement noise covariance at time . The optimal Kalman gain can be formulated as:
The motion of the object has a certain regularity, so the change in the size of the rectangle and the center coordinates can satisfy a certain law. The trajectory of the target generated by the DCF framework is not smooth enough to satisfy the motion law. After obtaining the final center point position by the DCF tracker, we use it as the observation of the current state to correction the Kalman filter. After that, we get a more accurate estimation with less noise.
In the inference phase, we obtain the center point of the object by the motion estimation model and then we generate a cosine window as motion map centering on it. After that, we expand the patch to get the accurate search area to predict the location of the target. The Gaussian motion map is used to multiply the feature map generated by the CNN. Through the above steps, a more accurate estimation of the center point of the object is obtained, and the problem of center point drift caused by deformation and partial occlusion is alleviated.
3.3 Hierarchical Features Selection
Lin. et al  propose a novel feature pyramid networks (FPN), which uses top-down and lateral connections so as to capture the object with different scales. During online tracking, the size of the target changes frequently. Therefore, hierarchical features are introduced into our tracking framework. In this subsection, we investigate the specialty of different layers and the proper ways to combine them.
The semantic information of CNN layers from the shallow to the deep has specific performance to the tracking problem. The performance of different layers in three tracking situations (occlusion, illumination and simple case) is visualized in Figure 4. The shallower features provide rich details, which have better adaptability and precision. Yet, for objects with large deformation and occlusion, the performance goes worse. As for the middle layer features, there exist not only object outline but also powerful semantic information, which tends to be more stable for the scale variations. Deeper features have better stability when dealing with larger deformation. However, it is prone to drifting when having similar objects. Conventional multi-feature fusion directly combines the features of multiple levels to solve the problem, ignoring the interference between different levels, which causes failures under some complex situations.
Specifically, for ResNet , we use the features before ReLU activations as output. The differences in features of adjacent layers are similar. Therefore, if we select adjacent layers, more redundancy and interference will be introduced into our tracking framework. We finally select the better performance levels of each stage in the graph as the feature extraction layers, including Conv1x, Res3d and Res4f, which not only increases the specificity of features, but also can be easily adapted to dramatic scale variations.
3.4 Independent Correlation Filters Online Learning
Compared to the conventional CF algorithms, We treat the hierarchical features differently, and train a set of independent filters for each feature:
where is the feature of the layer and the channel. Each layer of a convolutional neural network can be viewed as a set of nonlinear filters. More complexity and redundancy will be introduced into the algorithm as the feature dimensions increase. Therefore, an input image is encoded by filters at each layer. If there are filters in one layer and the size of the feature map is , the number of channels corresponding to the feature map is also and the feature matrix belongs to . With the deepening of the hierarchical features, increases gradually and the complexity of the obtained feature map also increases gradually. If we combine the multiple dimensional features directly, the dimension of the features will be high, which causes computational burden growing greatly. It is indicated to exist a lot of redundant information , which is not conducive to the solution of filters. Specific information can be independently learned by different training strategies. On the other hand, the adaptive multi-branch correlation filters fusion is able to improve the robustness effectively, which can be seen in the section 3.5.
Assume that layers of results are adopted, the correct probability of each result is , and the probability of the correct result in the final results is between and . When the correct rate of each result is above 0.7, the probability of having the correct result is between 0.7 and 1-. And when the specificity among the results is more distinct, the eventual result tends to be 1-. At the same time, as increases, tends to become 1. Therefore, it’s efficient to improve the accuracy of the result by increasing the specificity in the selected levels or the number of layers. Instead, Too many layers will cause choosing the right result more difficult and encountering the overburden of many calculations. So the best solution is to compromise the specificity and the number of layers. We investigate CNN features applied to the exploration of differential performance in tracking problems in section 3.3.
The decomposed objective function can be expressed as independent solution objective functions:
where is the layer and the channel filter parameters, and represents the predefined Gaussian window objective function.
For the above optimization problem in Eq.14, we first solve the filter parameters . As for each group of filters, we set the derivative to be zero, and the minimizer of Eq.14 is solved by the following normal equations, where :
where . Moreover, we adopt the Conjugate Gradient method to solve Eq.14 iteratively.
Most existing correlation filters algorithms tend to update at each frame, which causes high computational load. ECO  updates the model in a fixed frame interval. It proves a sparser updating scheme is more efficient than the conventional strategy which updates every frame. By postponing update of the model a few frames, the loss is updated by adding a new mini-batch to the training samples, instead of only a single one, which helps to reduce over-fitting to the recent training samples. Intuitively, a sparser updating scheme leads to a low convergence speed. Hence, adopting more conjugate gradient iterations is necessary. More than that, to improve convergence rate, we choose a suitable momentum factor by using Fletcher-Reeves formula  or the Polak-Ribiere formula .
3.5 Adaptive Multi-branch Correlation Filters Fusion
For multiple independent branches, the hierarchical filters are solved. Then we design an adaptive weight scheme to effectively fuse and obtain a more robust result. We call this weight . Then our final loss function can be formulated as follows:
we express the results of each layer as . The optimization problem of can be converted to:
Because of , then and are constant, which can be ignored. can be converted to:
in this equation, . Meanwhile, the problem in Eq.17 becomes a quadratic programming problem, which can be solved by standard quadratic programming.
The center point coordinates of the target can be obtained by the fusion score map, after which the scale of the target can be predicted. We apply a multi-scale search scheme, which takes the position predicted by the motion estimation module and takes multiple scales to extract the search area. In terms of our conclusion in section 3.3, medium-layer features are more robust for the determination of scales. We extract an image patch of size centered around the target, where is 1.03, . After that, we extract the medium features and employ filters to acquire response maps, where is the position of CNN’s layer, and can be expressed as:
|Trackers||A R EAO||Trackers||A R EAO||Trackers||A R EAO|
|DNT||0.515 0.329 0.278||MCPF||0.510 0.427 0.248||DLSTpp||0.583 0.454 0.196|
|STAPLEp||0.557 0.329 0.278||SiamDCF||0.500 0.473 0.249||DASiamRPN||0.628 0.518 0.205|
|SRBT||0.496 0.350 0.290||CSRDCF||0.491 0.356 0.256||CPT||0.577 0.424 0.209|
|EBT||0.465 0.252 0.291||CCOT||0.494 0.318 0.267||DeepSTRCF||0.600 0.444 0.221|
|DDC||0.541 0.345 0.293||MCCT||0.525 0.323 0.270||LADCF||0.550 0.375 0.222|
|Staple||0.544 0.378 0.295||Gnet||0.502 0.276 0.274||RCO(Resnet)||0.571 0.315 0.246|
|MLDF||0.490 0.233 0.311||ECO||0.483 0.276 0.280||UPDT||0.603 0.343 0.247|
|SSAT||0.577 0.291 0.321||CFCF||0.509 0.281 0.286||-||- - -|
|TCNN||0.554 0.268 0.325||CFWCR||0.484 0.267 0.303||-||- - -|
|CCOT||0.539 0.238 0.331||LSART||0.493 0.218 0.323||-||- - -|
|MHIT(Ours)||0.580 0.111 0.451||MHIT(Ours)||0.510 0.138 0.388||MFT(Ours)||0.577 0.311 0.252|
4.1 Implementation Details
Our tracker is implemented on MATLAB using Matconvnet and AutoNN tools. We employ the Conv1x, Res3d, and Res4f of ResNet50 and SE-ResNet50 as the layers of our feature extraction. To reduce and balance the dimensions of the hierarchical features, principal component analysis (PCA) is introduced into our tracking framework. Through PCA, the feature dimensions of Conv1x, Res3d and Res4f are respectively reduced to 64, 256, 256. The search area range is set between 224*224 and 250*250. We utilize the same model update strategy as CFWCR  and the maximum number of stored training samples is set to 50 to avoid over-fitting while the number of intermediate frames without training is set to 6. We select different Gaussian window variances for different layers, 1/12, 1/12, 1/3, from shallow to deep. All the parameters are determined by selecting a uniform validation set.
4.2 Results on VOT
The visual object tracking (VOT) challenge is a competition between short-term, model-free visual tracking algorithms which contains 60 sequences. For each sequence in the dataset, the tracker is evaluated by initializing by the rectangle of the target in the first frame. The toolkit will restart the tracker as long as the target is lost. The robustness is obtained by counting the average number of failures, and the accuracy is the statistical average crossover ratio.
4.2.1 Results of VOT2016
The results in Table 1 are presented in terms of expected average overlap (EAO), robustness (R), and accuracy (A). For clarity, we show the comparison with the top-10 best trackers, including CCOT , TCNN , SSAT , MLDF , Staple , DDC , EBT , SRBT, STAPLE+  in the VOT2016  competition.
Our proposed MHIT outperforms all top 10 trackers at the EAO score 0.451. The MHIT significantly precedes CF approaches that do not apply deep CNN and end-to-end training CNN based trackers. Meanwhile, Figure 5 shows per-attribute plot for ten top-performing trackers on VOT2016 in EAO. In all the attributes (size change, camera motion, occlusion, unassigned, motion change, illumination change), our MHIT tracker gets better performance than other state-of-the-art results, which also demonstrates the effectiveness of our tracking framework.
4.2.2 Results of VOT2017
We compare the tracking results  of the top 10 trackers, including LSART , CFWCR , CFCF , ECO , Gnet , MCCT , CCOT , CSRDCF , SiamDCF  on the VOT2017 challenge. LSART  achieved the first in the VOT2017 public 60 sequences with the EAO of 0.323. By fully excavating different sematic information, the proposed MHIT framework achieves a relative gain of 20.1% compared to LSART in EAO.
4.2.3 Results of VOT2018
The public 60 sequences of VOT2018  public remain unchanged compared to VOT2017. Besides, VOT2018 also adopts EAO, accuracy (A) and robustness (R) for evaluations. The official top 8 trackers, include MHIT, UPDT , RCO , LADCF , DeepSTRCF , CPT , DaSiamRPN [40, 23], DLSTpp  in the hidden dataset.
Table 1 illustrates that our MHIT ranks 1st according to EAO criterion. When taking part in the VOT2018 challenge, we name our tracker MFT(SE-ResnNet50) and RCO(ResNet50). The robustness of our tracker is 9.4% better than that of the other trackers. The robustness and the EAO of our tracker outperform all the state-of-the-art trackers in the VOT2018 challenge.
4.3 Results on OTB
4.3.1 Results of OTB2013
The OTB2013 dataset  is one of the most widely used canonical dataset in visual tracking, which contains 50 image sequences with various challenging factors. The evaluation is based on two metrics: precision plot and success plot. Mean overlap precision (OP) is defined as the percentage of frames in a video where the intersection-over-union overlap exceeds a threshold of 0.5. The area under curve (AUC) of each success plot is used to rank the tracking algorithm. AUC is computed from the success plot, where the mean OP over all videos is plotted over the range of thresholds [0, 1]. To reduce clutter in the graphs, we show only the results for top-performing recent baselines, i.e., ECO , CCOT , SRDCFdecon , DeepSRDCF , SRDCF , siamfc3s , Staple .
Among all compared trackers, the proposed MHIT method obtains the best performance, which achieves the 94.6% distance precision rate at the threshold of 20 pixels and a 72.6% area-under-curve (AUC) score. Performance evaluation on different attributes of OTB2013 can be found in supplement material.
4.3.2 Results of OTB2015
The OTB2015  dataset is an extension of the OTB2013 dataset, which contains 50 more video sequences. We also evaluate the performance of the proposed MHIT method over all 100 videos in this dataset. To reduce clutter in the graphs, we show only the results for top-performing recent baselines, i.e., ECO , CCOT , SRDCFdecon , DeepSRDCF , SRDCF , siamfc3s , Staple .
Overall, our MHIT method provides the best result with a distance precision rate of 91.7% and with an AUC score of 69.8%, which again achieves a substantial improvement of several out-standing trackers (e.g., ECO, C-COT and DeepSRDCF). Performance evaluation on different attributes of OTB2015 can be found in supplement material.
4.4 Ablation Analyses
In this subsection, ablation analyses are performed to illustrate the effectiveness of proposed components. To verify the contributions of each component in our algorithm, the variations of our approach are implemented and evaluated.
Feature Comparisons We compare the performance of VGG-M , Densenet121 , ResNet50  and SE-ResNet50  in Figure 9. In all cases, we employ the same shallow representation, consisting of HOG and Color Names. The baseline CFWCR does not benefit from deeper and more powerful ResNet backbone. When using our MHIT framework, the EAO comes to 0.341 with the original VGG-M backbone, which demonstrates the effectiveness of independent correction filters. When using the deeper Resnet50 and SE-Resnet50 backbone in our proposed tracking framework, the power of CNN is significantly unveiled. In conclusion, our approach is able to exploit more powerful representations, which achieves a remarkable gain going from hand-crafted features towards more powerful network architectures.
After that, we compare the performance of different combinations of features extracted by SE-ResNet50, which is shown in Table 2. If we select adjacent layers, more redundancy and interference will be introduced into our tracking framework, thus causing the performance degradation. Besides, according to the Table 2, Conv1x with the spatial size of 112112 provides richer details of target outline than Res2a with the spatial size of 5656. The middle layer of SE-ResNet50, Res3d can provide not only object outline but also powerful semantic information, thus significantly improving the performance.
Module Analyses We next investigate the effect of proposed four modules, which can be visualized Table 3. Baseline means using VGG-M without independent correlation filters, IS+VGG-M means introducing independent solution into the baseline, and fusing by adding with same weights. IS+ResNet50 means replacing VGG-M with ResNet50 as backbonenet. IS+AMF+ResNet50 joins the adaptive multi-branch fusion method. To verify the importance of our feature diversity and feature quantity in Section 3.4, we further add the feature extracted from the SE-ResNet50 on the basis of IS+AMF+ResNet50, called IS+AMF+SE-ResNet50. Besides, IS+AMF+SE-ResNet50+ME means adding motion estimation module on the basis of IS+AMF+SE-ResNet50. According to Table 3, each module is compared with the baseline, improving respectively 3.4%, 6.2%, 7.3%, 8.3%, 8.5% in terms of EAO.
The tracking performance under partial occlusion is visualized in Figure 10. When the girl in the figure suffers from short-term partial occlusion (marked in red), motion estimation can help the tracking framework to predict the position of the target.
In this paper, we fully utilize hierarchical features and multi-branch correlation filters fusion to construct a novel CF based tracking framework, which efficiently remits curse of dimensionality of conventional multi-feature fusion. Motion estimation module is introduced into the framework as a supplement to the appearance information so as to capture the motion state of the target and remit the partial occlusion. Besides, our tracking framework benefits from the online learning to adapt appearance changes and scale variances in continuous image sequences. Finally, extensive experiments verify the efficiency of the proposed tracker, which achieves new state-of-the-art performance on both OTB and VOT benchmarks.
-  L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1401–1409, 2016.
-  L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865. Springer, 2016.
-  G. Bhat, J. Johnander, M. Danelljan, F. S. Khan, and M. Felsberg. Unveiling the power of deep tracking. arXiv preprint arXiv:1804.06833, 2018.
-  D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui. Visual object tracking using adaptive correlation filters. In Computer Vision and Pattern Recognition, pages 2544–2550, 2010.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
-  M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: efficient convolution operators for tracking. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pages 21–26, 2017.
-  M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 58–66, 2015.
-  M. Danelljan, G. HÃ¤ger, F. S. Khan, and M. Felsberg. Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. pages 1430–1438, 2016.
-  M. Danelljan, G. HÃ¤ger, F. S. Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual tracking. pages 4310–4318, 2016.
-  M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV, 2016.
-  R. Fletcher and C. M. Reeves. Function minimization by conjugate gradients. Computer Journal, 7(2):149–154, 1964.
-  E. Gundogdu and A. A. Alatan. Good features to correlate for visual tracking. IEEE Transactions on Image Processing, 27(5):2526–2540, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  J. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. PAMI, 37(3):583–596, 2015.
-  J. F. Henriques, C. Rui, P. Martins, and J. Batista. Exploiting the Circulant Structure of Tracking-by-Detection with Kernels. Springer Berlin Heidelberg, 2012.
-  J. Z. Y. D. H. B. hiqun He, Yingruo Fan. Correlation filters with weighted convolution responses. IEEE International Conference on Computer Vision, 2017.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. 2017.
-  G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2261–2269, 2017.
-  M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Äehovin, T. VojÃrÌ, G. HÃ¤ger, A. LukeÅ¾iÄ, and G. FernÃ¡ndez. The visual object tracking vot2016 challenge results. COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 8926:191–217, 2016.
-  M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. C. Zajc, T. Vojir, G. Hager, A. Lukezic, and A. Eldesokey. The visual object tracking vot2017 challenge results. In IEEE International Conference on Computer Vision Workshop, pages 1949–1972, 2017.
-  M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, G. Fernandez, and et al. The sixth visual object tracking vot2018 challenge results, 2018.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, pages 1097–1105, 2012.
-  B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance visual tracking with siamese region proposal network. June 2018.
-  H. Li, Y. Li, and F. Porikli. Deeptrack: Learning discriminative feature representations online for robust visual tracking. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 25(4):1834–1848, 2015.
-  T. Y. Lin, P. DollÃ¡r, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. pages 936–944, 2016.
-  A. Lukezic, T. Vojír, L. C. Zajc, J. Matas, and M. Kristan. Discriminative correlation filter with channel and spatial reliability. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 4847–4856, 2017.
-  H. Nam, M. Baek, and B. Han. Modeling and propagating cnns in a tree structure for visual tracking. arXiv preprint arXiv:1608.07242, 2016.
-  H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 4293–4302. IEEE, 2016.
-  E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7464–7473, 2017.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Shewchuk and R. Jonathan. An introduction to the conjugate gradient method without the agonizing pain. 186(3):219–20, 1994.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  C. Sun, H. Lu, and M. H. Yang. Learning spatial-aware regressions for visual tracking. 2018.
-  J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learning color names for real-world applications. IEEE Transactions on Image Processing, 18(7):1512–1523, 2009.
-  Q. Wang, J. Gao, J. Xing, M. Zhang, and W. Hu. Dcfnet: Discriminant correlation filters network for visual tracking. arXiv preprint arXiv:1704.04057, 2017.
-  Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Computer vision and pattern recognition (CVPR), 2013 IEEE Conference on, pages 2411–2418. Ieee, 2013.
-  Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
-  T. Xu, Z. H. Feng, X. J. Wu, and J. Kittler. Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual tracking. 2018.
-  G. Zhu, F. Porikli, and H. Li. Beyond local search: Tracking objects everywhere with instance-specific proposals. In Computer Vision and Pattern Recognition, pages 943–951, 2016.
-  Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu. Distractor-aware siamese networks for visual object tracking. In European Conference on Computer Vision, pages 103–119. Springer, 2018.
Appendix A Supplementary Material
a.1 Detailed results on OTB2013
In this subsection, detailed results on OTB2013 are provided. Figure 11 shows the success plots for all 11 attributes, including abrupt motion, background clutter, blur, deformation, in-plane rotation, low resolution, illumination variation, occlusion, out-of-plane rotation, out-of-view and scale variation on OTB2013.
Our tracker MHIT obtains remarkable performance with good robustness, which outperforms state-of-the-art tracker ECO in most of the attributes. In the evaluation of attributes of fast motion and motion blur, MHIT achieves 5.6% and 4.2% relative AUC gain compared to ECO, respectively. It illustrates the effectiveness of motion estimation module, which captures motion information to pre-locate the position of the target and generates a motion map to rectify the final score map. Moreover, due to the motion estimation module, our tracker also outperforms ECO under deformation and occlusion issues. Benefits from the powerful multi-hierarchical deep features of SE-ResNet50 and independent correlation filters, in the cases of scale variation, MHIT achieves a 3.0% relative AUC gain compared to ECO.
a.2 Detailed results on OTB2015
In this subsection, we provide detailed results on OTB2015. Figure 12 shows the success plots for all 11 attributes, including abrupt motion, background clutter, blur, deformation, in-plane rotation, low resolution, illumination variation, occlusion, out-of-plane rotation, out-of-view and scale variation on OTB2015. Our tracker MHIT also significantly outperforms ECO in most of the attributes, which shares consistent results with OTB2013.