EnKCF: Ensemble of Kernelized Correlation Filters for High-Speed Object Tracking
Computer vision technologies are very attractive for practical applications running on embedded systems. For such an application, it is desirable for the deployed algorithms to run in high-speed and require no offline training. To develop a single-target tracking algorithm with these properties, we propose an ensemble of the kernelized correlation filters (KCF), we call it EnKCF. A committee of KCFs is specifically designed to address the variations in scale and translation of moving objects. To guarantee a high-speed run-time performance, we deploy each of KCFs in turn, instead of applying multiple KCFs to each frame. To reduce any potential drifts between individual KCFs’ transition, we developed a particle filter. Experimental results showed that the performance of ours is, on average, 70.10% for precision at 20 pixels, 53.00% for success rate for the OTB100 data, and 54.50% and 40.2% for the UAV123 data. Experimental results showed that our method is better than other high-speed trackers over 5% on precision on 20 pixels and 10-20% on AUC on average. Moreover, our implementation ran at 340 fps for the OTB100 and at 416 fps for the UAV123 dataset that is faster than DCF (292 fps) for the OTB100 and KCF (292 fps) for the UAV123. To increase flexibility of the proposed EnKCF running on various platforms, we also explored different levels of deep convolutional features.
A recent advancement of air/ground/water unmanned vehicle technologies has increased interests on deploying intelligent algorithms to existing embedded and mobile platforms. Among those technologies, computer vision algorithms are getting more attentions primarily because payloads of those mobile platforms are limited to carry any other sensors than a monocular camera. Instead of just manually being flew for video recording, an unmanned air vehicle (UAV) equipped with an object or feature following function would make it more useful for the application of monitoring/surveillance/surveying on private properties/wild-life/crop, video recording on sports/movies/events, many others. To this end, in this paper, we propose a single-target tracking algorithm that does not require offline training and can run in high-speed. Specifically, we would like to have our algorithm 1) learn the appearance model of a target on the fly and 2) run on a typical desktop as fast as - fps.
One of the dominant frameworks for online object tracking is the correlation filter that essentially solves a single-target tracking problem as a regression problem in the frequency domain. This framework assumes that a target location is given at the beginning like any other online tracking algorithms . Given this positive example for the regression problem, a set of negative examples is collected around the initial target bounding box and represented as a form of the circulant matrix . One can optimally solve this regression problem using a ridge regression in a closed form. However, this solution has to deal with expensive matrix operations . The correlation filter offers a less complex solution, over element-wise multiplication in a frequency domain [2, 12]. Thank to this reformulation, an object tracking pipeline based on the correlation filter can run very efficiently and be even easily implemented. In fact, an extension of a linear correlation filter, the kernelized correlation filter with multi-channel features  showed impressive object tracking results and outperformed other state-of-the-art, online tracking algorithms in terms of run-time and tracking accuracy. However, a vanilla form of such an online tracker is prone to drift, and fails to track a target over a long period of time . This is primarily due to the dilemma of stability-plasticity in updating appearance model, where the appearance model will be overfitted to only the images used to train, unless a compromise on the frequency of updating the model is carefully implemented . For example, one may handle a target’s scale variation by just concatenating multiple correlation filters including KCF and running them on each frame. Alternatively one could think of scanning the region of interest (ROI) with a list of templates in predefined scale ratios to find the target in appropriate scale [12, 23, 18, 1, 16]. However, these approaches would drastically reduce run-time performance because multiple KCFs run on each frame.
Another way of handling the scale change for the correlation filter based approach is to estimate a correct scale at a location where a target highly likely appears  – estimate a target’s translation first and then estimate correct scale. For example, Zhang and his colleagues used the MOSSE tracker  to estimate the target’s translation. And then they attempted to update the scale of the target by further analyzing image sub-regions in high confidence. Their method is based on an assumption that the scale of a target would not change drastically over two consecutive frames. Similarly, Ma and his colleagues used two KCFs to learn the translation and scale of a target separately . In particular, a KCF is used to learn the translation of the target and its background. Given this, another KCF is used to learn the target area to estimate the new scale of the target. However, because of running more than a KCF on each frame, this method degrades its run-time performance (i.e., ). Our method is motivated by this idea – the idea of deploying multiple KCFs to address the issues of single target tracking: scale and translation, but in a more efficient way. To maximize run-time performance and accuracy, instead of running them all together on every frame, we deploy three KCFs in turn: target+small background translation filter (), target-only scale filter () and target+large background translation filter (). By doing so, our method aims at addressing scale change and estimating target’s motion efficiently while maintaining or increasing run-time performance. Figure 2 illustrates the workflow of the EnKCF.
The contribution of this paper is a novel, single-target tracking algorithm running in a very high-speed ( fps). In particular, to effectively address the changes in scale and translation of a target, we extended the KCF and deployed each of three KCFs in turn. Because of such a deployment strategy, the run-time performance of the proposed algorithm maintains high without sacrificing the accuracy of tracking. To reduce any potential drifts while switching KCFs, we developed a particle filter. Additionally, to increase the flexibility of the proposed algorithm’s usage, we explore deep convolutional features with varying levels of abstraction.
2 EnKCF: Ensemble of Kernelized Correlation Filters
We propose a novel way of utilizing an ensemble of the KCFs  to effectively handle scale variations and dynamic maneuvers of a target. To maintain or improve the run-time performance of the original KCF (e.g., 300), we deploy three KCFs in turn, instead of applying them on the same image frame together. Each of these KCFs is designed to address the challenges of single-target tracking – variance in scale and translation. By deploying each of three KCFs in turn, our algorithm will improve the performance of the original KCF while preserving the original KCF’s high run-time performance and small-footprint.
The proposed algorithm, EnKCF, learns three KCFs in turn: The first filter, , focuses on learning the target area and its background for addressing a marginal translation by a target, the second filter, , focuses entirely on learning the target’s scale, and the last filter, , focuses on the target area and its background bigger than that of the first filter, . We set up EnKCF in this way so that a correlation filter for learning a target’s translation could include the background of the target to better discriminate it from its surrounding, and another correlation filter could focus on the target itself to estimate the right scale of the target. In particular, a transition filter with larger padding size, will enable EnKCF to recover from potential drifts after a scale filter is applied to the input images. On the other hand, another translation filter with smaller padding size, , will help EnKCF to better localize the position of the target after the large ROI translation filter is applied.
Our approach is similar to that of  which sequentially applied more than one KCF to every frame, but different in that we operate multiple KCFs in an alternating manner. It intuitively makes sense to alternatively use multiple KCFs with different purposes because the appearance of a target does not change drastically over consecutive images. In other words, for most cases, learning a correlation filter over consecutive frames would not be substantially different from the one updated with smaller frequency. Figure 3 shows examples supporting this observation.
The algorithm LABEL:alg:MKCF shows the pseudo-code of EnKCF. The order of running these three KCFs is important because each of these filters aims at addressing different challenges. A translation filter, , is applied to the th and +th image frames, another translation filter, , is applied to the +th and +th image frames, and then the scale filter, , is applied to the +th image. This order repeats until the last image is presented. Note that the translation filter, , is intended to run right after the scale filter, , runs which is applied at every other + frames. We run these filters in this way because we want to minimize any drifts that are likely to happen running only . In addition, the filter, , is applied to every other two frames before and right after two consecutive frames running . By repeating this order of learning three KCFs, we can integrate more discriminative shape features that cannot be learned just by . The filter, , uses shape and color information together to recover from any potential drifts – drifts could happen due to only scale filter operation in certain frames.
In summary, the scale filter, , is designed to directly handle the scale change of a target and provides the translation filters, and , with more accurate ROIs. On the other hand, a translation filter, , is devised to look into a larger search area to estimate the target’s translation and recover from any potential drifts. Another translation filter, , is designed to address the drawback of that it may learn noisy shape features due to its relatively larger search area. In what follows, we will detail the main idea behind the KCF.
Kernelized Correlation Filter The Kernelized Correlation Filter is a well-known single target tracking method and its workflow has been detailed in other papers [11, 12]. This section briefly goes over the parts of the KCF relevant to this study. Its computational efficiency is derived from the correlation filter framework representing training examples as a circulant matrix. The fact that a circulant matrix can be diagonalized by Discrete Fourier transform is the key to reduce the complexity of any tracking method based on correlation filter. The off-diagonal elements become zero whereas the diagonal elements represent the eigenvalues of the circulant matrix. The eigenvalues are equal to the DFT transformation of the base sample () elements. The Kernelized Correlation Filter, in particular, applies a kernel to to transform to a more discriminative domain. The circulant matrix is then formed by applying cyclic shifts on the kernelized . Such kernelization operation maintains () complexity unlike other kernel algorithms leading to () or even higher complexities.
The KCF solves essentially the problem of a regression in the form of the regularization (ridge regression):
where we seek for that minimizes given the desired continuous response, , and the training template . The parameter enables one to integrate features in the form of multiple channels, such as HoG and color, in this setup [12, 9]. A closed-form solution for Equation 1 exists. To simplify the closed-form solution, an element-wise multiplication in frequency domain was proposed to learn the frequency domain correspondence of , :
Where is an element-wise multiplication. A non-linear version of and are proposed to increase robustness to any geometric and photometric variations . In particular, the diagonalized non-linear Fourier domain dual form solution is expressed as
where represents the regularization weight whereas denotes the first row of the kernel matrix known as gram matrix and is expressed as
An early version based on this formulation used grayscale feature () to learn the solution vector and integrating multi-channel features such as HoG and Color showed improved accuracy [12, 9, 23, 18, 1]. In the detection phase, the learned correlation filter is correlated with the first row of the gram matrix, , which contains the similarity values between the learned feature template and the new test template . This can be formulated as
where denotes the correlation response at all cyclic shifts of the first row of the kernel matrix.
As a common practice, to integrate further temporal information into tracker, we update the correlation filter and the appearance model as below.
where represents the learning rate tuned to a small value in practice.
In the following sections, we will detail two different types of features used in EnKCF: hand-crafted features (fHoG + color-naming) and deep convolutional features.
EnKCF with Hand-Crafted Features We, first, use two conventional hand-crafted features, fHoG (shape)  and color-naming (color) . Figure 4 shows examples of these features. This setup of the EnKCF is designed to perform tracking at fps on low-end embedded systems without GPUs.
We use both fHoG  and color-naming  for the large area translation filter, . This is because the fHoG applied to relatively larger area tends to be more noisy and less discriminative, and adding color information makes the feature vector for the translation filter more discriminative. Figure 4(b) shows an example of this translation filter. By contrast, the only employs the fHoG features because it covers a relatively smaller area. Figure 4(a) shows an example of this translation filter. Lastly, we use both fHoG and color-naming features again for the scale filter, . By assigning fHoG and color-naming features, we ensure that the likelihood of inaccurate scale estimation is reduced in comparison to the scale filter with only fHoG features. This is more important in our case as the scale filter is operated in every 5 frames in the proposed tracker. Also, the scale filter, explained earlier in the algorithm LABEL:alg:MKCF, estimates the scale of a target by correlating it with three candidate ROIs. This search may increase the run-time complexity of the scale filter from to . To ensure practical actual filtering operations, we use a smaller template size (i.e., 64) for the scale filter, . The smaller template size does not degrade the tracking performance because of 1) zero padding around the target (smaller ROI) and 2) use of color-naming features.
EnKCF with Deep Convolutional Features In addition to hand-crafted features, we explore deep convolutional features to extend the applicability of the EnKCF and boost tracking performance.
There has been a large volume of studies on utilizing a pre-trained CNN features in tracking-by-detection algorithms. For example,  used the activation of the first and second convolutional layer of the VGGNet . They report that the first convolutional layer features lead to slightly better precision and success rates than the second layer due to increased translational invariance in this layer.  proposed a KCF tracker integrating features with higher abstraction capacity. Their method first employs the third convolutional layer features to estimate the response map. In the next stage, the KCF, concentrating on the second convolutional layer features, is run to update transition. The third KCF then works on the transition given by the previous KCF and learns the first convolutional layer features to update the transition. This coarse-to-fine translation estimation accommodates different levels of abstractions of the object. However, it runs multiple KCFs in a sequential fashion, increasing the run-time complexity. For this reason, we follow an approach similar to , in order to embed deep features in EnKCF. Our EnKCF algorithm enables us to exploit different level of feature encodings with different KCFs. The translation filters, and , consider at least twice bigger area than the scale filter, . Given this, we can assign deeper feature encodings to as it learns less spatial information than and . Thus we assign the activation of the fourth convolutional layer features () to whereas and are assigned the activation of the second layer features (). Figure 5 illustrates this feature assignment. Additionally, we assign the second convolutional layer features to as in and and compare this setting (conv222-VGG) to conv224-VGG. The conv222 setting stands for the assignment of the activation of convolutional layer features of VGGNet to , and . We use the VGGNet since it provides higher spatial resolution at the first several layers than AlexNet .
Particle Filter for Smoothing Transition among KCFs As explained earlier, the EnKCF updates the target’s scale at every other frames. Although the strategy of updating every other th frames will result in an optimal run-time performance, this may lead drifts at the later frames. To prevent such potential drifts, we developed a Bayes filter that incorporates a target’s motion to smooth any intermediate outputs from individual KCFs. In particular, we use a particle filter to estimate the target’s image coordinate based on the EnKCF’s outputs. The state in this paper, , represents the target’s pixel coordinates and its velocity, , where and are the pixel coordinates of the target’s centroid, and are the velocities estimated along the -axis and -axis. The particle filter predicts, using a constant velocity motion model, the target’s next state by generating a predefined number of particles. Then it uses the confidence maps of EnKCF as observation to update its state. The particle weight is computed as, , where is the weight of the particle at time , is the response map from one of the KCFs, and and denote the number of rows and columns of the confidence map.
So far we explained how the proposed algorithm could efficiently perform a single-target tracking task while effectively addressing variations in scale and translation of moving objects. To accurately evaluate the performance of the proposed algorithm in terms of the goal of this paper, we need data with challenging variations in scale and translation. To this end, we chose the UAV123 data111https://ivul.kaust.edu.sa/Pages/Dataset-UAV123.aspx that contains 123 video sequences of objects captured from low-altitude UAVs. Because of the nature of the video acquisition – the targets and the camera are moving together, this data pose a great deal of challenges as the scale and position of the targets change drastically in most of the sequences. To verify that our algorithm is useful not only for the image data with such extreme challenges but also for the one with nominal difficulties in single-object tracking, we also used the OTB100 data222http://cvlab.hanyang.ac.kr/tracker_benchmark/benchmark_v10.html  that contains the videos recorded from smart phones and typical cameras in perspective view. In addition, we use the temporarily down-sampled UAV123 dataset to check how robust our algorithm is to drastic motions of camera and moving targets. Lastly, we evaluate the proposed algorithm with deep convolutional features on the UAV123 dataset to investigate how much performance gain we can achieve in using deep features over the conventional features like fHoG and color-naming.
Finding Optimal Hyper-parameters As each of three KCFs in EnKCF is designed to address specific challenges in single target tracking problem, the optimal parameters for individual KCFs should be different. We set the learning rates () of individual filters, , , and as , and . For the kernel of the Gaussian neighboring function, we empirically found the optimal values of as , , and for , , and . We set the scale filter update threshold, , to 4, peak-to-sidelobe (PSR) ratio. The padding size for the correlation filters is tuned to for , for , and for . For our particle filter implementation, we empirically set the number of particles to to balance the run-time performance and accuracy. To keep the level of variance among the particles reasonable, we performed the re-sampling only when the efficient number of samples, (), is lower than a pre-defined threshold.
Performance on UAV123 Dataset We used the precision and success rates to compare the performance of the proposed algorithm with those of the state-of-the-art tracking algorithms. For the precision, we rank the trackers based on the precision numbers at 20 pixels whereas in the success rate plots, they are ranked based on the area under curve (AUC) scores. The tracking algorithms under the comparison include ones in high-speed (300 fps): KCF , CSK , DCF , MOSSE , and STC  and ones in relatively lower-speed (50): ECO , CCOT , SCT , SAMF , DSST , Struck , MUSTER , TLD , and OAB . The ECO and CCOT trackers originally uses deep convolutional features, however, to perform fair comparison, we employ fHoG and color-naming features similar to our tracker. Figure 6 shows the results on the UAV123 dataset. The EnKCF outperformed other high-speed trackers by - at 20 pixels precision. In particular, three algorithms, SAMF, DSST, and Struck did about better than ours in accuracy, but 10-20 times slower. The more recent trackers, ECO and CCOT, on the other hand, outperforms EnKCF by while running at 10-15 times slower. For the scale adaptation, EnKCF did fourth best in terms of AUC for the success rate plot. It outperformed other high-speed trackers by about - in AUC. In addition, it performed even better than some of the lower-speed trackers. For example, for AUC, EnKCF outperformed Struck and DSST by and while running at more than and times faster. For the DSST, we believe our algorithm outperformed it because, first of all, the DSST uses a 1-D scale filter and searches the target’s scale over 33 candidates in a scale pool whereas our algorithm uses a 2-D scale filter with 3 candidate scales. Learning a 2-D filter resulted in learning more spatial information. Second of all, the DSST uses only fHoG whereas our algorithm uses both fHoG and color-naming features. Learning complementary features such as color and shape provided a better understanding of the ROI.
— Best — Best — Best
Finding Optimal Combination and Deployment Order of Multiple KCFs In addition to performance comparison with other trackers, we also conducted an experiment to see what combination and deployment order of the KCF works best. Table 1 presents results of this experiment. To achieve fair comparison, we did not use the particle filter. The combination presented at the column switched the order of deploying and . The one at the column removed and replaces with whereas the one at the column removes . The combinations with the “*” marker ran the filters on every frame in the listed order. The trackers at the and columns ran a translation and a scale filter on every frame which is similar to the LCT tracker .333To be precise, the LCT tracker comes with a re-detection module for a long-term tracking. This experiment empirically verified that EnKCF is an optimal way of using multiple KCFs, in terms of both tracking accuracy and run-time performance. For example, one could achieve higher in accuracy by running and at every frame, but this combination decreases run-time performance to 317 fps.
Evaluation on Particle Filter Contribution For the UAV123 dataset, we also evaluated the performance of the EnKCF with and without the particle filter. In particular, we use 50 video sequences with no drastic camera motion. To evaluate the robustness of particle filter, we add noise from a uniform distribution (), to the translation estimations of the KCFs. For this experiment, the EnKCF with particle filter achieves precision (px), outperforming the one without particle filter by . This experiment validated that the integration of particle filter into the EnKCF can further improve the performance.
Performance on OTB100 Dataset Figure 6 shows the results on the OTB100 dataset. The performance of EnKCF on the OTB100 data is similar to that of the UAV123 dataset. Specifically, it performed reasonably well at estimating target scale in that it showed the highest precision and success rates among the other high-speed trackers. Interestingly, it outperformed another correlation filter based tracker, DSST, but performing behind of another low-speed scale adaptive SAMF tracker. Finally, the ECO, and CCOT achieve - higher precision and success rates than ours while operating at 10-20 times slower rates.
Performance on UAV12310fps Dataset We were curious about how the frame rate of a testing video would affect the performance of a tracking algorithm. To this end, we use the temporarily down-sampled version of the UAV123, called UAV12310fps dataset. We believe the downsampling would make the original UAV123 data more challenging because the displacements of the moving objects become even bigger. To tackle this challenge, we slightly modified the proposed algorithm – ran every frame and removed the particle filter due to larger ego-motion. Figure 7 shows the performance of the modified version of EnKCF and other tracking methods. The precision rates of the ECO and CCOT dropped about in comparison to the original UAV123 dataset. Our tracker outperforms other high-speed correlation filter based trackers including KCF, DSST by about - and showed a precision rate similar to that of SAMF. It ranks as fourth in AUC while running about to times faster than the top three trackers. It was interesting to observe how sensitive the performance of the state-of-the-art tracking algorithms is to the frame rate, and, with a slight modification, the proposed method effectively handled large displacement of objects in successive frames.
Evaluation with Deep Features Our focus so far was to develop a single-target tracker that does not require offline learning to effectively address the variations in scale and translation, and can run at a high-speed, fps. We would like to see how much the performance gain we could achieve by replacing the conventional features, fHoG and color-naming with deep convolutional features. We use deep convolutional features similar to [17, 6]. Note that running the EnKCF with deep features would obviously increase the computational cost and degrade the run-time performance.
Figure 8 shows the DeepEnKCF’s performance on UAV123 dataset. DeepEnKCF outperformed the EnKCF with hand-crafted features by about to in precision (20 px) and in success rates (AUC). The conv224-VGG setting performed slightly better than conv222-VGG in precision while achieving similar success rate (AUC). This indicates that higher level feature abstraction works better for the smaller ROIs. By using feature abstractions from the VGGNet pre-trained on millions of images, we can better represent the low level features of the object than hand-crafted features in challenging cases such as low contrast ROIs. However, we believe that the contribution of the deep features is limited by two factors: increased translational invariance in deeper layers and losing targets due to large target and camera motion in the UAV123 dataset.
Running a computer vision algorithm on any existing embedded systems for real-world applications is economically and practically very attractive. Among other practical considerations, it would be desirable if such computer vision algorithms require no offline training and operate at high-speed. To develop a single-target tracking algorithm to meet these features, we proposed an extension of KCF that applies three KCFs, in turn, to address the variations in scale and translation of moving objects. We also developed a particle filter to smooth the transition between three KCFs, especially the transition between the scale and large ROI translation filter. Through experiments, we found that the way the EnKCF deployed three KCFs was optimal, and the particle filter contributed to increase the performance. We used two public datasets to evaluate the performance of the proposed algorithm and other state-of-the-art tracking algorithms, and found that, on average, the performance of the proposed algorithm is better than other high-speed trackers over 5% on precision at 20 pixels and 10-20% on AUC. Our implementation ran at 340 fps for OTB100 and at 416 fps for UAV123 data that is faster than DCF (292 fps) for OTB100 and KCF (292 fps) for UAV123. Finally, we explored the idea of utilizing deep features for the proposed algorithm and found that the deep features helped the proposed algorithm boost the performance by .
Although the proposed algorithm showed a promising result tested with challenging data, we believe the corner case analysis has not been extensively done yet. As future work, we would like to thoroughly study under what conditions our algorithm would fail to track objects. In addition, we also would like to investigate a way of re-initializing a target when the target is lost, due to occlusion, illumination change, drastic camera motion, etc.
-  A. Bibi and B. Ghanem. Multi-template scale-adaptive kernelized correlation filters. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 50–57, 2015.
-  D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui. Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE Computer Vision and Pattern Recognition, pages 2544–2550, 2010.
-  J. Choi, H. J. Chang, J. Jeong, Y. Demiris, and J. Y. Choi. Visual tracking using attention-modulated disintegration and integration. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. In CVPR, 2017.
-  M. Danelljan, G. Häger, F. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In Proceedings of British Machine Vision Conference, 2014.
-  M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 58–66, 2015.
-  M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV, 2016.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
-  H. K. Galoogahi, T. Sim, and S. Lucey. Multi-channel correlation filters. In Proceedings of International Conference on Computer Vision, 2013.
-  S. Hare, A. Saffari, and P. H. Torr. Efficient online structured output learning for keypoint-based object tracking. In Proceedings on IEEE Conference on Computer Vision and Pattern Recognition, pages 1894–1901, 2012.
-  J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings on European Conference on Computer Vision, pages 702–715, 2012.
-  J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
-  Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao. Multi-store tracker (muster): A cognitive psychology inspired approach to object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 749–758, 2015.
-  Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Y. Li and J. Zhu. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of European Conference on Computer Vision, pages 254–265, 2014.
-  C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 3074–3082, 2015.
-  C. Ma, X. Yang, C. Zhang, and M.-H. Yang. Long-term correlation tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5388–5396, 2015.
-  M. Mueller, N. Smith, and B. Ghanem. A benchmark and simulator for uav tracking. In Proceedings of European Conference on Computer Vision, pages 445–461, 2016.
-  J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof. Prost: parallel robust online simple tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: an experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1442–1468, 2014.
-  M. Tang and J. Feng. Multi-kernel correlation filter for visual tracking. In Proceedings of International Conference on Computer Vision, pages 3038–3046, 2015.
-  J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learning color names for real-world applications. IEEE Transactions on Image Processing, 18(7):1512–1523, 2009.
-  Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
-  K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M.-H. Yang. Fast visual tracking via dense spatio-temporal context learning. In Proceedings of the European Conference on Computer Vision, pages 127–141, 2014.
-  T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. Robust visual tracking via multi-task sparse learning. In Computer vision and pattern recognition (CVPR), 2012 IEEE conference on, pages 2042–2049. IEEE, 2012.