Robust Object Tracking with a Hierarchical Ensemble Framework
Abstract
Autonomous robots enjoy a wide popularity nowadays and have been applied in many applications, such as home security, entertainment, delivery, navigation and guidance. It is vital for robots to track objects accurately in real time in these applications, so it is necessary to focus on tracking algorithms to improve the robustness, speed and accuracy. In this paper, we propose a realtime robust object tracking algorithm based on a hierarchical ensemble framework which incorporates information including individual pixel features, local patches and holistic target models. The framework combines multiple ensemble models simultaneously instead of using a single ensemble model individually. A discriminative model which accounts for the matching degree of local patches is adopted via a bottom ensemble layer, and a generative model which exploits holistic templates is used to search for the object based on the middle ensemble layer as well as an adaptive Kalman filter. We test the proposed tracker on challenging benchmark image sequences. The experimental results demonstrate that the proposed tracker performs superiorly against several stateoftheart algorithms, especially when the appearance changes dramatically and the occlusions occur.
I Introduction
Visual tracking is a wellstudied problem in computer vision with a variety of applications such as surveillance, human motion analysis, robot guidance, humancomputer interaction and so on. Recent attention has been focused to visual tracking in the robotic domains [1, 2]. However, due to the diverse environment and the complex motion of the robots, several tracking conditions such as occlusions, deformations, fast motion and background clutters remain difficult.
There are three fundamental tracking components that are essential [3] for improving performance of tracking: (1) the background information; (2) local appearance models; (3) motion models. This paper presents a hierarchical tracking framework which takes the above components into account. We model the object as an ensemble threelayer structure which can incorporate information including individual pixel features, the local patches and the target bounding box. The first component, i.e. the background information, is essential to overcome the background clutters due to the complexity of the environment. In our proposed method, we incorporate both the object and the background information into classifiers. For the second component, most of existing approaches [4, 5], which represent the target with a limited number of nonoverlapping or regular local regions. So they may not cope well with the large deformations of the target. While our hierarchical tracker models the target with a series of overlapping and randomly sampled regions. We introduce the compressive sensing theory [6, 7] which significantly reduces the dimension of the pixel features in local regions. An overall schematic for the tracker is shown in Fig.1. For each subpatch, we build a bottom ensemble layer which combines a collection of weak classifiers on the compressive features for the subpatch into a strong classifier as a base ensemble. In the middle ensemble layer, we aggregate these base ensembles to generate the measurement of the target. As the robots move almost all the time when tracking an object, our approach needs to consider the third component and introduce an adaptive Kalman filter [8] in the top layer to consider the motion models and the temporal consistency in the target bounding box level. Above all, the contributions of our method are summarized as follows:

We legitimately organize compressive features, overlapping subpatches and holistic target models to capture the detailed appearance of the object;

We propose a hierarchical ensemble framework that combines multiple ensemble models simultaneously instead of using a single ensemble model individually;

We employ compressive sensing method to significantly reduce the feature dimensions so that our approach can handle colorful images without suffering from exponential memory explosion;

We take the motion model into consideration to overcome the temporary occlusions, missing and false detections with an adaptive Kalman filter.
In the experiment, we compare the proposed method against stateoftheart tracking approaches which are feasible for robotic applications in terms of computational complexity and hardware requirements using an online object tracking benchmark [3]. Our method obtains superior results compared with the stateoftheart tracking approaches. The results also show that our method performs much better in the moving human tracking than other approaches for the conditions with occlusions, deformations, background clutters and scale variations.
Ii Related Work
Recent tracking algorithms are developed in terms of three primary components: target representation, matching mechanism, and model update mechanism.
Target representation plays a pivotal role in visual tracking, and numerous representation schemes have been proposed. Several factors need to be considered for an effective appearance model in target representation. First, the features to represent the objects have many choices such as color histogram [9], superpixels [10], Haarlike features [11, 12, 13], etc. Second, the templates to represent the objects can be global or local. Global templates [12, 1] are easy to construct the object representation that contains information of the whole object. However, for the tracking problem of robots, holistic templates will have difficulty in handling significant appearance changes and deformations of the targets. While local templates [14, 15, 4] are more robust and flexible to these conditions. But the geometrical relationships for local patches remain tough since the environmental clutter, occlusions and partially similar objects can often distract such local patches and lead to drift.
Matching mechanism is used to classify candidate regions which are most similar to the target from background. There are two main streams of research on this: One is generative model which typically searches for the most similar candidate to the target within a neighborhood [16, 17, 18]. Another is discriminative model which poses the tracking problem as a binary classification task that determines the decision boundary for separating the target from the background [13, 12, 19, 20].
Online model update mechanism is quite essential for robust visual tracking to deal with appearance variations. Addressing on this problem, Kalal et al. [15] develop a bootstrapping classifier to select positive and negative samples for model update. Grabner et al. [21] formulate the update problem as a semisupervised task where the classifier is updated with both labeled and unlabeled data. However, online boosting requires that the data should be independent and identically distributed. This is not always satisfied in visual tracking because the data are often temporally correlated.
In the proposed method, we adopt the compressive sensing theory to reduce the dimension of Haarlike features and this process is operated similarly to [12]. We employ a joint representation which considers both global and local models of the target to better handle significant appearance changes, deformations, similar object identification and occlusions. Our local models are efficiently constructed with a number of overlapping and randomly sampled local patches and we reextract the subpatches at each time step to avoid the drifting caused by arbitrary subpatch. We adopt a discriminative model via the bottom ensemble layer to account for the matching degree of local patches, and a generative model is used to seek for the object through the middle ensemble layer as well as an adaptive Kalman filter. For model update, we employ ensemble learning to update the patches and classifiers to capture appearance variations and reduce tracking drifts.
Iii Robust Object Tracking with a Hierarchical Ensemble Framework
In this section, we give a detailed description of the proposed hierarchical ensemble tracking(HET) framework. It is composed of two ensemble layers and a Kalman filter layer. At each time step, we start with detecting several samples around each local patch and try to formulate the corresponding base ensemble for each subpatch with several weak classifiers in the bottom ensemble layer. Second, we recover the target location in the middle ensemble layer by incorporating these base ensembles, and regard this location as the measurement to an adaptive Kalman filter. Third, we ascertain the ultimate object location at the current frame with a motion model and the measurement via the adaptive Kalman filter in the top layer. Finally, we update the model by reextracting the local overlapping image subpatches efficiently in the final target region with a random spatial layout and updating the parameters of weak classifiers for tracking in the next frame.
Iiia Local Compressive Appearance Model
The compressive sensing theory shows that if the dimension of the feature space is sufficiently high, these features can be projected to a randomly chosen low dimensional space which contains enough information to preserve most of the salient information of the original highdimensional features through a random projection matrix [22]. The signal can be recovered as long as the projection matrix follows the Restricted Isometry Property (RIP) [7]. Representing the object appearance by regions allows the proposed tracker to better handle occlusions and large appearance changes. The compressive appearance model also allows us to process a large number of regions in realtime.
In this paper, we build compressively sensed versions of subpatches. Randomly extracted subpatches are used and the relative location between subpatches and the target bounding box are established when the tracking window is given by a detector or manual label at the first frame. Every subpatch is represented by four components: a compressive feature vector , a classification score , a relative location , where denotes the relative upperleft corner coordinate to upperleft corner of the target window, and the location of the subpatch itself in the image space . Denoted th subpatch as:
(1) 
It is notable that the width and the height of each subpatch are identical, denoted as and , which are determined at beginning. After extracting these local overlapping image subpatches , where denotes the number of subpatches, for the th subpatch, we sample subpatches with the same size as the th subpatch, whose Euclidean distances to the subpatch is smaller than a threshold that is fixed through the sequence. These samples can form a matrix . Then we present all samples as .
In order to find a kind of feature that is invariant to scale, we adopt a multiscale image representation that is often formed by convolving the input image with a Gaussian filter of different spatial variances and speed up the process via integral image method. We replace the Gaussian filter with rectangle filters for computation consideration [12]. For samples of the th subpatch , we obtain the feature matrix , the th column , where denotes the large multiscale feature vector of the th sample that is filtered with rectangle filters and concatenated as such a highdimensional feature vector. Features of the total samples can denote as .
We adopt a sparse random matrix , to reduce the original feature space into a lowerdimensional space such as for th subpatch. Concatenating local patches together, we obtain , computed by
(2) 
A typical choice of such a measurement matrix is the random Gaussian matrix . But when is huge, the computational loads are still heavy because the random Gaussian matrix is dense. Thus it is common to employ a very sparse random measurement matrix that satisfies a weaker property than RIP but almost as accurate as the conventional random Gaussian matrix [23], as (3), where denotes the element in the th row and th column of . This random matrix is fixed at the beginning and easy to compute for realtime tracking by fixing the maximum number of nonzero elements to be a lower number. The scheme to produce the random matrix in this work is similar to [12]. We illustrate the dimension reduction process in Fig.2.
(3) 
IiiB Classification via Ensemble Layers
To link up the individual pixels with the local patches, we employ the naive Bayesian classifier to construct the pool of weak classifiers corresponding to each individual compressive feature in the bottom layer. We assume the compressive dimensional features of each subpatch are independently distributed and build weak classifiers corresponding to these features by considering both the object and the background information. Since is fixed during the tracking process, the way to compress the high dimensional features of samples stays consistent for all subpatches. Let denote an arbitrary compressive sample, for the th compressive feature, the th classifier is constructed as follows:
(4) 
where is a binary variable which represents the sample label. We assume by sampling the same quantity of positive and negative samples at update step. The conditional distributions are almost Gaussian due to the random projections of the high dimension features[24]. Thus we have:
(5) 
where are the mean and standard deviation of the positive (negative) class.
Then we introduce an ensemble strategy which combines the output of weak classifiers to create a strong classifier as a base ensemble to detect the subpatches as shown in Fig.3, denoted as
(6) 
For the th subpatch, we seek its samples for matching and its matching score like:
(7) 
We match all subpatches in the same way in the bottom layer and obtain the compressive feature of their optimal matching and their scores . In the ensemble learning field, it is often found that improved performance can be obtained by combining multiple models simultaneously like (6), instead of just using a single model individually [25].
In the middle layer, we propose a novel ensemble strategy to acquire the observed location of the object from the base ensembles like Fig.4 via these detected local patches.
Suppose the actual location of the object we are trying to predict is given by , and denotes the th hypothesis of object location obtained by the th detected subpatch. The output of each subpatch model can be written as the true value plus an error in this form:
(8) 
To be convenient for comparison, we adapt the scores of subpatches by the zeromean normalization, then rescale them to , . is regarded as the weights of candidates that obtained by the corresponding subpatches. We update these weights adaptively for each new frame. The combined prediction is given by
(9) 
The average sumofsquares error then takes the form as follows:
(10) 
where denotes a frequentist expectation. The average error made by the subpatch models acting individually is
(11) 
We assume that the errors have zero mean and uncorrelated due to the subpatches are randomly extracted. So we have:
(12) 
The expected error from the combined prediction is computed by
(13) 
IiiC Adaptive Kalman Filter
The top layer builds an adaptive Kalman filter based on the two ensemble layers to estimate the optimal system state and target image velocity so that the proposed tracker can overcome the temporary occlusion, missing and false detections. We regard the observation result (9) from the bottom and middle ensemble layers as the measurement. The discrete time system state and measurement at time are given by and , where denote the center coordinate in the image space corresponding to system state and measurement at time respectively, denote velocities in both two axis of system state. The state and measurement in the next time step + 1 is given by
(14) 
where is modeled according to the Newton’s equation of motion, is the time between two frames, and are assumed to be white Gaussian noises with zero mean and covariance matrixes respectively. To achieve an adaptive Kalman filter, we take the mean of normalized scores in the middle layer to update these two covariance matrixes every frame like (15) and (16). We ascertain the ultimate object location at the current frame in the top layer with this adaptive Kalman filter.
(15) 
(16) 
IiiD Model Update
It is important to update the target model continuously for robust tracking in the face of various difficult environment. The proposed method updates the hierarchical model via three mechanisms: reextracting the subpatches according to the object that we have found at the current frame, choosing the subpatches that need to be updated and adjusting the parameters of the weak classifiers in the bottom layer. The update process is also shown in the Fig.1.
Once we find the object at the current frame, we need to correct the locations of all subpatches in the middle layer due to the drift of the detection process. The way to reextract the randomly overlapping subpatches is fixed at the first frame. After that, we compress the features of these new subpatches and put them into the weak classifiers in the bottom layer to obtain the updated scores . We assume that is Gaussian, and the subpatches to be updated are those whose scores satisfy
(17) 
where are mean and standard deviation of the scores.
Then, for the th chosen subpatch, we extract positive samples whose Euclidean distances to the subpatch are smaller than a threshold value and negative samples whose Euclidean distances to the subpatch are bigger than a threshold value that is fixed at beginning. We update the parameters of its th weak classifier in (5) like
(18) 
where , denote mean and standard deviation of the positive samples. And are updated in a similar way.
Iv Experiments
In this section, we show the experimental results of our method. Firstly, we present the implement details of the proposed tracker and the evaluation criteria to quantitatively assess the performance. Secondly, we validate the joint representation of our hierarchical ensemble framework with the base method. Thirdly, we compare our tracker to three most similar methods which are famous in the visual tracking field. Fourthly, we compare our method with 8 stateoftheart methods which are feasible for robotic applications in terms of computational complexity and hardware requirements. Finally, we demonstrate that our tracker performs excellently for moving human tracking which is crucial for the tracking applications of robots.
Iva Implementation Details
The proposed algorithm is implemented in Matlab(R2013a) and runs at 30 frames per second on an Intel i74790 machine with 3.6GHz CPU and 8GB RAM. For each sequence, the location of the target object is manually labeled at the first frame. For all reported experiments, we employ 150 weak classifiers in the bottom ensemble layer and randomly generate 11 subpatches that are located inside the object and whose width and height are three quarters of the size of the object. We set learning rate , maximum number of nonzero elements in random matrix and thresholds , in all experiments.
In the experiment, we employ two evaluation criteria to quantitatively assess the performance of the trackers including the average overlap rate and the center location error. Given the tracked bounding box and the ground truth bounding box , we use the detection criterion in the PASCAL VOC challenge[26], to evaluate the success rate.
IvB Comparison with the base method
Compressive tracking(CT)[12] employs the compressive sensing theory to compress the appearance models. It is reasonable to consider CT as our base method since the way to compress a image subpatch is almost the same. In the bottom layer of our method, we build compressively sensed versions of subpatches, while CT presents objects by the compressive appearance models globally.
However, it’s insufficient to present the holistic object by a single appearance model just like CT especially in the case of tracking nonrigid objects. So we adopt the joint representation which considers both global and local models of the targets to better handle significant appearance changes, deformations and occlusions. As shown in Fig.5 and Fig.7, our method obtains more accurate tracking performances than the base method and it outperforms CT by 24% for the success plots and by 38.3% for the precision plots.
IvC Comparison with similar methods
There are three methods LSK[14], OAB[20] and MIL[19] that are most similar to our tracker in recent years. The proposed method outperforms them as shown in Fig.6 and Fig.7.
LSK proposes a robust tracking algorithm with a local sparse appearance model which combines a static sparse dictionary with a sparse coding histogram. This method outperforms several sparse representation methods according to [3]. However, LSK neglects the temporal consistency in the target bounding box level while we take this into consideration by employing an adaptive Kalman filter. Therefore our method is more robust to occlusions than LSK, as shown in Fig.9.
OAB and MIL are both boostingbased algorithms similar with ours. Our ensemble technique is much easier than the boosting of the two methods. However, they characterize the objects by global templates while we adopt both local representations and holistic templates. Thus we can better handle the deformations and occlusions, as shown in Fig.9.
IvD Comparison with Stateofthearts
For comparison, we run 11 stateoftheart algorithms with the same initial positions of targets. These algorithms are CT[12], CN[27], Struck[11], TLD[15], ASLA[16], CSK[18], OAB[20], MIL[19], LSK[14], SCM[28] and VTD[17]. For fair evaluation, we evaluate the proposed HET against those methods using the source codes provided by the authors with adjusted parameters. We examine the effectiveness of the proposed approach on an online object tracking benchmark[3] tested with 50 sequences that cover most challenging tracking scenarios such as illumination variations, scale variations, occlusions, deformations, etc.
Note that we don’t compare with some excellent methods such as MEEM[29], TGPR[30], MUSTer[31], etc, because they are not fast enough in robotic applications. For instance, the average time of MUSTer cost on the benchmark[3] is 0.287s/frame on a cluster node (3.4GHz, 8 cores, 32GB RAM). We also not consider some convolutional neural network based methods like FCNT[32] and HCF[33] because they require powerful GPUs to run the algorithms while still in a low frame rate. Although the performances of these trackers may be a little better than ours, their computational costs and hardware requirements are impracticable for robots.
The comparison results of precision plots and success plots of OPE on benchmark are shown on Fig.7. As for the other topranking trackers in the benchmark, the results show that the proposed method achieves the best average performance. The performance of our approach can be attributed to the efficient ensemble methods on subpatches with a spatial layout combining the adaptive Kalman filter.
IvE Human body tracking for robots
In particular, we find that the proposed HET performs excellently for moving human tracking which is crucial for the tracking applications of robots. We choose 20 videos whose target objects are moving human bodies and includes the challenging conditions like occlusions, deformations, fast motions, background clutters, etc. Due to space limitations, we only show some shots of 6 videos among the 20 videos in Fig.8. We can see when the human body objects undergoing large deformations or occlusions, other methods almost lose the objects except the proposed HET.
Due to the suppleness structure of human limbs and the flexibility of human movements, the most challenging factors in moving body tracking are deformations and occlusions. It is obvious that splitting up a human object into local parts can make the appearance models more flexible. The local compressive appearance models of the proposed HET actually do these things. We build compressively sensed versions of local patches in the bottom layers which allows the proposed tracker to better handle occlusion and large appearance change. The tracking results on these moving human videos with occlusions(OCC), deformation(DEF), background clutter(BC), scale variations(SV), fast motion(FM) and illumination variation(IV) attributes based on the precision and success rate metrics are persuasive, as shown in Fig.9. Our method almost ranks the first on these attributes according to the two criteria.
The robustness and realtime performance to the human body tracking makes HET suitable for many robotic applications such as humancomputer interaction, home service robots, robot teaching systems and unmanned vehicles.
V Conclusion
In this paper, we propose a novel hierarchical ensemble framework, where the representations of the target candidates are localized and compressed. We incorporate information including individual pixel features, local patches and holistic target models. The multiple ensemble layers exploit the intrinsic relationship not only between the individual pixel features and local patches, but also between the patches and the target candidates. In the bottom layer, the base ensembles are created using linear combinations of outputs from the base weak classifiers. A diverse collection of base ensembles are systematically combined in order to generate a more strong ensemble classifier in the middle layer and the scores of the local patches are normalized to produce a vector of weights of the base ensembles. Experimental results with evaluations against several stateoftheart methods on challenging image sequences demonstrate the robustness of the proposed HET tracking algorithm. Since our method is realtime, general and robust, we plan to apply it to the tracking tasks of robots. In particular, the proposed HET is very efficient for moving human tracking, we can apply this to many applications such as unmanned vehicles, robot teaching system, etc.
References
 [1] A. Kolarow, M. Brauckmann, M. Eisenbach, K. Schenk, E. Einhorn, K. Debes, and H.M. Gross, “Visionbased hyperrealtime object tracker for robotic applications,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 2108–2115, IEEE, 2012.
 [2] D. A. Klein, D. Schulz, S. Frintrop, and A. B. Cremers, “Adaptive realtime videotracking for arbitrary objects,” in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pp. 772–777, IEEE, 2010.
 [3] Y. Wu, J. Lim, and M.H. Yang, “Object tracking benchmark,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 37, no. 9, pp. 1834–1848, 2015.
 [4] Y. Li, J. Zhu, and S. C. Hoi, “Reliable patch trackers: Robust visual tracking by exploiting reliable patches,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 353–361, 2015.
 [5] T. Zhang, S. Liu, C. Xu, S. Yan, B. Ghanem, N. Ahuja, and M.H. Yang, “Structural sparse tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 150–158, 2015.
 [6] D. L. Donoho, “Compressed sensing,” Information Theory, IEEE Transactions on, vol. 52, no. 4, pp. 1289–1306, 2006.
 [7] E. J. Candes, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on pure and applied mathematics, vol. 59, no. 8, pp. 1207–1223, 2006.
 [8] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Fluids Engineering, vol. 82, no. 1, pp. 35–45, 1960.
 [9] S. He, Q. Yang, R. W. Lau, J. Wang, and M.H. Yang, “Visual tracking via locality sensitive histograms,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2427–2434, IEEE, 2013.
 [10] J. Xiao, R. Stolkin, and A. Leonardis, “Single target tracking using adaptive clustered decision trees and dynamic multilevel appearance models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4978–4987, 2015.
 [11] S. Hare, A. Saffari, and P. H. Torr, “Struck: Structured output tracking with kernels,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 263–270, IEEE, 2011.
 [12] K. Zhang, L. Zhang, and M.H. Yang, “Fast compressive tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 10, pp. 2002–2015, 2014.
 [13] S. Stalder, H. Grabner, and L. Van Gool, “Beyond semisupervised tracking: Tracking should be as simple as detection, but not simpler than recognition,” in Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pp. 1409–1416, IEEE, 2009.
 [14] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, “Robust tracking using local sparse appearance model and kselection,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1313–1320, IEEE, 2011.
 [15] Z. Kalal, K. Mikolajczyk, and J. Matas, “Trackinglearningdetection,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 7, pp. 1409–1422, 2012.
 [16] X. Jia, H. Lu, and M.H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on, pp. 1822–1829, IEEE, 2012.
 [17] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 1269–1276, IEEE, 2010.
 [18] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of trackingbydetection with kernels,” in Computer Vision–ECCV 2012, pp. 702–715, Springer, 2012.
 [19] B. Babenko, M.H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 8, pp. 1619–1632, 2011.
 [20] H. Grabner, M. Grabner, and H. Bischof, “Realtime tracking via online boosting.,” in BMVC, vol. 1, p. 6, 2006.
 [21] H. Grabner, C. Leistner, and H. Bischof, “Semisupervised online boosting for robust tracking,” in Computer Vision–ECCV 2008, pp. 234–247, Springer, 2008.
 [22] R. G. Baraniuk, “Compressive sensing,” IEEE signal processing magazine, vol. 24, no. 4, 2007.
 [23] P. Li, T. J. Hastie, and K. W. Church, “Very sparse random projections,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 287–296, ACM, 2006.
 [24] P. Diaconis and D. Freedman, “Asymptotics of graphical projection pursuit,” The annals of statistics, pp. 793–815, 1984.
 [25] L. Rokach, “Ensemblebased classifiers,” Artificial Intelligence Review, vol. 33, no. 12, pp. 1–39, 2010.
 [26] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
 [27] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. Van de Weijer, “Adaptive color attributes for realtime visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1090–1097, 2014.
 [28] W. Zhong, H. Lu, and M.H. Yang, “Robust object tracking via sparse collaborative appearance model,” Image Processing, IEEE Transactions on, vol. 23, no. 5, pp. 2356–2368, 2014.
 [29] J. Zhang, S. Ma, and S. Sclaroff, “Meem: Robust tracking via multiple experts using entropy minimization,” in Computer Vision–ECCV 2014, pp. 188–203, Springer, 2014.
 [30] J. Gao, H. Ling, W. Hu, and J. Xing, “Transfer learning based visual tracking with gaussian processes regression,” in Computer Vision–ECCV 2014, pp. 188–203, Springer, 2014.
 [31] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “Multistore tracker (muster): a cognitive psychology inspired approach to object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 749–758, 2015.
 [32] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3119–3127, 2015.
 [33] C. Ma, J.B. Huang, X. Yang, and M.H. Yang, “Hierarchical convolutional features for visual tracking,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3074–3082, 2015.