Learning Support Correlation Filters
for Visual Tracking
Abstract
Sampling and budgeting training examples are two essential factors in tracking algorithms based on support vector machines (SVMs) as a tradeoff between accuracy and efficiency. Recently, the circulant matrix formed by dense sampling of translated image patches has been utilized in correlation filters for fast tracking. In this paper, we derive an equivalent formulation of a SVM model with circulant matrix expression and present an efficient alternating optimization method for visual tracking. We incorporate the discrete Fourier transform with the proposed alternating optimization process, and pose the tracking problem as an iterative learning of support correlation filters (SCFs) which find the global optimal solution with realtime performance. For a given circulant data matrix with samples of size , the computational complexity of the proposed algorithm is whereas that of the standard SVMbased approaches is at least . In addition, we extend the SCFbased tracking algorithm with multichannel features, kernel functions, and scaleadaptive approaches to further improve the tracking performance. Experimental results on a large benchmark dataset show that the proposed SCFbased algorithms perform favorably against the stateoftheart tracking methods in terms of accuracy and speed.
I Introduction
Robust visual tracking is a challenging problem due to the large changes of object appearance caused by pose, illumination, deformation, occlusion, distractors, as well as background clutters. Among the stateoftheart methods, discriminative classifiers with model update and dense sampling have been demonstrated to perform well in visual tracking. On the other hand, correlation filters have been shown to be efficient for locating objects with the use of circulant matrix and fast Fourier transform. Central to the advances in visual tracking are the development of effective appearance models and efficient sampling schemes.
Discriminative appearance models have been extensively studied in visual tracking and have achieved the stateoftheart results. One representative discriminative appearance model is based on support vector machines (SVMs) [1, 2, 3, 4]. To learn classifiers for detecting objects within local regions, SVMbased tracking approaches are developed based on two modules: a sampler to generate a set of positive and negative samples and a learner to update the classifier using the training samples. To reduce the computational load, SVMbased trackers usually only use a limited set of samples [3, 4]. As kernel SVMbased tracking methods are susceptible to the curse of kernelization, a budget mechanism is introduced for online learning of the structural SVM tracker [3] to restrict the number of support vectors, or an explicit feature mapping function is used to approximate the intersection kernel [4]. While sampling and budgeting may improve tracking efficiency at the expense of accuracy, most SVMbased trackers [2, 3, 4] do not run in realtime.
Correlation filters (CFs) [5, 6, 7, 8] have recently been utilized for efficient visual tracking. The data matrix formed by dense sampling of base sample has circulant structures, which facilitates the use of the discrete Fourier transform (DFT) for efficient and effective visual tracking [5, 6, 7, 8]. However, ridge regression or kernel ridge regression are generally adopted as the predictors in these trackers. Henriques et al. [9] apply the circulant property for training of support vector regression efficiently to detect pedestrians. The problem on how to exploit the circulant property to accelerate SVMbased trackers remains unaddressed.
In this paper, we propose a novel SVMbased algorithm via support correlation filters (SCFs) for efficient and accurate visual tracking. Different from the existing SVMbased trackers, the proposed algorithm based on SCFs deals with the sampling and budgeting issues by using the data matrix formed by dense sampling. By exploiting the circulant property, we formulate the proposed SVMbased tracker as a learning problem for support correlation filters and propose an efficient algorithm. By incorporating the discrete Fourier transform in an alternating optimization process, the SVM classifier can be efficiently updated by iterative learning of correlation filters. For an image, there are training sample images of the same size in the circulant data matrix and the computational complexity of the proposed algorithm is whereas that of the standard SVMbased approaches is at least . Furthermore, we extend the proposed SCFbased algorithm to multichannel SCF (MSCF), kernelized SCF (KSCF), and scaleadaptive KSCF (SKSCF) methods to improve the tracking performance.
We evaluate the proposed SCFbased algorithms on a large benchmark dataset with comparison to the stateoftheart methods [10] and analyze the tracking results. First, with the discriminative strength of SVMs, the proposed KSCF method performs favorably against the existing regressionbased correlation filter trackers. Second, by exploiting the circulant structure of training samples, the proposed KSCF algorithm performs well compared with the existing SVMbased trackers [3, 4] in terms of efficiency and accuracy. Third, the proposed KSCF and SKSCF algorithms outperform the stateoftheart methods including the ensemble and scaleadaptive tracking methods [4, 11, 12].
Ii Related Work and Problem Context
Visual tracking has long been an active research topic in computer vision which involves developments of both learning methods (e.g., feature learning and selection, online learning and ensemble models) and application domains (e.g., autonavigation, visual surveillance and humancomputer interactions). Several surveys and performance evaluation on stateoftheart tracking algorithms [13, 14, 10, 15] have been reported in the literature, and in this section we discuss the most relevant methods to this work.
Iia Appearance models for visual tracking
Appearance models play an important role in visual tracking which can be broadly categorized as generative or discriminative. Generative appearance methods based on holistic templates [16], subspace representations [17, 18, 19], and sparse representations [20, 21, 22] have been developed for object representations. Discriminative appearance methods are usually based on features learned from a large set of examples with effective classifiers. Visual tracking is posed as a task to distinguish the target objects from the backgrounds. Tracking methods based on discriminative appearance models have been shown to achieve the stateoftheart results [10].
Discriminative tracking methods are usually based on object detection within local search using classifiers such as boosting methods [23, 24, 25, 26, 27], random forests [28, 29], and SVMs [1, 2, 3]. Among these classifiers, boosting methods [23, 24, 25, 26, 27] and random forests [28, 29] are ensemble learning methods where sampling from large sets of features is indispensable, and that makes it difficult to adapt correlation filters in these approaches. In this work, we exploit the discriminative strength of SVMs and efficiency of correlation filters for visual tracking.
Label ambiguity has also been studied for visual tracking, e.g., semisupervised [26, 27, 30] and multiple instance [23, 24] learning methods. Considering that classification based methods are trained to predict the class label rather than the object location, Hare et al. [3] propose a tracker based on structured SVM. In this work, we alleviate the label ambiguity problem by using the assignment scheme in a way similar to that for object detection and tracking [31, 25, 32].
IiB Correlation filters for tracking
A correlation filter uses a designed template to generate strong response to a region that is similar to the target object while suppressing responses to distractors. Correlation filters have been widely applied to numerous problems such as face recognition [33, 34], object detection [35, 9, 36], object alignment [37] and action recognition [38, 39]. A number of correlation filters have been proposed in the literature including the minimum average correlation energy (MACF) [36], optimal tradeoff synthetic discriminant filter (OTSDF) [40], unconstrained minimum average correlation energy (UMACE) [34], and minimum output sum of squared error (MOSSE) [5] methods.
Recently, the maxmargin CF (MMCF) [41], multichannel CF [11, 42, 43, 6], and kernelized CF [6, 7, 44] methods have been developed for object detection and tracking. The MMCF [41] scheme combines the localization properties of correlation filters with good generalization performance of SVM. The multichannel correlation filters [11, 42, 43, 6] are designed to use more effective features, e.g., histogram of oriented gradients (HOG). In addition, a method that combines MMCF and multichannel CF is developed [45] for object detection and landmark localization. The kernel tricks are also employed to learn kernelized synthetic discriminant functions (SDF)[44] with correlation filters. We note that the MMCF [45, 41] and kernelized SDF [44] schemes are trained offline with high computational load, and do not exploit the circulant structure of data matrix formed by translated images of target objects.
In visual tracking, Bolme et al. [5] propose the MOSSE method to learn adaptive correlation filters with high efficiency and competitive performance. Subsequently, the kernelized correlation filter (KCF) [6] is developed by exploiting the circulant property of the kernel matrix. Extensions of CF and KCF with multichannel features are introduced for visual tracking [11, 42, 43, 6]. Within the tracking methods based on correlation filters, numerous issues such as adaptive scale estimation [11, 12, 8], limited boundaries [46], zeroaliasing [47], tracking failure [48], and partial occlusion [49] have been addressed.
We note existing CFbased tracking methods are developed with ridge regression schemes for locating the target. On the other hand, the SVMbased tracking methods, e.g., Struck [3] and MEEM [4], have been demonstrated to achieve the stateoftheart performance. One straightforward extension is to integrate SVMbased trackers with the MMCF method[41]. Nevertheless, the MMCF scheme is computationally prohibitive for realtime applications. In this work, we develop novel discriminative tracking algorithms based on SVMs and correlation filters that perform efficiently and effectively.
Iii Support Correlation Filtering
We first present the problem formulation and propose an alternating optimization algorithm to learn support correlation filters efficiently. We then develop the MSCF, KSCF and SKSCF methods to learn multichannel, nonlinear and scaleadaptive correlation filters respectively for robust visual tracking.
Iiia Problem formulation
Given an image , the full set of its translated versions forms a circulant matrix with several interesting properties [50], where each row represents one possible observation of a target object (See Fig. 1). A circulant matrix consists of all possible cyclic translations of a target image, and tracking is formulated as determining the most likely row. In general, the eigenvectors of a circulant matrix are the base vectors of the discrete Fourier transform:
(1) 
where is the Hermitian transpose of and denotes the Fourier transform of . In the following, we use to form a diagonal matrix from a vector, and use to return the diagonal vector of a matrix.
Our goal is to learn a support correlation filter and a bias , to classify any translated image by
(2) 
Note that all the translated images form a circulant matrix . We can classify all the samples in by
(3) 
where denotes the inverse discrete Fourier transform (IDFT), and denotes the complex conjugate of . Given the circulant matrix generated by an image , the computational complexity of classifying every by (2) is , while that of classifying all samples of by (3) is .
Given the training set of a circulant matrix with the corresponding class labels , we use the squared hinge loss and define the SVM model [51] as follows:
where is the vector of slack variables.
Based on the circulant property of , the SVM model can be equivalently formulated as:
(4) 
where denotes the elementwise multiplication, and denotes a vector of s.
Class labels of the translated images.
Let denote the centre position of the object of interest , and as the position of the translated image . In object detection [31, 32], the overlap function is used to measure the similarity between and . Specifically, the positive samples are defined by all ground truth object windows and the negative samples are defined by those with below a lower overlap threshold. In the proposed discriminative tracking model, we need to set upper and lower thresholds of for assigning binary labels. In Section IV, we determine the optimal upper and lower thresholds for SCF, MSCF and KSCF respectively with experiments.
In this work, we use the following confidence map of object position [8] to define the class label:
where is a normalization constant, and are the scale and shape parameters, respectively. With the confidence map, we define the class labels as follows:
(5) 
where and are the lower and upper thresholds, respectively. With this formulation, we can use the circulant matrix formed by all samples to improve training efficiency, and discard any samples that are not labeled.
Comparisons with existing CFbased trackers.
As illustrated in Fig. 2(a), existing CFbased trackers generally follow the ridge regression models. That is, with the continuous confidence map , CFbased trackers seek the optimal correlation filter by minimizing the mean squared error (MSE) between the predefined confidence map and actual output,
(6) 
which has the closed form solution,
(7) 
As shown in Fig. 2(b), the proposed model aims to learn a maxmargin SVM classifier to distinguish the object of interest from the background. Using the label assignment scheme in (5), we can discard uncertain samples in training to alleviate the label ambiguity problem. The importance of SVM and label ambiguity issues have been demonstrated in object detection [31]. The proposed model copes with both issues (classification and label ambiguity) for effective visual tracking.
IiiB Alternating optimization
In this section, we reformulate the model in (3) and propose an alternating optimization algorithm to learn SCFs efficiently. To exploit the property of the circulant matrix for learning SCFs, we let , and the SVM model in (4) is then reformulated as:
(8) 
With this formulation, the subproblem on has a closed form solution when is known, and the subproblem on has a closed form solution when is known. Thus the above model can be efficiently solved using the alternating optimization algorithm by iterating between the following two steps:
Updating .
Given , we let , and the subproblem on becomes:
The subproblem has the closed form solution:
(9) 
Updating .
Given , we let , and the subproblem on becomes:
The subproblem with is a quadratic programming problem. One feasible solution is to let and derive the closed form solution on . However, this approach fails to exploit the circulant property of . Thus, we obtain by solving the following system of equations:
(10)  
(11) 
where . Combining the two equations above and with the property of DFT, we have
(12) 
where is the mean of . Given , we use (10) to obtain the closed form solution to .
As illustrated in Fig. 1, when the th frame with class labels arrives, the proposed algorithm learns support correlation filters by iterating between updating and updating until convergence. Given , the update of can be computed elementwise, which has the complexity of . Given , the complexity of updating is and that of updating is . Thus, the complexity is per iteration which makes our algorithm efficient in learning support correlation filters. The main steps of the proposed learning algorithm for support correlation filters are summarized in Algorithm 1.
Convergence.
The proposed algorithm converges to the global optimum with the linear convergence rate. For presentation clarity, we give the detailed analysis and proof on its optimality condition, global convergence, and convergence rate in Appendix A. Based on the optimality condition, we define
and adopt the following stopping criterion:
(13) 
Comparisons with MMCF [41].
The proposed SCF model and learning algorithm are different from the MMCF approach in three aspects. First, the training samples for MMCF are images of pixels, while those for SCF are translated images of pixels. We exploit the circulant property of the data matrix to develop an efficient learning algorithm. Second, we propose an alternating optimization algorithm to solve the proposed model, which has the complexity of . In contrast, the MMCF method adopts the conventional SMO algorithm with the complexity of where is the dimension of the sample. For visual tracking considered in this work, we have and , and the complexity of MMCF is , which is computationally expensive for realtime applications. Third, the proposed model has the squared hinge loss and regularizer terms, while the MMCF method adopts the hinge loss and includes an extra average correlation energy term.
IiiC Multichannel SCF
Different local descriptors, e.g., color attributes, HOG, and SIFT [52, 42, 53], provide rich image features for effective visual tracking. We treat local descriptors as multichannel images where multiple measurements are associated to each pixel. To exploit multidimensional features, we propose the multichannel SCF as follows:
(14) 
where is the number of channels, and and denote the th channel of the image and correlation filter, respectively. To learn the proposed MSCF model, we adopt the same equations on updating and , and compute by solving the following problem:
where , and .
Let . The closed form solution for can be directly obtained by
(15) 
where is the identity matrix. Note that is an matrix. It is not practical to compute the inverse of to update . In the multichannel correlation filters, it is noted that has the diagonal block structure, and the th element of depends only on and . Thus, the subproblem on can be further decomposed into systems of equations:
(16) 
In [43], Galoogahi et al. solve these systems of equations by an algorithm with the complexity of . We note that the matrix on the left hand of (16) is a rankone matrix and a scaled identity matrix. Based on the ShermanMorrison formula [54], we have
The closed form solution for is then obtained by
(17) 
It should be noted that all s can be precomputed with the complexity of . As such, the proposed algorithm only involves one DFT, one IDFT and several elementwise operations per iteration, and the complexity is .
Algorithms  MSCF  DCF [6]  
Features  Raw pixels  CN  HOG  HOG CN  Raw pixels  CN  HOG  HOG CN 
Mean DP (%)  64.9  66.3  78.4  80.6  44.4  48.0  71.9  76.2 
Mean AUC (%)  44.6  44.9  53.7  55.5  31.2  34.8  50.1  53.2 
Mean FPS (s)  76  62  64  54  278  210  292  151 
Algorithms  KSCF  KCF [6]  
Features  Raw pixels  CN  HOG  HOG CN  Raw pixels  CN  HOG  HOG CN 
Mean DP (%)  64.4  68.1  79.3  85.0  55.3  57.3  73.2  75.8 
Mean AUC (%)  45.3  46.9  53.2  57.5  40.0  41.8  50.7  53.0 
Mean FPS (s)  40  37  44  35  154  120  172  102 
IiiD Kernelized SCF
Given the kernel function , the proposed kernelized SCF model can be extended to learn the nonlinear decision function:
where stands for the nonlinear feature mapping implicitly determined by the kernel function , and is the coefficient vector to be learned.
Denote by the kernel matrix with . As noted in [7], for some kernel functions (e.g., Gaussian RBF and polynomial) which are permutation invariant, the kernel matrix is circulant. Let be the first row of the circulant matrix . Therefore, the matrixvector multiplication can be efficiently computed via DFT:
(18) 
and we have,
(19) 
Based on (18) and (19), the proposed kernelized SCF model is formulated as
(20) 
To learn KSCF, we use the alternating optimization method by iteratively solving and . The solution of the subproblem with is similar to that in the SCF model, and we update and using the closed form solution of kernel ridge regression. Based on the representation theorem [55], the optimal solution in the kernel space can be expressed as the linear combination of the feature maps of the samples: . Namely, only the coefficient vector needs to be learned. In [55], the solution to the kernelized ridge regression in the dual space is given by
Thus, the closed form solution to our subproblem on can be formulated as
where and 1 denotes a vector of 1s. As the kernel matrix K is circulant and can be diagonalized, the optimal solution of in the Fourier transform domain can be computed by
(21) 
where , is the kernel correlation of with itself in the Fourier domain which is known as the kernel autocorrelation.
For image features with channels, the complexity to compute kernel matrix is . After that, the learning process only requires elementwise operations, one DFT and one IDFT per iteration, and the complexity is . Thus, the proposed KCSF model leverages rich features from the nonlinear filters without increasing computational load significantly.
Furthermore, to handle large scale changes, we develop the SKSCF model by maintaining a scaling pool in a way similar to the scaleadaptive CF scheme [12], and the bilinear interpolation is used to resize samples across scales.
Iv Performance Evaluation
We use the benchmark dataset and protocols [10] to evaluate the proposed SCF algorithms. First, we evaluate several variants of the proposed method, i.e., SCF, MSCF, KSCF, and SKSCF, to analyze the effect of feature representations and kernel functions. Next, comprehensive experiments are conducted to compare the proposed methods with other CFbased trackers. Finally, the KSCF and SKSCF algorithms are compared with existing SVMbased and the stateoftheart methods. The tracking results can be found at http://faculty.ucmerced.edu/project/scf/ and the source code will be made available to the public.
Iva Experimental setup
Datasets and evaluated tracking methods.
To assess the performance of the proposed methods, experiments are carried out on a benchmark dataset [10] of 50 challenging image sequences annotated with 11 attributes. For the first frame of each sequence, the bounding box of the target object is provided for fair comparisons. For comprehensive comparisons, we evaluate the baseline SCF, multichannel SCF, kernelized SCF and SKSCF methods. The SCF and MSCF methods are designed in the linear space with raw pixels, and multichannel features are based on HOG [52] as well as color names (CN) [42], respectively. The KSCF and SKSCF algorithms are evaluated by using the Gaussian kernel on multichannel feature representations. Furthermore, we compare the proposed trackers with the other trackers based on correlation filters (e.g., MOSSE [5], CSK [7], KCF [6], DCF [6], STC [8] and CN [42]), existing SVM based trackers (e.g., Struck [3] and MEEM [4]), and other stateoftheart methods (e.g., TGPR [56], SCM [16], TLD [27], L1APG [57], MIL [23], ASLA [20] and CT [58]).
Evaluation protocols.
We use the onepass evaluation (OPE) protocol [10] which reports the precision and success plots based on the position error and bounding box overlap metrics with respect to the ground truth object locations. For precision plots, the distance precision at a threshold of 20 pixels (DP) is reported. For success plots, the area under the curve (AUC) is computed. In addition, the frames per second (FPS) that each method is able to process is discussed.
Parameter settings.
The experiments are carried out on a desktop computer with an Intel Xenon 2 core 3.30 GHz CPU and 32 GB RAM. The proposed SCFbased trackers involve a few model parameters, i.e., tradeoff parameter , scale parameter and shape parameter of confidence maps, and lower and upper thresholds (, ) in (5). In addition, the KSCF method has one extra parameter for the Gaussian RBF kernel function, and SKSCF contains a scaling pool parameter . For online tracking, the model is updated by linear interpolation with the adaption rate [10].
In all experiments, the model parameters are fixed for each SCFbased tracker. For all SCFbased trackers, the tradeoff and shape parameter are fixed to and , respectively. The thresholds (, ) in (5) are set to for SCF, for MSCF and for KSCF, SKSCF. The scale parameter is set to be , which is adaptive to the size of each target object. The scaling pool is fixed as . The adaption rate is set to for raw pixel features, and for multichannel features, respectively. The kernel parameter of KSCF is set to . As for HOG parameters, the orientations and cell size are set to 9 and 4.
IvB Evaluation on SCFbased trackers
In this section, we first evaluate the effect of feature representations and kernel functions, and then compare four variants of the SCFbased trackers, i.e., SCF, MSCF, KSCF, and SKSCF, in terms of both accuracy and efficiency. The results of the corresponding CFbased trackers are also reported for all SCFbased methods.
Kernels  Linear  Polynomial  Gaussian 
Mean DP (%)  82.0  84.2  85.0 
Mean AUC (%)  56.2  57.1  57.5 
Mean FPS (s)  94  55  35 
We consider three typical feature representations, i.e., raw pixels, HOG features [52], and color names (CN) [42]. The results of the MSCF and KSCF methods are listed in Table I and Table II. The result for each feature representation is optimal by varying the parameters , , and from , , and . These parameters are then fixed for all the following experiments. For KSCF, the Gaussian RBF kernel with is adopted.
The OPE plots of MSCF with linear DCF [6] and KSCF with nonlinear KCF [6] are shown in Fig. 3 and Fig. 4. Compared with raw pixels and color features, the method with HOG representation significantly improves the tracking performance in terms of mean DP and mean AUC. For MSCF, the implementation using color names and HOG features outperforms raw pixels by and in terms of mean DP. For KSCF, the tracker using color names and HOG features outperforms raw pixels by and in terms of mean DP.
The MSCF tracker with the combination of color names and HOG is further improved to in terms of DP. Similarly, the performance of the KSCF method is improved to in terms of DP with the use of color names and HOG features. Compared with the DCF [6] and KCF [6] methods, the proposed MSCF and KSCF algorithms achieve higher DP and AUC values for each feature representation. Table I and II show that both KSCF and MSCF methods perform in realtime even using the representation based on HOG and CN features.
Algorithms 














Mean DP (%)  87.4  85.0  80.6  62.8  77.1  74.8  73.2  71.9  63.7  58.6  55.8  44.4  
Mean AUC (%)  62.3  57.5  55.5  48.9  56.5  56.3  50.7  50.1  44.9  37.4  40.6  31.3  
Mean FPS (s)  8  35  54  76  14  30  172  292  79  557  151  421 
Algorithms 






Mean DP (%)  87.4  85.0  83.3  67.4  
Mean AUC (%)  62.3  57.5  57.2  48.6  
Mean FPS (s)  8  35  10  10 
We evaluate the effect of kernel functions on KSCF using HOG and CN features, including linear kernel , polynomial kernel , and Gaussian RBF kernel . For , the degree is set as . For , the kernel parameter is set as . Table III shows the results of KSCF with different kernels. Clearly the KSCF method with a nonlinear kernel outperforms the one with a linear kernel in terms of mean DP and mean AUC, and the one with Gaussian RBF kernel achieves the best performance.
We implement the SKSCF method by extending KSCF with the Gaussian RBF kernel, and compare four variants of the SCFbased trackers, i.e., SCF, MSCF, KSCF, and SKSCF. Table IV shows the results of four SCFbased trackers, where the SKSCF method performs best, followed by the KSCF approach. On the other hand, the KSCF method is more efficient than the SKSCF approach. In the following experiments, we compare both KSCF and SKSCF methods with the other schemes based on correlation filters, SVMs, and other stateoftheart tracking approaches.
IvC Comparisons with CFbased trackers
We use the tracking benchmark dataset [10] to evaluate the proposed SCFbased algorithm against existing CFbased methods including MOSSE [5], CSK [7], KCF [6], DCF [6], STC [8], CN [42], DSST [11] and SAMF [12].
Algorithms 












Mean DP (%)  87.4  85.0  83.3  73.2  71.8  65.2  60.6  54.5  49.4  48.8  41.5  
Mean AUC (%)  62.3  57.5  57.2  50.7  51.1  50.1  43.4  44.2  38.6  36.9  30.8  
Mean FPS (s)  8  35  10  172  0.5  1  22  8  3  28  39 
Classic correlation filters.
Fig. 5 shows the OPE plots of these trackers. The SCF, MOSSE [5], CSK [7] and STC [8] methods operate on raw pixels in the linear space. We note that the MOSSE method adopts the ridge regression function while the SCF algorithm uses the maxmargin model. Although the CSK and STC methods operate on raw pixels, the CSK method is a kernelized CFbased tracker and the STC approach is a scaleadaptive tracking method. Overall, the SCF algorithm performs favorably against these CFbased methods based on regression and nonlinear functions.
Multichannel correlation filters.
The MSCF, CN [42], and DCF [6] methods are based on correlation filters using multichannel features. The DCF method is based on HOG features and the CN approach is operated on color attributes, while the MSCF scheme uses the combination of HOG and color representations. Fig. 5 shows that the MSCF method performs well among these three trackers based on correlation filters.
Kernelized correlation filters.
The KSCF method is compared with the corresponding kernelized KCF [6] and CSK [7] trackers. The CSK and KCF methods are based on raw pixels and HOG features, respectively. As shown in Table IV and Fig. 6, the KSCF method based on HOG and CN features performs favorably against the KCF and CSK appraoches.
Scaleadaptive correlation filters.
The KSCF and SKSCF are evaluated against three scaleadaptive trackers: STC [8], DSST [11] and SAMF [12]. We note that the DSST [11] and SAMF [12] methods have been shown to perform best and second best trackers in the recent tracking benchmark evaluation [59]. Both KSCF and SKSCF trackers perform significantly better than the STC method. In addition, the KSCF and SKSCF methods also significantly outperform the DSST and SAMF approaches by a large margin. Fig. 7 shows the OPE plots on all the sequences with the attribute of scale variation where the KSCF method performs favorably against the DSST and SAMF trackers. Overall, the KSCF algorithm performs favorably in terms of accuracy and speed.
Attributes  FM  BC  MB  DEF  IV  IPR  LR  OCC  OPR  OV  SV 

SKSCF  0.779  0.859  0.802  0.893  0.841  0.810  0.596  0.872  0.857  0.800  0.809 
KSCF  0.680  0.825  0.761  0.854  0.805  0.816  0.555  0.852  0.836  0.697  0.768 
MEEM [4]  0.745  0.802  0.721  0.856  0.771  0.796  0.529  0.801  0.840  0.726  0.795 
TGPR [56]  0.579  0.763  0.570  0.760  0.695  0.683  0.567  0.668  0.693  0.535  0.637 
KCF [6]  0.564  0.752  0.599  0.747  0.687  0.692  0.379  0.735  0.718  0.589  0.680 
SCM [16]  0.346  0.578  0.358  0.589  0.613  0.613  0.305  0.646  0.621  0.429  0.672 
TLD [27]  0.557  0.428  0.523  0.495  0.540  0.588  0.349  0.556  0.593  0.576  0.606 
ASLA [20]  0.255  0.496  0.283  0.473  0.529  0.521  0.156  0.479  0.535  0.333  0.552 
L1APG [57]  0.367  0.425  0.379  0.398  0.341  0.524  0.460  0.475  0.490  0.329  0.472 
MIL [23]  0.415  0.456  0.381  0.493  0.359  0.465  0.171  0.448  0.484  0.393  0.471 
CT [58]  0.330  0.339  0.314  0.463  0.365  0.361  0.152  0.429  0.405  0.336  0.448 
Attributes  FM  BC  MB  DEF  IV  IPR  LR  OCC  OPR  OV  SV 

SKSCF  0.729  0.795  0.757  0.863  0.743  0.720  0.542  0.788  0.757  0.808  0.682 
KSCF  0.629  0.741  0.689  0.779  0.649  0.690  0.389  0.696  0.697  0.705  0.540 
MEEM [4]  0.706  0.747  0.692  0.711  0.653  0.648  0.470  0.694  0.694  0.742  0.594 
TGPR [56]  0.542  0.713  0.570  0.711  0.632  0.601  0.501  0.592  0.603  0.546  0.505 
KCF [6]  0.516  0.669  0.539  0.668  0.534  0.575  0.358  0.593  0.587  0.589  0.477 
SCM [16]  0.348  0.550  0.358  0.566  0.586  0.574  0.308  0.602  0.576  0.449  0.635 
TLD [27]  0.475  0.388  0.485  0.434  0.461  0.477  0.327  0.455  0.489  0.516  0.494 
ASLA [20]  0.261  0.468  0.284  0.485  0.514  0.496  0.163  0.469  0.509  0.359  0.544 
L1APG [57]  0.359  0.404  0.363  0.398  0.298  0.445  0.458  0.437  0.423  0.341  0.407 
MIL [23]  0.353  0.414  0.261  0.440  0.300  0.339  0.157  0.378  0.369  0.416  0.335 
CT [58]  0.327  0.323  0.262  0.420  0.308  0.290  0.143  0.360  0.325  0.405  0.342 
IvD Comparisons with SVMbased trackers
We evaluate the proposed KSCF and SKSCF with two stateoftheart SVMbased methods, i.e., Struck [3] and MEEM [4], based on the structured and ensemble learning. Table V and Fig. 8 show that both KSCF and SKSCF algorithms perform favorably against the MEEM and Struck methods in all aspects. As shown in Fig. 6, the KSCF algorithm can track target objects more precisely than other methods in the Singer2, Coke, Suv and Tiger2 sequences. The results show that dense sampling can be efficiently used with SVMs for effective visual tracking. Fig. 6 shows that the KSCF algorithm can track the objects more precisely in all challenging sequences, while the other trackers tend to drift away from the target objects.
IvE Comparisons with stateoftheart trackers
We evaluate the KSCF algorithm with the other stateoftheart trackers, including MEEM [4], KCF [6], TGPR [56], SCM [16], TLD [27], L1APG [57], MIL [23], ASLA [20] and CT [58]. Fig. 10 shows the OPE plots, and Table VI presents the mean DP, AUC and FPS. Overall, the proposed KSCF and SKSCF algorithms perform favorably against the stateoftheart methods including the TLD, SCM, TGPR and MEEM schemes.
The sequences in the benchmark dataset [10] are annotated with 11 challenging factors for visual tracking, including illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), inplane rotation (IPR), outofplane rotation (OPR), outofview (OV), background clutters (BC), and low resolution (LR). Table VII and Table VIII show the performance of the KSCF and stateoftheart methods in terms of DP and AUC with respect to each factor. Fig. 9 shows the precision and success metrics of the leading trackers (i.e., SKSCF, KSCF, MEEM, KCF and TGPR) with respect to the attributes. We note that the MEEM method [4] adopts the multiple experts framework to deal with model drift , and performs slightly better than KSCF for attributes FM, LR, OV and SV. Overall, the KSCF algorithm are among the top 3 trackers for any attribute, and the SKSCF algorithm performs best in both metrics for all but one attribute.
V Conclusions
We propose an effective and efficient approach to learn support correlation filters for real time visual tracking. By reformulating the SVM model with circulant data matrix as training input, we present a DFT based alternating optimization algorithm to learn support correlation filters efficiently. In addition, we develop the MSCF, KSCF, and SKSCF tracking methods to exploit multidimensional features, nonlinear classifiers, and scaleadaptive schemes. Experiments on a large benchmark dataset show that the proposed KSCF and SKSCF algorithms perform favorably against the stateoftheart tracking methods in terms of accuracy and speed.
Acknowledgments
This work is supported in part by NSFC grant (61271093), the program of ministry of education for new century excellent talents (NCET120150), NSF CAREER Grant (No. 1149783) and NSF IIS Grant (No. 1152576).
Appendix A Convergence analysis
In the following, we first analyze the optimality condition of the problem, and then prove the global convergence and convergence rate of the SCF algorithm.
Aa Optimality conditions
In the spatial domain, the SCF model can be expressed as:
Defining the augmented vector with , we compute the augmented weight vector . The above problem can then be reformulated as:
(22) 
where and . We introduce an indicator function and the subdifferential [60] of is:
(23) 
As the loss function (22) is convex, is a solution if and only if the subdifferential of the loss at contains zero [61]. Thus the optimality conditions are:
(24) 
where denotes the ith training sample. With , we have:
where with . Thus the matrix is invertible. For simplicity, let , from (24) and above equation, we have
(25) 
(26) 
Based on the optimality conditions in (24), we define
and use the stopping criterion:
where is a predefined threshold.
AB Global convergence
To compute , we reformulate the subproblem for each entry:
where . Its solution is given by:
Proposition 1.
For any , we have:
where the equality holds only if .
Proof.

if , , and we also have .

if , , where the equality holds only if .

if , e.g., , it is easy to see that, .
∎
For simplicity, let . We have and then we get two symmetric positive definite matrices as follows:
where and is the spectral radius of matrix [62].
With the definitions of and , the updating rules and can be written as:
Let , we have the following proposition.
Proposition 2.
For any , the following inequality holds:
and the equality holds if and only if .
Proof.
Note that . From the definition of , we have:
Denote the eigendecomposition of by , where is a full rank orthogonal matrix, and is a diagonal matrix with .
The equality can be written as . Since is fullrank orthogonal, there is . Thus, we have . Since is diagonal with , it holds that . Multiplying both sides by , we have