Learning Support Correlation Filtersfor Visual Tracking

Learning Support Correlation Filters
for Visual Tracking

Wangmeng Zuo, Xiaohe Wu, Liang Lin, Lei Zhang, and Ming-Hsuan Yang
Abstract

Sampling and budgeting training examples are two essential factors in tracking algorithms based on support vector machines (SVMs) as a tradeoff between accuracy and efficiency. Recently, the circulant matrix formed by dense sampling of translated image patches has been utilized in correlation filters for fast tracking. In this paper, we derive an equivalent formulation of a SVM model with circulant matrix expression and present an efficient alternating optimization method for visual tracking. We incorporate the discrete Fourier transform with the proposed alternating optimization process, and pose the tracking problem as an iterative learning of support correlation filters (SCFs) which find the global optimal solution with real-time performance. For a given circulant data matrix with samples of size , the computational complexity of the proposed algorithm is whereas that of the standard SVM-based approaches is at least . In addition, we extend the SCF-based tracking algorithm with multi-channel features, kernel functions, and scale-adaptive approaches to further improve the tracking performance. Experimental results on a large benchmark dataset show that the proposed SCF-based algorithms perform favorably against the state-of-the-art tracking methods in terms of accuracy and speed.

I Introduction

Robust visual tracking is a challenging problem due to the large changes of object appearance caused by pose, illumination, deformation, occlusion, distractors, as well as background clutters. Among the state-of-the-art methods, discriminative classifiers with model update and dense sampling have been demonstrated to perform well in visual tracking. On the other hand, correlation filters have been shown to be efficient for locating objects with the use of circulant matrix and fast Fourier transform. Central to the advances in visual tracking are the development of effective appearance models and efficient sampling schemes.

Discriminative appearance models have been extensively studied in visual tracking and have achieved the state-of-the-art results. One representative discriminative appearance model is based on support vector machines (SVMs) [1, 2, 3, 4]. To learn classifiers for detecting objects within local regions, SVM-based tracking approaches are developed based on two modules: a sampler to generate a set of positive and negative samples and a learner to update the classifier using the training samples. To reduce the computational load, SVM-based trackers usually only use a limited set of samples [3, 4]. As kernel SVM-based tracking methods are susceptible to the curse of kernelization, a budget mechanism is introduced for online learning of the structural SVM tracker [3] to restrict the number of support vectors, or an explicit feature mapping function is used to approximate the intersection kernel [4]. While sampling and budgeting may improve tracking efficiency at the expense of accuracy, most SVM-based trackers [2, 3, 4] do not run in real-time.

Correlation filters (CFs) [5, 6, 7, 8] have recently been utilized for efficient visual tracking. The data matrix formed by dense sampling of base sample has circulant structures, which facilitates the use of the discrete Fourier transform (DFT) for efficient and effective visual tracking [5, 6, 7, 8]. However, ridge regression or kernel ridge regression are generally adopted as the predictors in these trackers. Henriques et al. [9] apply the circulant property for training of support vector regression efficiently to detect pedestrians. The problem on how to exploit the circulant property to accelerate SVM-based trackers remains unaddressed.

In this paper, we propose a novel SVM-based algorithm via support correlation filters (SCFs) for efficient and accurate visual tracking. Different from the existing SVM-based trackers, the proposed algorithm based on SCFs deals with the sampling and budgeting issues by using the data matrix formed by dense sampling. By exploiting the circulant property, we formulate the proposed SVM-based tracker as a learning problem for support correlation filters and propose an efficient algorithm. By incorporating the discrete Fourier transform in an alternating optimization process, the SVM classifier can be efficiently updated by iterative learning of correlation filters. For an image, there are training sample images of the same size in the circulant data matrix and the computational complexity of the proposed algorithm is whereas that of the standard SVM-based approaches is at least . Furthermore, we extend the proposed SCF-based algorithm to multi-channel SCF (MSCF), kernelized SCF (KSCF), and scale-adaptive KSCF (SKSCF) methods to improve the tracking performance.

We evaluate the proposed SCF-based algorithms on a large benchmark dataset with comparison to the state-of-the-art methods [10] and analyze the tracking results. First, with the discriminative strength of SVMs, the proposed KSCF method performs favorably against the existing regression-based correlation filter trackers. Second, by exploiting the circulant structure of training samples, the proposed KSCF algorithm performs well compared with the existing SVM-based trackers [3, 4] in terms of efficiency and accuracy. Third, the proposed KSCF and SKSCF algorithms outperform the state-of-the-art methods including the ensemble and scale-adaptive tracking methods [4, 11, 12].

Ii Related Work and Problem Context

Visual tracking has long been an active research topic in computer vision which involves developments of both learning methods (e.g., feature learning and selection, online learning and ensemble models) and application domains (e.g., auto-navigation, visual surveillance and human-computer interactions). Several surveys and performance evaluation on state-of-the-art tracking algorithms [13, 14, 10, 15] have been reported in the literature, and in this section we discuss the most relevant methods to this work.

Fig. 1: Illustration of the proposed SCF learning algorithm at the -th frame. The proposed algorithm iterates between updating and updating SVM classifier until convergence. In each iteration, only one DFT and one IDFT are required, which make the proposed algorithm computationally efficient. The black blocks in denote support vectors, and our algorithm can adaptively find and exploit difficult samples (i.e., support vectors) to learn support correlation filters.

Ii-a Appearance models for visual tracking

Appearance models play an important role in visual tracking which can be broadly categorized as generative or discriminative. Generative appearance methods based on holistic templates [16], subspace representations [17, 18, 19], and sparse representations [20, 21, 22] have been developed for object representations. Discriminative appearance methods are usually based on features learned from a large set of examples with effective classifiers. Visual tracking is posed as a task to distinguish the target objects from the backgrounds. Tracking methods based on discriminative appearance models have been shown to achieve the state-of-the-art results [10].

Discriminative tracking methods are usually based on object detection within local search using classifiers such as boosting methods [23, 24, 25, 26, 27], random forests [28, 29], and SVMs [1, 2, 3]. Among these classifiers, boosting methods [23, 24, 25, 26, 27] and random forests [28, 29] are ensemble learning methods where sampling from large sets of features is indispensable, and that makes it difficult to adapt correlation filters in these approaches. In this work, we exploit the discriminative strength of SVMs and efficiency of correlation filters for visual tracking.

Label ambiguity has also been studied for visual tracking, e.g., semi-supervised [26, 27, 30] and multiple instance [23, 24] learning methods. Considering that classification based methods are trained to predict the class label rather than the object location, Hare et al. [3] propose a tracker based on structured SVM. In this work, we alleviate the label ambiguity problem by using the assignment scheme in a way similar to that for object detection and tracking [31, 25, 32].

Ii-B Correlation filters for tracking

A correlation filter uses a designed template to generate strong response to a region that is similar to the target object while suppressing responses to distractors. Correlation filters have been widely applied to numerous problems such as face recognition [33, 34], object detection [35, 9, 36], object alignment [37] and action recognition [38, 39]. A number of correlation filters have been proposed in the literature including the minimum average correlation energy (MACF) [36], optimal trade-off synthetic discriminant filter (OTSDF) [40], unconstrained minimum average correlation energy (UMACE) [34], and minimum output sum of squared error (MOSSE) [5] methods.

Recently, the max-margin CF (MMCF) [41], multi-channel CF [11, 42, 43, 6], and kernelized CF [6, 7, 44] methods have been developed for object detection and tracking. The MMCF [41] scheme combines the localization properties of correlation filters with good generalization performance of SVM. The multi-channel correlation filters [11, 42, 43, 6] are designed to use more effective features, e.g., histogram of oriented gradients (HOG). In addition, a method that combines MMCF and multi-channel CF is developed [45] for object detection and landmark localization. The kernel tricks are also employed to learn kernelized synthetic discriminant functions (SDF)[44] with correlation filters. We note that the MMCF [45, 41] and kernelized SDF [44] schemes are trained off-line with high computational load, and do not exploit the circulant structure of data matrix formed by translated images of target objects.

In visual tracking, Bolme et al. [5] propose the MOSSE method to learn adaptive correlation filters with high efficiency and competitive performance. Subsequently, the kernelized correlation filter (KCF) [6] is developed by exploiting the circulant property of the kernel matrix. Extensions of CF and KCF with multi-channel features are introduced for visual tracking [11, 42, 43, 6]. Within the tracking methods based on correlation filters, numerous issues such as adaptive scale estimation [11, 12, 8], limited boundaries [46], zero-aliasing [47], tracking failure [48], and partial occlusion [49] have been addressed.

We note existing CF-based tracking methods are developed with ridge regression schemes for locating the target. On the other hand, the SVM-based tracking methods, e.g., Struck [3] and MEEM [4], have been demonstrated to achieve the state-of-the-art performance. One straightforward extension is to integrate SVM-based trackers with the MMCF method[41]. Nevertheless, the MMCF scheme is computationally prohibitive for real-time applications. In this work, we develop novel discriminative tracking algorithms based on SVMs and correlation filters that perform efficiently and effectively.

Iii Support Correlation Filtering

We first present the problem formulation and propose an alternating optimization algorithm to learn support correlation filters efficiently. We then develop the MSCF, KSCF and SKSCF methods to learn multi-channel, nonlinear and scale-adaptive correlation filters respectively for robust visual tracking.

Iii-a Problem formulation

Given an image , the full set of its translated versions forms a circulant matrix with several interesting properties [50], where each row represents one possible observation of a target object (See Fig. 1). A circulant matrix consists of all possible cyclic translations of a target image, and tracking is formulated as determining the most likely row. In general, the eigenvectors of a circulant matrix are the base vectors of the discrete Fourier transform:

(1)

where is the Hermitian transpose of and denotes the Fourier transform of . In the following, we use to form a diagonal matrix from a vector, and use to return the diagonal vector of a matrix.

Our goal is to learn a support correlation filter and a bias , to classify any translated image by

(2)

Note that all the translated images form a circulant matrix . We can classify all the samples in by

(3)

where denotes the inverse discrete Fourier transform (IDFT), and denotes the complex conjugate of . Given the circulant matrix generated by an image , the computational complexity of classifying every by (2) is , while that of classifying all samples of by (3) is .

Given the training set of a circulant matrix with the corresponding class labels , we use the squared hinge loss and define the SVM model [51] as follows:

where is the vector of slack variables.

Based on the circulant property of , the SVM model can be equivalently formulated as:

(4)

where denotes the element-wise multiplication, and denotes a vector of s.

Class labels of the translated images.

Let denote the centre position of the object of interest , and as the position of the translated image . In object detection [31, 32], the overlap function is used to measure the similarity between and . Specifically, the positive samples are defined by all ground truth object windows and the negative samples are defined by those with below a lower overlap threshold. In the proposed discriminative tracking model, we need to set upper and lower thresholds of for assigning binary labels. In Section IV, we determine the optimal upper and lower thresholds for SCF, MSCF and KSCF respectively with experiments.

In this work, we use the following confidence map of object position [8] to define the class label:

where is a normalization constant, and are the scale and shape parameters, respectively. With the confidence map, we define the class labels as follows:

(5)

where and are the lower and upper thresholds, respectively. With this formulation, we can use the circulant matrix formed by all samples to improve training efficiency, and discard any samples that are not labeled.

Fig. 2: Differences between the proposed SCF model and existing CF approaches [5, 7, 8]. (a) Existing CF-based models are designed to learn correlation filters that make the actual output being close to the predefined confidence maps. (b) The SCF model aims to learn a support correlation filter together with the bias for distinguishing a target object from the background based on the max margin principle. The peak value in the right response map of (b) locates the target object well.

Comparisons with existing CF-based trackers.

As illustrated in Fig. 2(a), existing CF-based trackers generally follow the ridge regression models. That is, with the continuous confidence map , CF-based trackers seek the optimal correlation filter by minimizing the mean squared error (MSE) between the predefined confidence map and actual output,

(6)

which has the closed form solution,

(7)

As shown in Fig. 2(b), the proposed model aims to learn a max-margin SVM classifier to distinguish the object of interest from the background. Using the label assignment scheme in (5), we can discard uncertain samples in training to alleviate the label ambiguity problem. The importance of SVM and label ambiguity issues have been demonstrated in object detection [31]. The proposed model copes with both issues (classification and label ambiguity) for effective visual tracking.

Iii-B Alternating optimization

In this section, we reformulate the model in (3) and propose an alternating optimization algorithm to learn SCFs efficiently. To exploit the property of the circulant matrix for learning SCFs, we let , and the SVM model in (4) is then reformulated as:

(8)

With this formulation, the subproblem on has a closed form solution when is known, and the subproblem on has a closed form solution when is known. Thus the above model can be efficiently solved using the alternating optimization algorithm by iterating between the following two steps:

Updating .

Given , we let , and the subproblem on becomes:

The subproblem has the closed form solution:

(9)

Updating .

Given , we let , and the subproblem on becomes:

The subproblem with is a quadratic programming problem. One feasible solution is to let and derive the closed form solution on . However, this approach fails to exploit the circulant property of . Thus, we obtain by solving the following system of equations:

(10)
(11)

where . Combining the two equations above and with the property of DFT, we have

(12)

where is the mean of . Given , we use (10) to obtain the closed form solution to .

1:Training image patch class labels
2:.
3:Initialize , , .
4:while not converged do
5:     // Lines 4-5 : updating .
6:     ,
7:     ,
8:     // Lines 7-9 : updating , , .
9:     ,
10:     ,
11:     ,
12:     // Line 11 : updating .
13:     .
14:     
15:end while
Algorithm 1 SCF model training

As illustrated in Fig. 1, when the -th frame with class labels arrives, the proposed algorithm learns support correlation filters by iterating between updating and updating until convergence. Given , the update of can be computed element-wise, which has the complexity of . Given , the complexity of updating is and that of updating is . Thus, the complexity is per iteration which makes our algorithm efficient in learning support correlation filters. The main steps of the proposed learning algorithm for support correlation filters are summarized in Algorithm 1.

Convergence.

The proposed algorithm converges to the global optimum with the -linear convergence rate. For presentation clarity, we give the detailed analysis and proof on its optimality condition, global convergence, and convergence rate in Appendix A. Based on the optimality condition, we define

and adopt the following stopping criterion:

(13)

Comparisons with MMCF [41].

The proposed SCF model and learning algorithm are different from the MMCF approach in three aspects. First, the training samples for MMCF are images of pixels, while those for SCF are translated images of pixels. We exploit the circulant property of the data matrix to develop an efficient learning algorithm. Second, we propose an alternating optimization algorithm to solve the proposed model, which has the complexity of . In contrast, the MMCF method adopts the conventional SMO algorithm with the complexity of where is the dimension of the sample. For visual tracking considered in this work, we have and , and the complexity of MMCF is , which is computationally expensive for real-time applications. Third, the proposed model has the squared hinge loss and regularizer terms, while the MMCF method adopts the hinge loss and includes an extra average correlation energy term.

Iii-C Multi-channel SCF

Different local descriptors, e.g., color attributes, HOG, and SIFT [52, 42, 53], provide rich image features for effective visual tracking. We treat local descriptors as multi-channel images where multiple measurements are associated to each pixel. To exploit multi-dimensional features, we propose the multi-channel SCF as follows:

(14)

where is the number of channels, and and denote the -th channel of the image and correlation filter, respectively. To learn the proposed MSCF model, we adopt the same equations on updating and , and compute by solving the following problem:

where , and .

Let . The closed form solution for can be directly obtained by

(15)

where is the identity matrix. Note that is an matrix. It is not practical to compute the inverse of to update . In the multi-channel correlation filters, it is noted that has the diagonal block structure, and the -th element of depends only on and . Thus, the subproblem on can be further decomposed into systems of equations:

(16)

In [43], Galoogahi et al. solve these systems of equations by an algorithm with the complexity of . We note that the matrix on the left hand of (16) is a rank-one matrix and a scaled identity matrix. Based on the Sherman-Morrison formula [54], we have

The closed form solution for is then obtained by

(17)

It should be noted that all s can be pre-computed with the complexity of . As such, the proposed algorithm only involves one DFT, one IDFT and several element-wise operations per iteration, and the complexity is .

Algorithms MSCF DCF [6]
Features Raw pixels CN HOG HOG CN Raw pixels CN HOG HOG CN
Mean DP (%) 64.9 66.3 78.4 80.6 44.4 48.0 71.9 76.2
Mean AUC (%) 44.6 44.9 53.7 55.5 31.2 34.8 50.1 53.2
Mean FPS (s) 76 62 64 54 278 210 292 151
TABLE I: Results of MSCF and DCF with different feature representations.
Algorithms KSCF KCF [6]
Features Raw pixels CN HOG HOG CN Raw pixels CN HOG HOG CN
Mean DP (%) 64.4 68.1 79.3 85.0 55.3 57.3 73.2 75.8
Mean AUC (%) 45.3 46.9 53.2 57.5 40.0 41.8 50.7 53.0
Mean FPS (s) 40 37 44 35 154 120 172 102
TABLE II: Results of KSCF and KCF with different feature representations.

Iii-D Kernelized SCF

Given the kernel function , the proposed kernelized SCF model can be extended to learn the nonlinear decision function:

where stands for the nonlinear feature mapping implicitly determined by the kernel function , and is the coefficient vector to be learned.

Denote by the kernel matrix with . As noted in [7], for some kernel functions (e.g., Gaussian RBF and polynomial) which are permutation invariant, the kernel matrix is circulant. Let be the first row of the circulant matrix . Therefore, the matrix-vector multiplication can be efficiently computed via DFT:

(18)

and we have,

(19)

Based on (18) and (19), the proposed kernelized SCF model is formulated as

(20)

To learn KSCF, we use the alternating optimization method by iteratively solving and . The solution of the subproblem with is similar to that in the SCF model, and we update and using the closed form solution of kernel ridge regression. Based on the representation theorem [55], the optimal solution in the kernel space can be expressed as the linear combination of the feature maps of the samples: . Namely, only the coefficient vector needs to be learned. In [55], the solution to the kernelized ridge regression in the dual space is given by

Thus, the closed form solution to our sub-problem on can be formulated as

where and 1 denotes a vector of 1s. As the kernel matrix K is circulant and can be diagonalized, the optimal solution of in the Fourier transform domain can be computed by

(21)

where , is the kernel correlation of with itself in the Fourier domain which is known as the kernel auto-correlation.

For image features with channels, the complexity to compute kernel matrix is . After that, the learning process only requires element-wise operations, one DFT and one IDFT per iteration, and the complexity is . Thus, the proposed KCSF model leverages rich features from the nonlinear filters without increasing computational load significantly.

Furthermore, to handle large scale changes, we develop the SKSCF model by maintaining a scaling pool in a way similar to the scale-adaptive CF scheme [12], and the bilinear interpolation is used to resize samples across scales.

Iv Performance Evaluation

We use the benchmark dataset and protocols [10] to evaluate the proposed SCF algorithms. First, we evaluate several variants of the proposed method, i.e., SCF, MSCF, KSCF, and SKSCF, to analyze the effect of feature representations and kernel functions. Next, comprehensive experiments are conducted to compare the proposed methods with other CF-based trackers. Finally, the KSCF and SKSCF algorithms are compared with existing SVM-based and the state-of-the-art methods. The tracking results can be found at http://faculty.ucmerced.edu/project/scf/ and the source code will be made available to the public.

Iv-a Experimental setup

Datasets and evaluated tracking methods.

To assess the performance of the proposed methods, experiments are carried out on a benchmark dataset [10] of 50 challenging image sequences annotated with 11 attributes. For the first frame of each sequence, the bounding box of the target object is provided for fair comparisons. For comprehensive comparisons, we evaluate the baseline SCF, multi-channel SCF, kernelized SCF and SKSCF methods. The SCF and MSCF methods are designed in the linear space with raw pixels, and multi-channel features are based on HOG [52] as well as color names (CN) [42], respectively. The KSCF and SKSCF algorithms are evaluated by using the Gaussian kernel on multi-channel feature representations. Furthermore, we compare the proposed trackers with the other trackers based on correlation filters (e.g., MOSSE [5], CSK [7], KCF [6], DCF [6], STC [8] and CN [42]), existing SVM based trackers (e.g., Struck [3] and MEEM [4]), and other state-of-the-art methods (e.g., TGPR [56], SCM [16], TLD [27], L1APG [57], MIL [23], ASLA [20] and CT [58]).

Evaluation protocols.

We use the one-pass evaluation (OPE) protocol [10] which reports the precision and success plots based on the position error and bounding box overlap metrics with respect to the ground truth object locations. For precision plots, the distance precision at a threshold of 20 pixels (DP) is reported. For success plots, the area under the curve (AUC) is computed. In addition, the frames per second (FPS) that each method is able to process is discussed.

Parameter settings.

The experiments are carried out on a desktop computer with an Intel Xenon 2 core 3.30 GHz CPU and 32 GB RAM. The proposed SCF-based trackers involve a few model parameters, i.e., trade-off parameter , scale parameter and shape parameter of confidence maps, and lower and upper thresholds (, ) in (5). In addition, the KSCF method has one extra parameter for the Gaussian RBF kernel function, and SKSCF contains a scaling pool parameter . For online tracking, the model is updated by linear interpolation with the adaption rate [10].

In all experiments, the model parameters are fixed for each SCF-based tracker. For all SCF-based trackers, the trade-off and shape parameter are fixed to and , respectively. The thresholds (, ) in (5) are set to for SCF, for MSCF and for KSCF, SKSCF. The scale parameter is set to be , which is adaptive to the size of each target object. The scaling pool is fixed as . The adaption rate is set to for raw pixel features, and for multi-channel features, respectively. The kernel parameter of KSCF is set to . As for HOG parameters, the orientations and cell size are set to 9 and 4.

Iv-B Evaluation on SCF-based trackers

In this section, we first evaluate the effect of feature representations and kernel functions, and then compare four variants of the SCF-based trackers, i.e., SCF, MSCF, KSCF, and SKSCF, in terms of both accuracy and efficiency. The results of the corresponding CF-based trackers are also reported for all SCF-based methods.

Kernels Linear Polynomial Gaussian
Mean DP (%) 82.0 84.2 85.0
Mean AUC (%) 56.2 57.1 57.5
Mean FPS (s) 94 55 35
TABLE III: Results of KSCF with different kernels.
Fig. 3: OPE plots of the MSCF and DCF [6] with different feature representations. the AUC values are shown next to the legends.

We consider three typical feature representations, i.e., raw pixels, HOG features [52], and color names (CN) [42]. The results of the MSCF and KSCF methods are listed in Table I and Table II. The result for each feature representation is optimal by varying the parameters , , and from , , and . These parameters are then fixed for all the following experiments. For KSCF, the Gaussian RBF kernel with is adopted.

The OPE plots of MSCF with linear DCF [6] and KSCF with nonlinear KCF [6] are shown in Fig. 3 and Fig. 4. Compared with raw pixels and color features, the method with HOG representation significantly improves the tracking performance in terms of mean DP and mean AUC. For MSCF, the implementation using color names and HOG features outperforms raw pixels by and in terms of mean DP. For KSCF, the tracker using color names and HOG features outperforms raw pixels by and in terms of mean DP.

Fig. 4: OPE plots of the KSCF and KCF [6] methods with different feature representations.
Fig. 5: OPE plots of the SCF methods (i.e., SCF, MSCF, KSCF, and SKSCF) and other CF-based trackers (i.e., MOSSE [5], CSK [7], DCF [6], KCF [6], STC [8], CN [42], DSST [11] and SAMF [12]).

The MSCF tracker with the combination of color names and HOG is further improved to in terms of DP. Similarly, the performance of the KSCF method is improved to in terms of DP with the use of color names and HOG features. Compared with the DCF [6] and KCF [6] methods, the proposed MSCF and KSCF algorithms achieve higher DP and AUC values for each feature representation. Table I and II show that both KSCF and MSCF methods perform in real-time even using the representation based on HOG and CN features.

Algorithms
SKSCF
KSCF
MSCF
SCF
SAMF
[12]
DSST
[11]
KCF
[6]
DCF
[6]
CN
[42]
STC
[8]
CSK
[7]
MOSSE
[5]
Mean DP (%) 87.4 85.0 80.6 62.8 77.1 74.8 73.2 71.9 63.7 58.6 55.8 44.4
Mean AUC (%) 62.3 57.5 55.5 48.9 56.5 56.3 50.7 50.1 44.9 37.4 40.6 31.3
Mean FPS (s) 8 35 54 76 14 30 172 292 79 557 151 421
TABLE IV: Performance of tracking methods based on correlation filters: Top three results are shown in red, blue and orange.
Algorithms
SKSCF
KSCF
MEEM
[4]
Struck
[3]
Mean DP (%) 87.4 85.0 83.3 67.4
Mean AUC (%) 62.3 57.5 57.2 48.6
Mean FPS (s) 8 35 10 10
TABLE V: Comparison of SVM-based trackers.
\setkeys

Ginwidth=0.5\OVP@calc#19#134#251

(a) Bolt
\setkeys

Ginwidth=0.5\OVP@calc#17#117#237

(b) Singer2
\setkeys

Ginwidth=0.5\OVP@calc#270#274#278

(c) Coke
\setkeys

Ginwidth=0.5\OVP@calc#86#120#147

(d) David3
\setkeys

Ginwidth=0.5\OVP@calc#516#526#538

(e) Suv
\setkeys

Ginwidth=0.5\OVP@calc#78#83#102

(f) Tiger2
\setkeys

Ginwidth=0.5\OVP@calc#70#71#72

(g) Football1
\setkeys

Ginwidth=0.5\OVP@calc#54#189#310

(h) Jumping
Fig. 6: Screenshots of tracking results on 8 challenging benchmark sequences. For the sake of clarity, we only show the results of five trackers, i.e., KSCF, KCF [6], MEEM [4], TGPR [56], Struck [3] and SCM [16].

We evaluate the effect of kernel functions on KSCF using HOG and CN features, including linear kernel , polynomial kernel , and Gaussian RBF kernel . For , the degree is set as . For , the kernel parameter is set as . Table III shows the results of KSCF with different kernels. Clearly the KSCF method with a nonlinear kernel outperforms the one with a linear kernel in terms of mean DP and mean AUC, and the one with Gaussian RBF kernel achieves the best performance.

We implement the SKSCF method by extending KSCF with the Gaussian RBF kernel, and compare four variants of the SCF-based trackers, i.e., SCF, MSCF, KSCF, and SKSCF. Table IV shows the results of four SCF-based trackers, where the SKSCF method performs best, followed by the KSCF approach. On the other hand, the KSCF method is more efficient than the SKSCF approach. In the following experiments, we compare both KSCF and SKSCF methods with the other schemes based on correlation filters, SVMs, and other state-of-the-art tracking approaches.

Iv-C Comparisons with CF-based trackers

We use the tracking benchmark dataset [10] to evaluate the proposed SCF-based algorithm against existing CF-based methods including MOSSE [5], CSK [7], KCF [6], DCF [6], STC [8], CN [42], DSST [11] and SAMF [12].

Algorithms
SKSCF
KSCF
MEEM
[4]
KCF
[6]
TGPR
[56]
SCM
[16]
TLD
[27]
ASLA
[20]
L1APG
[57]
MIL
[23]
CT
[58]
Mean DP (%) 87.4 85.0 83.3 73.2 71.8 65.2 60.6 54.5 49.4 48.8 41.5
Mean AUC (%) 62.3 57.5 57.2 50.7 51.1 50.1 43.4 44.2 38.6 36.9 30.8
Mean FPS (s) 8 35 10 172 0.5 1 22 8 3 28 39
TABLE VI: Comparison of KSCF and SKSCF methods with the state-of-the-art trackers. The top three results are shown in red, blue and orange.
Fig. 7: OPE plots of the KSCF, SKSCF, DSST [11] and SAMF [12] methods on sequences with large scale variation.
Fig. 8: OPE plots of the KSCF, SKSCF and other SVM-based trackers, including MEEM [4] and Struck [3].

Classic correlation filters.

Fig. 5 shows the OPE plots of these trackers. The SCF, MOSSE [5], CSK [7] and STC [8] methods operate on raw pixels in the linear space. We note that the MOSSE method adopts the ridge regression function while the SCF algorithm uses the max-margin model. Although the CSK and STC methods operate on raw pixels, the CSK method is a kernelized CF-based tracker and the STC approach is a scale-adaptive tracking method. Overall, the SCF algorithm performs favorably against these CF-based methods based on regression and nonlinear functions.

Multi-channel correlation filters.

The MSCF, CN [42], and DCF [6] methods are based on correlation filters using multi-channel features. The DCF method is based on HOG features and the CN approach is operated on color attributes, while the MSCF scheme uses the combination of HOG and color representations. Fig. 5 shows that the MSCF method performs well among these three trackers based on correlation filters.

Kernelized correlation filters.

The KSCF method is compared with the corresponding kernelized KCF [6] and CSK [7] trackers. The CSK and KCF methods are based on raw pixels and HOG features, respectively. As shown in Table IV and Fig. 6, the KSCF method based on HOG and CN features performs favorably against the KCF and CSK appraoches.


Fig. 9: Precision and success metrics of four top-performing trackers for the 11 attributes.
Fig. 10: OPE plots of the KSCF, SKSCF and other state-of-the art trackers, including MEEM [4], TGPR [56], KCF [6], SCM [16], TLD [27], ASLA [20], L1APG [57], MIL [23] and CT [58].

Scale-adaptive correlation filters.

The KSCF and SKSCF are evaluated against three scale-adaptive trackers: STC [8], DSST [11] and SAMF [12]. We note that the DSST [11] and SAMF [12] methods have been shown to perform best and second best trackers in the recent tracking benchmark evaluation [59]. Both KSCF and SKSCF trackers perform significantly better than the STC method. In addition, the KSCF and SKSCF methods also significantly outperform the DSST and SAMF approaches by a large margin. Fig. 7 shows the OPE plots on all the sequences with the attribute of scale variation where the KSCF method performs favorably against the DSST and SAMF trackers. Overall, the KSCF algorithm performs favorably in terms of accuracy and speed.

Attributes FM BC MB DEF IV IPR LR OCC OPR OV SV
SKSCF 0.779 0.859 0.802 0.893 0.841 0.810 0.596 0.872 0.857 0.800 0.809
KSCF 0.680 0.825 0.761 0.854 0.805 0.816 0.555 0.852 0.836 0.697 0.768
MEEM [4] 0.745 0.802 0.721 0.856 0.771 0.796 0.529 0.801 0.840 0.726 0.795
TGPR [56] 0.579 0.763 0.570 0.760 0.695 0.683 0.567 0.668 0.693 0.535 0.637
KCF [6] 0.564 0.752 0.599 0.747 0.687 0.692 0.379 0.735 0.718 0.589 0.680
SCM [16] 0.346 0.578 0.358 0.589 0.613 0.613 0.305 0.646 0.621 0.429 0.672
TLD [27] 0.557 0.428 0.523 0.495 0.540 0.588 0.349 0.556 0.593 0.576 0.606
ASLA [20] 0.255 0.496 0.283 0.473 0.529 0.521 0.156 0.479 0.535 0.333 0.552
L1APG [57] 0.367 0.425 0.379 0.398 0.341 0.524 0.460 0.475 0.490 0.329 0.472
MIL [23] 0.415 0.456 0.381 0.493 0.359 0.465 0.171 0.448 0.484 0.393 0.471
CT [58] 0.330 0.339 0.314 0.463 0.365 0.361 0.152 0.429 0.405 0.336 0.448
TABLE VII: Precision metrics of the trackers for 11 attributes. The top three results are shown in red, blue and orange.
Attributes FM BC MB DEF IV IPR LR OCC OPR OV SV
SKSCF 0.729 0.795 0.757 0.863 0.743 0.720 0.542 0.788 0.757 0.808 0.682
KSCF 0.629 0.741 0.689 0.779 0.649 0.690 0.389 0.696 0.697 0.705 0.540
MEEM [4] 0.706 0.747 0.692 0.711 0.653 0.648 0.470 0.694 0.694 0.742 0.594
TGPR [56] 0.542 0.713 0.570 0.711 0.632 0.601 0.501 0.592 0.603 0.546 0.505
KCF [6] 0.516 0.669 0.539 0.668 0.534 0.575 0.358 0.593 0.587 0.589 0.477
SCM [16] 0.348 0.550 0.358 0.566 0.586 0.574 0.308 0.602 0.576 0.449 0.635
TLD [27] 0.475 0.388 0.485 0.434 0.461 0.477 0.327 0.455 0.489 0.516 0.494
ASLA [20] 0.261 0.468 0.284 0.485 0.514 0.496 0.163 0.469 0.509 0.359 0.544
L1APG [57] 0.359 0.404 0.363 0.398 0.298 0.445 0.458 0.437 0.423 0.341 0.407
MIL [23] 0.353 0.414 0.261 0.440 0.300 0.339 0.157 0.378 0.369 0.416 0.335
CT [58] 0.327 0.323 0.262 0.420 0.308 0.290 0.143 0.360 0.325 0.405 0.342
TABLE VIII: Success metrics of the trackers for 11 attributes. The top three results are shown in red, blue and orange.

Iv-D Comparisons with SVM-based trackers

We evaluate the proposed KSCF and SKSCF with two state-of-the-art SVM-based methods, i.e., Struck [3] and MEEM [4], based on the structured and ensemble learning. Table V and Fig. 8 show that both KSCF and SKSCF algorithms perform favorably against the MEEM and Struck methods in all aspects. As shown in Fig. 6, the KSCF algorithm can track target objects more precisely than other methods in the Singer2, Coke, Suv and Tiger2 sequences. The results show that dense sampling can be efficiently used with SVMs for effective visual tracking. Fig. 6 shows that the KSCF algorithm can track the objects more precisely in all challenging sequences, while the other trackers tend to drift away from the target objects.

Iv-E Comparisons with state-of-the-art trackers

We evaluate the KSCF algorithm with the other state-of-the-art trackers, including MEEM [4], KCF [6], TGPR [56], SCM [16], TLD [27], L1APG [57], MIL [23], ASLA [20] and CT [58]. Fig. 10 shows the OPE plots, and Table VI presents the mean DP, AUC and FPS. Overall, the proposed KSCF and SKSCF algorithms perform favorably against the state-of-the-art methods including the TLD, SCM, TGPR and MEEM schemes.

The sequences in the benchmark dataset [10] are annotated with 11 challenging factors for visual tracking, including illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutters (BC), and low resolution (LR). Table VII and Table VIII show the performance of the KSCF and state-of-the-art methods in terms of DP and AUC with respect to each factor. Fig. 9 shows the precision and success metrics of the leading trackers (i.e., SKSCF, KSCF, MEEM, KCF and TGPR) with respect to the attributes. We note that the MEEM method [4] adopts the multiple experts framework to deal with model drift , and performs slightly better than KSCF for attributes FM, LR, OV and SV. Overall, the KSCF algorithm are among the top 3 trackers for any attribute, and the SKSCF algorithm performs best in both metrics for all but one attribute.

V Conclusions

We propose an effective and efficient approach to learn support correlation filters for real time visual tracking. By reformulating the SVM model with circulant data matrix as training input, we present a DFT based alternating optimization algorithm to learn support correlation filters efficiently. In addition, we develop the MSCF, KSCF, and SKSCF tracking methods to exploit multidimensional features, nonlinear classifiers, and scale-adaptive schemes. Experiments on a large benchmark dataset show that the proposed KSCF and SKSCF algorithms perform favorably against the state-of-the-art tracking methods in terms of accuracy and speed.

Acknowledgments

This work is supported in part by NSFC grant (61271093), the program of ministry of education for new century excellent talents (NCET-12-0150), NSF CAREER Grant (No. 1149783) and NSF IIS Grant (No. 1152576).

Appendix A Convergence analysis

In the following, we first analyze the optimality condition of the problem, and then prove the global convergence and convergence rate of the SCF algorithm.

A-a Optimality conditions

In the spatial domain, the SCF model can be expressed as:

Defining the augmented vector with , we compute the augmented weight vector . The above problem can then be reformulated as:

(22)

where and . We introduce an indicator function and the subdifferential [60] of is:

(23)

As the loss function (22) is convex, is a solution if and only if the subdifferential of the loss at contains zero [61]. Thus the optimality conditions are:

(24)

where denotes the i-th training sample. With , we have:

where with . Thus the matrix is invertible. For simplicity, let , from (24) and above equation, we have

(25)
(26)

Based on the optimality conditions in (24), we define

and use the stopping criterion:

where is a predefined threshold.

A-B Global convergence

To compute , we reformulate the sub-problem for each entry:

where . Its solution is given by:

Proposition 1.

For any , we have:

where the equality holds only if .

Proof.
  1. if , , and we also have .

  2. if , , where the equality holds only if .

  3. if , e.g., , it is easy to see that, .

For simplicity, let . We have and then we get two symmetric positive definite matrices as follows:

where and is the spectral radius of matrix [62].

With the definitions of and , the updating rules and can be written as:

Let , we have the following proposition.

Proposition 2.

For any , the following inequality holds:

and the equality holds if and only if .

Proof.

Note that . From the definition of , we have:

Denote the eigen-decomposition of by , where is a full rank orthogonal matrix, and is a diagonal matrix with .

The equality can be written as . Since is full-rank orthogonal, there is . Thus, we have . Since is diagonal with , it holds that . Multiplying both sides by , we have