Robust Structured Multi-task Multi-view Sparse Tracking
Sparse representation is a viable solution to visual tracking. In this paper, we propose a structured multi-task multi-view tracking (SMTMVT) method, which exploits the sparse appearance model in the particle filter framework to track targets under different challenges. Specifically, we extract features of the target candidates from different views and sparsely represent them by a linear combination of templates of different views. Unlike the conventional sparse trackers, SMTMVT not only jointly considers the relationship between different tasks and different views but also retains the structures among different views in a robust multi-task multi-view formulation. We introduce a numerical algorithm based on the proximal gradient method to quickly and effectively find the sparsity by dividing the optimization problem into two subproblems with the closed-form solutions. Both qualitative and quantitative evaluations on the benchmark of challenging image sequences demonstrate the superior performance of the proposed tracker against various state-of-the-art trackers.
Mohammadreza Javanmardi, Xiaojun Qi
\addressDepartment of Computer Science, Utah State University, Logan,
UT, 84322 USA
Sparse Representation, Particle Filter, Convex Optimization, Proximal Gradient
Visual tracking is the process of estimating states of a moving target in a dynamic frame sequence. It is considered as one of the most important and challenging topics in computer vision and has overabundant applications in surveillance, human motion analysis, smart vehicles transportation, navigation, etc. Although numerous methods [1, 2, 3] have been introduced in recent years, it is still challenging to develop a robust tracking algorithm due to occlusion, illumination variations, deformation, camera motion, background clutter, etc.
Tracking algorithms can be classified into two categories: discriminative and generative. Discriminative approaches formulate a decision boundary to separate the target from the background. For example, Avidan  proposes the ensemble tracking method, which combines a set of weak classifiers into a strong one to label the candidate as target or background. Grabner et al.  propose semi-supervised boosting to alleviate the drifting problem in tracking. Bebenko et al.  use a large number of positive and negative bags consisting of image patches to update a multiple instance learning-based appearance model. In contrast, generative approaches adopt a model to represent the target and formulate the tracking as a model-based searching procedure to find the most similar region to the target. Black et al.  formulate the eigenspace model to represent the target and employ a coarse-to-fine matching strategy to track the target over image sequences. Adam et al.  propose the Frag-Track algorithm to track an object, which represents a template object using multiple arbitrary image patches and creates the vote map using the integral histogram. Ross et al.  learn a low-dimensional subspace representation of the target to track objects.
Sparse representation based trackers (sparse trackers) are considered as the generative tracking methods since they express features of a target as a sparse linear combination of a template set. Based on the number of employed features, sparse trackers are further classified into single-view and multi-view. Single-view sparse tracking approaches represent one feature (i.e., pixel intensity) of a target region using a set of templates. Mei et al.  propose the L1T tracker, which represents the intensity of each target candidate by a set of templates and finds the sparsity by solving an minimization problem. Li et al.  propose to adopt the orthogonal matching pursuit (OMP) algorithm  to reduce the complexity of solving the minimization problem in L1T tracker. The modified version of OMP  may be incorporated to further reduce the complexity. Zhang et al.  propose to jointly learn the intensities of all target candidates. These methods adopt a global sparse appearance model to represent a target as an entity region. Therefore, they are less effective in handling large occlusion. To address this problem, Jia et al.  divide each candidate into a set of overlapping patches and represent them by a template set of the patches. Zhang et al.  exploit the global and local representation of a target candidate to achieve robust tracking. These methods solve the problems associated with occlusion and noise. However, they are sensitive to shape deformation of targets and varied illumination due to the use of intensity values. In contrast, multi-view sparse tracking approaches extract visual features such as color, edge, texture, and histogram to complement the intensity of the target. Exploiting multi-view information has also been widely used in many computer vision tasks such as visual classification, face recognition, and image segmentation [17, 18]. For instance, Zohrizadeh et al.  employ multiple local features in a non-negative matrix factorization framework to segment an image. Hong et al.  propose a tracker that considers the underlying relationship among different views and particles in terms of the least square (LS). To handle the data contaminated by outliers and noise, Mei et al.  use the least absolute deviation (LAD) in their optimization model. Both approaches cannot retain the underlying layout structure among different views. In other words, different views of a target candidate may be reconstructed by activating the same templates in the dictionary set, whose representation coefficients do not resemble the similar combination of activated templates.
To address these issues, we propose a novel structured multi-task multi-view tracking (SMTMVT) method to track objects under different challenges. Similar to , SMTMVT exploits multi-view information such as intensity, edge, and histogram of target candidates and jointly represents them using templates. However, SMTMVT improves Hong’s tracker by proposing a new optimization model to attain the underlying layout structure among different views and reduce the error corresponding to outlier target candidates. The main contributions of the proposed work are summarized as: 1) Designing a novel optimization model to effectively utilize a nuclear norm of the sparsity for multi-task multi-view sparse trackers; 2) Representing a particular view of a target candidate as an individual task and simultaneously retaining the underlying layout structure among different views; 3) Incorporating an outlier minimization term in the optimization model to efficiently reduce the error of outlier target candidates; 4) Adopting the proximal gradient (PG) method to quickly and effectively solve the optimization problem.
The remainder of this paper is as follows: Section 2 introduces the notations. Section 3 presents the SMTMVT method together with its optimization model solved by our proposed PG-based numerical algorithm.. Section 4 demonstrates the experimental results on 15 publicly challenging image sequences and the CVPR2013 tracking benchmark and compares the results of the proposed method with several state-of-the-art methods. Section 5 draws the conclusions.
Throughout this paper, we use bold lowercase and bold uppercase letters to denote vectors and matrices, respectively. Specifically, two sets of numbers and are respectively denoted by and . Vector is a column vector of all ones of an appropriate dimension. For a given matrix , we denote its Frobenius norm, nuclear norm, and norm by , , and , respectively. The soft-thresholding operator is defined as . For a set , the indicator function returns when and returns when . The proximal operator is defined as , where and is a given function.
3 The Proposed SMTMVT Method
This section provides detailed information about the proposed particle filter based tracker. Specifically, we formulate a sparse appearance model in the proposed SMTMVT and propose a numerical solution to efficiently solve the model.
3.1 Structured Multi-Task Multi-View Tracking (SMTMVT)
The proposed SMTMVT method utilizes the sparse appearance model to exploit multi-task multi-view information in a new optimization model, attain the underlying layout structure among different views, and reduce the error of outlier target candidates. At time , we consider particles with their corresponding image observations (target candidates). Using the state of the -th particle, its observation is obtained by cropping the region of interest around the target. Each observation is considered to have different views. For the -th view, dimensional feature vectors of all particles, , are combined to form the matrix and target templates are used to create its target dictionary . Following the same notations in , we use the -th dictionary to represent the -th feature matrix and learn the sparsity . In addition, We divide the reconstruction errors of the -th view into two components as follows:
The first error component corresponds to the minor reconstruction errors resulted from the representation of good target candidates. The second error component corresponds to the significant reconstruction errors resulted from the representation of outlier target candidates. We use the Frobenius norm factor minimization of error to minimize the square root of the sum of the absolute squares of its elements and adopt the norm minimization of error to minimize the maximum column-sum of its elements. This assures the reconstruction errors for both good and bad target candidates are minimized.
To boost the performance, we maintain the underlying layout structure between different views. For the -th particle, we not only represent all its feature vectors by activating the same subset of target templates in the target dictionaries (i.e., ), but also equalize the representation coefficients of activated templates for all views. In other words, we aim to resemble the th columns of s, for , to have a similar representation structure in terms of the activated templates and similar coefficients in terms of the activated values. To do so, we concatenate s to form , which is the sparsity matrix corresponding to the representation of views in observations. We then minimize the nuclear norm of the matrix , which is a good surrogate of the rank minimization, to ensure the columns to be similar or linearly dependent of each other. Here, selects a sub-matrix of columns of , whose index belongs to the set . The selected columns are the simultaneous columns of the -th target candidate in different views.
We formulate the SMTMVT sparse appearance model as the following optimization problem by jointly evaluating its view matrices with different particles (tasks):
where ’s are vertically stacked to form the bigger matrix , parameter regularizes the nuclear norm of , and parameter controls the sparsity of and .
Finally, we compute the likelihood of the -th candidate as follows:
where is the sparse coefficients of the -th candidate corresponding to the target templates in the -th view and is a constant value. We select the candidate with the highest likelihood value as the tracking result at frame . Similar to , we update the target templates to handle the appearance changes of the target throughout the frame sequences.
3.2 Optimization Algorithm
Since the convex problem in (3) can be split into differentiable and non-differentiable subproblems, we adopt the PG method  to develop a numerical solution to the proposed model. To do so, we cast the differentiable subproblem as follows:
This equation is sub-differentiable with respect to and differentiable with respect to . Hence, two variables and can be updated at time by the following equations:
where the step-size controls the convergence rate, is the sub-gradient operator, , and . We adopt the computationally efficient PG algorithm to iteratively update and , which are initially set as 0’s, until they converge to the constant matrices. Both subproblems (6a) and (6b) can be easily solved via the existing methods. Specifically, (6a) is a -minimization problem with an analytic solution, which can be obtained using soft thresholding, i.e., . Moreover, (6b) is an Euclidean norm projection onto the nonnegative orthant, which enjoys the closed-form solution. It should be emphasized that the convergence rate of this numerical algorithm can be further improved by the acceleration techniques presented in .
4 Experimental Results
In this section, we evaluate the performance of the proposed method on 15 publicly available frame sequences and the CVPR2013 tracking benchmark data set .
To ensure fair comparison, We employ four popular features as used in [19, 20] in the proposed SMTMVT method. These features are intensity, color histogram, histogram of oriented gradients (HOGs) , and local binary patterns (LBPs) . In addition, we employ a simple but effective illumination normalization method  before feature extraction to eliminate the effect of illumination and improve the quality and discriminative power of the features. Following the same settings in [19, 20], we set the size of intensity template to be one third of the size of the initial target or the half size of the initial target when its shorter length is less than 20 pixels. For all the experiments, we set , , , the number of particles , and the number of target templates .
4.1 Experiments on Publicly Available Sequences
We extensively conduct experiments on 15 challenging frame sequences and follow the same settings as in [19, 20] to resize all frames to 320240. We compare the proposed SMTMVT method with eight state-of-the-art tracking methods, namely, L1 tracker , multi-task tracking (MTT) , Struck , tracking with multiple instance learning (MIL) , incremental learning for visual tracking (IVT) , visual tracking decomposition (VTD) , multi-task multi-view tracking least square (MTMVTLS) , and multi-task multi-view tracking least absolute deviation (MTMVTLAD) . We use the publicly available source code or binary code provided by the authors to produce the tracking results. We use the default parameters for initialization.
Fig. 1 demonstrates the tracking results of all compared methods on two representative frames for each of the 15 sequences. In the david1, david2, girl, faceocc2, fleetface, and jumping sequences, the task is to track human faces under occlusion and scale variations. Let us take the girl sequence as an example. IVT drifts from the target because of appearance changes. MIL and VTD are prone to drifts due to scale changes and occlusion, respectively. Struck successfully tracks the target in most frames. MTMVTLS and MTMVTLAD achieve better performance than L1T and MTT due to use of different features. SMTMVT achieves the best performance in handling the occlusion and scale variations because it retains the structure among different views.
In the basketball, walking, subway, football, singer2, and crossing sequences in Fig. 1, the task is to track multiple human bodies under fast motion, rapid pose changes, and illumination variations. For instance, in the singer2 sequence, the algorithms aim to track a target with illumination variation, deformation, and rotations. IVT, L1T, MTT, and Struck quickly drift from the target mainly because of illumination changes. VTD gradually drifts from the target and loses it completely after some deformation and rotations. MIL is able to only track a part of the target without losing it. MTMVTLS and MTMVTLAD achieve good overall performance. SMTMVT achieves the best performance due to use of different views and structured representation of them.
In the doll, dog, and carDark sequences in Fig. 1, the task is to track various objects under different challenges. For instance, in the doll sequence, the algorithms aim to track a doll with various rotations and background clutters. MTT loses the target due to the background clutter and IVT fails when the target undergoes pose changes. L1T, MIL, and Struck include much of background in the results. However, they don’t lose the target since they track a part of the target throughout the frames. VTD, MTMVTLS, and MTMVTLAD achieve better performance comparing with five other methods due to incorporation of multiple features. SMTMVT produces more accurate tracking results specially when the target undergoes in-plane and out-of-plane rotations.
For quantitative comparison, we adopt the overlap score between the tracked bounding box and the ground truth bounding box as , where is the number of pixels in the bounding box, and represent the intersection and union of two bounding boxes, respectively. We compute the average overlap score across all frames of each image sequence for each compared method. Table 1 summarizes the average overlap scores across all frame of each of 15 sequences for the nine compared methods. It is clear that the proposed SMTMVT method achieves the best overall performance for the tested sequences. It improves the second best method (i.e., MTMVTLAD) by in terms of the average overlap score for all 15 sequences. It ranks the best on seven sequences (e.g., david1, girl, subway, singer2, fleetface, football, and crossing) and ranks the second best on four sequences (e.g., basketball, david2, doll, and walking).
(—L1T, —MTT, —Struck, —MIL, —IVT, —VTD, —MTMVTLS, —MTMVTLAD, —SMTMVT)
4.2 Experiments on CVPR2013 Tracking Benchmark
We conduct the experiments on the CVPR2013 tracking benchmark  to evaluate the performance of SMTMVT under different challenges. This benchmark consists of 50 annotated sequences. Each sequence is also labeled with attributes specifying the presence of different challenges including illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutter (BC), and low resolution (LR). The sequences are categorized based on the attributes and 11 challenge subsets are generated. These subsets are utilized to evaluate the performance of trackers in different challenge categories.
For this benchmark dataset, there are online available tracking results for 29 trackers. In addition, we include the results of MTMVTLS and MTMVTLAD provided by the authors. Following the protocol in , we use the same parameters for all the sequences to produce the results for SMTMVT. We run SMTMVT to obtain the one-pass evaluation (OPE) results and compare them with the OPE results of the other 31 trackers. The OPE is conventionally used to evaluate the trackers by initializing them using the ground truth location in the first frame. We present the overall OPE success plot and the OPE success plot for each of 11 challenge subsets in Fig. 2. These success plots show the percentage of successful frames at the overlap thresholds ranging from 0 to 1, where the successful frames are the ones who have overlap scores larger than a given threshold. For fair comparison, we use the area under curve (AUC) of each success plot to rank the trackers. Here, we include the top 10 of 32 trackers in each plot for clarity. The values shown in the parenthesis alongside the legends are the AUC scores. The values shown in the parenthesis alongside the titles for 11 challenge subsets are the number of video sequences in the respective subset. It is clear that SMTMVT achieves the best overall performance since it has the largest AUC score of 0.507. Also, SMTMVT ranks the best for four challenge subsets. It achieves the highest AUC score of 0.502 for IV, 0.518 for IPR, 0.518 for OPR, and 0.527 for OV. It achieves the second best for five challenge subsets (e.g., FM, MB, DEF, SV, and OCC), and the third best for BC.
In this paper, we propose a robust SMTMVT method that uses sparse representation in the particle filter framework to track objects in challenging frame sequences. By introducing the nuclear norm regularization, we represent all views of a target candidate using the same subset of templates in the target dictionaries. We further equalize the representation coefficients of activated templates for all views. The proposed model is efficiently solved by a numerical algorithm based on the PG method. The results on 15 publicly frame sequences and the CVPR2013 tracking benchmark demonstrate that the SMTMVT method outperforms various state-of-the-art trackers.
-  A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, pp. 13, 2006.
-  S. Salti, A. Cavallaro, and L. Di Stefano, “Adaptive appearance modeling for video tracking: Survey and evaluation,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4334–4348, 2012.
-  M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual object tracking vot2015 challenge results,” in ICCV, 2015.
-  S. Avidan, “Ensemble tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, 2007.
-  Helmut Grabner, Christian Leistner, and Horst Bischof, “Semi-supervised on-line boosting for robust tracking,” in ECCV, 2008.
-  B. Babenko, M-H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in CVPR, 2009.
-  M. J. Black and A. D. Jepson, “Eigentracking: Robust matching and tracking of articulated objects using a view-based representation,” Int. J. Comput. Vis., vol. 26, no. 1, pp. 63–84, 1998.
-  A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in CVPR, 2006.
-  D. A. Ross, J. Lim, R-S. Lin, and M-H. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vis., 2008.
-  X. Mei and H. Ling, “Robust visual tracking and vehicle classification via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 33, no. 11, pp. 2259–2272, 2011.
-  H. Li, C. Shen, and Q. Shi, “Real-time visual tracking using compressive sensing,” in CVPR, 2011.
-  Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in 27th Asilomar Conf. on Sig. Syst. and Compt., 1993.
-  M. Shekaramiz, T. K. Moon, and J. H. Gunther, “On the block-sparsity of multiple-measurement vectors,” in Sig. Process. and Sig. Process. Edu. Workshop (SP/SPE), 2015.
-  T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via multi-task sparse learning,” in CVPR, 2012.
-  X. Jia, H. Lu, and M-H Yang, “Visual tracking via adaptive structural local sparse appearance model,” in CVPR, 2012.
-  T. Zhang, S. Liu, C. Xu, S. Yan, B. Ghanem, N. Ahuja, and M-H. Yang, “Structural sparse tracking,” in CVPR, 2015.
-  X-T. Yuan, X. Liu, and S. Yan, “Visual classification with multitask joint sparse representation,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4349–4360, 2012.
-  F. Zohrizadeh, M. Kheirandishfard, K. Ghasedidizaji, and F. Kamangar, “Reliability-based local features aggregation for image segmentation,” in ISVC, 2016.
-  Z. Hong, X. Mei, D. Prokhorov, and D. Tao, “Tracking via robust multi-task multi-view joint sparse representation,” in ICCV, 2013.
-  X. Mei, Z. Hong, D. Prokhorov, and D. Tao, “Robust multitask multiview tracking in videos,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 11, pp. 2874–2890, 2015.
-  N. Parikh, S. Boyd, et al., “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, no. 3, pp. 127–239, 2014.
-  Y. Nesterov, “Smooth minimization of non-smooth functions,” Math. Prog., vol. 103, no. 1, pp. 127–152, 2005.
-  Y. Wu, J. Lim, and M-H. Yang, “Online object tracking: A benchmark,” in CVPR, 2013.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
-  T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, 2002.
-  X. Tan and B. Triggs, “Enhanced local texture feature sets for face recognition under difficult lighting conditions,” IEEE Trans. Image Process., vol. 19, no. 6, pp. 1635–1650, 2010.
-  S. Hare, A. Saffari, and P. H. Torr, “Struck: Structured output tracking with kernels,” in ICCV, 2011.
-  B. Babenko, M-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 33, no. 8, pp. 1619–1632, 2011.
-  J. Kwon and K. M. Lee, “Visual tracking decomposition,” in CVPR, 2010.