Online Object Tracking, Learning and Parsing with AndOr Graphs
Abstract
This paper presents a method, called AOGTracker, for simultaneously tracking, learning and parsing (TLP) of unknown objects in video sequences with a hierarchical and compositional AndOr graph (AOG) representation. The TLP method is formulated in the Bayesian framework with a spatial and a temporal dynamic programming (DP) algorithms inferring object bounding boxes onthefly. During online learning, the AOG is discriminatively learned using latent SVM [1] to account for appearance (e.g., lighting and partial occlusion) and structural (e.g., different poses and viewpoints) variations of a tracked object, as well as distractors (e.g., similar objects) in background. Three key issues in online inference and learning are addressed: (i) maintaining purity of positive and negative examples collected online, (ii) controling model complexity in latent structure learning, and (iii) identifying critical moments to relearn the structure of AOG based on its intrackability. The intrackability measures uncertainty of an AOG based on its score maps in a frame. In experiments, our AOGTracker is tested on two popular tracking benchmarks with the same parameter setting: the TB100/50/CVPR2013 benchmarks [2, 3], and the VOT benchmarks [4] — VOT 2013, 2014, 2015 and TIR2015 (thermal imagery tracking). In the former, our AOGTracker outperforms stateoftheart tracking algorithms including two trackers based on deep convolutional network [5, 6]. In the latter, our AOGTracker outperforms all other trackers in VOT2013 and is comparable to the stateoftheart methods in VOT2014, 2015 and TIR2015.
1 Introduction
1.1 Motivation and Objective
Online object tracking is an innate capability in human and animal vision for learning visual concepts [7], and is an important task in computer vision. Given the state of an unknown object (e.g., its bounding box) in the first frame of a video, the task is to infer hidden states of the object in subsequent frames. Online object tracking, especially longterm tracking, is a difficult problem. It needs to handle variations of a tracked object, including appearance and structural variations, scale changes, occlusions (partial or complete), etc. It also needs to tackle complexity of the scene, including camera motion, background clutter, distractors, illumination changes, frame cropping, etc. Fig. 1 illustrates some typical issues in online object tracking. In recent literature, object tracking has received much attention due to practical applications in video surveillance, activity and event prediction, humancomputer interactions and traffic monitoring.
figure\end@dblfloat
This paper presents an integrated framework for online tracking, learning and parsing (TLP) of unknown objects with a unified representation. We focus on settings in which object state is represented by bounding box, without using pretrained models. We address five issues associated with online object tracking in the following.
Issue I: Expressive representation accounting for structural and appearance variations of unknown objects in tracking. We are interested in hierarchical and compositional object models. Such models have shown promising performance in object detection [1, 8, 9, 10, 11] and object recognition [12]. A popular modeling scheme represents object categories by mixtures of deformable partbased models (DPMs) [1]. The number of mixture components is usually predefined and the part configuration of each component is fixed after initialization or directly based on strong supervision. In online tracking, since a tracker can only access the groundtruth object state in the first frame, it is not suitable for it to “make decisions” on the number of mixture components and part configurations, and it does not have enough data to learn. It’s desirable to have an object representation which has expressive power to represent a large number of part configurations, and can facilitate computationally effective inference and learning. We quantize the space of part configurations recursively in a principled way with a hierarchical and compositional AndOr graph (AOG) representation [8, 11]. We learn and update the most discriminative part configurations online by pruning the quantized space based on part discriminability.
Issue II: Computing joint optimal solutions. Online object tracking is usually posed as a maximum a posterior (MAP) problem using first order hidden Markov models (HMMs) [13, 2, 14]. The likelihood or observation density is temporally inhomogeneous due to online updating of object models. Typically, the objective is to infer the most likely hidden state of a tracked object in a frame by maximizing a Bayesian marginal posterior probability given all the data observed so far. The maximization is based on either particle filtering [15] or dense sampling such as the trackingbydetection methods [16, 17, 18]. In most prior approaches (e.g., the 29 trackers evaluated in the TB100 benchmark [2]), no feedback inspection is applied to the history of inferred trajectory. We utilize trackingbyparsing with hierarchical models in inference. By computing joint optimal solutions, we can not only improve prediction accuracy in a new frame by integrating past estimated trajectory, but also potentially correct errors in past estimated trajectory. Furthermore, we simultaneously address another key issue in online learning (Issue III).
Issue III: Maintaining the purity of a training dataset. The dataset consists of a set of positive examples computed based on the current trajectory, and a set of negative examples mined from outside the current trajectory. In the dataset, we can only guarantee that the positives and the negatives in the first frame are true positives and true negatives respectively. A tracker needs to carefully choose frames from which it can learn to avoid model drifting (i.e., selfpaced learning). Most prior approaches do not address this issue since they focus on marginally optimal solutions with which object models are updated, except for the PN learning in TLD [17] and the selfpaced learning for tracking [18]. Since we compute joint optimal solutions in online tracking, we can maintain the purity of an online collected training dataset in a better way.
Issue IV: Failureaware online learning of object models. In online learning, we mostly update model parameters incrementally after inference in a frame. Theoretically speaking, after an initial object model is learned in the first frame, model drifting is inevitable in general setting. Thus, in addition to maintaining the purity of a training dataset, it is also important that we can identify critical moments (caused by different structural and appearance variations) automatically. At those moments, a tracker needs to relearn both the structure and the parameters of object model using the current whole training dataset. We address this issue by computing uncertainty of an object model in a frame based on its response maps.
Issue V: Computational efficiency by dynamic search strategy. Most trackingbydetection methods run detection in the whole frame since they usually use relatively simple models such as a single object template. With hierarchical models in tracking and sophisticated online inference and updating strategies, the computational complexity is high. To speed up tracking, we need to utilize a dynamic search strategy. This strategy must take into account the tradeoff between generating a conservative proposal state space for efficiency and allowing an exhaustive search for accuracy (e.g., to handle the situation where the object is completely occluded for a while or moves out the camera view and then reappears). We address this issue by adopting a simple search cascade with which we run detection in the whole frame only when local search has failed.
Our TLP method obtains stateoftheart performance on one popular tracking benchmark [2]. We give a brief overview of our method in the next subsection.
1.2 Method Overview
As illustrated in Fig.LABEL:fig:overview (a), the TLP method consists of four components. We introduce them briefly as follows.
(1) An AOG quantizing the space of part configurations. Given the bounding box of an object in the first frame, we assume object parts are also of rectangular shapes. We first divide it evenly into a small cellbased grid (e.g., ) and a cell defines the smallest part. We then enumerate all possible parts with different aspect ratios and different sizes which can be placed inside the grid. All the enumerated parts are organized into a hierarchical and compositional AOG. Each part is represented by a terminalnode. Two types of nonterminal nodes as compositional rules: an Andnode represents the decomposition of a large part into two smaller ones, and an Ornode represents alternative ways of decompositions through different horizontal or vertical binary splits. We call it the full structure AOG^{1}^{1}1By âfull structureâ, it means all the possible compositions on top of the grid with binary composition being used for Andnodes. It is capable of exploring a large number of latent part configurations (see some examples in Fig. LABEL:fig:overview (b)), meanwhile it makes the problem of online model learning feasible.
(2) Learning object AOGs. An object AOG is a subgraph learned from the full structure AOG (see Fig. LABEL:fig:overview (c) ^{2}^{2}2We note that there are some Ornodes in the object AOGs which have only one child node since they are subgraphs of the full structure AOG and we keep their original structures.). Learning an object AOG consists of two steps: (i) The initial object AOG are learned by pruning branches of Ornodes in the full structure AOG based on discriminative power, following breadthfirst search (BFS) order. The discriminative power of a node is measured based on its training error rate. We keep multiple branches for each encountered Ornode to preserve ambiguities, whose training error rates are not bigger than the minimum one by a small positive value. (ii) We retrain the initial object AOG using latent SVM (LSVM) as it was done in learning the DPMs [1]. LSVM utilizes positive relabeling (i.e., inferring the best configuration for each positive example) and hard negative mining. To further control the model complexity, we prune the initial object AOG through majority voting of latent assignments in positive relabeling.
(3) A spatial dynamic programming (DP) algorithm for computing all the proposals in a frame with the current object AOG. Thanks to the DAG structure of the object AOG, a DP parsing algorithm is utilized to compute the matching scores and the optimal parse trees of all sliding windows inside the search region in a frame. A parse tree is an instantiation of the object AOG which selects the best child for each encountered Ornode according to matching score. A configuration is obtained by collapsing a parse tree onto the image domain, capturing layout of latent parts of a tracked object in a frame.
(4) A temporal DP algorithm for inferring the most likely trajectory. We maintain a DP table memorizing the candidate object states computed by the spatial DP in the past frames. Then, based on the firstorder HMM assumption, a temporal DP algorithm is used to find the optimal solution for the past frames jointly with pairwise motion constraints (i.e., the Viterbi path [14]). The joint solution can help correct potential tracking errors (i.e., false negatives and false positives collected online) by leveraging more spatial and temporal information. This is similar in spirit to methods of keeping Nbest maximal decoder for part models [19] and maintaining diverse Mbest solutions in MRF [20].
2 Related Work
In the literature of object tracking, either single object tracking or multipleobject tracking, there are often two settings.
Offline visual tracking [21, 22, 23, 24]. These methods assume the whole video sequence has been recorded, and consist of two steps. i) It first computes object proposals in all frames using some pretrained detectors (e.g., the DPMs [1]) and then form “tracklets” in consecutive frames. ii) It seeks the optimal object trajectory (or trajectories for multiple objects) by solving an optimization problem (e.g., the Kshortest path or mincost flow formulation) for the data association. Most work assumed firstorder HMMs in the formulation. Recently, Hong and Han [25] proposed an offline single object tracking method by sampling treestructured graphical models which exploit the underlying intrinsic structure of input video in an orderless tracking [26].
Online visual tracking for streaming videos. It starts tracking after the state of an object is specified in certain frame. In the literature, particle filtering [15] has been widely adopted, which approximately represents the posterior probability in a nonparametric form by maintaining a set of particles (i.e., weighted candidates). In practice, particle filtering does not perform well in highdimensional state spaces. More recently, trackingbydetection methods [16, 17] have become popular which learn and update object models online and encode the posterior probability using dense sampling through slidingwindow based detection onthefly. Thus, object tracking is treated as instancebased object detection. To leverage the recent advance in object detection, object tracking research has made progress by incorporating discriminatively trained partbased models [1, 8, 27] (or more generally grammar models [11, 10, 9]). Most popular methods also assume firstorder HMMs except for the recently proposed online graphbased tracker [28]. There are four streams in the literature of online visual tracking:

Appearance modeling of the whole object, such as incremental learning [29], kernelbased [30], particle filtering [15], sparse coding [31] and 3DDCT representation [32]; More recently, Convolutional neural networks are utilized in improving tracking performance [6, 5, 33], which are usually pretrained on some large scale image datasets such as the ImageNet [34] or on video sequences in a benchmark with the testing one excluded.

Appearance modeling of objects with parts, such as patchbased [35], coupled 2layer models [36] and adaptive sparse appearance [37]. The major limitation of appearance modeling of a tracked object is the lack of background models, especially in preventing model drift from distracotrs (e.g., players in sport games). Addressing this issue leads to discriminant tracking.
Our method belongs to the fourth stream of online visual tracking. Unlike predefined or fixed part configurations with starmodel structure used in previous work, our method learns both structure and appearance of object AOGs online, which is, to our knowledge, the first method to address the problem of online explicit structure learning in tracking. The advantage of introducing AOG representation are threefold.

More representational power: Unlike TLD [17] and many other methods (e.g., [18]) which model an object as a single template or a mixture of templates and thus do not perform well in tracking objects with large structural and appearance variations, an AOG represents an object in a hierarchical and compositional graph expressing a large number of latent part configurations.

More robust tracking and online learning strategies: While the whole object has large variations or might be partially occluded from time to time during tracking, some other parts remain stable and are less likely to be occluded. Some of the parts can be learned to robustly track the object, which can also improve accuracy of appearance adaptation of terminalnodes. This idea is similar in spirit to finding good features to track objects [45], and we find good part configurations online for both tracking and learning.

Finegrained tracking results: In addition to predicting bounding boxes of a tracked object, outputs of our AOGTracker (i.e., the parse trees) have more information which are potentially useful for other modules beyond tracking such as activity or event prediction.
Our preliminary work has been published in [46] and the method for constructing full structure AOG was published in [8]. This paper extends them by: (i) adding more experimental results with stateoftheart performance obtained and full source code released; (ii) elaborating details substantially in deriving the formulation of inference and learning algorithms; and (iii) adding more analyses on different aspects of our method. This paper makes three contributions to the online object tracking problem:

It presents a trackinglearningparsing (TLP) framework which can learn and track objects AOGs.

It presents a spatial and a temporal DP algorithms for trackingbyparsing with AOGs and outputs finegrained tracking results using parse trees.
Paper Organization. The remainder of this paper is organized as follows. Section 3 presents the formulation of our TLP framework under the Bayesian framework. Section 4 gives the details of spatialtemporal DP algorithm. Section 5 presents the online learning algorithm using the latent SVM method. Section 6 shows the experimental results and analyses. Section 7 concludes this paper and discusses issues and future work.
3 Problem Formulation
3.1 Formulation of Online Object Tracking
In this section, we first derive a generic formulation from generative perspective in the Bayesian framework, and then derive the discriminative counterpart.
3.1.1 Tracking with HMM
Let denote the image lattice on which video frames are defined. Denote a sequence of video frames within time range by,
(1) 
Denote by the bounding box of a target object in . In online object tracking, is given and ’s are inferred by a tracker (). With firstorder HMM, we have,
(2)  
(3)  
(4) 
Then, the prediction model is defined by,
(5) 
where is the candidate space of , and the updating model is defined by,
(6) 
which is a marginal posterior probability. The tracking result, the best bounding box , is computed by,
(7) 
which is usually solved using particle filtering [15] in practice.
To allow feedback inspection of the history of a trajectory, we seek to maximize a joint posterior probability,
(8) 
By taking the logarithm of both sides of Eqn.(8), we have,
(9) 
where the image data term and are not included in the maximization as they are treated as constant terms.
Since we have groundtruth for , can also be treated as known after the object model is learned based on . Then, Eqn.(9) can be reproduced as,
(10)  
3.1.2 Tracking as Energy Minimization over Trajectories
To derive the discriminative formulation of Eqn.(10), we show that only the loglikelihood ratio matters in computing in Eqn.(10) with very mild assumptions.
Let be the image domain occupied by a tracked object, and the remaining domain (i.e., and ) in a frame . With the independence assumption between and given , we have,
(11) 
where is the probability model of background scene and we have w.r.t. contextfree assumption. So, does not need to be specified explicitly and can be omitted in the maximization. This derivation gives an alternative explanation for discriminant tracking v.s. tracking by generative appearance modeling of an object [47].
Based on Eqn.(10), we define an energy function by,
(12) 
And, we do not compute in the probabilistic way, instead we compute matching score defined by,
(13)  
which we can apply discriminative learning methods.
Also, denote the motion cost by,
(14) 
We use a thresholded motion model in experiments: the cost is if the transition is accepted based on the median flow [17] (which is a forwardbackward extension of the LucasKanade optimal flow [48]) and otherwise. A similar method was explored in [18].
So, we can rewrite Eqn.(10) in the minimization form,
(15)  
In our TLP framework, we compute in Eqn.( 15) with an object AOG. So, we interpret a sliding window by the optimal parse tree inferred from object AOG. We treat parts as latent variables which are modeled to leverage more information for inferring object bounding box. We note that we do not track parts explicitly in this paper.
3.2 Quantizing the Space of Part Configurations
In this section, we first present the construction of a full structure AOG which quantizes the space of part configurations. We then introduce notations in defining an AOG.
Part configurations. For an input bounding box, a part configuration is defined by a partition with different number of parts of different shapes (see Fig. 2 (a)). Two natural questions arise: (i) How many part configurations (i.e., the space) can be defined in a bounding box? (ii) How to organize them into a compact representation? Without posing some structural constraints, it is a combinatorial problem.
We assume rectangular shapes are used for parts. Then, a configuration can be treated as a tiling of input bounding box using either horizontal or vertical cuts. We utilize binary splitting rule only in decomposition (see Fig. 2 (b) and (c)). With these two constraints, we represent all possible part configurations by a hierarchical and compositional AOG constructed in the following.
Given a bounding box, we first divide it evenly into a cellbased grid (e.g., grid in the right of Fig. 3). Then, in the grid, we define a dictionary of part types and enumerate all instances for all part types.
A dictionary of part types. A part type is defined by its width and height. Starting from some minimal size (such as cells), we enumerate all possible part types with different aspect ratios and sizes which fit the grid (see in Fig.3 (a)).
Part instances. An instance of a part type is obtained by placing the part type at a position. Thus, a part instance is defined by a “sliding window” in the grid. Fig.3 (b) shows an example of placing part type ( cells) in a grid with instances in total.
To represent part configurations compactly, we exploit the compositional relationships between enumerated part instances.
The full structure AOG. For any subgrid indexed by the lefttop position, width and height (e.g., in the rightmiddle of Fig.3 (c)), we can either terminate it directly to the corresponding part instance (Fig.3 (c.1)), or decompose it into two smaller subgrids using either horizontal or vertical binary splits. Depending on the side length, we may have multiple valid splits along both directions (Fig.3 (c.2)). When splitting either side we allow overlaps between the two subgrids up to some ratio (Fig.3 (c.3)). Then, we represent the subgrid as an Ornode, which has a set of child nodes including a terminalnode (i.e. the part instance directly terminated from it), and a number of Andnodes (each of which represents a valid decomposition). This procedure is applied recursively for all child subgrids. Starting from the whole grid and using BFS order, we construct a full structure AOG, all summarized in Algorithm 1 (see Fig. 4 for an example). Table. I lists the number of part configurations for three cases from which we can see that full structure AOGs cover a large number of part configurations using a relatively small set of part instances.
We denote an AOG by,
(16) 
where and represent a set of Andnodes, Ornodes and terminalnodes respectively, a set of edges and a set of parameters (to be defined in Section 4.1). We have,

The object/root Ornode (plotted by green circles), which represents alternative object configurations;

A set of Andnodes (solid blue circles), each of which represents the rule of decomposing a complex structure (e.g., a walking person or a running basketball player) into simpler ones;

A set of part Ornodes, which handle local variations and configurations in a recursive way;

A set of terminalnodes (red rectangles), which link an object and its parts to image data (i.e., grounding symbols) to account for appearance variations and occlusions (e.g., headshoulder of a walking person before and after opening a sun umbrella).
Grid  primitive part  Configuration  Tnode  Andnode 

319  35  48  
76,879,359  224  600  
3.8936e+009  1409  5209 
An object AOG is a subgraph of a full structure AOG with the same root Ornode. For notational simplicity, we also denote by an object AOG. So, we will write in Eqn.( 15) with added.
A parse tree is an instantiation of an object AOG with the best child node (w.r.t. matching scores) selected for each encountered Ornode. All the terminalnodes in a parse tree represents a part configuration when collapsed to image domain.
We note that an object AOG contains multiple parse trees to preserve ambiguities in interpreting a tracked object (see examples in Fig. LABEL:fig:overview (c) and Fig. 6).
4 TrackingbyParsing with Object AOGs
In this section, we present details of inference with object AOGs. We first define scoring functions of nodes in an AOG. Then, we present a spatial DP algorithm for computing , and a temporal DP algorithm for inferring the trajectory in Eqn.(15).
4.1 Scoring Functions of Nodes in an AOG
Let be the feature pyramid computed for either the local ROI or the whole image , and the position space of pyramid . Let specify a position in the th level of pyramid .
Given an AOG (e.g., the left in Fig.5), we define four types of edges, i.e., as shown in Fig.5. We elaborate the definitions of parameters :

Each terminalnode has appearance parameters , which is used to ground a terminalnode to image data.

The parent Andnode of a part terminalnode with deformation edge has deformation parameters . They are used for penalizing local displacements when placing a terminalnode around its anchor position. We note that the object template is not allowed to perturb locally in inference since we infer the optimal part configuration for each given object location in the pyramid with sliding window technique used, as done in the DPM [1], so the parent Andnode of the object terminalnode does not have deformation parameters.

A child Andnode of the root Ornode has a bias term . We do not define bias terms for child nodes of other Ornodes.
Appearance Features. We use three types of features: histogram of oriented gradient (HOG) [49], local binary pattern features (LBP) [50], and RGB color histograms (for color videos).
Deformation Features. Denote by the displacement of placing a terminalnode around its anchor location. The deformation feature is defined by as done in DPMs [1].
We use linear functions to evaluate both appearance scores and deformation scores. The score functions of nodes in an AOG are defined as follows:

For a terminalnode , its score at a position is computed by,
(17) where represents inner product and extracts features in feature pyramid.

For an Ornode , its score at position takes the maximum score over its child nodes,
(18) where denotes the set of child nodes of a node .

For an Andnode , we have three different functions w.r.t. the type of its outedge (i.e., Terminal, Deformation, or Decompositionedge),
(19) where the first case is for sharing score maps between the object terminalnode and its parent Andnode since we do not allow local deformation for the whole object, the second case for computing transformed score maps of parent Andnode of a part terminalnode which is allowed to find the best placement through distance transformation [1], represents the displacement operator in the position space in , and the third case for computing the score maps of an Andnode which has two child nodes through composition.
4.2 TrackingbyParsing
With scoring functions defined above, we present a spatial DP and a temporal DP algorithms in solving Eqn.(15).
Spatial DP: The DP algorithm (see Algorithm 2) consists of two stages: (i) The bottomup pass computes score map pyramids (as illustrated in Fig. 5) for all nodes following the depthfirstsearch (DFS) order of nodes. It computes matching scores of all possible parse trees at all possible positions in feature pyramid. (ii) In the topdown pass, we first find all candidate positions for the root Ornode based on its score maps and current threshold of the object AOG, denoted by
(20) 
Then, following BFS order of nodes, we retrieve the optimal parse tree at each : starting from the root Ornode, we select the optimal branch (with the largest score) of each encountered Ornode, keep the two child nodes of each encountered Andnode, and retrieve the optimal position of each encountered part terminalnode (by taking for the second case in Eqn.(19)).
After spatial parsing, we apply nonmaximum suppression (NMS) in computing the optimal parse trees with a predefined intersectionoverunion (IoU) overlap threshold, denoted by . We keep top parse trees to infer the best together with a temporal DP algorithm, similar to the strategies used in [19, 20].
Temporal DP: Assuming that all the Nbest candidates for are memoized after running spatial DP algorithm in to , Eqn.(15) corresponds to the classic DP formulation of forward and backward inference for decoding HMMs with being the singleton “data” term and the pairwise cost term.
Let be energy of the best object states in the first frames with the constraint that the th one is . We have,
(21) 
When is the input bounding box. Then, the temporal DP algorithm consists of two steps:

The forward step for computing all ’s, and caching the optimal solution for as a function of for later backtracing starting at ,

The backward step for finding the optimal trajectory , where we first take,
(22) and then in the order of trace back,
(23)
In practice, we often do not need to run temporal DP in the whole time range , especially for longterm tracking, since the target object might have changed significantly or we might have camera motion, instead we only focus on some short time range, (see settings in experiments).
Remarks: In our TLP method, we apply the spatial and the temporal DP algorithms in a stagewise manner and without tracking parts explicitly. Thus, we do not introduce loops in inference. If we instead attempt to learn a joint spatialtemporal AOG, it will be a much more difficult problem due to loops in joint spatialtemporal inference, and approximate inference is used.
Search Strategy: During tracking, at time , is initialized by , and then a rectangular region of interest (ROI) centered at the center of is used to compute feature pyramid and run parsing with AOG. The ROI is first computed as a square area with the side length being times longer than the maximum of width and height of and then is clipped with the image domain. If no candidates are found (i.e., is empty), we will run the parsing in whole image domain. So, our AOGTracker is capable of redetecting a tracked object. If there are still no candidates (e.g., the target object was completely occluded or went out of camera view), the tracking result of this frame is set to be invalid and we do not need to run the temporal DP.
4.3 The Trackability of an Object AOG
To detect critical moments online, we need to measure the quality of an object AOG, at time . We compute its trackability based on the score maps in which the optimal parse tree is placed. For each node in the parse tree, we have its position in score map pyramid (i.e., the level of pyramid and the location in that level), . We define the trackability of node by,
(24) 
where is the score of node , the mean score computed from the whole score map. Intuitively, we expect the score map of a discriminative node has peak and steep landscape, as investigated in [51]. The trackabilities of part nodes are used to infer partial occlusion and local structure variations, and trackability of the inferred parse tree indicate the “goodness” of current object AOG. We note that we treat trackability and intrackability (i.e., the inverse of th trackability) exchangeably. More sophisticated definitions of intrackability in tracking are referred to [52].
We model trackability by a Gaussian model whose mean and standard derivation are computed incrementally in . At time , a tracked object is said to be “intrackable” if its trackability is less than . We note that the tracking result could be still valid even if it is “intrackable” (e.g., in the first few frames in which the target object is occluded partially, especially by similar distractors).
5 Online Learning of Object AOGs
In this section, we present online learning of object AOGs, which consists of three components: (i) Maintaining a training dataset based on tracking results; (ii) Estimating parameters of a given object AOG; and (iii) Learning structure of the object AOG by pruning full structure AOG, which requires (ii) in the process.
5.1 Maintaining the Training Dataset Online
Denote by the training dataset at time , consisting of , a positive dataset, and , a negative dataset.
In the first frame, we have and let . We augment it with eight locally shifted positives, i.e., where and with width and height not changed. is set to the cell size in computing HOG features. The initial uses the whole remaining image for mining hard negatives in training.
At time , if is valid according to trackingbyparsing, we have , and add to all other candidates in (Eqn. 20) which are not suppressed by according to NMS (i.e., hard negatives). Otherwise, we have .
5.2 Estimating Parameters of a Given Object AOG
We use latent SVM method (LSVM) [1]. Based on the scoring functions defined in Section 4.1, we can rewrite the scoring function of applying a given object AOG, on a training example (denoted by for simplicity),
(25) 
where represents a parse tree, the space of parse trees, the concatenated vector of all parameters, the concatenated vector of appearance and deformation features in feature pyramid w.r.t. parse tree , and the bias term.
The objective function in estimating parameters is defined by the regularized empirical hinge loss function,
(26) 
where is the tradeoff parameter in learning. Eqn.( 26) is a semiconvexity function of the parameters due to the empirical loss term on positives.
In optimization, we utilize an iterative procedure in a “coordinate descent” way. We first convert the objective function to a convex function by assigning latent values for all positives using the spatial DP algorithm. Then, we estimate parameters. While we can use stochastic gradient descent as done in DPMs [1], we adopt LBFGS method in practice ^{3}^{3}3We reimplemented the matlab code available at http://www.cs.ubc.ca/ schmidtm/Software/minConf.html in c++. [53] since it is more robust and efficient with parallel implementation as investigated in [9, 54]. The detection threshold, is estimated as the minimum score of positives.
5.3 Learning Object AOGs
With the training dataset and the full structure AOG constructed based on , an object AOG is learned in three steps:
i) Evaluating the figure of merits of nodes in the full structure AOG. We first train the root classifier (i.e., object appearance parameters and bias term) by linear SVM using and datamining hard negatives in . Then, the appearance parameters for each part terminalnode is initialized by cropping out the corresponding portion in the object template ^{4}^{4}4We also tried to train the linear SVM classifiers for all the terminalnodes individually using cropped examples, which increases the runtime, but does not improve the tracking performance in experiments. So, we use the simplified method above.. Following DFS order, we evaluate the figure of merit of each node in the full structure AOG by its training error rate. The error rate is calculated on where the score of a node is computed w.r.t. scoring functions defined in Section 4.1. The smaller the error rate is, the more discriminative a node is.
ii) Retrieving an initial object AOG and reestimating parameters. We retrieve the most discriminative subgraph in the full structure AOG as initial object AOG. Following BFS order, we start from the root Ornode, select for each encountered Ornode the best child node (with the smallest training error rate among all children) and the child nodes whose training error rates are not bigger than that of the best child by some predefined small positive value (i.e., preserving ambiguities), keep the two child nodes for each encountered Andnode, and stop at each encountered terminalnode. We show two examples in the left of Fig. 6. We train the parameters of initial object AOG using LSVM [1] with two rounds of positive relabeling and hard negative mining respectively.
iii) Controlling model complexity. To do that, a refined object AOG for tracking is obtained by further selecting the most discriminative part configuration(s) in the initial object AOG learned in the step ii). The selection process is based on latent assignment in relabeling positives in LSVM training. A part configuration in the initial object AOG is pruned if it relabeled less than 10% positives (see the right of Fig. 6). We further train the refined object AOG with one round latent positive relabeling and hard negative mining. By reducing model complexity, we can speed up the trackingbyparsing procedure.
Verification of a refined object AOG. We run parsing with a refined object AOG in the first frame. The refined object AOG is accepted if the score of the optimal parse tree is greater than the threshold estimated in training and the IoU overlap between the predicted bounding box and the input bounding box is greater than or equals the IoU NMS threshold, in detection.
Identifying critical moments in tracking. A critical moment means a tracker has become “uncertain” and at the same time accumulated “enough” new samples, which is triggered in tracking when two conditions were satisfied. The first is that the number of frames in which a tracked object is “intrackable” was larger than some value, . The second is that the number of new valid tracking results are greater than some value, . Both are accumulated from the last time an object AOG was relearned.
The spatial resolution of placing parts. In learning object AOGs, we first place parts at the same spatial resolution as the object. If the learned object AOG was not accepted in verification, we then place parts at twice the spatial resolution w.r.t. the object and relearn the object AOG. In our experiments, the two specifications handled all testing sequences successfully.
Overall flow of online learning. In the first frame or when a critical moment is identified in tracking, we learn both structure and parameters of an object AOG, otherwise we update parameters only if the tracking result is valid in a frame based on trackingbyparsing.
Representation  Search  

Local 
Template 
Color 
Histogram 
Subspace 
Sparse 
Binary or Haar 
Discriminative 
Generative 
Model Update 
Particle Filter 
MCMC 
Local Optimum 
Dense Sampling 

ASLA [55]  ✓  ✓  ✓  ✓  ✓  ✓  ✓  
BSBT [56]  H  ✓  ✓  ✓  
CPF [57]  ✓  ✓  ✓  ✓  ✓  
CSK [58]  ✓  ✓  ✓  ✓  
CT [59]  H  ✓  ✓  ✓  
CXT [60]  B  ✓  ✓  ✓  
DFT [61]  ✓  ✓  ✓  ✓  ✓  
FOT [62]  ✓  ✓  ✓  ✓  ✓  
FRAG [63]  ✓  ✓  ✓  ✓  
IVT [29]  ✓  ✓  ✓  ✓  ✓  
KMS [30]  ✓  ✓  ✓  ✓  ✓  
L1APG [64]  ✓  ✓  ✓  ✓  ✓  ✓  
LOT [65]  ✓  ✓  ✓  ✓  ✓  
LSHT [66]  ✓  ✓  ✓  H  ✓  ✓  ✓  
LSK [67]  ✓  ✓  ✓  ✓  ✓  ✓  
LSS [68]  ✓  ✓  ✓  ✓  ✓  ✓  ✓  
MIL [39]  H  ✓  ✓  ✓  
MTT [69]  ✓  ✓  ✓  ✓  ✓  ✓  
OAB [70]  H  ✓  ✓  ✓  
ORIA [71]  ✓  ✓  H  ✓  ✓  ✓  
PCOM [72]  ✓  ✓  ✓  ✓  ✓  ✓  
SCM [73]  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓  
SMS [74]  ✓  ✓  ✓  ✓  
SBT [75]  H  ✓  ✓  ✓  
STRUCK [40]  H  ✓  ✓  ✓  
TLD [17]  ✓  B  ✓  ✓  ✓  
VR [76]  ✓  ✓  ✓  ✓  
VTD [77]  ✓  ✓  ✓  ✓  ✓  ✓  
VTS [78]  ✓  ✓  ✓  ✓  ✓  ✓  ✓  
AOG  ✓  ✓  ✓  ✓  HOG [+Color]  ✓  ✓  ✓  ✓ 
Metric  Success Rate / Precision Rate  
Evaluation  OPE  SRE  TRE  
Subset  100  50  CVPR2013  100  50  CVPR2013  100  50  CVPR2013 
AOG Gain  13.93 / 18.06  16.84 / 22.23  2.74 / 19.37  11.47 / 16.79  12.52 / 17.82  11.89 / 17.55  9.25 / 11.06  11.37 / 14.61  11.59 / 14.38 
Runnerup  STRUCK[40]  SODLT[6] / STRUCK[40]  STRUCK[40] 
Subsets in TB50  DEF(23)  FM(25)  MB(19)  IPR(29)  BC(20)  OPR(32)  OCC(29)  IV(22)  LR(8)  SV(38)  OV(11) 

AOG Gain (success rate)  15.89  15.56  17.29  12.29  17.81  14.04  14.7  15.73  6.65  18.38  15.99 
Runnerup  STRUCK[40]  TLD[17]  SCM[73]  MIL[39] 
6 Experiments
In this section, we present comparison results on the TB50/100/CVPR2013 benchmarks [2, 3] and the VOT benchmarks [4]. We also analyze different aspects of our method. The source code ^{5}^{5}5Available at https://github.com/tfwu/RGMAOGTracker is released with this paper for reproducing all results. We denote the proposed method by AOG in tables and plots.
Parameter Setting. We use the same parameters for all experiments since we emphasize online learning in this paper. In learning object AOGs, the side length of the grid used for constructing the full structure AOG is either or depending the slide length of input bounding box (to reduce the time complexity of online learning). The number of intervals in computing feature pyramid is set to with cell size being . The factor in computing search ROI is set to . The NMS IoU threshold is set to . The number of top parse trees kept after spatial DP parsing is set . The time range in temporal DP algorithm is set to . In identifying critical moments, we set and . The LSVM tradeoff parameter in Eqn.(26) is set to . When relearning structure and parameters, we could use all the frames with valid tracking results. To reduce the time complexity, the number of frames used in relearning is at most in our experiments. At time , we first take the first frames with valid tracking results in with the underlying intuition that they have high probabilities of being tracked correctly (note that we alway use the first frame since the groundtruth bounding box is given), and then take the remaining frames in reversed time order.
Speed. In our current c++ implementation, we adopt FFT in computing score pyramids as done in [54] which also utilizes multithreads with OpenMP. We also provide a distributed version based on MPI ^{6}^{6}6https://www.mpich.org/ in evaluation. The FPS is about 2 to 3. We are experimenting GPU implementations to speed up our TLP.
6.1 Results on TB50/100/CVPR2013
The TB100 benchmark has 100 target objects ( frames in total) with 29 publicly available trackers evaluated. It is extended from a previous benchmark with 51 target objects released at CVPR2013 (denoted by TBCVPR2013). Further, since some target objects are similar or less challenging, a subset of 50 difficult and representative ones (denoted by TB50) is selected for an indepth analysis. Two types of performance metric are used, the precision plot (i.e., the percentage of frames in which estimated locations are within a given threshold distance of groundtruth positions) and the success plot (i.e., based on IoU overlap scores which are commonly used in object detection benchmarks, e.g., PASCAL VOC [79]). The higher a success rate or a precision rate is, the better a tracker is. Usually, success plots are preferred to rank trackers [2, 4] (thus we focus on success plots in comparison). Three types of evaluation methods are used as illustrated in Fig.7.
To account for different factors of a test sequence affecting performance, the testing sequences are further categorized w.r.t. 11 attributes for more inddepth comparisons: (1) Illumination Variation (IV, 38/22/21 sequences in TB100/50/CVPR2013), (2) Scale Variation (SV, 64/38/28 sequences), (3) Occlusion (OCC, 49/29/29 sequences), (4) Deformation (DEF, 44/23/19 sequences), (5) Motion Blur (MB, 29/19/12 sequences), (6) Fast Motion (FM, 39/25/17 sequences), (7) InPlane Rotation (IPR, 51/29/31 sequences), (8) OutofPlane Rotation (OPR, 63/32/39 sequences), (9) OutofView (OV, 14/11/6 sequences), (10) Background Clutters (BC, 31/20/21 sequences), and (11) Low Resolution (LR, 9/8/4 sequences). More details on the attributes and their distributions in the benchmark are referred to [2, 3].
Table. II lists the 29 evaluated tracking algorithms which are categorized based on representation and search scheme. See more details about categorizing these trackers in [2]. In TBCVPR2013, two recent trackers trained by deep convolutional network (CNT[5], SODLT[6]) were evaluated using OPE.
We summarize the performance gain of our AOGTracker in Table.III. Our AOGTracker obtains significant improvement (more than 12%) in the 10 subsets in TB50. Our AOGTracker handles outofview situations much better than other trackers since it is capable of redetecting target objects in the whole image, and it performs very well in the scale variation subset (see examples in the second and fourth rows in Fig. 10) since it searches over feature pyramid explicitly (with the expense of more computation). Our AOGTracker obtains the least improvement in the lowresolution subset since it uses HOG features and the discrepancy between HOG cellbased coordinate and pixelbased one can cause some loss in overlap measurement, especially in the low resolution subset. We will add automatic selection of feature types (e.g., HOG v.s. pixelbased features such as intensity and gradient) according to the resolution, as well as other factors in future work.
Fig.8 shows success plots of OPE, SRE and TRE in TB100/50/CVPR2013. Our AOGTracker consistently outperforms all other trackers. We note that for OPE in TBCVPR2013, although the improvement of our AOGTracker over the SODLT[6] is not very big, the SODLT utilized two deep convolutional networks with different model update strategies in tracking, both of which are pretrained on the ImageNet [34]. Fig. 10 shows some qualitative results.
6.2 Analyses of AOG models and the TLP Algorithm
To analyze contributions of different components in our AOGTracker, we compare performance of six different variants– three different object representation schema: AOG with and without structure relearning (denoted by AOG and AOGFixed respectively), and whole object template only (i.e., without part configurations, denoted by ObjectOnly), and two different inference strategies for each representation scheme: inference with and without temporal DP (denoted by st and s respectively). As stated above, we use a very simple setting for temporal DP which takes into account frames, in our experiments.
Fig. 11 shows performance comparison of the six variants. AOGst obtains the best overall performance consistently. Trackers with AOG perform better than those with whole object template only. AOG structure relearning has consistent overall performance improvement. But, we observed that AOGFixedst works slightly better than AOGst on two subsets out of 11, MotionBlur and OutofView, on which the simple intrackability measurement is not good enough. For trackers with AOG, temporal DP helps improve performance, while for trackers with whole object templates only, the one without temporal DP (ObjectOnlys) slightly outperform the one with temporal DP (ObjectOnlyst), which shows that we might need strong enough object models in integrating spatial and temporal information for better performance.
6.3 Comparison with StateoftheArt Methods
We explain why our AOGTracker outperforms other trackers on the TB100 benchmark in terms of representation, online learning and inference.
Representation Scheme. Our AOGTracker utilizes three types of complementary features (HOG+LBP+Color) jointly to capture appearance variations, while most of other trackers use simpler ones (e.g., TLD [17] uses intensity based Haar like features). More importantly, we address the issue of learning the optimal deformable partbased configurations in the quantized space of latent object structures, while most of other trackers focus on either whole objects [58] or implicit configurations (e.g., the random fern forest used in TLD). These two components are integrated in a latent structuredoutput discriminative learning framework, which improves the overall tracking performance (e.g., see comparisons in Fig. 11).
Online Learning. Our AOGTracker includes two components which are not addressed in all other trackers evaluated on TB100: online structure relearning based on intrackability, and a simple temporal DP for computing optimal joint solution. Both of them improve the performance based on our ablation experiments. The former enables our AOGTracker to capture both large structural and sudden appearance variations automatically, which is especially important for longterm tracking. In addition to improve the prediction performance, the latter improves the capability of maintaining the purity of online collected training dataset.
Inference. Unlike many other trackers which do not handle scale changes explicitly (e.g., CSK [58] and STRUCK [40]), our AOGTracker runs trackingbyparsing in feature pyramid to detect scale changes (e.g., the car example in the second row in Fig. 10). Our AOGTracker also utilizes a dynamic search strategy which redetects an object in whole frame if local ROI search failed. For example, our AOGTracker handles outofview situations much better than other trackers due to the redetection component (see examples in the fourth row in Fig. 10).
Limitations. All the performance improvement stated above are obtained at the expense of more computation in learning and tracking. Our AOGTracker obtains the least improvement in the lowresolution subset since it uses HOG features and the discrepancy between HOG cellbased coordinate and pixelbased one can cause some loss in overlap measurement, especially in the low resolution subset. We will add automatic selection of feature types (e.g., HOG v.s. pixelbased features such as intensity and gradient) according to the resolution, as well as other factors in future work.
6.4 Results on VOT
In VOT, the evaluation focuses on shortterm tracking (i.e., a tracker is not expected to perform redetection after losing a target object), so the evaluation toolkit will reinitialize a tracker after it loses the target (w.r.t. the condition the overlap between the predicted bounding box and the groundtruth one drops to zero) with the number of failures counted. In VOT protocol, a tracker is tested on each sequence multiple times. The performance is measured in terms of accuracy and robustness. Accuracy is computed as the average of perframe accuracies which themselves are computed by taking the average over the repetitions. Robustness is computed as the average number of failure times over repetitions.
We integrate our AOGTracker in the latest VOT toolkit^{7}^{7}7Available at https://github.com/votchallenge/vottoolkit, version 3.2 to run experiments with the baseline protocol and to generate plots ^{8}^{8}8The plots for VOT2013 and 2014 might be different compared to those in the original VOT reports [80, 81] due to the new version of vottoolkit..
The VOT2013 dataset [80] has 16 sequences which was selected from a large pool such that various visual phenomena like occlusion and illumination changes, were still represented well within the selection. 7 sequences are also used in TB100. There are 27 trackers evaluated. The readers are referred to the VOT technical report [80] for details.
Fig.12 shows the ranking plot and AR plot in VOT2013. Our AOGTracker obtains the best accuracy while its robustness is slightly worse than three other trackers (i.e., PLT[80], LGT[82] and LGTpp[83], and PLT was the winner in VOT2013 challenge). Our AOGTracker obtains the best overall rank.
The VOT2014 dataset [81] has 25 sequences extended from VOT2013. The annotation is based on rotated bounding box instead of upright rectangle. There are 33 trackers evaluated. Details on the trackers are referred to [81]. Fig.13 shows the ranking plot and AR plot. Our AOGTracker is comparable to other trackers. One main limitation of AOGTracker is that it does not handle rotated bounding boxes well.
The VOT2015 dataset [84] consists of 60 short sequences (with rotated bounding box annotations) and VOTTIR2015 comprises 20 sequences (with bounding box annotations). There are 62 and 28 trackers evaluated in VOT2015 and VOTTIR2015 respectively. Our AOGTracker obtains % and 65% (tied for third place) in accuracy in VOT2015 and VOTTIR2015 respectively. The details are referred to the reports [84] due to space limit here.
7 Discussion and Future Work
We have presented a tracking, learning and parsing (TLP) framework and derived a spatial dynamic programming (DP) and a temporal DP algorithm for online object tracking with AOGs. We also have presented a method of online learning object AOGs including its structure and parameters. In experiments, we test our method on two main public benchmark datasets and experimental results show better or comparable performance.
In our ongoing work, we are studying more flexible computing schemes in tracking with AOGs. The compositional property embedded in an AOG naturally leads to different bottomup/topdown computing schemes such as the three computing processes studied by Wu and Zhu [85]. We can track an object by matching the object template directly (i.e. process), or computing some discriminative parts first and then combine them into object (process), or doing both (process, as done in this paper). In tracking, as time evolves, the object AOG might grow through online learning, especially for objects with large variations in longterm tracking. Thus, faster inference is entailed for the sake of real time applications. We are trying to learn near optimal decision policies for tracking using the framework proposed by Wu and Zhu [86].
In our future work, we will extend the TLP framework by incorporating generic categorylevel AOGs [8] to scale up the TLP framework. The generic AOGs are pretrained offline (e.g., using the PASCAL VOC [79] or the imagenet [34]), and will help the online learning of specific AOGs for a target object (e.g., help to maintain the purity of the positive and negative datasets collected online). The generic AOGs will also be updated online together with the specific AOGs. By integrating generic and specific AOGs, we aim at the lifelong learning of objects in videos without annotations. Furthermore, we are also interested in integrating scene grammar [87] and event grammar [88] to leverage more topdown information.
Acknowledgments
This work is supported by the DARPA SIMPLEX Award N6600115C4035, the ONR MURI grant N000141612007, and NSF IIS1423305. T. Wu was also supported by the ECE startup fund 20147302119 at NCSU. We thank Steven Holtzen for proofreading this paper. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of one GPU.
References
 [1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” PAMI, vol. 32, no. 9, pp. 1627–1645, 2010.
 [2] Y. Wu, J. Lim, and M.H. Yang, “Object tracking benchmark,” PAMI, vol. 37, no. 9, pp. 1834–1848, 2015.
 [3] ——, “Online object tracking: A benchmark,” in CVPR, 2013.
 [4] M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. P. Pflugfelder, G. Fernández, G. Nebehay, F. Porikli, and L. Cehovin, “A novel performance evaluation methodology for singletarget trackers,” CoRR, vol. abs/1503.01313, 2015. [Online]. Available: http://arxiv.org/abs/1503.01313
 [5] K. Zhang, Q. Liu, Y. Wu, and M.H. Yang, “Robust visual tracking via convolutional networks,” arXiv preprint arXiv:1501.04505v2, 2015.
 [6] N. Wang, S. Li, A. Gupta, and D.Y. Yeung, “Transferring rich feature hierarchies for robust visual tracking,” arXiv preprint arXiv:1501.04587v2, 2015.
 [7] S. Carey, The Origin of Concepts. Oxford University Press, 2011.
 [8] X. Song, T. Wu, Y. Jia, and S.C. Zhu, “Discriminatively trained andor tree models for object detection,” in CVPR, 2013.
 [9] R. Girshick, P. Felzenszwalb, and D. McAllester, “Object detection with grammar models,” in NIPS, 2011.
 [10] P. Felzenszwalb and D. McAllester, “Object detection grammars,” University of Chicago, Computer Science TR201002, Tech. Rep., 2010.
 [11] S. C. Zhu and D. Mumford, “A stochastic grammar of images,” Foundations and Trends in Computer Graphics and Vision, vol. 2, no. 4, pp. 259–362, 2006.
 [12] Y. Amit and A. Trouvé, “POP: patchwork of parts models for object recognition,” IJCV, vol. 75, no. 2, pp. 267–282, 2007.
 [13] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, 2006.
 [14] L. R. Rabiner, “Readings in speech recognition,” A. Waibel and K.F. Lee, Eds., 1990, ch. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, pp. 267–296.
 [15] M. Isard and A. Blake, “Condensation  conditional density propagation for visual tracking,” IJCV, vol. 29, no. 1, pp. 5–28, 1998.
 [16] M. Andriluka, S. Roth, and B. Schiele, “Peopletrackingbydetection and peopledetectionbytracking,” in CVPR, 2008.
 [17] Z. Kalal, K. Mikolajczyk, and J. Matas, “Trackinglearningdetection,” PAMI, vol. 34, no. 7, pp. 1409–1422, 2012.
 [18] J. S. Supancic III and D. Ramanan, “Selfpaced learning for longterm tracking,” in CVPR, 2013.
 [19] D. Park and D. Ramanan, “Nbest maximal decoder for part models,” in ICCV, 2011.
 [20] D. Batra, P. Yadollahpour, A. GuzmánRivera, and G. Shakhnarovich, “Diverse mbest solutions in markov random fields,” in ECCV, 2012.
 [21] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multiobject tracking using network flows,” in CVPR, 2008.
 [22] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globallyoptimal greedy algorithms for tracking a variable number of objects,” in CVPR, 2011.
 [23] J. Berclaz, F. Fleuret, E. Türetken, and P. Fua, “Multiple object tracking using kshortest paths optimization,” PAMI, vol. 33, no. 9, pp. 1806–1819, 2011.
 [24] A. V. Goldberg, “An efficient implementation of a scaling minimumcost flow algorithm,” J. Algorithms, vol. 22, no. 1, pp. 1–29, 1997.
 [25] S. Hong and B. Han, “Visual tracking by sampling treestructured graphical models,” in ECCV, 2014.
 [26] S. Hong, S. Kwak, and B. Han, “Orderless tracking through modelaveraged posterior estimation,” in ICCV, 2013.
 [27] R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, “Partbased visual tracking with online latent structural learning,” in CVPR, 2013.
 [28] H. Nam, S. Hong, and B. Han, “Online graphbased tracking,” in ECCV, 2014.
 [29] D. A. Ross, J. Lim, R.S. Lin, and M.H. Yang, “Incremental learning for robust visual tracking,” IJCV, vol. 77, no. 13, pp. 125–141, 2008.
 [30] D. Comaniciu, V. Ramesh, and P. Meer, “Kernelbased object tracking,” PAMI, vol. 25, no. 5, pp. 564–575, 2003.
 [31] X. Mei and H. Ling, “Robust visual tracking and vehicle classification via sparse representation,” PAMI, vol. 33, no. 11, pp. 2259–2272, 2011.
 [32] X. Li, A. R. Dick, C. Shen, A. van den Hengel, and H. Wang, “Incremental learning of 3ddct compact representations for robust visual tracking,” PAMI, vol. 35, no. 4, pp. 863–881, 2013.
 [33] H. Nam and B. Han, “Learning multidomain convolutional neural networks for visual tracking,” in CVPR, 2016.
 [34] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “ImageNet: A LargeScale Hierarchical Image Database,” in CVPR, 2009.
 [35] J. Kwon and K. M. Lee, “Highly nonrigid object tracking via patchbased dynamic appearance modeling,” PAMI, vol. 35, no. 10, pp. 2427–2441, 2013.
 [36] L. Cehovin, M. Kristan, and A. Leonardis, “Robust visual tracking using an adaptive coupledlayer visual model,” PAMI, vol. 35, no. 4, pp. 941–953, 2013.
 [37] X. Jia, H. Lu, and M.H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in CVPR, 2012.
 [38] S. Avidan, “Support vector tracking,” PAMI, vol. 26, no. 8, pp. 1064–1072, 2004.
 [39] B. Babenko, M.H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” PAMI, vol. 33, no. 8, pp. 1619–1632, 2011.
 [40] S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in ICCV, 2011.
 [41] J. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of trackingbydetection with kernels,” in ECCV, 2012.
 [42] V. Mahadevan and N. Vasconcelos, “Biologically inspired object tracking using centersurround saliency mechanisms,” PAMI, vol. 35, no. 3, pp. 541–554, 2013.
 [43] R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, “Partbased visual tracking with online latent structural learning,” in CVPR, 2013.
 [44] L. Zhang and L. van der Maaten, “Structure preserving object tracking,” in CVPR, 2013.
 [45] J. Shi and C. Tomasi, “Good feature to track,” in CVPR, 1994.
 [46] Y. Lu, T. Wu, and S.C. Zhu, “Online object tracking, learning and parsing with andor graphs,” in CVPR, 2014.
 [47] X. Li, W. Hu, C. Shen, Z. Zhang, A. R. Dick, and A. van den Hengel, “A survey of appearance models in visual object tracking,” CoRR, vol. abs/1303.4803, 2013.
 [48] S. Baker and I. Matthews, “Lucaskanade 20 years on: A unifying framework,” IJCV, vol. 56, no. 3, pp. 221–255, 2004.
 [49] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
 [50] T. Ojala, M. Pietikainen, and D. Harwood, “Performance evaluation of texture measures with classification based on kullback discrimination of distributions,” in ICPR, 1994.
 [51] J. Kwon and K. M. Lee, “Highly nonrigid object tracking via patchbased dynamic appearance modeling,” TPAMI, vol. 35, no. 10, pp. 2427–2441, 2013.
 [52] H. Gong and S. C. Zhu, “Intrackability: Characterizing video statistics and pursuing video representations,” IJCV, vol. 97, no. 3, pp. 255–275, 2012.
 [53] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM J. Sci. Comput., vol. 16, no. 5, pp. 1190–1208, 1995.
 [54] C. Dubout and F. Fleuret, “Exact acceleration of linear object detectors,” in ECCV, 2012.
 [55] X. Jia, H. Lu, and M.H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in CVPR, 2012.
 [56] S. Stalder, H. Grabner, and L. van Gool, “Beyond semisupervised tracking: Tracking should be as simple as detection, but not simpler than recognition,” in ICCV Workshop, 2009.
 [57] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, “Colorbased probabilistic tracking,” in ECCV, 2002.
 [58] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of trackingbydetection with kernels,” in ECCV, 2012.
 [59] K. Zhang, L. Zhang, and M. Yang, “Fast compressive tracking,” PAMI, vol. 36, no. 10, pp. 2002–2015, 2014.
 [60] T. B. Dinh, N. Vo, and G. G. Medioni, “Context tracker: Exploring supporters and distracters in unconstrained environments,” in CVPR, 2011.
 [61] L. SevillaLara and E. LearnedMiller, “Distribution fields for tracking,” in CVPR, 2012.
 [62] T. Vojir and J. Matas, “Robustifying the flock of trackers,” in Computer Vision Winter Workshop, 2011.
 [63] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragmentsbased tracking using the integral histogram,” in CVPR, 2006.
 [64] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust L1 tracker using accelerated proximal gradient approach,” in CVPR, 2012.
 [65] S. Oron, A. BarHillel, D. Levi, and S. Avidan, “Locally orderless tracking,” in CVPR, 2012.
 [66] S. He, Q. Yang, R. W. Lau, J. Wang, and M.H. Yang, “Visual tracking via locality sensitive histograms,” in CVPR, 2013.
 [67] B. Liu, J. Huang, L. Yang, and C. A. Kulikowski, “Robust tracking using local sparse appearance model and kselection,” in CVPR, 2011.
 [68] D. Wang, H. Lu, and M.H. Yang, “Least softthresold squares tracking,” in CVPR, 2013.
 [69] T.Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via multitask sparse learning,” in CVPR, 2012.
 [70] H. Grabner, M. Grabner, and H. Bischof, “Realtime tracking via online boosting,” in BMVC, 2006.
 [71] Y. Wu, B. Shen, and H. Ling, “Online robust image alignment via iterative convex optimization,” in CVPR, 2012.
 [72] D. Wang and H. Lu, “Visual tracking via probability continuous outlier model,” in CVPR, 2014.
 [73] W. Zhong, H. Lu, and M. Yang, “Robust object tracking via sparsitybased collaborative model,” in CVPR, 2012.
 [74] R. T. Collins, “Meanshift blob tracking through scale space,” in CVPR, 2003.
 [75] H. Grabner, C. Leistner, and H. Bischof, “Semisupervised online boosting for robust tracking,” in ECCV, 2008.
 [76] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” PAMI, vol. 27, no. 10, pp. 1631–1643, 2005.
 [77] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in CVPR, 2010.
 [78] ——, “Tracking by sampling trackers,” in ICCV, 2011.
 [79] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.”
 [80] M. Kristan and et al, “The visual object tracking vot2013 challenge results,” 2013. [Online]. Available: http://www.votchallenge.net/vot2013/program.html
 [81] ——, “The visual object tracking vot2014 challenge results,” 2014. [Online]. Available: http://www.votchallenge.net/vot2014/program.html
 [82] L. Cehovin, M. Kristan, and A. Leonardis, “Robust visual tracking using an adaptive coupledlayer visual model,” PAMI, vol. 35, no. 4, pp. 941–953, 2013.
 [83] J. Xiao, R. Stolkin, and A. Leonardis, “An enhanced adaptive coupledlayer lgtracker++,” in Vis. Obj. Track. Challenge VOT2013, In conjunction with ICCV2013, 2013.
 [84] M. Kristan and et al, “The visual object tracking vot2015 and tir2015 challenge results,” 2015. [Online]. Available: http://www.votchallenge.net/vot2015/program.html
 [85] T. Wu and S. C. Zhu, “A numerical study of the bottomup and topdown inference processes in andor graphs,” IJCV, vol. 93, no. 2, pp. 226–252, 2011.
 [86] T. Wu and S. Zhu, “Learning nearoptimal costsensitive decision policy for object detection,” TPAMI, vol. 37, no. 5, pp. 1013–1027, 2015.
 [87] Y. Zhao and S. C. Zhu, “Image parsing with stochastic scene grammar,” in NIPS, 2011.
 [88] M. Pei, Z. Si, B. Z. Yao, and S. Zhu, “Learning and parsing video events with goal and intent prediction,” CVIU, vol. 117, no. 10, pp. 1369–1383, 2013.
Tianfu Wu received Ph.D. degree in Statistics from University of California, Los Angeles (UCLA) in 2011. He joined NC State University in August 2016 as a Chancellorâs Faculty Excellence Program cluster hire in Visual Narrative. He is currently assistant professor in the Department of Electrical and Computer Engineering. His research focuses on explainable and improvable visual Turing test and robot autonomy through lifelong communicative learning by pursuing a unified framework for machines to ALTER (Ask, Learn, Test, Explain, and Refine) recursively in a principled way: (i) Statistical learning of large scale and highly expressive hierarchical and compositional models from visual big data (images and videos). (ii) Statistical inference by learning nearoptimal costsensitive decision policies. (iii) Statistical theory of performance guaranteed learning algorithm and optimally scheduled inference procedure. 
Yang Lu is currently Ph. D. student in the Center for Vision, Cognition, Learning and Autonomy at the University of California, Los Angeles. He received B.S. degree and M.S. degree in Computer Science from Beijing Institute of Technology, China, in 2009 and in 2012 respectively. He received the University Fellowship from UCLA and National Fellowships from Department of Education at China. His current research interests include Computer Vision and Statistical Machine Learning. Specifically, his research interests focus on statistical modeling of natural images and videos, and structure learning of hierarchical models. 
SongChun Zhu received Ph.D. degree from Harvard University in 1996. He is currently professor of Statistics and Computer Science at UCLA, and director of Center for Vision, Cognition, Learning and Autonomy. He received a number of honors, including the J.K. Aggarwal prize from the Int’l Association of Pattern Recognition in 2008 for ”contributions to a unified foundation for visual pattern conceptualization, modeling, learning, and inference”, the David Marr Prize in 2003 with Z. Tu et al. for image parsing, twice Marr Prize honorary nominations in 1999 for texture modeling and in 2007 for object modeling with Z. Si and Y.N. Wu. He received the Sloan Fellowship in 2001, a US NSF Career Award in 2001, and an US ONR Young Investigator Award in 2001. He received the Helmholtz Testoftime award in ICCV 2013, and he is a Fellow of IEEE since 2011. 