Online Object Tracking, Learning and Parsing with And-Or Graphs

Online Object Tracking, Learning and Parsing with And-Or Graphs

Tianfu Wu, Yang Lu and Song-Chun Zhu T.F. Wu is with the Department of Electrical and Computer Engineering and the Visual Narrative Cluster, North Carolina State University. This work was mainly done when T.F. Wu was research assistant professor at UCLA.
E-mail: Y. Lu is with the Department of Statistics, University of California, Los Angeles.
E-mail: S.-C. Zhu is with the Department of Statistics and Computer Science, University of California, Los Angeles.
E-mail: Manuscript received MM DD, YYYY; revised MM DD, YYYY.

This paper presents a method, called AOGTracker, for simultaneously tracking, learning and parsing (TLP) of unknown objects in video sequences with a hierarchical and compositional And-Or graph (AOG) representation. The TLP method is formulated in the Bayesian framework with a spatial and a temporal dynamic programming (DP) algorithms inferring object bounding boxes on-the-fly. During online learning, the AOG is discriminatively learned using latent SVM [1] to account for appearance (e.g., lighting and partial occlusion) and structural (e.g., different poses and viewpoints) variations of a tracked object, as well as distractors (e.g., similar objects) in background. Three key issues in online inference and learning are addressed: (i) maintaining purity of positive and negative examples collected online, (ii) controling model complexity in latent structure learning, and (iii) identifying critical moments to re-learn the structure of AOG based on its intrackability. The intrackability measures uncertainty of an AOG based on its score maps in a frame. In experiments, our AOGTracker is tested on two popular tracking benchmarks with the same parameter setting: the TB-100/50/CVPR2013 benchmarks [2, 3], and the VOT benchmarks [4] — VOT 2013, 2014, 2015 and TIR2015 (thermal imagery tracking). In the former, our AOGTracker outperforms state-of-the-art tracking algorithms including two trackers based on deep convolutional network  [5, 6]. In the latter, our AOGTracker outperforms all other trackers in VOT2013 and is comparable to the state-of-the-art methods in VOT2014, 2015 and TIR2015.

Visual Tracking, And-Or Graphs, Latent SVM, Dynamic Programming, Intrackability

1 Introduction

1.1 Motivation and Objective

Online object tracking is an innate capability in human and animal vision for learning visual concepts [7], and is an important task in computer vision. Given the state of an unknown object (e.g., its bounding box) in the first frame of a video, the task is to infer hidden states of the object in subsequent frames. Online object tracking, especially long-term tracking, is a difficult problem. It needs to handle variations of a tracked object, including appearance and structural variations, scale changes, occlusions (partial or complete), etc. It also needs to tackle complexity of the scene, including camera motion, background clutter, distractors, illumination changes, frame cropping, etc. Fig. 1 illustrates some typical issues in online object tracking. In recent literature, object tracking has received much attention due to practical applications in video surveillance, activity and event prediction, human-computer interactions and traffic monitoring.

Fig. 1: Illustration of some typical issues in online object tracking using the “skating1” video in the benchmark [2]. Starting from the object specified in the first frame, a tracker needs to handle many variations in subsequent frames which include illuminative variation, scale variation, occlusion, deformation, fast motion, in-plane and out-of-plane rotation, background clutter, etc.


This paper presents an integrated framework for online tracking, learning and parsing (TLP) of unknown objects with a unified representation. We focus on settings in which object state is represented by bounding box, without using pre-trained models. We address five issues associated with online object tracking in the following.

Issue I: Expressive representation accounting for structural and appearance variations of unknown objects in tracking. We are interested in hierarchical and compositional object models. Such models have shown promising performance in object detection [1, 8, 9, 10, 11] and object recognition [12]. A popular modeling scheme represents object categories by mixtures of deformable part-based models (DPMs) [1]. The number of mixture components is usually predefined and the part configuration of each component is fixed after initialization or directly based on strong supervision. In online tracking, since a tracker can only access the ground-truth object state in the first frame, it is not suitable for it to “make decisions” on the number of mixture components and part configurations, and it does not have enough data to learn. It’s desirable to have an object representation which has expressive power to represent a large number of part configurations, and can facilitate computationally effective inference and learning. We quantize the space of part configurations recursively in a principled way with a hierarchical and compositional And-Or graph (AOG) representation [8, 11]. We learn and update the most discriminative part configurations online by pruning the quantized space based on part discriminability.

Issue II: Computing joint optimal solutions. Online object tracking is usually posed as a maximum a posterior (MAP) problem using first order hidden Markov models (HMMs) [13, 2, 14]. The likelihood or observation density is temporally inhomogeneous due to online updating of object models. Typically, the objective is to infer the most likely hidden state of a tracked object in a frame by maximizing a Bayesian marginal posterior probability given all the data observed so far. The maximization is based on either particle filtering [15] or dense sampling such as the tracking-by-detection methods  [16, 17, 18]. In most prior approaches (e.g., the 29 trackers evaluated in the TB-100 benchmark [2]), no feedback inspection is applied to the history of inferred trajectory. We utilize tracking-by-parsing with hierarchical models in inference. By computing joint optimal solutions, we can not only improve prediction accuracy in a new frame by integrating past estimated trajectory, but also potentially correct errors in past estimated trajectory. Furthermore, we simultaneously address another key issue in online learning (Issue III).

Issue III: Maintaining the purity of a training dataset. The dataset consists of a set of positive examples computed based on the current trajectory, and a set of negative examples mined from outside the current trajectory. In the dataset, we can only guarantee that the positives and the negatives in the first frame are true positives and true negatives respectively. A tracker needs to carefully choose frames from which it can learn to avoid model drifting (i.e., self-paced learning). Most prior approaches do not address this issue since they focus on marginally optimal solutions with which object models are updated, except for the P-N learning in TLD [17] and the self-paced learning for tracking [18]. Since we compute joint optimal solutions in online tracking, we can maintain the purity of an online collected training dataset in a better way.

Issue IV: Failure-aware online learning of object models. In online learning, we mostly update model parameters incrementally after inference in a frame. Theoretically speaking, after an initial object model is learned in the first frame, model drifting is inevitable in general setting. Thus, in addition to maintaining the purity of a training dataset, it is also important that we can identify critical moments (caused by different structural and appearance variations) automatically. At those moments, a tracker needs to re-learn both the structure and the parameters of object model using the current whole training dataset. We address this issue by computing uncertainty of an object model in a frame based on its response maps.

Issue V: Computational efficiency by dynamic search strategy. Most tracking-by-detection methods run detection in the whole frame since they usually use relatively simple models such as a single object template. With hierarchical models in tracking and sophisticated online inference and updating strategies, the computational complexity is high. To speed up tracking, we need to utilize a dynamic search strategy. This strategy must take into account the trade-off between generating a conservative proposal state space for efficiency and allowing an exhaustive search for accuracy (e.g., to handle the situation where the object is completely occluded for a while or moves out the camera view and then reappears). We address this issue by adopting a simple search cascade with which we run detection in the whole frame only when local search has failed.

Our TLP method obtains state-of-the-art performance on one popular tracking benchmark [2]. We give a brief overview of our method in the next subsection.

1.2 Method Overview

As illustrated in Fig.LABEL:fig:overview (a), the TLP method consists of four components. We introduce them briefly as follows.

(1) An AOG quantizing the space of part configurations. Given the bounding box of an object in the first frame, we assume object parts are also of rectangular shapes. We first divide it evenly into a small cell-based grid (e.g., ) and a cell defines the smallest part. We then enumerate all possible parts with different aspect ratios and different sizes which can be placed inside the grid. All the enumerated parts are organized into a hierarchical and compositional AOG. Each part is represented by a terminal-node. Two types of nonterminal nodes as compositional rules: an And-node represents the decomposition of a large part into two smaller ones, and an Or-node represents alternative ways of decompositions through different horizontal or vertical binary splits. We call it the full structure AOG111By “full structure”, it means all the possible compositions on top of the grid with binary composition being used for And-nodes. It is capable of exploring a large number of latent part configurations (see some examples in Fig. LABEL:fig:overview (b)), meanwhile it makes the problem of online model learning feasible.

(2) Learning object AOGs. An object AOG is a subgraph learned from the full structure AOG (see Fig. LABEL:fig:overview (c) 222We note that there are some Or-nodes in the object AOGs which have only one child node since they are subgraphs of the full structure AOG and we keep their original structures.). Learning an object AOG consists of two steps: (i) The initial object AOG are learned by pruning branches of Or-nodes in the full structure AOG based on discriminative power, following breadth-first search (BFS) order. The discriminative power of a node is measured based on its training error rate. We keep multiple branches for each encountered Or-node to preserve ambiguities, whose training error rates are not bigger than the minimum one by a small positive value. (ii) We retrain the initial object AOG using latent SVM (LSVM) as it was done in learning the DPMs [1]. LSVM utilizes positive re-labeling (i.e., inferring the best configuration for each positive example) and hard negative mining. To further control the model complexity, we prune the initial object AOG through majority voting of latent assignments in positive re-labeling.

(3) A spatial dynamic programming (DP) algorithm for computing all the proposals in a frame with the current object AOG. Thanks to the DAG structure of the object AOG, a DP parsing algorithm is utilized to compute the matching scores and the optimal parse trees of all sliding windows inside the search region in a frame. A parse tree is an instantiation of the object AOG which selects the best child for each encountered Or-node according to matching score. A configuration is obtained by collapsing a parse tree onto the image domain, capturing layout of latent parts of a tracked object in a frame.

(4) A temporal DP algorithm for inferring the most likely trajectory. We maintain a DP table memorizing the candidate object states computed by the spatial DP in the past frames. Then, based on the first-order HMM assumption, a temporal DP algorithm is used to find the optimal solution for the past frames jointly with pair-wise motion constraints (i.e., the Viterbi path [14]). The joint solution can help correct potential tracking errors (i.e., false negatives and false positives collected online) by leveraging more spatial and temporal information. This is similar in spirit to methods of keeping N-best maximal decoder for part models [19] and maintaining diverse M-best solutions in MRF [20].

2 Related Work

In the literature of object tracking, either single object tracking or multiple-object tracking, there are often two settings.

Offline visual tracking [21, 22, 23, 24]. These methods assume the whole video sequence has been recorded, and consist of two steps. i) It first computes object proposals in all frames using some pre-trained detectors (e.g., the DPMs [1]) and then form “tracklets” in consecutive frames. ii) It seeks the optimal object trajectory (or trajectories for multiple objects) by solving an optimization problem (e.g., the K-shortest path or min-cost flow formulation) for the data association. Most work assumed first-order HMMs in the formulation. Recently, Hong and Han [25] proposed an offline single object tracking method by sampling tree-structured graphical models which exploit the underlying intrinsic structure of input video in an orderless tracking [26].

Online visual tracking for streaming videos. It starts tracking after the state of an object is specified in certain frame. In the literature, particle filtering [15] has been widely adopted, which approximately represents the posterior probability in a non-parametric form by maintaining a set of particles (i.e., weighted candidates). In practice, particle filtering does not perform well in high-dimensional state spaces. More recently, tracking-by-detection methods [16, 17] have become popular which learn and update object models online and encode the posterior probability using dense sampling through sliding-window based detection on-the-fly. Thus, object tracking is treated as instance-based object detection. To leverage the recent advance in object detection, object tracking research has made progress by incorporating discriminatively trained part-based models [1, 8, 27] (or more generally grammar models [11, 10, 9]). Most popular methods also assume first-order HMMs except for the recently proposed online graph-based tracker [28]. There are four streams in the literature of online visual tracking:

  • Appearance modeling of the whole object, such as incremental learning [29], kernel-based [30], particle filtering [15], sparse coding [31] and 3D-DCT representation [32]; More recently, Convolutional neural networks are utilized in improving tracking performance [6, 5, 33], which are usually pre-trained on some large scale image datasets such as the ImageNet [34] or on video sequences in a benchmark with the testing one excluded.

  • Appearance modeling of objects with parts, such as patch-based [35], coupled 2-layer models [36] and adaptive sparse appearance [37]. The major limitation of appearance modeling of a tracked object is the lack of background models, especially in preventing model drift from distracotrs (e.g., players in sport games). Addressing this issue leads to discriminant tracking.

  • Tracking by discrimination using a single classifier, such as support vector tracking [38], multiple instance learning [39], STRUCK [40], circulant structure-based kernel method [41], and discriminant saliency based tracking [42];

  • Tracking by part-based discriminative models, such as online extensions of DPMs [43], and structure preserving tracking method [44, 27].

Our method belongs to the fourth stream of online visual tracking. Unlike predefined or fixed part configurations with star-model structure used in previous work, our method learns both structure and appearance of object AOGs online, which is, to our knowledge, the first method to address the problem of online explicit structure learning in tracking. The advantage of introducing AOG representation are three-fold.

  • More representational power: Unlike TLD [17] and many other methods (e.g., [18]) which model an object as a single template or a mixture of templates and thus do not perform well in tracking objects with large structural and appearance variations, an AOG represents an object in a hierarchical and compositional graph expressing a large number of latent part configurations.

  • More robust tracking and online learning strategies: While the whole object has large variations or might be partially occluded from time to time during tracking, some other parts remain stable and are less likely to be occluded. Some of the parts can be learned to robustly track the object, which can also improve accuracy of appearance adaptation of terminal-nodes. This idea is similar in spirit to finding good features to track objects [45], and we find good part configurations online for both tracking and learning.

  • Fine-grained tracking results: In addition to predicting bounding boxes of a tracked object, outputs of our AOGTracker (i.e., the parse trees) have more information which are potentially useful for other modules beyond tracking such as activity or event prediction.

Our preliminary work has been published in [46] and the method for constructing full structure AOG was published in [8]. This paper extends them by: (i) adding more experimental results with state-of-the-art performance obtained and full source code released; (ii) elaborating details substantially in deriving the formulation of inference and learning algorithms; and (iii) adding more analyses on different aspects of our method. This paper makes three contributions to the online object tracking problem:

  • It presents a tracking-learning-parsing (TLP) framework which can learn and track objects AOGs.

  • It presents a spatial and a temporal DP algorithms for tracking-by-parsing with AOGs and outputs fine-grained tracking results using parse trees.

  • It outperforms the state-of-the-art tracking methods in a recent public benchmark, TB-100  [2], and obtains comparable performance on a series of VOT benchmarks [4].

Paper Organization. The remainder of this paper is organized as follows. Section 3 presents the formulation of our TLP framework under the Bayesian framework. Section 4 gives the details of spatial-temporal DP algorithm. Section 5 presents the online learning algorithm using the latent SVM method. Section 6 shows the experimental results and analyses. Section 7 concludes this paper and discusses issues and future work.

3 Problem Formulation

3.1 Formulation of Online Object Tracking

In this section, we first derive a generic formulation from generative perspective in the Bayesian framework, and then derive the discriminative counterpart.

3.1.1 Tracking with HMM

Let denote the image lattice on which video frames are defined. Denote a sequence of video frames within time range by,


Denote by the bounding box of a target object in . In online object tracking, is given and ’s are inferred by a tracker (). With first-order HMM, we have,


Then, the prediction model is defined by,


where is the candidate space of , and the updating model is defined by,


which is a marginal posterior probability. The tracking result, the best bounding box , is computed by,


which is usually solved using particle filtering [15] in practice.

To allow feedback inspection of the history of a trajectory, we seek to maximize a joint posterior probability,


By taking the logarithm of both sides of Eqn.(8), we have,


where the image data term and are not included in the maximization as they are treated as constant terms.

Since we have ground-truth for , can also be treated as known after the object model is learned based on . Then, Eqn.(9) can be reproduced as,


3.1.2 Tracking as Energy Minimization over Trajectories

To derive the discriminative formulation of Eqn.(10), we show that only the log-likelihood ratio matters in computing in Eqn.(10) with very mild assumptions.

Let be the image domain occupied by a tracked object, and the remaining domain (i.e., and ) in a frame . With the independence assumption between and given , we have,


where is the probability model of background scene and we have w.r.t. context-free assumption. So, does not need to be specified explicitly and can be omitted in the maximization. This derivation gives an alternative explanation for discriminant tracking v.s. tracking by generative appearance modeling of an object [47].

Based on Eqn.(10), we define an energy function by,


And, we do not compute in the probabilistic way, instead we compute matching score defined by,


which we can apply discriminative learning methods.

Also, denote the motion cost by,


We use a thresholded motion model in experiments: the cost is if the transition is accepted based on the median flow [17] (which is a forward-backward extension of the Lucas-Kanade optimal flow [48]) and otherwise. A similar method was explored in [18].

So, we can re-write Eqn.(10) in the minimization form,


In our TLP framework, we compute in Eqn.( 15) with an object AOG. So, we interpret a sliding window by the optimal parse tree inferred from object AOG. We treat parts as latent variables which are modeled to leverage more information for inferring object bounding box. We note that we do not track parts explicitly in this paper.

Fig. 2: We assume parts are of rectangular shapes. (a) shows a configuration with 3 parts. Two different, yet equivalent, decomposition rules in representing a configuration are shown in (b) for decomposition with branching factor equal to the number of parts (i.e., a flat structure), and in (c) for a hierarchical decomposition with branching factor being set to 2 at all levels.
Fig. 3: Illustration of (a) the dictionary of part types, and (b) part instances generated by placing a part type in a grid. Given part instances, (c) shows how a sub-grid is decomposed in different ways. We allow overlap between child nodes (see (c.3)).
Fig. 4: Illustration of full structure And-Or Graph (AOG) representing the space of part configurations. It is of directed acyclic graph (DAG) structure. For clarity, we show a toy example constructed for a grid. The AOG can generate all possible part configurations (the number is often huge for typical grid sizes, see Table.I), while allowing efficient exploration with a DP algorithm due to the DAG structure. See text for details. (Best viewed in color and with magnification)

3.2 Quantizing the Space of Part Configurations

In this section, we first present the construction of a full structure AOG which quantizes the space of part configurations. We then introduce notations in defining an AOG.

Part configurations. For an input bounding box, a part configuration is defined by a partition with different number of parts of different shapes (see Fig. 2 (a)). Two natural questions arise: (i) How many part configurations (i.e., the space) can be defined in a bounding box? (ii) How to organize them into a compact representation? Without posing some structural constraints, it is a combinatorial problem.

We assume rectangular shapes are used for parts. Then, a configuration can be treated as a tiling of input bounding box using either horizontal or vertical cuts. We utilize binary splitting rule only in decomposition (see Fig. 2 (b) and (c)). With these two constraints, we represent all possible part configurations by a hierarchical and compositional AOG constructed in the following.

Given a bounding box, we first divide it evenly into a cell-based grid (e.g., grid in the right of Fig. 3). Then, in the grid, we define a dictionary of part types and enumerate all instances for all part types.

A dictionary of part types. A part type is defined by its width and height. Starting from some minimal size (such as cells), we enumerate all possible part types with different aspect ratios and sizes which fit the grid (see in Fig.3 (a)).

Part instances. An instance of a part type is obtained by placing the part type at a position. Thus, a part instance is defined by a “sliding window” in the grid. Fig.3 (b) shows an example of placing part type ( cells) in a grid with instances in total.

To represent part configurations compactly, we exploit the compositional relationships between enumerated part instances.

The full structure AOG. For any sub-grid indexed by the left-top position, width and height (e.g., in the right-middle of Fig.3 (c)), we can either terminate it directly to the corresponding part instance (Fig.3 (c.1)), or decompose it into two smaller sub-grids using either horizontal or vertical binary splits. Depending on the side length, we may have multiple valid splits along both directions (Fig.3 (c.2)). When splitting either side we allow overlaps between the two sub-grids up to some ratio (Fig.3 (c.3)). Then, we represent the sub-grid as an Or-node, which has a set of child nodes including a terminal-node (i.e. the part instance directly terminated from it), and a number of And-nodes (each of which represents a valid decomposition). This procedure is applied recursively for all child sub-grids. Starting from the whole grid and using BFS order, we construct a full structure AOG, all summarized in Algorithm 1 (see Fig. 4 for an example). Table. I lists the number of part configurations for three cases from which we can see that full structure AOGs cover a large number of part configurations using a relatively small set of part instances.

Input: Image grid with cells; Minimal size of a part type ; Maximal overlap ratio between two sub-grids.
Output: The And-Or graph (see Fig.4)
Initialization: Create an Or-node for the grid , , BFSqueue;
while BFSqueue is not empty do
       Pop a node from the BFSqueue;
       if  is an Or-node then
             i) Add a terminal-node (i.e. the part instance) ;
             ii) Create And-nodes for all valid cuts;
             if  then
                   Push to the back of BFSqueue;
             end if
       else if  is an And-node then
             Create two Or-nodes for the two sub-grids;
             if  then
                   Push to the back of BFSqueue;
             end if
       end if
end while
Algorithm 1 Constructing the grid AOG using BFS

We denote an AOG by,


where and represent a set of And-nodes, Or-nodes and terminal-nodes respectively, a set of edges and a set of parameters (to be defined in Section 4.1). We have,

  • The object/root Or-node (plotted by green circles), which represents alternative object configurations;

  • A set of And-nodes (solid blue circles), each of which represents the rule of decomposing a complex structure (e.g., a walking person or a running basketball player) into simpler ones;

  • A set of part Or-nodes, which handle local variations and configurations in a recursive way;

  • A set of terminal-nodes (red rectangles), which link an object and its parts to image data (i.e., grounding symbols) to account for appearance variations and occlusions (e.g., head-shoulder of a walking person before and after opening a sun umbrella).

Grid primitive part Configuration T-node And-node
319 35 48
76,879,359 224 600
3.8936e+009 1409 5209
TABLE I: The number of part configurations generated from our AOG without considering overlapped compositions.

An object AOG is a subgraph of a full structure AOG with the same root Or-node. For notational simplicity, we also denote by an object AOG. So, we will write in Eqn.( 15) with added.

A parse tree is an instantiation of an object AOG with the best child node (w.r.t. matching scores) selected for each encountered Or-node. All the terminal-nodes in a parse tree represents a part configuration when collapsed to image domain.

We note that an object AOG contains multiple parse trees to preserve ambiguities in interpreting a tracked object (see examples in Fig. LABEL:fig:overview (c) and Fig. 6).

4 Tracking-by-Parsing with Object AOGs

In this section, we present details of inference with object AOGs. We first define scoring functions of nodes in an AOG. Then, we present a spatial DP algorithm for computing , and a temporal DP algorithm for inferring the trajectory in Eqn.(15).

4.1 Scoring Functions of Nodes in an AOG

Let be the feature pyramid computed for either the local ROI or the whole image , and the position space of pyramid . Let specify a position in the -th level of pyramid .

Given an AOG (e.g., the left in Fig.5), we define four types of edges, i.e., as shown in Fig.5. We elaborate the definitions of parameters :

  • Each terminal-node has appearance parameters , which is used to ground a terminal-node to image data.

  • The parent And-node of a part terminal-node with deformation edge has deformation parameters . They are used for penalizing local displacements when placing a terminal-node around its anchor position. We note that the object template is not allowed to perturb locally in inference since we infer the optimal part configuration for each given object location in the pyramid with sliding window technique used, as done in the DPM [1], so the parent And-node of the object terminal-node does not have deformation parameters.

  • A child And-node of the root Or-node has a bias term . We do not define bias terms for child nodes of other Or-nodes.

Appearance Features. We use three types of features: histogram of oriented gradient (HOG) [49], local binary pattern features (LBP) [50], and RGB color histograms (for color videos).

Deformation Features. Denote by the displacement of placing a terminal-node around its anchor location. The deformation feature is defined by as done in DPMs [1].

We use linear functions to evaluate both appearance scores and deformation scores. The score functions of nodes in an AOG are defined as follows:

  • For a terminal-node , its score at a position is computed by,


    where represents inner product and extracts features in feature pyramid.

  • For an Or-node , its score at position takes the maximum score over its child nodes,


    where denotes the set of child nodes of a node .

  • For an And-node , we have three different functions w.r.t. the type of its out-edge (i.e., Terminal-, Deformation-, or Decomposition-edge),


    where the first case is for sharing score maps between the object terminal-node and its parent And-node since we do not allow local deformation for the whole object, the second case for computing transformed score maps of parent And-node of a part terminal-node which is allowed to find the best placement through distance transformation [1], represents the displacement operator in the position space in , and the third case for computing the score maps of an And-node which has two child nodes through composition.

Fig. 5: Illustration of the spatial DP algorithm for parsing with AOGs (e.g., in the left). Right-middle: The input image (ROI in the 173-th frame in the “Skating1” sequence) and the inferred object configuration. Right-top: The score map pyramid for root Or-node. Middle: For each node in AOG, we show one level of score map pyramid at which the optimal parse tree is retrieved.
Input: An image , a bounding box , and an AOG
Output: in Eqn.(8) and the optimal configuration from the parse tree for the object at frame .
Initialization: Build the depth-first search (DFS) ordering queue () of all nodes in the AOG;
Step 0: Compute scores for all nodes in ;
while  is not empty do
       Pop a node from the ;
       if  is an Or-node then
             Score(v) = Score(u); // is the set of child nodes of
       else if  is an And-node then
             Score(v) = LocalMax(Score(u))
       else if  is a Terminal-node then
             Compute the filter response map for . // represents the image domain of the LocalMax operation of Terminal-node .
       end if
end while
= Score(RootOrNode).;
Step 1: Compute using the breadth-first search;
, , ;
while  is not empty do
       Pop a node from the ;
       if  is an Or-node then
             Push the child node with maximum score into (i.e., Score(u)=Score(v)).
       else if  is an And-node then
             Push all the child nodes ’s into .
       else if  is a Terminal-node then
             Add to . Increase .
       end if
end while
Algorithm 2 The spatial DP algorithm for parsing with the AOG, Parse()

4.2 Tracking-by-Parsing

With scoring functions defined above, we present a spatial DP and a temporal DP algorithms in solving Eqn.(15).

Spatial DP: The DP algorithm (see Algorithm 2) consists of two stages: (i) The bottom-up pass computes score map pyramids (as illustrated in Fig. 5) for all nodes following the depth-first-search (DFS) order of nodes. It computes matching scores of all possible parse trees at all possible positions in feature pyramid. (ii) In the top-down pass, we first find all candidate positions for the root Or-node based on its score maps and current threshold of the object AOG, denoted by


Then, following BFS order of nodes, we retrieve the optimal parse tree at each : starting from the root Or-node, we select the optimal branch (with the largest score) of each encountered Or-node, keep the two child nodes of each encountered And-node, and retrieve the optimal position of each encountered part terminal-node (by taking for the second case in Eqn.(19)).

After spatial parsing, we apply non-maximum suppression (NMS) in computing the optimal parse trees with a predefined intersection-over-union (IoU) overlap threshold, denoted by . We keep top parse trees to infer the best together with a temporal DP algorithm, similar to the strategies used in [19, 20].

Temporal DP: Assuming that all the N-best candidates for are memoized after running spatial DP algorithm in to , Eqn.(15) corresponds to the classic DP formulation of forward and backward inference for decoding HMMs with being the singleton “data” term and the pairwise cost term.

Let be energy of the best object states in the first frames with the constraint that the -th one is . We have,


When is the input bounding box. Then, the temporal DP algorithm consists of two steps:

  1. The forward step for computing all ’s, and caching the optimal solution for as a function of for later back-tracing starting at ,

  2. The backward step for finding the optimal trajectory , where we first take,


    and then in the order of trace back,


In practice, we often do not need to run temporal DP in the whole time range , especially for long-term tracking, since the target object might have changed significantly or we might have camera motion, instead we only focus on some short time range, (see settings in experiments).

Remarks: In our TLP method, we apply the spatial and the temporal DP algorithms in a stage-wise manner and without tracking parts explicitly. Thus, we do not introduce loops in inference. If we instead attempt to learn a joint spatial-temporal AOG, it will be a much more difficult problem due to loops in joint spatial-temporal inference, and approximate inference is used.

Search Strategy: During tracking, at time , is initialized by , and then a rectangular region of interest (ROI) centered at the center of is used to compute feature pyramid and run parsing with AOG. The ROI is first computed as a square area with the side length being times longer than the maximum of width and height of and then is clipped with the image domain. If no candidates are found (i.e., is empty), we will run the parsing in whole image domain. So, our AOGTracker is capable of re-detecting a tracked object. If there are still no candidates (e.g., the target object was completely occluded or went out of camera view), the tracking result of this frame is set to be invalid and we do not need to run the temporal DP.

4.3 The Trackability of an Object AOG

To detect critical moments online, we need to measure the quality of an object AOG, at time . We compute its trackability based on the score maps in which the optimal parse tree is placed. For each node in the parse tree, we have its position in score map pyramid (i.e., the level of pyramid and the location in that level), . We define the trackability of node by,


where is the score of node , the mean score computed from the whole score map. Intuitively, we expect the score map of a discriminative node has peak and steep landscape, as investigated in [51]. The trackabilities of part nodes are used to infer partial occlusion and local structure variations, and trackability of the inferred parse tree indicate the “goodness” of current object AOG. We note that we treat trackability and intrackability (i.e., the inverse of th trackability) exchangeably. More sophisticated definitions of intrackability in tracking are referred to [52].

We model trackability by a Gaussian model whose mean and standard derivation are computed incrementally in . At time , a tracked object is said to be “intrackable” if its trackability is less than . We note that the tracking result could be still valid even if it is “intrackable” (e.g., in the first few frames in which the target object is occluded partially, especially by similar distractors).

5 Online Learning of Object AOGs

In this section, we present online learning of object AOGs, which consists of three components: (i) Maintaining a training dataset based on tracking results; (ii) Estimating parameters of a given object AOG; and (iii) Learning structure of the object AOG by pruning full structure AOG, which requires (ii) in the process.

5.1 Maintaining the Training Dataset Online

Denote by the training dataset at time , consisting of , a positive dataset, and , a negative dataset.

In the first frame, we have and let . We augment it with eight locally shifted positives, i.e., where and with width and height not changed. is set to the cell size in computing HOG features. The initial uses the whole remaining image for mining hard negatives in training.

At time , if is valid according to tracking-by-parsing, we have , and add to all other candidates in (Eqn. 20) which are not suppressed by according to NMS (i.e., hard negatives). Otherwise, we have .

5.2 Estimating Parameters of a Given Object AOG

We use latent SVM method (LSVM) [1]. Based on the scoring functions defined in Section 4.1, we can re-write the scoring function of applying a given object AOG, on a training example (denoted by for simplicity),


where represents a parse tree, the space of parse trees, the concatenated vector of all parameters, the concatenated vector of appearance and deformation features in feature pyramid w.r.t. parse tree , and the bias term.

The objective function in estimating parameters is defined by the -regularized empirical hinge loss function,


where is the trade-off parameter in learning. Eqn.( 26) is a semi-convexity function of the parameters due to the empirical loss term on positives.

In optimization, we utilize an iterative procedure in a “coordinate descent” way. We first convert the objective function to a convex function by assigning latent values for all positives using the spatial DP algorithm. Then, we estimate parameters. While we can use stochastic gradient descent as done in DPMs [1], we adopt LBFGS method in practice 333We reimplemented the matlab code available at schmidtm/Software/minConf.html in c++. [53] since it is more robust and efficient with parallel implementation as investigated in [9, 54]. The detection threshold, is estimated as the minimum score of positives.

Fig. 6: Illustration of learning an object AOG in the first frame (top) and re-learning an object AOG in the 281-th frame when a critical moment has triggered. It consists of two steps: (a) learning initial object AOG by pruning branches of Or-nodes in full structure AOG, and (b) learning refined object AOG by pruning part configurations w.r.t. majority voting in positive re-labeling in LSVM.

5.3 Learning Object AOGs

With the training dataset and the full structure AOG constructed based on , an object AOG is learned in three steps:

i) Evaluating the figure of merits of nodes in the full structure AOG. We first train the root classifier (i.e., object appearance parameters and bias term) by linear SVM using and data-mining hard negatives in . Then, the appearance parameters for each part terminal-node is initialized by cropping out the corresponding portion in the object template 444We also tried to train the linear SVM classifiers for all the terminal-nodes individually using cropped examples, which increases the runtime, but does not improve the tracking performance in experiments. So, we use the simplified method above.. Following DFS order, we evaluate the figure of merit of each node in the full structure AOG by its training error rate. The error rate is calculated on where the score of a node is computed w.r.t. scoring functions defined in Section 4.1. The smaller the error rate is, the more discriminative a node is.

ii) Retrieving an initial object AOG and re-estimating parameters. We retrieve the most discriminative subgraph in the full structure AOG as initial object AOG. Following BFS order, we start from the root Or-node, select for each encountered Or-node the best child node (with the smallest training error rate among all children) and the child nodes whose training error rates are not bigger than that of the best child by some predefined small positive value (i.e., preserving ambiguities), keep the two child nodes for each encountered And-node, and stop at each encountered terminal-node. We show two examples in the left of Fig. 6. We train the parameters of initial object AOG using LSVM [1] with two rounds of positive re-labeling and hard negative mining respectively.

Fig. 7: Illustration of the three types of evaluation methods in TB-100/50/CVPR2013. In one-pass evaluation (OPE), a tracker is initialized in the first frame and let it track the target until the end of the sequence. In temporal robustness evaluation (TRE), a tracker starts at different starting frames initialized with the corresponding ground-truth bounding boxes and then tracks the object until the end. 20 starting frames (including the first frame) are used in TB-100. In spatial robustness evaluation (SRE), a tracker runs multiple times with spatially scaled (4 types) and shifted (8 types of perturbation) initializations in the first frame.

iii) Controlling model complexity. To do that, a refined object AOG for tracking is obtained by further selecting the most discriminative part configuration(s) in the initial object AOG learned in the step ii). The selection process is based on latent assignment in relabeling positives in LSVM training. A part configuration in the initial object AOG is pruned if it relabeled less than 10% positives (see the right of Fig. 6). We further train the refined object AOG with one round latent positive re-labeling and hard negative mining. By reducing model complexity, we can speed up the tracking-by-parsing procedure.

Verification of a refined object AOG. We run parsing with a refined object AOG in the first frame. The refined object AOG is accepted if the score of the optimal parse tree is greater than the threshold estimated in training and the IoU overlap between the predicted bounding box and the input bounding box is greater than or equals the IoU NMS threshold, in detection.

Identifying critical moments in tracking. A critical moment means a tracker has become “uncertain” and at the same time accumulated “enough” new samples, which is triggered in tracking when two conditions were satisfied. The first is that the number of frames in which a tracked object is “intrackable” was larger than some value, . The second is that the number of new valid tracking results are greater than some value, . Both are accumulated from the last time an object AOG was re-learned.

The spatial resolution of placing parts. In learning object AOGs, we first place parts at the same spatial resolution as the object. If the learned object AOG was not accepted in verification, we then place parts at twice the spatial resolution w.r.t. the object and re-learn the object AOG. In our experiments, the two specifications handled all testing sequences successfully.

Overall flow of online learning. In the first frame or when a critical moment is identified in tracking, we learn both structure and parameters of an object AOG, otherwise we update parameters only if the tracking result is valid in a frame based on tracking-by-parsing.

Representation Search







Binary or Haar



Model Update

Particle Filter


Local Optimum

Dense Sampling

ASLA [55]
BSBT [56] H
CPF [57]
CSK [58]
CT [59] H
CXT [60] B
DFT [61]
FOT [62]
FRAG [63]
IVT [29]
KMS [30]
L1APG [64]
LOT [65]
LSHT [66] H
LSK [67]
LSS [68]
MIL [39] H
MTT [69]
OAB [70] H
ORIA [71] H
PCOM [72]
SCM [73]
SMS [74]
SBT [75] H
TLD [17] B
VR [76]
VTD [77]
VTS [78]
AOG HOG [+Color]
TABLE II: Tracking algorithms evaluated in the TB-100 benchmark (reproduced from [2]).
Metric Success Rate / Precision Rate
Evaluation OPE SRE TRE
Subset 100 50 CVPR2013 100 50 CVPR2013 100 50 CVPR2013
AOG Gain 13.93 / 18.06 16.84 / 22.23 2.74 / 19.37 11.47 / 16.79 12.52 / 17.82 11.89 / 17.55 9.25 / 11.06 11.37 / 14.61 11.59 / 14.38
Runner-up STRUCK[40] SO-DLT[6] / STRUCK[40] STRUCK[40]
Subsets in TB-50 DEF(23) FM(25) MB(19) IPR(29) BC(20) OPR(32) OCC(29) IV(22) LR(8) SV(38) OV(11)
AOG Gain (success rate) 15.89 15.56 17.29 12.29 17.81 14.04 14.7 15.73 6.65 18.38 15.99
Runner-up STRUCK[40] TLD[17] SCM[73] MIL[39]
TABLE III: Performance gain (in %) of our AOGTracker in term of success rate and precision rate in the benchmark [2]. Success plots of TB-100/50/CVPR2013 are shown in Fig. 8. The success plots of the 11 subsets in TB-50 are shown in Fig. 9. Precision plots are provided in the supplementary material due to space limit here.
Fig. 8: Performance comparison in TB-100 (1st row), TB-50 (2nd row) and TB-CVPR2013 (3rd row) in term of success plots of OPE (1st column), SRE (2nd column) and TRE (3rd colum). For clarity, only top 10 trackers are shown in color curves and listed in the legend. Two deep learning based trackers, CNT[5] and SO-DLT[6], are evaluated in TB-CVPR2013 using OPE (with their performance plots manually added in the left-bottom figure). We note that the plots are reproduced with the raw results provided at (Best viewed in color and with magnification)
Fig. 9: Performance comparison in the 11 subsets (with different attributes and different number of sequences as shown by the titles in the sub-figures) of TB-50 based on the success plots of OPE.
Fig. 10: Qualitative results. For clarity, we show tracking results (bounding boxes) in 6 randomly sampled frames for the top 10 trackers according to their OPE performance in TB-100. (Best viewed in color and with magnification.)

6 Experiments

In this section, we present comparison results on the TB-50/100/CVPR2013 benchmarks [2, 3] and the VOT benchmarks [4]. We also analyze different aspects of our method. The source code 555Available at is released with this paper for reproducing all results. We denote the proposed method by AOG in tables and plots.

Parameter Setting. We use the same parameters for all experiments since we emphasize online learning in this paper. In learning object AOGs, the side length of the grid used for constructing the full structure AOG is either or depending the slide length of input bounding box (to reduce the time complexity of online learning). The number of intervals in computing feature pyramid is set to with cell size being . The factor in computing search ROI is set to . The NMS IoU threshold is set to . The number of top parse trees kept after spatial DP parsing is set . The time range in temporal DP algorithm is set to . In identifying critical moments, we set and . The LSVM trade-off parameter in Eqn.(26) is set to . When re-learning structure and parameters, we could use all the frames with valid tracking results. To reduce the time complexity, the number of frames used in re-learning is at most in our experiments. At time , we first take the first frames with valid tracking results in with the underlying intuition that they have high probabilities of being tracked correctly (note that we alway use the first frame since the ground-truth bounding box is given), and then take the remaining frames in reversed time order.

Speed. In our current c++ implementation, we adopt FFT in computing score pyramids as done in [54] which also utilizes multi-threads with OpenMP. We also provide a distributed version based on MPI 666 in evaluation. The FPS is about 2 to 3. We are experimenting GPU implementations to speed up our TLP.

Fig. 11: Performance comparison of the six variants of our AOGTracker in TB-100/50/CVPR2013 in term of the success plots of OPE (1st column), SRE (2nd column) and TRE (3rd colum).

6.1 Results on TB-50/100/CVPR2013

The TB-100 benchmark has 100 target objects ( frames in total) with 29 publicly available trackers evaluated. It is extended from a previous benchmark with 51 target objects released at CVPR2013 (denoted by TB-CVPR2013). Further, since some target objects are similar or less challenging, a subset of 50 difficult and representative ones (denoted by TB-50) is selected for an in-depth analysis. Two types of performance metric are used, the precision plot (i.e., the percentage of frames in which estimated locations are within a given threshold distance of ground-truth positions) and the success plot (i.e., based on IoU overlap scores which are commonly used in object detection benchmarks, e.g., PASCAL VOC [79]). The higher a success rate or a precision rate is, the better a tracker is. Usually, success plots are preferred to rank trackers [2, 4] (thus we focus on success plots in comparison). Three types of evaluation methods are used as illustrated in Fig.7.

To account for different factors of a test sequence affecting performance, the testing sequences are further categorized w.r.t. 11 attributes for more ind-depth comparisons: (1) Illumination Variation (IV, 38/22/21 sequences in TB-100/50/CVPR2013), (2) Scale Variation (SV, 64/38/28 sequences), (3) Occlusion (OCC, 49/29/29 sequences), (4) Deformation (DEF, 44/23/19 sequences), (5) Motion Blur (MB, 29/19/12 sequences), (6) Fast Motion (FM, 39/25/17 sequences), (7) In-Plane Rotation (IPR, 51/29/31 sequences), (8) Out-of-Plane Rotation (OPR, 63/32/39 sequences), (9) Out-of-View (OV, 14/11/6 sequences), (10) Background Clutters (BC, 31/20/21 sequences), and (11) Low Resolution (LR, 9/8/4 sequences). More details on the attributes and their distributions in the benchmark are referred to [2, 3].

Table. II lists the 29 evaluated tracking algorithms which are categorized based on representation and search scheme. See more details about categorizing these trackers in [2]. In TB-CVPR2013, two recent trackers trained by deep convolutional network (CNT[5], SO-DLT[6]) were evaluated using OPE.

We summarize the performance gain of our AOGTracker in Table.III. Our AOGTracker obtains significant improvement (more than 12%) in the 10 subsets in TB-50. Our AOGTracker handles out-of-view situations much better than other trackers since it is capable of re-detecting target objects in the whole image, and it performs very well in the scale variation subset (see examples in the second and fourth rows in Fig. 10) since it searches over feature pyramid explicitly (with the expense of more computation). Our AOGTracker obtains the least improvement in the low-resolution subset since it uses HOG features and the discrepancy between HOG cell-based coordinate and pixel-based one can cause some loss in overlap measurement, especially in the low resolution subset. We will add automatic selection of feature types (e.g., HOG v.s. pixel-based features such as intensity and gradient) according to the resolution, as well as other factors in future work.

Fig.8 shows success plots of OPE, SRE and TRE in TB-100/50/CVPR2013. Our AOGTracker consistently outperforms all other trackers. We note that for OPE in TB-CVPR2013, although the improvement of our AOGTracker over the SO-DLT[6] is not very big, the SO-DLT utilized two deep convolutional networks with different model update strategies in tracking, both of which are pretrained on the ImageNet [34]. Fig. 10 shows some qualitative results.

6.2 Analyses of AOG models and the TLP Algorithm

To analyze contributions of different components in our AOGTracker, we compare performance of six different variants– three different object representation schema: AOG with and without structure re-learning (denoted by AOG and AOGFixed respectively), and whole object template only (i.e., without part configurations, denoted by ObjectOnly), and two different inference strategies for each representation scheme: inference with and without temporal DP (denoted by -st and -s respectively). As stated above, we use a very simple setting for temporal DP which takes into account frames, in our experiments.

Fig. 11 shows performance comparison of the six variants. AOG-st obtains the best overall performance consistently. Trackers with AOG perform better than those with whole object template only. AOG structure re-learning has consistent overall performance improvement. But, we observed that AOGFixed-st works slightly better than AOG-st on two subsets out of 11, Motion-Blur and Out-of-View, on which the simple intrackability measurement is not good enough. For trackers with AOG, temporal DP helps improve performance, while for trackers with whole object templates only, the one without temporal DP (ObjectOnly-s) slightly outperform the one with temporal DP (ObjectOnly-st), which shows that we might need strong enough object models in integrating spatial and temporal information for better performance.

6.3 Comparison with State-of-the-Art Methods

We explain why our AOGTracker outperforms other trackers on the TB-100 benchmark in terms of representation, online learning and inference.

Representation Scheme. Our AOGTracker utilizes three types of complementary features (HOG+LBP+Color) jointly to capture appearance variations, while most of other trackers use simpler ones (e.g., TLD [17] uses intensity based Haar like features). More importantly, we address the issue of learning the optimal deformable part-based configurations in the quantized space of latent object structures, while most of other trackers focus on either whole objects [58] or implicit configurations (e.g., the random fern forest used in TLD). These two components are integrated in a latent structured-output discriminative learning framework, which improves the overall tracking performance (e.g., see comparisons in Fig. 11).

Online Learning. Our AOGTracker includes two components which are not addressed in all other trackers evaluated on TB-100: online structure re-learning based on intrackability, and a simple temporal DP for computing optimal joint solution. Both of them improve the performance based on our ablation experiments. The former enables our AOGTracker to capture both large structural and sudden appearance variations automatically, which is especially important for long-term tracking. In addition to improve the prediction performance, the latter improves the capability of maintaining the purity of online collected training dataset.

Inference. Unlike many other trackers which do not handle scale changes explicitly (e.g., CSK [58] and STRUCK [40]), our AOGTracker runs tracking-by-parsing in feature pyramid to detect scale changes (e.g., the car example in the second row in Fig. 10). Our AOGTracker also utilizes a dynamic search strategy which re-detects an object in whole frame if local ROI search failed. For example, our AOGTracker handles out-of-view situations much better than other trackers due to the re-detection component (see examples in the fourth row in Fig. 10).

Limitations. All the performance improvement stated above are obtained at the expense of more computation in learning and tracking. Our AOGTracker obtains the least improvement in the low-resolution subset since it uses HOG features and the discrepancy between HOG cell-based coordinate and pixel-based one can cause some loss in overlap measurement, especially in the low resolution subset. We will add automatic selection of feature types (e.g., HOG v.s. pixel-based features such as intensity and gradient) according to the resolution, as well as other factors in future work.

Fig. 12: Performance comparison in VOT2013. Left: Ranking plot for the baseline experiment. The smaller the rank number is, the better a tracker is w.r.t. accuracy and/or robust (i.e., the right-top region indicates better performance) Right: Accuracy-Robustness plot. The larger the rate is, the better a tracker is.
Fig. 13: Performance comparison in VOT2014.

6.4 Results on VOT

In VOT, the evaluation focuses on short-term tracking (i.e., a tracker is not expected to perform re-detection after losing a target object), so the evaluation toolkit will re-initialize a tracker after it loses the target (w.r.t. the condition the overlap between the predicted bounding box and the ground-truth one drops to zero) with the number of failures counted. In VOT protocol, a tracker is tested on each sequence multiple times. The performance is measured in terms of accuracy and robustness. Accuracy is computed as the average of per-frame accuracies which themselves are computed by taking the average over the repetitions. Robustness is computed as the average number of failure times over repetitions.

We integrate our AOGTracker in the latest VOT toolkit777Available at, version 3.2 to run experiments with the baseline protocol and to generate plots 888The plots for VOT2013 and 2014 might be different compared to those in the original VOT reports [80, 81] due to the new version of vot-toolkit..

The VOT2013 dataset [80] has 16 sequences which was selected from a large pool such that various visual phenomena like occlusion and illumination changes, were still represented well within the selection. 7 sequences are also used in TB-100. There are 27 trackers evaluated. The readers are referred to the VOT technical report [80] for details.

Fig.12 shows the ranking plot and AR plot in VOT2013. Our AOGTracker obtains the best accuracy while its robustness is slightly worse than three other trackers (i.e., PLT[80], LGT[82] and LGTpp[83], and PLT was the winner in VOT2013 challenge). Our AOGTracker obtains the best overall rank.

The VOT2014 dataset [81] has 25 sequences extended from VOT2013. The annotation is based on rotated bounding box instead of up-right rectangle. There are 33 trackers evaluated. Details on the trackers are referred to [81]. Fig.13 shows the ranking plot and AR plot. Our AOGTracker is comparable to other trackers. One main limitation of AOGTracker is that it does not handle rotated bounding boxes well.

The VOT2015 dataset [84] consists of 60 short sequences (with rotated bounding box annotations) and VOT-TIR2015 comprises 20 sequences (with bounding box annotations). There are 62 and 28 trackers evaluated in VOT2015 and VOT-TIR2015 respectively. Our AOGTracker obtains % and 65% (tied for third place) in accuracy in VOT2015 and VOT-TIR2015 respectively. The details are referred to the reports [84] due to space limit here.

7 Discussion and Future Work

We have presented a tracking, learning and parsing (TLP) framework and derived a spatial dynamic programming (DP) and a temporal DP algorithm for online object tracking with AOGs. We also have presented a method of online learning object AOGs including its structure and parameters. In experiments, we test our method on two main public benchmark datasets and experimental results show better or comparable performance.

In our on-going work, we are studying more flexible computing schemes in tracking with AOGs. The compositional property embedded in an AOG naturally leads to different bottom-up/top-down computing schemes such as the three computing processes studied by Wu and Zhu [85]. We can track an object by matching the object template directly (i.e. -process), or computing some discriminative parts first and then combine them into object (-process), or doing both (-process, as done in this paper). In tracking, as time evolves, the object AOG might grow through online learning, especially for objects with large variations in long-term tracking. Thus, faster inference is entailed for the sake of real time applications. We are trying to learn near optimal decision policies for tracking using the framework proposed by Wu and Zhu [86].

In our future work, we will extend the TLP framework by incorporating generic category-level AOGs [8] to scale up the TLP framework. The generic AOGs are pre-trained offline (e.g., using the PASCAL VOC [79] or the imagenet [34]), and will help the online learning of specific AOGs for a target object (e.g., help to maintain the purity of the positive and negative datasets collected online). The generic AOGs will also be updated online together with the specific AOGs. By integrating generic and specific AOGs, we aim at the life-long learning of objects in videos without annotations. Furthermore, we are also interested in integrating scene grammar [87] and event grammar [88] to leverage more top-down information.


This work is supported by the DARPA SIMPLEX Award N66001-15-C-4035, the ONR MURI grant N00014-16-1-2007, and NSF IIS-1423305. T. Wu was also supported by the ECE startup fund 201473-02119 at NCSU. We thank Steven Holtzen for proofreading this paper. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of one GPU.


  • [1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” PAMI, vol. 32, no. 9, pp. 1627–1645, 2010.
  • [2] Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” PAMI, vol. 37, no. 9, pp. 1834–1848, 2015.
  • [3] ——, “Online object tracking: A benchmark,” in CVPR, 2013.
  • [4] M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. P. Pflugfelder, G. Fernández, G. Nebehay, F. Porikli, and L. Cehovin, “A novel performance evaluation methodology for single-target trackers,” CoRR, vol. abs/1503.01313, 2015. [Online]. Available:
  • [5] K. Zhang, Q. Liu, Y. Wu, and M.-H. Yang, “Robust visual tracking via convolutional networks,” arXiv preprint arXiv:1501.04505v2, 2015.
  • [6] N. Wang, S. Li, A. Gupta, and D.-Y. Yeung, “Transferring rich feature hierarchies for robust visual tracking,” arXiv preprint arXiv:1501.04587v2, 2015.
  • [7] S. Carey, The Origin of Concepts.   Oxford University Press, 2011.
  • [8] X. Song, T. Wu, Y. Jia, and S.-C. Zhu, “Discriminatively trained and-or tree models for object detection,” in CVPR, 2013.
  • [9] R. Girshick, P. Felzenszwalb, and D. McAllester, “Object detection with grammar models,” in NIPS, 2011.
  • [10] P. Felzenszwalb and D. McAllester, “Object detection grammars,” University of Chicago, Computer Science TR-2010-02, Tech. Rep., 2010.
  • [11] S. C. Zhu and D. Mumford, “A stochastic grammar of images,” Foundations and Trends in Computer Graphics and Vision, vol. 2, no. 4, pp. 259–362, 2006.
  • [12] Y. Amit and A. Trouvé, “POP: patchwork of parts models for object recognition,” IJCV, vol. 75, no. 2, pp. 267–282, 2007.
  • [13] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, 2006.
  • [14] L. R. Rabiner, “Readings in speech recognition,” A. Waibel and K.-F. Lee, Eds., 1990, ch. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, pp. 267–296.
  • [15] M. Isard and A. Blake, “Condensation - conditional density propagation for visual tracking,” IJCV, vol. 29, no. 1, pp. 5–28, 1998.
  • [16] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-detection-by-tracking,” in CVPR, 2008.
  • [17] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” PAMI, vol. 34, no. 7, pp. 1409–1422, 2012.
  • [18] J. S. Supancic III and D. Ramanan, “Self-paced learning for long-term tracking,” in CVPR, 2013.
  • [19] D. Park and D. Ramanan, “N-best maximal decoder for part models,” in ICCV, 2011.
  • [20] D. Batra, P. Yadollahpour, A. Guzmán-Rivera, and G. Shakhnarovich, “Diverse m-best solutions in markov random fields,” in ECCV, 2012.
  • [21] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in CVPR, 2008.
  • [22] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of objects,” in CVPR, 2011.
  • [23] J. Berclaz, F. Fleuret, E. Türetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” PAMI, vol. 33, no. 9, pp. 1806–1819, 2011.
  • [24] A. V. Goldberg, “An efficient implementation of a scaling minimum-cost flow algorithm,” J. Algorithms, vol. 22, no. 1, pp. 1–29, 1997.
  • [25] S. Hong and B. Han, “Visual tracking by sampling tree-structured graphical models,” in ECCV, 2014.
  • [26] S. Hong, S. Kwak, and B. Han, “Orderless tracking through model-averaged posterior estimation,” in ICCV, 2013.
  • [27] R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, “Part-based visual tracking with online latent structural learning,” in CVPR, 2013.
  • [28] H. Nam, S. Hong, and B. Han, “Online graph-based tracking,” in ECCV, 2014.
  • [29] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” IJCV, vol. 77, no. 1-3, pp. 125–141, 2008.
  • [30] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” PAMI, vol. 25, no. 5, pp. 564–575, 2003.
  • [31] X. Mei and H. Ling, “Robust visual tracking and vehicle classification via sparse representation,” PAMI, vol. 33, no. 11, pp. 2259–2272, 2011.
  • [32] X. Li, A. R. Dick, C. Shen, A. van den Hengel, and H. Wang, “Incremental learning of 3d-dct compact representations for robust visual tracking,” PAMI, vol. 35, no. 4, pp. 863–881, 2013.
  • [33] H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in CVPR, 2016.
  • [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR, 2009.
  • [35] J. Kwon and K. M. Lee, “Highly nonrigid object tracking via patch-based dynamic appearance modeling,” PAMI, vol. 35, no. 10, pp. 2427–2441, 2013.
  • [36] L. Cehovin, M. Kristan, and A. Leonardis, “Robust visual tracking using an adaptive coupled-layer visual model,” PAMI, vol. 35, no. 4, pp. 941–953, 2013.
  • [37] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in CVPR, 2012.
  • [38] S. Avidan, “Support vector tracking,” PAMI, vol. 26, no. 8, pp. 1064–1072, 2004.
  • [39] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” PAMI, vol. 33, no. 8, pp. 1619–1632, 2011.
  • [40] S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in ICCV, 2011.
  • [41] J. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in ECCV, 2012.
  • [42] V. Mahadevan and N. Vasconcelos, “Biologically inspired object tracking using center-surround saliency mechanisms,” PAMI, vol. 35, no. 3, pp. 541–554, 2013.
  • [43] R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, “Part-based visual tracking with online latent structural learning,” in CVPR, 2013.
  • [44] L. Zhang and L. van der Maaten, “Structure preserving object tracking,” in CVPR, 2013.
  • [45] J. Shi and C. Tomasi, “Good feature to track,” in CVPR, 1994.
  • [46] Y. Lu, T. Wu, and S.-C. Zhu, “Online object tracking, learning and parsing with and-or graphs,” in CVPR, 2014.
  • [47] X. Li, W. Hu, C. Shen, Z. Zhang, A. R. Dick, and A. van den Hengel, “A survey of appearance models in visual object tracking,” CoRR, vol. abs/1303.4803, 2013.
  • [48] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” IJCV, vol. 56, no. 3, pp. 221–255, 2004.
  • [49] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
  • [50] T. Ojala, M. Pietikainen, and D. Harwood, “Performance evaluation of texture measures with classification based on kullback discrimination of distributions,” in ICPR, 1994.
  • [51] J. Kwon and K. M. Lee, “Highly nonrigid object tracking via patch-based dynamic appearance modeling,” TPAMI, vol. 35, no. 10, pp. 2427–2441, 2013.
  • [52] H. Gong and S. C. Zhu, “Intrackability: Characterizing video statistics and pursuing video representations,” IJCV, vol. 97, no. 3, pp. 255–275, 2012.
  • [53] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM J. Sci. Comput., vol. 16, no. 5, pp. 1190–1208, 1995.
  • [54] C. Dubout and F. Fleuret, “Exact acceleration of linear object detectors,” in ECCV, 2012.
  • [55] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in CVPR, 2012.
  • [56] S. Stalder, H. Grabner, and L. van Gool, “Beyond semi-supervised tracking: Tracking should be as simple as detection, but not simpler than recognition,” in ICCV Workshop, 2009.
  • [57] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic tracking,” in ECCV, 2002.
  • [58] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in ECCV, 2012.
  • [59] K. Zhang, L. Zhang, and M. Yang, “Fast compressive tracking,” PAMI, vol. 36, no. 10, pp. 2002–2015, 2014.
  • [60] T. B. Dinh, N. Vo, and G. G. Medioni, “Context tracker: Exploring supporters and distracters in unconstrained environments,” in CVPR, 2011.
  • [61] L. Sevilla-Lara and E. Learned-Miller, “Distribution fields for tracking,” in CVPR, 2012.
  • [62] T. Vojir and J. Matas, “Robustifying the flock of trackers,” in Computer Vision Winter Workshop, 2011.
  • [63] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in CVPR, 2006.
  • [64] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust L1 tracker using accelerated proximal gradient approach,” in CVPR, 2012.
  • [65] S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan, “Locally orderless tracking,” in CVPR, 2012.
  • [66] S. He, Q. Yang, R. W. Lau, J. Wang, and M.-H. Yang, “Visual tracking via locality sensitive histograms,” in CVPR, 2013.
  • [67] B. Liu, J. Huang, L. Yang, and C. A. Kulikowski, “Robust tracking using local sparse appearance model and k-selection,” in CVPR, 2011.
  • [68] D. Wang, H. Lu, and M.-H. Yang, “Least soft-thresold squares tracking,” in CVPR, 2013.
  • [69] T.Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via multi-task sparse learning,” in CVPR, 2012.
  • [70] H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via on-line boosting,” in BMVC, 2006.
  • [71] Y. Wu, B. Shen, and H. Ling, “Online robust image alignment via iterative convex optimization,” in CVPR, 2012.
  • [72] D. Wang and H. Lu, “Visual tracking via probability continuous outlier model,” in CVPR, 2014.
  • [73] W. Zhong, H. Lu, and M. Yang, “Robust object tracking via sparsity-based collaborative model,” in CVPR, 2012.
  • [74] R. T. Collins, “Mean-shift blob tracking through scale space,” in CVPR, 2003.
  • [75] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line boosting for robust tracking,” in ECCV, 2008.
  • [76] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” PAMI, vol. 27, no. 10, pp. 1631–1643, 2005.
  • [77] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in CVPR, 2010.
  • [78] ——, “Tracking by sampling trackers,” in ICCV, 2011.
  • [79] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.”
  • [80] M. Kristan and et al, “The visual object tracking vot2013 challenge results,” 2013. [Online]. Available:
  • [81] ——, “The visual object tracking vot2014 challenge results,” 2014. [Online]. Available:
  • [82] L. Cehovin, M. Kristan, and A. Leonardis, “Robust visual tracking using an adaptive coupled-layer visual model,” PAMI, vol. 35, no. 4, pp. 941–953, 2013.
  • [83] J. Xiao, R. Stolkin, and A. Leonardis, “An enhanced adaptive coupled-layer lgtracker++,” in Vis. Obj. Track. Challenge VOT2013, In conjunction with ICCV2013, 2013.
  • [84] M. Kristan and et al, “The visual object tracking vot2015 and tir2015 challenge results,” 2015. [Online]. Available:
  • [85] T. Wu and S. C. Zhu, “A numerical study of the bottom-up and top-down inference processes in and-or graphs,” IJCV, vol. 93, no. 2, pp. 226–252, 2011.
  • [86] T. Wu and S. Zhu, “Learning near-optimal cost-sensitive decision policy for object detection,” TPAMI, vol. 37, no. 5, pp. 1013–1027, 2015.
  • [87] Y. Zhao and S. C. Zhu, “Image parsing with stochastic scene grammar,” in NIPS, 2011.
  • [88] M. Pei, Z. Si, B. Z. Yao, and S. Zhu, “Learning and parsing video events with goal and intent prediction,” CVIU, vol. 117, no. 10, pp. 1369–1383, 2013.

Tianfu Wu received Ph.D. degree in Statistics from University of California, Los Angeles (UCLA) in 2011. He joined NC State University in August 2016 as a Chancellor’s Faculty Excellence Program cluster hire in Visual Narrative. He is currently assistant professor in the Department of Electrical and Computer Engineering. His research focuses on explainable and improvable visual Turing test and robot autonomy through life-long communicative learning by pursuing a unified framework for machines to ALTER (Ask, Learn, Test, Explain, and Refine) recursively in a principled way: (i) Statistical learning of large scale and highly expressive hierarchical and compositional models from visual big data (images and videos). (ii) Statistical inference by learning near-optimal cost-sensitive decision policies. (iii) Statistical theory of performance guaranteed learning algorithm and optimally scheduled inference procedure.

Yang Lu is currently Ph. D. student in the Center for Vision, Cognition, Learning and Autonomy at the University of California, Los Angeles. He received B.S. degree and M.S. degree in Computer Science from Beijing Institute of Technology, China, in 2009 and in 2012 respectively. He received the University Fellowship from UCLA and National Fellowships from Department of Education at China. His current research interests include Computer Vision and Statistical Machine Learning. Specifically, his research interests focus on statistical modeling of natural images and videos, and structure learning of hierarchical models.

Song-Chun Zhu received Ph.D. degree from Harvard University in 1996. He is currently professor of Statistics and Computer Science at UCLA, and director of Center for Vision, Cognition, Learning and Autonomy. He received a number of honors, including the J.K. Aggarwal prize from the Int’l Association of Pattern Recognition in 2008 for ”contributions to a unified foundation for visual pattern conceptualization, modeling, learning, and inference”, the David Marr Prize in 2003 with Z. Tu et al. for image parsing, twice Marr Prize honorary nominations in 1999 for texture modeling and in 2007 for object modeling with Z. Si and Y.N. Wu. He received the Sloan Fellowship in 2001, a US NSF Career Award in 2001, and an US ONR Young Investigator Award in 2001. He received the Helmholtz Test-of-time award in ICCV 2013, and he is a Fellow of IEEE since 2011.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description