Planar Object Tracking in the Wild: A Benchmark
Abstract
Planar object tracking is an actively studied problem in visionbased robotic applications. While several benchmarks have been constructed for evaluating stateoftheart algorithms, there is a lack of video sequences captured in the wild rather than in constrained laboratory environment. In this paper, we present a carefully designed planar object tracking benchmark containing 210 videos of 30 planar objects sampled in the natural environment. In particular, for each object, we shoot seven videos involving various challenging factors, namely scale change, rotation, perspective distortion, motion blur, occlusion, outofview, and unconstrained. The ground truth is carefully annotated semimanually to ensure the quality. Moreover, eleven stateoftheart algorithms are evaluated on the benchmark using two evaluation metrics, with detailed analysis provided for the evaluation results. We expect the proposed benchmark to benefit future studies on planar object tracking.
I Introduction
Camera localization and environment modeling is a fundamental problem in visionbased robotics. In theory, these tasks can be completed by tracking and then analyzing 3D structures in the input from visual sensors. In practice, however, tracking of 3D structures is by itself very challenging. Twodimensional planar structures, instead, often serve as a reliable and reasonable surrogate. As a result, planar object tracking plays an important role in many visionbased robotic applications, such as visual servoing [1], visual SLAM [2], and UAV control [3], as well as related fields, e.g. augmented reality [4, 5].
Recently, several datasets have been provided for comprehensively evaluating planar tracking, including the Metaio dataset [6], the tracking manipulation tasks (TMT) dataset [7] and the planar texture dataset [8]. Though these datasets overcome the shortcomings of synthetic datasets that cannot faithfully reproduce the real effects of every condition, all of them are constructed in laboratory environments (see Fig. 1). A disadvantage of the datasets collected this way is that the background is short of diversity or even artificial, while in real world scenarios can be much more complicated. Consequently, it is insufficient to evaluate the effectiveness of planar object tracking algorithms in natural setting with these datasets.
To address the above issue, in this paper, we present a novel planar object tracking benchmark containing 210 video sequences collected in the wild and each sequence has 500 frames plus an additional frame for initialization. For constructing the dataset, we first select 30 planar objects in natural scene; then, for each object, we capture seven videos involving seven challenging factors. Six of the challenging factors are commonly encountered in practical applications, while the seventh dedicates to an unconstrained condition, typically involving multiple challenging factors. To annotate the ground truth as precisely as possible, given the initial state of an object, we first run a keypointbased object tracking algorithm using structured output learning [9] as an initial guess; then we manually check and revise the results to ensure accuracy, with tracking reinitialization if needed. We annotate every other frame for each sequence.
To understand the performance of stateofthearts, we evaluate eleven modern tracking algorithms on the dataset. These algorithms include three types of trackers: four keypointbased planar object tracking algorithms [9, 10, 11, 12], four regionbased (a.k.a. direct methods) planar object tracking algorithms [13, 14, 15, 16], and three generic object tracking algorithms [17, 18, 19]. We use two performance metrics to analyze the evaluation results in details. One metric is based on four reference points and measures the distance of misalignment between the ground truth state and the predicted state; the other is the difference between the ground truth homography and the predicted homography. Note that we do not evaluate the stateoftheart generic object trackers such as trackers using deep learned features [20, 21]. This is because such trackers, by outputting rectangular bounding boxes, aim at locating the target rather than providing the precise state of the target.
In summary, our contributions are threefold: (1) collecting systematically a dataset containing 210 videos for planar object tracking in the wild; (2) providing accurate ground truth by annotating the data in a semiautomatic manner, and 52,710 frames are annotated in total; and (3) evaluating eleven representative stateoftheart algorithms with two performance metrics, and analyzing the results in details according to seven different motion patterns. To the best of our knowledge, our benchmark not only is the largest one to date, but also is more realistic than previously proposed ones. The benchmark, along with the evaluation results, is made available for research purpose (http://www.dabi.temple.edu/~hbling/data/POT210/planar_benchmark.html).
Ii Related Work
Iia Previous benchmark
With the advance of planar object tracking, it is crucial to provide benchmarks for evaluation purpose. Recently, there have been several such benchmarks relevant with our work [6], [7] and [8]. In [6], the authors collected 40 sequences with eight different texture patterns under five different dynamic behaviors. To annotate the ground truth precisely, a camera was mounted on a robotic measurement arm which could record the camera pose. One limitation of using the measurement arm for annotation is that it may have problems when used in natural environments flexibly.
To evaluate tracking algorithms for manipulation tasks, 100 sequences were collected and each sequence was tagged with different challenging factors in [7]. For annotation, three trackers were used to annotate the ground truth, and the coordinates of the four reference corners were determined when the coordinates reported by all the three trackers lay within a certain range. Such annotation avoids heavy manual work, but can be noisy especially for challenging sequences on which at least one tracker fails.
In [8], 96 sequences were collected with six planar textures under 16 different motion patterns each. To annotate the ground truth in a semiautomatic manner, a planar texture picture was held by a milled acrylic glass frame and there were four bright red balls on the frame as markers.
Besides the above three benchmark datasets, the authors of several papers focusing on tracking algorithms collected their own data for evaluation purpose. In [9], five sequences were collected and the ground truth was obtained using a SLAM system which could track the 3D camera pose in each frame. In [22], image sequences of three different objects were collected and the ground truth was annotated manually using the object corners. In [23], the authors used the five sequences from [9] and another four sequences collected by themselves to evaluate their algorithm.
It is worth mentioning that several benchmarks for generic object tracking have been proposed in recent years [24, 25, 26, 27]. However, all of these datasets provide rectangular bounding box annotation, and none of them can be used for evaluating planar object tracking algorithms.
To the best of our knowledge, our work is the first one providing a dataset for planar object tracking in the wild. Moreover, our dataset contains 210 sequences with careful annotation, and is much larger than previous ones.
Painting2, 0.853  BusStop, 0.844  IndegoStation, 0.831  ShuttleStop, 0.821  Lottery2, 0.798  SmokeFree, 0.796 
Painting1, 0.790  Map1, 0.788  Citibank, 0.785  Snap, 0.760  Fruit, 0.735  Poster2, 0.733 
Woman, 0.724  Lottery1, 0.721  Pretzel, 0.721  Coke, 0.704  WalkYourBike, 0.699  OneWay, 0.697 
NoStopping, 0.690  StopSign, 0.681  Map2, 0.676  Poster1, 0.659  Snack, 0.643  Melts, 0.640 
Burger, 0.624  Map3, 0.615  Sundae, 0.615  Sunoco, 0.595  Amish, 0.594  Pizza, 0.519 
IiB Tracking algorithms
Current planar object tracking algorithms can be categorized into two main groups. The first group is keypointbased. The algorithms [9, 28, 23, 29, 10] lying in this group often model an object with a set of keypoints (e.g., SIFT [11], SURF [12] and FAST [30]) and associated descriptors, and the tracking process consists of two steps. First, a set of correspondences between object and image keypoints is constructed through descriptor matching; then, the transformation of the object in the image is estimated using a robust geometric estimation algorithm such as RANSAC [31] based on the correspondences. In [28], keypoint matching was formulated as a multiclass classification problem so that the computational burden was shifted to the training phase. In [23], to utilize the temporal and spatial consistency during the tracking process, a robust keypointbased appearance model was learned with a metric learning driven approach. The authors of [29] carefully modified the feature descriptors SIFT [11] and Ferns [32] so that they could work at realtime speed on mobile phones. Graph matching is integrated for matching keypoints in [33] recently.
The second group of planar tracking algorithms are regionbased and sometimes called direct methods. These algorithms [13, 14, 34, 35, 15, 36, 16] lying in this group directly estimate the transformation parameters by minimizing an error that measures the image similarity between the template and its projection in the image. In [34], both texture and contour information were used to construct the appearance model, and the 2D transformation was estimated by minimizing the error between the multicue template and the projected image patch. To deal with resolution degradation, the authors in [35] proposed to reconstruct the target model with an image sampling process. In [36], random forest was used to learn the relationship between the parameters modeling the motion of the target and the image intensity change of the template. This learningbased approach is useful to avoid local minimum and handle partial occlusion. The authors of [37] provided a code framework for regionbased trackers, also known as registration based tracking or direct visual tracking, by decomposing this kind of trackers into three modules including an appearance model, a state space model and a search method.
In this paper, we select four keypointbased [11, 12, 9, 10], four regionbased [13, 14, 15, 16] and three generic object tracking algorithms [17, 18, 19] as representative trackers in evaluation. The details of these algorithms are given in Sec. IV.







Iii Dataset Design
Iiia Dataset Construction
We use a smart phone (iPhone 5S) to record all the videos and the camera is held by hands. The reason for using a smart phone is that it can approach everyday scenarios as closely as possible. The videos are recorded at 30 frames per second with a resolution of , and we resample the video sequences to for efficiency^{1}^{1}1By contrast, the frame size in [6] and [8] is ; and the frame size in [7] is ..
We select 30 planar objects in natural scene in different photometric environments as shown in Fig. 2. As we can see, the background of the selected objects varies a lot, especially when compared with previous benchmarks as shown in Fig. 1. For each object, we shoot videos involving seven motion patterns so that the dataset can be used to systematically analyze the strengths and weaknesses of different tracking algorithms. The dataset contains 210 sequences in total, and each sequence has 500 frames plus an additional frame for initialization. The following are the challenging factors involved:

Scale change (SC): the distance between the camera and the target changes significantly (Fig. 3(a)).

Rotation (RT): rotating the camera and trying to keep the camera in the same plane during rotation (Fig. 3(b)).

Perspective distortion (PD): changing the perspective between the object and the camera (Fig. 3(c)).

Motion blur (MB): motion blur is generated by fast camera movement (Fig. 3(d)).

Occlusion (OCC): manually occluding the object while moving the camera (Fig. 3(e)).

Outofview: (OV): part of the object is out of the image (Fig. 3(f)).

Unconstrained (UC): moving the camera freely and the resulting video sequence may involve one or more of the above challenging factors (Fig. 3(g)).
It is worth noting that as it is hard to control the illumination condition in natural environment, illumination variation is not included in the challenging factors.
IiiB Annotating the ground truth
Following the popular strategy in planar object tracking [8], we define the tracking ground truth as a transformation matrix that projects a point in frame to its corresponding point in frame . To find the homography, we annotate four reference points (corners of the object) on the object in each frame. The natural environment prevents us from using a measurement arm [6], markers [8] or SLAM system [9] to obtain the ground truth. In [7], three tracking algorithms were used to annotate the ground truth. Despite still requiring manual verification as the final step, this approach is not suitable for the cases where the three algorithms fail to reach a correct consensus, especially for challenging scenarios. In this paper, we use a semiautomatic approach to annotate the ground truth. In particular, we annotate every other frame for each sequence and the ground truth of 52,710 frames are produced in total.
Fig. 4 shows the user interface of our annotation tool. Besides the four corner points, we use four additional points located around the middle of the four edges to deal with occlusion and outofview. On the top of Fig. 4 , it shows the initial eight points for reference; on the bottom, it is the current frame that needs annotation. The black margin around the image is used to help annotate the frames that are outofview. The annotation contains two steps:

Step 1: Run the keypointbased algorithm [9] to get an initial estimation of the object state. Manual reinitialization of the algorithm is used so that the algorithm can better adapt to the change of the object state.

Step 2: Select four out of the eight reference points and manually fine tune their positions, then reestimate the homography transformation with the selected points. The global shape of the object is also taken into account when it is occluded or outofview.
Note that in step 2: (1) the four corner points are selected first if they are visible in the image; (2) the initial four middle points might not remain at the middle after homography transformation, so when we use the middle points, we also take the context around the initial positions in the reference frame into consideration; (3) we mark frames in which more than half of the target is invisible (occluded or outofview, Fig. 5(a)) and frames that are heavily blurred (Fig. 5(b)). Such marked frames will not be used for evaluation.
In general, after excluding frames the invisible part of which are more than half or heavily blurred as shown in Fig. 5, the above annotation approach generates accurate ground truth with manageable amount of human labor.
Iv Evaluation
Iva Selected trackers
To study the performance of modern visual trackers for planar object tracking, we select eleven representative algorithms from three different groups.
Keypointbased planar tracking:
SIFT [11] and SURF [12]: To evaluate the performance of SIFT and SURF for planar object tracking on our benchmark, we follow the traditional keypointbased tracking pipeline and use OpenCV for implementation. These two trackers contain three steps: (1) keypoints detection; (2) keypoint matching via nearest neighbour search; and (3) homography estimation using RANSAC [31].
FERNS [10]: FERNS formulates the keypoints recognition problem in a naive Bayesian classification framework. The appearance of the image patch surrounding a keypoint is described by hundreds of simple binary features (ferns) depending on the intensities of two pixels, then the class posterior probabilities are estimated. By shifting the computation burden to the training stage as [28], the classification of keypoints can be performed very fast.
SOL [9]: Structured output learning (SOL) is used to combine keypoints matching and transformation estimation in a unified framework. The adopted linear structured SVM algorithm allows the object model to adapt to a given environment quickly. To speed up the algorithm, the classifier is approximated with a set of binary vectors and the binary descriptor BRIEF [38] is used for keypoint matching. The keypoints are extracted by FAST [30]. With binary representation and Hamming distance similarity, matching can be performed extremely fast using bitwise operations.
Regionbased planar tracking:
ESM [14]: The transformation parameters in [14] is estimated by minimizing the sumofsquareddifference between a given template and the current image. To solve the optimization problem efficiently, efficient secondorder minimization (ESM) is used to estimate the second order approximation of the cost function. Compared with the Newton method, ESM does not need to compute the Hessian and has a higher convergence rate.
IC [15]: To avoid reevaluating the Hessian in every iteration in the LucasKanade image alignment algorithm [39], the inverse compositional (IC) algorithm switches the role of the template and the image. The resulted optimization problem has a constant Hessian and can be precomputed. The proof of equivalence between IC and LucasKanade is provided in [15].
SCV [13]: Being invariant to nonlinear illumination variation, the sum of conditional variance (SCV) is employed to measure the similarity between a given template and the current image in [13]. The SCV tracker can be viewed as an extension of ESM.
GOESM [16]: As gradient orientations (GO) is robust to illumination changes, it is used in GOESM along with denoising techniques to model the appearance of the target. GOESM also generalizes ESM to multidimensional features.
Generic object tracking:
GPF [17]: Using deterministic optimization to estimate the spatial transformation for templatebased tracking can result in local optima. To overcome this limitation, the authors of [17] formulate the problem in a geometric particle filter (GPF) framework on a matrix Lie group.GPF uses the combination of the incremental PCA model [19] and normalized cross correlation (NCC) score to model the appearance of the target.
IVT [19]: To deal with appearance change of the target, IVT uses an incremental PCA algorithm to update the appearance model which is a eigenbasis learned offline. The algorithm estimates an affine transformation for each frame with particle filter.
L1APG [18]: To solve the norm minimization problem efficiently of the sparse linear representation of target appearance and improve its robustness, L1APG uses a mixed norm and an efficient optimization method based on accelerated proximal gradient (APG) approach. L1APG also estimates an affine transformation for each frame.
Note that all these three generic tracking algorithms are templatebased and they can be attributed to the regionbased group. For all above eleven algorithms except SIFT [11] and SURF [12], we use their available source codes. For ESM [14], IC [15] and SCV [13], we increase the number of maximum iterations for solving the optimization problem to 200; for other trackers, we use their default parameter setting. The original number of iterations used by GOESM [16] is 200.
(a)  (b)  (c)  (d) 
(e)  (f)  (g)  (h) 
(a)  (b)  (c)  (d) 
(e)  (f)  (g)  (h) 
IvB Evaluation metrics
In this paper, we use the following two metrics to analyze the results quantitatively.
Alignment error. The alignment error is based on the four reference points (four corners of the object), and is defined as the root of the mean square distances between the estimated positions of the points and their ground truth [6, 7],
(1) 
where is the position of a reference point and is its ground truth position.
Precision plot has been adopted to evaluate the tracking algorithms for general purposes recently [24]. In this work, we draw precision plot based on the alignment error, and it shows the percentage of frames whose is smaller than a threshold . We use as a representative precision score for each algorithm.
Degree of difficulty of each object. To rank the 30 planar objects used in our benchmark as shown in Fig. 2, we quantitatively derive the degree of difficulty (DoD) of each object. Specifically, during the evaluation process, the precision score at the threshold for each sequence and each tracker is recorded. Then, given an object , its degree of difficulty is defined as:
Homography discrepancy. Homography discrepancy measures the difference between the ground truth homography and the predicted one , and it is defined as [9]:
(2) 
where are the corners of a square. is if and are identical. The success rate of a tracker on a sequence is the percentage of frames whose homography discrepancy score is less than a threshold. We generate the success plot by varying the threshold from 0 to 200. Following [9], the success rate at threshold is used as a representative score.
Note: (1) the same for success rate of different sequences may correspond to different for precision score; (2) is a very tight threshold, as shown by some illustrative examples in Fig. 8; and (3) as there is no correspondence between for success rate and for precision score, there are inconsistencies between the rank of trackers in Fig. 6 and Fig. 7.
(a) 8.05  (b) 85.75  (c) 315.75 
IvC Results and analysis







IvC1 Comparison with respect to different challenges
Fig. 6 shows the comparison among the eleven trackers by precision plot using both subsets of sequences according to different motion patterns and all the sequences. In addition, the success plots of each tracker using the homography discrepancy are reported in Fig. 7. It is worth noting that the performance of the generic object trackers IVT [19] and L1APG [18] are obviously worse than other trackers. One possible reason is that the parameters of these two trackers are set for the tracking scenario which just requires coarse bounding box estimation; another possible reason is that the adopted affine transformation with six degreeoffreedom is not sufficient to get very accurate results. In the following part, we use the performances of the other nine trackers for analysis purpose. Also, as the alignment error is more easy to measure perceptually and the success rate at is too tight (as shown in Fig. 8), we mainly use the precision plots to analyze the results and success plots are displayed for further validation purpose.
(a)  (b) 
For scale change (Fig. 6(a)), GPF performs best and FERNS also achieves comparable performance. Though SURF is also designed to be scale invariant, its performance is not promising on this subset. For the rotation subset (Fig. 6(b)), although all of SIFT, FERNS and SURF are designed to be rotationinvariant, SIFT outperforms the other two algorithms by a large margin. Also, SCV and GPF achieves better results than other regionbased trackers. The relatively inferior performance of SOL should be due to that the BRIEF descriptor lacks invariance ability to inplane rotation [38].
Under the perspective distortion subset (Fig. 6(c)), all the keypointbased trackers outperform all the regionbased trackers. The performances of SIFT, FERNS and GPF decrease obviously compared with scale change or rotation. SIFT itself is not designed to be invariant to perspective distortion. During the training stage of FERNS, it generates training samples with randomly picked affine transformation, nevertheless, the perspective distortion can also be homography transformation. For SCV and ESM, they have similar performance across these three motion patterns.
Motion blur (Fig. 6(d)) is the most challenging motion pattern for all these eleven trackers. As motion blur deteriorates the quality of the entire image, it is difficult for keypointbased trackers to detect useful keypoints, and for regionbased trackers to measure the similarity between images patches effectively.
For occlusion (Fig. 6(e)) and out of view (Fig. 6(f)), the performances of the keypointbased trackers are obviously better than the regionbased trackers. This is consistent with the fact that it is still possible to obtain a set of correspondences between the target and image keypoints when occlusion appears or the target is outofview, and the correspondences are accurate enough to estimate the geometric transformation correctly. However, for regionbased based trackers, both occlusion and out of view can cause large appearance variance.
According to the performances with respect to the unconstrained subset of sequences (Fig. 6(g)) and all the sequences (Fig. 6(h)), in general, the keypointbased trackers are more robust than the regionbased trackers. The obvious performance difference can be attributed to the following two reasons: (1) though the image similarity measure SCV adopted by [13] or GO adopted by [16] are robust to illumination variations, their robustness is not comparable with the stateofart keypoint detectors and descriptors or ferns; and (2) the keypointbased algorithms use the trackingbydetection strategy and the detection in the current frame depends little on the object location in previous frames; by contrast, the regionbased algorithms make use of the previous object state to reduce the optimization space for efficiency. Thus it is easier for keypointbased trackers to recover from failure than regionbased trackers.
Also, for ESM based algorithms [14, 13, 16], SCV [13] is a little better than the original ESM tracker [14] using the sumofsquareddifference for appearance similarity measure. Though gradient orientations is robust to illumination change, the overall performance of GOESM [16] is worse than ESM [14]. At the same time, ESM, SCV and GOESM perform better than IC [15], implying that the efficient secondorder minimization approach is better than the inverse compositional optimization approach for the planar object tracking task. Some failure cases based on different motion patterns are shown in Fig. 9.
IvC2 Overall performance of trackers in each group
We summarize the overall performance of trackers in each group by average precision plot in Fig. 10(a) and Fig. 10(b) respectively. Note that we include the GPF tracker [17] in the regionbased group, and we do not consider IVT [19] and L1APG [18] for these two figures. We rank the performance with respect to different challenging factors using the precision score at the threshold .
The average precision plot of the keypointbased trackers [11, 9, 10, 12] in Fig. 10(a) shows that they are more robust to occlusion, rotation and outofview than to other challenging factors. This is consistent with the better performance of keypointbased trackers on these three subsets as shown in Fig. 6(e), Fig. 6(b) and Fig. 6(f) respectively. The most challenging situation for the keypointbased trackers is motion blur, as motion blur heavily affects the repeatability of the keypoints and the associated appearance descritpion.
The average precision plot of the regionbased trackers [13, 14, 15, 16, 17] is given in Fig. 10(b). It shows that the regionbased trackers are more robust to scale change, rotation and perspective distortion than to occlusion and outofview. This observation is consistent with the fact that the regionbased trackers find the transformation by directly minimizing the error that measures the similarity between the entire template and the image, and occlusion and outofview increase the dissimilarity largely between the template and the corresponding image patch after alignment. Motion blur remains the most challenging factor due to appearance corruption and large displacement of the target.
V Conclusion
In this paper, we present a benchmark for evaluating planar object tracking algorithms in the wild. The dataset is constructed according to seven different challenging factors so that the performance of trackers can be investigated thoroughly. We design a semimanual approach to annotate the ground truth accurately. We also evaluate eleven stateoftheart algorithms on the dataset with two metrics and give detailed analysis. The evaluation result shows that there is large space for improvement for all algorithms. We expect that our work can provide dataset and motivation for future study on planar object tracking in unconstrained natural environments.
Acknowledgment
This work is supported by China National Key Research and Development Plan (Grant No. 2016YFB1001200).
References
 [1] S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” TRA, 1996.
 [2] A. Concha and J. Civera, “Dpptam: Dense piecewise planar tracking and mapping from a monocular sequence,” in IROS, 2015.
 [3] I. F. Mondragón, P. Campoy, C. Martinez, and M. A. OlivaresMéndez, “3d pose estimation based on planar object tracking for uavs control,” in ICRA, 2010.
 [4] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in ISMA, 2007.
 [5] H. Kato and M. Billinghurst, “Marker tracking and hmd calibration for a videobased augmented reality conferencing system,” in IWAR, 1999.
 [6] S. Lieberknecht, S. Benhimane, P. Meier, and N. Navab, “A dataset and evaluation methodology for templatebased tracking algorithms,” in ISMA, 2009.
 [7] A. Roy, X. Zhang, N. Wolleb, C. P. Quintero, and M. Jägersand, “Tracking benchmark and evaluation for manipulation tasks,” in ICRA, 2015.
 [8] S. Gauglitz, T. Höllerer, and M. Turk, “Evaluation of interest point detectors and feature descriptors for visual tracking,” IJCV, 2011.
 [9] S. Hare, A. Saffari, and P. H. Torr, “Efficient online structured output learning for keypointbased object tracking,” in CVPR, 2012.
 [10] M. Ozuysal, M. Calonder, V. Lepetit, and P. Fua, “Fast keypoint recognition using random ferns,” PAMI, 2010.
 [11] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” IJCV, 2004.
 [12] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speededup robust features (surf),” CVIU, 2008.
 [13] R. Richa, R. Sznitman, R. Taylor, and G. Hager, “Visual tracking using the sum of conditional variance,” in IROS, 2011.
 [14] S. Benhimane and E. Malis, “Realtime imagebased tracking of planes using efficient secondorder minimization,” in IROS, 2004.
 [15] S. Baker and I. Matthews, “Lucaskanade 20 years on: A unifying framework,” IJCV, 2004.
 [16] L. Chen, F. Zhou, Y. Shen, X. Tian, H. Ling, and Y. Chen, “Illumination insensitive efficient secondorder minimization for planar object tracking,” in ICRA, 2017.
 [17] J. Kwon, H. S. Lee, F. C. Park, and K. M. Lee, “A geometric particle filter for templatebased visual tracking,” PAMI, 2014.
 [18] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 tracker using accelerated proximal gradient approach,” in CVPR, 2012.
 [19] D. A. Ross, J. Lim, R.S. Lin, and M.H. Yang, “Incremental learning for robust visual tracking,” IJCV, 2008.
 [20] H. Nam and B. Han, “Learning multidomain convolutional neural networks for visual tracking,” in CVPR, 2016.
 [21] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.H. Yang, “Hedged deep tracking,” in CVPR, 2016.
 [22] K. Zimmermann, J. Matas, and T. Svoboda, “Tracking by an optimal sequence of linear predictors,” PAMI, 2009.
 [23] L. Zhao, X. Li, J. Xiao, F. Wu, and Y. Zhuang, “Metric learning driven multitask structured output optimization for robust keypoint tracking,” in AAAI, 2015.
 [24] Y. Wu, J. Lim, and M.H. Yang, “Object tracking benchmark,” PAMI, 2015.
 [25] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” PAMI, 2014.
 [26] A. Li, M. Lin, Y. Wu, M.H. Yang, and S. Yan, “Nuspro: A new visual tracking challenge,” PAMI, 2016.
 [27] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual object tracking vot2015 challenge results,” in ICCV workshops, 2015.
 [28] V. Lepetit and P. Fua, “Keypoint recognition using randomized trees,” PAMI, 2006.
 [29] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg, “Realtime detection and tracking for augmented reality on mobile phones,” TVCG, 2010.
 [30] E. Rosten, R. Porter, and T. Drummond, “Faster and better: A machine learning approach to corner detection,” PAMI, 2010.
 [31] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commu. ACM, 1981.
 [32] M. Ozuysal, P. Fua, and V. Lepetit, “Fast keypoint recognition in ten lines of code,” in CVPR, 2007.
 [33] T. Wang and H. Ling, “Gracker: A Graphbased Planar Object Tracker,” PAMI, in press.
 [34] M. Pressigout and E. Marchand, “Real time planar structure tracking for visual servoing: a contour and texture approach,” in IROS, 2005.
 [35] E. Ito, T. Okatani, and K. Deguchi, “Accurate and robust planar tracking based on a model of image sampling and reconstruction process,” in ISMA, 2011.
 [36] D. J. Tan and S. Ilic, “Multiforest tracker: A chameleon in tracking,” in CVPR, 2014.
 [37] A. Singh and M. Jagersand, “Modular tracking framework: A unified approach to registration based tracking,” arXiv preprint arXiv:1602.09130, 2016.
 [38] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in ECCV, 2010.
 [39] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in IJCAI, 1981.