Planar Object Tracking in the Wild: A Benchmark

Planar Object Tracking in the Wild: A Benchmark

Pengpeng Liang, Yifan Wu, Hu Lu, Liming Wang, Chunyuan Liao, Haibin Ling Pengpeng Liang and Liming Wang are with School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China. {ieppliang, ielmwang} Wu and Haibin Ling are with Computer & Information Sciences Department, Temple University, Philadelphia, USA. {yifan.wu, hbling}@temple.eduHiScene Information Technologies, Shanghai 201210, China.Hu Lu is with School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212003, China.*Correspondence author.

Planar object tracking is an actively studied problem in vision-based robotic applications. While several benchmarks have been constructed for evaluating state-of-the-art algorithms, there is a lack of video sequences captured in the wild rather than in constrained laboratory environment. In this paper, we present a carefully designed planar object tracking benchmark containing 210 videos of 30 planar objects sampled in the natural environment. In particular, for each object, we shoot seven videos involving various challenging factors, namely scale change, rotation, perspective distortion, motion blur, occlusion, out-of-view, and unconstrained. The ground truth is carefully annotated semi-manually to ensure the quality. Moreover, eleven state-of-the-art algorithms are evaluated on the benchmark using two evaluation metrics, with detailed analysis provided for the evaluation results. We expect the proposed benchmark to benefit future studies on planar object tracking.

I Introduction

Camera localization and environment modeling is a fundamental problem in vision-based robotics. In theory, these tasks can be completed by tracking and then analyzing 3D structures in the input from visual sensors. In practice, however, tracking of 3D structures is by itself very challenging. Two-dimensional planar structures, instead, often serve as a reliable and reasonable surrogate. As a result, planar object tracking plays an important role in many vision-based robotic applications, such as visual servoing [1], visual SLAM [2], and UAV control [3], as well as related fields, e.g. augmented reality [4, 5].

Recently, several datasets have been provided for comprehensively evaluating planar tracking, including the Metaio dataset [6], the tracking manipulation tasks (TMT) dataset [7] and the planar texture dataset [8]. Though these datasets overcome the shortcomings of synthetic datasets that cannot faithfully reproduce the real effects of every condition, all of them are constructed in laboratory environments (see Fig. 1). A disadvantage of the datasets collected this way is that the background is short of diversity or even artificial, while in real world scenarios can be much more complicated. Consequently, it is insufficient to evaluate the effectiveness of planar object tracking algorithms in natural setting with these datasets.

(a) The Metaio dataset [6]
(b) The TMT dataset [7]
(c) The planar texture dataset [8]
(d) The proposed benchmark
Fig. 1: Sample frames from three representative benchmarks and ours. Note: frames in the Metaio dataset have artificial white background by design, and we draw the image boundary for better illustration.

To address the above issue, in this paper, we present a novel planar object tracking benchmark containing 210 video sequences collected in the wild and each sequence has 500 frames plus an additional frame for initialization. For constructing the dataset, we first select 30 planar objects in natural scene; then, for each object, we capture seven videos involving seven challenging factors. Six of the challenging factors are commonly encountered in practical applications, while the seventh dedicates to an unconstrained condition, typically involving multiple challenging factors. To annotate the ground truth as precisely as possible, given the initial state of an object, we first run a keypoint-based object tracking algorithm using structured output learning [9] as an initial guess; then we manually check and revise the results to ensure accuracy, with tracking re-initialization if needed. We annotate every other frame for each sequence.

To understand the performance of state-of-the-arts, we evaluate eleven modern tracking algorithms on the dataset. These algorithms include three types of trackers: four keypoint-based planar object tracking algorithms [9, 10, 11, 12], four region-based (a.k.a. direct methods) planar object tracking algorithms [13, 14, 15, 16], and three generic object tracking algorithms [17, 18, 19]. We use two performance metrics to analyze the evaluation results in details. One metric is based on four reference points and measures the distance of misalignment between the ground truth state and the predicted state; the other is the difference between the ground truth homography and the predicted homography. Note that we do not evaluate the state-of-the-art generic object trackers such as trackers using deep learned features [20, 21]. This is because such trackers, by outputting rectangular bounding boxes, aim at locating the target rather than providing the precise state of the target.

In summary, our contributions are three-fold: (1) collecting systematically a dataset containing 210 videos for planar object tracking in the wild; (2) providing accurate ground truth by annotating the data in a semi-automatic manner, and 52,710 frames are annotated in total; and (3) evaluating eleven representative state-of-the-art algorithms with two performance metrics, and analyzing the results in details according to seven different motion patterns. To the best of our knowledge, our benchmark not only is the largest one to date, but also is more realistic than previously proposed ones. The benchmark, along with the evaluation results, is made available for research purpose (

In the rest of the paper, we first summarize related work in Sec. II and then introduce details of the dataset in Sec. III. The evaluation and the analysis of the results are described in Sec. IV. Finally, we conclude the paper in Sec. V.

Ii Related Work

Ii-a Previous benchmark

With the advance of planar object tracking, it is crucial to provide benchmarks for evaluation purpose. Recently, there have been several such benchmarks relevant with our work [6], [7] and [8]. In [6], the authors collected 40 sequences with eight different texture patterns under five different dynamic behaviors. To annotate the ground truth precisely, a camera was mounted on a robotic measurement arm which could record the camera pose. One limitation of using the measurement arm for annotation is that it may have problems when used in natural environments flexibly.

To evaluate tracking algorithms for manipulation tasks, 100 sequences were collected and each sequence was tagged with different challenging factors in [7]. For annotation, three trackers were used to annotate the ground truth, and the coordinates of the four reference corners were determined when the coordinates reported by all the three trackers lay within a certain range. Such annotation avoids heavy manual work, but can be noisy especially for challenging sequences on which at least one tracker fails.

In [8], 96 sequences were collected with six planar textures under 16 different motion patterns each. To annotate the ground truth in a semi-automatic manner, a planar texture picture was held by a milled acrylic glass frame and there were four bright red balls on the frame as markers.

Besides the above three benchmark datasets, the authors of several papers focusing on tracking algorithms collected their own data for evaluation purpose. In [9], five sequences were collected and the ground truth was obtained using a SLAM system which could track the 3D camera pose in each frame. In [22], image sequences of three different objects were collected and the ground truth was annotated manually using the object corners. In [23], the authors used the five sequences from [9] and another four sequences collected by themselves to evaluate their algorithm.

It is worth mentioning that several benchmarks for generic object tracking have been proposed in recent years [24, 25, 26, 27]. However, all of these datasets provide rectangular bounding box annotation, and none of them can be used for evaluating planar object tracking algorithms.

To the best of our knowledge, our work is the first one providing a dataset for planar object tracking in the wild. Moreover, our dataset contains 210 sequences with careful annotation, and is much larger than previous ones.

Painting-2, 0.853 BusStop, 0.844 IndegoStation, 0.831 ShuttleStop, 0.821 Lottery-2, 0.798 SmokeFree, 0.796
Painting-1, 0.790 Map-1, 0.788 Citibank, 0.785 Snap, 0.760 Fruit, 0.735 Poster-2, 0.733
Woman, 0.724 Lottery-1, 0.721 Pretzel, 0.721 Coke, 0.704 WalkYourBike, 0.699 OneWay, 0.697
NoStopping, 0.690 StopSign, 0.681 Map-2, 0.676 Poster-1, 0.659 Snack, 0.643 Melts, 0.640
Burger, 0.624 Map-3, 0.615 Sundae, 0.615 Sunoco, 0.595 Amish, 0.594 Pizza, 0.519
Fig. 2: The 30 planar objects (in green bounding box) in our dataset, ordered from hardest to easiest according to the degree of difficulty (Sec. IV-B).

Ii-B Tracking algorithms

Current planar object tracking algorithms can be categorized into two main groups. The first group is keypoint-based. The algorithms [9, 28, 23, 29, 10] lying in this group often model an object with a set of keypoints (e.g., SIFT [11], SURF [12] and FAST [30]) and associated descriptors, and the tracking process consists of two steps. First, a set of correspondences between object and image keypoints is constructed through descriptor matching; then, the transformation of the object in the image is estimated using a robust geometric estimation algorithm such as RANSAC [31] based on the correspondences. In [28], keypoint matching was formulated as a multi-class classification problem so that the computational burden was shifted to the training phase. In [23], to utilize the temporal and spatial consistency during the tracking process, a robust keypoint-based appearance model was learned with a metric learning driven approach. The authors of [29] carefully modified the feature descriptors SIFT [11] and Ferns [32] so that they could work at real-time speed on mobile phones. Graph matching is integrated for matching keypoints in [33] recently.

The second group of planar tracking algorithms are region-based and sometimes called direct methods. These algorithms [13, 14, 34, 35, 15, 36, 16] lying in this group directly estimate the transformation parameters by minimizing an error that measures the image similarity between the template and its projection in the image. In [34], both texture and contour information were used to construct the appearance model, and the 2D transformation was estimated by minimizing the error between the multi-cue template and the projected image patch. To deal with resolution degradation, the authors in [35] proposed to reconstruct the target model with an image sampling process. In [36], random forest was used to learn the relationship between the parameters modeling the motion of the target and the image intensity change of the template. This learning-based approach is useful to avoid local minimum and handle partial occlusion. The authors of [37] provided a code framework for region-based trackers, also known as registration based tracking or direct visual tracking, by decomposing this kind of trackers into three modules including an appearance model, a state space model and a search method.

In this paper, we select four keypoint-based [11, 12, 9, 10], four region-based [13, 14, 15, 16] and three generic object tracking algorithms [17, 18, 19] as representative trackers in evaluation. The details of these algorithms are given in Sec. IV.

(a) Scale change
(b) Rotation
(c) Perspective distortion
(d) Motion blur
(e) Occlusion
(f) Out-of-view
(g) Unconstrained
Fig. 3: Example frames for different challenging factors.

Iii Dataset Design

Iii-a Dataset Construction

We use a smart phone (iPhone 5S) to record all the videos and the camera is held by hands. The reason for using a smart phone is that it can approach everyday scenarios as closely as possible. The videos are recorded at 30 frames per second with a resolution of , and we resample the video sequences to for efficiency111By contrast, the frame size in [6] and [8] is ; and the frame size in [7] is ..

We select 30 planar objects in natural scene in different photometric environments as shown in Fig. 2. As we can see, the background of the selected objects varies a lot, especially when compared with previous benchmarks as shown in Fig. 1. For each object, we shoot videos involving seven motion patterns so that the dataset can be used to systematically analyze the strengths and weaknesses of different tracking algorithms. The dataset contains 210 sequences in total, and each sequence has 500 frames plus an additional frame for initialization. The following are the challenging factors involved:

  • Scale change (SC): the distance between the camera and the target changes significantly (Fig. 3(a)).

  • Rotation (RT): rotating the camera and trying to keep the camera in the same plane during rotation (Fig. 3(b)).

  • Perspective distortion (PD): changing the perspective between the object and the camera (Fig. 3(c)).

  • Motion blur (MB): motion blur is generated by fast camera movement (Fig. 3(d)).

  • Occlusion (OCC): manually occluding the object while moving the camera (Fig. 3(e)).

  • Out-of-view: (OV): part of the object is out of the image (Fig. 3(f)).

  • Unconstrained (UC): moving the camera freely and the resulting video sequence may involve one or more of the above challenging factors (Fig. 3(g)).

It is worth noting that as it is hard to control the illumination condition in natural environment, illumination variation is not included in the challenging factors.

Iii-B Annotating the ground truth

(a) Normal
(b) Occlusion
(c) Out-of-view
Fig. 4: The user interface of our annotation tool for different situations.

Following the popular strategy in planar object tracking [8], we define the tracking ground truth as a transformation matrix that projects a point in frame to its corresponding point in frame . To find the homography, we annotate four reference points (corners of the object) on the object in each frame. The natural environment prevents us from using a measurement arm [6], markers [8] or SLAM system [9] to obtain the ground truth. In [7], three tracking algorithms were used to annotate the ground truth. Despite still requiring manual verification as the final step, this approach is not suitable for the cases where the three algorithms fail to reach a correct consensus, especially for challenging scenarios. In this paper, we use a semi-automatic approach to annotate the ground truth. In particular, we annotate every other frame for each sequence and the ground truth of 52,710 frames are produced in total.

Fig. 4 shows the user interface of our annotation tool. Besides the four corner points, we use four additional points located around the middle of the four edges to deal with occlusion and out-of-view. On the top of Fig. 4 , it shows the initial eight points for reference; on the bottom, it is the current frame that needs annotation. The black margin around the image is used to help annotate the frames that are out-of-view. The annotation contains two steps:

  • Step 1: Run the keypoint-based algorithm [9] to get an initial estimation of the object state. Manual re-initialization of the algorithm is used so that the algorithm can better adapt to the change of the object state.

  • Step 2: Select four out of the eight reference points and manually fine tune their positions, then re-estimate the homography transformation with the selected points. The global shape of the object is also taken into account when it is occluded or out-of-view.

Note that in step 2: (1) the four corner points are selected first if they are visible in the image; (2) the initial four middle points might not remain at the middle after homography transformation, so when we use the middle points, we also take the context around the initial positions in the reference frame into consideration; (3) we mark frames in which more than half of the target is invisible (occluded or out-of-view, Fig. 5(a)) and frames that are heavily blurred (Fig. 5(b)). Such marked frames will not be used for evaluation.

In general, after excluding frames the invisible part of which are more than half or heavily blurred as shown in Fig. 5, the above annotation approach generates accurate ground truth with manageable amount of human labor.

(a) Invisible (Map-2)
(b) Blur (Painting-1)
Fig. 5: Example frames excluded from annotation.

Iv Evaluation

Iv-a Selected trackers

To study the performance of modern visual trackers for planar object tracking, we select eleven representative algorithms from three different groups.

Keypoint-based planar tracking:

SIFT [11] and SURF [12]: To evaluate the performance of SIFT and SURF for planar object tracking on our benchmark, we follow the traditional keypoint-based tracking pipeline and use OpenCV for implementation. These two trackers contain three steps: (1) keypoints detection; (2) keypoint matching via nearest neighbour search; and (3) homography estimation using RANSAC [31].

FERNS [10]: FERNS formulates the keypoints recognition problem in a naive Bayesian classification framework. The appearance of the image patch surrounding a keypoint is described by hundreds of simple binary features (ferns) depending on the intensities of two pixels, then the class posterior probabilities are estimated. By shifting the computation burden to the training stage as [28], the classification of keypoints can be performed very fast.

SOL [9]: Structured output learning (SOL) is used to combine keypoints matching and transformation estimation in a unified framework. The adopted linear structured SVM algorithm allows the object model to adapt to a given environment quickly. To speed up the algorithm, the classifier is approximated with a set of binary vectors and the binary descriptor BRIEF [38] is used for keypoint matching. The keypoints are extracted by FAST [30]. With binary representation and Hamming distance similarity, matching can be performed extremely fast using bitwise operations.

Region-based planar tracking:

ESM [14]: The transformation parameters in [14] is estimated by minimizing the sum-of-squared-difference between a given template and the current image. To solve the optimization problem efficiently, efficient second-order minimization (ESM) is used to estimate the second order approximation of the cost function. Compared with the Newton method, ESM does not need to compute the Hessian and has a higher convergence rate.

IC [15]: To avoid re-evaluating the Hessian in every iteration in the Lucas-Kanade image alignment algorithm [39], the inverse compositional (IC) algorithm switches the role of the template and the image. The resulted optimization problem has a constant Hessian and can be pre-computed. The proof of equivalence between IC and Lucas-Kanade is provided in [15].

SCV [13]: Being invariant to non-linear illumination variation, the sum of conditional variance (SCV) is employed to measure the similarity between a given template and the current image in [13]. The SCV tracker can be viewed as an extension of ESM.

GO-ESM [16]: As gradient orientations (GO) is robust to illumination changes, it is used in GO-ESM along with denoising techniques to model the appearance of the target. GO-ESM also generalizes ESM to multidimensional features.

Generic object tracking:

GPF [17]: Using deterministic optimization to estimate the spatial transformation for template-based tracking can result in local optima. To overcome this limitation, the authors of [17] formulate the problem in a geometric particle filter (GPF) framework on a matrix Lie group.GPF uses the combination of the incremental PCA model [19] and normalized cross correlation (NCC) score to model the appearance of the target.

IVT [19]: To deal with appearance change of the target, IVT uses an incremental PCA algorithm to update the appearance model which is a eigenbasis learned off-line. The algorithm estimates an affine transformation for each frame with particle filter.

L1APG [18]: To solve the norm minimization problem efficiently of the sparse linear representation of target appearance and improve its robustness, L1APG uses a mixed norm and an efficient optimization method based on accelerated proximal gradient (APG) approach. L1APG also estimates an affine transformation for each frame.

Note that all these three generic tracking algorithms are template-based and they can be attributed to the region-based group. For all above eleven algorithms except SIFT [11] and SURF [12], we use their available source codes. For ESM [14], IC [15] and SCV [13], we increase the number of maximum iterations for solving the optimization problem to 200; for other trackers, we use their default parameter setting. The original number of iterations used by GO-ESM [16] is 200.

(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 6: Comparison of evaluated trackers using precision plots. The precision at the threshold is used as a representative score.
(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 7: Comparison of evaluated trackers using success plots. The success rate at the threshold is used as a representative score.

Iv-B Evaluation metrics

In this paper, we use the following two metrics to analyze the results quantitatively.

Alignment error. The alignment error is based on the four reference points (four corners of the object), and is defined as the root of the mean square distances between the estimated positions of the points and their ground truth [6, 7],


where is the position of a reference point and is its ground truth position.

Precision plot has been adopted to evaluate the tracking algorithms for general purposes recently [24]. In this work, we draw precision plot based on the alignment error, and it shows the percentage of frames whose is smaller than a threshold . We use as a representative precision score for each algorithm.

Degree of difficulty of each object. To rank the 30 planar objects used in our benchmark as shown in Fig. 2, we quantitatively derive the degree of difficulty (DoD) of each object. Specifically, during the evaluation process, the precision score at the threshold for each sequence and each tracker is recorded. Then, given an object , its degree of difficulty is defined as:

Homography discrepancy. Homography discrepancy measures the difference between the ground truth homography and the predicted one , and it is defined as [9]:


where are the corners of a square. is if and are identical. The success rate of a tracker on a sequence is the percentage of frames whose homography discrepancy score is less than a threshold. We generate the success plot by varying the threshold from 0 to 200. Following [9], the success rate at threshold is used as a representative score.

Note: (1) the same for success rate of different sequences may correspond to different for precision score; (2) is a very tight threshold, as shown by some illustrative examples in Fig. 8; and (3) as there is no correspondence between for success rate and for precision score, there are inconsistencies between the rank of trackers in Fig. 6 and Fig. 7.

(a) 8.05 (b) 85.75 (c) 315.75
Fig. 8: Some example homography discrepancy scores (shown under subfigures). The black bounding box represents ground truth while the red one represents tracking result.

Iv-C Results and analysis

(a) Failure in scale change
(b) Failure in rotation
(c) Failure in perspective distortion
(d) Failure in motion blur
(e) Failure in occlusion
(f) Failure in out-of-view
(g) Failure in unconstrained
Fig. 9: Some failures observed in our experiment involving different challenge factors.

Iv-C1 Comparison with respect to different challenges

Fig. 6 shows the comparison among the eleven trackers by precision plot using both subsets of sequences according to different motion patterns and all the sequences. In addition, the success plots of each tracker using the homography discrepancy are reported in Fig. 7. It is worth noting that the performance of the generic object trackers IVT [19] and L1APG [18] are obviously worse than other trackers. One possible reason is that the parameters of these two trackers are set for the tracking scenario which just requires coarse bounding box estimation; another possible reason is that the adopted affine transformation with six degree-of-freedom is not sufficient to get very accurate results. In the following part, we use the performances of the other nine trackers for analysis purpose. Also, as the alignment error is more easy to measure perceptually and the success rate at is too tight (as shown in Fig. 8), we mainly use the precision plots to analyze the results and success plots are displayed for further validation purpose.

(a) (b)
Fig. 10: The overall performance of trackers in two groups for different challenging factors. For each group, the overall performance is calculated by averaging the performances of trackers within this group. The precision at the threshold is used.

For scale change (Fig. 6(a)), GPF performs best and FERNS also achieves comparable performance. Though SURF is also designed to be scale invariant, its performance is not promising on this subset. For the rotation subset (Fig. 6(b)), although all of SIFT, FERNS and SURF are designed to be rotation-invariant, SIFT outperforms the other two algorithms by a large margin. Also, SCV and GPF achieves better results than other region-based trackers. The relatively inferior performance of SOL should be due to that the BRIEF descriptor lacks invariance ability to in-plane rotation [38].

Under the perspective distortion subset (Fig. 6(c)), all the keypoint-based trackers outperform all the region-based trackers. The performances of SIFT, FERNS and GPF decrease obviously compared with scale change or rotation. SIFT itself is not designed to be invariant to perspective distortion. During the training stage of FERNS, it generates training samples with randomly picked affine transformation, nevertheless, the perspective distortion can also be homography transformation. For SCV and ESM, they have similar performance across these three motion patterns.

Motion blur (Fig. 6(d)) is the most challenging motion pattern for all these eleven trackers. As motion blur deteriorates the quality of the entire image, it is difficult for keypoint-based trackers to detect useful keypoints, and for region-based trackers to measure the similarity between images patches effectively.

For occlusion (Fig. 6(e)) and out of view (Fig. 6(f)), the performances of the keypoint-based trackers are obviously better than the region-based trackers. This is consistent with the fact that it is still possible to obtain a set of correspondences between the target and image keypoints when occlusion appears or the target is out-of-view, and the correspondences are accurate enough to estimate the geometric transformation correctly. However, for region-based based trackers, both occlusion and out of view can cause large appearance variance.

According to the performances with respect to the unconstrained subset of sequences (Fig. 6(g)) and all the sequences (Fig. 6(h)), in general, the keypoint-based trackers are more robust than the region-based trackers. The obvious performance difference can be attributed to the following two reasons: (1) though the image similarity measure SCV adopted by [13] or GO adopted by [16] are robust to illumination variations, their robustness is not comparable with the state-of-art keypoint detectors and descriptors or ferns; and (2) the keypoint-based algorithms use the tracking-by-detection strategy and the detection in the current frame depends little on the object location in previous frames; by contrast, the region-based algorithms make use of the previous object state to reduce the optimization space for efficiency. Thus it is easier for keypoint-based trackers to recover from failure than region-based trackers.

Also, for ESM based algorithms [14, 13, 16], SCV [13] is a little better than the original ESM tracker [14] using the sum-of-squared-difference for appearance similarity measure. Though gradient orientations is robust to illumination change, the overall performance of GO-ESM [16] is worse than ESM [14]. At the same time, ESM, SCV and GO-ESM perform better than IC [15], implying that the efficient second-order minimization approach is better than the inverse compositional optimization approach for the planar object tracking task. Some failure cases based on different motion patterns are shown in Fig. 9.

Iv-C2 Overall performance of trackers in each group

We summarize the overall performance of trackers in each group by average precision plot in Fig. 10(a) and Fig. 10(b) respectively. Note that we include the GPF tracker [17] in the region-based group, and we do not consider IVT [19] and L1APG [18] for these two figures. We rank the performance with respect to different challenging factors using the precision score at the threshold .

The average precision plot of the keypoint-based trackers [11, 9, 10, 12] in Fig. 10(a) shows that they are more robust to occlusion, rotation and out-of-view than to other challenging factors. This is consistent with the better performance of keypoint-based trackers on these three subsets as shown in Fig. 6(e), Fig. 6(b) and Fig. 6(f) respectively. The most challenging situation for the keypoint-based trackers is motion blur, as motion blur heavily affects the repeatability of the keypoints and the associated appearance descritpion.

The average precision plot of the region-based trackers [13, 14, 15, 16, 17] is given in Fig. 10(b). It shows that the region-based trackers are more robust to scale change, rotation and perspective distortion than to occlusion and out-of-view. This observation is consistent with the fact that the region-based trackers find the transformation by directly minimizing the error that measures the similarity between the entire template and the image, and occlusion and out-of-view increase the dissimilarity largely between the template and the corresponding image patch after alignment. Motion blur remains the most challenging factor due to appearance corruption and large displacement of the target.

V Conclusion

In this paper, we present a benchmark for evaluating planar object tracking algorithms in the wild. The dataset is constructed according to seven different challenging factors so that the performance of trackers can be investigated thoroughly. We design a semi-manual approach to annotate the ground truth accurately. We also evaluate eleven state-of-the-art algorithms on the dataset with two metrics and give detailed analysis. The evaluation result shows that there is large space for improvement for all algorithms. We expect that our work can provide dataset and motivation for future study on planar object tracking in unconstrained natural environments.


This work is supported by China National Key Research and Development Plan (Grant No. 2016YFB1001200).


  • [1] S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” TRA, 1996.
  • [2] A. Concha and J. Civera, “Dpptam: Dense piecewise planar tracking and mapping from a monocular sequence,” in IROS, 2015.
  • [3] I. F. Mondragón, P. Campoy, C. Martinez, and M. A. Olivares-Méndez, “3d pose estimation based on planar object tracking for uavs control,” in ICRA, 2010.
  • [4] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in ISMA, 2007.
  • [5] H. Kato and M. Billinghurst, “Marker tracking and hmd calibration for a video-based augmented reality conferencing system,” in IWAR, 1999.
  • [6] S. Lieberknecht, S. Benhimane, P. Meier, and N. Navab, “A dataset and evaluation methodology for template-based tracking algorithms,” in ISMA, 2009.
  • [7] A. Roy, X. Zhang, N. Wolleb, C. P. Quintero, and M. Jägersand, “Tracking benchmark and evaluation for manipulation tasks,” in ICRA, 2015.
  • [8] S. Gauglitz, T. Höllerer, and M. Turk, “Evaluation of interest point detectors and feature descriptors for visual tracking,” IJCV, 2011.
  • [9] S. Hare, A. Saffari, and P. H. Torr, “Efficient online structured output learning for keypoint-based object tracking,” in CVPR, 2012.
  • [10] M. Ozuysal, M. Calonder, V. Lepetit, and P. Fua, “Fast keypoint recognition using random ferns,” PAMI, 2010.
  • [11] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
  • [12] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” CVIU, 2008.
  • [13] R. Richa, R. Sznitman, R. Taylor, and G. Hager, “Visual tracking using the sum of conditional variance,” in IROS, 2011.
  • [14] S. Benhimane and E. Malis, “Real-time image-based tracking of planes using efficient second-order minimization,” in IROS, 2004.
  • [15] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” IJCV, 2004.
  • [16] L. Chen, F. Zhou, Y. Shen, X. Tian, H. Ling, and Y. Chen, “Illumination insensitive efficient second-order minimization for planar object tracking,” in ICRA, 2017.
  • [17] J. Kwon, H. S. Lee, F. C. Park, and K. M. Lee, “A geometric particle filter for template-based visual tracking,” PAMI, 2014.
  • [18] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 tracker using accelerated proximal gradient approach,” in CVPR, 2012.
  • [19] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” IJCV, 2008.
  • [20] H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in CVPR, 2016.
  • [21] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H. Yang, “Hedged deep tracking,” in CVPR, 2016.
  • [22] K. Zimmermann, J. Matas, and T. Svoboda, “Tracking by an optimal sequence of linear predictors,” PAMI, 2009.
  • [23] L. Zhao, X. Li, J. Xiao, F. Wu, and Y. Zhuang, “Metric learning driven multi-task structured output optimization for robust keypoint tracking,” in AAAI, 2015.
  • [24] Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” PAMI, 2015.
  • [25] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” PAMI, 2014.
  • [26] A. Li, M. Lin, Y. Wu, M.-H. Yang, and S. Yan, “Nus-pro: A new visual tracking challenge,” PAMI, 2016.
  • [27] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual object tracking vot2015 challenge results,” in ICCV workshops, 2015.
  • [28] V. Lepetit and P. Fua, “Keypoint recognition using randomized trees,” PAMI, 2006.
  • [29] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg, “Real-time detection and tracking for augmented reality on mobile phones,” TVCG, 2010.
  • [30] E. Rosten, R. Porter, and T. Drummond, “Faster and better: A machine learning approach to corner detection,” PAMI, 2010.
  • [31] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commu. ACM, 1981.
  • [32] M. Ozuysal, P. Fua, and V. Lepetit, “Fast keypoint recognition in ten lines of code,” in CVPR, 2007.
  • [33] T. Wang and H. Ling, “Gracker: A Graph-based Planar Object Tracker,” PAMI, in press.
  • [34] M. Pressigout and E. Marchand, “Real time planar structure tracking for visual servoing: a contour and texture approach,” in IROS, 2005.
  • [35] E. Ito, T. Okatani, and K. Deguchi, “Accurate and robust planar tracking based on a model of image sampling and reconstruction process,” in ISMA, 2011.
  • [36] D. J. Tan and S. Ilic, “Multi-forest tracker: A chameleon in tracking,” in CVPR, 2014.
  • [37] A. Singh and M. Jagersand, “Modular tracking framework: A unified approach to registration based tracking,” arXiv preprint arXiv:1602.09130, 2016.
  • [38] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in ECCV, 2010.
  • [39] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in IJCAI, 1981.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description