A Framework for Evaluating 6-DOF Object Trackers

A Framework for Evaluating
6-DOF Object Trackers

Mathieu Garon Université Laval111mathieu.garon.2@ulaval.ca, denis.laurendeau@gel.ulaval.ca, jflalonde@gel.ulaval.ca    Denis Laurendeau and Jean-François Lalonde Université Laval111mathieu.garon.2@ulaval.ca, denis.laurendeau@gel.ulaval.ca, jflalonde@gel.ulaval.ca

We present a challenging and realistic novel dataset for evaluating 6-DOF object tracking algorithms. Existing datasets show serious limitations—notably, unrealistic synthetic data, or real data with large fiducial markers—preventing the community from obtaining an accurate picture of the state-of-the-art. Our key contribution is a novel pipeline for acquiring accurate ground truth poses of real objects w.r.t a Kinect V2 sensor by using a commercial motion capture system. A total of 100 calibrated sequences of real objects are acquired in three different scenarios to evaluate the performance of trackers in various scenarios: stability, robustness to occlusion and accuracy during challenging interactions between a person and the object. We conduct an extensive study of a deep 6-DOF tracking architecture [1] and determine a set of optimal parameters. We enhance the architecture and the training methodology to train a 6-DOF tracker that can robustly generalize to objects never seen during training, and demonstrate favorable performance compared to previous approaches trained specifically on the objects to track.

3D object tracking, databases, deep learning, augmented reality

1 Introduction

With the recent emergence of 3D-enabled augmented reality devices, tracking 3D objects in 6 degrees of freedom (DOF) is a problem that has received increased attention in the past few years. As opposed to SLAM-based camera localization techniques—now robustly implemented on-board various commercial devices such as the iPhone X or the Hololens—that can use features from the entire scene, 6-DOF object tracking approaches have to rely on features present on a (typically small) object, making it a challenging problem. Despite this challenge, recent approaches have demonstrated tremendous performance both in terms of speed and accuracy [2, 3, 1].

Unfortunately, obtaining an accurate assesment of the performance of 6-DOF object tracking approaches is becoming increasingly difficult since the main dataset used for this purpose has reached its limitations. Indeed, the main dataset currently used to evaluate 6-DOF tracking algorithms was introduced in 2013 by Choi and Christensen [4] and consists of 4 short sequences of purely synthetic scenes. The scenes are made of unrealistic, texture-less backgrounds with a single colored object to track, resulting in noiseless RGBD images (see fig. 1-(a)). The object is static and the camera rotates around it in wide motions, occasionally creating small occlusions (at most 20% of the object is occluded). While challenging at first, the dataset has now essentially been solved, and as such has become less useful. For example, the method of Kehl et al. [2] (2017) reports an overall accuracy of 0.5mm/, which is an improvement of 0.3mm/ over the work of Tan et al. (2015) [5], who have themselves reported a 0.01mm/ improvement to the approach designed by Krull et al. (2014) [6]. The state of the art on the dataset has reached a near-perfect accuracy of 0.1mm/ [3], which shows that the dataset has reached the end of its useful life.

(a) Choi-Christensen [4] (b) Garon-Lalonde [1] (c) Ours
Figure 1: Comparison of datasets for evaluating 6-DOF tracking algorithms. Typical RGB (top) and depth (bottom) frames for (a) the synthetic dataset of Choi and Christensen [4], (b) the real dataset of Garon and Lalonde [1], and (c) ours. Compared to the previous work, our dataset contains real objects captured by a sensor, and does not use a calibration board, therefore mimicking realistic real-world scenarios.

Another dataset, introduced by Garon and Lalonde [1], includes 12 sequences of real objects captured with real sensors. While a significant improvement over the synthetic dataset of [4], dealing with real data raises the issue of providing accurate ground truth pose of the object at all times. To obtain this ground truth information, their strategy (also adopted in 6-DOF detection datasets [7, 8]) is to use calibration boards with fiducial markers. While useful to accurately and easily determine an object pose, this has the unfortunate consequence of constraining the object to lie on a large planar surface (fig. 1-(b)).

In this paper, we present a novel dataset allowing systematic evaluation of 6-DOF tracking algorithms in a wide variety of real scenarios without requiring calibration boards (fig. 1-(c)). Our dataset is one order of magnitude larger than the previous work: it contains 100 sequences of 4 real objects. The sequences are split into 3 different scenarios, which we refer to as stability, occlusion, and interaction. The stability scenario aims at quantifying the degree of jitter in a tracker. The object is kept static and placed at various angles and distances from a Kinect V2. The occlusion scenario, inspired by [1], has the object rotating on a turntable and being progressively occluded by a flat panel. Occlusion ranges from 0% (unoccluded) to 60%, thereby testing trackers in very challenging situations. Finally, in the interaction scenario, a person is moving the object around freely in front of the camera (fig. 1-(c)), creating occlusions and varying object speeds.

In addition, we also introduce two new 6-DOF real-time object trackers based on deep learning. The first, trained for a specific object, achieves state-of-the-art performance on the new dataset. The second, trained without a priori knowledge of the object to track, is able to achieve an accuracy that is comparable with previous work trained specifically on the object. These two trackers actually rely on the same deep learning architecture and only differ in the training data. Furthermore, both of our trackers have the additional significant advantage of requiring only synthetic training data (i.e. no real data is needed for training). We believe this is an exciting first step in the direction of training generic trackers which do not require knowledge of the object to track at training time.

In summary, this paper brings 3 key contributions to 6-DOF object tracking:

  1. A novel dataset of real RGBD sequences for the systematic evaluation of 6-DOF tracking algorithms that is one order of magnitude larger than existing ones, and contains 3 challenging scenarios;

  2. A real-time deep learning architecture for tracking objects in 6-DOF which is more stable and more robust to occlusions than previous approaches;

  3. A generic 6-DOF object tracker trained without knowledge of the object to track, achieving performance on par with previous approaches trained specifically on the object.

2 Related work

There are two main relevant aspects in 6-DOF pose estimation: single frame object detection and multi-frame temporal trackers. The former received a lot of attention in the literature and benefits from a large range of public datasets. The most notorious dataset is arguably Linemod [7], which provide 15 objects with their mesh models and surface colors. To obtain the ground truth object pose, a calibration board with fiducial markers is used. Since then, many authors created similar but more challenging benchmarks [9, 10, 8]. However these datasets do not contain temporal and displacement correlation between each frame, which makes them useless to evaluate temporal trackers. As opposed to the above datasets, we provide a challenging set of sequences that can be used for benchmarking detection and tracking approaches in various scenarios. Additionally, our dataset contains no fiducial markers and does not constrain the object to lay on a plane.

In the case of temporal tracking, only a few datasets exist to evaluate the approaches. As mentioned in the introduction, the current standard dataset widely used is the synthetic dataset of Choi and Christensen [4], which contains 4 sequences with 4 objects rendered in a texture-less virtual scene. Another available option is the one provided by Akkaladevi et al. [11] who captured a single sequence of a scene containing 4 different objects with a Primesense sensor. However, the 3D models are not complete and do not include training data that could be exploited by learning-based methods. Finally, recent work by Garon and Lalonde [1] proposed a public dataset of 4 objects containing 4 sequences with clutter and an additional set of 8 sequences with controlled occlusion on a specific object. Fiducial markers are used to generate the ground truth pose of the model, which limits the range of displacements that can be achieved. In contrast, we propose a new method to collect ground truth pose data that makes the acquisition simpler without the need for fiducial markers.

There is an increasing interest in 6-DOF temporal trackers since they were shown to be faster and more robust than single frame detection methods. In the past, geometric methods based on ICP [4, 12, 13, 14] were used for temporal tracking, but they lack robustness for small objects and are generally computationally expensive. Data-driven approaches such as the ones reported in [6, 5, 15] can learn more robust features and the use of the Random Forest regressor [16] decreases the computing overhead significantly. Other methods show that the contours of the objects in RGB and depth data provide important cues for estimating pose [3, 2, 17]. While their optimization techniques can be accurate, many assumptions are made on the features which restrict the type of object and the type of background that can be dealt with at test time. Recently, Garon and Lalonde [1] proposed a deep learning framework which can learn robust features automatically from data. They use a feedback loop by rendering the 3D model at runtime at the previous pose, and regress the pose difference between the rendered object and the real image. While their method compares to the previous work with respect to accuracy, their learned features are more robust to higher level of occlusion and noise. A downside is that their method needs a dataset of real images and a specific network has to be trained for each object which can be time consuming. We take advantage of the later architecture but introduce novel ideas to provide a better performing tracker that can be trained entirely on synthetic data. In addition, our network can be trained to generalize to previously unseen objects.

3 Dataset capture and calibration

Building a dataset with calibrated object pose w.r.t the sensor at each frame is a challenging task since it requires an accurate method to collect ground truth object pose. Until now, the most practical method to achieve this task was to use fiducial markers and calibrate the object pose w.r.t these markers [1, 7, 8, 9]. However, this method suffers from several drawbacks. Firstly, the object cannot be moved independently of the panel so this restricts the camera to move around the object of interest. Secondly, and perhaps more importantly, the scene always contains visual cues (the markers) which could indirectly “help” the algorithms.

Our approach eliminates these two limitations. A Vicon™  MX-T40 motion capture system is used to collect the ground truth pose of the objects in the scene. The retroreflective Vicon markers that must be used are very small in size (3mm diameter) and can be automatically removed in a post-processing step. In this section, we describe the capture setup and the various calibration steps needed to align the object model and estimate its ground truth pose. The resulting RGBD video sequences captured using this setup are presented in sec. 4.

3.1 Capture setup

The motion capture setup is composed of a set of 8 calibrated cameras that track retro-reflective markers of 3mm in diameter installed on the objects of interest in a work area. Vicon systems can provide a marker detection accuracy of up to 0.15 mm on static objects and 2mm on moving objects according to [18]. A Kinect V2 is used to acquire the RGBD frames, and is calibrated with the Vicon to record the ground truth pose of the objects in the Kinect coordinate system. The actual setup used to capture the dataset is shown in fig. 2-(a).

(a) (b)
Figure 2: Capture setup used to capture our novel dataset. (a) Actual setup, which includes an 8-camera Vicon motion capture system and a Kinect V2. The resulting view from the kinect is shown in the inset. Here, an occluder is placed in front of the object. (b) The various transformations that must be calibrated in order to obtain the object pose in the camera reference frame . The transformations shown in black are obtained from the Vicon motion capture system directly, while the gray ones need a specific calibration procedure which is described in the main body of the paper.

3.2 Calibration

With a RGB-D sensor such as the Kinect V2, color and depth values are projected onto two different planes. We define the Kinect reference frame (“knt”) as the origin of its RGB camera, and align the depth data by reprojecting it to the color plane using the factory calibration parameters. We calibrate the intrinsic parameters of the RGB camera and remove the lens distortion as in Hodan et al. [8]. In this section, the notation is used to denote a rigid transformation from reference frame “a” to “b”.

We aim to recover the pose of the object in the Kinect reference frame (fig. 2-(b)). To do so, we first rely on the Vicon motion capture system, which has its own reference frame “vcn”. The set of retroreflective markers installed on the object define the local reference frame “objm”. Similarly, the set of markers placed on the Kinect define the local reference frame “kntm”. The Vicon system provides the transformations and directly, that is, the mappings between the object and Kinect markers and the Vicon reference frame respectively. The transformation between the object markers and the Kinect markers is obtained by chaining the previous transformations:


The pose is recovered with the transformations between the local frames defined by the markers and the object/Kinect reference frames and :


The calibration procedures needed to obtain these two transformations, also shown in gray in fig. 2-(b), are detailed next.

3.2.1 Kinect calibration

In order to find the transformation between the local frame defined by the markers installed on the Kinect and its RGB camera, we rely on a planar checkerboard target on which Vicon markers are randomly placed. Then, the position of each corner of the checkerboard is determined with respect to the markers with the following procedure. A 15cm-long pen-like probe that has a 1cm Vicon marker attached at one end was designed for this purpose. The sharp end is placed on the corner to be detected, and the probe is moved in a circular motion around that point. A sphere is then fit (using least-squares) to the resulting marker positions (achieving an average radius estimation error of 0.7mm), and the center of the sphere is kept as the location for the checkerboard corner. The checkerboard target was then moved in the capture volume and corners were detected by the Kinect RGB camera, thereby establishing 2D-3D correspondences between these points. The perspective--points algorithm [19] was finally used to compute .

3.2.2 Object calibration

To estimate the transformation between the local frame defined by the markers placed on the object and its mesh coordinate system , we rely on the Kinect pose calibrated with the method described in the previous paragraph. As a convention, we define the origin of the object local coordinate system at the center of mass of the markers, the same convention is used for the mesh by using the center of mass of the vertices. We roughly align the axis and use ICP to refine its position (based on the Kinect depth values). Finally, with the help of a visual interface where a user can move and visualize the aligment of the object, fine-scale adjustments can be performed manually from several viewpoints to minimize the error between the observed object and the reprojected mesh.

3.2.3 Synchronization

In addition to spatial calibration, precise temporal alignment must be achieved to synchronise Vicon and Kinect frames. Unfortunately, the Kinect does not offer hardware synchronization capabilities, therefore the following software solution is adopted. We assume the absence of clock drift and that the sequence length is short enough to avoid the accumulation of time drift. We also assume a stable sampling of the Vicon system on a high bandwidth closed network. In this setup, synchronization can be achieved by estimating the (constant) time difference between the Vicon and the Kinect frame timestamps. By moving the checkboard of sec.3.2 with varying speed, we estimate the that minimizes the reprojection error between the checkerboard corners from sec. 3.2 and the Vicon markers.

3.2.4 Removing the markers

The 3mm markers used to track the object are retro-reflective and, despite their small size and their low number (7 per object on average), they nevertheless create visible artifacts in the depth data measured by the Kinect, see fig. 3. We propose a post-processing algorithm for automatically removing them in all the sequences. First, to ensure that the marker can be observed by the Kinect we reproject the (known) marker positions onto the depth image and compute the median distance between the depth in a small window around the reprojected point and its ground truth depth. If the difference is less than 1cm, the point is considered as not occluded, and will be processed. Finally, we render the depth values of the 3D model at the given pose and replace the pixel window from the original image with the rendered depth values. For more realism, a small amount of gaussian noise is added. Pixels from the background are simply ignored. On average, only 3.4% of the object pixels are corrected. We also minimize the chances of affecting the geometric structure of the object by placing the markers on planar surfaces. Fig. 3 shows a comparison of the error between a Kinect depth image captured with markers, and another image of the same scene with markers that have been corrected with our algorithm. Compared to a ground truth image without markers, the RMSE of the pixel patches around the markers is 139.8 mm without the correction, and 4.7 mm with the correction.

Figure 3: Example of an RGB and depth frame containing 2 markers on a flat surface, and 2 markers near an edge. We take advantage of our knowledge of the object mesh and pose to replace patches of pixels around the marker by the depth values of a render at the same pose. We capture an image without the markers to compare the error. On the modified patches we report a RMSE of 139.8 mm on the depth with the markers, and 4.7 with the corrected version.

4 Dataset scenarios, metrics, and statistics

This section defines novel ways to systematically evaluate 6 DoF trackers using calibrated sequences captured with the setup presented in sec. 3. We strive to provide an evaluation methodology that will reflect the overall performance of a tracker in different scenarios. To attain this objective, we captured 100 sequences of 4 different objects of various shapes in 3 scenarios: stability, occlusion, and interaction. We also provide quantitative metrics to measure the performance of the tracker in each scenario. See the supplementary material for videos of sequences for all three scenarios.

4.1 Performance metrics

Before we describe each scenario, we first introduce how we propose to evaluate the difference between two poses and . Here, a pose is described by a rotation matrix and a translation vector . Previous works consider the average of each axis component in translation and rotation separately. The side effect of this metric is that a large error on a single component is less penalized. To overcome this limitation, the translation error is simply defined as the L2 norm between the two translation vectors:


The distance between two rotation matrices is computed using:


where denotes the matrix trace.

4.2 Scenarios

4.2.1 The stability scenario

In this first scenario, we propose to quantify the degree of pose jitter when tracking a static object. To evaluate this, we captured sequences of 5 seconds of the object under 4 different viewpoints and with 3 configurations: at a distance of 0.8m from the sensor (“near”), at a distance of 1.5m from the sensor (“far”), and a last one at 0.8m from the sensor, but this time with distractor objects partly occluding the object of interest (“occluded”). To measure the stability, Tan et al. [3] use the standard deviation of the pose parameters on a sequence. We propose a different metric inspired from [20] that penalizes variation from frame to frame instead of the general distribution accross the sequence. We compute the distance between poses and at time . In other words, we report the distribution of and for all frames of the stability scenario. Note that is the time between each frames.

4.2.2 The occlusion scenario

To evaluate the robustness to occlusion, we follow [1] and place the object on a turntable at 1.2m from the sensor, and a static occluder is placed in front of the object in a vertical and horizontal position. We compute the amount of occlusion based on the largest dimension of the object, and provide sequences for each object from 0% to 60% occlusion in 15% increments, which results in a total of 9 sequences per object. Here, we compute errors by comparing the pose at time with the ground truth for that same frame, i.e., and . Temporal trackers may lose tracking on difficult frames. This can affect the overall score depending on the moment where the tracker fails. To bypass this limitation, we initialize the tracker at the ground truth pose every 15 frames as in [1].

4.2.3 The interaction scenario

In this last scenario, the experimenter holds the object in his hands and manipulates it in 4 different ways: 1) by moving the object around but without rotating it (“translation-only”); 2) by rotating the object on itself without translating it (“rotation-only”); 3) by freely moving and rotating the object around at low speeds (“free-slow”); and 4) by freely moving and rotating the object at higher speeds and by voluntarily generating more occlusions (“free-hard”). In all situations but the “free-hard”, we reset the tracker every 15 frames and we report and as in sec. 4.2.2. Since object speed varies, we also compute the translational and rotational speeds (, ) and report the performance metric above as a function of that speed. In addition, it is also informative to count the number of times the tracker has failed. We consider a tracking failure when either or for more than 7 consecutive frames. When a failure is detected, the tracker is reset at the ground truth pose . We report these failures on the “free-hard” sequences only.

4.3 Dataset statistics

Four different objects were selected ranging from simple to more complex geometry and texture: shoe, toy clock, skull, and dragon. To obtain a highly precise 3D model of each object in the database, each of them was scanned with a Creaform GoScan™ handheld 3D scanner at a resolution of 1mm. The scans were cleaned using Creaform VxElements™ to remove background and spurious vertices.

Overall, the dataset contains 100 sequences: 25 sequences for each object. The breakdown is the following: 12 sequences for stability (4 viewpoints, 3 configurations: “near”, “far”, “occluded”); 9 sequences for occlusion (0%, and 15% to 60% in 15% increment for both horizontal and vertical occluders); and 4 sequences for interaction (“rotation-only”, “translation-only”, “free-slow”, “free-hard”). It also contains 4 high resolution models with mesh and texture, and 100 Kinect V2 RGBD frames with ground truth pose per objects.

5 Analyzing a deep 6-DOF tracker with our dataset

As a testbed to evaluate the relevance of the new dataset, we borrow the technique of Garon and Lalonde [1] who train a 6-DOF tracker using deep learning, but propose changes to their architecture and training methodology. We evaluate several variants of the network on our dataset and show that it can be used to accurately quantify the performance of a tracker in a wide variety of scenarios.

5.1 Training an object-specific tracker

Input: Input:     conv3-96 conv3-96 fire-48-96 fire-48-96 concatenation fire-96-384 fire-192-768 fire-384-768 FC-500 FC-6   Output:

Figure 6: The deep learning architecture used to track 3D objects in this work, inspired by the one in [1]. The notation “conv-” indicates a convolution layer of filters of dimension , “fire--” indicates a “fire” module [21] which reduces the number of channels to and expands it to , and “FC-” is a fully-connected layer of units. Each convolution layer is followed by a max pooling operation to downsample their representation. We use a dropout of 50% on the input connections to the FC-500 layer. All layers (except the last FC-6) have batch normalization and the ELU activation function [22].

The proposed network architecture is shown in fig. 6. As in [1], the network accepts two inputs: an image of the object rendered at its predicted position (from the previous timestamp in the video sequence) , and an image of the observed object at the current timestamp . The last layer outputs the 6-DOF (3 for translation, 3 for rotation in Euler angles) representing the pose change between the two inputs. As in [1], the loss used is simply the MSE between the predicted and ground truth pose change. Note that we experimented with the reprojection loss [23], but found it did not help in our context. Our improvement require the same runtime as [1].

As another key difference with [1], we rely purely on synthetic data to train the network in fig. 6 (in [1] a set of real frames was required to fine-tune the network). We generate synthetic data with one important difference. Their approach consists in generating pairs of synthetic views of the object with random pose changes between them. To sample the random pose changes, they proposed to independently sample a random translation and rotation in Euler angle notation, with referring to a uniform distribution on interval . Doing so actually biases the resulting pose changes. For example, translations of very small amplitudes are very unlikely to be generated (since would all need to be small simultaneously). Instead, we propose to sample a random translation vector and magnitude separately. The translation vector is sampled in spherical coordinates , where and with . The translation magnitude is drawn from a Gaussian distribution . The same process is repeated for rotations, where the rotation axis and angle are sampled similarly. Here, we intentionally parameterize the translation magnitude and rotation angle distributions with and , since the range of these parameters may influence the behavior of the network.

5.2 Training a generic tracker

To train a generic 6-DOF object tracker, we experimented with two ways of generating a training dataset, using the same network architecture, loss, and training procedure described in sec. 5.1. First, we generate a training set of images that contain all 4 objects from our dataset, as well as 30 other objects. These other objects, downloaded from 3D Warehouse222Available at: https://3dwarehouse.sketchup.com. and from “Linemod” [7], show a high diversity in geometry and texture and are roughly of the same size. We name the network trained on this dataset the “multi-object” network. Second, we generate a training set of images that contain only the 30 other objects—the actual objects to track are not included. We call this network “generic”, since it never saw any of the objects in our dataset during training. Note that all these approaches require the 3D model of the object to track at test time, however.

stability scenario

occlusion scenario

interaction scenario

(a) Impact of on (b) Impact of on
Figure 7: Applying our evaluation methodology for determining the best range of translations and rotations for generating synthetic data when training a deep 6-DOF tracker. We plot (a) the impact of varying on the error , and (b) the impact of varying on the error for all three scenarios (from top to bottom: stability, occlusion, and interaction). The box plots indicate, from bottom to top, the -th percentiles respectively. See the supplementary material for more results.

6 Experiments

In this section, we perform an exhaustive evaluation of the various approaches presented in sec. 5 using our novel dataset and framework proposed in sec. 4. First, we analyze the impact of varying the training data generation hyper-parameters and for the object-specific case. Then, we proceed to compare our object-specific, “multi-object”, and “generic” trackers with two methods from the state of the art: Garon and Lalonde [1] and Tan et al. [5].

6.1 Analysis to training data generation parameters

We now apply the evaluation methodology proposed in sec. 4 on the method presented above and evaluate the influence of the and hyper-parameters on the various metrics and sequences from our dataset. We experiment by varying and one at a time (the other parameter is kept at its lowest value). For each of these parameters, we synthesize 200,000 training image pairs per object using [1] and the modifications proposed in sec. 5.1. We then train a network for each object, for each set of parameters, and evaluate each network on our dataset.

A subset of the results of this analysis is shown in fig. 7. Note that, for the interaction scenario, the “free-hard” sequences (sec. 4.2.3) were left out since they are much harder than the others and would bias the results. In particular, we show the impact that varying has on , as well as that of varying has on for all 3 scenarios. Here, we drop the parentheses for the error metrics for ease of notation (see sec. 4 for the definitions).

The figure reveals a clear trend: increasing (fig. 7-(b)) systematically results in worse performance in rotation. This is especially visible for the high occlusion cases (45% and 60%), where the rotation error increases significantly as a function of . The situation is not so simple when is increased (fig. 7-(a)). Indeed, while increasing negatively impacts in the stability and occlusion scenarios, performance actually improves when the object speed is higher, as seen in the interaction scenario. Therefore, to achieve a good balance between stability and accuracy at higher speeds, a value of seems to be a good trade-off. The remainder of the plots for this analysis, as well as plots evaluating the impact of the resolution of the crop and the size of the bounding box w.r.t the object are shown in the supplementary material.

6.2 Comparison with previous work

Additionally, we provide a comparison of our networks against other approaches in fig. 8. In particular, we compare with object-specific versions of the Deep 6-DOF Tracker work of Garon and Lalonde [1] as well as the Random Forest approach of Tan et al. [5]. For [5], we use the training parameters reported in their paper. For our techniques, the and hyper-parameters were obtained with leave-one-out cross-validation to ensure no training/test overlap. As before, the “free-hard” sequences were left out for the interaction experiments.

Overall, as can be observed in fig. 8, the proposed deep learning methods perform either on par or better than the previous work. The “object-specific” networks outperform all the other techniques in the stability and occlusion scenarios, except for the case of translational error in the 0% occlusion case. In the interaction scenario, it performs remarkably well at predicting rotations (only median error at 60% occlusion), and is on par with the other methods for translation. In comparison, [5] performs well at low occlusions, but fails when the occlusion level is 30% or greater (particularly in rotation). [1] shows improved robustness to occlusions, but still achieves high rotation errors at 45% occlusion, and is also much less stable (esp. in rotation) than our “object-specific” networks. Interestingly, our “generic” tracker, which has seen none of these objects in training, performs similarly to the previous works that were trained specifically on these objects. Indeed, it shows a stability, robustness to occlusions and behavior at higher speeds that is similar to [1] and [5], demonstrating that learning generic features that are useful for tracking objects can be achieved.

Finally, we use the “free-hard” interaction sequences to count the number of times the tracking is lost (sec. 4.2.1). In this case, [5] loses tracking 23 times on 4 sequences, while [1] lose tracking only 6 times. In contrast, the “object-specific” networks lose tracking only once, whereas the “multi-object” loses it 4 times. Surprisingly, we recorded only a single failure for the “generic” network. Qualitative videos showing side-by-side comparisons of these methods are available in the supplementary material.

stability scenario

occlusion scenario

interaction scenario

(a) Impact on (b) Impact on
Figure 8: Comparison of our networks (shades of blue) with the previous work of [1] (green) and [5] (red). Our “object-specific” networks outperform the state of the art in stability and occlusion, and performs remarkably well at predicting the rotation. Our “generic” tracker shows great promise: although not quite as good as the “object-specific” version overall, it results in lower translational error in the interaction scenarios, even if it has not seen any of these objects during training. See the supplementary video for a visual qualitative comparison of the trackers.

7 Discussion

When considering the recent evolution in tracking performance on the popular dataset of Choi et Christensen [4], we conclude that the dataset has now essentially been solved. To keep making progress as a community, it is critical that it be replaced by a new dataset, containing real data and more challenging situations. We provide such a dataset, which we hope will spur further research in the field. Our dataset contains 100 sequences containing 4 objects of various shapes and textures. The sequences are grouped into 3 scenarios: stability, occlusion, and interaction. Naturally, our dataset and companion evaluation code will be released publically upon publication of the paper. Additionally, we build on the deep learning framework of [1] with an improved architecture and training procedure which allows the network to learn purely from synthetic data, yet generalize well on real data. In addition, the architecture allows for training on multiple objects and test on different objects it has never seen in training. To the best of our knowledge, we are the first to propose such a generic learner for the 6-DOF object tracking task. Finally, our approach is extensively compared with recent work and is shown to achieve better performance.

A current limitation of the dataset is its limited number of objects (4) and high speed sequences, which explains some of the high variance visible in the interaction scenario. We are currently expanding the dataset, with the goal of reaching at least 10 different objects. Another limitation is that the Vicon markers must be removed in a post-processing step, which may leave some artifacts behind. While the markers are very small (3mm) and the resulting marker-free images have low error (see fig. 3), there might still room for improvement. Finally, our “generic” tracker is promising, but it still does not perform quite as well as “object-specific” models, especially for rotations. In addition, a 3D model of the object is still required at test time, so exploring how this constraint can be removed would make for an exciting future direction.


The authors wish to thank Jonathan Gilbert for his help with data acquisition and Sylvain Comtois for his help with Vicon setup. This work was supported by the NSERC/Creaform Industrial Research Chair on 3D Scanning: CREATION 3D. We gratefully acknowledge the support of Nvidia with the donation of the Tesla K40 and Titan X GPUs used for this research.


  • [1] Garon, M., Lalonde, J.F.: Deep 6-DOF tracking. IEEE Transactions on Computer Graphics and Visualization 23(11) (2017)
  • [2] Kehl, W., Tombari, F., Ilic, S., Navab, N.: Real-time 3D model tracking in color and depth on a single CPU core. In: IEEE Conference on Computer Vision and Pattern Recognition. (2017)
  • [3] Tan, D.J., Navab, N., Tombari, F.: Looking beyond the simple scenarios: Combining learners and optimizers in 3D temporal tracking. IEEE transactions on visualization and computer graphics 23(11) (2017) 2399–2409
  • [4] Choi, C., Christensen, H.I.: RGB-D object tracking: A particle filter approach on GPU. In: International Conference on Intelligent Robots and Systems. (2013)
  • [5] Tan, D.J., Tombari, F., Ilic, S., Navab, N.: A versatile learning-based 3D temporal tracker: Scalable, robust, online. In: IEEE International Conference on Computer Vision. (2015)
  • [6] Krull, A., Michel, F., Brachmann, E., Gumhold, S., Ihrke, S., Rother, C.: 6-dof model based tracking via object coordinate regression. In: Asian Conference on Computer Vision. (2014)
  • [7] Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Asian conference on computer vision. (2012)
  • [8] Hodan, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., Zabulis, X.: T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In: IEEE Winter Conference on Applications of Computer Vision. (2017)
  • [9] Tejani, A., Tang, D., Kouskouridas, R., Kim, T.K.: Latent-class hough forests for 3D object detection and pose estimation. In: European Conference on Computer Vision. (2014)
  • [10] Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.K.: Recovering 6D object pose and predicting next-best-view in the crowd. In: IEEE Conference on Computer Vision and Pattern Recognition. (2016)
  • [11] Akkaladevi, S., Ankerl, M., Heindl, C., Pichler, A.: Tracking multiple rigid symmetric and non-symmetric objects in real-time using depth data. In: IEEE International Conference on Robotics and Automation. (2016)
  • [12] Aldoma, A., Tombari, F., Prankl, J., Richtsfeld, A., Di Stefano, L., Vincze, M.: Multimodal cue integration through hypotheses verification for rgb-d object recognition and 6dof pose estimation. In: Robotics and Automation (ICRA), 2013 IEEE International Conference on, IEEE (2013) 2104–2111
  • [13] Kwon, J., Choi, M., Park, F.C., Chun, C.: Particle filtering on the euclidean group: framework and applications. Robotica 25(6) (2007)
  • [14] Chitchian, M., van Amesfoort, A.S., Simonetto, A., Keviczky, T., Sips, H.J.: Adapting particle filter algorithms to many-core architectures. In: Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, IEEE (2013) 427–438
  • [15] Tan, D.J., Ilic, S.: Multi-forest tracker: A chameleon in tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014)
  • [16] Breiman, L.: Random forests. Machine learning 45(1) (2001)
  • [17] Tjaden, H., Schwanecke, U., Schömer, E.: Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In: IEEE Conference on Computer Vision and Pattern Recognition. (2017)
  • [18] Merriaux, P., Dupuis, Y., Boutteau, R., Vasseur, P., Savatier, X.: A study of vicon system positioning performance. Sensors 17(7) (2017) 1591
  • [19] Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11) (2000) 1330–1334
  • [20] Niehorster, D.C., Li, L., Lappe, M.: The accuracy and precision of position and orientation tracking in the HTC vive virtual reality system for scientific research. i-Perception 8(3) (2017)
  • [21] Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5MB model size. arXiv:1602.07360 (2016)
  • [22] Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
  • [23] Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: IEEE Conference on Computer Vision and Pattern Recognition. (2017)

Appendix A Supplementary material

In this document, we present two additional set of results to complement the main paper. First, in addition to the experiments conducted on hyperparameters and in sec. 6.1 of the main paper, we also evaluate two other parameters for the dataset generation:

  • the size of the bounding box used to crop the object in the scene;

  • the resolution of the input images.

Second, fig. 11 shows images of the 3D models used for training the generic network in sec. 5.2 of the manuscript.

a.1 Bounding box size evaluation

stability scenario

occlusion scenario

interaction scenario

(a) Impact of bounding box on (b) Impact of bounding box on
Figure 9: Applying our evaluation methodology for determining the impact on the bounding box used to crop the object. The initial bounding box size (0%) is determined by the distance between the two vertices that are furthest apart on the object. We vary this size from -25% to 25% of the initial bounding box for all three scenarios (from top to bottom: stability, occlusion, and interaction). A smaller bounding box helps in most scenarios, but achieves lower performance on the interaction scenario. This can be explained by the fact that larger translations may bring the object outside of the bounding box.

a.2 Resolution evaluation

stability scenario

occlusion scenario

interaction scenario

(a) Impact of resolution on (b) Impact of resolution on
Figure 10: Applying our evaluation methodology for determining the impact of the resolution of the input images. We vary the resolution from to for all three scenarios (from top to bottom: stability, occlusion, and interaction). While the lower resolution achieves good results in most situations, the achieves a better overall result, esp. for estimating rotations in the occlusion scenario.

a.3 3D models

Figure 11: 3D models used to train the generic tracker. All models are resized so they have approximately the same dimensions. The models were downloaded from https://3dwarehouse.sketchup.com, the Linemod dataset [7] and Choi et al. [4]
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description