ViPR: Visual-Odometry-aided Pose Regression for 6DoF Camera Localization

ViPR: Visual-Odometry-aided Pose Regression for 6DoF Camera Localization


Visual Odometry (VO) accumulates a positional drift in long-term robot navigation tasks. Although Convolutional Neural Networks (CNNs) improve VO in various aspects, VO still suffers from moving obstacles, discontinuous observation of features, and poor textures or visual information. While recent approaches estimate a 6DoF pose either directly from (a series of) images or by merging depth maps with the optical flow (OF), research that combines absolute pose regression with OF is limited.

We propose ViPR, a novel architecture for long-term 6DoF VO that leverages synergies between absolute pose estimates (from PoseNet-like architectures) and relative pose estimates (from FlowNet-based architectures) by combining both through recurrent layers. Experiments with known publicly available datasets and with our own Industry dataset show that our novel design outperforms existing techniques in long-term navigation tasks.


1 Introduction

Real-time tracking of mobile objects, e.g., forklifts, trucks or workers in industrial areas, allows to monitor and optimize workflows, enables zoning rules for safety, and tracks goods for automated inventory management. Such environments typically include large warehouses or factory buildings, and localization solutions often use a combination of radio-, LiDAR- or radar-based systems, etc.

However, these solutions often require infrastructure or they are costly in their operation. An alternative approach is a (mobile) optical pose estimation based on ego-motion. Such approaches are usually based on SLAM (Simultaneous Localization and Mapping), meet the requirements of exact real-time localization, and are also cost-efficient.

Available pose estimation approaches are categorized into three groups: classical, hybrid, and deep learning (DL)-based methods. Classical methods often require an infrastructure that includes either synthetic (i.e., installed in the environment) or natural (e.g., walls and edges) markers. The accuracy of the pose estimation depends to a large extent on suitable invariance properties of the available features such that they can be reliably recognized. However, to reliably detect features, we have to invest a lot of expensive computing time [30, 20]. Additional sensors (e.g., inertial sensors, depth cameras, etc.) or additional context (e.g., 3D models of the environment, prerecorded landmark databases, etc.) may increase the accuracy but also increase system complexity and costs [35]. Hybrid methods [49, 5, 4, 17, 54] combine geometric and machine learning (ML) approaches. For instance, ML predicts the 3D position of each pixel in world coordinates, from which geometry-based methods infer the camera pose [13].

Figure 1: Our pose estimation pipeline solves the APR- and RPR-tasks in parallel, and recurrent layers estimate the final 6DoF pose.

Recent methods that exploit DL to partly address the above mentioned issues of complexity and cost also aim for high positioning accuracy, e.g., regression forests [40] learn a mapping of images to positions based on 3D models of the environment. Absolute pose regression (APR) uses DL [47] as a cascade of convolution operators to learn poses only from 2D images. PoseNet [25] as an initial attempt has been successfully extended by Bayesian approaches [23], long short-term memories (LSTMs) [57] and others [39, 19, 28, 9]. Recent approaches to APR such as VLocNet [53, 46] and DGRNets [33] introduce relative pose regression (RPR) to address the APR-task. While APR needs to be trained for a particular scene, RPR may be trained for multiple scenes [47]. However, RPR alone does not solve the navigation task.

For applications such as indoor positioning, existing approaches are not yet mature, i.e., in terms of robustness and accuracy to handle real-world challenges such as changing environment geometries, lighting conditions, and camera (motion) artifacts. Hence, this paper proposes a 6DoF pose estimation based on PoseNet-like architectures and predictions of relative camera movement by RNNs, using the flow of image pixels between successive images computed by FlowNet2.0 [18], to capture time dependencies in the camera movement in the recurrent layers, see Fig. 1. Our model reduces the positioning error using this multitasking approach, which learns both the absolute poses based on monocular (2D) imaging and the relative motion for the task of estimating visual odometry (VO).

We evaluate our approach first on the small-scale 7-Scenes [49] dataset. As other datasets are unsuitable to evaluate continuous navigation tasks we also release a dataset that can be used to evaluate various problems arising from real industrial scenarios such as inconsistent lighting, occlusion, dynamic environments, etc. We benchmark our approach on both datasets against existing approaches  [25, 57] and show that we consistently outperform the accuracy of their pose estimates.

The rest of the paper is structured as follows. Section 2 discusses related work. Section 3 provides details about our architecture. We discuss available datasets and introduce our novel Industry dataset in Section 4. We present experimental results in Section 5 before Section 6 concludes.

2 Related Work

SLAM-driven 3D point registration methods enable precise self-localization even in unknown environments. Although VO has made remarkable progress over the last decade, it still suffers greatly from scaling errors of real and estimated maps [34, 51, 38, 21, 26, 27, 32, 42, 3, 31]. With more computing power, Visual Inertial SLAM (VISLAM) combines VO with Inertial Measurement Unit (IMU) sensors to partly resolve the scale ambiguity, to provide motion cues without visual features [34, 52, 21], to process more features, and to make tracking more robust [51, 26]. However, recent SLAM methods do not yet meet industry-strength with respect to accuracy and reliability [44] as they need undamaged, clean and undisguised markers [31, 22] and as they still suffer from long-term stability and the effects of movement, sudden acceleration and occlusion [55].

VO primarily addresses the problem of separating ego- from feature-motion and suffers from area constraints, poorly textured environments, scale drift, a lack of an initial position, and thus inconsistent camera trajectories [8]. Instead, PoseNet-like architectures (see Sec. 2.1) that estimate absolute poses on single-shot images are more robust, less compute-intensive, and can be trained in advance on application data. Unlike VO, they do not suffer from a lack of initial poses and do not require access to camera parameters, good initialization, and handcrafted features [48]. Although the joint estimation of relative poses may contribute to increasing accuracy (see Sec. 2.2), such hybrid approaches still suffer from dynamic environments, as they are often trained offline in quasi-rigid environments. While optical flow (see Sec. 2.3) addresses these challenges it has not yet been combined with APR for 6DoF self-localization.

2.1 Absolute Pose Regression (APR)

Methods that derive a 6DoF pose directly from images have been studied for decades. Therefore, there are currently many classic methods whose complex components are replaced by machine learning (ML) or deep learning (DL). For instance, RelocNet [2] learns metrics continuously from global image features through a camera frustum overlap loss. CamNet [12] is a coarse (image-based)-to-fine (pose-based) retrieval-based model that includes relative pose regression to get close to the best database entry that contains extracted features of images. NNet [29] queries a database for similar images to predict the relative pose between images and a RANSAC solves the triangulation to provide a position. While those classic approaches have already been extended with DL-components their pipelines are expensive as they embed feature matching and projection and/or manage a database. Most recent (and simple) DL-based also outperform their accuracies.

The key idea of PoseNet [25] and its variants [24, 23, 15, 57, 56, 58, 45, 48, 43, 49] among others such as BranchNet [43] and Hourglass [49] is to use a CNN for camera (re-)localization. PoseNet works with scene elements of different scales and is partially insensitive to light changes, occlusions and motion blur. However, while Dense PoseNet [25] crops subimages, PoseNet2 [24] jointly learns network and loss function parameters, Cipolla et al. [23] links a Bernoulli function and applies variational inference [15] to improve the positioning accuracy. However, those variants work with single images and hence do not use the temporal context (which is available in continuous navigation tasks), that could help to increase accuracy.

In addition to PoseNet+LSTM [57], there are also similar approaches that exploit time-context that is inherently given by consecutive images (i.e., DeepVO [58], ContextualNet [45], and VidLoc [10]). Here, the key-idea is to identify temporal connections in-between the feature vectors (extracted from images) with LSTM-units and to only track feature correlations that contribute the most to the pose estimation. However, there are hardly any long-term dependencies between successive images, and therefore LSTMs give worse or equal accuracy to, for example, simple averaging over successively estimated poses [48]. Instead, we combine estimated poses from time-distributed CNNs with estimates of the optical flow to maintain the required temporal context in the features of image series.

2.2 APR/RPR-Hybrids

In addition to approaches that derive a 6DoF pose directly from an image there are hybrid methods that combine them with VO to increase the accuracy. VLocNet [53] is closely related to our approach as it estimates a global pose and combines it with VO (but it does not use OF). To further improve the (re)localization accuracy VLocNet++ [46] uses features from a semantic segmentation. However, we use different networks and do not need to share weights between VO and the global pose estimation. DGRNets [33] estimates both the absolute and relative poses, concatenates them, and uses recurrent CNNs to extract temporal relations between consecutive images. This is similar to our approach but we estimate the relative motion with OF, which allows us to train in advance on large datasets, making the model more robust. MapNet [6] learns a map representation from input data, combines it with GPS, inertial data, and unlabeled images, and uses pose graph optimization (PGO) to combine absolute and relative pose predictions. However, compared to all other methods the most accurate extension of it, MapNet+PGO, does not work on purely visual information, but exploits additional sensors.

Figure 2: Optical flow (OF): input image (left); OF-vectors as RPR-input (middle); color-coded visualization of OF [1] (right).

2.3 Optical Flow

Typically, VO uses OF to extract features from image sequences. Motion fields and OF-images, see Fig. 2, are used to estimate trajectories of pixels in a series of images. For instance, Flowdometry [41] estimates displacements and rotations from OF-input. Mansur et al. [37] proposed a VO-based dead reckoning system that uses OF to match features. Zhao et al. [59] combined two CNNs to estimate the VO-motion: FlowNet2-ss [18] estimates the OF and PCNN [11] links two images to process global and local pose information. However, to the best of our knowledge, we are the first to propose an OF-based architecture that estimates the relative camera movement through RNNs, using the optical flow (FlowNet2.0 [18]).

3 Proposed Model

After a data preprocessing that crops subimages of size from a sequence of four images, our pose regression pipeline consists of three parts (see Fig. 4): an absolute pose regression (APR) network, a relative pose regression (RPR) network, and a 6DoF pose estimation (PE) network. PE uses the outputs of the APR- and RPR-networks to provide the final 6DoF pose.

3.1 Absolute Pose Regression (APR) Network

Our APR-network predicts the 6DoF camera pose from three input images. The APR-network is based on the original PoseNet [25] model (i.e., essentially a modified GoogLeNet [50] with a regression head instead of a softmax) to train and predict the absolute positions in the Euclidean space and the absolute orientations as quaternions. From a single monocular image I the model predicts the pose


as approximations to the actual p and q.

As the original model learns the image context, based on shape and appearance of the environment, but does not exploit the time context and relation between consecutive images [24], we adapted the model to a time-distributed variant. Hence, instead of a single image the new model receives three (consecutive) input images, see Fig. 3, uses three separate dense layers (one for each pose) with 2,048 neurons each, and each of the dense layers yields a pose. The middle pose yields the most accurate position for the image at time step .

Figure 3: Structure of our APR-network (with time-distribution). A CNN encodes 3 images at timesteps , , and to 3 feature layers and time-distributed FC-layers obtain 3 poses.
Figure 4: Pipeline of the ViPR-architecture. Data preprocessing (grey): Four consecutive input images from timesteps are center cropped. For the absolute network the mean is subtracted. For the relative network the OF is precomputed by FlowNet2.0 [18]. The absolute poses are predicted by our time-distributed absolute pose regression network (yellow). The relative pose regression network (purple) predicts the transformed relative displacements and rotations on reshaped mean vectors of the OF with (stacked) LSTM-RNNs. The pose estimator (green) concatenates the absolute and relative components and predicts the absolute 6DoF poses with stacked LSTM-RNNs.

3.2 Relative Pose Regression (RPR) Network

Our RPR-network uses FlowNet2.0 [18] on each consecutive pairs of the four input images to compute an approximation of the optical flow (see Fig. 2) and to predict three relative poses for later use.

As displacements of similar length but from different camera viewing directions result in different OFs, the displacement and rotation of the camera between pairwise images must be relative to the camera’s viewing direction of the first image. Therefore, we transform each camera’s global coordinate systems to the same local coordinate system by


with the rotation matrix R. The displacement is the difference between the transformed coordinate systems. The displacement, in global coordinates, is obtained by a back-transformation of the predicted displacement:


such that the following equations hold:

Figure 5: Pipeline of the relative pose regression (RPR) architecture: Data preprocessing, OF- and mean computation, reshaping, and concatenation, 3 recurrent LSTM units, and 2 FC-layers that yield the relative pose.

Fig. 5 shows the structure of the RPR-network. Similar to the APR-network, the RPR-network also uses a stack of images, i.e., three OF-fields from the four input images of the timesteps , to include more time context.

In a preliminary study, we found that our recurrent units struggle to remember temporal features when the direct input of the OF is too large (raw size px). This is in line with findings from Walch et al. [57]. Hence, we split the OF in zones and compute the mean value for each the - and -direction. We reshape number of zones in both directions to the size . The final concatenation results in a smaller total size of . The LSTM-output is forwarded to 2 FC-layers that regress both the displacement (size ) and rotation (size ).

The 2 FC-layers use the following loss function to predict the relative transposed poses and :


The first term accounts for the predicted and transformed displacement to the ground truth displacement with an -norm. The second term quantifies the error of the predicted rotation to the normalized ground truth rotation using an -norm. Both terms are weighted by the hyperparameters and . A preliminary grid search with a fixed revealed an optimal value for that depends on the scaling of the environment.

3.3 6DoF Pose Estimation (PE) Network

Our PE-network predicts absolute 6DoF poses from the outputs of both the APR- and RPR-networks, see Fig. 6. The PE-network takes as input the absolute position , the absolute orientation , the relative displacement , and the rotation change . As we feed poses from three sequential timesteps , , and as input to the model it is implicitly time-distributed. The 2 stacked LSTM-layers and the 2 FC-layers return a 3DoF absolute position and a 3DoF orientation using the following loss:


Again, in a preliminary grid search we chose -norms with a fixed that revealed an optimal value for .

4 Evaluation Datasets

To train our network we need two different types of image data: (1) images annotated with their absolute poses for the APR-network, and (2) images of optical flow, annotated with their relative poses for the RPR-network.

Datasets to evaluate APR. Publicly available datasets for absolute pose regression (Cambridge Landmarks [25] and TUM-LSI [57]) either lack accurate ground truth labels or the proximity between consecutive images is too large to embed meaningful temporal context, which is required for temporal networks such as ViPR. 7-Scenes [49] only embeds small-scale scenes and hence only enables small scene-wise evaluations. The Aalto University [29], Oxford RobotCar [36] and DeepLoc [46] datasets solve the small-scale issue, but are barely used for evaluation of state-of-the-art techniques. Hence, to compare ViPR with recent techniques we can only use parts of the 7-Scenes [49] dataset, and thus, had to record the Industry dataset (see Sec. 4.1) that embeds three different scenarios. These datasets allow a comprehensive and detailed evaluation with different movement patterns (such as slow motion and fast rotation).

Figure 6: Pipeline of the 6DoF pose estimation (PE) architecture. The input tensor () contains absolute positions and orientations and relative displacements and rotations at timesteps . 2 stacked LSTMs process the tensor and 2 FC-layers return the pose.

Datasets to evaluate RPR. To evaluate the performance of the RPR-network and its contribution to ViPR, we also need a dataset with a close proximity between consecutive images. This is key to calculate the relative movement with optical flow. However, most publicly available datasets (Middlebury [1], MPI Sintel [7], KITTI Vision Benchmark [16], and FlyingChairs [14]) either do not meet this requirement or the OF pixel velocities do not match those of real world applications. Hence, we cannot use available datasets to train the RPR-network. Instead, we directly calculate the OF from images with FlowNet2.0 [18] to train our RPR-network on it. Our novel Industry dataset allows this, while simultaneously retaining a large, diverse environment with hard real-world conditions, as described in the following.

4.1 Industry Dataset

We designed the Industry dataset to suite the requirements of both the APR- and the RPR-network. It is composed of three scenarios with multiple trajectories recorded at large-scale () using a high-precision () laser-based reference system. Each scenario presents different challenges (such as dynamic ego-motion with motion blur), various environmental characteristics (such as different geometric scales, light changes, i.e., artificial and natural light), and ambiguously structured elements, see Fig. 7

(a) Scenario #1 example images.
(b) Scenario #2 example images.
(c) Scenario #3 setup and example image.
Figure 7: Industry datasets. Setup of the measurement environment (i.e., forklift truck, warehouse racks and black walls) and example images with normal (a) and wide-angle (b+c) cameras.
(a) Training.
(b) Testing.
(c) Training.
(d) Testing.
Figure 8: Exemplary trajectories of Industry Scenarios #2 (a-b) and #3 (c-d) to assess the generalizability of ViPR.

Industry Scenario #1 [35] has been recorded with 8 cameras (approx. field-of-view (FoV) each) mounted on a stable apparatus to cover (with overlaps) that has been moved automatically at a constant velocity of approx. 0.3 . The height of the cameras is at 1.7 m. The scenario contains 521,256 images (px) and densely covers an area of 1,320 m2. The environment imitates a typical warehouse scenario under realistic conditions. Besides well-structured elements such as high-level racks with goods, there are also very ambiguous and homogeneously textured elements (e.g., blank white or dark black walls). Both natural and artificial light illuminates volatile structures such as mobile work benches. While the training dataset is composed of a horizontal and vertical zig-zag movement of the apparatus the test datasets movements vary to cover different properties for a detailed evaluation, e.g., different environmental scalings (i.e., scale transition, cross, large scale, and small scale), network generalization (i.e., generalize open, generalize racks, and cross), fast rotations (i.e., motion artifacts was recorded on a forklift at 2.26 m height) and volatile objects (i.e., volatility).

Industry Scenario #2 uses three cameras (with overlaps) on the same apparatus at the same height. The recorded 11,859 training images (px) represent a horizontal zig-zag movement (see Fig. 8(a)) and 3,096 test images represent a diagonal movement (see Fig. 8(b)). Compared to Scenario #1 this scenario has more variation in its velocities (between 0  and 0.3 , SD 0.05 ).

Industry Scenario #3 uses four cameras (with overlaps) on a forklift truck at a height of 2.26 . Both the training and test datasets represents camera movements at varying, faster, and dynamic speeds (between 0  and 1.5 , SD 0.51 ). This makes the scenario the most challenging one. The training trajectory (see Fig. 8(c)) consists of 4,166 images and the test trajectory (see Fig. 8(d)) consists of 1,687 images. In contrast to the Scenarios #1 and #2 we train and test a typical industry scenario on dynamic movements of a forklift truck. However, one of cameras’ images were corrupted in the test dataset, and thus, not used in the evaluation.

5 Experimental Results

Spatial PoseNet [25] PoseNet+ APR-only APR+LSTM ViPR* Improv.

extend () (original/our param.) LSTM [57] (our param.) ViPR (%)

chess 0.32 / 0.24 4.06 / 7.79 0.24 5.77 0.23 7.96 0.27 9.66 0.22 7.89 + 1.74

heads 0.29 / 0.21 6.00 / 16.46 0.21 13.7 0.22 16.48 0.23 16.91 0.21 16.41 + 3.64

office 0.48 / 0.33 3.84 / 10.08 0.30 8.08 0.36 10.11 0.37 10.83 0.35 9.59 + 4.01

stairs 0.47 / 0.36 6.93 / 13.69 0.40 13.7 0.31 12.49 0.42 13.50 0.31 12.65 + 0.46

total 0.39 / 0.29 5.21 / 12.00 0.29 10.3 0.28 11.76 0.32 12.73 0.27 11.63 + 2.46

Industry Scenario 1 [35]
cross -- / 1.15 -- / 0.75 -- 0.61 0.53 4.42 0.21 0.46 0.60 + 25.31

gener. open -- / 1.94 -- / 11.73 -- 1.68 11.07 3.36 2.95 1.48 10.86 + 11.75

gener. racks -- / 3.48 -- / 6.01 -- 2.48 1.53 3.90 0.61 2.38 1.95 + 4.03

large scale -- / 2.32 -- / 6.37 -- 2.37 9.82 4.99 1.61 2.12 8.64 + 10.68

motion art. -- / 7.43 -- / 124.94 -- 7.48 131.30 8.18 139.37 6.73 136.6 + 10.01

scale trans. -- / 2.17 -- / 3.03 -- 1.94 6.46 5.63 0.58 1.64 6.29 + 15.52

small scale -- / 3.78 -- / 9.18 -- 4.09 20.75 4.46 6.06 3.50 15.74 + 14.41

volatility -- / 2.68 -- / 78.52 -- 2.09 77.68 4.16 78.73 1.96 77.54 + 6.41

total -- / 3.12 -- / 30.07 -- 2.82 32.30 4.89 28.76 2.53 32.28 + 12.27

Scen. 2 cam #0 -- / 0.49 -- / 0.21 -- 0.22 0.29 1.49 0.14 0.16 3.37 + 26.24

cam #1 -- / 0.15 -- / 0.38 -- 0.23 0.35 2.68 0.17 0.12 2.75 + 46.49

cam #2 -- / 0.43 -- / 0.19 -- 0.37 0.13 0.90 0.15 0.30 1.84 + 17.87

total -- / 0.36 -- / 0.26 -- 0.27 0.26 1.69 0.15 0.20 2.65 + 30.20

Scen. 3 cam #0 -- / 0.41 -- / 1.00 -- 0.34 1.26 0.72 1.31 0.27 1.43 + 20.64

cam #1 -- / 0.32 -- / 1.07 -- 0.26 1.11 0.88 1.27 0.21 1.06 + 20.13

cam #2 -- / 0.32 -- / 1.60 -- 0.36 1.62 0.72 1.74 0.32 1.38 + 11.47

total -- / 0.35 -- / 1.22 -- 0.32 1.33 0.77 1.44 0.27 1.29 + 17.41
Table 1: Pose estimation results (position and orientation median error in meters and degrees ()) and total improvement of PE in on the 7-Scenes [49] and Industry datasets. The best results are bold and underlined ones are additionally referenced in the text.

To compare ViPR with originally reported results of state-of-the-art, we first briefly describe our parameterization of PoseNet [25] and PoseNet+LSTM [57] in Section 5.1. Next, Section 5.2 presents our results. We highlight the performance of ViPR’s sub-networks (APR, APR+LSTM) individually, and investigate both the impact of RPR and PE on the final pose estimation accuracy of ViPR. Section 5.3 shows results of the RPR-network. Finally, we discuss general findings and show runtimes of our models in Section 5.4.

For all experiments we used an AMD Ryzen 7 2700 CPU 3.2 GHz equipped with two NVidia GeForce RTX 2070 with 8 GB GDDR6 VRAM each. Table 1 shows the median error of the position in and of the orientation in degrees. The second column reports the spatial extends of the datasets. The last column reports the improvement in position accuracy of ViPR (in %) over APR-only.

5.1 Baselines

As a baseline we report the initially described results on 7-Scenes of PoseNet [25] and PoseNet+LSTM [57] (in italic). We further reimplemented the initial variant of PoseNet and trained it from scratch with , (thus optimizing for positional accuracy at the expense of orientation accuracy). Tab. 1 (cols. 3 and 4) shows our implementation’s results next to the initially reported ones (on 7-Scenes). We see that (as expected) the results of the PoseNet implementations differ due to changed values for and in our implementation.

5.2 Evaluation of the ViPR-Network

In the following, we evaluate our method in multiple scenarios with different distinct challenges for the pose estimation task. 7-Scenes focuses on difficult motion blur conditions of typical human motion. We then use the Industry Scenario #1 to investigate various challenges at a larger scale, but with mostly constant velocities. Industry Scenarios #2 and #3 then focus on dynamic, fast ego-motion of a moving forklift truck at large-scale.

7-Scenes [49]. For both architectures (PoseNet and ViPR), we optimized to weight the impact of position and orientation such that it yields the smallest total median error. Both APR+LSTM and ViPR return a slightly lower pose estimation error of 0.28  and 0.27  than PoseNet+LSTM with 0.29 . ViPR yields an average improvement of the position accuracy of even in strong motion blur situations. The results indicate that ViPR relies on a plausible optical flow component to achieve performance that is superior to the baseline. In situations of negligible motion between frames the median only improves by 0.02 . However, the average accuracy gain still shows that ViPR performs en par or better than the baselines.

Stable motion evaluation. For the Industry Scenario #1 dataset, we train the models on the zig-zag trajectories, and test them on specific sub-trajectories with individual challenges, but at almost constant velocity. In total, ViPR improves the position accuracy by on average (min.: ; max.: ) while the orientation error is similar for most of the architectures and test sets.

In environments with volatile features, i.e., objects that are only present in the test dataset, we found that ViPR (with optical flow) is significantly () better compared to APR-only. However, the high angular error of indicates an irrecoverable degeneration of the APR-part. In tests with different scaling of the environment, we think that ViPR learns an interpretation of relative and absolute position regression, that works both in small and large proximity to environmental features, as ViPR improves by (scale trans.) and (small scale) or (large scale). When the test trajectories are located within areas that embed only few or no training samples (gener. racks and open), ViPR still improves over other methods with 4-11.75 . The highly dynamic test on a forklift truck (motion artifacts) is exceptional here as only the test dataset contains dynamics and blur, and hence, challenges ViPR most. However, ViPR still improves by over APR-only, despite the data dynamic’s absolute novelty.

In summary, ViPR decreases the position median significantly by about 2.53  than only APR+LSTM (4.89 ). This and the other findings are strong indicators that the relative component RPR significantly supports the final pose estimation of ViPR.

(a) Scenario #2.
(b) Scenario #3.
Figure 9: Exemplary comparison of APR, ViPR, and a baseline (ground truth) trajectory of the Industry datasets.

Industry Scenario #2 is designed to evaluate for unknown trajectories. Hence, training trajectories represent an orthogonal grid, and test trajectories are diagonal. In total, ViPR improves the position accuracy by on average (min.: ; max.: ). Surprisingly, the orientation error is comparable for all architectures, except ViPR. We think that this is because ViPR learns to optimize its position based on the APR- and RPR- orientations, and hence, exploits these orientations to improve its position estimate, that we prioritized in the loss function. APR-only yields an average position accuracy of 0.27 , while the pure PoseNet yields position errors of 0.36  on average, but APR+LSTM results in an even worse accuracy of 1.69 . Instead, the novel ViPR outperforms all significantly with 0.2 . Compared to our APR+LSTM approach, we think that ViPR on the one hand interprets and compensates the (long-term) drift of RPR and on the other hand smooths the short-term errors of APR, as PE counteracts the accumulation of RPR’s scaling errors with APR’s absolute estimates. Here, the synergies of the networks in ViPR are particularly effective. This is also visualized in Fig. 9: the green (ViPR) trajectory aligns more smoothly to the blue baseline when the movement direction changes. This may also indicate that the RPR component is necessary for ViPR to be able to generalize to unknown trajectories and to compensate scaling errors.

Dynamic motion evaluation. In contrast to the other datasets, the Industry Scenario #3 includes fast, large-scale, and high dynamic ego-motion in both training and test datasets. However, all estimators result in similar findings as Scenario #2 as both scenarios embed motion dynamics and unknown trajectory shapes. Accordingly, ViPR again improves the position accuracy by on average (min.: ; max.: ), but this time exhibits very similar orientation errors. Improved orientation accuracy compared to Scenario #2 is likely due to diverse orientations available in this dataset’s training.

Fig. 9 shows exemplary results that visualize how ViPR handles especially motion changes and motion dynamics (see the abrupt direction change between and ). The results also indicate that ViPR predicts the smoothest and most accurate trajectories on unknown trajectory shapes (compare the trajectory segments between and ). We think the reason why ViPR significantly outperforms APR by here is because of the synergy of APR, RPR, and PE. RPR contributes the most in fast motion-changes, respectively in situation of motion blur. The success of RPR may also indicate that RPR differentiates between ego- and feature-motion to more robustly estimate a pose.

5.3 Evaluation of the RPR-Network

We use the smaller FlowNet2-s [18] variant of FlowNet2.0 as this has faster runtimes (140 Hz), and use it pretrained on the FlyingChairs [14], ChairsSDHom and FlyingThings3D datasets. To highlight that RPR contributes to the accuracy of the final pose estimates of ViPR, we explicitly test it on the Industry Scenario #3 that embeds dynamic motion of both ego- and feature-motion. Here, the relative proximity between consecutive samples is up to 20 , see Fig. 10. This results in a median of 2.49  in x- and 4.09  in y-direction on average. Hence, the error is between and . This shows that the RPR achieves meaningful results for relative position regression in a highly dynamic and therefore difficult setting. It furthermore appears to be relatively robust in its predictions despite both ego- and feature-motion. Further tests also showed comparable results for 7-Scenes [49].

Figure 10: Exemplary RPR-results (displacements ) against the baseline (ground truth) on the Scenario #3 dataset, see trajectory Fig. 8.

5.4 Discussion

6DoF Pose Regression with LSTMs. APR increases the positional accuracy over PoseNet for all datasets, see PoseNet and APR-only columns of Table 1. We found that the position errors increase when we use state-of-the-art methods with independent and single-layer LSTM-extensions [57, 58, 45, 48] on both the 7-Scenes and the Industry datasets, by 0.04 , 2.07 , 1.42  and 0.45 . This motivated us to investigate stacked LSTM-layers only for the RPR- and PE-networks.

We support the statement of Seifi et al. [48] that the motion between consecutive frames is too small, and thus, naive CNNs are already unable to embed them. Hence, additionally connected LSTMs are also unable to discover and track meaningful temporal and contextual relations between the features. Therefore, we undersampled the image sequences to embed longer temporal (5, 10, 15, 20, ..., 60) relations between consecutive frames such that both the CNN and the LSTM identify and track temporal causalities and relations. Note, that the minimal undersampling rate depends on the minimal velocity of the ego-motion. We found that a sampling rate of 10 at a minimal velocity of 0.1  yields the highest gain.

Runtimes. The training of the APR takes 0.86 s per iteration for 20,585,503 trainable parameters and a batch size of 50 (GoogLeNet [50]) on our hardware setup. Instead, the training of both the RPR-network and the PE-network is faster (0.065 s per iteration) even at a higher batch size of 100, as these models are smaller 214,605, resp. 55,239. In total, the training time for the complete RPR-network takes 90  and for the PE-network 65  on average for more than 12,000 samples. Hence, it is possible to retrain the PE-network quickly upon environment changes. The inference time of both networks are between 0.25 s and 0.35 s.

6 Conclusion

In this paper, we addressed typical challenges of learning-based visual self-localization of a monocular camera. We introduced a novel deep learning architecture that makes use of three modules: an absolute pose regressor, a relative pose regressor (that predicts displacement and rotation from OF), and a final regressor that predicts a 6DoF pose by concatenating the predictions of the two former sub-networks. To show that our novel architecture improves the absolute pose estimates, we compared it with a publicly available dataset and proposed novel Industry datasets that enable a more detailed evaluation of different (dynamic) movement patterns, generalization, and scale transitions.

In the future, we will further investigate the impact of the orientation, i.e., quaternions or Euler angles (especially yaw), on the pose accuracy. We will also explore a closed end-to-end architecture to investigate the gain of information through direct concatenation of CNN-encoder-output (APR) and LSTM-output (RPR). Furthermore, instead of a position regression for each camera separately, we will investigate learning a weighted fusion of each of the multiple cameras to gain benefits from diverse viewing angles.


  1. S. Baker, S. Roth, D. Scharstein, M. J. Black, J. P. Lewis and R. Szeliski (2007) ”A Database and Evaluation Methodology for Optical Flow”. In Intl. Conf. on Computer Vision (ICCV), Rio de Janeiro, Brazil, pp. 1–8. Cited by: Figure 2, §4.
  2. V. Balntas, S. Li and V. Prisacariu (2018) ”RelocNet: Continuous Metric Learning Relocalisation Using Neural Nets”. In Europ. Conf. on Computer Vision (ECCV), Cited by: §2.1.
  3. J. Bergen (2004) Visual odometry. Intl. J. of Robotics Research 33 (7), pp. 8–18. Cited by: §2.
  4. E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. M. S. Gumhold and C. Rother (2017) ”DSAC — Differentiable RANSAC for Camera Localization”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 2492–2500. Cited by: §1.
  5. E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold and C. Rother (2016) ”Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 3364–3372. Cited by: §1.
  6. S. Brahmbhatt, J. Gu, K. Kim, J. Hays and J. Kautz (2018) ”Geometry-Aware Learning of Maps for Camera Localization”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, pp. 2616–2625. Cited by: §2.2.
  7. D. J. Butler, J. Wulff, G. B. Stanley and MichaelJ. Black (2012) ”A Naturalistic Open Source Movie for Optical Flow Evaluation”. In Proc. Europ. Conf. on Computer Vision (ECCV), Florence, Italy, pp. 611–625. Cited by: §4.
  8. C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid and J. J. Leonard (2016) Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. Trans. on Robotics 32 (6), pp. 1309–1332. Cited by: §2.
  9. M. Chi, C. Shen and I. Reid (2018) ”A Hybrid Probabilistic Model for Camera Relocalization”. In British Machine Vision Conf. (BMVC), York, UK. Cited by: §1.
  10. R. Clark, S. Wang, A. Markham, N. Trigoni and H. Wen (2017) ”VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization”. In Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 2652–2660. Cited by: §2.1.
  11. G. Costante, M. Mancini, P. Valigi and T. A. Ciarfuglia (2016) ”Exploring Representation Learning With CNNs for Frame-to-Frame Ego-Motion Estimation”. In Robotics and Automation Letters, Boston, MA, pp. 1–12. Cited by: §2.3.
  12. M. Ding, Z. Wang, J. Sun, J. Shi and P. Luo (2019) ”CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization”. In Intl. Conf. on Computer Vision (ICCV), Seoul, South Korea, pp. 2871–2880. Cited by: §2.1.
  13. N. Duong, A. Kacete, C. Sodalie, P. Richard and J. Royan (2018) ”xyzNet: Towards Machine Learning Camera Relocalization by Using a Scene Coordinate Prediction Network”. In Intl. Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Munich, Germany, pp. 258–263. Cited by: §1.
  14. P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers and T. Brox (2015) ”FlowNet: Learning Optical Flow with Convolutional Networks”. In Intl. Conf. on Computer Vision (ICCV), Santiago de Chile, Chile, pp. 2758–2766. Cited by: §4, §5.3.
  15. Y. Gal and Z. Ghahramani (2016) ”Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference”. In arXiv preprint arXiv:1506.02158, Cited by: §2.1.
  16. A. Geiger, P. Lenz and R. Urtasun (2012) ”Are we ready for autonomous driving? The Kitti vision benchmark suite”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Providence, RI, pp. 3354–3361. Cited by: §4.
  17. A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon and S. Izadi (2014) ”Multi-output Learning for Camera Relocalization”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, pp. 1114–1121. Cited by: §1.
  18. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy and T. Brox (2017) ”FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 1647–1655. Cited by: §1, §2.3, Figure 4, §3.2, §4, §5.3.
  19. G. Iyer, J. K. Murthy, G. Gupta, K. M. Krishna and L. Paull (2018) ”Geometric Consistency for Self-Supervised End-to-End Visual Odometry”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT. Cited by: §1.
  20. G. Jang, S. Lee and I. Kweon (2002) ”Color landmark based self-localization for indoor mobile robots”. In Intl. Conf. on Robotics and Automation (ICRA), Washington, DC, pp. 1037–1042. Cited by: §1.
  21. A. Kasyanov, F. Engelmann, J. Stückler and B. Leibe (2017) Keyframe-based visual-inertial online slam with relocalization. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, Canada, pp. 6662–6669. Cited by: §2.
  22. H. Kato and M. Billinghurst (1999) Marker tracking and HMD calibration for a video-based augmented reality conferencing system. In Proc. Intl. Workshop on Augmented Reality (IWAR), San Francisco, CA, pp. 85–94. Cited by: §2.
  23. A. Kendall and R. Cipolla (2016) ”Modelling Uncertainty in Deep Learning for Camera Relocalization”. In Intl. Conf. on Robotics and Automation (ICRA), Stockholm, Sweden, pp. 4762–4769. Cited by: §1, §2.1.
  24. A. Kendall and R. Cipolla (2017) ”Geometric Loss Functions for Camera Pose Regression with Deep Learning”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 6555–6564. Cited by: §2.1, §3.1.
  25. A. Kendall, M. Grimes and R. Cipolla (2015) ”PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization”. In Intl. Conf. on Computer Vision (ICCV), Santiago de Chile, Chile, pp. 2938–2946. Cited by: §1, §1, §2.1, §3.1, §4, §5.1, Table 1, §5.
  26. C. Kerl, J. Sturm and D. Cremers (2013) Dense visual SLAM for RGB-d cameras. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Tokyo, Japan, pp. 2100–2106. Cited by: §2.
  27. G. Klein and D. Murray (2007) Parallel tracking and mapping for small AR workspaces. In Proc. Intl. Workshop on Augmented Reality (ISMAR), Nara, Japan, pp. 1–10. Cited by: §2.
  28. R. Kreuzig, M. Ochs and R. Mester (2019) ”DistanceNet: Estimating Traveled Distance From Monocular Images Using a Recurrent Convolutional Neural Network”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA. Cited by: §1.
  29. Z. Laskar, I. Melekhov, S. Kalia and J. Kannala (2017) ”Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network”. In Intl. Conf. on Computer Vision Workshop (ICCVW), Venice, Italy, pp. 920–929. Cited by: §2.1, §4.
  30. S. Lee and J. Song (2007) ”Mobile robot localization using infrared light reflecting landmarks”. In Intl. Conf. Control, Automation and Systems, Seoul, South Korea, pp. 674–677. Cited by: §1.
  31. P. Li, T. Qin, B. Hu, F. Zhu and S. Shen (2017) Monocular visual-inertial state estimation for mobile augmented reality. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, Canada, pp. 11–21. Cited by: §2.
  32. W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang and S. Leutenegger (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv preprint arXiv:1809.00716 18 (17). Cited by: §2.
  33. Y. Lin, Z. Liu, J. Huang, C. Wang, G. Du, J. Bai, S. Lian and B. Huang (2019) ”Deep Global-Relative Networks for End-to-End 6-DoF Visual Localization and Odometry”. In Pacific Rim Intl. Conf. Artificial Intelligence (PRICAI), Yanuca Island, Cuvu, Fiji, pp. 454–467. Cited by: §1, §2.2.
  34. H. Liu, M. Chen, G. Zhang, H. Bao and Y. Bao (2018) Ice-ba: incremental, consistent and efficient bundle adjustment for visual-inertial slam. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, pp. 1974–1982. Cited by: §2.
  35. C. Löffler, S. Riechel, J. Fischer and C. Mutschler (2018) ”An Evaluation Methodology for Inside-Out Indoor Positioning based on Machine Learning”. In Intl. Conf. Indoor Positioning and Indoor Navigation (IPIN), Nantes, France. Cited by: §1, §4.1, Table 1.
  36. W. Maddern, G. Pascoe, C. Linegar and P. Newman (2016) ”1 Year, 1000km: The Oxford RobotCar Dataset”. In Intl. J. of Robotics Research (IJRR), pp. 3–15. Cited by: §4.
  37. S. Mansur, M. Habib, G. N. P. Pratama, A. I. Cahyadi and I. Ardiyanto (2017) Real Time Monocular Visual Odometry using Optical Flow: Study on Navigation of Quadrotora’s UAV. In Intl. Conf. on Science and Technology - Computer (ICST), Yogyakarta, Indonesia, pp. 122–126. Cited by: §2.3.
  38. E. Marchand, H. Uchiyama and F. Spindler (2016) Pose estimation for augmented reality: a hands-on survey. Trans. Visualization and Computer Graphics 22 (12), pp. 2633–2651. Cited by: §2.
  39. I. Melekhov, J. Ylioinas, J. Kannala and E. Rahtu (2017) ”Image-Based Localization Using Hourglass Networks”. In Intl. Conf. on Computer Vision Workshop (ICCVW), Venice, Italy, pp. 870–877. Cited by: §1.
  40. L. Meng, J. Chen, F. Tung, J. J. Little, J. Valentin and C. W. da Silva (2017) ”Backtracking Regression Forests for Accurate Camera Relocalization”. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, BC, pp. 6886–6893. Cited by: §1.
  41. P. Muller and A. Savakis (2017) ”Flowdometry: An Optical Flow and Deep Learning Based Approach to Visual Odometry”. In Winter Conf. on Applications of Computer Vision (WACV), Santa Rosa, CA, pp. 624–631. Cited by: §2.3.
  42. R. Mur-Artal and J. D. Tardos (2017) ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-d cameras. Trans. Robotics 33 (5), pp. 1255–1262. Cited by: §2.
  43. T. Naseer and W. Burgard (2017) ”Deep Regression for Monocular Camera-based DoF Global Localization in Outdoor Environments”. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, BC, pp. 1525–1530. Cited by: §2.1.
  44. R. Palmarini, J. A. Erkoyuncu and R. Roy (2017) An innovative process to select augmented reality (AR) technology for maintenance. In Proc. Intl. Conf. on Manufacturing Systems (CIRP), Vol. 59, Taichung, Taiwan, pp. 23–28. Cited by: §2.
  45. M. Patel, B. Emery and Y. Chen (2018) ”ContextualNet: Exploiting Contextual Information using LSTMs to Improve Image-based Localization”. In Intl. Conf. Robotics and Automation (ICRA), Cited by: §2.1, §2.1, §5.4.
  46. N. Radwan, A. Valada and W. Burgard (2018) ”VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry”. Robotics and Automation Letters 3 (4), pp. 4407–4414. Cited by: §1, §2.2, §4.
  47. T. Sattler, Q. Zhou, M. Pollefeys and L. Leal-Taixé (2019) ”Understanding the Limitations of CNN-based Absolute Camera Pose Regression”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, pp. 3302–3312. Cited by: §1.
  48. S. Seifi and T. Tuytelaars (2019) ”How to improve CNN-based 6-DoF Camera Pose Estimation”. In Intl. Conf. on Computer Vision (ICCV), Cited by: §2.1, §2.1, §2, §5.4, §5.4.
  49. J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi and A. Fitzgibbon (2013) ”Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Portland, OR, pp. 2930–2937. Cited by: §1, §1, §2.1, §4, §5.2, §5.3, Table 1.
  50. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) ”Going deeper with convolutions”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 1–9. Cited by: §3.1, §5.4.
  51. T. Taketomi, H. Uchiyama and S. Ikeda (2017) Visual SLAM algorithms: a survey from 2010 to 2016. Trans. Computer Vision and Applications 9 (1), pp. 452–461. Cited by: §2.
  52. T. Terashima and O. Hasegawa (2017) A visual-SLAM for first person vision and mobile robots. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, Canada, pp. 73–76. Cited by: §2.
  53. A. Valada, N. Radwan and W. Burgard (2018) ”Deep Auxiliary Learning for Visual Localization and Odometry”. In Intl. Conf. on Robotics and Automation (ICRA), Brisbane, Australia, pp. 6939–6946. Cited by: §1, §2.2.
  54. J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi and P. Torr (2015) ”Exploiting uncertainty in regression forests for accurate camera relocalization”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 4400–4408. Cited by: §1.
  55. R. Vassallo, A. Rankin, E. C. S. Chen and T. M. Peters (2017) Hologram stability evaluation for microsoft (r) hololens tm. In Intl. Conf. on Robotics and Automation (ICRA), Marina Bay Sands, Singapur, pp. 3–14. Cited by: §2.
  56. F. Walch, D. Cremers, S. Hilsenbeck, C. Hazirbas and L. Leal-Taix (2016) ”Deep Learning for Image-Based Localization”. Master’s Thesis, Technische Universität München, Department of Informatics, Semantic Scholar, Munich, Germany. Cited by: §2.1.
  57. F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, S. Hilsenbeck and D. Cremers (2017) ”Image-Based Localization Using LSTMs for Structured Feature Correlation”. In Intl. Conf. on Computer Vision (ICCV), Venice, Italy, pp. 627–637. Cited by: §1, §1, §2.1, §2.1, §3.2, §4, §5.1, §5.4, Table 1, §5.
  58. S. Wang, R. Clark, H. Wen and N. Trigoni (2017) ”DeepVO: Towards end-to-end Visual Odometry with deep Recurrent Convolutional Neural Networks”. In Intl. Conf. on Robotics and Automation (ICRA), Singapore, Singapore, pp. 2043–2050. Cited by: §2.1, §2.1, §5.4.
  59. Q. Zhao, F. Li and X. Liu (2018) ”Real-time Visual Odometry based on Optical Flow and Depth Learning”. In Intl. Conf. on Measuring Technology and Mechatronics Automation (ICMTMA), Changsha, China, pp. 239–242. Cited by: §2.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description