ViPR: Visual-Odometry-aided Pose Regression for 6DoF Camera Localization
Visual Odometry (VO) accumulates a positional drift in long-term robot navigation tasks. Although Convolutional Neural Networks (CNNs) improve VO in various aspects, VO still suffers from moving obstacles, discontinuous observation of features, and poor textures or visual information. While recent approaches estimate a 6DoF pose either directly from (a series of) images or by merging depth maps with the optical flow (OF), research that combines absolute pose regression with OF is limited.
We propose ViPR, a novel architecture for long-term 6DoF VO that leverages synergies between absolute pose estimates (from PoseNet-like architectures) and relative pose estimates (from FlowNet-based architectures) by combining both through recurrent layers. Experiments with known publicly available datasets and with our own Industry dataset show that our novel design outperforms existing techniques in long-term navigation tasks.
Real-time tracking of mobile objects, e.g., forklifts, trucks or workers in industrial areas, allows to monitor and optimize workflows, enables zoning rules for safety, and tracks goods for automated inventory management. Such environments typically include large warehouses or factory buildings, and localization solutions often use a combination of radio-, LiDAR- or radar-based systems, etc.
However, these solutions often require infrastructure or they are costly in their operation. An alternative approach is a (mobile) optical pose estimation based on ego-motion. Such approaches are usually based on SLAM (Simultaneous Localization and Mapping), meet the requirements of exact real-time localization, and are also cost-efficient.
Available pose estimation approaches are categorized into three groups: classical, hybrid, and deep learning (DL)-based methods. Classical methods often require an infrastructure that includes either synthetic (i.e., installed in the environment) or natural (e.g., walls and edges) markers. The accuracy of the pose estimation depends to a large extent on suitable invariance properties of the available features such that they can be reliably recognized. However, to reliably detect features, we have to invest a lot of expensive computing time [30, 20]. Additional sensors (e.g., inertial sensors, depth cameras, etc.) or additional context (e.g., 3D models of the environment, prerecorded landmark databases, etc.) may increase the accuracy but also increase system complexity and costs . Hybrid methods [49, 5, 4, 17, 54] combine geometric and machine learning (ML) approaches. For instance, ML predicts the 3D position of each pixel in world coordinates, from which geometry-based methods infer the camera pose .
Recent methods that exploit DL to partly address the above mentioned issues of complexity and cost also aim for high positioning accuracy, e.g., regression forests  learn a mapping of images to positions based on 3D models of the environment. Absolute pose regression (APR) uses DL  as a cascade of convolution operators to learn poses only from 2D images. PoseNet  as an initial attempt has been successfully extended by Bayesian approaches , long short-term memories (LSTMs)  and others [39, 19, 28, 9]. Recent approaches to APR such as VLocNet [53, 46] and DGRNets  introduce relative pose regression (RPR) to address the APR-task. While APR needs to be trained for a particular scene, RPR may be trained for multiple scenes . However, RPR alone does not solve the navigation task.
For applications such as indoor positioning, existing approaches are not yet mature, i.e., in terms of robustness and accuracy to handle real-world challenges such as changing environment geometries, lighting conditions, and camera (motion) artifacts. Hence, this paper proposes a 6DoF pose estimation based on PoseNet-like architectures and predictions of relative camera movement by RNNs, using the flow of image pixels between successive images computed by FlowNet2.0 , to capture time dependencies in the camera movement in the recurrent layers, see Fig. 1. Our model reduces the positioning error using this multitasking approach, which learns both the absolute poses based on monocular (2D) imaging and the relative motion for the task of estimating visual odometry (VO).
We evaluate our approach first on the small-scale 7-Scenes  dataset. As other datasets are unsuitable to evaluate continuous navigation tasks we also release a dataset that can be used to evaluate various problems arising from real industrial scenarios such as inconsistent lighting, occlusion, dynamic environments, etc. We benchmark our approach on both datasets against existing approaches [25, 57] and show that we consistently outperform the accuracy of their pose estimates.
2 Related Work
SLAM-driven 3D point registration methods enable precise self-localization even in unknown environments. Although VO has made remarkable progress over the last decade, it still suffers greatly from scaling errors of real and estimated maps [34, 51, 38, 21, 26, 27, 32, 42, 3, 31]. With more computing power, Visual Inertial SLAM (VISLAM) combines VO with Inertial Measurement Unit (IMU) sensors to partly resolve the scale ambiguity, to provide motion cues without visual features [34, 52, 21], to process more features, and to make tracking more robust [51, 26]. However, recent SLAM methods do not yet meet industry-strength with respect to accuracy and reliability  as they need undamaged, clean and undisguised markers [31, 22] and as they still suffer from long-term stability and the effects of movement, sudden acceleration and occlusion .
VO primarily addresses the problem of separating ego- from feature-motion and suffers from area constraints, poorly textured environments, scale drift, a lack of an initial position, and thus inconsistent camera trajectories . Instead, PoseNet-like architectures (see Sec. 2.1) that estimate absolute poses on single-shot images are more robust, less compute-intensive, and can be trained in advance on application data. Unlike VO, they do not suffer from a lack of initial poses and do not require access to camera parameters, good initialization, and handcrafted features . Although the joint estimation of relative poses may contribute to increasing accuracy (see Sec. 2.2), such hybrid approaches still suffer from dynamic environments, as they are often trained offline in quasi-rigid environments. While optical flow (see Sec. 2.3) addresses these challenges it has not yet been combined with APR for 6DoF self-localization.
2.1 Absolute Pose Regression (APR)
Methods that derive a 6DoF pose directly from images have been studied for decades. Therefore, there are currently many classic methods whose complex components are replaced by machine learning (ML) or deep learning (DL). For instance, RelocNet  learns metrics continuously from global image features through a camera frustum overlap loss. CamNet  is a coarse (image-based)-to-fine (pose-based) retrieval-based model that includes relative pose regression to get close to the best database entry that contains extracted features of images. NNet  queries a database for similar images to predict the relative pose between images and a RANSAC solves the triangulation to provide a position. While those classic approaches have already been extended with DL-components their pipelines are expensive as they embed feature matching and projection and/or manage a database. Most recent (and simple) DL-based also outperform their accuracies.
The key idea of PoseNet  and its variants [24, 23, 15, 57, 56, 58, 45, 48, 43, 49] among others such as BranchNet  and Hourglass  is to use a CNN for camera (re-)localization. PoseNet works with scene elements of different scales and is partially insensitive to light changes, occlusions and motion blur. However, while Dense PoseNet  crops subimages, PoseNet2  jointly learns network and loss function parameters, Cipolla et al.  links a Bernoulli function and applies variational inference  to improve the positioning accuracy. However, those variants work with single images and hence do not use the temporal context (which is available in continuous navigation tasks), that could help to increase accuracy.
In addition to PoseNet+LSTM , there are also similar approaches that exploit time-context that is inherently given by consecutive images (i.e., DeepVO , ContextualNet , and VidLoc ). Here, the key-idea is to identify temporal connections in-between the feature vectors (extracted from images) with LSTM-units and to only track feature correlations that contribute the most to the pose estimation. However, there are hardly any long-term dependencies between successive images, and therefore LSTMs give worse or equal accuracy to, for example, simple averaging over successively estimated poses . Instead, we combine estimated poses from time-distributed CNNs with estimates of the optical flow to maintain the required temporal context in the features of image series.
In addition to approaches that derive a 6DoF pose directly from an image there are hybrid methods that combine them with VO to increase the accuracy. VLocNet  is closely related to our approach as it estimates a global pose and combines it with VO (but it does not use OF). To further improve the (re)localization accuracy VLocNet++  uses features from a semantic segmentation. However, we use different networks and do not need to share weights between VO and the global pose estimation. DGRNets  estimates both the absolute and relative poses, concatenates them, and uses recurrent CNNs to extract temporal relations between consecutive images. This is similar to our approach but we estimate the relative motion with OF, which allows us to train in advance on large datasets, making the model more robust. MapNet  learns a map representation from input data, combines it with GPS, inertial data, and unlabeled images, and uses pose graph optimization (PGO) to combine absolute and relative pose predictions. However, compared to all other methods the most accurate extension of it, MapNet+PGO, does not work on purely visual information, but exploits additional sensors.
2.3 Optical Flow
Typically, VO uses OF to extract features from image sequences. Motion fields and OF-images, see Fig. 2, are used to estimate trajectories of pixels in a series of images. For instance, Flowdometry  estimates displacements and rotations from OF-input. Mansur et al.  proposed a VO-based dead reckoning system that uses OF to match features. Zhao et al.  combined two CNNs to estimate the VO-motion: FlowNet2-ss  estimates the OF and PCNN  links two images to process global and local pose information. However, to the best of our knowledge, we are the first to propose an OF-based architecture that estimates the relative camera movement through RNNs, using the optical flow (FlowNet2.0 ).
3 Proposed Model
After a data preprocessing that crops subimages of size from a sequence of four images, our pose regression pipeline consists of three parts (see Fig. 4): an absolute pose regression (APR) network, a relative pose regression (RPR) network, and a 6DoF pose estimation (PE) network. PE uses the outputs of the APR- and RPR-networks to provide the final 6DoF pose.
3.1 Absolute Pose Regression (APR) Network
Our APR-network predicts the 6DoF camera pose from three input images. The APR-network is based on the original PoseNet  model (i.e., essentially a modified GoogLeNet  with a regression head instead of a softmax) to train and predict the absolute positions in the Euclidean space and the absolute orientations as quaternions. From a single monocular image I the model predicts the pose
as approximations to the actual p and q.
As the original model learns the image context, based on shape and appearance of the environment, but does not exploit the time context and relation between consecutive images , we adapted the model to a time-distributed variant. Hence, instead of a single image the new model receives three (consecutive) input images, see Fig. 3, uses three separate dense layers (one for each pose) with 2,048 neurons each, and each of the dense layers yields a pose. The middle pose yields the most accurate position for the image at time step .
3.2 Relative Pose Regression (RPR) Network
As displacements of similar length but from different camera viewing directions result in different OFs, the displacement and rotation of the camera between pairwise images must be relative to the camera’s viewing direction of the first image. Therefore, we transform each camera’s global coordinate systems to the same local coordinate system by
with the rotation matrix R. The displacement is the difference between the transformed coordinate systems. The displacement, in global coordinates, is obtained by a back-transformation of the predicted displacement:
such that the following equations hold:
Fig. 5 shows the structure of the RPR-network. Similar to the APR-network, the RPR-network also uses a stack of images, i.e., three OF-fields from the four input images of the timesteps , to include more time context.
In a preliminary study, we found that our recurrent units struggle to remember temporal features when the direct input of the OF is too large (raw size px). This is in line with findings from Walch et al. . Hence, we split the OF in zones and compute the mean value for each the - and -direction. We reshape number of zones in both directions to the size . The final concatenation results in a smaller total size of . The LSTM-output is forwarded to 2 FC-layers that regress both the displacement (size ) and rotation (size ).
The 2 FC-layers use the following loss function to predict the relative transposed poses and :
The first term accounts for the predicted and transformed displacement to the ground truth displacement with an -norm. The second term quantifies the error of the predicted rotation to the normalized ground truth rotation using an -norm. Both terms are weighted by the hyperparameters and . A preliminary grid search with a fixed revealed an optimal value for that depends on the scaling of the environment.
3.3 6DoF Pose Estimation (PE) Network
Our PE-network predicts absolute 6DoF poses from the outputs of both the APR- and RPR-networks, see Fig. 6. The PE-network takes as input the absolute position , the absolute orientation , the relative displacement , and the rotation change . As we feed poses from three sequential timesteps , , and as input to the model it is implicitly time-distributed. The 2 stacked LSTM-layers and the 2 FC-layers return a 3DoF absolute position and a 3DoF orientation using the following loss:
Again, in a preliminary grid search we chose -norms with a fixed that revealed an optimal value for .
4 Evaluation Datasets
To train our network we need two different types of image data: (1) images annotated with their absolute poses for the APR-network, and (2) images of optical flow, annotated with their relative poses for the RPR-network.
Datasets to evaluate APR. Publicly available datasets for absolute pose regression (Cambridge Landmarks  and TUM-LSI ) either lack accurate ground truth labels or the proximity between consecutive images is too large to embed meaningful temporal context, which is required for temporal networks such as ViPR. 7-Scenes  only embeds small-scale scenes and hence only enables small scene-wise evaluations. The Aalto University , Oxford RobotCar  and DeepLoc  datasets solve the small-scale issue, but are barely used for evaluation of state-of-the-art techniques. Hence, to compare ViPR with recent techniques we can only use parts of the 7-Scenes  dataset, and thus, had to record the Industry dataset (see Sec. 4.1) that embeds three different scenarios. These datasets allow a comprehensive and detailed evaluation with different movement patterns (such as slow motion and fast rotation).
Datasets to evaluate RPR. To evaluate the performance of the RPR-network and its contribution to ViPR, we also need a dataset with a close proximity between consecutive images. This is key to calculate the relative movement with optical flow. However, most publicly available datasets (Middlebury , MPI Sintel , KITTI Vision Benchmark , and FlyingChairs ) either do not meet this requirement or the OF pixel velocities do not match those of real world applications. Hence, we cannot use available datasets to train the RPR-network. Instead, we directly calculate the OF from images with FlowNet2.0  to train our RPR-network on it. Our novel Industry dataset allows this, while simultaneously retaining a large, diverse environment with hard real-world conditions, as described in the following.
4.1 Industry Dataset
We designed the Industry dataset to suite the requirements of both the APR- and the RPR-network. It is composed of three scenarios with multiple trajectories recorded at large-scale () using a high-precision () laser-based reference system. Each scenario presents different challenges (such as dynamic ego-motion with motion blur), various environmental characteristics (such as different geometric scales, light changes, i.e., artificial and natural light), and ambiguously structured elements, see Fig. 7
Industry Scenario #1  has been recorded with 8 cameras (approx. field-of-view (FoV) each) mounted on a stable apparatus to cover (with overlaps) that has been moved automatically at a constant velocity of approx. 0.3 . The height of the cameras is at 1.7 m. The scenario contains 521,256 images ( px) and densely covers an area of 1,320 m2. The environment imitates a typical warehouse scenario under realistic conditions. Besides well-structured elements such as high-level racks with goods, there are also very ambiguous and homogeneously textured elements (e.g., blank white or dark black walls). Both natural and artificial light illuminates volatile structures such as mobile work benches. While the training dataset is composed of a horizontal and vertical zig-zag movement of the apparatus the test datasets movements vary to cover different properties for a detailed evaluation, e.g., different environmental scalings (i.e., scale transition, cross, large scale, and small scale), network generalization (i.e., generalize open, generalize racks, and cross), fast rotations (i.e., motion artifacts was recorded on a forklift at 2.26 m height) and volatile objects (i.e., volatility).
Industry Scenario #2 uses three cameras (with overlaps) on the same apparatus at the same height. The recorded 11,859 training images ( px) represent a horizontal zig-zag movement (see Fig. 8(a)) and 3,096 test images represent a diagonal movement (see Fig. 8(b)). Compared to Scenario #1 this scenario has more variation in its velocities (between 0 and 0.3 , SD 0.05 ).
Industry Scenario #3 uses four cameras (with overlaps) on a forklift truck at a height of 2.26 . Both the training and test datasets represents camera movements at varying, faster, and dynamic speeds (between 0 and 1.5 , SD 0.51 ). This makes the scenario the most challenging one. The training trajectory (see Fig. 8(c)) consists of 4,166 images and the test trajectory (see Fig. 8(d)) consists of 1,687 images. In contrast to the Scenarios #1 and #2 we train and test a typical industry scenario on dynamic movements of a forklift truck. However, one of cameras’ images were corrupted in the test dataset, and thus, not used in the evaluation.
5 Experimental Results
||extend ()||(original/our param.)||LSTM ||(our param.)||ViPR (%)|
|chess||0.32 / 0.24||4.06 / 7.79||0.24||5.77||0.23||7.96||0.27||9.66||0.22||7.89||+ 1.74|
||heads||0.29 / 0.21||6.00 / 16.46||0.21||13.7||0.22||16.48||0.23||16.91||0.21||16.41||+ 3.64|
||office||0.48 / 0.33||3.84 / 10.08||0.30||8.08||0.36||10.11||0.37||10.83||0.35||9.59||+ 4.01|
||stairs||0.47 / 0.36||6.93 / 13.69||0.40||13.7||0.31||12.49||0.42||13.50||0.31||12.65||+ 0.46|
||total||0.39 / 0.29||5.21 / 12.00||0.29||10.3||0.28||11.76||0.32||12.73||0.27||11.63||+ 2.46|
Industry Scenario 1 
|cross||-- / 1.15||-- / 0.75||--||0.61||0.53||4.42||0.21||0.46||0.60||+ 25.31|
||gener. open||-- / 1.94||-- / 11.73||--||1.68||11.07||3.36||2.95||1.48||10.86||+ 11.75|
||gener. racks||-- / 3.48||-- / 6.01||--||2.48||1.53||3.90||0.61||2.38||1.95||+ 4.03|
||large scale||-- / 2.32||-- / 6.37||--||2.37||9.82||4.99||1.61||2.12||8.64||+ 10.68|
||motion art.||-- / 7.43||-- / 124.94||--||7.48||131.30||8.18||139.37||6.73||136.6||+ 10.01|
||scale trans.||-- / 2.17||-- / 3.03||--||1.94||6.46||5.63||0.58||1.64||6.29||+ 15.52|
||small scale||-- / 3.78||-- / 9.18||--||4.09||20.75||4.46||6.06||3.50||15.74||+ 14.41|
||volatility||-- / 2.68||-- / 78.52||--||2.09||77.68||4.16||78.73||1.96||77.54||+ 6.41|
||total||-- / 3.12||-- / 30.07||--||2.82||32.30||4.89||28.76||2.53||32.28||+ 12.27|
|Scen. 2||cam #0||-- / 0.49||-- / 0.21||--||0.22||0.29||1.49||0.14||0.16||3.37||+ 26.24|
||cam #1||-- / 0.15||-- / 0.38||--||0.23||0.35||2.68||0.17||0.12||2.75||+ 46.49|
||cam #2||-- / 0.43||-- / 0.19||--||0.37||0.13||0.90||0.15||0.30||1.84||+ 17.87|
||total||-- / 0.36||-- / 0.26||--||0.27||0.26||1.69||0.15||0.20||2.65||+ 30.20|
|Scen. 3||cam #0||-- / 0.41||-- / 1.00||--||0.34||1.26||0.72||1.31||0.27||1.43||+ 20.64|
||cam #1||-- / 0.32||-- / 1.07||--||0.26||1.11||0.88||1.27||0.21||1.06||+ 20.13|
||cam #2||-- / 0.32||-- / 1.60||--||0.36||1.62||0.72||1.74||0.32||1.38||+ 11.47|
||total||-- / 0.35||-- / 1.22||--||0.32||1.33||0.77||1.44||0.27||1.29||+ 17.41|
To compare ViPR with originally reported results of state-of-the-art, we first briefly describe our parameterization of PoseNet  and PoseNet+LSTM  in Section 5.1. Next, Section 5.2 presents our results. We highlight the performance of ViPR’s sub-networks (APR, APR+LSTM) individually, and investigate both the impact of RPR and PE on the final pose estimation accuracy of ViPR. Section 5.3 shows results of the RPR-network. Finally, we discuss general findings and show runtimes of our models in Section 5.4.
For all experiments we used an AMD Ryzen 7 2700 CPU 3.2 GHz equipped with two NVidia GeForce RTX 2070 with 8 GB GDDR6 VRAM each. Table 1 shows the median error of the position in and of the orientation in degrees. The second column reports the spatial extends of the datasets. The last column reports the improvement in position accuracy of ViPR (in %) over APR-only.
As a baseline we report the initially described results on 7-Scenes of PoseNet  and PoseNet+LSTM  (in italic). We further reimplemented the initial variant of PoseNet and trained it from scratch with , (thus optimizing for positional accuracy at the expense of orientation accuracy). Tab. 1 (cols. 3 and 4) shows our implementation’s results next to the initially reported ones (on 7-Scenes). We see that (as expected) the results of the PoseNet implementations differ due to changed values for and in our implementation.
5.2 Evaluation of the ViPR-Network
In the following, we evaluate our method in multiple scenarios with different distinct challenges for the pose estimation task. 7-Scenes focuses on difficult motion blur conditions of typical human motion. We then use the Industry Scenario #1 to investigate various challenges at a larger scale, but with mostly constant velocities. Industry Scenarios #2 and #3 then focus on dynamic, fast ego-motion of a moving forklift truck at large-scale.
7-Scenes . For both architectures (PoseNet and ViPR), we optimized to weight the impact of position and orientation such that it yields the smallest total median error. Both APR+LSTM and ViPR return a slightly lower pose estimation error of 0.28 and 0.27 than PoseNet+LSTM with 0.29 . ViPR yields an average improvement of the position accuracy of even in strong motion blur situations. The results indicate that ViPR relies on a plausible optical flow component to achieve performance that is superior to the baseline. In situations of negligible motion between frames the median only improves by 0.02 . However, the average accuracy gain still shows that ViPR performs en par or better than the baselines.
Stable motion evaluation. For the Industry Scenario #1 dataset, we train the models on the zig-zag trajectories, and test them on specific sub-trajectories with individual challenges, but at almost constant velocity. In total, ViPR improves the position accuracy by on average (min.: ; max.: ) while the orientation error is similar for most of the architectures and test sets.
In environments with volatile features, i.e., objects that are only present in the test dataset, we found that ViPR (with optical flow) is significantly () better compared to APR-only. However, the high angular error of indicates an irrecoverable degeneration of the APR-part. In tests with different scaling of the environment, we think that ViPR learns an interpretation of relative and absolute position regression, that works both in small and large proximity to environmental features, as ViPR improves by (scale trans.) and (small scale) or (large scale). When the test trajectories are located within areas that embed only few or no training samples (gener. racks and open), ViPR still improves over other methods with 4-11.75 . The highly dynamic test on a forklift truck (motion artifacts) is exceptional here as only the test dataset contains dynamics and blur, and hence, challenges ViPR most. However, ViPR still improves by over APR-only, despite the data dynamic’s absolute novelty.
In summary, ViPR decreases the position median significantly by about 2.53 than only APR+LSTM (4.89 ). This and the other findings are strong indicators that the relative component RPR significantly supports the final pose estimation of ViPR.
Industry Scenario #2 is designed to evaluate for unknown trajectories. Hence, training trajectories represent an orthogonal grid, and test trajectories are diagonal. In total, ViPR improves the position accuracy by on average (min.: ; max.: ). Surprisingly, the orientation error is comparable for all architectures, except ViPR. We think that this is because ViPR learns to optimize its position based on the APR- and RPR- orientations, and hence, exploits these orientations to improve its position estimate, that we prioritized in the loss function. APR-only yields an average position accuracy of 0.27 , while the pure PoseNet yields position errors of 0.36 on average, but APR+LSTM results in an even worse accuracy of 1.69 . Instead, the novel ViPR outperforms all significantly with 0.2 . Compared to our APR+LSTM approach, we think that ViPR on the one hand interprets and compensates the (long-term) drift of RPR and on the other hand smooths the short-term errors of APR, as PE counteracts the accumulation of RPR’s scaling errors with APR’s absolute estimates. Here, the synergies of the networks in ViPR are particularly effective. This is also visualized in Fig. 9: the green (ViPR) trajectory aligns more smoothly to the blue baseline when the movement direction changes. This may also indicate that the RPR component is necessary for ViPR to be able to generalize to unknown trajectories and to compensate scaling errors.
Dynamic motion evaluation. In contrast to the other datasets, the Industry Scenario #3 includes fast, large-scale, and high dynamic ego-motion in both training and test datasets. However, all estimators result in similar findings as Scenario #2 as both scenarios embed motion dynamics and unknown trajectory shapes. Accordingly, ViPR again improves the position accuracy by on average (min.: ; max.: ), but this time exhibits very similar orientation errors. Improved orientation accuracy compared to Scenario #2 is likely due to diverse orientations available in this dataset’s training.
Fig. 9 shows exemplary results that visualize how ViPR handles especially motion changes and motion dynamics (see the abrupt direction change between and ). The results also indicate that ViPR predicts the smoothest and most accurate trajectories on unknown trajectory shapes (compare the trajectory segments between and ). We think the reason why ViPR significantly outperforms APR by here is because of the synergy of APR, RPR, and PE. RPR contributes the most in fast motion-changes, respectively in situation of motion blur. The success of RPR may also indicate that RPR differentiates between ego- and feature-motion to more robustly estimate a pose.
5.3 Evaluation of the RPR-Network
We use the smaller FlowNet2-s  variant of FlowNet2.0 as this has faster runtimes (140 Hz), and use it pretrained on the FlyingChairs , ChairsSDHom and FlyingThings3D datasets. To highlight that RPR contributes to the accuracy of the final pose estimates of ViPR, we explicitly test it on the Industry Scenario #3 that embeds dynamic motion of both ego- and feature-motion. Here, the relative proximity between consecutive samples is up to 20 , see Fig. 10. This results in a median of 2.49 in x- and 4.09 in y-direction on average. Hence, the error is between and . This shows that the RPR achieves meaningful results for relative position regression in a highly dynamic and therefore difficult setting. It furthermore appears to be relatively robust in its predictions despite both ego- and feature-motion. Further tests also showed comparable results for 7-Scenes .
6DoF Pose Regression with LSTMs. APR increases the positional accuracy over PoseNet for all datasets, see PoseNet and APR-only columns of Table 1. We found that the position errors increase when we use state-of-the-art methods with independent and single-layer LSTM-extensions [57, 58, 45, 48] on both the 7-Scenes and the Industry datasets, by 0.04 , 2.07 , 1.42 and 0.45 . This motivated us to investigate stacked LSTM-layers only for the RPR- and PE-networks.
We support the statement of Seifi et al.  that the motion between consecutive frames is too small, and thus, naive CNNs are already unable to embed them. Hence, additionally connected LSTMs are also unable to discover and track meaningful temporal and contextual relations between the features. Therefore, we undersampled the image sequences to embed longer temporal (5, 10, 15, 20, ..., 60) relations between consecutive frames such that both the CNN and the LSTM identify and track temporal causalities and relations. Note, that the minimal undersampling rate depends on the minimal velocity of the ego-motion. We found that a sampling rate of 10 at a minimal velocity of 0.1 yields the highest gain.
Runtimes. The training of the APR takes 0.86 s per iteration for 20,585,503 trainable parameters and a batch size of 50 (GoogLeNet ) on our hardware setup. Instead, the training of both the RPR-network and the PE-network is faster (0.065 s per iteration) even at a higher batch size of 100, as these models are smaller 214,605, resp. 55,239. In total, the training time for the complete RPR-network takes 90 and for the PE-network 65 on average for more than 12,000 samples. Hence, it is possible to retrain the PE-network quickly upon environment changes. The inference time of both networks are between 0.25 s and 0.35 s.
In this paper, we addressed typical challenges of learning-based visual self-localization of a monocular camera. We introduced a novel deep learning architecture that makes use of three modules: an absolute pose regressor, a relative pose regressor (that predicts displacement and rotation from OF), and a final regressor that predicts a 6DoF pose by concatenating the predictions of the two former sub-networks. To show that our novel architecture improves the absolute pose estimates, we compared it with a publicly available dataset and proposed novel Industry datasets that enable a more detailed evaluation of different (dynamic) movement patterns, generalization, and scale transitions.
In the future, we will further investigate the impact of the orientation, i.e., quaternions or Euler angles (especially yaw), on the pose accuracy. We will also explore a closed end-to-end architecture to investigate the gain of information through direct concatenation of CNN-encoder-output (APR) and LSTM-output (RPR). Furthermore, instead of a position regression for each camera separately, we will investigate learning a weighted fusion of each of the multiple cameras to gain benefits from diverse viewing angles.
- (2007) ”A Database and Evaluation Methodology for Optical Flow”. In Intl. Conf. on Computer Vision (ICCV), Rio de Janeiro, Brazil, pp. 1–8. Cited by: Figure 2, §4.
- (2018) ”RelocNet: Continuous Metric Learning Relocalisation Using Neural Nets”. In Europ. Conf. on Computer Vision (ECCV), Cited by: §2.1.
- (2004) Visual odometry. Intl. J. of Robotics Research 33 (7), pp. 8–18. Cited by: §2.
- (2017) ”DSAC â Differentiable RANSAC for Camera Localization”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 2492–2500. Cited by: §1.
- (2016) ”Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 3364–3372. Cited by: §1.
- (2018) ”Geometry-Aware Learning of Maps for Camera Localization”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, pp. 2616–2625. Cited by: §2.2.
- (2012) ”A Naturalistic Open Source Movie for Optical Flow Evaluation”. In Proc. Europ. Conf. on Computer Vision (ECCV), Florence, Italy, pp. 611–625. Cited by: §4.
- (2016) Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. Trans. on Robotics 32 (6), pp. 1309–1332. Cited by: §2.
- (2018) ”A Hybrid Probabilistic Model for Camera Relocalization”. In British Machine Vision Conf. (BMVC), York, UK. Cited by: §1.
- (2017) ”VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization”. In Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 2652–2660. Cited by: §2.1.
- (2016) ”Exploring Representation Learning With CNNs for Frame-to-Frame Ego-Motion Estimation”. In Robotics and Automation Letters, Boston, MA, pp. 1–12. Cited by: §2.3.
- (2019) ”CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization”. In Intl. Conf. on Computer Vision (ICCV), Seoul, South Korea, pp. 2871–2880. Cited by: §2.1.
- (2018) ”xyzNet: Towards Machine Learning Camera Relocalization by Using a Scene Coordinate Prediction Network”. In Intl. Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Munich, Germany, pp. 258–263. Cited by: §1.
- (2015) ”FlowNet: Learning Optical Flow with Convolutional Networks”. In Intl. Conf. on Computer Vision (ICCV), Santiago de Chile, Chile, pp. 2758–2766. Cited by: §4, §5.3.
- (2016) ”Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference”. In arXiv preprint arXiv:1506.02158, Cited by: §2.1.
- (2012) ”Are we ready for autonomous driving? The Kitti vision benchmark suite”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Providence, RI, pp. 3354–3361. Cited by: §4.
- (2014) ”Multi-output Learning for Camera Relocalization”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, pp. 1114–1121. Cited by: §1.
- (2017) ”FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 1647–1655. Cited by: §1, §2.3, Figure 4, §3.2, §4, §5.3.
- (2018) ”Geometric Consistency for Self-Supervised End-to-End Visual Odometry”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT. Cited by: §1.
- (2002) ”Color landmark based self-localization for indoor mobile robots”. In Intl. Conf. on Robotics and Automation (ICRA), Washington, DC, pp. 1037–1042. Cited by: §1.
- (2017) Keyframe-based visual-inertial online slam with relocalization. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, Canada, pp. 6662–6669. Cited by: §2.
- (1999) Marker tracking and HMD calibration for a video-based augmented reality conferencing system. In Proc. Intl. Workshop on Augmented Reality (IWAR), San Francisco, CA, pp. 85–94. Cited by: §2.
- (2016) ”Modelling Uncertainty in Deep Learning for Camera Relocalization”. In Intl. Conf. on Robotics and Automation (ICRA), Stockholm, Sweden, pp. 4762–4769. Cited by: §1, §2.1.
- (2017) ”Geometric Loss Functions for Camera Pose Regression with Deep Learning”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 6555–6564. Cited by: §2.1, §3.1.
- (2015) ”PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization”. In Intl. Conf. on Computer Vision (ICCV), Santiago de Chile, Chile, pp. 2938–2946. Cited by: §1, §1, §2.1, §3.1, §4, §5.1, Table 1, §5.
- (2013) Dense visual SLAM for RGB-d cameras. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Tokyo, Japan, pp. 2100–2106. Cited by: §2.
- (2007) Parallel tracking and mapping for small AR workspaces. In Proc. Intl. Workshop on Augmented Reality (ISMAR), Nara, Japan, pp. 1–10. Cited by: §2.
- (2019) ”DistanceNet: Estimating Traveled Distance From Monocular Images Using a Recurrent Convolutional Neural Network”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA. Cited by: §1.
- (2017) ”Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network”. In Intl. Conf. on Computer Vision Workshop (ICCVW), Venice, Italy, pp. 920–929. Cited by: §2.1, §4.
- (2007) ”Mobile robot localization using infrared light reflecting landmarks”. In Intl. Conf. Control, Automation and Systems, Seoul, South Korea, pp. 674–677. Cited by: §1.
- (2017) Monocular visual-inertial state estimation for mobile augmented reality. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, Canada, pp. 11–21. Cited by: §2.
- (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv preprint arXiv:1809.00716 18 (17). Cited by: §2.
- (2019) ”Deep Global-Relative Networks for End-to-End 6-DoF Visual Localization and Odometry”. In Pacific Rim Intl. Conf. Artificial Intelligence (PRICAI), Yanuca Island, Cuvu, Fiji, pp. 454–467. Cited by: §1, §2.2.
- (2018) Ice-ba: incremental, consistent and efficient bundle adjustment for visual-inertial slam. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, pp. 1974–1982. Cited by: §2.
- (2018) ”An Evaluation Methodology for Inside-Out Indoor Positioning based on Machine Learning”. In Intl. Conf. Indoor Positioning and Indoor Navigation (IPIN), Nantes, France. Cited by: §1, §4.1, Table 1.
- (2016) ”1 Year, 1000km: The Oxford RobotCar Dataset”. In Intl. J. of Robotics Research (IJRR), pp. 3–15. Cited by: §4.
- (2017) Real Time Monocular Visual Odometry using Optical Flow: Study on Navigation of Quadrotora’s UAV. In Intl. Conf. on Science and Technology - Computer (ICST), Yogyakarta, Indonesia, pp. 122–126. Cited by: §2.3.
- (2016) Pose estimation for augmented reality: a hands-on survey. Trans. Visualization and Computer Graphics 22 (12), pp. 2633–2651. Cited by: §2.
- (2017) ”Image-Based Localization Using Hourglass Networks”. In Intl. Conf. on Computer Vision Workshop (ICCVW), Venice, Italy, pp. 870–877. Cited by: §1.
- (2017) ”Backtracking Regression Forests for Accurate Camera Relocalization”. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, BC, pp. 6886–6893. Cited by: §1.
- (2017) ”Flowdometry: An Optical Flow and Deep Learning Based Approach to Visual Odometry”. In Winter Conf. on Applications of Computer Vision (WACV), Santa Rosa, CA, pp. 624–631. Cited by: §2.3.
- (2017) ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-d cameras. Trans. Robotics 33 (5), pp. 1255–1262. Cited by: §2.
- (2017) ”Deep Regression for Monocular Camera-based DoF Global Localization in Outdoor Environments”. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, BC, pp. 1525–1530. Cited by: §2.1.
- (2017) An innovative process to select augmented reality (AR) technology for maintenance. In Proc. Intl. Conf. on Manufacturing Systems (CIRP), Vol. 59, Taichung, Taiwan, pp. 23–28. Cited by: §2.
- (2018) ”ContextualNet: Exploiting Contextual Information using LSTMs to Improve Image-based Localization”. In Intl. Conf. Robotics and Automation (ICRA), Cited by: §2.1, §2.1, §5.4.
- (2018) ”VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry”. Robotics and Automation Letters 3 (4), pp. 4407–4414. Cited by: §1, §2.2, §4.
- (2019) ”Understanding the Limitations of CNN-based Absolute Camera Pose Regression”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, pp. 3302–3312. Cited by: §1.
- (2019) ”How to improve CNN-based 6-DoF Camera Pose Estimation”. In Intl. Conf. on Computer Vision (ICCV), Cited by: §2.1, §2.1, §2, §5.4, §5.4.
- (2013) ”Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Portland, OR, pp. 2930–2937. Cited by: §1, §1, §2.1, §4, §5.2, §5.3, Table 1.
- (2015) ”Going deeper with convolutions”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 1–9. Cited by: §3.1, §5.4.
- (2017) Visual SLAM algorithms: a survey from 2010 to 2016. Trans. Computer Vision and Applications 9 (1), pp. 452–461. Cited by: §2.
- (2017) A visual-SLAM for first person vision and mobile robots. In Proc. Intl. Conf. on Intelligent Robots and Systems (IROS), Vancouver, Canada, pp. 73–76. Cited by: §2.
- (2018) ”Deep Auxiliary Learning for Visual Localization and Odometry”. In Intl. Conf. on Robotics and Automation (ICRA), Brisbane, Australia, pp. 6939–6946. Cited by: §1, §2.2.
- (2015) ”Exploiting uncertainty in regression forests for accurate camera relocalization”. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 4400–4408. Cited by: §1.
- (2017) Hologram stability evaluation for microsoft (r) hololens tm. In Intl. Conf. on Robotics and Automation (ICRA), Marina Bay Sands, Singapur, pp. 3–14. Cited by: §2.
- (2016) ”Deep Learning for Image-Based Localization”. Master’s Thesis, Technische Universität München, Department of Informatics, Semantic Scholar, Munich, Germany. Cited by: §2.1.
- (2017) ”Image-Based Localization Using LSTMs for Structured Feature Correlation”. In Intl. Conf. on Computer Vision (ICCV), Venice, Italy, pp. 627–637. Cited by: §1, §1, §2.1, §2.1, §3.2, §4, §5.1, §5.4, Table 1, §5.
- (2017) ”DeepVO: Towards end-to-end Visual Odometry with deep Recurrent Convolutional Neural Networks”. In Intl. Conf. on Robotics and Automation (ICRA), Singapore, Singapore, pp. 2043–2050. Cited by: §2.1, §2.1, §5.4.
- (2018) ”Real-time Visual Odometry based on Optical Flow and Depth Learning”. In Intl. Conf. on Measuring Technology and Mechatronics Automation (ICMTMA), Changsha, China, pp. 239–242. Cited by: §2.3.