Improving Image-Based Localization with Deep Learning:
The Impact of the Loss Function
This work investigates the impact of the loss function on the performance of Neural Networks, in the context of a monocular, RGB-only, image localization task. A common technique used when regressing a camera’s pose from an image is to formulate the loss as a linear combination of positional and rotational mean squared error (using tuned hyperparameters as coefficients). In this work we observe that changes to rotation and position mutually affect the captured image, and in order to improve performance, a pose regression network’s loss function should include a term which combines the error of both of these coupled quantities. Based on task specific observations and experimental tuning, we present said loss term, and create a new model by appending this loss term to the loss function of the pre-existing pose regression network ‘PoseNet’. We achieve improvements in the localization accuracy of the network for indoor scenes; with decreases of up to % and % in the median positional and rotational error respectively, when compared to the default PoseNet.
In Convolutional Neural Networks (CNNs) and other Neural Network (NN) based architectures, a ‘loss’ function is provided which quantifies the error between the ground truths and each of the NN’s predictions. This scalar quantity is used during the backpropagation process, essentially ‘informing’ the NN on how to adjust its trainable parameters. Naturally, the design of this loss function greatly affects the training process, yet simple metrics such as mean squared error (MSE) are often used in place of more intuitive, task specific loss functions. In this work, we explore the design and subsequent impact of a NN’s loss function in the context of a monocular, RGB-only, image localization task.
The problem of image localization — that is; extracting the position and rotation (herein referred to collectively as the ‘pose’) of a camera, directly from an image — has been approached using a variety of traditional and deep learning based techniques in the recent years. The problem remains exceedingly relevant as it lies at the heart of numerous technologies in Computer Vision (CV) and robotics, e.g. geo-tagging, augmented reality and robotic navigation.
More colloquially, the problem can be understood as trying to find out where you are, and where you are looking, by considering only the information present in an RGB image.
CNN based approaches to image localization — such as PoseNet [kendall_posenet_2015] — have found success in the recent years due to the availability of large datasets and powerful training hardware, but the performance gap between these systems and the more accurate SIFT feature-based pipelines remains large. For example, the SIFT-based Active Search algorithm [sattler_prior_2017] remains as a reminder that significant improvements need to be made before CNN techniques can be considered competitive when localizing images.
However, CNN-based approaches do possess number of characteristics which qualify them to handle this task well. Namely, CNNs are robust to changes in illumination and occlusion [melekhov_hourglass_2017], they can operate in close to real time [massiceti_rf_nn_2016] ( frames per second) and can be trained from labelled data (which can easily be gathered via Structure from Motion (SfM) for any arbitrary scene [schonberger_sfm_2016, schonberger_pixelwise_2016]). CNN based systems also tend to excel in textureless environments where SIFT based methods would typically fail [brachmann_lessmore_2017]. They are also proven to operate well using purely RGB image data — making them an ideal solution for localizing small, cheap, robotic devices such as drones and unmanned ground vehicles. The major concern of this work is to extend existing pipelines whilst ensuring that the benefits provided by CNNs are preserved.
A key observation when considering existing CNN approaches is how position and rotation are treated separately in the loss function. It can be observed that altering a camera’s position or rotation both affect the image produced, and hence the error in the regressed position and the regressed rotation cannot be decoupled — each mutually affects the other. In order to optimize a CNN for regressing a camera’s pose accurately, a loss term should be used which combines both distinct quantities in an intuitive fashion.
This publication thus offers the following key contributions:
The formulation of a loss term which considers the error in both the regressed position and rotation (Section 3).
Comparison of a CNN trained with and without this loss term on common RGB image localization datasets (Section LABEL:sec:results).
An indoor image localization dataset (the Gemini dataset) with over pose-labelled images per-scene (Section LABEL:ssec:datasets).
2 Related work
This work builds chiefly on the PoseNet architecture (a camera pose regression network [kendall_posenet_2015]). PoseNet was one of the first CNNs to regress the 6 degrees of freedom in a camera’s pose. The network is pretrained on object detection datasets in order to maximize the quality of feature extraction, which occurs in the first stage of the network. It only requires a single RGB image as input, unlike other networks [radwan_vloc_2018, radwan_vloc++_2018], and operates in real time.
Notably, PoseNet is able to localize traditionally difficult-to-localize images, specifically those with large textureless areas (where SIFT-based methods fail). PoseNet’s end-to-end nature and relatively simple ‘one-step’ training process makes it perfect for the purpose of modification, and in the case of this work, this comes in the form of changing its loss function.
PoseNet has had its loss function augmented in prior works. In [kendall_geometric_2017] it was demonstrated that changing a pose regression network’s loss function is sufficient enough to cause an improvement in performance. The network was similarly ‘upgraded’ in [walch_image_2016] using LSTMs to correlate features at the CNN’s output. Additional improvements to the network were completed in [kendall_uncertain_2015], where a Bayesian CNN implementation was used to estimate re-localization accuracy.
More complex CNN approaches do exist [melekhov_hourglass_2017, melekhov_relative_cnn_2017, purkait_sppnet_2017]. For example, the pipeline outlined in [laskar_camera_2017] uses a CNN to regress the relative poses between a set of images which are similar to a query image. These relative pose estimates are coalesced in a fusion algorithm which produces an estimate for the camera pose of the query image.
Depth data has also been incorporated into the inputs of pose regression networks (to improve performance by leveraging multi-modal input information). These RGB-D input pipelines are commonplace in the image localization literature [brachmann_lessmore_2017], and typically boast higher localization accuracy at the cost of requiring additional sensors, data and computation.
A variety of non-CNN solutions exist, with one of the more notable solutions being the Active Search algorithm [sattler_prior_2017], which uses SIFT features to inform a matching process. SIFT descriptors are calculated over the query image and are directly compared to a known 3D model’s SIFT features. SIFT and other non-CNN learned descriptors have been used to achieve high localization accuracy, but these descriptors tend to be susceptible to changes in the environment, and they often necessitate systems with large amounts of memory and computational power (comparatively to CNNs) [kendall_posenet_2015].
The primary focus of this work is quantifying the impact of the loss function when training a pose regression CNN. Hence, we do not draw direct comparisons between the proposed model and significantly different pipelines — such as SIFT-based feature matching algorithms or PoseNet variations with highly modified architectures. Moreover, for the purpose of maximizing the number of available benchmark datasets, we consider pose regressors which handle purely RGB query images. In this way, this work deals specifically with CNN solutions to the monocular, RGB-only image localization task.
3 Formulating the proposed loss term
When trying to accurately regress one’s pose based on visual data alone, the error in the two terms which define pose — position and rotation — obviously needs to be minimized. If these error terms were entirely minimized, the camera would be in the correct location and would be ‘looking’ in the correct direction.
Formally, pose regression networks — such as the default PoseNet — are trained to regress an estimate for a camera’s true pose . They do this by calculating the loss after every training iteration, which is formulated as the MSE between the predicted position and the true position , plus the MSE between the predicted rotation and the true rotation . Note that rotations are encoded as quaternions, since the space of rotations is continuous, and results can be easily normalized to the unit sphere in order to ensure valid rotations. Hyperparameters and control the balance between positional and rotational error, as illustrated in Equation (LABEL:eq:origlossfn). In practice, RGB-only pose regression networks reach a maximum localization accuracy when minimizing these error terms independently.