DeepTIO: A Deep Thermal-Inertial Odometry with Visual Hallucination
Visual odometry shows excellent performance in a wide range of environments. However, in visually-denied scenarios (e.g. heavy smoke or darkness), pose estimates degrade or even fail. Thermal imaging cameras are commonly used for perception and inspection when the environment has low visibility. However, their use in odometry estimation is hampered by the lack of robust visual features. In part, this is as a result of the sensor measuring the ambient temperature profile rather than scene appearance and geometry. To overcome these issues, we propose a Deep Neural Network model for thermal-inertial odometry (DeepTIO) by incorporating a visual hallucination network to provide the thermal network with complementary information. The hallucination network is taught to predict fake visual features from thermal images by using the robust Huber loss. We also employ selective fusion to attentively fuse the features from three different modalities, i.e thermal, hallucination, and inertial features. Extensive experiments are performed in our large scale hand-held data in benign and smoke-filled environments, showing the efficacy of the proposed model.
Camera pose estimation is a key enabler for a wide range of applications in robotics and computer vision. Primary examples include position tracking of mobile robots, autonomous vehicles, pedestrians, or mobile devices for augmented reality applications. Visual Odometry (VO) is the de facto solution for estimating camera pose. Many VO techniques have been proposed, ranging from traditional feature-based approaches [Nister2004f, Forster2014, Mur-Artal2015b] to the more recently developed Deep Neural Network (DNN) based approaches [Wang2017, saputra19clvo, almalioglu2018ganvo, zhan2018unsupervised]. While VO is useful in a number of scenarios, its application is limited to those with sufficient illumination. For instance, VO systems fail in locating aerial robots in dim underground tunnels [kanellakis2016evaluation] or tracking a firefighter in emergency response scenarios in presence of airborne particulates (e.g., smoke and soot). In contrast, thermal cameras are not affected by illumination conditions or airborne particulates, making them a viable sensing alternative to RGB cameras.
Although thermal cameras have been commonly used in visually-denied environments, their use cases are largely limited to perception and inspection [peynot2009towards, quater2014light]. The main hindrance preventing their usage in odometry estimation is the lack of visual features (e.g. edges and textures) in the imaging system. Thermal cameras capture the radiation emitted from objects in the Long-Wave Infrared (LWIR) portion of the spectrum. These raw radiometric data are then converted to a temperature profile represented in a visible format (e.g. grayscale) to ease human interpretation [yu2013camera]. As the camera captures the environmental temperature rather than the scene appearance and geometry, it is difficult to extract sufficient hand engineered features to accurately estimate pose. Moreover, even for the same scene, the extracted features are dependent on the temperature gradient. This issue is further compounded by the fact that every thermal camera is plagued with fixed-pattern noise and requires frequent re-calibration during operation through Non-Uniformity Correction (NUC) which periodically freezes the images for about half to one second [averbuch2007scene] every 30-150 seconds.
The last decade has witnessed a rapid development in the use of deep learning for automatically extracting salient features by directly learning a non-linear mapping function from data. We believe that, with sufficient training data, a DNN can also learn to infer 6 Degree-of-Freedom (DoF) poses from a sequence of thermal images. However, despite the DNN’s ability to model this complexity, as stated before, thermal images are largely textureless and inherently lack sufficient features for accurate odometry estimation. Our novel intuition to alleviate this issue is to force our network to not only extract features from thermal images, but to additionally learn to hallucinate visual features similar to the ones extracted from a DNN-based VO, which have been proven to work well [Wang2017, saputra19clvo, almalioglu2018ganvo, saputra2018visual]. Given sufficient training data, we hypothesize that hallucinating visual features is possible and can provide the thermal network with auxiliary information for accurate odometry estimation.
In this paper, we propose a DNN-based thermal-inertial odometry which is able to estimate accurate camera pose by not only extracting features from thermal images, but also hallucinating the visual features given thermal images as input. We also fuse the thermal image stream with Inertial Measurement Unit (IMU) data to improve pose estimation robustness due to its environment-agnostic characteristic. To this extent, we employ selective fusion [chen2019selective] to adaptively fuse the different modalities conditioned on the input data. In summary, our key contributions are as follows:
We propose the first end-to-end trainable Deep Thermal-Inertial Odometry (DeepTIO) model.
We present a novel deep neural odometry architecture incorporating a hallucination network.
We present a new application of selective fusion with input from three feature channels, i.e. thermal, IMU, and hallucinated visual features.
We perform extensive experiments and analysis in our self-collected hand-held dataset in both benign and smoke-filled environments.
Ii Related Work
Ii-a Thermal Odometry
Accurately estimating camera ego-motion from a thermal imaging system remains a challenging problem. Some efforts have been made towards thermal odometry systems, although these are limited to relatively short distances and yield sub-optimal performances compared to visible camera systems. Existing works either rely on sparse feature-based or direct-based approaches. Mouats et al. [mouats2015thermal] employed a Fast-Hessian feature detector for UAV tracking using a stereo thermal camera. Khattak et al. [khattak2019keyframe] developed a keyframe-based direct approach which minimizes radiometric error (raw thermal data) between consecutive frames. Borges and Vidas [borges2016practical] designed a practical thermal odometry system with an automatic mechanism to determine when to perform NUC operation based on the current and the predicted poses. To improve robustness during NUC operation, most works incorporate thermal imaging with other modalities such as visual [poujol2016visible, khattak2019visual] or inertial [papachristos2018thermal, khattak2019keyframe].
Ii-B DNN-based Odometry
Due to the advancements of DNN, learning based odometry is recently gaining more favor. Wang et al. [Wang2017] started this trend by introducing an end-to-end trainable deep visual odometry method (DeepVO) by composing the feature extraction capabilities of Convolutional Neural Networks (CNN) and the ability to model long-term camera pose dependencies using a Recurrent Neural Network (RNN). This was then followed by other improvements such as enforcing consistency among multiple poses [saputra19clvo] or introducing additional learning signals by performing multi-task learning such as global pose localization [valada2018deep] or semantic segmentation [radwan2018vlocnet]. Other works improved the robustness of VO by fusing visual and inertial streams [clark2017vinet] and performing selective fusion between visual and inertial features [chen2019selective]. In parallel to these supervised approaches, many self-supervised DNN-based VO approaches were also developed by leveraging the view reconstruction paradigm, started by Zhou et al. [Zhou2017], followed by adding stereo information [li2018undeepvo, zhan2018unsupervised, yang2018deep] or generative networks [almalioglu2018ganvo, feng2019sganvo]. While many works exist for DNN-based odometry, to the best of our knowledge none of them uses thermal camera as input.
Ii-C Learning with Side Information
Related to our work is the concept of learning with side information. Hoffman et al. [hoffman2016learning] introduced this concept by incorporating a depth hallucination network to increase the accuracy of object detection in RGB images. This concept was then adopted in other applications such as learning hand articulations [choi2017learning] or face recognition [lezama2017not]. Our work introduces this concept to odometry regression and trains the whole network with the non-trivial Huber loss. We are the first to hallucinate visual features from thermal images for odometry regression.
Iii Network Architecture
In this section we describe our proposed DeepTIO model for estimating thermal-inertial odometry. Fig. 1 illustrates the general architecture of DeepTIO model at inference time. It is composed by a feature encoder, a selective fusion module, and a pose regressor. The feature encoder extracts salient features from each modality. We use a CNN for encoding thermal data and hallucinating visual features from thermal images. To extract features from the IMU data stream we employ a RNN, as RNN works better to model temporal dependencies of time-series data [connor1994recurrent]. The feature vectors generated from the IMU, thermal, and hallucination encoder networks are input into the selective fusion module, attentively selecting certain features that are necessary for pose regression. The reweighted features are further feed into pose regression module to infer 6-DoF relative camera poses. The details of each module are described as below.
Iii-a Feature Encoder
Given a pair of consecutive thermal images , the purpose of the thermal encoder network is to extract geometrically meaningful features for movement estimation (e.g. optical flow captured between moving edges). To this end, both thermal encoder and hallucination encoder are implemented and pre-initialized with FlowNetSimple structure [Dosovitskiy2016]. As the observed temperature profile (in grayscale) fluctuates when the camera captures hotter objects, we directly use the 16 bit raw radiometric data to obtain more stable inputs. Since raw radiometric data are only represented by one channel, we duplicate it into three channels for feeding into the FlowNet structure. We use the last output activation from both and as our thermal and visual hallucinated features
We employ a single LSTM layer with 256 hidden states as IMU encoder . The 6-dimensional inertial data with a sequence of 20 frames are fed into IMU encoder to produce IMU features
To balance the number of features, we perform average pooling for and , such that the final dimensions for all features are , , and .
Iii-B Selective Fusion
In deep learning-based VIO, a standard way to fuse feature vectors coming from different modalities is by concatenation. However, a direct fusion of all feature modalities using concatenation results in sub-optimal performance, as not all features are useful and necessary [chen2019selective]. The situation is even more exasperated by the intrinsic noise distribution of each modality. In our case, thermal data are plagued by fixed-pattern noise, while IMU data are affected by white random noise and sensor bias. On the other hand, the hallucination network might produce erroneous visual features. Moreover, in real applications there is high chance that different modalities, as well as the ground truth poses, will not be tightly synchronized.
To this end, we employ selective fusion [chen2019selective] to let the network automatically learn the best suitable feature combination given feature inputs. Specifically, a deterministic soft fusion is employed to attentively fuse features from three sources with compensation for possible misalignment between inputs and ground truth. The fusion module will learn to re-weight each feature by conditioning on all channels. The corresponding mask for thermal , hallucination , and inertial feature are learnt via:
where denotes the concatenation of all channels features, is the sigmoid function and , , and are the learnable weights for each feature modality. These masks are used to weight the relative importance of the features modalities by multiplying them via element-wise operation with their corresponding masks:
Finally, the merged features are fed to the pose regressor network to estimate 6-DoF poses.
Iii-C Pose Regressor
The pose regressor consists of LSTM layers followed by two parallel Fully Connected (FC) layers that estimate relative translation and rotation respectively. We use an LSTM to model the long-term temporal dependencies of camera ego-motion as seen in [Wang2017, saputra19clvo]. Each LSTM has 512 hidden states and takes the reweighted features as input. The output latent vectors from the LSTMs are then fed into three parallel FC layers with 128, 64, 3 units respectively. We decouple the FC layers for translation and rotation as it has been shown to work better separately as in [deeppco2019]. We also use a dropout [srivastava2014dropout] rate of 0.25 between FC layers to help regularization.
Iv Learning Mechanism
This section introduces the mechanism to train hallucination network and learn odometry regression.
Iv-a Learning Visual Hallucination
The visual hallucination network is intended to provide additional information along with the thermal encoder . Given original thermal images as an input, this module produces visual hallucination vectors that imitate the visual features from real RGB image input encoded by a visual encoder . In order to acquire pseudo ground truth of visual features, we employ a modified deep Visual-Inertial Odometry (VIO) model, i.e. VINet [clark2017vinet]. The only difference is that we utilize FlowNetSimple as the feature extractor instead of FlowNetCorr [Dosovitskiy2016] as used in the original VINet. This modification allows hallucination features and visual features to have same dimension, simplifying the training process. After training VINet model, the weights in visual encoder are frozen during the training of hallucination network, while the hallucination encoder ’s weights are trainable.
Fig. 2 illustrates the architecture of our visual hallucination model in training process. We train the hallucination network by minimizing the discrepancy between the output activation from and . Standard norm is generally used for minimizing in benign cases [hoffman2016learning, saputra2019distilling]. However, thermal camera requires periodic NUC calibration, during which time the same image will be output for between half to one second. NUC will force several identical thermal features to be matched with different visual features during network training. This process might produce an erroneous mapping between and and contaminate with outliers. Since the loss is very sensitive to outliers, encountering some during training will impact gradient back-propagation as the outliers will dominate the loss, impacting convergence. To improve robustness against outliers, we instead propose to use the Huber Loss [huber1992robust] to minimize . Then, our hallucination loss is formally defined as follows:
where is a threshold and is the batch size during training. By using Huber loss, larger than will have a linear effect instead of quadratic, making it less sensitive to outliers. Loss values below will still be minimized using quadratic loss to enable fast convergence. During training, we use .
Iv-B Learning Odometry Regression
We train the network to estimate odometry by minimizing the loss between the predicted pose and the ground truth pose. This task is essentially learning a mapping function from the input to the output where is the whole training data. The pose regressor network, together with all other networks except the hallucination part, are trained using the following regression loss
where is the Huber Loss as in (5). [t, r] and [t̂, r̂] are a pair of translation and rotation component for the predicted poses and the ground truth poses respectively. We use Euler angle to represent rotation since it is faster to converge as it is free from constraints unlike other representations (e.g. rotation matrix, quaternion). We also use to balance the loss between translation and rotation.
Iv-C Training Details
The architecture is trained in two stages. In the first stage we train the hallucination network, while in the second stage we train the remaining networks. Note that, in the second stage, we freeze the hallucination network such that the unstable learning process in the beginning of training the other networks does not alter the learnt hallucination weights that have been trained in the first stage. We train the losses separately as it empirically shows better result. We use the Adam optimizer with a 0.0001 learning rate to train the hallucination network for 200 epochs. For training the remaining networks in the second stage we employ RMSProp with a 0.001 initial learning rate, dropping by every 25 epochs for a total of 200 epochs. We normalize the input radiometric data by subtracting the mean over the dataset. We randomly cut the training sequence into small batches of consecutive pairs () to obtain better generalization. We also sub-sample the input such that the frame rate is around 4-5 fps to provide sufficient parallax between two consecutive frames. To further fine-tune the network, we alternately freeze and train the selective fusion and the pose regressor.
The thermal data was collected using a hand-held FLIR E95 camera at 60 fps with 464348 image resolution, while inertial measurements were captured using a XSens MTI-1 Series IMU. As we gathered the data mostly in public spaces, real ground truth poses are not available. Instead, we utilize VINS-Mono (visual-inertial SLAM) [qin2018vins] to act as pseudo ground truth. For this purpose, we also collected RGB-D data at 30 fps using an Intel RealSense D435 depth camera at 848480 image resolution.
We collected data from five different buildings including a real firefighter training facility. In total, we collected 19 sequences in different places including a library, open office, large room, apartment, corridor, underground storage, and an actual smoke-filled environment. The sequences are divided into two groups, one with good time alignment and another one with slight misalignments among sensor modalities. This misalignment can be used to test the robustness of the model against unsynchronized inputs. We use 13 sequences for training (about half from bad and another half from good alignment) and use the remaining data for testing.
V-B Evaluation Metrics
To evaluate the proposed model, we utilize Mean Square (MS) of Relative Pose Error (RPE) and Absolute Trajectory Error (ATE), since they have been widely used for measuring VO or visual SLAM accuracy [sturm2012benchmark]. Since VINS-Mono generates trajectories through a key-frame selection process, the estimated poses will be unaligned in time with the poses produced by the model. To this end, we align the predicted poses with VINS-Mono using Horn approaches (closed-form) and evaluate only the poses that are closest in time. We use the evaluation tools from TUM RGB-D dataset to do this111https://vision.in.tum.de/data/datasets/rgbd-dataset.
V-C Sensitivity Analysis
To understand the influence of the hallucination network, we perform a sensitivity analysis in the following section by using the test sequence in Corridor 2.
V-C1 Validating the Hallucination Network
To validate the hallucination network, we replace the visual decoder network from VINet with the hallucination network from DeepTIO as seen in Fig. 3. By feeding the hallucinated visual features (fake RGB features) to the original VINet, we can measure how accurate the learnt representation produced by the hallucination network are. Fig. 4 shows the distribution of RPE between VINet and Fake VINet (VINet with input from fake RGB features). It can be seen that the error distribution for both translation and rotation are very similar, showing the success of training the hallucination network. Table I shows how close the average RPE between VINet and Fake VINet are. Surprisingly, the Fake VINet got a slightly better result for rotation estimation, showing the efficacy of training using Huber Loss. Fig. 5 illustrates the visualization of the output features from VINet and from hallucination network in the test sequence (Corridor 2). It can be seen that the network can hallucinate visual features accurately (top) although there are also cases when the hallucination network produces erroneous features (bottom) due to blurriness or lack of thermal edges. In this case, selective fusion plays important roles for selecting only relevant information from the hallucination network. It can be seen that in the erroneous case example, the DeepTIO’s selective fusion produces less dense fusion masks, indicating less features are being used.
|Model||t (m)||r ()|
|Fake VINet ()||0.1197||5.1926|
|Fake VINet (Huber)||0.1128||5.0739|
V-C2 The Influence of Each Feature Modality
To understand the influence of each feature modality, we decouple each feature modality, train it separately, and test the result. The result can be seen in Table II and shows that thermal alone got the worst accuracy, implying the difficulty of estimating odometry solely based on temperature profile information. IMU alone clearly shows much stronger performance although the optimal solution may require to produce only 3-DoF poses (instead of 6-DoF) as seen in [chen2018ionet] since there is not enough information from the IMU data to produce accurate 6-DoF poses. Incorporating IMU with thermal features or fake RGB features improves the accuracy (ATE) as the thermal or the visual features constraints the IMU error growth. Adding Fake RGB features to the model with IMU+Thermal further reduces the ATE, indicating that the hallucinated visual features help generate more accurate poses. Note that all feature fusions (with the same mark in Table II) have the same network capacity, indicating that the improved accuracy is due to more useful information, rather than increased network capacity.
V-C3 The Influence of Selective Fusion
As seen in Table II, incorporating selective fusion to the combined features consistently reduces the ATE over the network without selective fusion. This shows that selective fusion plays important role in producing accurate results as each feature modality comes with intrinsic noises, the hallucination network may produce erroneous visual features, and there is time misalignment between the sensors and the ground truth. Finally, putting together all feature modalities with selective fusion yields the strongest performance for both RPE and ATE.
|Features||SF||t (m)||r ()||ATE (m)|
|has 52 M weights, while has 136 M weights.|
|†Whether Selective Fusion (SF) is employed or not.|
V-D Experimental Evaluation
V-D1 Test in Benign Environment
We test our model across different buildings and compare it with the state-of-the-art VIO frameworks to show that our DeepTIO solution is comparable, even though the image representation has fewer channels and useful features. For the traditional approach we employ ROVIO [sturm2012benchmark], which tightly fuses IMU and visual data with an iterated extended Kalman filter. For deep learning based approaches, we use VINet [clark2017vinet] which fuses IMU and visual features in the intermediate layer. We also compare with Vanilla DeepTIO, a version of DeepTIO without the visual hallucination network. Fig. 6 (a)-(e) depicts the qualitative results in this scenario.
Table III shows the numerical evaluation results in terms of RPE and ATE. ROVIO provides good accuracy in Corridor 2, although it suffers from large scaling problem and loses tracking in Corridor 1 due to lack of visual features when the camera faces white, flat walls. In misaligned sequences, ROVIO completely fails to initialize, since it requires tightly synchronized inputs. VINet also performs well when good alignment is available but suffers from large drift in presence of time misalignment. This shows that directly concatenating features may lead to sub-optimal performances. Nevertheless, VINet can still produce odometry where ROVIO completely fails, showing that deep learning approaches are more robust against sensor alignment issue. However, the best results are achieved by Vanilla DeepTIO and DeepTIO as they employ selective fusion which is proven to be robust to time synchronization issues [chen2019selective]. Note that Vanilla DeepTIO and DeepTIO use a smaller thermal image resolution (464x348) compared to the RGB images used by ROVIO and VINet (848x480).
Vanilla DeepTIO achieves excellent results in Corridor 2, Large Office, and Library 2, but suffers from drift in Corridor 1 and Library 1. DeepTIO, on the other hand, produces better results due to the additional information provided by the hallucination network. Nonetheless, estimating an accurate scale is a problem in some sequences. As seen in the Large Office sequence, both Vanilla DeepTIO and DeepTIO give inaccurate scale, possibly due to a large variation of walking speeds. This scaling problem is very common in VO or VIO (as seen in ROVIO test in Corridor 1) and remains an open problem. Overall, DeepTIO yields the best ATE against the competing approaches, with an average ATE of 1.67 m.
|ROVIO (VIO)||VINet (VIO)||Vanilla DeepTIO (ours)||DeepTIO (ours)|
|Sequence||t (m)||r ()||ATE(m)||t (m)||r ()||ATE(m)||t (m)||r ()||ATE(m)||t (m)||r ()||ATE(m)|
|Large Office*||failed to initialize||0.1202||3.0351||4.4359||0.1325||2.9469||3.3088||0.1348||3.1266||3.2648|
|Library 1*||failed to initialize||0.2041||3.9357||5.2647||0.2046||3.3424||2.5698||0.2027||3.2498||2.0532|
|Library 2*||failed to initialize||0.1290||3.1720||1.6812||0.1307||3.1546||1.5741||0.1286||3.0403||0.5735|
|*In these sequences, there is time misalignment among sensor modalities.|
V-D2 Test in Smoke-filled Environment
In the smoke-filled environment, none of the VIO frameworks can work as the camera only captures black frames. Even Lidar odometry does not work well as near-visible light is blocked by the smoke [bijelic2019seeing]. In this case, we cannot provide quantitative evaluation with any (pseudo) ground truth. We instead provide a qualitative comparison with a zero-velocity-aided Inertial Navigation System (INS) [wahlstrom2019zero], which is not impacted by visibility. This navigation system utilizes foot-mounted inertial sensors to detect Zero Velocity Updates (ZUPT) and thereby mitigates the fast error growth of stand-alone inertial navigation. Fig. 6 (f) shows the result of the test. It can be seen that DeepTIO yields a similar trajectory shape to ZUPT. This shows that our model, despite being trained in a benign environment, can generalize to a smoke-filled environment as the thermal camera is not affected by the smoke. However, there is scaling issue which probably due to different speed of the camera (as an effect of different walking speed) or different temperature profile compared to the one observed in the training data. If we adjust the scale of DeepTIO, it can be seen that the prediction is very close to ZUPT. This shows that our model is promising for odometry estimation in smoke-filled environments.
V-D3 Memory and Execution Time
The network was trained on an NVIDIA TITAN V GPU and required around 18 hours for training the hallucination network and 20 hours to train the remaining networks. The network contains around 136 millions weights, requiring 847 MB of space. Neglecting the time to load and normalize the input, the model can run at 40 fps on a TITAN V and 5 fps on a standard CPU.
V-E Challenges and Limitations
Despite the fact that DeepTIO can work well in our test scenarios, there are some limitations:
Sensitivity to sampling rate. As we trained DeepTIO with a frame rate of 4-5 fps, the network will only perform well by using that frame rate. When inferring with lower or faster fps, the accuracy will degrade as seen in Fig. 7. Training with multiple fps at the same time might be possible to obtain robustness against different sampling rates. This may also alleviate the problem of scaling as the network would be trained with more variations of parallax. However, this might require a (pseudo) ground truth with constant fps, i.e. not irregularly sampled by a key frame selection process as in VINS-Mono.
Robustness against distributional shift. DNNs are usually vulnerable to distributional (covariate) shift which occurs when the test data are sampled from a different distribution than the training data. As we train our model in a benign environment, when we test it in smoke-filled environment, it is expected that we will experience some covariate shift as the temperature profile will be different. Development of an odometry approach robust against this distributional shift might be necessary to enable practical odometry in adverse environments.
We presented a novel DNN-based method for thermal-inertial odometry (DeepTIO) using hallucination networks. We demonstrated that the hallucination network can provide side information for the thermal network to produce accurate odometry estimation. Future work includes incorporating other sensor modalities for more accurate scale estimation in diverse scenarios and developing robust techniques to cope with distributional (covariate) shifts in the test data.
This work is funded by the US National Institute of Standards and Technology (NIST) grant “Pervasive, Accurate, and Reliable Location-Based Services for Emergency Responders” (No. 70NANB17H185). M. R. U. Saputra was supported by Indonesia Endowment Fund for Education (LPDP).