Online Adaptation through Meta-Learning for Stereo Depth Estimation

Online Adaptation through Meta-Learning for Stereo Depth Estimation

Zhenyu Zhang,  Stéphane Lathuilière, Andrea Pilzer,  Nicu Sebe,  Elisa Ricci and Jian Yang
PCALab, Nanjing University of Science and Technology, China
DISI, University of Trento, via Sommarive 14, Povo (TN), Italy
Huawei Technologies Ireland, Dublin, Ireland
Technologies of Vision, Fondazione Bruno Kessler, via Sommarive 18, Povo (TN), Italy
{stephane.lathuiliere, andrea.pilzer, niculae.sebe, e.ricci}

In this work, we tackle the problem of online adaptation for stereo depth estimation, that consists in continuously adapting a deep network to a target video recorded in an environment different from that of the source training set. To address this problem, we propose a novel Online Meta-Learning model with Adaption (OMLA). Our proposal is based on two main contributions. First, to reduce the domain-shift between source and target feature distributions we introduce an online feature alignment procedure derived from Batch Normalization. Second, we devise a meta-learning approach that exploits feature alignment for faster convergence in an online learning setting. Additionally, we propose a meta-pre-training algorithm in order to obtain initial network weights on the source dataset which facilitate adaptation on future data streams. Experimentally, we show that both OMLA and meta-pre-training help the model to adapt faster to a new environment. Our proposal is evaluated on the well-established KITTI dataset, where we show that our online method is competitive with state of the art algorithms trained in a batch setting.

1 Introduction

Deep neural networks have brought amazing progresses in visual scene understanding in the last few years, enabling remarkable results in tasks such as object recognition [13, 17], semantic segmentation [47], depth estimation [6] and many more. These advances can be ascribed not only to the availability of large scale datasets and powerful computational resources, but also to the design of specialized deep architectures.

Depth estimation is one of the fundamental tasks in visual scene understanding and, over the years, has attracted considerable attention in the computer vision and robotics research communities. Earlier deep learning-based approaches for depth estimation considered a supervised setting: a deep regression model was trained to estimate a dense depth map from RGB images. This approach was exploited in many works [6, 18, 8, 21, 45, 46, 49] where it was shown that accurate depth maps can be recovered given enough training data. However, in the context of depth estimation, collecting the data is an expensive and time consuming task. For instance, in an autonomous driving setting it requires a car with a mounted camera system plus a LIDAR that drives for many hours in different environmental conditions. More recently, to avoid the costly procedure of collecting densely and accurately annotated datasets, researchers have proposed self-supervised, also known as unsupervised, depth estimation approaches. In the unsupervised setting a deep network is asked to regress the dense correspondence map (i.e. disparity) between two views of a stereo image pair. Interestingly, recent works [9, 11, 48, 32] showed performance comparable to supervised methods on the common benchmark datasets (e.g. KITTI [10, 27]).

Figure 1: We propose an Online Meta-Learning model with Adaption (OMLA) for depth estimation. Our approach combines feature distribution alignment and meta-learning optimization for the purpose of predicting depth maps on a target video. Additionally, OMLA is employed when pretraining on the source data in order to obtain a network whose parameters enable fast online adaptation.

One main limitation of current approaches is that they are designed and tested under a closed-world assumption, meaning that training and testing data are derived from a common dataset and there is no much difference in term of visual appearance between video sequences used for learning the model and data considered for testing. In this paper we argue that, to be deployed in real applications (i.e. for autonomous driving, robotics, etc), deep architectures for depth estimation should consider an open-world setting, with visual data continuously gathered in changing environmental conditions. For example, in an autonomous driving scenario we would need a model that continuously adapt to changing environments (i.e. urban, suburban, highway, etc) and lighting conditions (i.e. night, dawn, day, tunnel, etc). In other words, we require a deep architecture with online adaptation abilities.

Driven by this motivation, in this paper we propose a meta-learning approach for stereo depth estimation designed for fast online adaptation. Our proposal is illustrated in Fig. 1. First, we introduce an Online Meta-Learning Algorithm (OMLA) which combines feature distribution alignment and meta-learning (upper half of Fig. 1). Specifically, in order to handle the domain shift between the source training data and the target video, we first align the feature distributions of the two domains using statistics gathered along the video sequence and considering alignment layers derived from Batch Normalization similarly to [23, 25]. Feature alignment is then combined with a meta-learning algorithm. The motivation for this choice is that, previous network parameter updates can be used to learn to update better on future frames. Since the frames of a video depict a similar environment, our meta-learning can learn how to optimize the network specifically for this environment.

In addition, we also propose to employ our OMLA algorithm when pre-training the model on the source dataset (bottom half of Fig. 1), with the intent of obtaining a network parameter initialization that leads to accurate disparities after few frames of adaption only. Rather than pre-training over stereo image pairs, we propose to explicitly use OMLA on video sequences to define our training loss. Specifically, our meta-learning loss favors network parameters that lead to good depth predictions when using OMLA on every video sequences of the source training set.

To summarize our contributions are the following: (i) We propose a novel approach for online adaptation in the context of depth prediction. Our method combines meta-learning and feature alignment to allow fast adaptation on video sequences recorded in new environments. (ii) We introduce a meta-pre-training approach that explicitly uses OMLA in order to provide a good parameter initialization for online adaption. (iii) From an experimental perspective, we perform an extensive evaluation on the well-known KITTI[27] benchmark. We show that both our OMLA and meta-pre-training help to improve depth prediction performance and that our method is even competitive with previous algorithms trained in a batch offline setting.

2 Related Work

Depth Estimation. Depth estimation, among other scene understanding tasks, has attracted a lot of attention in the last years with the development of deep Convolutional neural Networks (ConvNets). Deep models are usually trained following a supervised setting [6, 18, 8, 21, 45, 46, 49] by minimizing the discrepancy between the predicted and the ground truth depth maps. Eigen et al.   [6] showed that a multi-scale approach leads to better performance and Laina et al.   [18] outlined the benefit of using a very deep architecture. Other works proposed to enforce some structure in the predicted depth maps considering graphical models as CRFs [21, 45, 43]. In order to train and evaluate these deep architectures, several datasets have been recorded for indoor scenes, as NYU [29], or outdoor as KITTI [10, 27] and Make3D [37]. Synthetic datasets, as Synthia [34], have been also considered as an alternative in order to avoid the time consuming ground-truth recording process. However, the resulting models generally suffer from the domain shift between the synthetic and the real environments.

To avoid the need of annotated data, self-supervised depth estimation methods [9, 11, 48, 32] have been recently developed. For instance, Godard et al.   [11] used self-structuring and self-consistency losses to improve the prediction quality. Other works proposed to enhance the estimation accuracy through ego-motion estimation [48], adversarial learning [32, 28], visual odometry [46]. Interestingly, Kundu et al.   [28] tackled the problem of domain adaptation from synthetic to real world data proposing an adversarial approach for depth estimation. Recently, Tonioni et al.   [39] employed a self-supervised formulation to enable fast update of parameters in an online setting for estimating depth maps. We follow this research direction and consider the problem of updating the prediction model online, but opposite to [39], we tackle explicitly the distribution misalignment problem and devise a novel strategy to obtain faster adaptation.

Figure 2: Proposed Online Meta-Learning Algorithm (OMLA). At time , the feature statistics are updated within BN layers for feature distribution alignment (Sec. 3.1). Then, the model weights are updated according to our meta-learning optimizer (Sec. 3.2).

Domain Adaptation. Over the years, several works have considered the problem of domain adaptation within computer vision applications [4], proposing both shallow models and deep neural networks. Focusing on recent deep learning-based models, the different methods can be roughly grouped in three categories, according to the strategies used to reduce the discrepancy between the source and target feature distributions. The first category includes approaches which reduce the domain shift by designing appropriate loss functions, such as the Maximum Mean Discrepancy [22, 41] or the domain confusion loss [40]. A second group of methods considers approaches based on Generative Adversarial Networks (GANs) [2, 38, 36], whose main idea is to directly transform images from the target domain to the source domain. The latter category includes approaches which deal with the domain-shift problem by embedding into the deep architecture specifically designed domain alignment layers [20, 3, 24]. The idea is to consider domain-specific Batch Normalization (BN) layers in order to align the source and target feature distributions to a common reference distribution. While most previous works on domain adaptation focused on a classification setting, recent works have considered structured predictions problems, such as semantic segmentation and depth prediction [5, 35].

Domain adaptation has been studied in the online learning setting where data are available sequentially and the target domain distribution changes continuously [14, 44, 23]. Extending domain-alignment layers in [20, 24], online adaptation can be performed by incrementally updating feature statistics in the BN layers [23]. In this work we consider domain adaptation in a pixel-level prediction problem, i.e. depth estimation, and propose an elaborate formulation for online adaptation by combining feature distribution alignment through domain-specific layers and loss minimization. Furthermore, we introduce a meta-learning approach that improves the adaptation ability of the model trained on the source dataset.

Meta-Learning. Meta-learning is the problem of learning how to learn. In [42, 33, 7, 19], meta-learning has been employed to obtain fast generalization on novel domains or categories. In [33, 42] the problem has been explicitly formulated as a few-shot learning problem. When it comes to deep network, meta-learning can improve convergence of gradient descent [1] by using a trainable optimizer to train a neural network. For transfer learning applications, a policy network has been proposed in [12] to decide which layer should be fine-tuned. Park et al.   [30] employed an offline meta-learning method to adjust the initial deep networks used in online tracking. Following this line of research, we consider meta-learning to obtain a source model that can adapt fast to a particular stereo video sequence. Conversely to [30], the use of feature distribution alignment is explicitly modeled in the source training algorithm. To the best of our knowledge, this is the first approach that introduces meta-learning in the context of online depth estimation.

3 Meta-learning for Self-adaptive Depth Estimation

In this section, we detail the proposed meta-learning approach for online adaptation. Formally, we assume to have a source domain composed of stereo video sequences recorded with the same calibrated stereo camera. In a first stage, we employ this source dataset to train a neural network with parameters in order to predict the disparity maps between image pairs recorded with this stereo setting. The network training is performed via minimization of a loss leading to network parameter values . In a second stage, we consider that a target video sequence is recorded using a different calibrated stereo camera in a different environment and that the video frame pairs at time are available sequentially. The goal is to adapt the network parameters in order to predict more accurate disparity maps between image pairs recorded in this new environment. Note that, opposite to deep domain adaptation in a batch setting [28], in this work we assume that the source dataset is no longer available during this second stage.

A naive approach for online adaptation could consist in computing the training loss on the current frame and updating the whole network by gradient descent. This procedure could then be applied to each video frame. Despite its simplicity, this strategy has several drawbacks: it is very sensitive to domain shift, it accounts only for the current frame and it may introduce negative bias in the learning procedure. To cope with these issues, we propose to adapt our network by combining two complementary approaches: feature distribution alignment and meta-learning (see Fig. 2). Specifically, to neutralize domain shift, we align the feature distributions using statistics gathered in the batch normalization layers and combined over time as detailed in Section 3.1. By opposition to back-propagation-based loss minimization, feature distribution alignment is performed during the forward pass allowing adaption of the first layers with a limited computational cost. Additionally, we optimize our model with a fine-tuning strategy, and propose to guide the fine-tuning with a meta-learning optimizer (Meta-Optimizer in Fig. 2) for faster convergence as motivated and detailed in Section 3.2. We argue that fine-tuning and feature distribution alignment are complementary to each other. Feature alignment can cope with low-level feature domain shifts, whereas fine-tuning can handle higher level representation shifts. In addition, we propose a meta-pre-training formulation to obtain initial parameters that are able to be adapted faster to a particular sequence. Our meta-pre-training strategy is explained in Section 3.3. Finally, the whole model is trained using unsupervised depth estimation losses as detailed in Section 3.4.

3.1 Domain Adaptation via Online Feature Distribution Alignment (OFDA)

We consider a deep network embedding BN layers. We follow the idea of previous works [25, 20, 23] and perform domain adaptation by updating the BN statistics with the incoming frames of the target video. The main idea behind this strategy is that the domain shift is reduced by aligning the target feature distribution to a gaussian reference distribution [25, 20, 23]. For the sake of notation, here we consider a single BN layer but this approach is applied independently to each BN layer of . First, when training on the source domain , we collect BN statistics as in [20]. Second, we perform adaptation on the target video , using the following procedure. At time , we initialize the batch statistics with . At time , we consider that we dispose of the previous BN statistics at time . Considering that we dispose of samples of a given feature vector, we compute the partial BN statistics:


Given a dynamic parameter , the global statistics are computed as follows:

For a given input , the output of the normalization layer is then given by:


where and are the usual affine transformation parameters of the BN layer, while is a constant introduced for numerical stability.

3.2 Online Meta-learning with Adaptation (Omla)

We now introduce our online meta-learning approach with adaptation. We assume to have a target stereo video sequence of length . When performing online training on , we use the following recursive algorithm given in Alg. 1.

1:procedure OMLA()
2:     for t=0..T do
5:         if  then
7:               return
Algorithm 1 Online Meta-Learning with Adaptation

For initialization, we assume we dispose of network parameters , BN statistics , an initial learning rates for the network parameters and a meta-learning rate . Note that, we use specific learning rates for each network parameter. Therefore, both and have the same dimension as . At time , given the current parameters , the disparity maps between the stereo pair are predicted according to . Here, and denote the left and right disparities. Note that, in the forward pass, we perform feature distribution alignment using statistics gathered in the BN layers as described in Sec. 3.1. The BN statistics are stored for the next iteration. The predicted disparity quality is assessed via where is a loss function detailed in 3.4. Then, we udpate the network parameter learning rate by performing one gradient descent in order to minimize with respect to the learning rate at the previous OMLA iteration . The motivation here is to obtain better learning rates for the next network parameter updates. The gradient descent step can be computed using any gradient-based optimizer. In all of our experiment, we employ the Adam optimizer [16]. Finally, we update the network parameter by applying a gradient descent step. The procedure returns the final network parameter and the BN statistics . In the case of online learning on the target video, the parameters obtained at the end of OMLA, and , are not further used. Nevertheless, they are used when OMLA is employed within our meta-pre-training procedure described in the next section.

3.3 Meta-pre-training for Fast Adaptation

In this section, we detail our meta-learning framework for pre-training our network on the source dataset. The motivation behind our meta-learning formulation is to obtain network parameters that lead to accurate disparity predictions on image pairs of our source dataset but also that can be adapted to a specific sequence in a few frames only. Rather than using a training procedure that would minimize a reconstruction loss over stereo pairs, we propose to explicitly use OMLA to define our training loss. Consequently, our loss enforces that the network parameters must lead to accurate depth predictions after using OMLA on every video sequence of the source training set. Formally, we assume to have a source domain composed of video sequences of length recorded with the same calibrated stereo camera. We consider here video sequences of equal lengths for the sake of notation but it could be applied to any arbitrary varying lengths. We seek network parameters that lead to a low loss value after N steps of OMLA, . Importantly, must lead to a fast adaptation given that feature distribution alignment is employed when learning online. In addition, we propose to use the meta-learning pre-training procedure to provide OMLA hyper-parameters such as the learning rates . Interestingly, the meta-learned learning rates can be interpreted as a hyper-parameter indicating which network parameters should be fine-tuned and which parameters should not be updated. Our meta-training procedure consists in repeating the training step given in Alg.2 until convergence.

1:procedure step()
3:     for k=1..K do For each video of the meta-batch
5:         for  do Evaluation of
10:      Updates
11:      return
Algorithm 2 Meta-Training Step for adaptation

The procedure takes as inputs a subset of the source dataset composed of videos. These videos form a meta-batch containing K different cases where the network is adapted to a particular video. We provide also, the current network parameters together with BN Statistics . Finally, we provide three different learning rates: the current meta-learned learning rate and the two fixed learning rate and for the network parameters and the meta-learning rate respectively. For each video of the meta-batch, our algorithm is divided in two steps. First, we employ OMLA on the first frame pairs to adapt specifically to the video. For the video, we obtain network parameters and BN statistics . The second step consists in evaluating the parameters obtained with OMLA on the remaining frames of the video. We compute the loss function and its gradient with respect to the original parameters used as initialization of the OMLA. The motivation for computing this gradient is that we aim at obtaining parameters that leads to fast adaptation and therefore low loss values . The gradients are summed over all the future frames of the videos of the meta-batch. The same procedure is applied for the learning rate . Finally, we perform two gradient descent steps using the computed gradients and an ad hoc optimizer for and respectively.

Online Evaluation Scores Evaluation Scores on Last 20% frames
Method Pre-training RMSE Abs Rel Sq Rel RMSE Abs Rel Sq Rel
Online Naive w/o 12.2012 0.4357 5.5672 1.3598 12.2874 0.4452 5.5213 1.3426
Online Naive Standard 9.0518 0.2499 3.2901 0.9503 9.0309 0.2512 3.3104 0.9495
Online Meta-learning Standard 8.7553 0.2367 3.0028 0.9412 8.7032 0.2285 2.9842 0.9403
OFDA Standard 4.7280 0.1885 1.3012 0.2331 4.6134 0.1800 1.2957 0.2297
OMLA Standard 4.5126 0.1623 1.2892 0.2287 4.4783 0.1503 1.2033 0.2198
Online Naive Meta 8.8230 0.2305 3.0578 0.9324 8.7061 0.2273 2.9804 0.9065
Online Meta-learning Meta 8.5572 0.2301 2.9576 0.9054 8.4325 0.2278 2.8503 0.8921
OFDA Meta 4.1279 0.1236 0.9027 0.1989 4.0731 0.1176 0.8845 0.1921
OMLA Meta 3.9025 0.1189 0.8256 0.1952 3.7203 0.1058 0.8176 0.1835
Table 1: Ablation study on KITTI Eigen test split of the proposed unsupervised online stereo method. At the top we show fine-tuning without pre-training, in the middle part fine-tuning after standard batch pre-training and in the bottom part fine-tuning after meta-pre-training on Synthia dataset. Depth predictions are capped at 50 meters.

3.4 Depth Estimation Loss

In this section, we provide the loss used both in OMLA and in our meta-pre-training algorithm (see in Algs.1 and 2). Following [11], the model takes in input the left image and the right image and outputs the disparities . We employ a warping operation in order to reconstruct the left image from the right image according to:


Symmetrically, we obtained a reconstructed right image form the left image. The loss is a combination of a reconstruction loss and a self-structuring loss (SSIM) proposed in [11] weighted by a parameter


Using such reconstructing loss, we can perform depth estimation in a totally unsupervised way, and perform adaptation in an online mode without groud truth.

4 Experiments

Pretraining on Synthia [34] Pretraining on SceneFlow [26]
Method RMSE Abs Rel Sq Rel RMSE Abs Rel Sq Rel FPS
DispNet [26] Naive 9.0222 0.2710 4.3281 0.9452 9.1587 0.2805 4.3590 0.9528 5.42
DispNet ours 4.5201 0.2396 1.3104 0.2503 4.6314 0.2457 1.3541 0.2516 4.00
MADNet [39] Naive 8.8650 0.2684 3.1503 0.8233 8.9823 0.2790 3.3021 0.8350 12.05
MADNet ours 4.0236 0.1756 1.1825 0.2501 4.2179 0.1883 1.2761 0.2523 9.56

Godard et al.  (ResNet) [11] Naive
9.0518 0.2499 3.2901 0.8577 9.0893 0.2602 3.3.896 0.8901 5.06
Godard et al.  (ResNet) ours 3.9025 0.1189 0.8256 0.1952 4.0573 0.1231 1.1532 0.1985 3.40

Table 2: Analysis of the performance of our method on common stereo architectures, DispNet [26], MADNet [39] and Godard (ResNet) [11], and different pretraining datasets, Synthia [34] or SceneFlow [26]. Depth predictions are capped at 50 meters.
Method RMSE Abs Rel Sq Rel
Godard et al.  [11] Offline 3.6975 0.0983 1.1720 0.1923 0.9166 0.9580 0.9778
Godard et al.  [11] Offline + Online 3.7059 0.0980 1.1712 0.1956 0.9203 0.9612 0.9776
Godard et al.  + OMLA 3.9025 0.1189 0.8256 0.1952 0.9110 0.9505 0.9776
MADNet [39] Offline 3.8965 0.1793 1.2369 0.2457 0.9147 0.9601 0.9790
MADNet [39] Offline + Online 3.9023 0.1760 1.1902 0.2469 0.9233 0.9652 0.9813
MADNet + OMLA 4.0236 0.1756 1.1825 0.2501 0.9022 0.9453 0.9586
DispNet [26] Offline 4.5210 0.2433 1.2801 0.2490 0.9126 0.9472 0.9730
DispNet [26] Offline + Online 4.5327 0.2368 1.2853 0.2506 0.9178 0.9600 0.9725
DispNet + OMLA 4.5201 0.2396 1.3104 0.2503 0.9085 0.9460 0.9613

Table 3: Comparison with different offline methods. Only points with depth less than 50m are calculated.

4.1 Evaluation

Evaluation for online learning. Following an online learning protocol, the frames are fed into the network sequentially. Each frame leads to a predicted depth map and a model parameter update. Importantly, we evaluate the estimated depth maps obtained at each time step before applying gradient descent. After processing the whole sequence, we compute the average scores over the sequence. In order to further evaluate the adaptation ability of the different models, we also report the average scores over the last 20% frames of each video. The motivation behind these scores is that they measure the final prediction quality after convergence whereas the average scores over the whole sequence measure better the convergence speed.

Evaluation metrics. The quantitative evaluation is performed according several standard metrics used in previous works [6, 11, 43]. Let be the total number of pixels in the test set and , the estimated depth and ground truth depth values for pixel . We compute the following metrics:

  • Mean relative error (abs rel): ,

  • Squared relative error (sq rel): ,

  • Root mean squared error (rmse): ,

  • Mean error (rmse log):

  • Accuracy with threshold , i.e. the percentage of such that . We employ and following [6].

Datasets. Evaluation of adaptation methods requires two different datasets: a source and a target dataset. As source dataset, we select synthetic datasets which contain videos of driving environment. To evaluate the online adaptation performance, we select a real-world urban dataset that is used as the target dataset. In detail, we use the following benchmarks:

Synthia dataset: Synthia [34] is a synthetic dataset made of urban driving scenes. It contains stereo image pairs for four views, frontal, rear, left and right views. There are five video sequences for each of the four seasons, spring, summer, fall and winter. We select 4k frontal view image paris from the spring recordings, and use these images as source dataset to perform our meta-pretraining procedure.

Scene Flow Driving: Scene Flow Driving [26] is a synthetic dataset with one driving video. It contains different settings on camera, speed and direction. We select all 2k stereo image pairs in forward setting for meta-pretraining.

KITTI: As target domain, we employ the KITTI [27] dataset. KITTI is recorded from driving vehicles. We employ the training and test split of Eigen et al.   [6]. This split is composed of 32 different scenes for training, and 28 different driving scenes for testing. Note that, for online evaluation, we use all stereo images in the testing sequences.

Implementation Details. We implement the proposed method using Pytorch [31] on a single Nvidia P40 GPU. All of the networks we built contain batch normalization layer [15] to perform the proposed feature distribution alignment. For pretraining on each synthetic dataset, we first perform unsupervised learning for a total 200 epochs in an offline batch setting, with initial learning rate for 100 epochs, halved to for the remaining 100 epochs. Then we perform meta-pretraining as Alg. 2 for 10 epochs. Here we set , meta-batch and future frames . For online learning, we perform adaptation following Alg. 1 and set the meta-learning rate . All of the networks are trained from scratch with the Adam optimizer [16].

4.2 Results and Analysis

In this section, we evaluate our proposed approach, experimentally validate the benefit of each component, and compare its performance with state-of-the-art methods.

Analysis on the Proposed Method. To validate the contribution of each component of our method, we adopt the framework proposed in [11] and used in several recent works [32, 48, 46]. For fair comparison, all the online learning procedures are applied and evaluated on the videos of Eigen’s testing split [6] and, except explicit specification, all models are pretrained on the Synthia dataset.

As a first baseline for online learning, we consider the approach that consists in performing adaptation via gradient descent at every step using a fixed learning rate. This approach is referred to as Online Naive. Note that Online Naive is equivalent to our approach without feature distribution alignment and without meta-learning updates in Alg.1. We consider three different variants of our model. First, in Online Meta-learning, we employ meta-learned gradient updates in Alg.1 but we do not use feature distribution alignment. Second, in OFDA, we use feature distribution alignment but perform gradient descent as in Online Naive. Finally, in OMLA, we use our full model.

Concerning the pretraining, we compare different models where we use either batch pretraining (referred to as Standard in upper half) or our meta-pretraining (referred to as Meta in bottom half). For completeness, we also report the performance of a model without pretraining.

We report the evaluation scores obtained by the different methods, in Table 1. First, we observe that directly performing naive online learning without pretraining (Online Naive, w/o) does not lead to good performance. The scores obtained on the last frames are not better than the average scores over the whole videos showing that the model is not learning. Similarly, moving to the models pretrained with Standard pretraining on Synthia, Online Naive only provides a very limited gain. A first proof that online meta-learning is beneficial is found in Online Meta-learning where we see a clear improvement in the last frames when applying meta-learning for online fine-tuning. Even better performance are obtained with OFDA that perform feature distribution alignment. These results show that handling domain shift between the source and the target distributions truly improve the quality of the estimated depth maps. A further improvement is obtained with our full meta-learning method, OMLA, that reaches the best performances in the Standard pretraining setting.

Concerning pretraining strategy, our meta-pretraining approach, denoted by Meta, improves consistently the performances for every setting adopted on the target video with respect to Standard. Similarly to what is observed with standard pretraining, Online Naive does not perform well. Online Meta-learning with meta-pretraining obtains better results considering both the scores averaged over the complete sequences and over the last 20% frames. This indicates meta-pretraining helps the model, not only to adapt faster, but also to perform better after observing many frames. Furthermore, OFDA again improves the performance. The gain is even larger on the last of the frames. Finally, OMLA achieves the best performance by combining OFDA with meta-learning.

Analysis on Network Architectures and Datasets. In order to further evaluate our approach, in Table 2 we report the performances of our method considering three different architectures: DispNet [26], MADNet [39] and Godard et al.  (ResNet encoder) [11]. DispNet and MADNet are two light-weight networks for stereo matching. We compare the performances obtained using these network architectures when we employ the baseline naive online learning approach and our full model referred to as ours. We report results when pretraining either on Synthia or SceneFlow Driving datasets.

From Table 2, we see that our proposed method obtains significantly better performances independently of the architecture. Such results demonstrate that OMLA is effective even with smaller networks as DispNet or MADNet. Concerning the SceneFlow driving dataset, we observe that the models, both Naive and ours, obtain slightly poorer performance than when pretraining on Synthia. A possible explanation is that the SceneFlow driving dataset is smaller and less diverse than Synthia. Nonetheless, these experiments confirm again the excellent performance of our approach even on this small dataset.

Figure 3: Qualitative comparison of different baseline models of the proposed approach on three video sequences of the KITTI dataset Eigen test split. We report frames at the beginning, in the middle and at the end of each video.

Concerning running time, the reported frames per second (FPS) are reduced by approximately since OMLA requires more gradient computation and parameter updates. Nevertheless, taking into account the performance gains we claim that such running time increase is acceptable for most applications.

4.3 Comparison with offline methods

In this section we compare our online learning method with models trained in an offline setting. We consider the following baselines:

  • Offline, model pretrained using offline training as in [11] on the KITTI Eigen training split and tested on KITTI Eigen test split.

  • Offline+Online, model pretrained using standard offline training on KITTI Eigen training split and online learning on KITTI Eigen test split. In that case, we employed the naive online formulation previously described.

  • OMLA, model meta-pretrained offline on Synthia and using OMLA on KITTI Eigen test split.

The results are reported in Table 3. First, when the models are trained in an Offline setting on the KITTI training set, naive online learning does not improve significantly the performance. Second, we observe that our online approach is competitive with the methods trained offline on the KITTI training set whereas our model did not see any real-world KITTI image. According to some metrics, our approach even outperforms the model trained offline. These observations clearly show the potential of our approach.

Finally, we report qualitative results in Fig. 3. We show the input frames, and the associated predictions, from the beginning, the middle and the last part of a same video. In the first frame of the video, we observe that the offline method already performs well, while the naive online model and our model obtain poor results. The reason is that these two models were trained on totally different environments and did not observe enough frames to adapt. However, after several frames, we see that our method is able to learn and improve its predictions, while the naive model improves its performance more slowly. Finally, after observing enough frames, our model produces satisfactory results getting closer to the offline model predictions and to the ground truth. These qualitative results demonstrate that our method adapts effectively in a new environment and progressively improves its estimations.

5 Conclusions

We addressed the problem of online domain adaptation in the context of depth estimation and presented an algorithm, OMLA, specifically designed for a sequential learning setting where fast convergence of the network training and adaptation to evolving data streams are required. We evaluated the proposed framework on the challenging KITTI dataset where we achieved state-of-the-art performance. As future works, we plan to combine our approach with the fast network update method in [39] and to extend our framework in a monocular setting.


  • [1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In NIPS, 2016.
  • [2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with gans. In CVPR, 2017.
  • [3] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò. Autodial: Automatic domain alignment layers. In ICCV, 2017.
  • [4] G. Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
  • [5] D. Dai and L. Van Gool. Dark model adaptation: Semantic image segmentation from daytime to nighttime. In IEEE International Conference on Intelligent Transportation Systems, 2018.
  • [6] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015.
  • [7] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [8] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018.
  • [9] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV. Springer, 2016.
  • [10] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  • [11] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
  • [12] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, and R. Feris. Spottune: Transfer learning through adaptive fine-tuning. In CVPR, 2019.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [14] J. Hoffman, T. Darrell, and K. Saenko. Continuous manifold based adaptation for evolving visual domains. In CVPR, 2014.
  • [15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv :1412.6980, 2014.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [18] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, 2016.
  • [19] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Learning to generalize: Meta-learning for domain generalization. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [20] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779, 2016.
  • [21] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. TPAMI, 2016.
  • [22] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. ICML, 2017.
  • [23] M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, and B. Caputo. Kitting in the wild through online domain adaptation. In IROS, 2018.
  • [24] M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci. Boosting domain adaptation by discovering latent domains. CVPR, 2018.
  • [25] F. Maria Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. Rota Bulo. Autodial: Automatic domain alignment layers. In ICCV, 2017.
  • [26] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
  • [27] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In CVPR, 2015.
  • [28] J. Nath Kundu, P. Krishna Uppala, A. Pahuja, and R. Venkatesh Babu. Adadepth: Unsupervised content congruent adaptation for depth estimation. In CVPR, 2018.
  • [29] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • [30] E. Park and A. C. Berg. Meta-tracker: Fast and robust online adaptation for visual object trackers. In ECCV, 2018.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [32] A. Pilzer, D. Xu, M. Puscas, E. Ricci, and N. Sebe. Unsupervised adversarial depth estimation using cycled generative networks. In 3DV, 2018.
  • [33] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2017.
  • [34] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, June 2016.
  • [35] C. Sakaridis, D. Dai, S. Hecker, and L. Van Gool. Model adaptation with synthetic and real data for semantic dense foggy scene understanding. In ECCV, 2018.
  • [36] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR, 2018.
  • [37] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. TPAMI, 2009.
  • [38] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. ICLR, 2017.
  • [39] A. Tonioni, F. Tosi, M. Poggi, S. Mattoccia, and L. Di Stefano. Real-time self-adaptive deep stereo. In CVPR, 2019.
  • [40] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
  • [41] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
  • [42] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. arxiv, 2016.
  • [43] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille. Towards unified depth and semantic prediction from a single image. In CVPR, 2015.
  • [44] M. Wulfmeier, A. Bewley, and I. Posner. Incremental adversarial domain adaptation for continually changing environments. In ICRA, 2018.
  • [45] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci. Structured attention guided convolutional neural fields for monocular depth estimation. In CVPR, 2018.
  • [46] N. Yang, R. Wang, J. Stuckler, and D. Cremers. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In ECCV, September 2018.
  • [47] Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang. Joint task-recursive learning for semantic segmentation and depth estimation. In ECCV, 2018.
  • [48] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
  • [49] W. Zhuo, M. Salzmann, X. He, and M. Liu. Indoor scene structure analysis for single image depth estimation. In CVPR, 2015.


We now report additional experiments using a monocular setting for depth estimation in order to further compare our approach with [11] (see Sec. A). Then, in Sec. B, we evaluate our approach according to stereo matching metrics. Finally, in Sec. C, we study the temporal behavior of several variants of our proposed model.

Appendix A Analyses on monocular setting

Although our method is meant for online stereo depth estimation, we also report experiment results in a monocular setting for further evaluation. More precisely, we employ the monocular network of [11] but still employed stereo pairs to compute the loss as in [11]. The results obtained in the monocular setting are reported in Table 4. As in the main paper, we also show the results averaged over the last 20% frames of each scene. We can observe that directly performing naive online learning without pretraining does not lead to good performances. We notice that the scores on the last frames are not better than the scores averaged over the whole video showing that the model is not learning. Concerning the online methods with pretraining, the results are well in-line with the stereo setting results reported in the main paper. We first observe that our meta-pretraining strategy improves consistently the performance for every strategy adopted on the target video. With meta-pretraining, the models can all obtain better results on the last 20% frames, which show again that meta-pretraining helps the model, not only to adapt faster, but also to perform better after observing many frames. Finally, using Online Feature Distribution Alignment (OFDA) and online meta-learning both improve the performances of online learning. Similarly to the stereo setting, OMLA (online meta-learning with OFDA) leads to the best results with both standard and meta pretraining.

Online Evaluation Scores Evaluation Scores on Last 20% frames
Method pretraining RMSE Abs Rel Sq Rel RMSE Abs Rel Sq Rel
Naive Online FT [11] w/o 13.4035 0.4687 5.7436 1.3801 13.3264 0.4693 5.7342 1.3810
Naive online FT Standard 12.3065 0.4189 5.5863 1.2247 12.3169 0.4120 5.5691 1.2298
Online FT with meta-learning Standard 10.3564 0.3556 3.6403 1.1720 10.3010 0.3486 3.6287 1.1653
OFDA Standard 6.2100 0.2903 2.9608 0.2899 6.1870 0.2833 2.9531 0.2745
OMLA Standard 5.6230 0.2267 2.3094 0.2680 5.4730 0.2106 2.1587 0.2541
Naive online FT Meta 11.7650 0.3903 5.3825 1.1065 11.7302 0.3756 5.3530 1.0103
Online FT with meta-learning Meta 10.1024 0.3271 3.3195 0.9653 9.9366 0.3063 3.2987 0.9469
OFDA Meta 5.6074 0.2301 2.1874 0.2745 5.4840 0.2157 2.1542 0.9261
OMLA Meta 5.3898 0.2047 2.0069 0.2590 5.2187 0.1956 1.9803 0.2490
Table 4: Unsupervised online monocular depth estimation results on Eigen test scenes in the Kitti dataset. Only points with depth less than 50m are calculated.

Appendix B Results in stereo matching metrics

We now evaluate our approach with different network architectures according to stereo matching metrics. We use D1-all and End point Error (EPE) to compare the different approaches [27]. Here, all the experiments are performed using the exact same protocol as in the main paper: all the online models are pretrained on Synthia dataset [34], and the offline models are pretrained on KITTI [27] Eigen training split [6]. As shown in table  5, the offline methods obtain better results according to both metrics, but the naive online fine-tuning cannot bring any significant contribution to the offline method. Here again, the naive online learning obtains poor results in all the metrics and we observe that both OMLA and meta-pretraining improve significantly the performances. Even though online models report slightly lower performances than offline models, these experiments clearly illustrate the interesting potential of the online learning setting.

Method d1-all EPE
Godard et al.  [11] Offline 18.6883 2.9076
Godard et al.  [11] Offline + Online 19.3257 2.9803
Godard et al.  [11] Naive 50.2587 5.2140
Godard et al.  + OMLA 22.3525 3.5820
MADNet [39] Offline 17.2573 2.7544
MADNet [39] Offline + Online 17.1209 2.7631
MADNet [39] Naive 46.9753 4.9866
MADNet + OMLA 20.2215 3.2014
DispNet [26] Offline 20.4301 2.9542
DispNet [26] Offline + Online 20.1037 2.9256
DispNet [26] Naive 51.8796 3.0259
DispNet + OMLA 25.3598 3.3746

Table 5: Comparison with different offline methods. Only points with depth less than 50m are calculated.
Figure 4: Online evaluation across frames of different methods on the 2011_09_26_drive_0052_sync sequence from the KITTI Eigen test split.

Appendix C Illustration of Online Learning

In this section we show an online evaluation over a video sequence using different methods. We select the sequence named 2011_09_26_drive_0056_sync from KITTI Eigen testing split and perform online evaluation on it. This sequence contains only 293 frames so that online learning on such a short sequence is challenging. We illustrate the evolution of RMSE on each frame of different methods to see how each method adapt to the current environment. The offline models are trained on the KITTI Eigen training split, and online models (our OMLA and naive) are meta-pretrained on Synthia. All the models are based on [11] and have a ResNet-50 architecture. As shown in Fig 4, the offline method performs well on the sequence from the beginning, since the model is trained on images visually similar to the test sequence. Applying naive online learning to the model trained offline does not improve the performance significantly, and may bring some instability to the model (e.g., at the frames around 150). For the naive online learning model (the green line), the model performs poorly in the first frames because of the difference between the synthetic and real-world images. Then after about 20 frames, the network starts to adapt to the current environment. Note that, even if the model provides better results after 50 frames, the results are not as stable and robust as the offline methods in the following frames. The performances of our naive model are constantly worse than those of offline models. Concerning our approach, our model with OMLA (the red line) obtains performance competitive with the two offline models trained on the real KITTI images. Even in the first 10 frames, the model starts to adapt quickly to the new environment and shows much faster convergence than naive online learning. In the following frames, the performance of OMLA is also more stable than with the naive approach. Even if the sequence is rather short, our model with OMLA can provide depth predictions with a precision similar to offline models. Such results demonstrate the effectiveness of our method.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description