Instance-wise Depth and Motion Learning from Monocular Videos

Instance-wise Depth and Motion Learning from Monocular Videos


We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. The only annotation used in our pipeline is a video instance segmentation map that can be predicted by our new auto-annotation scheme. Our technical contributions are three-fold. First, we propose a differentiable forward rigid projection module that plays a key role in our instance-wise depth and motion learning. Second, we design an instance-wise photometric and geometric consistency loss that effectively decomposes background and moving object regions. Lastly, we introduce an instance-wise mini-batch re-arrangement scheme that does not require additional iterations in training. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods.

Figure 1: Overview of our system. The proposed method learns 3D motion of multiple dynamic objects, ego-motion and depth, leveraging an intermediate video instance segmentation.

1 Introduction

Knowledge of the 3D environment structure and the motion of dynamic objects is essential for autonomous navigation [10, 28]. The 3D structure is valuable because it implicitly models the relative position of the agent, and it is also utilized to improve the performance of the high-level scene understanding such as detection and segmentation [17, 29, 1, 16]. Besides scene structure, the 3D motion of the agent and traffic participants such as pedestrians and vehicles is also required for safe driving. The relative direction and speed between them are taken as the primary inputs for determining the next direction of travel.

Recent advances in deep neural networks (DNNs) have led to a surge of interest in depth prediction using monocular images [8, 9] and stereo images [22, 5], as well as in optical flow estimation [7, 30, 20]. These supervised methods require a large amount and broad variety of training data with ground-truth labels. Studies have shown significant progress in unsupervised learning of depth and ego-motion from unlabeled image sequences [39, 12, 32, 21, 26]. The joint optimization framework uses a network for predicting single-view depth and pose, and exploits view synthesis of images in the sequence as the supervisory signal. However, these works ignore or mask out regions of moving objects for pose and depth inference.

In this work, rather than consider moving object regions as a nuisance [39, 12, 32, 21, 26], we utilize them as an important clue for estimating 3D object motions. This problem can be formulated as motion factorization of object motion and ego-motion. Factorizing object motion in monocular sequences is challenging problem, especially in complex urban environments that contain many dynamic objects. Moreover, deformable dynamic objects such as humans make the problem more difficult because of the greater inaccuracy in their correspondence [27].

To address this problem, we propose a novel framework that explicitly models 3D motions of dynamic objects and ego-motion together with scene depth in a monocular camera setting. Our unsupervised method relies solely on monocular video for training (without any ground-truth labels) and imposes a photo-consistency loss on warped frames from one time step to the next in a sequence. Given two consecutive frames from a video, the proposed neural network produces depth, 6-DoF motion of each moving object, and the ego-motion between adjacent frames as shown in Fig. 1. In this process, we leverage the instance mask of each dynamic object, obtained from an off-the-shelf instance segmentation module.

Our main contributions are the following:

Forward image warping Differentiable depth-based rendering (which we call inverse warping) was introduced in [39], where the target view is reconstructed by sampling pixels from a source view based on the target depth map and the relative pose . The warping procedure is effective in static scene areas, but the regions of moving objects cause warping artifacts because the 3D structure of the source image may become distorted after warping based on the target image’s depth  [4]. To build a geometrically plausible formulation, we introduce forward warping which maps the source image to the target viewpoint based on the source depth and the relative pose . There is a well-known remaining issue with forward warping that the output image may have holes. Thus, we propose the differentiable and hole-free forward warping module that works as a key component in our instance-wise depth and motion learning from monocular videos. The details are described in Sec. 3.2.

Instance-wise photometric and geometric consistency Existing works [3, 15] have successfully estimated independent object motion with stereo cameras. Approaches based on stereo video can explicitly separate static and dynamic motion by using stereo offset and temporal information. On the other hand, estimation from monocular video captured in the dynamic real world, where both agents and objects are moving, suffers from motion ambiguity, as only temporal clues are available. To address this issue, we introduce instance-wise view synthesis and geometric consistency into the training loss. We first decompose the image into background and object (potentially moving) regions using a predicted instance mask. We then warp each component using the estimated single-view depth and camera poses to compute photometric consistency. We also impose a geometric consistency loss for each instance that constrains the estimated geometry from all input frames to be consistent. Sec. 3.3 presents our technical approach (our loss and network details) for inferring 3D object motion, ego-motion and depth.

Mini-batch arrangement Our work aims to recover instance-wise 6-DoF object motion regardless of how many instances the scene contains. To accomplish this efficiently, we propose an instance-wise mini-batch arrangement technique that organizes mini-batches with respect to the total number of instances to avoid iterative training.

KITTI video instance segmentation dataset We introduce an auto-annotation scheme to generate a video instance segmentation dataset, which is expected to contribute to various areas of self-driving research. The role is this method is similar to that of [34], but we design a new framework that tailored to driving scenarios. Details are described in Sec. 3.4.

State-of-the-art performance Our unsupervised monocular depth and pose estimation is validated with a performance evaluation, presented in Sec. 4, which shows that our jointly learned system outperforms earlier approaches.

Our codes, models, and video instance segmentation dataset will be made publicly available.

2 Related Works

Unsupervised depth and ego-motion learning Several works [39, 12, 32, 21, 26] have studied the inference of depth and ego-motion from monocular sequences. Zhou et al[39] introduce an unsupervised learning framework for depth and ego-motion by maximizing photometric consistency across monocular videos during training. Godard et al[12] offer an approach replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. It trains a network by searching for the correspondence in a rectified stereo pair that requires only a one-dimensional search. Wang et al[32] show that Direct Visual Odometry (DVO) can be used to estimate the camera pose between frames and the inverse depth normalization leads to a better local minimum. Mahjourian et al[21] combine 3D geometric constraints using Iterative Closest Point (ICP) with a photometric consistency loss. Ranjan et al[26] propose a competitive collaboration framework that facilitates the coordinated training of multiple specialized neural networks to solve joint problems. Recently, there have been two works [2, 6] that achieve state-of-the-art performance on depth and ego-motion estimation using geometric constraints.

Figure 2: Overview of the proposed frameworks.

Learning motion of moving objects Recently, the joint optimization of dynamic object motion along with the depth and ego-motion [4, 3] has gained interest as a new research topic. Casser et al[4] present an unsupervised image-to-depth framework that models the motion of moving objects and cameras. The main idea is to introduce geometric structure in the learning process, by modeling the scene and the individual objects; camera ego-motion and object motions are learned from monocular videos as input. Cao et al[3] propose a self-supervised framework with a given 2D bounding box to learn scene structure and object 3D motion from stereo videos. They factor the scene representation into independently moving objects with geometric reasoning. However, this work is based on a stereo camera setup and computes the 3D motion vector of each instance using simple mean filtering.

Video instance segmentation The task of video instance segmentation (VIS) is to simultaneously conduct detection, segmentation and tracking of instances in videos. Yang et al[34] first extend the image instance segmentation problem to the video domain. To facilitate research on this new task, they present a large-scale dataset and a new tracking branch to Mask R-CNN to jointly perform the detection, segmentation and tracking tasks simultaneously.

3 Methodology

We introduce an end-to-end joint training framework for instance-wise depth and motion learning from monocular videos without supervision. Figure 2 illustrates an overview of the complete pipeline. The core of our method is a novel forward rigid projection module to align the depth map from adjacent frames and an instance-wise training loss. In this section, we introduce the forward projective geometry and the networks for each type of output: DepthNet, Ego-PoseNet, and Obj-PoseNet. Further, we describe our novel loss functions and how they are designed for back-propagation in decomposing the background and moving object regions.

3.1 Method Overview


Given two consecutive RGB images , sampled from an unlabeled video, we first predict their respective depth maps  via our presented DepthNet  with trainable parameters . By concatenating two sequential images as an input, our proposed Ego-PoseNet , with trainable parameters , estimates the six-dimensional SE(3) relative transformation vectors . With the predicted depth, relative ego-motion, and a given camera intrinsic matrix , we can synthesize an adjacent image in the sequence using an inverse warping operation , where is the reconstructed image by warping the reference frame  [39, 13]. As a supervisory signal, an image reconstruction loss, , is imposed to optimize the parameters, and .

Instance-wise learning

The baseline method has a limitation that it cannot handle dynamic scenes containing moving objects. Our goal is to learn depth and ego-motion, as well as object motions, using monocular videos by constraining them with instance-wise geometric consistencies. We propose an Obj-PoseNet  with trainable parameters  and number of instances, which is specialized to estimating individual object motions. We annotate a novel video instance segmentation dataset to utilize it as an individual object mask while training the ego-motion and object motions. The details of the video instance segmentation dataset will be described in Sec. 3.4. Given two consecutive binary instance masks  corresponding to , -instances are annotated and matched between the frames. First, in the case of camera ego-motion, potentially moving objects are masked out and only the background region is fed to Ego-PoseNet. Secondly, the -binary instance masks are multiplied to the input images and fed to Obj-PoseNet. For both networks, motions of the element are represented as , where means camera ego-motion from frame to . The details of the motion models will be described in the following subsections.

Training objectives

The previous works [21, 2, 6] imposed geometric constraints between frames, but they are limited to rigid projections. Regions containing moving objects cannot be constrained with this term and are treated as outlier regions with regard to geometric consistency. In this paper, we propose instance-wise geometric consistency. We leverage instance masks to impose geometric consistency region-by-region. Following instance-wise learning, our overall objective function can be defined as follows:


where are the reconstruction and geometric consistency losses applied on each instance region including the background, stands for the depth smoothness loss, and are the object translation and height constraint losses. is the set of loss weights. We train the models in both forward () and backward () directions to maximally use the self-supervisory signals. In the following subsections, we introduce how to constrain the instance-wise consistencies.

Figure 3: Warping discrepancy occurs for inverse projection of moving objects.

3.2 Forward Projective Geometry

A fully differentiable warping function enables learning of structure-from-motion tasks. This operation is first proposed by spatial transformer networks (STN) [13]. Previous works for learning depth and ego-motion from unlabeled videos so far follow this grid sampling module to synthesize adjacent views. To synthesize from , the homogeneous coordinates, , of a pixel in the are projected to as follows:


As expressed in the equation, this operation computes by taking the value of the homogeneous coordinates from the inverse rigid projection using and . As a result, the coordinates are not valid if lies on an object that moves between and . Therefore, the inverse warping operation is not suitable for removing the effects of ego-motion in dynamic scenes. Figure 3 describes the point discrepancy while geometric warping with the rigid structure assumption. As shown in Figure 4, the inverse warping distorts the appearance of moving objects.

In order to synthesize the novel view (from to ) properly when there exist moving objects, we propose forward projective geometry   as follows:


Unlike inverse projection in Eq. (2), this warping process cannot be sampled by the STN since the projection is in the forward direction (inverse of grid sampling). In order to make this operation fully differentiable, we first use sparse tensor coding to index the homogeneous coordinates , of a pixel in the . Invalid coordinates (exiting the view where ) of the sparse tensor are masked out. We then convert this sparse tensor to be dense by taking the nearest neighbor value of the source pixel. However, this process has a limitation that there exist irregular holes due to the sparse tensor coding. Since we need to feed those forward projected images into the neural networks in the next step, the size of the holes should be minimized. To fill these holes as much as possible, we pre-upsample the depth map  of the reference frame. If the depth map is upsampled by a factor of , the camera intrinsics matrix is also upsampled as follows:


where are focal lengths along the - and -axis. Figure 5 shows the effect of pre-upsampling reference depth while forward warping. With an upsampling factor of during forward projection, the holes in the warped valid masks are filled properly. In the following subsection, we describe the steps of how to synthesize novel views with inverse and forward projection in each instance region.

Figure 4: Visual comparisons between inverse and forward warping. Here, (, , , ) are warped only by the camera ego-motion. As shown in the last column, the inverse warping distorts the appearances of the vehicles due to the rigid structure assumption, while forward warping preserves the geometric characteristics.
Figure 5: Effect of pre-upsampling reference depth for forward projection. The right column shows the results of forward warped depth () with different upsampling factors, . Holes are colored with white.

3.3 Instance-wise View Synthesis and Geometric Consistency

Instance-wise reconstruction

Each step of the instance-wise view synthesis is described in Figure 2. To synthesize a novel view in an instance-wise manner, we first decompose the image region into background and object (potentially moving) regions. With given instance masks , the background mask along frames is generated as


The background mask is element-wise multiplied (, Hadamard product) to the input frames , and then concatenated along the channel axis, which is an input to the Ego-PoseNet. The camera ego-motion is computed as


To learn the object motions, we first apply the forward warping, , to generate ego-motion-eliminated warped images and masks as follows:



where both equations are applied in the backward direction as well by changing the subscripts and . Now we can generate forward-projected instance images as and . For every object instance in the image, Obj-PoseNet predicts the object motion as


where both object motions are composed of six-dimensional SE(3) translation and rotation vectors. We merge all instance regions to synthesize the novel view. In this step, we utilize inverse warping, . First, the background region is reconstructed as


where the gradients are propagated with respect to and . Second, the inverse-warped instance region is represented as


where the gradients are propagated with respect to and . Finally, our instance-wise fully reconstructed novel view is formulated as


Note that the above three equations are applied in either the forward or backward directions by switching the subscripts and .

Instance-wise mini-batch re-arrangement

While training Obj-PoseNet, the number of instance images may change after each iteration. In order to avoid inefficient iterative training, we fix the maximum number of instances per image (sampled in order of instance size) and intermediately re-arrange the mini-batches with respect to the total number of instances in the mini-batch. For example, if the mini-batch has four frames and each frame has -number of instances, then the re-arranged mini-batch size is . The scale of the gradients while training Obj-PoseNet is normalized according to the total number of instances per mini-batch.

Instance mask propagation

Through the process of forward and inverse warping, the instance mask is also propagated to contain the information of instance position and pixel validity. In the case of the instance mask , the forward and inverse warped mask is expressed as follows:


Note that the forward warped mask has holes due to the sparse tensor coding. To keep the binary format and avoid interpolation near the holes while inverse warping, we round up the fractional values after each warping operation. The final valid instance mask is expressed as follows:


Instance-wise geometric consistency

We impose the geometric consistency loss for each region of an instance. Following the work by Bian et al[2], we constrain the geometric consistency during inverse warping. With the predicted depth map and warped instance mask, can be spatially aligned to the frame by inverse warping, represented as and respectively for background and instance regions. In addition, can be scale-consistently transformed to the frame , represented as and respectively for background and instance regions. Based on this instance-wise operation, we compute the unified depth inconsistency map as:



where each line is applied to either the background or an instance region, and both are applied in either the forward and backward directions by changing the subscripts and . Note that the above depth inconsistency maps are spatially aligned to the frame . Therefore, we can integrate the depth inconsistency maps from the background and instance regions as follows:


Training loss

In order to handle occluded, view-exiting, and valid instance regions, we leverage Eq. (23) and Eq. (17). We generate a weight mask as and this is multiplied to the valid instance mask . Finally, our weighted valid mask is formulated as:


The reconstruction loss is expressed as follows:


where is the location of each pixel, is the structural similarity index [33], and is set to 0.8 based on cross-validation. The geometric consistency loss is expressed as follows:


To mitigate spatial fluctuation, we incorporate a smoothness term to regularize the predicted depth map. We apply the edge-aware smoothness loss proposed by Ranjan et al[26], which is described as:


Note that the above loss functions are imposed for both forward and backward directions by switching the subscripts and .

Since the dataset has low proportion of moving objects, the motions of objects tend to learn to converge to zero. The same issue has been raised in the previous study [4]. To supervise the approximate amount of object’s movement, we constrain the motion of the object with a translation prior. We compute this translation prior, , by subtracting the mean estimation of the object’s 3D points in the forward warped frame into that of the target frame’s 3D object points. This represents the mean estimated 3D vector of the object’s motion. The object translation constraint loss is defined as follows:


where and are predicted object translation from Obj-PoseNet and the translation prior on the instance mask.

Although we have designed the instance-wise geometric consistency, there still exists a trivial case of the infinite depth of the moving object, which has the same motion as the camera motion, especially the vehicles in front. To mitigate this issue, we adopt the object height constraint loss proposed by the previous study [4], which is described as:


where is the mean estimated depth, and (, ) are learnable height prior and pixel height of the instance. The final loss is a weighted summation of those five losses defined as Eq. (1).

Figure 6: Auto-annotation of video instance segmentation using instance segmentation and optical flow estimation.

3.4 Auto-annotation of Video Instance Segmentation Dataset

We introduce an auto-annotation scheme to generate a video instance segmentation dataset from the existing KITTI autonomous driving dataset [11]. To this end, we adopt an off-the-shelf instance segmentation model, PANet [19], and an optical flow estimation model, PWC-Net [30]. Figure 6 shows an example case between two consecutive frames. We first compute the instance segmentation for every image frame, and calculate the Intersection over Union (IoU) score table among instances in each frame. The occluded and disoccluded regions are handled by bidirectional flow consensus proposed in UnFlow [23]. If the maximal IoU in the adjacent frame is above a threshold, then the instance is assumed as matched. Our KITTI video instance segmentation dataset (KITTI-VIS) will be publicly available.

4 Experiments

We evaluate the performance of our frameworks and compare with previous unsupervised methods on single view depth and visual odometry tasks. We train and test our method on KITTI [11] for benchmarking.

4.1 Implementation Details

Network details

For DepthNet, we use DispResNet [26] based on ResNet-50 with an encoder-decoder structure. The network can generate multi-scale outputs (six different scales), but the single-scale training converges faster and produces better performance (based on the Abs Rel error metrics). The structures of Ego-PoseNet and Obj-PoseNet are the same, but the weights are not shared. They consist of seven convolutional layers and regress the relative pose as three Euler angles and three translation vectors.

Figure 7: Cyclic training. Intermediate frames are generated to eliminate the effect of camera motion. Object motions are regularized by averaging two motions in the same direction, identified with the same colored arrows. Dashed arrows indicate the direction of warping to the intermediate frames.


Our system is implemented in PyTorch [25]. We train our neural networks using the ADAM optimizer [14] with and on Nvidia RTX 2080 GPUs. The training image resolution is set to and the video data is augmented with random scaling, cropping, and horizontal flipping. We set the mini-batch size to 4 and train the networks over 200 epochs with 1,000 randomly sampled batches in each epoch. The initial learning rate is set to and is decreased by half every 50 epochs. The loss weights are set to , , , , and .

While training, we take three consecutive frames as input to train our joint networks. Our three-frame cyclic training is described in Figure 7. Dynamic scenes are hard to handle with rigid projective geometry in a one-shot manner. We utilize an intermediate frame which enables decomposition of ego-motion-driven global view synthesis and residual view synthesis by object motions. From this, several warping directions can be proposed. The arrows in Figure 7 represent the warping direction of RGB images and depth maps. We tried to optimize Obj-PoseNet by warping to the intermediate views (dashed arrows); however, the network did not converge. One important point here is that we need to feed the supervisory signals generated at the original timestamps , not in the intermediate frames . Although we generate the photometric and geometric supervisions only in the reference or target frames, we utilize the object motions while warping to the intermediate frames. We regularize the object motions by averaging two motions in the same direction (e.g., “intermediate target” and “reference intermediate”, shown as red and green arrows).


width=0.47   Abs Rel w/ instance w/ geometric inverse proj. forward proj.   0.156 0.151 0.143 0.137 0.124  

Table 1: Ablation study on Abs Rel metrics. We verify the effect of forward projection (proj.) and the instance-wise geometric consistency term. Note that instance and geometric means instance-wise loss and geometric consistency loss, respectively.

4.2 Ablation Study

To validate the effect of our forward projective geometry and instance-wise geometric consistency term, we conduct an ablation study that compares to the baseline method. As shown in Table 1, our forward projection works effectively and leads to convergence of DepthNet. It shows that each proposed module improves the quality of depth maps and the best performance is achieved with all proposed components. We observe that Obj-PoseNet did not converge without forward projection with the instance-wise geometric loss, so the performance of DepthNet was not improved. In other words, optimized Obj-PoseNet helps to boost the performance of DepthNet. DepthNet and Obj-PoseNet complement each other. We note that the significant performance improvement comes from the instance-wise geometric loss incorporated with forward projection.

Figure 8: Qualitative results of single-view depth prediction on the KITTI test set (from the Eigen split). Green circles highlight favorable results of our method for depth prediction on moving objects.

width=0.99   Method Supervision Training Error metric Accuracy metric input Abs Rel Sq Rel RMSE RMSE log   Eigen et al[8] Depth M 0.203 1.548 6.307 0.282 0.702 0.890 0.958 Liu et al[18] Depth S 0.202 1.614 6.523 0.275 0.678 0.895 0.965   Garg et al[9] S 0.152 1.226 5.849 0.246 0.784 0.921 0.967 Godard et al[12] S 0.148 1.344 5.927 0.247 0.803 0.922 0.964 Zhan et al[38] S 0.144 1.391 5.869 0.241 0.803 0.928 0.969 SfM-Learner [39] M 0.208 1.768 6.856 0.283 0.678 0.885 0.957 Yang et al[36] M 0.182 1.481 6.501 0.267 0.725 0.906 0.963 Wang et al[32] M 0.151 1.257 5.583 0.228 0.810 0.936 0.974 LEGO et al[35] M 0.162 1.352 6.276 0.252 0.783 0.921 0.969 DF-Net [40] M 0.150 1.124 5.507 0.223 0.806 0.933 0.973 DDVO [32] M 0.151 1.257 5.583 0.228 0.810 0.936 0.974 GeoNet [37] M 0.155 1.296 5.857 0.233 0.793 0.931 0.973 Mahjourian et al[21] M 0.159 1.231 5.912 0.243 0.784 0.923 0.970 CC [26] M 0.140 1.070 5.326 0.217 0.826 0.941 0.975 SC-SfM-Learner [2] M 0.137 1.089 5.439 0.217 0.830 0.942 0.975 GLNet [6] M 0.135 1.070 5.230 0.210 0.841 0.948 0.980 Struct2Depth [4] M 0.141 1.026 5.290 0.215 0.816 0.945 0.979   Ours M 0.124 1.009 5.176 0.208 0.839 0.942 0.980  

Table 2: Single-view depth estimation results on the KITTI Eigen split. ‘M’ and ‘S’ denotes monocular and stereo video inputs for training. Bold: Best, Underbar: Second best.

4.3 Monocular Depth Estimation

Test setup

Following the test setup proposed for SfM-Learner [39], we train and test our models with the Eigen split [8] of the KITTI dataset. We compare the performance of the proposed method with recent state-of-the-art works [6, 4, 2, 26] for unsupervised single-view depth estimation.

Results analysis

We show qualitative results on single-view depth estimation in Figure 8. The compared methods are CC [26] and SC-SfM [2], which have the same network structure (ResNet-based) for depth map prediction. Ours produces better depth representations on moving objects than the previous methods. As the previous studies do not consider dynamics of objects when finding pixel correspondences, their results of training on object distance could be either farther or closer than the actual distance. This is a traditional limitation for the task of self-supervised learning of depth from monocular videos, however our networks self-disentangle moving and static object regions by our instance-wise losses.

Table 2 shows the results on the KITTI Eigen split test, where the proposed method achieves state-of-the-art performance in the single view depth prediction task with unsupervised training. The advantage is evident of using instance masks and constraining the instance-wise photometric and geometric consistencies. Note that we do not need instance masks in the test phase for DepthNet.


width=0.47   Method No. frames Seq. 09 Seq. 10 ORB-SLAM (full) [24] All ORB-SLAM (short) [24] 5 SfM-Learner [39] 5 SfM-Learner (updated) [39] 5 DF-Net [40] 3 Vid2Depth [21] 3 GeoNet [37] 3 CC [26] 3 Struct2Depth [4] 3 GLNet [6] 3   Ours (w/o inst.) 3 Ours (w/ inst.) 3  

Table 3: Absolute Trajectory Error (ATE) on the KITTI visual odometry dataset (lower is better). Bold: Best.

4.4 Visual Odometry Estimation

Test setup

We evaluate the performance of our Ego-PoseNet on the KITTI visual odometry dataset. Following the evaluation setup of SfM-Learner [39], we use the sequences 00-08 for training, and sequences 09 and 10 for tests. In our case, since the potentially moving object masks are fed together with the image sequences while training Ego-PoseNet, we test the performance of visual odometry under two conditions: with and without instance masks.

Results analysis

Table 3 shows the results on the KITTI visual odometry test. We measure the Absolute Trajectory Error (ATE) and achieve state-of-the-art performance. Although we do not use the instance mask, the result of sequence 10 produces favorable performance. This is because the scene does not have many potentially moving objects, e.g., vehicles and pedestrians, so the result is not affected much by using or not using instance masks.

5 Conclusion

In this work, we proposed a novel framework that predicts 6-DoF motion of multiple dynamic objects, ego-motion and depth with monocular image sequences. Leveraging video instance segmentation, we design an end-to-end joint training pipeline in an unsupervised manner. There are four main contribution of our work: (1) an auto-annotation scheme for video instance segmentation, (2) differentiable forward image warping, (3) instance-wise view-synthesis and geometric consistency loss, (4) mini-batch re-arrangement. We show that our method outperforms the existing methods that estimates object motion, ego-motion and depth. We also show that each proposed module plays a role in improving the performance of our framework.

In the future, we plan to investigate joint optimization of the instance mask together with the depth and motion. Another future direction is to consider longer input sequences as in bundle adjustment [31].


  1. J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss and J. Gall (2019) SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In ICCV, Cited by: §1.
  2. J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In NeurIPS, Cited by: §2, §3.1, §3.3, §4.3, §4.3, Table 2.
  3. Z. Cao, A. Kar, C. Hane and J. Malik (2019) Learning independent object motion from unlabelled stereoscopic videos. In CVPR, Cited by: §1, §2.
  4. V. Casser, S. Pirk, R. Mahjourian and A. Angelova (2019) Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In AAAI, Cited by: §1, §2, §3.3, §3.3, §4.3, Table 2, Table 3.
  5. J. Chang and Y. Chen (2018) Pyramid stereo matching network. In CVPR, Cited by: §1.
  6. Y. Chen, C. Schmid and C. Sminchisescu (2019) Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In ICCV, Cited by: §2, §3.1, §4.3, Table 2, Table 3.
  7. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In ICCV, Cited by: §1.
  8. D. Eigen, C. Puhrsch and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, Cited by: §1, §4.3, Table 2.
  9. R. Garg, V. K. BG, G. Carneiro and I. Reid (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In ECCV, Cited by: §1, Table 2.
  10. A. Geiger, M. Lauer, C. Wojek, C. Stiller and R. Urtasun (2014) 3d traffic scene understanding from movable platforms. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Cited by: §1.
  11. A. Geiger, P. Lenz and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §3.4, §4.
  12. C. Godard, O. Mac Aodha and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, Cited by: §1, §1, §2, Table 2.
  13. M. Jaderberg, K. Simonyan and A. Zisserman (2015) Spatial transformer networks. In NIPS, Cited by: §3.1, §3.2.
  14. D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  15. S. Lee, S. Im, S. Lin and I. S. Kweon (2019) Learning residual flow as dynamic motion from stereo videos. In IROS, Cited by: §1.
  16. S. Lee, J. Kim, T. Oh, Y. Jeong, D. Yoo, S. Lin and I. S. Kweon (2019) Visuomotor understanding for representation learning of driving scenes. In BMVC, Cited by: §1.
  17. S. Lee, J. Kim, J. Shin Yoon, S. Shin, O. Bailo, N. Kim, T. Lee, H. Seok Hong, S. Han and I. So Kweon (2017) Vpgnet: vanishing point guided network for lane and road marking detection and recognition. In ICCV, Cited by: §1.
  18. F. Liu, C. Shen, G. Lin and I. Reid (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Cited by: Table 2.
  19. S. Liu, L. Qi, H. Qin, J. Shi and J. Jia (2018) Path aggregation network for instance segmentation. In CVPR, Cited by: §3.4.
  20. Z. Lv, K. Kim, A. Troccoli, D. Sun, J. M. Rehg and J. Kautz (2018) Learning rigidity in dynamic scenes with a moving camera for 3d motion field estimation. In ECCV, Cited by: §1.
  21. R. Mahjourian, M. Wicke and A. Angelova (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, Cited by: §1, §1, §2, §3.1, Table 2, Table 3.
  22. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, Cited by: §1.
  23. S. Meister, J. Hur and S. Roth (2018) UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In AAAI, Cited by: §3.4.
  24. R. Mur-Artal, J. M. M. Montiel and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics. Cited by: Table 3.
  25. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  26. A. Ranjan, V. Jampani, K. Kim, D. Sun, J. Wulff and M. J. Black (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, Cited by: §1, §1, §2, §3.3, §4.1, §4.3, §4.3, Table 2, Table 3.
  27. C. Russell, R. Yu and L. Agapito (2014) Video pop-up: monocular 3d reconstruction of dynamic scenes. In ECCV, Cited by: §1.
  28. A. Shashua, Y. Gdalyahu and G. Hayun (2004) Pedestrian detection for driving assistance systems: single-frame classification and system level performance. In IEEE Intelligent Vehicles Symposium, Cited by: §1.
  29. K. Shin, Y. P. Kwon and M. Tomizuka (2019) Roarnet: a robust 3d object detection based on region approximation refinement. In 2019 IEEE Intelligent Vehicles Symposium (IV), Cited by: §1.
  30. D. Sun, X. Yang, M. Liu and J. Kautz (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In CVPR, Cited by: §1, §3.4.
  31. B. Triggs, P. F. McLauchlan, R. I. Hartley and A. W. Fitzgibbon (1999) Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, Cited by: §5.
  32. C. Wang, J. M. Buenaposada, R. Zhu and S. Lucey (2018) Learning depth from monocular videos using direct methods. In CVPR, Cited by: §1, §1, §2, Table 2.
  33. Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. Cited by: §3.3.
  34. L. Yang, Y. Fan and N. Xu (2019) Video instance segmentation. In ICCV, Cited by: §1, §2.
  35. Z. Yang, P. Wang, Y. Wang, W. Xu and R. Nevatia (2018) Lego: learning edge with geometry all at once by watching videos. In CVPR, Cited by: Table 2.
  36. Z. Yang, P. Wang, W. Xu, L. Zhao and R. Nevatia (2018) Unsupervised learning of geometry with edge-aware depth-normal consistency. In AAAI, Cited by: Table 2.
  37. Z. Yin and J. Shi (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In CVPR, Cited by: Table 2, Table 3.
  38. H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal and I. Reid (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In CVPR, Cited by: Table 2.
  39. T. Zhou, M. Brown, N. Snavely and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In CVPR, Cited by: §1, §1, §1, §2, §3.1, §4.3, §4.4, Table 2, Table 3.
  40. Y. Zou, Z. Luo and J. Huang (2018) Df-net: unsupervised joint learning of depth and flow using cross-task consistency. In ECCV, Cited by: Table 2, Table 3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description