PointFlowNet: Learning Representations for
3D Scene Flow Estimation from Point Clouds
Abstract
Despite significant progress in imagebased 3D scene flow estimation, the performance of such approaches has not yet reached the fidelity required by many applications. Simultaneously, these applications are often not restricted to imagebased estimation: laser scanners provide a popular alternative to traditional cameras, for example in the context of selfdriving cars, as they directly yield a 3D point cloud. In this paper, we propose to estimate 3D scene flow from such unstructured point clouds using a deep neural network. In a single forward pass, our model jointly predicts 3D scene flow as well as the 3D bounding box and rigid body motion of objects in the scene. While the prospect of estimating 3D scene flow from unstructured point clouds is promising, it is also a challenging task. We show that the traditional global representation of rigid body motion prohibits inference by CNNs, and propose a translation equivariant representation to circumvent this problem. For training our deep network, a large dataset is required. Because of this, we augment real scans from KITTI with virtual objects, realistically modeling occlusions and simulating sensor noise. A thorough comparison with classic and learningbased techniques highlights the robustness of the proposed approach.
claimproofProof of Claim
1 Introduction
For intelligent systems such as selfdriving cars, the precise understanding of their surroundings is key. Notably, in order to make predictions and decisions about the future, tasks like navigation and planning require knowledge about the 3D geometry of the environment as well as about the 3D motion of other agents in the scene.
3D scene flow is the most generic representation of this 3D motion; it associates a velocity vector with 3D motion to each measured point. Traditionally, 3D scene flow is estimated based on two consecutive image pairs of a calibrated stereo rig [38, 39, 17]. While the accuracy of scene flow methods has greatly improved over the last decade [24], imagebased scene flow methods have rarely made it into robotics applications. The reasons for this are twofold. First of all, most leading techniques take several minutes or hours to predict 3D scene flow. Secondly, stereobased scene flow methods suffer from a fundamental flaw, the “curse of twoview geometry”: it can be shown that the depth error grows quadratically with the distance to the observer [20]. This causes problems for the baselines and object depths often found in selfdriving cars, as illustrated in Fig. 1 (top).
Consequently, most modern selfdriving car platforms rely on LIDAR technology for 3D geometry perception. In contrast to cameras, laser scanners provide a 360 degree field of view with just one sensor, are generally unaffected by lighting conditions, and do not suffer from the quadratic error behavior of stereo cameras. However, while LIDAR provides accurate 3D point cloud measurements, estimating the motion between two such scans is a nontrivial task. Because of the sparse and nonuniform nature of the point clouds, as well as the missing appearance information, the data association problem is complicated. Moreover, characteristic patterns produced by the scanner, such as the circular rings in Fig. 1 (bottom), move with the observer and can easily mislead local correspondence estimation algorithms.
To address these challenges, we propose PointFlowNet, a generic model for learning 3D scene flow from pairs of unstructured 3D point clouds. Our main contributions are:

We present an endtoend trainable model for joint 3D scene flow and rigid motion prediction and 3D object detection from unstructured LIDAR data, as captured from a (selfdriving) car.

We show that a global representation is not suitable for rigid motion prediction, and propose a local translationequivariant representation to mitigate this problem.

We augment the KITTI dataset with virtual cars, taking into account occlusions and simulating sensor noise, to provide more (realistic) training data.

We demonstrate that our approach compares favorably to the stateoftheart.
We will make the code and dataset available.
2 Related Work
In the following discussion, we first group related methods based on their expected input; we finish this section with a discussion of learningbased solutions.
Scene Flow from Image Sequences: The most common approach to 3D scene flow estimation is to recover correspondences between two calibrated stereo image pairs. Early approaches solve the problem using coarsetofine variational optimization [38, 39, 2, 17, 36, 42, 40]. As coarsetofine optimization often performs poorly in the presence of large displacements, slantedplane models which decompose the scene into a collection of rigidly moving 3D patches have been proposed [41, 24, 25, 22]. The benefit of incorporating semantics has been demonstrated in [3]. While the stateoftheart in imagebased scene flow estimation has advanced significantly, its accuracy is inherently limited by the geometric properties of twoview geometry as previously mentioned and illustrated in Figure 1.
Scene Flow from RGBD Sequences: When perpixel depth information is available, two consecutive RGBD frames are sufficient for estimating 3D scene flow. Initially, the imagebased variational scene flow approach was extended to RGBD inputs [43, 15, 30]. Franke et al. [12] instead proposed to track KLT feature correspondences using a set of Kalman filters. Exploiting PatchMatch optimization on spherical 3D patches, Hornacek et al. [16] recover a dense field of 3D rigid body motions. However, while structured light scanning techniques (e.g., Kinect) are able to capture indoor environments, dense RGBD sequences are hard to acquire in outdoor scenarios like ours. Furthermore, structured light sensors suffer from the same depth error characteristics as stereo techniques.
Scene Flow from 3D Point Clouds: In the robotics community, motion estimation from 3D point clouds has so far been addressed primarily with classical techniques. Several works [6, 35, 33] extend occupancy maps to dynamic scenes by representing moving objects via particles which are updated using particle filters [6, 33] or EM [35]. Others tackle the problem as 3D detection and tracking using mean shift [1], RANSAC [8], ICP [26], CRFs [37] or Bayesian networks [14]. In contrast, Dewan et al. [9] propose a 3D scene flow approach where local SHOT descriptors [34] are associated via a CRF that incorporates local smoothness and rigidity assumptions. While impressive results have been achieved, all the aforementioned approaches require significant engineering and manual model specification. In addition, local shape representations such as SHOT [34] often fail in the presence of noisy or ambiguous inputs. In contrast, we address the scene flow problem using a generic endtoend trainable model which is able to learn local and global statistical relationships directly from data. Accordingly, our experiments show that our model compares favorably to the aforementioned classical approaches.
Learningbased Solutions: While several learningbased approaches for stereo [19, 44, 21] and optical flow [7, 10, 18] have been proposed in literature, there is little prior work on learning scene flow estimation. A notable exception is SceneFlowNet [23], which concatenates features from FlowNet [10] and DispNet [23] for imagebased scene flow estimation. In contrast, this paper proposes a novel endtoend trainable approach for scene flow estimation from unstructured 3D point clouds.
3 Method
We start by formally defining our problem. Let and denote the input 3D point clouds at frames and , respectively. Our goal is to estimate

the 3D scene flow and the 3D rigid motion , at each of the points in the reference point cloud at frame , and

the location, orientation, size and rigid motion of every moving object in the scene (in our experiments, we focus solely on cars).
The overall network architecture of our approach is illustrated in Figure 2. The network comprises four main components: (1) feature encoding layers, (2) scene flow estimation and 3D object detection layers, (3) rigid motion estimation layers and (4) dynamic pooling layers. In the following, we provide a detailed description for each of these components as well as the loss functions.
3.1 Feature Encoding Layers
The feature encoding layers take a raw point cloud as input, partition the space into voxels, and describe each voxel with a feature vector. The simplest form of aggregation is binarization, where any voxel containing at least one point is set to and all others are zero. However, better results can be achieved by aggregating highorder statistics over the voxel [28, 29, 46, 32, 27, 5]. We leverage the feature encoding recently proposed by Zhou et al. [46], which has demonstrated stateoftheart results for 3D object detection from point clouds.
We briefly summarize this encoding, but refer the reader to [46] for more details. We subdivide the 3D space of each input point cloud into equally spaced voxels and group points according to the voxel they reside in. To reduce bias with respect to LIDAR point density, a fixed number of T points is randomly sampled for all voxels containing more than T points. Each voxel is processed with a stack of Voxel Feature Encoding (VFE) layers to capture local and global geometric properties of its contained points. As more than of the voxels in LIDAR scans tend to be empty, we only process nonempty voxels and store the results in a sparse 4D tensor.
3.2 3D Detection, Egomotion and 3D Scene Flow
As objects in a street scene are restricted to the ground plane, we only estimate objects and motions on this plane: we assume that 3D objects cannot be located on top of each other and that 3D scene points directly above each other undergo the same 3D motion. This is a valid assumption for our autonomous driving scenario, and greatly improves memory efficiency. Following [46], we vertically downsample the voxel feature map to size by using three 3D convolutions with vertical stride .
The resulting 3D feature map is reshaped by stacking the remaining height slices as feature maps to yield a 2D feature map. The first layer of each block downsamples the feature map via a convolution with stride , followed by a series of convolution layers with stride . Each convolution layer is followed by Batch Normalization and a ReLU.
Next, the network splits up in three branches for respectively egomotion estimation, 3D object detection and 3D scene flow estimation. As there is only one observer, the egomotion branch further downsamples the feature map by interleaving convolutional layers with strided convolutional layers and finally using a fully connected layer to regress a 3D egomotion (movement in the groundplane and rotation around the vertical). For the other two tasks, we upsample the output of the various blocks using upconvolutions: to half the original resolution for 3D object detection, and to the full resolution for 3D scene flow estimation. The resulting features are stacked and mapped to the training targets with one 2D convolutional layer each. We regress a 3D vector per voxel for the scene flow, and follow [46] for the object detections: regressing likelihoods for a set of proposal bounding boxes and regressing the residuals (translation, rotation and size) between the positive proposal boxes and corresponding ground truth boxes. A proposal bounding box is called positive if it has the highest Intersection over Union (IoU, in the ground plane) with a ground truth detection, or if its IoU with any ground truth box is larger than 0.6, as in [46].
3.3 Rigid Motion Layers
We now wish to infer perpixel and perobject rigid body motions from the previously estimated 3D scene flow. For a single point in isolation, there are infinitely many rigid body motions that explain a given 3D scene flow: this ambiguity can be resolved by considering the local neighborhood.
It is unfortunately impossible to use a convolutional neural network to regress rigid body motions that are represented in global world coordinates, as the conversion between scene flow and global rigid body motion depends on the location in the scene: while convolutional layers are translation equivariant, the mapping to be learned is not. Identical regions of flow lead to different global rigid body motions, depending on the location in the volume, and a fully convolutional network cannot model this. In the following, we first prove that the global rigid motion representation is not translation equivariant. Subsequently, we introduce our proposed localized representation and show it to be translation equivariant and therefore amenable to fully convolutional inference.
Let us assume a point in the world coordinate system with scene flow , and let denote a local coordinate system with origin , as illustrated in Fig. 2(a). A given scene flow is said to be explained by the rigid body motion , represented in local coordinate system at world location , if and only if
(1) 
Now assume a second world location , also with scene flow (as in Fig. 2(a)). Let denote a second local coordinate system with origin such that and have the same local coordinates in their respective coordinate system, i.e., . We now prove these two claims:

There exists no rigid body motion represented in the world coordinate system that explains the scene flow for both and , unless .

Any rigid body motion explaining the scene flow for in system also does so for in .
Claim 1
(2)  
From we get
where . Now, we assume that and that (in all other cases the claim is already fulfilled). In this case, we have . However, any rotation matrix representing a nonzero rotation has no real eigenvectors. Hence, as , this equality can only be fulfilled if is the identity matrix, and does not represent an actual rotation.
Claim 2
(3)  
Trivially from .
The first proof shows the nonstationarity of globally represented rigid body motion, while the second proof shows that the local representation is stationary in all cases and can therefore be learned with a CNN.
We provide a simple synthetic experiment in Figure 3 to empirically confirm this analysis. We transform a grid of points with random rigid motions, and then try to infer these rigid motions from the resulting scene flow: as expected, the estimation is only succesful under the local representation. Note that a change of reference system only affects the translation component: while remains constant, the relationship between and is given by . Motivated by the preceding analysis, we task our architecture to predict local rigid body motion.
3.4 Dynamic Pooling Layers
Finally, we combine the results of the 3D object detection and the rigid motion estimation into a single rigid motion for each detected object. We first apply nonmaximumsuppression (NMS) using detection threshold , yielding a set of 3D object detections. To estimate the rigid body motion of each detection, we pool the predicted rigid body motions over the corresponding voxels (i.e., the voxels in the bounding box of the detection). Note that the rigid body motions must be converted back into global coordinates for this pooling to be meaningful.
3.5 Loss Functions
This section describes the loss functions used by our approach. While it might be desirable to define a rigid motion loss directly at object level, this is complicated by the need for differentiation through the nonmaximumsuppression step and the difficulty associating to ground truth objects. Furthermore, balancing the influence of an object loss across voxels is much more complex than applying all loss functions directly at the voxel level. We therefore use a voxellevel loss function. It comprises four taskspecific parts:
(4) 
where are positive constants for balancing the relative importance of the task specific loss functions. We now describe the task specific loss functions in more detail.
Scene Flow Loss: The scene flow loss is defined as the average distance between the predicted scene flow and the true scene flow at every voxel
(5) 
where and are respectively the regression estimate and ground truth scene flow at voxel , and is the number of non empty voxels.
Rigid Motion Loss: The rigid motion loss is defined as the average error between the predicted translation and its ground truth and the predicted rotation around the Zaxis and its ground truth at every voxel :
(6) 
where is a positive constant to balance the relative importance of the two terms.
Egomotion Loss: Similarly, the egomotion loss is defined as the distance between the predicted background translation and its ground truth and the predicted rotation and its ground truth :
(7) 
Detection Loss: Following [46], we define the detection loss as follows:
(8)  
where and represent the softmax output for positive proposal boxes and negative proposal boxes respectively, and respectively are the regression estimate and ground truth residual vector for the positive proposal box . and represent the number of positive and negative proposal boxes respectively. denotes binary cross entropy loss, while stands for the smooth distance function. We refer to [46] for further details.
4 Experimental Evaluation
We now evaluate the performance of our method on the KITTI object detection dataset [13] as well as an extended version, which we have augmented by simulating virtual objects in each scene.
Eval.  Method  Training  Dense S.Flow (m)  Dense Rot. (rad)  Dense Trans. (m)  Object Motion  Egomotion  

Dataset  Dataset  FG  BG  All  FG  BG  All  FG  BG  All  Rot.(rad)  Tr.(m)  Rot.(rad)  Tr.(m)  
K  PointFlowNet  K  0.23  0.14  0.14  0.004  0.004  0.004  0.18  0.09  0.10  0.004  0.30  0.004  0.09 
K  PointFlowNet  K+AK  0.18  0.14  0.14  0.004  0.004  0.004  0.15  0.09  0.10  0.004  0.29  0.004  0.09 
K+AK  PointFlowNet  K  0.58  0.14  0.18  0.015  0.004  0.005  0.59  0.15  0.19  0.010  0.57  0.004  0.14 
K+AK  PointFlowNet  K+AK  0.28  0.14  0.16  0.015  0.004  0.005  0.35  0.12  0.14  0.011  0.48  0.004  0.12 
4.1 Datasets
KITTI: For evaluating our approach, we use all 61 sequences of the training set in the KITTI object detection dataset [13], containing a total of 20k frames. As there is no pointcloudbased scene flow benchmark in KITTI, we perform our experiments on the original training set. Towards this goal, we split the original training set into 70% train, 10% validation, 20% test sequences, making sure that frames from the same sequence are not used in different splits.
Augmented KITTI: However, the official KITTI object detection datasets lacks cars with a diverse range of motions. To generate more salient training example, we generate a realistic mixed reality LiDAR dataset exploiting a set of high quality 3D CAD models of cars [11] by taking the characteristics of real LIDAR scans into account.
We discuss our workflow here. We start by fitting the ground plane using RANSAC 3D plane fitting; this allows us to detect obstacles and hence the drivable region. In a second step, we randomly place virtual cars in the drivable region, and simulate a new LIDAR scan that includes these virtual cars. Our simulator uses a noise model learned from the real KITTI scanner, and also produces missing estimates at transparent surfaces using the transparency information provided by the CAD models, as illustrated in Figure 4. Additionally, we remove points in the original scan which become occluded by the augmented car by tracing a ray between each point and the LIDAR, and removing those points whose ray intersects with the car mesh. Finally, we sample the augmented car’s rigid motion using a simple approximation of the Ackermann steering geometry, place the car at the corresponding location in the next frame, and repeat the LIDAR simulation. We generate k such frames with to augmented moving cars per scene. We split the sequences into train, validation, test similar to our split of the original KITTI dataset.
4.2 Baseline Methods
We compare our method to four baselines: a point cloudbased method using a CRF [9], two pointmatching methods, and an Iterative Closest Point [4] (ICP) baseline.
Dewan et al. [9] estimate perpoint rigid motion. To arrive at objectlevel motion and egomotion, we pool the estimates over our object detections and over the background. As they only estimate valid scene flow for a subset of the points, we evaluate [9] only on those estimates and the comparison is therefore inherently biased in their favor.
Method Matching 3D Descriptors yield a scene flow estimate for each point in the reference point cloud by finding correspondences of 3D features in two timesteps. We evaluate two different descriptors: 3D Match [45], a learnable 3D descriptor trained on KITTI and Fast Point Feature Histogram features (FPFH) [31]. Based on the perpoint scene flow, we fit rigid body motions to each of the objects and to the background, again using the object detections from our pipeline for a fair comparison.
Iterative Closest Point (ICP) [4] outputs a transformation relating two point clouds to each other using an SVDbased pointtopoint algorithm. We estimate object rigid motions by fitting the points of each detected 3D object in the first point cloud to the entire second point cloud.
Evaluation Metrics: We quantify performance using several metrics applied to both the detected objects and the background. To quantify the accuracy of the estimates independently from the detection accuracy, we only evaluate object motion on true positive detections.

For 3D scene flow, we use the average endpoint error between the prediction and the ground truth.

For the dense rigid motion estimates, the average rotation and translation errors are reported separately.

Similarly, we list the average rotation and translation error averaged over all of the detected objects, and averaged over all scenes for the observer’s egomotion.
4.3 Experimental Results
The importance of simulated augmentation: To quantify the value of our proposed LIDAR simulator for realistic augmentation with extra cars, we compare the performance of our method trained on the original KITTI object detection dataset with our method trained on both KITTI and Augmented KITTI. Table 1 shows the results of this study. Our analysis shows that training using a combination of KITTI and augmented KITTI leads to significant performance gains, especially when evaluating on the more diverse vehicle motions in the validation set of Augmented KITTI.
(a) Ground Truth  (b) Our result 
(c) Dewan et al. [9]+Det.  (d) ICP+Det. 
Direct scene flow vs. object motion: We have also evaluated the difference between estimating scene flow directly and calculating it from either dense or objectlevel rigid motion estimates. While scene flow computed from rigid motion estimates was qualitatively smoother, there was no significant difference in overall accuracy; we have used the direct estimates as output by our network for all evaluations.
Comparison with the baselines: Table 2 summarizes the complete performance comparison on the KITTI test set. Note that the comparison with Dewan et al. [9] is biased in their favor, as mentioned earlier, as we only evaluate their accuracy on the points they consider accurate. Regardless, our method outperforms all baselines. Additionally, we observe that the ICPbased method exhibits large errors for object motions. This is because of objects with few points: ICP often performs very poorly on these, but while their impact on the dense evaluation is small they constitute a relatively larger fraction of the objectbased evaluation. Visual examination (Fig. 5) shows that the baseline methods predict a reasonable estimate for the background motion, but fail to estimate motion for dynamic objects; in contrast, our method is able to estimate these motions correctly. This further reinforces the importance of training our method on scenes with many augmented cars and challenging and diverse motions.
Regarding execution time, our method requires 0.5 seconds to process one point cloud pair. In comparison, Dewan et al. (4 seconds) and the 3D Match and FPFHbased approaches (100 and 300 seconds, respectively) require significantly longer, while the ICP solution also takes 0.5 seconds but performs considerably worse. On top of that, our method also outputs the object detections, which are passed to the other methods as part of their input.
5 Conclusion
In this paper, we have proposed the first learningbased approach for estimating scene flow and rigid body motion from unstructured point clouds. Our proposed architecture simultaneously detects objects in the point clouds, estimates dense scene flow and rigid motion for all points in the cloud, and estimates object rigid motion for all detected objects as well as the observer. We have shown that a global rigid motion representation is not amenable to fully convolutional estimation, and instead use a local representation. Our proposed approach outperforms all evaluated baselines, yielding more accurate object motions in less time. Future directions include defining the rigid motion loss in the network at object level and extending the method to jointly estimating 3D scene flow from image and LiDAR data.
References
 [1] A. Asvadi, P. Girao, P. Peixoto, and U. Nunes. 3d object tracking using RGB and LIDAR data. In Proc. IEEE Conf. on Intelligent Transportation Systems (ITSC), 2016.
 [2] T. Basha, Y. Moses, and N. Kiryati. Multiview scene flow estimation: A view centered variational approach. International Journal of Computer Vision (IJCV), 101(1):6–21, 2013.
 [3] A. Behl, O. H. Jafari, S. K. Mustikovela, H. A. Alhaija, C. Rother, and A. Geiger. Bounding boxes, segmentations and object coordinates: How important is recognition for 3d scene flow estimation in autonomous driving scenarios? In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2017.
 [4] P. Besl and H. McKay. A method for registration of 3d shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 14:239–256, 1992.
 [5] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multiview 3d object detection network for autonomous driving. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
 [6] R. Danescu, F. Oniga, and S. Nedevschi. Modeling and tracking the driving environment with a particlebased occupancy grid. IEEE Trans. on Intelligent Transportation Systems (TITS), 12(4):1331–1342, 2011.
 [7] M.Y. L. Deqing Sun, Xiaodong Yang and J. Kautz. Pwcnet: Cnns for optical flow using pyramid, warping, and cost volume. arXiv.org, 2017.
 [8] A. Dewan, T. Caselitz, G. D. Tipaldi, and W. Burgard. Motionbased detection and tracking in 3d lidar scans. In Proc. IEEE International Conf. on Robotics and Automation (ICRA), 2016.
 [9] A. Dewan, T. Caselitz, G. D. Tipaldi, and W. Burgard. Rigid scene flow for 3d lidar scans. In Proc. IEEE International Conf. on Intelligent Robots and Systems (IROS), 2016.
 [10] A. Dosovitskiy, P. Fischer, E. Ilg, P. Haeusser, C. Hazirbas, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2015.
 [11] S. Fidler, S. Dickinson, and R. Urtasun. 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In Advances in Neural Information Processing Systems (NIPS), December 2012.
 [12] U. Franke, C. Rabe, H. Badino, and S. Gehrig. 6DVision: fusion of stereo and motion for robust environment perception. In Proc. of the DAGM Symposium on Pattern Recognition (DAGM), 2005.
 [13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012.
 [14] D. Held, J. Levinson, S. Thrun, and S. Savarese. Robust realtime tracking combining 3d shape, color, and motion. International Journal of Robotics Research (IJRR), 35(13):30–49, 2016.
 [15] E. Herbst, X. Ren, and D. Fox. RGBD flow: Dense 3D motion estimation using color and depth. In Proc. IEEE International Conf. on Robotics and Automation (ICRA), 2013.
 [16] M. Hornacek, A. Fitzgibbon, and C. Rother. SphereFlow: 6 DoF scene flow from RGBD pairs. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
 [17] F. Huguet and F. Devernay. A variational method for scene flow estimation from stereo sequences. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2007.
 [18] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
 [19] A. Kendall, H. Martirosyan, S. Dasgupta, and P. Henry. Endtoend learning of geometry and context for deep stereo regression. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2017.
 [20] P. Lenz, J. Ziegler, A. Geiger, and M. Roser. Sparse scene flow segmentation for moving object detection in urban environments. In Proc. IEEE Intelligent Vehicles Symposium (IV), 2011.
 [21] Z. Liang, Y. Feng, Y. Guo, H. Liu, L. Qiao, W. Chen, L. Zhou, and J. Zhang. Learning deep correspondence through prior and posterior feature constancy. arXiv.org, 1712.01039, 2017.
 [22] Z. Lv, C. Beall, P. Alcantarilla, F. Li, Z. Kira, and F. Dellaert. A continuous optimization approach for efficient and accurate scene flow. In Proc. of the European Conf. on Computer Vision (ECCV), 2016.
 [23] N. Mayer, E. Ilg, P. Haeusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
 [24] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
 [25] M. Menze, C. Heipke, and A. Geiger. Joint 3d estimation of vehicles and scene flow. In Proc. of the ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
 [26] F. Moosmann and C. Stiller. Joint selflocalization and tracking of generic objects in 3d range data. In Proc. IEEE International Conf. on Robotics and Automation (ICRA), 2013.
 [27] P. Purkait, C. Zhao, and C. Zach. Sppnet: Deep absolute pose regression with synthetic views. arXiv.org, 1712.03452, 2017.
 [28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
 [29] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems (NIPS), 2017.
 [30] J. Quiroga, T. Brox, F. Devernay, and J. L. Crowley. Dense semirigid scene flow estimation from RGBD images. In Proc. of the European Conf. on Computer Vision (ECCV), 2014.
 [31] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (FPFH) for 3d registration. In Proc. IEEE International Conf. on Robotics and Automation (ICRA), 2009.
 [32] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang, and J. Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
 [33] G. Tanzmeister, J. Thomas, D. Wollherr, and M. Buss. Gridbased mapping and tracking in dynamic environments using a uniform evidential environment representation. In Proc. IEEE International Conf. on Robotics and Automation (ICRA), 2014.
 [34] F. Tombari, S. Salti, and L. di Stefano. Unique signatures of histograms for local surface description. In Proc. of the European Conf. on Computer Vision (ECCV), 2010.
 [35] A. K. Ushani, R. W. Wolcott, J. M. Walls, and R. M. Eustice. A learning approach for realtime temporal scene flow estimation from LIDAR data. In Proc. IEEE International Conf. on Robotics and Automation (ICRA), 2017.
 [36] L. Valgaerts, A. Bruhn, H. Zimmer, J. Weickert, C. Stoll, and C. Theobalt. Joint estimation of motion, structure and geometry from stereo sequences. In Proc. of the European Conf. on Computer Vision (ECCV), 2010.
 [37] J. van de Ven, F. Ramos, and G. D. Tipaldi. An integrated probabilistic model for scanmatching, moving object detection and motion estimation. In Proc. IEEE International Conf. on Robotics and Automation (ICRA), 2010.
 [38] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade. Threedimensional scene flow. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 1999.
 [39] S. Vedula, P. Rander, R. Collins, and T. Kanade. Threedimensional scene flow. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 27(3):475–480, 2005.
 [40] C. Vogel, K. Schindler, and S. Roth. 3D scene flow estimation with a rigid motion prior. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2011.
 [41] C. Vogel, K. Schindler, and S. Roth. 3d scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision (IJCV), 115(1):1–28, 2015.
 [42] A. Wedel, T. Brox, T. Vaudrey, C. Rabe, U. Franke, and D. Cremers. Stereoscopic scene flow computation for 3D motion understanding. International Journal of Computer Vision (IJCV), 95(1):29–51, 2011.
 [43] A. Wedel, C. Rabe, T. Vaudrey, T. Brox, U. Franke, and D. Cremers. Efficient dense scene flow from sparse or dense stereo data. In Proc. of the European Conf. on Computer Vision (ECCV), 2008.
 [44] J. Žbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research (JMLR), 17(65):1–32, 2016.
 [45] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from rgbd reconstructions. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
 [46] Y. Zhou and O. Tuzel. Voxelnet: Endtoend learning for point cloud based 3d object detection. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
Appendix A Appendix
The following appendix provides additional qualitative comparisons of our method with the baselines methods along with examples of failure cases for our method.
a.1 Qualitative Comparison to Baseline Methods
Figures 627 show qualitative comparison of our method with the best performing baseline methods on examples from the test set of the Augmented KITTI dataset. The qualitative results show that our method predicts motion for both background and foreground parts of the scene with higher accuracy than all the baselines on a diverse range of scenes and motions.
In Figures 2527, we provide challenging examples where our method fails to predict the correct scene flow. We observe here that in case of scenes with two or more cars in very close proximity, our method may predict wrong scene flow for points on one car in the reference point cloud at frame by matching them with points on the other car in close proximity at frame . However, we note that, even for these failure cases our method performs better than the baseline methods.