# Planecell: Representing the 3D Space with Planes

###### Abstract

Reconstruction based on the stereo camera has received considerable attention recently, but two particular challenges still remain. The first concerns the need to aggregate similar pixels in an effective approach, and the second is to maintain as much of the available information as possible while ensuring sufficient accuracy. To overcome these issues, we propose a new 3D representation method, namely, planecell, that extracts planarity from the depth-assisted image segmentation and then projects these depth planes into the 3D world. An energy function formulated from Conditional Random Field that generalizes the planar relationships is maximized to merge coplanar segments. We evaluate our method with a variety of reconstruction baselines on both KITTI and Middlebury datasets, and the results indicate the superiorities compared to other 3D space representation methods in accuracy, memory requirements and further applications.

## 1 Introduction

3D reconstruction has been an active research area in the computer vision community, which can be used in numerous tasks, such as perception and navigation of intelligent robotics, high precision mapping, and online modeling. Among various sensors that can be used for 3D reconstruction, stereos cameras are popular for offering advantages in terms of being low-cost and supplying color information. Many researchers have improved the precision and speed of self-positioning and depth calculation algorithms to enable better reconstruction, but few have attempted to change the basic map representation method which determines the upper bound of reconstructions. Current approaches including point-based or voxel-based representations are confronted with problems, such as significant redundancy, ambiguities, and memory requirements. To overcome these limitations, we propose a new representation method named planecell, which models planes to deliver geometric information in the 3D space.

It is a classical approach to representing the 3D space with a preliminary point-level map. The point-based representations usually suffer a tradeoff of density and efficiency. Many approaches [15, 1, 11] have been developed to address this issue, i.e., to merge similar points in the 3D reconstruction results for both indoor and outdoor scenes. The current leading representation method, called the voxel map [2, 21, 15, 19], is designed to give each voxel grid an occupancy probability, and then aggregates all points within a fixed range. However, dense reconstructions using regular voxel grids are limited to reach small volumes because of their memory requirements.

Previous studies have adopted the plane prior both in stereo matching [22] and reconstruction [15]. Ulusoy et al. presented a Markov random field model in the former work [15] for volumetric multi-view 3D reconstruction. The model uses large 3D surface patches that can be encoded as probabilistic priors. Deriving primitives in the model raises the complexity and restricts further applications. Methods that derive the planarity parameter or use the plane model are based on the fact that the world we live in is mostly composed of plane structures, especially in man-made environments.

In this paper, we propose a novel representation method that differs from existing approaches by mapping the 3D space with basic plane units, which is called planecell for it resembles cells to a living being. The proposed method utilizes a general function to represent a group of points with similar geometric information, i.e., belong to the same plane by a depth-aware superpixel segmentation, and these planes are projected into the real-world coordinates after plane-fitting with depth values. The standardized representation promotes memory efficiency and provides convenience for following computations, such as surface segmentation and distance calculation. Our method starts from extracting planecells from 2D images by superpixelizing the input image following the hierarchical strategy of SEEDS [18] and converts them into a 3D map. The planecells are then merged by modeling a Conditional Random Field (CRF) formulation. Unlike existing surface estimation methods, the aggregation of coplanar units applying proposed CRF formulation only needs to refer to the properties of each planecell, which dramatically reduces required computations. The proposed representation is motivated by the planar nature of the environment. The input to our method is a color reference image and the corresponding depth map, and the output is a plane-based 3D map. In our experiments, we evaluate different input disparity images from various matching algorithms in our experiment to demonstrate the adaptiveness of our technique, and we compare our planecell with existing popular 3D space representation approaches.

The detailed contributions of this paper are as follows: (a) We propose a novel plane-based 3D map representation method that demonstrates remarkable accuracy and has enhanced the space perception abilities. (b) A CRF model that aggregates coplanar planecells in 3D space is proposed. (c) The accuracy and efficiency of our representation method are studied by comparison to existing popular approaches. We also show the accessibility of applications include but not limited to road extraction and obstacle avoidance on our planecell representation in the experiment. In practice, this objective can be optimized on a single core in as little as 0.2 seconds for about 700 planecells. To further aggregate coplanar planes requires only s per frame. More detailed results can be found at http://www.carlib.net/planecell.html.

## 2 Related Work

Basic 3D map representation methods using an image pair are inheritors of various stereo matching algorithms [22, 12, 23, 5, 14, 10]. Point-based 3D reconstruction methods directly transforming stereo matching results lack structural representations. Recent point-level online scanning [24] produces a high-quality 3D model of small objects with the geometric surface prior, which is simpler to operate than strong shape assumptions. For large-scale reconstructions, sparse point-based representations are mainly used for their quality and speed. The point-based maps embedded in the system [5] is designed for real-time applications, such as localization. Different features have been developed for this purpose. For example, the ORB feature matching [16] is designed for fast tracking via a binary descriptor. Adopting denser point clouds in the mapping is challenging because it involves managing millions of discrete values.

The heightmap is a representation adopting 2.5D continuous surface representations, which shows its advancement modeling large buildings and floors. Gallup et al. proposed an -layer heightmap [7] to support more complex 3D reconstruction of urban scenes. The proposed heightmap enforced vertical surfaces and avoided major limitations when reconstructing overhanging structures. The basic unit of heightmap is the probability occupancy grid computed by the bayesian inference, which could compress surface data efficiently but is also lossy of point-level precision.

Recent studies on voxelized 3D reconstruction focus on infusing primitives into the reconstructions [13, 15, 6, 3] or utilizing scalable data structures to meet CPU requirements [19]. Dame et al. proposed a formulation which combines shape priors-based tracking and reconstruction. The map was represented as voxels with two parameters including the distance to the closest surface and the confidence value. Nonetheless, the accuracy of volumetric reconstruction is always limited to itself, and re-estimating object surfaces from voxels or 3D grids lead to ambiguities.

Planar nature assumptions have been applied to both the reconstruction [13] and depth recovery from a stereo pair [22, 10, 20]. For surface reconstruction or segmentation, Liu et al. [13] partitioned the large-scale environment into structural surfaces including planes, cylinders, and spheres using a higher-order CRF. A bottom-up progressive approach is adopted alternately on the input mesh with high geometrical and topological noises. Adopting this assumption, we present a new representation method of 3D space, which is composed of planes with pixel-level accuracy.

## 3 System Overview

As shown in Fig. 2, the input to the system is a combination of a color image and the disparity map. We use a depth-aware superpixel segmentation method with an additional depth term. A hill-climbing [18] superpixel segmentation method is applied to the color image with a regularization term to reduce the complexity. The disparity map is pre-calculated with stereo matching algorithms. Sparse results produced by fast algorithms can still be the input, as we utilize random sampling to omit the effect of outliers during plane-fitting. The boundaries of the segmentation are further updated after plane functions have been assigned to each segment. The superpixels are the basic elements of the mapping process. We extract the vertexes of each plane and then convert them into the camera coordinate system. For existing 3D planes, we aggregate those whose spatial relationship are planarity while minimizing the total energy function.

As the core of our algorithm is independent of the choice of 3D knowledge acquisition, we can alternate the input into the ground truth from laser scanners. However, stereo camera has the advantages of being a low-cost solution for obtaining both depth and color information. In the next section, we primarily describe the process of using stereo pairs as inputs, and we impose the SGM to obtain the depth map.

## 4 Representing the 3D Space with Planecells

The planecell is the basic unit representing geometric information of objects in the 3D space. Each planecell is a combination of pixels from the color image and uses a joint plane function to deliver their positions. The shape of each planecell is a polygon, which enables us to define their boundaries by vertexes. The planecells are adopted by two main processes of stereo matching (introduced with SGM method) and superpixel segmentation, which will be explained separately.

### 4.1 Depth Map Calculation with SGM

The proposed method first calculates a semi-dense disparity map on the input image with a kind of SGM method that combines both the Census transformation and gradient information. Denote as the descriptor of the Census transformation and as the Hamming distance between two descriptors. Let be the directional gradient in the image. The matching cost between the pixel in the left image and pixel on the epipolar line in the right image is defined as

(1) |

Applying the minimum cost path aggregated in direction with penalties for discontinuities, the final disparity of pixel is calculated as

(2) |

### 4.2 Planecells Extraction

We utilize superpixel segmentation methods to adopt basic planecells from the color reference image. For a color image , the superpixel segmentation has the following properties

(3) |

Throughout the literature, various superpixel algorithms are graph-based methods that aggregate similar pixels are belonging to the same object. This property helps to distinguish planes in the region of interest initially, and superpixel segmentation leaves no holes in the input reference image, which also benefits the 3D map representation. However, the boundary of each superpixel is always unnecessarily heavy and complicated, especially for urban scenarios with structural objects, which increases the computational and storage demands. To address this issue, we propose an improved superpixel method based on the prior work [18]. We define the smallest unit of boundary update as a block instead of a pixel. In practice, the size of is related to the levels of the hill-climbing algorithm and the number of superpixels , which is set as pixels.

### 4.3 Superpixel Energy Function

The superpixel segmentation is bounded by the maximization of the energy function, which is defined as the sum of three terms. The energy comprises a color term based on the histogram of the color space, a regularization term , and a depth term :

(4) |

where and are two balancing parameters.

Color term: The color term measures the color distribution of the superpixels and inclines toward superpixels with color histograms that drop into similar bins. With the image segmentation , the color term is formulated as

(5) |

where denotes the histogram bin and is the number of pixels in the bin. It is not difficult to infer that reaches its maximum if and only if each histogram is placed in the same bin. Nonetheless, the quality of this evaluation of color is related to the bin size, i.e., the sensibility of color declines when the number of neighboring colors in a single bin is large.

Regularization term: The regularization term (see Fig. 3(a)) constrains the superpixels to be standard, encouraging straight boundaries. Let be the center of segment , be the set of boundary blocks between segment and segment , and be the set of adjacent segments of segment . The regularization term is given by

(6) |

The value of is maximized when all blocks on the boundary have the same distances to the neighboring superpixels.

Depth term: The depth term (see Fig. 3(b)) comes into effect after the plane function of each segment has been obtained. We denote the plane function of as which equals . The depth distance between a block and neighboring segments is estimated by measuring the difference in the average block depth and the estimated depth generated by the plane function. The formula for is quite similar to with the two-dimensional coordinate alternatives to the disparity. By applying this term, the segmentation outperforms the former method when the color loses its effect.

### 4.4 Plane Function Estimation

After importing the disparity image, we assign each pixel a label to distinguish whether it is an outlier, i.e., the unmatched pixels. To further identify mismatched pixels, we estimate the plane function by random sampling. The plane-fitting terminates when the number of inliers reaches a target percentage. The difference between the estimated disparity of pixel with segment and the input disparity is measured as . If this term exceeds a given threshold, the pixel is considered to be an outlier. If no appropriate function can be obtained after a designated number of iterations of , we omit this segment and re-estimate it after the boundary is updated with depth.

### 4.5 Block-level Update

The proposed method is implemented using a hill-climbing algorithm, which reduces the computational complexity as it allows for faster convergence by changing the size of the initial blocks. Nonetheless, when the updating blocks become bigger, the accuracy decreases. The block size shrinks after the movement of bigger blocks has finished. At level of the hill-climbing process, the algorithm proposes a new partitioning with blocks changing to its neighboring superpixel horizontally or vertically. The partitioning process is evaluated by the superpixel energy function (Eq.4). In our implementation, to adopt more efficient segmentation, the boundary block alters its label at level depending on the costs defined below:

(7) |

where . Note that and increase and decrease separately when measuring the same block . and are evaluated during the boundary blocks updating. As the minimum updating unit of our algorithm is at the block level, we iterate changing the boundaries with the smallest blocks until a valid image partitioning is obtained or the maximum run-time is reached.

## 5 3D Map Expression

After producing the partition results with a plane function assigned to each superpixel, we extract the vertex of each segment. The vertex of each segment is the set of intersections of horizontal and vertical edges. Vertexes require much less memory and computation time than storing all of the plane pixels or edges during 2D-3D conversion. We propose a method of selecting the vertexes on the segmentation results by referring to the count of adjacent pixels that belong to the same superpixel. As demonstrated in Fig. 4, the pixel is a when three neighboring pixels belong to a different superpixel and two adjacent pixels on the horizontal or vertical line belongs to the same superpixel, and a when only one or larger than neighboring pixels belong to the same superpixel. The result after extracting the vertexes is shown in Fig. 4(b).

### 5.1 2D-3D Conversion

The conversion is based on the vertexes. Each vertex set contains vertexes, and is composed of variables describing their location in the 2D image and the disparity value estimated with the plane function. Then, for in segment , the position in the 3D coordinate system of the left camera can be calculated using the camera’s intrinsic parameters and the relative rotation and translation matrix between the stereo camera. We denote the plane function as after converting. It should be noted that the 2D-3D converting does not cause loss of precision to each pixel.

### 5.2 Coplanar Planecells Aggregation with CRF

The process of aggregating coplanar planecells starts from a plane-based model reconstructed from 2D-3D conversion of vertexes. The target is to assign each planecell with a common label if they fit into a similar geometric primitive in the 3D world. This aggregation reveals higher-level comprehension of the environment, which can be further used in the road extraction and understanding of structures. Prior methods utilizing CRF to merge pixels in the 3D world for the purpose of surface segmentation do not integrate existing knowledge of the color image sufficiently, which requires significant computational resources, especially when dealing with large-scale maps.

The plane-based map is demonstrated by a set of discrete plane units . The process is then presented as a labeling problem from the CRF model , i.e., assign each unit a label whose value indicates the most probable surface to which it belongs. We denote a tuple to describe a plane, where is the plane parameters in the camera coordinates and is the color distribution descriptor. The CRF model is shown in Fig. 5. The implementation of our process merges coplanar units into a larger surface iteratively until each surface is denoted with a unique tuple. The CRF model at the -th iteration is defined as

(8) |

where is the set of nodes denoting the surface units, is the labels of , is the set of boundaries between each adjacent units, and is a descriptive label of the boundary .

The CRF energy function is then formulated as the following (the superscript has been omitted)

(9) |

where the potential evaluates the color distribution of from the reference image using the histogram of the color space, the pairwise potential measures the difference of depth in the 3D space, which encourages neighboring planecells to belong to the same surface if they are close in both geometric position and pose, and the term technically encodes the boundaries of . These potentials are further explained in the following.

The term is a unary potential that measures the similarity of the unit and the surface with respect to the color histogram:

(10) |

where is the histogram bin of unit . This potential increases when the similarity in color rises. The potential is designed to constrain the geometric information, which refers of the in each planecell:

(11) |

where denotes the difference value with coordinates of the plane function . Let be the plane pose of unit . The 3D point in unit (planecell) obeys . The potential reaches its maximum when two units agree in their poses. For the potential , the formula can be written as

(12) |

The maximization of Eq. 9 is an NP-hard problem solving a CRF model with various variables. We implement the labeling process using a circular greedy algorithm, which merges units within a given range of variation to the greatest extent. Note that the variation range determines the possibility of planarity between adjacent plane units.

## 6 Experimental Results

We evaluate our algorithm on three datasets, namely the KITTI stereo dataset, KITTI odometry dataset [8], and the Middlebury stereo dataset [17]. The KITTI stereo dataset separates the images into training and testing sets. The training part includes LiDAR ground truth data with and without occlusions. Each group of images contains two continuous stereo pairs with scene flow information. The outdoor scene dataset provided by the KITTI benchmark is quite challenging, as it contains significant depth variation. Our method shows its advancement dealing with KITTI datasets, whose images are largely of man-made environments that exhibits geometric structures. To better demonstrate the superiority for handling large-scale inputs, we test our algorithm on the KITTI odometry dataset, which has continuous stereo pairs with camera poses. The final dataset on which we evaluate our method is the Middlebury 2001 stereo dataset, which is composed of 9 image groups with ground truths and mostly piecewise planar indoor scenes.

The results are discussed in terms of accuracy, speed, memory requirements, and the ability to represent useful information. We compare our reconstruction accuracy with the point-level method which directly converts 2D pixels into the 3D world. Then, by changing the input depth maps, we test the variation of reconstruction accuracy. We also analyze the 3D map results with a voxel-grid based method. Detailed baselines and evaluations are given in the following.

### 6.1 Implementation Details

As the goal of the proposed method is a new representation of 3D geometric information, we give each planecell an average RGB value for reference. For and , we assign them with values according to the initial superpixel size. The plane function is obtained during plane-fitting, and this process may fail if the input depth information is insufficient. In our experiments, the average rate at which the plane function successfully defines all superpixels is with the input depth map from SGM. Those planes without plane functions are mostly the area of the sky or reflective objects that will not be converted into the final output. The input depth maps to point or voxel-based 3D space representation methods in our experiments are all calculated by deriving SGM [12]. All experiments in this paper only occupy a single core.

### 6.2 Baselines

We compare our results with several state-of-the-art stereo matching algorithms on point-level 3D map reconstruction. The method [23] proposed by Zbontar et al. is a preprocessing step for many stereo algorithms, which utilized a convolutional neural network to calculate the matching cost between patches. The corresponding algorithm named MC-CNN-arct outperforms other approaches on both KITTI and Middlebury stereo datasets. We also compare our results with the matching algorithm of Yamaguchi et al. [22] called SPS-st, whose formulation is based on a slanted plane model. Besides, we also test algorithms including SNCC [5], ELAS [9], SGBM and SGM [12] are also listed in our experiments. The SNCC [5] is implemented with additional left-right consistency check and median filters.

The voxelized representation [4, 15] is well developed recently for it standardizes the observations of the regions in space. By following this concept, we implement it by dividing the space into 3D voxel grid. The input contains a depth map and a color reference image. The color estimated for each voxel is the average over the observed pixels. The voxel size in our experiments is fixed to for KITTI datasets in our experiments.

### 6.3 Evaluation and Discussion

We first evaluate reconstruction accuracy by comparing depth maps of each method to ground truth. The sum of per-pixel Euclidean distance errors over the ground truth is computed after reprojecting into the coordinate of left camera. The comparison demonstrates the pixel-level accuracy of our method. Since our method does not lose the position information of each pixel during converting each plane into the 3D space, the comparison is tested on the depth maps. The comparison results are displayed in Fig. 6. We set the parameters of SPS-st to produce superpixels. Our method generates nearly planecells for KITTI dataset and plancells for Middlebury dataset referring to the image sizes. For the absence of camera parameters of Middlebury 2001 dataset, we give out the result with the error on the disparities. Ant it can be observed from Fig. 6 that more than points of our results are located within around the ground truth. The end of each curve is also restricted to the density of each method.

Method | ||||
---|---|---|---|---|

SGBM | ||||

SNCC | ||||

LiDAR Data |

For the quality of our results depends on the input depth map to some degree, we then test with different inputs in Table. 1. Note that the ground truth from LiDAR can produce planecell model as well. The loss of precision with ground truth inputs is mainly due to inaccurate superpixelization. Another test focuses on changing the number of planecells is shown in Fig. 7. It demonstrates that the precision increases with more planecells, which is due to the probability of better partition of boundaries. Our depth term also helps improve boundary update results.

We provide several results from three different 3D space representation in Fig. 8. The input depth maps are all generated by SGM [12] method. As shown in Fig. 8(b) and (e), both point-based and voxel-based results become sparse when the disparities grow, mainly because that far scenes do not have sufficient informations. The proposed planecell method avoids this bad influence by summarizing pixels into a 3D plane which restricts blank area in the output. With the regularization term , the partition of our method reduces the complexity of boundaries. The boundaries of each planecell influence both following computation time and storage by defining the vertexes. For the input depth maps generated by SGM [12] are not full-dense and include many unmatched areas, the proposed method derives the slanted-plane model to produce optimized depth maps. The planecell also benefits the distance measurements during applications like obstacle avoidance. For instance, denote a position in the 3D world as , the shortest distance to planecell with can be calculated as . The proposed method also shows advantages for storing the reconstruction results efficiently. In contrast to point-based method saving all locations, the proposed method requires an average of 45kB per frame.

More detailed results are displayed in Fig. 9 with consecutive frames from KITTI odometry datasets. The reconstruction is based on 50 frames with ground truth poses. The map is reconstructed by mapping each frame data to the first left camera coordinate. Moreover, to show the ability of height perception, we color the placecell with an additional height attribute (see the fifth row of Fig. 9). The height is an essential variable for path-planning of autonomous driving. With the proposed CRF model, we further aggregate coplanar planecells with plane functions and boundaries. As demonstrated in the last row of Fig. 9, the coplanar planecells are given the same color. In the supplementary materials, we present additional evaluations containing more conditions, like larger-scale reconstructions.

## 7 Conclusion

We propose a novel approach in this paper representing the 3D space with basic units of planes named planecell. The planecells are extracted with a depth-aware manner and can be further aggregated if they belong to the same surface applying proposed CRF model. The experiments demonstrate that our method gives consideration to pixel-level accuracy while efficiently express locations of similar pixels. The results avoid the redundancy of point cloud map and limit output map sizes for further applications. In our future work, we plan to import more complex plane models, like spheres and cylinders, to suit more conditions. We also believe that giving each planecell a semantic label would extend the understanding of the environment in a more effective way.

## References

- [1] M. Agrawal and L. S. Davis. A probabilistic framework for surface reconstruction from multiple images. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2001.
- [2] R. Bhotika, D. J. Fleet, and K. N. Kutulakos. A probabilistic theory of occupancy and emptiness. In European conference on computer vision, pages 112–130. Springer, 2002.
- [3] A.-L. Chauve, P. Labatut, and J.-P. Pons. Robust piecewise-planar 3d reconstruction and completion from large-scale unstructured point data. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1261–1268. IEEE, 2010.
- [4] J. S. De Bonet and P. Viola. Poxels: Probabilistic voxelized volume reconstruction. ICCV, 1999.
- [5] N. Einecke and J. Eggert. A two-stage correlation method for stereoscopic depth estimation. In Digital Image Computing: Techniques and Applications (DICTA), 2010 International Conference on, pages 227–234. IEEE, 2010.
- [6] F. Fraundorfer, K. Schindler, and H. Bischof. Piecewise planar scene reconstruction from sparse correspondences. Image and vision computing, 24(4):395–406, 2006.
- [7] D. Gallup, M. Pollefeys, and J.-M. Frahm. 3d reconstruction using an n-layer heightmap. In Joint Pattern Recognition Symposium, pages 1–10. Springer, 2010.
- [8] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354–3361. IEEE, 2012.
- [9] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale stereo matching. In Asian conference on computer vision, pages 25–38. Springer, 2010.
- [10] F. Guney and A. Geiger. Displets: Resolving stereo ambiguities using object knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4165–4175, 2015.
- [11] C. Hane, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. Joint 3d scene reconstruction and class segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 97–104, 2013.
- [12] H. Hirschmüller. Stereo processing by semi-global matching and mutual information. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Citeseer, 2007.
- [13] J. Liu, J. Wang, T. Fang, C.-L. Tai, and L. Quan. Higher-order crf structural segmentation of 3d reconstructed surfaces. In Proceedings of the IEEE International Conference on Computer Vision, pages 2093–2101, 2015.
- [14] W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016.
- [15] A. Osman Ulusoy, M. J. Black, and A. Geiger. Patches, planes and probabilities: A non-local prior for volumetric 3d reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3280–3289, 2016.
- [16] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2564–2571. IEEE, 2011.
- [17] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1-3):7–42, 2002.
- [18] M. Van den Bergh, X. Boix, G. Roig, B. de Capitani, and L. Van Gool. Seeds: Superpixels extracted via energy-driven sampling. In European conference on computer vision, pages 13–26. Springer, 2012.
- [19] V. Vineet, O. Miksik, M. Lidegaard, M. Nießner, S. Golodetz, V. A. Prisacariu, O. Kähler, D. W. Murray, S. Izadi, P. Pérez, et al. Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 75–82. IEEE, 2015.
- [20] C. Vogel, K. Schindler, and S. Roth. 3d scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision, 115(1):1–28, 2015.
- [21] T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald. Robust real-time visual odometry for dense rgb-d mapping. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 5724–5731. IEEE, 2013.
- [22] K. Yamaguchi, D. McAllester, and R. Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In European Conference on Computer Vision, pages 756–771. Springer, 2014.
- [23] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1-32):2, 2016.
- [24] M. Zollhöfer, M. Nießner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, et al. Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics (TOG), 33(4):156, 2014.