3DFS: Deformable Dense Depth Fusion and Segmentation for Object Reconstruction from a Handheld Camera
Abstract
We propose an approach for 3D reconstruction and segmentation of a single object placed on a flat surface from an input video. Our approach is to perform dense depth map estimation for multiple views using a proposed objective function that preserves detail. The resulting depth maps are then fused using a proposed implicit surface function that is robust to estimation error, producing a smooth surface reconstruction of the entire scene. Finally, the object is segmented from the remaining scene using a proposed 2D3D segmentation that incorporates image and depth cues with priors and regularization over the 3D volume and 2D segmentations. We evaluate 3D reconstructions qualitatively on our ObjectVideos dataset, comparing to fusion, multiview stereo, and segmentation baselines. We also quantitatively evaluate the dense depth estimation using the RGBD Scenes V2 dataset [Henry et al. 2013] and the segmentation using keyframe annotations of the ObjectVideos dataset.
Keywords: 3D Reconstruction, DenseDepth Estimation, Volumetric Fusion, Joint 2D3D Segmentation
Concepts: •Computing methodologies Shape modeling; Volumetric models;
1 Introduction
Our aim is to create a 3D model of a single object recorded by a handheld mobile phone camera. We assume only that the object is placed on a flat surface and that the object is approximately centered. The ability to easily and accurately create 3D models from handheld cameras has broad applications including virtual reality and 3D printing. But existing methods are unable to consistently produce 3D models of objects with specularities and irregular shapes without user interaction or controlled capture settings.
3D reconstruction of scenes from image sequences is a well studied problem in computer vision. The underlying idea behind most existing work is that multiple pixel measurements of a point in the scene can be used to triangulate its 3D position. Therefore, reconstruction accuracy hinges on the ability to track a pixel with subpixel accuracy. Tracking and matching algorithms, however, assume brightness constancy which breaks down for specular surfaces. Tracking algorithms also falter in textureless regions where neighboring pixels have similar intensities.
MonoFusion [Pradeep et al. 2013] and MobileFusion [Ondruska et al. 2015] demonstrate 3D reconstruction by explicitly modeling surfaces as depth maps and then performing volumetric fusion. However, to achieve realtime performance such methods compromise on accuracy of the depth maps by relying on stereo matching between the live frame and the last key frame. Similarly, volumetric fusion is performed by a weighted average of implicit surface representation of the depth surfaces, such as Truncated Signed Distance Fields (TSDF), computed in different views. Success of such volumetric fusion approaches depends on the accuracy of depth maps. While these techniques work well for depth maps acquired using active sensors like Kinect, they are not robust to localized but often large errors in depth maps that are common with multiview stereo techniques.
Another important aspect of single object 3D reconstructions is segmentation of the object from the scene. To be useful for 3D printing, the 3D volume occupied by the object needs to be identified. While TSDF gives an estimate of empty regions in space, the cues that distinguish the object from other surfaces come from scene priors and the images. Some approaches iteratively solve for 2D and 3D segmentations [Vogiatzis et al. 2005] or jointly segment pixels and sparse point clouds [Xiao et al. 2007], but none, to the best of our knowledge, perform joint inference over pixels and a dense grid of voxels, which we find to be important for obtaining accurate 3D models.
In this work, we propose to improve the accuracy and robustness of video based multiview single object 3D reconstruction and segmentation systems by improving surface modeling, volumetric fusion, and segmentation. Our system computes accurate depth maps by posing dense perpixel depth estimation as an optimization problem which incorporates multiview stereo cues, sparse point cloud reconstruction from a VSLAM system, and a surface smoothness prior in the form of a rotation invariant bending energy. For 3D reconstruction, we reformulate volumetric fusion of depth maps by getting rid of the truncation in TSDF and using a softmax based signed distance function (SDF). Our fusion approach has the benefit of being more robust to errors in depth maps and also produces smoother surfaces. This technique however leads to a shift in the zerocrossing of the function field, which we remedy by introducing a novel volume field deformation using a sparse surface point cloud such as that provided by patchbased multiview stereo [Furukawa and Ponce 2010]. For segmentation, we perform joint inference over pixel and voxel labels in a graph cut optimization framework. Pixels and voxels impose complementary constraints on the reconstruction. Pixels model the object color which helps distinguish the object surface from other background surfaces and provide cues to surface discontinuities through contour edges. Voxels impose constraints on empty regions, enforce continuity of objects in 3D space and incorporate our scene prior. In addition, surface voxels enforce consistency of pixel labeling across multiple images. We evaluate each of these components to demonstrate good performance even in the presence of specularities and textureless surfaces.
To summarize our contributions:

We propose a fully automated approach to produce a 3D mesh of a single object placed on a flat surface.

We propose a method to robustly estimate depth surfaces from multiview stereo cues and a sparse point cloud (optional) which is regularized by a rotation invariant second order bending energy. We demonstrate its performance on our ObjectVideos dataset as well as RGBD Scenes v2 dataset and compare against other forms of regularization.

We reformulate volumetric fusion by using a softmax instead of the truncation and weighted averaging for computing the TSDF proposed in [Newcombe et al. 2011a]. To correct the bias in zerocrossing we perform a smooth function field deformation using a surface point cloud. Our approach is more robust to errors in depth maps and produces smooth surface meshes.

We formulate joint 2D image segmentation and volumetric 3D reconstruction as the task of assigning discrete labels to every pixel and to every voxel .The discrete optimization is solved using graph cut with expansion procedure under constraints imposed by pixels, voxels and scene priors.

We evaluate our entire system through pixel segmentation accuracy on ObjectVideos dataset which is a video dataset of 12 objects recorded by a mobile phone camera. We provide groundtruth segmentation masks for selected frames to encourage advancement of stateoftheart on this task.
2 Related Work
We can group relevant literature into two broad areas: 3D reconstruction and multiview segmentation.
3D Reconstruction: Most of the existing work in this area focuses on reconstruction of an entire scene. Different approaches cater to different reconstruction resolution requirements and size of the scenes. Single object reconstruction mostly falls within the purview of literature that deals with small sized indoor scenes. MonoFusion [Pradeep et al. 2013] and MobileFusion [Ondruska et al. 2015] use stereo matching between the live frame and last selected keyframe to generate depth maps. The final surface reconstruction is computed by extracting zerolevel set of the Signed Distance Field (SDF) obtained by weighted averaging of TSDFs for different views. This approach to volumetric fusion, popularized by KinectFusion [Newcombe et al. 2011a], has demonstrated good performance for fusing accurate depth measurements from active sensors like the Kinect. However, this method has two drawbacks. First, due to weighted averaging, the TSDFs are forced to be asymmetric to avoid changing the zerolevel set. This in turn hinders their ability to collect evidence of occupancy for voxels behind the visible surfaces. Second, this approach relies on accurate depth surface estimates which, while generally true for active sensors operating in indoor scenes, does not generalize to stereo based methods in presence of specular reflection and textureless surfaces. To counter these problems we propose an alternative scheme that relies on maximum signed distance to the visible surfaces. For robustness we use a softmax instead of a hard maximum. This is followed by zerocrossing correction using a smooth deformation field generated using a sparse point cloud provided by PMVS [Furukawa and Ponce 2010]. Our joint 2D3D segmentation also reduces dependence on the depth maps by utilizing multiple cues and hence improves the overall robustness of the system.
DTAM [Newcombe et al. 2011b] proposes a novel and robust alternative to pairwise stereo for computing dense perpixel depth maps. They pose the problem of computing depth at evey pixel as a two step iterative optimization problem, where the first step ensures consistency of depth estimates with photometric evidence integrated over multiple frames (cost volume) and the second step provides a first order spatial regularization. Inspired by DTAM, we also pose depth surface computation as a two step iterative optimization problem but with some important improvements and simplifications. The major differences are: (1) relaxation of frontalplanar assumption imposed by the first order regularization; (2) removal of Huber loss on smoothness term and appropriate compensation through spatially varying weights that turns the optimization problem in the second step to linear least squares; (3) replacement of pixel based photometric error by patch based Zeromean Normalized Cross Correlation (ZNCC) which is more robust to brightness changes, while computing the cost volume.
Another approach to dense 3D reconstruction is to begin with a sparse point cloud, increase density by propagating measurements to nearby points using techniques like PMVS [Furukawa and Ponce 2010] and then fit a mesh to this semidense reconstruction [Kazhdan et al. 2006]. However to expand the point cloud, techniques like PMVS perform only local operations that may not produce a globally consistent result. It is also difficult to enforce surface regularization.
Camera pose estimation is an integral component of any of the above mentioned methods. Direct monocular slam approaches like LSDSLAM [Engel et al. 2014] and DTAM, which directly minimize photometric error to register live frame with the last selected keyframe have been shown to be more robust than feature tracking based approaches such as PTAM [Klein and Murray 2009]. LSDSLAM also provides semidense depth estimates with inverse depth variance which expresses belief about the accuracy of the estimates.
Multiview Segmentation: The dominant approach for multiview segmentation involves finding optimal labels for some subset of {pixels, superpixels, 3D points, voxels} that minimizes an energy function that encodes task specific priors and evidence from data.
The work most closely related to ours is [Vogiatzis et al. 2005], where 3D reconstruction is posed as a volumetric Graph Cut [Boykov and Kolmogorov 2004] optimization problem defined over a 6connected grid of voxels. Foreground and background color distributions were modeled from images and the unary term for a voxel was defined in terms of the average posterior probabilities of the pixels that the voxel projected to. The foreground voxels were then projected onto the images to get segmentation masks and the color model was updated. Similarly, [Djelouah et al. 2013] propose an approach for segmenting an object from a video by performing graph cut over superpixels and sparse 3D points. They have explicit edges between 3D points and superpixels to ensure multiview coherence, and edges between superpixels across frames, which are related by optical flow, to model temporal consistency. Both the approaches generate image segmentation masks, but, while the first method generates a volumetric reconstruction by labeling voxels, the second approach only segments a sparse point cloud. Our formulation borrows the idea of jointly labeling superpixels and a voxel grid with edges between them to enforce consistency, but we incorporate rich information from volumetrically fused dense depth maps computed in early stages of our system and scene priors to get good quality reconstructions of objects with complicated shapes and varying material properties. Our system also differs from these approaches in that we use a richer set of labels that distinguish between the voxels that belong to the object of interest and those belonging to other objects in the scene, namely voxels behind the support surface. This distinction allows a more complete and accurate reasoning about the scene.
Like us, [Kowdle et al. 2012] attempts to exploit the rich evidence available in dense depth maps, using stereo and color based appearance cues to first compute dense piecewise planar depth maps for every view. Then they assign a foreground or background label to polygons in every image independently. Finally, they fuse these independent segmentations using multiview reasoning. Each of these steps is performed by expansion based energy minimization with carefully chosen terms. Unlike us, they only generate image segmentations. Also, the piecewise planar assumption also does not hold for many objects of interest.
An alternative to graph cut optimization was proposed in [Xiao et al. 2007], where joint segmentation is performed over 2D pixels and 3D scene points using affinity propagation. A graph is constructed over joint nodes each of which comprises of a 3D point and its image projections in different views. Similarity between nodes is defined using 3D features like spatial proximity and angle between normal directions, and 2D features like color differences and KullbackLeibler divergence between patch histograms. This method, however, requires user initialization and includes only sparse 3D points in the optimization.
All the above approaches perform 3D segmentation by converting the problem to a discrete optimization. [Kolev et al. 2009] formulate volumetric reconstruction as that of minimizing a continuous convex functional by relaxing binary labels to lie in , and they show quantitative improvement over discrete graph cut based approaches.
3 Approach
Our system takes as input an image sequence and automatically generates a volumetric reconstruction of the object, depth surfaces, and segmentation masks for selected keyframes, as shown in Fig. 1. There are four main stages:
Pose Estimation: A stateoftheart VSLAM system LSDSLAM [Engel et al. 2014] is used to get camera poses, keyframe selection, and semidense depth maps which are used in later stages of the system.
Surface Modeling: The visible surfaces from each view are modeled as dense depth maps. The procedure, inspired by DTAM [Newcombe et al. 2011b], involves minimizing an objective function comprising of a cost volume based data term regularized by a spatially varying linearized bending energy. We experiment with second order bending energies and compare to the first order regularization used in [Newcombe et al. 2011b].
Volumetric Fusion: Depth maps from the previous stage are fused together volumetrically to generate a signed function field over a voxel grid. The function field indicates normalized and clipped signed distance of every voxel to its nearest surface. The fusion, however, introduces a bias in the function field and shifts the zerocrossing. PMVS [Furukawa and Ponce 2010] points which are known to lie on the surface are used to correct the bias by deforming the function field where the deformation is modeled by a radial basis expansion.
Joint 2D3D Segmentation: A joint segmentation of all keyframe images and a common voxel map is performed using graph cuts with expansion over the set of all pixel and voxel nodes. The pixel unary models the color of the object and background regions. The voxel unary enforces constraints on the empty regions (using the SDF) and incorporates scene prior by encouraging voxels below the fitted plane to belong to the background. Pixelpixel and voxelvoxel pairwise terms are used to impose smoothness in 2D (edgeaware) and 3D space respectively. Finally, edges connecting pixel nodes to sparse surface voxel nodes (identified by hashing sparse 3D points generated by LSDSLAM into the voxel grid) implicitly enforce consistent labeling across views.
3.1 Pose Estimation
Camera poses and an initial sparse point cloud with visibility information for each point are required in later stages. Camera focal length, optical center, and radial distortion parameters are precomputed using a checkerboard pattern. The video frames are also corrected for radial distortion. LSDSLAM [Engel et al. 2014] with loop closure is then used to select a set of keyframes, compute camera poses for every frame, and generate semidense depth maps with corresponding estimates of inverse depth variance for all keyframes.
3.2 Surface Modeling
Let the set of all pixels in the current reference image be denoted by . The pixel coordinates will be referred to as or by a 2 dimensional vector u. The problem of estimating depth surface reduces to assigning a depth value to each pixel such that the assignment results in low photometric error, agrees with the depth measurements wherever available and is mostly smooth barring depth discontinuities. In our system these depth measurements are provided by LSDSLAM for a set of pixels . LSDSLAM also provides an estimate of the inverse depth variance for all .
Inspired by DTAM [Newcombe et al. 2011b], we begin by constructing the cost volume which specifies the average photometric error of pixel u with respect to neighboring frames for some discrete depth candidates . Instead of simply using the difference between the pixel intensities, we use a measure based on Zeromean Normalized CrossCorrelation (ZNCC) between patches of size . ZNCC has two advantages: (i) removal of brightness constancy assumption; and (ii) robustness to false matches between pixels with similar intensities but different neighborhoods. For a selected set of neighboring frames , the cost volume is given by
(1) 
where is the ZNCC between a patch around u in and the patch around its reprojection in assuming depth . Depth estimation is now formulated as the following optimization problem \linenomath
(2) 
where is the surface bending energy at u and is a spatially varying weight that prevents smoothing across edges. For ease of optimization, we restrict the choice of bending energy to those which can be approximated by finite differences. Ideally, bending energy must also be invariant to inplane rotation of the camera. One such bending energy has the form
(3) 
Experiments show that dropping the cross term produces similar results but considerably speeds up the optimization procedure. A commonly used substitute, which is also used in [Newcombe et al. 2011b], is squared norm of the gradient . However, this imposes an often misleading bias towards frontalplanar surfaces.
The optimization problem in Equation 2 is solved using a twostep iterative procedure, similar to [Newcombe et al. 2011b], by introducing auxiliary variables and a coupling term whose role in the optimization is controlled by parameter \linenomath
(4) 
where is a constant that is empirically determined and is a spatially varying weight that depends on the reliability of the measurement at u.
The first step involves minimizing the data term by pointwise search; while the second step solves for the surface that is consistent with the estimates in the first step, agrees with the measurement, and has low bending energy.
Step 1 involves a perpixel pointwise search over the discrete depth candidates to obtain photoconsistent depth estimates \linenomath
(5) 
(6) 
where is decreased after every iteration to increase the coupling between the discrete auxiliary variables and continuous depth estimates. This step is identical to [Newcombe et al. 2011b] except that our final estimate in the current iteration is obtained by median filtering the depth image to increase robustness to outliers and then performing a single Newton step using the numerical gradient of \linenomath
(7) 
where is the learning rate. We use for our experiments. was used for RGBD Scenes dataset and was used for ObjectVideos dataset.
Step 2 enforces smoothness and consistency with depth measurements while being tied to photoconsistent depth estimates obtained in the previous step. \linenomath
(8) 
In contrast to [Newcombe et al. 2011b], we remove the Huber loss on the smoothness term which turns our optimization problem into linear least squares thereby improving efficiency. We deal with outliers and large depth discontinuities by employing two techniques: 1) median filtering the optimal solution to equation (5); 2) appropriate choice of spatially varying constants and :

Measurement weights: is chosen to be inversely proportional to , the inversedepth variance estimate for each measurement provided by LSDSLAM. We use as the constant of proportionality.

Smoothness weights: Since we remove the Huber loss on smoothness term, we gradually increase with every iteration so that by the time the smoothness term dominates, is already very close to the true solution. The weight also depends upon edge strength to make smoothing edge aware. Specifically we use
(9) where is the edge strength obtained from Structured Edge Detector [Dollár and Zitnick 2013] which was trained to detect object contours in images. For our experiments we used and . was used for RGBD Scenes dataset, whereas was used for ObjectVideos dataset.

Coupling weights: Similar to [Newcombe et al. 2011b], we decrease with every iteration to increase coupling
(10) In addition, photometric error based spatially varying weights are used to propagate depth estimates from more confident regions to less confident ones such as textureless and specular surfaces.
(11) where is the minimum photometric error stored in the cost volume at pixel u
(12) In our experiments, we use , , and .
Note that with the exception of and , the same set of parameters generalized to two datasets with very different scales.
3.3 Volumetric Fusion
The depth maps estimated for the keyframes are fused together to get a volumetric reconstruction of the scene represented by an implicit surface function field defined on a voxel grid, . A desired property of such a function field is that takes a positive value in empty regions and negative value otherwise. The surface mesh can then be extracted as the zerolevel set of . Let be the Signed Distance Function (SDF) computed over due to keyframe. Given keyframes, the actual choice of the functional mapping from to is closely related to the choice of the mapping from the keyframe’s depth map to . Another important requirement from is that its zerocrossing lies on the object’s surface. If the depth maps were very accurate, then for any choice of which assigns a negative value to voxels behind the depth surface and positive otherwise, can simply be computed as follows
(13) 
The surface extracted from will be guaranteed to be accurate up to the resolution of the voxel grid. Subvoxel accuracy however still depends on the computation of . Naturally smooth and more accurate surfaces can be expected if is proportional to the distance of v from the closest depth surface.
However, since the estimated depth maps often have local errors in regions with textureless surfaces or specularities, the above computations need to be suitably modified for robustness. We propose to decompose volumetric fusion into two parts: i) robust fusion, and ii) zerocrossing or bias correction. The first step produces SDFs that are robust to errors in depth map or camera pose estimation at the cost of incurring a bias or a shift in the zerocrossing. The second step corrects for this bias by applying a smooth deformation field to force the function value close to zero at a set of points known to lie on the object surface. We compare our proposed approach to TSDF [Newcombe et al. 2011a] in Fig. 4. Fig. 5 shows the effect and importance of zerocrossing correction.
3.3.1 Robust Fusion
Let be the point where the ray from the camera center to v intersects the depth surface due to . is simply given by
(14) 
where with being the transformation from world coordinates to camera coordinates. Here is a constant used to clip the SDF value further away from the surface.
The SDFs from all keyframes are fused using a soft maximum operation defined as follows
(15) 
where is a hardness constant. We use for all our experiments. Soft maximum produces level surfaces that are naturally smooth and trades the property of preserving the zerocrossings for robustness to errors in depth and camera parameter estimates. Note that this is in contrast to [Newcombe et al. 2011a] where were truncated to zero beyond a depth of behind the depth surface in order to preserve zerocrossing as it relied on a weighted averaging of s. Views in which a voxel was more directly in the line of sight were given a higher weight while fusing the SDFs. Such a weighing is implicit in our softmax formulation since voxels tend to have higher in views where it is directly in the line of sight.
3.3.2 Zerocrossing Correction
To correct for the bias in , we use a sparse set of 3D surface points, generated by PMVS [Furukawa and Ponce 2010]. We pose this problem as that of finding a deformation field such that
(16) 
We choose to parametrize by a radial basis expansion
(17) 
where is a set of control points, a denotes the coefficients of expansion, and is any radial basis kernel. For our experiments, we use with . For a voxel grid of maximum grid dimension , we use randomly sampled control points out of which half are sampled in the region and the other half outside this region. The coefficients are obtained by minimizing the following least squares objective function \linenomath
(18) 
The first term enforces for which contains points randomly sampled from . The second term constrains the deformation to be zero at points which contains 500 points each sampled from and level sets of . The third term is used to regularize the deformation.
3.4 Joint 2D3D Segmentation
Given a set of keyframe images , their camera pose estimates , fused and deformed SDF , and a sparse set of pixelvoxel correspondences, we want to label all 2D image pixels as and all voxels as . Note that even though the objects are placed on simple planar surfaces, achieving good quality segmentations is challenging because of one or more of the following reasons: (i) specularities; (ii) significant color variations on the object surface; (iii) errors in camera pose estimation; (iv) local errors in depth map estimates, (v) noisy PMVS point cloud; and (vi) error in support surface estimation. Another challenge is the computational complexity that arises as a result of dealing with pixels in all images and a dense voxel grid.
To keep computational complexity in check, we label superpixels instead of pixels. About superpixels are computed for each image using SLIC [Achanta et al. 2012]. We denote the label assigned to the superpixel and voxel by and , and the set of all pixels and voxels by and respectively . We formulate our objective as a joint 2D3D segmentation and minimize it using graph cuts with expansion. The objective function is given by \linenomath
(19) 
The superpixel unary term encodes object color, a scene prior that is informative about background superpixels, and consistency of image segmentations with volumetric reconstruction. The voxel unary term, , encodes information about the empty regions using the SDF, location of background voxels using the scene prior pertaining to a flat support surface, and consistency of volumetric reconstruction with image segmentations. Superpixel binary term, , encodes edgeaware smoothness between labeling of neighboring superpixels. Similarly, voxel binary term encodes smoothness in voxel labeling. Finally, superpixel  voxel pairwise term, , ensures consistency between surface voxels, and superpixels they project to in different views. Indirectly, ensures consistent superpixel labeling across views. Next we describe the graph structure, initialization, and each of the unary and pairwise energy terms in detail.
3.4.1 Graph Structure
There is a node in the graph, , for every superpixel in and every voxel in . We connect the neighboring voxels using a connected grid. We insert an edge between two superpixels if they both lie in the same image and share a boundary. We would ideally like to connect each superpixel with all voxels that project inside the superpixel, are in direct line of sight in the corresponding view, and lie on the surface of the object. However, this would require knowing the surface voxels and their visibility information. We can compute this information for a sparse set of points using semidense depth maps provided by LSDSLAM. For each image we backproject the LSDSLAM depth map and hash the points into the voxel grid . This gives us a sparse set of correspondences between superpixels and voxels. The region to be discretized by the voxel grid (our region of interest or ROI), is set to standard deviations computed from the LSDSLAM point cloud along each dimension. The resolution of the grid is chosen such that the largest dimension is divided into voxels.
3.4.2 Initialization
We first extract the zero level set of the fused and deformed SDF and fit a plane to the points to identify the flat surface that supports the object. All voxels whose signed distance to the plane (negative denoting behind the plane) is less than a small positive theshold, , are set to . All voxels above the surface which contain at least one point from LSDSLAM point cloud are marked as and the remaining ones are maked . We initialize the segmentation with a perimage trimap (see Fig. 6), computed by projecting an overestimate, , and an underestimate, , of the 3D volume on to each image. is obtained by selecting the largest connected component from the set of voxels with and distance to the plane greater than . We project on to each image to get segmentation masks. Superpixels that have more than pixels in the background region of the segmentation mask are permanently set to . Next we create a set of voxels . is projected on to each image plane to get segmentation masks and each superpixel with more than pixels in the foreground is permanently marked . The remaining superpixels are randomly initialized. Note that only the superpixels that lie in this narrow band around the object silhouette are being solved for by graph cuts. A good initialization must include as few voxels on or below the plane in as possible. For this is set to times the mean of the standard deviations along each dimension of the points used to fit the plane.
3.4.3 Energy Terms
SuperPixel Unary comprises 3 components: encodes foreground and background colors; ensures that superpixels which correspond to surfaces farther away from the ROI are assigned to background; and iii) ensures consistency of silhouettes with volumetric reconstruction.
For the color term, separate GMMs are used to model the color distribution of foreground and background in Lab color space across all keyframe images. We learn both GMMs with 10 components and learn a full covariance matrix. superpixels are randomly sampled across images from the currently labeled foreground or background regions and from them pixels are sampled to learn each GMM model. Next, the posterior probabilities, denoted by , are computed for each pixel using the learned GMMs. In order to limit the influence of the color term, the probabilities are scaled and truncated by using the following function
(20) 
Let denote the set of pixels in the superpixel. Then the color based superpixel unary term is given by
(21) 
To compute , we backproject pixels using the depth map and count how many lie inside the ROI. Let denote the fraction of pixels in which fall outside the ROI. Then the ROI based term is given by
(22) 
For the silhouette based term, the current voxels labeled as object are used to render segmentation masks. Let denote the fraction of pixels in labeled as foreground. Then silhouette consistency term is given by
(23) 
Finally . For our experiments we used and . In our experiments, we found that using an average of over all past iterations of graph cut leads to more stable solutions.
Voxel Unary has 3 components: encodes the information contained in the SDF, ; encodes our scene prior; and ensure that voxel labeling is consistent with superpixel labeling. Note that SDF helps in identifying empty regions () but cannot distinguish between object of interest and other background regions, such as those below the support surface or regions further away from the object, since SDF attains a positive value in both cases. This complementary information is provided by the other two components, and .
First, for each voxel v, a normalized distance to the fitted plane is computed as follows
(24) 
where denotes the distance of v from the plane and is the standard deviation of the distance of all voxels from the plane. Next, all voxels with are permanently set to . All voxels with are permanently set to . For the remaining voxels, v the based energy term is given by
(25) 
The scene prior energy term, , penalizes labeling of voxels above the support surface as background and those below the surface as either or . It is defined as
(26) 
In order to compute , we first perform silhouette carving to get a set of voxel which project on the superpixels labeled in all keyframe views. Let the complementary set be . is then given by
(27) 
The final voxel unary for the variable voxels is given by . For all our experiments we have set , , and . Similar to , we also use the average over all past iterations.
Superpixel Binary term imposes edgeaware smoothness constraints on the superpixel labeling. For a pair of superpixels with indices that share boundary, with the set of boundary pixels denoted by , the superpixel pairwise energy term is given by
(28) 
where denotes the contour edge strength of pixel p obtained using Structured Edge Detector [Dollár and Zitnick 2013].
Voxel Binary term imposes smoothness constraints on the voxels. For a pair of voxels indexed by , the voxel binary term is defined as .
Superpixel  Voxel Pairwise term ensures that for every superpixel and voxel with an edge between them, with indices , superpixel and voxel labels are consistent with each other. Note that only connects surface voxels with superpixels. The surface voxels can never be hence both and lie in . Given this restriction on the labels .
3.4.4 Details of expansion
We run iterations of optimization with expansion steps per iteration or until the sum of the number of label swaps for all expansion moves in an iteration falls below , whichever comes first. To make the set of labels the same for pixels and voxels, we assign a very high cost to for any superpixel. The color models are updated in each iteration using the current superpixel labeling hence needs to be recomputed in each iteration. Besides this only and are updated in each iteration since they depend on the current voxel map and segmentation masks.
3.5 Postprocessing
As an important postprocessing step, we select the largest connected component from the voxels labeled as . While this could be used to generate a surface mesh of the object directly, we found that qualitatively better results are obtained by using it as a mask to set to for all v such that , extract its zerolevel set, and perform mesh smoothing using [Zhang and Haniza 2006].
4 Experiments
We evaluate our system qualitatively (Fig. 9 and Fig. 7) with displays of 3D reconstructed objects from our ObjectVideos dataset. We also quantitatively evaluate dense depth map estimation (Fig. 8) with the RGBD Scenes V2 Dataset [Henry et al. 2013] and segmentation accuracy (Tab. 1) on key frames from the ObjectVideos dataset.
4.1 Object Reconstruction
Our ObjectVideos dataset consists of 12 videos of 10 objects captured using a commercial mobile phone camera Samsung Galaxy S4. Many of the objects have complex shapes, lowtexture surfaces, and specular materials. Thus, while many graphics and vision papers use carefully designed experimental setups and/or objects with smooth Lambertian surfaces that satisfy model assumptions, we attempt to reconstruct common objects filmed with a typical camera in a casual process. In Fig. 7, we show results of our method “SDF + 2D3D Voxel” with comparisons to baseline methods: (i) Poisson surface reconstruction [Kazhdan et al. 2006] using PMVS [Furukawa and Ponce 2010] point cloud; (ii) Volumetric graph cut method of [Vogiatzis et al. 2005]; and (iii) Zero level set of deformed SDF after selecting appropriate region using aggressive and conservative thresholds on .
4.2 Dense Depth Estimation
To evaluate our dense depth map estimation procedure we use the available camera poses to estimate depth maps for five selected key frames of each video in the RGBD Scenes V2 dataset. We compare three different smoothness priors: first order (); second order (eq. 3 without the cross term); and second order rotation invariant bending energy (eq. 3). We also evaluate the effect of increasing the number of measurements (known depth points) from 100 to 1000 (see Fig. 3 for a qualitative comparison on RGBD Scenes V2 dataset). We compare to bilinear interpolation of known depth values as a baseline. For indoor scenes that consist of many planar surfaces, the baseline works quite well, but it provides poor estimates for objects that have curved surfaces. Fig. 2 shows the qualitative comparison of depth estimation with different smoothness priors on 2 videos from our ObjectVideos dataset.
In Fig. 8, we plot root mean squared error as a function of ground truth depth. All the variants of our approach beat the baseline for small depths, which is most relevant for our application. For large depths, nonlinear discretization used for the cost volume results in high quantization errors, causing the dense depth estimation to underperform the baseline. While inverse depth and log space discretization are common, in our experiments we found that scaling to the expected range of depth values, where is the number of discrete values, performed best.
Among smoothness priors, the rotation invariant second order bending energy performs the best, beating the second order bending energy by a small but noticeable amount. However, the latter may be preferred because it requires less computation. Both the second order energies perform significantly better than the first order energy.
Finally, as the number of measurements increase the error reduces, as expected, especially for regions further from the camera.
4.3 Object Segmentation
dragon  stapler  rubik_cube  paper_punch  keyboard  teabox  books  helmet  hedgehog1  hedgehog2  godzilla1  godzilla2  mean  median  
Fused SDF  0.86  0.79  0.83  0.73  0.74  0.74  0.88  0.84  0.82  0.73  0.82  0.78  0.80  0.80 
Deformed SDF  0.87  0.82  0.85  0.81  0.76  0.54  0.81  0.79  0.63  0.80  0.58  0.82  0.76  0.81 
Volumetric GC  0.90  0.86  0.76  0.88  0.68  0.85  0.35  0.37  0.86  0.86  0.81  0.90  0.76  0.85 
2D GC  0.90  0.90  0.95  0.93  0.76  0.94  0.88  0.84  0.83  0.88  0.80  0.91  0.88  0.89 
2D3D GC Pixel:  0.91  0.90  0.95  0.93  0.78  0.87  0.92  0.87  0.85  0.85  0.83  0.91  0.88  0.88 
2D3D GC Voxel  0.91  0.90  0.93  0.88  0.73  0.84  0.92  0.88  0.87  0.82  0.80  0.88  0.86  0.88 
GC Select  0.90  0.89  0.91  0.88  0.74  0.82  0.92  0.87  0.86  0.80  0.79  0.86  0.85  0.87 
SDF + 2D3D Voxel  0.90  0.89  0.91  0.88  0.73  0.82  0.91  0.87  0.86  0.79  0.79  0.85  0.85  0.87 
We also compare variants of our algorithm on 2D pixel segmentation. These results give a sense of 3D segmentation and reconstruction quality, but multiple 3D segmentations are consistent with a single 2D segmentation (e.g., the visual hull and the true shape have the same silhouettes) and different kinds of 2D segmentation errors have different impact on 3D reconstruction. In Table 1, we compare segmentation performance, measured by intersection over union (IoU) of ground truth and estimated segmentation masks over annotated keyframes, for several variants of our technique:

Fused SDF: Segmentations obtained by backprojecting the largest connected component of interior voxels above the fitted plane from the signed distance function (SDF) created from depth maps.

Deformed SDF: Same as above, but after deforming the SDF so that its surface lies close to PMVS points.

Volumetric GC: Our implementation of volumetric graph cut method of [Campbell et al. 2010].

2D GC: Image graph cuts cosegmentation using color terms computed from multiple images but without voxel segmentation, the pixelvoxel constraints, or the term.

2D3D GC Pixel: Pixel segmentations resulting from our joint 2D3D graph cuts method.

2D3D GC Voxel: Backprojects segmented voxels using our joint 2D3D method.

SDF + 2D3D Voxel: Backprojects the 3D volume obtained by slightly dilating the voxels obtained from 2D3D segmentation, intersecting with Deformed SDF, and smoothing. This method provides the best qualitative results and is used for reconstruction results.
The SDF deformation yields better results in most cases, but for teabox, books, helmet, hedgehog1, and godzilla1, the results are much worse due to large errors in camera pose estimation that decreased accuracy of the PMVS point cloud used for deformation. The PMVS results are included in the supplementary video. Our graph cuts methods are robust to these errors, outperforming “Fused SDF” and “Deformed SDF” by large margins.
Our joint 2D3D segmentation (“2D3D GC Pixel”) performs equally or better than 2Donly segmentation (“2D GC”) in all but two cases, supporting the value of joint segmentation. The main case in which 2Donly outperforms is teabox in which errors in camera pose estimation harm the 2D3D result. Although backprojecting voxels from the 2D3D segmentation (“2D3D GC Voxel”) slightly underperforms the 2D3D pixel segmentation (partly due to coarser voxel discretization), the voxelbased 3D model is better than that obtained by shape carving from 2D segmentations because shape carving is sensitive to errors in individual images. Our most qualitatively pleasing 3D models are produced by combining meshes from “Deformed SDF” and “2D3D GC Voxel”, but when backprojected to images, the resulting silhouettes are slightly less accurate than “2D3D GC Voxel”.
As an alternative to graph cut optimization, we tried Spectral Matting [Levin et al. 2008] for performing independent image segmentations using the trimap. Spectral Matting first identifies a basis set of matting components where each component is obtained as a linear combination of a Laplacian matrix. Then it uses the trimap initialization to assign each component to foreground or background and constructs an matte. However, we found that in order to obtain consistent segmentations across views it is necessary to compute eigenvectors of a prohibitively large Laplacian matrix defined over pixels in all keyframes and all voxels.
5 Failure Modes and Limitations
Based on qualitative (see Fig. 10) and quantitative evaluation, we have identified a set of failure modes for our approach 
A major source of error that affects all stages of our system is camera pose estimation. Camera pose estimates directly affect the accuracy of constructed cost volume and depth measurements for computing dense depth maps, accuracy of PMVS point cloud, alignment of depth maps during volumetric fusion, deformation of SDF using PMVS point cloud, and superpixelvoxel consistency constraints during joint 2D3D segmentation. The pose estimation errors are largely due to severe occlusion and breaking of brightness constancy assumption due to textureless and specular surfaces. The videos in ObjectVideos dataset are also collected in a casual fashion with blurry frames and large displacement between consecutive frames.
Secondly, errors in estimated depth maps due to specular and textureless surfaces adversely affect quality of the fused SDF which is the driving force behind the 2D3D segmentation mechanism. We have demonstrated some degree of robustness in reconstructing textureless and specular surfaces such as in stapler, paperpunch, teabox, and helmet, but explicit removal of specularities would further improve performance.
As a byproduct of the above two, incorrect estimation of the support surface is a main source of error while reconstructing flat objects like the keyboard.
Our approach also has some limitations. Our method was targeted towards reconstruction of small objects for 3D printing or augmented reality applications and hence applies to a scale of objects and scenes which can afford computation with a discrete grid of voxels. We are also limited by computation of the cost volume for depth estimation which grows linearly in the number of discrete depth values used. We limit our selves to planar support surfaces with limited background clutter. Finally, our method cannot recover fine geometric details such as the scale pattern on the dragon. Such detail recovery would require shape from shading and material analysis which are still open research problems.
6 Conclusion
In this work, we proposed a system for 3D reconstruction of an object from a video taken with handheld mobile phone camera. Our three major contributions are: (1) improved objective function for dense depth map computation; 2) robust estimation of an implicit surface using a softmax signed distance function with zerocrossing correction; and (3) a method for joint  segmentation. Through qualitative and quantitative results we demonstrate robustness to textureless surfaces, specularities, and errors in camera pose estimation. Potential directions for future work include extending the proposed approach for category specific reconstruction using data driven priors and recovering high frequency details in the reconstruction through shapefromshading and material analysis.
Acknowledgements
This research was supported in part by NSF Award 1421521. We are also thankful to David Forsyth for helpful discussion on linearized bending energy and smoothness priors, and to Jason Rock for suggesting the region of interest based component of superpixel unary.
References
 Achanta et al. 2012 Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Susstrunk, S. 2012. Slic superpixels compared to stateoftheart superpixel methods. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 11, 2274–2282.
 Boykov and Kolmogorov 2004 Boykov, Y., and Kolmogorov, V. 2004. An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. Pattern Analysis and Machine Intelligence, IEEE Transactions on 26, 9, 1124–1137.
 Campbell et al. 2010 Campbell, N. D., Vogiatzis, G., Hernández, C., and Cipolla, R. 2010. Automatic 3d object segmentation in multiple views using volumetric graphcuts. Image and Vision Computing 28, 1, 14–25.
 Djelouah et al. 2013 Djelouah, A., Franco, J.S., Boyer, E., Le Clerc, F., and Perez, P. 2013. Multiview object segmentation in space and time. In Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE, 2640–2647.
 Dollár and Zitnick 2013 Dollár, P., and Zitnick, C. L. 2013. Structured forests for fast edge detection. In Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE, 1841–1848.
 Engel et al. 2014 Engel, J., Schöps, T., and Cremers, D. 2014. Lsdslam: Largescale direct monocular slam. In Computer Vision–ECCV 2014. Springer, 834–849.
 Furukawa and Ponce 2010 Furukawa, Y., and Ponce, J. 2010. Accurate, dense, and robust multiview stereopsis. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32, 8, 1362–1376.
 Henry et al. 2013 Henry, P., Fox, D., Bhowmik, A., and Mongia, R. 2013. Patch volumes: Segmentationbased consistent mapping with rgbd cameras. In 3D Vision3DV 2013, 2013 International Conference on, IEEE, 398–405.
 Kazhdan et al. 2006 Kazhdan, M., Bolitho, M., and Hoppe, H. 2006. Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, vol. 7.
 Klein and Murray 2009 Klein, G., and Murray, D. 2009. Parallel tracking and mapping on a camera phone. In Mixed and Augmented Reality, 2009. ISMAR 2009. 8th IEEE International Symposium on, IEEE, 83–86.
 Kolev et al. 2009 Kolev, K., Klodt, M., Brox, T., and Cremers, D. 2009. Continuous global optimization in multiview 3d reconstruction. International Journal of Computer Vision 84, 1, 80–96.
 Kowdle et al. 2012 Kowdle, A., Sinha, S. N., and Szeliski, R. 2012. Multiple view object cosegmentation using appearance and stereo cues. In Computer Vision–ECCV 2012. Springer, 789–803.
 Levin et al. 2008 Levin, A., RavAcha, A., and Lischinski, D. 2008. Spectral matting. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30, 10, 1699–1712.
 Newcombe et al. 2011a Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohi, P., Shotton, J., Hodges, S., and Fitzgibbon, A. 2011. Kinectfusion: Realtime dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, IEEE, 127–136.
 Newcombe et al. 2011b Newcombe, R. A., Lovegrove, S. J., and Davison, A. J. 2011. Dtam: Dense tracking and mapping in realtime. In Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2320–2327.
 Ondruska et al. 2015 Ondruska, P., Kohli, P., and Izadi, S. 2015. Mobilefusion: Realtime volumetric surface reconstruction and dense tracking on mobile phones.
 Pradeep et al. 2013 Pradeep, V., Rhemann, C., Izadi, S., Zach, C., Bleyer, M., and Bathiche, S. 2013. Monofusion: Realtime 3d reconstruction of small scenes with a single web camera. In Mixed and Augmented Reality (ISMAR), 2013 IEEE International Symposium on, IEEE, 83–88.
 Rother et al. 2004 Rother, C., Kolmogorov, V., and Blake, A. 2004. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG) 23, 3, 309–314.
 Vogiatzis et al. 2005 Vogiatzis, G., Torr, P. H., and Cipolla, R. 2005. Multiview stereo via volumetric graphcuts. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, IEEE, 391–398.
 Xiao and Quan 2009 Xiao, J., and Quan, L. 2009. Multiple view semantic segmentation for street view images. In Computer Vision, 2009 IEEE 12th International Conference on, IEEE, 686–693.
 Xiao et al. 2007 Xiao, J., Wang, J., Tan, P., and Quan, L. 2007. Joint affinity propagation for multiple view segmentation. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, IEEE, 1–7.
 Zhang and Haniza 2006 Zhang, Y., and Haniza, A. 2006. Vertexbased anisotropic smoothing of 3d mesh data. In Electrical and Computer Engineering, 2006. CCECE’06. Canadian Conference on, IEEE, 202–205.