Semantic 3D Reconstruction with Continuous Regularization and Ray Potentials Using a Visibility Consistency Constraint
Abstract
We propose an approach for dense semantic 3D reconstruction which uses a data term that is defined as potentials over viewing rays, combined with continuous surface area penalization. Our formulation is a convex relaxation which we augment with a crucial nonconvex constraint that ensures exact handling of visibility. To tackle the nonconvex minimization problem, we propose a majorizeminimize type strategy which converges to a critical point. We demonstrate the benefits of using the nonconvex constraint experimentally. For the geometryonly case, we set a new state of the art on two datasets of the commonly used Middlebury multiview stereo benchmark. Moreover, our generalpurpose formulation directly reconstructs thin objects, which are usually treated with specialized algorithms. A qualitative evaluation on the dense semantic 3D reconstruction task shows that we improve significantly over previous methods.
1 Introduction
One of the major goals in computer vision is to compute dense 3D geometry from images. Recently, also approaches that jointly reason about the geometry and semantic segmentation have emerged [11]. Due to the noise in the input data often strong regularization has to be performed. Optimizing jointly over 3D geometry and semantics has the advantage that the smoothness for a surface can be chosen depending on the involved semantic labels and the normal direction to the surface. Eventually, this leads to more faithful reconstructions that directly include a semantic labeling.
Posing the reconstruction task as a volumetric segmentation problem [5] is a widely used approach. A volume gets segmented into occupied and free space. In case of dense semantic 3D reconstruction, the occupied space label is replaced by a set of semantic labels [11]. To get smooth, noisefree reconstructions, the final labeling is normally determined by energy minimization. The formulation of the energy comes with the challenge that the observations are given in the image space but the reconstruction is volumetric. Therefore each pixel of an image contains information about a ray composed out of voxels. This naturally leads to energy formulations with potential functions that depend on the configuration of a whole ray. Including such potentials in a naive way leads to (on current hardware) infeasible optimization problems. Hence, many approaches try to approximate such a potential. One often utilized strategy is to derive a pervoxel unary potential (cost for assigning a specific label to a specific voxel). However, this is only possible in a restricted setting and under a set of assumptions that often do not hold in practice. By modeling the true ray potential, more faithful reconstructions are obtained [28]. Thus, efficient ways to minimize energies with ray potentials, while at the same time being able to benefit from the joint formulation of 3D modeling and semantic segmentation, are desired.
In this work, we propose an energy minimization strategy for ray potentials that can be directly used together with continuously inspired surface regularization approaches and hence does not suffer from metrication artifacts [13], common to discrete formulations on grid graphs. When using our ray potential formulation for dense semantic 3D reconstruction, it additionally allows for the usage of classspecific anisotropic smoothness priors. Continuously inspired surface regularization approaches are formulated as convex optimization problems. We identify that a convex relaxation for the ray potential is weak and unusable in practice. We propose to add a nonconvex term that handles visibility exactly and optimize the resulting energy by linearly majorizing the nonconvex part. By regularly reestimating the linear majorizer during the optimization, we devise an energy minimization algorithm with guaranteed convergence.
1.1 Related Work
Visibility relations in 3D reconstructions were studied for computing a single depth map out of multiple images [16, 17]. To generate a full consistent 3D model from many depth maps, a popular approach is posing the problem in the volume [5]. To handle the noise in the input data a surface regularization term is added [21, 41]. A discrete graphbased formulation is used in [21] and continuous surface area penalization in [41]. One of the key questions is how to model the data term. Starting from depth maps, [21, 41] model the 2.5D data as pervoxel unary potentials. Such a modeling utilizes information contained in the depth map only partially. Using a discrete graph formulation [18, 36] propose to model the free space between the camera center and the measured depth with pairwise potentials.
Another approach to modeling the data term is to directly use a photoconsistencybased smoothness term [31, 12, 14]. To resolve the visibility relations, image silhouettes are used. This is done either in the optimization as a constraint, meaning that the reconstruction needs to be consistent with the image silhouettes [31, 14], or by deriving pervoxel occupancy probability [12]. Silhouette consistency is achieved through a discrete graphcut optimization in [31], and with a convexrelaxationbased approach in the continuous domain in [14]. The resulting relaxed problem in the latter case is not tight and hence does not generate binary solutions. Therefore, to guarantee silhouette consistency a special thresholding scheme is required. Handling visibility has also been done in meshbased photoconsistency minimization [6].
To fully address the 2.5D nature of the input data, the true ray potential should be used, meaning the data cost depends on the first occupied voxel along the ray. What happens behind is unobserved and hence has no influence. This was formulated in [27, 10, 23, 24, 33] as a problem of finding a voxel labeling in terms of color and occupancy such that the first occupied voxel along a ray has a similar color as the pixel it projects to. One of the limitations all these works share is that they only compare colors of single pixels, which often does not give a strong enough signal to recover weakly textured areas. We use depth maps that are computed based on comparing image patches and interpret them as noisy input data containing outliers. We use regularization to handle the noise and outliers in the input data, but in contrast to other approaches with ray potentials that use a purely discrete graphbased formulation [10, 23, 24] our proposed method is the first one that combines true ray potentials with a continuous surface regularization term. This allows us to set a new state of the art on two commonly used benchmark datasets. Unlike in any previous volumetric depth map fusion approach, thin surfaces do not pose problems in our formulation, due to an accurate representation of the input data.
Most earlier formulations of ray potentials are for purely geometrybased 3D reconstruction. Ours is more general and also allows to incorporate semantic labels. [28] shows that by using a discrete graphbased approach the true multilabel ray potential can be used as data term. Several artifacts present in the unary potential approximation [11] can be resolved using a formulation over rays. However, utilizing a discrete graphbased approach it is not directly possible to use the classspecific anisotropic regularization proposed in [11]. We bridge this gap and show how the full multilabel ray potential can be used together with continuously inspired anisotropic surface regularization [2, 40].
2 Formulation
In this section we will introduce the mathematical formulation that we are using to represent the dense semantic 3D reconstruction as an optimization problem. The problem is posed over a 3D voxel volume . We denote the label as the free space label and introduce the set of semantic labels, which represent the occupied space, plus the free space label. The final goal of our method is to assign a label to each of the voxels. The label assignment is formalized using indicator variables indicating if label is assigned at voxel , () or not.
We denote the vector of all pervoxel indicator variables as . Finally, the energy that we are minimizing in this paper has the form
subject to  (1) 
where guarantees that exactly one label is assigned to each voxel. The objective contains two terms. The term , the ray potential, contributes the data to the optimization problem. This is in contrast to many other formulations where the data term is specified as local pervoxel preferences for the individual labels. The second term is a smoothness term, which penalizes the surface area of the transitions between different labels. For this term, we utilize formulations originating from convex continuous multilabel segmentation [2]. As we will see in Sec. 4 this smoothness term allows for classspecific anisotropic penalization of the interfaces between all labels. Due to the continuous nature of the regularization term, it does not suffer from metrication artifacts like most of the graphbased formulations. The straightforward way to utilize such a smoothness term would be to use a convex relaxation of the ray potential. Unfortunately, convex relaxations of the ray potential do not seem to lead to strong formulations (c.f. Fig.4). In this paper we show how to resolve this problem by adding a nonconvex constraint. In Sec. 3 we introduce the convex relaxation formulation of the ray potential and its nonconvex extension. The regularization term and optimization strategy are discussed in Sec. 4.
3 Ray Potential
The main idea of the ray potential [28] is that for each ray, originating from an image pixel, a cost is induced that only depends on the position of the first nonfree space label along the viewing ray or the ray is all free space. This means that the potential can take only linearly many (in the number of voxels along a ray) values, which is the reason why optimization of such potentials is tractable. Note that this is not a restriction we impose, it represents the fact that it is impossible to see behind occupied space. We denote the cost of having the first occupied space label at position with label as and the cost of having the whole ray free space as .
The vector of indicator variables belonging to ray is denoted as . To index positions along a ray, we denote as the positions of all the voxels belonging to ray , where denotes the position index along the ray. Note that there exists only one variable per label for each voxel , if evaluates to the same position for different rays it refers to the same variable. Now we can state the ray potential part of the energy as a sum of potentials over rays
(2)  
with meaning the set of all labels excluding the free space label. The term is iff the first occupied label along the ray is at position . Similarly, equals iff the whole ray is free space. Thus, in Eq. 2 only one term is nonzero, and its coefficient equals the desired cost of the ray configuration.
To make the derivations throughout the paper compact, we omit the last term without loss of generality by shifting the costs by a constant, and .
3.1 Visibility Variables
Before we state a convex relaxation of the ray potential, which we eventually augment with a nonconvex constraint, we rewrite the potential using visibility variables. First, we introduce the visibility variables indicating that the ray only contains free space up to the position and the label assigned at position is .
(3) 
To anchor the definition we assume that the 1^{st} voxel of the ray has free space assigned, . Note that if we insert all the nested definitions for a free space variable we get . Note that this variables are perray local variables and multiple ones can exist per voxel in case multiple rays cross that voxel in contrast to the global pervoxel variables , which exist only once per voxel (c.f. Fig. 2). In the remainder of Sec. 3 we will drop the index at most places for better readability. We now state a reformulation of the ray potential as
(4)  
subject to 
Here we introduced costs along the ray also for the free space label . This does not change the potential but is required for our next step, where we reformulate the nonconvex equality constraints as a series of inequality constraints. To make sure that the corresponding equalities are still satisfied in the optimum, we show that the costs can be replaced by nonpositive ones without changing the minimizer of the energy. This means that the are bounded from above by the linear inequality constraints and are tight from below through the minimization of the cost function, so the resulting constraints model the same optimization problem. The inequality constraints read as follows
(5) 
To derive the transformation to nonpositive costs, we first notice that after applying to both sides of the constraint from Eq. 1, we can plug in the constraints of Eq. 4 to obtain
(6) 
Intuitively, this means if position is in the observed visible free space then the next position is either free space or one of the occupied space labels and if is in the occupied space then all the are (see Fig. 3). The cost transformation is done for every ray separately. Starting with the last position , we add the following expression, which always evaluates to , to the ray potential.
(7) 
This moves one nonnegative term to the previous position and make all the for the current position nonpositive.
(8) 
This is done iteratively for all , leaving just a constant, which can be omitted.
3.2 Convex Relaxation and Visibility Consistency
So far our derivation has been done using binary variables and hence also all the . To minimize the energy, we relax this constraint by replacing with in Eq. 1. This directly leads to a convex relaxation of the ray potential. Unfortunately, this relaxation is weak and therefore inapplicable in practice. In Fig. 4 we evaluate the convex relaxation on a twolabel example (Lemon dataset), using surface area penalization via a total variation (TV) smoothness prior. The convex relaxation fails entirely, producing variable assignments to the that are up to machine precision and hence no meaningful solution can be extracted. A comparison of the energies reveals that there is a significant difference between the nonconvex and the convex solution ( and , respectively), which indicates that the relaxed problem is far from the original binary one. Most importantly, our earlier convex formulation [28] shares this behavior of not making a decision for any voxel, when run without initialization on a twolabel problem. The aspects of initialization, heuristic assignment of unassigned variables, move making algorithm, and a coarsetofine scheme are essential elements of the algorithm in [28].
The reason for the weak relaxation is that Eq. 6 is unsatisfied for the solution of the convex relaxation. This equation ensures that the per camera local view is consistent with the global model (c.f. Fig. 2). Concretely, the equation states that the change in visibility is directly linked to the cost that can be taken by the potential e.g. a surface can only be placed iff the occupancy along the ray changes. Hence we propose a formulation that directly enforces this constraint, which we will call visibility consistency constraint. Eq. 6 can be reformulated using the definition of as
(9) 
This means that we can only have an occupied space label assigned as the visible surface at position , if position does not have free space assigned and and hence the whole ray from the camera center to the position has free space assigned (see Fig. 3).
Since we minimize the objective with nonpositive , the visibility consistency constraint is equivalent to the inequality
(10) 
Our final formulation for the ray potential is
(11)  
s.t.  
The above potential is nonconvex because of the nonconvex inequality which describes visibility consistency. We follow the strategy of using a surrogate convex constraint for the nonconvex one that majorizes the objective of the nonconvex program. The majorization, as we will see in Sec. 4, happens during the iterative optimization. Therefore, at each iteration, we have a current assignment to the variables, which we denote by and . Here we introduced the notation that variable assignments at iteration are denoted with a superscript . Replacing
(12) 
with the linear majorizer,
(13) 
leads to a surrogate linear (and therefore convex) ray potential, which we will denote by . The variables and denote the position of the linearization. We handle the corner case where both branches are feasible to always take the first branch. In numerical experiments we observed that this choice is not critical, it makes no significant difference which branch is used in this case.
Next we state a Lemma that will be a crucial part of the optimization strategy detailed in Sec. 4.
Lemma 1.
Given , with pointwise, we can find such that all the constraints of the ray potential Eq. 11 are fulfilled and the value of the potential is minimal.
Intuitively, the lemma states that given the global pervoxel variable assignments , an assignment to the perray variables can be found. This is not surprising given that the whole information about the scene is contained in the variables (c.f. Fig. 2). We prove the lemma by giving a construction.
Proof.
We provide an algorithm that computes for each ray individually. First we set , which satisfies , . Now we iteratively increase such that and . For an optimal assignment we do this procedure in an increasing order of . The observation holds by construction. ∎
4 Energy Minimization Strategy
Before we discuss the proposed energy minimization, we complete the formulation by including the regularization term.
4.1 Regularization Term
There are several choices of regularization terms for continuously inspired multilabel segmentation that can be inserted into our formulation [39, 2, 32, 40]. They are all convex relaxations and are originally posed in the continuum and discretized for numerical optimization. The main differences are the strength of relaxation and generality of the allowed smoothness priors. We directly describe the strongest, most general version, which allows for nonmetric and anisotropic smoothness [40]. We only state the smoothness term and explain the meaning of the individual variables. For a thorough mathematical derivation we refer the reader to the original publications [40, 11].
(14)  
The variables describe the transitions between the assigned labels. They indicate how much change there is from label to label along the direction they point to and are hence called label transition gradients. For example, if there is a change from label to label at voxel along the first canonical direction, the corresponding is . The need to be nonnegative in order to allow for general, nonmetric smoothness priors [40]. Therefore the difference is used to allow for arbitrary transition directions. The variable denotes the canonical basis vector for the th component, i.e. . are convex positively homogeneous functions that act as anisotropic regularization of a surface between label and . Note that the regularization term takes into account label combinations. This enables us to select classspecific smoothness priors, which depend on the surface direction and the involved labels and are inferred from training data [11]. For example, a surface between ground and building is treated differently from a transition between free space and building. The following lemma will be necessary for our optimization strategy.
Lemma 2.
Given , , with an assignment can be determined that fulfills the constraints of the regularization term.
For the full proof of the lemma we refer the reader to the supplementary material, here we only state the main idea of the proof. In a first step we project our current solution onto the space spanned by the equality constraints. This leads to an initialization of the which fulfills the equality constraints but might lead to negative assignments to the . To get a nonnegative solution, we notice that as long as there is a which is negative we can find and such that we can increase by along with changing by the same in order not to affect the equality constraints.
4.2 Optimization
The goal of this section is to minimize the proposed energy using the nonconvex ray potential Eq. 11. Optimizing nonconvex functionals is an inherently difficult task. One often successfully utilized strategy is the so called majorizeminimize strategy (for example [20]). The idea is to majorize the nonconvex functional in some way with a surrogate convex one. Alternating between minimizing the surrogate convex energy, which we will call the minimization step in the following, and recomputing the surrogate convex majorizer, which we will denote the majorization step, leads to an algorithm that decreases the energy at each step and hence converges.
Note that we already discussed the majorization step of the ray potential in Sec. 3, Eq. 13. Together with the regularizer we end up with a surrogate convex but nonsmooth program.
(15)  
This energy can be globally minimized using the iterative first order primaldual algorithm [26]. However, there is no guarantee that the energy during the iterative minimization decreases monotonically nor that the constraints are fulfilled before convergence. One solution is to run the convex optimization until convergence however in practice this leads to slow convergence. Therefore, we follow a different strategy where we regularly run the majorization step during the optimization of the energy. Before we can state the final algorithm we present the following lemma.
Lemma 3.
Proof.
Our final majorizeminimize optimization strategy can now be stated as follows.
 Majorization step

Using the current variables and Lemma 3 a feasible solution can be found. If the new energy is lower or equal than the last known energy the linearization Eq. 13 is applied and the new optimization problem is passed to the minimization step. Otherwise the whole majorization step is skipped and the old variables are passed to the minimization step.
 Minimization step

The primaldual algorithm [26] is run on the surrogate convex program for a fixed number of iterations. For guaranteed convergence, the primal dual gap can be evaluated and the minimization step can be restarted until we have with a function for . In practice, we get a good convergence behavior without a restart.
The majorization step either does no changes to the optimization state or finds a better or equal solution with a new linearization of the nonconvex part because the linearization does not change the energy. If no changes are done this can be due to two reasons. Either the current solution was worse than the last known one, or the majorization stayed the same. In the latter case the primaldual gap could reveal convergence and as a consequence we know that we arrived at a critical point or corner point of the original nonconvex energy. In any other case the minimization step is run again and due to the convexity of the surrogate convex function a better solution or the optimality certificate will be found. The above procedure could stop at a noncritical point^{1}^{1}1similar to blockcoordinate descent based message passing algorithms due to the kink in the maximum in Eq. 12. To guarantee convergence to a critical point, the visibility consistency constraint Eq. 6 can be smoothed slightly e.g. using [25]. Again, in our experiments this was unnecessary to achieve good convergence behavior.
5 Experiments
Before we discuss the experiments, we describe the input data and state the costs used for the ray potentials.
5.1 Input Data
We are using our approach for two different tasks: standard dense 3D reconstruction and dense semantic 3D reconstruction. In both cases, the initial input is a set of images with associated camera poses. Those camera poses are either provided with the dataset (as in the Middlebury Benchmark [29] or in the Thin Road Sign dataset [34]) or computed via structure from motion algorithm [3] (as in the semantic reconstruction experiments). We computed the depth maps using plane sweeping stereo for Middlebury Benchmark and semantic reconstruction datasets, while utilizing those already provided with the dataset for the experiment with Thin Road Sign. The patch similarity measure for stereo matching was zeromean normalized cross correlation. For the dense semantic 3D reconstruction experiments, we computed perpixel semantic labels using [19], trained on the datasets from [1], [30] and [11].
5.2 Ray Potential Costs
In case of a twolabel problem, there exists only one single label . This allows us to directly insert the visibility consistency, Eq. 9, into the objective. In this case the majorization can directly be done on the objective instead of the visibility constraint, leading to a more compact optimization problem with a smaller memory footprint. Like [11], we assume exponential noise on the depth maps and define the assignments to the costs , given the position of the depth measurement along the ray as as
(16) 
The parameters and are chosen such that the potential captures the uncertainty of the depth measurement.
For the multiclass case we also assume exponential noise on the depth data and independence between the depth measurement and the semantic measurement. Therefore the combined costs read as
(17) 
with being the response of the semantic classifier for the respective pixel. This is the same potential that [11] approximates with unary potentials.
5.3 Middlebury MultiView Stereo Benchmark
We evaluate our method for dense 3D reconstuction on the Middlebury benchmark [29]. We ran our algorithm on all 6 datasets (using the same parameters). Two quantitative measures are defined in this benchmark paper: accuracy (Acc) and completeness (Comp). In terms of accuracy our algorithm sets a new stateoftheart for the Dino Full and Dino Ring datasets (c.f. Fig. 5). An actual ranking of the benchmark is difficult because there is no default, commonly accepted, way to combine the two measures. Taking into account both measures we are close to the stateoftheart on all datasets (results can be found online ^{2}^{2}2http://vision.middlebury.edu/mview/eval/).
5.4 Street Sign Dataset
A challenging case for volumetric 3D reconstruction are thin objects. When approximating the data term, which is naturally given as a ray potential in the 2.5D input data, by unary or pairwise potentials the data terms from both sides are prone to cancel out. Similarly, when using visual hulls a slight misalignment of the two sides might generate an empty visual hull. These are the reasons why thin objects are considered to be a hard case in volumetric 3D reconstruction. We evaluate the performance of our algorithm for such objects on the street sign dataset from [34]. The dataset consists images of a street sign with corresponding depth maps. As depicted in Fig 6 the thin surface does not pose any problem to our method, thanks to an accurate representation of the input data in the optimization problem. To illustrate the result obtained with a standard volumetric 3D reconstruction algorithm we ran our implementation of the TVFlux fusion from [38] on the same data. Note that this dataset is particularly hard because the two sides actually interpenetrate as detailed in [34].
5.5 MultiLabel Experiments
We evaluate our formulation for dense semantic 3D reconstruction on several realworld datasets. We show our results sidebyside with the method of [11] and [28] in Figs. 7 and 1. Our method uses the same smoothness prior as [11]. For all the datasets we observe that the approximation of the data cost with a unary potential in [11] artificially fattens corners and thin objects (e.g. pillars or tree branches). In the closeups (c.f. Fig. 1) we see that such a data term recovers significantly less surface detail with respect to our proposed method. This problem has been addressed in [28], but their discrete graphbased approach suffers from metrication artifacts, cannot be combined with the classspecific anisotropic smoothness prior and does not lead to smooth surfaces (c.f. Fig. 1). Moreover, their coarsetofine scheme produces artifacts in the reconstructions. Our approach takes the best of both worlds, the ray potential part ensures an accurate position of the observed surfaces, while the anisotropic smoothness prior faithfully handles weakly observed areas.
6 Conclusion
In this paper we proposed an approach for using ray potentials together with continuously inspired surface regularization. We demonstrated that a direct convex relaxation is too weak to be used in practice. We resolved this issue by adding a nonconvex constraint to the formulation. Further, we detailed an optimization strategy and gave an extensive evaluation on twolabel and multilabel datasets. Our algorithm allows for a general multilabel ray potential, at the same time it achieves volumetric 3D reconstruction with high accuracy. In semantic 3D reconstruction we are able to overcome limitations of earlier methods.
Acknowledgements: We thank Ian Cherabier for providing code for Lemma 2. This work is partially funded by the Swiss National Science Foundation projects 157101 and 163910. Ľubor Ladický is funded by the Max Planck Center for Learning Systems Fellowship.
References
 [1] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In European Conference on Computer Vision, 2008.
 [2] A. Chambolle, D. Cremers, and T. Pock. A convex approach to minimal partitions. SIAM Journal on Imaging Sciences, 2012.
 [3] A. Cohen, C. Zach, S. N. Sinha, and M. Pollefeys. Discovering and exploiting 3D symmetries in structure from motion. In Conference on Computer Vision and Pattern Recognition, 2012.
 [4] D. Cremers and K. Kolev. Multiview stereo and silhouette consistency via convex functionals over convex domains. Transactions on Pattern Analysis and Machine Intelligence, 2011.
 [5] B. Curless and M. Levoy. A volumetric method for building complex models from range images. In Conference on Computer graphics and interactive techniques, 1996.
 [6] A. Delaunoy and E. Prados. Gradient flows for optimizing triangular meshbased surfaces: Applications to 3D reconstruction problems dealing with visibility. International Journal of Computer Vision (IJCV), 2011.
 [7] J. Duchi, S. ShalevShwartz, Y. Singer, and T. Chandra. Efficient projections onto the ball for learning in high dimensions. In International conference on Machine learning (ICML), 2008.
 [8] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans. on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010.
 [9] S. Galliani, K. Lasinger, and K. Schindler. Massively parallel multiview stereopsis by surface normal diffusion. In International Conference on Computer Vision, 2015.
 [10] P. Gargallo, P. Sturm, and S. Pujades. An occupancy  depth generative model of multiview images. In Asian Conference on Computer Vision (ACCV). 2007.
 [11] C. Häne, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. Joint 3D scene reconstruction and class segmentation. In Conference on Computer Vision and Pattern Recognition, 2013.
 [12] C. Hernandez, G. Vogiatzis, and R. Cipolla. Probabilistic visibility for multiview stereo. In Conference on Computer Vision and Pattern Recognition, 2007.
 [13] M. Klodt, T. Schoenemann, K. Kolev, M. Schikora, and D. Cremers. An experimental comparison of discrete and continuous shape optimization methods. In Computer Vision–ECCV 2008, pages 332–345. Springer, 2008.
 [14] K. Kolev and D. Cremers. Integration of multiview stereo and silhouettes via convex functionals on convex domains. In European Conference on Computer Vision (ECCV), 2008.
 [15] K. Kolev, M. Klodt, T. Brox, and D. Cremers. Propagated photoconsistency and convexity in variational multiview 3D reconstruction. In IN WORKSHOP ON, 2007.
 [16] V. Kolmogorov and R. Zabih. Multicamera scene reconstruction via graph cuts. In European Conference on Computer Vision (ECCV). 2002.
 [17] V. Kolmogorov, R. Zabih, and S. Gortler. Generalized multicamera scene reconstruction using graph cuts. In Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), 2003.
 [18] P. Labatut, J.P. Pons, and R. Keriven. Efficient multiview reconstruction of largescale scenes using interest points, delaunay triangulation and graph cuts. In IEEE International Conference on Computer Vision, 2007.
 [19] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associative hierarchical CRFs for object class image segmentation. In International Conference on Computer Vision (ICCV), 2009.
 [20] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of computational and graphical statistics, 2000.
 [21] V. S. Lempitsky and Y. Boykov. Global optimization for shape fitting. In Conference on Computer Vision and Pattern Recognition, 2007.
 [22] Z. Li, K. Wang, W. Zuo, D. Meng, and L. Zhang. Detailpreserving and contentaware variational multiview stereo reconstruction. arXiv preprint arXiv:1505.00389, 2015.
 [23] S. Liu and D. B. Cooper. Ray Markov random fields for imagebased 3D modeling: Model and efficient inference. In Conference on Computer Vision and Pattern Recognition, 2010.
 [24] S. Liu and D. B. Cooper. Statistical inverse ray tracing for imagebased 3D modeling. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
 [25] Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical programming, 2005.
 [26] T. Pock and A. Chambolle. Diagonal preconditioning for first order primaldual algorithms in convex optimization. In 2011 IEEE International Conference on Computer Vision (ICCV), 2011.
 [27] T. Pollard and J. L. Mundy. Change detection in a 3d world. In Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
 [28] N. Savinov, L. Ladicky, C. Häne, and M. Pollefeys. Discrete optimization of ray potentials for semantic 3D reconstruction. In Conference on Computer Vision and Pattern Recognition, 2015.
 [29] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multiview stereo reconstruction algorithms. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’2006), volume 1, pages 519–526. IEEE Computer Society, June 2006.
 [30] J. Shotton, J. Winn, C. Rother, and A. Criminisi. TextonBoost: Joint appearance, shape and context modeling for multiclass object recognition and segmentation. In European Conference on Computer Vision, 2006.
 [31] S. N. Sinha, P. Mordohai, and M. Pollefeys. Multiview stereo via graph cuts on the dual of an adaptive tetrahedral mesh. In International Conference on Computer Vision, 2007.
 [32] E. Strekalovskiy and D. Cremers. Generalized ordering constraints for multilabel optimization. In International Conference on Computer Vision (ICCV), 2011.
 [33] A. O. Ulusoy, A. Geiger, and M. J. Black. Towards probabilistic volumetric reconstruction using ray potentials. In International Conference on 3D Vision (3DV), 2015.
 [34] B. Ummenhofer and T. Brox. Pointbased 3D reconstruction of thin objects. In IEEE International Conference on Computer Vision (ICCV), 2013.
 [35] G. Vogiatzis, P. H. Torr, and R. Cipolla. Multiview stereo via volumetric graphcuts. In Conference on Computer Vision and Pattern Recognition, 2005.
 [36] H.H. Vu, P. Labatut, J.P. Pons, and R. Keriven. High accuracy and visibilityconsistent dense multiview stereo. Transactions on Pattern Analysis and Machine Intelligence, 2012.
 [37] J. Wei, B. Resch, and H. Lensch. Multiview depth map estimation with crossview consistency. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
 [38] C. Zach. Fast and high quality fusion of depth maps. In 3D Data Processing, Visualization and Transmission, 2008.
 [39] C. Zach, D. Gallup, J.M. Frahm, and M. Niethammer. Fast global labeling for realtime stereo using multiple plane sweeps. In International Workshop on Vision, Modeling and Visualization (VMV), 2008.
 [40] C. Zach, C. Hane, and M. Pollefeys. What is optimized in convex relaxations for multilabel problems: Connecting discrete and continuously inspired MAP inference. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
 [41] C. Zach, T. Pock, and H. Bischof. A globally optimal algorithm for robust TVL1 range image integration. In International Conference on Computer Vision (ICCV), 2007.
 [42] Z. Zhu, C. Stamatopoulos, and C. S. Fraser. Accurate and occlusionrobust multiview stereo. ISPRS Journal of Photogrammetry and Remote Sensing, 109:47–61, 2015.
Appendix A Supplementary Material
First, we provide the proof of Lemma 2 from the main text. Then we give additional experimental evaluation which did not fit into the main text of the paper: the additional experiments are shown for semantic 3D reconstruction as well as for classic 3D reconstruction. Afterwards, we give an intuition why the convex formulation, which was introduced in Section 3.2 of the main text, provides a very weak solution. Eventually, we show convergence experiments for our algorithm.
a.1 Proof of Lemma 2
Proof.
For readability we drop the iteration index . First we note the following. If we fix and each only appears in two linear equation systems with equations.
(18) 
Hence, this constraints can be written in the form for each and , where is a vector containing the variables . contains the values of and . The variables are initialized by projecting the variables to the affine space defined by the equation system for each , combination individually. To also ensure that the nonnegativity constraints on the are fulfilled, the following substitution is applied until there are no more negative . Assuming , from it follows that there are and . Hence, we update
(19)  
(20) 
Note that this substitution does not affect the original constraints if we choose such that nonnegative variables stay nonnegative. The above substitution is iteratively applied until no more nonnegative variables are left. By always choosing as big as possible, meaning such that either the nonpositive variable or one of the positive variables gets , the number of iterations of the algorithm is bounded by . This holds because for each negative variable there is a maximum of steps that can be made to increase it. ∎
a.2 Semantic 3D Reconstruction: Additional Results
Additional reconstructions are shown in Fig. 8. We refer the reader to the supplementary video where renderings of our models can be found.
a.3 Dataset ”Head”
We test our algorithm on a challenging specular ”Head” dataset from [4]. It was shown in that paper that the results of traditional dense 3D reconstruction methods can be improved by utilizing the silhouette information. This information was included in their formulation as energy over rays. We show even more improvement by using our nonconvex ray potential formulation in Fig. 9.
a.4 Middlebury: Additional Analysis
We provide accuracy (Acc) and completeness (Comp) plots for Dino Ring dataset in Fig. 10. We also show additional renderings of reconstructions in Fig. 11.
Overall, besides being accurate (as shown in the paper), our algorithm produces reconstructions with very high completeness: for out of datasets our reconstructions have completeness above .
a.5 Why is Convex Formulation so Weak?
In this section we give a small intuitive example why the convex relaxation gives a solution which is far from binary. We give this example for a label problem without regularization and use the following notation for the labels: means occupied, means freespace. Consider one ray of the length with costs and the rest of the costs are . This is a realistic example since it corresponds to allowing the uncertainty around the estimated depth position (for example, camera sees the wall and stereo matching provides an estimate of depth, but this estimate is noisy in practice, so the uncertainty window along the ray is very desirable). Since we only consider single ray, the ray index is omitted and the voxel space indexing function simplifies to just position along the ray. The exact problem, which we are solving, would be (as a reminder, is always set to be ):
(21)  
s.t.  
The desired solution to this problem would be
(22)  
This means taking the best position in the uncertainty window. This solution has the cost . Unfortunately, the solution where all the variables above take value has a better cost: .
Our preliminary investigations indicate that the ”all” solution will always be the optimal solution to the convex relaxation as long as the best cost is larger than the sum of other occupied costs (as it is the case in the example above, versus ).
a.6 Convergence Analysis
In this section we analyze the convergence behavior of our method.
First, we evaluate how fast the algorithm converges using different minimization intervals in between the majorization steps. In Fig. 12 we can see that a frequent execution of the majorization step has a very beneficial effect on the convergence. Additionally, we see that for a broad range of values we reach similar (in energy) critical points of our cost function. This is a strong indication that our method is robust against bad solutions.
Second, we analyze tie handling in Eq. 13 of the main text. As a reminder, this equation describes linear majorizer as
(23) 
In that equation the tie case is and it is possible to choose any of the two branches in this case: or . Our experiment in Fig. 13 shows that the difference in final energies between these two choices is very small, of their values.
Input Image  Häne et al. 2013 [11]  Savinov et al. 2015 [28]  Proposed Method 
Original Images  [35]  [12], [15]  [4]  Our Method 

Dino Full  Dino Sparse  Temple Full  Temple Sparse 

Providence  Catania 
