From Shading to Local Shape
Abstract
We develop a framework for extracting a concise representation of the shape information available from diffuse shading in a small image patch. This produces a midlevel scene descriptor, comprised of local shape distributions that are inferred separately at every image patch across multiple scales. The framework is based on a quadratic representation of local shape that, in the absence of noise, has guarantees on recovering accurate local shape and lighting. And when noise is present, the inferred local shape distributions provide useful shape information without overcommitting to any particular image explanation. These local shape distributions naturally encode the fact that some smooth diffuse regions are more informative than others, and they enable efficient and robust reconstruction of objectscale shape. Experimental results show that this approach to surface reconstruction compares well against the stateofart on both synthetic images and captured photographs.
1 Introduction
Recovering shape from diffuse shading is pointwise ambiguous because each surface normal can lie anywhere on a cone of directions. Surface normals are uniquely determined only where they align with the light direction which, at best, occurs at only a handful of singular points. A common strategy for reducing the ambiguity is to pursue global reconstructions of large, presegmented regions, with the hope that many pointwise ambiguities will collaboratively resolve, or that shape information will successfully propagate from identifiable singular points and occluding contours.
Global strategies are difficult to apply in natural scenes because diffuse shading is typically intermixed with other phenomena such as texture, gloss, shadows, translucency, and mesostructure. Occluding contours and singular points are hard to detect in these scenes; and shadingbased shape propagation breaks down unless occlusions, gloss, texture, etc. are somehow analyzed and removed by additional visual reasoning. Moreover, most global strategies do not provide spatial uncertainty information to accompany their output reconstructions, and this limits their use in providing feedback to improve topdown scene analysis, or in cocomputing with other necessary bottomup processes that perform complimentary analysis of other phenomena.
This paper develops a framework for leveraging diffuse shading more broadly and robustly by developing a richer description of what it says locally about shape. We show that pointwise ambiguity can be systematically reduced by jointly analyzing intensities in small image patches, and that some of these patches are inherently more informative than others. Accordingly, we develop an algorithm that produces for any image patch a concise distribution of surface patches that are likely to have created it. We propose these dense, local shape distributions as a new midlevel scene representation that provides useful local shape information without overcommitting to any particular image explanation. Finally, we show how these local shape distributions can be combined to recover global objectscale shape.
Our framework is developed in three parts:

[]

Local uniqueness. We provide uniqueness results for jointly recovering shape and lighting from a small image patch. By considering a world in which the shape of each small surface patch is exactly the graph of a quadratic function, we prove two generic facts: i) when the light direction is known, quadratic shape is uniquely determined; and ii) when the light is unknown, it is determined up to a fourway choice. We also catalog the degenerate cases, which correspond to special shapes, or conspiracies between the light and shape. These results are of direct interest to those studying the mathematics of shape from shading.

Local shape distributions. We introduce a computational process that takes an image patch at any scale and produces a compact distribution of quadratic shapes that are likely to have produced it. At the core of this process is our observation that all likely shapes corresponding to a (noisy) image patch lie close to a onedimensional manifold embedded in the fivedimensional space of quadratic shapes. This part of the paper is of broad interest because these local, multiscale shape distributions may be useful as intermediate scene descriptors for various visual tasks.

Reconstruction. We present a simple and effective bottomup reconstruction system for inferring objectscale shape from a single image of a predominantly textureless and diffuselyshaded surface. This reconstruction system uses as input our local shape distributions inferred from dense, overlapping patches at multiple scales. It is conceptually simple, computationally parallelizable, and robust to nonidealities like shadows, texture, highlights, etc., with reconstruction accuracy that compares well to the state of the art. This system is of direct interest to those studying algorithms for traditional shape from shading, and it is structured in a modular way that provides a step toward cocomputation with other reconstructive processes that also analyze other phenomena.
These three parts are tightly bound together. The uniqueness results in Sec. 3 show that the quadratic model is a particularly convenient representation for small surface patches. In the absence of noise, both shape and lighting are locally revealed, and local shape is generally unique when lighting is known. Building on this, Sec. 4 examines how uniqueness breaks down in the presence of noise. While very different quadratic shapes can produce equallylikely local intensity patterns, we find that all highlylikely shapes lie close to a onedimensional submanifold. Then, Sec. 5 shows how to infer a dense set of sample shapes along this submanifold, thereby taking an image patch and producing a onedimensional shape distribution. Finally, Sec. 6 shows how these multiscale local distributions can enable robust global reconstruction of shape, by naturally encoding the fact that some smooth diffuse regions are more informative than others.
The project page associated with this paper [1] provides separate implementations of our algorithms for inferring local distributions (Sec. 5) and global shape (Sec. 6). These are highly parallelized and can be executed on a single machine, a local cluster of machines, or a cluster from a standard utility computing service.
2 Related Work
Background on shape inference from diffuse shading can be found in several reviews and serveys [2, 3, 4]. An important question is whether shape is uniquely determined by a noiseless image, which has been addressed by a variety of PDEbased formulations. For example, Oliensis considered surfaces and showed that shape can be uniquely determined for the entire image by singular points and occluding boundaries together [5], and in many parts of the image by singular points alone [6]. For the more general class of surfaces, Prados and Faugeras [7] employed a smoothness constraint to prove uniqueness properties in a more general perspective setup [8, 9] given appropriate boundary conditions. In this paper, we use a more restrictive local surface model but prove local uniqueness without any boundary conditions or knowledge of singular points. This generalizes previous studies of local uniqueness, which have considered locallyspherical [10] and frontoparallel [11] surfaces.
Global uniqueness analyses have inspired global propagation and energybased methods for global shape inference (e.g. [2, 12, 13]), some of which rely on identifying occluding boundaries and/or singular points. While most methods do not typically provide any measurement of uncertainty in their output, progress toward representing shape ambiguity was made by Ecker and Jepson [14], who use a polynomial formulation of global shape from shading to numerically generate distinct global surfaces that are equally close to an input image. In this paper, we study uniqueness and uncertainty at the local level, and infer distributions over candidate local shapes.
Our work is related to patchbased approaches that use syntheticallygenerated reference databases. The idea there is to reconstruct depth (or other scene properties [15]) by synthesizing a database of aligned image and depthmap pairs, and then finding and stitching together depth patches from this database to match the input image and be spatially consistent. Hassner and Basri [16] obtain plausible results this way when the input image and the database are of similar object categories, and Huang et al. [17] pursue a similar goal for textureless objects using a database of rendered Lambertian spheres. Cole et al. [18] focus on patches located at detected keypoints near an object’s occlusion boundaries, combining shading and contour cues. We also describe global shape as a mosaic of perpatch depth primitives, but instead of relying on primitives from a prechosen set of 3D models, we consider a continuous fiveparameter family of depth primitives corresponding to graphs of quadratic functions at multiple scales.
One of the our main motivations is the longterm goal of enabling better cocomputation with other bottomup and topdown visual processes, and by providing useful local shape information without choosing any single image interpretation, our distributions are consistent with Marr’s principle of least commitment [19]. We focus on diffuse shading on textureless surfaces, leaving for future work the task of merging with bottomup processes for other cues like occluding contours (e.g., [18, 20]), texture, gloss, and so on. Our belief that this will be useful is bolstered by promising results achieved by recent global approaches to such combined reasoning [21].
In independent work, Kunsberg and Zucker [22] have recently derived local uniqueness results that are related to, and consistent with, our results in Sec. 3. Their elegant analysis, which uses differential geometry and applies to continuous images, is complimentary to the discrete and algebraic approach employed in this paper. Kunsberg and Zucker also observe that the analysis of shading in patches instead of at isolated points is consistent with early processing in the visual cortex, and they discuss the possibility of local shading distributions being computed there. Indeed, the notion of such distributions is compatible with evidence that humans perceive shape in some diffuse regions more accurately than others [11].
3 QuadraticPatch Shape from Shading
We begin by analyzing the ability to uniquely determine the shape and lighting of a local patch from a Lambertian shading image in the absence of noise. The key assumption in our analysis is that depth of the patch can be exactly expressed as the graph of a quadratic function. While subsequent sections consider deviations from this idealized setting, the following analysis characterizes the inherent ambiguity under a local quadratic patch model.
We model the depth of a local surface patch as a quadratic function defined by coefficient vector up to a constant offset:^{1}^{1}1Local shading for the special case is described in [11], and a more restrictive, locallyspherical model is analyzed in [10].
(1) 
In matrix form, this is with
(2) 
the Hessian matrix and the Jacobian of the depth function. The unnormalized surface normal to this patch at each location is then given by
(3) 
where
(4)  
(5) 
In matrix form, this is with
(6) 
the shape matrix corresponding to quadratic shape .
The intensity of this patch, observed from viewing direction under a directional light source , is
(7) 
assuming spatiallyuniform Lambertian reflectance and that no part of the patch is in shadow, i.e., . Here, the magnitude of the light vector represents the product of the surface albedo and the light strength, and it is not assumed to be equal to one. Rearranging, the intensity at each point induces a quadratic constraint on its surface normal [14]:
(8) 
where is the identity matrix. This further induces a related constraint on shape parameters :
(9) 
where we use the matrix to rewrite the relationship between and in (3)(5) as .
Every pixel in an image patch gives one such constraint on shape parameters , and shape from shading for quadratic patches rests on solving this system of polynomial equations. Our immediate goal is to determine whether the shape and lighting can be uniquely determined from these local constraints.
3.1 Uniqueness of simultaneous shape and light
We assume that the local patch is sufficiently large to contain a minimum number of nondegenerate pixel locations, where the condition for nondegeneracy is defined as follows:
Definition 1.
For a patch , we define the matrix such that each row of consists of all fourthorder and lower terms of and :
(10) 
A patch is considered nondegenerate if the matrix has rank 15.
Note that rectangular grids of pixels that are or larger will be nondegenerate under the above definition.
Theorem 1.
Given intensities in an image patch collected at a set of nondegenerate locations not in shadow, if any quadraticpatch/lighting pair that satisfies the set of polynomial equations (9) has a surface Hessian with eigenvalues that are not equal in magnitude, then there are no more than four distinct surfaces that can create the same image. Each of these surfaces is associated with a unique lighting when the Hessian of any solution is nonsingular, and a onedimensional family of lighting vectors otherwise.
This theorem states that given measurements of intensity from a quadratic surface patch, there generically exists four physical explanations, each comprised of a shape , a light direction , and a scalar encoding the product of albedo and light strength.
Before proceeding to the proof, we introduce a lemma that relates to equations with ratios of quadratic terms. We define , so that the normals are given by , and the intensity constraint (9) becomes
(11) 
Using this notation, we can state the following lemma, which is proven in the supplementary material:
Lemma 1.
Let and correspond to two matrices of the form in (6), and and to two lighting vectors. If
(12) 
and if Rank, Rank, and (i.e., no point is in shadow), then
(13) 
Moreover, if Rank, then Rank and both and share a common null space.
Proof of Theorem 1: Suppose there exists a solution that produces the observed set of intensities in the patch , and the Hessian matrix of surface has eigenvalues of unequal magnitude. We will prove that if there exists another solution , such that
(14) 
then must be related to in one of four specific ways.
Since is not planar (otherwise the Hessian would have both eigenvalues equal to zero), the corresponding matrix is at least rank 2, and we can apply Lemma 1:
(15) 
We define a new matrix satisfying . Specifically, when is full rank we set ; and when Rank, we set with a vector in the common nullspace of and , i.e., . We will show that there are only four possibilities for the matrix .
Note that and are affine matrices (last rows are both ). Moreover, in the rank 2 case, the last entry of will be 0 and will also be an affine matrix. Therefore, (if is full rank) and (if is rank 2) are affine. Hence, is also an affine matrix:
(16) 
From (15), we have , i.e.,
(17) 
The orthogonality of further restricts its topleft block to be either a 2D rotation matrix
(18) 
or an “antirotation” matrix
(19) 
for .
From and the fact that the entry and entry of matrix should be the same (since ,, we have
(20) 
This implies that when is of the form in (18)
(21) 
and when is of the form in (19)
(22) 
Since the Hessian of defined in (2) has eigenvalues of unequal magnitude, , and either or . This leaves only four possible solutions for :
(23) 
where . Thus can relate to in only four possible ways.
Next, we consider the lighting associated with each shape . Equation (15) implies or but the latter has shadows, so
(24) 
When is full rank, (24) implies a unique given by
(25) 
If Rank, we define as the component of in the null space of . Then, from (24), we have
(26) 
where is a scalar. In this case there is a 1D family of for each of the four shapes . ∎
Figure 2 provides an example of the four choices of shape/light pairs in the generic, noncylindrical case when both eigenvalues of the surface Hessian are nonzero. Without loss of generality, we consider a rotated coordinate system where , i.e., the and axes are aligned with the eigenvectors of the surface Hessian. Then, the four solutions from (23) are:
(27)  
(28)  
(29)  
(30) 
The first choice is the surface/lighting pair that actually induced the image. The second corresponds to the wellknown convexconcave ambiguity [10], and is obtained by reflecting both the light and the normals across the view direction. The last two choices (29)(30) correspond to performing the reflection separately along each of the eigenvector directions of the Hessian matrix. These form a second concaveconvex pair.
When one of the Hessian eigenvalues is zero (say in our rotated coordinate system), the patch surface is a cylinder and it is possible to construct a 1D family of lights for each of the four surfaces:
(31) 
for any such that no pixel is in shadow. Figure 3 shows an example of four cylindrical surfaces and associated families of lights that can produce the same image.
Theorem 1 applies when the Hessian eigenvalues of any solution shape are not equal in magnitude. What happens when shape solutions have Hessian eigenvalues that are of equal magnitude? There are two distinct cases. The first is when the Hessian is zero and the true surface is planar. In this case every surface normal in the patch is identical, and the wellknown pointwise cone ambiguity applies to the patch as a whole: The observed image can be explained by a oneparameter family of planar surfaces for every light .
In the second case, the true surface is not planar but the magnitudes of the two eigenvalues of the Hessian matrix are equal. Unlike the planar ambiguity, there is not an infinite number of surfaces that can combine with every lighting. But as depicted in Fig. 4, there is still an infinite number of allowable patch/lighting pairs. We note that all quadratic surfaces in this category can be expressed as either one of two following forms
(32)  
(33) 
where , and . Given fixed values of and , these surfaces generate identical images when paired with lighting
(34) 
for surfaces (3.1), or with
(35) 
for surfaces (33), with fixed values of .
3.2 Unique shape when light is known
Theorem 2.
Given intensities at a nondegenerate set of locations , a known light , and a quadratic patch that satisfies the set of equations in (9), if the planar component of the light is nonzero (i.e, is not equal to the viewing direction) and not an eigenvector of the Hessian of , then the solution is unique.
Proof of Theorem 2: Without loss of generality, we choose a coordinate system where . Note that for any such choice and will both be nonzero, unless is zero or an eigenvector of the surface Hessian, which is ruled out by the statement of the theorem.
If the Hessian of has eigenvalues with unequal magnitudes, then it is easy to see that each of the four possible solutions from Theorem 1 has distinct light from (25) and (26), and therefore for a fixed light, the shape is unique. A Hessian with equal eigenvalues is ruled out since then every lightdirection would be an eigenvector. When the eigenvalues have equal magnitudes but opposite signs, must be of the form in (3.1) with or (since ) and . In this case too, we see that each member of the continuous family of solutions—with for surface (3.1) and light (34), or for surface (33) and light (35)—has a distinct lightdirection. ∎
When the conditions in Theorem 2 are not satisfied, there are shape ambiguities as follows. First, planar patches have Hessians with zero eigenvalues so that every is an eigenvector; this leads to an infinite set of planar shape explanations for any given light. Second, when the light and view directions are the same, there are generically four shape solutions analogous to Fig. 2 or, in the case of equal eigenvalue magnitudes, a continuous family of solutions analogous to Fig. 4. Finally, when the true surface is not planar but the azimuthal component of the light happens to be aligned with one of the Hessian eigenvectors, it is possible to construct a second solution by performing a reflection of the normals across that eigenvector direction. Figure 5 demonstrates this with photographs of two 3Dprinted surfaces that are distinct but related by a horizontal reflection of their normals.
4 Ambiguity in the presence of noise
The uniqueness results from the previous section suggest that among the many possible models one could use for local shapes—such as splines, linear subspaces, exemplar dictionaries [17], or continuous functions with smoothness constraints as in [21]—the quadratic function model may be particularly useful. However, before we can use this model for inference, we must understand the effects of deviations, such as intensity noise and higherorder (nonquadratic) components of local shape. To this end, we provide some intuition about the types of quadratic shapes that almost satisfy the polynomial system (9) and thus become likely explanations in the presence of noise. These intuitions motivate a statistical inference technique that will be introduced in Sec. 5.
In the rest of this paper, we assume that the light direction and the albedo/lightstrength product are known. Then, the polynomial system (9) relating the quadratic parameters to the observed intensities can be understood as combining two types of constraints on the patch normals . First, each pixel’s normal is constrained by its intensity to a lightcentered circle of directions as per (7). This is shown in the left of Fig. 6, where the circle of directions is parameterized by “azimuthal” angle
(36) 
The second type of constraint comes from the quadratic shape model, which induces a joint geometric constraint on the set of surface normals that belong to the patch. This joint constraint has an intuitive interpretation when we represent the normals, light, and view as points on the plane defined by (the socalled projective plane [23]). This representation is constructed by radiallyprojecting the hemisphere of directions onto the plane as shown in Figure 6. The view is the origin of the plane, the light projects to another planar point , and each pixel’s parameterized circle of normal azimuthal directions projects to a conic section, still parameterized by . The set of normals that lie on different conics but have the same azimuthal angle form a ray (right of Fig. 6), and an inversion in the sign of corresponds to a reflection of the surface normal across light point.
Using this representation, Fig. 7 visualizes the two types of constraints (under a light with ) for 25 normals at a grid of pixel locations. In addition to each pixel’s normal being constrained to its conic, the set of normals is collectively constrained, via (6), to be a symmetric affine grid. Therefore, solving the polynomial system for quadratic coefficients amounts to finding a symmetric affine grid that aligns properly with the perpixel conics. Theorem 2 tells us there is only one grid that aligns perfectly, but as shown in the figure, there will be other grids that come close. When there is noise, the shapes corresponding to all of these grids become likely explanations, even though they are physically quite different from one another. To avoid overcommitting, local inference systems must output distributions of shapes that encode this fact.
Then, a natural question is: do we need to search the entire fivedimensional space of quadratic parameters to find all the likely approximate solutions? To answer this question, we note that these approximate solutions are intuitively expected to arise from the degenerate cases detailed in Theorem 2. For example, we find that these solutions often occur in pairs corresponding to reflections across the light direction (i.e., across the axis in Fig. 7), which would correspond to a second exact solution if the light were a eigenvector of the surface Hessian. Remember that the most ambiguous degeneracy is the one induced by the true surface being planar, when all the conics overlap and there is a continuous set of solutions whose normals can be parametrized by a single angle as per (36). Based on this intuition, we define as the firstorder orientation of the shape to be the angle of the center normal, and find empirically that it is sufficient to search along only a onedimensional manifold parametrized by this angle.
In Fig. 7, this search can be understood as fixing the value of , and warping an affine grid by optimizing the parameters to fit the conic intensity constraints. We see that this leaves very little play in the parameters, so the shapes of possible solutions are highly constrained once is fixed. This effect is further visualized in Fig. 8, which shows contours of constant RMS intensity difference—equally spaced in value on a logarithmic scale—between the observed intensities and the Lambertian renderings of bestfit shapes obtained by fixing and one coefficient (say, ) and then optimally fitting the others (say, ). The four “close fits” appear as the four modes in these plots, where the value of strongly constrains each coefficient of lowerror shapes .
5 Local shape distributions
Armed with intuition about the characteristics of approximate solutions for the quadraticpatch model, we now develop a method for inferring shape distributions at any local image patch of any size. The output for each image patch is a set of quadratic shapes of the same size that correspond to a discrete sampling along a parametrized onedimensional manifold, as well as a probability distribution over this set of quadratic shapes. The previous sections have demonstrated that shading in some image patches is inherently more informative than others. Our goal is to create a compact description of this ambiguity in each local region at multiple scales, thereby providing a useful midlevel representation of “intrinsic” scene information for vision.
5.1 Computing quadratic shape proposals
Given the intensities at a patch , we first generate a set of quadratic proposals for the shape of that patch, and based on the intuition from the previous section, we index these proposals angularly in reference to the light . Consider a discrete set of uniformlyspaced values over ^{2}^{2}2For some patches, we consider closerspaced samples over a shorter interval when values close to do not correspond to physically feasible estimates for shape., and for each angle we find the corresponding quadratic shape that best explains the observed intensities in terms of minimum sum of squared errors:
(37) 
where is defined as per (7).
Let be the center of the patch. Then since is fixed, the quadratic coefficients and of only have one degree of freedom, and can be reparametrized in terms of a single variable that indexes points along the constant ray on the projective plane:
(38)  
(39) 
Therefore, the nonlinear leastsquares minimization in (37) is over the four variables , and can be efficiently carried out with LevenbergMarquardt [24]. We found empirically that it is insensitive to initialization, and use in our experiments, where is chosen such that the center pixel lies on the corresponding conic.
This minimization occurs independently and in parallel for every patch in an image, and it can therefore be parallelized over an arbitrary number of CPU cores, on a single machine or a cluster of machines, as required for increasing image sizes. Our reference implementation considers quantized angles for each patch, and takes one minute on an eightcore machine for inference on all overlapping patches in a image.
5.2 Computing shape likelihoods
Next, we define a probability distribution over these shape proposals by computing the likelihoods for the observed intensities being generated by each proposed shape . We introduce a model for the deviation between observed intensity and expected intensity from a proposal at each location :
(40) 
This is a Gaussian distribution conditioned on , where the variance at each pixel is a sum of two terms. The first is additive i.i.d. intensity noise induced, for example, by sensor noise. The second is a function of and varies spatially across the patch, capturing the fact that the veridical shape may exhibit higherorder (nonquadratic) variations at this patch’s scale. It is the expected variance in intensity induced by higherorder components of shape that exist on top of the shape predicted by at the current scale.
To compute , we model the deviations of the true normals from those predicted by as i.i.d. Gaussian random variables:
(41)  
(42) 
where is the expected normal variance of these deviations, which is set to in our experiment. Then, we compute as the expected variance in intensity over the distribution of :
(43) 
We find that for lights not aligned with the view, i.e., , this expression can be reliably approximated as:
(44) 
Intuitively, this says that the observed intensity is less sensitive to perturbations in the normal when the surface is tilted away from the viewing direction.
Putting everything together, we compute a cost for every proposal , defined as the negative loglikelihood of all observed intensities under the above model:
(45) 
5.3 Evaluation
We evaluate the accuracy of the proposed local shape distributions using images of a set of six random surfaces synthetically rendered (with the light at an elevation of ) , where each surface is created by generating a grid of random depth values, and then smoothly interpolating these to form a surface (see [1], and Fig. 12 for an example).
Figure 9 shows likelihood distributions and proposed shapes for representative image patches from this synthetic dataset. Empirically, we find that the distributions can have a single peak (left), be multimodal (center), or nearly uniform (right); reenforcing our intuition that shading in some image patches is more informative than others. Note that given the correct value of (green dot in the figure), the corresponding estimated shape proposal yields an accurate reconstruction of the true shape in all three cases shown.
We perform a quantitative evaluation of accuracy using all overlapping patches from all six random surfaces and for three different patchsizes (roughly patches in total for each size). We are interested in knowing: (i) how often the veridical shape is among the set of shape proposals at a patch; and (ii) whether the cost is an informative metric for determining which proposals are most accurate. To this end, for each image patch in the evaluation set, we sort the proposed quadratic shapes according to their likelihood costs , compute for each proposal the mean angular difference between its surface normals the veridical ones, and record for increasing values the lowest mean angular error among the mostlikely shape proposals. Figure 10 shows the statistics of these errors across all test patches for increasing values of . Although the mostlikely shape proposal (i.e. ) is often reasonably close to true shape, the error quantiles decrease significantly more as we consider larger sets of likely proposals. This emphasizes the value of maintaining full distributions of local shape as a midlevel scene representation, as opposed to “overcommitting” to only one (often suboptimal) shape proposal for each patch through a process of hard local decisionmaking.
Figure 10 also provides some insight about the effect of patchsize, and it shows that patches at multiple scales tend to be complimentary. Smaller patches are more likely to have lower errors when considering the full set of proposals (), since the veridical shape is much more likely to be exactly quadratic at smaller scales. But, as evidenced by the relatively smaller error quantiles for lower values of , larger patches tend to be more informative, with their likelihoods being better predictors of which of the proposals is the true one.
6 Surface reconstruction
To demonstrate the utility of our theory and local distributions for higherlevel scene analysis, we consider the application of reconstructing objectscale surface shape when the light is known. The local representations provide concise summaries of the shape information available in each image patch, and they do this without “overcommitting” to any one local explanation. This allows us to achieve reliable performance with very a simple algorithm for global reasoning that infers objectscale shape through simple iterations between: 1) choosing one likely shape proposal for each local patch; and 2) fitting a global smooth surface to the set of chosen perpatch proposals.
Formally, our goal is to estimate the depth map from an observed intensity image , with known lighting and under the assumption that the surface is predominantly textureless and diffuse, i.e., the shading equation (7) holds at most pixels. We first compute local distributions as described in the previous section, by dividing the surface into a mosaic of overlapping patches of different sizes. We let index all patches (across different patchsizes), with corresponding to the pixel locations, and denoting the local shape proposals and distribution for each patch .
In addition to the proposals at each patch, we use an approach similar to [18] and include a dummy proposal in the distribution for every patch. This serves to make the surface reconstruction robust to outliers, such as when the local patch deviates significantly from a quadratic approximation (e.g. sharp edges or depth discontinuities), or when the observed intensities vary from the diffuse model in (7), e.g. specularities, shadows, or albedo variations due to texture.
We formulate the reconstruction problem as simultaneously finding a depth estimate and a labeling that minimize the cost function:
(46) 
where is a scalar weight, and measures the agreement between the local shape proposal and at :
(47) 
if , and otherwise.
We use an iterative algorithm to minimize the cost , alternating updates to and based on the current estimate of the other. Given the current estimate of the depth map at each iteration, we update the label of every patch independently (and in parallel):
(48) 
Similarly, given a labeling (with corresponding shape proposals ), we compute the depth map that minimizes the cost in (46) as:
(49) 
where is the number patches that include and have not been labeled as outliers, and their mean normal estimates:
(50) 
The computation in (6) could be carried out exactly and efficiently using the FrankotChellappa algorithm [25] if all were equal. But this is not the case since will be lower near the boundary and in regions where some patches have been detected as outliers. Nevertheless, we find that [25] provides an acceptable approximation in these cases. We use [25] throughout the alternating iterations until and converge^{3}^{3}3We simply set when ., and then we run a limited number of additional iterations using conjugategradient to compute step (6) exactly.
To speed up convergence, we smooth the estimate of in the first few iterations with a Gaussian filter with variance and set , starting from an initial value that is decreased by a constant factor till it reaches (at which point we stop smoothing). We also initially run the algorithm over only the valid proposals at each patch till convergence, and then introduce the dummy proposal . We set the parameters and automatically based on the input distributions— is set to th the reciprocal of the mean of the differences between the minimum and median likelihood costs across all patches at the smallest scale, and is set to . The reconstruction from local patch proposals takes 40 seconds on average on an eightcore machine for images with local distributions at four scales (not including the computation time for estimating local proposals, which is reported in Sec. 5.1).
6.1 Evaluation
We first quantitatively evaluate the proposed reconstruction algorithm, under known lighting, with the random surfaces described in Sec. 5.3. We render images with different amounts of additive white Gaussian noise, as well as with specular highlights. For the latter, we use the Beckmann [26] model and consider different values of “surface smoothness” to get images with increasing numbers of saturated pixels. We mask out pixels that are saturated during estimation, but note that many nearby unclipped pixels will also include a nonzero specular component that violates the diffuse shading model.
Figure 11 summarizes the performance—using local distributions of , , , and overlapping patches—and compares it to two stateoftheart methods. The first is the iterative algorithm proposed by Ecker and Jepson [14] (labeled “Polynomial SFS”). The second (labeled “Crossscale”) is the shape from shading component of the SIRFS method [21], i.e., where we treat the light and shadingimage as given, and do not use contour information. The crossscale method uses an overcomplete, multiscale representation of the global depth map and minimizes the rendering error along with the likelihood of the recovered shape under a prior model. For both methods we use implementations provided by their authors, and for the crossscale method, we use the authorprovided prior parameters that were trained on the MIT intrinsic image database [27].^{4}^{4}4We also evaluated the crossscale method with a prior trained on the random surfaces, but this did not improve performance.
We see from Fig. 11 that while the polynomial SFS method performs the best in the noiseless case, the proposed algorithm is more robust to both Gaussian noise and the structured artifacts from specular highlights. The crossscale method is also reasonably robust to these effects due to its use of a shape prior, but in general has higher errors. Figure 12 provides example reconstructions for the proposed method for one surface—the full set of reconstructions are available at [1].
Next, we evaluate all algorithms using photographs of seven relativelydiffuse objects, captured with a Canon EOS 40D camera under directional lighting, with two chrome spheres in the scene to measure light direction. These photographs contain nonidealities such as mutual illumination, selfshadowing, and slight texture. For each object under a fixed viewpoint, we took twenty images with varying light directions, with which we can recover the normal vectors as well as depth map by photometric stereo to a high accuracy. We use this recovered shape as ground truth for our evaluation. All the captured images, calibration information, and recovered normal and depth maps are available on our project page [1].
For each object, we choose a single image as input to evaluate the performance of different SFS frameworks. The 99th percentile intensity value of the image is assumed to correspond to the albedo times light intensity and used for image normalization; and since these images are larger, we use local distributions at two additional patchscales: and . The surfaces reconstructed using the different methods and measured light direction are shown in Fig. 13 along with median angular error values. We find that in most cases, the proposed algorithm provides a better reconstructions of objectscale shape than the baselines.
Input Image  Proposed  Polynomial SFS  CrossScale 
Resolution  Median Angular Error  Median Angular Error  Median Angular Error 
Resolution  Median Angular Error  Median Angular Error  Median Angular Error 
Resolution  Median Angular Error  Median Angular Error  Median Angular Error 
Resolution  Median Angular Error  Median Angular Error  Median Angular Error 
Resolution  Median Angular Error  Median Angular Error  Median Angular Error 
Resolution  Median Angular Error  Median Angular Error  Median Angular Error 
Resolution  Median Angular Error  Median Angular Error  Median Angular Error 
7 Discussion
Our theoretical analysis shows that in an idealized quadratic world, local shape can be recovered uniquely in almost every local image patch, without the use of singular points, occluding contours, or any other external shape information. Beyond this idealized world, our evaluations on synthetic and captured images suggest that one can infer, efficiently and in parallel, concise multiscale local shape distributions that are accurate and useful for global reasoning.
There are many viable directions for interesting future work. Foremost among these is the joint estimation of shape, lighting, and albedo. The reconstruction algorithms proposed in this paper are limited to the case when lighting is known, but the uniqueness results in Sec. 3.1 suggest that simultaneous reconstruction of shape and lighting may also be possible. Theorem 1 tells us that, in an idealized quadratic world, there are generically four lights that can explain each local patch, and that these quadruples of possible lights will vary from patch to patch according to the directions of each patch’s Hessian eigenvectors. Intuitively, one might infer the true light (along with its reflection across the view, which is always equallylikely) as the one that is common to all or most of the perpatch quadruples.^{5}^{5}5We have experimented with a direct implementation of this intuition that does a bruteforce search only on lighting direction, assuming a known constant lightstrength and albedo, and with pooling local estimates without considering consistency or noise. This method worked reasonably well in many cases, but was computationally expensive and not entirely robust. Practically speaking, it is likely that for a reconstruction algorithm to handle unknown lighting, it will need to jointly reason about shape, lighting, and varying albedo, in the same spirit as Barron and Malik [21]; and that such reasoning will benefit from an analysis of the joint ambiguities that are induced by noise and nonquadratic shape, similar to what was done for shape alone in Sections 4 and 5.
Also, while we provide a means to extract a single estimate of the global surface from local shape distributions, one could also imagine using reasoning about consistency and outliers to allow the full distributions of neighboring patches to collaboratively refine themselves. This could be useful, for instance, when the object boundaries in a scene are not known apriori. These refined local distributions may then be able to identify depth discontinuities in the scene, and help segment out individual objects for shape recovery.
Finally, it will be interesting to pursue combining our shadingbased local distributions with complementary reasoning about contours, shading keypoints [28], texture, gloss, shadows, and so on—treating these as additional cues for shape, as well as to better identify outliers to our smooth diffuse shading model. We also believe it is worth integrating these local shape distributions into processes for higherlevel vision tasks such as pose estimation, object recognition, and multiview reconstruction, where one can imagine additionally using topdown processing to aid local inference, for example by exploiting priors on local quadratic shapes that are based on object identity or scene category.
References
 [1] Project Page, http://vision.seas.harvard.edu/qsfs/.
 [2] J. D. Durou, M. Falcone, and M. Sagona, “Numerical methods for shapefromshading: A new survey with benchmarks,” CVIU, 2008.
 [3] B. K. P. Horn and M. J. Brooks, Shape from shading. MIT press, 1989.
 [4] R. Zhang, P. Tsai, J. E. Cryer, and M. Shah, “Shape from shading: A survey,” PAMI, 1999.
 [5] J. Oliensis, “Uniqueness in shape from shading,” IJCV, 1991.
 [6] ——, “Shape from shading as a partially wellconstrained problem,” CVGIP: Image Understanding, 1991.
 [7] E. Prados and O. Faugeras, “A generic and provably convergent shapefromshading method for orthographic and pinhole cameras,” IJCV, 2005.
 [8] ——, “Unifying approaches and removing unrealistic assumptions in shape from shading: Mathematics can help,” in ECCV, 2004.
 [9] ——, “Shape from shading: a wellposed problem?” in CVPR, 2005.
 [10] A. P. Pentland, “Local shading analysis,” PAMI, 1984.
 [11] J. Wagemans, A. J. Van Doorn, and J. J. Koenderink, “The shading cue in context,” iPerception, 2010.
 [12] M. K. Johnson and E. H. Adelson, “Shape estimation in natural illumination,” in CVPR, 2011.
 [13] Q. Zhu and J. Shi, “Shape from shading: Recognizing the mountains through a global view,” in CVPR, 2006.
 [14] A. Ecker and A. D. Jepson, “Polynomial shape from shading,” in CVPR, 2010.
 [15] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning lowlevel vision,” IJCV, 2000.
 [16] T. Hassner and R. Basri, “Example based 3D reconstruction from single 2D images,” in CVPR Workshop “Beyond patches”, 2006.
 [17] X. Huang, J. Gao, L. Wang, and R. Yang, “Examplarbased shape from shading,” in 3DIM, 2007.
 [18] F. Cole, P. Isola, W. T. Freeman, F. Durand, and E. H. Adelson, “Shapecollage: occlusionaware, examplebased shape interpretation,” in ECCV, 2012.
 [19] D. Marr, “Early processing of visual information,” Philosophical Trans. of the Royal Society of London. B, Biological Sciences, 1976.
 [20] P. S. Huggins, H. F. Chen, P. N. Belhumeur, and S. W. Zucker, “Finding folds: On the appearance and identification of occlusion,” in CVPR, 2001.
 [21] J. T. Barron and J. Malik, “Shape, illumination, and reflectance from shading,” Berkeley Tech Report, 2013.
 [22] B. Kunsberg and S. W. Zucker, “Characterizing ambiguity in light source invariant shape from shading,” arXiv:1306.5480v1 [cs.CV], June 2013.
 [23] P. Tan, L. Quan, and T. Zickler, “The geometry of reflectance symmetries,” PAMI, 2011.
 [24] D. W. Marquardt, “An algorithm for leastsquares estimation of nonlinear parameters,” Journal of the Society for Industrial & Applied Mathematics, 1963.
 [25] R. T. Frankot and R. Chellappa, “A method for enforcing integrability in shape from shading algorithms,” PAMI, 1988.
 [26] P. Beckmann and A. Spizzichino, The scattering of electromagnetic waves from rough surfaces. Pergamon New York, 1963.
 [27] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman, “Ground truth dataset and baseline evaluations for intrinsic image algorithms,” ICCV, 2009.
 [28] J. Haddon and D. Forsyth, “Shape representations from shading primitives,” in ECCV, 1998.
Supplementary Material
In this supplementary section, we provide a proof for Lemma 1 from Sec. 3.1. As a reminder, the lemma is defined in terms of matrices which are related to the coefficient vectors as:
(6) 
The statement of the lemma itself is reproduced below.
Lemma 1 : Let and correspond to two coefficient matrices of the form in (6), and and to two lighting vectors. If,
(12) 
Rank, Rank, and (i.e., no point is in shadow), then
(13) 
Moreover, if Rank, then Rank and both and have a common null space.
The expression in (12) equates two rational forms in . To prove the lemma, we will show that the equality holds for all if we have a sufficient number of nondegenerate locations in the patch. Then, we will show that the corresponding coefficients in the quadratic expressions in the numerator and denominator must be equal when they are of the form in (6) and the conditions of Lemma 1 are met, essentially ruling out the possibility of a common factor or scaling term. To this end, we introduce another lemma, with proof, and then present the proof of Lemma 1.
Lemma 2.
Let be symmetric matrices. Then, and , where is a constant scalar, if
(51) 
Rank() = 15, and,
Case 1: All of the following conditions are satisfied:
(52)  
(53)  
(54)  