Joint Image and Depth Estimation
with MaskBased Lensless Cameras
Abstract
Maskbased lensless cameras replace the lens of a conventional camera with a customized mask. These cameras can potentially be very thin and even flexible. Recently, it has been demonstrated that such maskbased cameras can recover light intensity and depth information of a scene. Existing depth recovery algorithms either assume that the scene consists of a small number of depth planes or solve a sparse recovery problem over a large 3D volume. Both these approaches fail to recover scene with large depth variations. In this paper, we propose a new approach for depth estimation based on alternating gradient descent algorithm that jointly estimates a continuous depth map and light distribution of the unknown scene from its lensless measurements. The computational complexity of the algorithm scales linearly with the spatial dimension of the imaging system. We present simulation results on image and depth reconstruction for a variety of 3D test scenes. A comparison between the proposed algorithm and other method shows that our algorithm is faster and more robust for natural scenes with a large range of depths.
I Introduction
Depth estimation is an important and challenging problem that arises in a variety of applications including computer vision, robotics, and autonomous systems. Existing depth estimation systems use stereo pairs of conventional (lensbased) cameras or timeofflight sensors [17, 15, 18]. These cameras can be heavy, bulky and require large space for their installation. Therefore, their adoption for portable and lightweight devices with strict physical constraints is still limited.
In this paper, we propose a joint image and depth estimation framework for a computational lensless camera that consists of a fixed, binary mask placed on top of a bare sensor. Such maskbased cameras offer an alternative design for building cameras without lenses. A recent example of maskbased lensless camera is known as FlatCam [4]. In contrast with a lensbased camera that is designed to map every point in the scene to a single pixel on the sensor, every sensor in a FlatCam records light from every point in the scene. A single point source in the scene casts a shadow of the mask on the sensor, which shifts if the point moves parallel to the sensor plane and expand/shrink if the point source moves toward/away from the sensor plane. The measurements recorded on the sensor thus represent superposition of shifted and scaled versions of the mask shadows corresponding to light sources in different directions and depths. Image and depth information about the scene is thus encoded in the measurements, and we can solve an inverse problem to estimate both of them.
Joint estimation of intensity and depth is a nonconvex problem. In fact, estimation of depth by itself, even with known scene intensity, is a nonconvex problem. To jointly estimate depth and light distribution, we propose a two step approach that consists of an initialization step and an alternating gradient descent step to minimize our objective. To preserve sharp edges in the image intensity and depth map, we include an adaptive regularization penalty in our objective function. An overview of the reconstruction framework is illustrated in Figure 1. We initialize the estimates of image intensity and depth using a greedy algorithm proposed in [3]. Then we refine the estimates by minimizing an objective function with respect to image intensity and depth via alternating gradient descent. To simplify the recovery algorithm, we assume that the mask pattern is differentiable everywhere. We use adaptive weights to add smoothness regularization on the intensity and depth estimates [22]. Even though the problem of joint estimation of intensity and depth is nonconvex, we observed that a simple regularization makes the algorithm robust against local minima of the loss function and improve the performance of the algorithm.
The key contributions of this paper are as follows.

[topsep=2pt,itemsep=2pt,leftmargin=*]

We propose a new computational framework for joint estimation of light intensity and depth map from a single image of a maskbased lensless camera. In contrast to other methods, our method estimates the depth map on a continuous domain. Our algorithm consists of a careful initialization step based on greedy pursuit and an alternating minimization step based on gradient descent.

The problem of joint image and depth recovery is highly nonconvex. To tackle this issue, we present different regularization schemes that offer robust recovery on a diverse dataset.

We present simulation results on standard 3D datasets and demonstrated a significant improvement over existing methods for 3D imaging using coded maskbased lensless cameras.
Ii Related Work
A pinhole camera, also known as camera obscura, is the simplest example of a maskbased lensless camera. Even though a pinhole can easily provide an image of the scene onto a sensor plane, the image quality is often severely affected by noise because the amount of light collected is limited by the pinhole aperture [42]. Coded aperturebased lensless cameras avoid this problem by increasing the number of pinholes and allowing more light to reach the sensor [14, 8, 9, 4, 6]. In contrast to a pinhole camera where only one inverted image of the scene is obtained through a single pinhole, the measurements captured through a codedmask are a linear combination of all the pinhole images under every mask element. To recover an image of the scene, we need to solve a computational image recovery problem [14, 4].
A coded aperture system offers another advantage by encoding light from different directions and depths differently. Note that a bare sensor can provide the intensity of a light source but not its spatial location. A mask in front of the sensor encodes directional information of the source in the sensor measurements. Consider a single light source with a dark background; the image formed on the sensor will be a shadow of the mask. If we change the angle of the light source, the mask shadow on the sensor will shift. Furthermore, if we increase or decrease the depth of the light source, the width of the shadow will decrease or increase, respectively. Thus, we can represent the relationship between all the points in the 3D world and the sensor measurements as a linear system, which depends on the pattern and the placement of the mask. We can solve this system using an appropriate computational algorithm to recover the image of the scene.
The depthdependent imaging capability in coded aperture systems is known since the pioneering work in this domain [31, 14]. The following excerpt in [14] summarizes it well: “One can reconstruct a particular depth in the object by treating the picture as if it was formed by an aperture scaled to the size of the shadow produced by the depth under consideration.” However, the classical methods usually assume that the scene consists of a single plane at known depth. In this paper, we assume that the depth map is arbitrarily distributed on a continuous domain and the true depth map is unknown at the time of reconstruction.
The 3D lensless imaging problem has also recently been studied in [2, 3, 1, 6]. These methods can broadly be divided into two categories. In the first category, the 3D scene is divided into a finite number of voxels. To recover the 3D light distribution, these methods solve an normbased recovery problem under the assumption that the scene is very sparse [2, 1]. In the second category, the 3D scene is divided into an intensity map and multiple depth planes such that each pixel is assigned one intensity and depth. To solve the intensity and depth recovery problem, these methods either sweep through the depth planes [6] or assign depth to each pixel using a greedy method [3]. Our proposed method belongs to the second category in which we model the image intensity and depth separately and assume that the depth values of the scene are distributed on a continuous domain. To recover 3D scene, we jointly estimates image intensity and depth map from the available sensor measurements.
Joint estimation of image intensity and depth map can be viewed as a nonlinear inverse problem in which the sampling function is dependent on scene depth. Similar inverse problem also arises in many other fields such as directionofarrival estimation in radar [35], super resolution [7] and compressed sensing [38, 26, 5]. Similar to joint estimation of image intensity and depth, the solution approaches to these problems consists of two main steps: identification of signal bases and the estimation of signal intensities based on the identified bases. The problem of identifying the signal bases from continuously varying candidates is often called offthegrid signal recovery. The methods for solving the offthegrid signal recovery problems can be divided into two main types. The first approach formulates the problem as a convex program on a continuous domain and solve it using an atomic norm minimization approach [10, 36]. The second approach linearizes the problem w.r.t. the continuous optimization parameter using a firstorder approximation at every iteration [7, 41]. Our proposed algorithm is inspired by the second approach.
Maskbased lensless cameras have traditionally been used for imaging light at wavelengths beyond the visible spectrum [8, 9]. Other examples related to maskbased cameras include controllable aperture and employing codedmask for compressed sensing and computational imaging [34, 43], single pixel camera [12] and external mask setting [27].
Coded masks have also recently been used with conventional lensbased cameras to estimate depth and lightfield [19, 39]. Recently, a number of datadriven methods have been proposed to design custom phase masks and optical elements to estimate depth from a single image [40, 11]. An alloptical diffractive deep neural network is proposed in [20, 24], which can perform pattern recognition tasks such as handwritten digits classification using optical mask layers. Such networks can literally process images at a lightningfast pace with nearzero energy cost.
Iii Methods
Iiia Imaging Model
We divide the 3D scene under observation into uniformly spaced directions. We use and to denote the angular directions of a light source with respect to the center of the sensor. The intensity and depth of the light source are denoted using and respectively. Figure 1(a) depicts the geometry of such an imaging model. A planar codedmask is placed on top of a planar sensor array at distance . The sensor array captures lights coming from the scene modulated by the codedmask.
Every light source in the scene casts a shadow of the mask on the sensor array, which we denote using basis functions . We use and to index a pixel on the rectangular sensor array. The shadow cast by a light source with unit intensity at can be represented as the following basis or point spread function:
(1) 
where denotes the transmittance of the mask pattern at location on the mask plane and is a variable that is related to the physical depth with the following inverse relation:
(2) 
If the 3D scene consists of only a single point source at with light intensity , the measurement captured at sensor pixel would be
(3) 
The measurement recorded on any sensor pixel is the summation of contributions from each of the point sources in the 3D scene. The imaging model for a single sensor pixel can be represented by
(4) 
We can write the imaging model for all the sensors in a compact form as
(5) 
where is a vectorized form of an matrix that denotes sensor measurements, is a vectorized form of an matrix that denotes light intensity from all the locations , and is a matrix with all the basis functions corresponding to . The basis functions in (5) are parameterized by the unknown . denotes noise and other nonidealities in the system.
We can jointly estimate light distribution and inverse depth map ^{1}^{1}1 has an inverse relation with the depth map (2); therefore we refer to it as inverse depth map throughout the paper. using the following optimization problem:
(6) 
Note that if we know the true values of (or we fix it to something), then the problem in (6) reduces to a linear leastsquares problem that can be efficiently solved via standard solvers. On the other hand, if we fix the value of , the problem remains nonlinear with respect to . In the next few sections we discuss our approach for solving the problem in (6) via alternating minimization.
IiiB Initialization
Since the minimization problem in (6) is not convex, a proper initialization is often needed to ensure convergence to a local minima close to the optimal point. A naïve approach is to initialize all the point sources in the scene at the same depth plane. To select an initial depth plane, we sweep through a set of candidate depth planes and perform image reconstruction on one depth plane at a time by solving the following linear least squares problem:
(7) 
We evaluate the loss value for all the candidate depth planes and picked the one with the smallest loss as our initialized depth. The mask basis function in (1) changes as we change , which has an inverse relation with the scene depth. We select candidate depth corresponding to uniformly sampled values of , which yields nonuniform sampling of the physical scene depth. The singledepth initialization approach is computationally simple and provides a reasonable initialization of light distribution to start with, especially when the scene is far from the sensor.
Our second approach for initialization is the greedy method proposed in [3]. Greedy algorithms are widely used for sparse signal recovery [38, 26, 5]. Based on these algorithms, [3] proposed a greedy depth pursuit algorithm for depth estimation from FlatCam [4]. The algorithm works by iteratively updating the depth surface that matches the observed measurements the best.
The depth pursuit method assumes that the scene consists of a finite number of predefined depth planes. We start the program by initializing all the pixels at a single depth plane and the estimation of light intensities based on initialized depth map. The first step is to select new candidate values for . The new candidates are selected using the basis vectors that are mostly correlated with the current residual of the estimate. In the second step, new candidates for are appended to the current estimate. We solve a least squares problem using the appended . In the third step, we prune the by selecting as the value corresponding to the largest magnitude of . Although this method may not estimate the offgrid point sources well, it produces a good preliminary estimate of the scene.
IiiC Refinement via Alternating Gradient Descent
To solve the minimization problem in (6), we start with the preliminary image and depth estimates from the initialization step and alternately update depth and light distribution via gradient descent. The main computational task in gradient descent method is computing the gradient of the loss function w.r.t. . To compute that gradient, we expand the loss function in (6) as
(8) 
We define as the residual approximation error at location . The derivatives of the loss function with respect to the is given as
(9) 
We compute the derivatives of sensor value with respect to the using the total derivative ^{2}^{2}2Recall that the total derivative of a multivariate function is . as follows.
(10) 
and denote two dummy variables that also correspond to the specific location on the mask where a light ray from a point source at angle and depth and sensor pixel at intersects with the mask plane. The terms in can be viewed as the derivatives of mask pattern along the respective spatial coordinates and evaluated at . We compute these derivatives using finitedifference of over a fine grid and linear interpolation.
IiiD Algorithm Analysis
To solve the nonlinear least squares problem (7) in our algorithms, we compute the gradient derived in (IIIC) and use it as input of a optimization solver. Suppose and denote the basis function vector evaluated on 1D mask,
(11) 
If we use a separable mask pattern that the mask pattern we use is the outer product of two 1D mask patterns, the 2D mask function in (1) can be computed as the outer product of two vectors given as . Similarly, we define 1D subgradient function as
(12) 
Similar to (IIIC), the functions and are the subgradient functions along the 1D mask. It takes nonnegative values at locations where mask pattern value changes and takes zero value at the other places. Using the derivation in (IIIC), the matrix contains at all can be computed using the following sum of two vector outer products.
(13) 
Using the derivations in (IIIC), the derivative of loss function with respect to depth value can be computed using the following matrix multiplications, where refers to the matrix of residual at all
(14) 
Suppose we have pixels on sensor array. The computation in (14) takes multiplications. We then feed our gradients to minfunc solver [33] with LBFGS algorithm [21] to solve the nonlinear optimization problem in (7).
IiiE Regularization Approaches
regularization on spatial gradients. The optimization problem in (6) is highly nonconvex and contains several local minima; therefore, the estimate often gets stuck in some local minima and the estimated intensity and depth map are coarse. To improve the performance of our algorithm for solving the nonconvex problem in (6), we seek to exploit additional structures in the scene. A standard assumption is that the depth of neighboring pixels is usually close, which implies that the spatial differences of (inverse) depth map are small. To incorporate this assumption in our model, we add a quadratic regularization term on the spatial gradients of the inverse depth map to our loss function. The quadratic regularization term is defined on an inverse depth map matrix and can be written as
(15) 
where the operators compute spatial differences along rows and columns, respectively. We call this regularization an normbased total variation (TV) in this paper. Figure 2 illustrates the effect of the depth regularization. From Figure 2, we observe that smoothness regularization improves the loss function by removing several local minima. We also observed this effect in our simulations for highdimensional depth recovery problem, which is not very sensitive to initialization with depth regularization.
Weighted regularization on spatial gradients. Even though smoothness regularization on the inverse depth map removes some local minima and helps with converge, it does not respect the sharp edges in the depth map. To preserve sharp discontinuities in the (inverse) depth map, we used the following adaptive weighted regularization:
(16) 
where and denote weights for row and column differences, respectively. We aim to select these weights to promote depth similarity for neighboring pixels, but avoid smoothing the sharp edges. To promote this, we selected weights with exponential decay in our experiments that we compute as
(17) 
Such a weighted regularization forces pixels that have depth within a small range of one another to be smooth and does not penalize the points that have larger gap in depth (which indicates the presence of an edge). This helps preserve sharp edges in the reconstructed depth estimates. This weighting approach is analogous to bilateral filtering approach for image denoising [37, 13].
To highlight the effect of the weighted smoothness regularization on depth, we plot the following weighted quadratic function in Figure 3, where stand for inverse depth of neighboring pixels. We plot the weighted function for different values of along with a normal quadratic function. The plots show that the quadratic function (without any weights) penalizes large values of depth differences; however, weighted function add small penalty if the neighboring pixels have large depth difference (which indicates the presence of an edge).
The regularized estimation problem for image and depth can be written in the following form:
(18) 
We call this regularization approach weighted TV and solve it by alternately updating the inverse depth map and light intensity . A pseudocode of the algorithm is presented at Algorithm 1.
regularization on spatial gradients. It is wellknown that the norm regularization enforces the solution to be sparse. We add an based total variation norm [28] of the depth to our optimization problem. By enforcing the sparsity of spatial gradients, the edges of (inverse) depth map can be preserved. The normbased TV regularization term is given as
(19) 
To solve the nonlinear optimization problem with norm regularization, we write the optimization problem as
s.t.  (20) 
We solve this problem (20) using a splitBregman method [16].
Iv Simulations and Results
Iva Simulation Setup
To validate the performance of the proposed algorithm, we simulate a lensless imaging system using a binary planar mask with a separable maximum length sequence (MLS) pattern [23] that is placed 4mm away from a planar sensor array. We used an MLS sequence of length 1024 and converted all the 1s to 0s to create a separable binary pattern. We used square mask features, each of which is 30m wide. Since we require the mask to be differential, we cannot use a binary mask pattern. Therefore, we convolved the binary pattern with a Gaussian blur kernel of length 15m and standard deviation 5. The sensor contains square pixels, each of which is 50m wide. The chief ray angle of each sensor pixel is . We assume that there is no noise added to the sensor measurements. In our experiments for continuous depth estimation, we fixed all the parameters to these default values and analyze the performance with respect to a single parameter.
IvB Reconstruction of a Single Plane
To understand various parameters of our method and their effect on depth estimation, we perform a simple simulation experiment in which the scene consists of a single plane and our goal is to recover the correct depth accurately. In other words, the depth map is parameterized by a scalar instead of a matrix of the same size as intensity . The estimation of the single depth parameter does not involve the regularization, so we solve (6) via alternating minimization using Algorithm 1 without the regularization term. We test the performance of our optimization program using a USAF target image placed at three different depths in the scene: . We test two different initialization schemes for comparison. (1) Initial value of is selected to set depth at 1km (i.e., for all the experiments). (2) We choose a depth plane from 10 candidate depths that were uniformly sampled in to get effective depth range from 9cm to 1km; for a given measurement we selected that provided smallest loss function in (6). To separate the effect of image estimate and depth estimate, we tested two cases: one in which image intensity is an optimization variable and updated iteratively and the other one where the image intensity is fixed to its original value. Since the problem in (6) is nonconvex in even if we fix the value of intensity , there is no guarantee we can estimate the correct value of .
We report the results for 12 experiments (image at three different depths with two different initialization and known or unknown image intensity) in Figure 4. Figures 4(a) and (b) plot the loss function at the estimated depth at every iteration of the refinement step in Algorithm 1 for the cases when the initial depth is set to 1km and when the initial depth is selected out of 10 candidate depths. From the curves in Figure 4(a), we observe that we can estimate the depth accurately in all three cases when the image intensities are known (dashed lines). If we jointly estimate the depth and image intensities, the initial value of depth plays a critical role (solid lines). We observe that the intensity and depth are recovered correctly for the scene at 1m and 10m but the algorithm fails to recover correct depth for the scene at 10cm. This is mainly because the initial value of is very far from the true value and the algorithm is more likely to get stuck in a local minima. Figure 4(c) shows the reconstructed image when the original scene is at 10cm away, initial depth is set to 1km, and we estimate image intensity and depth by solving (6). On the other hand, when we pick initial depth close to the true depth (out of 10 candidates, using a greedy approach), as shown in Figure 4(b), the algorithm converges to the true depth and recover the correct image intensities as well.


IvC Reconstruction of Scenes with Continuous Depth
Depth datasets
We performed all our experiments on 3D images created using light intensities and depth information from Middlebury[32], Make3D[30, 29] and NYU Depth[25], the test scenes and their depth ranges are listed in Table I.
Test datasets  Min depth (m)  Max depth (m) 

Sword  0.65  0.95 
Cones  0.99  1.70 
Playtable  1.47  3.75 
Corner  3.93  10.60 
Whiteboard  1.08  2.90 
Playroom  1.62  2.93 
Moebius  0.74  1.23 
Books  0.73  1.27 
Initialization via greedy method
Let us further discuss our simulation setup using cones scene, for which the results are presented in Figure 5. We simulated the 3D scene using depth data from Middlebury dataset[32]. We sample the scene at uniform angles to create a image and its (inverse) depth map with the same size. We can compute the physical depth from using (2). In our simulation, the depth of this scene ranges from around 0.99m to 1.7m. We used depth pursuit greedy algorithm in [3] as our initialization method. We selected 15 candidate depths by uniformly sampling the inverse depth values from 0.996 to 0.9976, which gives an effective depth in the same range as the original depth. Since we are trying to gauge the performance for offthegrid estimate of depth, the candidate values of are not exactly the same as the true values of in our simulations. The output of initialization algorithm is then fed into the alternating gradient descent method.
Performance metrics
We evaluate the performance of recovered image intensity and depth independent of each other. We report the peak signal to noise ratio (PSNR) of the estimated intensity images and root mean squared error (RMSE) of the estimated depth maps for all our experiments. The estimates for image intensity and depth maps for the initialization and our proposed weighted TV method are shown in Figure 5, along with the PSNR and RMSE. We can observe that both image and depth estimation from greedy method [3] contain several spikes because of the model mismatch with the predefined depth grid. In contrast, many of these spikes are removed in the estimations from the proposed algorithm with weighted TV while the edges are preserved.
Comparison of regularization methods
Here we present a comparison between three different regularization approaches. We reconstruct image intensity and (inverse) depth map using same measurements with TV, weighted TV, and TV regularization. The results are shown in Figure 6. Compared to the TV method, we observe that both weighted TV and TV preserve the sharp edges in image and depth estimates. Overall, in our experiments, weighted TV provided best results. Therefore, we used that as our default method in the rest of the paper.
IvD Effects of Noise
Sensor noise exists widely in any observation process. The amplitude of noise depends on the intensities of sensor measurements and can adversely affect the reconstruction results. To investigate the effect of noise on our algorithm, we present simulation results for reconstruction of scenes from the same sensor measurements under different levels of additive white Gaussian noise. The experiments are performed on multiple 3D scenes listed in Table I. Some examples of reconstruction with different levels of noise are shown in Figure 7.
















The plots recording PSNR of image intensities and RMSE of depth maps over a range of measurement SNR values are presented in Figure 8. As we can observe from the curves that the quality of both estimated image and depth improve when the measurements have small noise (high SNR) and the quality degrades as we add more noise in the measurements (low SNR). Another observation we can make is that the scenes that are farther away have higher RMSE. This aspect is understandable because as the scenes move farther, of the scene pixels all get very close to 1 and we cannot resolve fine depth variations in the scene.
















IvE Number of Sensor Measurements
In this experiment, we evaluate the performance of our algorithm as we increase/decrease the number of sensor measurements. The depth estimation problem we are solving is highly illposed because of existence of nontrivial null space of the system matrix and nonlinear dependence of measurements on the depth parameters. Adding more measurements helps with improve the solution of the system by adding more constraints on the feasible solutions.
We perform experiments with different number of sensor pixels while the size of each pixel is fixed as 50m. We do not add any noise in these experiments to avoid randomness that potentially affect comparison. As we increase or decrease the number of pixels, it is equivalent to increasing or decreasing the sensor area. Therefore, when we increase the number of sensor pixels (equivalently, sensor area), the baseline of the sensor is also increased, which helps us in resolving the depth more accurately. The results are presented in Figure 9 for sensors of size , , and . We observe that the quality of both image and depth improves as we use more sensor pixels for measurements.
IvF Size of Sensor
In conventional disparitybased depth estimation method [17], the quality of reconstructed depth depends on the disparity between frames captured from multiple camera views. Larger distance between camera viewing positions results in better depth estimation accuracy. In a lensless imaging system, we can think of each pinhole on the mask and the sensor area behind the mask as a tiny pinhole camera. The analogy only goes this far, because we do not record images from these tiny pinhole cameras separately; instead, we record a multiplexed version of the all the views. The disparity between different points on the sensors, however, does affect our ability to resolve depth of the scene, which is determined by the size of sensor.
To analyze the effect of disparity in our system, we performed experiments with three different sizes of sensor pixels from 25m, 50m, and 100m. For a fair comparison, the number of sensor pixels and other parameters are set to the default settings as described earlier. No noise is included in this experiment. Results in terms of reconstructed image and depth maps are presented in Figure 10, where we observe that the quality of depth reconstruction improves as we increase the size of sensor pixels. The results in Figure 10 and 9 demonstrate that increasing the disparity of viewing points increases the depth reconstruction quality.
















IvG Comparison with Existing Methods
Finally, we present a comparison of our proposed algorithm and two other methods for 3D recovery with lensless cameras. In our method, we estimate light intensity and a depth map over continuous domain. The greedy method in [3] also estimates intensity and depth separately, but the depth map for any angle is restricted to one of the predetermined planes. Threedimensional recovery using lensless cameras for 3D fluorescence microscopy was presented in [2] and [1], which estimate the entire 3D volume of the scene sampled over a predetermined 3D grid. Since the unknown volumetric scene in microscopy is often very sparse, the 3D scene recovery problem is solved as a sparse recovery problem for the light intensity over all the grid voxels. The result is a light distribution over the entire 3D space. We call this method 3D Grid and use the code provided in [2] to solve the 3D recovery problem using the forward model and measurements from our simulation setup.
The scenes studied in [1] and [2] are mostly transparent and contain multiple point light sources at different depths but the same angle. This is different from the natural scenes we are testing, where the objects are usually opaque and block light from objects behind them. We can model such scenes as having only one voxel along any angle to be nonzero; however, that will be a nonconvex constraint and to enforce that we will have to resort to some heuristic similar to the one in [3]. For the sake of comparison, we solve the normbased sparse recovery problem as described in [2], but then we pick the points with the maximum light intensity at each angle to form the reconstructed image and (inverse) depth map.
A comparison of different recovery methods with same imaging setup is shown in Figure 11. For the same scene, we reconstruct the same measurements using the three methods. As we can observe that our proposed algorithm offers a significant improvement compared to existing methods in all the test scenes.
V Conclusion
We presented a new algorithm to jointly estimate the image and depth of a scene using a single snapshot of a maskbased lensless camera. Existing methods for 3D lensless imaging either estimate scene over a predefined 3D grid (which is computationally expensive) or a finite number of candidate depth planes (which provides a coarse depth map). We divide the scene into an intensity map on uniform angles and a depth map on a continuous domain, which allows us to estimate a variety of scenes with different depth ranges using the same formulation. We jointly estimate the image intensity and depth map by solving a nonconvex problem. We initialize our estimates using a greedy method and add weighted regularization to enforce smoothness in the depth estimate while preserving the sharp edges. We demonstrated with extensive simulations that our proposed method can recover image and depth with high accuracy for a variety of scenes. We evaluated the performance of our methods under different noise levels, sensor sizes, and numbers of sensor pixels and found the method to be robust. Finally, we presented a comparison with existing methods for lensless 3D imaging and demonstrated that our method provides significantly better results.












References
 [1] (2017) Singleframe 3d fluorescence microscopy with ultraminiature lensless flatscope. Science Advances 3 (12). External Links: Document, Link, https://advances.sciencemag.org/content/3/12/e1701548.full.pdf Cited by: §II, §IVG, §IVG, Fig. 11.
 [2] (201801) DiffuserCam: lensless singleexposure 3d imaging. Optica 5 (1), pp. 1–9. External Links: Link, Document Cited by: §II, §IVG, §IVG, Fig. 11, Fig. 11, Fig. 11.
 [3] (2017) Toward depth estimation using maskbased lensless cameras. In Signals, Systems, and Computers, 2017 51st Asilomar Conference on, pp. 1467–1470. Cited by: §I, §II, §IIIB, Fig. 5, Fig. 5, §IVC, §IVC, §IVG, §IVG, Fig. 11, Fig. 11, Fig. 11, 1.
 [4] (2017Sept) FlatCam: thin, lensless cameras using coded aperture and computation. IEEE Transactions on Computational Imaging 3 (3), pp. 384–397. External Links: Document, ISSN 23339403 Cited by: §I, §II, §IIIB.
 [5] (2010) Modelbased compressive sensing. IEEE Transactions on Information Theory 56 (4), pp. 1982–2001. Cited by: §II, §IIIB.
 [6] (2016) Lensless imaging: a computational renaissance. IEEE Signal Processing Magazine 33 (5), pp. 23–35. Cited by: §II, §II.
 [7] (201512) The alternating descent conditional gradient method for sparse inverse problems. In IEEE 6th International Workshop on Computational Advances in MultiSensor Adaptive Processing (CAMSAP), Vol. , pp. 57–60. External Links: Document, ISSN Cited by: §II.
 [8] (19980601) Uniformly redundant arrays. Experimental Astronomy 8 (2), pp. 97–123. External Links: ISSN 15729508, Document, Link Cited by: §II, §II.
 [9] (198006) Coded Aperture Imaging: Many Holes Make Light Work. Optical Engineering 19, pp. 283. External Links: Document Cited by: §II, §II.
 [10] (20121201) The convex geometry of linear inverse problems. Foundations of Computational Mathematics 12 (6), pp. 805–849. External Links: ISSN 16153383, Document, Link Cited by: §II.
 [11] (2019) Deep optics for monocular depth estimation and 3d object detection. In Proc. IEEE ICCV, Cited by: §II.
 [12] (200803) Singlepixel imaging via compressive sampling. IEEE Signal Processing Magazine 25 (2), pp. 83–91. External Links: Document, ISSN 10535888 Cited by: §II.
 [13] (200207) Fast bilateral filtering for the display of highdynamicrange images. ACM Trans. Graph. 21 (3), pp. 257–266. External Links: ISSN 07300301, Link, Document Cited by: §IIIE.
 [14] (197802) Coded aperture imaging with uniformly redundant arrays. Appl. Opt. 17 (3), pp. 337–347. External Links: Link, Document Cited by: §II, §II.
 [15] (200406) A timeofflight depth sensor  system description, issues and solutions. In Conference on Computer Vision and Pattern Recognition Workshop, Vol. , pp. 35–35. External Links: Document, ISSN Cited by: §I.
 [16] (200904) The split bregman method for l1regularized problems. SIAM J. Img. Sci. 2 (2), pp. 323–343. External Links: ISSN 19364954, Link, Document Cited by: §IIIE.
 [17] (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §I, §IVF.
 [18] (2013) Lowbudget transient imaging using photonic mixer devices. ACM Transactions on Graphics (ToG) 32 (4), pp. 45. Cited by: §I.
 [19] (200707) Image and depth from a conventional camera with a coded aperture. ACM Trans. Graph. 26 (3). External Links: ISSN 07300301, Link, Document Cited by: §II.
 [20] (2018) Alloptical machine learning using diffractive deep neural networks. Science 361 (6406), pp. 1004–1008. External Links: Document, ISSN 00368075, Link, https://science.sciencemag.org/content/361/6406/1004.full.pdf Cited by: §II.
 [21] (19890801) On the limited memory bfgs method for large scale optimization. Mathematical Programming 45 (1), pp. 503–528. External Links: ISSN 14364646, Document, Link Cited by: §IIID.
 [22] (2012) Adaptiveweighted total variation minimization for sparse data toward lowdose xray computed tomography image reconstruction. Physics in Medicine & Biology 57 (23), pp. 7923. External Links: Link Cited by: §I.
 [23] (197612) Pseudorandom sequences and arrays. Proceedings of the IEEE 64 (12), pp. 1715–1729. External Links: Document, ISSN 00189219 Cited by: §IVA.
 [24] (202001) Analysis of diffractive optical neural networks and their integration with electronic neural networks. IEEE Journal of Selected Topics in Quantum Electronics 26 (1), pp. 1–14. External Links: Document, ISSN 1077260X Cited by: §II.
 [25] (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §IVC, TABLE I.
 [26] (201012) CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Commun. ACM 53 (12), pp. 93–100. External Links: ISSN 00010782, Link, Document Cited by: §II, §IIIB.
 [27] (201312) External mask based depth and light field camera. In 2013 IEEE International Conference on Computer Vision Workshops, Vol. , pp. 37–44. External Links: Document, ISSN Cited by: §II.
 [28] (199211) Nonlinear total variation based noise removal algorithms. Phys. D 60 (14), pp. 259–268. External Links: ISSN 01672789, Link, Document Cited by: §IIIE.
 [29] (200905) Make3D: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5), pp. 824–840. External Links: Document, ISSN 01628828 Cited by: §IVC, TABLE I.
 [30] (2006) Learning depth from single monocular images. In Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. C. Platt (Eds.), pp. 1161–1168. External Links: Link Cited by: §IVC, TABLE I.
 [31] (1973) Fresnel zone plate imaging in radiology and nuclear medicine. Optical Engineering 12 (1), pp. 8 – 12 – 5. External Links: Document, Link, Cited by: §II.
 [32] (200112) A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. In Proceedings IEEE Workshop on Stereo and MultiBaseline Vision, Vol. , pp. 131–140. External Links: Document, ISSN Cited by: 5(a), §IVC, §IVC, TABLE I.
 [33] (2005) MinFunc: unconstrained differentiable multivariate optimization in matlab. External Links: Link Cited by: §IIID.
 [34] (2006) A new compressive imaging camera architecture using opticaldomain compression. Proc.SPIE 6065 (), pp. 6065 – 6065 – 10. External Links: Document, Link, Cited by: §II.
 [35] (201410) Joint sparse recovery method for compressed sensing with structured dictionary mismatches. IEEE Transactions on Signal Processing 62 (19), pp. 4997–5008. External Links: Document, ISSN 1053587X Cited by: §II.
 [36] (201311) Compressed sensing off the grid. IEEE Transactions on Information Theory 59 (11), pp. 7465–7490. External Links: Document, ISSN 00189448 Cited by: §II.
 [37] (199801) Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Vol. , pp. 839–846. External Links: Document, ISSN Cited by: §IIIE.
 [38] (200712) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory 53 (12), pp. 4655–4666. External Links: Document, ISSN 00189448 Cited by: §II, §IIIB.
 [39] (2007) Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing. In ACM transactions on graphics (TOG), Vol. 26, pp. 69. Cited by: §II.
 [40] (201905) PhaseCam3D — learning phase masks for passive single view depth estimation. In 2019 IEEE International Conference on Computational Photography (ICCP), Vol. , pp. 1–12. External Links: Document, ISSN 24727636 Cited by: §II.
 [41] (201301) Offgrid direction of arrival estimation using sparse bayesian inference. IEEE Transactions on Signal Processing 61 (1), pp. 38–43. External Links: Document, ISSN 1053587X Cited by: §II.
 [42] (201804) Analysis and optimization of aperture design in computational imaging. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4029–4033. External Links: Link, Document, ISSN 2379190X Cited by: §II.
 [43] (200606) Lensless imaging with a controllable aperture. In IEEE Computer Vision and Pattern Recognition, Vol. 1, pp. 339–346. External Links: Document, ISSN 10636919 Cited by: §II.