Joint Image and Depth Estimation with Mask-Based Lensless Cameras

Joint Image and Depth Estimation
with Mask-Based Lensless Cameras

Yucheng Zheng and M. Salman Asif Y. Zheng and M. Asif are with the Department of Electrical and Computer Engineering, University of California, Riverside, CA, 92521 USA (e-mail: yzhen069@ucr.edu; sasif@ece.ucr.edu).
Abstract

Mask-based lensless cameras replace the lens of a conventional camera with a customized mask. These cameras can potentially be very thin and even flexible. Recently, it has been demonstrated that such mask-based cameras can recover light intensity and depth information of a scene. Existing depth recovery algorithms either assume that the scene consists of a small number of depth planes or solve a sparse recovery problem over a large 3D volume. Both these approaches fail to recover scene with large depth variations. In this paper, we propose a new approach for depth estimation based on alternating gradient descent algorithm that jointly estimates a continuous depth map and light distribution of the unknown scene from its lensless measurements. The computational complexity of the algorithm scales linearly with the spatial dimension of the imaging system. We present simulation results on image and depth reconstruction for a variety of 3D test scenes. A comparison between the proposed algorithm and other method shows that our algorithm is faster and more robust for natural scenes with a large range of depths.

Lensless imaging, flatcam, depth estimation, non-convex optimization, alternating minimization.
(a) 1D imaging model for a planar sensor with a coded mask placed at distance . Light rays from a light source at location are received by all the sensor pixels. A light ray that hits sensor pixel passes through mask at location .
(b) An overview of the proposed intensity and depth estimation framework. Consider a natural scene as a 3D point cloud, where each point represents a light source located at a different depth. The camera consists of a fixed, coded mask placed on top of an image sensor. Every point in the scene casts a shadow of the mask on the sensor plane. Each sensor pixels records a linear combination of the scene modulated by the mask pattern. The recovery algorithm consists of two steps. (1) Initialization using a greedy depth selection method. (2) An alternating gradient descent-based refinement algorithm that jointly estimates the light distribution and depth map on a continuous domain.
Fig. 1: A coded mask-based imaging model and an overview of the proposed continuous depth estimation framework.

I Introduction

Depth estimation is an important and challenging problem that arises in a variety of applications including computer vision, robotics, and autonomous systems. Existing depth estimation systems use stereo pairs of conventional (lens-based) cameras or time-of-flight sensors [17, 15, 18]. These cameras can be heavy, bulky and require large space for their installation. Therefore, their adoption for portable and lightweight devices with strict physical constraints is still limited.

In this paper, we propose a joint image and depth estimation framework for a computational lensless camera that consists of a fixed, binary mask placed on top of a bare sensor. Such mask-based cameras offer an alternative design for building cameras without lenses. A recent example of mask-based lensless camera is known as FlatCam [4]. In contrast with a lens-based camera that is designed to map every point in the scene to a single pixel on the sensor, every sensor in a FlatCam records light from every point in the scene. A single point source in the scene casts a shadow of the mask on the sensor, which shifts if the point moves parallel to the sensor plane and expand/shrink if the point source moves toward/away from the sensor plane. The measurements recorded on the sensor thus represent superposition of shifted and scaled versions of the mask shadows corresponding to light sources in different directions and depths. Image and depth information about the scene is thus encoded in the measurements, and we can solve an inverse problem to estimate both of them.

Joint estimation of intensity and depth is a nonconvex problem. In fact, estimation of depth by itself, even with known scene intensity, is a nonconvex problem. To jointly estimate depth and light distribution, we propose a two step approach that consists of an initialization step and an alternating gradient descent step to minimize our objective. To preserve sharp edges in the image intensity and depth map, we include an adaptive regularization penalty in our objective function. An overview of the reconstruction framework is illustrated in Figure 1. We initialize the estimates of image intensity and depth using a greedy algorithm proposed in [3]. Then we refine the estimates by minimizing an objective function with respect to image intensity and depth via alternating gradient descent. To simplify the recovery algorithm, we assume that the mask pattern is differentiable everywhere. We use adaptive weights to add smoothness regularization on the intensity and depth estimates [22]. Even though the problem of joint estimation of intensity and depth is non-convex, we observed that a simple regularization makes the algorithm robust against local minima of the loss function and improve the performance of the algorithm.

The key contributions of this paper are as follows.

  • [topsep=2pt,itemsep=2pt,leftmargin=*]

  • We propose a new computational framework for joint estimation of light intensity and depth map from a single image of a mask-based lensless camera. In contrast to other methods, our method estimates the depth map on a continuous domain. Our algorithm consists of a careful initialization step based on greedy pursuit and an alternating minimization step based on gradient descent.

  • The problem of joint image and depth recovery is highly nonconvex. To tackle this issue, we present different regularization schemes that offer robust recovery on a diverse dataset.

  • We present simulation results on standard 3D datasets and demonstrated a significant improvement over existing methods for 3D imaging using coded mask-based lensless cameras.

Ii Related Work

A pinhole camera, also known as camera obscura, is the simplest example of a mask-based lensless camera. Even though a pinhole can easily provide an image of the scene onto a sensor plane, the image quality is often severely affected by noise because the amount of light collected is limited by the pinhole aperture [42]. Coded aperture-based lensless cameras avoid this problem by increasing the number of pinholes and allowing more light to reach the sensor [14, 8, 9, 4, 6]. In contrast to a pinhole camera where only one inverted image of the scene is obtained through a single pinhole, the measurements captured through a coded-mask are a linear combination of all the pinhole images under every mask element. To recover an image of the scene, we need to solve a computational image recovery problem [14, 4].

A coded aperture system offers another advantage by encoding light from different directions and depths differently. Note that a bare sensor can provide the intensity of a light source but not its spatial location. A mask in front of the sensor encodes directional information of the source in the sensor measurements. Consider a single light source with a dark background; the image formed on the sensor will be a shadow of the mask. If we change the angle of the light source, the mask shadow on the sensor will shift. Furthermore, if we increase or decrease the depth of the light source, the width of the shadow will decrease or increase, respectively. Thus, we can represent the relationship between all the points in the 3D world and the sensor measurements as a linear system, which depends on the pattern and the placement of the mask. We can solve this system using an appropriate computational algorithm to recover the image of the scene.

The depth-dependent imaging capability in coded aperture systems is known since the pioneering work in this domain [31, 14]. The following excerpt in [14] summarizes it well: “One can reconstruct a particular depth in the object by treating the picture as if it was formed by an aperture scaled to the size of the shadow produced by the depth under consideration.” However, the classical methods usually assume that the scene consists of a single plane at known depth. In this paper, we assume that the depth map is arbitrarily distributed on a continuous domain and the true depth map is unknown at the time of reconstruction.

The 3D lensless imaging problem has also recently been studied in [2, 3, 1, 6]. These methods can broadly be divided into two categories. In the first category, the 3D scene is divided into a finite number of voxels. To recover the 3D light distribution, these methods solve an norm-based recovery problem under the assumption that the scene is very sparse [2, 1]. In the second category, the 3D scene is divided into an intensity map and multiple depth planes such that each pixel is assigned one intensity and depth. To solve the intensity and depth recovery problem, these methods either sweep through the depth planes [6] or assign depth to each pixel using a greedy method [3]. Our proposed method belongs to the second category in which we model the image intensity and depth separately and assume that the depth values of the scene are distributed on a continuous domain. To recover 3D scene, we jointly estimates image intensity and depth map from the available sensor measurements.

Joint estimation of image intensity and depth map can be viewed as a nonlinear inverse problem in which the sampling function is dependent on scene depth. Similar inverse problem also arises in many other fields such as direction-of-arrival estimation in radar [35], super resolution [7] and compressed sensing [38, 26, 5]. Similar to joint estimation of image intensity and depth, the solution approaches to these problems consists of two main steps: identification of signal bases and the estimation of signal intensities based on the identified bases. The problem of identifying the signal bases from continuously varying candidates is often called off-the-grid signal recovery. The methods for solving the off-the-grid signal recovery problems can be divided into two main types. The first approach formulates the problem as a convex program on a continuous domain and solve it using an atomic norm minimization approach [10, 36]. The second approach linearizes the problem w.r.t. the continuous optimization parameter using a first-order approximation at every iteration [7, 41]. Our proposed algorithm is inspired by the second approach.

Mask-based lensless cameras have traditionally been used for imaging light at wavelengths beyond the visible spectrum [8, 9]. Other examples related to mask-based cameras include controllable aperture and employing coded-mask for compressed sensing and computational imaging [34, 43], single pixel camera [12] and external mask setting [27].

Coded masks have also recently been used with conventional lens-based cameras to estimate depth and lightfield [19, 39]. Recently, a number of data-driven methods have been proposed to design custom phase masks and optical elements to estimate depth from a single image [40, 11]. An all-optical diffractive deep neural network is proposed in [20, 24], which can perform pattern recognition tasks such as handwritten digits classification using optical mask layers. Such networks can literally process images at a lightning-fast pace with near-zero energy cost.

Iii Methods

Iii-a Imaging Model

We divide the 3D scene under observation into uniformly spaced directions. We use and to denote the angular directions of a light source with respect to the center of the sensor. The intensity and depth of the light source are denoted using and respectively. Figure 1(a) depicts the geometry of such an imaging model. A planar coded-mask is placed on top of a planar sensor array at distance . The sensor array captures lights coming from the scene modulated by the coded-mask.

Every light source in the scene casts a shadow of the mask on the sensor array, which we denote using basis functions . We use and to index a pixel on the rectangular sensor array. The shadow cast by a light source with unit intensity at can be represented as the following basis or point spread function:

(1)

where denotes the transmittance of the mask pattern at location on the mask plane and is a variable that is related to the physical depth with the following inverse relation:

(2)

If the 3D scene consists of only a single point source at with light intensity , the measurement captured at sensor pixel would be

(3)

The measurement recorded on any sensor pixel is the summation of contributions from each of the point sources in the 3D scene. The imaging model for a single sensor pixel can be represented by

(4)

We can write the imaging model for all the sensors in a compact form as

(5)

where is a vectorized form of an matrix that denotes sensor measurements, is a vectorized form of an matrix that denotes light intensity from all the locations , and is a matrix with all the basis functions corresponding to . The basis functions in (5) are parameterized by the unknown . denotes noise and other nonidealities in the system.

We can jointly estimate light distribution and inverse depth map 111 has an inverse relation with the depth map (2); therefore we refer to it as inverse depth map throughout the paper. using the following optimization problem:

(6)

Note that if we know the true values of (or we fix it to something), then the problem in (6) reduces to a linear least-squares problem that can be efficiently solved via standard solvers. On the other hand, if we fix the value of , the problem remains nonlinear with respect to . In the next few sections we discuss our approach for solving the problem in (6) via alternating minimization.

Iii-B Initialization

Since the minimization problem in (6) is not convex, a proper initialization is often needed to ensure convergence to a local minima close to the optimal point. A naïve approach is to initialize all the point sources in the scene at the same depth plane. To select an initial depth plane, we sweep through a set of candidate depth planes and perform image reconstruction on one depth plane at a time by solving the following linear least squares problem:

(7)

We evaluate the loss value for all the candidate depth planes and picked the one with the smallest loss as our initialized depth. The mask basis function in (1) changes as we change , which has an inverse relation with the scene depth. We select candidate depth corresponding to uniformly sampled values of , which yields nonuniform sampling of the physical scene depth. The single-depth initialization approach is computationally simple and provides a reasonable initialization of light distribution to start with, especially when the scene is far from the sensor.

Our second approach for initialization is the greedy method proposed in [3]. Greedy algorithms are widely used for sparse signal recovery [38, 26, 5]. Based on these algorithms, [3] proposed a greedy depth pursuit algorithm for depth estimation from FlatCam [4]. The algorithm works by iteratively updating the depth surface that matches the observed measurements the best.

The depth pursuit method assumes that the scene consists of a finite number of predefined depth planes. We start the program by initializing all the pixels at a single depth plane and the estimation of light intensities based on initialized depth map. The first step is to select new candidate values for . The new candidates are selected using the basis vectors that are mostly correlated with the current residual of the estimate. In the second step, new candidates for are appended to the current estimate. We solve a least squares problem using the appended . In the third step, we prune the by selecting as the value corresponding to the largest magnitude of . Although this method may not estimate the off-grid point sources well, it produces a good preliminary estimate of the scene.

Iii-C Refinement via Alternating Gradient Descent

To solve the minimization problem in (6), we start with the preliminary image and depth estimates from the initialization step and alternately update depth and light distribution via gradient descent. The main computational task in gradient descent method is computing the gradient of the loss function w.r.t. . To compute that gradient, we expand the loss function in (6) as

(8)

We define as the residual approximation error at location . The derivatives of the loss function with respect to the is given as

(9)

We compute the derivatives of sensor value with respect to the using the total derivative 222Recall that the total derivative of a multivariate function is . as follows.

(10)

and denote two dummy variables that also correspond to the specific location on the mask where a light ray from a point source at angle and depth and sensor pixel at intersects with the mask plane. The terms in can be viewed as the derivatives of mask pattern along the respective spatial coordinates and evaluated at . We compute these derivatives using finite-difference of over a fine grid and linear interpolation.

Iii-D Algorithm Analysis

To solve the non-linear least squares problem (7) in our algorithms, we compute the gradient derived in (III-C) and use it as input of a optimization solver. Suppose and denote the basis function vector evaluated on 1D mask,

(11)

If we use a separable mask pattern that the mask pattern we use is the outer product of two 1D mask patterns, the 2D mask function in (1) can be computed as the outer product of two vectors given as . Similarly, we define 1D sub-gradient function as

(12)

Similar to (III-C), the functions and are the sub-gradient functions along the 1D mask. It takes non-negative values at locations where mask pattern value changes and takes zero value at the other places. Using the derivation in (III-C), the matrix contains at all can be computed using the following sum of two vector outer products.

(13)

Using the derivations in (III-C), the derivative of loss function with respect to depth value can be computed using the following matrix multiplications, where refers to the matrix of residual at all

(14)

Suppose we have pixels on sensor array. The computation in (14) takes multiplications. We then feed our gradients to minfunc solver [33] with L-BFGS algorithm [21] to solve the non-linear optimization problem in (7).

Iii-E Regularization Approaches

regularization on spatial gradients. The optimization problem in (6) is highly non-convex and contains several local minima; therefore, the estimate often gets stuck in some local minima and the estimated intensity and depth map are coarse. To improve the performance of our algorithm for solving the non-convex problem in (6), we seek to exploit additional structures in the scene. A standard assumption is that the depth of neighboring pixels is usually close, which implies that the spatial differences of (inverse) depth map are small. To incorporate this assumption in our model, we add a quadratic regularization term on the spatial gradients of the inverse depth map to our loss function. The quadratic regularization term is defined on an inverse depth map matrix and can be written as

(15)

where the operators compute spatial differences along rows and columns, respectively. We call this regularization an norm-based total variation (TV-) in this paper. Figure 2 illustrates the effect of the depth regularization. From Figure 2, we observe that smoothness regularization improves the loss function by removing several local minima. We also observed this effect in our simulations for high-dimensional depth recovery problem, which is not very sensitive to initialization with depth regularization.

(a) Without smooth regularization, the loss curve is highly non-convex and contains several local minimums.
(b) With the smooth regularization, the loss curve is smooth and several local minimums are removed.
(c) Similar to 1D case, loss surface contains many local minimums without smooth regularization.
(d) With smooth regularization, many local minimums are removed from loss surface.
Fig. 2: A comparison between objective loss functions without and with smooth regularization. The inverse depth axis refers to the value of .

Weighted regularization on spatial gradients. Even though smoothness regularization on the inverse depth map removes some local minima and helps with converge, it does not respect the sharp edges in the depth map. To preserve sharp discontinuities in the (inverse) depth map, we used the following adaptive weighted regularization:

(16)

where and denote weights for row and column differences, respectively. We aim to select these weights to promote depth similarity for neighboring pixels, but avoid smoothing the sharp edges. To promote this, we selected weights with exponential decay in our experiments that we compute as

(17)

Such a weighted regularization forces pixels that have depth within a small range of one another to be smooth and does not penalize the points that have larger gap in depth (which indicates the presence of an edge). This helps preserve sharp edges in the reconstructed depth estimates. This weighting approach is analogous to bilateral filtering approach for image denoising [37, 13].

To highlight the effect of the weighted smoothness regularization on depth, we plot the following weighted quadratic function in Figure 3, where stand for inverse depth of neighboring pixels. We plot the weighted function for different values of along with a normal quadratic function. The plots show that the quadratic function (without any weights) penalizes large values of depth differences; however, weighted function add small penalty if the neighboring pixels have large depth difference (which indicates the presence of an edge).

The regularized estimation problem for image and depth can be written in the following form:

(18)

We call this regularization approach weighted TV- and solve it by alternately updating the inverse depth map and light intensity . A pseudocode of the algorithm is presented at Algorithm 1.

Fig. 3: The weighted regularization function penalizes depth values that are within a small distance of one another and does not penalize those values that are above certain threshold. The smooth range can be changed by tuning the parameter . In contrast to the TV-, a weighted TV- regularization term does not penalize neighboring pixels with large depth disparity, which tends to preserve the sharpness of the edges in the depth estimation.
0:   Sensor measurements:
0:   Light distribution and inverse depth map: Initialization via greedy algorithm:
  Compute and with depth pursuit algorithm in [3]. Refinement via alternating gradient descent:
  for  do
     
     
  end for
  return   and
Algorithm 1 Weighted TV- regularized optimization

regularization on spatial gradients. It is well-known that the norm regularization enforces the solution to be sparse. We add an -based total variation norm [28] of the depth to our optimization problem. By enforcing the sparsity of spatial gradients, the edges of (inverse) depth map can be preserved. The norm-based TV regularization term is given as

(19)

To solve the nonlinear optimization problem with norm regularization, we write the optimization problem as

s.t. (20)

We solve this problem (20) using a split-Bregman method [16].

Iv Simulations and Results

Iv-a Simulation Setup

To validate the performance of the proposed algorithm, we simulate a lensless imaging system using a binary planar mask with a separable maximum length sequence (MLS) pattern [23] that is placed 4mm away from a planar sensor array. We used an MLS sequence of length 1024 and converted all the -1s to 0s to create a separable binary pattern. We used square mask features, each of which is 30m wide. Since we require the mask to be differential, we cannot use a binary mask pattern. Therefore, we convolved the binary pattern with a Gaussian blur kernel of length 15m and standard deviation 5. The sensor contains square pixels, each of which is 50m wide. The chief ray angle of each sensor pixel is . We assume that there is no noise added to the sensor measurements. In our experiments for continuous depth estimation, we fixed all the parameters to these default values and analyze the performance with respect to a single parameter.

Iv-B Reconstruction of a Single Plane

To understand various parameters of our method and their effect on depth estimation, we perform a simple simulation experiment in which the scene consists of a single plane and our goal is to recover the correct depth accurately. In other words, the depth map is parameterized by a scalar instead of a matrix of the same size as intensity . The estimation of the single depth parameter does not involve the regularization, so we solve (6) via alternating minimization using Algorithm 1 without the regularization term. We test the performance of our optimization program using a USAF target image placed at three different depths in the scene: . We test two different initialization schemes for comparison. (1) Initial value of is selected to set depth at 1km (i.e., for all the experiments). (2) We choose a depth plane from 10 candidate depths that were uniformly sampled in to get effective depth range from 9cm to 1km; for a given measurement we selected that provided smallest loss function in (6). To separate the effect of image estimate and depth estimate, we tested two cases: one in which image intensity is an optimization variable and updated iteratively and the other one where the image intensity is fixed to its original value. Since the problem in (6) is nonconvex in even if we fix the value of intensity , there is no guarantee we can estimate the correct value of .

(a) Convergence of estimates; initial depth at 1km.
(b) Convergence of estimates; initial depth selected from 10 candidate planes.
(c) Initial depth 1km. Image intensity unknown
(d) Initial depth 1km. Image intensity known
(e) Initial depth selected from 10 planes. Image intensity unknown
(f) Initial depth selected from 10 planes. Image intensity known
Fig. 4: Recovery of a single depth parameter. USAF target is tested at three different depths with two different types of initialization. (a) Initial depth is set to 1km. (b) Initial depth is selected out of 10 candidate planes. (a), (b) Reconstruction loss plotted against the depth estimate at every iteration of the algorithm. If the algorithm estimates correct image intensity and depth, the plot should converge to the true depth with a small loss function value. Solids lines correspond to the case when we jointly estimate image intensity and depth. Dashed lines correspond to the case when image intensities are known. (c)–(d) Reconstructed images when the original scene is 10cm away, under different choice of initial depth and unknown/known image intensities.

We report the results for 12 experiments (image at three different depths with two different initialization and known or unknown image intensity) in Figure 4. Figures 4(a) and (b) plot the loss function at the estimated depth at every iteration of the refinement step in Algorithm 1 for the cases when the initial depth is set to 1km and when the initial depth is selected out of 10 candidate depths. From the curves in Figure 4(a), we observe that we can estimate the depth accurately in all three cases when the image intensities are known (dashed lines). If we jointly estimate the depth and image intensities, the initial value of depth plays a critical role (solid lines). We observe that the intensity and depth are recovered correctly for the scene at 1m and 10m but the algorithm fails to recover correct depth for the scene at 10cm. This is mainly because the initial value of is very far from the true value and the algorithm is more likely to get stuck in a local minima. Figure 4(c) shows the reconstructed image when the original scene is at 10cm away, initial depth is set to 1km, and we estimate image intensity and depth by solving (6). On the other hand, when we pick initial depth close to the true depth (out of 10 candidates, using a greedy approach), as shown in Figure 4(b), the algorithm converges to the true depth and recover the correct image intensities as well.

Image PSNR:
Image PSNR:
Depth RMSE:
Original
16.57dB
16.57dB
87.48mm
Greedy[3]
31.65dB
31.65dB
17.90mm
Ours
Fig. 5: Left to right: original image and depth of Cones scene; image and depth initialized via greedy algorithm [3]; depth estimation using weighted -based regularization. The depth in this scene varies from around 0.99m to 1.7m.

Iv-C Reconstruction of Scenes with Continuous Depth

Depth datasets

We performed all our experiments on 3D images created using light intensities and depth information from Middlebury[32], Make3D[30, 29] and NYU Depth[25], the test scenes and their depth ranges are listed in Table I.

Test datasets Min depth (m) Max depth (m)
Sword 0.65 0.95
Cones 0.99 1.70
Playtable 1.47 3.75
Corner 3.93 10.60
Whiteboard 1.08 2.90
Playroom 1.62 2.93
Moebius 0.74 1.23
Books 0.73 1.27
TABLE I: Analysis experiments are performed on multiple scenes picked from Middlebury[32], Make3D[30, 29] and NYU Depth[25]. Results of the four scenes are presented within the main text, while the rest of them are reported in the supplemnetary material.
(a) Image and depth of the original scene. The selected cones scene is taken from Middlebury dataset [32]. The range of depth is from 0.99 to 1.7 meters.
Original Scene
(b) Image and depth reconstruction from isotripic total variation. PSNR of image is 29.69dB and depth RMSE is 25.21mm.
TV-
(c) Image and depth reconstruction from weighted total variation. The PSNR of image is 31.65dB and the RMSE of depth is 17.90mm. The edges of depth are preserved better.
Weighted TV-
(d) Image and depth reconstruction from TV-. The PSNR of image is 30.82dB and the depth RMSE is 19.56mm. The edges of depth are preserved better well.
TV-
Fig. 6: Comparison between reconstructions using three different regularization approaches from the same measurements.

Initialization via greedy method

Let us further discuss our simulation setup using cones scene, for which the results are presented in Figure 5. We simulated the 3D scene using depth data from Middlebury dataset[32]. We sample the scene at uniform angles to create a image and its (inverse) depth map with the same size. We can compute the physical depth from using (2). In our simulation, the depth of this scene ranges from around 0.99m to 1.7m. We used depth pursuit greedy algorithm in [3] as our initialization method. We selected 15 candidate depths by uniformly sampling the inverse depth values from 0.996 to 0.9976, which gives an effective depth in the same range as the original depth. Since we are trying to gauge the performance for off-the-grid estimate of depth, the candidate values of are not exactly the same as the true values of in our simulations. The output of initialization algorithm is then fed into the alternating gradient descent method.

Performance metrics

We evaluate the performance of recovered image intensity and depth independent of each other. We report the peak signal to noise ratio (PSNR) of the estimated intensity images and root mean squared error (RMSE) of the estimated depth maps for all our experiments. The estimates for image intensity and depth maps for the initialization and our proposed weighted TV- method are shown in Figure 5, along with the PSNR and RMSE. We can observe that both image and depth estimation from greedy method [3] contain several spikes because of the model mismatch with the predefined depth grid. In contrast, many of these spikes are removed in the estimations from the proposed algorithm with weighted TV- while the edges are preserved.

Comparison of regularization methods

Here we present a comparison between three different regularization approaches. We reconstruct image intensity and (inverse) depth map using same measurements with TV-, weighted TV-, and TV- regularization. The results are shown in Figure 6. Compared to the TV- method, we observe that both weighted TV- and TV- preserve the sharp edges in image and depth estimates. Overall, in our experiments, weighted TV- provided best results. Therefore, we used that as our default method in the rest of the paper.

Iv-D Effects of Noise

Sensor noise exists widely in any observation process. The amplitude of noise depends on the intensities of sensor measurements and can adversely affect the reconstruction results. To investigate the effect of noise on our algorithm, we present simulation results for reconstruction of scenes from the same sensor measurements under different levels of additive white Gaussian noise. The experiments are performed on multiple 3D scenes listed in Table I. Some examples of reconstruction with different levels of noise are shown in Figure 7.

PSNR:
PSNR:
RMSE:
Original
1.15dB
1.15dB
66.98mm
20dB
11.33dB
11.33dB
38.84mm
30dB
19.16dB
19.16dB
34.29mm
40dB
PSNR:
PSNR:
RMSE:
Original
3.22dB
3.22dB
120.66mm
20dB
13.76dB
13.76dB
43.52mm
30dB
23.70dB
23.70dB
29.22mm
40dB
PSNR:
RMSE:
PSNR:
5.12dB
360.95mm
5.12dB
15.50dB
163.26mm
15.50dB
24.88dB
151.46mm
24.88dB
PSNR:
RMSE:
PSNR:
6.33dB
32810.73mm
6.33dB
16.52dB
420.07mm
16.52dB
26.32dB
304.35mm
26.32dB
Fig. 7: Effects of noise: Reconstruction from the measurements with signal-to-noise ratio (SNR) at 20dB, 30dB and 40dB, along with the PSNR of reconstructed image and RMSE of reconstructed depth map. As expected, the quality of reconstructed image and depth improves as the noise level is reduced. The sequence in top left is for Sword, top right is Cones, bottom left Playtable, and bottom right is Corner scene from our test datasets.

The plots recording PSNR of image intensities and RMSE of depth maps over a range of measurement SNR values are presented in Figure 8. As we can observe from the curves that the quality of both estimated image and depth improve when the measurements have small noise (high SNR) and the quality degrades as we add more noise in the measurements (low SNR). Another observation we can make is that the scenes that are farther away have higher RMSE. This aspect is understandable because as the scenes move farther, of the scene pixels all get very close to 1 and we cannot resolve fine depth variations in the scene.

(a) Image PSNR for different noise levels
(b) Depth RMSE for different noise levels
Fig. 8: Reconstruction from measurements with different levels of Gaussian noise on multiple scenes. Both of the image Peak Signal-Noise Ratio and depth Root mean squared error are improved as the noise is reduced. The reconstruction quality degrades if the scene is placed farther from the camera.
Image PSNR:
Image PSNR:
Depth RMSE:
Original
16.51dB
16.51dB
74.76mm
22.27dB
22.27dB
34.39mm
27.79dB
27.79dB
11.17mm
Image PSNR:
Image PSNR:
Depth RMSE:
Original
26.85dB
26.85dB
56.59mm
32.83dB
32.83dB
28.70mm
35.29dB
35.29dB
11.02mm
Image PSNR:
Depth RMSE:
Image PSNR:
25.55dB
248.78mm
25.55dB
31.05dB
147.68mm
31.05dB
32.79dB
107.04mm
32.79dB
Image PSNR:
Depth RMSE:
Image PSNR:
29.82dB
439.66mm
29.82dB
34.15dB
282.61mm
34.15dB
35.69dB
194.91mm
35.69dB
Fig. 9: Reconstructions from measurements with different numbers of pixels. The size of each sensor pixel is fixed as 50m. The sequence in top left is for Sword, top right is Cones, bottom left Playtable, and bottom right is Corner dataset.

Iv-E Number of Sensor Measurements

In this experiment, we evaluate the performance of our algorithm as we increase/decrease the number of sensor measurements. The depth estimation problem we are solving is highly ill-posed because of existence of nontrivial null space of the system matrix and nonlinear dependence of measurements on the depth parameters. Adding more measurements helps with improve the solution of the system by adding more constraints on the feasible solutions.

We perform experiments with different number of sensor pixels while the size of each pixel is fixed as 50m. We do not add any noise in these experiments to avoid randomness that potentially affect comparison. As we increase or decrease the number of pixels, it is equivalent to increasing or decreasing the sensor area. Therefore, when we increase the number of sensor pixels (equivalently, sensor area), the baseline of the sensor is also increased, which helps us in resolving the depth more accurately. The results are presented in Figure 9 for sensors of size , , and . We observe that the quality of both image and depth improves as we use more sensor pixels for measurements.

Iv-F Size of Sensor

In conventional disparity-based depth estimation method [17], the quality of reconstructed depth depends on the disparity between frames captured from multiple camera views. Larger distance between camera viewing positions results in better depth estimation accuracy. In a lensless imaging system, we can think of each pinhole on the mask and the sensor area behind the mask as a tiny pinhole camera. The analogy only goes this far, because we do not record images from these tiny pinhole cameras separately; instead, we record a multiplexed version of the all the views. The disparity between different points on the sensors, however, does affect our ability to resolve depth of the scene, which is determined by the size of sensor.

To analyze the effect of disparity in our system, we performed experiments with three different sizes of sensor pixels from 25m, 50m, and 100m. For a fair comparison, the number of sensor pixels and other parameters are set to the default settings as described earlier. No noise is included in this experiment. Results in terms of reconstructed image and depth maps are presented in Figure 10, where we observe that the quality of depth reconstruction improves as we increase the size of sensor pixels. The results in Figure 10 and 9 demonstrate that increasing the disparity of viewing points increases the depth reconstruction quality.

Image PSNR:
Image PSNR:
Depth RMSE:
Original
24.31dB
24.31dB
45.61mm
25m
22.27dB
22.27dB
34.39mm
50m
21.92dB
21.92dB
24.42mm
100m
Image PSNR:
Image PSNR:
Depth RMSE:
Original
33.90dB
33.90dB
39.64mm
25m
32.83dB
32.83dB
28.70mm
50m
31.22dB
31.22dB
17.56mm
100m
Image PSNR:
Depth RMSE:
Image PSNR:
32.38dB
198.50mm
32.38dB
31.50dB
147.68mm
31.50dB
28.09dB
134.15mm
28.09dB
Image PSNR:
Depth RMSE:
Image PSNR:
35.60dB
358.98mm
35.60dB
34.15dB
282.61mm
34.15dB
30.74dB
267.94mm
30.74dB
Fig. 10: Reconstructions from measurements with different sizes of sensor pixels. The number of sensor pixels is fixed as . The quality of depth reconstruction improves as we increase the size of sensor pixels.

Iv-G Comparison with Existing Methods

Finally, we present a comparison of our proposed algorithm and two other methods for 3D recovery with lensless cameras. In our method, we estimate light intensity and a depth map over continuous domain. The greedy method in [3] also estimates intensity and depth separately, but the depth map for any angle is restricted to one of the predetermined planes. Three-dimensional recovery using lensless cameras for 3D fluorescence microscopy was presented in [2] and [1], which estimate the entire 3D volume of the scene sampled over a predetermined 3D grid. Since the unknown volumetric scene in microscopy is often very sparse, the 3D scene recovery problem is solved as a sparse recovery problem for the light intensity over all the grid voxels. The result is a light distribution over the entire 3D space. We call this method 3D Grid and use the code provided in [2] to solve the 3D recovery problem using the forward model and measurements from our simulation setup.

The scenes studied in [1] and [2] are mostly transparent and contain multiple point light sources at different depths but the same angle. This is different from the natural scenes we are testing, where the objects are usually opaque and block light from objects behind them. We can model such scenes as having only one voxel along any angle to be nonzero; however, that will be a nonconvex constraint and to enforce that we will have to resort to some heuristic similar to the one in [3]. For the sake of comparison, we solve the norm-based sparse recovery problem as described in [2], but then we pick the points with the maximum light intensity at each angle to form the reconstructed image and (inverse) depth map.

A comparison of different recovery methods with same imaging setup is shown in Figure 11. For the same scene, we reconstruct the same measurements using the three methods. As we can observe that our proposed algorithm offers a significant improvement compared to existing methods in all the test scenes.

V Conclusion

We presented a new algorithm to jointly estimate the image and depth of a scene using a single snapshot of a mask-based lensless camera. Existing methods for 3D lensless imaging either estimate scene over a predefined 3D grid (which is computationally expensive) or a finite number of candidate depth planes (which provides a coarse depth map). We divide the scene into an intensity map on uniform angles and a depth map on a continuous domain, which allows us to estimate a variety of scenes with different depth ranges using the same formulation. We jointly estimate the image intensity and depth map by solving a nonconvex problem. We initialize our estimates using a greedy method and add weighted regularization to enforce smoothness in the depth estimate while preserving the sharp edges. We demonstrated with extensive simulations that our proposed method can recover image and depth with high accuracy for a variety of scenes. We evaluated the performance of our methods under different noise levels, sensor sizes, and numbers of sensor pixels and found the method to be robust. Finally, we presented a comparison with existing methods for lensless 3D imaging and demonstrated that our method provides significantly better results.

Image PSNR:
Image PSNR:
Depth RMSE:
Original
9.10dB
9.10dB
87.55mm
3D Grid[2]
14.04dB
14.04dB
47.59mm
Greedy [3]
22.27dB
22.27dB
34.39mm
Ours
Image PSNR:
Image PSNR:
Depth RMSE:
Original
8.46dB
8.46dB
96.08mm
3D Grid[2]
14.13dB
14.13dB
109.47mm
Greedy[3]
32.83dB
32.83dB
28.70mm
Ours
Image PSNR:
Depth RMSE:
Image PSNR:
7.14dB
479.78mm
7.14dB
16.71dB
400.99mm
16.71dB
31.50dB
147.68mm
31.50dB
Image PSNR:
Depth RMSE:
Image PSNR:
6.10dB
1423.17mm
6.10dB
18.15dB
1506.94mm
18.15dB
30.47dB
267.94mm
30.47dB
Fig. 11: Comparison of existing 3D recovery methods for lensless imaging, 3D grid method in [1, 2] and greedy method in [3], with our proposed method. 3D grid method provides a 3D volume with multiple depth planes; therefore, we pick the depth with the largest light intensity along any angle for comparison.

References

  • [1] J. K. Adams, V. Boominathan, B. W. Avants, D. G. Vercosa, F. Ye, R. G. Baraniuk, J. T. Robinson, and A. Veeraraghavan (2017) Single-frame 3d fluorescence microscopy with ultraminiature lensless flatscope. Science Advances 3 (12). External Links: Document, Link, https://advances.sciencemag.org/content/3/12/e1701548.full.pdf Cited by: §II, §IV-G, §IV-G, Fig. 11.
  • [2] N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, and L. Waller (2018-01) DiffuserCam: lensless single-exposure 3d imaging. Optica 5 (1), pp. 1–9. External Links: Link, Document Cited by: §II, §IV-G, §IV-G, Fig. 11, Fig. 11, Fig. 11.
  • [3] M. S. Asif (2017) Toward depth estimation using mask-based lensless cameras. In Signals, Systems, and Computers, 2017 51st Asilomar Conference on, pp. 1467–1470. Cited by: §I, §II, §III-B, Fig. 5, Fig. 5, §IV-C, §IV-C, §IV-G, §IV-G, Fig. 11, Fig. 11, Fig. 11, 1.
  • [4] M. S. Asif, A. Ayremlou, A. Sankaranarayanan, A. Veeraraghavan, and R. G. Baraniuk (2017-Sept) FlatCam: thin, lensless cameras using coded aperture and computation. IEEE Transactions on Computational Imaging 3 (3), pp. 384–397. External Links: Document, ISSN 2333-9403 Cited by: §I, §II, §III-B.
  • [5] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde (2010) Model-based compressive sensing. IEEE Transactions on Information Theory 56 (4), pp. 1982–2001. Cited by: §II, §III-B.
  • [6] V. Boominathan, J. K. Adams, M. S. Asif, B. W. Avants, J. T. Robinson, R. G. Baraniuk, A. C. Sankaranarayanan, and A. Veeraraghavan (2016) Lensless imaging: a computational renaissance. IEEE Signal Processing Magazine 33 (5), pp. 23–35. Cited by: §II, §II.
  • [7] N. Boyd, G. Schiebinger, and B. Recht (2015-12) The alternating descent conditional gradient method for sparse inverse problems. In IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Vol. , pp. 57–60. External Links: Document, ISSN Cited by: §II.
  • [8] A. Busboom, H. Elders–Boll, and H. D. Schotten (1998-06-01) Uniformly redundant arrays. Experimental Astronomy 8 (2), pp. 97–123. External Links: ISSN 1572-9508, Document, Link Cited by: §II, §II.
  • [9] T. M. Cannon and E. E. Fenimore (1980-06) Coded Aperture Imaging: Many Holes Make Light Work. Optical Engineering 19, pp. 283. External Links: Document Cited by: §II, §II.
  • [10] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky (2012-12-01) The convex geometry of linear inverse problems. Foundations of Computational Mathematics 12 (6), pp. 805–849. External Links: ISSN 1615-3383, Document, Link Cited by: §II.
  • [11] J. Chang and G. Wetzstein (2019) Deep optics for monocular depth estimation and 3d object detection. In Proc. IEEE ICCV, Cited by: §II.
  • [12] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk (2008-03) Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine 25 (2), pp. 83–91. External Links: Document, ISSN 1053-5888 Cited by: §II.
  • [13] F. Durand and J. Dorsey (2002-07) Fast bilateral filtering for the display of high-dynamic-range images. ACM Trans. Graph. 21 (3), pp. 257–266. External Links: ISSN 0730-0301, Link, Document Cited by: §III-E.
  • [14] E. E. Fenimore and T. M. Cannon (1978-02) Coded aperture imaging with uniformly redundant arrays. Appl. Opt. 17 (3), pp. 337–347. External Links: Link, Document Cited by: §II, §II.
  • [15] S. B. Gokturk, H. Yalcin, and C. Bamji (2004-06) A time-of-flight depth sensor - system description, issues and solutions. In Conference on Computer Vision and Pattern Recognition Workshop, Vol. , pp. 35–35. External Links: Document, ISSN Cited by: §I.
  • [16] T. Goldstein and S. Osher (2009-04) The split bregman method for l1-regularized problems. SIAM J. Img. Sci. 2 (2), pp. 323–343. External Links: ISSN 1936-4954, Link, Document Cited by: §III-E.
  • [17] R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §I, §IV-F.
  • [18] F. Heide, M. B. Hullin, J. Gregson, and W. Heidrich (2013) Low-budget transient imaging using photonic mixer devices. ACM Transactions on Graphics (ToG) 32 (4), pp. 45. Cited by: §I.
  • [19] A. Levin, R. Fergus, F. Durand, and W. T. Freeman (2007-07) Image and depth from a conventional camera with a coded aperture. ACM Trans. Graph. 26 (3). External Links: ISSN 0730-0301, Link, Document Cited by: §II.
  • [20] X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan (2018) All-optical machine learning using diffractive deep neural networks. Science 361 (6406), pp. 1004–1008. External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/361/6406/1004.full.pdf Cited by: §II.
  • [21] D. C. Liu and J. Nocedal (1989-08-01) On the limited memory bfgs method for large scale optimization. Mathematical Programming 45 (1), pp. 503–528. External Links: ISSN 1436-4646, Document, Link Cited by: §III-D.
  • [22] Y. Liu, J. Ma, Y. Fan, and Z. Liang (2012) Adaptive-weighted total variation minimization for sparse data toward low-dose x-ray computed tomography image reconstruction. Physics in Medicine & Biology 57 (23), pp. 7923. External Links: Link Cited by: §I.
  • [23] F. J. MacWilliams and N. J. A. Sloane (1976-12) Pseudo-random sequences and arrays. Proceedings of the IEEE 64 (12), pp. 1715–1729. External Links: Document, ISSN 0018-9219 Cited by: §IV-A.
  • [24] D. Mengu, Y. Luo, Y. Rivenson, and A. Ozcan (2020-01) Analysis of diffractive optical neural networks and their integration with electronic neural networks. IEEE Journal of Selected Topics in Quantum Electronics 26 (1), pp. 1–14. External Links: Document, ISSN 1077-260X Cited by: §II.
  • [25] P. K. Nathan Silberman and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §IV-C, TABLE I.
  • [26] D. Needell and J. A. Tropp (2010-12) CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Commun. ACM 53 (12), pp. 93–100. External Links: ISSN 0001-0782, Link, Document Cited by: §II, §III-B.
  • [27] D. Reddy, J. Bai, and R. Ramamoorthi (2013-12) External mask based depth and light field camera. In 2013 IEEE International Conference on Computer Vision Workshops, Vol. , pp. 37–44. External Links: Document, ISSN Cited by: §II.
  • [28] L. I. Rudin, S. Osher, and E. Fatemi (1992-11) Nonlinear total variation based noise removal algorithms. Phys. D 60 (1-4), pp. 259–268. External Links: ISSN 0167-2789, Link, Document Cited by: §III-E.
  • [29] A. Saxena, M. Sun, and A. Y. Ng (2009-05) Make3D: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5), pp. 824–840. External Links: Document, ISSN 0162-8828 Cited by: §IV-C, TABLE I.
  • [30] A. Saxena, S. H. Chung, and A. Y. Ng (2006) Learning depth from single monocular images. In Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. C. Platt (Eds.), pp. 1161–1168. External Links: Link Cited by: §IV-C, TABLE I.
  • [31] H. H. Barrett,D. T. Wilson,G. D. DeMeester,H. Scharfman (1973) Fresnel zone plate imaging in radiology and nuclear medicine. Optical Engineering 12 (1), pp. 8 – 12 – 5. External Links: Document, Link, Cited by: §II.
  • [32] D. Scharstein, R. Szeliski, and R. Zabih (2001-12) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision, Vol. , pp. 131–140. External Links: Document, ISSN Cited by: 5(a), §IV-C, §IV-C, TABLE I.
  • [33] M. Schmidt (2005) MinFunc: unconstrained differentiable multivariate optimization in matlab. External Links: Link Cited by: §III-D.
  • [34] D. Takhar, J. N. Laska, M. B. Wakin, M. F. Duarte, D. Baron, S. Sarvotham, K. F. Kelly, and R. G. Baraniuk (2006) A new compressive imaging camera architecture using optical-domain compression. Proc.SPIE 6065 (), pp. 6065 – 6065 – 10. External Links: Document, Link, Cited by: §II.
  • [35] Z. Tan, P. Yang, and A. Nehorai (2014-10) Joint sparse recovery method for compressed sensing with structured dictionary mismatches. IEEE Transactions on Signal Processing 62 (19), pp. 4997–5008. External Links: Document, ISSN 1053-587X Cited by: §II.
  • [36] G. Tang, B. N. Bhaskar, P. Shah, and B. Recht (2013-11) Compressed sensing off the grid. IEEE Transactions on Information Theory 59 (11), pp. 7465–7490. External Links: Document, ISSN 0018-9448 Cited by: §II.
  • [37] C. Tomasi and R. Manduchi (1998-01) Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Vol. , pp. 839–846. External Links: Document, ISSN Cited by: §III-E.
  • [38] J. A. Tropp and A. C. Gilbert (2007-12) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory 53 (12), pp. 4655–4666. External Links: Document, ISSN 0018-9448 Cited by: §II, §III-B.
  • [39] A. Veeraraghavan, R. Raskar, A. Agrawal, A. Mohan, and J. Tumblin (2007) Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing. In ACM transactions on graphics (TOG), Vol. 26, pp. 69. Cited by: §II.
  • [40] Y. Wu, V. Boominathan, H. Chen, A. Sankaranarayanan, and A. Veeraraghavan (2019-05) PhaseCam3D — learning phase masks for passive single view depth estimation. In 2019 IEEE International Conference on Computational Photography (ICCP), Vol. , pp. 1–12. External Links: Document, ISSN 2472-7636 Cited by: §II.
  • [41] Z. Yang, L. Xie, and C. Zhang (2013-01) Off-grid direction of arrival estimation using sparse bayesian inference. IEEE Transactions on Signal Processing 61 (1), pp. 38–43. External Links: Document, ISSN 1053-587X Cited by: §II.
  • [42] A. Yedidia, C. Thrampoulidis, and G. Wornell (2018-04) Analysis and optimization of aperture design in computational imaging. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4029–4033. External Links: Link, Document, ISSN 2379-190X Cited by: §II.
  • [43] A. Zomet and S. K. Nayar (2006-06) Lensless imaging with a controllable aperture. In IEEE Computer Vision and Pattern Recognition, Vol. 1, pp. 339–346. External Links: Document, ISSN 1063-6919 Cited by: §II.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393199
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description