Compressive Coded Aperture Keyed Exposure Imaging with Optical Flow Reconstruction
Abstract
This paper describes a coded aperture and keyed exposure approach to compressive video measurement which admits a small physical platform, high photon efficiency, high temporal resolution, and fast reconstruction algorithms. The proposed projections satisfy the Restricted Isometry Property (RIP), and hence compressed sensing theory provides theoretical guarantees on the video reconstruction quality. Moreover, the projections can be easily implemented using existing optical elements such as spatial light modulators (SLMs). We extend these coded mask designs to novel dualscale masks (DSMs) which enable the recovery of a coarseresolution estimate of the scene with negligible computational cost. We develop fast numerical algorithms which utilize both temporal correlations and optical flow in the video sequence as well as the innovative structure of the projections. Our numerical experiments demonstrate the efficacy of the proposed approach on shortwave infrared data.
I Introduction
The theory of compressed sensing (CS) suggests that we can collect highresolution imagery with relatively few photodetectors or a small focal plane array (FPA) when the scene is sparse or compressible in some dictionary or basis and the measurements are chosen appropriately [1, 2]. There has been significant recent interest in building imaging systems in a variety of contexts to exploit CS (cf. [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]). By designing optical sensors to collect measurements of a scene according to CS theory, we can use sophisticated computational methods to infer critical scene structure and content. One particularly famous example is the Single Pixel Camera [13], which collects sequential projections of the scene. While these measurements are supported by the CS literature, there are several practical challenges associated with the tradeoff between temporal resolution and physical system footprint. In this paper we describe an alternative approach to designing a low framerate snapshot CS camera which naturally parallelizes the compressive data acquisition. Our approach is based on two imaging techniques called coded apertures and keyed exposures, which we explain next.
Coded apertures [14, 15] have a long history in lowlight astronomical applications. Coded apertures were first developed to increase the amount of light hitting a detector in an optical system without sacrificing resolution (by, say, increasing the diameter of an opening in a pinhole camera). The basic idea is to use a mask, i.e., an opaque rectangular plate with a specified pattern of openings, that allows significantly brighter observations with higher signaltonoise ratio than those from conventional pinhole cameras [14, 15]. These masks encode the image before detection, inducing a more complicated point spread function than that associated with a pinhole aperture. The original scene is then recovered from the encoded observations in postprocessing using an appropriate reconstruction algorithm which exploits the mask pattern. These multiplexing techniques are particularly popular in astronomical [16, 17] and medical [18, 19, 20] applications because of their efficacy at wavelengths where lenses cannot be used, but recent work has also demonstrated their utility for collecting both high resolution images and object depth information simultaneously [21].
Keyed exposure imaging (also called coded exposure [22], flutter shutter [23], or coded strobing [24]) was initially developed to facilitate motion deblurring in video using a relatively low frame rate. In some cases motion has been inferred from a single keyed exposure snapshot. The basic idea is that the camera sensor continuously collects light while the shutter is rapidly opened and closed; the shutter movement effectively modulates the motion blur point spread function, and with wellchosen shutter movement patterns it becomes possible to deblur moving objects. Similar effects can be achieved using a strobe light instead of moving a shutter.
Despite the utility of the above methods in specific settings, they both face some limitations. The design of conventional coded apertures does not account for the inherent structure and compressibility of natural scenes, nor the potential for nonlinear reconstruction algorithms. Likewise, existing keyed exposure methods focus on direct (uncoded) measurements of the spatial content of the scene and have limited reconstruction capabilities, as we detail below. The Coded Aperture Keyed Exposure (CAKE) sensing paradigm we describe in this paper is designed to allow nonlinear highresolution video reconstruction from relatively few measurements in more general settings. Recent related work [25] following our early technical report [26] describes a physical platform for combining coded apertures with coded exposures. We provide theoretical support for this framework as well as detail novel dualscale mask generation schemes that can be used to compute a coarseresolution estimate of the scene with negligible computational cost; this coarse estimation can be followed by reconstruction algorithms which use optical flow models of video sequences.
Ia Problem Formulation
We consider the problem of reconstructing an frame video sequence , where each frame is an twodimensional image denoted . Using standard vector representation, we have that for where is the total number of pixels. As a result, the vector representation of the video sequence is .
The observations of are also acquired as a video sequence. We do not assume that the observations are acquired at the same rate at which we will ultimately reconstruct . In general, we assume is an frame video sequence, with each frame of size . Similarly to , we have for , where , therefore .
We observe via a spatiotemporal sensing matrix which linearly projects the spatiotemporal scene onto an dimensional set of observations:
where is noise associated with the physics of the sensor.
CS optical imaging systems must be designed to meet several competing objectives:

The sensing matrix must satisfy some necessary criterion (such as the RIP, defined below) which provides theoretical guarantees on the accuracy with which we can estimate from .

The total number of measurements, , must be lower than the total number of pixels to be reconstructed, . This is achievable via compressive spatial acquisition (), frame rate reduction (), or simultaneous spatiotemporal compression.

The measurements must be causally related to the temporal scene , which restricts the structure of the projections .

The optical measurements modeled by must be implementable in a way that results in a smaller, cheaper, more robust, or lower power system.

The sensing matrix structure must facilitate fast reconstruction algorithms.
This paper demonstrates that compressive Coded Aperture Keyed Exposure systems achieve all these objectives.
IB Contributions
The primary contribution of this paper is the design and theoretical characterization of compressive Coded Aperture Keyed Exposure (CAKE) sensing. We describe how keyed exposure ideas can be used in conjunction with coded apertures to increase both the spatial and temporal resolution of video from relatively few measurements. We prove hitherto unknown theoretical properties of such systems and demonstrate their efficacy in several simulations. We also detail a new mask design method that enables, with trivial computational cost, a lowresolution image of the scene from which we estimate an optical flow vector field for use in a computationally efficient optical flowbased reconstruction algorithm. In addition, we discuss several important algorithmic aspects of our approach, as well as detailing considerations for hardware implementations of the proposed approach. This paper builds substantially upon earlier preliminary studies by the authors [27, 28, 29, 30] and related independent work by Romberg [31].
IC Organization of the Paper
The paper is organized as follows. Section II introduces relevant background material; Section IIA describes conventional coded aperture imaging techniques and Section IIB describes keyed exposure techniques currently in the literature. We describe the compressive sensing problem and formulate it mathematically in Section IIC. We describe our approach to video compressed sensing using the CAKE paradigm in Section III including theoretical analysis of the method. We then describe the design of novel dualscale mask patterns in Section IV. Section V details how we perform the video reconstruction from the CAKE measurements which we validate with numerical experiments in Section VI. We conclude with a discussion of the implementation details and tradeoffs in practical optical systems in Section VII.
Ii Background
Prior to detailing our main contributions, we first review pertinent background material. This review touches upon the development of coded aperture imaging, coded exposure photography, and a brief review of salient concepts in compressed sensing theory.
Iia Coded Aperture Imaging
Seminal work in coded aperture imaging includes the development of masks based on Hadamard transform optics [32] and pseudorandom phase masks [33]. Modified Uniformly Redundant Arrays (MURAs) [34] are generally accepted as optimal mask patterns for coded aperture imaging. These mask patterns (which we denote by ) are binary, square patterns, whose grid size matches the spatial resolution of the photodetector and hence are length . Each mask pattern is specifically designed to have a complementary pattern such that the twodimensional convolution is a single peak with flat sidelobes (i.e., a Kronecker function).
In practice, the resolution of a detector array dictates the properties of the mask pattern and hence the resolution at which can be reconstructed. We model this effect as being downsampled to the resolution of the detector array and then convolved with the mask pattern , which has the same resolution as the FPA and the downsampled , i.e.,
where denotes circular convolution in two dimensions, corresponds to noise associated with the physics of the sensor, and is the integration downsampling of the scene, which consists of partitioning into uniformly sized blocks, where for , and measuring the total intensity in each block.
Because of the construction of and , can be reconstructed using . However, the resulting resolution is often lower than what is necessary to capture some of the desired details in the image. Clearly, the estimates from MURA reconstruction are limited by the spatial resolution of the photodetector. Thus, high resolution reconstructions cannot generally be obtained from lowresolution MURAcoded observations.
It can be shown that this mask design and reconstruction result in minimal reconstruction errors at the FPA resolution and subject to the constraint that linear, convolutionbased reconstruction methods would be used. However, when the scene of interest is sparse or compressible, and nonlinear sparse reconstruction methods may be employed, then CS ideas can be used to design coded apertures.
IiB Coded (Keyed) Exposure Imaging
Coded (or keyed) exposures were developed recently in the computational photography community. Initial work in this area was focused on engineering the temporal component of a motion blur point spread function by rapidly opening and closing the shutter during a single exposure or a small number of exposures at a low frame rate [22, 23]. That is,
where the keyed exposure (KE) measurement matrix selects the subset of frames during which the shutter is open. We refer to this subset as the exposure code .
If an object is moving during image acquisition, then a static shutter would induce a typical motion blur, making the moving object difficult to resolve with standard deblurring methods. However, by “fluttering” the shutter during the exposure using carefully designed patterns, the induced motion blur can be made invertible and moving objects can be accurately reconstructed. Instead of a moving shutter, more recent work uses a strobe light to produce a similar effect [24].
While this novel approach to video acquisition can produce very accurate deblurred images of moving objects, there is significant overhead associated with the reconstruction process. To see why, note that every object moving with a different velocity or trajectory will produce a different motion blur. This means that (a) any stationary background must be removed during preprocessing and (b) multiple moving objects must be separated and processed individually.
More recently, it was shown that these challenges could be sidestepped when the video is temporally periodic (e.g., consider a video of an electronic toothbrush spinning) [24]. The periodic assumption amounts to a sparse temporal Fourier transform of the video, and this approach, called coded strobing, is a compressive acquisition in the temporal domain. As a result, the authors were able to leverage ideas from compressed sensing to achieve highquality video reconstruction.
The assumption of a periodic video makes it possible to apply much more general reconstruction algorithms that do not require background subtraction or separating different moving components. However, it is a very strong assumption, which places some limits in its applicability to realworld settings. The approach described in this paper has similar performance guarantees but operates on much more general video sequences.
IiC Compressed Sensing
In this section we briefly define the Restricted Isometry Property (RIP) and explain its significance to reconstruction performance. In subsequent sections, we demonstrate our primary theoretical contribution, which is to prove the RIP for compressive coded aperture and keyed exposure systems.
Definition 1 (Restricted Isometry Property (RIP) [35]).
A matrix satisfies the RIP of order if there exists a constant for which
holds for all sparse . If this property holds, we say is
Matrices which satisfy the RIP are called CS matrices; and when combined with sparse recovery algorithms, they are guaranteed to yield accurate estimates of the underlying function :
Theorem 1 (Sparse Recovery with RIP Matrices [36, 37]).
Let be a matrix satisfying with , and let be a vector of noisy observations of any signal , where the is a noise or error term with for some . Let be the best sparse approximation of ; that is, is the approximation obtained by keeping the largest entries of and setting the others to zero. Then the estimate
(1)  
obeys
where and are constants which depend on but not on or .
Iii Coded Aperture Keyed Exposure Sensing
Here we detail our compressive video acquisition method that combines coded apertures and keyed exposures to address all the competing challenges detailed in Sec. IA. In our CAKE imaging method, each observed frame is given by an exposure of highrate coded observations in the time interval :
(3) 
for and is a zeromean additive noise. Without the exposure, the action of each would be to collect a compressive snapshot measurement of the scene at time . Each of these snapshots utilizes a highresolution coded aperture and can be modeled mathematically as
(4) 
where is a coding mask, and is a subsampling operator (detailed below). The subsampling operator effectively models the action of a lowresolution FPA or CCD array. Our theory allows to be a structured nonrandom linear operator, and hence we can rewrite the above as
(5) 
To implement this sensing paradigm we simply modulate the coded aperture mask over the frame exposure time.
It should be stressed that the coding masks () are of the size and resolution at which will be reconstructed; this is in contradistinction to the MURA system, in which is at the size and resolution of the FPA. Thus in (4), we model the measurements as the scene being convolved with the coded mask and then downsampled.
While the sensing strategy described in (3) is straightforward to implement, the additional structure imposed on the sensing matrix complicates the performance analysis of the system from a compressive sensing perspective. To highlight this, we now describe the structure of these sensing operations.
The twodimensional convolution of with a frame as in (5) can be represented as the application of the Fourier transform to and , followed by elementwise multiplication and application of the inverse Fourier transform. In matrix notation, this series of linear operations can be expressed as
where is the twodimensional Fourier transform matrix, and is a diagonal matrix whose elements correspond to the transfer function, which is the Fourier transform of . The matrix product is blockcirculant with circulant block (BCCB) matrix. In matrix notation, a BCCB matrix is consists of blocks,
where each is circulant; i.e., is of the form
This structure is a direct result of the fact that the point onedimensional Fourier transform diagonalizes any circulant matrix (such as with ) and so diagonalizes blockcirculant matrices (such as ). Here denotes the Kronecker matrix product.
For each frame, a compressed sensing matrix can be formed from the matrix by retaining only a subset of the rows of . This is equivalent to restricting the number of measurements collected of the vector . Here we simply subsample by applying a pointwise subsampling matrix which retains only one measurement per uniformly sized block; i.e., we subsample by in the first coordinate and in the second coordinate so that the result is an image with . In matrix form can be thought of as retaining a certain number of rows of the identity matrix. Because of the structure and deterministic nature of this type of downsampling, it is often more straightforward to realize in practical imaging hardware (see Sec. VII).
The resulting perframe sensing matrix is then given by
For our CAKE sensing strategy in (3), the sensing matrix for each block of frames is formed by concatenating sensing matrices of the above type. For the th lowrate observation, we have a corresponding sensing matrix
We illustrate this sensing procedure in Figure 1.
Since our sensing strategy is independent across each lowresolution frame, and also for simplicity of presentation, we only consider the recovery of a length block of frames from a single snapshot image in our theoretical analysis.
Theorem 2 (Coded Aperture Keyed Exposure Sensing).
Let be an sensing matrix for the CAKE system where the coded aperture pattern for each is drawn iid from a suitable probability distribution. Then there exists absolute constants such that for any
is with with probability exceeding
Note here that denotes the sparsity of frames of the video sequence, rather than simply the sparsity of an individual frame; that is, if each frame had sparsity , then naïvely, without accounting for temporal correlations, we would have . We present the proof of Theorem 2 in Appendix A for the particular case where we select
iid for and . This binaryvalued distribution was selected to model coded apertures with two states – open and closed – per mask element. It is straightforward to extend the result to other mask generating distributions, such as uniform or Gaussian distributed entries. We discuss the issue of implementing these masks in Section VII.
Recent work by Bajwa et al. [38, 39], showed that subsampled rows from random circulant matrices (and Toeplitz matrices, in general) are sufficient to recover sparse from with high probability. In particular, they showed that when for some constant , a Toeplitz matrix whose first row contains elements drawn independently from a Gaussian distribution are with with high probability, with extensions to other generating distributions. Drawing iid from any zeromean subGaussian distribution with variance is sufficient to establish RIP; if one is willing to pay additional logarithmic factors [40] shows RIP can be established if for some constant .
It is of interest to note that since the CAKE sensing matrix is a concatenation of downsampled Toeplitz matricies, this sensing strategy has clear connections to [41] where they consider concatenations of Toeplitz matricies as a sensing matrix for performing multiuser detection in wireless sensor networks. The important conceptual link is that their sensing matrix is used to determine a sparse set of simultaneously active users, where in our system, we are using it to infer a sequence of sparse frames.
Iv Dualscale masks
While the iid random coded aperture mask patterns are supported theoretically by Theorem 2, using a generative distribution with more structure allows for mask designs with several practical benefits. Here we describe a new method of coded aperture mask design which we call DualScale Masks (DSMs). This approach allows us to obtain a coarse lowrate lowresolution estimate from which we can often accurately estimate the optical flow motion field. We describe the algorithms used for optical flow estimation as well as how these optical flow estimates are used in the final reconstruction in Sec. VB. As these mask designs leverage elements of Romberg’s work [31], we first review important components of this work prior to detailing our dualscale mask designs.
Iva PhaseShifting Mask Designs
We note that Romberg [31] conducted related work concurrent and independent of our initial investigations [27]. While the random convolution in Romberg’s work follows a similar structure, the convolution pattern is generated randomly in the frequency domain. More specifically, the entries of the transfer function correspond to random phase shifts (with some constraints to keep the resulting observation matrix realvalued). For simplicity of presentation, we describe the generating distribution for a onedimensional convolution for evenlength masks. First we generate such that
(6) 
Here denotes the uniform distribution over . Then the realvalued convolution kernel is then given by . This result can be extended easily for twodimensional convolutions. We call coded aperture mask patterns generated by this procedure phaseshift masks.
To form a compressed sensing matrix from this random convolution, [31] considers two different downsampling strategies: sampling at random locations and random demodulation. The first method entails selecting a random subset of indices. We form a downsampling matrix by retaining only the rows of the identity matrix indexed by , hence the resulting measurement matrix is given by
where is generated according to the twodimensional analog of (6). The random demodulation method multiplies the result of the random convolution by a random sign sequence , such that with equal probability for all , then performs an integration downsampling of the result. Therefore in this case the measurement matrix is
It can be shown that both of these strategies yield RIPsatisfying matrices with high probability.
Theorem 3 (FourierDomain CCA Sensing [31]).
Let be an sensing matrix resulting from random convolution with phase shifts followed by either random subsampling or random demodulation, and let denote an arbitrary orthonormal basis. Then there exists a constant such that if
is with with probability exceeding .
In certain regimes, these theoretical results are stronger than those in Theorem 2 specialized to . The main strength is that these results allow sparsity in an arbitrary orthonormal basis. However, there are important differences between the observation models which have a significant impact on the feasibility and ease of hardware implementation. We elaborate on this in Section VII.
IvB DualScale Mask Designs
As the name suggests, our DSMs utilize two mask patterns; one pattern is related to the coarse lowrate lowresolution measurement scale, and the other at the fine highrate highresolution reconstruction scale. We differentiate the two patterns using superscripts: for the coarsescale masks, and for the finescale masks. We leverage the uniform phase generation strategy in Section IVA for our lowresolution pattern, and generate our highresolution pattern directly in the spatial domain. The reason being that the random phase patterns yield a unitary transform which is easily inverted to acquire a lowresolution estimate. Generating the highresolution pattern in the spatial domain easily allows us to ensure that each lowresolution block in the pattern has zero mean. While this may be possible with a suitable normalization of the phaseshift patterns, some of the desirable properties of these patterns will be lost.
Our procedure is as follows. For , generate a length random phase sequence such that , and is conjugate symmetric (i.e., according to the twodimensional analog of (6) with ). We then let where is integration downsampling and is the 2D DFT matrix. For , generate an as follows. First we partition each mask into blocks, then on each block we select a binaryvalued vector in uniformly at random from binary vectors which sum to zero and assign these values to this portion of the block. Said differently, on each block we randomly select half of the mask elements to have value and set the other half to , so that by construction. The DSM patterns suitable for use in the CAKE sensing strategy (3) are linear combinations of the form
(7) 
where , and . The scalars and allow additional flexibility in the generation and are normalized such that . Individually and are normalized such that , so this scaling ensures that (as by design). See Fig. 2 for an example mask pattern. The selection of and essentially tradeoff the quality of the intermediate lowresolution estimate, and the final highresolution reconstruction. Our experiments reveal that a larger weighting on the highresolution mask results in higher fidelity reconstructions.
The advantage of the DSM design is that we can obtain a coarse estimate of with negligible computational cost. The remainder of this section details how this is accomplished. Our target is to recover an estimate of the lowrate lowresolution frame sequence
For each lowresolution frame, an estimate of can be found easily from a size convolution with . If we denote , this estimate is computed via
(8) 
We derive (8) in Appendix B, and we detail how this estimate is used to inform an optical flowbased reconstruction algorithm in Section V.
V Video Reconstruction
To solve the CS minimization problems (1) or (2), we use wellestablished gradientbased optimization methods. These iterative algorithms are attractive because they utilize repeated applications of the operators and [42, 43, 44]. For most CS matrices, these matrixvector products are a large computational burden and storing such matrices is memory intensive even for modest image sizes. However, the BCCB structure of allows for computationally fast matrixvector products using FFTs that do not require explicit storage of the matrix. In our case, the memory footprint is no larger than that of the image itself. In the remainder of this section, we describe how these algorithmic approaches are adapted for video reconstruction as well as describe how optical flow constraints can be introduced to improve reconstruction performance.
Va Reconstruction for Video
Given measurements from the proposed CAKE imaging system, where we have designed the mask patterns for sparsity in a temporal transform, we recover the frames by solving a minimization similar to (2). However, since successive video frames are generally highly correlated, there are gains in considering the differences between subsequent frames instead. We therefore define the difference frame sequence such that
(9) 
From , the frames can easily be recovered by , , which can be expressed compactly as , where is an lower triangular matrix of all ones. Additionally, total variation (TV) regularization [45] has been shown to lead to increased reconstruction accuracy. We combine these approaches by utilizing a TV penalty on the first frame (), and an sparsity penalty on the subsequent difference fames and solve the following minimization problem
(10)  
Note that in the above we are solving for the entire video sequence in one large optimization problem. This is the reconstruction technique we use in the experimental results section.
Empirically, this leads to better reconstruction performance however this may pose computational challenges for video sequences of longer duration. In such cases, an effective approach is to process the video in an online fashion, solving for only a few blocks of frames simultaneously. In many applications, such as surveillance and monitoring, the video frames are strongly correlated, and the solution to a previous block of frames may be used as an initialization to the algorithm to solve the next block of frames. This estimate will generally be very accurate, and therefore, relatively few iterations are needed to obtain a solution to each optimization problem. In previous work, we noticed that when the amount of processing time allotted per frame is held constant, the accuracy generally increases with the number of frames processed simultaneously. However, one simply cannot solve for arbitrarily many frames simultaneously, as the improvement in accuracy diminishes when the size of the problem is such that only a very small number of reconstruction iterations can be run within the allotted time. We refer the reader to [28] for details.
VB Reconstruction using Estimated Optical Flow
Optical flow is a measurement of the spatial motion of the grayscale intensity of a video sequence over time [46] and has enjoyed success in the computer vision community for several applications such as video coding [47], and robot vision [48]. The development of fast algorithms for estimating optical flow, such as [49] or [50] based on convex optimization, has enabled its use in the signal processing community. In addition to its use in this work, we have successfully employed optical flow in related work to design adaptive sensing matrices based on compressive coded apertures [51].
Mathematically, optical flow is a timevarying vector field , over the image space which describes how the grayscale intensity of frame is propagated to frame . For twodimensional video sequences, we consider , where is the horizontal component and is the vertical component of the optical flow. We can express the effects of the optical flow as a matrix such that . Grouping these constraints over the duration of the video sequence, we can express this as , where
(11) 
In this work, we use optical flow as a method to enforce smooth motion fields over an estimated video sequence. Our approach is based on the CSMUVI approach [52]. The reconstruction procedure follows the following steps.

From the lowrate coded data , we compute the lowresolution estimate in (8).

We upsample , to match the frame rate and resolution of the sequence , . In our implementation, we use spline interpolation based on the interp3 MATLAB command.
In our implementation, we consider sparsity in the Daubechies4 wavelet basis. Here and are tuning parameters that dictate how tightly the data and optical flow constraints should be enforced. The convex program in above can be placed into a standard constrained program which can be solved using SPGL1 [44]. The added optical flow constraints are an effective tool for enforcing smooth motion in the video sequence as well as constraining the optimization program which, in our experiments, results in faster convergence behavior. We demonstrate the effectiveness of this method in Section VI.
Vi Numerical Experiments
Conventional  DualScale  Optical  
Truth  Spline  CAKE  Mask  Flow  
Upsampling  CAKE  CAKE  
Frame 5  
Residual  


Frame 24  
Residual  


Total RMSE  0.7192%  0.5270%  0.4764%  0.4503% 
In this section, we demonstrate the effectiveness of the proposed CAKE architecture at successfully recovering a video sequence of shortwave infrared (SWIR) data collected (courtesy of Jon Nichols at NRL) by a shortwave IR () camera. The camera is based on a InGaAs (Indium, gallium arsenide) focal plane array with pixel pitch. Optically, the camera was built around a fixed, aperture and provides a field of view along the diagonal with a focus range of . Imagery were output at the standard 30Hz frame rate with a 14 bit dynamic range. An example frame is shown in Fig. 3. In this sequence, the three boats are traveling at different velocities with respect to the slowlychanging background of the waves. The size of each frame is , and we consider reconstructing 28 frames of the sequence.
We consider CAKE observations where we downsample spatially by a factor of 2 in both directions (), and use a coded exposure block length of . We compare three CAKE paradigms: single mask, dualscale masks, and CAKE using optical flow. We reconstruct the entire video sequence using all lowresolution lowrate frames using a total variation penalty on the first frame, and sparsity penalty on all subsequent difference frames. We optimize the regularization parameters to minimize the reconstruction error. Specifically, for the single and dualscale mask CAKE (10), , , and for CAKE with optical flow (12), and . We used the weights and in (7). For comparison, we consider traditionally captured data (i.e., by simply averaging over blocks of the spatiotemporal video sequence). To interpolate this data to the original resolution of the video sequence, we consider spline interpolation.
We show the estimates for the different acquisition and reconstruction methods in Fig. 4. Here we focus only on a region of interest (ROI) of the sailboat. We see that the CAKE sensing is able to reconstruct the scene with a higher spatial and temporal resolution. This is evidenced in the residuals, which include much less critical scene structure than the conventional system. We see from the examination of the difference frame that the nearestneighbor reconstruction from conventionally sampled data yields only slight motion over the two frames and suffers from poor spatial resolution. Using spline interpolation helps improve the spatial resolution, but it is still insufficient to recover the scene with high temporal resolution, as can be seen in the blur on the leading edge of the sail. Numerically we quantify the performance over the entire video sequence in terms of the RMSE (%), , calculated over the sailboat ROI in order to effectively assess ability to capture motion information in the scene. This is tabulated in Table I. In summary we see that the CAKE acquisitions are able to outperform traditionally sampled video in terms of reconstruction accuracy and reconstructing salient motion. On average, to reconstruct the 28 video frames, CAKE took 96 minutes and DualScale Mask CAKE took 93 minutes, while CAKE with Optical Flow took only 31 minutes. Conventional spline upsampling took nearly 23 seconds.
It should be noted that there are limitations to the CAKE system in the presence of strong motion. In this case, the sparsity level of the difference frames may drastically increase as the previous frame ceases to be a good prediction of the next frame. As such, the RIP bound in Theorem 2 states the number of measurements we require to reconstruct the scene must necessarily increase, requiring either a faster temporal resolution measurements or higher resolution FPA to achieve the same accuracy. Because of this balance, a system designer may need to make important engineering tradeoffs to implement CAKE acquisition for a particular application.
Sensing Architecture  Reconstruction RMSE 

(Average over 10 Trials)  
Conventional Spline Upsampling  0.7192% 
CAKE  0.5270% 
DualScale Mask CAKE  0.4764% 
Optical Flow CAKE  0.4503% 
Vii Hardware and Practical Implementation Considerations
In this section we describe how we shift from modeling the coded aperture masks in a way that is compatible with compressed sensing theory, to a model that describes their actual implementation in an optical system. In particular, gaps between theory and practice arise due to:

when using lenses with bandwidth below the desired resolution;

system calibration issues;

nonnegativity of scenes and sensing matrices in incoherent light settings;

inaccurate synchronization of spatial light modulators and sensor readout;

low photon efficiency of some architectures;

hardware implementation of downsampling operation;

quantization errors in observations.
Each of these is described in more detail below.
Lens bandwidth: Our analysis does not account for the bandwidth of the lenses; in particular, we implicitly assume that the bandwidth of the lenses is high enough that bandlimitation effects are negligible at the resolution of interest.
System calibration issues: In all of the hardware settings described below, precise alignment of the optical components (e.g., the mask and the focal plane array) is critical to the performance of the proposed system. Often a highresolution FPA is helpful for alignment and calibration. Even when we have the ability to estimate precisely, there are settings where using an approximation of has advantages; for instance, when we can approximately compute using fast Fourier transforms, then conducting sparse recovery is much faster than with a dense matrix representation of . When we run a sparse recovery algorithm with an inaccurate sensing matrix , it corresponds to the observation model
where represents the difference between the true projections collected by imager and the assumed projections in . The term can be thought of as signaldependent noise. Recent work analyzes the theoretical ramifications of these kinds of errors [54].
Nonnegativity in incoherent light settings: In this paper we focus on incoherent light settings (consistent with many applications in astronomy, microscopy, and infrared imaging). In this case, the coded aperture must be realvalued and fluxpreserving (i.e., the light intensity hitting the detector cannot exceed the light intensity of the source). In a conventional lensless coded aperture imaging setup, the point spread function associated with the aperture is the mask pattern itself. To shift a RIPsatisfying aperture as in Theorem 2 to an implementable aperture, one simply needs to apply an affine transform to mapping to . This transform ensures that the resulting mask pattern is nonnegative and fluxpreserving. In previous work [29, 30], we discuss accounting for nonnegativity during reconstruction by a mean subtraction step.
Synchronization issues: Programmable spatial light modulators (SLMs) allow changing the mask pattern over time, a critical component of the CAKE concept. However, in order for the proposed approach to work, the mask or SLM used must be higher resolution than the FPA. Currently, very high resolution SLMs are still in development. Chrome on quartz masks can be made with higher resolution than many SLMs, but cannot be changed on the fly unless we mount a small number of fixed masks on a rotating wheel or translating stage. Furthermore, the changing of the mask patterns must be carefully synchronized to th timing of readout on the focal plane array.
Photon efficiency: Phaseshift masks for coded aperture imaging have been implemented recently using a phase screen [55]. This approach allows one to account for diffraction in the optical design. However, depending on the precise optical architecture, phase shift masks may be less photonefficient than amplitude modulation masks.
Downsampling in hardware: In developing RIPsatisfying coded apertures, Theorem 2 assumes the subsampling operation selects one measurement per block. From an implementation standpoint, this operation effectively discards a large portion () of the available light, which would result in much lower signaltonoise ratios at the detector. A more pragmatic approach is to use larger detector elements that essentially sum the intensity over each block, making a better use of the available light. We call this operation integration downsampling and denote it by to distinguish it from subsampling. The drawback to this approach is that we lose many of the desirable features of the system in terms of the RIP. Integration downsampling causes a large coherence between neighboring columns of the resulting sensing matrix . An intermediate approach would randomly sum a fraction of the elements in each size block, which increases the signaltonoise ratio versus subsampling, but yields smaller expected coherence. This approach is motivated by the random demodulation proposed in [31] and described in Sec IV, whereby the signal is multiplied by a random sequence of signs , then blockwise averaged. The pseudorandom summation proposed here can be thought of as an optically realizable instantiation of the same idea where we multiply by a random binary sequence.
Noise and quantization errors: While CS is particularly useful when the FPA needs to be kept compact, it should be noted that CS is more sensitive to measurement errors and noise than more direct imaging techniques. The experiments conducted in this paper simulated very high signaltonoise ratio (SNR) settings and showed that CS methods can help resolve high resolution features in images. However, in low SNR settings CS reconstructions can exhibit significant artifacts that may even cause more distortion than the lowresolution effects associated with conventional coded aperture techniques such as MURA. Similar observations are made in [56], which presents a direct comparison of the noise robustness of CS in contrast to conventional imaging techniques both in terms of bounds on how reconstruction error decays with the number of measurements and in a simulation setup; the authors conclude that for most realworld images, CS yields the biggest gains in high signaltonoise ratio (SNR) settings. Related theoretical work in [57] show that in the presence of low SNR photon noise, theoretical error bounds can be large, and thus the expected performance of CS may be limited unless the number of available photons to sense is sufficiently high. These considerations play an important role in choosing the type of downsampling to implement. Similar issues arise when considering the bitdepth of focal plane arrays, which corresponds to measurement quantization errors. Future efforts in designing optical CS systems must carefully consider the amount of noise anticipated in the measurements to find the optimal tradeoff between the focal plane array size and image quality.
Viii Conclusions
In this paper, we demonstrate how CS principles can be applied to physically realizable optical system designs, namely coded aperture imaging. Numerical experiments show that CS methods can help resolve high resolution features in videos that conventional imaging systems cannot. We have also demonstrated that our CAKE acquisition system can recover video sequences from highly undersampled data in much broader settings than those considered in initial coded exposure studies. However, our derived theoretical limits show that there are important tradeoffs involved that depend on the spatial sparsity and temporal correlations in the scene that ultimately govern the accuracy of our reconstructions for a specified number of compressive measurements. Finally, we note that our proposed approach performs well in high signaltonoise ratio settings, but like all CS reconstructions, it can exhibit significant artifacts in highnoise settings.
Two practical issues associated with coded aperture imaging in general are the blur due to misalignment of the mask and diffraction and interference effects. Noncompressive aperture codes have been developed to be robust to these effects [58]. One important avenue for future research is the development of compressive coded aperture makes with similar robustness properties. Noncompressive coded apertures have also been shown useful in inferring the depth of different objects in a scene; similar inference may be possible with the compressive coded apertures described in this paper. However, using the current proof techniques here or in [40] cannot be directly applied as the additional structure in the DSM masks results in random vectors that are not isotropic subGaussian random vectors.
Acknowledgements
The authors would like to thank Prof. Justin Romberg and Dr. Kerkil Choi for several valuable discussions, Dr. Aswin Sankaranarayanan for providing code to reproduce the results in [52], and Dr. Seda Senay for assisting with the numerical experiments.
a Proof of the RIP for CAKE Sensing
Here we present the proof of Theorem 2 that our CAKE sensing architecture satisfies the restricted isometry property. The proof uses the same techniques as that of Theorem 4 in [39], where the RIP is established by shifting the analysis of the submatrices of to the entries of the Gram matrix by invoking Geršgorin’s disc theorem [59]. This theorem states that the eigenvalues of an complex matrix all lie in the union of discs , , centered at with radius
In essence, we show that with high probability , so that the eigenvalues of are clustered around one with suitably high probability.
Proof:
Note that the Gram matrix has a certain block structure; since , we have
In the interest of notational simplicity, it is easier to work on the twodimensional images versus their onedimensional vectorial representations. This means that for our coded aperture mask patterns , instead of our previous indexing notation , we use two dimensional indexing such that
Additionally, for the sensing matrices , we use the indexing
also define for modular arithmetic.
We first focus on the diagonal blocks of this matrix. First we need a relationship between the matrices and the mask patterns . It can be shown that
so the entries of the diagonal blocks of are
(13)  
From the normalization, it is straightforward to show that . Now we need to bound the deviation about the mean via concentration. The diagonal terms are simply a sum of bounded iid entries, and hence for all and , Hoeffding’s inequality yields
Next we consider the offdiagonal entries of the diagonal blocks. There are two special cases to consider. In the special case that either or , each of the terms in the summand in (13) picks out a different set of coefficients from . However, the case that both and needs special care since there are now dependencies between the terms in the summand in (13). However, due to the special nature by which we select the coefficients, we can partition the sum into two sums (denoted and such that each of these is a sum of independent terms. We can then apply the Hoeffding bound to each to yield
Then applying the union bound gives us
(14) 
Since this latter bound decays more slowly, and in the interest of simplicity, we overbound all the offdiagonal entries of the diagonal blocks using this latter expression.
Now consider the offdiagonal blocks , . From independence we can show that . The entries of these matrices are sums of independent entries, and we could apply Hoeffding’s bound. However, will will need to overbound with (14) to obtain a simple expression, hence we use
Lastly, we need to perform a union bound over all the entries of the matrix. We have diagonal entries, and exploiting symmetry, we only have remaining entries. Hence taking ,
where . If , this probability is less than one provided for any where , noting that establishes the theorem. ∎
B Coarse Reconstruction using DualScale Masks
This section verifies (8). First, we decompose into two terms,
(15) 
such that the first is based on and the remaining is a residual term. In the sequel we define the residual as
Now consider our measurements. For simplicity, assume we have these without noise. Then if we define
we can write the observations as
where we used the decomposition (15) in the third equality. Turning to our coarse estimate (8), we have
(16)  
It can be shown that . For simplicity of notation, we show this for 1D signals; the extension to higher dimensions is straightforward. Furthermore, we drop the subscripts temporarily. We show that , since by construction of the random phases . To see this, first let , then by construction we have
and so
Now let us examine :
and hence as claimed.
Because , we have
where is defined as the second two terms in (16). Note that we expect that will be small in comparison to . In particular, the first term in will be small because the energy in will be much greater than that in for smoothly varying video sequences. The second term in will be zeromean because and are generated independently from zeromean distributions.
References
 [1] E. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inform. Theory, vol. 52, no. 2, pp. 489 – 509, 2006.
 [2] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
 [3] M. Shankar, R. Willett, N. Pitsianis, T. Schulz, R. Gibbons, R. T. Kolste, J. Carriere, C. Chen, D. Prather, and D. Brady, “Thin infrared imaging systems through multichannel sampling,” Applied Optics, vol. 47, no. 10, pp. B1–B10, 2008.
 [4] A. F. Coskun, I. Sencan, T.W. Su, and A. Ozcan, “Lensless widefield flourescent imaging on a chip using compressive decoding of sparse objects,” Optics Express, vol. 18, no. 10, pp. 10510–10523, 2010.
 [5] L. C. Potter, E. Ertin, J. T. Parker, and M. Cetin, “Sparsity and compressed sensing in radar imaging,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1006–1020, 2010.
 [6] J. Gu, S. K. Nayar, E. Grinspun, P. N. Belhumeur, and R. Ramamoorthi, “Compressive structured light for recovering inhomogeneous participating media,” in Proceedings of the European Conference on Computer Vision, 2008.
 [7] M. Mohtashemi, H. Smith, D. Walburger, F. Sutton, and J. Diggans, “Sparse sensing dna microarraybased biosensor: Is it feasible?,” in 2010 IEEE Sensors Applications Symposium, Feb 2010, pp. 127–130.
 [8] M.A. Sheikh, O. Milenkovic, and R.G. Baraniuk, “Designing compressive sensing DNA microarrays,” in 2nd IEEE International Workshop on Computational Advances in MultiSensor Adaptive Processing, 2007, December 2007, pp. 141–144.
 [9] M. Lustig, D. Donoho, and J. M. Pauly, “Sparse MRI: The application of compressed sensing for rapid MR imaging,” Magnetic Resonance in Medicine, vol. 58, no. 6, pp. 1182–1195, December 2007.
 [10] A.C. Gurbuz, J.H. McClellan, and W.R. Scott, “A compressive sensing data acquisition and imaging method for stepped frequency GPRs,” IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2640 –2650, July 2009.
 [11] P. Ye, J.L. Paredes, G.R. Arce, Y. Wu, C. Chen, and D.W. Prather, “Compressive confocal microscopy,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, April 2009, pp. 429 –432.
 [12] J. Bobin, J.L. Starck, and R. Ottensamer, “Compressed sensing in astronomy,” Selected Topics in Signal Processing, IEEE Journal of, vol. 2, no. 5, pp. 718 –726, October 2008.
 [13] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk, “Single pixel imaging via compressive sampling,” IEEE Sig. Proc. Mag., vol. 25, no. 2, pp. 83–91, 2008.
 [14] J. G. Ables, “Fourier transform photography: a new method for Xray astronomy,” Proceedings of the Astronomical Society of Australia, vol