A Unified Framework for Multi-Sensor HDR Video Reconstruction
One of the most successful approaches to modern high quality HDR-video capture is to use camera setups with multiple sensors imaging the scene through a common optical system. However, such systems pose several challenges for HDR reconstruction algorithms. Previous reconstruction techniques have considered debayering, denoising, resampling (alignment) and exposure fusion as separate problems. In contrast, in this paper we present a unifying approach, performing HDR assembly directly from raw sensor data. Our framework includes a camera noise model adapted to HDR video and an algorithm for spatially adaptive HDR reconstruction based on fitting of local polynomial approximations to observed sensor data. The method is easy to implement and allows reconstruction to an arbitrary resolution and output mapping. We present an implementation in CUDA and show real-time performance for an experimental 4 Mpixel multi-sensor HDR video system. We further show that our algorithm has clear advantages over existing methods, both in terms of flexibility and reconstruction quality.
keywords:HDR Video Capture, HDR Reconstruction, Local Polynomial Approximation, Camera Noise
High dynamic range (HDR) video is an emerging field of technology, by many considered to be one of the key components of future imaging, that will enable a wide range of new applications in image processing, computer vision, computer graphics and cinematography. A few prototype HDR-video camera systems have been presented recently, showing that the computational power and bandwidth is now high enough to handle high resolution HDR-video processing and storage. Current CMOS and CCD imaging sensors are still unable to accurately capture the full dynamic range in general scenes. There are sensors with logarithmic response to incident light that exhibit significantly higher dynamic range compared to standard CMOS and CCD sensors Technologies (2008). These are, however, in most applications still not accurate enough due to problems with image noise and that the entire dynamic range is usually quantized to 10-12 bit output. To date, the most successful approach for high quality HDR-video capture has instead been to use camera setups with multiple sensors imaging the scene through a common optical system Aggarwal and Ahuja (2004); Mcguire and Hughes (2007); Tocci et al. (2011); Kronander et al. (2013). The idea is similar to exposure bracketing Debevec and Malik (1997), but instead of varying the exposure time, the optical setup is designed to make the sensors capture different exposures of the scene by the use of different neutral density (ND) filters. By capturing all exposures simultaneously, temporal artifacts such as ghosting and incorrect motion blur inherent to techniques where the exposure time is varied, e.g. Kang et al. (2003); Unger and Gustavson (2007), are avoided. Although multi-sensor systems currently provide the (arguably) best alternative for high quality HDR video capture, there has been a lack of a formalized framework for efficient and accurate image reconstruction for general sensor configurations.
In order to reconstruct HDR images from multi-sensor systems it is necessary to perform: demosaicing of color filter array (CFA) sampled data, resampling to correct for geometric misalignments between the sensors, HDR assembly to fuse the input low dynamic range (LDR) images into the output HDR image, and denoising to reduce image and color noise.
Most existing HDR reconstruction techniques, e.g. Debevec and Malik (1997); Ajdin et al. (2008); Granados et al. (2010); Tocci et al. (2011), have considered this as separate steps in a pipeline fashion and performed demosaicing and realignment either before or after HDR assembly, see Figure 1. This pipeline approach introduces several problems. Demosaicing before HDR assembly as in Debevec and Malik (1997), causes problems with bad or missing data around saturated pixels. This is especially problematic as color channels usually saturate at different levels. Demosaicing after HDR assembly as in Ajdin et al. (2008); Tocci et al. (2011) causes problems with blur and ghosting unless the sensors are perfectly aligned. For multi-sensor systems using high resolution sensors, it is problematic, costly, and sometimes even impossible to match the sensors so that the CFA patterns align correctly. Tocci et al. (2011) report misalignment errors around the size of a pixel despite considerable alignment efforts. This means that overlapping pixels from different sensors might not have the same spectral response. Also, treating demosaicing, resampling and HDR assembly in separate steps makes it difficult to analyze and incorporate sensor noise in the HDR reconstruction in a coherent way.
The main contribution of this article is a filtering framework that addresses all of these steps simultaneously in a formalized way. Our reconstruction method takes into account the entire set of non-saturated LDR samples available around the output HDR pixel, and performs demosaicing, resampling, HDR assembly and denoising in a single operation. The framework is based on spatially adaptive cross-sensor filtering, and uses a noise-aware local polynomial approximation (LPA) approach adapted to include heterogeneous noise in the reconstruction. Using our framework, we present two flexible and robust algorithms for high quality HDR reconstructions using both isotropic filtering support and adaptive anisotropic filtering support with correlations between the color channels taken into account. A key feature of our framework is its computational speed, making it applicable to real-time HDR reconstruction for video applications. Using isotropic filtering, our CUDA implementation of the reconstruction algorithm runs at a sustained rate of 25 fps for input data from a 4 Mpixel system with four sensors. As a part of our framework, we have also developed a new sensor noise model.
To demonstrate the usefulness of our approach, we evaluate the algorithms and noise model using both simulated data for a range of possible multi-sensor setups for HDR-imaging as well as on data from an experimental multi-sensor HDR-video camera prototype.
2 Related Work
To the authors’ knowledge, no previous approach has considered HDR reconstruction with joint realignment, noise reduction and CFA interpolation. While our method is not designed to provide optimal results with respect to denoising, demosaicing or super-resolution alone, we take a unified approach based on a model of sensor noise and provide a flexible framework which is straightforward to implement and use. Our method is also well suited to the needs of a versatile high dynamic range video system, by allowing fast parallel processing and an arbitrary resolution for the reconstruction. In comparison to our previous work Kronander et al. (2013), where we only consider isotropic filtering support, we use adaptive anisotropic filtering support to adapt the reconstruction to the local image structure. This also enables us to incorporate correlations between the color channels in the reconstruction, similar to modern demosaicing methods Gunturk et al. (2005).
The algorithms presented in this paper are related to a large body of previous work, ranging from HDR capture and reconstruction, e.g. Reinhard et al. (2010); Myszkowski et al. (2008), to theory and algorithms for accurate image reconstruction and image fusion, e.g. Astola et al. (2006). In this section, we give an overview of the previous work most closely related to the methods proposed in this paper.
2.1 HDR Video Capture
Nayar and Mitsunaga (2000) proposed an HDR imaging system suitable for video capture, based on placing a neutral density (ND) filter array over the image sensor introducing a tradeoff between spatial resolution and dynamic range. While early methods only considered grayscale images the approach was later generalized to include spectral sampling using a spatially varying exposure and color filter array Narasimhan and Nayar (2005); Yasuma et al. (2010). Solutions to the sampling and resolution enhancement of image dimensions such as spatial, spectral, temporal and intensity information have also been solved previously using multiple sensors for the same optical viewpoint. A simple example is the tri-color CCD system, where a beam splitter separates the color spectrum into three components (R, G, B) which are then measured individually on three different sensors. Aggarwal and Ahuja Aggarwal and Ahuja (2004) present a system for HDR video capture using a mirror pyramid aligned with the optical axis to split incoming light and consider different exposure times or ND-filters for the sensors, or they consider mirrors with an offset from the optical axis to effectively distribute light unevenly over the sensors. Mcguire et al. Mcguire and Hughes (2007) presented a framework for optimally selecting components for optical splitting trees given specific target goals, such as dynamic range, spectral sensitivity and cost budgets. Tocci et al. Tocci et al. (2011) recently presented a light efficient multi-sensor HDR video system utilizing up to 99.96% of the incident light, which has spawned a renewed interest in these systems. Hybrid approaches could also use a multi-sensor setup where one of the sensors includes a spatially varying ND-filter array. Our reconstruction method is general and applies to all of these various setups.
2.2 HDR fusion
The most common method for HDR reconstruction from a set of differently exposed Low Dynamic Range (LDR) images is to compute a per-pixel weighted average of the LDR measurements. The weights, often based on heuristics, are chosen to suppress image noise and remove saturated values from processing Mann and Picard (1995); Debevec and Malik (1997); Akyüz and Reinhard (2007) . Weight functions can also be based on more sophisticated camera noise models. Mitsunaga and Nayar Mitsunaga and Nayar (1999) derived a weight function that maximizes SNR assuming signal-independent additive noise. Kirk and Andersen Kirk and Andersen (2006) derived a weight function inversely proportional to the temporal variance of the digital LDR values. Granados et al. (2010) later extended this approach to include both spatial and temporal camera noise. While most previous methods consider only a single pixel at a time from each LDR exposure, Tocci et al. (2011) presented an algorithm that incorporates a neighborhood of LDR samples in the reconstruction. This method helps smooth the transition regions between sensors and performs well for sensors with large exposure differences, over 3 f-stops apart.
2.3 Non-parametric Image Processing
The last two decades have seen an increased popularity of image processing operations using locally adaptive filter weights, for applications in e.g. interpolation, denoising and upsampling. Examples include normalized convolution Knutsson and Westin (1993), the bilateral filter Tomasi and Manduchi (1998), and moving least squares Lancaster and Salkauskasr (1981). Recently, deep connections have been shown Takeda et al. (2007); Milanfar (2013) between these methods and traditional non-parametric statistics Loader (1999). In this paper, we fit Local Polynomial Approximations (LPA) Astola et al. (2006) to irregularly distributed samples around output pixels using a localized maximum likelihood estimation Tibshirani and Hastie (1987) to incorporate the heterogeneous noise of the samples.
2.4 Image Fusion
When multiple images of the same scene are captured within sub-pixel displacements of each other, super-resolution reconstruction methods can be applied. A class of computationally simple super-resolution techniques are those based on interpolation in a common high resolution grid Milanfar (2010). These methods typically first use some registration to put observed images in a common high resolution reference frame. The high-resolution image is then estimated in a least squares sense given a local neighborhood of observed pixels Farsiu et al. (2004); Hardie (2007). As noted in previous work, super-resolution and HDR imaging complement each other. Super-resolution increases the resolution in the spatial domain, and HDR increases the radiometric range and resolution. Several methods have also been proposed to incorporate both enhancements in the same processing framework Gunturk and Gevrekci (2006); Zimmer et al. (2011).
Most modern camera systems acquire color images using a sensor covered by a Color Filter Array (CFA), and they estimate the final color vector for each pixel from neighboring color samples. This process is known as demosaicing. The most frequently used CFA of today is the RGB Bayer pattern Bayer (1976), sampling the G channel on a quincunx grid and the R and B images on a rectangular grid at half the native resolution of the sensor. Several demosaicing methods have been developed through the years. A comprehensive overview can be found in Gunturk et al. Gunturk et al. (2005). Joint denoising and demosaicing have been considered in previous work K. Hirakawa (2006); Hirakawa (2008). Using a global maximum a posteriori optimization with priors that enforce correlation between color channels, Farsiu et al. (2006) proposed a framework for joint super-resolution and demosaicing. A similar approach was taken in Kachatou and van Silfhout (2008) for HDR imaging with joint demosaicing for sensors with non-destructive readout (i.e. perfect alignment between consecutive readouts/frames). A global MAP optimization can also be used for joint HDR fusion, realignment and demosaicing. However, it is computationally demanding and does not straightforwardly allow for parallel processing, making it less suitable for the HDR video applications considered in this paper.
3 Problem formulation
The goal of the algorithm presented in this paper is to generate a video stream of output HDR images based on input data from multi-sensor imaging systems which for each output frame capture a set of differently exposed raw CFA images , where . The algorithm is designed for multi-sensor systems where the sensors image the scene through a common front lens, and where the light path is split onto the sensors by a beam splitter arrangement. An example multi-sensor setup is displayed in Figure 2. We assume that misalignments between the sensors can be described as 2-D transformations, . The difference in exposure between the sensors, created by e.g. neutral density (ND) filters placed in the ray path, can be described as an exposure scaling coefficient for each sensor. Each pixel in the input images provides a measurement of the incident radiant power at the pixel location, scaled by an unknown factor due to the quantum efficiency of the sensor, Hasinoff et al. (2010). In this work we assume that the sensors have the same quantum efficiency or that it can be measured separately for each sensor so that it can be incorporated into the exposure scaling coefficient .
This yields the image formation model described in Figure 3. Each sensor image, , samples the incident radiant power, , at a set of discrete pixel locations. Using a linear index for pixels in each sensor image, we define the measured digital sample value at a pixel in image as . The samples, , contain measurement noise that is dependent on the incident radiant power, , as well as the sensor characteristics.
During capture, we assume that the sensors are synchronized or that the scene is static, so that the motion blur characteristics are the same for all sensors. The parameters that are allowed to vary between the sensors are: the exposure time , the sensor gain setting and the exposure scaling coefficient, .
Before estimating the HDR frames, we perform radiometric and geometric calibration of the sensors.
4.1 Radiometric Calibration
The (noisy) digital pixel values in the input images, , are first converted to radiant power estimates . This is carried out by subtracting a bias frame and taking into account the exposure settings (exposure time, exposure scaling, global gain) and per-pixel gain variations. The variance of the radiant power, , can then be estimated using a calibrated radiometric camera noise model. Similarly to Granados et al. (2010), we assume the noise to follow a Gaussian distribution. This assumption is described in detail in Section 6, where we give an in-depth overview of our radiometric noise model adapted for HDR video. It should be noted that our framework can use any radiometric noise model that assumes an approximately Gaussian noise distribution.
4.2 Geometric Alignment
The HDR reconstruction is performed in a virtual reference space, corresponding to a virtual sensor placed somewhere in the focal plane. The virtual sensor dimensions are chosen to reflect the desired output frame, unconstrained by the resolution of the input frames. Using a geometric calibration procedure for each sensor an affine transform, , is established which maps sensor pixel coordinates, , to the coordinates of the reference output coordinate system, . In general, the transformed input pixel coordinates, , are not integer-valued in the output image coordinates, and for general affine transformations the sample locations will become irregularly positioned. An example for three sensor images is shown in Figure 4.
5 HDR Reconstruction
To reconstruct the HDR output frame the relative radiant power is estimated for each pixel in the virtual sensor separately in a non-parametric fashion using the transformed samples . For each output pixel with integer valued, regularly spaced coordinates , a local polynomial model is fitted to observed samples in a local neighborhood, see Figure 4. We first discuss how a single color channel, , can be reconstructed independently of the other channels using LPA with isotropic filtering supports. In section 5.4 it is then shown how the first LPA estimate can be improved by using adaptive anisotropic filtering support. Using adaptive filtering support also enables us to utilize cross-correlation between color channels to improve the estimate. This is discussed in section 5.5.
5.1 Local Polynomial Model
To estimate the radiant power, , at an output pixel, we use a generic local polynomial expansion of the radiant power around the output pixel location . Assuming that the radiant power is a smooth function in a local neighborhood around the output location a M-th order Taylor series expansion is used to predict the radiant power at a point close to as:
where tril lexicographically vectorizes the lower triangular part of a symmetric matrix and where
Given the fitted polynomial coefficients, , we can thus predict the radiant power at the output location by , and the first order gradients by . In this work we only consider local expansions of order .
5.2 Maximum Localized Likelihood Fitting
To estimate the coefficients, , of the local polynomial model, we maximize a localized likelihood function Tibshirani and Hastie (1987) defined using a smoothing window centered around :
where is a smoothing matrix that determines the shape and size of the window. In section 5.3.1, we discuss how the shape and size of the window function and smoothing can be selected. To lessen the notational burden, the observed samples in the neighborhood, , are denoted below by with a linear index . Using the assumption of normally distributed radiant power estimates , see Section 6, the log of the localized likelihood for the polynomial expansion centered at is given by
where represents terms independent of . The polynomial coefficients, , maximizing the localized likelihood function is found by the weighted least squares estimate
Note that similarly to the kernel regression framework Takeda et al. (2007), using equivalent kernels, the solution to the weighted least squares problem can also be formulated as a locally adaptive linear filtering.
The expected mean square error of the reconstructed image depends on a trade-off between bias and variance of the estimate. This trade-off is determined by: the order of the polynomial basis , the window function , and the smoothing matrix .
Using a piecewise constant polynomial, , the estimator corresponds to an ordinary locally weighted average of neighboring sensor observations
However, locally weighted averages can exhibit severe bias at image boundaries and in regions around sensor saturation, due to the asymmetry of the number of available sensor measurements in these locations. By instead fitting a linear polynomial (a plane, ), the bias can be reduced significantly, see Figure 5. Introducing higher order polynomials is possible, but may lead to increased variance in the estimates. In this work, we only consider .
5.3.1 Window and scale selection
In this paper we only show results using a Gaussian window function
There are also other possible choices of symmetric smooth window functions, e.g. Epanechnikov or Tricube windows Astola et al. (2006). In this work we consider the Gaussian window function for its simplicity and widespread use.
The smoothing matrix, , affects the shape of the Gaussian. It is important that the support of the smoothing window is extended to incorporate enough samples into the estimate in Equation (6) so that is invertible (full rank). The simplest choice of the smoothing matrix is , where is a global scale parameter and I is the identity matrix. This corresponds to an isotropic filter support. The choice of is dependent on the sensor characteristics (noise) and the scene. We therefore treat as a user parameter. To reconstruct sharp images in low noise conditions, we generally choose the smallest possible such that is invertible. However, if the captured images exhibit severe noise, a larger scale parameter may be beneficial. For Bayer CFA sensors, we consider different scale parameters for the green and red/blue color channels, , as there are more green samples per unit area. For the remainder of this paper we will refer to the reconstruction method using a global smoothing matrix, , as Local Polynomial Approximation (LPA).
5.4 Adaptive LPA Reconstruction
The truncated local polynomial expansion assumes that the radiant power, , is a smooth function (up to order ) in a neighborhood of the reconstruction point . In many image regions these assumptions are not valid. For example, if an output pixel is located near a sharp edge, it cannot be represented accurately with a finite polynomial expansion. It is therefore desirable to adapt the window support so that it only includes observations on the same side of the edge.
To adapt the support of the local polynomial model to the data, we use a two step approach: First, we use LPA with and isotropic window support to compute an initial estimation of the local gradients, . Using the estimated gradients, we then locally adapt the smoothing matrix, , to the local structure of the signal at each output pixel .
We define an oriented, anisotropic smoothing matrix as where represents the local gradient covariance of the signal estimated from the data. To estimate , we use a parametric approach similar to Takeda et al. (2007). Specifically, we consider a decomposition of the covariance matrix as:
This corresponds to describing the covariance matrix as a function of three parameters: describing an elongation along the principal directions, describing a rotation angle and describing an overall scaling. The elongation, rotation and scaling parameters are estimated from a truncated singular value decomposition of weighted gradients in a local neighborhood around the output location :
where and are the estimated gradients along and at the nearby point , and represents a local window function. The singular values and along the diagonal of represent the energy along the dominant directions in the neighborhood defined by the local window function. Thus, the orientation angle is set so that the window is elongated orthogonal to the dominant gradient direction, i.e. from the second column of , as
The elongation parameter, is then set according to:
where is a regularization parameter. The scaling parameter, , is finally set according to
where is another regularization parameter, is the structure sensitivity parameter, and is the number of samples in the local neighborhood of the gradient analysis window . Intuitively the shape of the window support should be: circular and relatively large in flat areas, elongated along edges, and small in textured areas to preserve detail. For all results presented in this paper we fix the regularization parameters to and . The structure sensitivity parameter, , is treated as a user specified parameter, representing a tradeoff between denoising and detail preservation. For a more detailed analysis of the choice of parameters, we refer the reader to Takeda et al. (2007) and Milanfar (2013). Note that as we are primarily interested in irregularly sampled data we choose to adapt the window functions instead of the support of each observation (kernel) as in the work of Takeda et al.
5.5 Color Channel Correlation
Modern demosaicing methods consider the correlation between color channels to improve CFA interpolation. Commonly used heuristics are that edges should match between color channels and that interpolation should be performed along and not across edges. For Bayer pattern CFA arrays, in which the G channel is sampled more densely than the R and B channels, a common method is to first reconstruct the G channel, then use the gradients of the G channel when reconstructing the R and B channel.
To take into account the correlation between color channels in our reconstruction, we adapt this idea to the irregularly sampled input data from multi-sensor camera systems. First we estimate the gradients for the G channel using LPA with and isotropic window supports. We then use the G channel gradients to locally adapt as described in the previous section. Using the adapted , we then compute the radiant power estimates for all color channels, (R, G, B). This effectively forces the interpolation to be performed along and not across edges and preserves their location between the color channels. We refer to this reconstruction method as Color Adaptive Local Polynomial Approximation (CALPA).
6 Sensor noise model
In this section we describe a radiometric camera noise model adapted to HDR video. We assume a linear digital response and model the noise using a radiometric camera model taking into account the exposure times, gain settings and ND-filter coefficients for the different sensors. Our noise model is inspired by previous methods considering radiometric camera calibration Reibel et al. (2003); Foi et al. (2008) and optimal HDR capture methods Granados et al. (2010); Hasinoff et al. (2010).
Each observed digital pixel value, , is obtained using an exposure time, , and an exposure scaling, that is assumed constant over the image . The pixel response is a measurement of the radiant power reaching the image sensor, which we for convenience express as the number of photo-induced electrons collected per unit time, . The accumulated number of photoelectrons, , collected at the pixel during the exposure time, , follows a Poisson distribution with expected value
and variance where is a per pixel factor due to non-uniform photo-response. The recorded digital value, , is also affected by the sensor gain and signal independent readout noise,
where is the analog amplifier/sensor gain (proportional to the native ISO settings on modern cameras) and is the readout noise which generally depends on the gain setting of the sensor. As we focus on video applications with frame rates of fps or more, the dark current noise is neglected. For most modern camera sensors dark current noise has negligible effect for exposures less than a second Hasinoff et al. (2010). In contrast to previous work Hasinoff et al. (2010); Kronander et al. (2013) we do not assume a simple parametric model for the readout noise dependence on the gain, as we have found that such models do not generally provide a satisfactory fit to measured readout noise at different gain settings for modern camera sensors. To handle sensors with different gain settings, we instead calibrate the readout noise, for each separate gain/ISO setting considered.
Before reconstruction of the HDR output frame, each digital pixel value, , is independently transformed to an estimate of the number of photoelectrons reaching the pixel per unit time, . To compensate for the readout noise bias (blacklevel), , we subtract a bias frame, , from each observation. The bias frame is computed as the average of a large set of black images captured with the same camera settings as the observations but with the lens covered, so that no photons reach the sensor. The radiant power can then be estimated as
6.1 Variance estimate
We assume to follow a normal distribution with mean and standard deviation . For low light levels photon shot noise is generally dominated by signal independent readout noise approximately following a normal distribution. In contrast for brighter regions the Poisson distributed shot noise is well approximated by a normal distribution. Assuming no saturated or clipped pixel values, the variance of is given by
Due to the photon shot noise, to estimate, , we should ideally know the true value of , however in practice we are instead forced to use the (noisy) estimate, , obtained from equation (18). Thus we form an approximate variance estimate, , as
The analysis above assumes non-saturated pixels. In practice, we exclude saturated pixels by comparing each digital pixel value, , to a saturation frame. The model does not include the effect of pixel cross-talk, and the variances, , are assumed to be independent of each other. More complicated models of sensor noise also include the effects of saturation on the digital value variance, , see e.g. Foi (2009). However such models assume to follow a clipped normal distribution which requires an impractical non-linear estimation when computing the coefficients for the local polynomial approximations discussed in section 5.
6.2 Parameter calibration
The exposure scaling, , and per-pixel non-uniformity, , can in practice be estimated using a flat field image computed as the average over a large sequence of images, running the sensors at different exposure times ensure well exposed pixels for all sensors. Choosing one sensor as reference, the exposure scaling, , can be estimated as the spatial mean of the ratio, computed pixel-wise between the radiant power estimates, Eq. 18, of the sensors. The non-uniformity, , can then be computed as the per-pixel deviation from the exposure scaling, .
The variance of the readout noise, can be estimated, similarly to the bias frame, from a set of black images, captured with the same camera settings, (, ), as the observations but with the lens covered, so that no photons reach the sensor.
The sensor gain, , can be calibrated using the relation,
where the second equality follows from being Poisson distributed shot noise with . and can be estimated by averaged flat fields and the bias frame respectively, and as described above.
7 Algorithm evaluation
To evaluate our algorithm, we compare its performance against several reconstruction methods. As most previously proposed algorithms do not generalize to setups with arbitrary misalignments between the sensors, we include evaluations on setups where the sensors are assumed to be perfectly aligned as well as setups where the sensors only have a translational offset from each other.
We compare our algorithms with the following methods:
Tocci et al. : The recent real-time multi-sensor reconstruction algorithm presented by Tocci et al. (2011).
Debayer-first : A method performing demosaicing before HDR assembly, using heuristic linear radiometric weights.
Debayer-last: A method performing demosaicing after HDR assembly, using radiometric weights inversely proportional to the variance estimate given by our sensor noise model 6.
Granados et al. : The iterative method proposed by Granados et al. Granados et al. (2010) for high-quality HDR assembly of perfectly aligned data.
For the above methods we consider demosaicing using both bilinear interpolation and the gradient adaptive method presented by Malvar et al. (2004).
We base the majority of our comparisons on HDR image data generated from a camera simulation framework. In the simulator, the scene is represented by a virtually noise free high resolution HDR image, meticulously generated from a set of exposures captured 1 f-stop apart covering the dynamic range of the real scene. Each exposure is computed as an average of frames, in this case captured with a Canon 5D SLR camera. The camera simulation framework allows us to generate images of arbitrary bit depth, sensor misalignments, exposure settings and noise characteristics according to the model described in Section 6. For our experiments, we simulate different multi-sensor setups with different sensor misalignments, , and different noise characteristics, given the simulation parameters . We focus our evaluation on simulated sensors with three to four f-stops difference between the sensor exposures as this is the most common setup for multi-sensor HDR video systems in practice Tocci et al. (2011).
7.1 Perfect alignment
Using perfectly aligned sensors allows us to compare our more general algorithms to a range of previous methods only applicable to this type of sensor data. We note that these sensor setups are not very realistic for multi-sensor video applications. However, we show that our methods yield image quality that is comparable and in some cases even better than state-of-the-art high quality reconstruction methods even for these cases. In Figure 6 the local polynomial approximation (LPA) and color adaptive local polynomial approximation (CALPA) methods are compared to the Debayer-first method and the Debayer-last method. The virtual sensors were perfectly aligned, captured with exposure scalings and set to simulate the noise properties of a Canon 5D sensor, with parameters: , and as reported by Granados et al. (2010). The Debayer-first method performs poorly for high frequency details, introducing significant color artifacts. The Debayer-last method performs slightly better for bilinear CFA interpolation. Using the gradient adaptive method of Malvar et al. (2004) effectively suppresses color moiré artifacts in low contrast regions, however in high contrast regions, such as close to the light bulb filament, significant color artifacts are introduced. Compared to the Debayer-first method, the LPA reconstruction reduces color artifacts by performing resampling and CFA interpolation jointly with reconstruction. LPA reconstruction also gracefully handles transitional regions around sensor saturation points, as a smoothing spatial filtering and a cross-sensor blending are both inherent in the reconstruction. This has the benefit of implicitly performing denoising. Using anisotropic window supports, correlated between the color channels, CALPA produces even smoother results in high contrast regions, e.g. close to the filament. CALPA also produces less color artifacts than LPA in regions with high spatial frequency.
To compare our method to the algorithm proposed by Granados et al. (2010) we captured four exposures 4 f-stops apart with a Canon 40D camera, with calibrated gain (ISO 400), and readout noise variance (14 bit sensor). For a fair comparison to our method we perform demosaicing (bilinear CFA interpolation) after HDR-assembly for Granados method, in contrast to the publicly available code Granados (8 14) that performs a crude nearest neighbor demosaicing before HDR assembly. Careful inspection of the reconstructions, see e.g. Figure 7, shows that the method proposed by Granados et al. (2010) performs slightly better, although the difference is not striking. Note that in contrast to the method of Granados et al. (2010) our method handles arbitrary misalignments between the sensors.
7.2 Sensor Missalignment
In Figure 8, LPA and CALPA are compared to the real-time multi-sensor reconstruction method recently proposed by Tocci et al. (2011). The reconstructions were generated from 3 virtual exposures of a simulated Kodak Kai-04050 sensor (used in our experimental HDR video system) captured 5 f-stops apart. The lowest exposed and the highest exposed sensors are perfectly aligned, while the middle sensor has a translational offset of in pixel units. This is similar to the alignment errors reported by Tocci et al. for their camera system. The method of Tocci et al. copes rather well with translational misalignments, because each reconstructed pixel uses information from only one sensor. However, the reconstruction suffers from noise. This is expected as the method of Tocci et al. only considers observations from the highest exposed non-saturated sensor at each point. In contrast LPA and CALPA are based on a maximum likelihood approach incorporating information from all non-saturated sensor observations. CALPA effectively denoises the image by using larger filtering supports in flat image areas. Compared to the other methods, CALPA also reduces unwanted color artifacts in areas of high spatial frequency.
A robust multi-sensor reconstruction algorithm requires handling of general sub-pixel misalignments between the sensors, including rotational errors. Both LPA and CALPA gracefully handle any geometric misalignments between the sensors, as can be seen in Figure 10, where the reconstructions were based on 3 virtual Kodak Kai-04050 sensors simulated with 6 degree rotational errors between them. The Debayer-first method can also handle general misalignment errors including rotations. However, methods performing demosaicing before HDR assembly often suffer from color artifacts in regions of high spatial frequency, e.g. seen as color fringes in Figure 6. Methods performing demosaicing after resampling (alignment), e.g. Tocci et al. Tocci et al. (2011), cannot handle general misalignments, as the color filters no longer necessarily overlap after resampling.
7.3 LPA parameters
Varying the polynomial order and the scale parameter for the LPA method imposes a trade-off between image sharpness, noise reduction and processing speed. Figure 9b and 9c show reconstructions from three aligned virtual Kodak KAI-04050 sensors 4 f-stops apart and a resolution of pixels. The close-ups show how the image fidelity varies for different choices of polynomial order, , and the scale parameter, , described in Section 5.3. Using a constant fit, , leads to unwanted block artifacts and color fringes, especially at edge features and in regions were one sensor is close to saturation. The difference between orders and is more subtle, with only small improvements in visual quality. A low value for the scale parameter, , increases image sharpness, while a large reduces image noise.
8 Experimental validation
To evaluate the real world performance of our approach, we have implemented the LPA algorithm in CUDA to enable GPU processing of the output from a custom built multi-sensor HDR camera system.
The camera system used in our experiments is based on four high quality Kodak KAI-04050 CCD sensors with a resolution of pixels and RGB Bayer pattern CFA sampling. The sensors receive different amounts of light from an optical system with a common lens, a four-way beam splitter and four different ND filters, see Figure 11, top. The relay lens was custom built for this setup, and extends the optical path length to make room for the beam splitters. All other optics are standard, off-the-shelf components. The sensors have 12 bits linear A/D conversion, and the exposure scalings cover a range of , yielding a dynamic range equivalent to bits of linear resolution, commonly referred to as “24 f-stops” of dynamic range. The dynamic range can be extended further by varying the exposure times between the sensors. The sensors are connected to a host computer through a CameraLink interface. The system allows for capture, processing and off-line storage of up to 32 frames per second at 4 Mpixels resolution, amounting to around 1 GiB/s per second of raw data.
The misalignments between the sensors were measured by imaging a checkerboard calibration target and computing the cross-sensor correlation using OpenCV 2.2. The misalignments could be characterized as translations of a few pixels and as 2D rotations in the order of fractions of a degree. The sensor transformation matrices could be estimated with an accuracy in the order of pixels. The system was radiometrically calibrated as described in Section 6.2. The gain parameter was estimated for each pixel separately, and then spatially averaged to find the sensor gain . Readout noise was estimated for each pixel, with a cross-sensor spatial mean of e, and a standard deviation of e.
Our software implementation of the LPA algorithm required several seconds for processing of reasonably sized frames. To enable real time processing for high resolution video data, we implemented the LPA algorithm in CUDA. Two CUDA kernels are executed per frame. First, the raw sensor images are converted to a half-float radiant power estimate, , and a variance estimate, , is computed. Secondly, for each pixel in the output HDR image, all samples overlapping the observation window are read from global GPU memory. The corresponding weight is computed and used to compute the weighted least squares estimate described in Equation (6). Since memory transfer is a bottle-neck, we use page-locked and write-combined memory to improve efficiency. The data is transferred to the GPU for kernel execution and display, and then back to the CPU for disk storage. To enable simultaneous data transfer and kernel execution we utilize two CUDA streams.
If the sensors are aligned or related by purely translational transforms, and if the output resolution matches the input resolution, the spatial arrangement of samples in each observation window will be invariant across the image, and the window weights can be precomputed once. This results in a to speedup for a local constant fit .
Figure 11 shows an example frame captured using our experimental HDR-video system. Using our CUDA implementation with , and an output resolution of pixels, the four input frames are processed at fps using pre-computed weights on an Nvidia GForce 680 GPU. As can be seen in the tonemapped image, which is in full resolution to enable zooming in during on-screen viewing, the small rotational misalignments present in the hardware setup are not visible even though the weights are pre-computed. This is partly due to the smoothing inherent in the LPA method. Reconstructing the same frame without pre-computed weights achieves an interactive speed of fps. The performance scales linearly with the number of pixels, so the corresponding figures for reconstructing to HD resolution frames of are fps for pre-computed weights and fps without pre-computed weights.
9 Conclusions and Future Work
This paper presented a novel filtering framework for HDR reconstruction from multi-sensor images that performs all steps in the traditional HDR imaging pipeline in a single step. The reconstruction is based on spatially adaptive filtering and noise-aware local likelihood estimation using isotropic and anisotropic filter supports taking into account the correlation between color channels. The method uses a novel sensor noise model adapted to multi-sensor systems. We also presented an overview of a novel HDR video camera and showed how our algorithm achieves real-time performance. It should be noted that the framework is general in that the algorithms apply equally well to still image HDR reconstruction from e.g. exposure bracketing. The only requirement is that the sensor transformations can be accurately estimated.
As future work, we would like to extend the filter supports from 2D to 3D to include both the spatial and temporal domain in the reconstruction.
- Technologies (2008) N. I. Technologies, MAGIC Technology - Scene contrast indexed image sensing with WDR, White paper, www.new-imaging-technologies.com/download-wpan.html, 2008.
- Aggarwal and Ahuja (2004) M. Aggarwal, N. Ahuja, Split aperture imaging for high dynamic range, International Journal of Computer Vision 58 (1) (2004) 7–17.
- Mcguire and Hughes (2007) M. Mcguire, J. F. Hughes, Optical Splitting Trees for High-Precision Monocular Imaging, IEEE Computer Graphics And Applications 27 (2007) 32–42.
- Tocci et al. (2011) M. D. Tocci, C. Kiser, N. Tocci, P. Sen, A Versatile HDR Video Production System, ACM Transactions on Graphics (Proceedings of SIGGRAPH 2011) 30 (4) (2011) 41:1–41:10.
- Kronander et al. (2013) J. Kronander, S. Gustavson, G. Bonnet, J. Unger, Unified HDR reconstruction from raw CFA data, in: IEEE International Conference on Computational Photography (ICCP), 2013.
- Debevec and Malik (1997) P. Debevec, J. Malik, Recovering high dynamic range radiance maps from photographs, in: SIGGRAPH, 369–378, 1997.
- Kang et al. (2003) S. Kang, M. Uyttendaele, S. Winder, R. Szeliski, High dynamic range video, ACM Transactions on Graphics (Proceedings of SIGGRAPH 2003) 22 (3) (2003) 319–325.
- Unger and Gustavson (2007) J. Unger, S. Gustavson, High-dynamic-range video for photometric measurement of illumination, in: SPIE Electronic Imaging, 2007.
- Ajdin et al. (2008) B. Ajdin, M. B. Hullin, C. Fuchs, H.-P. Seidel, H. P. A. Lensch, Demosaicing by Smoothing along 1-D Features, in: CVPR, 2008.
- Granados et al. (2010) M. Granados, B. Ajdin, M. Wand, C. Theobalt, H. Seidel, H. Lensch, Optimal HDR reconstruction with linear digital cameras, in: CVPR, 2010.
- Gunturk et al. (2005) B. K. Gunturk, J. Glotzbach, Y. Altunbasak, R. W. Schafer, R. M. Mersereau, Demosaicking: color filter array interpolation, IEEE Signal Processing Magazine 22 (1) (2005) 44–54.
- Reinhard et al. (2010) E. Reinhard, W. Heidrich, S. Pattanaik, P. Debevec, G. Ward, K. Myszkowski, High dynamic range imaging: acquisition, display, and image-based lighting, Morgan Kaufmann, 2010.
- Myszkowski et al. (2008) K. Myszkowski, R. Mantiuk, G. Krawczyk, High Dynamic Range Video, Morgan Claypool, 2008.
- Astola et al. (2006) J. Astola, V. Katkovnik, K. Egiazarian, Local Approximation Techniques in Signal and Image Processing, SPIE Publications, 2006.
- Nayar and Mitsunaga (2000) S. Nayar, T. Mitsunaga, High dynamic range imaging: Spatially varying pixel exposures, in: CVPR, 2000.
- Narasimhan and Nayar (2005) S. G. Narasimhan, S. K. Nayar, Enhancing Resolution Along Multiple Imaging Dimensions Using Assorted Pixels, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (4) (2005) 518–530.
- Yasuma et al. (2010) F. Yasuma, T. Mitsunaga, D. Iso, S. K. Nayar, Generalized Assorted Pixel Camera : Postcapture Control of Resolution , Dynamic Range , and Spectrum, IEEE Transactions on Image Processing 19 (9) (2010) 2241–2253.
- Mann and Picard (1995) S. Mann, R. W. Picard, On Being ’undigital’ With Digital Cameras: Extending Dynamic Range By Combining Differently Exposed Pictures, in: IS&T, 1995.
- Akyüz and Reinhard (2007) A. O. Akyüz, E. Reinhard, Noise reduction in high dynamic range imaging, Journal of Visual Communication and Image Representation 18 (5) (2007) 366–376.
- Mitsunaga and Nayar (1999) T. Mitsunaga, S. K. Nayar, Radiometric Self Calibration, in: CVPR, ISBN 0769501494, 374–380, 1999.
- Kirk and Andersen (2006) K. Kirk, H. Andersen, Noise characterization of weighting schemes for combination of multiple exposures, in: Proc. British Machine Vision Conference (BMVC), 1129–1138, 2006.
- Knutsson and Westin (1993) H. Knutsson, C.-F. Westin, Normalized and differential convolution, in: CVPR, 1993.
- Tomasi and Manduchi (1998) C. Tomasi, R. Manduchi, Bilateral filtering for gray and color images, in: ICCV, 839–846, 1998.
- Lancaster and Salkauskasr (1981) P. Lancaster, K. Salkauskasr, Surfaces generated by moving least squares methods, Mathematics of Computation 87 (1981) 141–158.
- Takeda et al. (2007) H. Takeda, S. Farsiu, P. Milanfar, Kernel regression for image processing and reconstruction, IEEE Transactions On Image Processing 16 (2) (2007) 349–366.
- Milanfar (2013) P. Milanfar, A Tour of Modern Image Filtering: New Insights and Methods, Both Practical and Theoretical, IEEE Signal Processing Magazine 30 (1) (2013) 106 – 128.
- Loader (1999) C. Loader, Local regression and likelihood, New York: Springer-Verlag, ISBN 0-387-9877, 1999.
- Tibshirani and Hastie (1987) R. Tibshirani, T. Hastie, Local Likelihood Estimation, Journal of the American Statistical Association 82 (398) (1987) 559–567.
- Milanfar (2010) P. Milanfar (Ed.), Super-resolution Imaging, CRC Press, 2010.
- Farsiu et al. (2004) S. Farsiu, M. D. Robinson, M. Elad, P. Milanfar, Fast and robust multi-frame super-resolution, IEEE Transactions on Image Processing 13 (10) (2004) 1327–1344.
- Hardie (2007) R. Hardie, A fast image super-resolution algorithm using an adaptive Wiener filter, IEEE Transactions on Image Processing 16 (12) (2007) 2953–64.
- Gunturk and Gevrekci (2006) B. K. Gunturk, M. Gevrekci, High-Resolution Image Reconstruction From Multiple Differently Exposed Images, IEEE Signal Processing Letters 13 (4) (2006) 197–200.
- Zimmer et al. (2011) H. Zimmer, A. Bruhn, J. Weickert, Freehand HDR Imaging of Moving Scenes with Simultaneous Resolution Enhancement, Computer Graphics Forum (Proceedings of Eurographics) 30 (2) (2011) 405–414.
- Bayer (1976) B. Bayer, Color imaging array, US Patent 3 971 065, 1976.
- K. Hirakawa (2006) T. P. K. Hirakawa, Joint Demosaicing and Denoising, IEEE Trans. Image Processing 15 (8) (2006) 2146–2157.
- Hirakawa (2008) K. Hirakawa, Color Filter Array Image Analysis for Joint Denoising and Demosaicking, in: R. Lukac (Ed.), Single-Sensor Imaging: Methods and Applications for Digital Cameras, CRC Press, 2008.
- Farsiu et al. (2006) S. Farsiu, M. Elad, P. Milanfar, Multiframe Demosaicing and Super-Resolution of Color Images, IEEE Transactions on Image Processing 15 (1) (2006) 141–159.
- Kachatou and van Silfhout (2008) A. Kachatou, R. van Silfhout, Dynamic Range Enhancement Algorithms for CMOS Sensors With Non-Destructive Readout, in: IEEE International Worshop on Imaging Systems and Techniques, 2008.
- Hasinoff et al. (2010) S. Hasinoff, F. Durand, W. Freeman, Noise-optimal capture for high dynamic range photography, in: CVPR, 2010.
- Reibel et al. (2003) Y. Reibel, M. Jung, M. Bouhifd, B. Cunin, C. Draman, CCD or CMOS camera noise characterisation, The European Physical Journal - Applied Physics 21 (2003) 75–80.
- Foi et al. (2008) A. Foi, M. Trimeche, V. Katkovnik, K. Egiazarian, Practical Poissonian-Gaussian Noise Modeling and Fitting for Single-Image Raw-Data, IEEE Transactions on Image Processing 17 (10) (2008) 1737–1754.
- Foi (2009) A. Foi, Clipped noisy images: heteroskedastic modeling and practical denoising, Signal Processing 89 (12) (2009) 2609–2629.
- Malvar et al. (2004) H. S. Malvar, L. wei He, R. Cutler, High-quality linear interpolation for demosaicing of bayer-patterned color images, in: IEEE International Conference on Speech, Acoustics, and Signal Processing (ICASSP), 2004.
Granados (8 14)
granados/projects/opthdr/index.html, Project Webpage, Last checked 2013-08-14.