Light Field Super-ResolutionVia Graph-Based Regularization

Light Field Super-Resolution
Via Graph-Based Regularization

Mattia Rossi and Pascal Frossard
École Polytechnique Fédérale de Lausanne
mattia.rossi@epfl.ch, pascal.frossard@epfl.ch
Abstract

Light field cameras capture the 3D information in a scene with a single exposure. This special feature makes light field cameras very appealing for a variety of applications: from post-capture refocus, to depth estimation and image-based rendering. However, light field cameras suffer by design from strong limitations in their spatial resolution, which should therefore be augmented by computational methods. On the one hand, off-the-shelf single-frame and multi-frame super-resolution algorithms are not ideal for light field data, as they do not consider its particular structure. On the other hand, the few super-resolution algorithms explicitly tailored for light field data exhibit significant limitations, such as the need to estimate an explicit disparity map at each view. In this work we propose a new light field super-resolution algorithm meant to address these limitations. We adopt a multi-frame alike super-resolution approach, where the complementary information in the different light field views is used to augment the spatial resolution of the whole light field. We show that coupling the multi-frame approach with a graph regularizer, that enforces the light field structure via nonlocal self similarities, permits to avoid the costly and challenging disparity estimation step for all the views. Extensive experiments show that the new algorithm compares favorably to the other state-of-the-art methods for light field super-resolution, both in terms of PSNR and visual quality.

I Introduction

We live in a 3D world but the pictures taken with traditional cameras can capture just 2D projections of this reality. The light field is a model that has been originally introduced in the context of image-based rendering with the purpose of capturing richer information in a 3D scene [1] [2]. The light emitted by the scene is modeled in terms of rays, each one characterized by a direction and a radiance value. The light field function provides, at each point in space, the radiance from a given direction. The rich information captured by the light field function could be used in many applications, from post-capture refocus to depth estimation or virtual reality.

However, the light field is a theoretical model: in practice the light field function has to be properly sampled, which is a challenging task. A straightforward but hardware-intensive approach relies on camera arrays [3]. In this setup, each camera records an image of the same scene from a particular position and the light field takes the form of an array of views. More recently, the development of the first commercial light field cameras [4] [5] has made light field sampling more accessible. In light field cameras, a micro lens array placed between the main lens and the sensor permits to virtually partition the main lens into sub-apertures, whose images are recorded altogether in a single exposure [6] [7]. As a consequence, a light field camera behaves as a compact camera array, providing multiple simultaneous images of a 3D scene from slightly different points of view.

Even if light field cameras become very appealing, they still face the so called spatio-angular resolution tradeoff. Since the whole array of views is captured by a single sensor, a dense sampling of the light field in the angular domain (i.e., a large number of views) necessarily translates into a sparse sampling in the spatial domain (i.e., low resolution views) and vice versa. A dense angular sampling is at the basis of any light field application, as the 3D information provided by the light field data comes from the availability of different views. It follows that the angular sampling cannot be excessively penalized to favor spatial resolution. Moreover, even in the limit scenario of a light field with just two views, the spatial resolution of each one may be reduced to half of the sensor [6], which still happens to be a dramatic drop in the resolution. Consequently, the light field views exhibit a significantly lower resolution than images from traditional cameras, and many light field applications, such as depth estimation, happen to be very challenging on low spatial resolution data. The design of spatial super-resolution techniques, aiming at increasing the view resolution, is therefore crucial in order to fully exploit the potential of light field cameras.

In this work, we propose a new light field super-resolution algorithm that provides a global solution that augments the resolution of all the views together, without an explicit a priori disparity estimation step. In particular, we propose to cast light field spatial super-resolution into a global optimization problem, whose objective function is designed to capture the relations between the light field views. The objective function comprises three terms. The first one enforces data fidelity, by constraining each high resolution view to be consistent with its low resolution counterpart. The second one is a warping term, which gathers for each view the complementary information encoded in the other ones. The third one is a graph-based prior, which regularizes the high resolution views by enforcing smoothness along the light field epipolar lines that define the light field structure. These terms altogether form a quadratic objective function that we solve iteratively with the proximal point algorithm. The results show that our algorithm compares favorably to state-of-the-art light field super-resolution algorithms, both visually and in terms of reconstruction error.

The article is organized as follows. Section II presents an overview of the related literature. Section III formalizes the light field structure. Section IV presents our problem formulation and carefully analyzes each of its terms. Section V provides a detailed description of our super-resolution algorithm, and Section VI analyses its computational complexity. Section VII is dedicated to our experiments. Finally, Section VIII concludes the article.

Ii Related work

The super-resolution literature is quite vast, but it can be divided mainly into two areas: single-frame and multi-frame super-resolution methods. In single-frame super-resolution, only one image from a scene is provided, and its resolution has to be increased. This goal is typically achieved by learning a mapping from the low resolution data to the high resolution one, either on an external training set [8] [9] [10] or on the image itself [11] [12]. Single-frame algorithms can be applied to each light field view separately in order to augment the resolution of the whole light field, but this approach would neither exploit the high correlation among the views, nor enforce the consistency among them.

In the multi-frame scenario, multiple images of the same scene are used to increase the resolution of a target image. To this purpose, all the available images are typically modeled as translated and rotated versions of the target one [13] [14]. The multi-frame super-resolution scenario resembles the light field one, but its global image warping model does not fit the light field structure. In particular, the different moving speeds of the objects in the scene across the light field views, which encode their different depths, cannot be captured by a global warping model. Multi-frame algorithms employing more complex warping models exist, for example in video super-resolution [15] [16], yet the warping models do not exactly fit the geometry of light field data and their construction is computationally demanding. In particular, multi-frame video super-resolution involves two main steps, namely optical flow estimation, which finds correspondences between temporally successive frames, and eventually a super-resolution step that is built on the optical flow.

In the light field representation, the views lie on a two-dimensional grid with adjacent views sharing a constant baseline under the assumption of both vertical and horizontal registration. As a consequence, not only the optical flow computation reduces to disparity estimation, but the disparity map at one view determines its warping to every other view in the light field, in the absence of occlusions. In [17] Wanner and Goldluecke build over these observations to extract the disparity map at each view directly from the epipolar line slopes with the help of a structure tensor operator. Then, similarly to multi-frame super-resolution, they project all the views to the target one within a global optimization formulation endowed with a Total Variation (TV) prior. Although the structure tensor operator permits to carry out disparity estimation in the continuous domain, this task remains very challenging at low spatial resolution. As a result, disparity errors unfortunately translate into significant artifacts in the textured areas and along object edges. Finally, each view of the light field has to be processed separately to super-resolve the complete light field, which does not permit to fully exploit the inter-view dependencies.

In another work, Heber and Pock [18] consider the matrix obtained by warping all the views to a reference one, and propose to model it as the sum of a low rank matrix and a noise one, where the later describes the noise and occlusions. This model, that resembles Robust PCA [19], is primarily meant for disparity estimation at the reference view. However, the authors show that a slight modification of the objective function can provide the corresponding high resolution view, in addition to the low resolution disparity map at the reference view. The algorithm could ideally be applied separately to each view in order to super-resolve the whole light field, but that may not be the ideal solution to that global problem, due to the high redundancy in estimating all the low resolution disparity maps independently.

In a different framework, Mitra and Veeraraghavan propose a light field super-resolution algorithm based on a learning procedure [20]. Each view in the low resolution light field is divided into patches that are possibly overlapping. All the patches at the same spatial coordinates in the different views form a light field with very small spatial resolution, i.e., a light field patch. The authors assign a constant disparity to each light field patch, i.e., all the objects within the light field patch are assumed to lie at the same depth in the scene. A different Gaussian Mixture Model (GMM) prior for high resolution light field patches is learnt offline for each discrete disparity value, and it is then employed within a MAP estimator to super-resolve each light field patch with the corresponding disparity. Due to the reduced dimensionality of each light field patch, and to the closed form solution of the estimator, this approach requires less online computation than other light field super-resolution algorithms. However, the offline learning strategy has also some drawbacks: the dependency of the reconstruction on the chosen training set, the need for a new training for each super-resolution factor, and finally the need for a proper discretization of the disparity range, which introduces a tradeoff between the reconstruction quality and the time required by both the training and the reconstruction steps. Moreover, the simple assumption of constant disparity within each light field patch leads to severe artifacts at depth discontinuities in the super-resolved light field views.

The light field super-resolution problem has been addressed within the framework of Convolutional Neural Networks (CNNs) too. In particular, Yoon et al. [21] consider the cascade of two CNNs, the first meant to super-resolve the given light field views, and the second to synthesize new high resolution views based on the previously super-resolved ones. However, the first CNN (whose design is borrowed from [10]) is meant for single-frame super-resolution, therefore the views are super-resolved independently, without considering the light field structure.

Finally, we note that some authors, e.g., Bishop et al. [22], consider the recovery of an all in focus image with full sensor resolution from the light field camera output. They refer to this task as light field super-resolution although it is different from the problem considered in this work. In this article, no light field applications is considered a priori: the light field views are all super-resolved, thus enabling any light field application to be performed later at a resolution higher than the original one. Differently from the other light field super-resolution algorithms, ours does not require an explicit a priori disparity estimation step, and does not rely on a learning procedure. Moreover, our algorithm reconstructs all the views jointly, provides homogeneous quality across the reconstructed views, and preserves the light field structure.

Fig. 1: Light field sampling in the two plane parametrization. The light field is ideally sampled through an array of pinhole cameras. The pinhole cameras at coordinates and in the camera array are represented as two pyramids, with their apertures denoted by the two green dots on the plane , and their sensors represented by the two orange squares on the plane . The distance between the two planes is the focal length . The distance between the apertures of horizontally or vertically adjacent views in the array is the baseline , hence the distance between the two green dots on plane is . The small yellow squares in the two sensors denote pixel . Pixel of camera captures one of the light rays (in red) emitted by a point at depth in the scene. The disparity associated to pixel of camera is , therefore the projection of on the sensor of camera lies at . The intersection coordinate is denoted by a red spot, as it does not necessarily correspond to a pixel, since is not necessarily integer.

Iii Light field structure

In the light field literature, it is common to parametrize the light rays from a 3D scene by the coordinates of their intersection with two parallel planes, typically referred to as the spatial plane and the angular plane . Each light ray is associated to a radiance value, and a pinhole camera with its aperture on the plane and its sensor on the plane can record the radiance of all those rays accommodated by its aperture. This is represented in Figure 1, where each pinhole camera is represented as a pyramid, with its vertex and basis representing the camera aperture and sensor, respectively. In general, an array of pinhole cameras can perform a regular sampling of the angular plane , therefore the sampled light field takes the form of a set of images captured from different points of view. This is the sampling scheme approximated by both camera arrays and light field cameras.

In the following we consider the light field as the output of an array of pinhole cameras, each one equipped with an pixel sensor. Each camera (aperture) is identified through the angular coordinate with , while a pixel within the camera sensor is identified through the spatial coordinate with . The distance between the apertures of horizontally or vertically adjacent cameras is , referred to as the baseline. The distance between the planes and is , referred to as the camera focal length. Figure 1 sketches two cameras of the array. Within this setup, we can represent the light field as an real matrix , with the intensity of a pixel with coordinates in the view of camera . In particular, we denote the view at as . Finally, without lack of generality, we assume that each pair of horizontally or vertically adjacent views in the light field are properly registered.

With reference to Figure 1, we now describe in more details the particular structure of the light field data. We consider a point at depth from , whose projection on one of the cameras is represented by the pixel , in the right view of Figure 1. We now look at the projection of on the other views in the same row of the camera array, such as the left view in Figure 1. We observe that, in the absence of occlusions and under the Lambertian assumption111All the rays emitted by point exhibit the same radiance., the projection of obeys the following stereo equation:

(1)

where , with the disparity map of view with respect to its left view . A visual interpretation of Eq. (1) is provided by the Epipolar Plane Image (EPI) [23] in Figure (b)b, which represents a slice of the light field. This is obtained by stacking the -th row from each view , with , on top of each other. This procedure leads to a clear line pattern, as the projection of point on the view at is the pivot of a line hosting all its projections on the other views . In particular, the line slope depends on , hence on the depth of the point in the scene. We stress out that, although is a pixel in the captured light field, all its projections do not necessarily correspond to actual pixels in the light field views, as may not be integer. We finally observe that Eq. (1) can be extended to the whole light field:

(2)

We refer to the model in Eq. (2) as the light field structure.

Later on, for the sake of clarity, we will denote a light field view either by its angular coordinate or by its linear coordinate . In particular, we have where is the -th view encountered when visiting the camera array in column major order. Finally, we also handle the light field in a vectorized form, with the following notation:

  • is the vectorized form of view ,

  • ,

where the vectorized form of a matrix is simply obtained by visiting its entries in column major order.

(a)
(b)
Fig. 2: Example of light field and epipolar image (EPI). Figure (a) shows an array of views, extracted from the knights light field (Stanford dataset) which actually consists of an array of views. Figure (b) shows an epipolar image from the original knights light field. In particular, the -th row in the epipolar image correspond to the row . The top, central, and bottom red dashed rows in (b) corresponds to the left, central, and right dashed rows in red in the sample views in (a), respectively.

Iv Problem formulation

The light field (spatial) super-resolution problem concerns the recovery of the high resolution light field from its low resolution counterpart at resolution , with the super-resolution factor. Equivalently, we aim at super-resolving each view to get its high resolution version . In order to recover the high resolution light field from the low resolution data, we propose to minimize the following objective function:

(3)

where each term describes one of the constraints about the light field structure and the multipliers and balance the different terms. We now analyze each one of them separately.

Each pair of high and low resolution views have to be consistent, and we model their relationship as follows:

(4)

where and denote a blurring and a sampling matrix, respectively, and the vector captures possible inaccuracies of the assumed model. The first term in Eq. (3) enforces the constraint in Eq. (4) for each high resolution and low resolution view pair, and it is typically referred to as the data fidelity term:

(5)

Then, the various low resolution views in the light field capture the scene from slightly different perspectives, therefore details dropped by digital sensor sampling at one view may survive in another one. Gathering at one view all the complementary information from the others can augment its resolution. This can be achieved by enforcing that the high resolution view can generate all the other low resolution views in the light field, with . For every view we thus have the following model:

(6)

where the matrix is such that and it is typically referred to as a warping matrix. The vector captures possible inaccuracies of the model, such as the presence of pixels of that cannot be generated because they correspond to occluded areas in . The second term in Eq. (3) enforces the constraint in Eq. (6) for every high resolution view:

(7)

where the matrix is diagonal and binary, and masks those pixels of that cannot be generated due to occlusions in , while denotes a subset of the views (potentially all) with .

Finally, a regularizer happens to be necessary in the overall objective function of Eq. (3), as the original problem in Eq. (4), and encoded in term , is ill-posed due to the fat matrix . The second term can help, but the warping matrices in Eq. (7) are not known exactly, such that the third term is necessary.

We borrow the regularizer from Graph Signal Processing (GSP) [24] [25], and define as follows:

(8)

where the positive semi-definite matrix is the un-normalized Laplacian of a graph designed to capture the light field structure. In particular, each pixel in the high resolution light field is modeled as a vertex in a graph, where the edges connect each pixel to its projections on the other views. The quadratic form in Eq. (8) enforces connected pixels to share similar intensity values, thus promoting the light field structure described in Eq. (2).

In particular, we consider an undirected weighted graph , with the set of graph vertices, the edge set, and a function mapping each edge into a non negative real value, referred to as the edge weight:

The vertex corresponds to the entry of the high resolution light field, therefore the graph can be represented through its adjacency matrix , with the number of pixels in the light field:

Since the graph is assumed to be undirected, the adjacency matrix is symmetric: . We can finally rewrite the term in Eq. (8) as follows:

(9)

where denotes the set of vertices directly connected to the vertex , and we recall that the scalar is the -th entry of the vectorized light field . In Eq. (9) the term penalizes significant intensity variations along highly weighted edges. A weight typically captures the similarity between vertices, therefore the minimization of Eq. (9) leads to an adaptive smoothing [24], ideally along the EPI lines of Figure (b)b in our light field framework.

Differently from the other light field super-resolution methods, the proposed formulation permits to address the recovery of the whole light field altogether, thanks to the global regularizer . The term permits to augment the resolution of each view without recurring to external data and learning procedures. However, differently from video super-resolution or the light field super-resolution approach in [17], the warping matrices in do not rely on a precise estimation of the disparity at each view. This is possible mainly thanks to the graph regularizer , that acts on each view as a denoising term based on nonlocal similarities [26] but at the same time constraints the reconstruction of all the views jointly, thus enforcing the full light field structure captured by the graph.

Fig. 3: The neighboring views and the square constraint. All the squares indicate pixels, in particular, all the yellow pixels lie at the spatial coordinates in their own view. The projection of pixel on the other eight neighboring views is indicated with a red dot. According to Eq. (2) they all lie on a square, highlighted in red. The four orange views represent the set , used in the warping matrix construction. In each one of these four views, the two pixels employed in the convex combination targeting pixel are indicated in green and are adjacent to each other. The projection of pixel lies between the two green pixels, and these two belong to a 1D window indicated in gray.

V Super-resolution algorithm

We now describe the algorithm that we use to solve the optimization problem in Eq. (3). We first discuss the construction of the warping matrices of the term in Eq. (7), and then the construction of the graph employed in the regularizer in Eq. (8). Finally, we describe the complete super-resolution algorithm.

V-a Warping matrix construction

We define the set of the neighboring views in the term in Eq. (7) as containing only the four views adjacent to in the light field:

This choice reduces the number of the warping matrices but at the same time does not limit our problem formulation, as the interlaced structure of the term constrains together also those pairs of views that are not explicitly constrained in .

The inner summation in Eq. (7) considers the set of the four warping matrices that warp the view to each one of the four views . Conversely, but without loss of generality, in this section we consider the set of the four warping matrices that warp each one of the four views to the view . The warping matrix is such that . In particular, the -th row of the matrix computes the pixel as a convex combination of those pixels around its projection on . Note that the convex combination is necessary, as the projection does not lie at integer spatial coordinates in general. The exact position of the projection on is determined by the disparity value associated to the pixel . This is represented in Figure 3, which shows that the projections of on the four neighboring views lie on the edges of a virtual square centered on the pixel and whose size depends on the disparity value . We estimate roughly the disparity value by finding a such that . In details, we first define a similarity score between the target pixel and a generic pixel as follows:

(10)

where denotes a square patch centered at the pixel , the operator denotes the Frobenius norm, and is a tunable constant. Then, we center a search window at in each one of the four neighboring views, as represented in Figure 3. In particular, we consider

  • a pixel window for ,

  • a pixel window for ,

with , and odd, defining the disparity range assumed for the whole light field, i.e., . Finally, we introduce the following function that assigns a score to each possible value of :

(11)

Note that each line of Eq. (11) refers to a pair of adjacent pixels in one of the neighboring views. We finally estimate as follows:

(12)

We can now fill the -th row of the matrix such that the pixel is computed as the convex combination of the two closest pixels to its projection on , namely the following two pixels:

which are indicated in green in Figure 3. Once the two pixels involved in the convex combination at the -th row of the matrix are determined, the -th row can be constructed. As an example, let us focus on the left neighboring view . The two pixels involved in the convex combination at the -th row of the matrix are the following:

We thus define the -th row of the matrix as follows:

with . In particular, each one of the two pixels in the convex combination has a weight that is proportional to its similarity to the target pixel . For the remaining three neighboring views we proceed similarly.

We stress out that, for each pixel , the outlined procedure fills the -th row of each one of the four matrices with . In particular, as illustrated in Figure 3, the pair of pixels selected in each one of the four neighboring views encloses one edge of the red square hosting the projections of the pixel , therefore this procedure contributes to enforce the light field structure in Eq. (2). Later on, we will refer to this particular structure as the square constraint.

Finally, since occlusions are mostly handled by the regularizer , we use the corresponding masking matrix in Eq. (7) to handle only the trivial occlusions due to the image borders.

V-B Regularization graph construction

The effectiveness of the term depends on the graph capability to capture the underlying structure of the light field. Ideally, we would like to connect each pixel in the light field to its projections on the other views, as these all share the same intensity value under the Lambertian assumption. However, since the projections do not lie at integer spatial coordinates in general, we adopt a procedure similar to the warping matrix construction and we aim at connecting the pixel to those pixels that are close to its projections on the other views. We thus propose a three step approach to the computation of the graph adjacency matrix in Eq. (9).

V-B1 Edge weight computation

We consider a view and define its set of neighboring views as follows:

where we extend the neighborhood considered in the warping matrix construction with the four diagonal views. In particular, is defined as follows:

The full set of neighboring views is represented in Figure 3, with the views in in orange, and those in in green. We then consider a pixel and define its edges toward one neighboring view with . We center a search window at the pixel and compute the following similarity score (weight) between the pixel and each pixel in the considered window:

(13)

with the notation already introduced in Section V-A. We repeat the procedure for each one of the eight neighboring views in , but we use differently shaped windows at different views:

  • a pixel window for ,

  • a pixel window for ,

  • a pixel window otherwise.

This is illustrated in Figure 3. The pixel window is introduced for the diagonal views , with , as the projection of the pixel on these views lies neither along row , nor along column . Iterating the outlined procedure over each pixel in the light field leads to the construction of the adjacency matrix . We regard as the adjacency matrix of a directed graph, with the weight of the edge from to .

V-B2 Edge pruning

We want to keep only the most important connections in the graph. We thus perform a pruning of the edges leaving the pixel toward the eight neighboring views. In particular, we keep only

  • the two largest weight edges, for ,

  • the two largest weight edges, for ,

  • the four largest weight edges, otherwise.

For the diagonal neighboring views , with , we allow four weights rather than two as it is more difficult to detect those pixels that lie close to the projection of . We define as the adjacency matrix after the pruning.

V-B3 Symmetric adjacency matrix

We finally carry out the symmetrization of the matrix , and set in Eq. (9). We adopt a simple approach for obtaining a symmetric matrix: we choose to preserve an edge between two vertexes and if and only if both entry and are non zero. If this is the case, then necessarily holds true, and the weights are maintained. This procedure mimics the well-known left-right disparity check of stereo vision [27].

We finally note that the constructed graph can be used to build an alternative warping matrix to the one in Section V-A. We recall that the matrix is such that . In particular, the -th row of this matrix is expected to compute the pixel as a convex combination of those pixels around its projection on . We thus observe that the sub-matrix , obtained by extracting the rows and the columns from the adjacency matrix , represents a directed weighted graph with edges from the pixels of the view (rows of the matrix) to the pixels of the view (columns of the matrix). In this graph, the pixel is connected to a subset of pixels that lie close to its projections on . We thus normalize the rows of such that they sum up to one, in order to implement a convex combination, and set with the normalized sub-matrix. This alternative approach to the warping matrix construction does not take the light field structure explicitly into account, but it represents a valid alternative when computational resources are limited, as it reuses the regularization graph.

V-C Optimization algorithm

We now have all the ingredients to solve the optimization problem in Eq. (3). We observe that it corresponds to a quadratic problem. We can first rewrite the first term, in Eq. (5), as follows:

(14)

with , the identity matrix, and the Kronecker product. For the second term, in Eq. (7), we introduce the following matrices:

  • ,

  • ,

where denotes a block diagonal matrix, and denotes the -th vector of the canonical basis, with and zero elsewhere. The matrices and , originally defined only for , have been extended to the whole light field by assuming them to be zero for . Finally, it is possible to remove the inner sum in Eq. (7):

(15)

Replacing Eq. (14) and Eq. (V-C) in Eq. (3) permits to rewrite the objective function in a quadratic form:

(16)

with

We observe that, in general, the matrix is positive semi-definite, therefore it is not possible to solve Eq. (16) just by employing the Conjugate Gradient (CG) method on the linear system . We thus choose to adopt the Proximal Point Algorithm (PPA), which iteratively solves Eq. (16) using the following update rule:

(17)

The matrix is positive definite for every , hence we can now use the CG method to solve the linear system . The full Graph-Based super-resolution algorithm is summarized in Algorithm 1. We observe that both the construction of the warping matrices and of the graph requires the high resolution light field to be available. In order to bypass this causality problem, a fast and rough high resolution estimation of the light field is computed, e.g., via bilinear interpolation, at the bootstrap phase. Then, at each new iteration, the warping matrices and the graph can be reconstructed on the new available estimate of the high resolution light field, and a new estimate can be obtained by solving the problem in Eq. (16).

1:, , , .
2:.
3: bilinear interp. of by , ;
4:for  do
5:     build the graph adjacency matrix on ;
6:     build the matrices on , ;
7:     update the matrix ;
8:     update the vector ;
9:     ; Initialize CG
10:     while convergence is reached do
11:          CG;
12:     end while
13:     ; Update
14:end for
15:return ;
Algorithm 1 Graph-Based Light Field Super-Resolution

Vi Computational Complexity

In this section we provide an estimate of the computational complexity of our super-resolution algorithm proposed in Section V. This is comprised of three main steps: the construction of the graph adjacency matrix, the construction of the warping matrices, and the solution of the optimization problem in Eq. (16). We analyze each one of these steps separately.

In the graph construction step, the weights from each view to the eight neighboring ones are computed. Using the method in [28], the computation of all the weights from one view to the eight neighboring ones can be made independent of the size of the square patch and computed in operations, where is the number of pixels per view and is the maximum number of pixels in a search window. This balance takes into account also the operations required by the selection of the highest weights, which is necessary to define the graph edges. Repeating this procedure for all the views in the light field leads to a complexity , or equivalently , as the disparity in the high resolution views grows with the super-resolution factor , and it is therefore reasonable to define the size of the search window as a multiple of .

The construction of the warping matrices relies on the previously computed weights, therefore the complexity of this step depends exclusively on the estimation of the parameter in Eq. (12). The computation of for each pixel in a view requires operations, where is no longer squared because only 1D search windows are considered at this step. The computation of for all the views in the light field leads to a complexity , or equivalently .

Finally, the optimization problem in Eq. (16) is solved via PPA, whose iterations consist in a call to the CG method (cf. steps 8-10 in Algorithm 1). Each internal iteration of the CG method is dominated by a matrix-vector multiplication with the matrix . However, it is straightforward to observe that the matrix is very sparse with non zeros entries, where we assume the size of the blurring kernel to be pixels, as in our tests in Section VII. It follows that the matrix-vector multiplication within each CG internal iteration requires only operations. The complexity of the overall optimization step depends on the number of iterations of PPA, and on the number of internal iterations performed by each instance of CG. Although we do not provide an analysis of the convergence rate of our optimization algorithm, in our tests we empirically observe the following: regardless of the number of pixels in the high resolution light field, in general PPA converges after iterations (each one consisting in a call to CG) while each instance of CG typically converges in only iterations. Therefore, assuming the global number of iterations of CG to be independent of the light field size, we approximate the complexity of the optimization step with .

The global complexity of our super-resolution algorithm can finally be approximated with , which is linear in the number of pixels of the high resolution light field, hence it represents a reasonable complexity. Moreover, we observe that the graph and warping matrix construction steps can be highly parallelized. Although this feature would not affect the algorithm computational complexity, in practice it could lead to a significative speed up. Finally, compared to the light field super-resolution method in [17], which employs TV regularization, our algorithm turns super-resolution into a simpler (quadratic) optimization problem, and differently from the learning-based light field super-resolution method in [20] it does not require any time demanding training.

Vii Experiments

Vii-a Experimental settings

We now provide extensive experiments to analyze the performance of our algorithm. We compare it to the two super-resolution algorithms that, up to our knowledge, are the only ones developed explicitly for light field data, and that we already introduced in Section II: Wanner and Goldluecke [17], and Mitra and Veeraraghavan [20]. We also compare our algorithm to the CNN-based super-resolution algorithm in [10], which represent the state-of-the-art for single-frame super-resolution. Up to the authors knowledge, a multi-frame super-resolution algorithm based on CNNs has not been presented yet.

We test our algorithm on two public datasets: the HCI light field dataset [29] and the (New) Stanford light field dataset [30]. The HCI dataset comprises twelve light fields, each one characterized by a array of views. Seven light fields have been artificially generated with a 3D creation suite, while the remaining five have been acquired with a traditional SLR camera mounted on a motorized gantry, that permits to move the camera precisely and emulate a camera array with an arbitrary baseline. The HCI dataset is meant to represent the data from a light field camera, where both the baseline distance between adjacent views and the disparity range are typically very small. In particular, in the HCI dataset the disparity range is within pixels. Differently, the Stanford dataset contains light fields whose view baseline and disparity range can be much larger. For this reason, the Stanford dataset does not closely resemble the typical data from a light field camera. However, we include the Stanford dataset in our experiments in order to evaluate the robustness of light field super-resolution methods to larger disparity ranges, possibly exceeding the assumed one. The Stanford light fields have all been acquired with a gantry, and they are characterized by a array of views. Similarly to [17] and [20], in our experiments we crop the light fields to a array of views, i.e., we choose .

In our experiments, we first create the low resolution version of the test light field from the datasets above. The spatial resolution of the test light field is decreased by a factor by applying the blurring and sampling matrix of Eq. (4) to each color channel of each view. In order to match the assumptions of the other light field super-resolution frameworks involved in the comparison [17] [20], and without loss of generality, the blur kernel implemented by the matrix is set to an box filter, and the matrix performs a regular sampling. Then the obtained low resolution light field is brought back to the original spatial resolution by the super-resolution algorithms under study. This approach guarantees that the final spatial resolution of the test light field is exactly its original one, regardless of .

In our framework, the low resolution light field is divided into possibly overlapping sub-light-fields and each one is reconstructed separately. Formally, a sub-light-field is obtained by fixing a spatial coordinate and then extracting an patch with the top left pixel at from each view . The result is an light field, with . Once all the sub-light-fields are super resolved, different estimates of the same pixel produced by the possible overlap are merged. We set and for and , respectively. This choice leads to a high resolution sub-light-field with a spatial resolution that is approximatively pixels. Finally, only the luminance channel of the low resolution light field is super resolved using our method, as the two chrominance channels can be easily up-sampled via bilinear interpolation due to their low frequency nature.

Bilinear [17] [20] [10] GB-DR GB-SQ
buddha 35.22 0.00 38.22 0.11 39.12 0.62 37.73 0.03 38.59 0.08 39.00 0.14
buddha2 30.97 0.00 33.01 0.11 33.63 0.22 33.67 0.00 34.17 0.01 34.41 0.02
couple 25.52 0.00 26.22 1.61 31.83 2.80 28.56 0.00 32.79 0.17 33.51 0.25
cube 26.06 0.00 26.40 1.90 30.99 3.02 28.81 0.00 32.60 0.23 33.28 0.35
horses 26.37 0.00 29.14 0.63 33.13 0.72 27.80 0.00 30.99 0.05 32.62 0.12
maria 32.84 0.00 35.60 0.33 37.03 0.44 35.50 0.00 37.19 0.03 37.25 0.02
medieval 30.07 0.00 30.53 0.67 33.34 0.71 31.23 0.00 33.23 0.03 33.45 0.02
mona 35.11 0.00 37.54 0.64 38.32 1.14 39.07 0.00 39.30 0.04 39.37 0.05
papillon 36.19 0.00 39.91 0.15 40.59 0.89 39.88 0.00 40.94 0.06 40.70 0.04
pyramide 26.49 0.00 26.73 1.42 33.35 4.06 29.69 0.00 34.63 0.34 35.41 0.67
statue 26.32 0.00 26.15 2.15 32.95 4.67 29.65 0.00 34.81 0.38 35.61 0.73
stillLife 25.28 0.00 25.58 1.41 28.84 0.82 27.27 0.00 30.80 0.07 30.98 0.05
TABLE I: hci dataset - psnr mean and variance for the super-resolution factor
Bilinear [17] [20] [10] GB-DR GB-SQ
amethyst 35.59 0.01 24.18 0.20 36.08 4.12 38.81 0.01 40.30 0.11 39.19 0.25
beans 47.92 0.01 23.28 0.53 36.02 7.38 52.01 0.01 50.20 0.16 48.41 1.18
bracelet 33.02 0.00 18.98 0.22 19.91 3.86 38.05 0.00 39.10 0.43 28.27 2.84
bulldozer 34.94 0.01 22.82 0.09 32.05 3.73 39.76 0.03 37.27 0.33 35.96 0.43
bunny 42.44 0.01 26.22 1.15 40.66 6.69 47.16 0.01 46.96 0.06 47.01 0.06
cards 29.50 0.00 19.38 0.22 28.46 5.68 33.77 0.00 36.54 0.20 36.52 0.20
chess 36.36 0.00 21.77 0.27 34.74 5.85 40.75 0.00 42.04 0.08 41.86 0.07
eucalyptus 34.09 0.00 25.04 0.28 34.90 3.50 36.69 0.00 39.07 0.12 39.09 0.08
knights 34.31 0.04 21.14 0.24 29.33 4.77 38.37 0.06 39.68 0.27 37.23 0.40
treasure 30.83 0.00 22.81 0.15 30.52 2.93 34.16 0.00 37.68 0.26 37.51 0.15
truck 36.26 0.04 25.77 0.08 37.52 4.60 39.11 0.09 41.09 0.14 41.57 0.15
TABLE II: stanford dataset - psnr mean and variance for the super-resolution factor
Bilinear [17] [20] [10] GB-DR GB-SQ
buddha 32.58 0.01 34.92 0.63 35.36 0.34 34.62 0.01 35.42 0.02 35.40 0.02
buddha2 28.14 0.00 29.96 0.07 30.29 0.10 30.23 0.00 30.52 0.00 30.43 0.00
couple 22.62 0.00 23.02 1.56 27.43 1.16 24.01 0.00 26.65 0.01 26.95 0.00
cube 23.25 0.00 22.47 2.65 26.48 1.16 24.58 0.00 27.23 0.01 27.39 0.00
horses 24.35 0.00 24.45 1.27 29.90 0.55 24.73 0.00 25.53 0.00 26.41 0.01
maria 30.02 0.00 30.64 0.87 33.36 0.37 31.55 0.00 33.48 0.01 33.12 0.01
medieval 28.29 0.00 28.54 0.37 29.78 0.50 28.57 0.00 29.23 0.00 29.54 0.01
mona 32.05 0.00 33.42 0.52 33.31 0.40 34.82 0.00 34.66 0.01 34.47 0.01
papillon 33.66 0.00 36.76 0.13 36.13 0.48 36.56 0.00 36.44 0.01 36.18 0.01
pyramide 23.39 0.00 23.60 2.72 29.13 1.86 24.84 0.00 28.34 0.01 28.48 0.00
statue 23.21 0.00 22.97 3.63 28.93 2.03 24.72 0.00 28.21 0.01 28.38 0.00
stillLife 23.28 0.00 23.62 1.64 27.23 0.49 23.83 0.00 24.99 0.00 25.54 0.00
TABLE III: hci dataset - psnr mean and variance for the super-resolution factor
Bilinear [17] [20] [10] GB-DR GB-SQ
amethyst 32.69 0.01 21.94 0.30 31.47 1.07 34.79 0.01 35.97 0.03 35.63 0.02
beans 43.81 0.01 22.66 0.26 31.25 1.85 47.38 0.02 48.67 0.12 47.28 0.43
bracelet 29.06 0.00 17.37 0.22 15.83 0.32 31.96 0.00 35.23 0.06 30.46 1.08
bulldozer 31.88 0.00 21.85 0.11 26.21 0.85 35.48 0.01 35.44 0.05 35.31 0.04
bunny 39.03 0.00 23.40 1.22 35.88 1.82 43.62 0.01 43.73 0.05 43.54 0.05
cards 26.13 0.00 17.77 0.33 25.22 1.40 28.34 0.00 31.03 0.02 31.49 0.03
chess 33.11 0.00 20.56 0.24 31.19 1.96 35.76 0.00 36.87 0.04 36.76 0.03
eucalyptus 31.71 0.00 23.38 0.17 32.23 1.61 33.03 0.00 34.51 0.01 34.80 0.01
knights 31.31 0.02 19.36 0.07 25.55 1.40 34.38 0.05 35.37 0.06 35.21 0.05
treasure 27.98 0.00 21.45 0.14 27.86 0.89 29.58 0.00 31.37 0.01 31.21 0.01
truck 33.52 0.02 23.27 0.05 33.04 1.66 35.45 0.04 36.67 0.05 36.97 0.05
TABLE IV: stanford dataset - psnr mean and variance for the super-resolution factor

In our experiments we consider two variants of our Graph-Based super-resolution algorithm (GB). The first is GB-SQ, the main variant, which employs the warping matrix construction based on the square constraint (SQ) and is presented in Section V-A. The second is GB-DR, the variant employing the warping matrices extracted directly (DR) from the graph and introduced at the end of Section V-B. In the warping matrix construction in Eq. (10), as well as in the graph construction in Eq. (13), we empirically set the size of the patch to pixels and . For the search window size, we set pixels. This choice is equivalent to consider a disparity range of pixels at high resolution. Note that this range happens to be loose for the HCI dataset, whose disparity range is within . Choosing exactly the correct disparity range for each light field could both improve the reconstruction by