Graph-based representation for multiview image coding

Graph-based representation for multiview image coding

Thomas Maugey,  Antonio Ortega,  and Pascal Frossard,  Thomas Maugey and Pascal Frossard are with the Signal Processing Laboratory (LTS4), Institute of Electrical Engineering, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland (e-mail: thomas.maugey@epfl.ch; pascal.frossard@epfl.ch).Antonio Ortega is with the Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089 USA (e-mail: antonio.ortega@sipi.usc.edu)
Abstract

In this paper, we propose a new representation for multiview image sets. Our approach relies on graphs to describe geometry information in a compact and controllable way. The links of the graph connect pixels in different images and describe the proximity between pixels in the 3D space. These connections are dependent on the geometry of the scene and provide the right amount of information that is necessary for coding and reconstructing multiple views. This multiview image representation is very compact and adapts the transmitted geometry information as a function of the complexity of the prediction performed at the decoder side. To achieve this, our GBR adapts the accuracy of the geometry representation, in contrast with depth coding, which directly compresses with losses the original geometry signal. We present the principles of this graph-based representation (GBR) and we build a complete prototype coding scheme for multiview images. Experimental results demonstrate the potential of this new representation as compared to a depth-based approach. GBR can achieve a gain of dB in reconstructed quality over depth-based schemes operating at similar rates.

Multiview image coding, 3D representation, view prediction, graph-based representation

I Introduction

Multiview image coding has received considerable attention in recent years. In particular, hardware technologies for the capture and the rendering of multiview content have improved significantly. For example, depth sensors and autostereoscopic displays have become popular in the past years [1]. This has led to novel immersive applications and thus to more challenges for the research community. One of the main open questions in multiview data processing resides in the design of representation methods for multiview data [2, 3, 4], where the challenge is to describe the scene content in a compact form that is robust to lossy compression. Many approaches have been studied in the literature such as the multiview format [5], light fields [6] or even mesh-based techniques [7]. All these representations contain two types of data. On the one hand, the color or luminance information, which is classically described by 2D images. On the other hand, the geometry information describes the scene’s 3D characteristics, represented by 3D coordinates, depth maps or disparity vectors111Note that no explicit scene geometry information is transmitted in the multiview case.. Effective representation, coding and processing of multiview data rely on the proper manipulation of these two types of information, i.e., luminance and geometry.

Since depth signals can be efficiently captured due to the advent of new sensor devices, the multiview plus depth (MVD) [8] format has become very popular in recent years. Depth information allows us to build a reliable estimation of scene geometry. With this information, encoders are able to extract the correlations between views [9], and decoders can synthesize virtual views [10]. Many recent multiview video coders rely on depth signals to enhance their coding performance [11]. However, the representation of geometry with depth data has one main drawback: if lossy compression is applied to depth, as done in classical coders, the induced error makes it difficult to control the quality of synthesized viewpoint. This is the case even if depth gives a good estimation of 3D scene geometry. More specifically, uncertainty in the depth value (due to quantization for example) leads to a spatial uncertainty when determining the correspondence between pixels in neighboring views. This is illustrated in Fig 1. Proper modeling of the impact of quantization on rendered view quality is in general difficult, even though it is crucial for solving classical problems such as rate allocation between depth and color signals in order to maximize the quality of the reconstructed views.

Fig. 1: Pixel in view is associated to pixel in view , given its geometry (depth value ). An uncertainty about this pixel’s depth leads to a spatial inaccuracy in view . This basic observation is the origin of the main drawbacks of depth-based representations. In contrast, our GBR uses disparity values which are controlled lossy versions of depth values.

Note that, by nature, depth information represents the geometry of one view without considering any information about the predicted viewpoints. For example, the raw depth maps generally have too much precision given the view predictions they are supposed to perform. Instead of directly coding the raw depth maps with hard to control losses, a more efficient approach may consist of building a representation that captures only the information needed for the required view predictions, and then to perform a lossless coding of this new geometry signal222Note that this would be a lossy representation since only the information needed for rendering will be transmitted.. This approach is similar to one based on the disparity vectors, even though these have a block precision in the current standards where they are employed. Hence, we investigate in this paper a solution for building “just enough” geometry information for coding a given set of views. The proposed approach considers only integer disparities that are obtained after a rounding operation on the float disparity values derived from the depth maps. This geometry information can be viewed as a dense disparity map, which explicitly contains the information to link pixels in different views.

We propose a new geometry representation format based on graph structures, called Graph-Based Representation (GBR), where the geometry of the scene is represented as connections between corresponding pixels in different images. In this notation, two connected pixels are neighbors in the 3D scene. As shown in Fig. 1, they are derived from dense disparity maps and provide just enough geometry information to predict another viewpoint. In other words, before losslessly coding the geometry signal, GBR drastically simplifies it as a function of how it will be used at the decoder. This “use-aware” geometry compression allows us to control the error due to coding. While GBR representations offer a very generic format, we focus our study on the scenario where views (color and depth), acquired at the encoder, are transmitted to a decoder that reconstructs the luminance images for all the views. This scheme is illustrated in Fig 2. Throughout this paper, we compare our GBR solution, where geometry is represented by connections between pixels, to approaches where geometry is described by depth. We outline the importance of proper control of coding errors and show that it leads to a better view reconstruction quality.

Fig. 2: Our goal is to represent and code viewpoints with ground-truth color and depth informations, with a low rate and high decoding quality.

In more detail, the GBR is constructed as follows. The first image in the set is represented by its color information. Then the GBR represents the new pixels of image (i.e., pixels that are not present in image , such as disoccluded pixels) and link them to their neighbors in image which correspond to the same 3D points in the scene. The same approach is repeated for all views (or a subset of them as explained later) until the view is reached. Hence, the resulting representation describes 3D points of the scene once and only once, i.e., the first time they are captured by one of the cameras, and links them through the different views in the graph. We build on [12, 13] where the basic concept of GBR has been introduced and here we extend that work by designing a complete scheme, where luminance information is coded along with the graph information. Moreover, we take into account the errors in the connections and introduce residual images that allows us to correct minor geometrical distortions. Our GBR-based multiview coding scheme thus has to transmit one reference image, the graph connections providing the geometrical information, the luminance signal of new pixels of every viewpoint and finally some residual images. As a first prototype implementation, we make use of off the shelf tools: JPEG2000 for image and residual, and arithmetic coding for the graph.

Throughout this paper, we compare our approach to a simplified depth-based scheme. Rather than using the most recent standards, which apply depth-based intra prediction to each block, we build an hybrid coding scheme where depth-based prediction is used for the whole image, and prediction residuals are transmitted. Image by image, the current view (color and depth images) is predicted using the previous view and the corresponding depth, and then residuals for luminance and depth signals are sent. The residuals for luminance and depth correct the prediction errors and complete the information in the disoccluded regions. The reconstructed images eventually serves to build an estimate of the images from the following viewpoints. This simplified approach provides a more direct way of comparing depth-based techniques to our GBR approach, since in both cases, the encoder is required to use geometry-based prediction for all images, except for the first one. For depth image compression, we also use JPEG2000, which helps us to highlight the difficulties due to “blind” geometry compression. In this paper we provide a proof of concept implementation of our GBR, rather than optimized RD results. Our experiments nevertheless demonstrate that our GBR representation leads to an easier control of geometry compression artifacts than depth-based representation signals, leading to a better reconstruction quality.

Our GBR thus constitutes a promising alternative to depth-based representations that face the problem of geometry inaccuracies due to lossy compression of depth information. A number of approaches have been proposed recently to address this problem. In particular, some recent methods aim at improving rate-distortion performance of standard compression tools when applied to depth information. For that purpose, models of the error in geometry estimation due to traditional lossy compression of depth have been studied. In [14], the optimization is done by experimentally simulating some practical RD points and choosing the best one. The minimization is done with a multi-resolution full search. In some other works, a rate-distortion (RD) model is developed [15]. For example, in [16], the RD model is estimated region-by-region, corresponding to the different objects of the scene. In [17], the RD analysis relies on some complex models for the image textures. In [18], wavelet properties are used to separate the different components of the scene and to analyze object by object the consequence of inaccuracies in their depth values. Note that, regardless of the chosen RD model, optimization remains complex and strongly dependent on scene content and camera settings (baseline, geometry complexity, etc.).

Instead of optimizing standard codecs as described above, another solution is to develop alternative coding tools for depth maps in order to enhance the control of depth compression. The main observation in these approaches is that depth maps have sharp edges but very smooth textures. The goal of these coding tools is to preserve the sharpness of the edges, while spending few bits on the flat or smooth parts. Examples of tools that have been proposed, include meshes [7], new block formats [19], graph-based transforms [20] and coding of depth edges [21]. These tools indeed permit to increase the performance of the depth-based coding schemes. However, they do not provide a better understanding of the effect of depth compression on rendering.

The same objective of reducing geometry inaccuracies has also be targeted by works that investigate alternative representations for multiview data. Similarly to GBR, the layered depth image (LDI) representation [22, 23] avoids the inter-view redundancies in the signal description. More precisely, in both GBR and LDI, the 3D points of the scene are represented once and only once, which is not the case for light field, multiview or depth-based representations. In LDI [22, 23], the pixels of multiple viewpoints are projected onto a single view. The redundant pixels are discarded and the new ones (i.e., the ones occluded on this reference view) are added in an additional layer. This very promising representation has however the drawback of being dependent on the depth signal. Indeed, the LDI describes also the depth values in multiple layers. They are necessary at the decoder side for retrieving the viewpoints. Thus, the problem of controlling the error due to depth compression, mentioned for multiview-plus-depth format, still arises in LDI. A better control of these inaccuracies is achieved with GBR.

The rest of this paper is organized as follows. In Section II, we present our GBR solution by introducing in detail the graph construction process and the view reconstruction technique. We then present the complete coding scheme for the transmission of multiview data with our GBR representation (Section III). Finally, in Section IV, we present various experiments to compare the depth-based scheme and the GBR approach and we show the benefit of representing geometry with graphs.

Ii Graph-based geometry representation

Ii-a Multiview image data

Let us consider a scene captured by cameras with the same resolution and focal length . The -th image is denoted by , with , where is the pixel at row and column . We consider translation between cameras, and we assume that the views are rectified. In other words, the geometrical correlation between the views depends on horizontal components. We also work under the Lambertian assumption, which states that each 3D point of the scene has the same luminosity when viewed from every possible viewpoint. We assume a depth image, , is available at the encoder for every viewpoints, , as illustrated in Fig. 2. Since the images are rectified, the relation between the depth and the disparity for two camera images is given by , where is the distance between the two cameras. In what follows, the geometry information is given by disparity values that are computed from the depth maps and the camera parameters. Our goal is to design a compact multiview representation of these camera images that offers control of the geometry information accuracy.

Ii-B Geometrical structure representation

Fig. 3: Illustration of camera translation for a simple scene with a uniform background, and one foreground object. Types of pixels in depth-based inter-view image warping: pixels can be a) appearing, b) disoccluded, c) occluded and d) disappearing. The green plain line is an arbitrary row in the reference image and the dashed line is the corresponding row in the target image.

Before introducing our new data representation in detail, we analyze the effect of camera translation on the image content. Let us consider two images and captured by cameras that are separated by a distance . Since we consider only full pixel displacements, the geometrical correlation between pixels in these two images takes the form of , where is a disparity value. When this relation holds, pixels in certain regions in image can be directly associated to pixels in corresponding regions in . These correspond to the elements of the scene that are visible in both images. Alternatively, the elements that are visible only from one viewpoint are often designed under the general name of occlusions, even if their appearance is not only due to object occlusions. More precisely, we can categorize these pixels that are present or absent only in one image, into four different types as illustrated in Fig. 3. First, a new part of the scene appears in the camera because of camera translation. It usually comes from the right or left (depending on translation direction) and the new pixels are not related to object occlusions. They are called appearing pixels. During camera translation, foreground objects move faster than the background. As a result, some background pixels may appear behind objects and are thus called disoccluded pixels. Conversely, some background pixels may become hidden by a foreground object. These are called the occluded pixels. Finally, some pixels disappear in the viewpoint change, and they are called disappearing pixels.

We illustrate these different types of pixels and consider a row of the target image in Fig. 3. Starting from the left border, we notice that the row first contains several appearing pixels, and then some pixels of the reference image. Then, the row presents some disoccluded pixels before coming back to pixels of the reference image. After that, the row contains occluded pixels that correspond to a jump between pixels in the reference image. The rest of the row refers to the reference image until a series of disappearing pixels are depicted at the end of the row. We want now to describe the pixels in this target row in the second view by maximizing references to elements from the corresponding row in the first view. This can be achieved by navigating between the reference image and the “new” pixels of the target image. This navigation can be guided by connections between corresponding pixels in both views. We thus propose to construct a graph that is exactly made of these connections. This graph is derived from the depth information and the number of connections varies linearly with the number of foreground objects in the image. Similarly, the size of these connections evolves linearly with the distance between cameras and object disparities. A more formal description of the graph construction method is given next.

Ii-C Graph construction

Fig. 4: Graph construction example: the blue texture background has a disparity of at each view and the red rectangle foreground has a disparity of for each view. This example graph contains all different types of pixels: a) appearing, b) disoccluded, c) occluded and d) disappearing.
Fig. 5: Reconstruction of the view with the toy example of Fig. 5. The green arrows indicates the graph exploration order for view reconstruction.
Fig. 4: Graph construction example: the blue texture background has a disparity of at each view and the red rectangle foreground has a disparity of for each view. This example graph contains all different types of pixels: a) appearing, b) disoccluded, c) occluded and d) disappearing.

The proposed graph representation intends to avoid redundancies in the color information (i.e., only “new” pixels are described) and additionally to offer an intuitive description of the geometry information with links between corresponding pixels in different views. Generally, a graph with levels describes reference image and predicted ones and is constructed based on the depth maps , . Since the object displacement is only horizontal, we consider that the graph construction is independent for each image row. For each row, the graph is made of two components, which are described by two matrices and of size , where is the number of levels (i.e., the number of images encoded by the graph) and is the image width in pixels. These two matrices respectively gather color and geometry information for all pixels in that row across all images. The color values in row are given by and the connections in the same row are given by . In the graph construction, both matrices are initialized to , which means “no connection” and “no color value” respectively.

We now describe in details the construction of the graph. We show in Fig. 5 a graph construction example, with levels that correspond to reference view and synthesized views. For the sake of clarity, we first describe in detail the graph construction of an arbitrary row by considering only one predicted view , one reference view and its associated depth map . The first level corresponds to the reference view, and thus for all . The connections then indicate the relation between the pixels in the current level and those in the next one.

Then, the connection values and the color values are assigned based on the following principles:

  • The pixels intensities are represented in the level where they appear first, which means that the second level only contains pixels that are not present in the reference image.

  • The connexions simply consists in linking these “new” pixels to the position of their neighbor in the previous level. More precisely, a new pixel represented at a level larger than is hidden by a foreground object in the previous views. If this foreground object was not in the scene, the pixel would have been visible in the previous views, near the other background pixels. The “neighbor” in the lower level is thus the pixel just next to the disoccluded area.

We describe now precisely how each of the pixel types in Fig. 3 is handled in our graph-based representation. First, for the appearing pixels, their corresponding values, , are assigned without any connectivity information (they are implicitly attached to the side of the image). In the example of Fig. 5, we see that the dark blue appearing pixel is stored in level at its position in , i.e., it corresponds to . Similarly, for the disoccluded pixels, since they do not appear in the reference image, their color value is stored in the position in the color matrix, where corresponds to the pixel positions in the view . In Fig. 5, the disoccluded pixels are stored in and . Additionally, at the reference level and at the position of the last pixel before the foreground object on row , we store the connection value , where is the disparity vector associated to depth value . This connection value links the last background pixel of the reference view to the ones in the target view that are disoccluded. For example, in Fig. 5, the foreground object is red. In level , the last pixel before this foreground object is at position (light blue pixel). The graph thus links this pixel to the first disoccluded pixel of level . These two pixels are considered as neighbors. The disparity of the background in the example of Fig. 5 is equal to , so the connection value is equal to , i.e., .

The occluded pixels correspond to a jump in the reference view, since they represent color values that are absent in the second view, and only visible in the reference one. The jump value is stored in the connectivity matrix at the following two positions: 1) the last pixel of the foreground object (with a connection value equal to the foreground disparity) and 2) the last pixel of the corresponding occluded region (with a connection value equal to the background disparity). In the example of Fig. 5, the last pixel of the foreground object 1) is at position in level . Thus, we have , since the red foreground object has a disparity of . Secondly, the last pixel of the occluded region 2) is at position in level . Since the background disparity is , we have . We notice that the two connections meet at in level , which corresponds to the position of the last foreground pixel in level . This time, since no new pixel is contained in the second view, we do not store any value in the color vector. Finally, the disappearing pixels are simply indicated by a connection value at the position of the last preceding pixel. This connection value is equal to the background disparity. In the example of Fig. 5, the first disappearing pixel is at position , thus . For the next views, the graph construction proceeds in the same way, i.e., each view is connected to the previous one and constitutes a new level in the graph. This leads to matrices and ( is the number of rows) that are concatenated in two 3D matrices and and constitute the complete GBR data structure.

The GBR construction strategy introduced above is presented in a general form in Algorithm 1. The inputs are two luminance views and , the depth image and the distance between the two cameras . First, we convert the depth image into a dense disparity map (line 4 to 6). The non-integer disparity value is simply rounded to the closest integer since the current GBR implementation only handles such values. This operation induces an approximation error that is corrected by a residual image as detailed in Section III. Then, the graph construction is done row by row. The pixels of are first inserted in the first level of the luminance matrix (lines 8 to 10). We then insert the appearing pixels on level of the luminance matrix (lines 11 to 13). After this operation, we go through the dense disparity map of and detect disocclusions (lines 21 to 27) and occlusions (lines 27 to 31). For building a graph with more than images, one simply needs to repeat the operations from lines 11 to 39 for every predicted view, while taking as starting point the most recent view. Finally, the matrices for every row are concatenated in the 3D matrices and .

With the above graph construction method, the graph representation is sparse (only a small fraction of entries is non-zero) and avoids all redundancy in the color value description since the pixels values stored at a given level in are only those that are not present in the lower levels. Another important advantage of this graph representation consists in the multi-level structure, where the connections in one level are related to connections in other lower levels and for a chain of connections. Therefore, a reconstruction algorithm only needs to go through these connection chains to reconstruct the different multiview images.

0:  
1:   - luminance images of height and width
2:   - the depth map corresponding to view
3:   - the distance between the two views 
3:  The color and geometry matrices and
3:   Algorithm:
3:   Convert depth to dense disparity map with rounding operation
4:  for  and  do
5:      
6:  end for
6:  
7:  for  do
7:      
7:       Insert in the first level of the color matrix
8:      for  do
9:          
10:      end for
10:      
10:       Insert the appearing pixels ((a) in Fig. 5) in the second level of
11:      for  do
12:          
13:      end for
13:      
14:       current column index in
15:       previous disparity value
16:       column index in level that serves as stopping criterion
16:      
17:      while  do
18:           current disparity value, in the case of occlusion or disocclusion
19:          if  then
20:               disocclusion () or occlusion () size
21:              if  then
22:                 
22:                  Fill the disoccluded pixels ((b) in Fig. 5) in the second level of
23:                 for  do
24:                     
25:                 end for
25:                  Include the link between the two neighbors in the 3D space ((b) in Fig. 5) in
26:                 
27:              else
27:                  Include the jump ((c) in Fig. 5) in
28:                 
29:                 
30:                 
31:              end if
32:          else
33:              
34:          end if
35:          
36:          
37:      end while
37:      
38:       disappearing pixels ((d) in Fig. 5)
39:  end for
39:  
40:  for  and and  do
41:      
42:      
43:  end for
Algorithm 1 GBR construction for two levels

Ii-D View reconstruction at the decoder

The graph information described in the previous section is used directly for view reconstruction at the decoder, which has access to graph components and for every row . The reconstruction of a certain view requires the color values and the connections of all lower levels. The reconstruction of the color values in the current view is performed by navigating the graph between the different levels. This navigation starts from the border of the image at the level that needs to be constructed, then follows the connections and refers to the lower levels when no color information is available at current level. We show in Fig. 5 an example of a view synthesis for the image of level , based on the graph in the example of Fig. 5. The pixel numbering is done with respect to the column index of , as in Fig. 5. The reconstruction starts with the appearing pixel at level . Then, it moves to the reference level and fills pixel color values until encountering a non-zero connection. The first connection is after pixel and links it to pixel and in level . After filling all the disoccluded pixels, the reconstruction goes back to the reference level and fills color information (, and ) until the next non-zero connection (at pixel ). The connection in indicates an occluded region. Hence, the reconstruction algorithms jumps across columns in the reference view and continues the decoding of the pixels in the reference level for pixel to until it encounters the next non-zero connection (disappearing pixel). The reconstruction of the other views (i.e., the other levels of the graph) is done recursively. We see that the reconstruction process is very simple and that the required geometry information is captured in a flexible and controlled way by the graph connections. The integer disparities obtained after a rounding operation leads however to errors in the view prediction. We leave for future work the study of more evolved techniques that could interpolate pixels from float disparities, as it is done in [24]. In this paper, we handle these rounding errors with the generation of residual images, as detailed in Section III.

Iii GBR information coding

In this section, we propose a complete encoding scheme where the GBR information (color and geometry) is compressed to provide a compact description of multiview images. The geometrical errors, due to depth errors or non-integer disparities, are carefully taken into account in order to minimize the reconstructed image distortion.

Iii-a Geometry coding

We describe first our approach to code the graph connections. As we can observe in the example of Fig. 5, the matrix of connections for row , , is sparse. Hence, it can be coded with a small number of bits. For that purpose, we do not code directly the connection matrix and rather consider a small matrix of size , where is the number of non-zero elements in all the the connection sub-matrices with . The matrix stores all the meaningful connections which are characterized by parameters. The first column of contains the row indices for each graph connection. The second column contains the column indices , the third column contains the graph level indices, and finally, the fourth column contains the actual connection values. We then code the columns separately using first a differential operator along each of them in order to decrease the entropy, and then an arithmetic coding technique.

view 1 (luminance) view 2 (luminance) view 1 (depth) view 2 (depth)
Fig. 6: View and luminance and disparity images of the “squares 1” dataset. The disparities of the objects in the scene are integer.

We illustrate the behavior of the proposed compression scheme in the lossless coding of images of an artificial dataset (called “squares 1”, , and shown in Fig. 6). The data represents a 3D scene made of one plane background and multiple foreground square objects. The scene is captured by parallel cameras such that the disparities of the objects between the viewpoints is only horizontal. The background and the foreground objects are parallel to the camera planes. The number of foreground objects, their position, their size and the integer disparity values are generated randomly. We show in Fig. 7, the evolution of the graph geometry coding size (bits) as a function of the number of views involved in the representation. Although the observed linear relationship only depends on the regularity of the scene and acquisition, we notice that the required number of bits increases with the number of levels. This is due to the nature of the graph construction. It reflects that GBR sends “just enough” geometry information for a given number of views to predict, and increases this geometry precision as soon as it becomes higher.

Fig. 7: Evolution of the coding size (in bits) for lossless geometry compression, as a function of the number of images in the “squares 1” dataset.

In order to decrease the coding costs, we can estimate the geometry in some views, instead of coding it for every image. Hence, we introduce the possibility of removing some images (i.e., levels) from the graph structure and interpolating them at the receiver. In this case, fewer bits are required for encoding the geometry since the number of levels is reduced. When a level is removed from the graph, the graph links are directly extended to the next level (e.g., edges connect levels and directly, instead of passing through level ), and the pixel values of the level that is skipped are stored in the upper level. However, the interpolation of views at the decoder may create some distortion in the geometry. The interpolation of a view at the decoder is done by disparity compensation with the two closest received images. The two disparity-compensated estimations of the interpolated view are then merged, which results in a synthesized image with no disocclusion. Since the disparity maps are not explicitly transmitted in a GBR-based scheme, they are retrieved from the values of the connections in the graph. In other words, the GBR geometry can be used for virtual view synthesis at the decoder, similarly to what can be achieved with depth images. The choice of the number of levels and of which levels are included in the graph is a tradeoff between the bitrate required for graph transmission, and the distortion of the reconstructed view distortion induced by view removal. In this paper, we choose and the views with a full search algorithm that evaluates the graph size and the rendered distortion for many configurations.

Iii-B Luminance compression and residual images

The color signal compression may benefit from the graph structure that links pixels at different levels to each other. In the proposed scheme, the reference image is encoded with traditional image coding tools. The novel pixels at every level are to be coded by traversing pixels along the graph connections. One of the interesting properties of the graph is that it links pixels that are supposed to represent two neighboring points in the 3D scene. In other words, these pixels might be correlated, which can be exploited for coding. For the current system, we use a simple differential operator along the graph. The differentiated color values are then coded using an arithmetic coder. Development of more sophisticated graph-based techniques is part of our ongoing work.

The graph introduced in the previous sections only handles integer disparities in the connections. However, the actual disparity value obtained from depth data might not always be integer and the rounding operation (see line 5 of Algorithm 1) in the graph construction brings a geometric error in the view prediction. In addition, if the initial depth map contains errors, the views predicted by geometric projection contain error too. In order to compensate for these errors, we generate residual images that correct the view prediction error and prevent error propagation through the graph levels. Compensation errors may appear on almost every part of the predicted images as illustrated in Fig. 8. More precisely, sub-pixel precision disparity values imply a pixel corresponding to an object in the scene is not necessarily captured at integer pixel positions by two adjacent viewpoints. Therefore, when a view is predicted from another one, errors may occur. We can see in Fig. 8 that these errors may appear in every region of the image except in ones filled by the current level of the graph. However, we still build the residual at an image level with a value of zero at the location where the current level provides a new value. These residuals are coded using the JPEG2000 coder in our current implementation. In order to illustrate the role of the residual images, we build a dataset called “squares 2” () involving non integer disparities. The scene is made of square foreground objects with half pixel precision disparity values. As for the “square 1” dataset, the position of the foregrounds and their disparity are initialized randomly. Thus for some views a foreground object may appear at half pixel position. In this case, the pixel intensity represented in the image is the average of the foreground and background luminance values. We show in Fig. 9 the images corresponding to view and in the dataset, along with the residual error image of view 2, which is the difference between and its estimation from and the given GBR geometry information. We see that the residual error image mostly contains energy at the object boundaries, as it is also the case in the illustration of Fig. 8.

Similarly, we also generate residual images to correct the error due to interpolation when a view is removed from the graph structure. For the interpolated views, errors may occur in any region of the image, given that the connections of the graph do not correspond exactly to the actual geometry of the scene. Therefore, these residuals are again images, coded with JPEG2000.

Fig. 8: Illustration of prediction error between two views when half-pixel disparity occur for a given row .
(a) (b) (c)
Fig. 9: Views (a) and (b) and residual error image of view (c) after geometry prediction from GBR connections for the “squares 2” dataset. The zoomed foreground object is shown for view (b). The disparities of the objects in the scene are non integer, such that error appear at object boundaries as it can be observed in (c).

We have described above a complete coding scheme where we can vary the coding precision of the color signal and of the residual images, and where we can also adjust the number of levels involved in the graph representation, in order to optimize the rate-distortion performance. We are thus able to generate several rate-distortion points from low to high bitrate. The optimal rate-allocation between the different components is however a complex task. In our prototype encoder, it relies on a full search algorithm between the different compression steps of all the components. Development of coding tools better suited for these datasets, as well RD optimization techniques, are part of ongoing work.

(a) original depth map (c) JPEG2000 compressed depth map (e) Reconstructed from compressed depth map (b) original (d) retrieved disparity from GBR (f) Reconstructed from GBR geometry
Fig. 10: Original depth of (a) and luminance of (b), JPEG2000 compressed depth map (c), and disparity map retrieved from GBR of the first view in the “squares 3” dataset. Reconstruction of the second image , from JPEG2000 compressed depth map (e) and GBR geometry (f). No residual error data is used for reconstruction.
(a) (b) (c)
Fig. 11: Original depth image of “sawtooth” dataset (a), the corresponding retrieved disparity from the GBR (b) and the depth image compressed with JPEG 2000 (c). GBR geometry and depth maps have been compressed at bpp.
(a) (b)
Fig. 12: Coding rate distribution (a) between geometry, texture and residual components for GBR and depth-based representations. For both schemes, the average reconstruction quality of the views of venus dataset (b) is set to dB.

Iv Multiview coding experiments

We now evaluate the performance of our novel GBR representation technique. We consider the multiview system described in Fig. 2 and show that i) the representation of the geometry with our graph-based approach leads to more efficient compression performance than depth-based schemes and ii) the graph-based representation of the geometry provides a better control of geometry coding artifacts than commonly used approaches for depth map compression.

We first propose experiments where we measure the compressibility of the geometry signal in GBR and depth-based schemes in lossless representation scenarios, in the sense that view prediction is perfect. We focus on two views with only integer disparities in the “squares 1” dataset introduced above. We build our GBR structure on the two first images. The reference image is not compressed and no residual is transmitted. Similarly, the depth-based scheme encodes one reference image, one depth image and the color residual of view . The compression of the depth image is done with the lossless JPEG2000 codec [25], while the luminance is also transmitted losslessly. We focus on the geometry rate only, and for both schemes the prediction is perfect. We first observe that the rate needed to compress the depth image is equal to kb, while the rate for graph information is equal to kb. Thus, the graph links provide a more compact description of the scene geometry than lossless compression of depth map images. Even though a more efficient technique could be considered for lossless depth compression, this first experiment shows that our graph obtains a good compressibility of its geometry signal, even in the lossless coding case. This case is however particular in the sense that lossless prediction only happens for very particular datasets. Moreover, coding schemes almost never operate in a lossless configurations.

We next evaluate performance in lossy compression scenarios. In natural images, losses are introduced because of a) non integer disparities or depth inaccuracies as shown in Sec. III and b) geometry compression with graph reduction or depth image compression in our GBR or depth-based scheme respectively. We study now the geometry compression artifacts. We use a more complex dataset called “squares 3” () that contains more complex depth maps, since the foreground objects are not parallel to the camera plane, unlike in the “squares 1” and “squares 2” datasets. The positions and the depths of the foregrounds are initialized randomly. Depth images corresponding to this new dataset are shown in Fig. 10. As the foreground objects do not have the same depth everywhere, new disoccluded pixels may appear. This is due to the fact that the foreground objects change size from one view to another. Since prediction algorithms simply project the pixels involved in a view, some additional pixels might be added to complete the view. They are handled by the residual images in the depth-based scheme, while they are simply added in the current graph level in our GBR. As in the previous experiment, we are interested in the geometry information compression only. We thus compress the geometry information with our GBR scheme, and compare this with the depth-based scheme where the depth image is encoded with JPEG2000. In both cases, we use the same encoding rate for the geometry information. As can be seen in Fig. 10 (c), the JPEG2000 depth compression leads to significant artifacts on the resulting depth maps and thus to high compensation error (Fig. 10 (e)). With the GBR scheme, the geometry information is more accurate (Fig. 10 (d)), and the reconstruction results are better, as shown in Fig. 10 (f). Similar observations can be made on natural sequence, as shown in Fig. 11 for the “sawtooth” dataset. We use the same comparison method and study the geometry compression artifacts. We compare the original depth map, the retrieved disparity map from the GBR and the compressed depth image (at similar bitrate). We observe that GBR provides better control over where to introduce losses and where to preserve geometry accuracy. More specifically, the reconstructed disparity map is piecewise constant but the edges are still sharp, in contrast to the approximation provided by JPEG2000 compression. Moreover, the level of geometry precision achieved by GBR is just enough to reconstruct the second viewpoint. We next show how these GBR properties lead to better reconstructed view quality.

We build another experiment with more images, i.e., a higher , and extend our study of the effect of geometry compression for the GBR and depth-based representations when, this time, reference, geometry and residual are coded. We run experiments where we represent the images of the “venus” dataset (Fig. 12 (b)) using GBR and depth-based coding schemes. We select the coding total coding rates so that we achieve the same reconstruction quality ( dB) while geometry rates and color rates are kept similar in both coding schemes. In other words, we vary only the rate of the residual images. We show the rate distribution in Fig. 12 (a) for the two representations. We observe that for a constant geometry and texture rate, the depth-based scheme needs to send more residual information in order to achieve the same quality. In other words, the GBR has to perform less compensation after geometry compression, which means that it controls better the effect of geometry coding. While, similar observations are done at different target qualities, GBR gains with respect to depth-based scheme are highest at medium or high bitrates. The GBR, by the nature of its construction, cannot decrease its geometry rate below the minimum amount of information that is needed for one view prediction. Thus it looses its advantage with respect to depth-based representation in this rate range.

Fig. 13: IPPPP depth-based multiview encoding scheme used for experiments.
(a) (b) (c)
Fig. 14: Rate-distortion performance comparisons between the GBR system and the depth-based scheme in a configuration for respectively (a) “squares 2”, (b) “venus” and (c) “sawtooth” test sequences.

Finally, we present some rate-distortion (RD) performance evaluation results, where we compare the optimized GBR and optimized depth-based schemes in the scenario depicted in Fig. 2. For the depth-based scheme, we consider a format of type IPPPPP, which means that a first view is transmitted along with its depth, and then, the other views are estimated iteratively by disparity-compensation using residual error data (for depth and color images). The block diagram of this scheme is illustrated in Fig. 13. We build the GBR coder as explained before; it uses the same images (color and depth) as the depth-based scheme. The objective for both schemes is the reconstruction of color images. For both schemes, we simulate RD points at different quantization steps for geometry, color and residual compression. For the GBR scheme, we also vary the number of levels in the graph (the other levels are interpolated at the decoder side). In both schemes, we have distributed the rates of geometry, texture and residual optimally in order to maximize the reconstruction quality. In particular, we retain the convex envelope of these two RD point clouds in order to present the optimal RD curves for each scheme. We present the results obtained for the “squares 2”, “venus” and “sawtooth” datasets in Fig. 14 (a), (b) and (c) respectively. We see that our scheme generally outperforms the depth-based approach. This is due to the fact that GBR controls the geometry compression, which leads to reduced residual error sizes. We see however that, at low bitrates, the difference between the two schemes is smaller or that the depth approach is better for the “Venus” dataset. The simple graph compression algorithm that we have designed is still limited when the bandwidth is too small. In particular, once we have removed all the intermediary images from the graph, we cannot reduce further the rate required for the geometry information in GBR. This fixed overhead leads to less competitive behavior at low rates. However, outside of the very low bitrate regime, the GBR representation leads to improved RD performance.

V Conclusion

In this paper, we have proposed an alternative to depth-based representations for multiview image coding. Using graphs to describe connections between pixels of different views, our method manages to represent the geometry of the scene and to avoid the inter-view redundancies. At the same time, it increases the control on geometry compression artifacts in the reconstructed images. We have proposed a complete coding scheme based on this new graph-based representation and illustrated its potential in rate-distortion performance compared to depth-based schemes. Future work will focus on the development of more effective coding strategies in order to extend the performance of this promising GBR representation of multiview images. More precisely, we will investigate how GBR can handle non-integer disparity values, and quantization errors in other to improve the performance at low bitrate.

Acknowledgment

This work has been partly supported by the Hasler Foundation, within the project NORIA (novel image representation for future interactive multiview systems).

References

  • [1] G. Alenya and C. Torras, “Lock-in time-of-flight (TOF) cameras: A survey,” IEEE Sensors Journal, vol. 11, pp. 1917–1926, 2011.
  • [2] H. Shum, S. Kang, and S. Chan, “Survey of image-based representations and compression techniques,” IEEE Trans. on Circ. and Syst. for Video Technology, vol. 13, pp. 1020–1037, 2003.
  • [3] K. Müller, P. Merkle, and T. Wiegand, “3D video representation using depth maps,” Proc. IEEE, vol. 99, no. 4, pp. 643–656, Apr. 2011.
  • [4] J. Salvador and J. Casas, “Multi-view video representation based on fast monte carlo surface reconstruction,” IEEE Trans. on Image Proc., vol. 22, pp. 3342–3352, 2013.
  • [5] ISO/IEC MPEG & ITU-T VCEG, “Joint multiview video model (JMVM),” Marrakech, Morocco, Jan.13-19 2007.
  • [6] J. Chai, X. Tong, S. Chan, and H. Shum, “plenoptic sampling,” in Proc. Int. Conf. on Computer graphics and interactive techniques, 2000.
  • [7] S. Kim and Y. Ho, “Mesh-based depth coding for 3D video using hierarchical decomposition of depth maps,” in Proc. IEEE Int. Conf. on Image Processing, San Antonio, TX, USA, Sep. 2007.
  • [8] P. Merkle, A. Smolic, K. Muller, and T. Wiegand, “Multi-view video plus depth representation and coding,” in Proc. IEEE Int. Conf. on Image Processing, San Antonio, TX, US, Oct. 2007.
  • [9] S. Yea and A. Vetro, “View synthesis prediction for multiview video coding,” EURASIP J. on Sign. Proc.: Image Commun., vol. 24, pp. 89–100, 2009.
  • [10] D. Tian, P. Lai, P. Lopez, and C. Gomila, “View synthesis techniques for 3D videos,” Proc. of SPIE, the Int. Soc. for Optical Engineering, vol. 7443, 2009.
  • [11] K. Müller, H. Schwarz, D. Marpe, C. Bartnik, S. Bosse, H. Brust, T. Hinz, H. Lakshman, P. Merkle, H. Rhee, G. Tech, M. Winken, and T. Wiegand, “3D high-efficiency video coding for multi-view video and depth data,” IEEE Trans. on Image Proc., 2013.
  • [12] T. Maugey, A. Ortega, and P. Frossard, “Graph-based representation and coding of multiview geometry,” in Proc. Int. Conf. on Acoust., Speech and Sig. Proc., Vancouver, Canada, 2013.
  • [13] ——, “Multiview image coding using graph-based approach,” in IEEE Workshop on 3D Image/Video Technologies and Applications (IVMSP), Seoul, Korea, Jun. 2013.
  • [14] Y. Morvan, D. Farin, and P. de With, “Joint depth/texture bit-allocation for multi-view video compression,” in Picture Coding Symposium (PCS), 2007.
  • [15] W. Kim, A. Ortega, P. Lai, T. D, and C. Gomila, “Depth map distortion analysis for view rendering and depth coding,” in Proc. IEEE Int. Conf. on Image Processing, Cairo, Egypt, Nov 2009.
  • [16] Q. Wang, X. Ji, Q. Dai, and N. Zhang, “Free viewpoint video coding with rate-distortion analysis,” IEEE Trans. on Circ. and Syst. for Video Technology, vol. 22, pp. 875––889, 2012.
  • [17] G. Cheung, V. Velisavlevic, and A. Ortega, “On dependent bit allocation for multiview image coding with depth-image-based rendering,” IEEE Trans. on Image Proc., vol. 20, pp. 3179––3194, 2011.
  • [18] B. Rajei, T. Maugey, and P. Frossard, “Rate-distortion analysis of multiview coding in a DIBR framework,” Annals of Telecommunications, 2013.
  • [19] A. Liu, P. Lai, D. Tian, and C. Chen, “New depth coding techniques with utilization of corresponding video,” IEEE Trans. on Broadcasting, vol. 57, pp. 551–561, 2011.
  • [20] G. Cheung, W. Kim, A. Ortega, J. Ishida, and A. Kubota, “Depth map coding using graph based transform and transform domain sparsification,” in IEEE Int. Workshop on Multimedia Sig. Proc., Hangzhou, China, Oct. 2011.
  • [21] I. Daribo, G. Cheung, and D. Florencio, “Arithmetic edge coding for arbitrarily shaped sub-block motion prediction in depth video coding,” in Proc. IEEE Int. Conf. on Image Processing, Orlando, FL, USA, Sep. 2012.
  • [22] A. Gelman, P. Dragotti, and V. Velisavlevic, “Multiview image coding using depth layers and an optimized bit allocation,” IEEE Trans. on Image Proc., vol. 21, pp. 4092–4105, 2012.
  • [23] U. Takyar, T. Maugey, and P. Frossard, “Extended layered depth image representation in multiview navigation,” accepted in Signal Processing Letters, vol. 21, pp. 22–25, Jan. 2014.
  • [24] Y. Mao, G. Cheung, A. Ortega, and Y. Ji, “Expansion hole filling in depth-image-based rendering using graph-based interpolation,” in Proc. IEEE Int. Conf. on Image Processing, Vancouver, Canada, May 2013.
  • [25] JPEG-2000, “ISO/IEC FCD 15444-1: JPEG 2000 final comitee draft version 1.0,” 2000. [Online]. Available: http://www.jpeg.org/FCD15444-1.htm
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
58512
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description