Stereo on a Budget

Stereo on a Budget

Dana Menaker, Shai Avidan

We propose an algorithm for recovering depth using less than two images. Instead of having both cameras send their entire image to the host computer, the left camera sends its image to the host while the right camera sends only a fraction of its image. The key aspect is that the cameras send the information without communicating at all. Hence, the required communication bandwidth is significantly reduced.

While standard image compression techniques can reduce the communication bandwidth, this requires additional computational resources on the part of the encoder (camera). We aim at designing a light weight encoder that only touches a fraction of the pixels. The burden of decoding is placed on the decoder (host).

We show that it is enough for the encoder to transmit a sparse set of pixels. Using only images, with as little as 2% of the image, the decoder can compute a depth map. The depth map’s accuracy is comparable to traditional stereo matching algorithms that require both images as input. Using the depth map and the left image, the right image can be synthesized. No computations are required at the encoder, and the decoder’s runtime is linear in the images’ size.

Stereo matching, Wyner-Ziv coding, Stereo vision, Stereo image processing.

I Introduction

Stereo matching algorithms assume that both images are available for processing. This puts a burden on the host computer that must capture both images even though they are highly correlated with each other. Once captured, the host can recover the depth map of the scene and there are numerous algorithms for doing so.

Our goal is to minimize the communication cost between the cameras and the host and still be able to produce a depth map of the scene, as well as both images captured by the cameras. Our intent is to let the left camera transmit its image to the host and let the right camera transmit only a fraction of its image. The host uses the images to compute the depth map. Using the left image and the depth map, a high quality estimate of the right image can be generated. The most important aspect of our work is that the right camera cannot communicate with the left camera. What information should the right camera send to the host?

The right camera can use a standard image compression algorithm to reduce the communication bandwidth to the host but this, in turn, places a higher computational burden on the camera. Higher computational cost translates to higher battery consumption and we would like to avoid that as much as possible.

The scenario we envision is a group of people taking pictures of the same scene with multiple smartphones and uploading them to the cloud where the host can then run a stereo matching algorithm. Because all smartphones capture the same scene the images they capture are highly correlated. It is therefore a waste to let each smartphone compress and transmit highly correlated images.

As a first step toward reaching this goal we consider a simple stereo pair with two calibrated and synchronized cameras. The left camera transmits its image to the host and the right camera transmits an encoded image . Suppose that is a low-resolution version of the original image . Then the host must solve a super resolution problem where given a pair of images it must recover both the depth and an approximation of the true high resolution image .

This straightforward approach still requires the right camera to touch every pixel in in order to construct the low resolution . We argue that this is the worst possible choice. To understand why, take this approach to the extreme. Suppose the right camera can send only one pixel to the host and the value of this pixel would be the mean intensity of . But because and are images of the same scene they are highly correlated and therefore their mean intensities are highly correlated. Using the mean intensity of would be a good enough approximation. We haven’t gained much from sending the mean intensity of . We give a better alternative.

Instead of sending a low resolution image of we sample a sparse grid (without smoothing) of and send it. The sparse grid keeps the high frequencies of at the cost of introducing aliasing and we use to resolve this problem. Our key insight is that even a small fraction of is sufficient to compute a disparity map by using Joint Bilateral Filter with serving as the guidance image. Once we have the depth map we can recover a high quality approximation of .

Ii Background

There is inherent redundancy in a stereo image pair, and stereoscopic compression algorithms use this redundancy in order to encode the stereo pair efficiently, e.g. [1]. Most stereo compression techniques use disparity compensation, with one image serving as a reference and the other predicted using the reference image and the disparity field. The residual image can also be encoded for improved performance. However, these techniques require the knowledge of both stereo images at the encoder, unlike the scenario we address.

In our scenario we wish to encode a single image, without information about its stereo counterpart (expect that it exists). Furthermore, we would like the encoding to be as light as possible, and a sampling of the image seems attractive. The topic of image sampling has been studied extensively, and one particular sampling method, called Farthest Point Strategy (FPS) [2], aims at reducing the communication bandwidth, as we do. This method preserves the sampling uniformity, while being random and without adding the extra cost of transmitting each pixel’s coordinates.

The redundancy in a stereo image pair is also utilized in 3D-TV applications, where different views of a real world scene can be synthesized from a monoscopic view and the associated per-pixel depth information [3].

Our work is also related to super resolution from multiple cameras where the goal is to recover a high resolution video from a collection of low resolution videos and high resolution still images [4]. The key difference is that in our case we choose what information to send and can therefore avoid sending redundant information.

Disparity estimation algorithms can be divided into global methods that solve a global optimization problem or local methods that estimate disparity values for each pixel independently. An extensive survey of methods can be found in [5].

Local methods compute, for every pixel in the reference image, the cost for a range of disparity values. The disparity value with the lowest cost is assigned to that pixel. Because a single pixel may not be robust to noise, it’s common to aggregate information in a neighborhood. One way to do that is to use a bilateral filter (BF) [6]. Instead of aggregating information over a rectangular window, BF is used to respect edge boundaries in the aggregation step.

The bilateral filter was developed as an edge preserving filter, where the weight of pixels is based on space-range distance. See [7] for a review of the topic. An interesting extension of BF is the realization that the weights of the filter need not come from the input image itself but rather from some guidance image. For example, the case of Flash/No Flash photography. The No Flash image, that has warm colors but a lot of noise, is filtered with the guidance of the Flash image that has cold colors, but is less noisy [8, 9]. The same principle was applied in Joint Bilateral Upsampling [10] where a high resolution image served as a guide when upsampling a low resolution image. This led to the general idea of Guided Image Filter [11] that also offers an exact algorithm that is linear in the size of the image.

Edge preserving filtering has also been used in the context of depth maps to estimate a high resolution depth map from a low resolution active 3D time-of-flight camera and a high resolution RGB image [12]. This setup, however, is quite different from ours. It is an active method while ours is passive. That is, the pixels we send do not carry accurate depth information as is the case of a ToF camera.

Iii Depth Estimation

In this section we first describe how to estimate the disparity map using images.

Consider a stereo pair where the left camera sends its image to the host and the right camera sends an encoded image, , to the host. The host must recover the disparity map from and . The question is, what is the best to use? We evaluate two encoding options.

The first, denoted Downsample, is to take to be a downsampled version of , which is equivalent to having an asymmetrical pair of cameras: one with a higher resolution than the other. This is appealing as it can reduce hardware costs. However, fine details are lost in this process, as we will later show.

The second, denoted Sparse, is to take to contain sparse samples from . In this scenario the camera capturing can have a high resolution, and the amount of information transmitted can vary according to the available bandwidth.

Each of these encoding schemes has it own upsides and downsides. Finally, we consider a Hybrid method, in which contains sparse samples from as in Sparse, but the stereo matching is inspired by both approaches.

Iii-a Preliminaries

Depth estimation algorithms often use the Disparity Space Image (DSI), which is calculated from two stereo images. The DSI is a 3D volume that assigns a cost to each pixel and disparity value, and the goal of Depth Estimation algorithms is to take the DSI as input and return a depth map as output. This involves choosing one, and only one, disparity value per pixel.

The DSI can be easily computed given a pair of stereo images and . A commonly used cost measure is the sum of absolute differences. Formally, given images and we define the Disparity Space Image to be:


Many stereo methods perform a cost aggregation step on the DSI. For example, Yang [13] aggregates costs adaptively, based on pixel similarity, which is derived from in order to preserve edges. Hence, is an input to the aggregation step as well. Yang proposed a left-right consistency check that improves results, but requires , and not just sparse samples of it.

Iii-B Downsample

Let denote a downsampled version of . The Downsample algorithm is presented in figure 2(a). We upsample to the original size and use it as input to a standard stereo method, which yields a high resolution disparity map. When doing so, it is crucial to smooth prior to the stereo matching, in order to match the spectral frequencies between the images. Otherwise the high frequencies in , which are not present at the upsampled , will introduce noise to the matching.

The importance of smoothing is demonstrated in figure 1, where (a) is the matching result when is a downsized version of by in each dimension, and (b) is the matching result with both images smoothed prior to matching (as shown in Fig. 2(a)). We used the stereo matching algorithm of [13] in both cases. The result without smoothing has many discontinuities, caused by edges in the image which are not necessarily depth discontinuities, such as the shadow cast by the head on the table behind it, or the various folders and boxes on the shelves in the background. Those edges are not sharp in the upsampled , and therefore there is an ambiguity in the matching costs to the correct disparity value. The aggregation assumes correlation between depth and color, which is not a valid assumption in this case. Even if a different stereo matching algorithm, which includes a global optimization step (such as graph cut) was used, the weight on the smoothness term would have to be increased significantly, to the point of losing details in other areas.

The input to the stereo matcher is symmetric in the sense that both images have the same resolution. That enables us to perform the left-right consistency check suggested in [13], which improves the disparity map’s accuracy. Since the images contain only low frequencies, we expect the disparity map to contain only low frequencies as well.

(a) (b)
Fig. 1: Stereo matching results of tsukuba, where is a downsampled version of , with a scale factor 5 in each dimension. In (a) is upsampled and the stereo matching is performed with the unprocessed . In (b) is smoothed prior to the stereo matching. The stereo matching was done with [13]. In (a), 14% of the pixels in the disparity map have an error larger than 1, while in (b) only 8.18% pixels have such an error.
Fig. 2: Algorithm flowcharts. In (a) denotes a downsampled version of , in (b-c) it denotes a sparse sampling of .

Iii-C Sparse

Let denote a sparse sampling of . In Sparse the high frequencies in can be transmitted, contrary to Downsample. We wish to sample the image in a manner that will allow us to extract depth information in conjunction with .

A progressive image sampling, which aims at minimizing the communication bandwidth, is described in [2] (called Farthest Point Strategy - FPS). This sampling is random, however only the first pixel’s coordinates should be transmitted, and the rest of the pixels’ coordinates are determined by the previously transmitted pixels’ intensities. This sampling can guarantee a uniform density, and it can also be adaptive to the content of the image, with a higher sample density in areas with finer details (denoted adaptive-FPS). Another advantage is that the irregular sampling corresponds to convolution of the signal with a wideband noise which reduces aliasing.

However, this sampling strategy requires a significant amount of computations, which contradicts our aim at designing a light weight encoder. Therefore, we sample on a uniform grid, requiring very little power and no computations at the camera.

The entire Sparse scheme is depicted in figure 2(b). First, ee use and to calculate a sparse DSI (denoted ) according to the following equation:


where is an indicator function. Note that is sparse and has many zero entries due to lack of data. In order to distinguish the case of a zero entry due to equal intensities from the case of missing data, we use .

We then upgrade each layer of (for every ) using Joint Bilateral Filter, with as the guidance image. We denote the result of the filtering as . In this process we exploit the correlation between color and disparity. Because and are expressed in the same coordinate system we do not have to estimate an intermediate motion field between them.

For the Bilateral Filter we use the fast implementation of [14], and we take into account the fact that the filtered DSI is sparse according to the following equation:


where the filter is defined by


where and are range kernels.

Given the estimated DSI , we can use it instead of the full DSI in a stereo matching algorithm. As shown in Fig. 2(b), the aggregation and final disparity selection are done according to [13]. Since is not available, we must skip the left-right consistency check.

Iii-D Hybrid

In Downsample the lower frequencies are transmitted, while in Sparse the high frequencies help preserve the details. In order to enjoy the best of both approaches, we wish to combine Downsample and Sparse into Hybrid. The DSI contains a soft estimation of the disparities, and we wish to utilize information from both approaches at this stage, prior to the hard decision (depth selection).

The algorithm is depicted in 2(c). Let denote a sparse sampling of . We compute a weighted mean of the estimated DSI and a second DSI, calculated from interpolation of the samples in . This second DSI can be seen in the bottom route, surrounded by the dash-dot blue line. A direct interpolation of the samples would introduce aliasing, so we apply a low pass filter on the interpolated image, according to the transmitted frequencies. We also smooth to match the frequencies, just like we did in Downsample. The blocks surrounded by the dotted red line are identical to Sparse, while the route surrounded by the dash-dot blue line is inspired by Downsample (the difference stems from the different input ).

Given the interpolated , the left-right consistency check such as the one described in [13] can be performed. It significantly improves the disparity map’s accuracy.

Iii-E A lower bound

We report results of the encoding schemes in the experimental section, but as with any encoding scheme one wonders: Are we making the most out of the bandwidth at our disposal? How far are we from optimal encoding? We measure the optimal encoding as follows.

Let be the full DSI computed from the original and . This is the best we can hope for. Now, resize down to the size of the allocated bandwidth. If we take image resize to be the optimal encoding of a signal, then this gives the optimal encoding. In the experimental section we use this technique to measure how far are the encoding schemes from optimal encoding.

Iii-F Recovering

So far we discussed recovering depth. Given and the recovered depth, (an estimate of ) can be synthesized using depth-image based rendering techniques. This is beyond the scope of this paper, but we will describe briefly the method we used. Recovering involves three major steps: warp, inpainting and enhancement.

Given and , it is possible to warp and obtain , the scene as viewed from the position of the camera who produced , minus occluded areas (see [5] for details).

Although the pixels in the occluded areas can’t be retrieved from , we have additional information in . We use this information to inpaint the occluded areas. In case of Downsample, we can copy the missing pixels from the upsampled . In case contains sparse samples from (Sparse and Hybrid), we can interpolate the samples (after smoothing) and use them to fill the occluded areas. If the occlusion is small, or if it doesn’t contain texture, the blur will not be significant.

Furthermore, in the cases of Sparse and Hybrid, contains exact samples of . We use those samples to enhance in the non-occluded areas as follows: we calculate a sparse difference image between and (it’s sparse because is sparse). We then perform joint bilateral filtering on the difference image with as the guidance image, and finally we add the filtered difference image to in the non-occluded areas.

Iv Connection to Information Theory

Our problem has its roots in the Distributed Source Coding literature. Consider two sources and that are known to be correlated. Suppose both and are known at the encoder but only is known at the decoder, and assume the encoder wishes to efficiently transmit to the decoder. Clearly the encoder can take advantage of the fact that is known to both sides to better encode . The remarkable result of Slepian and Wolf [15] is that the encoder can encode just as well without knowing at all. The basic result of Slepian and Wolf holds for lossless compression and it was later extended by Wyner and Ziv [16] to the lossy case.

These results were theoretical and the first practical implementation was reported in DISCUS [17]. This led to a surge of interest in developing efficient Wyner-Ziv video encoding algorithms. See [18, 19, 20] for a review of recent advances in the field. -

V Results

We first evaluate the performance of the different algorithms by measuring the accuracy of the initial disparity map. We evaluate the result on the Middlebury benchmark stereo datasets [5, 21]. We also show how our algorithm compares with a standard compression technique such as JPEG2000. Next, we compare our recovered images to a distributed video codec. We use the recovery process described in III-F in order to make this comparison. Finally we show results of our algorithm on indoor and outdoor scenes. Since the true depth is not available for these scenes, we measure results using the recovered .

V-a Stereo

(a) Downsample (b) Sparse
(c) Downsample, Sparse, Hybrid (d) Sparse vs. JPEG2000
Fig. 3: Average performance of different ”stereo on a budget” strategies on the middlebury stereo dataset. The x-axis is the percentage of used for the calculation of the disparity map, the y-axis is the average percent of pixels with disparity error larger than 1. When a standard stereo algorithm is used, we use [13]. Its performance is included for reference (using 100% of for the calculation). The labels on the graph denote the scaling factor of the image, in both axes.
groundtruth Downsample Sparse - uniform grid Hybrid
Fig. 4: Results for Middlebury benchmark data. The first column shows the groundtruth disparity maps. The other columns are disparity maps calculated using only 4% of .

First, we evaluate the performance of Downsample vs. other approaches, using as input a downsampled version of . Figure 3(a) shows that Downsample as described in Figure 2(a) ranks higher than computing a low resolution disparity and performing joint bilateral upsampling (denoted JBU) as suggested in [10]. In addition, the disparity map is more accurate in terms of number of bad pixels when is smoothed compared to when it’s not smoothed (denoted ”Upsample ”).

When the images are resized to half of their original dimension there is very little loss of quality compared to no-resizing at all, since the correlation between adjacent pixels is high. When the image is further resized, the quality of the disparity map drops. We are interested in sending only a small fraction of the data, which would require to downsize by at least 5 in each dimension. Computing disparity in low-resolution and JBU would require an extremely good sub-pixel accuracy, which is rare for large scaling factors. Hence we compute the stereo matching in full resolution.

Next, we compare different sparse sampling strategies. The uniform sampling over a grid is more appealing in terms of computation power, however it may be sub-optimal in terms of aliasing. Therefore, we compare it’s performance to FPS - random uniform sampling [2], and to adaptive-FPS, random sampling which is part uniform (80%) and part adaptive (last 20% of the samples). The disparity maps were calculated according to Sparse, as described in Figure 2(b). Figure 3(b) shows that in terms of percentage of bad pixels in the disparity map, there isn’t a significant difference between the different sampling methods. Also in terms of RMS error the sampling strategies are quite comparable, as evident in tables I and II. Therefore, the uniform grid sampling is preferable.

For the Joint-Bilateral filter, we used the parameters and in Equation 3, where is the sub-sampling factor in the Sparse algorithm (see [14] for an explicit definition of the filter).

Algorithm Tsukuba Venus Cones Teddy
nonocc all disc nonocc all disc nonocc all disc nonocc all disc
Uniform grid 1.33 1.45 2.43 1.28 1.49 1.74 5.41 8.56 8.62 4.33 9.45 7.41
FPS 1.37 1.52 2.69 1.23 1.48 1.61 5.05 8.29 7.31 4.68 9.62 8.01
Adaptive FPS 1.36 1.49 2.49 1.57 1.77 2.14 5.75 8.33 8.12 4.60 9.75 7.78
TABLE I: RMS error of the disparity maps, calculated using only 4% of . In the adaptive FPS, we sampled 80% of the samples uniformly, and only the last 20% of the samples were adaptive. Otherwise large region of the image wouldn’t be covered, which would lead to errors.
Algorithm Tsukuba Venus Cones Teddy
nonocc all disc nonocc all disc nonocc all disc nonocc all disc
Uniform grid 7.30 8.86 20.55 10.03 11.12 19.79 20.37 26.66 38.61 14.01 24.04 29.65
FPS 6.61 8.18 24.50 10.49 11.62 21.39 20.91 28.22 36.58 14.49 24.47 29.59
Adaptive FPS 7.49 9.20 24.50 11.93 12.96 25.16 24.01 30.43 38.81 15.44 25.17 31.87
TABLE II: Bad pixels (with error ) in the disparity maps, calculated using only 4% of . In the adaptive FPS, we sampled 80% of the samples uniformly, and only the last 20% of the samples were adaptive. Otherwise large region of the image wouldn’t be covered, which would lead to errors.

Figure 3(c) compares Sparse, Downsample and Hybrid strategies (Figure 2). For Sparse we used uniform grid sampling, since it is the most efficient in terms of computation and power savings. At strong compression ratios, Hybrid is superior to using either strategy alone. Noteworthy is the fact that with only 11.1% of we can compute a disparity map with an accuracy which is comparable to that of graph-cut on and .

Figure 3(d) compares Hybrid and standard JPEG2000 compression with variable rate on . Noteworthy is the fact that at extreme compression regimes, retrieving depth from and using Hybrid is more accurate than using and a JPEG2000 compressed . Also shown is the lower bound estimation as described in section III-E.

Some of the disparity maps can be seen in Figure 4, using only 4% of for the calculation. In Downsample the high frequencies are lost in the images, and hence also in the depth map, e.g. the missing lamp arm in tsukuba. In Sparse the fine details are better preserved, such as the shape of the cones and the video camera in tsukuba. Hybrid combines the best of both, and we will focus on that algorithm.

Figure 5 demonstrates the process of recovering , which is described in section III-F. is encoded in-camera and is estimated at the host by warping with the calculated disparity map and using the available samples from to enhance the result and inpaint the occluded areas.

Figure 7 shows the effectiveness of the enhancement process described in section III-F, showing the PSNR before (dashed lines) and after the enhancement (solid lines), for the Middlebury datasets. Once an estimate of is calculated, it can be used as input to other algorithms. Thus future advancements in stereo matching as well as different types of algorithms can benefit from our algorithm.

Original warped to warped to warped to warp with
Sparse 11.1% pixels Sparse 11.1% pixels Sparse 11.1% pixels Sparse 4% pixels
+ enhancement + inpainting + enhancement, inpainting
PSNR: 27.17 dB PSNR: 27.66 dB PSNR: 26.24 dB PSNR: 26.60 dB
Fig. 5: Demonstration of the estimation process of described in section III-F on an outdoor scene. This image was downloaded from flickr and manually rectified by [22]. The first column shows , the right image of the stereo pair. The 2nd column shows the warp results using the disparity map calculated using the Sparse method, with containing 11.1% of the pixels in . The PSNR is calculated on the non-occluded areas. The 3rd column is after the enhancement step described in III-F, which gives increases the PSNR by about 0.5dB. The 4th column is the 2nd column with the occluded areas now inpainted with interpolation of the sample from . The 5th column is the inpainting result of the image from the 3rd column.

V-B Comparison to Distributed Video Coding

In [23] Varodayan et al. developed a coding scheme that exploits the similarity of stereo images without communication among the cameras. It was later extended to video [24], and the code is available online.

The code was designed as a video codec, and its output is the reconstructed frame, not the disparity map. Because it is a video codec, it works for general two frames and does not take advantage of the fact that the images are rectified. Therefore, the motion search space is two-dimensional and is limited to a motion field of 5 pixels in each direction (the number of options totals possibilities), even though the true disparity is larger. The code only accepts images in QCIF resolution, so we downsampled the standard Middlebury datasets (and cropped if necessary, to maintain the aspect ratio). The rate is determined automatically: the decoder may request more bits of information from the encoder via a feedback channel if the reconstructed image isn’t good enough. Our algorithm does not use a feedback channel. The running time of DVC is several minutes, while ours is an optimized Matlab code that takes a few seconds to run.

Figure 6 shows both our results and [24]’s. On average we achieve higher PSNRs, while transmitting a smaller fraction of the image, and the computation time is shorter than the DVC codec. The DVC results from Figure 6 can be compared to the rate-distortion curves in Figure 7.

PSNR: 29.64 dB PSNR: 30.76dB, 12.6%
PSNR: 30.23 dB PSNR: 26.86dB, 15.21%
PSNR: 27.45 dB PSNR: 25.6dB, 22.96%
PSNR: 26.48 dB PSNR: 25.01dB, 27.96%
Fig. 6: Comparing our method (left column) to the DVC method of [24] (right column). We show recovered and enhanced using our method with only 11.1% of used to calculate the disparity map. The DVC method determines the percentage of transmitted pixels adaptively. We give these numbers below each image. (See more results in supplemental material).
tsukuba venus
teddy cones
Fig. 7: Rate distortion curves of recovered using the Hybrid algorithm, for the Middluebury datasets. The graph shows the PSNR before (dashed lines) and after (solid lines) the enhancement, showing a larger gain when more samples of are given. For comparison, the results obtained from the code of [24] are shown on the graphs.

In addition to the middlebury dataset, we tested our algorithm on various stereo pairs, some captured by UCSD Vision and Graphics Laboratories [25] and some downloaded from Flicker and rectified manually [22]. Since the ground truth disparity is unavailable for those images, we measure the PSNR between the original and the warped and enhanced result . Figure 8 shows in the first column for various datasets, and the enhanced for two different compressions: in the 2nd column the disparity map is calculated with only 11.1% of the pixels of ; in the rightmost column, only 4% of ’s pixels were transmitted to the host.

Original enhanced , 11.1% enhanced , 4%
PSNR: 27.02 dB PSNR: 25.4 dB
PSNR: 23.4 dB PSNR: 21.7 dB
PSNR: 27.8 dB PSNR: 26.8 dB
PSNR: 28.94 dB PSNR: 27.3 dB
PSNR: 30.2 dB PSNR: 27.95 dB
Fig. 8: Results of our algorithm on various stereo pairs (the stereo images can be found in the supplementary material). The first column is ; the 2nd and 3rd columns show the enhanced , when the disparity map was generated with 11.% and 4% of respectively.

Vi Conclusions

We proposed an algorithm for recovering depth using less than two images, in order to reduce the communication costs. Specifically, we have shown that Joint Bilateral Filter (JBF) offers a simple and attractive way to compress correlated images that can not communicate with each other, as is the case in practical scenarios.

In our experiments, one camera sends a full image to the host to serve as a reference, while the other camera sends as little as pixels to the host. The host can then use JBF to recover an initial depth map and use it, together with the reference image to recover the sampled image.

Our algorithm is quite fast, since both the Bilateral filter’s complexity and the Non Local aggregation’s complexity are linear in the image size and the disparity search range. This is significantly more efficient than previously suggested distributed source coding schemes.

There is a trade off between the amount of data transmitted and the quality of the reconstruction. This paves the way to camera arrays that can adjust the number of pixels sent to the host based on the particular bandwidth of the host and still produce a depth image that, in turn, can be used to synthesize the encoded images. In scenarios where a feedback channel exists, the errors due to occlusions can be significantly minimized. The algorithm is efficient and can be made to run at several frames per second.


  • [1] H. Aydinoglu and I. Hayes, M.H., “Stereo image coding: a projection approach,” IEEE TIP, vol. 7, no. 4, pp. 506–516, 1998.
  • [2] Y. Eldar, M. Lindenbaum, M. Porat, and Y. Y. Zeevi, “The farthest point strategy for progressive image sampling,” IEEE Trans. on Image Processing, p. 1315, 1997.
  • [3] C. Fehn, “Depth-image-based rendering (dibr), compression, and transmission for a new approach on 3d-tv,” pp. 93–104, 2004.
  • [4] E. Shechtman, Y. Caspi, and M. Irani, “Increasing space-time resolution in video,” in ECCV, 2002.
  • [5] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” IJCV, vol. 47, no. 1-3, pp. 7–42, Apr. 2002. [Online]. Available:
  • [6] K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search,” TPAMI, 2006.
  • [7] S. Paris, P. Kornprobst, J. Tumblin, and F. Durand, “Bilateral filtering: Theory and applications,” Foundations and Trends® in Computer Graphics and Vision, vol. 4, no. 1, pp. 1–73, 2009.
  • [8] E. Eisemann and F. Durand, “Flash photography enhancement via intrinsic relighting,” in SIGGRAPH, 2004.
  • [9] G. Petschnigg, R. Szeliski, M. Agrawala, M. Cohen, H. Hoppe, and K. Toyama, “Digital photography with flash and no-flash image pairs,” ser. SIGGRAPH, 2004.
  • [10] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,” in SIGGRAPH, 2007.
  • [11] K. He, J. Sun, and X. Tang, “Guided image filtering,” ser. ECCV, 2010.
  • [12] J. Park, H. Kim, Y.-W. Tai, M. Brown, and I. Kweon, “High quality depth map upsampling for 3d-tof cameras,” ser. ICCV, 2011.
  • [13] Q. Yang, “A non-local cost aggregation method for stereo matching,” in CVPR, 2012.
  • [14] K. N. Chaudhury, D. Sage, and M. Unser, “Fast bilateral filtering using trigonometric range kernels,” TIP, vol. 20, no. 12, pp. 3376–3382, Dec. 2011. [Online]. Available:
  • [15] D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,” IEEE Trans. on Information Theory, vol. 19, no. 4, pp. 471 – 480, jul 1973.
  • [16] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Trans. on Information Theory, vol. 22, no. 1, pp. 1 – 10, jan 1976.
  • [17] S. S. Pradhan and K. Ramchandran, “Distributed source coding using syndromes (discus): Design and construction,” IEEE Trans. on Information Theory, vol. 49, pp. 626–643, 1999.
  • [18] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero, “Distributed video coding,” Proceedings of the IEEE, vol. 93, no. 1, pp. 71 –83, jan. 2005.
  • [19] F. Pereira, C. Brites, J. Ascenso, and M. Tagliasacchi, “Wyner-ziv video coding: A review of the early architectures and further developments,” ser. ICME, 2008, pp. 625–628.
  • [20] F. Dufaux, W. Gao, S. Tubaro, and A. Vetro, “Distributed video coding: Trends and perspectives,” EURASIP J. Image and Video Processing, 2009.
  • [21] D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using structured light,” ser. CVPR, 2003.
  • [22] T. Basha, Y. Moses, and S. Avidan, “Geometrically consistent stereo seam carving,” in Computer Vision (ICCV), 2011 IEEE International Conference on, 2011, pp. 1816–1823.
  • [23] D. Varodayan, Y. chung Lin, A. Mavlankar, M. Flierl, and B. Girod, “Wyner-ziv coding of stereo images with unsupervised learning of disparity,” in Proc. Picture Coding Symp, 2007.
  • [24] D. P. Varodayan, D. M. Chen, M. Flierl, and B. Girod, “Wyner-ziv coding of video with unsupervised motion vector learning,” Sig. Proc.: Image Comm., vol. 23, no. 5, pp. 369–378, 2008.
  • [25] M. Zwicker, W. Matusik, F. Durand, and H. Pfister, “Antialiasing for automultiscopic 3d displays,” in Eurographics Symposium on Rendering, 2006.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description