4D Temporally Coherent Light-field Video
Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare light-field camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multi-view dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.
Light-field cameras capture multiple densely spaced viewpoints using either lenticular arrays for a single sensor , a camera densely scanned over a static scene to capture a large number of views  or multiple densely spaced cameras . This allows photo-realistic rendering of novel viewpoints, depth-of-field effects and high-accuracy dense reconstruction [2, 4]. Reconstruction from densely sampled light-fields commonly employs an Epipolar Plane Image (EPI) representation to estimate spatial correspondence between images .
Light-fields have primarily been captured for static scenes due to the requirement for a large number of viewpoints resulting in a high-volume of data. Recently motivated by applications in immersive content production for Virtual and Augmented Reality (VR/AR), arrays of video cameras could be employed to acquire light-fields of dynamic scenes. However, conventional video editing techniques are not suitable to edit such data as they fail to exploit the spatial and temporal redundancy encoded in light-fields due to multiple discrete views. Hence there is a need for efficient light-field video representations exploiting the spatio-temporal redundancy to enable compression and facilitate editing for live action VR. In this paper we propose the first method for spatio-temporally coherent dynamic light-field video, illustrated in Figure 1. This enables the use of light-field video for immersive VR content production. An EPI representation is used to estimate dense spatio-temporal correspondence from light-field video for temporal alignment.
Previous work on temporal alignment of complex dynamic objects has primarily focused on acquisition using multi-view cameras to reliably reconstruct the complete object surface at each frame using shape-from-silhouette and multiple view stereo [6, 7]. Robust techniques have been introduced for temporal alignment of the reconstructed non-rigid shape to obtain a 4D model based on tracking the complete surface shape or volume for complex motion [8, 9, 7, 10]. However, these approaches assume a reconstruction of the full non-rigid object surface at each time frame and do not easily extend to 4D alignment of partial surface reconstructions or depth maps. Recent work obtained reliable temporal alignment of partial surfaces for complex dynamic scenes [11, 12, 13, 14]. Dynamic Fusion  was introduced for 4D modelling from depth image sequences integrating temporal observations of non-rigid shape to resolve fine detail. However due to the high degree of spatial and temporal regularity and redundancy in the light-field video, existing methods for temporal alignment are not directly applicable. This paper exploits the spatio-temporal redundancy in light-field video for robust 4D spatio-temporally coherent representation of dynamic scenes. Sparse temporal correspondence tracks are obtained across the sequence for each view using temporal feature matching. These sparse temporal correspondence tracks are used to initialize dense scene flow estimation. A novel sparse-to-dense light-field scene flow is proposed exploiting EPIs to obtain a temporally coherent dense 4D representation of the scene, shown in Figure 2. The contributions of the proposed approach include:
Temporally coherent 4D reconstruction of dynamic light-field video;
EPI from sparse light-field video for spatio-temporal correspondence;
Sparse-to-dense light-field scene flow exploiting EPI image information.
2 Related work
With the advent of virtual and augmented reality, light-field capture has been explored for live action immersive virtual reality experiences. Bolles et al introduced EPI  to represent light-field as a volume. This representation has been used effectively for depth estimation and segmentation [5, 17]. Methods have been proposed for depth-of-field effects  and image-based rendering [18, 19]. However existing light-field representations have been designed around capturing static objects. A recent method introduced oriented light-field windows to leverage the EPI information from light-fields and enable more robust and accurate pixel comparisons  to improve the performance for scene flow estimation for a pair of images. An efficient light-field representation for live action data should account for the high level of redundancy of light-field video and handle multiple resolutions, occlusions, and high variation in depth in complex scenes. Another challenging aspect relates to the development of creative tools dedicated to light-field editing . There is currently a lack of an efficient representation to enable light-field video editing for content production. In this paper we propose a method to exploit the spatio-temporal redundancy in light-fields. EPI is used to obtain dense scene flow for 4D temporally coherent representation of dynamic light-field video to enable compression and efficient light-field video editing (colour grading, object removal, infilling etc.).
2.2 4D reconstruction of dynamic scenes
For conventional single view depth sequences and multiple view reconstruction of dynamic scenes techniques have been introduced to align sequences using correspondence information between frames. Methods have been proposed to obtain sparse [22, 23, 24, 25] and dense [26, 27, 28] correspondence between consecutive frames for entire sequences. Existing sparse correspondence methods work independently on a frame-by-frame basis for a single view  or multiple views  and require a strong prior initialization . Existing feature matching techniques either work in 2D  or 3D  or for sparse [23, 24] or dense  points. Other methods are limited to RGBD data  or stereo pairs  for dynamic scenes. Dense matching techniques include scene flow methods. Scene flow techniques [29, 28] typically estimate the pairwise surface or volume correspondence between reconstructions at successive frames but do not extend to 4D alignment or correspondence across complete sequences due to drift and failure for rapid and complex motion. In this paper we propose sparse-to-dense temporal alignment exploiting the high spatio-temporal redundancy in light-fields to robustly align light-field video captured with sparse camera arrays.
The high spatial and temporal redundancy in light-field video makes it challenging to use in content production for AR/VR. The aim of this work is to obtain a 4D temporally coherent representation of dynamic light-field video exploiting spatio-temporal redundancy. This provides an efficient structured representation for light-field editing and use in immersive VR content production.
Given an independent per-frame surface or depth reconstruction from the light-field video captured with a sparse camera array the problem is to simultaneously estimate the temporal correspondence of the input light-field across all views for the entire sequence. This is achieved efficiently by estimating the temporal alignment of the reconstructed surface between each time frame and propagating this across all light-field camera views. A coarse-to-fine approach is introduced that initially estimates temporal correspondence based on sparse features and then estimates dense scene flow initialised from the sparse features. To ensure robust tracking, key-frames are identified as a reference to minimise drift in long-term tracking. An overview of the 4D temporally coherent reconstruction of light-field video is presented in Figure 2.
Light-field video: Light-field video of the dynamic scene is captured using a camera array. Multiple-view stereo is performed to obtain a per frame surface mesh reconstruction . Throughout this work a sparse 5x4 array with a range of up to 50cm x 50cm is used to capture light-field video.
Key-frame detection: Key-frames are detected for light-field video exploiting redundant spatial information across views to identify a set of unique reference frames for stable long-term tracking. Surface tracking is performed between key-frames to reduce the accumulation of errors in sequential tracking due to large non-rigid motion for long sequences ( frames).
Sparse temporal feature correspondence tracks: Reconstructed 3D points projected at all frames are matched frame-to-frame across the sequence to estimate sparse temporal feature tracks for each dynamic object for each light-field camera view. These sparse temporal correspondence tracks are used to initialize light-field scene flow to handle occlusions and improve robustness.
Light-field scene flow: We propose to estimate dense scene flow between images by exploiting EPI information from light-field video based on oriented light-field windows  initialised by the sparse feature correspondence per light-field view at each time instant. This exploits the spatio-temporal redundancy in light-fields to give 2D dense correspondences which are back-projected to the 3D mesh to obtain a 4D spatio-temporally coherent dynamic light-field video representation.
3.2 Key-frame detection
Aligning per-frame reconstruction for long sequences leads to drift due to accumulation of errors in alignment between frames and failure is observed due to large non-rigid motion. To tackle this problem we detect key-frames across the sequence. Key-frame detection exploits the spatial redundancy in light-field capture by fusing the appearance, distance and shape information across all views () in the sparse camera array. Temporal coherence is introduced between key-frames as explained in Section 3.3 and 3.4.
Appearance Metric (): This measures appearance similarity between frame and for each object region in light-field view . It is the ratio of the number of temporal feature correspondences to the total number of features in the object region at frame , and , :
Distance Metric (): This metric measures the distance between frame and for each object in each view , it is defined as: where and is the maximum number of frames between key-frames for view . This term ensures that the distance between two key-frames does not exceed . This is set to .
Shape Metric (): Gives the shape overlap between pairs of frames for each object in the light-field video. It is defined as the ratio of the intersection of the aligned segmentation  to the union of the area :
Key-frame similarity metric: The metrics defined above are used to calculate the similarity between frames as follows: . All frames with similarity are selected as key-frames defined as where to (number of key-frames in light-field video) and is the number of frames between and .
3.3 Sparse temporal feature correspondence
Numerous approaches have been proposed to temporally align moving objects in 2D using either feature matching or optical flow. However these methods may fail in the case of occlusion, movement parallel to the view direction, large motions and visual ambiguity. To overcome these limitations we match sparse feature points from all light-field camera views at each time instant for each object. This is used to estimate the similarity between the object surface observed at different frames of the light-field video for key-frame detection and subsequently to initialize dense light-field scene flow between frames.
3D points corresponding to each dynamic object are projected at each frame to ensure spatial light-field coherence. The features for each light-field camera view are defined as: , where and in our case. are the 3D points visible at each frame . Nearest neighbour matching is used to establish matches between features. The ratio of the first to second nearest neighbour descriptor matching score is used to eliminate ambiguous matches (). This is followed by a symmetry test which employs the principle of forward and backward match consistency to remove erroneous inconsistent correspondences. To further refine the sparse matching and eliminate outliers we enforce local spatial coherence in the matching. For matches in an () neighbourhood of each feature we find the average Euclidean distance and constrain the match to be within a threshold ().
Sparse temporal correspondence tracks are obtained by performing exhaustive matching between all frames for each key-frame for the entire sequence. Feature matching is performed between frames such that features at view frame , are matched to features at view to frames . This gives us correspondences for all the frames with key-frame . Point tracks are constructed from this correspondence information for key-frame . The same process is repeated for key-points which are not part of point tracks at the corresponding key-frame for all frames . Any new point-tracks are added to the list of point tracks for key-frame . The exhaustive matching between frames per key-frame handles reappearance and disappearance of points due to occlusion or object movement. An example of sparse feature correspondence tracks in shown in Figure 4. Frame is a key-frame and each point track is represented by a unique colour with missing points showing partial feature correspondence tracks. Sparse correspondences are also obtained between all key-frames to initialise dense light-field flow.
3.4 Light-field scene flow
Dense temporal correspondence is estimated using a novel light-field scene flow method using pairwise dense correspondence which is estimated using the light-field oriented window matching. This combines information across all light-field views to achieve robust temporal correspondence. The sparse feature correspondences provides a robust initialisation of the proposed dense flow for large non-rigid shape deformation in the light-field video.
The high-level of inherent redundancy in the light-field images is used to improve the robustness and reliability of dense scene flow by using oriented light-field windows on the EPI for matching. This has been shown to improve the temporal correspondences for scene flow . In this paper we propose a novel method to obtain EPIs from sparse light-field data to improve the quality of dense scene flow. Our approach uses oriented light-field windows  to represent surface appearance information for matching as illustrated in Figure 5.
3.4.1 Epipolar Plane Image (EPI)
Traditional light-field data is captured with a plenoptic camera or a dense camera array typically capturing 200-300 views . An example dataset is illustrated in Figure 6 (a) captured with 289 images. Given a dense set of views the EPI provides a representation for estimating correspondence across views from the regular structure of the image i.e. slant lines correspond to the same surface point. The 2D slice of such representation has the same width as the captured image and height is given by the number of views in a camera array row ( in Figure 6 (a)). With dense sampling the disparity between adjacent views is typically sub-pixel giving slant lines in the EPI. Approaches have been proposed to estimate depth and segmentation from this dense EPI representation [5, 17]. However for sparse camera sampling the disparity between views may be several pixels making it difficult to directly establish surface correspondence from the EPI, Figure 6 (b). In our case of sparse views with just cameras, the height of this EPI reduces to pixels which makes it challenging to utilize the regularity of the information from EPI obtained from light-fields.
In this paper we aim to use the EPIs to introduce spatio-temporal coherence in dynamic light-field video. We propose a novel method to create an EPI parametrized representation from sparse light-field capture. We assume that the calibration is known. All the images at each time instant are undistorted and rectified with respect to the reference camera. The depth information at each time instant is used to resample the light-field to create an EPI representation of sparse light-field data. The algorithm to obtain EPI from image, calibration and depth information is illustrated in Figure 5; the stages are as follows:
The dense point-cloud of each dynamic object is projected on the undistorted and rectified images. is the set of points in view . For each row of the projected points on the dynamic object a 2D EPI is obtained.
The size of 2D EPI is set to , where is the width of input images and is estimated corresponding to the number of cameras in each row (). is a constant introduced to increase the distance between views due to sparse camera sampling. It is set to in our case to maximize performance. For each row of the camera array, 2D EPIs are obtained for each dynamic object, where is the height of the object. The corresponding projected points for each point are plotted on the 2D EPI with coordinates the same as that of input images and coordinates are estimated using the translation information from the calibration, defined as:
Given the set of image samples corresponding to a given surface point we fit a line in the EPI due to imperfect camera array. A scene point is represented by a line in the 2D EPI. A similar process is repeated for the entire point-cloud to obtain multiple 2D EPIs. Given a 2 dimensional camera array there are 2 sets of EPIs obtained by resampling along epipolar lines in the vertical and horizontal directions
The EPIs obtained from sparse camera arrays are used to constrain dense light-field flow using oriented light-field windows.
3.4.2 Oriented light-field windows
Oriented light-field windows exploit the regularity information in the EPI to enable more robust and accurate pixel comparisons  over spatial windows which suffer from defocus blur and loss of precision. In this paper we propose to use these oriented light-field windows on the EPIs obtained from the sparse light-field data to estimate the light-field scene flow temporal correspondence.
Each scene point can be represented by an oriented window in the light-field ray space. The shear of the ray is related to the depth of the 3D point and the size of the window is defined by spatial and angular Gaussian weights. However the ray moves with object motion, hence there is a need to account for shear and translation. The oriented light-field window corresponding to a scene point for a light-field, is computed as follows:
where, is the centre of the window, represents the image plane and represents the camera plane. is the shear operation for depth , defined as , is the translation operator defined as, and is the Gaussian weighted windows defined as: . is the 2D Gaussian distribution windows and is the corresponding row and column of the camera array (shown as green cameras in Figure 5, step) of the reference light-field view (() shown in red box), defined as:
This formulation could be easily extended to dense camera array. We propose to use this oriented light-field window to robustly estimate the dense scene flow for light-field video.
3.4.3 Dense light-field scene flow
Oriented light-field windows are used to estimate flow between consecutive frames and between key-frames, Figure 7, using the sparse temporal feature correspondence as initialization for each view. Flow is estimated on the object region for each light-field view, as illustrated in Figure 8, to obtain dense temporal correspondence.
The flow is formulated as a translation of each pixel location in image by in time. We formulate the computation of flow M per view for each dynamic object by minimization of the cost function:
The cost consists of three terms: the light-field consistency for the oriented light-field window alignment; the appearance term for brightness coherency; and the regularization term to avoid sudden peaks in flow and maintain the consistency. Colour and regularization terms are common to optical flow problems  and the light-field consistency is introduced for the first time to improve dense flow for sparse light-field video.
Light-field Consistency: The 2D EPIs obtained from the sparse light-field views are used to define oriented light-field windows for each scene point. These windows encapsulate the observed multi-view light-field appearance of the corresponding surface point and can be matched over time to estimate the temporal correspondence, defined as:
where is the oriented light-field window at time with depth as defined in Equation 1.
Appearance Consistency: This adds the brightness consistency assumption to the cost function generalized for all light-field cameras for both time steps. This term is obtained by integrating the sum of three penalizers over the reference image domain. penalizes deviation from the brightness constancy assumption in time for same views; penalizes deviation from the brightness constancy assumption between the reference view and each of the other views at time . forces the flow to be close to nearby sparse temporal correspondences.
where is the intensity at point and time in camera .
This term denotes that the flow vector is located within a window from a sparse constraint at and it forces the flow to approximate the sparse 2D temporal correspondence tracks.
Regularization: This penalizes the absolute difference of the flow field to enforce motion smoothness and handle occlusions and areas with low confidence.
where and we compute and as the minimum subtracted from the mean data energy within the search window for each pixel .
Occlusions: To detect occlusions, we compare the forward flow from to , and the backward flow from to . Occluded pixels are robustly indicated by large differences in the forward/backward motion and excluded as outliers.
Optimization: For each pixel , the energy is estimated as defined in eq. 2 on a window , as in the SimpleFlow algorithm  to estimate the flow vector . The optical flow is optimized over a multi-scale pyramid with warping between pyramid levels, resulting in a coarse-to-fine strategy that allows the estimation of large displacements by minimizing the equation defined as:
3.4.4 4D temporally coherent light-field video
The estimated dense flow for each view is back projected to the 3D visible surface to establish dense 4D correspondence between frames () and between key-frames as seen in Figure 7 to obtain 4D temporally coherent light-field video. Dense 4D correspondence is first obtained for the light-field view with maximum visibility of 3D points. To increase surface coverage correspondences are added in order of visibility of 3D points for different sparse light-field views. Dense temporal correspondence is propagated to new surface regions as they appear using the sparse feature correspondence tracks and respective dense light-field scene flow. Dense scene flow is estimated between key-frames for robust long-term surface tracking.
|Datasets||Sequence length||No. of views||Shot level||Resolution||Key-frames||Avg. sparse tracks|
|Walking||667||20||Far||2448 X 2048||15||1934|
|Sitting||694||20||Mid-level||2448 X 2048||13||1046|
|Wakingup||270||20||Close-up||2448 X 2048||7||2083|
|Running||140||20||Far||2448 X 2048||5||1278|
|Magician||353||20||Close-up||2448 X 2048||6||1312|
4 Results and Performance Evaluation
The proposed approach is tested on various light-field captures. The properties of the evaluation datasets are presented in Table 1. Algorithm parameters set empirically are constant for all results.
Sparse and dense correspondence are obtained on the sparse light-field dynamic data and the colour coded results are shown in Figure 9 for Walking dataset and in Figure 10 for Sitting and Waking up dataset using the method explained in Section 3.1.
To illustrate the 2D dense alignment the silhouette of the dense mesh on key-frames is colour coded and the colours are propagated between frames using dense scene flow explained in Section 3.4.
Results of the proposed 4D temporal alignment, illustrated in Figure 11 shows that the colour of the points remains consistent between frames.
The proposed approach is qualitatively shown to propagate the correspondences reliably over the entire light-field video for complex dynamic scenes with large non-rigid motion.
For comparative evaluation we use:(a) state-of-the-art dense flow algorithm Deepflow ; (b) dense flow without light-field consistency (DFwLF) in eq. 2; (c) a recent algorithm for alignment of partial surfaces (4DMatch)  and (d) Simple flow .
Qualitative results against DFwLF, 4DMatch, Deepflow and Simpleflow shown in Figure 12 indicate that the propagated colour map does not remain consistent across the sequence for large motion as compared to the proposed method (red regions indicate correspondence failure).
Quantitative evaluation: For quantitative evaluation we compare the silhouette overlap error (SOE). Dense correspondence over time is used to create propagated mask for each image. The propagated mask is overlapped with the silhouette of the projected surface reconstruction at each frame to evaluate the accuracy of the dense propagation. The error is defined as: . Evaluation against the different techniques is shown in Table 2 for all datasets. As observed the silhouette overlap error is lowest for the proposed approach showing relatively high accuracy.
We evaluate the temporal coherence across Walking sequence, by evaluating the variation in appearance for each scene point between frames and between key-frames and frames for state-of-the-art methods, defined as: , where is the difference operator. Evaluation shown in Table 3 against state-of-the-art methods demonstrates the stability of long term temporal tracking for proposed method.
Evaluation of the proposed method against dense flow without light-field consistency (DFwLF) demonstrates the usefulness of information from the EPIs in the dense flow in Section 3.4.
Light-field camera array configuration: We evaluate the performance of the proposed 4D temporal alignment with different light-field camera configurations shown in Figure 13. The completeness of the 3D points at each time instant for all camera configurations as observed in Table 4 is defined as: . The evaluation demonstrates a drop in 3D correspondence with the reduction in number of cameras in different camera configurations, specially for close-up shots (Wakeup and Magician). However configuration and with cameras in the corners provide a better coverage compared to and .
Limitations: The proposed method fails for fast spinning objects; scenes with uniform appearance and highly crowded dynamic environments. This is due to the failure of sparse and dense correspondence due to high amibiguity.
This paper introduced the first algorithm to obtain a 4D temporally coherent representation of dynamic light-field video. A novel method to obtain EPIs from sparse light-field video for spatio-temporal correspondence is proposed. Sparse-to-dense light-field scene flow is introduced exploiting information from the EPIs. Dense correspondence is fused spatially for 4D temporally coherent light-field video. The proposed approach is evaluated on various light-field sequences of complex dynamic scenes with large non-rigid deformations to obtain a temporally consistent 4D representation and demonstrating accuracy of the resulting 4D alignment. 4D Light-field video provides a spatio-temporally coherent representation to support subsequent light-field video compression or editing to replicate the functionality of conventional video editing allowing the propagation of edits both spatially across views and temporally across frames.
Acknowledgments: This research was supported by the InnovateUK grant for Live Action Lightfields for Immersive Virtual Reality Experiences (ALIVE) project (grant 102686).
-  M. Levoy, R. Ng, A. Adams, M. Footer, and M. Horowitz. Light field microscopy. ACM Trans. Graph., 25(3):924–934, July 2006.
-  C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and M. Gross. Scene reconstruction from high spatio-angular resolution light fields. ACM Trans. Graph., 32(4):73:1–73:12, July 2013.
-  B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, M. Adams, A.and Horowitz, and M. Levoy. High performance imaging using large camera arrays. ACM Trans. Graph., 24(3):765–776, July 2005.
-  V. Vaish, R. Szeliski, C. L. Zitnick, S. B. Kang, and M. Levoy. Reconstructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures. In CVPR, page 2331, 2006.
-  Kaan Yücer, Alexander Sorkine-Hornung, Oliver Wang, and Olga Sorkine-Hornung. Efficient 3D object segmentation from densely sampled light fields with applications to 3D reconstruction. ACM Transaction in Graphics, 35(3):22:1–22:15, March 2016.
-  J.-S. Franco and E. Boyer. Exact polyhedral visual hulls. In Proc. BMVC, pages 32.1–32.10, 2003.
-  J. Starck and A. Hilton. Model-based multiple view reconstruction of people. In ICCV, pages 915–922, 2003.
-  C. Budd, P. Huang, M. Klaudiny, and A. Hilton. Global non-rigid alignment of surface sequences. Int. J. Comput. Vision, 102:256–270, 2013.
-  C. Cagniart, E. Boyer, and S. Ilic. Probabilistic deformable surface tracking from multiple videos. In ECCV, pages 326–339, 2010.
-  C. Huang, C. Cagniart, E. Boyer, and S. Ilic. A bayesian approach to multi-view 4d modeling. Int. J. Comput. Vision, 116:115–135, 2016.
-  A. Mustafa, H. Kim, and A. Hilton. 4d match trees for non-rigid surface alignment. In ECCV, 2016.
-  A. Tevs, A. Berner, M.l Wand, I. Ihrke, M. Bokeloh, J. Kerber, and H-P. Seidel. Animation cartography; intrinsic reconstruction of shape and motion. ACM Trans. Graph., pages 12:1–12:15, April 2012.
-  L. Wei, Q. Huang, D. Ceylan, e. Vouga, and H. Li. Dense human body correspondences using convolutional networks. CoRR, abs/1511.05904, 2015.
-  A. Mustafa, H. Kim, J-Y. Guillemaut, and A. Hilton. Temporally coherent 4d reconstruction of complex dynamic scenes. In CVPR, 2016.
-  R. Newcombe, D. Fox, and S. Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In CVPR, 2015.
-  R. C. Bolles, H. Baker, and D. H. Marimont. Epipolar-plane image analysis: An approach to determining structure from motion. IJCV, 1(1):7–55, Mar 1987.
-  S. Wanner and B. Goldluecke. Globally consistent depth labeling of 4D light fields. In CVPR, pages 41–48, June 2012.
-  M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages 31–42, New York, NY, USA, 1996. ACM.
-  S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages 43–54, New York, NY, USA, 1996. ACM.
-  P. Srinivasan, M. Tao, R. Ng, and R. Ramamoorthi. Oriented light-field windows for scene flow. In ICCV, December 2015.
-  Adrian Jarabo, Belen Masia, Adrien Bousseau, Fabio Pellacini, and Diego Gutierrez. How do people edit light fields? ACM Transactions on Graphics (SIGGRAPH 2014), 33(4), 2014.
-  A. Mustafa and A. Hilton. Semantically coherent co-segmentation and reconstruction of dynamic scenes. In CVPR, 2017.
-  H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S.i Nobuhara, and Y. Sheikh. Panoptic studio: A massively multiview system for social motion capture. In ICCV, 2015.
-  D. Zheng, E.and Ji, E. Dunn, and J-M. Frahm. Sparse dynamic 3D reconstruction from unsynchronized videos. In ICCV, 2015.
-  N. Sundaram, T. Brox, and K. Keutzer. Dense point trajectories by gpu-accelerated large displacement optical flow. In ECCV, pages 438–451, 2010.
-  M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In CVPR, 2015.
-  A. Zanfir and C. Sminchisescu. Large displacement 3d scene flow with occlusion reasoning. In ICCV, 2015.
-  T. Basha, Y. Moses, and N. Kiryati. Multi-view scene flow estimation: A view centered variational approach. In CVPR, pages 1506–1513, 2010.
-  Andreas Wedel, Thomas Brox, Tobi Vaudrey, Clemens Rabe, Uwe Franke, and Daniel Cremers. Stereoscopic scene flow computation for 3D motion understanding. IJCV, 95:29–51, 2011.
-  S.M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. In CVPR, pages 519–528, 2006.
-  R. B. Rusu. Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. PhD thesis, Computer Science department, Technische Universitaet Muenchen, Germany, 2009.
-  G. D. Evangelidis and E. Z. Psarakis. Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans. Pattern Anal. Mach. Intell., 30:1858–1865, 2008.
-  Stanford Graphics Laboratory. The (New) Stanford Light Field Archive, 2008.
-  Michael W. Tao, Jiamin Bai, Pushmeet Kohli, and Sylvain Paris. Simpleflow: A non-iterative, sublinear optical flow algorithm. Computer Graphics Forum (Eurographics 2012), 31(2), May 2012.
-  P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. Deepflow: Large displacement optical flow with deep matching. In ICCV, pages 1385–1392, 2013.