Multiframe Motion Coupling for Video Super Resolution
The idea of video super resolution is to use different view points of a single scene to enhance the overall resolution and quality. Classical energy minimization approaches first establish a correspondence of the current frame to all its neighbors in some radius and then use this temporal information for enhancement. In this paper, we propose the first variational super resolution approach that computes several super resolved frames in one batch optimization procedure by incorporating motion information between the high-resolution image frames themselves. As a consequence, the number of motion estimation problems grows linearly in the number of frames, opposed to a quadratic growth of classical methods and temporal consistency is enforced naturally.
We use infimal convolution regularization as well as an automatic parameter balancing scheme to automatically determine the reliability of the motion information and reweight the regularization locally. We demonstrate that our approach yields state-of-the-art results and even is competitive with machine learning approaches.
The technique of video super resolution combines the spatial information from several low resolution frames of the same scene to produce a high resolution video. A classical way of solving the super resolution problem is to estimate the motion from the current frame to its neighboring frames, model the data formation process via warping, blur, and downsampling, and use a suitable regularization to suppress possible artifacts arising from the ill-posedness of the underlying problem. The final goal is to produce an enhanced, visually pleasing high resolution video in a reasonable runtime.
However the number of flow computations in this approach increases quadratically with the number of frames. Moreover, due to the strategy of super resolving each frame separately, temporal consistency cannot be enforced explicitly. Yet the latter is a key feature of a visually pleasing video: Even if a method generates a sequence of high quality high resolution frames, temporal inconsistencies will be visible as a disturbing flickering.
In addition, choosing the right strength of the regularization is a delicate issue. While a small regularization allows significant improvements in areas where the motion estimation is precise, it can lead to heavy oscillations and ringing artifacts in areas of quick motion and occlusions. A large regularization on the other hand avoids these artifacts but quickly oversmoothes the image and hence also suppresses the desirable super resolution effect.
Contributions of this work. We propose a method that jointly solves for all frames of the super resolved video and couples the high resolution frames directly. Such an approach tackles the drawbacks mentioned above: Because only neighboring frames are coupled explicitly, the number of required motion estimations grows linearly with the number of frames. However by introducing this coupling on the unknown high resolution images directly, all frames are still coupled implicitly and information is exchanged over the entire sequence.
Furthermore, we tackle the problem of choosing the right strength of spatial regularity by proposing to use the infimal convolution between a strong spatial and a strong temporal regularization term. The latter allows our framework to automatically select the right type of regularization locally in a single convex optimization approach that can be minimized globally. To make this approach robust we devise a parameter choice heuristic that allows us to process very different videos.
As illustrated in Figure 1 our approach yields state-of-the-art results. While Figure 1 is a synthetic test consisting of planar motion only, we demonstrate the performance of the proposed approach on several real world videos in Section 4.
The literature on super resolution techniques is vast and it goes beyond the scope of the paper to present a complete overview. An extensive survey of super resolution techniques published before 2012 can be found in . We will focus on recalling some recent approaches based on energy minimization and deep learning techniques.
Variational Video Reconstruction. A classical variational super resolution technique was presented in  in which the authors propose to determine a high resolution version of the -th frame via
where denotes the Huber loss, a downsampling operator, a blur kernel, a regularization parameter, and a warping operator that compensates the motion from the -th to the -th frame and is computed by an optical flow estimation in a first processing step. The temporal consistency term is based on and hence compares each frame to multiple low resolution frames. Figure (a)a shows all couplings needed to use this approach for a sequence of frames.
Mitzel et al.  use a similar minimization, albeit with the norm instead of Huber loss. In comparison to  they do not compute all needed couplings but approximate them from the flows between neighboring frames, which allows for a trade-off between speed and accuracy.
Liu and Sun  proposed to incorporate different (global) weights for each of the temporal consistency terms in eq. (1), and additionally estimate the blur kernel as well as the warping operators by applying alternating minimization. In , Ma et al. extended the work  for the case of some of the low resolution frames being particularly blurry. Similar to (1) the energies proposed in [12, 13] do not enforce regularity between the high resolution frames directly and require quadratically many motion estimations. Furthermore both works focus on a simplified downsampling procedure that is easier to invert than our more realistic model.
In a recent work  on time continuous variational models, the authors proposed to use an optical flow penalty as a temporal regularization for joint image and motion reconstruction. While the optical flow term is exact in the temporally continuous setting, it would require small motions of less than one pixel to be a good approximation in a temporally discrete video.
Learning based approaches.
With the recent breakthroughs of deep
learning and convolutional neural networks, researchers have promoted
learning-based methods for super resolution [9, 11, 30, 10, 22].
The focus of [30, 22] is the development of a real-time capable super resolution technique, such that we will concentrate our comparison to , , and , which focus on high image quality rather than computational efficiency.
Note that  and  work with motion correction and require optical flow estimations. Similar to the classical variational techniques they register multiple neighboring frames to the current frame and hence also require quadratically many flow estimations.
The very deep convolutional network VDSR  is a conceptually different approach that does not use any temporal information, but solely relies on the training data.
2 Proposed Method
For a sequence of low-resolution input images we propose a multi-frame super resolution model based on motion coupling between subsequent frames. Opposed to any of the variational approaches summarized in the previous section, the energy we propose directly couples all (unknown) high resolution frames . Our method jointly computes the super resolved versions of video frames at once via the following minimization problem,
The first term is a standard data fidelity term similar to (1). The key novelty of our approach is twofold and lies in the way we incorporate and utilize the motion information as well as the way we combine the temporal information with a spatial regularity assumption. The latter combines an extension of a spatio-temporal infimal convolution technique proposed by Holler and Kunisch in  with an automatic parameter balancing scheme.
2.1 Spatio-Temporal Infimal Convolution and Parameter Balancing
The second term in (2) denotes the infimal convolution  between a term , which is mostly focused on introducing temporal information, and a term , which is mostly focused on enforcing spatial regularity on . The infimal convolution between the two terms is defined as
It can be understood as a convex approximation to a logical OR connection and allows to optimally divide the input into two parts, one of which is preferable in terms of the costs and the other one in terms of the costs . The respective costs are defined as
for , where the subscripts and denote the - and -derivatives, and denotes the photoconsistency
given a motion field , see section 3.1. The parameter encodes the scaling of time time and space dimensions and is estimated automatically as the ratio of warp energy to gradient energy on a bicubic estimate :
where denotes the vector-valued image obtained by stacking all , and denote the stacked - and -derivatives of all frames of the sequence.
Since the warp operator is multiplied with this provides an image-adaptive way to make sure that the spatial and temporal regularity terms are in the same order of magnitude. Note that such a term also makes sense from a physical point of view: Since and measure change in space and measures change per time, a normalization factor with units ‘time over space’ is necessary to make these physical quantities comparable. A related discussion can be found in [8, Section 4].
The idea for using the infimal convolution approach originates from  in which the authors used a similar term with a time derivative instead of the operator for video denoising and decompression. The infimal convolution automatically selects a regularization focusing either on space or time at each point. At points in the image where the warp energy is high, our approach automatically uses strong total variation (TV) regularization. In this sense it is a convex way of replacing the EM-based local parameter estimation from  by a joint and fully automatic regularization method with similar effects: It can handle inconsistencies in the motion field by deciding to determine such locations by . On the other hand introducing strong spatial regularity can suppress details to be introduced by the temporal coupling. The infimal convolution approach allows favoring the optical flow information without over-regularizing those parts of the image, where the flow estimation seems to be faithful.
Figure 3 demonstrates the behavior of the infimal convolution by illustrating the division of one frame into the two parts and of (3). Areas in which the optical flow estimation is problematic are visible in the variable and hence mostly regularized spatially. All other areas are dominated by strong temporal regularization.
2.2 Multiframe Motion Coupling
A key aspect of our approach is the temporal coupling of the (unknown) high resolution frames . It is based on color constancy assumptions and couples the entire sequence in a spatio-temporal manner using only linearly many flow fields . Figure 2 illustrates the difference of the temporal coupling of previous energy minimization techniques and the proposed method. Besides only requiring linearly many flow fields, the high resolution frames are estimated jointly such that temporal consistency is enforced directly. Note that the energies (1), or the ones of [15, 12, 13] decouple and solve for each high resolution frame separately with the temporal conformance only given by the consistency of the low resolution frames , so that inconsistent flickering in high resolution components is not accounted for.
The optimization is performed in a two-step procedure: We compute the optical flow on the low resolution input frames and upsample the flow to the desired resolution using bicubic interpolation. Then we solve the super resolution problem (2). We experimented extensively with an alternating scheme, c.f. , however the effective resolution increase through this recurring optical flow computation is marginal as we will discuss in section 4.4. An alternative approach shown by the authors of  would be to compute the high resolution optical flow on a bicubic video estimate. However our experiments showed that our approach was as precise while being much more efficient.
3.1 Optical Flow Estimation
The optical flow on low resolution input frames is calculated via
It consists of two data terms, one that models brightness constancy and one that models gradient constancy, as well as a Huber penalty () that is enforcing the regularity of the flow field. Note that (8) describes a series of time-independent problems. To solve each of these problems we follow well-established methods [24, 27, 29] and first linearize the brightness- and gradient constancy terms using a first order Taylor expansion with respect to the current estimate of the flow field resulting in a convex energy minimization problem for each linearization. We exploit the well-known iterative coarse-to-fine approach [1, 2] with median filtering. A detailed evaluation of this strategy can be found in . We use a primal-dual algorithm with preconditioning [19, 6] to solve the convex subproblems within the coarse-to-fine pyramid using the CUDA module of the FlexBox framework .
3.2 Super Resolution
Unlike previous approaches, the super resolution problem (2) does not simplify to a series of time-independent problems, since individual frames are correlated by the flow. Consequently, the problem is solved in the whole space/time domain. First, we want to deduce that (2) can be rewritten in the form
where , , and denotes a linear operator, i.e. a matrix in the discrete case after vectorization of the images , that contains the downsampling and blur operators. We use an averaging approach for the downsampling, e.g.  and choose the subsequent blur operator as Gaussian blur with variance dependent on the magnification factor, e.g. for a factor of 4. Similarly, the gradients on and are block-diagonal operators consisting of the gradient operators of the single frames along the diagonal. The operator is also linear and can be seen as a motion-corrected time derivative. The notation is used to denote the sum of the norms of the vector formed by two entries from the gradient and one entry from the warping operator .
Based on the flow fields from the first step, we write the functions of the form as , where the are bicubic interpolation operators, such that . The final linear operator consists of entries of the form and one final block of zeros, acting as zero Neumann boundary conditions in time.
Similar to the flow problem, we used an implementation of the primal-dual algorithm in the PROST  framework but also provide an optional binding to Flexbox . Our code is publicly available on Github111https://github.com/HendrikMuenster/superResolution for the sake of reproducibility.
4 Numerical Results
We choose static parameters and across all of our different datasets and figures as we found them to yield a good and robust trade off for arbitrary video sequences for a magnification factor of 4.
To be able to super resolve color videos we follow a common approach [28, 9, 10] and transform the image sequence into a YCbCr color space and only super resolve the luminance channel Y with our variational method. The chrominance channels Cr and Cb are upsampled using bicubic interpolation. Since almost all detail information is concentrated in the luminance channel, this simplification yields almost exactly the same peak signal-to-noise ratio (PSNR) as super resolving each channel separately.
To process longer videos, we use our method with frame batches in the size of a desired temporal radius and use the last computed frame from each batch as boundary value for the next batch to ensure temporal consistency.
We evaluate the presented algorithm on several scenes with very different complexity and resolution. Included in our test set is one simple synthetic scene consisting of a planar motion of the London subway map (tube), shown in Figure 1, four common test videos [12, 9, 21] (calendar, city, foliage, walk), three sequences from [11, 21] (foreman, temple, penguins), and four sequences from a realistic and modern UHD video sequence (sheets, wave, surfer, dog)  subsampled to 720p, that contain large non-linear motion and complex scene geometries. For the sake of this comparison we focused on an upsampling factor of 4, although our variational approach is able to handle arbitrary positive real upsampling factors in a straightforward fashion
We evaluate nearest neighbor (NN) and bicubic interpolation (Bic), Video Enhancer  (a commercial upsampling software), the variational approach  (MFSR), as well as the learning based techniques Deep Draft , VSRnet , and VDSR  using code provided by the respective authors along with our proposed method and reimplementations of the variational methods [25, 15] (with ). For the sake of fairness in comparison of  to  we computed all necessary optical flows directly instead of approximating them. We consider 13 frames of the tube, city, calendar, foliage, walk and foreman sets and 5 frames of the larger temple, penguins, sheets, surfer, wave and dog sets. The PSNR and structural similarity index measure (SSIM)  were determined for the central image of each sequence after cropping 20 pixels at each boundary. This was done so that the classical coupling methods [13, 11, 25, 15] are properly evaluated at the frame with maximal information in each direction for a given batch of frames.
4.1 Evaluation of proposed improvements
We present several incremental steps in this work. To delineate the contributions of each, we will consider the average PSNR value score over our data sets in the bar plot to the right. The baseline is given by nearest neighbors interpolation with dB. Bicubic interpolation yields an improvement to dB and total variation upsampling, e.g. , adds further dB.
As a next step we consider our model of coupling frames directly, but without the infimal convolution. Instead we consider a simpler additive regularizer first,
using this regularizer results in an average PSNR of dB. It turns out that this method is already dB better than the classical coupling of , due to failure cases in several fast-moving sets. In these cases, computing the optical flow between frames that are further apart is too error-prone, whereas the flow between neighboring frames is still reasonable to compute.
Next we consider our robustness improvements: Coupling spatial and temporal regularizers via the proposed infimal convolution (3) increases the PSNR value by dB to dB for fixed . Adapting the spatio-temporal scaling with the heuristic (7) finally adds dB. Note this choice of can also be used directly for the additive regularizer, yielding dB. A memory constrained implementation of the proposed method might want to rely just on that.
We report run times of 24 seconds per frame (~40% optical flow, ~60% super resolution) for our medium sized datasets (13 frames) on a NVIDIA Titan GPU. Although these results are on a modern GPU, the flow and the super resolution problem are implemented in a general purpose framework without direct communication and with linear operators in explicit matrix notation. Further increase in speed could be obtained by porting to a specialized framework avoiding matrix representations. For comparison, our implementation of classical coupling, e.g.  with the same framework needs 126 (~86% optical flow, ~14% super resolution) seconds per frame.
4.2 Choice of Forward Model
During our comparison to other approaches we found out that there was a significant disparity in the choice of operators for the forward model and subsequent data generation. Whereas our approach follows the works [15, 25] and uses a bicubic downsampling process, other works [12, 13, 11] use a Gaussian kernel followed by an asymmetric ’striding’ operation, which keeps every -th pixel in each direction for a downsampling factor of . The Gaussian kernel in  is further chosen to be the theoretically optimal kernel. It turns out that this forward model is firstly easier to invert and secondly favors different strategies. Using it with our infimal convolution approach yields sharper results, significantly improving the PSNR values, e.g. the city dataset. However the direct use of the additive regularizer, eq.(10), is the optimal choice, outperforming infimal convolution and results of  with up to dB. This is a direct consequence of the perfect match of data simulation and construction as discussed in detail in [17, Chapter 2].
To have a proper evaluation, we generate data by using Matlab’s bicubic image rescaling in our experiments, including color dithering and an anti-aliasing filter, followed by a clipping to obtain image values in . We explicitly do not use this operator in our reconstruction, c.f. eq. (9). Note that these shortfalls are not limited to variational methods: Neural networks equally benefit from training on exactly the same data formation process that is later used for testing.
4.3 Comparison to other Methods
The results for all test sequences and algorithms are shown in Table 1. We structured the methods into three categories; simple interpolation based methods, variational super resolution approaches that utilize temporal information but do not require any training data, and deep learning methods. We indicate the three categories by vertical lines in the tables.
Our method consistently outperforms simple interpolation techniques and also improves upon competing variational approaches, especially for complex motions like walk or surfer. Comparing to the learning based methods, our model based technique seems to be superior on those sequences that contain reasonable motion or a high frame rate. On sequences with particularly large motion and strong occlusions, e.g. penguins or foreman, the very deep convolutional neural network  performs very well, possibly because it does not rely on any motion information but produces high quality results purely based on learned information.
Besides the fact that our approach remains competitive even for the aforementioned challenging data sets in terms of the PSNR values, we want to stress the importance of temporal consistency: Consistency of successive frames is required for a visually pleasing video perception and the lack thereof in other methods immediately yields a disturbing flickering effect. Demo videos showcasing this effect can be found on our supplementary web page222http://www.vsa.informatik.uni-siegen.de/en/superResolution, including a comparison of the consistency of our approach to the VSRnet and VDSR methods.
For a visual inspection of single frames, we present the super resolution results obtained by various methods on a selection of four data sets in Figure 4.
4.4 Numerical Analysis
In light of the results of , where alternating the optical flow (OF) estimation and super resolution was beneficial for a simplified and controlled data generation, we experimented with its application to our data model and more sophisticated regularization. However, as mentioned, applying the alternating procedure does not increase the video quality. The authors of  (who extend the model of  to include motion blur) report a similar behavior (,figure 9, ).
We investigate this further by running our approach with samples from the Sintel MPI dataset , which contains ground truth OF and several levels of realism denoted by ’albedo’, ’clean’ and ’final’, respectively. We compared PSNR values by running our method with estimated optical flow and running our method with the ground truth OF for all three realism settings, c.f. table 2.
|Rendering||GT Flow||our OF|
Interestingly, we do not profit from the ground truth OF on realistic data. Our super resolution warping operator penalizes changes in the brightness of the current pixel to the corresponding pixels in neighboring frames, i.e. brightness constancy as does our OF. It turns out that the estimated OF yields matchings that are well suited for super resolution despite not being the physically correct ones.
In light of this discussion the effectiveness of an alternating scheme is questionable. Even if the repeated OF computations converged to the GT OF, performance would not necessarily improve. The performance can only improve if the new OF would yield a refined pixel matching.  report for the case of heavy motion blur that recognizing and eliminating particularly blurry frames can refine their matchings in an alternating minimization. However it remains unclear how this translates into a generalized strategy, when all frames are equally low on details.
We have proposed a variational super resolution technique based on a multiframe motion coupling of the unknown high resolution frames. The latter enforces temporal consistency of the super resolved video directly and requires only as optical flow estimations for frames. By combining spatial regularity and temporal information with an infimal convolution and estimating their relative weight automatically, our method adapts the strength of spatial and temporal smoothing autonomously without a change of parameters. We provided an extensive numerical comparison which demonstrates that the proposed method outperforms interpolation approaches as well as competing variational super resolution methods, while being competitive to state-of-the-art learning approaches. For small motions or sufficiently high frame rate, our results are temporally consistent and avoid flickering effects.
J.G. and M.M. acknowledge the support of the German Research Foundation (DFG) via the research training group GRK 1564 Imaging New Modalities. D.C. was partially funded by the ERC Consolidator grant 3D Reloaded.
-  M. J. Black and P. Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63(1):75–104, 1996.
-  T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV, pages 25–36. Springer, 2004.
-  M. Burger, H. Dirks, and C.-B. Schönlieb. A variational model for joint motion estimation and image reconstruction. arXiv preprint arXiv:1607.03255, 2016.
-  D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, Oct. 2012.
-  A. Chambolle and P.-L. Lions. Image recovery via total variation minimization and related problems. Numerische Mathematik, 76(2):167–188, 1997.
-  A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011.
-  H. Dirks. A flexible primal-dual toolbox. arXiv preprint, 2016. http://www.flexbox.im.
-  M. Holler and K. Kunisch. On infimal convolution of tv-type functionals and applications to video and image reconstruction. SIAM Journal on Imaging Sciences, 7(4):2258–2300, 2014.
-  A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
-  J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR Oral), June 2016.
-  R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia. Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 531–539, 2015.
-  C. Liu and D. Sun. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2014.
-  Z. Ma, R. Liao, X. Tao, L. Xu, J. Jia, and E. Wu. Handling motion blur in multi-frame super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5224–5232, 2015.
-  A. Marquina and S. J. Osher. Image super-resolution by tv-regularization and bregman iteration. Journal of Scientific Computing, 37(3):367–382, 2008.
-  D. Mitzel, T. Pock, T. Schoenemann, and D. Cremers. Video super resolution using duality based TV-L1 optical flow. In Pattern Recognition, pages 432–441. Springer, 2009.
-  T. Möllenhoff, E. Laude, M. Moeller, J. Lellmann, and D. Cremers. Sublabel-accurate relaxation of nonconvex energies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. https://github.com/tum-vision/prost.
-  J. Mueller and S. Siltanen. Linear and Nonlinear Inverse Problems with Practical Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2012.
-  K. Nasrollahi and T. B. Moeslund. Super-resolution: a comprehensive survey. Machine vision and applications, 25(6):1423–1468, 2014.
-  T. Pock, D. Cremers, H. Bischof, and A. Chambolle. An algorithm for minimizing the Mumford-Shah functional. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1133–1140. IEEE, 2009.
-  Infognition Co. Ltd. Videoenhancer 2 software, version 2.1.
-  Xiph.org, redistributable Video Test Media Collection. https://media.xiph.org/video/derf/.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
-  Sony Corporation. Sony 4k uhd surfing screen test demo. CC-BY License.
-  D. Sun, S. Roth, and M. J. Black. A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision, 106(2):115–137, 2014.
-  M. Unger, T. Pock, M. Werlberger, and H. Bischof. A convex approach for variational super-resolution. In Pattern Recognition, pages 313–322. Springer, 2010.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. Image Processing, IEEE Transactions on, 13(4):600–612, 2004.
-  A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers. An improved algorithm for TV-L1 optical flow. In Statistical and Geometrical Approaches to Visual Motion Analysis, pages 23–45. Springer, 2009.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.
-  C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-L1 optical flow. In Pattern Recognition, pages 214–223. Springer, 2007.
-  Z. Zhang and V. Sze. Fast: Free adaptive super-resolution via transfer for compressed videos. Available on ArXiv, https://arxiv.org/abs/1603.08968, 2016.