Aman Chadha \etal
Recently, learning-based models have enhanced the performance of single-image super-resolution (SISR). However, applying SISR successively to each video frame leads to a lack of temporal coherency. Convolutional neural networks (CNNs) outperform traditional approaches in terms of image quality metrics such as peak signal to noise ratio (PSNR) and structural similarity (SSIM). However, generative adversarial networks (GANs) offer a competitive advantage by being able to mitigate the issue of a lack of finer texture details, usually seen with CNNs when super-resolving at large upscaling factors. We present iSeeBetter, a novel GAN-based spatio-temporal approach to video super-resolution (VSR) that renders temporally consistent super-resolution videos. iSeeBetter extracts spatial and temporal information from the current and neighboring frames using the concept of recurrent back-projection networks as its generator. Furthermore, to improve the “naturality” of the super-resolved image while eliminating artifacts seen with traditional algorithms, we utilize the discriminator from super-resolution generative adversarial network (SRGAN). Although mean squared error (MSE) as a primary loss-minimization objective improves PSNR/SSIM, these metrics may not capture fine details in the image resulting in misrepresentation of perceptual quality. To address this, we use a four-fold (MSE, perceptual, adversarial, and total-variation (TV)) loss function. Our results demonstrate that iSeeBetter offers superior VSR fidelity and surpasses state-of-the-art performance. \MakeKeywordssuper resolution; video upscaling; frame recurrence; optical flow; generative adversarial networks; convolutional neural networks
The goal of super-resolution (SR) is to enhance a low resolution (LR) image to a higher resolution (HR) image by filling in missing fine-grained details in the LR image. The domain of SR research can be divided into three main areas: single image SR (SISR) [6, 15, 17, 27], multi image SR (MISR) [9, 10] and video SR (VSR) [3, 43, 38, 16, 23].
Consider an LR video source which consists of a sequence of LR video frames , …, , …, , where we super-resolve a target frame . The idea behind SISR is to super-resolve by utilizing spatial information inherent in the frame, independently of other frames in the video sequence. However, this technique fails to exploit the temporal details inherent in a video sequence resulting in temporal incoherence. MISR seeks to address just that – it utilizes the missing details available from the neighboring frames , …, , …, and fuses them for super-resolving . After spatially aligning frames, missing details are extracted by separating differences between the aligned frames from missing details observed only in one or some of the frames. However, in MISR, the alignment of the frames is done without any concern for temporal smoothness, which is in stark contrast to VSR where the frames are typically aligned in temporal smooth order.
Traditional VSR methods upscale based on a single degradation model (usually bicubic interpolation) followed by reconstruction. This is sub-optimal and adds computational complexity . Recently, learning-based models that utilize convolutional neural networks (CNNs) have outperformed traditional approaches in terms of widely-accepted image reconstruction metrics such as peak signal to noise ratio (PSNR) and structural similarity (SSIM).
In some recent VSR methods that utilize CNNs, frames are concatenated  or fed into recurrent neural networks (RNNs)  in temporal order, without explicit alignment. In other methods, the frames are aligned explicitly, using motion cues between temporal frames with the alignment modules [3, 33, 43, 38]. The latter set of methods generally render temporally smoother results compared to the methods with no explicit spatial alignment [31, 20]. However, these VSR methods suffer from a number of problems. In the frame-concatenation approach [3, 33, 23], many frames are processed simultaneously in the network, resulting in significantly higher network training times. With methods that use RNNs [38, 43, 20], modeling both subtle and significant changes simultaneously (e.g., slow and quick motions of foreground objects) is a challenging task even if long short-term memory units (LSTMs) are deployed, which are designed for maintaining long-term temporal dependencies . A crucial aspect of an effective VSR system is the ability to handle motion sequences, which are often integral components of videos [3, 34].
The proposed method, iSeeBetter, is inspired by recurrent back-projection networks (RBPNs)  which utilize “back-projection” as their underpinning approach, originally introduced in [21, 22] for MISR. The basic concept behind back-projection is to iteratively calculate residual images as reconstruction error between a target image and a set of neighboring images. The residuals are then back-projected to the target image for improving super-resolution accuracy. The multiple residuals enable representation of subtle and significant differences between the target frame and its adjacent frames, thus exploiting temporal relationships between adjacent frames as shown in Fig. 1. Deep back-projection networks (DBPNs)  use back-projection to perform SISR using learning-based methods by estimating the output frame using the corresponding frame. To this end, DBPN produces a high-resolution feature map that is iteratively refined through multiple up- and down-sampling layers. RBPN offers superior results by combining the benefits of the original MISR back-projection approach with DBPN. Specifically, RBPN uses the idea of iteratively refining HR feature maps from DBPN, but extracts missing details using neighboring video frames like the original back-projection technique [21, 22]. This results in superior SR accuracy.
To mitigate the issue of a lack of finer texture details when super-resolving at large upscaling factors that is usually seen with CNNs , iSeeBetter utilizes GANs with a loss function that weighs adversarial loss, perceptual loss , mean square error (MSE)-based loss and total-variation (TV) loss . Our approach combines the merits of RBPN and SRGAN  – it is based on RBPN as its generator and is complemented by SRGAN’s discriminator architecture, which is trained to differentiate between super-resolved images and original photo-realistic images. Blending these techniques yields iSeeBetter, a state-of-the-art system that is able to recover precise photo-realistic textures and motion-based scenes from heavily down-sampled videos.
Our contributions include the following key innovations.
Combining the state-of-the-art in SR: We propose a model that leverages two superior SR techniques – (i) RBPN, which is based on the idea of integrating SISR and MISR in a unified VSR framework using back-projection and, (ii) SRGAN, which is a framework capable of inferring photo-realistic natural images. RBPN enables iSeeBetter to extract details from neighboring frames, complemented by the generator-discriminator architecture in GANs which pushes iSeeBetter to generate more realistic and appealing frames while eliminating artifacts seen with traditional algorithms . iSeeBetter thus yields more than the sum of the benefits of RBPN and SRGAN.
“Optimizing” the loss function: Pixel-wise loss functions such as L1 loss, used in RBPN , struggle to handle the uncertainty inherent in recovering lost high-frequency details such as complex textures that commonly exist in many videos. Minimizing MSE encourages finding pixel-wise averages of plausible solutions that are typically overly-smooth and thus have poor perceptual quality [36, 24, 7, 2]. To address this, we adopt a four-fold (MSE, perceptual, adversarial, and TV) loss function for superior results. Similar to SRGAN , we utilize a loss function that optimizes perceptual quality by minimizing adversarial loss and MSE loss. Adversarial loss helps improve the “naturality” associated with the output image using the discriminator. On the other hand, MSE loss focuses on optimizing perceptual similarity instead of similarity in pixel space. Furthermore, we use a de-noising loss function called TV loss . We carried out experiments comparing L1 loss with our four-fold loss and found significant improvements with the latter (cf. Section 4).
Extended evaluation protocol: To evaluate iSeeBetter, we used standard datasets: Vimeo90K , Vid4  and SPMCS . Since Vid4 and SPMCS lack significant motion sequences, we included Vimeo90K, a dataset containing various types of motion. This enabled us to conduct a more holistic evaluation of the strengths and weaknesses of iSeeBetter. To make iSeeBetter more robust and enable it to handle real-world videos, we expanded the spectrum of data diversity and wrote scripts to collect additional data from YouTube. As a result, we augmented our dataset to about 170,000 clips.
User-friendly infrastructure: We built several useful tools to download and structure datasets, visualize temporal profiles of intermediate blocks and the output, and run predefined benchmark sequences on a trained model to be able to iterate on different models quickly. In addition, we built a video-to-frames tool to directly input videos to iSeeBetter, rather than frames. We also ensured our script infrastructure is flexible (such that it supports a myriad of options) and can be easily leveraged. The code and pre-trained models are available at https://iseebetter.amanchadha.com.
2 Related work
Since the seminal work by Tsai on image registration  two decades ago, many SR techniques based on various underlying principles have been proposed. Initial methods included spatial or frequency domain signal processing, statistical models and interpolation approaches . In this section, we focus our discussion on learning-based methods which have emerged as superior VSR techniques compared to traditional statistical methods.
2.1 Deep SISR
First introduced by SRCNN , deep SISR required a predefined up-sampling operator. Further improvements in this field include better up-sampling layers , residual learning , back-projection , recursive layers , and progressive up-sampling . A significant milestone in SR research was the introduction of a GAN-powered SR approach , which achieved state-of-the-art performance.
2.2 Deep VSR
Deep VSR can be divided into five types based on the approach to preserving temporal information.
(a) Temporal Concatenation. The most popular approach to retain temporal information in VSR is concatenating multiple frames [26, 3, 23, 31]. This approach can be seen as an extension of SISR to accept multiple input images. However, this approach fails to represent multiple motion regimes within a single input sequence since the input frames are simply concatenated together.
(b) Temporal Aggregation. To address the dynamic motion problem in VSR,  proposed multiple SR inferences which work on different motion regimes. The final layer aggregates the outputs of all branches to construct SR frame. However, this approach still concatenates many input frames, resulting in lengthy convergence during global optimization.
(c) Recurrent Networks. RNNs deal with temporal inputs and/or outputs and have been deployed in a myriad of applications ranging from video captioning [25, 35, 50], video summarization [5, 45], and VSR [43, 20, 38]. Two types of RNNs have been used for VSR. A many-to-one architecture is used in [20, 43] where a sequence of LR frames is mapped to a single target HR frame. A many-to-many RNN has recently been used by  where an optical flow network to accepts and , which is fed to an SR network along with . This approach was first proposed by  using bidirectional RNNs. However, the network has a small network capacity and has no frame alignment step. Further improvement is proposed by  using a motion compensation module and a ConvLSTM layer .
(d) Optical Flow-Based Methods. The above methods estimate a single HR frame by combining a batch of LR frames and are thus computationally expensive. They often result in unwanted flickering artifacts in the output frames . To address this,  proposed a method that utilizes a network trained on estimating the optical flow along with the SR network. Optical flow methods allow estimation of the trajectories of moving objects, thereby assisting in VSR.  warp video frames and onto using the optical flow method of , concatenate the three frames, and pass them through a CNN that produces the output frame .  follow the same approach but replace the optical flow model with a trainable motion compensation network.
(e) Pre-Training then Fine-Tuning v/s End-to-End Training. While most of the above-mentioned methods are end-to-end trainable, certain approaches first pre-train each component before fine-tuning the system as a whole in a final step [3, 43, 33].
Our approach is a combination of (i) an RNN-based optical flow method that preserves spatio-temporal information in the current and adjacent frames as the generator and, (ii) a discriminator that is adept at ensuring the generated SR frame offers superior fidelity.
To train iSeeBetter, we amalgamated diverse datasets with differing video lengths, resolutions, motion sequences, and number of clips. Tab. 1 presents a summary of the datasets used. When training our model, we generated the corresponding LR frame for each HR input frame by performing 4 down-sampling using bicubic interpolation. We thus perform self-supervised learning by automatically generating the input-output pairs for training without any human intervention. To further extend our dataset, we wrote scripts to collect additional data from YouTube. The dataset was shuffled for training and testing. Our training/validation/test split was 80%/10%/10%.
3.2 Network architecture
Fig. 2 shows the iSeeBetter architecture that consists of RBPN  and SRGAN  as its generator and discriminator respectively. Tab. 2 shows our notational convention. RBPN has two approaches that extract missing details from different sources: SISR and MISR. Fig. 3 shows the horizontal flow (represented by blue arrows in Fig. 2) that enlarges using SISR. Fig. 4 shows the vertical flow (represented by red arrows in Fig. 2) which is based on MISR that computes residual features from (i) a pair of and its neighboring frames (, …, ) coupled with, (ii) the pre-computed dense motion flow maps (, …, ).
|input high resolution image|
|low resolution image (derived from )|
|optical flow output|
|residual features extracted from (, , )|
|estimated HR output|
At each projection step, RBPN observes the missing details from and extracts residual features from neighboring frames to recover details. The convolutional layers that feed the projection modules in Fig. 2 thus serve as initial feature extractors. Within the projection modules, RBPN utilizes a recurrent encoder-decoder mechanism for fusing details extracted from adjacent frames in SISR and MISR and incorporates them into the estimated frame through back-projection. The convolutional layer that operates on the concatenated output from all the projection modules is responsible for generating . Once is synthesized, it is sent over to the discriminator (shown in Fig. 5) to validate its “authenticity”.
3.3 Loss functions
The perceptual image quality of the resulting SR image is dependent on the choice of the loss function. To evaluate the quality of an image, MSE is the most commonly used loss function in a wide variety of state-of-the-art SR approaches, which aims to improve the PSNR of an image . While optimizing MSE during training improves PSNR and SSIM, these metrics may not capture fine details in the image leading to misrepresentation of perceptual quality . The ability of MSE to capture intricate texture details based on pixel-wise frame differences is very limited, and can cause the resulting video frames to be overly-smooth . In a series of experiments, it was found that even manually distorted images had an MSE score comparable to the original image . To address this, iSeeBetter uses a four-fold (MSE, perceptual, adversarial, and TV) loss instead of solely relying on pixel-wise MSE loss. We weigh these losses together as a final evaluation standard for training iSeeBetter, thus taking into account both pixel-wise similarities and high-level features. Fig. 6 shows the individual components of the iSeeBetter loss function.
3.3.1 MSE loss
We use pixel-wise MSE loss (also called content loss ) for the estimated frame against the ground truth .
where, is the estimated frame . and represent the width and height of the frames respectively.
3.3.2 Perceptual loss
[11, 2] introduced a new loss function called perceptual loss, also used in [24, 30], which focuses on perceptual similarity instead of similarity in pixel space. Perceptual loss relies on features extracted from the activation layers of the pre-trained VGG-19 network in , instead of low-level pixel-wise error measures. We define perceptual loss as the euclidean distance between the feature representations of the estimated SR image and the ground truth .
where, denotes the feature map obtained by the convolution (after activation) before the maxpooling layer in the VGG-19 network. and are the dimensions of the respective feature maps in the VGG-19 network.
3.3.3 Adversarial loss
We use the generative component of iSeeBetter as the adversarial loss to limit model “fantasy”, thus improving the “naturality” associated with the super-resolved image. Adversarial loss is defined as:
where, is the discriminator’s output probability that the reconstructed image is a real HR image. We minimize instead of for better gradient behavior .
3.3.4 Total-Variation loss
TV loss was introduced as a loss function in the domain of SR by . It is defined as the sum of the absolute differences between neighboring pixels in the horizontal and vertical directions . Since TV loss measures noise in the input, minimizing it as part of our overall loss objective helps de-noise the output SR image and thus encourages spatial smoothness. TV loss is defined as follows:
3.3.5 Loss formulation
We define our overall loss objective for each frame as the weighted sum of the MSE, adversarial, perceptual, and TV loss components:
where, , , , are weights set as 1, 6 10, 10 and 2 10 respectively .
The discriminator loss for each frame is as follows:
The total loss of an input sample is the average loss of all frames.
4 Experimental evaluation
To train the model, we used an Amazon EC2 P3.2xLarge instance with an NVIDIA Tesla V100 GPU with 16GB VRAM, 8 vCPUs and 64GB of host memory. We used the hyperparameters from RBPN and SRGAN. Tab. 3 compares iSeeBetter with six state-of-the-art VSR algorithms: DBPN , B + T , DRDVSR , FRVSR , VSR-DUF  and RBPN/6-PF . Tab. 4 offers a visual analysis of VSR-DUF and iSeeBetter. Tab. 5 shows ablation studies to assess the impact of using a generator-discriminator architecture and the four-fold loss as design decisions.
|Dataset||Clip Name||VSR-DUF ||iSeeBetter||Ground Truth|
5 Conclusions and future work
We proposed iSeeBetter, a novel spatio-temporal approach to VSR that uses recurrent-generative back-projection networks. iSeeBetter couples the virtues of RBPN and SRGAN. RBPN enables iSeeBetter to generate superior SR images by combining spatial and temporal information from the input and neighboring frames. In addition, SRGAN’s discriminator architecture fosters generation of photo-realistic frames. We used a four-fold loss function that emphasizes perceptual quality. Furthermore, we proposed a new evaluation protocol for video SR by collating diverse datasets. With extensive experiments, we assessed the role played by various design choices in the ultimate performance of iSeeBetter, and demonstrated that on a vast majority of test video sequences, iSeeBetter advances the state-of-the-art.
To improve iSeeBetter, a couple of ideas could be explored. In visual imagery the foreground recieves much more attention than the background since it typically includes subjects such as humans. To improve perceptual quality, we can segment the foreground and background, and make iSeeBetter perform “adaptive VSR” by utilizing different policies for the foreground and background. For instance, we could adopt a wider span of the number of frames to extract details from for the foreground compared to the background. Another idea is to decompose a video sequence into scenes on the basis of frame-similarity and make iSeeBetter assign weights to adjacent frames based on which scene they belong to. Adjacent frames from a different scene can be weighed lower compared to frames from the same scene, thereby making iSeeBetter focus on extracting details from frames within the same scene – à la the concept of attention applied to VSR. \CvmAckThe authors would like to thank Andrew Ng’s lab at Stanford University for their guidance on this project. In particular, the authors express their gratitude to Mohamed El-Geish for the idea-inducing brainstorming sessions throughout the project.
-  (2005) Image up-sampling using total-variation regularization with a new observation model. IEEE Transactions on Image Processing 14 (10), pp. 1647–1659. Cited by: §1, §3.3.4.
-  (2016-01-01) Super-resolution with deep convolutional sufficient statistics. (English (US)). Note: 4th International Conference on Learning Representations (ICLR) 2016 Cited by: §1, §3.3.2.
-  (2017) Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4778–4787. Cited by: §1, §1, §2.2, §2.2, §2.2.
-  (2012) Fast video super-resolution using artificial neural networks. In 2012 8th International Symposium on Communication Systems, Networks & Digital Signal Processing (CSNDSP), pp. 1–4. Cited by: §3.3.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. Cited by: §2.2.
-  (2015) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §1, §2.1.
-  (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in neural information processing systems, pp. 658–666. Cited by: §1.
-  (2011) Total variation regularization of local-global optical flow. In 2011 14th International IEEE Conference on Intelligent Transportation Systems, pp. 318–323. Cited by: §2.2.
-  (2013) Unified blind method for multi-image super-resolution and single/multi-image blur deconvolution. IEEE Transactions on Image Processing 22 (6), pp. 2101–2114. Cited by: §1.
-  (2012) Super resolution for multiview images using depth information. IEEE Transactions on Circuits and Systems for Video Technology 22 (9), pp. 1249–1256. Cited by: §1.
-  (2015) Texture synthesis using convolutional neural networks. In Advances in neural information processing systems, pp. 262–270. Cited by: §3.3.2.
-  (1999) Learning to forget: continual prediction with lstm. Cited by: §1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.3.3.
-  (2019) Hands-on generative adversarial networks with pytorch 1. x: implement next-generation neural networks to build powerful gan models using python. Packt Publishing Ltd. Cited by: §3.3.5.
-  (2018) Deep back-projection networks for super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1664–1673. Cited by: §1, §1, §2.1, Figure 3, Table 3, §4.
-  (2019) Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3897–3906. Cited by: §1, §1, §1, §3.2, Table 3, §4.
-  (2017) Inception learning super-resolution. Applied optics 56 (22), pp. 6043–6048. Cited by: §1.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: Figure 3, Figure 4.
-  (2010) Image quality metrics: psnr vs. ssim. In 2010 20th International Conference on Pattern Recognition, pp. 2366–2369. Cited by: §3.3.
-  (2015) Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in Neural Information Processing Systems, pp. 235–243. Cited by: §1, §2.2.
-  (1991) Improving resolution by image registration. CVGIP: Graphical models and image processing 53 (3), pp. 231–239. Cited by: §1.
-  (1993) Motion analysis for image enhancement: resolution, occlusion, and transparency. Journal of Visual Communication and Image Representation 4 (4), pp. 324–335. Cited by: §1.
-  (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3224–3232. Cited by: §1, §1, §2.2, Table 3, Table 4, §4.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §1, §3.3.2.
-  (2016) Densecap: fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574. Cited by: §2.2.
-  (2016) Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging 2 (2), pp. 109–122. Cited by: §2.2, §2.2.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: §1.
-  (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1637–1645. Cited by: §2.1.
-  (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632. Cited by: §2.1.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §1, §1, §2.1, Figure 5, §3.2, §3.3.1, §3.3.2, §3.3.
-  (2015) Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 531–539. Cited by: §1, §2.2.
-  (2011) A bayesian approach to adaptive video super resolution. In CVPR 2011, pp. 209–216. Cited by: §1.
-  (2017) Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2507–2515. Cited by: §1, §2.2, §2.2, Table 3, §4.
-  (2017) End-to-end learning of video super-resolution with motion compensation. In German conference on pattern recognition, pp. 203–214. Cited by: §1.
-  (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632. Cited by: §2.2.
-  (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §1.
-  (2018) Recurrent back-projection network for video super-resolution. Final Project for MIT 6.819 Advances in Computer Vision. Cited by: §1, §2.2.
-  (2018) Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6626–6634. Cited by: §1, §1, §2.2, §2.2, Table 3, §4.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §1, §2.1.
-  (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28 (NIPS 2015), pp. 1–9. Cited by: §2.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.3.2.
-  (2017) Image super-resolution via deep recursive residual network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3147–3155. Cited by: §2.1.
-  (2017) Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4472–4480. Cited by: §1, §1, §1, §2.2, §2.2, Table 3, §4.
-  (1984) Multiframe image restoration and registration. Advance Computer Visual and Image Processing 1, pp. 317–339. Cited by: §2.
-  (2015) Translating videos to natural language using deep recurrent neural networks. pp. 1494–1504. Cited by: §2.2.
-  (2020) Deep learning for image super-resolution: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §3.3.4.
-  (2002) A universal image quality index. IEEE signal processing letters 9 (3), pp. 81–84. Cited by: §3.3.
-  (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: §1.
-  (2010) Image super-resolution: historical overview and future challenges. Super-resolution imaging, pp. 20–34. Cited by: §2.
-  (2016) Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4584–4593. Cited by: §2.2.
ACADAman Chadha has held positions at some of the world’s leading semiconductor/product companies. He is currently based out of Cupertino (Silicon Valley), California and is currently pursuing his graduate studies in Artificial Intelligence from Stanford University. He has published in prestigious international journals and conferences, and has authored two books. His publications have garnered about 200 citations. He currently serves on the editorial boards of several international journals including IJATCA, IJLTET, IJCET, IJEACS and IJRTER. He has served as a reviewer for IJEST, IJCST, IJCSEIT and JESTEC. Aman graduated with an M.S. from the University of Wisconsin-Madison with an outstanding graduate student award in 2014 and his B.E. with distinction from the University of Mumbai in 2012. His research interests include Computer Vision (particularly, Pattern Recognition), Artificial Intelligence, Machine Learning and Computer Architecture. Aman has 18 publications to his credit.
JBRITTOJohn Britto is pursuing his M.S. in Computer Science from the University of Massachusetts, Amherst. He completed his B.E. in Computer Engineering from the University of Mumbai in 2018. His research interests lie in Machine Learning, Natural Language Processing, and Artificial Intelligence.
MMROJAM. Mani Roja is a full professor in the Electronics and Telecommunication Department at the University of Mumbai since the past 30 years. She received her Ph.D. in Electronics and Telecommunication Engineering from Sant Gadge Baba Amravati University and her Masters in Electronics and Telecommunication Engineering from the University of Mumbai. She has collaborated across the years in the fields of Image Processing, Speech Processing, and Biometric recognition.