Video coding is a critical step in all popular methods of streaming video. Marked progress has been made in video quality, compression, and computational efficiency. Recently, there has been an interest in finding ways to apply techniques form the fast-progressing field of Machine Learning to further improve video coding.

We present a method that uses convolutional neural networks to help refine the output of various standard coding methods. The novelty of our approach is to train multiple different sets of network parameters, with each set corresponding to a specific, short segment of video. The array of network parameter sets expands dynamically to match a video of any length. We show that our method can improve the quality and compression efficiency of standard video codecs.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.


Dynamically Expanded CNN Array for Video Coding


Everett Fall0  Kai-Wei Chang0  Liang-Gee Chen0 


In recent years there have been many advances in video coding standards, and there are several prominent video codecs used commercially such as H.264/AVC (Schwarz et al., 2007), H.265/HECV (Sullivan et al., 2012) and VP9. However, most of these traditional coding algorithms are block-based and therefore suffer from block artifacts. They are also largely hand-designed, making joint optimization difficult.

In Machine Learning (ML) the past decade has yielded vast improvements and altogether new techniques for processing images and video such as convolutional neural networks (CNN). CNNs offer an intuitive theoretical approach of encoding or embedding information as vectors, that can be interpreted as high-level features. Despite the success of CNNs in many ML tasks, they have seen very limited success as a direct replacement for traditional commercial video codecs. Partially because training ML models is computationally expensive and partially because existing methods such as .H265 or VP9 are already highly optimized, making the improving on benchmarks a challenging barrier to entry.

In this work we take a different approach. Instead of directly replacing traditional codecs by using CNNs to learn an end-to-end model, we apply ML as a post processing step to refine the output. The novelty of our approach is to train a dynamically expanding array of many small CNNs, allowing each network to specialize in refining a relatively short segment of video. This refiner module switches between different networks as the video is decoded.

We evaluate our method on several commonly used commercial codecs and show that the refiner network makes a substantial improvement to quality with only a minimal increase in code size.


Several deep learning-integrated video compression methods have been proposed to improve traditional coding. One approach is replacing and enhancing different modules in traditional coding, especially the state-of-the-art HEVC codec. For example, improving motion compensation and inter-prediction (Huo et al., 2018; Yan et al., 2019; Zhao et al., 2018), intra-prediction (Song et al., 2017; Cui et al., 2018; J. Pfaff, 2018), and replacing in-loop filter (Park & Kim, 2016; Kang et al., 2017; Zhang et al., 2018; Jia et al., 2019).

Another approach is to apply ML as a post processing step to improve video quality. For example, (Li et al., 2017) proposed a dynamic metadata post-precessing scheme based on a CNN, (Yang et al., 2017) and (Wang et al., 2017) proposed Decoder-side Scalable Convolutional Neural Network (DS-CNN) and Deep CNN-based Auto Decoder (DACD) respectively for video qulity and efficiency enhancement.

Instead of using ML to form a hybrid video coding framework, some works propose an end-to-end framework for video compression, the performance of which can be on par with the commercial codecs. (Wu et al., 2018) developed an end-to-end deep video codec relying on repeated interpolate images in a hierarchical manner. Inheriting conventional video coding structure, (Lu et al., 2018) employed multiple neural networks to constitute different modules, which can be jointly optimized through a single loss function. (Rippel et al., 2018) proposed a learned end-to-end model for low-latency mode with spatial rate control.


We denote a video as a sequence of frames: where each is an image with width , height with 3 color channels for each pixel. An video coding scheme provides two algorithms: one to transform the video into code, , and one that converts an encoded video back into a sequence of images, . Let denote a sequence of frames generated by encoding and then decoding a video: . The goal of video coding is to minimize size of an encoded video and to design and to be as computationally efficient as possible. In the case of lossy compression coding, there is a pixel-wise error, known as the residual, associated with each decoded frame , which is also desirable to minimize.

Our method is designed to complement an existing coding scheme, reducing the error in the decoded video by adding a refining function which yields when applied to each frame of as shown in Fig. 1. The existing coding scheme could be a standard commercial codec or a custom end-to-end CNN.

Let denote a contiguous segment of . The novelty of our method is to use a CNN to implement with parameters , which vary with time. Specifically, we partition in to many small segments of duration , , and for each segment a corresponding set of network parameters is learned for which we denote as . Intuitively, this can be thought of as an array of CNNs that expands dynamically as needed to refine a video of any length.

Learning is accomplished through standard stochastic gradient decent. A training example consists of where is sampled from in the range . Intuitively the network learns to predict from . Alternatively, the refine function can learn to predict the residual (denoted ) in which case refining involves removing the predicted residual from the signal. In this case the training example consists of and the output is obtained by .

Another commonly desired characteristic of an encoding scheme is for the code to have a localized temporal correspondence to the video. This allows a small segment, known as a random access segment, of the video to be decoded without requiring the entire code which is useful for data transmission applications such as video streaming. Since each segment is relatively short, our method can also support random access by transmitting in advance of the corresponding random access segment of the code.

Figure 1: The video is encoded and decoded by some coding scheme to produce . The correct parameters for the refiner network are selected according to current frame being process. The refiner produces from .

In this section we present the results of initial experiments as a proof-of-concept for our method. We implemented our proposed method and applied it to the benchmark dataset ”Big Buck Bunny”. We used the .H264 codec with a high CRF value (low quality and high compression ratio) to create the input to the refiner network. The refiner contains 4 convolutional layers with 5x5 filters followed by 3 convolutional layers with 3x3 filters and applies to segments of size frames. The network is given approximately 500 training steps which corresponds to 10 epochs of the (very small) dataset for each 50 frame segment. The before and after result of a typical frame is shown in Fig. 2. The quality of the refined image was MS-SSIM: 0.9802 PSNR: 36.96 (original .H264 CRF-36 was MS-SSIM: 0.9589 and PSNR: 33.82).

Figure 2: Left: The refiner input (output of .H264 codec using CRF 36). Middle: Output of the refiner. Right: Ground truth.

In this work we introduced a novel method for video coding which uses an array CNNs to refine each frame. We describe the algorithms used to segment the video and train a CNN for each segment and automatically switch between networks during the decoding process. We implemented our proposed algorithm and conducted several experiments to evaluate it’s performance with standard benchmark coding schemes. Our method was able to provide substantial improvement to the quality and reduce the compressed video size.


  • Cui et al. (2018) Cui, W., Zhang, T., Zhang, S., Jiang, F., Zuo, W., and Zhao, D. Convolutional neural networks based intra prediction for HEVC. CoRR, abs/1808.05734, 2018.
  • Huo et al. (2018) Huo, S., Liu, D., Wu, F., and Li, H. Convolutional neural network-based motion compensation refinement for video coding. In IEEE International Symposium on Circuits and Systems, ISCAS 2018, 27-30 May 2018, Florence, Italy, pp. 1–4, 2018. doi: 10.1109/ISCAS.2018.8351609.
  • J. Pfaff (2018) J. Pfaff, P. Helle, D. M. S. K. W. S. H. S. D. M. T. W. Neural network based intra prediction for video coding, 2018.
  • Jia et al. (2019) Jia, C., Wang, S., Zhang, X., Wang, S., Liu, J., Pu, S., and Ma, S. Content-aware convolutional neural network for in-loop filtering in high efficiency video coding. IEEE Transactions on Image Processing, pp. 1–1, 2019. ISSN 1057-7149. doi: 10.1109/TIP.2019.2896489.
  • Kang et al. (2017) Kang, J., Kim, S., and Lee, K. M. Multi-modal/multi-scale convolutional neural network based in-loop filter design for next generation video codec. In 2017 IEEE International Conference on Image Processing, ICIP 2017, Beijing, China, September 17-20, 2017, pp. 26–30, 2017. doi: 10.1109/ICIP.2017.8296236.
  • Li et al. (2017) Li, C., Song, L., Xie, R., and Zhang, W. CNN based post-processing to improve HEVC. In 2017 IEEE International Conference on Image Processing, ICIP 2017, Beijing, China, September 17-20, 2017, pp. 4577–4580, 2017. doi: 10.1109/ICIP.2017.8297149.
  • Lu et al. (2018) Lu, G., Ouyang, W., Xu, D., Zhang, X., Cai, C., and Gao, Z. DVC: an end-to-end deep video compression framework. CoRR, abs/1812.00101, 2018.
  • Park & Kim (2016) Park, W. and Kim, M. Cnn-based in-loop filtering for coding efficiency improvement. In IEEE 12th Image, Video, and Multidimensional Signal Processing Workshop, IVMSP 2016, Bordeaux, France, July 11-12, 2016, pp. 1–5, 2016. doi: 10.1109/IVMSPW.2016.7528223.
  • Rippel et al. (2018) Rippel, O., Nair, S., Lew, C., Branson, S., Anderson, A. G., and Bourdev, L. D. Learned video compression. CoRR, abs/1811.06981, 2018.
  • Schwarz et al. (2007) Schwarz, H., Marpe, D., and Wiegand, T. Overview of the scalable video coding extension of the H.264/AVC standard. IEEE Trans. Circuits Syst. Video Techn., 17(9):1103–1120, 2007. doi: 10.1109/TCSVT.2007.905532.
  • Song et al. (2017) Song, R., Liu, D., Li, H., and Wu, F. Neural network-based arithmetic coding of intra prediction modes in HEVC. In 2017 IEEE Visual Communications and Image Processing, VCIP 2017, St. Petersburg, FL, USA, December 10-13, 2017, pp. 1–4, 2017. doi: 10.1109/VCIP.2017.8305104.
  • Sullivan et al. (2012) Sullivan, G. J., Ohm, J., Han, W., and Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Techn., 22(12):1649–1668, 2012. doi: 10.1109/TCSVT.2012.2221191.
  • Wang et al. (2017) Wang, T., Chen, M., and Chao, H. A novel deep learning-based method of improving coding efficiency from the decoder-end for HEVC. In 2017 Data Compression Conference, DCC 2017, Snowbird, UT, USA, April 4-7, 2017, pp. 410–419, 2017. doi: 10.1109/DCC.2017.42.
  • Wu et al. (2018) Wu, C., Singhal, N., and Krähenbühl, P. Video compression through image interpolation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII, pp. 425–440, 2018. doi: 10.1007/978-3-030-01237-3_26.
  • Yan et al. (2019) Yan, N., Liu, D., Li, H., Li, B., Li, L., and Wu, F. Convolutional neural network-based fractional-pixel motion compensation. IEEE Trans. Circuits Syst. Video Techn., 29(3):840–853, 2019. doi: 10.1109/TCSVT.2018.2816932.
  • Yang et al. (2017) Yang, R., Xu, M., and Wang, Z. Decoder-side HEVC quality enhancement with scalable convolutional neural network. In 2017 IEEE International Conference on Multimedia and Expo, ICME 2017, Hong Kong, China, July 10-14, 2017, pp. 817–822, 2017. doi: 10.1109/ICME.2017.8019299.
  • Zhang et al. (2018) Zhang, Y., Shen, T., Ji, X., Zhang, Y., Xiong, R., and Dai, Q. Residual highway convolutional neural networks for in-loop filtering in HEVC. IEEE Trans. Image Processing, 27(8):3827–3841, 2018. doi: 10.1109/TIP.2018.2815841.
  • Zhao et al. (2018) Zhao, L., Wang, S., Zhang, X., Wang, S., Ma, S., and Gao, W. Enhanced ctu-level inter prediction with deep frame rate up-conversion for high efficiency video coding. In 2018 IEEE International Conference on Image Processing, ICIP 2018, Athens, Greece, October 7-10, 2018, pp. 206–210, 2018. doi: 10.1109/ICIP.2018.8451465.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description