A Convolutional Neural Network Approach for Half-Pel Interpolation in Video Coding
Motion compensation is a fundamental technology in video coding to remove the temporal redundancy between video frames. To further improve the coding efficiency, sub-pel motion compensation has been utilized, which requires interpolation of fractional samples. The video coding standards usually adopt fixed interpolation filters that are derived from the signal processing theory. However, as video signal is not stationary, the fixed interpolation filters may turn out less efficient. Inspired by the great success of convolutional neural network (CNN) in computer vision, we propose to design a CNN-based interpolation filter (CNNIF) for video coding. Different from previous studies, one difficulty for training CNNIF is the lack of ground-truth since the fractional samples are actually not available. Our solution for this problem is to derive the “ground-truth” of fractional samples by smoothing high-resolution images, which is verified to be effective by the conducted experiments. Compared to the fixed half-pel interpolation filter for luma in High Efficiency Video Coding (HEVC), our proposed CNNIF achieves up to 3.2% and on average 0.9% BD-rate reduction under low-delay P configuration.
Motion compensation is a fundamental technology in video coding to remove the temporal redundancy between video frames. The existing video coding standards, including High Efficiency Video Coding (HEVC), mostly adopt block-based motion compensation for inter prediction, which assumes the block to be coded can be retrieved from the previously coded frames, and the corresponding location of the retrieved block is indicated by a motion vector (MV). Due to the inherent spatial sampling of digital video, MV is probably not integer, and thus the corresponding block may need to be generated instead of simply retrieved. Typically, generating non-integer samples is performed by kinds of interpolation, which can be viewed as fitting a continuous curve through a set of discrete samples, and picking out the new values at specific positions on the curve.
The problem of fractional sample interpolation for video coding has been widely studied [MOMS]. A theoretical analysis is conducted in [Girod] about the influence of fractional-pel accuracy on the efficiency of motion compensated prediction, using a Gaussian power spectral density model. Practically, the video coding standards mostly adopt fixed interpolation filters. For example, MPEG-4 AVC/H.264 uses the 6-tap filter to perform half-pel interpolation and the simple average filter to perform quarter-pel interpolation for luma component [wiegand2003overview]. In HEVC, the DCT-based interpolation filter (DCTIF) is adopted [sullivan2012overview]. HEVC uses a “7q+8h” DCTIF for luma component, that is a 7-tap DCTIF used for quarter-pel samples, and a 8-tap DCTIF for half-pel samples. Lv et al. [DCTIF] give the derivation process of DCTIF in detail, and compare the frequency responses between the interpolation filters in HEVC and H.264.
The fixed interpolation filters have been designed according to the signal processing theory, with the premise that the video signal is an ideal low-pass one. However, the video signal is indeed not low-pass and not stationary. Further study has been conducted to design different interpolation filters. For example, a motion compensated hybrid video coding scheme using an adaptive filter is presented in [wedi2001adaptive], where the filter coefficients are estimated during motion compensation for each frame. Wittmann et al. [wittmann2008separable] present a separable adaptive interpolation filter to reduce the computational cost while maintaining the coding efficiency of non-separable adaptive filter. Nonetheless, the previous works adopted hand-crafted filters, leaving a space for further improving the accuracy of motion compensation.
Recently, deep learning has achieved great success in computer vision. Convolutional neural network (CNN) based models led to a series of breakthroughs in high-level computer vision tasks, such as image classification [AlexNet] and object detection [GoogleNet]. Later on, CNN is also utilized in some low-level computer vision tasks. For example, Dong et al. [dong2014learning] propose a CNN approach for image super-resolution, termed SRCNN, which learns an end-to-end mapping between low- and high-resolution images. SRCNN has achieved significant boost of performance in both subjective and objective quality, compared to the previous methods without CNN. SRCNN is then extended to cope with the problem of artifact reduction [ARCNN]. More recently, Dai et al. [VRCNN] propose to learn a CNN for post-processing in video coding, and demonstrate on average 4.6% bit-rate reduction than HEVC baseline. All these works seem to open up a new direction that adopts CNN into video coding to further improve the coding efficiency.
In this paper, we present a CNN approach for fractional sample interpolation in video coding. We expect the CNN to automate the discovery of interpolation filters rather than to design them manually. However, a key difficulty here is how to generate training data for CNN. In previous studies concerning CNN, the ground-truth labels are provided comprehensively in the training data. However, for fractional sample interpolation, we have no ground-truth label because the fractional samples actually do not exist. Our solution for this problem is to derive the labels by smoothing high-resolution images, which is verified to be effective in experiments. We then reuse the network architecture of SRCNN, considering the similarity between super-resolution and fractional interpolation, but train the network with our derived training data. After training, the CNN-based interpolation filter (CNNIF) is integrated into HEVC for testing its performance in video coding. Currently, the CNNIF is applied only to half-pel samples of luma component. Experimental results show that the proposed CNNIF leads to up to 3.2% and on average 0.9% bits saving under low-delay P configuration.
The remainder of this paper is organized as follows. Section II provides the details of the proposed CNNIF. Section III gives the experimental results, and Section IV concludes the paper.
Ii Proposed Method
Ii-a Overview of the Proposed Method
In this work, we propose a CNN approach for half-pel interpolation of luma component. Fig. 1 illustrates the integer and half-pel positions during interpolation. The positions labeled with A stand for the available luma samples at integer locations, whereas the other positions labeled with b, h and j represent samples at half-pel locations, which need to be interpolated from integer-location samples. In HEVC, these three positions are derived using uniform 8-tap DCTIF. Taking into account the positional relation between the integer and the half-pel samples, we propose to train three CNN models to perform the interpolation of horizontal half-pel, vertical half-pel, and diagonal half-pel samples, respectively. That is, we train CNNIF_H, CNNIF_V and CNNIF_D, for the interpolation of b, h, and j, respectively.
It is very important to have abundant training data in store for supervised deep learning. However, in this fractional interpolation task, we cannot obtain the real ground-truth for training, i.e. the fractional samples do not exist actually. Without the ground-truth, it is not possible to carry out the training task, let alone expecting performance improvement in video coding. To overcome this contradiction, in this work, we design a method for generating training data, which will be introduced in Section II-C.
Ii-B Network Architecture
In this work, we reuse the existing Convolutional Neural Network for Super-Resolution (SRCNN) in [dong2014learning] to carry out the half-pel interpolation task. Fig. 2 depicts the architecture of SRCNN.
SRCNN consists of three convolutional layers. The output of -th layer () is the result of a linear transform of the previous layer followed by a Rectified Linear Unit (ReLU) [ReLU], and this process can be expressed as:
where and are the convolutional filter kernel and bias of the -th layer, respectively, and ‘*’ means convolutional operation. For SRCNN, the three layers are claimed to perform three steps, respectively [dong2014learning]:
The first layer is used for patch extraction and representation, extracting the features from low-resolution image. Here, is of size and is a 64-dimensional vector.
The second layer can be seen as non-linear mapping, which converts the features of low-resolution image to those of high-resolution. Here, is of size and is a 32-dimensional vector.
The third layer, where is of size , is used to recover the high-resolution image from the high-resolution features.
In the super-resolution task, a low-resolution image is firstly up-scaled to the desired size using bicubic interpolation, the interpolated image is denoted as . The goal is to recover an image to be as similar as possible to the ground-truth image . While in our fractional interpolation task, the input of the network is the image consisting of integer-position pixels, denoted by , and the output is the interpolated image of fractional positions, , which has the same size with the input image:
Please note that fractional sample interpolation is related to but different from super-resolution: the former tries to generate only fractional samples, while the latter is to generate a complete high-resolution image. If we simply reuse super-resolution for fractional sample interpolation, the integer-position samples cannot be guaranteed to be identical. Our experimental results also show that the simple reusing does not work well.
Ii-C Derivation of Training Data
The derivation of training data is performed in two steps:
Blurring a training image with a low-pass filter.
Extracting input and label for CNNIF_H, CNNIF_V, and CNNIF_D, respectively.
Ii-C1 Image Blurring
Like image super-resolution, fractional interpolation is also an ill-posed problem. One of the most difficult issues in training the CNN for fractional interpolation is the absence of ground-truth, since the fractional pixels are not available. It is infeasible for training the CNN if we do not have the ground-truth.
The generation of digital images is a process of sampling from the analog signal, which is essentially a process of low-pass filtering followed by decimation. Therefore, analogous to the derivation of digital images, we propose to firstly blur the training images with a low-pass filter to simulate the process of sampling. And intuitively, this operation can increase the correlation between neighboring pixels, so the relation between two neighboring pixels is more like that between integer and half-pel samples.
Ii-C2 Data Extracting
After blurring the training images, the correlation between the neighboring pixels of the images is more like that between integer and half pixels. Fig. 3 shows the process of extracting input and labels for training. The pixels of phase zero (red points in the figure) are extracted as the input image of CNN and regarded as integer pixels. The pixels of phase one (black points) are used as the horizontal half pixels. Similarly, the pixels of phase two and phase three (green and purple points) are used as vertical and diagonal half pixels, respectively.
It is noticeable that the fractional interpolation is performed upon the reconstruction of the previously coded frames. For lossy video coding, especially with high quantization parameter (QP), significant reconstruction error will be introduced, and thus the interpolation accuracy will deteriorate. To correct this, the pixels of phase zero (red points in Fig. 3) are firstly coded and reconstructed by HEVC intra coding, and the reconstruction results are used as the inputs of CNNIF. In other words, we hope the CNNIF to generate the half-pel pixels from the compressed integer pixels.
Up to now, we have generated the training data of the three CNN models for half-pel interpolation. The pairs shown in Fig. 3 (Input, horizontal half-pel), (Input, vertical half-pel) and (Input, diagonal half-pel) are used as the training data of CNNIF_H, CNNIF_V, and CNNIF_D, respectively.
Ii-D Training Method
In this task, the three CNN models are trained in the same way, and the loss function is optimized using stochastic gradient descent with back-propagation. The training of CNN is actually a process of adjusting the parameters set , i.e. for SRCNN, to minimize the loss function over the training set. Let denotes the output of the CNNIF, and the labels are denoted as . Here, we use the Euclidean distance as the loss function:
where is the total number of training data items.
Iii-a Training Stage
We use the deep learning framework Caffe [Caffe] to train the CNNIFs on an NVIDIA Tesla K40C graphical processing unit (GPU). The training set we use is the same as that in [ARCNN], which is a collection of 400 natural images. All the images are processed using the method depicted in Fig. 3, and after that we have 400 sub-images as input and 1200 sub-images for the labels (400 sub-images for each CNNIF). In this implementation, each sub-image as input is compressed by HEVC intra coding at four different QPs: 22, 27, 32, and 37. For each QP and each half-pel position, a separate network is trained. Therefore, we finally train 12 CNNIFs. During the process of compression, a CNNIF will be selected according to the slice QP and the corresponding half-pel position. The nearest QP among 22, 27, 32, and 37 to the current slice QP will be considered.
Iii-B Comparison with HEVC Baseline
The proposed method is implemented based on HEVC reference software HM 16.7. Currently, only the process of half-pel interpolation of the luma component is replaced by CNNIFs. The low delay P (LDP) configuration is tested in the experiment under the HEVC common test conditions. BD-rate is used to measure the rate-distortion (RD) performance. The experimental results are summarized in Table I. As can be observed, the proposed method achieves on average 0.9% BD-rate reduction. For the test sequence BQTerrace, the BD-rate reduction can be as high as 3.2%, 1.6%, 1.6% for Y, U, V components, respectively. Since the fractional interpolation of chroma components is still DCTIF, the performance of chroma components is not prominent. In the future work, we will train CNN models for the chroma components.
|Y (%)||U (%)||V (%)|
|Class Summary||Class B||-1.4||-0.4||-0.3|
Iii-C Comparison with Super-Resolution
We also compare our proposed method with super-resolution method. As an anchor, the reconstructed frame is up-scaled to a larger one by the pre-trained SRCNN model [SRCNNModel] with a factor of 2. From the enlarged frame, the phase one, phase two and phase three pixels are used as the interpolated half-pel samples. We integrate this anchor method into HM 16.7. Table II shows the results of the HEVC test sequences Class C and Class D of the anchor method compared to HEVC baseline. It can be observed that all the sequences suffer from significant loss. For the test sequence BQSquare, the loss can be as high as 8.2% for luma component. Therefore, despite the similarity between fractional interpolation and image super-resolution, they are indeed not the same task, and it is not appealing to directly apply super-resolution method to fractional interpolation, as validated by the experimental results.
|Y (%)||U (%)||V (%)|
This paper presents a convolutional neural network based fractional interpolation for inter prediction in HEVC. We use the existing SRCNN structure, but retrain interpolation CNN models for the three half-pel positions of luma component. A blurring followed by extracting method is proposed to generate training data, especially the missing labels. Experimental results show that the proposed CNNIF can achieve on avearge 0.9% bits saving. Our further work will focus on two aspects. First, to design a more efficient network architecture that is more suitable for the interpolation task. Second, we will investigate how to generate better labels for training, especially for the quarter-pel interpolation.