Copy-and-Paste Networks for Deep Video Inpainting
We present a novel deep learning based algorithm for video inpainting. Video inpainting is a process of completing corrupted or missing regions in videos. Video inpainting has additional challenges compared to image inpainting due to the extra temporal information as well as the need for maintaining the temporal coherency. We propose a novel DNN-based framework called the Copy-and-Paste Networks for video inpainting that takes advantage of additional information in other frames of the video. The network is trained to copy corresponding contents in reference frames and paste them to fill the holes in the target frame. Our network also includes an alignment network that computes affine matrices between frames for the alignment, enabling the network to take information from more distant frames for robustness. Our method produces visually pleasing and temporally coherent results while running faster than the state-of-the-art optimization-based method. In addition, we extend our framework for enhancing over/under exposed frames in videos. Using this enhancement technique, we were able to significantly improve the lane detection accuracy on road videos.
Inpainting is a task of completing an image that has empty pixels by filling the empty regions with visually plausible pixels. Inpainting is very useful in image editing process, and is usually utilized to generate more satisfying images by removing unwanted objects in images. There is a large body of literature on image inpainting and significant progress has been made recently by employing deep learning for image inpainting. Impressive inpainting results are reported by applying evolving deep generative models , synthesizing visually pleasing images even for complex scenes.
In this paper, we focus on the video inpainting problem. Videos with additional temporal information makes the already difficult problem even more challenging. In addition to filling the holes for every frame, the algorithm has to ensure that the completed frames are temporally consistent. Due to these challenges, we have only seen one work that tackles the problem using deep neural networks (DNN) , compared to the image inpainting problem where many deep learning based algorithms have been introduced.
While video inpainting is more challenging compared to image inpainting, it inherently includes more cues for the problem as valid pixels for missing regions in a frame may exist in other frames. Therefore, we propose a novel DNN based framework called the Copy-and-Paste Networks for video inpainting that takes advantage of additional information in other frames in the video. As the name suggests, the network is trained to copy the necessary pixels from other frames and paste those pixels on the holes in the current frame (Fig. 1).
The key components of our DNN system are the alignment and the context matching. To find corresponding pixels in other frames for the holes in the given frame, the frames need to be registered first. We propose a self-supervised alignment networks, which estimates affine matrices between frames. While DNNs for computing the affine matrix or homography exist [5, 11, 17], our alignment method is able to deal with holes in images when computing the affine matrices. After the alignment, the novel context matching algorithm is used to compute the similarity between the target frame and the reference frames. The network learns which pixels are valuable for copying through the context matching, and those pixels are used to paste and complete an image. By progressively updating the reference frames with the inpainted results at each step, the algorithm can produce videos with temporal consistency.
Our results are comparable to the state-of-the-art method , and outperform other deep learning based approaches [13, 24]. Moreover, we can easily extend our method for restoring saturated/under-exposed images as shown in (Fig. 1(b)). By enhancing the saturated/under-exposed images, we were able to significantly increase the lane detection accuracy.
In summary, the major contribution of our paper is as follows:
We propose a self-supervised deep alignment networks that can compute affine matrices between images that contain large holes.
We propose a novel context-matching algorithm to combine reference frame features based on similarity between images.
Our method produces visually pleasing completed videos, running much faster than the state-of-the-art method. Additionally, we extend our framework for enhancing over/under exposed frames in videos that can help to improve other vision tasks such as the lane detection.
2 Related works
2.1 Image Inpainting
In traditional image inpainting methods, an image is filled by referencing pixels outside the hole in the image or in the external image database. As one of the most representative inpainting methods, PatchMatch  reconstructs the missing region by searching the patches outside the hole based on the approximate nearest neighbor algorithm. With this type of approach, however, it is difficult to inpaint images with complicated scenes, or when the images do not contain sufficient information for filling the holes.
Since deep image inpainting has been introduced in [10, 18], many deep generative models for image inpainting have been proposed recently, showing impressive restoration results on complex scenes. Yu \etal proposed the contextual attention module between the completed structure of the hole area and the patches outside the hole. Liu \etal and Yu \etal applied the partial convolution and the gated convolution to compensate the weakness of the vanilla convolution for image inpainting. In particular, Liu \etal corrected the blurred results based on the perceptual and the style loss without the adversarial loss.
2.2 Video Inpainting
Video inpainting has additional challenges of restoring the holes in every frame and maintaining the temporal consistency between reconstructed frames. Meanwhile, unlike in image inpainting, one can utilize redundant information between frames of video in video inpainting. However, directly exploiting the redundant information in videos is difficult due to image variation from the movements of the camera and the objects. To compensate for the movements, Granados \etal proposed to align the frames based on the homographies. They also applied the optical flow between completed frames to maintain the temporal consistency.
In , Newson \etalproposed 3D PatchMatch to maintain the temporal consistency in addition to using the affine transformation to compensate the motion. While the spatio-temporal patches improve the short-term temporal consistency, the long-term consistency of complicated scenes remained as a limitation. To solve this limitation, Huang \etal proposed the optical flow optimization in spatial patches to complete images while preserving the temporal consistency. This method shows the state-of-the-art performance up until now. All the methods explained above are based on a heavy optimization, and therefore suffers in the computational time, limiting their practical use.
Wang \etal proposed the first deep learning based video inpainting by using 3D encoder-decoder networks. However, this work does not cover the object removal task in general videos, and was only applied to a few specific domains. Kim \etal proposed 3D-2D encoder-decoder networks to complete the missing contents efficiently. The temporal consistency is maintained through a recurrent feedback and a memory layer with the flow and the warping loss. The temporal window for the referencing is small in their method, and therefore it is difficult to use valid pixels in distant frames, resulting in a limited performance for scenes with large objects or slowly moving objects.
Our copy-and-paste network overcome the issues in  by aligning the frames with affine matrices computed by our alignment network instead of using the optical flow. With the novel context matching algorithm, our method can extract valid pixels in distant frames, resulting in more accurate reconstruction for general scenes. The performance of our method is comparable to the state-of-the-art method in  while being more practical with faster runtime due to the feed forward nature of DNNs.
3 Copy-and-Paste Network Algorithm
The overview of our framework is shown in Fig. 2. The system takes a video () annotated with the missing pixels () in each frame and outputs () the completed video. The video is processed frame-by-frame in the temporal order. We call the frame to be filled as the target frame and the other frames as the reference frames. For each target frame, our network completes the missing region by copying-and-pasting contents from the reference frames.
To complete a target frame, each reference frame is first aligned to the target frame through the alignment network. Then in the copy network, pixels to be copied from the aligned reference frames are determined by the context matching module. Finally, the outputs from the copy networks are decoded to produce inpainted target frame in the paste network. The input video in the memory is updated with the completed frame, which will subsequently be used as a reference frame, providing more information for the following frames.
3.1 Alignment Network
In video inpainting, a large temporal window is essential as valuable information is more likely to be in distant frames. With an optical flow based alignment as used in , the temporal range of information is too small to extract useful information. As illustrated in Fig. 3, a reference frame temporally close to the target frame lacks information to fill the hole as there are too much overlap between the holes in the images. Moreover, computing optical flows between images with holes is more difficult as the holes themselves become occlusion factors. Therefore, our alignment network estimates the affine matrices to align the reference frames with the target frame.
The alignment network consists of shared alignment encoders and alignment regressors. Details on the network architectures are provided in the supplementary materials. To train the alignment network, we minimize the self-supervised loss, which is the L1 distance between the target frame () and the aligned reference frame (). To exclude the hole regions, this pixel-wise loss is only measured with pixels that are valid in both images as follows:
where is the visibility map, is the element-wise product, is the target frame index, and is the reference frame index111The symbol indicates aligning a reference frame to a target frame . indicates the visibility map of the reference aligned to the target. The visibility map is computed from the given masks, where 0 indicates hole pixels and 1 represents non-hole pixels.
Note that the alignment network is jointly trained with other networks in an end-to-end manner, not independently.
3.2 Copy-and-Paste Network
After the frame alignment, the aligned frames are mapped into the feature space through the shared encoders. The context matching module computes the importance of each pixel in the reference frames in completing the holes as well as a mask () indicating the visibility of each pixel throughout the video. Finally, the decoder takes the output of the context matching module in addition to the target frame feature to restore values for the missing pixels.
Encoder networks extract the features from the target and the aligned reference frames. The input to the encoder is a concatenation of an RGB image and the corresponding binary mask. The details on the architecture will be described in the supplementary materials.
Context matching module
Together with the encoder, the context matching module constitutes the copy network. The context matching module is illustrated in Fig. 4. First, global similarities () between the aligned reference frames and the target frame in the feature space is computed as follows:
The above equation is basically computing the cosine similarity between the two feature maps, excluding the hole pixels.
Then, a saliency map for each reference frame is computed as follows:
Fig. 5 simplifies the steps for computing the saliency map in 1-D. Each pixel value in the saliency map holds the weight that specific pixels have on filling the hole in the target. The reference features are aggregated through a weighted sum with the , producing the features to be used for the decoder ().
The hole masks for the reference frames are also aggregated in a similar fashion, resulting in . indicates pixels that is never visible throughout the reference frame.
The process of the aggregation is expressed as:
The decoder network completes the target frame given target features, aggregated reference features, and mask . The inputs are concatenated before being fed into the decoder. Decoder is basically our paste network that learns to fill the missing region by using the aggregated reference features and the visibility of those features. The pixels marked on are pixels that are never visible in all reference frame because those pixels always fall into holes. Therefore, the decoder has to be able to synthesize contents for those pixels as well. We add dilated convolution blocks to grow the receptive field and design the decoder network deeper than the other networks, in order to enhance the completion results for the unseen area by looking at other pixels within the image itself.
3.3 Temporal Consistency
Each frame in the video is sequentially completed by the network, one by one. The completed frame at each iteration replaces its reference, providing more information for the following frames as the holes are now filled with contents. This iterative reference update procedure not only improves the quality of the restored images, but also enhances the temporal consistency. This is analyzed later in the ablation study. To further ensure the temporal consistency, we actually run the feed-forward network twice – completing the video from the first to the last frame, and also in the reverse order. Then the final results are computed as follows:
4.1 Loss functions
All the networks are trained jointly in an end-to-end manner. First, we compute the loss between the completed target frame and the ground truth. The losses for the hole region and the non-hole region are separately calculated. Furthermore, the hole region can be divided into areas depending on whether the pixel value can be copied from reference frames or not. Therefore, we also apply the losses in the hole region separately.
is properly resized to fit the size of the target frame.
To further improve the visual quality of the results, we also apply perceptual, style, and total variation loss.
where is combination of the decoder output in the hole region and the input outside the hole, is the output of the pooling layer in pretrained VGG-16  on ImageNet , is the pooling index, is the gram matrix multiplication .
The total-loss function is as follows:
where is the total variation loss for smoothing the checkerboard effect . The weight for each loss is empirically determined.
Our goal is to complete holes in video sequences. Inputs are image sequences with holes and binary masks indicating the hole regions. However, no public video dataset for video inpainting exist. Therefore, we synthesized a dataset for video inpainting using background images and segmentation masks.
We synthesize videos by compositing background image sequences with object masks (Fig. 6). To build background image sequences, we use the Places (amount of 1.8M images)  single image datasets. To synthesize a sequence of images from a single image, we applied random crops and successive random transformations (shear, scale, translation, rotation) on the image. Additionally, we crawled the Youtube video clips and divided them according to the scene (7.3K scenes). Frames are randomly sampled from video clips to form a image sequence. The source of the background image sequence is randomly selected in an equal chance.
To simulate masks for holes, we use object masks from MIT Saliency Benchmark(amount of 11K masks)  and Pascal VOC 2012(amount of 14.3K masks) . A mask is randomly resized to be smaller than the size of the background frames. And the mask is randomly transformed to be a mask sequence by simulating the moving objects. A training sample is made by compositing a background image sequence and a mask sequence made above.
4.3 Training Details
Our model runs on hardware with the Intel(R) Core(TM) i7-7800X CPU(3.50GHz) CPU and NVIDIA TITAN XP GPUs. We train with the randomly selected five frames from the synthesized video sequences as inputs. To train the network, we set the batch size as 40. We use the Adam Optimizer  with learning rates and reduce the running rate factor of 10 every 1 million iterations. The training process takes about 7 days using three NVIDIA TITAN XP GPUs.
To evaluate our algorithm, we provide both quantitative and qualitative analysis, as well as a user study. We conducted the experiments using the videos, which were scaled in half (). Our code will be available online. We also show an application of our work in restoring under/over-exposed images.
5.1 Quantitative Results
We first conducted quantitative evaluation by measuring the quality of video restoration. For this experiment, we randomly selected 25 video sequences in DAVIS dataset [19, 20], which consists of pairs of video and object segmentation mask sequences. To simulate image restoration, we synthesized videos by putting imaginary object masks from DAVIS [19, 20] on the videos. The video without the object masks are used as the ground truth. Table 1 compares the PSNR and the SSIM measures between our method and . Both methods show good performance with similar measures. Note that VINet  is excluded in this experiment because the official code has not been published yet.
|Huang \etal ||28.14||0.859|
5.2 User Study and Qualitative Analysis
We further conducted experiments on dynamic object removal in videos with 30 videos from DAVIS dataset [19, 20]. We compared our methods with the state-of-the-art video inpainting models [9, 13]. Results of the previous methods were gathered by using the official code released by the authors  and by requesting the results from the authors .
The user study result performed the Amazon Mechanical Turk (AMT) is shown in Fig. 8 and Table 2 . The workers were asked to rank the video completion results and we also allowed them to give ties. All tests were evaluated by 40 participants.
|Huang \etal ||1.74|
The user study shows that our method is highly competitive to the optimization based method , while VINet  is not on par with the other two methods. While the method in  was slightly more favored, it requires average completion time of 952 seconds per video, whereas our method only takes 27.14 seconds.
We extend our method for restoring under/over-exposed image sequences. The restoration process is similar to video inpainting problem in that it fills areas with missing information. This problem often happens to image sequences taken by a camera attached to a vehicle due to rapid exposure changes (\egtunnel entry and exit).
As shown in Fig. 9, both the texture and the color are improved. To validate the effectiveness of our restoration process, we ran a lane detection algorithm on road images before and after the enhancement. We collected 469 frames videos 222The dataset were taken by using Mobile Mapping System Camera of Hyundai MnSOFT, Inc. that contains rapid exposure changes due to tunnels and the internal color histogram-based lane detection method was used. As shown in Fig. 9 and Table 3, lane detection results are significantly improved.
|Lane detection input||Lane detection accuracy|
|Restored input by our model||83.00%|
6 Ablation Study
We conducted an ablation study to verify that masked softmax contributes to the performance improvements. We train our model using normal softmax under the same conditions. As shown in the Fig. 10, using masked softmax results are sharper than using the normal one.
To produce temporally coherent outputs, we update the past reference frames with the inpainted version. To visualize the effect of this updating protocol, we compare the temporal profile  of resulting videos in Fig. 11. As shown in Fig. 11, the update procedure contributes in enhancing the temporal consistency.
In this paper, we presented a novel DNN framework for video inpainting. The proposed method inpaints the missing information by copy-and-pasting contents from the reference frames. The reference information is dynamically updated by the previous completion results to ensure the temporal consistency. Our experiments support that the proposed framework is comparable to the optimization-based methods and outperform other deep learning based approaches. We extended our framework to restore over/under-exposed in videos and were able to significantly increase the lane detection accuracy.
This work was supported by Institute for Information & communication Technology Promotion (IITP) grant funded by the Korea government (MSIP) (2018-0-01858).
-  (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG) 28 (3), pp. 24. Cited by: §2.1.
-  MIT saliency benchmark. Cited by: §4.2.
-  (2017) Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4778–4787. Cited by: §6.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1.
-  (2016) Deep image homography estimation. arXiv preprint arXiv:1606.03798. Cited by: §1.
-  (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §4.2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
-  (2012) Background inpainting for videos with dynamic objects and a free-moving camera. In ECCV, Cited by: §2.2.
-  (2016) Temporally coherent completion of dynamic video. ACM Transactions on Graphics (TOG) 35 (6). Cited by: §1, §2.2, §2.2, §5.1, §5.2, §5.2, §5.2, Table 1, Table 2.
-  (2017) Globally and Locally Consistent Image Completion. ACM Transactions on Graphics (Proc. of SIGGRAPH 2017) 36 (4), pp. 107:1–107:14. Cited by: §2.1.
-  (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §4.1, §4.1.
-  (2019) Deep video inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5792–5801. Cited by: §1, §1, §2.2, §2.2, §3.1, §5.1, §5.2, §5.2, §5.2, Table 2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
-  (2018) Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §2.1.
-  (2014) Video inpainting of complex scenes. SIAM Journal on Imaging Sciences, Society for Industrial and Applied Mathematics 7 (4), pp. 1993–2019. Cited by: §2.2.
-  (2018) Unsupervised deep homography: a fast and robust homography estimation model. IEEE Robotics and Automation Letters 3 (3), pp. 2346–2353. Cited by: §1.
-  (2016) Context encoders: feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732. Cited by: §5.1, §5.2.
-  (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675. Cited by: §5.1, §5.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
-  (2019) Video inpainting by jointly learning temporal structure and spatial details. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5232–5239. Cited by: §2.2.
-  (2018) Free-form image inpainting with gated convolution. arXiv preprint arXiv:1806.03589. Cited by: §2.1.
-  (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514. Cited by: §1, §2.1.
-  (2017) Places: a 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §4.2.