Deep Video Color Propagation

Deep Video Color Propagation

Abstract

Traditional approaches for color propagation in videos rely on some form of matching between consecutive video frames. Using appearance descriptors, colors are then propagated both spatially and temporally. These methods, however, are computationally expensive and do not take advantage of semantic information of the scene. In this work we propose a deep learning framework for color propagation that combines a local strategy, to propagate colors frame-by-frame ensuring temporal stability, and a global strategy, using semantics for color propagation within a longer range. Our evaluation shows the superiority of our strategy over existing video and image color propagation methods as well as neural photo-realistic style transfer approaches.

\addauthor

Simone Meyersimone.meyer@inf.ethz.ch2 \addauthorVictor Cornillère2 \addauthorAbdelaziz Djelouahaziz.djelouah@disneyresearch.com2 \addauthorChristopher Schroers2 \addauthorMarkus Gross2 \addinstitution Department of Computer Science
ETH Zurich
\addinstitution Disney Research Deep Video Color Propagation

1 Introduction

Color propagation is an important problem in video processing and has a wide range of applications. For example in movie making work-flow, where color modification for artistic purposes [Pai()] plays an important role. It is also used in the restoration and colorization of heritage footage [Ame()] for more engaging experiences. Finally, the ability to faithfully propagate colors in videos can have a direct impact on video compression.

Traditional approaches for color propagation rely on optical flow computation to propagate colors in videos either from scribbles or fully colored frames. Estimating these correspondence maps is computationally expensive and error prone. Inaccuracies in optical flow can lead to color artifacts which accumulate over time. Recently, deep learning methods have been proposed to take advantage of semantics for color propagation in images [Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros] and videos [Jampani et al.(2017)Jampani, Gadde, and Gehler]. Still, these approaches have some limitations and do not yet achieve satisfactory results on video content.

In this work we propose a framework for color propagation in videos that combines local and global strategies. Given the first frame of a sequence in color, the local strategy warps these colors frame by frame based on the motion. However this local warping becomes less reliable with increasing distance from the reference frame. To account for that we propose a global strategy to transfer colors of the first frame based on semantics, through deep feature matching. These approaches are combined through a fusion and refinement network to synthesize the final image. The network is trained on video sequences and our evaluation shows the superiority of the proposed method over image and video propagation methods as well as neural style transfer approaches, see Figure 1.

Ref. () Image PropNet [Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros] Style transfer [Li et al.(2018)Li, Liu, Li, Yang, and Kautz] SepConv [Niklaus et al.(2017)Niklaus, Mai, and Liu] Video PropNet [Jampani et al.(2017)Jampani, Gadde, and Gehler]
Ground Truth () Phase-based [Meyer et al.(2016)Meyer, Sorkine-Hornung, and Gross] Bil.Solver [Barron and Poole(2016)] Flow-based [Xia et al.(2016)Xia, Liu, Fang, Yang, and Guo] Ours
Figure 1: Color propagation after 30 frames (). Our approach is superior to existing strategies for video color propagation. (Image source: [Pont-Tuset et al.(2017)Pont-Tuset, Perazzi, Caelles, Arbelaez, Sorkine-Hornung, and Gool])

Our main contribution is a deep learning architecture, that combines local and global strategies for color propagation in videos. We use a two-stage training procedure necessary to fully take advantage of both strategies. Our approach achieves state-of-the-art results as it is able to maintain better colorization results over a longer time interval compared to a wide range of methods.

2 Related work

2.1 Image and Video Colorization

A traditional approach to image colorization is to propagate colors or transformation parameters from user scribbles to unknown regions. Seminal works in this direction considered low level affinities based on spatial and intensity distance [Levin et al.(2004)Levin, Lischinski, and Weiss]. To reduce user interaction, many directions have been considered such as designing better similarities [Luan et al.(2007)Luan, Wen, Cohen-Or, Liang, Xu, and Shum]. Other approaches to improve edit propagation include embedding learning [Chen et al.(2012)Chen, Zou, Zhao, and Tan], iterative feature discrimination [Xu et al.(2013)Xu, Yan, and Jia] or dictionary learning [Chen et al.(2014)Chen, Zou, Li, Cao, Zhao, and Zhang]. Achieving convincing results for automatic image colorization [Cheng et al.(2015)Cheng, Yang, and Sheng, Iizuka et al.(2016)Iizuka, Simo-Serra, and Ishikawa], deep convolutional networks have also been considered for edit propagation [Endo et al.(2016)Endo, Iizuka, Kanamori, and Mitani] and interactive image colorization [Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros]. To extend edit propagation to videos, computational efficiency is critical and various strategies have been investigated [An and Pellacini(2008), Yatagawa and Yamaguchi(2014)].

One of the first method considering gray scale video colorization was proposed by Welsh et al. [Welsh et al.(2002)Welsh, Ashikhmin, and Mueller] as a frame-to-frame color propagation. Later, image patch comparisons [Sýkora et al.(2004)Sýkora, Buriánek, and Zára] were used to handle large displacements and rotations. However this method targets cartoon content and is not directly adaptable to natural videos. Yatzi et al. [Yatziv and Sapiro(2006)] consider geodesic distance in the 3d spatio-temporal volume to color pixels in videos and Sheng et al. [Sheng et al.(2011)Sheng, Sun, Chen, Liu, and Wu] replace spatial distance by a distance based on Gabor features. The notion of reliability and priority [Heu et al.(2009)Heu, Hyun, Kim, and Lee] for coloring pixels allow better color propagation. These notions are extended to entire frames [Xia et al.(2016)Xia, Liu, Fang, Yang, and Guo], considering several of them as sources for coloring next gray images. For increased robustness, Pierre et al. [Pierre et al.(2017)Pierre, Aujol, Bugeau, and Ta] use a variational model that rely on temporal correspondence maps estimated through patch matching and optical flow estimation.

Instead of using pixel correspondences, some recent methods have proposed alternative approaches to the video colorization problem. Meyer et al. [Meyer et al.(2016)Meyer, Sorkine-Hornung, and Gross] transfer image edits as modifications of the phase-based representation of the pixels. The main advantage is that expensive global optimization is avoided, however propagation is limited to only a few frames. Paul et al. [Paul et al.(2017)Paul, Bhattacharya, and Gupta] uses instead of motion vectors the dominant orientations of a 3D steerable pyramid decomposition as guidance for the color propagation of user scribbles. Jampani et al. [Jampani et al.(2017)Jampani, Gadde, and Gehler], on the other hand, use a temporal bilateral network for dense and video adaptive filtering, followed by a spatial network to refine features.

2.2 Style Transfer

Video colorization can be seen as transferring the color or style of the first frame to the rest of the images in the sequence. We only outline the main directions of color transfer as an extensive review of these methods is available in [Faridul et al.(2016)Faridul, Pouli, Chamaret, Stauder, Reinhard, Kuzovkin, and Trémeau]. Many methods rely on histogram matching [Reinhard et al.(2001)Reinhard, Ashikhmin, Gooch, and Shirley] which can achieve surprisingly good results given their relative simplicity but colors could be transferred between incoherent regions. Taking segmentation into account can help to improve this aspect [Tai et al.(2005)Tai, Jia, and Tang]. Color transfer between videos is also possible [Bonneel et al.(2013)Bonneel, Sunkavalli, Paris, and Pfister] by segmenting the images using luminance and transferring chrominance. Recently Arbelot et al. [Arbelot et al.(2016)Arbelot, Vergne, Hurtut, and Thollot] proposed an edge-aware texture descriptor to guide the colorization. Other works focus on more complex transformations such as changing the time of the day in photographs [Shih et al.(2013)Shih, Paris, Durand, and Freeman], artistic edits [Shih et al.(2014)Shih, Paris, Barnes, Freeman, and Durand] or season change [Okura et al.(2015)Okura, Vanhoey, Bousseau, Efros, and Drettakis].

Since the seminal work of Gatys et al. [Gatys et al.(2016)Gatys, Ecker, and Bethge], various methods based on neural networks have been proposed [Li and Wand(2016)]. While most of them focus on painterly results, several recent works have targeted photo-realistic style transfer [Mechrez et al.(2017)Mechrez, Shechtman, and Zelnik-Manor, Luan et al.(2017)Luan, Paris, Shechtman, and Bala, Li et al.(2018)Li, Liu, Li, Yang, and Kautz, He et al.(2017)He, Liao, Yuan, and Sander]. Mechrez et al. [Mechrez et al.(2017)Mechrez, Shechtman, and Zelnik-Manor] rely on Screened Poisson Equation to maintain the fidelity with the style image while constraining the results to have gradients similar to the content image. In [Luan et al.(2017)Luan, Paris, Shechtman, and Bala] photo-realism is maintained by constraining the image transformation to be locally affine in color space. This is achieved by adding a corresponding loss to the original neural style transfer formulation [Gatys et al.(2015)Gatys, Ecker, and Bethge]. To avoid the resulting slow optimization process, patch matching on VGG [He et al.(2017)He, Liao, Yuan, and Sander] features can be used to obtain a guidance image. Finally, Li et al. [Li et al.(2018)Li, Liu, Li, Yang, and Kautz] proposed a two stage architecture where an initial stylized image, estimated through whitening and coloring transform (WCT) [Li et al.(2017)Li, Fang, Yang, Wang, Lu, and Yang], is refined with a smoothing step.

3 Overview

The goal of our method is to colorize a gray scale image sequence by propagating the given color of the first frame to the following frames. Our proposed approach takes into account two complementary aspects: short range and long range color propagation, see Figure 2.

The objective of the short range propagation network is to propagate colors on a frame by frame basis. It takes as input two consecutive gray scale frames and estimates a warping function. This warping function is used to transfer the colors of the previous frame to the next one. Following recent trends [Xue et al.(2016)Xue, Wu, Bouman, and Freeman, Jia et al.(2016)Jia, Brabandere, Tuytelaars, and Gool, Niklaus et al.(2017)Niklaus, Mai, and Liu], warping is expressed as a convolution process. In our case we choose to use spatially adaptive kernels that account for motion and re-sampling simultaneously [Niklaus et al.(2017)Niklaus, Mai, and Liu], but other approaches based on optical flow could be considered as well.

Figure 2: Overview. To propagate colors in a video we use both short range and long range color propagation. First, the local color propagation network uses consecutive gray scale frames and to predict spatially adaptive kernels that account for motion and re-sampling from . To globally transfer the colors from the reference frame to the entire video a matching based on deep image features is used. The results of these two steps, and , are together with the input to the fusion and refinement network which estimates the final current color frame . (Image source: [Pont-Tuset et al.(2017)Pont-Tuset, Perazzi, Caelles, Arbelaez, Sorkine-Hornung, and Gool])

For longer range propagation, simply smoothing warped colors according to the gray scale guide image is not sufficient. Semantic understanding of the scene is needed to transfer color from the first colored frame of the video to the rest of the video sequence. In our case, we find correspondences between pixels of the first frame and the rest of the video. Instead of matching pixel colors directly we incorporate semantical information by matching deep features extracted from the frames. These correspondences are then used in order to sample colors from the first frame. Besides the advantage for long range color propagation, this approach also helps to recover missing colors due to occlusion/dis-occlusion.

To combine the intermediate images of these two parallel stages, we use a convolutional neural network. This corresponds to the fusion and refinement stage. As a result, the final colored image is estimated by taking advantage of information that is present in both intermediate images, i.e. local and global color information.

4 Approach

Let’s consider a grayscale video sequence of frames, where the colored image (corresponding to ) is available. Our objective is to use the frame to colorize the set of grayscale frames . Using a local (frame-by-frame) strategy, colors of can be sequentially propagated to the entire video using temporal consistency. With a global strategy, colors present in the first frame can be simultaneously transfered to all the frames of the video using a style transfer like approach. In this work we propose a unified solution for video colorization combining local and global strategies.

4.1 Local Color Propagation

Relying on temporal consistency, our objective is to propagate colors frame by frame. Using the adaptive convolution approach developed for frame interpolation [Niklaus et al.(2017)Niklaus, Mai, and Liu], one can similarly write color propagation as convolution operation on the color image: given two consecutive grayscale frames and , and the color frame , an estimate of the colored frame can be expressed as

(1)

where is the image patch around pixel and is the estimated pixel dependent convolution kernel based on and . This kernel is approximated with two 1D-kernels as

(2)

The convolutional neural network architecture used to predict these kernels is similar to the one originally proposed for frame interpolation [Niklaus et al.(2017)Niklaus, Mai, and Liu], with the difference that 2 kernels are predicted (instead of 4 in the interpolation case). Furthermore, we use a softmax layer for kernel prediction which helps to speedup training [Vogels et al.(2018)Vogels, Rousselle, McWilliams, Röthlin, Harvill, Adler, Meyer, and Novák]. If we note the prediction function, the local color propagation can be written as

(3)

with being the set of trainable parameters.

4.2 Global Color Transfer

The local propagation strategy becomes less reliable as the frame to colorize is further away from the first frame. This can be due to occlusions/dis-occlusions, new elements appearing in the scene or even complete change of background (due to camera panning for example). In this case, a global strategy with semantic understanding of the scene is necessary. It allows to transfer color within a longer range both temporally and spatially. To achieve this, we leverage deep feature extracted with convolutional neural networks trained for classification and image segmentation. Similar ideas have been developed for style transfer and image inpainting [Li and Wand(2016), Yang et al.(2017)Yang, Lu, Lin, Shechtman, Wang, and Li].

Figure 3: Global Color Transfer. To transfer the colors of the first frame , feature maps and are extracted from both inputs and . First, a matching is estimated at low resolution. This matching performed on features from a deep layer () allows to consider more abstract information. It is however too coarse to directly copy corresponding image patches. Instead, we use this initial matching to restrict the search region when matching pixels using low level image statistics (from level feature map). Here we show the region of interest (in blue) used to match the pixel in light green. All the pixels sharing the same coarse positions (in dark green rectangle) share the same Region Of Interest (ROI). Using the final matching, colors are transfered to the current gray scale image . (Image source: [Pont-Tuset et al.(2017)Pont-Tuset, Perazzi, Caelles, Arbelaez, Sorkine-Hornung, and Gool])

Formally, we note the feature map extracted from the image at layer of a discriminatively trained deep convolutional neural network. We can estimate a pixel-wise matching between the reference frame and the current frame to colorize using their respective features maps and . Similarity for two positions is measured as:

(4)

Transferring the colors using pixel descriptor matching can be written as:

(5)

To maintain good quality for the matching, while being computationally efficient, we adopt a two stage coarse-to-fine matching. Matching is first estimated for features from a deep layer . This first matching, at lower resolution, defines a region of interest for each pixel in the second matching step of features at level . The different levels of the feature maps correspond to different abstraction level. The coarse level matching allows to consider regions that have similar semantics, whereas the fine matching step considers texture-like statistics that are more effective once a region of interest has been defined. We note the global color transfer function

(6)

with being the set of trainable parameters. Figure 3 illustrates all the steps from feature extraction to color transfer. Any neural network trained for image segmentation could be used to compute the features maps. In our case we use ResNet-101 [He et al.(2016)He, Zhang, Ren, and Sun] architecture fine tuned for semantic image segmentation [Chen et al.(2018)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. For we use the output of the last layer of the conv3-block, while for we use the output of the first conv1-block (but with stride 1).

4.3 Fusion and Refinement Network

The results we obtain from the local and global stages are complementary. The local color propagation result is sharp with most of the fine details preserved. Colors are mostly well estimated except at occlusion/dis-occlusion boundaries where some color bleeding can be noticed. The result obtained from the global approach is very coarse but colors can be propagated to a much larger range both temporally and spatially. Fusing these two results is learned with a fully convolutional neural network.

For any given gray scale frame , the local and global steps result in two estimates of the color image : and . These intermediate results are leveraged by the proposed convolutional network (Figure 2) to predict the final output:

(7)

where notes the prediction function and the set of trainable parameters.

Architecture details. The proposed fusion and refinement network consists of 5 convolutional layers with output channels each followed by a relu-activation function. To keep the full resolution we use strides of and increase the receptive field by using dilations of and , respectively. To project the output to the final colors we us another convolutional layer without any activation function. To improve training and the prediction we use instance normalization [Ulyanov et al.(2016)Ulyanov, Vedaldi, and Lempitsky] to jointly normalize the input frames. The computed statistics are then also used to renormalize the final output.

4.4 Training

Since all the layers we use are differentiable, the proposed framework is end-to-end trainable, and can be seen as predicting the colored frame from all the available inputs

(8)

The network is trained to minimize the total objective function over the dataset consisting of sequences of colored and gray scale images.

(9)

Image loss. We use the -norm of pixel differences which has been shown to lead to sharper results than  [Niklaus et al.(2017)Niklaus, Mai, and Liu, Long et al.(2016)Long, Kneip, Alvarez, Li, Zhang, and Yu, Mathieu et al.(2015)Mathieu, Couprie, and LeCun]. This loss is computed on the final image estimate:

(10)

Warp loss. The local propagation part of the network has to predict the kernels used to warp the color image . This is enforced through the warp loss. It is also computed as the -norm of pixel differences between the ground truth image and :

(11)

Since is an intermediate result, using more sophisticated loss functions such as feature loss [Gatys et al.(2015)Gatys, Ecker, and Bethge] or adversarial loss [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio] is not necessary. All the sharp details will be recovered by the fusion network.

Training procedure. To train the network we used pairs of frames from video sequences obtained from the DAVIS [Perazzi et al.(2016)Perazzi, Pont-Tuset, McWilliams, Gool, Gross, and Sorkine-Hornung, Pont-Tuset et al.(2017)Pont-Tuset, Perazzi, Caelles, Arbelaez, Sorkine-Hornung, and Gool] dataset and Youtube. We randomly extract patches of from a total of k frames. We trained the fusion net with a batch size of 16 over 12 epochs.

To efficiently train the fusion network we first apply and separately to all training video sequences. The resulting images and show the limitations of their respective generators and . The fusion network can then be trained to synthesize the best color image from these two intermediate results. As input we provide and the intermediate images and converted to Yuv-color space. Using the luminance channel helps the prediction process as it can be seen as an indicator on the accuracy of the intermediate results. The final image consists of the chrominance values estimated by the fusion network and as luminance channel.

Running time. At test time, the matching step is the most computationally involved. Still, our naive implementation with TensorFlow computes high resolution () edit propagation within per frame on a Titan X (Pascal).

5 Results

Reference () Local Only Global Only Full method Ground Truth

Figure 4: Ablation study. Using local color propagation based on [Niklaus et al.(2017)Niklaus, Mai, and Liu] only preserve details but is sensitive to occlusion/dis-occlusion. Using only global color transfer does not preserve details and is not temporally stable. Best result is obtained when combining both strategies. See Figure 9 for quantitative evaluation. (Image source: [Wang et al.(2017)Wang, Katsavounidis, Zhou, Park, Lei, Zhou, Pun, Jin, Wang, Wang, Zhang, Huang, Kwong, and Kuo, Butler et al.(2012)Butler, Wulff, Stanley, and Black])

Ablation Study. To show the importance of both the local and global strategy, we evaluate both configuration. The local strategy is more effective for temporal stability and details preservation but is sensitive to occlusion/dis-occlusion. Figure 4 shows an example where color propagation is not possible due to an occluding object, and a global strategy is necessary. Using a global strategy only is not sufficient, as some details are lost during the matching step and temporal stability is not maintained (see video in supplemental material).

Ground Truth Zhang et al. [Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros] Barron et al. [Barron and Poole(2016)] Ours

\adjincludegraphics[trim=0cm 1cm 7.5cm 1cm, clip,width=0.2]results/comparison_image/GT_3.png \adjincludegraphics[trim=0cm 1cm 7.5cm 1cm, clip,width=0.2]results/comparison_image/ideep_3.png \adjincludegraphics[trim=0cm 1cm 7.5cm 1cm, clip,width=0.2]results/comparison_image/BLSolver_3.png \adjincludegraphics[trim=0cm 1cm 7.5cm 1cm, clip,width=0.2]results/comparison_image/ours_3.png

\adjincludegraphics[trim=0cm 0.5cm 7.5cm 1.5cm, clip,width=0.2]results/comparison_image/GT_30.png \adjincludegraphics[trim=0cm 0.5cm 7.5cm 1.5cm, clip,width=0.2]results/comparison_image/ideep_30.png \adjincludegraphics[trim=0cm 0.5cm 7.5cm 1.5cm, clip,width=0.2]results/comparison_image/BLSolver_30.png \adjincludegraphics[trim=0cm 0.5cm 7.5cm 1.5cm, clip,width=0.2]results/comparison_image/ours_30.png
Figure 5: Comparison with image color propagation methods. Methods propagating colors in a single image achieve good results on the first frame. The quality of the results degrades as the frame to colorize is further away from the reference image. (Image source: [Wang et al.(2017)Wang, Katsavounidis, Zhou, Park, Lei, Zhou, Pun, Jin, Wang, Wang, Zhang, Huang, Kwong, and Kuo])

Comparison with image color propagation. Given a partially colored image, propagating the colors to the entire image can be achieved using the bilateral space [Barron and Poole(2016)] or deep learning [Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros]. To extend these methods to video, we compute optical flow between consecutive frames [Zach et al.(2007)Zach, Pock, and Bischof] and use it to warp the current color image (details provided in supplementary material). These image based color methods achieve satisfactory color propagation on the first few frames (Figure 5) but the quality quickly degrades. In the case of the bilateral solver, there is no single set of parameters that performs satisfactorily on all the sequences. The deep learning approach [Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros] is not designed for videos and drifts towards extreme values.

Ground Truth Phase-based [Meyer et al.(2016)Meyer, Sorkine-Hornung, and Gross] Video PropNet [Jampani et al.(2017)Jampani, Gadde, and Gehler] Flow-based [Xia et al.(2016)Xia, Liu, Fang, Yang, and Guo] Ours

\adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.195]results/comparison_video/gt_5.png \adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/phase_5.png \adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/vpropnet_5.png \adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/flow_5.png \adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/ours_5.png

\adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/gt_25.png \adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/phase_25.png \adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/vpropnet_25.png \adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/flow_25.png \adjincludegraphics[trim=7.5cm 2cm 0.050pt 1cm, clip,width=0.19]results/comparison_video/ours_25.png
Figure 6: Comparison with video color propagation methods. Our approach best retains the sharpness and colors of this video sequence. Our result was obtained in less than one minute while the optical flow method [Xia et al.(2016)Xia, Liu, Fang, Yang, and Guo] needed 5 hours for half the original resolution. (Image source: [Wang et al.(2017)Wang, Katsavounidis, Zhou, Park, Lei, Zhou, Pun, Jin, Wang, Wang, Zhang, Huang, Kwong, and Kuo])

Comparison with video color propagation. Relying on optical flow to propagate colors in a video is the most common approach. In addition to this, Xie et al. [Xia et al.(2016)Xia, Liu, Fang, Yang, and Guo] also consider frame re-ordering and use multiple reference frames. However, this costly process is limiting as processing HD frames requires several hours. Figure 1 and Figure 6 shows that we achieve similar or better quality in one minute. Phase-based representation can also be used for edit propagation in videos [Meyer et al.(2016)Meyer, Sorkine-Hornung, and Gross]. This original approach to color propagation is however limited by the difficulty in propagating high frequencies. Recently, video propagation networks [Jampani et al.(2017)Jampani, Gadde, and Gehler] were proposed to propagate information forward through a video. Color propagation is a natural application of such networks. Contrary to the fast bilateral solver [Barron and Poole(2016)] that only operates on the bilateral grid, video propagation networks [Jampani et al.(2017)Jampani, Gadde, and Gehler] benefits from a spatial refinement module and achieve sharper and better results. Still, by relying on standard bilateral features (i.e. colors, position, time) colors can be mixed and propagated from incorrect regions, which leads to the global impression of washed out colors.

Comparison with photo-realistic style transfer. Propagating colors of a reference image is the problem solved by photo-realistic style transfer methods [Luan et al.(2017)Luan, Paris, Shechtman, and Bala, Li et al.(2018)Li, Liu, Li, Yang, and Kautz]. These method replicate the global look but little emphasize is put on transferring the exact colors (see Figure 7).

Reference () Gray () Li et al. [Li et al.(2018)Li, Liu, Li, Yang, and Kautz] Luan et al. [Luan et al.(2017)Luan, Paris, Shechtman, and Bala] Ours
Figure 7: Comparison with photo-realistic style transfer. The reference frame is used as style image. (Image source: [Pont-Tuset et al.(2017)Pont-Tuset, Perazzi, Caelles, Arbelaez, Sorkine-Hornung, and Gool])

Quantitative evaluation. Our test set consists of videos which span a large range of scenarios with videos containing various amounts of motions, occlusions/dis-occlusion, change of background and object appearing/disappearing. Due to their prohibitive running time, some methods [Xia et al.(2016)Xia, Liu, Fang, Yang, and Guo, Luan et al.(2017)Luan, Paris, Shechtman, and Bala] are not included in this quantitative evaluation. Figures 8 and 9 show the details of this evaluation. For a better understanding of the temporal behavior of the different methods, we plot error evolution over time averaged for all sequences. On the first frames, our results are almost indistinguishable from a local strategy (with very similar error values) but we quickly see the benefit of the global matching strategy. Our approach consistently outperforms related approaches for every frame and is able to propagate colors within a much larger time frame. Results of the video propagation networks [Jampani et al.(2017)Jampani, Gadde, and Gehler] vary largely depending on the sequence, which explain the inconsistent numerical performance on our large test set compared to the selected images shown in this paper.

   N  Gray BSolver  Style VideoProp SepConv [Niklaus et al.(2017)Niklaus, Mai, and Liu] Matching   Ours
[Barron and Poole(2016)] [Li et al.(2018)Li, Liu, Li, Yang, and Kautz] [Jampani et al.(2017)Jampani, Gadde, and Gehler] (local only) (global only)
33.65 41.00 32.94 34.96 42.72 38.90 43.64
33.66 39.57 32.81 34.65 41.01 37.97 42.64
33.66 38.59 32.70 34.45 39.90 37.43 42.02
33.67 37.86 32.61 34.26 39.08 37.02 41.54
33.68 37.40 32.54 34.13 38.56 36.75 41.23
Figure 8: Quantitative evaluation: Using PSNR in Lab-space we compute the average error over the first N frames.
Figure 9: Temporal evaluation: The average PSNR error per frame shows the temporal stability of our method and its ability to maintain a higher quality over a longer period.

6 Conclusions

In this work we have presented a new approach for color propagation in videos. Thanks to the combination of a local strategy, that consists of a frame by frame image warping, and a global strategy, based on feature matching and color transfer, we have augmented the temporal extent to which colors can be propagated. Our extended comparative results show that the proposed approach outperforms recent methods in image and video color propagation as well as style transfer.

Acknowledgments.

This work was supported by ETH Research Grant ETH-12 17-1.

Appendix A Implementation Details for Comparisons

Image Color Propagation. To extend image color propagation methods  [Barron and Poole(2016), Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros] to video, we compute optical flow between consecutive frames [Zach et al.(2007)Zach, Pock, and Bischof] and use it to warp the current color image to the next frame. We compute a confidence measure for the warped colors by warping the gray scale image and taking the difference in intensities with the original gray frame. The warped colors, the confidence maps and the reference gray scale image can be used to color the second frame using the fast bilateral solver [Barron and Poole(2016)]. Using a very conservative threshold, the confidence map is binarized to indicate regions where colors should be propagated using deep priors [Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros].

References

  • [Ame()] America In Color. https://www.smithsonianchannel.com/shows/america-in-color/1004516. Accessed: 2018-03-12.
  • [Pai()] Short Documentary - Painting with Pixels: O Brother, Where Art Thou? .
  • [An and Pellacini(2008)] Xiaobo An and Fabio Pellacini. Appprop: all-pairs appearance-space edit propagation. ACM Transactions on Graphics (TOG), 27(3):40:1–40:9, 2008.
  • [Arbelot et al.(2016)Arbelot, Vergne, Hurtut, and Thollot] Benoit Arbelot, Romain Vergne, Thomas Hurtut, and Joëlle Thollot. Automatic texture guided color transfer and colorization. In Joint Symposium on Computational Aesthetics and Sketch Based Interfaces and Modeling and Non-Photorealistic Animation and Rendering, pages 21–32. Eurographics Association, 2016.
  • [Barron and Poole(2016)] Jonathan T. Barron and Ben Poole. The fast bilateral solver. In European Conference on Computer Vision, pages 617–632, 2016.
  • [Bonneel et al.(2013)Bonneel, Sunkavalli, Paris, and Pfister] Nicolas Bonneel, Kalyan Sunkavalli, Sylvain Paris, and Hanspeter Pfister. Example-based video color grading. ACM Transactions on Graphics (TOG), 32(4):39:1–39:12, 2013.
  • [Butler et al.(2012)Butler, Wulff, Stanley, and Black] Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision, pages 611–625, 2012.
  • [Chen et al.(2018)Chen, Papandreou, Kokkinos, Murphy, and Yuille] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018.
  • [Chen et al.(2012)Chen, Zou, Zhao, and Tan] Xiaowu Chen, Dongqing Zou, Qinping Zhao, and Ping Tan. Manifold preserving edit propagation. ACM Transactions on Graphics (TOG), 31(6):132:1–132:7, 2012.
  • [Chen et al.(2014)Chen, Zou, Li, Cao, Zhao, and Zhang] Xiaowu Chen, Dongqing Zou, Jianwei Li, Xiaochun Cao, Qinping Zhao, and Hao Zhang. Sparse dictionary learning for edit propagation of high-resolution images. In Computer Vision and Pattern Recognition, pages 2854–2861, 2014.
  • [Cheng et al.(2015)Cheng, Yang, and Sheng] Zezhou Cheng, Qingxiong Yang, and Bin Sheng. Deep colorization. In International Conference on Computer Vision, pages 415–423, 2015.
  • [Endo et al.(2016)Endo, Iizuka, Kanamori, and Mitani] Yuki Endo, Satoshi Iizuka, Yoshihiro Kanamori, and Jun Mitani. Deepprop: Extracting deep features from a single image for edit propagation. Computer Graphics Forum, 35(2):189–201, 2016.
  • [Faridul et al.(2016)Faridul, Pouli, Chamaret, Stauder, Reinhard, Kuzovkin, and Trémeau] Hasan Sheikh Faridul, Tania Pouli, Christel Chamaret, Jürgen Stauder, Erik Reinhard, Dmitry Kuzovkin, and Alain Trémeau. Colour mapping: A review of recent methods, extensions and applications. Computer Graphics Forum, 35(1):59–88, 2016.
  • [Gatys et al.(2015)Gatys, Ecker, and Bethge] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pages 262–270, 2015.
  • [Gatys et al.(2016)Gatys, Ecker, and Bethge] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition, 2016.
  • [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
  • [He et al.(2017)He, Liao, Yuan, and Sander] M. He, J. Liao, L. Yuan, and P. V. Sander. Neural color transfer between images. arXiv preprint arXiv:1710.00756, 2017.
  • [Heu et al.(2009)Heu, Hyun, Kim, and Lee] Junhee Heu, Dae-Young Hyun, Chang-Su Kim, and Sang-Uk Lee. Image and video colorization based on prioritized source propagation. In International Conference on Image Processing, pages 465–468, 2009.
  • [Iizuka et al.(2016)Iizuka, Simo-Serra, and Ishikawa] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (TOG), 35(4):110:1–110:11, 2016.
  • [Jampani et al.(2017)Jampani, Gadde, and Gehler] Varun Jampani, Raghudeep Gadde, and Peter V. Gehler. Video propagation networks. In Computer Vision and Pattern Recognition, pages 3154–3164, 2017.
  • [Jia et al.(2016)Jia, Brabandere, Tuytelaars, and Gool] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, 2016.
  • [Levin et al.(2004)Levin, Lischinski, and Weiss] Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimization. ACM Transactions on Graphics (TOG)., 23(3):689–694, 2004.
  • [Li and Wand(2016)] Chuan Li and Michael Wand. Combining markov random fields and convolutional neural networks for image synthesis. In Computer Vision and Pattern Recognition, pages 2479–2486, 2016.
  • [Li et al.(2017)Li, Fang, Yang, Wang, Lu, and Yang] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. In Advances in Neural Information Processing Systems, pages 385–395, 2017.
  • [Li et al.(2018)Li, Liu, Li, Yang, and Kautz] Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, and Jan Kautz. A closed-form solution to photorealistic image stylization. arXiv preprint arXiv:1802.06474, 2018.
  • [Long et al.(2016)Long, Kneip, Alvarez, Li, Zhang, and Yu] Gucan Long, Laurent Kneip, Jose M. Alvarez, Hongdong Li, Xiaohu Zhang, and Qifeng Yu. Learning image matching by simply watching video. In European Conference on Computer Vision, pages 434–450, 2016.
  • [Luan et al.(2017)Luan, Paris, Shechtman, and Bala] Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep photo style transfer. In Computer Vision and Pattern Recognition, pages 6997–7005, 2017.
  • [Luan et al.(2007)Luan, Wen, Cohen-Or, Liang, Xu, and Shum] Qing Luan, Fang Wen, Daniel Cohen-Or, Lin Liang, Ying-Qing Xu, and Heung-Yeung Shum. Natural image colorization. In Eurographics Symposium on Rendering Techniques, pages 309–320, 2007.
  • [Mathieu et al.(2015)Mathieu, Couprie, and LeCun] Michaël Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
  • [Mechrez et al.(2017)Mechrez, Shechtman, and Zelnik-Manor] Roey Mechrez, Eli Shechtman, and Lihi Zelnik-Manor. Photorealistic style transfer with screened poisson equation. In British Machine Vision Conference, 2017.
  • [Meyer et al.(2016)Meyer, Sorkine-Hornung, and Gross] Simone Meyer, Alexander Sorkine-Hornung, and Markus Gross. Phase-based modification transfer for video. In European Conference on Computer Vision, 2016.
  • [Niklaus et al.(2017)Niklaus, Mai, and Liu] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In International Conference on Computer Vision, 2017.
  • [Okura et al.(2015)Okura, Vanhoey, Bousseau, Efros, and Drettakis] Fumio Okura, Kenneth Vanhoey, Adrien Bousseau, Alexei A. Efros, and George Drettakis. Unifying color and texture transfer for predictive appearance manipulation. Computer Graphics Forum, 34(4):53–63, 2015.
  • [Paul et al.(2017)Paul, Bhattacharya, and Gupta] Somdyuti Paul, Saumik Bhattacharya, and Sumana Gupta. Spatiotemporal colorization of video using 3d steerable pyramids. IEEE Transactions on Circuits and Systems for Video Technology, 27(8):1605–1619, 2017.
  • [Perazzi et al.(2016)Perazzi, Pont-Tuset, McWilliams, Gool, Gross, and Sorkine-Hornung] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc J. Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, 2016.
  • [Pierre et al.(2017)Pierre, Aujol, Bugeau, and Ta] Fabien Pierre, Jean-François Aujol, Aurélie Bugeau, and Vinh-Thong Ta. Interactive video colorization within a variational framework. SIAM Journal on Imaging Sciences, 10(4):2293–2325, 2017.
  • [Pont-Tuset et al.(2017)Pont-Tuset, Perazzi, Caelles, Arbelaez, Sorkine-Hornung, and Gool] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbelaez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  • [Reinhard et al.(2001)Reinhard, Ashikhmin, Gooch, and Shirley] Erik Reinhard, Michael Ashikhmin, Bruce Gooch, and Peter Shirley. Color transfer between images. IEEE Computer Graphics and Applications, 21(5):34–41, 2001.
  • [Sheng et al.(2011)Sheng, Sun, Chen, Liu, and Wu] Bin Sheng, Hanqiu Sun, Shunbin Chen, Xuehui Liu, and Enhua Wu. Colorization using the rotation-invariant feature space. IEEE Computer Graphics and Applications, 31(2):24–35, 2011.
  • [Shih et al.(2013)Shih, Paris, Durand, and Freeman] Yi-Chang Shih, Sylvain Paris, Frédo Durand, and William T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200:1–200:11, 2013.
  • [Shih et al.(2014)Shih, Paris, Barnes, Freeman, and Durand] Yi-Chang Shih, Sylvain Paris, Connelly Barnes, William T. Freeman, and Frédo Durand. Style transfer for headshot portraits. ACM Transactions on Graphics (TOG), 2014.
  • [Sýkora et al.(2004)Sýkora, Buriánek, and Zára] Daniel Sýkora, Jan Buriánek, and Jirí Zára. Unsupervised colorization of black-and-white cartoons. In International Symposium on Non-Photorealistic Animation and Rendering, 2004.
  • [Tai et al.(2005)Tai, Jia, and Tang] Yu-Wing Tai, Jiaya Jia, and Chi-Keung Tang. Local color transfer via probabilistic segmentation by expectation-maximization. In Computer Vision and Pattern Recognition, 2005.
  • [Ulyanov et al.(2016)Ulyanov, Vedaldi, and Lempitsky] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  • [Vogels et al.(2018)Vogels, Rousselle, McWilliams, Röthlin, Harvill, Adler, Meyer, and Novák] Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. Denoising with kernel prediction and asymmetric loss functions. ACM Transactions on Graphics (TOG), 2018.
  • [Wang et al.(2017)Wang, Katsavounidis, Zhou, Park, Lei, Zhou, Pun, Jin, Wang, Wang, Zhang, Huang, Kwong, and Kuo] Haiqiang Wang, Ioannis Katsavounidis, Jiantong Zhou, Jeong-Hoon Park, Shawmin Lei, Xin Zhou, Man-On Pun, Xin Jin, Ronggang Wang, Xu Wang, Yun Zhang, Jiwu Huang, Sam Kwong, and C.-C. Jay Kuo. Videoset: A large-scale compressed video quality dataset based on JND measurement. Journal of Visual Communication and Image Representation, 46:292–302, 2017.
  • [Welsh et al.(2002)Welsh, Ashikhmin, and Mueller] Tomihisa Welsh, Michael Ashikhmin, and Klaus Mueller. Transferring color to greyscale images. ACM Transactions on Graphics (TOG), 21(3):277–280, 2002.
  • [Xia et al.(2016)Xia, Liu, Fang, Yang, and Guo] Sifeng Xia, Jiaying Liu, Yuming Fang, Wenhan Yang, and Zongming Guo. Robust and automatic video colorization via multiframe reordering refinement. In International Conference on Image Processing, pages 4017–4021, 2016.
  • [Xu et al.(2013)Xu, Yan, and Jia] Li Xu, Qiong Yan, and Jiaya Jia. A sparse control model for image and video editing. ACM Transactions on Graphics (TOG), 32(6):197:1–197:10, 2013.
  • [Xue et al.(2016)Xue, Wu, Bouman, and Freeman] Tianfan Xue, Jiajun Wu, Katherine L. Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, pages 91–99, 2016.
  • [Yang et al.(2017)Yang, Lu, Lin, Shechtman, Wang, and Li] Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image inpainting using multi-scale neural patch synthesis. In Computer Vision and Pattern Recognition, pages 4076–4084, 2017.
  • [Yatagawa and Yamaguchi(2014)] Tatsuya Yatagawa and Yasushi Yamaguchi. Temporally coherent video editing using an edit propagation matrix. Computers & Graphics, 43:1–10, 2014.
  • [Yatziv and Sapiro(2006)] Liron Yatziv and Guillermo Sapiro. Fast image and video colorization using chrominance blending. IEEE Transactions on Image Processing, 15(5):1120–1129, 2006.
  • [Zach et al.(2007)Zach, Pock, and Bischof] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-L optical flow. In Joint Pattern Recognition Symposium, 2007.
  • [Zhang et al.(2017)Zhang, Zhu, Isola, Geng, Lin, Yu, and Efros] Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S. Lin, Tianhe Yu, and Alexei A. Efros. Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG), 36(4):119:1–119:11, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
247300
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description