Artistic style transfer for videos and spherical images††thanks: This study was partially supported by the Excellence Initiative of the German Federal and State Governments EXC 294
Manually re-drawing an image in a certain artistic style takes a professional artist a long time. Doing this for a video sequence single-handedly is beyond imagination. We present two computational approaches that transfer the style from one image (for example, a painting) to a whole video sequence. In our first approach, we adapt to videos the original image style transfer technique by Gatys et al. based on energy minimization. We introduce new ways of initialization and new loss functions to generate consistent and stable stylized video sequences even in cases with large motion and strong occlusion. Our second approach formulates video stylization as a learning problem. We propose a deep network architecture and training procedures that allow us to stylize arbitrary-length videos in a consistent and stable way, and nearly in real time. We show that the proposed methods clearly outperform simpler baselines both qualitatively and quantitatively. Finally, we propose a way to adapt these approaches also to 360 degree images and videos as they emerge with recent virtual reality hardware.
Keywords:Style transfer Deep networks Artistic videos Video stylization
The seminal work of Gatys et al. Gatys2016 that transfers the artistic style of a painting to real world photographs has opened up an interesting application area for deep networks and has triggered lots of follow-up works. Their approach uses the statistics of high-level feature representations of the images from the hidden layers of an image classification network to separate and reassemble content and style. This is done by formulating an optimization problem that, starting with white noise, searches for a new image showing similar neural activations as the content image and similar feature correlations (expressed by a Gram matrix) as the style image. Based on this, Johnson et al. Johnson2016 demonstrated that style transfer, conceived as an image transformation problem, can be learned by a convolutional neural network, which drastically reduces the runtime to process an image.
This paper presents extensions of the works from Gatys et al. Gatys2016 and Johnson et al. Johnson2016 to video sequences. Given an artistic image, we transfer its particular style of painting to the entire video. Processing each frame of the video independently leads to flickering and false discontinuities, since the solution of the style transfer task is not stable. To regularize the transfer and to preserve smooth transition between individual frames of the video, we introduce a temporal constraint that penalizes deviations between two successive frames. The temporal constraint takes the optical flow from the original video into account: instead of penalizing the deviations from the previous frame, we penalize deviation along the point trajectories. Disoccluded regions as well as motion boundaries are excluded from the penalizer. This allows the process to rebuild disoccluded regions and distorted motion boundaries while preserving the appearance of the rest of the image; see Fig. 1.
Adding a loss that prefers temporal consistency between successive frames alone still leads to visible artifacts in the resulting videos. Thus, we present two extensions to our basic approach. The first one aims on improving the consistency over larger periods of time. When a region occluded in some frame and disoccluded later gets rebuilt during the process, most likely this region will be given a different appearance than before the occlusion. We solve this issue by taking long-term motion estimates into account. Secondly, the style transfer tends to create artifacts at the image boundaries. Therefore, we developed a multi-pass algorithm processing the video in alternating temporal directions using both forward and backward flow.
Apart from the optimization-based approach, which yields high-quality results but is very slow to compute, we also present a network-based approach to video style transfer in the spirit of Johnson et al. Johnson2016 . We present a network to stylize subsequent frames of a video given the previously stylized frame as additional input, warped according to the optical flow. As with the optimization-based approach, we incorporate a temporal constraint into the loss and train a network which can be applied recursively to obtain a stable stylized video.
We compare both approaches in terms of speed and consistency. We show quantitative results on the Sintel MPI dataset and qualitative results on several movie shots. We were able to successfully eliminate most of the temporal artifacts and can create smooth and coherent stylized videos both using the optimization-based approach and image transformation networks. Additionally, we provide a detailed ablation study showing the importance of different components of both approaches.
Finally, we apply the style transfer to spherical images projected onto a cube map. The goal is to stylize a cube map such that adjacent edges remain consistent to each other. Technically this is related to video style transfer. We show that style transfer to videos and spherical images can be combined to process spherical videos. Although the approach is still a little too slow for real-time processing, it demonstrates the potential of artistic virtual reality applications.
2 Related work
Image style transfer. In an initial study, Gatys et al. DBLP:conf/nips/GatysEB15 showed that hidden feature maps learned by a deep neural network can be used for texture synthesis by using feature correlations of an image classification network as a descriptor for the texture.
Later, Gatys et al. Gatys2016 extended their approach to synthesizing images which have the texture (or, more generally, the style) of one image while showing the content of another image. The style is represented by feature correlations and the content of an image is described by high-level feature maps of the deep neural network.
The approach of Gatys et al. received much attention inside and outside the computer vision community and triggered a whole line of research on deep-learning-based style transfer. Li et al. DBLP:conf/cvpr/LiW16 proposed a different way to represent the style with the neural network to improved visual quality where the content and style image show the same semantic content. Nikulin et al. DBLP:journals/corr/NikulinN16 tried the style transfer algorithm on features from other networks and proposed several variations in the way the style of the image is represented to achieve different goals like illumination or season transfer. Luan et al. presented an approach for photo-realistic style transfer, where both the style image and the content are photographs luan2017deep .
Fast image style transfer. The processing time can be reduced by orders of magnitude by training a deep neural network to learn a mapping from an image to its stylized version. Different approaches have been developed: Li et al. DBLP:conf/eccv/LiW16 used supervised training and adversarial training. Johnson et al. Johnson2016 presented an unsupervised training approach where the loss is computed by using neural network features and statistics as introduced by Gatys et al. Gatys2016 . Ulyanov et al. DBLP:conf/icml/UlyanovLVL16 proposed an unsupervised approach, too, but used a multi-scale network architecture. Unsupervised approaches allow the use of larger training datasets, like Microsoft COCO, while Li et al. precomputed a small number of stylized images and generated more training samples by data augmentation techniques. Ulyanov et al. DBLP:journals/corr/UlyanovVL16 later introduced instance normalization, which greatly improves the qualitative results of feed-forward style transfer networks. Common to all approaches is the limitation that each model can only learn to transform the style according to one specific style template. However, recent techniques also allow multi-style networks DBLP:journals/corr/ZhangD17 ; Ghiasi2017 .
Painted animations. Litwinowicz Litwinowicz:1997:PIV:258734.258893 proposed an approach to create temporally consistent artistic videos by adding artificial brush strokes to the video. To gain consistency, brush strokes were generated for the first frame and then moved along the optical flow field. Different artistic styles were received by modifying various parameters, such as thickness, of these brush strokes, or by using different brush placement methods. Hays et al. Hays:2004:IVB:987657.987676 built upon this idea and proposed new stylistic parameters for the brush strokes to mimic different artistic styles. O’Donovan et al. odonovan:2012 introduced an energy minimization problem for this task. The optimization objective included a temporal constraint by penalizing changes in shape and width of the brush strokes compared to the previous frame.
Video style transfer. In an earlier conference version of the present paper, we demonstrated for the first time the capability of the optimization-based approach to stylize videos in a consistent way by incorporating additional loss functions into the objective function Ruder2016 . In the present paper, we extend this previous work to fast, network-based style transfer and to spherical images and videos.
Recently, Chen et al. DBLP:journals/corr/ChenLYYH17 , too, implemented video style transfer using feed-forward style transfer networks. Different from this work, Chen et al. used so-called style banks, which are high-level feature maps generated by style transfer networks. To stylize successive frames, style banks from previous frames are reused in order to achieve a similar stylization for similar content. More recently, Gupta et al. Gupta2017 and Huang et al. Huang17 showed a way to train style transfer networks that do not require optical flow at test time. Their style transfer networks get the new content frame and the last stylized frame as input, where the last stylized frame is not warped.
3 Style transfer in still images
In this section, we briefly review the image style transfer approaches introduced by Gatys et al. Gatys2016 and Johnson et al. Johnson2016 , which are used as basis for our video style transfer algorithms.
The goal is to generate a stylized image showing the content of an image in the style of an image . Gatys et al. formulated an energy minimization problem consisting of a content loss and a style loss. The key idea is that features extracted by a convolutional network carry information about the content of the image, while the correlations of these features encode the style.
We denote by the function implemented by the part of the convolutional network from input up to layer . We denote the feature maps extracted by the network from the original image , the style image , and the stylized image by , and , respectively. The dimensionality of these feature maps we denote by , where is the number of filters (channels) in the layer, and is the spatial dimensionality of the feature map, that is, the product of its width and height.
The content loss, denoted as , is the mean squared error between and . This loss need not be restricted to only one layer. Let be the set of layers to be used for content representation, then we have:
The style loss is also a mean squared error, but between the correlations of the filter responses expressed by their Gram matrices for the style image and for the stylized image . These are computed as and . As above, let be the set of layers we use to represent the style, then the style loss is given by:
Overall, the loss function is given by
with weighting factors and governing the relative importance of the two components.
The stylized image is computed by minimizing this energy with respect to using gradient-based optimization.
Typically it is initialized with random Gaussian noise.
However, the loss function is non-convex, therefore the optimization is prone to falling into local minima.
This makes the initialization of the stylized image important, especially when applying the method to frames of a video.
Instead of solving an optimization problem, a much faster method directly learns a style transfer function for one particular style mapping from an input image to its stylized correspondence. Such a function can be expressed by a convolutional neural network with parameters . The network is trained to minimize the expected loss for an arbitrary image :
Johnson et al. Johnson2016 directly evaluate in each iteration of the training phase and use the loss and gradients from this function to perform a backward pass. Since this function does not compare the output of the network to a ground truth but acts as a perceptual quality measure, this is called a perceptual loss function.
The network architecture chosen by Johnson et al. Johnson2016 is a fully convolutional neural network, which uses downsampling by strided convolution to increase the receptive field and upsampling to produce the final output. Furthermore, the network utilizes residual connections which help the network converge faster.
The quality of network-based style transfer can be significantly improved with instance normalization DBLP:journals/corr/UlyanovVL16 – a technique inspired by batch normalization Ioffe2015BatchNA . Instance normalization performs contrast normalization for each data sample. Details are provided in the Appendix.
4 Optimization-based coherent video style transfer
In this section, we explain our video style transfer approach formulated as an energy minimization problem. This includes two extensions that improve long-term consistency and image quality during camera motion.
We use the following notation: is the frame of the original video, is the style image and are the stylized frames to be generated. Furthermore, we denote by the initialization of the style optimization algorithm at frame . By we denote the component of a vector .
4.1 Short-term consistency by initialization
Independent initialization with Gaussian noise yields two consecutive frames that are stylized very differently, even when the original frames hardly differ. The most basic way to improve temporal consistency is to initialize the optimization for the frame with the already stylized frame . Areas that have not changed between the two frames are then initialized with the desired appearance, while the rest of the image has to be rebuilt through the optimization process.
This simple approach is insufficient for moving parts of the scene, because the initialization does not match. Thus, we take the optical flow into account and initialize the optimization for frame with the previous stylized frame warped by the optical flow: . Here denotes the function that warps a given image using the optical flow field that was estimated between images and . Only the first frame of the stylized video is initialized randomly.
Every reliable optical flow technique can be used for computing the optical flow. We experimented with two popular algorithms: DeepFlow weinzaepfel:hal-00873592 and EpicFlow revaud:hal-01142656 . Both are based on Deep Matching weinzaepfel:hal-00873592 : DeepFlow combines it with a variational approach, while EpicFlow relies on edge-preserving sparse-to-dense interpolation.
4.2 Temporal consistency loss
To enforce stronger consistency between adjacent frames we additionally introduce an explicit consistency penalty to the objective function. Disoccluded regions as well as motion boundaries are excluded from the penalizer, which allows the optimizer to rebuild those regions. To detect disocclusions, we perform a forward-backward consistency check of the optical flow Bro10e . Let be the optical flow in forward direction and the flow in backward direction. Denote by the forward flow warped to the second image:
In areas without disocclusion, this warped flow should be approximately the opposite of the backward flow. Therefore we mark as disocclusions those areas where the following inequality holds:
Motion boundaries are detected using the following inequality:
The temporal consistency loss function penalizes deviations from the warped image in regions where the optical flow is consistent and estimated with high confidence:
Here is a per-pixel weighting of the loss and is the dimensionality of the image. We define the weights between frames and as follows: in disoccluded regions (as detected by forward-backward consistency) and at the motion boundaries, and elsewhere. The overall loss takes the form:
We optimize the frames sequentially, thus refers to the already stylized frame .
We also experimented with the more robust absolute error instead of squared error for the temporal consistency loss; results are shown in section A.4.
4.3 Long-term consistency
The short-term model only accounts for consistency between adjacent frames. Therefore, areas that are occluded in some frame and disoccluded later will likely change their appearance in the stylized video. This can be counteracted by taking the long-term motion into account, i.e., not only penalizing deviations from the previous frame, but also from temporally distant frames. Let denote the set of indices each frame should take into account relative to the frame number, e.g., with , frame takes frames , , and into account. The loss function with long-term consistency is given by:
The weights are adjusted such that each pixel is only connected to the closest possible frame from the past. Since the optical flow computed over more frames is more erroneous than over fewer frames, this results in nicer videos. This is implemented as follows: let be the weights for the flow between image and , as defined for the short-term model. The long-term weights are then computed as:
where the maximum is applied element-wise. This means, we first apply the usual short-term constraint. For pixels in disoccluded regions we look into the past until we find a frame in which these have consistent correspondences. An empirical comparison of and is shown in the supplementary video (see section A.1).
4.4 Multi-pass algorithm
We found that the output image tends to have less contrast and to be less diverse near image boundaries than in other areas of the image. This effect appears particularly in dynamic videos with much camera motion. Disocclusion at image boundaries constantly creates small areas with new content that moves towards other parts of the image. This leads to a lower image quality over time when combined with our temporal constraint. Therefore, we developed a multi-pass algorithm that processes the whole sequence in multiple passes and alternates between the forward and backward direction. Every pass consists of a relatively low number of iterations without full convergence. At the beginning, we process every frame independently. After that, we blend frames with non-disoccluded parts of previous frames warped according to the optical flow, then run the optimization algorithm for some iterations initialized with this blend. We repeat this blending and optimization to convergence.
Let be the initialization of frame in pass and the corresponding output after some iterations of the optimization algorithm. When processed in forward direction, the initialization of frame is created as follows:
Here denotes element-wise vector multiplication, and are the blend factors, is a vector of all ones, and .
Analogously, the initialization for a backward direction pass is:
The multi-pass algorithm can be combined with the temporal consistency loss described above. We achieved good results when we disabled the temporal consistency loss in several initial passes and enabled it in later passes after the images had stabilized.
5 Fast coherent video style transfer
5.1 Consistency by architecture
Before introducing temporal consistency constraints into a network based approach, we investigated whether the temporal consistency of the output of the stylization network can be improved simply by using an architecture that is robust to transformations of the input image. Figure 3 shows that the network of Johnson et al. is invariant to shifts of exactly four pixels, but its output is significantly disturbed by smaller shifts. The loss of shift invariance is caused by the use of downsampling (strided) layers in the network, with a total downsampling factor of .
Shift invariance can be retained by avoiding downsampling and using a different technique for multi-scale context aggregation. This can be achieved with dilated convolutions YuKoltun2016 , which expand the convolutional filters instead of downsampling the feature maps. The experiments in Section 7.6 show that this indeed leads to improved temporal consistency, but at the cost of a heavily increased computational burden, especially in terms of memory. This makes the approach impractical under most circumstances. Furthermore, this technique does not account for transformations that are more complex than axis-aligned shifts. Thus, in the following we present more efficient and more general techniques to improve temporal consistency in network based stylization.
5.2 Training with the prior image
Our main approach builds upon the architecture by Johnson et al. Johnson2016 but extends the network to take additional inputs. In addition to the content image, the network gets an incomplete image representing prior knowledge about the output, which we refer to as prior image, as well as a per-pixel mask telling the network which regions of the prior image are valid. For video stylization, the prior image is the previously stylized frame warped according to the optical flow we calculated for the original video sequence. The mask is given by the occlusions between two frames and the motion boundaries. Occlusions are detected through a forward-backward consistency check, as described in the previous section, i.e., we calculate the optical flow both in forward and backward direction and test whether both estimates agree. At pixels where the estimates do not agree, and at motion boundaries, we set the certainty mask to zero. Everywhere else, we set this mask to .
The goal of the network is to produce a new stylized image, where the combination of perceptual loss and deviation from the prior image (in non-occluded regions) is minimized.
The network with parameters takes as input a frame , the previously generated frame warped and masked by the occlusion mask and yields the output . We can apply this function recursively to obtain a video. The first frame in the sequence is generated by an image network.
Ideally, network parameters would minimize the video loss function, as defined in Section 4.2, for an arbitrary video with frames:
where for a given pre-trained parameter set .
However, joint optimization for a whole video is computationally infeasible. Hence, we resort to recursively optimizing two-frame losses. Our basic approach is illustrated in Figure 4. The network is trained to generate a successor frame to the first frame of a video:
We found this approach to yield unsatisfactory results when applied recursively to stylize several successive frames of a video. While the network learns to fix occlusions and to preserve the prior image as much as possible, there is nothing that prevents the propagation of errors and the degeneration of the image quality over time. In particular, we observed blurring and the loss of content accuracy in the image. In the following we introduce two improvements to avoid the degeneration of image quality.
5.2.1 Mixed training
Our first approach to counteract the propagation of errors aims at forcing the network to refine the stylization at every step, rather than just copying the prior image. To this end, we randomly add training samples without prior image, i.e., the network has to generate a completely new image. We set the occlusion mask to zero everywhere, which also disables the temporal loss. Formally, the objective function changes to
where is a vector of the same dimension as the input image with zeros everywhere.
We increase the weight for the consistency loss proportionally to the percentage of empty-prior images. This aims for still achieving temporal consistency at test time, while focusing on content generation during training. We experimented with various fractions of zero-prior samples, and got good results when we set this fraction to .
5.2.2 Multi-frame training
Our second approach to improve longer-term stylizations presents more than just the first two frames of a sequence during training. We rather recursively pass to through the network, i.e., training includes the potentially degenerated and blurred output, , as the prior image and requires the network to generate the non-degenerated ground truth . This forces the network to learn not only to copy the prior image but also to correct degeneration. If the network did not correct errors and blurriness in the prior image, this would lead to a large loss. Thus, the overall loss function eventually overrules the temporal consistency constraint.
Formally, we generalize the objective function shown in Equation (5.2) to
Since the network does not produce meaningful stylizations in the beginning of the training phase, we first start training the network with and gradually increase in later iterations.
5.2.3 Training data
We used the Hollywood2 video scene dataset DBLP:conf/cvpr/MarszalekLS09 , a collection of 570 training videos. Some videos contain multiple shots. We split videos along shots and extracted distinct shots from that dataset. Since the dataset contains many frame pairs with little motion, we only took the five most differing frame tuples per shot. In more detail, we computed the squared distance between subsequent frames per tuple and took those five tuples which ranked highest on that measure. In total, we acquired frame tuples. The size of a tuple corresponds to the five steps in our multi-frame training approach.
Additionally, we created a large synthetic set of video sequences based on the Microsoft COCO dataset coco04 with training images to increase the diversity in our training data. To create each of these sequences, we selected a window from the image and moved it between to pixels in either direction to create consecutive frames. Moreover, we used a zoom out effect, i.e., the window size was increased by up to pixels in either direction when generating a following frame, with all crops being resized to the dimensions of the original window size. Translation strength and resize factor were sampled uniformly and independently for each axis for each training sample at training time. The advantage of the synthetic data is that it comes with ground truth optical flow and occlusion maps.
6 Spherical images and videos
Virtual reality (VR) applications become increasingly popular, and the demand for image processing methods applicable to spherical images and videos rises. Spherical reality media is typically distributed via a 2D projection. The most common format is the equirectangular projection. However, this format does not preserve shapes: the distortion of the projection becomes very large towards the poles of the sphere. Such non-uniform distortions are problematic for style transfer. Therefore, we work with subdivided spherical images that consist of multiple rectilinear projections. In particular, we will use cubic projection, which represents a spherical image with six non-distorted square images, as illustrated in Figure 6. For style transfer in this regime, the six cube faces must be stylized such that their cut edges are consistent, i.e., the style transfer must not introduce false discontinuities along the edges of the cube in the final projection. Since applications in VR environments must run in real time, we only consider the fast, network-based approach in this section.
6.1 Border consistency
In order to implement the consistency constraint, we extend the cube faces such that they overlap each other in the proximity of the face boundaries; see Figure 7 (top).
We stylize the cube faces sequentially and enforce subsequent cube faces to be consistent with already stylized edges from adjacent cube faces. This is similar to the video style transfer network, and we can use the same techniques described in detail in the previous section, such as mixed training and multi-frame training. Instead of using a previously generated frame from a video as the prior image, we use regions of adjacent cube faces overlapping with the current cube faces as prior image. The regions are transformed to match the cube face projection; see Figure 7 (bottom) and the Appendix.
6.2 Training dataset
A style transfer network trained with regular video frames did not produce good results for stylizing cube faces, where a perspective transformed border of the image is given as the prior image. Therefore, we extended our dataset by training samples particularly designed for this task.
We stylized and transformed the image borders of samples from the Microsoft COCO dataset in such a way that it mimics the perspective transformed borders of a cube face prior image. This process is illustrated in Figure 8. We created a training set with samples (four samples for each image in COCO) this way.
6.3 Spherical videos
The extensions for video style transfer and for spherical images can be combined to process spherical videos. This yields two constraints: (1) each cube face should be consistent along the motion trajectory; (2) neighboring cube faces must have consistent boundaries.
To implement the first constraint, we calculate optical flow for each cube face separately, which allows us to warp stylized cube faces the same way as for regular planar videos. To implement the second constraint, we blend both the warped image from the last frame and the transformed border of already stylized neighboring cube faces.
7 Experiments on video style transfer
In this section, we present the experimental results obtained with different versions of the presented methodology. Since it is hard to visualize qualitative differences in video content and spherical images appropriately on paper, we additionally refer to the supplemental video linked in Section A.1.
We used the MPI Sintel Dataset Butler:ECCV:2012 including 23 training scenes with 20 to 50 frames of resolution pixels per scene. To keep the runtime at an acceptable level, we tested our optimization-based approach with a subset of five diverse scenes. Our feed-forward approach, being orders of magnitudes faster, was additionally evaluated on the full set.
The Sintel dataset provides ground truth optical flow and ground truth occlusion areas, which allows a quantitative study. To evaluate the short-term consistency, we warped each stylized frame at time back with the ground truth flow and computed the mean squared difference to the stylized frame in regions without disocclusion. Each method was evaluated with different style templates, as shown in Section A.2.
7.2 Implementation details
7.2.1 Optimization-based approach
Our implementation111https://github.com/manuelruder/artistic-videos is based on a Torch Collobert_NIPSWORKSHOP_2011 implementation called neural-style222https://github.com/jcjohnson/neural-style. We used the following layers of the VGG-19 network Simonyan14 for computing the losses: relu4_2 for the content and relu1_1, relu2_1, relu3_1, relu4_1, relu5_1 for the style. The energy function was minimized using L-BFGS. For precise evaluation we incorporated the following strict stopping criterion: the optimization was considered converged if the loss did not change by more than during 50 iterations. This typically resulted in roughly to iterations for the first frame and roughly to iterations for subsequent frames when optimizing with our temporal constraint, depending on the amount of motion and the complexity of the style image. Using a convergence threshold of cuts the number of iterations and the running time in half, and we found it still produces reasonable results in most cases. However, we used the stronger criterion in our experiments for the sake of accuracy.
For videos of resolution we used weights and for the content and style losses, respectively (default values from neural-style), and weight for the temporal losses. However, the weights should be adjusted if the video resolution is different. We provide the details in section A.2.
We ran our multi-pass algorithm with iterations per pass and set . Since at least passes are needed for good results, this algorithm needs in total more computation time than the single-pass approaches.
Optical flow was computed with the DeepMatching, DeepFlow, and EpicFlow implementations provided by the authors of these methods. We used the ”improved-settings” flag in DeepMatching 1.0.1 and the default settings for DeepFlow 1.0.1 and EpicFlow 1.00.
7.2.2 Network-based approach
|Iteration||Number of frames|
|0 – 60,000||2|
|60,001 – 70,000||3|
|70,001 – 120,000||5|
|Optimization||Random init||0.019||0.027||0.037||0.0180||0.023||540 s|
|Optimization||Prev frame init||0.010||0.018||0.028||0.0041||0.014||260 s|
|Better reflects the style||44/55/0||28/72/0||34/61/1||39/60/1||46/53/1||191/301/3|
Our network-based style transfer implementation333Will be made available later is based on Johnson et al. Johnson2016 ’s Torch implementation, called fast-neural-style444https://github.com/jcjohnson/fast-neural-style. The architecture follows the design principle of Johnson et al., and we used the following layers of the VGG-16 network for the perceptual loss: relu3_3 for the content and relu1_2, relu2_2, relu1_2, relu3_3, relu4_3 for the style target. In the proposed style transfer network, we exchanged the upsampling convolution by two nearest neighbor upsampling layers with a regular convolution layer in between. This greatly reduces the checkerboard patterns visible in the original work. In addition, we replaced the batch normalization by instance normalization, as proposed in DBLP:journals/corr/UlyanovVL16 , and used a single padding layer in the beginning. The network was trained with the ADAM optimizer, a learning rate of , a batch size of for the strided convolution architecture, and a learning rate of with a batch size of for the dilated convolution variant due to memory restrictions. We trained the two-frame model for 60,000 iterations. The multi-frame model was fine-tuned from a pre-trained two-frame model. The number of iterations and frames are shown in Table 1. For spherical image fine-tuning, we started from the trained multi-frame model and trained for another 40,000 iterations using the specialized dataset described in section 6.2. To calculate optical flow for the videos from the Hollywood2 dataset, we used FlowNet 2.0 IMKDB17 , which is a recent state-of-the-art for optical flow estimation running at interactive frame rates. To overcome GPU memory limitations when training with dilated convolutions, we downsampled the frames by a factor of . We set the weight of the temporal consistency to when using mixed training and to otherwise.
7.3 Comparison to baselines
We benchmarked the following video stylization approaches: the optimization-based approach with per-frame random initialization, with initialization from the previous frame, and with our short-term consistency loss, where optical flow was provided by DeepFlow. Moreover, we ran the frame-wise network-based approach and our multi-frame network.
Quantitative results are shown in Table 2, averaged for style templates also used by Gatys et al. shown in Section A.2.1. Among optimization-based approaches, the most straightforward method – processing every frame independently – performs roughly an order of magnitude worse than the version with temporal consistency. Initialization with the previous stylized frame leads to better results than random initialization, but is still to times worse than our approach, depending on the scene.
The baseline for our network-based approach (independent per-frame) shows less flickering than the previous baseline, since this approach is more deterministic by design. Nevertheless, we consistently outperform this baseline by a factor of two or more. Interestingly, the network-based approach even outperforms the optimization-based approach in some cases while being much faster. The network-based method is more accurate on the ambush scenes. These include very large motion, which leads to more errors in the optical flow. We conclude that the network-based method is able to partially correct for these errors in optical flow.
7.4 Results of a user study
In the previous section we have only evaluated the temporal consistency, but not the perceptual quality of the stylized videos. We performed a comprehensive user study using Amazon Mechanical Turk (AMT) to evaluate the perceptual quality of videos stylized with our two best approaches: optimization-based with DeepFlow and network-based with multi-frame and mixed training (for details see Sections 7.5 and 7.6). We performed trials for each combination out of the five tested scenes and the five tested styles, resulting in a total of trials. In each trial a user was presented the style template, a still frame from the original video and two stylized versions of the video clip, one video processed with the optimization-based approach and the other processed with a network. The order of the videos was randomized. Inspired by Gupta2017 , users were asked three questions: “Which of the two videos better reflects the style of the painting?”, “Which of the two videos has less flickering?” and “Overall, which of the two videos do you think looks better?”. Users had to select either one of the videos or “undecided”.
Table 3 shows the results of the user study, separately for each style image. There was a quite clear preference towards video clips generated with the network approach, even though this approach is just an approximation of the much costlier optimization problem. This could be explained by the fact that our network approach is more robust to errors in the optical flow and can fill disocclusions more consistently due to less randomness. This reduces artifacts compared to the optimization-based approach.
7.5 Detailed analysis of the optimization-based approach
We performed an ablation study of the variants of our optimization-based approach to show the contribution of the different components and the choice of the optical flow method. The results are provided in Table 4. Initialization with the motion-compensated previous frame performs already much better than initializing just with the previous frame. The explicit temporal penalty additionally improves over this result. Comparing the two optical flow techniques, on average DeepFlow yields slightly better results than EpicFlow.
To show the benefit of the proposed long-term consistency and multi-pass technique, we cannot run a quantitative evaluation, since there is no long-term ground truth optical flow available. Thus, we present qualitative results in Fig. 9 and Fig. 10, respectively. Please see also the supplementary video.
Fig. 9 shows a scene with a person walking through the scene. Without our long-term consistency model, the background looks very different after the person passed by. The long-term consistency model keeps the background unchanged. Fig. 10 shows another scene with fast camera motion. The multi-pass algorithm avoids the artifacts introduced in new image areas over time by the basic algorithm.
7.6 Detailed analysis of the network-based approach
Table 5 shows the result of an ablation study for different variants of our network-based approach. We measure temporal, style and content loss. The latter two losses aim to measure propagation of error. If the image starts to blur or degenerate over time, this should increase the style and content loss. The numbers show results on the complete Sintel training dataset averaged over all scenes, frames and styles. We used a different set of style templates; inspired by Johnson2016 , see Section A.2.2. Since the absolute losses vary heavily, we rather averaged the loss as percentage of the loss of the baseline, which is the independent per-frame processing approach.
Dilated convolution, our proposal to achieve consistency without using prior images, yields a noticeable improvement in temporal consistency while keeping the style and content losses constant, see Table 5. To overcome GPU memory limitations, the frames were downsampled by a factor . To get a fair comparison, we also downsampled the frames for strided convolution and show results relative to this downsampled baseline.
The two-frame training procedure yields very good temporal consistency, but at the same time leads to a large increase in the content and the style loss. The best results are achieved with the mixed and multi-frame training. Both techniques are complementary and bring down the content and style loss compared to the two-frame training, while only marginally reducing temporal consistency.
Figures 12 shows a qualitative comparison, another comparison can be seen in the appendix (Figure A.5) showing different style templates. The difference between the multi-frame and the mixed training is very subtle, while the two-frame, non-mixed training produces clearly inferior stylizations. An example on how our approach minimizes inconsistencies is shown in Figure 11.
8 Experiments on spherical images and videos
To evaluate spherical images and videos, we downloaded 20 video scenes from YouTube containing different content (seven cityscape, six landscape and seven building interior scenes). For the spherical still image evaluation, we took three frames from each scene at a distance of approximately three seconds, i.e. 60 images.
8.1 Spherical images
To evaluate how well the individual cube faces fit together, i.e., if there are false discontinuities after stitching the cube faces, we measured the gradient magnitude at the cut edges (two pixel width). We report the gradient magnitude relative to the overall gradient magnitude in the image. A value of means there is no unusual gradient in that region. Values larger than indicate more gradients compared to the average gradient of the image.
Subsequent cube faces generated with a neighboring cube face border as prior image sometimes showed false discontinuities and/or unusual high gradients at the inner edges of the prior image, where the overlap between the images end. Therefore, we also measured gradients along the inner edges of the prior image in the stylized cube faces (four pixel width). Our detailed evaluation metric is shown in the appendix.
Results are shown in Table 6. Introducing edge consistency clearly reduces the false gradients along cut edges. However, it introduces new gradients along the inner edge from the prior image, most noticably in regions with little structure. Thus, on average, the magnitude of artificial gradients does not improve noticeably. Only when we use our specialized dataset for spherical images to fine-tune the model, as described in Section 6.2, the gradient error improved overall in the image. Fig. 13 shows a qualitative comparison. In textured regions, fine-tuning is not necessary. In homogeneous regions, fine-tuning helps to avoid artificial edges. Interestingly, fine-tuning introduces style features to cover the false edges, i.e., the gradient magnitude in our quantitative measure is still increased, but the disturbing visual effect is reduced and only noticeable when large parts of the image are untextured like, for example, the top cube face showing the sky.
|Loss||Ind. per face||Video net||Video net, ft|
|Content||100 %||101 %||98 %|
|Style||100 %||104 %||98 %|
8.2 Spherical videos
Table 7 shows numerical results for stylized panoramic videos at different frames of the video. We can see that unusual high gradients decrease over time, demonstrating the algorithm’s ability to improve from impaired prior images. We also observe this effect when the video is mostly still and only few objects are moving.
|Loss||Frame #1||Frame #2||Frame #10|
8.3 Filling missing regions
To further reduce unusual gradients, we experimented with different ways to fill in missing or disoccluded regions in the prior image. Originally, we masked missing regions with zeros, which is equivalent to the mean pixel of the ImageNet dataset555In the pretrained VGG models, which we use for our perceptual loss, values are centered around the mean pixel of the ImageNet dataset.. By filling missing regions with random noise instead, we want to avoid that the network falsely observes sharp edges along missing regions.
Numerical analysis shows that the measure indeed improves slightly from to when filling masked regions with random noise.
Most noticeably, a random fill-in improves a failure case of our algorithm: regions with little structure in the image, see Figure 14. False gradients along the prior image boundary are less visible.
However, we observed that a random fill-in further changes the way the network fills missing regions. To get a better understanding of how the fill-in influences the network’s behavior, we performed an experiment where we used random noise as fill-in during training, but used zeros at test time. Figure 15 shows how this changes the output. For this experiment, we gave the network an empty prior image, as if the whole image was missing or disoccluded, i.e. the prior image is either a vector containing only zeros or random noise. Once we removed this extra source of entropy, the network was able to rely on it during training, and the image loses variance especially in regions with little structure. We conclude that the network has learned to make use of the noise such that it creates a more diverse appearance in regions with little structure.
We measured the runtime of our algorithms on an Nvidia Titan X GPU (Maxwell architecture) and a resolution of pixel.
Using the relaxed convergence threshold of , the optimization-based approach takes around eight to ten minutes for the initial frame initialized with random noise. When initialized with the warped previous frame and combined with our temporal loss, the optimization converges 2 to 3 times faster, three minutes on average.
Our feed-forward approach only takes ms to stylize a new frame, including the time needed to pre-process the last frame (masking and warping, which is done on the GPU as well), but excluding time for I/O operations (e.g. writing the file to disk). This corresponds to a speedup factor of compared to the optimization-based approach. A single-image processed with Johnson et al. Johnson2016 needed at that resolution. Training the multi-frame model takes around 24 hours for 120,000 iterations.
For panoramic images, our approach needs seconds to stylize a full frame ( for each cube face with resolution including overlapping regions. Without overlapping, this equals a resolution of per cube face). If the video is not distributed in the cube face projection, additional time for reprojection has to be taken into account.
We presented a set of techniques for optimization-based style transfer in videos: suitable initialization and a loss function that enforces short-term temporal consistency of the stylized video, a loss function for long-term consistency, and a multi-pass approach. As a consequence, we can produce stable and visually appealing stylized videos even in the presence of fast motion and strong occlusion.
Furthermore, we presented an algorithm to train a neural network with a prior image in order to create coherent style transfer with a much lower run time per frame. We were able to prevent the propagation of errors when applying our model recursively: by iterating the network multiple times during training, using a longer sequence of frames, we trained our network to recognize and counteract progressed degeneration. We also generalized our model for spherical images: in this task, the prior image is not the last frame, but an edge from a neighboring cube face, perspectively transformed by . We could show that our model is able to adapt to such a prior image when there is enough structure in the image.
- (1) Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: ECCV, pp. 611–625 (2012)
- (2) Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. CoRR abs/1703.09211 (2017)
- (3) Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)
- (4) Gatys, L.A., Ecker, A.S., Bethge, M.: Texture synthesis using convolutional neural networks. In: NIPS, pp. 262–270 (2015)
- (5) Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR, pp. 2414–2423 (2016)
- (6) Ghiasi, G., Lee, H., Kudlur, M., Dumoulin, V., Shlens, J.: Exploring the structure of a real-time, arbitrary neural artistic stylization network
- (7) Gupta, A., Johnson, J., Alahi, A., Fei-Fei, L.: Characterizing and improving stability in neural style transfer. CoRR 1705.02092 (2017)
- (8) Hays, J., Essa, I.: Image and video based painterly animation. In: Proceedings of the 3rd International Symposium on Non-photorealistic Animation and Rendering, NPAR, pp. 113–120 (2004)
- (9) Huang, H., Wang, H., Luo, W., Ma, L., Jiang, W., Zhu, X., Li, Z., Liu, W.: Real-time neural style transfer for videos. In: CVPR (2017)
- (10) Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR (2017)
- (11) Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
- (12) Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV, pp. 694–711 (2016)
- (13) Li, C., Wand, M.: Combining markov random fields and convolutional neural networks for image synthesis. In: CVPR, pp. 2479–2486 (2016)
- (14) Li, C., Wand, M.: Precomputed real-time texture synthesis with markovian generative adversarial networks. In: ECCV, pp. 702–716 (2016)
- (15) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
- (16) Litwinowicz, P.: Processing images and video for an impressionist effect. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, pp. 407–414 (1997)
- (17) Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. arXiv preprint arXiv:1703.07511 (2017)
- (18) Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR, pp. 2929–2936 (2009)
- (19) Nikulin, Y., Novak, R.: Exploring the neural algorithm of artistic style. CoRR abs/1602.07188 (2016)
- (20) O’Donovan, P., Hertzmann, A.: Anipaint: Interactive painterly animation from video. Transactions on Visualization and Computer Graphics 18(3), 475–487 (2012)
- (21) Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpolation of correspondences for optical flow. In: CVPR, pp. 1164–1172 (2015)
- (22) Ruder, M., Dosovitskiy, A., Brox, T.: Artistic style transfer for videos. In: GCPR, pp. 26–36 (2016)
- (23) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR (2015)
- (24) Sundaram, N., Brox, T., Keutzer, K.: Dense point trajectories by gpu-accelerated large displacement optical flow. In: ECCV, pp. 438–451 (2010)
- (25) Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML, pp. 1349–1357 (2016)
- (26) Ulyanov, D., Vedaldi, A., Lempitsky, V.S.: Instance normalization: The missing ingredient for fast stylization. CoRR abs/1607.08022 (2016)
- (27) Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: Large displacement optical flow with deep matching. In: ICCV, pp. 1385–1392 (2013)
- (28) Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)
- (29) Zhang, H., Dana, K.J.: Multi-style generative network for real-time transfer. CoRR abs/1703.06953 (2017)
Appendix A Appendix
a.1 Supplementary videos
A supplementary video, available at https://youtu.be/SKql5wkWz8E, shows moving sequences corresponding to figures from this paper, plus a number of additional results:
Results of the optimization-based algorithm on different sequences, including a comparison of the basic algorithm and the multi-pass and long-term algorithm
Comparison of ”naive” () and ”advanced” () weighting schemes for long-term consistency
Results of the feed-forward algorithm on different sequences, including results from different techniques to reduce the propagation of errors.
Comparison between optimization-based and fast style transfer
A demonstration of our panoramic video algorithm
Another video, showing a full panoramic video in , can be found at https://youtu.be/pkgMUfNeUCQ.
a.2 Style images and parameter configuration
a.2.1 Optimization-based approach
a.2.2 Network-based approach
Figure A.2 shows the style images used for the detailed analysis of the network-based approach and for spherical image and video evaluation, inspired by the selection of styles by Johnson2016 . In Table 8, the parameters for training individual models are shown.
a.3 Effect of errors in optical flow estimation
The quality of results produced by our algorithm strongly depends on the quality of optical flow estimation. This is illustrated in Figure A.3. When the optical flow is correct (top right region of the image), the method manages to repair the artifacts introduced by warping in the disoccluded region. However, erroneous optical flow (tip of the sword in the bottom right) leads to degraded performance. The optimization process partially compensates for the errors (sword edges get sharp), but cannot fully recover.
a.4 Robust loss function for temporal consistency in the optimization-based approach
We tried using the more robust absolute error instead of squared error for the temporal consistency loss. The weight for the temporal consistency was doubled in this case. Results are shown in Figure A.4. While in some cases (left example in the figure) the absolute error leads to slightly improved results, in other cases (right example in the figure) it causes large fluctuations. We therefore stick with the mean squared error in all our experiments.
a.5 Batch and instance normalization.
Batch normalization Ioffe2015BatchNA normalizes the feature activations for individual feature maps in a mini-batch after each layer of a neural network. This has been found to be advantageous especially for very deep neural networks, where the variance over the output is likely to shift during training. Let be a tensor with batch size , channels and spatial dimensions , and let be the -th element in this tensor. Then, the mean and the variance are calculated as and .
The batch normalization layer performs the following operation to compute the output tensor :
with learnable scale and shift parameters and .
In contrast, instance normalization normalizes every single instance in a batch, that is, contrast normalization is performed for each individual instance. Therefore, the mean and the variance are computed per instance and feature map. We define a separate mean and variance for each instance as and .
The instance normalization layer performs the following operation to compute the output tensor :
a.6 Comparison of different methods to reduce the propagation of error in the network approach
a.7 Convergence of our style transfer network
Our network converges without overfitting, as seen in Figure A.6. For the validation loss, we always use the average loss of consecutive frames, processed recursively, so that the validation objective stays constant during the training of the multi-frame model. We used 120,000 iterations for the sake of accuracy, but Figure A.6 also indicates that the training can be stopped earlier in practice.
a.8 Reprojection for border consistency in spherical images
For perspective transformation of a border region we virtually organize the cube faces in a three dimensional space, so that we can use 3D projection techniques such as the pinhole camera model. According to the pinhole camera model, for a point in a three-dimensional Cartesian space, the projection on the target plane is calculated as:
where is the distance from the projection to the origin. We arrange the already stylized cube face at an angle of to the projection plane to project its border into the plane of another cube face.
a.9 Evaluation metric for spherical images
To evaluate if there are unusually high gradients in a given region compared to the rest of the image, we calculate the ratio of gradient magnitudes in that region to gradient magnitudes in the overal image. We take the maximum color gradients per pixel, i.e. we define our image gradient in direction as . On that basis, we define the ratio where is the dimensionality of the image and is a binary vector which encodes the region where we want to test for unusual gradients (in direction), such that the vector is in the desired region and everywhere else. By we denote the element-wise multiplication.
The error metric , which is a weighted average between horizontal and vertical gradients, is then calculated as follows: