Comixify: Transform video into a comics
In this paper, we propose a solution to transform a video into a comics. We approach this task using a neural style algorithm based on Generative Adversarial Networks (GANs). Several recent works in the field of Neural Style Transfer showed that producing an image in the style of another image is feasible. In this paper, we build up on these works and extend the existing set of style transfer use cases with a working application of video comixification. To that end, we train an end-to-end solution that transforms input video into a comics in two stages. In the first stage, we propose a state-of-the-art keyframes extraction algorithm that selects a subset of frames from the video to provide the most comprehensive video context and we filter those frames using image aesthetic estimation engine. In the second stage, the style of selected keyframes is transferred into a comics. To provide the most aesthetically compelling results, we selected the most state-of-the art style transfer solution and based on that implement our own ComixGAN framework. The final contribution of our work is a Web-based working application of video comixification available at http://comixify.ii.pw.edu.pl.
Comixify: Transform video into a comics
Maciej Pęśko, Adam Svystun, Paweł Andruszkiewicz, Przemysław Rokita and Tomasz Trzciński Faculty of Electronics and Information Technology Warsaw University of Technology Nowowiejska 15/19, 00-665 Warszawa, Poland email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com
December 12, 2018
Keywords Neural Style Transfer, Style Transfer, Comics Style, Comics, Computer Vision, Neural Network
Cartoons and comics became a popular mean of artistic expression worldwide. Unfortunately, only a limited number of talented people with painting or graphics skills can create them holding on to aesthetic standards. What is more, it also takes a significant amount of time to create a valuable comics graphics. Providing an automatic tool that converts videos to comics could revolutionize the way publishers and individuals create comixified content.
To advance the development of such tool we propose a working solution for the task of video comixification. We split this problem into two separate ones: (a) frame extraction and (b) style transfer. The goal of first stage is to select a subset of frames from video which provides the most comprehensive video context, being also visually attractive to viewer of the comics. In this stage we propose to use a state-of-the-art keyframe extraction algorithm based on reinforcement learning, which we further extend by combining temporal segmentation method with the image aesthetic estimation. In the second stage, we transform the style of selected keyframes into a comics. To achieve the best results we analyze the existing style transfer approaches on the specific problem of comics style transfer. We compare their advantages and disadvantages in terms of qualitative and quantitative analysis, to find the most aesthetically pleasing solution. The final contribution of our work is a web-based working application of video comixification available at http://comixify.ii.pw.edu.pl.
The remainder of this paper is organized in the following manner. In Section 2 we describe all related works that discuss the topic of style transfer and key frames extraction. Section 3 describes the details of the method proposed for video comixification. Section 4 shows the details about our web-based application. In Section 5 we make some conclusions and plans for further research.
2 Related Work
2.1 Keyframes Extraction
The task of keyframes extraction is generally similar to the video summarization task. Both of them try to find a subset of input frames that provides a comprehensive representatation of the video. In the recent years video summarization has gained a significant amount of attention from the research community. This attention may have been sparked by the availability of benchmark datasets, such as SumMe Gygli et al.  and TVSum Song et al. . Zhang et al. Zhang et al.  used Long Short-Term Memory (LSTM) to model the variable-range temporal dependency among video frames, so as to derive both representative and compact video summaries. Later, Mahasseni et al. Mahasseni et al.  achieved even better results by using novel generative adversarial framework (GAN), consisting of summarizer and discriminator. The proposed summarizer was implemented as an autoencoder LSTM network with the objective of, first, selecting video frames, and then decoding the obtained summarization for reconstructing the input video. Discriminator was another LSTM aimed at distinguishing between the original video and its reconstruction from the summarizer. Lastly, Zhou et al. Zhou et al.  proposed an end-to-end, reinforcement learning based framework, where they designed a novel reward function that jointly accounts for diversity and representativeness of generated summaries. Extensive experiments on two benchmark datasets showed that this unsupervised method not only outperforms other state-of-the-art unsupervised methods, but also is comparable or even superior to most of supervised approaches. Contrary to the above proposed methods, in this paper we leverage keyframe extraction and extend it with aesthetic estimation. For this task, we propose two approaches - our popularity estimation engine and image quality estimation model.
Similarly to video summarization, a considerable amount of research was carried out in the domain of content popularity estimation. Various aspects of popularity prediction were looked at such as analyzing the user-access patterns relation to popularity Almeida et al. , Chesire et al. , prediction of the peak popularity time Jiang et al.  or popularity evolution patterns Crane and Sornette , Szabó and Huberman , Pinto et al. . Khosla et al. Khosla et al.  proposed using visual cues in the context of popularity prediction. Authors train Support Vector Machines models using among others deep neural network outputs. Trzciński et al. used a similar approach as the baseline model in Trzcinski et al. , where authors proposed the use of recurrent neural networks for video popularity prediction.
Image quality assessment is also a popular topic among researchers. Large datasets for aesthetic visual analysis are available, such as AVA Murray et al.  and TID2013 Ponomarenko et al. . Many studies Lu et al. , Kao et al. , Kim et al.  have proved that the extraction of high-level features using convolutional neural networks (CNNs) can result in significantly better results in image quality assessment task than previous methods, based on hand-crafted features, like in Murray et al. . Talebi and Milanfar Talebi and Milanfar  proposed a novel approach, based on a prediction of the distribution of human opinion scores. They used the squared EMD (earth mover’s distance) loss proposed in Hou et al. . They achieved results close to the state of the art, using much less computationally expensive model. In this work we combine the advancements in video summarization and popularity estimation, to produce a keyframe extraction solution most suitable for the keyframe extraction task.
2.2 Style Transfer
Gatys et al. Gatys et al.  in their paper demonstrated that deep convolutional neural networks are able to encode the style information of any given image. Moreover, they showed that the content and style of the image can be separated and treated individually. As a consequence, it is possible to transfer the characteristics of the style of a given image to another one, while preserving the content of the latter. They proposed to exploit the correlations between the features of deep neural networks in terms of Gram matrices to capture image style. Furthermore, Y. Li et al. showed that covariance matrix can be as effective as Gram matrix in representing image style Li et al. [2017a] and theoretically proved that matching the Gram matrices of the neural activations can be seen as minimizing a specific Maximum Mean Discrepancy Li et al. [2017b] function, which gives more intuition on why Gram matrix can represent an artistic style.
Since the work of Gatys et al. , numerous improvements have been made in the field of Style Transfer. Johnson et al. Johnson et al.  and Ulyanov et al. Ulyanov et al. [2016a] proposed fast approaches that increase the efficiency of style transfer by three orders of magnitude. Nevertheless, this improvement comes at a price of lower results quality. Multiple authors tried to address this shortcoming Ulyanov et al. [2016b], Yeh and Tang , Wang et al. , Wilmot et al.  or make it more generic to enable using different styles in one model. The proposed solution include using a conditional normalization layer that learns normalization parameters for each style Dumoulin et al. , swapping a given content feature with the closest style feature in its local neighborhood Chen and Schmidt , directly adjusting the content feature to match the mean and variance of the style feature Huang and Belongie , Desai , Ghiasi et al.  or building a meta network which takes in the style image and produces corresponding image transformation network directly Shen et al. . Other methods include an automatic segmentation of the objects and extraction of their soft semantic masks Zhao et al.  or adjusting feature maps using whitening and coloring transforms Li et al. [2017c]. There are also some works that try to make photo-realistic style transfer Li et al. , Luan et al.  to carry style of one photo to another, leaving it as realistic as possible. In addition, many works have been created that focus on some other, various fields of Style Transfer. For instance Coherent Online Video Style Transfer Chen et al. [2017a] which describes end-to-end network that generates consistent stylized video sequences in near real time, StyleBank Chen et al. [2017b] which uses multiple convolution filter banks, where each filter in bank explicitly represents one style or Stereoscopic Neural Style Transfer Chen et al. [2018a] that concerns on 3D or AR/VR subject. Recently, Chen, Yang et al. in Chen et al. [2018b] presented style transfer method based on Generative Adversarial Networks that seems to work really well in terms of photo cartoonization problem.
It is worth mentioning that existing approaches often suffer from a trade-off between generalization, quality and efficiency. More precisely, the optimization-based approaches handle arbitrary styles with a great visual quality, but the computational costs are relatively high. On the other side, the feed-forward methods are executed very fast with slightly worse but acceptable quality, but they are limited to a specific, fixed number of styles. Finally, the arbitrary methods are fast and enable multiple style transfer, but often their quality is worse than the previously proposed ones.
2.3 Video comixification
The topic of video comixification has also gained some interest in the industry - Google Research has recently launched their Android application called Storyboard 111https://play.google.com/store/apps/details?id=com.google.android.apps.photolab.storyboard which allows you to transform videos into comics on your mobile device. Contrary to the solution provided by Google, we allow the user to select a few methods for keyframes extraction and Style Transfer and enable selection of not only the most comprehensive frames, but also those that are linked to the highest popularity.
In this section, we give an overview of the method proposed for video comixification. Our pipeline consists of two stages, described in the following sections. In Section 3.1 we describe in detail the keyframe extraction part, whereas in Section 3.2 we show how we implement the style transfer part. Fig. 1 shows a general architecture of the system. As input our pipeline takes raw videos and some optional configuration parameters, and as output it produces full formated comics. In Section 4 we go into more detail about the real world implementation of the pipeline.
3.1 Keyframes Extraction
In this section, we describe the keyframes extraction process. The full process is accomplished by a combination of video summarization and image aesthetic estimation techniques. The task of video summarization is to create short and concise summaries of longer videos. The first step to create such summaries usually is to evaluate each frame’s highlightness, a score that will tell you how well the current frame represents the video. After having all frames evaluated for highlightness, most representative video segments are selected to be included in the summary.
In step 2 features are extracted from selected frames with GoogLeNet v1 Szegedy et al.  (pre-trained on ImageNet Deng et al. ) from the last pooling layer. The resulting feature vector is used in steps 3, 4, and 6.
In the following subsections we describe the other steps of our process.
3.1.1 Highlightness score
In step 3 we extract highlightness score. The model we use for that is an unsupervised RL model described by Zhou in et al. Zhou et al.  and shown to outperform the competitors on the task of video summarization. In their work Zhou et al. developed deep summarization network (DSN) that predicts for each frame a probability, which indicates how likely a frame is to be selected as a representative frame of the video. That probability is what we call the highlightness score. To train their DSN, they propose an end-to-end, reinforcement learning based framework. The reward function they use in their RL model is a function that jointly accounts for diversity and representativeness of DSN generated summaries. During training, the reward function judges how diverse and representative the generated summaries are, while DSN aims at earning higher rewards by learning to produce more diverse and more representative summaries. Since labels are not required, their method can be fully unsupervised. We train their model on the four datasets they used, and also additionally expand the training with the additional dataset, called VTW Zeng et al. .
3.1.2 Temporal segmentation
In step 4 we split video into segments, and in step 5 we combine the forementioned segments with highlightness score to get a certain number of chosen frames. Segmentation is done mainly to further enforce the diversity of frames, as it limits the possibility of bordering frames being selected.
For temporal segmentation, we use KTS, proposed by Potapov et al. Potapov et al. , which segments a video into segments based on temporal differences in frames. We slightly modify the algorithm to be able to set a minimum constraint on number of segments . We segment a video into . From which we select n segments with highest highlightness by averaging the score from each frame in a segment. Then we find one frame in each of n segments with the highest score. On these resulting n frames we apply popularity estimation, to select the final k keyframes.
3.1.3 Image aesthetic estimation
Estimation of image aesthetic has been included in our pipeline as the last stage of keyframes selection - we limit the number of frames selected by temporal segmentation to the expected number k (k n, k is a divisor of n) by arranging selected frames in order of their occurrence, dividing them into k equal groups and selecting frame with the highest estimated aesthetic score in each group.
3.1.4 Image aesthetic - scoring
Image aesthetic score can be obtained using one of two methods - either popularity estimation or image quality estimation:
For popularity estimation we use baseline model from Trzcinski et al. . We use the same dataset of over 37’000 thumbnails of videos published on Facebook. Every video has a calculated normalized popularity score, according to the following formula:
We train Support Vector Regression (SVR) model and evaluate it using Spearman correlation. We achieved the Spearman correlation coefficient value of 0.41. We use estimated popularity score as our aesthetic score.
Image quality estimation
For image quality estimation we use approach proposed in NIMA: Neural Image Assesment Talebi and Milanfar . Model (based on NASNet-A Zoph et al.  architecture) is trained on AVA Murray et al.  dataset using squared EMD loss Hou et al. . Output of this model is estimated histogram of ratings on scale from 1 to 10. We use mean of these scores as our aesthetic score.
3.2 Comics style transfer
In this section, we first give an overview of the solutions proposed for transferring the style of images. We start with the initial work of Gatys et al. Gatys et al.  and continue with the most state-of-the-art methods with the best performance in terms of transfering comics style to images.
3.2.1 Existing approaches
Gatys et al. Gatys et al.  in their work, used 16 convolutional and 5 pooling layers of the 19-layer VGG network Simonyan and Zisserman . They passed three images through this network: Content image, Style image and White Noise image. The content information is given by feature maps from one layer and content loss is the squared-error loss between the two feature representations:
Where is the activation of the i-th filter at position in layer . On the other side, the style information is given by the Gram matrix of the vectorized feature maps and in layer :
The style loss is a squared-error loss function computed between two Gram matrices obtained from specific layers from white noise image and style image passed through the network. is a number of feature maps and is a feature map size.
Finally, the cost function is defined as weighted sum of two above losses. Namely, between the activations (content) and Gram matrices (style) and then is minimized using backpropagation.
Unfortunately, this method is very slow. Using this approach, style transfer of one 512x512 pixel image lasts almost a minute on recent GPU architectures, such as NVIDIA Quadro M6000 or Titan X Johnson et al. , Li et al. [2017b]. It is caused by the fact that for each pair of images, this method performs an optimization process using backpropagation to minimize . To address this shortcoming, several approaches have been proposed. M. Pęśko and T. Trzciński in Pesko and Trzcinski  presented very detailed comparison of all state-of-the-art neural style transfer approaches in terms of comics style transfer. Two of them were found the best, namely Adaptive Instance Normalization approach presented by X. Huang and S. Belongie in their paper Huang and Belongie  and Universal Style Transfer proposed by Y. Li et al. in Li et al. [2017c].
3.2.2 Adaptive Instance Normalization
This is the first fast and arbitrary neural style transfer algorithm that resolves problem with generalization, quality and efficiency trade-off. This method consists of two networks: a style transfer network and a loss network.
The loss network is pretty similar to the network presented in Gatys et al. . It is used to compute a total loss which is minimized by a backpropagation algorithm to fit the parameters of a style transfer network. The style transfer network consists of a simple encoder-decoder architecture. The encoder is first few layers of a pre-trained VGG-19 network. The decoder mirrors the encoder with a few minor differences. All pooling layers are replaced by nearest up-sampling layers and there are no normalization layers. The most interesting part is AdaIN layer which is placed between the encoder and the decoder. AdaIN produces the target feature maps that are inputs to the decoder by aligning the mean and variance of the content feature maps to style feature maps. A randomly initialized decoder is trained using a loss network to map outputs of AdaIN to the image space in order to get a stylized image. AdaIN architecture can be seen in the Fig. 4.
3.2.3 Universal Style Transfer
Knowing that covariance matrix can be as effective as Gram matrix in representing image style Li et al. [2017a], Y. Li et al. Li et al. [2017c] proposed a novel approach called Universal Style Transfer (UST). It is closely related to AdaIN method but the intuition of UST is to match covariance matrices of feature maps instead of aligning only mean and variance which was proposed in Huang and Belongie . Their Universal Style Transfer approach (UST) uses a very similar encoder-decoder architecture where encoder is composed of a first few layers of a pre-trained VGG-19 network and the decoder mostly mirrors the encoder. However, instead of using the AdaIN layer to carry style information, they used Whitening and Coloring transform (WCT). Moreover, in this method style transfer that is represented by WCT layer is not used during training. The network is only trained to reconstruct an input content image. The entire style transfer takes place in the WCT layer which is added to already pre-trained image reconstruction network.
Another extension is to use a multi-level stylization in order to match the statistics of the style at all abstraction levels. It means that the result obtained from a network that matches higher level information is treated as the new content input image to network that matches lower level statistics. Such architecture can be seen in Fig. 5.
Sample results presenting above-mentioned architectures can be found in Fig. 6. Unfortunately, each of those two methods suffers from some common problems such as inappropriate color transfer and blur effects. AdaIN method returns images that are close to cartoons, but it often leaves the content colors or some mixed hues that do not fit the style. UST-WCT give results with more appropriate and stylistically coherent colors. However, the results of those models seem to be stylized too much which often leads to significant distortions in those pictures. Moreover, for our "comixification" purpose we would like to have some more powerful solution that would be able to "remember" how to apply comics style to any content image without need to give it any style image. Fortunately, recently, Chen, Yang et al. in Chen et al. [2018b] presented completely different approach based on Generative Adversarial Networks that gives really promising results.
|a) Style images||b) Content images||c)AdaIN||d) UST-WCT|
This approach faces Style Transfer problem in completely different way using so-called Generative Adversarial Networks (GANs). This GAN based method consists of two Convolution Neural Networks: Discriminator D and Generator G. Generator is trained to create images that fools the discriminator. On the other hand, the discriminator is trained to classifies whether the image is from the real target domain or synthetic, produced by Generator. G and D architecture can be seen in Fig. 7. The Discriminator is designed to be shallow, because judging whether image is a cartoon or not is less demanding task and it should rely only on local features of the image.
Loss function The loss function consists of 2 parts: the adversarial loss that is responsible for achieving the desired style transformation by the generator network and the content loss which helps to persist the image content during training. is a parameter to balance between these two losses.
The adversarial loss is applied to both networks G and D. In comparison to classic GAN frameworks adversarial loss, authors presented so-called Edge Promoting Loss. What it means is to introduce additional class that represents cartoons with blurred edges and treat it like non-cartoon images. This extra addition handles the problem of clear edges that are very characteristic for cartoon/comics images and may not be recognize by Generator because of the small proportion of these edges in the whole image. The content loss is just L1 norm between features maps obtained from one layer from VGG network Simonyan and Zisserman .
Initialization phase To improve GAN framework convergence, authors propose additional step before the actual training. This Initialization phase is a short training of the Generator network to reconstruct the input image. Training is performed using only content loss .
However, CartoonGAN results are still not perfect. Authors shared two pretrained models, that are trained to output image in a style of two Anime artists Mamoru Hosoda and Miyazaki Hayao. This fact causes that output images are targeted to Anime style, which can be seen to have unnatural colors. Moreover, edges are still not so sharp and distinct, as they should be. Bearing in mind that we would like to get very universal comic images with more natural and contrasting colors and more distinct edges we decide to train our own ComixGAN framework based on CartoonGAN approach.
Data For training we use real, comics and comics with blurred edges images as described in Chen et al. [2018b]. As real image we use MS COCO dataset Lin et al. . For comics images we use keyframes obtained from different cartoons that can be described as being comic book style.
Architecture We start with identical architecture and parameters to original CartoonGAN framework, but surprisingly results turn out to be poor. Generated images lack the comic book style and have no distinct edges. Moreover, networks are very unstable during the training. To mitigate those shortcomings of the base model, we introduce the following changes:
As a Generator loss function we use non-saturating loss (as described in Goodfellow et al. ) instead of standard minimax loss version.
We use sigmoid function in the last layer of Discriminator Network.
We use Generator/Discriminator ratio in training equals to 3:1 (Three updates of Generator weights per one update of Discriminator)
We use initial training also for Discriminator to pretrain it to distinguish comics images from real and edge-blurred comics.
In Fig. 8 we present some sample results obtained from original CartoonGAN framework in both Hayao and Hosoda styles (columns b) and c)) and our ComixGAN framework with improvements that we introduce to obtain better quality results (column d)). We can see that our ComixGAN produces images with more distinct and clear edges than all remaining approaches. Moreover, colors are uniform and vivid which is very characteristic for comics style. Also content information in preserved very good with no visible distortions. The quality of images produced by ComixGAN seems to be better than in all previous approaches. This fact prevailed and we decide to use is as a main style transfer method in our comixify pipeline. On the other hand, to give everyone the opportunity to compare, in our application we also include the option of choosing between the models provided by CartoonGAN authors.
4 Web application
In order to demonstrate our solution, we share a Web-based application that allows us to process video files using our pipeline. Our application is publicly available at https://comixify.ii.pw.edu.pl/. Frontend interface can be also seen in Fig. 9. We publish the source code of implementation at https://github.com/maciej3031/comixify. The application gives everyone the opportunity to test our solution in three ways: by uploading some video file, by providing a link to the YouTube video or by selecting a video from available samples. One can choose which image aesthetic estimation model will be used - popularity estimation or image quality estimation (NIMA). The application also allows to select frame extraction model to experiment with results. There is a choice between Basic model and Basic + VTW model. What is more, also Style Transfer model can be chosen. One can select between ComixGAN model and both CartoonGAN models Chen et al. [2018b]. In addition, a REST API is also available, allowing the use of the pipeline without necessity of frontend experience.
In this paper, we introduced an extensive pipeline for transforming videos into comics. Our solution works in semi real time manner and produces convincing, eye-pleasing comics layouts. Our approach can be extended by adding some generative comics layout composition and by introducing voice recognition to add text annotation to the comics. This is a direction we plan to explore in the future work.
- Gygli et al.  Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc J. Van Gool. Creating summaries from user videos. In ECCV (7), volume 8695 of Lecture Notes in Computer Science, pages 505–520. Springer, 2014.
- Song et al.  Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. Tvsum: Summarizing web videos using titles. In CVPR, pages 5179–5187. IEEE Computer Society, 2015.
- Zhang et al.  Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term memory. In ECCV (7), volume 9911 of Lecture Notes in Computer Science, pages 766–782. Springer, 2016.
- Mahasseni et al.  Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. Unsupervised video summarization with adversarial LSTM networks. In CVPR, pages 2982–2991. IEEE Computer Society, 2017.
- Zhou et al.  Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In AAAI, pages 7582–7589. AAAI Press, 2018.
- Almeida et al.  Virgílio A. F. Almeida, Azer Bestavros, Mark Crovella, and Adriana de Oliveira. Characterizing reference locality in the WWW. In PDIS, pages 92–103. IEEE Computer Society, 1996.
- Chesire et al.  Maureen Chesire, Alec Wolman, Geoffrey M. Voelker, and Henry M. Levy. Measurement and analysis of a streaming media workload. In USITS, pages 1–12. USENIX, 2001.
- Jiang et al.  Lu Jiang, Yajie Miao, Yi Yang, Zhen-Zhong Lan, and Alexander G. Hauptmann. Viral video style: A closer look at viral videos on youtube. In ICMR, page 193. ACM, 2014.
- Crane and Sornette  R Crane and D Sornette. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences, 105(41):15649–15653, 2008.
- Szabó and Huberman  Gábor Szabó and Bernardo A. Huberman. Predicting the popularity of online content. Commun. ACM, 53(8):80–88, 2010.
- Pinto et al.  Henrique Pinto, Jussara M. Almeida, and Marcos André Gonçalves. Using early view patterns to predict the popularity of youtube videos. In WSDM, pages 365–374. ACM, 2013.
- Khosla et al.  Aditya Khosla, Atish Das Sarma, and Raffay Hamid. What makes an image popular? In WWW, pages 867–876. ACM, 2014.
- Trzcinski et al.  Tomasz Trzcinski, Pawel Andruszkiewicz, Tomasz Bochenski, and Przemyslaw Rokita. Recurrent neural networks for online video popularity prediction. In ISMIS, volume 10352 of Lecture Notes in Computer Science, pages 146–153. Springer, 2017.
- Murray et al.  Naila Murray, Luca Marchesotti, and Florent Perronnin. AVA: A large-scale database for aesthetic visual analysis. In CVPR, pages 2408–2415. IEEE Computer Society, 2012.
- Ponomarenko et al.  Nikolay N. Ponomarenko, Oleg Ieremeiev, Vladimir V. Lukin, Karen O. Egiazarian, Lina Jin, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli, Federica Battisti, and C.-C. Jay Kuo. Color image database TID2013: peculiarities and preliminary results. In EUVIP, pages 106–111. IEEE, 2013.
- Lu et al.  Xin Lu, Zhe Lin, Xiaohui Shen, Radomír Mech, and James Zijun Wang. Deep multi-patch aggregation network for image style, aesthetics, and quality estimation. In ICCV, pages 990–998. IEEE Computer Society, 2015.
- Kao et al.  Yueying Kao, Chong Wang, and Kaiqi Huang. Visual aesthetic quality assessment with a regression model. In ICIP, pages 1583–1587. IEEE, 2015.
- Kim et al.  Jongyoo Kim, Hui Zeng, Deepti Ghadiyaram, Sanghoon Lee, Lei Zhang, and Alan C. Bovik. Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment. IEEE Signal Process. Mag., 34(6):130–141, 2017.
- Talebi and Milanfar  Hossein Talebi and Peyman Milanfar. NIMA: neural image assessment. IEEE Trans. Image Processing, 27(8):3998–4011, 2018.
- Hou et al.  Le Hou, Chen-Ping Yu, and Dimitris Samaras. Squared earth mover’s distance-based loss for training deep neural networks. CoRR, abs/1611.05916, 2016.
- Gatys et al.  Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In CVPR, pages 2414–2423. IEEE Computer Society, 2016.
- Li et al. [2017a] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Diversified texture synthesis with feed-forward networks. In CVPR, pages 266–274. IEEE Computer Society, 2017a.
- Li et al. [2017b] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. pages 2230–2236, 2017b.
- Johnson et al.  Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV (2), volume 9906 of Lecture Notes in Computer Science, pages 694–711. Springer, 2016.
- Ulyanov et al. [2016a] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 1349–1357. JMLR.org, 2016a.
- Ulyanov et al. [2016b] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/1607.08022, 2016b.
- Yeh and Tang  Mao-Chuang Yeh and Shuai Tang. Improved style transfer by respecting inter-layer correlations. CoRR, abs/1801.01933, 2018.
- Wang et al.  Xin Wang, Geoffrey Oxholm, Da Zhang, and Yuan-Fang Wang. Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer. In CVPR, pages 7178–7186. IEEE Computer Society, 2017.
- Wilmot et al.  Pierre Wilmot, Eric Risser, and Connelly Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. CoRR, abs/1701.08893, 2017.
- Dumoulin et al.  Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. 2017. URL https://openreview.net/forum?id=BJO-BuT1g.
- Chen and Schmidt  Tian Qi Chen and Mark Schmidt. Fast patch-based style transfer of arbitrary style. CoRR, abs/1612.04337, 2016.
- Huang and Belongie  Xun Huang and Serge J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. pages 1510–1519, 2017.
- Desai  Shubhang Desai. End-to-end learning of one objective function to represent multiple styles for neural style transfer. Technical report, 2017. URL http://cs231n.stanford.edu/reports/2017/pdfs/407.pdf.
- Ghiasi et al.  Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon Shlens. Exploring the structure of a real-time, arbitrary neural artistic stylization network. In BMVC. BMVA Press, 2017.
- Shen et al.  Falong Shen, Shuicheng Yan, and Gang Zeng. Meta networks for neural style transfer. CoRR, abs/1709.04111, 2017.
- Zhao et al.  Huihuang Zhao, Paul L. Rosin, and Yu-Kun Lai. Automatic semantic style transfer using deep convolutional neural networks and soft masks. CoRR, abs/1708.09641, 2017.
- Li et al. [2017c] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. In NIPS, pages 385–395, 2017c.
- Li et al.  Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, and Jan Kautz. A closed-form solution to photorealistic image stylization. In ECCV (3), volume 11207 of Lecture Notes in Computer Science, pages 468–483. Springer, 2018.
- Luan et al.  Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep photo style transfer. In CVPR, pages 6997–7005. IEEE Computer Society, 2017.
- Chen et al. [2017a] Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. Coherent online video style transfer. In ICCV, pages 1114–1123. IEEE Computer Society, 2017a.
- Chen et al. [2017b] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stylebank: An explicit representation for neural image style transfer. In CVPR, pages 2770–2779. IEEE Computer Society, 2017b.
- Chen et al. [2018a] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stereoscopic neural style transfer. CoRR, abs/1802.10591, 2018a.
- Chen et al. [2018b] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo cartoonization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018b.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9. IEEE Computer Society, 2015.
- Deng et al.  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE Computer Society, 2009.
- Zeng et al.  Kuo-Hao Zeng, Tseng-Hung Chen, Juan Carlos Niebles, and Min Sun. Title generation for user generated videos. In ECCV (2), volume 9906 of Lecture Notes in Computer Science, pages 609–625. Springer, 2016.
- Potapov et al.  Danila Potapov, Matthijs Douze, Zaïd Harchaoui, and Cordelia Schmid. Category-specific video summarization. In ECCV (6), volume 8694 of Lecture Notes in Computer Science, pages 540–555. Springer, 2014.
- Zoph et al.  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- Pesko and Trzcinski  Maciej Pesko and Tomasz Trzcinski. Neural comic style transfer: Case study. CoRR, abs/1809.01726, 2018.
- Lin et al.  Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV (5), volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014.
- Goodfellow et al.  Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. pages 2672–2680, 2014.