Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks
This paper presents an automatic image synthesis method to transfer the style of an example image to a content image. When standard neural style transfer approaches are used, the textures and colours in different semantic regions of the style image are often applied inappropriately to the content image, ignoring its semantic layout, and ruining the transfer result. In order to reduce or avoid such effects, we propose a novel method based on automatically segmenting the objects and extracting their soft semantic masks from the style and content images, in order to preserve the structure of the content image while having the style transferred. Each soft mask of the style image represents a specific part of the style image, corresponding to the soft mask of the content image with the same semantics. Both the soft masks and source images are provided as multichannel input to an augmented deep CNN framework for style transfer which incorporates a generative Markov random field (MRF) model. Results on various images show that our method outperforms the most recent techniques.\@xsect
•Computing methodologies Non-photorealistic rendering; Neural networks;
Huihuang Zhao & Paul L. Rosin & Yu-Kun Lai]
Huihuang Zhao††thanks: email@example.com, Paul L. Rosin and Yu-Kun Lai
School of Computer Science and Technology, Hengyang Normal University, Hengyang, Hunan, China School of Computer Science and Informatics, Cardiff University, Cardiff, UK
Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks
Style transfer is a process of migrating a style from a “style image” to a “content image”. The goal is to be able to generate different renditions of the same scene according to different style images. Image style transfer has become a popular problem in computer vision and graphics, and can generate impressive results covering a wide variety of styles for both images [GEB15b] and videos [RDB16]. It has also been widely employed to solve problems such as texture synthesis [EF01], inpainting [CPT04], head portraits [SED16] and super-resolution [JAFF16].
When existing neural style transfer methods are applied to images with complex structures, visual elements from the style image are often transferred to semantically irrelevant areas of the content image. In order to achieve good results, users must pay attention to the composition and/or the selection of the style image, because for example the background colours or textures will often ruin the style transfer results, especially for portraits where the artifacts can be particularly off-putting. Addressing this problem, [Cha16] (and subsequently [GEB16]) recently proposed a method which uses a manually generated semantic map to help control the style transfer, and can achieve better results than some common methods.
In this paper, we specifically consider the problem of image style transfer guided by automatically extracted soft semantic masks. To achieve this, we adapt various semantic segmentation and labelling techniques to extract soft masks associated with specific semantics. By deploying the semantic masks to control the transfer, it is possible to avoid errors such as those shown in figure 1(c) generated using the CNNMRF method [LW16] in which stylised foreground objects are contaminated by the background texture, and vice versa.
The main contributions of the paper are as follows:
We adapt a state-of-the-art semantic segmentation method [ZJRP15] to generate semantic masks automatically. Instead of using hard segmentation as [ZJRP15], we propose to use soft masks containing the probabilities of occurrence of different objects in the image, since they preserve more information and is more robust when image regions have similar chances of belonging to multiple object categories. They are used to capture elements of the styles for objects in the style image and to preserve the structure of the content image. For the human face in particular we use a more detailed segmentation, in which different facial parts such as the nose, eyes and mouth are also automatically segmented, providing fine-grained control in perceptually crucial areas; these are also treated as semantic masks.
We augment a trained deep convolutional neural network by concatenating soft mask channels and channels of regular filters. This is further combined with a generative Markov random field (MRF) model [LW16] for image style transfer. Both the style and content images and their semantic maps are input into the augmented deep convolutional neural network. Extensive experiments show that such higher-level semantic information improves the quality of style transfer.
Style transfer using deep networks. The success of deep CNNs (DCNNs) in image processing has also raised interest in image style transfer. [SPB14] proposed a new style transfer method for headshot portraits. During their method, they presented a new multiscale technique based on deep networks to robustly transfer the local statistics of an example portrait onto a new one. [GEB15b, GEB15a] showed remarkable results by using the VGG-19 deep neural network for style transfer. Their approach was employed in unguided settings and taken up by various follow-up papers. [GEB16] in particular extended the Gram matrix method beyond the paradigm of transferring global style information between pairs of images, and they introduced control over spatial location, colour information and spatial scale. [ULVL16] presented an alternative approach which trained compact feed-forward convolutional networks. The resulting networks are extremely light-weight and can generate images faster than [GEB15b]. By combining the benefits of training feedforward convolutional neural networks and perceptual loss functions, [JAFF16] presented a novel approach for image style transfer. [LW16] suggested an approach to preserving local patterns of the style image. Instead of using a global representation of the style computed as a Gram matrix, they used patches of the neural activation from the style image. [RDB16] presented an approach that transfers the style from one image (for example, a painting) to a whole video sequence.
Two main types of methods are used in deep learning based style transfer: global approaches based on the Gram matrix, and local approaches based on patch matching. Compared to the global methods, methods based on patch matching are more flexible and better cope with images with spatial variation of visual styles or elements. However, they could also produce visible artefacts when there are local matching errors. In order to control the region of application of the style image, [GEB16] used several manually specified spatial guidance channels, containing values in [0,1], for both the content and style images. Their experiments showed that the guidance channels can ensure that the style is transferred between regions of similar scene content in the content and style images. It is however time-consuming to produce masks. As a result, for examples in their paper, they just used a mask to separate two parts of the image (e.g. sky and non-sky) for simple spatial control, and did not distinguish more detailed content in the images.
MRF-based image synthesis. Markov Random Fields (MRFs) are a famous framework for non-parametric image synthesis [EL99], [FPC00]. [KSE03], [KSE03] and [KEBK05] modelled the texture as an MRF and computed some approximation to the optimal solution. [ZCC13] formulated the patch mapping problem as a labelling problem modelled by a discrete MRF. Moreover, [FSDH16] proposed a novel unsupervised method for texture and colour transfer based on MRFs. In their approach an adaptive patch partition is used to capture the style of the example image and preserve the structure of the source image. MRF models suffer from a limitation that local image statistics are usually not sufficient for capturing complex image layouts at a global scale. [WL00] and [KEBK05] proposed a multi-resolution synthesis approach to improve this. We adapt this in our method. [LW16] presented a combination of generative Markov random field (MRF) models for image synthesis. Unlike other MRF-based texture synthesis approaches, their combined system can both match and adapt local features with considerable variability, and therefore our paper is based on this method.
Semantic segmentation. Recently, CNN architectures have been shown to be capable of providing semantic segmentation [GDDM14, Tho16]. [GDDM14] proposed a method called R-CNN, which combined region proposals with CNNs. [NHH15] applied a trained network (VGG 16-layer net) to each proposal in an input image, and constructed the final semantic segmentation map by combining the results from all the proposals. [SLD17] proposed a fully convolutional network for semantic segmentation. For producing accurate and detailed segmentations, they defined a skip architecture which combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer. In order to achieve better results, some existing face detection methods are also used in style transfer. By searching a database using Face++ [Fac] to find images with similar poses to a given source portrait image, [YZY17] presented a novel colour transfer approach for portraits. [ZJRP15] introduced a new form of convolutional neural network that combines the strengths of Convolutional Neural Networks and Conditional Random Fields based probabilistic graphical modelling. These models rely primarily on convolutional layers to extract high-level patterns, then use deconvolution to label the individual pixels. Currently they have trained this model to recognise 20 classes, and our paper uses this method to obtain some semantic content from images.
Limitations of current methods. Images are typically composed of regions corresponding to different (foreground) objects and background. Most existing methods either use Gram matrices which treat images globally, or for methods based on local patch matching, can often match regions of one object in the style image to regions of a different object in the content image, causing artefacts such as those shown in figure 1. This is more critical for human faces as subtle mismatches can be detrimental to the quality of synthesised images. To address this, existing methods [Cha16, GEB16] use manual segmentation to improve style transfer. However, manual segmentation is time-consuming and laborious. In contrast, our method automatically performs a partial soft semantic segmentation of the content and style images. We augment the CNNMRF model used in [LW16] to further incorporate soft semantic masks, which can better capture features from the style image and preserve the structure of the content image.
We first briefly introduce our augmented DCNN architecture in section Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks, followed by details for the style transfer algorithm in section Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks. We then provide details for automatic semantic mask extraction in section Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks. Experimental results and discussions are presented in section Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks and finally conclusions are drawn in section Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks.
We now discuss our augmented DCNN architecture which is based on VGG [SZ14] for style transfer. It takes as input a content image and a style image, both of which are fed into the VGG net. The DCNN architecture combines pooling and convolution layers with filters (for example, the first layer after second pooling is named ). Like common DCNNs, intermediate post-activation results denoted as for the layer consist of channels, which capture patterns from the source images for each region of the image. Our augmented network is shown in figure 2.
Our augmented network also takes semantic soft masks as input, which are down-sampled to produce semantic channels at layer with the same resolution as . We concatenate them to form the new output with channels, defined as and labeled accordingly for each layer (e.g. ). Before concatenation, the semantic channels are weighted by parameter to balance their importance:
We set which we have found experimentally to provide interesting results.
Next, we introduce our style transfer model. We use an augmented loss function which is based on a patch-based approach [LW16] for style transfer, using optimisation to minimise content reconstruction error and style remapping error , which combines an MRF and a DCNN model, given a style image , a content image , and semantic maps and associated with the content and style images, respectively (). For simplicity, the semantic masks for the content and style images are also collectively represented as and . The style transfer result image is denoted by . Since the synthesised image is expected to have the same semantic layout as the content image, we treat also as the semantic masks for the synthesised image. During our method, we make the high-level neural encoding of similar to and use the local patches similar to patches in . As a result, the style of is transferred onto the layout of . Meanwhile, we penalise patch matches with inconsistent semantic masks. We define an energy function as follows and seek that minimises it:
and are defined as the style loss function and content loss function respectively, where is ’s feature map (activation) that the network outputs in some layer, is the feature map (activation) of the style image in the same layer, and and are the semantic masks of the content and style images downsampled to the same resolution as and . For our method, aims to penalise inconsistencies in neural activations and/or semantic masks between and . computes the squared distance between the feature map of the synthesised image and that of the content source image . Since is assumed to have the same content layout as , does not involve the semantic masks.
Style loss function: We extract all the local patches from , denoted as . For a given layer, assuming is the number of channels, each patch in has size , where is the width and height of the patch. Similarly, and are the down-sampled semantic masks of extracted patches, each of size . We define the modified energy function incorporating semantic masks as
where is the number of patches in the synthesised image. For each patch with semantic masks we find its best matching patch using normalised cross-correlation over all example patches in :
where is the concatenation of neural activation and semantic masks for the patch of the synthesised image, and is the concatenation of neural activation and semantic masks for the patch of the style image. The nearest patch thus takes both style similarity and semantic consistency into account.
Content loss function: In order to control the content of the synthesised image, we define as the squared Euclidean distance between and :
Like method [LW16], we also minimise Equation Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks using backpropagation with L-BFGS. During Equation Automatic Semantic Style Transfer using Deep Convolutional Neural Networks and Soft Masks, and are weights for the style image and the content image constraints, respectively. According to our experiments, we set and , and these values can be fine tuned to interpolate between the content and the style preservation.
[Cha16] manually generated the semantic masks that they used in their work to control the style transfer. Each image used one mask containing semantic labels, where each component (not necessarily connected) was indicated by a particular pixel value in the image. Often these values were carefully chosen so that components with similar appearance such as ear and nose would be assigned similar mask values. Not only is it tedious to manually segment the image, but for most images some parts cannot be partitioned accurately. Therefore, instead of using a single crisp mask to control an image stylization, we propose instead to use a set of soft masks. Such soft masks provide more information than a single crisp mask, and do not require potentially unreliable boundaries to be set in the semantic mask, which is especially beneficial at ill-defined object boundaries.
In this paper we aim to automatically generate soft masks. Obviously this would make mask-based style transfer more convenient for the user. However, generating appropriate masks is challenging. Ideally, the segmentation of the style and content images should be consistent, e.g. using co-segmentation [VRK11]. However, such approaches have not been developed for semantic segmentation. Moreover, the different appearance of photographs compared to artwork (typically used for style images) leads to the cross-depiction problem [HCWC15], so that semantic segmentation techniques trained on photographs will fail on paintings. In this paper we not only demonstrate our approach for the domain of portraits, which are a popular topic for style transfer [SED16], and non-photorealistic rendering in general, but also show stylisation of scenes containing other objects, such as cars and trains. Portrait style transfer allows us to leverage state-of-the-art techniques for face detection, that are more robust than general segmentation methods, and are effective even for many artworks. During our method, facial component masks are automatically extracted using a combination of semantic segmentation, facial landmark detection, and skin detection.
[ZJRP15] proposed a semantic segmentation method named CRF-RNN which can segment 20 different objects. CRF-RNN achieves a good result on the popular Pascal VOC segmentation benchmark. This improvement can be attributed to the uniting of the strengths of CNNs and CRFs in a single deep network. In our work, we use CRF-RNN to produce semantic probability maps. Instead of labelling each pixel with an object category, we skip over the max pooling stage and extract the neural activations before that and rescale them to [0, 1]. These are treated as probability maps predicting the chance of each pixel belonging to each object category. An example is shown in figure 3.
This provides 20 probability masks which represent different objects. Since most images only contain a small number of object types, rather than use all 20 semantic masks we just use a subset of five so as to reduce memory requirements and improve efficiency. For a given content and style image pair the five semantic masks are automatically selected as the five masks maximising their average probability.
We have found that the CRF-RNN is mostly reliable for photographs. For paintings its performance degrades, especially as the style of the artwork becomes more extreme. However, it is still capable of producing adequate extractions of people, cars, etc. for many paintings (used as style images) that we have tested.
Skin detection is performed on the photographic images [BDPFG17], using a rule-based analysis of pixels in YCbCr colour space. The skin mask is then intersected with the person mask provided by the CRF-RNN, so as to subdivide the person into skin and non-skin (e.g. hair, clothing). An example is shown in figure 4.
Since skin detection is primarily colour based, it is not in general effective on artwork due to the typical colour shifts, as well as distortions caused by strong brush stroke textures. Therefore, for paintings the facial region is detected using the face detector, rather than using skin detection.
Facial landmark detection is performed using OpenFace [BRM16], which is based on Conditional Local Neural Fields, a version of the well known Constrained Local Model approach. Sixty-eight facial landmarks are located, from which the eye, nose, inner and outer mouth regions are determined – see figure 4.
Since the facial landmarks only cover the lower half of the face, the outline of the face is extended upwards, and intersected with the person mask provided by the semantic segmentation to produce a good approximation to the head region. This mask is used for artwork. For photographs the skin mask is used instead of the extended facial region as it is more accurate (although prone to noise).
The above steps result in a set of masks that are blurred to produce soft masks identifying the following objects: face/skin, nose, eye, mouth, see figure 4 for an example. To provide a more compact visualisation we also combine the set of soft masks into a single colour image, see figure 4. The soft masks for body, background and face/skin are mapped to red, green, blue respectively, while the eyes, nose and mouth values are mapped to cyan, yellow, magenta respectively. (Note that when performing style transfer the multiple soft image masks are used instead.)
We use the pre-trained 19-layer VGG-Network with the augmented layers and . For layers , , and we use patches, and we set the stride to one. Following the patch-based approach of [LW16], we synthesise at multiple increasing resolutions, and randomly initialise the optimisation. On a GTX TIT with 12Gb of GPU RAM, synthesis takes from 5 to 30 minutes depending on the quality and resolution.
We will now compare the proposed method with several popular methods: [GEB15b, LW16] which are representative global and local neural style transfer methods, and [GEB16, Cha16] which use manual segmentation to improve style transfer.
Note that for our method multiple soft masks were used; the single colour mask is just shown for illustrative purposes. For [Cha16], we set the content weight to 10, style weight to 25, semantic weight to 100, and we use the masks from [Cha16] when available and otherwise manually draw them ourselves. For [GEB16], we used two image maps of values in the range [0,1] for content and style images like figure 3 (c, d), similar to the examples used in their paper, which are also used in our method. To partially overcome orientation and scale differences between the style and the content images, we also allow a range of rotations and scalings to be considered in the CNNMRF, following the settings in [LW16].
We use figure 3(a) with several different backgrounds as the content image, and choose figure 3(b) as the style image. Style transfer results obtained by the different methods are shown in figure 5. Considering the four existing methods and by comparing the results in figure 5, it seems that [GEB15b] and [LW16] cannot transfer the background texture well. [Cha16] achieves better background texture transfer, comparable to our method, but some key facial parts (nose and mouth) are lost. [GEB16] can control the spatial texture very well, but the human style transfer is not so good. It also generates errors in rows 1 and 2 of figure 5(c).
Figures 7 and 9 show style transfer applied separately to photographs of men and women. We transfer the style of each style image to each content image. We can see from figures 7 and 9 that our method can achieve better results than the CNNMRF method and avoid errors in applying style transfer to inappropriate parts. The style images contain a range of simple and more complicated textures. In both cases, our method achieves effective results, and preserves the content of the images. [LW16] can also achieve interesting results, but only for simple texture images. For some examples such as figure 8 (b) and (c) content woman image, the CNNMRF method achieves interesting results as well, but our method can achieve better results in specific parts, such as the eyes, nose, mouth and background area. For style images that contain a mixture of textures – figure 9 and figure 7 – the results of [LW16] have many errors in which styles are misapplied. In figure 9 last row column 1, our results contain artefacts due to the errors in the content semantic masks.
More style transfer results for objects like train, car, bus and boat are shown in figures 11, 13, 15, and 16. In these examples, in the mask images the green part shows the background probability mask, and the red part shows the object probability mask. Our method produces better results in all these examples.
Automatic multi probability maps selection. Not only will probability maps provide a richer feature vector that will benefit the style transfer, but avoiding the need for thresholding or winner-take-all selection has the potential to improve robustness. Figure 18 shows an example in which our automatic semantic mask selection effectively chooses relevant object types (person and dog). It demonstrates style transfer using our method when multiple object categories are present. Note that even though the irrelevant 3rd – 5th masks contain very little response, it is not a problem to include them.
Multiple style images. Our method also allows styles to be transfered from multiple style images to a single content image. In this case, the semantic masks are essential to direct the method to choose suitable patches. Some interesting style transfer results are shown in figures 19 and 20.
Comparison of soft masks and binary masks. We compare our method using soft masks with alternative binary masks. The results are shown in figure 21. In comparison, the results with the soft masks (the 2nd column in figure 11) not only avoid choosing thresholds but also are visually better since more information is preserved.
Modifying the number of masks. The semantic segmentation significantly affects the style transfer results. For some style images, for example paintings of portraits, it is difficult to automatically segment the face, skin, month, eyes, etc., and to properly segment the background and foreground. Failures in the segmentation will cause some background texture to be embedded into the foreground elements in the synthesised image, thereby generating bad content, such as the jewellery in figure 8(f) and our result in figure 9 row 4. The jewellery should be around the neck and not in the background. If the accuracy and reliability of the semantic segmentation can be improved this will lead to better style transfer results. Figure 22 shows an experiment in which the number of labels in the semantic masks is increased, and demonstrates the importance of separately labelling all the major components of the face.
Modifying the soft mask weight. There are three parameters in our style transfer model, , and which are the weights for the style, content and semantic mask loss terms. Since the effect of and and is considered in [LW16], we focus on studying the effect of . By default we set the soft mask weight . This value can be adjusted to control the importance of semantic compliance. Figure 23 demonstrates the effect of modifying using the content image in figure 6(b) and style image in figure 3(b), where , . When is too small, the result does not have sufficient semantic control and can produce semantically wrong matches. On the other hand, setting too large may result in matched patches having poor content/style consistency. According to our experiments, achieves best results.
Semantic masks are very important for improving the style transfer results. They can achieve background texture and object texture style transfer separately, and prevent them from contaminating each other. We can also fine tune the weight of the semantic masks to achieve different results. In the future we will carry out more extensive experiments to determine which weights produce the best style transfer results.
In most cases, soft masks can achieve better results than binary masks, especially in uncertain areas. The probability maps show the likelihood of having specific objects in the image, and can help capture elements of the styles for objects in the style image and preserve the structure of the content image. Therefore, they are useful for finding better patches in the style image and improving the style transfer results.
Our paper demonstrates the benefits of automatic semantic mask extraction by combining state-of-the-art methods for both semantic segmentation and facial features. The correctness and accuracy of the semantic masks are critical. Using soft masks helps mitigate this, but there is certainly scope to improve semantic segmentation, or to develop methods dedicated to generating soft semantic masks.
This work was supported by National Natural Science Foundation of China (61503128), Science and Technology Plan Project of Hunan Province (2016TP102), Scientific Research Fund of Hunan Provincial Education Department (14B025,16C0311), and Hunan Provincial Natural Science Foundation of China (2017JJ4001). We also would like to thank NVIDIA for the GPU donation.
- [BDPFG17] Brancati N., De Pietro G., Frucci M., Gallo L.: Human skin detection through correlation rules between the ycb and ycr subspaces based on dynamic color clustering. Computer Vision and Image Understanding 155 (2017), 33–42.
- [BRM16] Baltrušaitis T., Robinson P., Morency L.-P.: Openface: an open source facial behavior analysis toolkit. In Winter Conf. on Applications of Computer Vision (2016), pp. 1–10.
- [Cha16] Champandard A. J.: Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768 (2016).
- [CPT04] Criminisi A., Pérez P., Toyama K.: Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing 13, 9 (2004), 1200–1212.
- [EF01] Efros A. A., Freeman W. T.: Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques (2001), ACM, pp. 341–346.
- [EL99] Efros A. A., Leung T. K.: Texture synthesis by non-parametric sampling. In Proc. Int. Conf. Computer Vision (1999), vol. 2, IEEE, pp. 1033–1038.
- [Fac] Face++: Face++. https://www.faceplusplus.com/face-detection/. Accessed April 4, 2015.
- [FPC00] Freeman W. T., Pasztor E. C., Carmichael O. T.: Learning low-level vision. International Journal of Computer Vision 40, 1 (2000), 25–47.
- [FSDH16] Frigo O., Sabater N., Delon J., Hellier P.: Split and match: example-based adaptive patch sampling for unsupervised style transfer. In Proc. Conf. Computer Vision and Pattern Recognition (2016), pp. 553–561.
- [GDDM14] Girshick R., Donahue J., Darrell T., Malik J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. Conf. Computer Vision and Pattern Recognition (2014), pp. 580–587.
- [GEB15a] Gatys L., Ecker A. S., Bethge M.: Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems (2015), pp. 262–270.
- [GEB15b] Gatys L. A., Ecker A. S., Bethge M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015).
- [GEB16] Gatys L. A., Ecker A. S., Bethge M., Hertzmann A., Shechtman E.: Controlling perceptual factors in neural style transfer. arXiv preprint arXiv:1611.07865 (2016).
- [HCWC15] Hall P., Cai H., Wu Q., Corradi T.: Cross-depiction problem: Recognition and synthesis of photographs and artwork. Computational Visual Media 1, 2 (2015), 91–103.
- [JAFF16] Johnson J., Alahi A., Fei-Fei L.: Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (2016), Springer, pp. 694–711.
- [KEBK05] Kwatra V., Essa I., Bobick A., Kwatra N.: Texture optimization for example-based synthesis. ACM Transactions on Graphics (ToG) 24, 3 (2005), 795–802.
- [KSE03] Kwatra V., Schödl A., Essa I., Turk G., Bobick A.: Graphcut textures: image and video synthesis using graph cuts. In ACM Transactions on Graphics (ToG) (2003), vol. 22, ACM, pp. 277–286.
- [LW16] Li C., Wand M.: Combining markov random fields and convolutional neural networks for image synthesis. In Proc. Conf. Computer Vision and Pattern Recognition (2016), pp. 2479–2486.
- [NHH15] Noh H., Hong S., Han B.: Learning deconvolution network for semantic segmentation. In Proc. Int. Conf. Computer Vision (2015), pp. 1520–1528.
- [RDB16] Ruder M., Dosovitskiy A., Brox T.: Artistic style transfer for videos. In German Conference on Pattern Recognition (2016), Springer, pp. 26–36.
- [SED16] Selim A., Elgharib M., Doyle L.: Painting style transfer for head portraits using convolutional neural networks. ACM Transactions on Graphics (TOG) 35, 4 (2016), 129.
- [SLD17] Shelhamer E., Long J., Darrell T.: Fully convolutional networks for semantic segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 39, 4 (2017), 640–651.
- [SPB14] Shih Y., Paris S., Barnes C., Freeman W. T., Durand F.: Style transfer for headshot portraits. ACM Transactions on Graphics (TOG) 33, 4 (2014).
- [SZ14] Simonyan K., Zisserman A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
- [Tho16] Thoma M.: A survey of semantic segmentation. arXiv preprint arXiv:1602.06541 (2016).
- [ULVL16] Ulyanov D., Lebedev V., Vedaldi A., Lempitsky V.: Texture networks: Feed-forward synthesis of textures and stylized images. In Int. Conf. on Machine Learning (ICML) (2016).
- [VRK11] Vicente S., Rother C., Kolmogorov V.: Object cosegmentation. In Conf. Computer Vision and Pattern Recognition (2011), IEEE, pp. 2217–2224.
- [WL00] Wei L.-Y., Levoy M.: Fast texture synthesis using tree-structured vector quantization. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques (2000), ACM Press/Addison-Wesley Publishing Co., pp. 479–488.
- [YZY17] Yang Y., Zhao H., You L., Tu R., Wu X., Jin X.: Semantic portrait color transfer with internet images. Multimedia Tools and Applications 76, 1 (2017), 523–541.
- [ZCC13] Zhang W., Cao C., Chen S., Liu J., Tang X.: Style transfer via image component analysis. IEEE Transactions on Multimedia 15, 7 (2013), 1594–1601.
- [ZJRP15] Zheng S., Jayasumana S., Romera-Paredes B., Vineet V., Su Z., Du: Conditional random fields as recurrent neural networks. In Proc. Int. Conf. Computer Vision (2015), pp. 1529–1537.