Language Guided Fashion Image Manipulation with Feature-wise Transformations††thanks: This is an extended version of a paper with the same title that has been accepted for presentation at the First Workshop on Computer Vision For Fashion, Art and Design at ECCV 2018. This research was supported in part by TUBITAK with award no 217E029. We would like to thank NVIDIA Corporation for the donation of a Quadro P5000 GPU.
Developing techniques for editing an outfit image through natural sentences and accordingly generating new outfits has promising applications for art, fashion and design. However, it is considered as a certainly challenging task since image manipulation should be carried out only on the relevant parts of the image while keeping the remaining sections untouched. Moreover, this manipulation process should generate an image that is as realistic as possible. In this work, we propose FiLMedGAN, which leverages feature-wise linear modulation (FiLM) to relate and transform visual features with natural language representations without using extra spatial information. Our experiments demonstrate that this approach, when combined with skip connections and total variation regularization, produces more plausible results than the baseline work, and has a better localization capability when generating new outfits consistent with the target description.
Keywords:image editing fashion images generative adversarial networks
Language based image editing (LBIE)  is a recently proposed umbrella term which describes the task of transforming a source image based on natural language descriptions. A specific case of LBIE aims at modifying an outfit in an image using textual descriptions as target transformations , which has potential applications in art, fashion, shopping and design. However, this is a rather challenging problem mainly due to two reasons. A successful model should be able to (i) reflect the changes to the input image while preserving structural coherence (e.g. body shape, pose, person identity), and (ii) understand and resolve the local changes in images according to only the relevant parts of textual description. While the former is about the image generation process, the latter is related to understanding the relations between the source image and the language description and it requires to disentangle semantics from both visual and textual modalities. In this respect, it shares some similarities with other integrated vision and language problems such as visual question answering (VQA).
The main motivation of this paper comes from a recent conditioning mechanism known as Feature-wise Linear Modulation (FiLM), which has been initially proposed for solving complicated VQA tasks  and has been proven very useful. In this work, we propose a new conditional Generative Adversarial Network (GAN), which we name FiLMedGAN, which incorporates FiLM based feature transformations to better guide the manipulation process based on natural language descriptions. To increase the overall quality of the resulting images, our network architecture also employs skip connections  and we additionally use total variation regularization  during training. We demonstrate that our proposed approach can synthesize and modify plausible outfit images without a need to utilize extra spatial information like segmentation maps or body joints and pose guidance as commonly considered in the previous work (see Fig. 1).
|input image||generated outfit images based on different textual descriptions|
2 Related Work
Our model is based on Generative Adversarial Networks (GANs) . GANs have become one of the dominant methods to build generative models of complex, real-life data and many GAN variants have been proposed for a range of generation tasks. GANs can be formulated as a two-player game where a discriminator () and a generator () are trained in an alternating manner with an adversarial loss. Despite the difficulties in training , adversarial learning has been applied to numerous domains such as text-to-image synthesis [28, 42, 41, 43, 38], language based image editing [7, 44], person image generation [20, 33, 27, 18, 21, 13] and texture synthesis [14, 2, 37, 36].
The most relevant work to ours is by Dong et al.  who first learn a visual-semantic text embedding from the image-text description pairs and then adversarially train a conditional generator network to perform LBIE on bird images from Caltech-200  and flower images from Oxford-102 . Our model is built upon these ideas and indeed can be viewed as an improved version of that work. The details of this model and the extensions we propose in this paper will be described fully in Sec. 3.
In another related work, Zhu et al.  performed LBIE on fashion images and proposed a model called FashionGAN. They emphasized structural coherence, which involves retaining body shape and pose, producing image parts that conform to given language description and enforcing coherent visibility of body parts. For that purpose, they proposed a two stage generator model, which also takes a human parsed segmentation map of the input image as complementary information. In the second stage, they generated target image conditioned on the segmentation map that is generated from the first stage together with language description. This differs significantly from our approach since we do not require any segmentation map or employ explicit spatial constraints, which is costly to obtain and might not always be available. We also believe that directly using segmentation maps in synthesizing the output might introduce some visual inconsistencies between the generated output and the actual input as the output is not generated in a holistic manner.
Other related works rather focus on image generation from text instead of directly manipulating images. Lassner et al.  proposed a model (ClothNet) which is able to generate full body images of people with clothing conditioned on a specific pose, shape and color. CAGAN  and VITON  models take a person and a clothing image as inputs to dress up the person with the specified clothing item. The rest of the related work mostly concerns with the pose of humans and accordingly adds some spatial constraints [20, 21, 33, 27, 8, 39], which also differs from our work in this respect.
A popular approach to increase the quality of generated images is to incorporate attention mechanisms into GANs [22, 16, 38, 5, 40], which helps identifying the most relevant parts of images or features as needed. Ma et al.  proposed a deep attention encoder as a part of their model for instance level translation. Kastaniotis et al.  used an attention mechanism in discriminator for generating better cell images. Xu et al.  proposed attention driven multi stage refinement approach for text-to-image problems. Chen et al.  proposed an extra attention network for object transfiguration . A recent work of Zhang et al.  involves a self attention model for retaining global consistency in generated images. Compared to these previous works, we alternatively explore using FiLM transformations  as a conditioning mechanism and exploit its implicit attention-like mechanism. Quite recently, FiLM has been investigated in a similar manner for colorization  and recovering textures . However, the former does not use GANs and the task explored in the latter is quite dissimilar to ours.
Network Architecture. Our network architecture is an improved version of the model suggested by Dong et al. , which is indeed inspired by  and . Fig. (a)a gives an overview of our improved architecture. The generator network in  is made up of an encoder, a residual transformation unit and a decoder. The encoder and the decoder consists of 2D convolution layers together with several strides and nearest-neighbor up-samplings followed by ReLU activations and Batch Normalization (BN) , except the first and the last layers respectively. We extend the encoder by including an extra 2D convolution and BN layers, which adds new features of dimensions . Moreover, the feature maps of the encoder are concatenated to the corresponding decoder layers via symmetric skip connections. The residual transformation unit is made up of four residual blocks. We redesign this part by adding a FiLM block after the first BN layer. The architecture of our modified residual block can be seen in Fig. (b)b. The discriminator has also an encoder with a residual branch composed of convolution layers akin to the generator followed by a classifier layer. We modify it to process images instead of sized images by adding (2,1) strides to convolution layers. We also fuse together semantic embedding of textual input with encoder output using FiLM rather than a simple replication and concatenation, as done in .
To train our visual-semantic text embedding, we utilize an external pre-trained word embedding (fastText)  and early layers of pre-trained VGG-16  network, as done in , where sentences are represented as the output of a GRU  unit, and for training, we follow the same procedure in  where we utilize a pairwise ranking loss.
Improved Conditioning using FiLM. Our inspiration comes from a recent work by Manjunatha et al.  on colorization of gray scale images with natural language descriptions who explore the use of FiLM to fuse textual representations with visual representations. Although qualitative results are not much different than those of simple concatenation, the authors reported that the activations of FiLM layer can emulate guided attention. This is in line with our own observations that it helps better localization while manipulating an image based on a textual description.
Mathematically speaking, a FiLM layer performs a feature-wise affine transform on visual features conditioned on textual information. Given as continuous vector representation of natural language description, we compute and vectors as in Eqn. 1 where and are parameters to be learned.
Here, denotes a feature output and it is modulated as in Eqn. 2, where is element-wise product and represents the spatial dimensions. For implementation fusing vectors with concatenation might result an increase in parameter size of the network whereas FiLM is much more efficient.
Regularization with Total Variation (TV). In our FiLMed experiments, we come across with some artifacts and blur in some manipulated images. To overcome this issue, we additionally include total variation loss  as a regularization term to the loss function of the generator. So, our final adversarial loss function that we use in our FiLMedGAN model becomes:
where stands for the matching text, represents a mismatching text and finally denotes a semantically relevant text .
Dataset. In our experiments, we use Fashion Synthesis  dataset, an extension of , which contains 78,979 images along with textual descriptions. It also provides gender, color, sleeve and category attributes as well as segmentation maps. We utilize the provided training (70,000) and test (8,979) splits and do not make use of the segmentation maps or the attributes during training.
Implementation Details and Training. To train visual-semantic text embedding, we use the Adam optimizer  with the parameters , , and a learning rate of . We train our model with the batch size of for epochs. We set the pairwise ranking loss margin to 0.2, embedding dimension to 300 and max words to 25. Similarly, to train GAN model, we employ the Adam optimizer with a different parameter , the rest is the same as visual-semantic text embedding training parameters. We also apply learning rate decay for 100 epochs with . We set the parameter of the TV regularization term to . All GAN models are trained for 125 epochs.
Visual-Semantic Text Embedding Evaluation. We qualitatively evaluated the learned visual-semantic text embeddings by comparing the vector representations of the first 500 test samples with each other in a pairwise manner and inspecting the top 3 most similar images based on both their projected textual and visual features. Our analysis reveals that visual-semantic text embedding learns the relationship between images and sentences in a proper manner. Fig. 3 shows the results of a sample query. As can be seen, the nearest neighbors in the embedding space are highly consistent with each other in terms of both modalities.
|Input||Baseline ||FiLM||TV||FiLM + TV||FiLMedGAN|
Qualitative Evaluation. In Fig. 4, we compare the results of our proposed FiLMedGAN model against those of the baseline method, and analyze the importance of FiLM, skip connections and TV regularization with an ablation study. While FiLM by itself exhibits a better performance in regard to language conditioned visual changes (color change in the last row), TV regularization provides slightly more detailed images (hair and glasses in the first row). When they are combined, the results are visually more appealing than those of the baseline model . Moreover, introducing additional skip connections, as in our FiLMedGAN model, gives the best results in terms of image details and quality since it decreases the information loss introduced by the vanilla encoder-decoder.
For disadvantages, when the results are investigated thoroughly, it can be seen that FiLMedGAN makes the hair on the foreground disappeared while transforming the blouse. (e.g. the rightmost image in the last row of Fig. 4) Although FiLMedGAN generates plausible images in general, it may lead some degeneration on input image. It is a drawback of our approach over FashionGAN  where FashionGAN solves these kind of issues by using segmentation maps.
It is also important to mention that FiLM helps to visualize internal dynamics of a network and provides a kind of implicit attention mechanism. We visualize heat maps of the average filter outputs of each of four FiLMed residual layers in the generator (Fig. 5). Generally speaking, while first block processes the head and the legs, third block focuses on the entire body to perform the transformations. Note that it is possible to interpret each of filter output separately rather than averaging them.
|IS||2.52 (2.68)||2.54 (2.65)||2.48 (2.67)||2.52 (2.62)||2.58 (2.68)|
|FID||22.86 (20.73)||23.38 (20.10)||18.79 (16.16)||16.83 (14.84)||10.72 (9.12)|
|AS||0.65 (0.67)||0.65 (0.66)||0.66 (0.68)||0.66 (0.67)||0.67 (0.68)|
Quantitative Evaluation. We apply three quantitative evaluation methods for our comparison: two for measuring realism and one for measuring the manipulation success. For quantifying realism, we consider Inception Score (IS)  and Fréchet Inception Distance (FID) . These evaluation metrics do not measure how successful an image manipulation is done according to a target textual description, and thus we also incorporate an attribute prediction method similar to . For each test image in the test set, without loss of generality, we set the next image’s text description as a target text description and measure the equivalence of the modified image’s attributes with the actual attributes of this next image. In order to do so, we fine-tuned a pre-trained VGG-16  model to simultaneously predict gender, sleeve, color and category attributes. First three attributes in all likelihood could be inferred from textual description. Category attribute is thought as holistic and also included. Consequently, these attributes can be considered as representative as textual description.
Estimated IS, FID and average attribute scores (AS) are reported in Table 1. In terms of AS, there is no noteworthy difference among the models through epochs. We think that this is because attributes are not distinctive enough. For example, in the first row of Fig. 3, all models have succeeded in performing the corresponding changes according to textual description and there are many such examples but the details visible in the images are open to discussion. Our FiLMedGAN model gives the best IS but IS is not a reliable measure . We observed that original images have IS score but the early epochs of FiLM alone have which is absurd. It should be considered as a rough measure of quality and should not be taken seriously. According to FID, FiLM is similar to the baseline because the main role of FiLM is not to improve the quality but the conditioning. It is very interesting that, FiLM+TV improves the FID score in a clear way. We speculate that they contribute collaboratively to the overall result. Lastly, FiLMedGAN shows a significant improvement over all the other models which uses the advantage of skip connections.
We present a novel approach for language conditioned editing of fashion images. Our approach employs a GAN-based architecture which allows the users to edit an outfit image by feeding in different descriptions to generate new outfits. Our experimental analysis demonstrate that our FiLMedGAN model which employs skipping connections and FiLMed residual blocks outperforms the baselines both quantitatively and qualitatively and generates more plausible outfit images according to a given natural description.
-  Barratt, S., Sharma, R.: A note on the inception score. CoRR abs/1801.01973 (2018), http://arxiv.org/abs/1801.01973
-  Bergmann, U., Jetchev, N., Vollgraf, R.: Learning texture manifolds with the periodic spatial GAN. In: Proceedings of the 34th International Conference on Machine Learning (ICML) (2017)
-  Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (TACL) (2017)
-  Chen, J., Shen, Y., Gao, J., Liu, J., Liu, X.: Language-based image editing with recurrent attentive models. CoRR abs/1711.06288 (2017), http://arxiv.org/abs/1711.06288
-  Chen, X., Xu, C., Yang, X., Tao, D.: Attention-gan for object transfiguration in wild images. CoRR abs/1803.06798 (2018), http://arxiv.org/abs/1803.06798
-  Cho, K., van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
-  Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
-  Esser, P., Sutter, E., Ommer, B.: A variational u-net for conditional appearance and shape generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27 (NIPS) (2014)
-  Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try-on network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a Nash equilibrium. CoRR abs/1706.08500 (2017), http://arxiv.org/abs/1706.08500
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning (ICML) (2015)
-  Jetchev, N., Bergmann, U.: The conditional analogy GAN: swapping fashion articles on people images. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017)
-  Jetchev, N., Bergmann, U., Vollgraf, R.: Texture synthesis with spatial generative adversarial networks. CoRR abs/1611.08207 (2016), http://arxiv.org/abs/1611.08207
-  Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (ECCV) (2016)
-  Kastaniotis, D., Ntinou, I., Tsourounis, D., Economou, G., Fotopoulos, S.: Attention-aware generative adversarial networks (ata-gans). CoRR abs/1802.09070 (2018), http://arxiv.org/abs/1802.09070
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: The International Conference on Learning Representations (ICLR) (2015)
-  Lassner, C., Pons-Moll, G., Gehler, P.V.: A generative model for people in clothing. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
-  Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
-  Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. In: Advances in Neural Information Processing Systems 30 (NIPS) (2017)
-  Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., Fritz, M.: Disentangled person image generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Ma, S., Fu, J., Chen, C.W., Mei, T.: DA-GAN: instance-level image translation by deep attention generative adversarial networks (with supplementary materials). CoRR abs/1802.06454 (2018), http://arxiv.org/abs/1802.06454
-  Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
-  Manjunatha, V., Iyyer, M., Boyd-Graber, J., Davis, L.: Learning to color from language. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (2018)
-  Nilsback, M., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) (2008)
-  Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: Visual Reasoning with a General Conditioning Layer. In: AAAI Conference on Artificial Intelligence. New Orleans, United States (Feb 2018), https://hal.inria.fr/hal-01648685
-  Pumarola, A., Agudo, A., Sanfeliu, A., Moreno-Noguer, F.: Unsupervised Person Image Synthesis in Arbitrary Poses. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: Proceedings of The 33rd International Conference on Machine Learning (ICML) (2016)
-  Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). LNCS, vol. 9351, pp. 234–241. Springer (2015), http://lmb.informatik.uni-freiburg.de/Publications/2015/RFB15a, (available on arXiv:1505.04597 [cs.CV])
-  Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60(1-4), 259–268 (nov 1992). https://doi.org/10.1016/0167-2789(92)90242-f, https://doi.org/10.1016/0167-2789(92)90242-f
-  Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems 29 (NIPS) (2016)
-  Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems 29 (NIPS) (2016)
-  Siarohin, A., Sangineto, E., Lathuilière, S., Sebe, N.: Deformable GANs for Pose-based Human Image Generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014), http://arxiv.org/abs/1409.1556
-  Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011)
-  Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang, C., Yu, F., Hays, J.: Texturegan: Controlling deep image synthesis with texture patches. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Xintao Wang, Ke Yu, C.D., Loy, C.C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Zanfir, M., Popa, A.I., Zanfir, A., Sminchisescu, C.: Human appearance transfer. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial networks. CoRR abs/1805.08318 (2018), http://arxiv.org/abs/1805.08318
-  Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan++: Realistic image synthesis with stacked generative adversarial networks. CoRR abs/1710.10916 (2017), http://arxiv.org/abs/1710.10916
-  Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
-  Zhang, Z., Xie, Y., Yang, L.: Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Zhu, S., Fidler, S., Urtasun, R., Lin, D., Loy, C.C.: Be your own Prada: Fashion synthesis with structural coherence. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)