Fashion++: Minimal Edits for Outfit Improvement
Abstract†† Authors contributed equally.
Given an outfit, what small changes would most improve its fashionability? This question presents an intriguing new vision challenge. We introduce Fashion++, an approach that proposes minimal adjustments to a full-body clothing outfit that will have maximal impact on its fashionability. Our model consists of a deep image generation neural network that learns to synthesize clothing conditioned on learned per-garment encodings. The latent encodings are explicitly factorized according to shape and texture, thereby allowing direct edits for both fit/presentation and color/patterns/material, respectively. We show how to bootstrap Web photos to automatically train a fashionability model, and develop an activation maximization-style approach to transform the input image into its more fashionable self. The edits suggested range from swapping in a new garment to tweaking its color, how it is worn (e.g., rolling up sleeves), or its fit (e.g., making pants baggier). Experiments demonstrate that Fashion++ provides successful edits, both according to automated metrics and human opinion. Project page is at http://vision.cs.utexas.edu/projects/FashionPlus.
“Before you leave the house, look in the mirror and take one thing off.” – Coco Chanel
The elegant Coco Chanel’s famous words advocate for making small changes with large impact on fashionability. Whether removing an accessory, selecting a blouse with a higher neckline, tucking in a shirt, or swapping to pants a shade darker, often small adjustments can make an existing outfit noticeably more stylish. This strategy has practical value for consumers and designers alike. For everyday consumers, recommendations for how to edit an outfit would allow them to tweak their look to be more polished, rather than start from scratch or buy an entirely new wardrobe. For fashion designers, envisioning novel enhancements to familiar looks could inspire new garment creations.
Motivated by these observations, we introduce a new computer vision challenge: minimal edits for outfit improvement. To minimally edit an outfit, an algorithm must propose alterations to the garments/accessories that are slight, yet visibly improve the overall fashionability. A “minimal” edit need not strictly minimize the amount of change; rather, it incrementally adjusts an outfit as opposed to starting from scratch. It can be recommendations on which garment to put on, take off, or swap out, or even how to wear the same garment in a better way. See Figure 1.
This goal presents several technical challenges. First, there is the question of training. A natural supervised approach might curate pairs of images showing better and worse versions of each outfit to teach the system the difference; however, such data is not only very costly to procure, it also becomes out of date as trends evolve. Secondly, even with such ideal pairs of images, the model needs to distinguish very subtle differences between positives and negatives (sometimes just a small fraction of pixels as in Fig. 1), which is difficult for an image-based model. It must reason about the parts (garments, accessories) within the original outfit and how their synergy changes with any candidate tweak. Finally, the notion of minimal edits implies that adjustments may be sub-garment level, and the inherent properties of the person wearing the clothes—e.g., their pose, body shape—should not be altered.
Limited prior work explores how to recommend a garment for an unfinished outfit [8, 11, 47, 33] (e.g., the fill-in-the-blank task). Not only is their goal different from ours, but they focus on clean per-garment catalog photos, and their recommendations are restricted to retrieved garments from a dataset. However, we propose that in the fashion domain, the problem demands going beyond seeking an existing garment to add—to further inferring which garments are detrimental and should be taken off, and how to adjust the presentation and details of each garment (e.g., cuff the jeans above the ankle) within a complete outfit to improve its style.
We introduce a novel image generation approach called Fashion++ to address the above challenges. The main idea is an activation maximization  method that operates on localized encodings from a deep image generation network. Given an original outfit, we map its composing pieces (e.g., one for bag, blouse, boots) to their respective codes. Then we use a discriminative fashionability model as an editing module to gradually update the encoding(s) in the direction that maximizes the outfit’s score, thereby improving its style. The update trajectory offers a spectrum of edits, starting from the least changed and moving towards the most fashionable, from which users can choose a preferred end point. We show how to bootstrap Web photos of fashionable outfits, together with automatically created “negative” alterations, to train the fashionability model. To account for both the pattern/colors and shape/fit of the garments, we factorize each garment’s encoding to texture and shape components, allowing the editing module to control where and what to change (e.g., tweaking a shirt’s color while keeping its cut vs. changing the neckline or tucking it in).
After optimizing the edit, our approach provides its output in two formats: 1) retrieved garment(s) from an inventory that would best achieve its recommendations and 2) a rendering of the same person in the newly adjusted look, generated from the edited outfit’s encodings. Both outputs aim to provide actionable advice for small but high-impact changes for an existing outfit.
We validate our approach using the Chictopia dataset  and, through both automated metrics and user studies, demonstrate that it can successfully generate minimal outfit edits, better than several baselines. Fashion++ offers a unique new tool for data-driven fashion advice and design—a novel image generation pipeline relevant for a real-world application.
2 Related Work
Recognition for fashion.
Fashion image synthesis.
Synthesis methods explore ways to map specified garments to new poses or people. This includes generating a clothed person conditioned on a product image [9, 50, 55] (and vice versa ), or conditioned on textual descriptions (e.g., “a woman dressed in sleeveless white clothes”) [39, 64], as well as methods for swapping clothes between people [38, 58] or synthesizing a clothed person in unseen poses [22, 60, 30, 42, 1, 2, 37]. Whereas these problems render people in a target garment or body pose, we use image synthesis as a communication tool to make suggestions to minimally edit outfits.
Image manipulation, translation, and style transfer
are also popular ways to edit images. There is a large base of literature studying generating realistic images based on user-specified edits (or some target domain) that condition on semantic label maps [14, 62, 63, 61, 51], edge maps [54, 40], or 3D models [25, 53], using generative adversarial networks (GANs) . Related ideas are explored in interactive image search, where users specify visual attributes to alter in their query [21, 59, 7]. Style transfer methods [3, 4, 5, 12] offer another way to edit images that turn photographs into artwork. Unlike previous work that conditions on maps, maps are generated in our case; as a result, we enable sub-object shape changes that alter regions’ footprints, which is critical for fashion image synthesis. Most importantly, all these works aim to edit images according to human specified input, whereas we aim to automatically suggest where and how to edit to improve the input.
Compatibility and fashionability.
Fashionability refers to the popularity or stylishness of clothing items, while compatibility refers to how well-coordinated individual garments are. Prior work on compatibility recommends garments retrieved from a database that go well together [45, 8, 10, 48, 15, 11, 47, 13], or even garments generated from GANs . Some also recommend interchangeable items [8, 47, 33] that are equally compatible to substitute in for another. We address a new and different problem: instead of recommending compatible garments from scratch, our approach tweaks an existing outfit to make it more compatible/fashionable. It can suggest removals, revise a garment, optimize fashionability, and identify where to edit—none of which is handled by existing methods. Using online “likes” as a proxy for fashionability, the system in  suggests—in words—garments or scenery a user should change to improve fashionability; however, it conditions on meta-data rather than images, and suggests coarse properties specified in words (e.g., heels, pastel shirt) that often dictate changing to an entirely new outfit.
Activation maximization  is a gradient based approach that optimizes an image to highly activate a target neuron in a neural network. It is widely used for visualizing what a network has learned [36, 44, 31, 57, 52], and recently to synthesize images [34, 17]. In particular,  also generates clothing images, but they generate single-garment products rather than full body outfits. In addition, they optimize images to match purchase history, not to improve fashionability.
Minimal editing suggests changes to an existing outfit such that it remains similar but noticeably more fashionable. To address this newly proposed task, there are three key desired objectives: (1) training must be scalable in terms of supervision and adaptability to changing trends; (2) the model could capture subtle visual differences and the complex synergy between garments that affects fashionability; and (3) edits should be localized, doing as little as swapping one garment or modifying its properties, while keeping fashion-irrelevant factors unchanged.
In the following, we first present our image generation framework, which decomposes outfit images into their garment regions and factorizes shape/fit and texture, in support of the latter two objectives (Sec. 3.1). Then we present our training data source and discuss how it facilitates the first two objectives (Sec. 3.2). Finally, we introduce our activation maximization-based outfit editing procedure and show how it recommends garments (Sec. 3.3).
3.1 Fashion++ Outfit Generation Framework
The coordination of all composing pieces defines an outfit’s look. To control which parts (shirt, skirt, pants) and aspects (neckline, sleeve length, color, pattern) to change—and also keep identity or other fashion-irrelevant factors unchanged—we want to explicitly model their spatial locality. Furthermore, to perform minimal edits, we need to control pieces’ texture as well as their shape. Texture often decides an outfit’s theme (style): denim with solid patterns give more casual looks, while leather with red colors give more street-style looks. With the same materials, colors, and patterns of garments, how they are worn (e.g., tucked in or pulled out) and the fit (e.g., skinny vs. baggy pants) and cut (e.g., a V-neck vs. turtleneck) of a garment will complement a person’s silhouette in different ways. Accounting for all these factors, we devise an image generation framework that both gives control over individual pieces (garments, accessories, body parts) and also factorizes shape (fit and cut) from texture (color, patterns, materials).
Our system has the following structure at test time: it first maps an outfit image and its associated semantic segmentation map into a texture feature and a shape feature . Our editing module, , then gradually updates and into and to improve fashionability. Finally, based on and , the system generates the output image(s) of the edited outfit . Fig. 2 overviews our system. Superscripts and denote variables before and after editing, respectively. We omit the superscript when clear from context. We next describe how our system maps an outfit into latent features.
An input image is a real full-body photo of a clothed person. It is accompanied by a region map assigning each pixel to a region for a clothing piece or body part. We use unique region labels defined in Chictopia10k : face, hair, shirt, pants, dress, hats, etc. We first feed into a learned texture encoder that outputs a feature map . Let be the region associated with label . We average pool in to obtain the texture feature . The whole outfit’s texture feature is represented as . See Fig. 2 top left.
We also develop a shape encoding that allows per-region shape control separate from texture control. Specifically, we construct a binary segmentation map for each region , and use a shared shape encoder to encode each into a shape feature . The whole outfit’s shape feature is represented as . See Fig. 2 bottom left.
To generate an image, we first use a shape generator that takes in whole-body shape feature and generates an image-sized region map . We then perform region-wise broadcasting, which broadcasts to all locations with label based on , and obtain the texture feature map .111Note that has uniform features for a region, since it is average-pooled, while is not. Finally, we channel-wise concatenate and to construct the input to a texture generator , which generates the final outfit image. This generation process is summarized in Fig. 2 (right). Hence, the generators and learn to reconstruct outfit images conditioned on garment shapes and textures.
Although jointly training the whole system is possible, we found a decoupled strategy to be effective. Our insight is that if we assume a fixed semantic region map, the generation problem is reduced to an extensively studied image translation problem, and we can benefit from recent advances in this area. In addition, if we separate the shape encoding and generation from the whole system, it reduces to an auto-encoder, which is also easy to train.
Specifically, for the image translation part (Texture++ in Fig. 2), we adapt from conditional generative adversarial networks (cGANs) that take in segmentation label maps and associated feature maps to generate photo-realistic images [63, 51]. We combine the texture encoder and texture generator with a discriminator to formulate a cGAN. An image is generated by , where , and is the combined operations of and . The discriminator aims to distinguish real images from generated ones. , and are learned simultaneously with a minimax adversarial game objective:
where is defined as:
for all training images , and denotes feature matching loss. For the shape deformation part of our model (Shape++ in Fig. 2), we formulate a shape encoder and generator with a region-wise Variational Autoencoder (VAE) . The VAE assumes the data is generated by a directed graphical model and the encoder learns an approximation to the posterior distribution . The prior over the encoded feature is set to be Gaussian with zero mean and identity covariance, . The objective of our VAE is to minimize the Kullback-Leibler () divergence between and , and the reconstruction loss:
Note that simply passing in the 2D region label map as the shape encoding would be insufficient for image editing. The vast search space of all possible masks is too difficult to model, and, during editing, mask alterations could often yield unrealistic or uninterpretable “fooling” images [36, 44]. In contrast, our VAE design learns the probability distribution of the outfit shapes, and hence can generate unseen shapes corresponding to variants of features from the learned distribution. This facilitates meaningful shape edits.
Having defined the underlying image generation architecture, we next introduce our editing module for revising an input’s features (encodings) to improve fashionability.
3.2 Learning Fashionability from Web Photos
Our editing module (Sec. 3.3) requires a discriminative model of fashionability, which in turn prompts the question: how can we train a fashionability classifier for minimal edits? Perhaps the ideal training set would consist of pairs of images in which each pair shows the same person in slightly different outfits, one of them judged to be more fashionable than the other. However, such a collection is not only impractical to curate at scale, it would also become out of date as soon as styles evolve. An alternative approach is to treat a collection of images from a specific group (e.g., celebrities) as positive exemplars and another group (e.g., everyday pedestrians) as negatives. However, we found such a collection suffers from conflating identity and style, and thus the classifier finds fashion-irrelevant properties discriminative between the two groups.
Instead, we propose to bootstrap less fashionable photos automatically from Web photos of fashionable outfits. The main idea is to create “negative” outfits from fashionista photos. We start with a Chictopia full-body outfit photo (a “positive”), select one of its pieces to alter, and replace it with a piece from a different outfit. To increase the probability that the replacement piece degrades fashionability, we extract it from an outfit that is most dissimilar to the original one, as measured by Euclidean distance on CNN features. We implement the garment swap by overwriting the encoding for garment with the target’s. See Fig. 3.
We use this data to train a 3-layer multilayer perceptron (MLP) fashionability classifier . It is trained to map the encoding for an image to its binary fashionability label .
The benefit of this training strategy is threefold: First, it makes curating data easy, and also refreshes easily as styles evolve—by downloading new positives. Second, by training the fashionability classifier on these decomposed (to garments) and factorized (shape vs. texture) encodings, a simple MLP effectively captures the subtle visual properties and complex garment synergies (see Supp. for ablation study). Finally, we stress that our approach learns from full-body outfit photos being worn by people on the street, as opposed to clean catalog photos of individual garments [45, 8, 10, 48, 41, 47]. This has the advantages of allowing us to learn aspects of fit and presentation (e.g., tuck in, roll up) that are absent in catalog data, as well as the chance to capture organic styles based on what outfits people put together in the wild.
3.3 Editing an Outfit
With the encoders , generators and editing module in hand, we now explain how our approach performs a minimal edit. Given test image , Fashion++ returns its edited version(s):
where and represent the models for both shape and texture. When an inventory of discrete garments is available, our approach also returns the nearest real garment for region that could be used to achieve that change, as we will show in results. Both outputs—the rendered outfit and the nearest real garment—are complementary ways to provide actionable advice to a user.
Computing an edit.
There are two key steps: calculating the desired edit, and generating the edited image. To calculate an edit, we take an activation maximization approach: we iteratively alter the outfit’s feature such that it increases the activation of the fashionable label according to .
Formally, let be the set of all features in an outfit, and be a subset of features corresponding to the target regions or aspects that are being edited (e.g., shirt region, shape of skirt, texture of pants). We update the outfit’s representation as:
where denotes the features after updates, denotes substituting only the target features in with while keeping other features unchanged, denotes the probability of fashionability according to classifier , and denotes the update step size. Each gradient step in Eqn (5) yields an incremental adjustment to the input outfit. Fig. 4 shows the process of taking gradient steps with step size (see Sec. 4 for details). By presenting this spectrum of edits to the user, one may choose a preferred end point (i.e., his/her preferred tradeoff in the “minimality” of change vs. maximality of fashionability). Finally, as above, gives the updated .
To further force updates to stay close to the original, one could add a proximity objective, , as in other editing work [25, 61]. However, balancing this smoothness term with other terms (users’ constraints in their cases, fashionability in ours) is tricky (e.g.,  reports non-convergence). We found our gradient step approach to be at least as effective to achieve gradual edits.
Optimizing where to edit.
A garment for region is represented as the concatenation of its texture and shape features: . Our approach optimizes the garment that ought to be edited by cycling though all garments to find the one with most impact:
By instructing the target to be , we can simultaneously optimize where and how to change an outfit.
Rendering the edited image.
Then we generate the Fashion++ image output by conditioning our image generators on these edits:
where refers to the broadcasted map of the edited texture components , and refers to the VAE generated mask for the edited shape components . The full edit operation is outlined in Fig. 2.
In this way, our algorithm automatically updates the latent encodings to improve fashionability, then passes its revised code to the image generator to create the appropriate image. An edit could affect as few as one or as many as garments, and we can control whether edits are permitted for shape or texture or both. This is useful, for example, if we wish to insist that the garments look about the same, but be edited to have different tailoring or presentation (e.g., roll up sleeves)—shape changes only.
Retrieving a real garment matching the edit.
Finally, we return the garment(s) that optimally achieves the edited outfit. Let denote an inventory of garments. The best matching garments to retrieve from are:
for , where denotes the garment’s feature. This is obtained by passing the real inventory garment image for to the texture and shape feature encoders and , and concatenating their respective results.
We now validate that Fashion++
We use the Chictopia10k  dataset for all experiments. We use images to train the generators, and to train the fashionability classifier. We use the procedure described in Sec. 3.2 to prepare positive and negative examples for training the fashionability classifier. We evaluate on such unfashionable examples. We stress that all test examples are from real world outfits, bootstrapped by swapping features (not pixels) of pieces from different outfits. This allows testing on real data while also having ground truth (see below). We use the region maps provided with Chictopia10k for all methods, though automated semantic segmentation could be used as methods continue to improve [24, 26, 27]. The model architecture and more training details are in Supp.
Since our work is the first to consider the minimal edit problem, we develop several baselines for comparison: Similarity-only, which selects the nearest neighbor garment in the database (Chictopia10k) to maintain the least amount of change. Fashion-only, which changes to the piece that gives the highest fashionability score as predicted by our classifier, using the database as candidates. Random sampling, which changes to a randomly sampled garment. Since all unfashionable outfits are generated by swapping out a garment, we instruct all methods to update that garment. We additionally run results where we automatically determine the garment to change, denoted auto-Fashion++.
4.1 Quantitative comparison
Minimal edits change an outfit by improving its fashionability while not changing it too much. Thus, we evaluate performance simultaneously by fashionability improvement and amount of change. We evaluate the former by how much the edit gets closer to the ground-truth (GT) outfit. Since each unfashionable outfit is generated by swapping to a garment (we will call it original) from another outfit, and the garment before the swap (we will call it GT) is just one possibility for a fashionable outfit, we form a set of GT garments per test image, representing the multiple ways to improve it (see Supp. for detail). The fashion improvement metric is the ratio of the original piece’s distance to the GT versus the edited piece’s distance to the GT. Numbers less than one mean no improvement. The amount of change metric scores the edited garment’s distance to the original garment, normalized by subtracting Similarity only’s number. All distances are Euclidean distance in the generators’ encoded space. All methods return the garment in the inventory nearest to their predicted encoding.
Fig. (a)a shows the results.222We plot ours with for clarity and since fashionability typically saturates soon after. Results for all values are in Fig. 4 and Sec. 4.2. Similarity-only changes the outfit the least, as expected, but it does not improve fashionability. Fashion-only improves fashionability the most, but also changes the outfit significantly. Random neither improves fashionability nor remains similar. Our Fashion++ improves fashionability nearly as well as the Fashion-only baseline, while remaining as similar to the original outfit as similarity-only. Auto-Fashion++ performs similarly to Fashion++. These are key results to support our claim that Fashion++ makes slight yet noticeable improvements.
Fig. 4 shows that by controlling the amount of change (number of gradient steps) made by Fashion++, one can choose whether to change less (while still being more fashionable than similarity-only) or improve fashionability more (while still changing less than fashion-only).
4.2 Human perceptual study
Next we ask humans to judge the quality of Fashion++’s edits, how it compares with baselines, and whether they know what actions to take to improve outfits based on the edits. We perform three human subject test protocols; please see Supp. for all three user interfaces. We randomly sample unfashionable test outfits and post tasks on Mechanical Turk (MTurk). Each sample is answered by 7 people, and in total Turkers answered.
Fashion++ can show users a spectrum of edits (e.g., Fig. 4) from which to choose the desired version.While preference will naturally vary among users, we are interested in knowing to what extent a given degree of change is preferred and why. To this end, we show Turkers an original outfit and edits from to , and ask them to:
Select all edits that are more fashionable than the original.
Choose which edit offers the best balance in improving the fashionability without changing too much.
Explain why the option selected in (ii) is best.
For (i), we found that the more we change an outfit (increasing ), the more often human judges think the changed outfit becomes fashionable, with of the changed outfits judged as more fashionable when . For (ii), no specific dominates. The top selected is preferred of the time, and to are each preferred at least of the time. This suggests that results for are similarly representative, so we use for remaining user studies. For (iii), a common reason for a preferred edit is being more attractive, catchy, or interesting. See Supp. for detailed results breaking down for (i) (ii) and more Turkers’ verbal explanations for (iii).
Next we ask human judges to compare Fashion++ to the baselines defined above. We give workers a pair of images at once: one is the original outfit and the other is edited by a method (Fashion++ or a baseline). They are asked to express their agreement with two statements on a five point Likert scale:
The changed outfit is more fashionable than the original.
The changed outfit remains similar to the original.
We do this survey for all methods. We report the median of the 7 responses for each pair.
Fig. (b)b shows the result. It aligns very well with our quantitative evaluation in Fig. (a)a: Fashion-only is rated as improving fashionability the most, but it also changes outfits as much as random. similarity-only is rated as remaining most similar. Fashion++ changes more than similarity-only but less than all others, while improving fashionability nearly as much as fashion-only. This strongly reinforces that Fashion++ makes edits that are slight yet improve fashionability.
Finally, it is important that no matter how good the image’s exact pixel quality is, humans can get actionable information from the suggested edits to improve outfits. We thus ask Turkers how “actionable” our edit is on a five point Likert scale, and to verbally describe the edit. of the time human judges find our images actionable, rating the clarity of the actionable information as . ( for agree and for strongly agree). See Supp. for Turkers’ verbal descriptions of our edits.
4.3 Minimal edit examples
Now we show example outfit edits. We first compare side-by-side with the baselines, and then show variants of Fashion++ to demonstrate its flexibility. For all examples, we show outfits both before and after editing as reconstructed by our generator.
General minimal edits comparing with baselines.
Fig. 6 shows examples of outfit edits by all methods as well as the retrieved nearest garments. Both fashion-only (ii) and random (iv) change the outfit a great deal. While random makes outfits less fashionable, fashion-only improves them with more stylish garments. Fashion++ (i) also increases fashionability, and the recommended change bears similarity (in shape and/or texture) to the initial less-fashionable outfit. For example, the bottom two instances in Fig. 6 wear the same shorts with different shirts. fashion-only recommends changing to the same white blouse with red floral print for both instances, which look fashionable but are entirely different from the initial shirts; Fashion++ recommends changing to a striped shirt with a similar color palette for the first one, and changing to a sleeveless shirt with a slight blush for the second. Similarity-only (iii) indeed looks similar to the initial outfit, but stylishness also remains similar.
Minimal edits changing only shapes.
Fig. 7 shows examples when we instruct our model to just change the shape (cf. Sec 3.3). Even with the exact same pieces and person, adjusting the clothing proportions and fit can favorably affect the style. Fig. 7 (a) shows the length of pants changing. Notice how changing where the shorts end on the wearer’s legs lengthens them. (b,c) show changes to the fit of pants/skirt: wearing pieces that fit well emphasizes wearers’ figures. (d) wears the same jacket in a more open fashion that gives character to the look. (e,f) roll the sleeves up: slight as it is, it makes an outfit more energetic (e) or dressier (f). (g,h) adjusts waistlines: every top and bottom combination looks different when tucked tightly (g) or bloused out a little (h), and properly adjusting this for different ensembles gives better shapes and structures.
Minimal edits changing only textures.
Fig. 8 shows examples when we instruct our model to just change the texture. (a) polishes the outfits by changing the bottom a tint lighter. (b) changes the outfit to a monochrome set that lengthens the silhouette. (c) swaps out the incoherent color. (d)-(f) swap to stand-out pieces by adding bright colors or patterns that make a statement for the outfits. (g)-(h) are changing or removing patterns: notice how even with the same color components, changing their proportions can light up outfits in a drastic way.
Beyond changing existing pieces.
Not only can we tweak pieces that are already on outfits, but we can also take off redundant pieces and even put on new pieces. Fig. 9 shows such examples. In (a), the girl is wearing a stylish dress, but together with somewhat unnecessary pants. (b) suggests to add outerwear to the dress for more layers, while (c) takes off the dark outerwear for a lighter, more energetic look. (d) changes pants to skirt for a better figure of the entire outfits.
A minimal edit requires good outfit generation models, an accurate fashionability classifier, and robust editing operations. Failure in any of these aspects can result in worse outfit changes. Fig. (a)a shows some failure examples as judged by Turkers: no change is made to (i). (ii) makes outfits unnatural. (iii,iv) may make the outfits worse.
Fig. (b)b shows Fashion++ operating on movie characters known to be unfashionable.
We introduced the minimal fashion edit problem. Minimal edits are motivated by consumers’ need to tweak existing wardrobes and designers’ desire to use familiar clothing as a springboard for inspiration. We introduced a novel image generation framework to optimize and display minimal edits yielding more fashionable outfits, accounting for essential technical issues of locality, scalable supervision, and flexible manipulation control. Our results are quite promising, both in terms of quantitative measures and human judge opinions. In future work, we plan to broaden the composition of the training source, e.g., using wider social media platforms like Instagram , bias an edit towards an available inventory, or generate improvements conditioned on an individual’s preferred style or occasion.
This supplementary file consists of:
Implementation details of the complete Fashion++ system presented in Section 4 of the main paper
Ablation study on our outfit’s representation (referenced in Section 3.2 of the main paper)
Details on shape generation
More details on the automatic evaluation metric defined in Section 4.1 of the main paper
More examples of Fashion++ edits
MTurk interfaces for the three human subject studies provided in Section 4.2 of the main paper
Full results and Turkers’ verbal rationales (as a wordcloud) for user study A (Section 4.2 of the main paper)
Examples of Turkers’ verbal descriptions of what actions to perform in user study C (Section 4.2 of the main paper)
I Implementation details
We have two generators, a GAN for texture and a VAE for shape, and a classifier for editing operations. All generation networks are trained from scratch, using the Adam solver  and a learning rate of . For VAE, we keep the same learning rate for the first epochs and linearly decay the rate to zero over the next epochs. For GAN, we keep the same learning rate for the first epochs and linearly decay the rate to zero over the next epochs. For the fashionability classifier, we train from scratch with the Adam solver with weight decay and a learning rate of . We keep the same learning rate for the first epochs and decay it times every epochs until epoch .
For the GAN, we adopt the architecture from . For the VAE, our architecture is defined as follows: Let denote a convolutional block with filters and stride . denotes a convolutional block with filters and stride . denotes a residual block that contains two convolutional blocks with filters. denotes a layer reflection padding on all boundaries. denotes a fully connected layer with filters. We use Instance Normalization (IN)  and ReLU activations. The VAE consists of:
where the encoder is adapted from  and decoder from .
Our MLP for the fashionability classifier is defined as:
For shape and texture features, both and are . For the fashionability classifier to perform edits, we use an SGD solver with step size .
Since the encodings’ distribution of inventory garments is not necessarily Gaussian, the random baseline samples from inventory garments for automatic evaluation, and from a standard Gaussian for human subject study B.
As our system did not alter clothing-irrelevant regions, and to encourage viewers to focus on clothing itself, we automatically replace the generated hair/face region with the original, using their segmentation maps.
Ii Ablation study
We use throughout our paper. Here, we show the effect of texture and shape feature on their own, and how the dimension of the feature affects our model. We measure the feature’s effect by the fashionability classifier (MLP)’s validation accuracy. We compare just using texture, just using shape, and using the concatenation of the two in Tab. 1(a): we found that shape is a more discriminative feature than texture. We tried , and found that gives qualitatively more detailed images than , but continuing increasing beyond does not give qualitatively better result. Tab. 1(b) shows the feature dimension’s effect on the quantitative results, where left is just using the texture as the feature and right is concatenating both texture and shape feature. In both cases, increasing makes our features more discriminative.
Iii More details about shape generation
Here, we walk through the process of how our shape generator controls the silhouette of each garment. If our goal is to change an outfit’s skirt, as in Fig. 11 left, our shape encoder first encodes each garment separately, and then overwrites the skirt’s code with the skirt we intend to change to. Finally, we concatenate each garment’s code into , and our shape generator decodes it back to a region map. This process is shown in Fig. 11 right.
Iv Automatic evaluation metric
To automatically evaluate fashionability improvement, we need ground-truth (GT) garments to evaluate against. To capture multiple ways to improve an outfit, we form a set of GT garments per outfit, as noted in Section 4.1 of the main paper. Our insight is that the garments that go well with a given blouse appear in outfits that also have blouses similar to this one. As a result, we take the corresponding region’s garments, that is the pants or skirts worn with these similar blouses, to form this set. To do so, we first find the nearest neighbors of the unfashionable outfit excluding the swapped out piece (Fig. 12 left), and then take the corresponding pieces in these neighbors (Fig. 12 right) as possible ways to make this outfit fashionable. We use the minimal distance of the original piece to all pieces in GT set as the original piece’s distance to GT. Using median or mean gives similar results.
V More qualitative examples
Due to the sake of space, we show one Fashion++ edit for each example in Section 4.3 of the main paper. In Fig. 13, we show more editing examples by Fashion++, and for each one we display the editing spectrum from to . Fig. 13(a) is the full spectrum for one of the examples in Fig. 6 of the main paper. The outfit starts changing by becoming sleeveless and tucked in, and then colors become even brighter as more edits are allowed. (b) changes the pink long skirt to black flared pants, which actually are not too different in shape, but makes the outfit more energetic and better color matching. (c) gradually shortens the length of the jeans to shorts. (d) tucks in more amount of the sweater. Both (e) and (f) change the pattern of the blouses to match the bottom better. In most examples, edits typically start saturating after , and changes are less obvious after .
Vi Mechanical Turk Interface
Fig. 18, Fig. 19, and Fig. 20 show our MTurk interfaces for the three human subject studies presented in the main paper. We give them the definition of minimal editing and good/bad examples of edits, and tell them to ignore artifacts in synthesized images. For A, we ask them to (i) choose whether any of the changed outfits become more fashionable, and (ii) which is the best minimal edited outfit and (iii) why. For B, we ask them two questions comparing the changed outfit to the original: (i) whether the changed outfit remains similar, and (ii) whether the changed outfit is more fashionable. For C, we ask them if (i) they understand what to change given the original and changed outfit, and (ii) describe it verbally.
Vii Detailed result for user study A
For question (i) in user study A, since there should be a consensus on fashionability improvement, we aggregate the responses over all subjects for each example. Each of the testing examples will be judged as either improved or not improved for every . The result is summarized in Fig. (a)a. As more changes are made (increasing ), more examples are rated as improving fashionability, with of them improved when .
Question (ii) is subjective in nature: different people prefer a different trade-off (between the amount of change versus the amount of fashionability added), so we treat response from each subject individually. The result is summarized in Fig. (b)b. No specific dominates, and a tendency of preferring is observed, in total of the time.
For question (iii), we ask users their reasons to selecting a specific in question (ii). Examples of Turkers’ responses are in Fig. 16. From phrases such as add contrast, offer focus, pop, or catchy in these examples, and a word cloud made from all responses (Fig. 15), we can tell that a common reason a user prefer an outfit is it being more attractive/interesting.
Viii Verbal descriptions of actionable edits for user study C.
In the experiment presented as user study C in the main paper, we asked Turkers to rate how actionable the suggested edit is, and briefly describe the edit in words. Fig. 17 shows example descriptions from human judges. Each example has 6 to 7 different descriptions from different people. For example, despite mild artifacts in Fig. 17(a), humans still reach consensus on the actionable information. Note that in Fig. 17(b)(c)(d), most people described the edit as changing color/pattern, while in Fig. 17(e)(f) more descriptions are about changing to/adding another garment, because Fig. 17(e)(f) changes garments in a more drastic way. Tweaking the color/pattern of a garment is essentially changing to another garment, yet humans perceived this differently. When the overall style of the outfit remains similar, changing to a garment with different colors/patterns seems like a slight change to humans.
-  G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag. Synthesizing images of humans in unseen poses. In CVPR, 2018.
-  C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. arXiv preprint arXiv:1808.07371, 2018.
-  L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
-  L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual factors in neural style transfer. In CVPR, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  X. Guo, H. Wu, Y. Cheng, S. Rennie, and R. Feris. Dialog-based interactive image retrieval. In NIPS, 2018.
-  X. Han, Z. Wu, Y.-G. Jiang, and L. S. Davis. Learning fashion compatibility with bidirectional lstms. ACM MM, 2017.
-  X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis. Viton: An image-based virtual try-on network. In CVPR, 2018.
-  R. He, C. Packer, and J. McAuley. Learning compatibility across categories for heterogeneous item recommendation. In ICDM, 2016.
-  W.-L. Hsiao and K. Grauman. Creating capsule wardrobes from fashion images. In CVPR, 2018.
-  X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
-  C. Huynh, A. Ciptadi, A. Tyagi, and A. Agrawal. Craft: Complementary recommendation by adversarial feature transform. In ECCV Workshop on Computer Vision For Fashion, Art and Design, 2018.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
-  T. Iwata, S. Watanabe, and H. Sawada. Fashion coordinates recommender system using photographs from fashion magazines. In IJCAI, 2011.
-  Y. Kalantidis, L. Kennedy, and L.-J. Li. Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. In ICMR, 2013.
-  W.-C. Kang, C. Fang, Z. Wang, and J. McAuley. Visually-aware fashion recommendation and design with generative image models. In ICDM, 2017.
-  M. H. Kiapour, X. Han, and S. Lazebnik. Where to buy it: Matching street clothing photos in online shops. In ICCV, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
-  A. Kovashka, D. Parikh, and K. Grauman. Whittlesearch: Image search with relative attribute feedback. In CVPR, 2012.
-  C. Lassner, G. Pons-Moll, and P. V. Gehler. A generative model of people in clothing. In ICCV, 2017.
-  X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, and S. Yan. Deep human parsing with active template regression. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2015.
-  X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and S. Yan. Human parsing with contextualized convolutional neural network. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  J. Liu, F. Yu, and T. Funkhouser. Interactive 3d modeling with a generative adversarial network. In Proceedings of the International Conference on 3D Vision, 2017.
-  S. Liu, X. Liang, L. Liu, K. Lu, L. Lin, X. Cao, and S. Yan. Fashion parsing with video context. IEEE Transactions on Multimedia, 2016.
-  S. Liu, X. Liang, L. Liu, X. Shen, J. Yang, C. Xu, L. Lin, X. Cao, and S. Yan. Matching-cnn meets knn: Quasi-parametric human parsing. In CVPR, 2015.
-  S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In CVPR, 2012.
-  Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016.
-  L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person image generation. In NIPS, 2017.
-  A. Mahendran and A. Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision (IJCV), 2016.
-  K. Matzen, K. Bala, and N. Snavely. Streetstyle: Exploring world-wide clothing styles from millions of photos. arXiv:1706.01869, 2017.
-  J. McAuley, C. Targett, Q. Shi, and A. van den Hengel. Image-based recommendations on styles and substitutes. In SIGIR, 2015.
-  A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, 2017.
-  A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In NIPS, 2016.
-  A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015.
-  A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer. Unsupervised person image synthesis in arbitrary poses. In CVPR, 2018.
-  A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu. Swapnet: Image based garment transfer. In ECCV, 2018.
-  N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, and C. Pal. Fashion-gen: The generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317, 2018.
-  P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. In CVPR, 2017.
-  Y.-S. Shih, K.-Y. Chang, H.-T. Lin, and M. Sun. Compatibility family learning for item recommendation and generation. In Proceedings AAAI, 2018.
-  A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe. Deformable gans for pose-based human image generation. In CVPR, 2018.
-  E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun. Neuroaesthetics in Fashion: Modeling the Perception of Fashionability. In CVPR, 2015.
-  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR, 2014.
-  X. Song, F. Feng, J. Liu, and Z. Li. Neurostylist: Neural compatibility modeling for clothing matching. ACM MM, 2017.
-  D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In CVPR, 2017.
-  M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, and D. Forsyth. Learning type-aware embeddings for fashion compatibility. In ECCV, 2018.
-  A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Belongie. Learning visual clothing style with heterogeneous dyadic co-occurrences. In ICCV, 2015.
-  S. Vittayakorn, K. Yamaguchi, A. C. Berg, and T. L. Berg. Runway to realway: Visual analysis of fashion. In WACV, 2015.
-  B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, and M. Yang. Toward characteristic-preserving image-based virtual try-on network. In ECCV, 2018.
-  T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
-  D. Wei, B. Zhou, A. Torrabla, and W. Freeman. Understanding intra-class knowledge inside cnn. arXiv preprint arXiv:1507.02379, 2015.
-  J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In NIPS, 2016.
-  W. Xian, P. Sangkloy, V. Agrawal, A. Raj, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. In CVPR, 2018.
-  S. Yang, T. Ambert, Z. Pan, K. Wang, L. Yu, T. Berg, and M. C. Lin. Detailed garment recovery from a single-view image. arXiv preprint arXiv:1608.01250, 2016.
-  D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-level domain transfer. In ECCV, 2016.
-  J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. In ICML, 2015.
-  M. Zanfir, A.-I. Popa, A. Zanfir, and C. Sminchisescu. Human appearance transfer. In CVPR, 2018.
-  B. Zhao, J. Feng, X. Wu, and S. Yan. Memory-augmented attribute manipulation networks for interactive fashion search. In CVPR, 2017.
-  B. Zhao, X. Wu, Z.-Q. Cheng, H. Liu, Z. Jie, and J. Feng. Multi-view image generation from a single-view. In ACMMM, 2018.
-  J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In NIPS, 2017.
-  S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. L. Chen. Be your own prada: Fashion synthesis with structural coherence. In CVPR, 2017.