Styling with Attention to Details

Styling with Attention to Details

Ayushi Dalmia, Sachindra Joshi, Raghavendra Singh, Vikas Raykar IBM Research{adalmi08, jsachind, raghavsi, viraykar}

Fashion as characterized by its nature, is driven by style. In this paper, we propose a method that takes into account the style information to complete a given set of selected fashion items with a complementary fashion item. Complementary items are those items that can be worn along with the selected items according to the style. Addressing this problem facilitates in automatically generating stylish fashion ensembles leading to a richer shopping experience for users.

Recently, there has been a surge of online social websites where fashion enthusiasts post the outfit of the day and other users can like and comment on them. These posts contain a gold-mine of information about style. In this paper, we exploit these posts to train a deep neural network which captures style in an automated manner. We pose the problem of predicting complementary fashion items as a sequence to sequence problem where the input is the selected set of fashion items and the output is a complementary fashion item based on the style information learned by the model. We use the encoder decoder architecture to solve this problem of completing the set of fashion items. We evaluate the goodness of the proposed model through a variety of experiments. We empirically observe that our proposed model outperforms competitive baseline like apriori algorithm by ~28% in terms of accuracy for top-1 recommendation to complete the fashion ensemble. We also perform retrieval based experiments to understand the ability of the model to learn style and rank the complementary fashion items and find that using attention in our encoder decoder model helps in improving the mean reciprocal rank by ~24%. Qualitatively we find the complementary fashion items generated by our proposed model are richer than the apriori algorithm.

complementary fashion item, fashion ensemble generation, social media mining, sequence to sequence models
copyright: none

1. Introduction

Fashion is a language that instantly conveys the persona. It is a choice with taste and styles, varying with space and time. It is a multi-billion dollar industry providing a surging market for e-commerce retailers, fashion designers and garment companies111 Like in many other domains, data driven technologies are making a difference in the fashion world (Lops2011, ; Aggarwal2016, ; McAuley_2015_C, ; He:2016:FFG:2872518.2890534, ; Jing:2015:VSP:2783258.2788621, ), to name a few.

However fashion, along with its variations and personalizations, is not trivial to model. In particular the problem of modeling style is difficult in nature. Style depends on the ensemble of clothes worn, where the attribute of the clothes: color, pattern, demand attention. A pant may look stylish with a shirt, but a pink pant may not be considered stylish to wear with a yellow shirt. An abstract notion of style is formed when users prefer certain combination of apparels based on their attributes. This notion is often hard to evaluate, more so because of the subjectivity involved. Additionally, traditional shopping cart methods (Sarwar, ) do not work as well because there may be no relationship between clothes in the cart, e.g., pants could be bought to complement a blouse in the wardrobe, or for a family multiple items could be bought with no apparent relationships between purchases. Unlike books, movies and electronics, style keeps changing over time and becomes stale. Thus, purchase history for currently stylish items may not be available.

Figure 1. Completing fashion ensembles (right) based on a given set of fashion items.
Figure 2. Examples showing set of stylish fashion ensembles.
Figure 3. Example post on social media by a user: the attributes (denoted by dashed lines) along with the apparels (denoted by solid lines) are marked in the same color. The box on the right hand side lists the set of items which appear in this post.

In this work we focus on the problem of completing fashion ensembles based on their style quotient. In a fashion ensemble there are complementary items that can be worn together, e.g., in Figure 1, for a given set of items red floral dress, black leather bag and silver bracelet, an example complementary fashion items will be black strappy heels. This is different from recommending similar items, that is given a red dress we are not recommending similar dresses. Our problem on the other hand is given an ensemble of apparels, find stylish complementary items that could be worn with this ensemble. Note, completion here is used loosely – it is not that the complement of our stylish set is not-stylish, or even that we cannot add more apparels to it; in fact our solution would be based on adding one apparel at a time to the ensemble. Thus it is for the user to decide when the look is completed. Further, we use apparels as a superset of clothes that includes accessories such as footwear, jewelery, head-wear etc.

Style is characterized by complex inter-relationship among the fashion attributes that are not straight forward to capture. Figure 2 shows examples of set of fashion items that could be worn together according to fashion experts on Polyvore222 Each of these ensembles of fashion items follow a certain style rule, e.g., black evening dress goes well with other black items, while the pink skirt will go better with a yellow sandal.

The examples in Figure 2 are of curated style rule, where a fashion experts have put together ensembles that would look good based on current style. Such curated advice is often expensive, obviously subjective and mostly out of date. To alleviate these problems and to quantify the abstract concept of style we propose to use social signals such as likes and comments by users on social media. Signals can easily be weighted to make current ones more relevant. Similarly, signals from friends or people with good style could be up-weighted.

There has been a rapid growth of online fashion social media websites like Polyvore00footnotemark: 0 and Chictopia333 where fashion enthusiasts share their outfit of the day444 Figure 3 shows an example of one such post. The user posts an image along with text to describe the outfit of the day. These posts are a rich source of style information.

We propose a data-driven solution to complete the fashion ensemble problem. Our data is parsed from posts appearing in social media websites discussed above. In this work we focus only on the textual description of the post. However, images can also be used as a source of modeling style. Posts are collated based on their social signals and then mined to create frequent itemsets where apparels that are often worn together, in a stylish look, occur together. Our problem then amounts to the problem of building conditional probability models that given a set of apparels can predict an apparel that is complementary to, and goes well with the set555as stated before we don’t generate the ensemble in one shot but complete the ensemble one at a time. Given a set of fashion items , the system recommends a complimentary fashion item . Now, one can query the model again, with a new set of fashion item to complete the ensemble. The input set can be of variable length (cardinality), not only in terms of the number of items in set, but also in terms of number of attributes attached to each item. The presence of attributes make the problem complicated, so the left example in Fig. 2 shows that color does not predict stylishness, unlike the right example in the same figure. Similarly, the predicted fashion item can also be a variable length output depending on the number of attributes associated with the fashion item.

In order to accommodate the variable length input and output, we pose our problem as a sequence to sequence task. The input sequence is a selected subset of fashion items while the output sequence is a complementary fashion item. To address this sequence to sequence task, we build our predictive model using the encoder decoder recurrent neural network(RNN) architecture (cho2014learning, ). RNNs are natural for variable length sequence modeling (SutskeverVL14, ), and as we shall show that using attention (bahdanau2014neural, ) along with RNN allows us to take care of the details of the attributes. The encoder-decoder architecture consists of two recurrent neural networks (RNN) that act as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length target sequence. In our case, the input set of fashion items are represented in a compact form using the encoder. We use this compact representation to generate the output sequence of a complementary apparel via the decoder. We train our model by using all the parsed fashion posts. We hypothesize that the model will be able to capture the style information expressed in these posts. Additionally, our model incorporates an attention mechanism to explicitly learn the more interesting attributes in the fashion items such as color and pattern in order to improve the performance. We use the trained model as a outfit completion system by predicting the complementary fashion item for a given set of fashion items.

In order to evaluate our system we work using two novel datasets: one curated from social media website Chictopia and another obtained from an e-commerce website. We perform empirical study on both these datasets, showing our proposed model outperforms competitive baseline such as apriori algorithm on accuracy by ~28% in terms of accuracy for top-1 recommendation to complete the fashion ensemble and mean reciprocal rank by ~24% for the task of retrieval. Qualitatively, we also find that the complementary fashion items given by our proposed model are richer than the apriori algorithm. To the best of our knowledge, our work is the first to utilize the social media posts and address the problem of completing style based complementary fashion items for a given set of fashion items.

We make the following main contributions in this paper:

  • We propose the problem of completing a fashion ensemble such that items in the ensemble go well with each other.

  • We propose an attention based recurrent neural network model that models the variable length input set of fashion items and generates a complementary apparel with varied number of attributes.

  • We compare the performance of the proposed sequence to sequence model with baselines such as apriori algorithm and find the neural network based model outperforms the pattern mining based method.

The rest of the paper is organized as follows. Section 2 formally describes the task of completing outfit ensembles while Section 3 discusses our proposed encoder decoder RNN model. Section 4 experimentally validates our proposed approach on two novel datasets, one curated from social media website Chictopia and another obtained from an Ecommerce website. We finally present the related work in Section 5 and conclude the paper in Section 6.

2. Problem Formulation

Let us begin by defining our dataset, , where correspond to the words666We will represent each query word as an indicator vector where represents the size of the (query) word vocabulary. in selected set of fashion items and corresponds to the words777Similarly, each recommended item word is represented as an indicator vector where represents the size of the (item) word vocabulary. from the complementary fashion item predicted by the model to complete the input set of fashion items . Our task is generate a sequence of words for the complementary fashion item to complete the query of the selected set of fashion items .

Consider Figure 1, where we have a sequence of words “red floral dress, black leather bag, silver charms bracelet” and the goal is to predict a fashion item such as “black strappy sandals” that complements the query set of fashion items.

Figure 4. Working Principle of the Fashion Text Annotator
Figure 5. Fashion Taxonomy for Women
Figure 6. Proposed System: At each decoding time step j, the model infers a variable-length attention vector based on the current decoder state and all source states . The input vector of a given set of fashion items q is then computed as the weighted average, according to , over all the source states. The next word in the item is predicted based on the query vector q and current decoder state . Note that each item in the encoder is separated by the ‘¡eoi¿’ token to mark the item boundary. [Best viewed in color.]

3. Proposed Methodology

In this section we will discuss the proposed methodology in detail. Our approach can be broadly divided into two phases:

  • Mining style rules from social media

  • Modeling style using encoder decoder based RNN.

3.1. Mining style rules from social media

From social media websites, we crawl online fashion posts, , where denotes the total number of posts. In order to filter the noise from these posts we use social signals such as likes and comments as a metric of the goodness of the fashion ensemble curated by these posts. We compute an average score of these social signals for every post and discard those posts which are below an empirically computed percentile. This, gives us a high quality set of fashion posts , where .

We parse each of these unstructured natural language post to obtain a structured set of fashion items using a Fashion Text Annotator (FTA). FTA uses a Fashion Taxonomy (example shown in Figure 5) and a set of external resources to parse unstructured natural language text into structured text. The FTA parses the unstructured post using an n-gram sliding window where ranges from 1 to 3. This value of n is chosen based on the average number of attributes per apparel in the dataset. For each n-gram it looks up the fashion taxonomy and the external list to extract the apparel along with its attributes such as color and pattern. Figure 4 illustrates the working of the Fashion Text Annotator where the input is the unstructured user post mentioned in Figure 3 and the output is structured data of apparels along with its attributes.

3.2. Modeling style using encoder decoder based RNN

Once we have the structured set of fashion items, we choose items from each of these structured set to generate the tuple where denotes the fashion item going well with the set of fashion items . The goal is to learn a mapping of the style information (color, pattern, apparel type) encoded in these high quality fashion ensembles. Our proposed encoder decoder RNN model is illustrated in Figure 6

The model directly tries to maximize the conditional probability . Specifically, our model has two main components: (a) an encoder which captures the essential information present in the input set of fashion items (b) a decoder which outputs , one target item word at a time of the complementary fashion item. If , denotes the total number of tokens in the predicted complementary fashion item, the conditional probability can be decomposed as:


Our encoder and decoder will be discussed next in detail.

Encoder: The goal of the encoder (as shown in red dashed line in Figure 6) is to represent the set of input items compactly. This compact representation is accessed by the decoder every time it emits a recommended item word. There are two main challenges in designing the encoder model. Firstly, the number of input symbols can be arbitrarily long in the query. This prohibits the usage of feedforward models which cannot model long-term dependencies (MikolovJCMR14, ) efficiently. In other words, it can only do so at the cost of a linear increase in the number of parameters with the increase in the number of input symbols considered. Secondly, the interactions of the different items in the input itemset along with their attributes such as color, pattern etc. can be very complex to be modeled thereby encouraging models with sophisticated architectures. The solution to the challenges will be discussed next.

Consider an itemset, consisting of words888 For the sake of brevity we ignore the superscript in this subsection. For each query word , our encoder RNN computes a dense vector called recurrent state, denoted by , that combines with the information that has already been processed so far, i.e. the recurrent state . Formally:


where , is the number of dimensions of the recurrent state, is a non-linear transformation. The recurrent state acts as a compact summary of the words seen up to position . Once Equation 2 has been run through the entire query , the last state may be viewed as a compact summary of the input query. We use the Long Short-Term Memory (LSTM) to reduce the fundamental difficulty in learning complex dependencies between fashion items and attributes, i.e. to store information for complex sequences.

The traditional encoder can consume a sequence of words present only in one item. To make our model practical, we propose a simple, yet effective compositionality technique for the encoder to consume multiple items. The idea is to concatenate all the words from different items separated by a special token (‘eoi’ marking the end of the item). Essentially, we combine all the item information to form one big input representing the itemset (user query) information. However, this simple strategy cannot be effective if the decoder is not able to give appropriate importance (or attention) to the input symbols consumed by the decoder.

Decoder The goal of our decoder (as shown in blue dashed line in Figure 6) is to predict a complementary fashion item word one at a time by accessing the encoder’s top layer hidden states (which together captures the latent itemset embedding), eventually returning the item 00footnotemark: 0. There are two main challenges in designing the decoder model. Firstly, the decoder must be equipped with an efficient mechanism to identify the area of focus per target word. The possible number of encoder positions can be very large (as we have merged multiple items from the itemset). Thus, the problem of locating the position of interest in the encoder corresponding to the salient information for the decoder with respect to the item to be predicted is non-trivial. For example, when the decoder is predicting the color attribute, it should emphasize on the position of the color words in the encoder. Secondly, the decoder has to sample from an extremely large set of candidate words.

Our decoder employs an LSTM model which parameterizes the probability of decoding each word as:


with denoting the softmax weight matrix of size .

To utilize the information present in the input set of items spanning across several memories (green blocks in Figure 6) in the encoder effectively, the decoder derives a query embedding using attention-based mechanism (which will be discussed next). This query embedding captures the salient information in the input that is useful to predict the current target word . Precisely, we employ a simple concatenation layer to combine and to produce an attentional hidden state, which is then fed through the softmax layer to produce the predictive distribution formulated as:


To derive the query embedding , we define a variable-length attention vector , whose size equals the number of input word consumed by the encoder. We compare the current target hidden state with each encoder hidden state using dot product as:


In Figure 6, the block within black dashed line represents the attention layer of our model. Intuitively captures the degree to which a particular item word (encoder input symbol) helps to predict the next target word (say attribute value for color). For our fashion item set completion task, this gives a provision to negate the influence of irrelevant words by setting closer to 0 and encourage the influence of relevant words by setting the same closer to 1. Finally, the query embedding is computed as the weighted average over all the encoder hidden states, where the weights are given by .

textbfModel Optimization Our model uses cross-entropy loss as the cost function and can be trained end-to-end by minimizing the negative conditional log likelihood (NLL) of the training data with respect to :


Here, constitutes the encoder and decoder parameters. Once the model is trained we generate the complementary fashion item for a new set of fashion items through a word-based beam search such that is maximized. The beam search reduces the impact of candidate explosion to a greater extent and is parameterized by the number of best paths that are pursued at each time step. We use Stochastic Gradient Descent (SGD) to learn the parameters of our model.

4. Experimental Validation

Evaluation of fashion based data driven system has always been a challenging task. For our problem of fashion item set completion, the problem is further compounded with the introduction of style due to its subjective nature. In this section we discuss our datasets, baseline algorithm and experimental setup and finally present the quantitative and qualitative results.

(a) ED
(b) CD
Figure 7. Jaccard Similarity Score: the x axis indicate the number of recommendation and y axis is the JSS score

4.1. Dataset

Currently, there is no publicly available dataset which can be used for evaluating our solution against the problem of completing the set of fashion items. In order to evaluate the proposed solution we work with two real world datasets: social media posts from Chictopia00footnotemark: 0 and style tips from an e-commerce website.

Chictopia Dataset (CD): Chictopia is an online fashion portal which allows users to post their outfit of the day. Every post is associated with a free text description of the post along with various social activities like votes, comments and likes. We crawl about 0.15 million posts from Chictopia and parse these posts using our Fashion Text Annotator. However, Chictopia like any other social media is noisy in nature. Many of these posts may be fashion blunders and therefore contain apparels which do not go well together. In order to filter these noisy posts, we exploit the wisdom of the crowd by defining a fashion score for every post. The fashion score of a post is a weighted combination of number of votes, likes and comments. We take only those posts which are in the top 30 percentile based on this fashion score to filter the noisy posts. Finally, we obtain an automatically crowd-sourced  28K golden fashion posts. The dataset consists of 135 unique colors, 95 unique patterns and 300 unique apparels.

E-commerce Dataset (ED): We crawl the manually curated style tips curated by fashion designers for every item in the catalog from an e-commerce website. We parse these style tips using our Fashion Text Annotator to obtain 10K high quality style tips from the catalog. The dataset consists of 90 unique colors, 40 unique patterns and 238 unique apparels. The apparel along with the style tip would give us a set of attributed items going well together. Unlike CD, ED is noise free and does not require any filtering as it is manually curated by experts.

Note, for the sake of simplicity, we only consider posts and tips associated with women fashion and focus on the attributes color and pattern. Table 1 summarizes the number of train, validate and test posts for both the datasets.

Dataset Total Train Test Validate
E-commerce Dataset 10749 7524 2149 1076
Chictopia Dataset 27303 19112 5460 2731
Table 1. Statistics of Dataset

4.2. Baseline Model

Here we discuss the baseline algorithm which is used for comparing the performance of our encoder decoder RNN model.

Apriori Algorithm: The problem of finding items which go well together falls in the classical paradigm of frequent pattern mining. The apriori algorithm (agrawal1993mining, ) is an influential algorithm to solve the problem of finding frequent patterns. We model the problem of generating stylish fashion ensembles as a frequent pattern mining problem. Given the dataset (discussed in Section 2), we employ apriori algorithm to mine itemsets to build a Style Rule Lexicon. Our lexicon consists of an attributed item and a list of attributed items which go well with it along with a support value. We further build this lexicon at different levels of granularity of attributes, i.e. considering all attributes, considering color or pattern only and considering no attribute at all. We use a minimum support value of 0.6 in case of CD to find frequent patterns while the support value for ED is 1 as the occurrence of every set of itemset is manually curated and validated by experts.

(a) ED
(b) CD
Figure 8. Jaccard Similarity Score for CD: the x axis indicate the number of recommendation and y axis is the JSS score

4.3. Results

In this section we evaluate our proposed algorithms (Seq2Seq Model) with the baseline (Apriori Model) for different quantitative tasks and present qualitative results.

4.3.1. Quantitative Results

Quantitative evaluation in the field of fashion is difficult due to the subjective and abstract nature of it. We devise an experimental setup to perform three quantitative experiments to compare our algorithms against the baseline model for both the datasets. These are prediction accuracy, generalisability and retrievability. We discuss each of these in the next section.

Prediction Accuracy: We use Jaccard Similarity Score (JSS) of the predicted attributed apparel (P) with the actual attributed apparel (A) for both the baseline model and our proposed model. We compute JSS for top-k predictions, where k ranges from 1 to 10. The for the predicted apparel and the ground-truth apparel where is given as follows:


Figure 7 illustrates the performance of the different models for both ED and CD respectively. We observe that Seq2Seq model performs consistently better than Apriori model across both the datasets giving an improvement of ~40% and 16% for CD and ED respectively. This shows that our model benefits from attending to important attributes in the input set of fashion items. We can further see that the gain for sequence to sequence model with respect to apriori is more in case of CD. We see that the ED dataset has less number of training data and fairly large number of fashion attributes. Therefore our model is not able to learn style from ED data as well as it does for CD data.

In order to understand the complexity of the attributes while generating style rules we compute JSS@k at different levels of granularity, by taking different set of attributes at a time. We consider color and pattern as the attributes and exploit the following combinations: color+pattern+apparel and apparel only.

Figures 8 illustrate the performance of the system for different level of granularity for both the datasets. We find that recommending attribute based item is more challenging. The apriori mining algorithm fails in generating color+pattern+apparel recommendation while Seq2Seq Model is able to beat the Apriori model for both the datasets. A similar trend is observed for color+apparel and pattern+apparel task where our proposed model retains its supremacy. (We don’t include the graphs due to limited space constraints). This performance is consistently observed for both the datasets. We observe that Apriori model performs well when compared with our proposed model for the task of apparel prediction. This is because of limited examples of apparel only examples seen by our sequence to sequence model. While training our sequence to sequence model, the model is not able to see enough examples to learn. Nevertheless, the performance of the apriori algorithm starts dropping as we add more attributes and our proposed model is the winning model.

(a) Performance on CD when model is trained on ED
(b) Performance on ED when model is trained on CD
Figure 9. Jaccard Similarity Score for Evaluating Generalization: the x axis indicate the number of recommendation and y axis is the JSS score
(a) ED
(b) CD
Figure 10. Mean Reciprocal Rank(MRR): The x axis indicate the number of negative samples taken and the y axis indicate the MRR


We study the generalization capability of the proposed models. In this experiment we examine the performance of the model by taking data from a different distribution but in the same domain. In order to do this, we use CD test data to evaluate the model trained on ED and vice-versa. This enables us to measure how well the model can transfer knowledge to newer test data when the underlying distribution changes. We compute JSS@k for different algorithms for both the datasets. Figure 9 indicates the JSS@k score on CD using model trained on ED and the JSS@k score on ED using model trained on CD. We find that in this transfer learning evaluation task, apriori algorithm performs well when using model trained on ED and tested on CD. This is because our model is not able to learn well on the limited data of ED which does not have enough examples of the itemsets repeating in the training data. The problem is alleviated in case of apriori algorithm since in case of apriori algorithm we use the minimum support as 1. However when we train using CD, which is richer in terms of attributes and data points, we find that our proposed model beats the apriori model.

Retrieval Based Experiments:

While JSS@k is a good indicator to compare the accuracy of the model for the task of prediction, we evaluate the model on the task of ranking recommendations. The intuition behind this experiment is to evaluate the capability of the model to rank the correct fashion item higher than an arbitrary recommendation. If the model is capable to perform well in this task, one can infer that the performance of the model is not random and it is indeed learning from the data.

Consider a query apparel where is the corresponding ground-truth label. We perform uniform random sampling to obtain negative samples from the set of all ground-truth labels. For a given model, we compute the conditional probabilities and to predict the labels and respectively, for the same query . We rank the ground-truth label and randomly chosen set of recommendations, in descending order of their conditional probability scores. In an ideal case, the rank of the ground-truth label should be more than the rank of the arbitrary negative samples. We compute the performance of ranking through the statistic measure Mean Reciprocal Rank (MRR) given by Equation 8. MRR gives a measure of the predicted rank for the true label, which in an ideal case should be 1.

For the task of retrieval based experiment we cannot use apriori algorithm due to its deterministic nature. Therefore to compare our proposed model we use a variant of our model called the Seq2Seq model without Attention. This model does not use the attention mechanism during training the model.


Figure 10 illustrates the MRR for both the datasets for the Seq2Seq with attention and Seq2Seq without attention based models. We vary , the number of negative samples from 1 to 4. Note that the probability given by a randomly performing model is for respectively. We find that the Seq2Seq Model is able to rank the correct complementary fashion item at rank 1 with high value of MRR for both the datasets. We also observe that the Seq2Seq with attention model outperforms Seq2Seq without attention model resulting in an average improvement of and respectively. This experiment demonstrates the importance of attention while decoding the attributes of the fashion items.

4.3.2. Qualitative Results

In order to validate the quality of the generated complementary fashion items we present the qualitative results for our proposed model and the baseline method. Tables 2 and  3 illustrates some example queries and the predicted complementary fashion item for different models on ED and CD dataset respectively. Note that for many esoteric attributes values like ivory color and in case of heavily attributed set of fashion items, the Apriori Model fails to generate any complementary fashion item. In contrast our proposed Seq2Seq model is able to generate good quality fashion items to add in the query set of fashion items.

Input Set of Fashion Items Apriori Algorithm Seq2Seq Model w/o Attention Seq2Seq Model with Attention
blue printed jeans NIL black t-shirt black solid top
medium stone blue printed kurta, brown clutch red printed skirt sandals copper toned sandals
white crop top, grey joggers white sneakers running shoes black running shoes
maroon camisole top NIL casual shoes white trousers
blue printed leggings, white heels NIL black printed kurta white printed kurta
Table 2. Qualitative Results for Fashion Item Prediction for ED Dataset
Query Apriori Algorithm Seq2Seq Model w/o Attention Seq2Seq Model with Attention
black polka dot tights NIL black dress black lace dress
yellow print jacket, brown leather boots NIL dress black skirt
black tights, mustard cardigan, brown boots, white blouse gloves black short black printed skirt
white woven shirt, light blue trousers black pumps blue bag ivory printed coat
navy trench coat NIL white top blue dress
Table 3. Qualitative Results for Fashion Item Prediction for CD Dataset

5. Related Work

This work is closely related to two sub-fields: application of recommender systems and deep learning.

Recommender System: The closest area of research is complementary item recommendation which has gained a lot of interest from the researchers in the field of recommender systems (McAuley_2015_C, ; Kalantidis:2013, ; McAuley_2015, ; Veit_2015_ICCV, ). Veit et. al (Veit_2015_ICCV, ) propose single item recommendation for a given item. They learn a feature transformation from images of items into a latent space that expresses compatibility and model pairwise compatibility based on co-occurrence in large-scale user behavior data; in particular co-purchase data from McAuley et. al (McAuley_2015, ) proposes joint recommendation of complimentary and substitutable products by formulating it as a supervised link prediction task. They employ product reviews to build topic models to learn such relationships. McAuley et. al (McAuley_2015_C, ) proposes image based recommendations for recommending clothes and accessories that go well together based on visual cues. In  (Kalantidis:2013, ) the authors propose an approach to learn relationships between clothing items and events (e.g. birthday parties, funerals) in order to recommend event-appropriate items. They learn a supervised model using visual features on a predefined set of categories and attributes. Although related to our problem, these methods require handcrafted methods and carefully annotated data for recommending clothing categories for a given occasion.

Fashion is driven by style and is set by fashion enthusiasts evolving over time. Most of the above discussed recommender systems are based on using the product images, purchase history or reviews of the product. However, these aspects do not capture the fashion aesthetics and are therefore incapable to recommend complementary stylish product. To the contrary, we exploit the goldmine of social media from fashion experts thereby learning the aesthetics of fashion while leveraging the fashion taxonomy to understand style which forms the backbone of fashion.

Deep Learning: Deep learning has excelled in providing the state-of-the-art models in diverse applications such as machine reading and comprehension (NIPS2015_5945, ), machine translation (45610, ), query suggestion (Sordoni:2015, ) and summarization (Abigail, ). This technology helps in building models with multiple desirable features: (a) minimal feature engineering (b) ability to create expressive models and (c) minimal assumptions about the domain thereby enabling easier portability to newer domains (Bengio:2013, ). This inspires us to tap its potential for building accurate models for capturing style and trend. For our complementary item recommendation problem, we have used the sequence-to-sequence model (SutskeverVL14, ) which has shown significant improvements in word error rates for conditional text generation problems such as machine translation (45610, ) and long text summarization (Abigail, ). Our expressive recommendation model provides predictions which are not only accurate but are generalizable. We believe our work would revive the interest among the researchers to apply deep learning to solve challenging problems in the fashion domain.

6. Conclusion and Future Work

In this work, we formally defined the problem of completing set of fashion items and proposed a sequence to sequence algorithm to solve this task. Finally, we applied this algorithm to the hitherto task of generating stylish fashion ensembles and demonstrated the efficiency of the system both quantitatively and qualitatively. In future, we would like to explore other information accompanying the post like comments from users sentiment in these comments, images etc. thereby improving the quality of recommendation. We would also like to investigate a different loss function for the proposed model addressing the subjective nature of the task. Finally, we want to move from predicting an item to an itemset prediction that can enumerate all the recommended items in one shot. The challenge of predicting an itemset is difficult to be optimized using cross entropy loss and it would be interesting to explore reinforcement learning to tackle this bottleneck.


  • (1) Pasquale Lops, Marco de Gemmis, Giovanni Semeraro 2011. Content-based Recommender Systems: State of the Art and Trends Recommender Systems Handbook, 73–105.
  • (2) Charu C. Aggarwal 2016. Content-Based Recommender Systems Recommender Systems: The Textbook, 139–166.
  • (3) Julian McAuley, Christopher Targett, Qinfeng Shi, Anton van den Hengel 2015. Image-Based Recommendations on Styles and Substitutes SIGIR, 43–52
  • (4) Ruining He, Chunbin Lin, Julian McAuley 2016. Fashionista: A Fashion-aware Graphical System for Exploring Visually Similar Items WWW Companion, 99–202.
  • (5) Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue, Sarah Tavel 2015. Visual Search at Pinterest KDD, 1889–1898
  • (6) B. Sarwar, G. Karypis, J. Konstan, J. Riedl 2001. Item-based Collaborative Filtering Recommendation Algorithms WWW, 285–295.
  • (7) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation EMNLP
  • (8) Ilya Sutskever, Oriol Vinyals, Quoc Le 2014. Sequence to Sequence Learning with Neural Networks NIPS, 3104–31120
  • (9) Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio 2015. Neural machine translation by jointly learning to align and translate ICLR
  • (10) Tomas Mikolov, Armand Joulin, Sumit Chopra, Michaël Mathieu, Marc’Aurelio Ranzato 2014. Learning Longer Memory in Recurrent Neural Networks JCMR
  • (11) Rakesh Agrawal, Tomasz Imieliński, Arun Swami 1993. Mining association rules between sets of items in large databases SIGMOD, 207–216
  • (12) Kalantidis, Yannis and Kennedy, Lyndon and Li, Li-Jia 2013. Getting the Look: Clothing Recognition and Segmentation for Automatic Product Suggestions in Everyday Photos ICMR, 105–112
  • (13) Julian McAuley, Rahul Pandey, Jure Leskovec 2015. Inferring Networks of Substitutable and Complementary Products KDD, 785–794
  • (14) Andreas Veit, Balazs Kovacs, Sean Bell, Julian McAuley, Kavita Bala, Serge Belongie 2015. Learning Visual Clothing Style With Heterogeneous Dyadic Co-Occurrences ICCV, 4642-4650
  • (15) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Phil Blunsom 2015. Teaching Machines to Read and Comprehend NIPS, 1693–1701
  • (16) Yonghui Wu et al. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation TACL
  • (17) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, Jian-Yun Nie, 2015. A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion CIKM, 553–562
  • (18) Abigail See, Peter J. Liu, Christopher D. Manning 2017. Get To The Point: Summarization with Pointer-Generator Networks ACL, 1073–1083
  • (19) Yoshua Bengio, Aaron Courville, Pascal Vincent, 2013. Representation Learning: A Review and New Perspectives IEEE Trans. Pattern Anal. Mach. Intell., 1798–1828
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description