MARS: Memory Attention-Aware Recommender System

MARS: Memory Attention-Aware Recommender System

Lei Zheng, Chun-Ta Lu Department of Computer ScienceUniversity of Illinois at ChicagoChicagoIL, U. S. A.60661 lzheng21,clu29@uic.edu Lifang He Weill Cornell Department of Healthcare Policy & ResearchCornell UniversityNew YorkNY, U. S. A.60661 lifanghescut@gmail.com Sihong Xie Computer Science and Engineering DepartmentLehigh UniversityBethlehemPA, U. S. A.60661 sxie@cse.lehigh.edu  and  Vahid Noroozi, He Huang and Philip S. Yu Department of Computer ScienceUniversity of Illinois at ChicagoChicagoIL, U. S. A.60661 vnoroo2,hehuang,psyu@uic.edu
Abstract.

In this paper, we study the problem of modeling users’ diverse interests. Previous methods usually learn a fixed user representation, which has a limited ability to represent distinct interests of a user. In order to model users’ various interests, we propose a Memory Attention-aware Recommender System (MARS). MARS utilizes a memory component and a novel attentional mechanism to learn deep adaptive user representations. Trained in an end-to-end fashion, MARS adaptively summarizes users’ interests. In the experiments, MARS outperforms seven state-of-the-art methods on three real-world datasets in terms of recall and mean average precision. We also demonstrate that MARS has a great interpretability to explain its recommendation results, which is important in many recommendation scenarios.

Recommender Systems, Deep Learning, Attention
copyright: rightsretaineddoi: 10.475/123_4isbn: 123-4567-24-567/08/06

1. Introduction

How can we accurately model users’ interests? It is a fundamental question for building recommender systems (RS). To answer this question, we observed one essential characteristic of users’ interests: diversity. Users are interested in different kinds of items and interests of users are diverse. For example, a user purchased several books while she or he also bought an electrical gadget. Furthermore, owing to the diversity of a user’s interests, only a subset of the user’s purchased products can reveal if the user is interested in another product. For instance, a user may purchase an iPad case because the user bought an iPad rather than a book or a pair of shoes in her last week’s shopping list. Hence, it is a nontrivial task to model the diversities of users’ interests.

However, how to devise representation vectors to express users’ diverse interests is challenging. Existing methods often project users and items into fixed low-dimensional representation vectors in a user-item joint space. We argue that a fixed user representation largely restrains models from accurately modeling users’ diverse interests. In the space, similar items are close to each other and the distance between a user and an item implies how much the user is interested in the item. As shown in Figure 1, a user liked a list of different kinds of items . Because of the diversity of items liked by user , their representation vectors (,, and ) form two clusters in the space. As a result, the fixed user representation resides between the two clusters in the space. In this case, a system employing fixed user representations will recommend item to user instead of item or item . Although is closer to than or in the space, it is obvious that either item or item is a much more reliable recommendation than item . One may increase the representation dimensionality to overcome the restriction but will make item representations more scattered in the space and cause a huge increase of model parameters.

Moreover, for most of current deep RS, interpreting its recommendations is also difficult and demanding (Jannach et al., 2010; Schafer et al., 1999). Despite the effectiveness of representation vectors for predicting interactions, these vectors reside in latent spaces and are not understandable. In order to enhance transparency and trust in the recommendation process, an increasing need of RS is to provide not only accurate but also interpretable recommendations. This is particularly important in a business-to-business setting, where recommendations are generated for experienced sales staff and not directly for the end-client.

In this paper, in order to model representations of users to express their diverse interests and build a recommendation model capable of interpreting its recommendations, we develop MARS: Memory Attention-Aware Recommender System. First of all, MARS exploits the power of deep learning to learn representations of items from the item content. More importantly, motivated by the observation of user behaviors, we facilitate a memory component and a novel item-level attention mechanism to devise a deep adaptive user representation. Unlike fixed user representations, adaptive user representations dynamically adapt to locally activated items. As shown in Figure 1, for a candidate item , an adaptive user representation dynamically adapts to locally activated items: and . To recommend item , another adaptive user representation adapts to relevant items like and . The concept of adaptive user representation can be defined as :

Definition 1 ().

(Adaptive User Representation). For a user and an item list: with items liked by the user, in order to recommend a candidate item , an adaptive user representation dynamically adapts to items in which are highly relevant to item .

The contributions of this work can be summarized as follows:

  • Adaptive User Representations: To the best of our knowledge, we are the first to introduce a deep end-to-end recommendation model to learn adaptive user representations. Especially, a memory component is utilized to capture users’ interests in an end-to-end fashion. An attentional mechanism is facilitated to handle the diversities of users’ interests.

  • An Interpretable Model: Benefiting from the item-level attention mechanism, MARS has a good interpretability to explain why an item gets recommended to a user by showing relevant items liked by the user.

  • Strong Performance: In the experiments, we demonstrate that MARS can significantly outperform strong baselines on three real-world datasets.

2. Background and Preliminaries

Figure 1. A fixed user representation fails to express user ’s diverse interests. Adaptive user representations dynamically adapt to relevant items.

In this section, we present the background and preliminaries of this study. Throughout the paper, we denote scalars by either lowercase or uppercase letters, vectors by boldfaced lowercase letters, and matrices by boldfaced uppercase letters.

We consider the most common scenario with implicit feedbacks. For implicit feedbacks, only positive observations, such as clicks, purchase or likes, are available. The non-observed user-item pairs, e.g. a user has not bought an item yet, are a mixture of real negative feedback (the user is not interested in buying the item) and missing values (the user might want to buy the item in the future).

Although, in different recommendation scenarios, the item content can be in distinct formats, such as text or images, we limit our study to recommend items associated with the textual content, which usually describe important characteristics of items. However, note that the proposed model can be easily generalized to recommend different types of items, such as songs or videos.

Let us denote a set of users as and an item set . Each item is associated with a document containing words. For a user , let denote the set of items liked by user and denote the remaining items. Furthermore, all remaining items liked by user except item , denoted as , are utilized to model user representations.

Finally, we define the recommendation problem which we study in this paper as follows:

Definition 2 ().

(Problem Definition). Given user and item sets: and , for each user , we aim to recommend a ranked list of items from that are of interests to the user.

3. Proposed Model

The overall architecture of MARS is shown as in Figure 3. Inspired by the recent success of Convolutional Neural Network (CNN) (LeCun et al., 1990) in various tasks (Kim et al., 2016; Krizhevsky et al., 2012), we first facilitate two CNN models to serve for different purposes. One CNN model, denoted as and parameterized by , is designed to learn a memory component from documents associated with items liked by user . The other CNN model, denoted as and parameterized by , is responsible for learning an representation of item . Later, a novel item-level attention mechanism is proposed to build on the top of the memory component to learn a deep adaptive user representation . At last, with and , user ’s preference score of item can be derived.

3.0.1. CNN Models

In this subsection, since and only differ in their inputs, we focus on illustrating the process of in detail. The same process is applied for with similar layers. The architecture of CNN used in this paper is shown in Figure 2.

Given an item and its associated document , in order to make use of complex lexical semantics within , we first utilize a word embedding layer to transform the document into a dense numeric matrix . Formally, the word embedding layer is defined as , where represents the dictionary of words. We map each word in into an dimensional vector as:

(1)

where , each column corresponds to a vector for a word in the document . Note that is randomly initialized and is further trained through the optimization process.

Following the word embedding layer, a convolutional layer is built to extract important contextual features that can represent items. A convolution layer consists of neurons in total, each of which applies convolution operator on as:

(2)

Here symbol is the convolution operator, and are a convolutional filter and a bias term, respectively, and denotes the window size of the convolutional filter. We use (Nair and Hinton, 2010) as our activation function.

After the convolutional layer, becomes a vector consisting of contextual features captured by the convolutional kernel . To extract the most important feature value from , we apply a max-pooling operation on as shown in Figure 2. With neurons in the convolutional layer, we obtain a column vector consisting of important contextual features as:

(3)

At last, contextual features of item is projected to a -dimensional space through a fully connected layer as:

(4)

where and . Since the document is associated with item and characterizes item , can be regard as a deep representation of item .

3.1. Memory Component

A set of items liked by a user naturally reflects the user’s interests. However, it is not easy to summarize diverse items liked by user into a user representation. As discussed in the introduction section, a fixed user representation fails to convey diverse interests of a user.

Different from previous methods (Wang et al., 2016; Cheng et al., 2016), in order to learn a deep and accurate user representation, as shown in Figure 3, we first propose to first utilize to learn a memory component for items in as:

(5)

Each column of corresponds to a representation of an item in . Hence, the matrix characterizes the user’s interests.

3.1.1. Item-level Attention

Although the memory component stores all item information of user , our goal is to learn a deep representation of user based on . As discussed in the Introduction section, items in are diverse and only a subset of is relevant to item .

Figure 2. Our CNN architecture used for learning item representations.

Motivated by the intuition above, we introduce an item-level attention mechanism on the memory component to capture items in relevant to item . To do so, as shown in Figure 3, for a given item , we first feed its corresponding document into to derive an item representation as:

(6)

where is a column vector serving as an representation of item .

More items in with high relevance scores or locally activated to item mean that the user is more likely interested in . Thus, with the memory component and , we measure the relevance score or activation degree between and each item representation in by taking the inner product followed by a softmax as:

(7)

Defined in this way, is an attention column vector of user for item . A larger value in indicates higher relevance between item and a corresponding item in . A larger value means a higher weight for deriving interests of user for item .

3.1.2. Deep Adaptive User Representations

A fixed user representation is unable to express diverse interests of the user. In order to uncover a user’s interests in a candidate item , we introduce a deep adaptive user representation. The proposed user representation is dependent on the candidate item and non-fixed. By focusing on items in which are highly activated, it is able to model users’ diverse interests.

With the matrix and the attention vector , a deep attention-aware adaptive user representation is formed by a weighted sum of as:

(8)

In Eq. (8), the attention vector weights each item representation in . An adaptive user representation is obtained by adaptively focusing on items in activated by item .

3.2. Pair-wise Learning and Prediction

Figure 3. The architecture of MARS.

Recall that an adaptive user representation is obtained by placing an attentional vector on our memory component. Given the representation of a candidate item , one can compute the user ’s preference score over item as:

(9)

Here, we derive a preference score that reflects the preference of the user for the item. Usual approaches for item recommendations are to rank items by sorting them according to the scores. However, these approaches treat recommendation tasks as regression problems and are not optimized for ranking. To train MARS optimized for ranking, inspired by BPR (Rendle et al., 2009), we formalize the training data as:

(10)

where , and are uniformly sampled form , and , respectively. Note that for each quadruple in , is a different sampled subset of . The sampling strategy is a common practice for pair-wise recommendation learning. (Rendle et al., 2009) also shows that training on all item pairs can result in slow and poor convergence.

Similar to BPR, instead of scoring single items, we use item pairs to model a user ’s preference of item over item as:

(11)

where and denote adaptive user representations of user for item and item ; and stand for representations of item and item ; and are regularization terms. The sigmoid function maps user ’s preference score of item over item into probabilities.

MARS is trained by minimizing Eq. (11). The derivatives of parameters in different layers can be computed by applying differentiation chain rule (Rumelhart et al., 1988). We optimize the model through RMSprop (Tieleman and Hinton, 2012) over a batch of tuple .

Input: Training set: , number of epochs , batch size , window size , number of convolutional neurons
Output: Model’s parameter set:
for  do
        Generate the batch of size by uniformly sampling from , and ;
        Calculate the memory component according to Eq. (3.1);
        Calculate and according to Eq. (6);
        Calculate and according to Eq. (7);
        Calculate and according to Eq. (8);
        Calculate according to Eq. (11);
        Estimate gradients by back propagation;
        Calculate with RMSprop (Tieleman and Hinton, 2012);
       
end for
return ;
Algorithm 1 Training procedure of MARS

Overall, the training procedure of MARS is summarized in Algorithm 1. During the test, given a user ’s past likes , the final item recommendation list for user is given according to the following ranking criterion:

(12)

4. Experiments

4.1. Datasets

We evaluate the proposed method on three real-world datasets with different kinds of items as the following:

  • Yahoo! Movies: it consists of users rating movies on a scale of 1-5 with a short synopsis. To be consistent with the implicit feedback setting, as in (He et al., 2017), we extract only positive ratings (rating 5) for training and testing. After removing movies without a synopsis and users with less than 3 ratings, we obtain a dataset that contains 7,642 users, 11,915 items, and 221,367 positive ratings with 0.24% density.

  • Amazon Video Games: it is collected by (McAuley et al., 2015; He and McAuley, 2016) and contains game descriptions and ratings from 1 to 5. Similarly, we transformed it into implicit data, where each entry is marked as 0 or 1 indicating whether the user has rated the games. After removed games with descriptions and users with less than 10 ratings, the resulting dataset contains 47,063 video games and 2,670 users with 0.037% density.

  • Amazon Movies and TV: this dataset is also collected by (McAuley et al., 2015; He and McAuley, 2016). Similarly, items in this dataset are movies and TV shows available on Amazon.com. Users rate them from 1 to 5. In order to transform explicit ratings into implicit feedbacks, we only preserve ratings with the value of 5 and treat them as positive feedbacks. Other entries are marked as negative ones. By removing users with less than 10 ratings, we obtain a dataset with 22,147 users, 178,086 items, and 0.0128% density. Note that, compared with the previous two datasets, this is a much harder dataset due to its sparsity.

The synopses and descriptions of items serve as item contents. After removing stop words and frequencies of the word less than 5, the vocabulary sizes of Yahoo! Movies, Amazon Video Games and Amazon Movies and TV are 33,195, 25,035 and 68,919, respectively. All three datasets are publicly available 111Yahoo! Movies can be downloaded at https://webscope.sandbox.yahoo.com; Amazon Video Games and Amazon Movies and TV are available at https://snap.stanford.edu/data/web-Amazon.html.

4.2. Baselines

To validate the effectiveness of MARS, we compare MARS with seven state-of-the-art baseline models. Among them, BPR and NCF completely ignore the textual content associated with items and all other baselines utilize the content information to boost their performances.

  • BPR (Rendle et al., 2009)222Code: https://github.com/zenogantner/MyMediaLite: We use Bayesian Personalized Ranking based Matrix Factorization, which is based on users’ pair-wise preference, as the single collaborative filtering method. BPR completely ignores the usage of item content.

  • NCF (He et al., 2017)333Code: https://github.com/hexiangnan/neural_collaborative_filtering: Neural Collaborative Filtering fuses matrix factorization and Multi-Layer Perceptron (MLP) to learn from user-item interactions. The MLP endows NCF with the ability of modelling non-linearities.

  • CMF (Singh and Gordon, 2008)444Code: https://github.com/david-cortes/cmfrec: Collective Matrix Factorization simultaneously factorizes multiple matrices to incorporate different sources of information. We factorize the rating matrix and a matrix consisting of bag-of-words features. Note that we follow the work of (Hu et al., 2008) to adapt CMF for implicit feedbacks.

  • CTR (Wang and Blei, 2011)555Code: https://github.com/blei-lab/ctr: Collaborative Topic Regression is based on topic modeling techniques and shows very good performance on recommending articles.

  • DeepFM (Sierra, 2017)666Code: https://github.com/ChenglongChen/tensorflow-DeepFM: This baseline combines the power of factorization machine (Rendle, 2012) for recommendations and deep learning for feature learning in a neural network. We concatenate a user id and bag-of-words features of each item for training and predictions.

  • CDL (Wang et al., 2015)777Code: http://www.wanghao.in/code/cdl-release.rar: Collaborative Deep Learning is a recently proposed deep recommender system. It tightly couples a Bayesian formulation of the stacked denoising auto-encoders and Probabilistic Matrix Factorization (PMF) (Salakhutdinov and Mnih, 2011). The middle layer of auto-encoders serves as a bridge between auto-encoders and probabilistic matrix factorization.

  • Wide & Deep (Cheng et al., 2016)888Code: https://www.tensorflow.org/tutorials/wide_and_deep: This model is proposed to jointly train wide linear models and deep neural networks to combine the benefits of memorization and generalization for recommender systems. Similar to DeepFM, a user id and bag-of-words features of each item are combined to be fed into both the ”wide” and ”deep” parts of the model.

All baseline models can be categorized into three groups: (1) BPR and NCF: these two models learn only from users’ implicit feedback and ignore the textual content; (2) CMF and CTR: this group includes two ”shallow” recommendation models; (3) DeepFM, CDL and Wide & Deep: three state-of-the-art deep learning based models are included to be compared with MARS.

4.3. Experimental Setup

Following (Wang et al., 2016), we evaluate the models on held-out user-item likes. For each dataset, to evaluate MARS and baselines in a sparse setting, we randomly select only 30% items associated with each user to form the training set. All the remaining are split evenly to serve as the validation and test set. We repeat the evaluation five times with different randomly selected training sets. The average performances are reported in the following sections. For each dataset, to achieve the best performance for each individual model, we conducted carefully parameter study in Section 4.4.

Since we expect RS to not only be able to retrieve relevant items out of all available items but also provide a ranking where items of users’ interests are ranked in the top. Therefore, for evaluation, we use two metrics to evaluate the proposed model and the baselines. The first metric we use is recall@. The recall is often used to measure how well a model can retrieve relevant items out of all available items. The recall@ for each user is then defined as:

(13)

Another evaluation metric we use is Mean Average Precision (MAP). The MAP is employed to measure the ranking performances of MARS and baselines. The definition of MAP is given as:

(14)

where represents the precision of the top products recommended to user ; denotes whether the item has interacted with user in the test set. Similar to (Van den Oord et al., 2013), we set the cut-off point as 500 for each user.

4.4. Hyper-Parameter Study

Although different models have different hyper-parameters, some hyper-parameters are common and play important roles on model performances. In this section, we optimize the performances of MARS and the baselines by studying the impacts of several important hyper-parameters on the validation sets of Yahoo! Movies and Amazon Video Games. The learning rate and batch size are empirically set as and for MARS. All the other hyper-parameters for baselines follow the original papers.

4.4.1. Dimension of Latent Vectors

A low-dimensional latent vector of users and items has a limitation of modeling complex user-item interactions. However, a high-dimensional vector may harm the generalization of the model and increases the number of parameters. In order to optimize the performances of MARS as well as the baselines, we conduct an experiment to investigate the impacts of the dimension of latent vectors. The dimension of latent vectors is searched from via validation sets from Yahoo! Movies and Amazon Video Games. As shown in Figure 4, different models reach their own best performances at different dimensions of latent vectors. MARS achieves its best performances when its dimensions are set to and on the dataset of Yahoo! Movies and Amazon Video Games, respectively.

4.4.2. Regularization Terms

(a) Yahoo! Movies
(b) Amazon Video Games
Figure 4. The dimension of latent vectors of users and items is varied from 5 to 100 to investigate its impacts on the performances.
(a) Yahoo! Movies
(b) Amazon Video Games
Figure 5. The regularization term is varied from 0.001 to 0.05 to investigate its impacts on the performances.
(a) Yahoo! Movies
(b) Amazon Video Games
Figure 6. The regularization term is varied from 0.001 to 0.05 to examine its impacts on the performances.

In order to combat the over-fitting problem, many models place regularization terms ( and ) on the representation vectors of users and items. We search the two hyper-parameters from to optimize performances of MARS and the baselines. In Figure 5 and 6, we can see that MARS reaches its best performances when both and are set to on the two datasets. Note that instead of regularization, NCF and DeepFM employ Dropout (Srivastava et al., 2014) to prevent over-fitting. And, Wide & Deep uses the regularization.

4.4.3. Network Architecture

To find a good network shape for MARS, we also investigate two hyper-parameters: and , each of which denotes the dimension of word embeddings and the number of convolutional filters, respectively. We perform a grid search on the two hyper-parameters on the validation set of Yahoo! Movies. We found MARS can achieve good performances when we set and . For NCF, as suggested in the original paper (He et al., 2017), we employ a three-layer MLP with the shape of . In Wide & Deep, a three-layer MLP with the shape of is used in the ”deep” part. In DeepFM, as in (Sierra, 2017), we build a MLP network with the shape of .

4.4.4. Other Hyper-Parameters

For CMF, bag-of-words vectors as in (Wang et al., 2015) for each item forms a matrix to be simultaneously factorized with the rating matrix. By performing the grid search, we optimize the performances of CMF by setting the weights for the rating matrix and content matrix to 5 and 1. of CTR is set as . To optimize the performance of CDL, we perform a grid search on the hyper-parameters: , and .

4.5. Experimental Results

Dataset BPR NCF CMF CTR DeepFM CDL Wide & Deep MARS MARS vs. best
Yahoo! Movies recall@50 0.1756 (0.001) 0.1649 (0.008) 0.1532 (0.002) 0.2067 (0.002) 0.2416 (0.002) 0.2534 (0.001) 0.2611 (0.001) 0.3230 (0.001) 23.7%
MAP 0.1045 (0.001) 0.0973 (0.002) 0.0894 (0.003) 0.1223 (0.002) 0.1389 (0.001) 0.1452 (0.001) 0.1523 (0.002) 0.1692 (0.002) 11.1%
Amazon
Video Games
recall@50 0.0716 (0.001) 0.0849 (0.002) 0.0528 (0.002) 0.0827 (0.003) 0.1086 (0.001) 0.1034 (0.004) 0.1123 (0.001) 0.1337 (0.002) 19.1%
MAP 0.0625 (0.002) 0.0654 (0.001) 0.0517 (0.002) 0.0658 (0.001) 0.0815 (0.004) 0.0746 (0.005) 0.0821 (0.003) 0.0934 (0.003) 13.8%
Amazon
Movies and TV
recall@50 0.0643 (0.001) 0.0604 (0.002) 0.0496 (0.002) 0.0711 (0.002) 0.0935 (0.001) 0.0901 (0.001) 0.1001 (0.003) 0.1196 (0.002) 19.5%
MAP 0.0543 (0.001) 0.0551 (0.001) 0.0493 (0.002) 0.0659 (0.004) 0.0771 (0.002) 0.0746 (0.004) 0.0785 (0.001) 0.0895 (0.002) 14.0%
Table 1. Performance comparison with baselines. The best performance is indicated in bold (higher is better). The standard deviation is shown in parentheses.
Figure 7. Performance comparison with variants of MARS on the three datasets in terms of recall@ and MAP.

Table 1 illustrates the experimental results for MARS and seven state-of-the-art baselines on all three datasets. Overall, MARS improves the best baseline by 20.8% and 13.0% in terms of recall@50 and MAP, respectively, averaging on all three datasets.

These experiments reveal a number of interesting points:

  • Regardless of the data sets and the evaluation metrics, our MARS always achieves the best performance. This shows that by leveraging the power of adaptive user representations, MARS can better model users’ diverse interests, resulting in better recommendations.

  • Besides CMF, models considering the texutal content generally give better results than models ignoring the textual content. It validates the usefulness of the textual content.

  • Deep models perform better than ”shallow” ones in our experiments. This observation indicates the advantages of deep features extracted by various deep models.

  • Among all baseline models, Wide & Deep shows itself as a strong performer and defeats all other baselines in all three datasets in terms of recall@50 and MAP. It calls us to combine a deep model with a ”shallow” one for improvements.

  • Because of the sparsity differences, we also observe that the performances of all models degrade as the dataset become sparser.

5. Model Analysis

In this section, we conduct experiments to answer the following questions:

  • Q1: How much does MARS benefit from taking the textual content associated with items into consideration?

  • Q2: How much do the incorporated CNN models help MARS?

  • Q3: Do the proposed adaptive user representations and attention mechanism assist MARS in achieving better performances?

In order to answer questions above, we include three variants of MARS: MARS w/o Text, MARS+Embed and MARS w/o Att as the following:

  • MARS w/o Text: To answer Q1, instead of learning item embeddings from the textual content, we ignore the content information and randomly initialize item embeddings based on a Gaussian distribution. After the initialization, embeddings of items are optimized during the training.

  • MARS+Embed: In order to answer Q2, in this variant, we replace the CNN models by simply averaging on embeddings of words contained each document associated with every item.

  • MARS w/o Att: For Q3, we include a variant of MARS without the proposed attention mechanism. To do so, we fix all values of as one. In this way, for a user , the attention vector is canceled and representation of user becomes fixed and is not adaptive to relevant items liked by user .

As shown in Figure 7, MARS w/o Text performs the worst, regardless of datasets and evaluation metrics. Compared with MARS w/o Text, MARS+Embed gains improvements in all three datasets. It further validates the advantages of incorporating the textual content for RS. However, averaging on embeddings of words associated with items is not an ideal method to make use of information existing in the textual content. Powered by deep features extracted by the CNN models from the textual content, MARS w/o Att achieves additional improvements over MARS+Embed. In terms of MAP, MARS w/o Att improves MARS+Embed by , and on the dataset of Yahoo! Movies, Amazon Video Games and Amazon Movies and TV, respectively. However, MARS defeats MARS w/o Att in terms of recall@50 and MAP on all three datasets. This demonstrates that, with the proposed attention mechanism and the incorporated CNN models, MARS is able to learn an effective adaptive user representation and leverage the rich information existing in the textual content.

6. A case study on the interpretability of MARS

Recommended Movies Because you watched
User I Sleepless in Seattle
1. Where the Heart Is (0.03)
2. When Harry Met Sally… (0.02)
3. Ronnie & Julie (0.02)
Enemy at the Gates
1. X2: X-Men United (0.02)
2. Top Gun (0.015)
3. The Matrix Reloaded (0.014)
User II The Lord of the Rings: The Two Towers
1. Pirates of the Caribbean: The Curse of the Black Pearl (0.03)
2. Harry Potter And the Chamber of Secrets (0.02)
3. Harry Potter and the Sorcerer’s Stone (0.02)
Kill Bill Vol. 1
1. Donnie Brasco (0.03)
2. The Italian Job (0.021)
3. S.W.A.T. (0.02)
Table 2. A case study on the interpretability of MARS. The second column displays the top 2 movies recommended by MARS to User I and User II. The third column includes the top three movies with highest attention values. The actual attention value is shown in parentheses.

To gain a better insight into the interpretation ability of MARS, we conduct a qualitative experiment in this section. We run MARS on the dataset of Yahoo! Movies and examine two example users. For each of them, we show two recommended movies in the second column of Table 2. And the top three movies with the highest attention values shown in the third column explain why corresponding movies in the second column are recommended by MARS. For example, MARS recommends Sleepless in Seattle to User I because movies, such as Where the Heart Is and When Harry met Sally…, are locally activated by high attention values. Besides romance movies, MARS also discovers that User I is interested in action movies because of his or her past interactions with X2: X-Men United and Top Gun. As a result, MARS recommends Enemy at the Gates. For User II, MARS recommends two movies of different genres: The Lord of the Rings: The Two Towers and Kill Bill Vol. 1 (2003). It is because movies watched by User II, such as Pirates of the Caribbean: The Curse of the Black Pearl and Donnie Brasco (1997), indicate the user is a fan of fantasy and action movies. Overall, this case study shows that MARS is not only able to capture users’ diverse interests but also has a great interpretability to explain its recommendation results.

7. Related Work

Our work is closely related to two areas: deep learning based RS and attentional mechanisms employed for RS. We will first give a brief review on deep learning based RS. Then, we covers works utilizing attentional mechanisms for RS.

7.1. Deep Recommender Systems

Several studies (Zheng et al., 2016; Wang et al., 2017c; Lu et al., 2018) recently propose deep learning based models for the recommendation tasks. One pioneer work (Salakhutdinov et al., 2007) in this area uses a Restricted Boltzmann Machines (RBM) (Hinton, 2010) based method to model users using their rating preferences. Followed by this trend, (Wu et al., 2016) utilize denoising Auto-encoders to learn latent vectors of users and items from the rating matrix. (He et al., 2017) and (He and Chua, 2017) leverage Multilayer Perceptron (MLP) to learn from user-item interactions. In (Zheng et al., 2016), a CF-based Neural Autoregressive Distribution Estimator (CF-NADE) model is proposed for collaborative filtering tasks. However, all works above differ from MARS because they ignore the rich content associated with items.

Some studies also propose to utilize deep learning techniques to build a content-based recommender system. In (Van den Oord et al., 2013; Wang and Wang, 2014), authors introduce Convolutional Neural Networks (CNN) (LeCun et al., 1990) and Deep Belief Network (DBN) (Hinton et al., 2006) to learn users’ preferences from music data. (Wang et al., 2015, 2016; Bansal et al., 2016) propose a deep recommendation model learning from the textual content associated with items. (Covington et al., 2016; Wang et al., 2017a) propose deep recommender systems for video and point-of-interest recommendation. (Zhang et al., 2016, 2017a; Zheng et al., 2017) investigate how to leverage the multi-view information to improve the quality of recommender systems. In (Li et al., 2017), they propose a model which is able to simultaneously predict ratings and generate abstractive tips. For more works on the deep learning based RS, readers can refer to a survey paper (Zhang et al., 2017b).

7.2. Attentional Mechanisms for RS

Recently, attentional mechanisms attract considerable interests of researchers, owing to their ability of modeling users’ attention. A number of works (Vinh et al., 2018) employing attentional mechanisms to build RS have been proposed. (Chen et al., 2018) introduces an attentional mechanism utilizing reviews. In (Xiao et al., 2017), authors propose a neural attention network to model the importance of each feature interaction from data. (Loyola et al., 2017) proposed an attentional mechanism based model to learn dynamics of user interests from a sequence of items. To build an interpretable recommendation model, (Seo et al., 2017) proposes to utilize an attentional method to extract text form reviews. For the purpose of capturing editors’ dynamic criteria for selecting news articles, (Wang et al., 2017b) proposes a dynamic attentional method.

To the best of our knowledge, two methods close to ours are presented in (Chen et al., 2017) and (Zhou et al., 2017). Although the two works proposed to place attentional mechanisms on items, none of them are end-to-end models. It means that they use either hand-crafted or pre-trained features. As a result, item features are not optimized for the task of recommendation. It leads to an inaccurate attentional mechanism. In contrast, MARS learns item features and users’ attention in an end-to-end fashion. Therefore, compared with them, MARS achieves better item features and more accurate attention of users.

In terms of interpretability, a deep recommender system with interpretability is presented in (Seo et al., 2017). However, they interpret recommendations based on reviews written by users. In contrast, we interpret recommendations based on items purchased/liked by users. In fact, interpreting recommendations based on purchased items is more popular in real life (as Amazon or Netflix does).

Although all neural network based approaches above are deep content-based recommender systems, they differ from MARS because all of works above are unable to learn an adaptive user representation in an end-to-end fashion.

8. Conclusions

Although deep learning based RS have shown promising results in a variety of recommendation tasks, previous methods often focus on utilizing deep models for modeling item representations and learning a fixed user representation. However, a fixed user representation restrains models from modeling diverse interests of users. Furthermore, while interpretability is demanded in many recommendation scenarios, most of existing deep learning based RS are unable to interpret their recommendation results.

In this paper, to tackle the problems and challenges above, we present a Memory Attention-aware Recommender System (MARS) model. With a proposed memory component and an item-level attention mechanism, instead of modeling fixed deep user representations, MARS learns a deep adaptive user representation. For an item and a set of items liked by user , a deep adaptive user representation can dynamically adapt to those items in the set which are relevant item . Owing to its adaptability, a deep adaptive user representation can overcome the difficulty of modeling users’ diverse interests. Moreover, with the help of the proposed attention mechanism, MARS is able to interpret its recommendation results based on purchased items of users.

In the experiments, we demonstrate that MARS achieves superior performances by comparing with seven state-of-the-art methods on three real-world datasets. Also, we demonstrate that MARS can not only overcome the difficulty of modeling diverse interests of users but also has a great interpretability.

References

  • (1)
  • Bansal et al. (2016) Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask the gru: Multi-task learning for deep text recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 107–114.
  • Chen et al. (2018) Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural Attentional Rating Regression with Review-level Explanations. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1583–1592.
  • Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item-and Component-Level A ention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 7–10.
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191–198.
  • He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 507–517.
  • He and Chua (2017) Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017. 355–364.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173–182.
  • Hinton (2010) Geoffrey Hinton. 2010. A practical guide to training restricted Boltzmann machines. Momentum 9, 1 (2010), 926.
  • Hinton et al. (2006) Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554.
  • Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining. Ieee, 263–272.
  • Jannach et al. (2010) Dietmar Jannach, Markus Zanker, Alexander Felfernig, and Gerhard Friedrich. 2010. Recommender systems: an introduction. Cambridge University Press.
  • Kim et al. (2016) Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu. 2016. Convolutional matrix factorization for document context-aware recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 233–240.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • LeCun et al. (1990) Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. 1990. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems. 396–404.
  • Li et al. (2017) Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neural Rating Regression with Abstractive Tips Generation for Recommendation. (2017).
  • Loyola et al. (2017) Pablo Loyola, Chen Liu, and Yu Hirate. 2017. Modeling User Session and Intent with an Attention-based Encoder-Decoder Architecture. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 147–151.
  • Lu et al. (2018) Chun-Ta Lu, Lifang He, Hao Ding, Bokai Cao, and S Yu Philip. 2018. Learning from Multi-View Multi-Way Data via Structural Factorization Machines. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1593–1602.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 43–52.
  • Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). 807–814.
  • Rendle (2012) Steffen Rendle. 2012. Factorization Machines with libFM. ACM Trans. Intell. Syst. Technol. 3, 3, Article 57 (May 2012), 22 pages.
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, 452–461.
  • Rumelhart et al. (1988) David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1.
  • Salakhutdinov and Mnih (2011) Ruslan Salakhutdinov and Andriy Mnih. 2011. Probabilistic matrix factorization. In NIPS, Vol. 20. 1–8.
  • Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning. ACM, 791–798.
  • Schafer et al. (1999) J Ben Schafer, Joseph Konstan, and John Riedl. 1999. Recommender systems in e-commerce. In Proceedings of the 1st ACM conference on Electronic commerce. ACM, 158–166.
  • Seo et al. (2017) Sungyong Seo, Jing Huang, Hao Yang, and Yan Liu. 2017. Interpretable Convolutional Neural Networks with Dual Local and Global Attention for Review Rating Prediction. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 297–305.
  • Sierra (2017) Carles Sierra (Ed.). 2017. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017. ijcai.org. http://www.ijcai.org/Proceedings/2017/
  • Singh and Gordon (2008) Ajit P Singh and Geoffrey J Gordon. 2008. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 650–658.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
  • Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4, 2 (2012), 26–31.
  • Van den Oord et al. (2013) Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Advances in Neural Information Processing Systems. 2643–2651.
  • Vinh et al. (2018) Tran Dang Quang Vinh, Tuan-Anh Nguyen Pham, Gao Cong, and Xiao-Li Li. 2018. Attention-based Group Recommendation. arXiv preprint arXiv:1804.04327 (2018).
  • Wang and Blei (2011) Chong Wang and David M Blei. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 448–456.
  • Wang et al. (2015) Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1235–1244.
  • Wang et al. (2016) Hao Wang, SHI Xingjian, and Dit-Yan Yeung. 2016. Collaborative recurrent autoencoder: Recommend while learning to fill in the blanks. In Advances in Neural Information Processing Systems. 415–423.
  • Wang et al. (2017c) Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017c. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 515–524.
  • Wang et al. (2017a) Suhang Wang, Yilin Wang, Jiliang Tang, Kai Shu, Suhas Ranganath, and Huan Liu. 2017a. What your images reveal: Exploiting visual contents for point-of-interest recommendation. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 391–400.
  • Wang and Wang (2014) Xinxi Wang and Ye Wang. 2014. Improving content-based and hybrid music recommendation using deep learning. In Proceedings of the ACM International Conference on Multimedia. ACM, 627–636.
  • Wang et al. (2017b) Xuejian Wang, Lantao Yu, Kan Ren, Guanyu Tao, Weinan Zhang, Yong Yu, and Jun Wang. 2017b. Dynamic attention deep model for article recommendation by learning human editors’ demonstration. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2051–2059.
  • Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. 2016. Collaborative denoising auto-encoders for top-n recommender systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 153–162.
  • Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional factorization machines: Learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617 (2017).
  • Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 353–362.
  • Zhang et al. (2017b) Shuai Zhang, Lina Yao, and Aixin Sun. 2017b. Deep Learning based Recommender System: A Survey and New Perspectives. arXiv preprint arXiv:1707.07435 (2017).
  • Zhang et al. (2017a) Yongfeng Zhang, Qingyao Ai, Xu Chen, and W Croft. 2017a. Joint representation learning for top-n recommendation with heterogeneous information sources. CIKM. ACM (2017).
  • Zheng et al. (2017) Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint Deep Modeling of Users and Items Using Reviews for Recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 425–434.
  • Zheng et al. (2016) Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. 2016. A neural autoregressive approach to collaborative filtering. arXiv preprint arXiv:1605.09477 (2016).
  • Zhou et al. (2017) Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, Xingya Dai, Han Zhu, Junqi Jin, Han Li, and Kun Gai. 2017. Deep Interest Network for Click-Through Rate Prediction. arXiv preprint arXiv:1706.06978 (2017).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
354278
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description