Attribute-aware Collaborative Filtering: Survey and Classification

Attribute-aware Collaborative Filtering: Survey and Classification

Abstract.

Attribute-aware CF models aims at rating prediction given not only the historical rating from users to items, but also the information associated with users (e.g. age), items (e.g. price), or even ratings (e.g. rating time). This paper surveys works in the past decade developing attribute-aware CF systems, and discovered that mathematically they can be classified into four different categories. We provide the readers not only the high level mathematical interpretation of the existing works in this area but also the mathematical insight for each category of models. Finally we provide in-depth experiment results comparing the effectiveness of the major works in each category.

attribute-aware recommender systems, matrix factorization
12

1. Introduction

Collaborative filtering is arguably the most effective idea in building a recommender system. It assumes that a user’s preferences on items can be inferred collaboratively from other users’ preferences. In practice, users’ past records toward items, such as explicit ratings or implicit feedback (e.g. binary access records), are typically used to infer similarity of taste among users for recommendation. In the past decade, matrix factorization (MF) has become a widely adopted realization of collaborative filtering. Specifically, MF learns a latent representation vector for a user and an item, and compute their inner products as the predicted rating. The learned latent user/item factors are supposed to embed the specific information about the user/item accordingly. That is, two users with similar latent representation shall have similar taste to items with similar latent vectors.

In big data era, classical MF using only ratings suffer a serious drawback for not being able to exploit other accessible information such as the attributes of users/items/ratings. For instance, data could contain the location and time about where and when a user rated an item. These rating-relevant attributes, or contexts, could be useful in determining the scale of a user liking an item. The side information or attributes relevant to users or items (e.g. the demographic information of users or the item genera) can also reveal useful information. Such side information is particularly useful for situation when the ratings about a user or an item is sparse, which is known as the cold-start problem for recommender systems. Therefore, researchers have formulated the attribute-aware recommender systems (see Figure 1) aiming at leverage not only the rating information but also the attributes associated with ratings/users/items to improve the quality of recommendation.

Researchers have proposed different methods to extend existing collaborative filtering models in recent years, such as factorization machines, probabilistic graphical models, kernel tricks and models based on deep neural networks. We notice that those papers can also be categorized according to what kinds of attributes are incorporated into models. If attributes are relevant to users (e.g. age, gender, occupation) or items (e.g. expiration, price), then the class of recommender systems with side information (e.g., (Adams et al., 2010; Fang and Si, 2011; Guo, 2017; Kim and Choi, 2014; Lu et al., 2016; Ning and Karypis, 2012; Park et al., 2013; Porteous et al., 2010; Xu et al., 2013; Yu et al., 2017; Zhao et al., 2016; Feipeng Zhao, 2017; Zhou et al., 2012; Tengfei Zhou, 2017)) consider such attributes when predicting ratings. On the other hand, context-aware recommender systems (e.g., (Baltrunas et al., 2011; Chen et al., 2014; Hidasi and Tikk, 2012; Hidasi, 2015; Hidasi and Tikk, 2016; Karatzoglou et al., 2010; Li et al., 2010; Liu and Aberer, 2013; Liu and Wu, 2015; Nguyen et al., 2014; Rendle et al., 2011; Shi et al., 2012a, 2014; Shin et al., 2009)) enhances themselves by considering the attributes appended to each rating (e.g. rating time, rating location). Other terms may be used to indicate attributes interchangably such as metadata (Kula, 2015), features (Chen et al., 2012) , taxonomy (Koenigstein et al., 2011), entities (Yu et al., 2014), demographical data (Safoury and Salah, 2013), categories (Chen et al., 2016), contexture information (Weng et al., 2009), etc. The above setups all share the same mathematical representation; thus technically we do not distinguish them in this paper. That is, we regard whichever information associated with user/item/rating as user/item/rating attributes, regardless whether they are location, time, or demographical features. Therefore, a CF model that take advantage of not only ratings but also associated attributes are called attribute-aware recommender in this paper.

Figure 1. Interpretation of inputs, including ratings and attributes, in attribute-aware collaborative filtering based recommender systems.
Difference Previous Works (Adomavicius and Tuzhilin, 2011; Verbert et al., 2012; Bobadilla et al., 2013; Shi et al., 2014) Our Work
Attribute discussions
Categories and definitions
of diversified attributes
Mathematical formulations
of the most general attribute vectors
Model introduction
High-level summary
of text descriptions
Mathematical interpretation
of model design criteria
Comparison Experiments
For memory-based models in (Bobadilla et al., 2013);
no experiments in others
For seven model-based models
on seven benchmark datasets
Table 1. Presentation differences between previous works and our work.

Note that the attribute-aware recommender systems discussed in this paper is not equivalent to hybrid recommender systems. The former treats addtional information as attributes while the latter emphasizes the combination of collaborative filtering based methods and content based methods. To be more precise, this survey covers only works that assume unstructured and independent attributes, either in binary or numerical format, for each user, item or rating. The reviewed models do not have prior knowledge of the dependency between attributes, such as the adjancent terms in a document or user relationships in a social network.

This survey covers more than one hundred papers in this area in the past decade. We found that the majority of the works propose an extension of matrix factorization to incorporate attribute information in collaborative filtering. The main contribution in this paper is to not only provide the review report, but rather a means to classify these works into four categories: (I) discriminative matrix factorization, (II) generative matrix factorization, (III) generalized factorization, and (IV) heterogeneous graphs. Inside each category, we provide the probabilistic interpretation of the models. The major distinction of these four categories lies in the representation of the interactions of users, items and attributes. The discriminative matrix factorization models extend the traditional MF by treating the attributes as prior knowledge to learn the latent representation of users or items. Generative matrix factorization further considers the distributions of attributes, and learn such together with the rating distributions. Generalized factorization models view the user/item identity simply as a kind of attribute, and various models are designed for learning the low-dimensional representation vectors for rating prediction. The last category of models propose to represent the users, items and attributes using a heterogeneous graph, where a recommendation task can be cast into a link prediction task on the heterogeneous graph. In the following sections, we will elaborate the general mathematical explanations of the four types of model designs, and discuss the similarity/difference among models.

There have been four prior survey works (Adomavicius and Tuzhilin, 2011; Verbert et al., 2012; Bobadilla et al., 2013; Shi et al., 2014) introducing attribute-aware recommender systems. We claim three major differences between our work and the existing papers. First, previous survey mainly focuses on grouping different types of attributes, and discussing the distinctions of memory-based collaborative filtering and model-based collaborative filtering. In contrast, we are the first that aims at classifying the existing works based on the methodology proposed, instead of the type of data used. We further provide mathematical connections for different types of models so the readers can better understand the spirit of the design of different models as well as their technical differences. Second, we are the first to provide thorough experiment results (7 different models on 8 benchmark datasets) to compare different types of attribute-award recommendation systems. Note that (Bobadilla et al., 2013) is the only previous survey work with experiment results. However, it performed experiments to compare different similarity measures in collaborative filtering algorithms, instead of directly verifying the effectiveness of different attribute-aware recommender systems. Finally, we cover the latest works on attribute-aware recommender systems. We have realized that the existing survey papers do not include about forty papers after . Especially in recent years several deep neural network based solutions have provided the state-of-the-art performance for this task.

Table 1 shows the comparisons between our work and previous surveys.

We will introduce basic ideas about recommender systems in Section 2, followed by the formal analyses on attribute-aware recommender systems in Section 3 and 4. A series of experiments in Section 5 are conducted to compare the accuracy and parameter sensitivity of six widely adopted models. Finally Section 6 concludes this review work and some tasks to be done in the future.

2. Preliminaries

2.1. Problem Definition of Recommender Systems

Recommender systems act as skilled agents to assist users in conquering information overload while making selection decisions over items by providing customized recommendations. Users and items are general phrases denoting entities actively browsing and making choices and entities being selected such as goods and services, respectively.

Formally, recommender systems leverage one or more of three information sources to discover user preferences and generate recommendations: user-item interactions, side information, and contexts. User-item interactions, or ratings, are collected explicitly by prompting users to provide numerical feedbacks towards items and acquired implicitly by tracking user behaviors such as clicks, browsing time, or purchase history. These information are commonly represented as a matrix that encodes preferences of users and is naturally sparse since users normally interact with a limited fraction of items. Side information are rich information attached to individual user or item that depict user characteristics such as educations and jobs or item properties such as descriptions and product categories. Side information can span over diverse structures with rich meaning ranging from numerical status, texts, images to videos, locations, or networks. On the other hand, contexts refer to all the information collected when a user interacts with an item such as timestamps, locations, or textual reviews. These contextual information usually serve as an additional information source appended to the user-item interaction matrix.

The goal of recommender systems is to disclose unknown user preferences over items that users never interact with and recommend the most preferred items to them. In practice, recommender systems learn to generate recommendations based on three types of approaches: pointwise, pairwise, and listwise. Pointwise approach is the most common approach and demands recommendation systems to provide accurate numerical predictions on observed ratings. Items that a user never interacts with are then sorted by their rating predictions and a number of items with the highest ratings are recommended to the user. On the other hand, pairwise approach seeks to preserve the ordering of any pair of items based on ratings, while in the listwise approach recommender systems aim to preserve the relative order of all rated items as a list for each user. Pairwise approach and listwise approach are together considered as item ranking that only requires recommender systems to output ordering of items but not ratings for individual items.

The problem definition of recommender systems can be defined as follows: Given users, items, and information sources user-item ratings with known entries, side information of users , side information of items , contexts , and under the assumption that ratings an item preference relation for user , a recommender system is a function that outputs a permutation of items for each user with more preferred items in front:

(1)

such that

(2)

where function moves item from index to index in the list, with respect to user , and is its inverse function. Note that the dimension of side information attribute matrix might be zero denoting that there is no side information about users or items. Likewise, if there is no contextual information about user-item interactions, will be zero.

The core techniques or algorithms to realize recommender systems are generally classified into three categories: content-based filtering, collaborative filtering, and hybrid filtering (Bobadilla et al., 2013; Shi et al., 2014; Isinkaye et al., 2015). Content-based filtering generates recommendations based on properties of items and user-item interactions. Content-based techniques exploit domain knowledge and seek to transform item properties in raw attribute structures such as texts, images, or locations into numerical item profiles. Each item is represented as a vector and the matrix of side information of items is constructed. A representation of each user is then created by aggregating profiles of items that this user interacted with and a similarity measure is leveraged to retrieve a number of the most similar items as recommendations. Note that content-based filtering doesn’t require information from any other user to make recommendations. Collaborative filtering strives to identify a group of users with similar preferences for each user based on the past user-item interactions and items preferred by these users are recommended. Since discovering users with common preferences is generally based on user-item ratings , collaborative filtering becomes the first choice when item properties are inadequate in describing their content such as movies or songs. Hybrid filtering is the extension or combination of content-based and collaborative filtering. Examples are building an ensemble from both techniques, using item rating history of collaborative filtering as part of item profiles for content-based filtering, or extending collaborative filtering to incorporate user characteristics or item properties . This survey focuses on attribute-aware recommender systems that shed light on not only user-item interactions but also side information of users or items , and contexts which is a subset of hybrid filtering.

2.2. Collaborative Filtering and Matrix Factorization

Collaborative filtering (CF) has become the most prevailing technique to realize recommender systems in recent years (Adomavicius and Tuzhilin, 2005; Shi et al., 2014; Adomavicius and Tuzhilin, 2011; Isinkaye et al., 2015). It assumes preferences that users exhibit towards interacted items can be generalized and used to infer their preferences towards items they have never interacted with through leveraging records of other users with similar preferences. This section briefly introduces conventional CF techniques that assumes the availability of only user-item interactions, or the rating matrix . In practice, they are commonly categorized into memory-based CF and model-based CF (Shi et al., 2014; Isinkaye et al., 2015; Adomavicius and Tuzhilin, 2005).

Memory-based CF directly exploits rows or columns in the rating matrix as representations of users or items and identifies a group of similar users or items by a pre-defined similarity measure. Commonly used similarity metrics include the Pearson correlation, the Jaccard similarity coefficient, the cosine similarity, or their variants. Memory-based CF techniques can be divided into user-based or item-based approaches indicating that a technique tries to identify a group of either similar users or similar items. For user-based approaches, nearest neighbors — or the most similar users — are extracted, and their preferences or ratings towards a target item are aggregated into a rating prediction using similarities between users as weights. The rating prediction of user to item , , can be formulated as:

(3)

where function is a similarity measure, is the normalization constant and is the set of similar users to user (Shi et al., 2014). Rating predictions of item-based approaches can be formulated in a similar way. The calculated pairwise similarities between users or items act as the memory of the recommender system since they can be saved for generating later recommendations.

Model-based CF, on the other hand, takes the rating matrix to train a predictive model with a set of parameters to make recommendations (Adomavicius and Tuzhilin, 2005; Shi et al., 2014). Predictive models can be formulated as a function that output ratings for rating predictions or numerical preference scores for item ranking given a user-item pair :

(4)

Model-based CF then ranks and selects items with the highest ratings or scores as recommendations. Common core algorithms for model-based CF involve Bayesian classifiers, clustering techniques, graph-based approaches, genetic algorithms, and dimension reduction methods such as Singular Value Decomposition (SVD) (Bobadilla et al., 2013; Shi et al., 2014; Adomavicius and Tuzhilin, 2011; Isinkaye et al., 2015; Adomavicius and Tuzhilin, 2005). Over the last decade, a class of latent factor models, called matrix factorization, has been popularized and is commonly adopted as the basis of advanced techniques because of its success in the development of algorithms for the Netflix competition (Koren et al., 2009; Koren and Bell, 2011). In general, latent factor models aim to learn a low-dimensional representation, or latent factor, for each entity and combine latent factors of different entities using specific methods such as inner product, bilinear map, or neural networks to make predictions. As a member of latent factor models, matrix factorization for recommender systems characterizes each user and item by a low-dimensional vector and predicts ratings based on inner product.

Matrix factorization (MF) (Shi et al., 2014; Koren et al., 2009; Paterek, 2007; Koren and Bell, 2011), in the basic form, represents each user as a parameter vector and each item as , where is the dimension of latent factors. The prediction of user ’s rating or preference towards item , denoted as , can be computed using inner product:

(5)

which captures the interaction between them. MF seeks to generate rating predictions as close as possible to those recorded ratings. In matrix form, it can be written as finding such that where . MF is essentially learning a low-rank approximation of the rating matrix since the dimension of representations is usually much smaller than the number of users and items . To learn the latent factors of users and items, the system tries to find that minimize the regularized square error on the set of known ratings :

(6)

where and are regularization parameters. MF tends to cluster users or items with similar rating configuration into groups in the latent factor space which implies that similar users or items will be close to each other. Furthermore, MF assumes the rank of rating matrix or the dimension of the vector space generated by rating configuration of users is far smaller than the number of users . This implies that each user’s rating configuration can be obtained by a linear combination of ratings from a group of other users since they are all generated by principle vectors. Thus MF entails the spirit of collaborative filtering, which is to infer a user’s unknown ratings by ratings of several other users.

Biased matrix factorization (Koren et al., 2009; Paterek, 2007; Koren and Bell, 2011), as an improvement of MF, models characteristics of each user and each item and the global tendency that are independent of user-item interactions. The obvious drawback of MF is that only user-item interactions are considered in rating predictions. However, ratings usually contain universal shifts or exhibit systematic tendencies with respect to users and items. For instance, there might be a group of users inclined to give significant higher ratings than others or a group of items widely considered as high-quality ones and receiving higher ratings. Besides, it is common that all ratings are non-negative which implies the overall average might not be close to zero and causes a difficulty for training of small-value-initialized representations. With issues mentioned above, biased MF augments MF rating predictions with linear biases that account for user-related, item-related, and global effects. The rating prediction is extended as follows:

(7)

where are global bias, bias of user , and bias of item , respectively. Biased MF then finds the optimal that minimize the regularized square error as follows:

(8)

where denotes the squared Frobenius norm. The regularization parameter is tuned by cross-validation.

(a) PMF
(b) Biased PMF
Figure 2. Graphical interpretation of Probabilistic Matrix Factorization (PMF). User or item latent factors are put to generate observed ratings . We can put biase terms to learn the latent shifts between and . Parameters control the certainty in the generation process.

Probabilistic matrix factorization (PMF, Figure 2) (Salakhutdinov and Mnih, 2007, 2008a) is a probabilistic linear model with observed Gaussian noise and can be viewed as a probabilistic extension of MF. PMF adopts the assumption that users and items are independent and represents each user or each item with a zero-mean spherical multivariate Gaussian distribution as follows:

(9)

where and are observed user-specific and item-specific noise. PMF then formulates the conditional probability over the observed ratings as

(10)

where is the set of known ratings and denotes the Gaussian distribution with mean and variance . Learning of PMF is conducted by maximum a posteriori (MAP) estimation, which is equivalent to maximize the log of the posterior distribution of :

(11)

where is a constant independent of all parameters and is the dimension of user or item representations. With Gaussian noise observed, maximizing the log-posterior is identical to minimize the objective function with the form:

(12)

where . Note that (12) has exactly the same form as the regularized square error of MF and gradient descent or its extensions can then be applied in training PMF.

Since collaborative filtering techniques only consider rating matrix in making recommendations, they cannot discover preferences of users or items with scant user-item interactions. This problem is referred as the cold-start issue. In Section 3, we will review recommendation systems that extend CF to incorporate contexts or rich side information regarding users and items to alleviate the cold-start problem.

3. Attribute-Aware Recommender Systems

3.1. Overview

Figure 3. Model design flow of attribute-aware collaborative filtering based recommender system. When reading ratings and attributes for a proposed approach, we have to consider the sources and the types of attributes or ratings, which could affect the recommendation goals and currently common model designs. The evaluation of a proposed recommender system much depends on chosen recommendation goals.

Attribute-aware recommendation models are proposed to tackle the challenges of integrating additional information from user/item/rating. There are two strategies to design attribute-aware collaborative filtering-based systems. One direction is to combine content-based recommendation models with CF models, which can directly accept attributes as content to perform recommendation. On the other hand, researchers also try to extend an existing collaborative filtering algorithm such that it leverages attribute information.

Rather, we will focus on four important factors of designing a attribute-aware recommender system in current works, as shown in Figure 3. They are specifically discussed from Section 3.2 to 3.5. With respect to input data, attribute sources determine whether a attribute vector is relevant to users, items or ratings. For example, attribute age describes a user instead of item; rating time must be appended to ratings, representing when the rating event occurred. Different models impose distinct strategies to integrate attributes of specific sources. Additionally, a model may constrain attribute types that can be used. For instance, graph-based collaborative filtering realizations define attributes as node types, which is not appropriate for numerical attributes. Rating types are even the factor that is emphasized by most model designers. Beside usual numerical ratings, many recommendation models concentrate on binary rating data, where the ratings represent whether users interact with items. Finally, different recommender systems emphasize on different recommendation goals. One is to predict the ratings from users to items through minimizing the error between the predicted and real ratings. Another is to produce the ranking among items given a user, instead of caring about the real rating value of a single item. We then give a table to summarize the design categories of all the surveyed papers in Section 3.6.

Throughout this paper, we will use to denote the attribute matrix, where each column represents a -dimensional attribute vector of entity . Here an entity can refer to a user, an item or a rating, determined by attribute sources (discussed in Section 3.2). If attributes are limited categorical, then can be represented by one-hot encoding (discussed in Section 3.3). Note that our survey does not include models designed specifically for a certain type of attributes, rather covers models that are general enough to accept different types of attributes. For example, Collaborative Topic Regression (CTR) (Wang and Blei, 2011) extends matrix factorization with Latent Dirichlet Allocation (LDA) to import text attributes. Social Regularization (Ma et al., 2011a) specifically utilizes user social networks to regularize the learning of matrix factorization. Both models are not included since they are not generally enough to deal with general attributes.

3.2. Sources of Attributes

Attributes usually come from a variety of sources. Typically, side information refers to the attributes appended to users or items. In contrast, keyword contexts indicate the attributes relevant to ratings. Ratings from the same user can be attached to different contexts, such as ”locations where users rate items”. The recommendation models considering rating-relevant attributes are usually called context-aware recommender systems. Although contexts in some papers could include user-relevant or item-relevant ones, in this paper we tend to be precise and use the term contexts for only rating-relevant attributes.

Sections 3.2.1 and 3.2.2 respectively introduce different attribute sources. It is worth mentioning our observation as follows. Even though some of the models we surveyed demand side information, while others require context information, we discover that the two sets of attributes can be represented in a unified manner and thus both types of models can be applied. We will discuss such unified representation in Section 3.2.3 and 3.2.4.

Side Information: User-relevant or Item-relevant Attributes

In the surveyed papers, side information could refer to user-relevant attributes, item-relevant attributes or both. User-relevant attributes determine the characteristics of a user, such as ”age”, ”gender”, ”education”, etc. In contrast, item-relevant attributes describe the properties of an item, like ”movie running time”, ”product expiration data”, etc. Below we discuss user-relevant attributes, but all the statements can be applied to item-relevant attributes. Given user-relevant attributes, we can express them with matrix where is the number of users. Each column of is corresponding to attribute values of a specific user. The most important characteristic of user-relevant attributes is that they are assumed unchanged with the rating process of a user. For example, every rating from the same user share the identical user-relevant attribute ”age”. In other words, even without any of a user’s ratings in collaborative filtering, the user’s rating behaviors on items could be still extracted from other users that have similar user-relevant attribute values. Attribute-aware recommender systems that address the cold-start user problems (i.e., there are few ratings of a user) typically adopt user-relevant attributes as their auxiliary information under collaborative filtering. The attribute leverage methods are presented in Section 4.

Readers may ask why not distinguish user-relevant attributes and item-relevant attributes. By our observations during survey, most of the recommendation approaches have symmetric model designs for users and items. In matrix factorization-based methods, rating matrix is factorized into two matrices and , respectively referring to user and item latent factors. However matrix factorization does not change its learning results if we exchange the rows and columns of . Despite the exchange of rows and columns, and just exchange what they learn from ratings: for items but for users.

Following the above conclusions,some of the related work could be further extended in our opinions. If one attribute-aware recommender system claims to be designed only for user-relevant attributes, then readers could put a symmetric model design for item-relevant attributes, to obtain a more general model.

Contexts: Rating-relevant Attributes

Collaborative filtering-based recommender systems usually define ratings as the interaction between users and items, though it is likely to have more than one interactions. Since ratings are still the focus of recommender systems, other types of interactions, or rating-relevant attributes, are called contexts in related work. For example, the ”time” and the ”location” that a user rates an item are recorded with the occurrence of the rating behavior. Rating-relevant attributes change with rating behaviors, and thus they could offer auxiliary data about why a user determines to give a rating to an item. Moreever, rating-relevant attributes could capture rating preference change of a user. If we have time information appended to ratings, then attribute-aware recommender systems could discover users’ preferences at different time.

The format of rating-relevant attributes is potentially more flexible than that of user-relevant or item-relevant ones. In Section 4.3, we will introduce a factorization-based generalization of matrix factorization. In this class of attribute-aware recommender systems, even the user and item latent factors are not required to predict ratings; mere rating-relevant attributes can do it using their corresponding latent factor vectors.

Converting Side Information to Contexts

Most attribute-aware recommender systems choose to leverage one of the attribute sources. Some proposed approaches specifically incorporate user or item-relevant attributes, while others are designed for rating-relevant attributes only. It seems that existing works should be applied according to which attribute sources they use. However we argue that the usage of attribute-aware recommender systems could be independent of attribute sources, if we convert them to each other using a simple way.

Let be the user-relevant attribute matrix, where each column is the attribute set of user . Similarly, let be respectively the matrices of item-relevant attributes and rating-relevant attributes. Note that a column index of matrix is denoted by which is associated with user and item . To express or as , a simple concatenation with respect to users and items can achieve the goal, as shown below:

(13)

(13) implies that we just extend current rating-revelant attributes to , using the attributes from corresponding users or items. If training data do not consist of or , we can eliminate the notations on the right-hand side of (13). Advanced attribute selection or dimensionality reduction methods could extract effective dimensions in , but the further improvement is beyond our scope. If missing attribute values exist in , then we suggest directly filling in these attributes. Please refer to to Section 3.2.4 for our reasons.

Converting Contexts to Side Information

Following the topic in Section 3.2.3, reader may be curious of how to reversely convert rating-relevant attributes as user or item-relevant ones. In the following paragraphs, we adopt the same notations in (3.2.3). Due to symmetric designs for and , we demonstrate only the conversion from to . The concatenation is still the simplest way to express as one part of :

(14)

All the rating-relevant attributes from items must be associated with user . is thus extended to by appending these attributes. Note that there exist a large number of missing attributes on the right-hand side of (14), since most items were never rated by user in real-world data. Eliminating missing , as what we do in Section 3.2.3, turns out different dimensions between two user-relevant attributes . To our knowledge, there is no user-relevant attribute-aware recommender system allowing individual dimensions of user-relevant attributes.

Readers can run attribute imputation approaches to remove missing values in . However in our opinions, simply filling in missing elements could be satisfactory for attribute-aware recommender systems. We explain our reasons by the observations in Section 3.3. For numerical attributes, (15) (16) (17) show the various attribute modeling methods. If attributes are mapped through function like (15) or (17), then zero attributes in will cause no mapping effect (except constant intercept of ). If attributes are fitted by latent factors onto function such as (16), then typically in the objective design, we can skip the objective computation of missing attributes. As for categorical attributes, we exploit one-hot encoding to represent them with numerical values. Then categorical attributes can be handled as numerical attributes.

3.3. Attribute Types

In most cases, attribute-aware recommender systems accept a real-valued attribute matrix . However we notice that some attribute-aware recommender systems require attributes to be categorical, which is typically represented by binary encoding. Specifically, these approaches have to demand a binary attribute matrix where attributes of value can be modeled as discrete latent information someway. The summary of both types of attributes are introduced in Section 3.3.1 and 3.3.2.

It is trivial to put one-hot categorical attributes into numerical attribute-aware recommender systems, since binary values . Nonetheless putting numerical attributes into categorical attribute-aware recommendation approaches has to take a risk of losing attribute information (e.g., quantization processing).

Numerical Attributes

In our paper, numerical attributes refer to the set of real-valued attributes, i.e., attribute matrix . We also classify integer attributes (like movie ratings ) to numerical attributes. Most of the relevant papers model numerical attributes as their default inputs in recommender systems, as common machine learning approaches.

There are three common model designs for numerical attributes to affect recommender systems. First, we can map to latent factor space by function with parameters , and then fit the corresponding user or item latent factor vectors:

(15)

Second, like the reverse of (15), we define a mapping function such that mapped values from user or item latent factors can be close to observed attributes:

(16)

Finally, numerical attributes can be put into function that is independent of existing user or item latent factors in matrix factorization:

(17)

(15) and (16) are typically seen in user-relevant or item-relevant attributes, while rating-relevant attributes are often put into (17)-like formats. However we emphasize that attribute-aware recommender systems are not restricted to these three model designs.

Categorical Attributes

The values of a numerical attribute are ordered, though the values of a categorical attribute show no ordered relations of each other. Given a categorical attribute , the meanings of the attribute values do not imply which one is larger than the other. Thus, it is improper to give categorical attributes ordered dummy variables, like that could incorrectly imply , which makes machine learning models misunderstand attribute information. The most common solution to categorical attribute transformation is one-hot encoding. We generate -dimensional binary attributes that correspond to the values of a categorical attribute. Each of the binary attributes indicate the current value of a categorical attribute. For example, we express attribute . They are corresponding to the original values . Since a categorical attribute exactly equals to one value, the mapped binary attributes contain only a and others . Once all the categorical attributes are converted to one-hot encoding expressions, we are allowed to apply them to existing numerical attribute-aware recommender systems.

However certain relevant papers are suitable for, or even limited to, categorical attributes. Heterogeneous graph-based methods (Section 4.4) add new nodes (e.g., three nodes named ) to represent the values of categorical attributes. Following the latent factor ideas in matrix factorization, some methods propose to assign each categorical attribute value a low-dimensional latent factor vector (e.g., each of has a latent factor vector ). Then these vectors are jointly learned with classical user or item latent factors in attribute-aware recommender systems.

3.4. Rating Types

Although we always define term ratings as the interactions between users and items in this paper, some existing works claim the difference between explicit opinions and implicit feedback. Taking dataset MovieLens for example, a user gives a rating value in toward an item. The value denotes the explicit opinion, which quantifies the preference of the user to that item. How recommendation methods handling such type of ratings will be introduced in Section 3.4.1.

Even though modeling explicit opinions is more beneficial for future recommendation, such data is more difficult to gather from users. Users may hesitate to show their preferences due to privacy consideration, or they are not willing to spend time labeling explicit ratings. Instead, recommender system developers are more likely to collect implicit feedback, like user browsing logs. Such datasets record a series of binary values, each of which imply whether a user ever saw an item. User preferences behind implicit feedback assume that all the items seen by a user must be more preferred by the user, than those items having never seen. We deeply discuss the type of ratings in Section 3.4.2.

There exist controversial numerical rating data, like ”the number that a user ever clicked the hyperlink toward the page of an item”. Some of the related work may define such data as implicit feedback, because the number of clicks is not equivalent to explicit user preferences. However in this paper, we still identify them as explicit opinions. With respect to model designs, related recommendation approaches take no difference between such data and explicit opinions.

Explicit Opinions: Numerical Ratings

A numerical rating matrix expresses users’ opinions on items. Actually numerical ratings in real-world scenarios are often represented by positive integers, such as MovieLens ratings . Despite no explicit statements in related work, typically we suppose that a higher rating implies a more positive opinion.

Since in most datasets the gathered rating values are positive, it could incur an unbiased learning problem. Matrix factorization could not learn the rating bias due to the non-zero mean of ratings . Specifically, in vanilla matrix factorization, we have regularization terms and for user and item latent factor matrix . That is, we require the expected value in the viewpoint of corresponding normal distributions. Given rating of user to item , and assuming the independence of as probabilistic matrix factorization does, we obtain the expected value of rating estimate , which cannot closely fits true ratings if . Biased matrix factorization can alleviate the problem by absorbing the non-zero mean with additional bias terms. Besides, we are allowed to normalize all the ratings (subtract the rating mean from every rating) to make matrix factorization prediction unbiased. Real-world numerical ratings also have finite maximum and minimum values. Some recommendation models choose to normalize the ratings to range , and then constrain the range of rating estimate using the sigmoid function .

Implicit feedback: Binary Ratings

Today there are more and more researches that are interested in the scenario of binary ratigns (i.e., implicit feedback), since such rating data are more accessible, like ”whether a user browsed the information about an item”. Online services do not have to require users to give an explicit numerical ratings, which are often gathered less than binary ones.

Nevertheless, we observe only positive ratings ; negative ratings do not exist in training data. Taking browsing logs as example, the data collect the items that are browsed by a user (i.e., positive examples). The items not in the browsing data could imply either absolutely unattractive () or just unknown () to the user. One-class collaborative filtering methods are proposed to address the problem. Such methods often claim two assumptions:

  • An item must be attractive to a user (), as long as the user ever saw the item.

  • Since we cannot distinguish the two reasons (absolutely unattractive or just unknown) why an item is unseen, such methods suppose that all the unseen items are less attractive (). However the number of unseen items are practically much more than that of seen items. To alleviate the problems learning bias toward together with learning speed, we exploit negative sampling that sub-samples partial unseen ratings for training.

To build an objective function satisfying the above assumptions, we can choose either pointwise learning (Section 3.5.1) or pairwise learning (Section 3.5.2). Area Under ROC Curve (AUC), Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP), precision and recall are often used to justfy the quality of recommender systems for binary ratings.

3.5. Recommendation Goals

Any recommender system needs human developers to offer a training goal of recommendation. Since collaborative filtering-based recommender systems rely on ratings, the most straightforward goal is to infer what rating will be given by a user for an unseen item, named rating prediction. If the ratings of every item can be accurately predicted, then for any user, a recommender system just sorts predicted ratings and recommends the items of the highest predicted ratings. In machine learning, such goal for model-based recommender systems can be described as a pointwise learning. That is, given a pair of user and item, a pointwise learning recommendation model directly minimize the error of predicted ratings and true ones. The related mathematical details is put in Section 3.5.1.

However in general, our ultimate goal is to recommend unseen items to users without concerning about how these items are rated. All unseen items in pointwise learning are finally ranked in descent order of their ratings. In other words, what we truly care about is the order of ratings, but not the true rating values. Also, some research papers figure out that low error of rating prediction is not always equivalent to high quality of recommended item lists. Recent model-based collaborative filtering models begin to set optimization goals of item ranking. That is, for the same user, such models maximize the differences between high-rated items and low-rated ones in training data. The implementation of item ranking includes pairwise learning and listwise learning in machine learning domains. Both learning ideas try to compare the potentially related ranks between at least two items for the same user. Section 3.5.2 will present how to define optimization criteria for item ranking.

Rating Prediction: Pointwise Learning

In the training stage, given a ground-truth rating , a recommender system needs to make a rating estimate that is expected to predict . Model-based collaborative filtering methods (e.g., matrix factorization) build an objective function to be optimized (either maximization or minimization) for recommendation goals. For numerical ratings (Section 3.4.1) of users to items , we can minimize the error between the ground truth and the estimate as follows:

(18)

is the set of training ratings, which are the non-missing entries in rating matrix . As Section 3.4.1 mentioned, if ground-truth ratings are normalized to in data pre-processing, then in (18) we can put sigmoid function onto rating estimate that could more fit . With respective to probability, (18) is equivalent to maximizing normal likelihood:

(19)

where means the probability density function of a normal distribution with mean and variance being a predefined uncertainty between and . Taking on (19) will obtain (18). Evidently both (18) and (19) make the rating prediction problem be addressed by regression models over ratings .

For binary ratings (Section 3.4.2), beside (18) with the sigmoid function, such data can be modeled as a binary classification problem. Specifically we model as the positive set, as the negative set. Then logistic regression (or Bernoulli likelihood) is built for rating prediction:

(20)

The optimization of (18) (19) corresponds to an evaluation metric: Root Mean Squared Error (RMSE), whose formal definition is shown as follows:

(21)

For the convenience of optimization, the regression models eliminate the root function from RMSE, i.e., they optimizes MSE in fact. Since the root function is monotonically increasing, minimizing MSE is equivalent to minimizing RMSE (21).

Even though a recommender system selects to optimize (20), the binary classification also corresponds to minimizing RMSE, except that rating estimate is replaced with sigmoid-applied version . Observing the maximization of (20), we obtain a conclusion: as , or as . In other words, (20) tries to minimize the error between and , which has the same optimization goal as RMSE (21).

Item Ranking: Pairwise Learning and Listwise Learning

This class of recommendation goal requires a model to correctly rank two items in the training data, even though the model could inaccurately predict the value of a single rating. Since recommender systems concern about item ranking for the same user more than ranking for different users, existing works sample item pairs where given fixed user (i.e., item is ranked higher than item for user ), and then let rating estimate pair learn to rank the two items with . In particular, we can use the sigmoid function to model the probabilities in the pairwise comparison likelihood:

(22)

Taking on objective function (22) will become the log-loss function. Bayesian Personalized Ranking (BPR) (Rendle et al., 2009) first investigates the usage and the optimization of (22) for recommender systems. BPR shows that (22) maximizes a differentiable smoothness of evaluate metric Area Under ROC Curve (AUC), one of whose definitions is:

(23)

where is the number of training instances . denote an indicator function whose output is if and only if condition is judged true. We show the connection between (22) and (23) below:

(24)

Under the condition of , we make non-differentiable indicator function be approximated by differentiable sigmoid function . The maximization of (24) is equivalent to optimizing (22) due to the monotonically increasing logarithm function. AUC evaluates whether all the predicted item pairs follow the ground-truth rating comparisons in the whole item list. By our observation, most of the reviewed approaches based on item ranking build their objective functions with AUC optimization. There are other choices of optimization functions to approxmately maximize AUC, like hinge loss:

(25)

In the domain of top- recommendation, the item orders outside top- ranks is unimportant for recommender systems. Maximizing AUC could fail to recommend items since AUC gives the same penalty to all items. That is, a recommender system could gain high AUC when it accurately ranks the bottom- items, but it is not beneficial for real-world recommendation since a user pays attention to the top- items. Listwise evaluation metrics like Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG) or Mean Average Precision (MAP) are proposed to give different penalty values to item ranking positions. There have been works to optimize differential versions of the above metrics, such as CliMF (Shi et al., 2012b), SoftRank (Taylor et al., 2008) and TFMAP (Shi et al., 2012a).

As our observations to the surveyed papers, recommender systems reading binary ratings (Section 3.4.2) more prefer to optimize an item-ranking objective function. Compared with numerical ratings (Section 3.4.1), a single binary rating reveals less information on a user’s absolute preference. Pairwise learning methods could capture more information by modeling a user’s relative preferences, because the number of rating pairs is more than the number of ratings for each user.

3.6. Summary of Related Work

After introducing the above categories that we propose for attribute-aware recommender systems, we then demonstrate Table 2, listing which categories each paper belongs to. Here Table 2 also shows all the publications that we have surveyed. We trace back to the publications to summarize the recent ten-year trend of attribute-aware recommender systems.

Model Year Attri. Source (3.2) Attri. Type (3.3) Rating Type (3.4) Recom. Goal (3.5)
User Item Rating Num. Cat. Num. Bin. Pred. Rank.
(3.2.1) (3.2.1) (3.2.2) (3.3.1) (3.3.2) (3.4.1) (3.4.2) (3.5.1) (3.5.2)
CMF (Singh and Gordon, 2008) 2008
TBM (Gunawardana and Meek, 2008) 2008
WNMCTF (Yoo and Choi, 2009) 2009
CAR-AUC (Shin et al., 2009) 2009
Multi. Recom. 3 (Weng et al., 2009) 2009
RLFM (Agarwal and Chen, 2009) 2009
Unified Boltz (Gunawardana and Meek, 2009) 2009
Matchbox (Stern et al., 2009) 2009
BMFSI (Porteous et al., 2010) 2010
wAMAN. 4 (Li et al., 2010) 2010
CACF (Lee et al., 2010) 2010
PLRM (Li et al., 2010) 2010
LAFM (Gantner et al., 2010) 2010
GPMF (Shan and Banerjee, 2010) 2010
LFL (Menon and Elkan, 2010) 2010
TF (Karatzoglou et al., 2010) 2010
GWNMTF (Gu et al., 2010) 2010
DPMF (Adams et al., 2010) 2010
SoRec (Ma et al., 2011b) 2011
UGPMF (Du et al., 2011) 2011
BMCF (Yoo and Choi, 2011) 2011
MCRI (Fang and Si, 2011) 2011
Hybrid. 5 (Menon et al., 2011) 2011
YMR (Koenigstein et al., 2011) 2011
CAMF (Baltrunas et al., 2011) 2011
GFREC (Lee et al., 2011) 2011
FM (Rendle et al., 2011) 2011
FIP (Yang et al., 2011) 2011
iTALS (Hidasi and Tikk, 2012) 2012
HVBMCF (Yoo and Choi, 2012) 2012
LCR (Weston et al., 2012) 2012
HierIntegModel (Lu et al., 2012) 2012
SVDFeature (Chen et al., 2012) 2012
SSLIM (Ning and Karypis, 2012) 2012
KPMF (Zhou et al., 2012) 2012
TFMAP (Shi et al., 2012a) 2012
CCMF (Bouchard et al., 2013) 2013
GFMF (Chen et al., 2013) 2013
KBMF (Gönen et al., 2013) 2013
HBMFSI (Park et al., 2013) 2013
DACR (Safoury and Salah, 2013) 2013
Maxide (Xu et al., 2013) 2013
MF-EFS (Koenigstein and Paquet, 2013) 2013
HeteroMF (Jamali and Lakshmanan, 2013) 2013
SoCo (Liu and Aberer, 2013) 2013
C-CTR-SMF2 (Chen et al., 2014) 2014
VBMFSI-CA (Kim and Choi, 2014) 2014
IMC (Natarajan and Dhillon, 2014) 2014
CARS (Shi et al., 2014) 2014
LLR (Ji et al., 2014) 2014
GBFM (Cheng et al., 2014) 2014
SCF (Sedhain et al., 2014) 2014
LCE (Saveski and Mantrach, 2014) 2014
CSEL (Zhang et al., 2014) 2014
GPFM (Nguyen et al., 2014) 2014
NCRPD-MF (Hu et al., 2014) 2014
HeteRec (Yu et al., 2014) 2014
CAPRF (Gao et al., 2015) 2015
mSDA-CF (Li et al., 2015) 2015
BIMC (Shin et al., 2015) 2015
Convex FM (Blondel et al., 2015) 2015
CDL (Wang et al., 2015) 2015
LightFM (Kula, 2015) 2015
DCT (Barjasteh et al., 2015) 2015
GFF (Hidasi, 2015) 2015
CALR (Liu and Wu, 2015) 2015
VBPR (He and McAuley, 2016) 2016
GFF (Hidasi and Tikk, 2016) 2016
PNFM (Blondel et al., 2016) 2016
TCRM (Kasai and Mishra, 2016) 2016
PCFSI (Zhao et al., 2016) 2016
CKE (Zhang et al., 2016) 2016
CRAE (Wang et al., 2016) 2016
SIMMCSI (Lu et al., 2016) 2016
DSR (Zheng et al., 2016) 2016
ALMM (Chou et al., 2016) 2016
FFM (Juan et al., 2016) 2016
ReMF (Yang et al., 2016) 2016
TAPER (Ge et al., 2016) 2016
LPRRM-CF (Chen et al., 2016) 2016
HeteRS (Pham et al., 2016) 2016
MVM (Cao et al., 2016) 2016
SQ (Yu et al., 2017) 2017
LoCo (Sedhain et al., 2017) 2017
aSDAE (Dong et al., 2017) 2017
CoEmbed (Guo, 2017) 2017
HMF (Brouwer and Liò, 2017) 2017
DeepFM (Guo et al., 2017) 2017
LDRSSI (Feipeng Zhao, 2017) 2017
CGSI (Tengfei Zhou, 2017) 2017
Func. Embed. 6 (Chen et al., 2017) 2017
CVAE (Li and She, 2017) 2017
entity2rec (Palumbo et al., 2017) 2017
NFM (He and Chua, 2017) 2017
MFM (Lu et al., 2017) 2017
Focused FM (Beutel et al., 2017) 2017
GB-CENT (Zhao et al., 2017) 2017
CML (Hsieh et al., 2017) 2017
ATRank (Zhou et al., 2017) 2018
Div-HeteRec (Nandanwar et al., 2018) 2018
HeteLearn (Jiang et al., 2018) 2018
RNNLatentCross (Beutel et al., 2018) 2018
DDL (Zhang et al., 2018) 2018
Table 2. List of model categories. The numbers in parentheses refer to the corresponding sections for category elaborations. All the model names come from the proposing publications, except that we use the title abbreviations if the authors do not name their approaches. Long model names are commented in footnotes.

4. Common Model Designs of Attribute-Aware Recommender Systems

In this section we formally introduce the common attribute integration methods of existing attribute-aware recommender systems. If collaborative filtering approaches are modeled by user or item latent factor structures like matrix factorization, then attribute matrice become either the prior knowledge of the latent factors (Section 4.1) or the generation outputs from the latent factors (Section 4.2). On the other hand, some of the works are actually the generalization of matrix factorization (Section 4.3). Besides, the interactions between users and items can be recorded by a heterogeneous network, which can incorporate attributes by simply adding attribute-representing nodes (Section 4.4). The major distinction of these four categories lies in the representation of the interactions of users, items and attributes. The discriminative matrix factorization models extend the traditional MF by making the attributes prior knowledge input to learn the latent representation of users or items. Generative matrix factorization further considers the distributions of attributes, and learn such together with the rating distributions. Generalized factorization models view the user/item identity simply as a kind of attribute, and various models are designed for learning the low-dimensional representation vectors for rating prediction. The last category of models propose to represent the users, items and attributes using a heterogeneous graph, where a recommendation task can be cast into a link prediction task on the heterogeneous graph.

DMF Similarity
(Li et al., 2010),(Gu et al., 2010),(Du et al., 2011),(Zhou et al., 2012), (Barjasteh et al., 2015), (Yu et al., 2017),(Adams et al., 2010), (Chen et al., 2014), (Gönen et al., 2013)
Linear
(Porteous et al., 2010),(Menon and Elkan, 2010),(Menon et al., 2011), (He and McAuley, 2016), (Zhao et al., 2016), (Guo, 2017),(Feipeng Zhao, 2017)
Bilinear
(Stern et al., 2009),(Li et al., 2010), (Agarwal and Chen, 2009),(Shin et al., 2015) (Yang et al., 2011), (Chen et al., 2012),(Park et al., 2013), (Xu et al., 2013), (Kim and Choi, 2014), (Natarajan and Dhillon, 2014),(Lu et al., 2016),(Chou et al., 2016)
GMF
Multiple
Matrix Factorization
(Sedhain et al., 2017),(Singh and Gordon, 2008),(Shan and Banerjee, 2010), (Ma et al., 2011b),(Yoo and Choi, 2011),(Fang and Si, 2011),(Bouchard et al., 2013), (Saveski and Mantrach, 2014),(Gao et al., 2015),(Ge et al., 2016),(Brouwer and Liò, 2017)
Deep
Neural Networks
(Li et al., 2015),(Wang et al., 2015),(Zhang et al., 2016), (Wang et al., 2016), (Dong et al., 2017), (Li and She, 2017)
GF TF
(Tengfei Zhou, 2017),(Karatzoglou et al., 2010),(Hidasi and Tikk, 2012), (Hidasi, 2015),(Kasai and Mishra, 2016)
FM
(He and Chua, 2017),(Rendle et al., 2011),(Cheng et al., 2014), (Nguyen et al., 2014),(Blondel et al., 2015),(Blondel et al., 2016), (Juan et al., 2016),(Cao et al., 2016),(Guo et al., 2017),(Lu et al., 2017)
HG
(Yu et al., 2014),(Zheng et al., 2016),(Palumbo et al., 2017)
Table 3. Classification of attribute-aware recommender systems.

4.1. Discriminative Matrix Factorization (Figure 4)

Figure 4. Graphical interpretation of discriminative probabilistic matrix factorization whose attributes is given for ratings and latent factors. User and item-relevant attributes could affect the generation of latent factors or ratings , while rating-relevant attributes typically determines the rating prediction . The models of this class may eliminate some of the gray arrows to imply additional independence assumptions between attributes and other factors.

Intuitively, the goal of a attribute-aware recommender system is to import attributes to improve its recommendation performance (either rating prediction or item ranking). In the framework of matrix factorization, an item is rated or ranked according to the latent factors of the item and its corresponding users. In order words, the learning of latent factors in classical matrix factorization depend only on ratings. Thus the learning may fail due to lacks of training ratings. If we can regularize the latent factors using attributes, or make attribute determine how to rate items, then matrix factorization methods can be more robust to the lacks of rating information in the training data, especially for those users or items that have very few ratings.

Following we choose to describe the attribute participation with probabilistic perspectives. The learning of Probabilistic Matrix Factorization (PMF) tries to maximize posterior probability of two latent factor matrices (for users) and (for items), given observed entries of training rating matrix . Clearly, attribute-aware recommneder systems claim that we are given extra attribute matrix . Then by Bayes’ rule, the posterior probability can be shown as follows:

(26)

We eliminate the denominator since it does not contain variables for maximization. At the prior part, we follow the independence assumption of PMF, though here the independence is given attribute matrix . Now compared with classical PMF, both likelihood and prior could be affected by attributes . Attributes in the likelihood can directly help predict or rank ratings, while attributes in the priors regularize the learning directions of latent factors. Moreover, some current works assumes additional independences between attributes and the matrix factorization formulation. For ease of explanations, we suppose that all the random variables follow normal distribution with mean and variance or multivariate normal distribution with mean vector and covariance matrix . Theoretically the following models accept other probability distributions.

(a) Matchbox
(b) KPMF
(c) RLFM
(d) FIP
Figure 5. Graphical interpretation of the example models whose attributes serve as prior knowledge of latent factors. We eliminate all the hyperparameters for presentation simplicity.

We further generate the sub-categories as below.

Attributes in a Linear Model

This is the generalized form to utilize attributes in this category. Given the attributes, a weight vector is applied to perform linear regression together with classical matrix factorization . Its characteristic in mathematical form is shown in likelihood functions:

(27)

where , while denotes the non-missing ratings in the training data, and is the column index corresponding to user and item . respectively denote attribute matrices relevant to user, item and ratings, while are their corresponding transformation functions where attribute space is mapped toward the rating space identical with . Most early models select simple linear transformations, i.e., which has shown recommendation boosting, but recent works consider neural networks for non-linear mapping functions. A simple linear regression model can be expressed as a likelihood function of normal distribution with mean and variance . Ideally the distributions of latent factors shall have prior knowledge from attributes , but we have not yet observed an approach aiming at designing attribute-aware priors as the last two terms of (27).

  • Bayesian Matrix Factorization with Side Information (BMFSI) (Porteous et al., 2010) is an example case in this sub-category. On the basis of Bayesian Probabilistc Matrix Factorization (BPMF) (Salakhutdinov and Mnih, 2008b), BMFSI uses a linear combination like (27) to introduce attribute information to rating prediction. It is formulated as: