Representation Learning and Pairwise Ranking for Implicit Feedback in Recommendation Systems
Abstract
In this paper, we propose a novel ranking framework for collaborative filtering with the overall aim of learning user preferences over items by minimizing a pairwise ranking loss. We show the minimization problem involves dependent random variables and provide a theoretical analysis by proving the consistency of the empirical risk minimization in the worst case where all users choose a minimal number of positive and negative items. We further derive a NeuralNetwork model that jointly learns a new representation of users and items in an embedded space as well as the preference relation of users over the pairs of items. The learning objective is based on three scenarios of ranking losses that control the ability of the model to maintain the ordering over the items induced from the users’ preferences, as well as, the capacity of the dotproduct defined in the learned embedded space to produce the ordering. The proposed model is by nature suitable for implicit feedback and involves the estimation of only very few parameters. Through extensive experiments on several realworld benchmarks on implicit data, we show the interest of learning the preference and the embedding simultaneously when compared to learning those separately. We also demonstrate that our approach is very competitive with the best stateoftheart collaborative filtering techniques proposed for implicit feedback.
[500]Information systems Learning to rank \ccsdesc[500]Information systems Recommender systems
1 Introduction
In the recent years, recommender systems (RS) have attracted a lot of interest in both industry and academic research communities, mainly due to new challenges that the design of a decisive and efficient RS presents. Given a set of customers (or users), the goal of RS is to provide a personalized recommendation of products to users which would likely to be of their interest. Common examples of applications include the recommendation of movies (Netflix, Amazon Prime Video), music (Pandora), videos (Youtube), news content (Outbrain) or advertisements (Google). The development of an efficient RS is critical from both the company and the consumer perspective. On one hand, users usually face a very large number of options: for instance, Amazon proposes over 20,000 movies in its selection, and it is therefore important to help them to take the best possible decision by narrowing down the choices they have to make. On the other hand, major companies report significant increase of their traffic and sales coming from personalized recommendations: Amazon declares that of its sales is generated by recommendations, twothirds of the movies watched on Netflix are recommended and of ChoiceStream users said that they would buy more music, provided the fact that they meet their tastes and interests.^{1}^{1}1Talk of Xavier Amatriain  Recommender Systems  Machine Learning Summer School 2014 @ CMU.
Two main approaches have been proposed to tackle this problem Ricci et al. (2010). The first one, referred to as ContentBased recommendation technique Pazzani and Billsus (2007); Lops et al. (2011) makes use of existing contextual information about the users (e.g. demographic information) or items (e.g. textual description) for recommendation. The second approach, referred to as collaborative filtering (CF) and undoubtedly the most popular one, relies on the past interactions and recommends items to users based on the feedback provided by other similar users. Feedback can be explicit, in the form of ratings; or implicit, which includes clicks, browsing over an item or listening to a song. Such implicit feedback is readily available in abundance but is more challenging to take into account as it does not clearly depict the preference of a user for an item. Explicit feedback, on the other hand, is very hard to get in abundance.
The adaptation of CF systems designed for one type of feedback to another has been shown to be suboptimal as the basic hypothesis of these systems inherently depends on the nature of the feedback White et al. (2001). Further, learning a suitable representation of users and items has been shown to be the bottleneck of these systems Wang et al. (2015), mostly in the cases where contextual information over users and items which allow to have a richer representation is unavailable.
In this paper we are interested in the learning of user preferences mostly provided in the form of implicit feedback in RS. Our aim is twofold and concerns:

the development of a theoretical framework for learning user preference in recommender systems and its analysis in the worst case where all users provide a minimum of positive/negative feedback;

the design of a new neuralnetwork model based on this framework that learns the preference of users over pairs of items and their representations in an embedded space simultaneously without requiring any contextual information.
We extensively validate our proposed approach over standard benchmarks with implicit feedback by comparing it to state of the art models.
The remainder of this paper is organized as follows. In Section 2, we define the notations and the proposed framework, and analyze its theoretical properties. Then, Section 3 provides an overview of existing related methods. Section 4 is devoted to numerical experiments on four realworld benchmark data sets including binarized versions of MovieLens and Netflix, and one real data set on online advertising. We compare different versions of our model with stateoftheart methods showing the appropriateness of our contribution. Finally, we summarize the study and give possible future research perspectives in Section 5.
2 User preference and embedding learning with Neural nets
We denote by (resp. ) the set of indexes over users (resp. the set of indexes over items). Further, for each user , we consider two subsets of items and such that;

and ,

for any pair of items ; has a preference, symbolized by . Hence implies that, user prefers item over item .
From this preference relation, a desired output is defined over each triplet as:
(1) 
2.1 Learning objective
The learning task we address is to find a scoring function from the class of functions that minimizes the ranking loss:
(2) 
where measures the cardinality of sets and is the indicator function which is equal to , if the predicate is true, and otherwise. Here we suppose that there exists a mapping function that projects a pair of user and item indices into a feature space of dimension , and a function such that each function can be decomposed as:
(3) 
In the next section we will present a NeuralNetwork model that learns the mapping function and outputs the function based on a nonlinear transformation of the useritem feature representation, defining the function .
The previous loss (2) is a pairwise ranking loss and it is related to the Area under the ROC curve Usunier et al. (2005). The learning objective is, hence, to find a function from the class of functions with a small expected risk, by minimizing the empirical error over a training set
constituted over users, , and their respective preferences over items, and is given by:
(4) 
However this minimization problem involves dependent random variables as for each user and item ; all comparisons involved in the empirical error (4) share the same observation . Different studies proposed generalization error bounds for learning with interdependent data Amini and Usunier (2015). Among the prominent works that address this problem are a series of contributions based on the idea of graph coloring introduced in Janson (2004), and which consists in dividing a graph that links dependent variables represented by its nodes into sets of independent variables, called the exact proper fractional cover of and defined as: {definition}[Exact proper fractional cover of , Janson (2004)] Let be a graph. , for some positive integer , with and is an exact proper fractional cover of , if: i) it is proper: is an independent set, i.e., there is no connections between vertices in ; ii) it is an exact fractional cover of : . The weight of is given by: and the minimum weight over the set of all exact proper fractional covers of is the fractional chromatic number of .
Figure 1 depicts an exact proper fractional cover corresponding to the problem we consider for a toy problem with user , and items preferred over other ones. In this case, the nodes of the dependency graph correspond to pairs constituted by; pairs of the user and each of the preferred items, with the pairs constituted by the user and each of the no preferred items, involved in the empirical loss (4). Among all the sets containing independent pairs of examples, the one shown in Figure 1, is the exact proper fractional cover of the and the fractional chromatic number is in this case .
By mixing the idea of graph coloring with the Laplace transform, Hoeffding like concentration inequalities for the sum of dependent random variables are proposed by Janson (2004). In Usunier et al. (2006) this result is extended to provide a generalization of the bounded differences inequality of McDiarmid (1989) to the case of interdependent random variables. This extension then paved the way for the definition of the fractional Rademacher complexity that generalizes the idea of Rademacher complexity and allows one to derive generalization bounds for scenarios where the training data are made of dependent data.
In the worst case scenario where all users provide the lowest interactions over the items, which constitutes the bottleneck of all recommendation systems:
the empirical loss (4) is upperbounded by:
(5) 
Following (Ralaivola and Amini, 2015, Proposition 4), a generalization error bound can be derived for the second term of the inequality above based on local Rademacher Complexities that implies secondorder (i.e. variance) information inducing faster convergence rates.
For sake of presentation and in order to be in line with the learning representations of users and items in an embedded space introduced in Section 2.2, let us consider kernelbased hypotheses with a positive semidefinite (PSD) kernel and its associated feature mapping function. Further we consider linear functions in the feature space with bounded norm:
(6) 
where is the weight vector defining the kernelbased hypotheses and denotes the dot product. We further define the following associated function class:
and the parameterized family which, for , is defined as:
where denotes the variance. The fractional Rademacher complexity introduced in Usunier et al. (2006) entails our analysis:
where is the total number of triplets in the training set and is a sequence of independent Rademacher variables verifying .
Let be a set of independent users, such that each user prefers items over ones in a predefined set of items. Let be the associated training set, then for any the following generalization bound holds for all with probability at least :
where , and
As the set of users is supposed to be independent, the exact fractional cover of the dependency graph corresponding to the training set will be the union of the exact fractional cover associated to each user such that cover sets which do not contain any items in common are joined together.
Following (Ralaivola and Amini, 2015, Proposition 4), for any we have with probability at least :
The infimum is reached for which by plugging it back into the upperbound, and from equation (5), gives:
(7) 
Now, for all and , let and be the first and the second pair constructed from , then from the bilinearity of dot product and the CauchySchwartz inequality, is upperbounded by:
(8) 
where the last inequality follows from Jensen’s inequality and the concavity of the square root, and
Further, for all we have , (ShaweTaylor and Cristianini, 2004, p. 91) so:
By using Jensen’s inequality and the concavity of the square root once again, we finally get
(9) 
We conclude this section with the following two remarks:

In the case where the feature space is of finite dimension; lower values of involves lower kernel estimation and hence lower complexity term which implies a tighter generalization bound.
2.2 A Neural Network model to learn user preference
Some studies proposed to find the dyadic representation of users and items in an embedded space, using neighborhood similarity information Volkovs and Yu (2015) or the Bayesian Personalized Ranking (BPR) Rendle et al. (2009). In this section we propose a feedforward Neural Network, denoted as RecNet, to learn jointly the embedding representation, , as well as the scoring function, , defined previously. The input of the network is a triplet composed by the indexes of an item , a user and a second item ; such that the user has a preference over the pair of items expressed by the desired output , defined with respect to the preference relation (Eq. 1). Each index in the triplet is then transformed to a corresponding binary indicator vector and having all its characteristics equal to except the one that indicates the position of the user or the items in its respective set, which is equal to . Hence, the following onehot vector corresponds to the binary vector representation of user :
The network entails then three successive layers, namely Embedding (SG), Mapping and Dense hidden layers depicted in Figure 2.

The Embedding layer transforms the sparse binary representations of the user and each of the items to a denser realvalued vectors. We denote by and the transformed vectors of user and item ; and and the corresponding matrices. Note that as the binary indicator vectors of users and items contain one single nonnull characteristic, each entry of the corresponding dense vector in the SG layer is connected by only one weight to that characteristic.

The Mapping layer is composed of two groups of units each being obtained from an elementwise product between the user representation vector of a user and a corresponding item representation vector of an item inducing the feature representation of the pair .

Each of these units are also fully connected to the units of a Dense layer composed of successive hidden layers (see Section 4 for more details related to the number of hidden units and the activation function used in this layer) .
The model is trained such that the output of each of the dense layers reflects the relationship between the corresponding item and the user and is mathematically defined by a multivariate realvalued function . Hence, for an input , the output of each of the dense layers is a realvalue score that reflects a preference associated to the corresponding pair or (i.e. or ). Finally the prediction given by RecNet for an input is:
(10) 
2.3 Algorithmic implementation
We decompose the ranking loss as a linear combination of two logistic surrogates:
(11) 
where the first term reflects the ability of the nonlinear transformation of user and item feature representations, , to respect the relative ordering of items with respect to users’ preferences:
(12) 
The second term focuses on the quality of the compact dense vector representations of items and users that have to be found, as measured by the ability of the dotproduct in the resulting embedded vector space to respect the relative ordering of preferred items by users:
(13) 
where is a regularization parameter for the user and items norms. Finally, one can also consider a version in which both losses are assigned different weights:
(14) 
where is a realvalued parameter to balance between ranking prediction ability and expressiveness of the learned item and user representations. Both options will be discussed in the experimental section.
Training phase
The training of the RecNet is done by backpropagating Bottou (2012) the errorgradients from the output to both the deep and embedding parts of the model using minibatch stochastic optimization (Algorithm 1).
During training, the input layer takes a random set of size of interactions by building triplets based on this set, and generating a sparse representation from id’s vector corresponding to the picked user and the pair of items. The binary vectors of the examples in are then propagated throughout the network, and the ranking error (Eq. 11) is backpropagated.
Model Testing
As for the prediction phase, shown in Algorithm 2, a ranked list of the preferred items for each user in the test set is maintained while retrieving the set . Given the latent representations of the triplets, and the weights learned; the two first items in are placed in in a way which ensures that preferred one, , is in the first position. Then, the algorithm retrieves the next item, by comparing it to . This step is simply carried out by comparing the model’s output over the concatenated binary indicator vectors of and .
Hence, if , which from Equation 10 is equivalent to , then is predicted to be preferred over ; ; and it is put at the first place instead of in . Here we assume that the predicted preference relation is transitive, which then ensures that the predicted order in the list is respected. Otherwise, if is predicted to be preferred over , then is compared to the second preferred item in the list, using the model’ prediction as before, and so on. The new item, , is inserted in in the case if it is found to be preferred over another item in .
By repeating the process until the end of , we obtain a ranked list of the most preferred items for the user . Algorithm 2 does not require an ordering of the whole set of items, as also in most cases we are just interested in the relevancy of the top ranked items for assessing the quality of a model. Further, its complexity is at most which is convenient in the case where . The merits of a similar algorithm have been discussed in Ailon and Mohri (2008) but, as pointed out above, the basic assumption for inserting a new item in the ranked list is that the predicted preference relation induced by the model should be transitive, which may not hold in general.
In our experiments, we also tested a more conventional inference algorithm, which for a given user , consists in the ordering of items in with respect to the output given by the function , and we did not find any substantial difference in the performance of RecNet, as presented in the following section.
3 (Un)related work
This section provides an overview of the stateoftheart approaches that are the most similar to ours.
3.1 Neural Language Models
Neural language models have proven themselves to be successful in many natural language processing tasks including speech recognition, information retrieval and sentiment analysis. These models are based on a distributional hypothesis stating that words, occurring in the same context with the same frequency, are similar. In order to capture such similarities, these approaches propose to embed the word distribution into a lowdimensional continuous space using Neural Networks, leading to the development of several powerful and highly scalable language models such as the word2Vec SkipGram (SG) model Mikolov et al. (2013a); Mikolov et al. (2013b); Shazeer et al. (2016).
The recent work of Levy and Goldberg (2014) has shown new opportunities to extend the word representation learning to characterize more complicated pieces of information. In fact, this paper established the equivalence between SG model with negative sampling, and implicitly factorizing a pointwise mutual information (PMI) matrix. Further, they demonstrated that word embedding can be applied to different types of data, provided that it is possible to design an appropriate context matrix for them. This idea has been successfully applied to recommendation systems where different approaches attempted to learn representations of items and users in an embedded space in order to meet the problem of recommendation more efficiently GuàrdiaSebaoun et al. (2015); Liang et al. (2016); Grbovic et al. (2015); He et al. (2017); Covington et al. (2016).
In He et al. (2017), the authors used a bagofword vector representation of items and users, from which the latent representations of latter are learned through word2vec. In Liang et al. (2016), the authors proposed a model that relies on the intuitive idea that the pairs of items which are scored in the same way by different users are similar. The approach reduces to finding both the latent representations of users and items, with the traditional Matrix Factorization (MF) approach, and simultaneously learning item embeddings using a cooccurrence shifted positive PMI (SPPMI) matrix defined by items and their context. The latter is used as a regularization term in the traditional objective function of MF. Similarly, in Grbovic et al. (2015) the authors proposed Prod2Vec, which embeds items using a NeuralNetwork language model applied to a time series of user purchases. This model was further extended in Vasile et al. (2016) who, by defining appropriate context matrices, proposed a new model called MetaProd2Vec. Their approach learns a representation for both items and side information available in the system. The embedding of additional information is further used to regularize the item embedding. Inspired by the concept of sequence of words; the approach proposed by GuàrdiaSebaoun et al. (2015) defined the consumption of items by users as trajectories. Then, the embedding of items is learned using the SG model and the users’ embeddings are further used to predict the next item in the trajectory. In these approaches, the learning of item and user representations are employed to make prediction with predefined or fixed similarity functions (such as dotproducts) in the embedded space.
3.2 LearningtoRank with Neural Networks
Motivated by automatically tuning the parameters involved in the combination of different scoring functions, LearningtoRank approaches were originally developed for Information Retrieval (IR) tasks and are grouped into three main categories: pointwise, listwise and pairwise Liu (2009).
Pointwise approaches Crammer and Singer (2001); Li et al. (2007) assume that each queried document pair has an ordinal score. Ranking is then formulated as a regression problem, in which the rank value of each document is estimated as an absolute quantity. In the case where relevance judgments are given as pairwise preferences (rather than relevance degrees), it is usually not straightforward to apply these algorithms for learning. Moreover, pointwise techniques do not consider the interdependency among documents, so that the position of documents in the final ranked list is missing in the regressionlike loss functions used for parameter tuning. On the other hand, listwise approaches Shi et al. (2010); Xu and Li (2007); Xu et al. (2008) take the entire ranked list of documents for each query as a training instance. As a direct consequence, these approaches are able to differentiate documents from different queries, and consider their position in the output ranked list at the training stage. Listwise techniques aim to directly optimize a ranking measure, so they generally face a complex optimization problem dealing with nonconvex, nondifferentiable and discontinuous functions. Finally, in pairwise approaches Cohen et al. (1999); Freund et al. (2003); Joachims (2002); Pessiot et al. (2007) the ranked list is decomposed into a set of document pairs. Ranking is therefore considered as the classification of pairs of documents, such that a classifier is trained by minimizing the number of misorderings in ranking. In the test phase, the classifier assigns a positive or negative class label to a document pair that indicates which of the documents in the pair should be better ranked than the other one.
Perhaps the first Neural Network model for ranking is RankProp, originally proposed by Caruana et al. (1995). RankProp is a pointwise approach that alternates between two phases of learning the desired real outputs by minimizing a Mean Squared Error (MSE) objective, and a modification of the desired values themselves to reflect the current ranking given by the net. Later on Burges et al. (2005) proposed RankNet, a pairwise approach, that learns a preference function by minimizing a cross entropy cost over the pairs of relevant and irrelevant examples. SortNet proposed in Rigutini et al. (2008, 2011) also learns a preference function by minimizing a ranking loss over the pairs of examples that are selected iteratively with the overall aim of maximizing the quality of the ranking. The three approaches above consider the problem of LearningtoRank for IR and without learning an embedding.
4 Experimental Results
We conducted a number of experiments aimed at evaluating how the simultaneous learning of user and item representations, as well as the preferences of users over items can be efficiently handled with RecNet. To this end, we considered four realworld benchmarks commonly used for collaborative filtering.
We validated our approach with respect to different hyperparameters that impact the accuracy of the model and compare it with competitive stateoftheart approaches.
4.1 Datasets
We report results obtained on three publicly available movie datasets, for the task of personalized topN recommendation: MovieLens^{2}^{2}2https://movielens.org/ 100K (ML100K), MovieLens 1M (ML1M) Harper and Konstan (2015), Netflix^{3}^{3}3http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a, and one clicks dataset, KasandrGermany ^{4}^{4}4https://archive.ics.uci.edu/ml/datasets/KASANDR Sidana et al. (2017), a recently released data set for online advertising.

ML100K, ML1M and Netflix consist of usermovie ratings, on a scale of one to five, collected from a movie recommendation service and the Netflix company. The latter was released to support the Netlfix Prize competition^{5}^{5}5B. James and L. Stan, The Netflix Prize (2007).. ML100K dataset gathers 100,000 ratings from 943 users on 1682 movies, ML1M dataset comprises of 1,000,000 ratings from 6040 users and 3900 movies and Netflix consists of 100 million ratings from 480,000 users and 17,000 movies. For all three datasets, we only keep users who have rated at least five movies and remove users who gave the same rating for all movies. In addition, for Netflix, we take a subset of the original data and randomly sample of the users and of the items. In the following, as we only compare with approaches developed for the ranking purposes and our model is designed to handle implicit feedback, these three data sets are made binary such that a rating higher or equal to 4 is set to 1 and to 0 otherwise.

The original Kasandr dataset contains the interactions and clicks done by the users of Kelkoo, an online advertising platform, across twenty Europeans countries. In this article, we used a subset of Kasandr that only considers interactions from Germany. It gathers 17,764,280 interactions from 521,685 users on 2,299,713 offers belonging to 272 categories and spanning across 801 merchants.
For each dataset, we sort the interactions according to time, and take 80% for training the model and the remaining 20% for testing it. In addition, we remove all users and offers which do not occur during the training phase. Table 1 provides the basic statistics on these collections after preprocessing, as discussed below.
# of users  # of items  # of interactions  Sparsity  

ML100K  943  1,682  100,000  93.685% 
ML1M  6,040  3,706  1,000,209  95.530% 
Netflix  90,137  3,560  4,188,098  98.700% 
Kasandr  25,848  1,513,038  9,489,273  99.976% 
4.2 Experimental setup
To validate the framework defined in the previous section, we propose to compare the following approaches.

BPRMF Rendle et al. (2009) provides an optimization criterion based on implicit feedback; which is the maximum posterior estimator derived from a Bayesian analysis of the pairwise ranking problem, and proposes an algorithm based on Stochastic Gradient Descent to optimize it. The model can further be extended to the explicit feedback case.

CoFactor Liang et al. (2016), developed for implicit feedback, constraints the objective of matrix factorization to use jointly item representations with a factorized shifted positive pointwise mutual information matrix of item cooccurrence counts. The model was found to outperform WMF Hu et al. (2008) also proposed for implicit feedback.

LightFM Kula (2015) was first proposed to deal with the problem of coldstart problem using meta information. As with our approach, it relies on learning the embedding of users and items with the Skipgram model and optimizes the cross entropy loss.

RecNet focuses on the quality of the latent representation of users and items by learning the preference and the representation through the ranking loss (Eq. 13).

RecNet focuses on the accuracy of the score obtained at the output of the framework and therefore learns the preference and the representation through the ranking loss (Eq. 12).

RecNet uses a linear combination of and as the objective function, with . We study the two situations presented before (w.r.t. the presence/absence of a supplementary weighting hyperparameter).
All comparisons are done based on a common ranking metric, namely the Mean Average Precision (MAP). First, let us recall that the Average Precision (AP) is defined over the precision, , at rank .
where the relevance judgments are binary (i.e. equal to when the item is clicked or preferred, and 0 otherwise). Then, the mean of these AP’s across all users is the MAP. In the following, we report MAP at different rank and .
4.3 Performance of Our Models
Hereafter, we run experiments to compare the performance of RecNet with the baseline methods on various data sets.
Results
First, we propose to detail the impact of the different hyperparameters involved in the proposed framework . For all datasets, we repeat the same procedure for the tuning (on a validation set).

Dimension of the latent representation: the size of the embedding are chosen among . The results are presented in Figure 6.

Regularization parameters: the values of the regularization term are chosen among using a validation set.

Number of hidden layers is set to 1 and the number hidden units varied in the set of and . We use the logistic activation functions for the hidden layers.
ML100K  ML1M  Netflix  Kasandr  

RecNet  RecNet  RecNet  RecNet  RecNet  RecNet  RecNet  RecNet  RecNet  RecNet  RecNet  RecNet  
18  1  18  
0.001  0.01  0.01  
# hidden units  64  64  32 
From Figure 6, it is evident that the best MAP@1 results are generally obtained with small sizes of item and user embedded vector spaces which are the same than the size of the feature vector space, . These empirical results support our theoretical analysis where we found that small induces smaller generalization bounds. Lastly, to train RecNet, we fix the number of epochs to and the size of the minibatches to . To further avoid overfitting (as shown in Figure 10 (a)), we also use earlystopping. For the optimization of the different ranking losses, we use Adam Kingma and Ba (2014) and the learning rate is set to 1e3 using a validation set. For other parameters involved in Adam, i.e., the exponential decay rates for the moment estimates, we keep the default values (, and ). For reproducibility purpose, we report the best combination of parameters for each variant of RecNet in Table 2. We run all experiments on a cluster of five 32 core Intel Xeon @ 2.6Ghz CPU (with 20MB cache per core) systems with 256 Giga RAM running Debian GNU/Linux 8.6 (wheezy) operating system. All subsequently discussed components were implemented in Python3 using the TensorFlow library.^{6}^{6}6https://www.tensorflow.org/. ^{7}^{7}7For research purpose we will make available all the codes implementing Algorithms 1 and 2 that we used in our experiments as well as all the preprocessed datasets.
Finally, we study two settings for the prediction phase: (1) for a given user, the prediction is done only on the items that were shown to him or her; (2) the prediction is done over the set of all items, regardless of any knowledge about previous interactions. In the context of movie recommendation, a shown item is defined as a movie for which the given user provided a rating. For Kasandr, the definition is quite straightforward as the data were collected from an online advertising platform, where the items are displayed to the users, who can either click or ignore them. The first setting is arguably the most common in academic research, but is abstracted from the realworld problem as at the time of making the recommendation, the notion of shown items is not available, therefore forcing the RS to consider the set of all items as potential candidates. The goal of the second setting is to reflect this realworld scenario, and we can expect lower results than in the first setting as the size of the search space of items increases considerably. To summarize, predicting only among the items that were shown to user evaluates the model’s capability of retrieving highly rated items among the shown ones, while predicting among all items measures the performance of the model on the basis of its ability to recommend offers which user would like to engage in.
ML100K  ML1M  Netflix  Kasandr  
MAP@1  MAP@10  MAP@1  MAP@10  MAP@1  MAP@10  MAP@1  MAP@10  
BPRMF  0.884  0.839  
LightFM  
CoFactor  
RecNet  
RecNet  0.839  
RecNet  0.913  0.857  0.839  0.880  0.849  0.971 
ML100K  ML1M  Netflix  Kasandr  
MAP@1  MAP@10  MAP@1  MAP@10  MAP@1  MAP@10  MAP@1  MAP@10  
BPRMF  0.261  0.072  
LightFM  0.144  
CoFactor  0.049  
RecNet  
RecNet  0.098  0.111  
RecNet  0.228  0.240 
Tables 3 and 4 report all results. In addition, in each case, we statistically compare the performance of each algorithms, and we use bold face to indicate the highest performance, and the symbol indicates that performance is significantly worst than the best result, according to a Wilcoxon rank sum test used at a pvalue threshold of Lehmann and D’Abrera (2006).

Regarding hyperparameter tuning, we can see that all three versions of RecNet perform the best with a quite small number of hidden units, only one hidden layers and a low dimension for the representation. As a consequence, they involve a few number of parameters and present an interesting computational complexity compared to other stateoftheart approaches. The observation on the dimension of embedding is also confirmed by Figure 6, and is in agreement with the conclusion of Kula (2015), which uses the same technique for representation learning. For instance, one can see that on ML1M, the highest MAP is achieved with a dimension of embedding equals to . While this could be surprising, it is mainly due to the fact that we used a onehot code input representation of both users and items, meaning that only one value is equal to one, and therefore, it seems quite reasonable to associate one weight to each user and each item in this context. This conclusion is also in line with the generalization bound that we have presented previously.

In terms of the ability to recover a relevant ranked list of items for each user, we also tune the hyperparameter (Eq. 14) which balances the weight given to the two terms in . These results are shown in Figure 14, where the values of are taken in the interval . While it seems to play an significant role on ML100K and Kasandr, we can see that for ML1M the results in terms of MAP are stable, regardless the value of . However, empirically, we observed that the version of without gives better results in average, and we decided to only report this latter in Table 3 and 4.
Figure 14: MAP@1, MAP@5, MAP@10 as a function of the value of for ML1M, ML100K and Kasandr. 
As RecNet is meant to handle implicit feedback, we compare with the baselines which were developed for the same purpose. When the prediction is done over all offers (Table 4), RecNet outperforms all other algorithms on Kasandr and ML1M. When the prediction is done over offers which user interacted with (Table 3), RecNet beats all the other algorithms on Kasandr, Netflix and ML100K. Our model is a fresh departure from the models which learn pairwise ranking function without the knowledge of embeddings or which learn embeddings without learning any pairwise ranking function. While learning pairwise ranking function, our model is aware of the learned embeddings so far and viceversa. This simultaneous learning of two ranking functions helps in learning hidden features of implicit data and improves the performance of RecNet. We also found that the performance of RecNet gets affected on some of the binarized movie data sets when the prediction is done on all movies (Table 4). A simple explanation comes from the fact that these data sets are strongly biased towards the popular movies and usually, the majority of users have watched one or the other popular movies in such data sets and rated them well. For instance, in ML100K and Netflix, more than and around of the users have given ratings greater to 4 to the top10 movies, respectively. We believe that this phenomenon adversely affects the performance of RecNet. However, on Kasandr, which is the only true implicit dataset RecNet significantly outperforms all other approaches.

One can note that while optimizing ranking losses by Eq. 11 or Eq. 12 or Eq. 13, we simultaneously learn representation and preference function; the main difference is the amount of emphasis we put in learning one or another. The results presented in both tables tend to demonstrate that, in almost all cases, optimizing the linear combination of the pairwiseranking loss and the embedding loss (RecNet) indeed increases the quality of overall recommendations than optimizing standalone losses to learn embeddings and pairwise preference function.
5 Conclusion
We presented and anlayzed a learning to rank framework for recommender systems which consists in learning user preferences over items. We showed that the minimization of pairwise ranking loss over user preferences involves dependent random variables and provided a theoretical analysis by proving the consistency of the empirical risk minimization in the worst case where all users choose a minimal number of positive and negative items. From this analysis we then proposed RecNet, a new neuralnetwork based model for learning the user preference, where both the user’s and item’s representations and the function modeling the user’s preference over pairs of items are learned simultaneously. The learning phase is guided using a ranking objective that can capture the ranking ability of the prediction function as well as the expressiveness of the learned embedded space, where the preference of users over items is respected by the dot product function defined over that space. The training of RecNet is carried out using the backpropagation algorithm in minibatches defined over a useritem matrix containing implicit information in the form of subsets of preferred and nonpreferred items. The learning capability of the model over both prediction and representation problems show their interconnection and also that the proposed double ranking objective allows to conjugate them well.
We assessed and validated the proposed approach through extensive experiments, using four popular collections proposed for the task of recommendation. Furthermore, we propose to study two different setting for the prediction phase and demonstrate that the performance of each approach is strongly impacted by the set of items considered for making the prediction.
For future work, we would like to extend RecNet in order to take into account additional contextual information regarding users and/or items. More specifically, we are interested in the integration of data of different natures, such as text or demographic information. We believe that this information can be taken into account without much effort and by doing so, it is possible to improve the performance of our approach and tackle the problem of providing recommendation for new users/items at the same time, also known as the coldstart problem. The second important extension will be the development of an online version of the proposed algorithm in order to make the approach suitable for realtime applications and online advertising.
References
 (1)
 Ailon and Mohri (2008) Nir Ailon and Mehryar Mohri. (2008). An Efficient Reduction of Ranking to Classification. In 21st Annual Conference on Learning Theory  COLT. 87–98.
 Amini and Usunier (2015) MassihReza Amini and Nicolas Usunier. 2015. Learning with Partially Labeled and Interdependent Data. Springer, New York, NY, USA.
 Bottou (2012) Léon Bottou. 2012. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade  Second Edition. 421–436.
 Burges et al. (2005) Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. 2005. Learning to rank using gradient descent. In Machine Learning, Proceedings of the TwentySecond International Conference (ICML). 89–96.
 Caruana et al. (1995) Rich Caruana, Shumeet Baluja, and Tom M. Mitchell. 1995. Using the Future to Sort Out the Present: Rankprop and Multitask Learning for Medical Risk Evaluation. In Advances in Neural Information Processing Systems 8, NIPS, Denver, CO, November 2730. 959–965.
 Cohen et al. (1999) William W. Cohen, Robert E. Schapire, and Yoram Singer. 1999. Learning to Order Things. J. Artif. Intell. Res. (JAIR) 10 (1999), 243–270.
 Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 1519. 191–198.
 Crammer and Singer (2001) Koby Crammer and Yoram Singer. 2001. Pranking with Ranking. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS, December 38, Vancouver, British Columbia, Canada]. 641–647.
 Freund et al. (2003) Yoav Freund, Raj D. Iyer, Robert E. Schapire, and Yoram Singer. 2003. An Efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research 4 (2003), 933–969.
 Grbovic et al. (2015) Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. Ecommerce in Your Inbox: Product Recommendations at Scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 1013. 1809–1818.
 GuàrdiaSebaoun et al. (2015) Élie GuàrdiaSebaoun, Vincent Guigue, and Patrick Gallinari. 2015. Latent Trajectory Modeling: A Light and Efficient Way to Introduce Time in Recommender Systems. In Proceedings of the 9th ACM Conference on Recommender Systems, RecSys , Vienna, Austria, September 1620. 281–284.
 Harper and Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (Dec. 2015), 19 pages.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW, Perth, Australia, April 37. 173–182.
 Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), December 1519, Pisa, Italy. 263–272.
 Janson (2004) S. Janson. 2004. Large Deviations for Sums of Partly Dependent Random Variables. Random Structures and Algorithms 24, 3 (2004), 234–248.
 Joachims (2002) Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 133–142.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
 Kula (2015) Maciej Kula. 2015. Metadata Embeddings for User and Item Coldstart Recommendations. In Proceedings of the 2nd Workshop on New Trends on ContentBased Recommender Systems colocated with 9th ACM Conference on Recommender Systems, RecSys. 14–21.
 Lehmann and D’Abrera (2006) E.L. Lehmann and H.J.M. D’Abrera. 2006. Nonparametrics: statistical methods based on ranks. Springer.
 Levy and Goldberg (2014) O. Levy and Y. Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, December 813, Montreal, Quebec, Canada. 2177–2185.
 Li et al. (2007) Ping Li, Christopher J. C. Burges, and Qiang Wu. 2007. McRank: Learning to Rank Using Multiple Classification and Gradient Boosting. In Advances in Neural Information Processing Systems 20, Proceedings of the TwentyFirst Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 36. 897–904.
 Liang et al. (2016) Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M. Blei. 2016. Factorization Meets the Item Embedding: Regularizing Matrix Factorization with Item Cooccurrence. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 1519. 59–66.
 Liu (2009) TieYan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval 3, 3 (2009), 225–331.
 Lops et al. (2011) Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro. 2011. Contentbased Recommender Systems: State of the Art and Trends. In Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor (Eds.). Springer, 73–105.
 McDiarmid (1989) C. McDiarmid. 1989. On the method of bounded differences. Cambridge University Press, 1989, Survey in Combinatorics (1989), 148–188.
 Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
 Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26. 3111–3119.
 Pazzani and Billsus (2007) Michael J. Pazzani and Daniel Billsus. 2007. The Adaptive Web. SpringerVerlag, Berlin, Heidelberg, Chapter Contentbased Recommendation Systems, 325–341.
 Pessiot et al. (2007) JeanFrançois Pessiot, TuongVinh Truong, Nicolas Usunier, MassihReza Amini, and Patrick Gallinari. 2007. Learning to Rank for Collaborative Filtering. In ICEIS 2007  Proceedings of the Ninth International Conference on Enterprise Information Systems. 145–151.
 Ralaivola and Amini (2015) Liva Ralaivola and MassihReza Amini. 2015. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015. 2436–2444.
 Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI, Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 1821. 452–461.
 Ricci et al. (2010) Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor. 2010. Recommender Systems Handbook (1st ed.). SpringerVerlag New York, Inc., New York, NY, USA.
 Rigutini et al. (2008) Leonardo Rigutini, Tiziano Papini, Marco Maggini, and Monica Bianchini. 2008. A Neural Network Approach for Learning Object Ranking. In Artificial Neural Networks  ICANN, 18th International Conference, Prague, Czech Republic, September 36, Proceedings, Part II. 899–908.
 Rigutini et al. (2011) Leonardo Rigutini, Tiziano Papini, Marco Maggini, and Franco Scarselli. 2011. SortNet: Learning to Rank by a Neural Preference Function. IEEE Trans. Neural Networks 22, 9 (2011), 1368–1380.
 ShaweTaylor and Cristianini (2004) John ShaweTaylor and Nello Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA.
 Shazeer et al. (2016) Noam Shazeer, Ryan Doherty, Colin Evans, and Chris Waterson. 2016. Swivel: Improving embeddings by noticing what’s missing. arXiv preprint arXiv:1602.02215 (2016).
 Shi et al. (2010) Yue Shi, Martha Larson, and Alan Hanjalic. 2010. Listwise Learning to Rank with Matrix Factorization for Collaborative Filtering. In Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys ’10). 269–272.
 Sidana et al. (2017) Sumit Sidana, Charlotte Laclau, MassihReza Amini, Gilles Vandelle, and Andre BoisCrettez. 2017. KASANDR: A LargeScale Dataset with Implicit Feedback for Recommendation. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval.
 Usunier et al. (2005) Nicolas Usunier, Massih Amini, and Patrick Gallinari. 2005. A Datadependent Generalisation Error Bound for the AUC. In ICML’05 workshop on ROC Analysis in Machine Learning. Bonn, Germany.
 Usunier et al. (2006) Nicolas Usunier, MassihReza Amini, and Patrick Gallinari. 2006. Generalization error bounds for classifiers trained with interdependent data. In Advances in Neural Information Processing Systems 19. 1369–1376.
 Vapnik (2000) Vladimir Vapnik. 2000. The nature of statistical learning theory. Springer Science & Business Media.
 Vasile et al. (2016) Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. MetaProd2Vec: Product Embeddings Using SideInformation for Recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 1519. 225–232.
 Volkovs and Yu (2015) Maksims Volkovs and Guang Wei Yu. 2015. Effective Latent Models for Binary Feedback in Recommender Systems. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15). 313–322.
 Wang et al. (2015) Hao Wang, Naiyan Wang, and DitYan Yeung. 2015. Collaborative Deep Learning for Recommender Systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 1013. 1235–1244.
 White et al. (2001) Ryen White, Joemon M. Jose, and Ian Ruthven. 2001. Comparing Explicit and Implicit Feedback Techniques for Web Retrieval: TREC10 Interactive Track Report. In Proceedings of The Tenth Text REtrieval Conference, TREC, Gaithersburg, Maryland, USA, November 1316.
 Xu and Li (2007) Jun Xu and Hang Li. 2007. AdaRank: a boosting algorithm for information retrieval. In SIGIR: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 391–398.
 Xu et al. (2008) Jun Xu, TieYan Liu, Min Lu, Hang Li, and WeiYing Ma. 2008. Directly optimizing evaluation measures in learning to rank. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, Singapore, July 2024. 107–114.