1 Introduction

AUC Optimisation and Collaborative Filtering

Abstract.

In recommendation systems, one is interested in the ranking of the predicted items as opposed to other losses such as the mean squared error. Although a variety of ways to evaluate rankings exist in the literature, here we focus on the Area Under the ROC Curve (AUC) as it widely used and has a strong theoretical underpinning. In practical recommendation, only items at the top of the ranked list are presented to the users. With this in mind, we propose a class of objective functions over matrix factorisations which primarily represent a smooth surrogate for the real AUC, and in a special case we show how to prioritise the top of the list. The objectives are differentiable and optimised through a carefully designed stochastic gradient-descent-based algorithm which scales linearly with the size of the data. In the special case of square loss we show how to improve computational complexity by leveraging previously computed measures. To understand theoretically the underlying matrix factorisation approaches we study both the consistency of the loss functions with respect to AUC, and generalisation using Rademacher theory. The resulting generalisation analysis gives strong motivation for the optimisation under study. Finally, we provide computation results as to the efficacy of the proposed method using synthetic and real data.

Charanpal Dhanjal]{charanpal.dhanjal, stephan.clemencon}@telecom-paristech.fr Romaric Gaudel]romaric.gaudel@univ-lille3.fr

1. Introduction

A recommendation system [1, 16] takes a set of items that a user has rated and recommends new items that the user may like in the future. Such systems have a broad range of applications such as recommending books [17], CDs [17], movies [27, 3] and news [7]. To formalise the recommendation problem let be the set of all users and let be the items that can be recommended. Each user rates item with a value which measures whether item is useful to . In this work, we consider the implicit recommendation problem in which . For example, the rating value represents whether a person has read a particular research paper, or bought a product. We are thus presented with a set of observations as a training sample. The aim of recommendation systems is to find a scoring function such that the score is high when user strongly prefers item . This problem is challenging for a number of reasons: we are interested only in the top few items for each user, there is often a large fraction of missing observations (irrelevant items are generally unknown) and the sample of rated items is often drawn from a non-uniform distribution.

To address these issues, we propose a general framework for obtaining strong ranking of items based on the theoretically well studied local Area Under the ROC Curve (AUC) metric. The framework relies on matrix factorisation to represent parameters, which generally performs well and scales to large numbers of users and items. One essentially finds a low rank approximation of the unobserved entries according to an AUC-based objective and several different surrogate loss functions. In addition we show how to focus the optimisation to the items placed at the top of the list. The resulting methods have smooth differentiable objective functions and can be solved using stochastic gradient descent. We additionally show how to perform the optimisation in parallel so it can make effective use of modern multicore CPUs. The novel algorithms are studied empirically in relation to other state-of-the-art matrix factorisation methods on a selection of synthetic and real datasets. We also answer some theoretical questions about the proposed methods. The first question is whether optimising the surrogate functions will result in improving the AUC. The second question represents a generalisation analysis of the matrix factorisation approaches using Rademacher Theory [2], a data-dependent approach to obtaining error bounds. The analysis sheds some light onto whether the quantities optimised on a training sample will generalise to unseen observations and provides a bound between these values. Note that a preliminary version of this paper has been presented in [9] and here we extend the work by considering a much larger class of objectives, the theoretical study and further empirical analysis.

This paper is organised as follows. In the next section we motivate and derive the Matrix Factorisation with AUC (MFAUC) framework and present its specialisation according to a variety of objective functions. In Section 3 we present the theoretical study on consistency and generalisation of the proposed approaches. Following, there is a review on related work on matrix factorisation methods including those based on top- rank-based losses. The MFAUC algorithm is then evaluated empirically in Section 5 and finally we conclude with a summary and some perspectives.

Notation: A bold uppercase letter represents a matrix, e.g. X, and a column vector is displayed using a bold lowercase letter, e.g. x. The transpose of a matrix or vector is written . The indicator function is given by so that it is 1 when its input is true otherwise it is . The all ones vector is denoted by j and the corresponding matrix is J.

2. Maximising AUC

The Area Under the ROC Curve is a common way to evaluate rankings. Consider user then the AUC is the expectation that a uniformly drawn positive item is greater with respect to a negative item under a scoring function

(1)

where is a distribution over items for , is the indicator function that is 1 when its input is true and otherwise 0, and we assume scores are never equal. One cannot in general maximise AUC directly since the indicator function is non-smooth and non-differentiable and hence difficult to optimise. Our observations consist only of positive relevance for items, thus we maximise a quantity closely related to the AUC for all users, which is practical to optimise. The missing observations are considered as negative as in [6, 25] for example.

Accuracy experienced by users is closely related to performance on complete data rather than available data [25] and thus the distribution is an important consideration in a practical recommendation system. This implies that a non-uniform sampling of items might be beneficial. Consider a user and a rating function , then the empirical AUC for this user is:

(2)

where is the set of indices of relevant items for user , is the complement of , and and are distributions on the relevant and irrelevant items respectively. The most intuitive choices for and are the uniform distributions. On real datasets however, item ratings often have a long tail distribution so that a small number of items represent large proportions of the total number of ratings. This opens up the question as to whether the so-called popularity bias might be an advantage to recommender systems that predict using the same distribution and we return to this point later.

The AUC certainly captures a great deal of information about the ranking of items for each user, however as pointed out by previous authors, in practice one is only interested in the top items in that ranking. Two scoring functions and could have identical AUC value and yet for example scores badly on the top few items but recovers lower down the list. One way of measuring this behaviour is through local AUC [5] which is the expectation that a positive item is scored higher than a negative one, and the positive one is in the top th quantile. This is equivalent to saying that we ignore positive items low down in the ranking.

2.1. A General Framework

As previously stated, direct maximisation of AUC is not practical and hence we use a selection of surrogate functions to approximate the indicator function. Here a scoring function is found for all users and items, in particular the score for the th user and th item is given by where is a matrix of user factors and is a matrix of item factors. We will refer to the th rows of these matrices as and respectively. Furthermore, let be a matrix of ratings such that if user finds item relevant otherwise if the item is missing or irrelevant. In a sense which is made clear later, X is an approximation of the complete structure of learner scores so that . The advantage of this matrix factorisation strategy is that the scoring function has parameters and hence scales linearly with both the number of users and items.

The framework we propose is composed principally of solving the following general optimisation:

(3)

for a user-defined regularisation parameter , loss function , rank weighting function , item difference and distributions for relevant and irrelevant items and . The relevant items for the th user are given by the set and irrelevant/missing items are indexed in . The first sum in the first term averages over users and the second sum computes ranking losses over positive items for the th user. The second term is a penalisation on the Frobenius norms () of the user and item factor matrices. Concrete item distributions and are discussed later in this section but a simple case is using the uniform distributions and respectively.

The loss function can be specialised to one of the following options ( is a user-defined parameter):

See Figure 1 for a graphical representation of each of these loss functions. On binary classification the square hinge loss is show to provide a strong AUC on both training and test data in [29]. Furthermore, the objective becomes convex in U and V when . The squared loss is shown to be consistent with the AUC [11] and the squared hinge loss is shown to be consistent in [12]. In contrast, the hinge loss is not consistent with AUC. The sigmoid function is perhaps the best approximation of the negative indicator function. As it approaches the indicator function, and hence we get an objective exactly corresponding to maximising AUC. The sigmoid function is used in conjunction with AUC for bipartite ranking in [14]. An noted in this paper, if is too small then corresponding objective is a poor approximation of the AUC and if it is too large then the objective approaches the sum of step functions making it hard to perform gradient descent. To alleviate the problem the training data is used to to choose a series of increasing values.

Figure 1. Plot of the loss functions with .

For the weighting function , one can see that the first term in the objective follows directly from AUC (Equation 2) by replacing the indicator function with the appropriate loss. The optimisation in this case is convex in the square and hinge loss cases for U and V but not both simultaneously. This form of the objective however, does not take into account the preference for ranking items correctly at the top of the list. Consider instead for and fixed , which is a concave function. The term inside in Optimisation 3 represents a ranking loss for th user and th item and thus is high up in the ranked list if this quantity is small. The effect of choosing the hyperbolic tangent is that items with small losses have larger gradients towards the optimal solution and hence are prioritised.

The distributions on the items allow some flexibility into the expectation we ultimately compute. Although the obvious choice is a uniform distribution, in practice relevant item distributions often follow the so-called power law given by for some exponent , where is the number of times relevant item occurs in the complete (fully observed) data. In words, the power law says that there are a few items which are rated very frequently whereas most items are not frequently rated, corresponding to a bias on observing a rating for popular items. However, recommendations for less well-known items are considered valuable for users. Since we have incomplete data the generating distribution can be modelled in a similar way to that of the complete data (see e.g. [26] for further motivation),

where is the empirical probability of observing and is a exponent used to control the weight of items in the objective. The irrelevant and missing items form the complement of the relevant ones and hence we have

Notice that when we sample from the uniform distribution. Since we expect to be related to with a power law this implies that when we give preference to items for which is small and hence focus on less popular items.

2.2. Optimisation Algorithms

To solve the above objectives one can use gradient descent methods which rely on computing the derivatives with respect to the parameters U and V. Here we present the derivatives for choices of loss and weighting function , starting with the squared hinge loss and . Denote the objective function of Optimisation 3 as then the derivatives are,

(4)

and

(5)

where and for convenience we use the notation . The squared loss is identical except that in this case. It is worth noting that the term inside the outer sum of this derivative in the squared loss case can be written as:

(6)

where and are empirical expectations. The derivative with respect to can be treated in a similar way:

(7)

where and . Thus we have inexpensive ways to compute derivatives in the square loss case provided we have access to the expectations of particular vectors.

The logistic and sigmoid losses are similar functions and their derivatives are computed as follows ( is a user-defined parameter and as before):

(8)

and

(9)

in which for the logistic loss and for the sigmoid loss.

We now consider the case in which we have the weighting function on the item losses in conjunction with a square or squared hinge loss. The gradients with respect to the objective are

and the corresponding gradient with respect to is

with in the square hinge loss case and for the square loss. When we compare the above derivatives to Equations 4 and 5 for the square/hinge loss functions we can see that there is an additional weighting on the relevant items given by which will naturally increase the size of gradient for items with small losses and hence those high up in the ranking.

Thus we have presented the derivatives for each of the loss functions we proposed earlier in this section as well the hyperbolic tangent weighting scheme. The derivative with respect to is generally the most costly to find and we can see that the computational complexity of computing it is and hence the derivative with respect to the complete matrix V is . The complexity of the derivative with respect to U is identical although in practice it is quicker to compute. Fortunately, the expectations for both derivatives can be estimated effectively using users, and items from and . This reduces the complexity per iteration of computing the derivative with respect to V to .

Using these approximate derivatives we can apply stochastic gradient descent to solve Optimisation 3. Define as the approximate subsampled objective, then one then walks in the negative direction of the approximate derivatives using average stochastic gradient descent (average SGD, [22]). Algorithm 1 shows the pseudo code for solving the optimisation. Given initial values of U and V (using random Gaussian elements, for example) we define an iteration as gradient steps using a random permutation of indices and . It follows that each iteration updates all of the rows of U and V at least once according to the learning rate and the corresponding derivative. Note in line 6 that if exceeds the size of the corresponding vector we wrap around to the start. Under average SGD one maintains, during the optimisation, matrices and which are weighted averages of the user and item factors. These matrices can be shown to converge more rapidly than U and V, see [22] for further details. The algorithm concludes when the maximum number of iterations has been reached or convergence is achieved by looking at the average objective value for the the most recent iterations. In the following text we refer to this algorithm as Matrix Factorisation with AUC (MFAUC).

0:  Ratings X, solutions U, V, iterations , average start , learning rate , convergence threshold
1:   and
2:  Vector z is permutation of and is permutation of
3:  , ,
4:  while  and  do
5:     for  do
6:        Let and
7:         and
8:        if  then
9:            and
10:        else
11:            and
12:        end if
13:     end for
14:     Update ,
15:  end while
16:  return  Solutions ,
Algorithm 1 Pseudo code for Matrix Factorisation with AUC

Parallel Algorithm

A disadvantage of the optimisation just described is that it cannot easily be implemented to make use of model parallel computer architectures as each intermediate solution depends on the previous one. To compensate, we build upon the Distributed SGD (DSDG) algorithm of [13]. The algorithm works on the assumption that the loss function we are optimising can be split up into a sum of local losses, each local loss working on a subset of the elements of . The algorithm proceeds by partitioning X into fixed blocks (sub-matrices) and each block is processed by a separate process/thread ensuring that no two concurrent blocks share the same row or column. This constraint is required so that updates to the rows of U and V are independent, see Figure 2. The blocks do not have to be contiguous or of uniform size, and in our implementation blocks are sized so that the number of nonzero elements within each one is approximately constant. The algorithm terminates after every block has been processed at least times.

Figure 2. Illustration of the principal behind DSGD. Each of the two large squares represents a matrix divided into blocks, and each black square represents a process working on a block.

For AUC-based losses we require a pairwise comparison of items for each row and therefore the loss cannot easily be split into local losses. To get around this issue we modify DSGD as follows: at each iteration we randomly assign rows/columns to blocks, i.e. blocks are no longer fixed throughout the algorithm. We then concurrently compute empirical estimates of the loss above within all blocks by restricting the corresponding positive and negative items, and repeat this process until convergence or for iterations.

3. Consistency and Generalisation

In this section we will discuss issues relating to the consistency of the above optimisation, as well as the generalisation performance on unseen data. First we address the question of whether optimising a surrogate of the AUC will lead to improving the AUC itself. We draw upon the work of [12] to show that our chosen loss functions are consistent. In the bipartite ranking case the square hinge loss is shown to be consistent in [12] and a similar result for the square loss is proven in [11]. Therefore we concentrate on the sigmoid and logistic functions and additionally show how the consistency results apply to the matrix factorisation scenario we consider in this paper.

To begin we define consistency in a more formal sense, starting with a more general definition of the AUC than Equation 1. In this case, the scoring function can result in the same scores for different items,

where are the labels respectively of items . We can alternatively state the AUC in terms of a risk so that maximising AUC corresponds to minimising this quantity:

in which . If we define then we see that the risk can be expressed as

where and is the marginal distribution over . Notice that if we assume that then we would prefer a function such that , and a similar results holds if we swap and . This allows us to introduce the Bayes risk where is defined as follows:

where is chosen from all measurable functions. Since we replace the indicator with a surrogate loss function, define then we can write the corresponding risk as,

and the optimal scoring function with respect to this risk is . With these definitions, we can define consistency.

Definition 3.1 (Consistency, [12]).

The surrogate loss is said to be consistent with AUC if for every sequence , the following holds over all distributions on :

Thus, a consistent loss function will lead to an optimal Bayes risk in the limit of an infinite sample size. Furthermore, a sufficient condition is given for AUC as follows

Theorem 3.1 (Sufficiency for AUC consistency, [12]).

The surrogate loss is consistent with AUC if is a convex, differentiable and non-increasing function with .

These theorems allow us to prove that the sigmoid and logistic losses are consistent with AUC.

Theorem 3.2.

Both the sigmoid , , and logistic loss functions are consistent with AUC.

Proof.

We will start with the logistic function. Assume that and is finite,

which implies a non-increasing function, in particular the derivative at zero is . In addition the logistic loss is convex since for all finite ,

where the first term in the derivative is greater than zero and the second term in parenthesis is between 0 and 1. An application of Theorem 3.1 gives the required result.

For the sigmoid function, consider a set of items with corresponding conditional probabilities on being positive then we can find the following risk (we use notation and for convenience)

For each pair of terms in this sum notice first that for all . It follows that if then the first term should be maximised by increasing otherwise the second one should be maximised to minimise the risk. Therefore, in the limiting case

and we have when as in the Bayes risk. ∎

As we have not made any assumption on scoring functions in our previous results, the AUC is simply generalised to the expectation over users as follows

where is now is a distribution over users and items and we redefine the scoring function as . Thus we have shown that all the loss functions we consider are consistent with the AUC in the multi-user scenario.

3.1. Generalisation Bound

We now turn to another critical question about our algorithm relating to the generalisation. In particular we look at Rademacher theory, which is a data-dependent way of studying generalisation performance. We assume that the observations are sampled from a distribution over . Recall that only positive ratings are observed, hence the sample is drawn from the distribution where we use the shorthand . Now consider a class of functions mapping from to then the AUC can be maximised by minimising

(10)

in the case that . Likewise we can substitute any class of loss functions to be minimised, for example the surrogate functions defined above. Noting this, we can now present the idea of a Rademacher variables which are uniformly distributed independent random variables. The following definition is an empirical Rademacher average for all users

where the ’s are independent Rademacher random variables (note the close connection to Equation 10). This expectation is an empirical measure of the capacity of a function class because it measures how well functions in that class can correlate with random data. We define the Rademacher complexity of as

where is drawn over . In the following lemma we provide a relationship between these two quantities using McDiarmid’s Theorem.

Theorem 3.3 (McDiarmid, [19]).

Let be independent random variables taking values in a set and assume satisfies

for all . Then for all ,

and

We show that the Rademacher complexities are related as follows

Theorem 3.4.

Let be a function class mapping from to and define sample drawn from distribution . Then with probability at least the following is true

Proof.

As well as consider another sample for and . For notational convenience we call the nonzero elements of , , and . Likewise for the zero elements , and . Define

Therefore we have that is equal to (the notation denotes the th element in )