Recommendations and User Agency: The Reachability of Collaboratively-Filtered Information

Recommendations and User Agency: The Reachability of Collaboratively-Filtered Information

Abstract

Recommender systems often rely on models which are trained to maximize accuracy in predicting user preferences. When the systems are deployed, these models determine the availability of content and information to different users. The gap between these objectives gives rise to a potential for unintended consequences, contributing to phenomena such as filter bubbles and polarization. In this work, we consider directly the information availability problem through the lens of user recourse. Using ideas of reachability, we propose a computationally efficient audit for top- linear recommender models. Furthermore, we describe the relationship between model complexity and the effort necessary for users to exert control over their recommendations. We use this insight to provide a novel perspective on the user cold-start problem. Finally, we demonstrate these concepts with an empirical investigation of a state-of-the-art model trained on a widely used movie ratings dataset.

1 Introduction

Recommendation systems influence the way information is presented to individuals for a wide variety of domains including music, videos, dating, shopping, and advertising. On one hand, the near-ubiquitous practice of filtering content by predicted preferences makes the digital information overload possible for individuals to navigate. By exploiting the patterns in ratings or consumption across users, preference predictions are useful in surfacing relevant and interesting content. On the other hand, this personalized curation is a potential mechanism for social segmentation and polarization. The exploited patterns across users may in fact encode undesirable biases which become self-reinforcing when used in feedback to make recommendations.

Recent empirical work shows that personalization on the Internet has a limited effect on political polarization [17], and in fact it can increase the diversity of content consumed by individuals [33]. However, these observations follow by comparison to non-personalized baselines of cable news or well known publishers. In a digital world where all content is algorithmically sorted by default, how do we articulate the tradeoffs involved? In the past year, YouTube has come under fire for promoting disturbing children’s content and working as an engine of radicalization [47, 34, 7]. This comes a push on algorithm development towards reaching 1 billion hours of watchtime per day; over 70% of views now come from the recommended videos [14].

The Youtube controversy is an illustrative example of potential pitfalls when putting large scale machine learning-based systems in feedback with people, and highlights the importance of creating analytical tools to anticipate and prevent undesirable behavior. Such tools should seek to quantify the degree to which a recommender system will meet the information needs of its users or of society as a whole, where these “information needs” must be carefully defined to include notions like relevance, coverage, and diversity. An important approach involves the empirical evaluation of these metrics by simulating recommendations made by models once they are trained [15]. In this work we develop a complementary approach which differs in two major ways: First, we directly analyze the predictive model, making it possible to understand underlying mechanisms. Second, our evaluation considers a range of possible user behaviors rather than a static snapshot.

Drawing conclusions about the likely effects of recommendations involves treating humans as a component within the system, and the validity of these conclusions hinges on modeling human behavior. We propose an alternative evaluation that favors the agency of individuals over the limited perspective offered by behavioral predictions. Our main focus is on questions of possibility: to what extent can someone be pigeonholed by their viewing history? What videos may they never see, even after a drastic change in viewing behavior? And how might a recommender system encode biases in a way that effectively limits the available library of content?

This perspective brings user agency into the center, prioritizing the the ability for models to be as adaptable as they are accurate, able to accommodate arbitrary changes in the interests of individuals. Studies find positive effects of allowing users to exert greater control in recommendation systems [52, 23]. While there are many system-level or post-hoc approaches to incorporating user feedback, we focus directly on the machine learning model that powers recommendations.

Contributions

In this paper, we propose a definition of user recourse and item availability for recommender systems. This perspective extends the notion of recourse proposed by Ustun et al. [48] to multiclass classification settings and specializes to concerns most relevant for information retrieval systems. We focus our analysis on top- recommendations made using linear predictions, a broad class including matrix factorization models. In Section 3 we show how properties of latent user and item representations interact to limit or ensure recourse and availability. This yields a novel perspective on user cold-start problems, where a user with no rating history is introduced to a system. In Section 4, we propose a computationally efficient model audit. Finally, in Section 5, we demonstrate how the proposed analysis can be used as a tool to interpret how learned models will interact with users when deployed.

1.1 Related Work

Recommendation models that incorporate user feedback for online updates have been considered from several different angles. The computational perspective focuses on ensuring that model updates are efficient and fast [24]. The statistical perspective articulates the sampling bias induced by recommendation [5], while practical perspectives identify ways to discard user interactions that are not informative for model updates [8]. Another body of work focuses on the learning problem, seeking to improve the predictive accuracy of models by exploiting the sequential nature of information. This includes strategies like Thompson sampling [26], upper confidence bound approximations for contextual bandits [6, 30], and reinforcement learning [51, 29].

This body of work, and indeed much work on recommender systems, focuses on the accuracy of the model. This encodes an implicit assumption that the primary information needs of users or society are described by predictive performance. Alternative measures proposed in the literature include concepts related to diversity or novelty of recommendations [9]. Directly incorporating these objectives into a recommender system might include further predictive models of users, e.g. to determine whether they are “challenge averse” or “diversity seeking” [46]. Further alternative criteria arise from concerns of fairness and bias, and recent work has sought to empirically quantify parity metrics on recommendations [15, 16]. In this work, we focus more directly on agency rather than predictive models or observations.

Most similar to our work is a handful of papers focusing on decision systems through the lens of the agency of individuals. We are most directly inspired by the work of Ustun et al. [48] on actionable recourse for binary decisions, where users seek to change negative classification through modifications to their features. This work has connections to concepts in explainability and transparency via the idea of counterfactual explanations [42, 49], which provide statements of the form: if a user had features , then they would have been assigned alternate outcome . Work in strategic manipulation studies nearly the same problem with the goal of creating a decision system that is robust to malicious changes in features [21, 32].

Applying these ideas to recommender systems is complex because while they can be viewed as classifiers or decision systems, there are as many outcomes as pieces of content. Computing precise action sets for recourse for every user-item pair is unrealistic; we don’t expect a user to even become aware of the majority of items. Instead, we consider the “reachability” of items by users, drawing philosophically from the fields of formal verification and dynamical system analysis [3, 36].

2 Problem Setting

A recommender system considers a population of users and a collection of items. We denote a “rating” by user of item as . This value can be either explicit (e.g. star-ratings for movies) or implicit (e.g. number of listens). Let denote the number of users in the system and denote the number of items. Though these are both generally quite large, the number of observed ratings is much smaller. Let denote the set of items whose ratings by user have been observed. We collect these observed ratings into a sparse vector whose values are defined at and elsewhere. Then a system makes recommendations with a policy which returns a subset of items.1 We will denote the size of the returned subset as , which is a parameter of the system.

We are now ready to define the reachability sub-problem for a recommender system. We say that a user can reach item if there is some allowable modification to their rating history that causes item to be recommended. The reachability problem for user and item is defined as

(1)

where the modification set describes how users are allowed to modify their rating history and describes how “difficult” or “unlikely” it is for a user to make this change. This notion of difficulty might relate discretely to the total number of changes, or to the amount that these changes deviate from the existing preferences of the user. By defining the cost with respect to user behavior, the reachability problem encodes both the possibilities of recommendations through its feasibility, as well as the relative likelihood of different outcomes as modeled by the cost.

The ways that users can change their rating histories, described by the modification set , depends on the design of user input to the system. We consider a single round of user reactions to recommendations and focus on two models of user behavior: changes to existing ratings, which we will refer to as “history edits,” and reaction to a batch of recommended items, which we will refer to as “reactions.” In the first case, consists of all possible ratings on the support . In the second, it consists of all new ratings on the support combined with the existing rating history.

The reachability problem (1) defines a quantity for each user and item in the system. To use this problem as a metric for evaluating recommender systems, we consider both user- and item-centric perspectives. For users, this is a notion of recourse.

Definition 2.1.

The amount of recourse available to a user is defined as the percentage of unseen items that are reachable, i.e. for which (1) is feasible. The difficulty of recourse is defined by the average value of the recourse problem over all reachable items .

On the other hand, the item-centric perspective centers around notions of availability and representation.

Definition 2.2.

The availability of items in a recommender system is defined as the percentage of items that are reachable by some user.

These definitions are important from the perspective of guaranteeing fair representation of content within recommender systems. This is significant for users – for example, to what extent have their previously expressed preferences limited the content that is currently reachable? It is equally important to content creators, for whom the ability to build an audience depends on the availability of their content in the recommender system overall.

In what follows, we turn our attention to specific classes of preference models, for which the reachability problem is analytically tractable. However, we note that estimation via sampling can provide lower bounds on availability and recourse even for black-box models, and further explore this observation in Section 4.1. We keep the main focus on analysis rather than sampling because it allows us to crisply distinguish between unlikely and impossible.

2.1 Linear Preference Models

While many different approaches to recommender systems exist [41], ranging from classical neighborhood models [19] to more recent deep neural networks [12], we focus our attention on linear preference models. These are models which predict user rating as the dot product between user and item vectors plus bias terms:

We will refer to and as item and user representations. This broad class includes both item- and user-based neighborhood models, sparse approaches like SLIM [35], and matrix factorization, which differ only in how the user and item representations are determined from the rating data. For ease of exposition, we drop the bias terms for the body of the paper and focus on matrix factorization, deferring our general results and explanation to Appendix A. Matrix factorization is a classical approach to recommendation which become prominent during the Netflix Prize competition [50]. It can be specialized to different assumptions about data and user behavior, including constrained approaches like non-negative matrix factorization [18] or augmentations like inclusion of implicit information about preferences and additional features [37, 25, 40, 24, 28, 27]. Due to its power and simplicity, the matrix factorization approach is still widely used; indeed it has recently been shown to be capable of attaining state-of-the-art results [38, 13].

In this setting, the item and user representations are referred to as factors, lying in a latent space of specified dimension which controls the complexity of the model. The factors can be collected into matrices and . Fitting the model most commonly entails solving the nonconvex minimization:

(2)

where is a regularizer.

The predicted ratings of unseen items are used to make recommendations. Specifically, we consider top- recommenders which return

For linear preference models, the condition reduces to

Thus, for fixed item vectors, a user’s recommendations are determined by their representation along with a list of unseen items. In a slight abuse of notation, we will use this fact to write the recommender policy instead of .

When users’ ratings change, their representations change as well. While there are a variety of possible strategies for performing online updates, we focus on the least squares approach, where

This is similar to continuing an alternating least-squares (ALS) minimization of (2), a common strategy [53]. Because we analyze a single round of recommendations, we do not consider a simultaneous updates to the item representations in .

In what follows, we focus on the canonical case of regularization on user and item factors, with . With this, the user factor calculation is given by:

(3)

which is linear in the rating vector. In Appendix A, we show that several other linear preference models have similar user updates.

3 Recourse and Availability

We begin by reformulating the reachability problem to the case of recommendations made by matrix factorization models (the general reformulation for linear models is presented in Appendix A). We focus this initial exposition on the simplifying case that and make direct connections between between model factors and the recourse and availability provided by the recommender system.

First, we consider what needs to be true for an item to be recommended. For for top-, the constraint is equivalent to requiring that

where we define to be a matrix with rows given by for . This is a linear constraint on the user factor , and the set of user factors which satisfy this constraint make up an open convex polytopic cone. We refer to this set as the item-region for item , since any user whose latent representation falls within this region will be recommended item . The top- regions partition the latent space, as illustrated by Figure 1 for a toy example with latent dimension .

(a) Items in Latent Space
(b) Availability of Items to a User
Figure 1: An example of item factors (indicted by colored points) in . In (a), the top-1 regions are indicated by shaded colors. The teal item is unavailable, and though the yellow item is reachable, it is not aligned-reachable. In (b), the availability of items for a user who has seen the blue and the green items (now in grey) with the blue item’s rating fixed. The black line indicates how the user’s representation can change depending on their rating of the green item. The red region indicates the constraining effect of requiring bounded and integer-valued ratings, which affect the reachability of the yellow region.

If item factors define regions within the latent space, user factors are points that may move between regions. The constraints on user actions are described by the modification set . We will distinguish between mutable and immutable ratings of items within a rating vector . Let denote the set of items with immutable ratings and let denote the corresponding ratings. Then let denote the set of items with mutable ratings. In what follows, we will write the full set of observed ratings . Then the modification set is all rating vectors with:

  1. fixed immutable ratings,

  2. mutable ratings for some value ,

  3. unseen items with no rating,

The variable is the decision variable in the reachability problem.

Then a user’s latent factor can change as

where we define

It is thus clear that this latent factor lies in an affine subspace. This space is anchored at by the immutable ratings, while the mutable ratings determine the directions of possible movement. This idea is illustrated in Figure 1b, which further demonstrates the limitations due to bounded or discrete ratings, as encoded in the rating set .

We are now able to specialize reachability problem (1) for matrix factorization models:

(4)

If the cost is a convex function and is a convex set, this is a convex optimization problem which can be solved efficiently. If is a discrete set or if the cost function incorporates nonconvex phenomena like sparsity, then this problem can be formulated as a mixed-integer program (MIP). Despite bad worst-case complexity, MIP can generally be solved quickly with modern software [20, 48].

3.1 Item Availability

Beyond defining the reachability problem, we seek to derive properties of recommender systems based on their underlying preference models. We begin by considering the feasibility of (4) with respect to its linear inequality constraints. For now, we focus on the item-regions, ignoring the effects of user history , anchor point , and control matrix . In the following result, we consider the convex hull of unseen item factors, which is the the smallest convex set that contains the item factors. This can be formally written as,

Furthermore, we will consider vertices of the convex hull, which are item factors that are not contained in the convex hull of the other factors, i.e. .

Result 1.

In a top- recommender system, the available items are those whose factors are vertices on the convex hull of all item factors.

As a result, the availability of items in a top- recommender system is determined by the way the item factors are distributed in space: it is simply the percentage of item factors that are vertices of their convex hull. The proof is provided in Appendix A, along with proofs of all results to follow.

We can further understand the effect of limited user movement in the case that ratings are real-valued, i.e. . In this case, we consider both the control matrix and the anchor point . For a fixed , this anchor point determines the set of items necessary for comparison: , i.e. those that are more similar to the anchor point than item is. We will refer to these items as the anchor-similar items. Furthermore, we consider multiplication of item factors by the transpose of the control matrix, the multiplied factors .

Result 2.

In a top- recommender system, a user can reach any item whose multiplied factor is a vertex of the convex hull of all unseen anchor-similar multiplied item factors.

Furthermore, if the factors of the items with mutable ratings are full rank, i.e. has rank equal to the latent dimension of the model , then item availability implies user recourse.

The second statement in this result means that for a model with item availability, having as many mutable ratings as latent dimensions is sufficient for ensuring that users have full recourse (so long as the associated item factors are linearly independent). This observation highlights that increased model complexity calls for more comprehensive user controls to maintain the same level of recourse.

Of course, this conclusion follows only from considering the possibilities of user action – to consider a notion of likelihood for various outcomes we need to consider the cost.

3.2 Bound on Difficulty of Recourse

We now propose a simple model for the cost of user actions, and use this to show a bound on the difficulty of recourse for users. The cost of user actions can be modeled as a penalty on change from existing ratings. For items whose ratings have not already been observed, we penalize instead the change from predicted ratings. For simplicity, we will represent this penalty as the norm of the difference.

For history edits, all mutable items have been observed, so we simply have

Additionally, all existing ratings are mutable so mutable set and immutable set . For reactions, the ratings for the new recommended items have not been observed, so

Additionally, the rating history is immutable so while the mutable ratings are the recommendations with .

We note briefly that our choice of handling the cost of unobserved ratings as “change from predicted ratings” assumes a level of model validity. While it does allow us to avoid any external behavioral modeling, it represents a simple case that perhaps over-emphasizes the role of the model. Exploring alternative cost functions is important for future work.

Under this model, we provide an upper bound on the difficulty of recourse. This result holds for the case that ratings are real-valued, i.e. and that the reachable items satisfy an alignment condition (defined in (8) of Appendix A).

Result 3.

Let indicate the user’s latent factor as in (3) before any actions are taken or the next set of recommendations are added to the user history. Then both in the case of full history edits and reactions,

where is the set of reachable items.

This bound depends how far item factors are from the initial latent representation of the user. When latent representations are close together, recourse is easier or more likely–an intuitive relationship. This quantity will be large in situations where a user is in an isolated niche, far from most of the items in latent space. The bound also depends on the conditioning of the user control matrix , which is related to the similarity between mutable items: the right hand side of the bound will be larger for sets of mutable items that are more similar to each other.

The proof of this result hinges on showing the existence of a specific feasible point to the optimization problem in (4). In the Section 4 we will further explore this idea of feasibility to develop lower bounds on availability and recourse when . First, we described how the presented results can be used to evaluate solutions to the user cold-start problem.

3.3 User Cold-Start

The amount and difficulty of recourse for a user yields a novel perspective on how to incorporate new users into a recommender system. The user cold-start problem is the challenge of selecting items to show a user who enters a system with no rating history from which to predict their preferences. This is a major issue with collaboratively filtered recommendations, and systems often rely on incorporating extraneous information [43]. These strategies focus on presenting items which are most likely to be rated highly or to be most informative about user preferences [4].

The idea of recourse offers an alternative point of view. Rather than evaluating a potential “onboarding set” only for its contribution to model accuracy, we can choose a set which additionally ensures some amount of recourse. Looking to Result 2, we can evaluate an onboarding set by the geometry of the multplied factors in latent space. In the case of onboarding, and , so the recourse evaluation involves considering the vertices of the convex hull of the columns of the matrix .

An additional perspective is offered by considering the difficulty of recourse. In this case, we focus on . If we consider an norm, then it reduces to

where are the nonzero singular values of . Minimizing this quantity is hard [11], though the hardness of selecting informative item sets is unsurprising as it has been discussed in related settings [4]. Due to their computational challenges, we primarily propose that these metrics be used to distinguish between candidate onboarding sets, based on the ways these sets provide user control. We leave to future work the task of generating candidate sets based on these recourse properties.

4 Sufficient Conditions for Top-N

In the previous section, we developed a characterization of reachability for top- recommender systems. However, most real world applications involve serving several items at once. Furthermore, using can approximate the availability of items to a user over time, as they see more items and increase the size of the set which is excluded from the selection. In this section, we focus on sufficient conditions to develop a computationally efficient model audit that provides lower bounds on the availability of items in a model. We further provide approximations for computing a lower bound on the recourse available to users.

We can define an item-region for the top- case, when for any user factors in the set

As in the previous section, this region is contained within the latent space, which is generally of relatively small dimension. However, its description depends on the number of items, which will generally be quite large. In the case of , this dependence in linear and therefore manageable. For , the item region is the union over polytopic cones for subsets describing “all but at most items.” Therefore, the description of each item region requires linear inequalities. For systems with tens of thousands of items, even considering becomes prohibitively expensive.

To ease the notational burden of discussing the ranking logic around top- selection in what follows, we define the operator , which selects the th largest value from a set. This allows us to write, for example,

4.1 Sufficient Condition for Availability

To bypass the computational concerns, we focus on a sufficient condition for item availability. The full description of the region is not necessary to verify non-emptiness; rather, showing the existence of any point in the latent space that satisfies is sufficient. Using this insight, we propose a sampling approach to determining the availability of an item. For a fixed and any , it is necessary only to compute and sort , which is an operation of complexity .

While this sampling approach could make use of gridding, randomness, or empirical user factor distributions, we propose choosing the sample point .

Result 4.

The item-region is nonempty if

(5)

When this condition holds, we say that item is aligned-reachable. The percent of items that are aligned-reachable is a lower bound on the availability of items.

Note that this is a sufficient rather than a necessary condition; it is possible to have for a nonempty . Figure 1 illustrates an example where this is the case: the yellow latent factor lies within even though is nonempty. As a result, aligned-reachability yields an underestimate of the availability of items in a system.

4.2 Model Audit

To use the aligned-reachable condition (5) as a generic model audit, we need to sidestep the specificity of the set of seen items . We propose an audit based on this condition with and an increased value for , where increasing compensates for discarding the effect of the items which have been seen. This audit is described in Algorithm 1 . If we set where is the number of items recommended by the system, then item availability has the following interpretation: if an item is not top- available, then that item will never be recommended to a user who has seen fewer than items.

Result: Lower bound on item availability
1 initialize , ;
2 for  do
3       compute ;
4       if  then
5            
6       end if
7      
8 end for
Algorithm 1 Item-Based Model Audit

If we consider the set of all possible users to be users with a history of at most , this model audit counts the number of aligned-unreachable items, returning lower bound on the overall availability of items. We can further use this model audit to propose constraints or penalties on the model during training. Ensuring aligned-reachability is equivalent to imposing linear constraints on the matrix ,

While this constraint is not convex, relaxed versions of it could be incorporated into the optimization problem (2) to ensure reachability during training. We point this out as a potential avenue for future work.

4.3 Sufficient Condition for Recourse

User recourse inherits the computational problems described above for . We note that the region is not necessarily convex, though it is the union of convex regions. While the problem could be solved by first minimizing within each region and then choosing the minimum value over all regions, this would not be practical for large values of . Therefore we continue with the sampling perspective to develop an efficient sufficient condition for verifying the feasibility of (4). We propose testing feasibility with the condition

(6)

By checking feasibility for each , we verify a lower bound on the amount of recourse available to a user, considering their specific rating history and the allowable actions.

Note that if the control matrix is full rank, then we can find a point such that , meaning that items who are aligned-reachable are also reachable by users. The rank of is equal to the rank of , so as previously observed, item availability implies recourse for any user with control over at least ratings whose corresponding item factors are linearly independent.

Even users with incomplete control have some level of recourse. For the following result, we define as the projection matrix onto the subspace spanned by . Then let be the component of the target item factor that lies in the space spanned by the control matrix , and be the component of the anchor point that cannot be affected by user control.

Result 5.

When , a lower bound on the amount of recourse for a user is given by the percent of unseen items that satisfy:

Note that this statement mirrors the sufficient condition for items, with modifications relating both to the directions of user control and the anchor point. In short, user recourse follows from the ability to modify ratings for a set of diverse items, and immutable ratings ensure the reachability of some items, potentially at the expense of others.

Figure 2: The test RMSE of the matrix factorization models on the MovieLens dataset.

5 Experimental Demonstration

In this section, we demonstrate how our proposed analyses can be used a a tool to audit and interpret characteristics of a matrix factorization model. We use the MovieLens 10M dataset, which comes from an online movie recommender service called MovieLens [22]. The dataset2 contains approximately 10 million ratings applied to 10,681 movies by 71,567 users. The ratings fall between and in increments.

We chose this dataset because it is a common benchmark for evaluating rating predictions. Using the method described by Rendle et al. [38] in their recent work on baselines for recommender systems, we train a regularized matrix factorization model. This is model incorporates item, user, and overall bias terms. (Appendix A includes full description of adapting our proposed audits to this model).

We examine models of a variety of latent dimension ranging from to . The models were trained using the libfm3 library [39]. We use the regression objective and optimize using SGD with regularization parameter and step size for epochs on of the data, verifying accuracy on the remaining with a random global test/train split. These methods match those presented by Rendle et al. [38] and reproduce their reported accuracies (Figure 2).

In Appendix B, we present a similar set of experiments on the LastFM dataset.

Figure 3: Only some of the 10,681 total movies are aligned-reachable, especially for models with smaller complexity and for smaller recommendation set sizes .
Figure 4: Unavailable items are systematically less popular than available items: they are rated less frequently and have lower average ratings in the training data. Each curve represents the cumulative density function (CDF) of the popularity measure within the available (green) and unavailable (red) items. The black line represents the CDF of the combined population. This trend is true for models of varying complexity.

5.1 Item-Based Audit

We begin by performing the item-based audit as described in Section 4.2. Figure 3 displays the total number of aligned-reachable movies for various parameters . It is immediately clear that all items are baseline-reachable in only the models with the largest latent dimension. Indeed, we can conclude that the model with has only about availability for users with a history of under 100 movies. On the other hand, the models with the highest complexity have about availability for even the smallest values of .

We further examine the characteristics of the items that are unavailable compared with those that are available. We examine two notions of popularity: total number of ratings and average rating. In Figure 4, we compare the distributions of the available and unavailable items (for ) in the training set on these measures. The unavailable items have systematically lower popularity for various latent dimensions. This observation has implications for the outcome of putting these models in feedback with users. If unavailable items are never recommended, they will be less likely to be rated, which may exacerbate their unavailability if models are updated online. We highlight this phenomenon as a potential avenue for future work.

While the difference in popularity is true across all models, it is important to note that there is still overlap in the support of both distributions. For a given number of ratings or average rating, some items will be available while others will not, meaning that popularity alone does not determine reachability.

Figure 5: The proportion of unseen items reachable by users varies with their history length. A LOESS regressed curve illustrates the trend. Less complex models are better for shorter history lengths, while more complex models reach higher overall values.

5.2 System Recourse for Users

Next, we examine the users in this dataset. We use the combined testing and training data to determine user ratings and histories . For this section, we examine 100 randomly selected users and only the 1,000 most rated items. Sub-selecting items and especially choosing them based on popularity means that these experimental results are an overestimation of the amount of recourse available to users. Additionally, we allow ratings on the continuous interval rather than enforcing integer constraints, meaning that our results represent the recourse available to users if they were able to precisely rate items on a continuous scale. Despite these two approximations, several interesting trends on the limits of recourse appear.

We begin with history edits, and compute the amount of recourse that the system provides to users using the sufficient condition in Result 5. Figure 5 shows the relationship with the length of user history for several different latent dimensions. First, note the shape of the curved: a fast increase and then leveling off for each dimension . For short histories, we see the limiting effect of projection onto the control matrix . For longer histories, as the rank of approaches or exceeds , the baseline item-reachability determines the effect. The transition between these two regimes differs depending on the latent dimension of the model. Smaller models reach their maximum quickly, while models of higher complexity provide a larger amount of recourse to users with long histories. This is an interesting distinction that connects to the idea of “power users.”

Figure 6: When actions are constrained to reaction to a set of items, lower complexity models provide higher reachability. A random set of items provides slightly more recourse to users than if the set is selected based on predicted user preferences. Furthermore, there is a slight trend that users with smaller history lengths have more available recourse.

Next, we consider reactions, where user input comes only through reaction to a new set of items while the existing ratings are fixed. Figure 6 displays the amount of recourse for two different types of new items: first, the case that users are shown a completely random set of unseen items and second, the case that they are shown the items with the highest predicted ratings. The top panel displays the amount of recourse provided by each model and each type of recommendation. There are two important trends. First, smaller models offer larger amounts of recourse–this is because we are in the regime of few mutable ratings, analogous to the availability of items to users with short histories in the previous figure. Second, for each model size, the random recommendations provide more recourse than the top-, and though the gap is not large it is consistent.

In the bottom panel of Figure 6, we further examine how the length of user history interacts with this model of user behavior. For both the smallest and the largest latent dimensions, there is a downwards trend between reachability and history length. This does not contradict the trend displayed in Figure 5: in the reactions setting, the rating history manifests as the anchor point rather than additional degrees of freedom in the control matrix . It is interesting in light of recent works examining the usefulness of recency bias in recommender systems [31].

Finally, we investigate the difficulty of recourse over all users and a single item. In this case, we consider top- recommendations to reduce the computational burden of computing the exact set . We pose the cost as the size of the difference between the user input and the predicted ratings in the norm. Figure 7 shows the difficulty of recourse via reaction for the two types of new items: a completely random set of unseen items and items with the highest predicted ratings. We note two interesting trends. First, the difficulty of recourse does not increase with model size (even though the amount of recourse is lower). Second, difficulty is lower for the random set of items than for the top- items. Along with the trend in availability, this suggests a benefit of suggesting items to users based on metrics other than predicted rating. Future work should more carefully examine methods for constructing recommended sets that trade-off predicted ratings with measures like diversity under the lens of user recourse.

Figure 7: The difficulty of reaching a single item across 100 users for different sets of new times. The difficulty of recourse does not increase for the larger models, despite the previously observed decrease in availability. Furthermore, we note that random items have lower difficulty.

6 Discussion

In this paper, we consider the effects of using predicted user preferences to recommend content, a practice prevalent throughout the Internet today. By defining a reachability problem for top- recommenders, we provide a way to evaluate the impact of using these predictive models to mediate the discovery of content. In applying these insights to linear preference models, we see several interesting phenomena. The first is simple but worth stating: good predictive models, when used to moderate information, can unintentionally make portions of the content library inaccessible to users. This is illustrated in practice in our study of the MovieLens and LastFM datasets.

To some extent, the differences in the availability of items are related to their unpopularity within training data. Popularity bias is a well known phenomenon in which systems fail to personalize [44], and instead over-recommend the most popular pieces of content. Empirical work shows connections between popularity bias and undesirable demographic biases, including the under-recommendation of female authors [16]. YouTube was long known to have a popularity bias problem (known as the “Gangnam Style Problem”), until the recommendations began optimizing for predicted “watch time” over “number of views.” Their new model has been criticized for its radicalization and political polarization [47, 34]. The choice of prediction target can have a large effect on the types of content users can or are likely to discover, motivating the use of analytic tools like the ones proposed here to reason about these trade-offs before deployment.

While the reachability criteria proposed in this work form an important basis for reasoning about the availability of content within a recommender system, they do not guarantee less biased behavior on their own. Many of the audits consider the feasibility of the recourse problem rather than its cost; thus confirming possible outcomes rather than distinguishing probable ones. Furthermore, the existence of recourse does not fix problems of filter bubbles or toxic content. Rather, it illuminates limitations inherent in recommender systems for organizing and promoting content. There is an important distinction between technically providing recourse and the likelihood that people will actually avail themselves of it. If the cost function is not commensurate with actual user behavior this analysis may lend an appearance of fairness without substance.

With these limitations in mind, we mention several ways to extend the ideas presented in this work. On the technical side, there are different models for rating predictions, especially those that incorporate implicit feedback or perform different online update strategies for users. Not all simple models are linear–for example, subset based recommendations offer greater scrutability and thus user agency by design [2]. Further more, top- recommendation is not the only option. Post-processing approaches to the recommender policy could work with existing models to modify their reachability properties. For example,  Steck [45] proposed a method to ensure that the distribution of recommendations over genres remains the same despite model predictions.

One avenue for addressing more generic preference models and recommender policies is to extend the sampling perspective introduced in Section 4.1 to develop a general framework for black-box recommender evaluation. By sampling with respect to a user transition model, the evaluation could incorporate notions of dynamics and user agency similar to those presented in this work.

Further future work could push the scope of the problem setting to understand the interactions between users and models over time. Analyzing connections between training data and the resulting reachability properties of the model would to give context to empirical work showing how biases can be reproduces in the way items are recommended [16, 15]. Similarly, directly considering multiple rounds of interactions between users and the recommendation systems would shed light on how these models evolve over time. This is a path towards understanding phenomena like filter bubbles and polarization.

More broadly, we emphasize the importance of auditing systems with learning-based components in ways that directly consider the models’ behavior when put into feedback with humans. In the field of formal verification, making guarantees about the behavior of complex dynamical systems over time has a long history. There are many existing tools [1], though they are generally specialized to the case of physical systems and suffer from the curse of dimensionality. We accentuate the abundance of opportunity for developing novel approximations and strategies for evaluating large scale machine learning systems.

Acknowledgements

Thanks to everyone at Canopy for feedback and support. SD is supported by an NSF Graduate Research Fellowship under Grant No. DGE 175281. BR is generously supported in part by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833, the DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs, and an Amazon AWS AI Research Award.

Appendix A Full Results

Here, we develop our main results in full generality. First, we specify the full form of the linear setting. We consider the preference model with bias terms:

Here, is a bias on each item, is a bias for each user, and is the overall bias. In this setting the item-regions are now defined as:

We define to be a matrix with rows given by for and to be the vector with entries given by for .

Lastly, we consider any model that updates only the user models online, and where this update is affine:

As developed in Section 3, then a user’s representation changes as a result of their actions

where we define and . Similar to before, the reachability problem becomes:

(7)

a.1 Examples of Linear Preference Models

In this section, we outline several models that satisfy the form of linear preference model introduced above. References can be found in chapters 2 and 3 of [41].

Example A.2 (Item-based neighborhood methods.).

In this model

where measures the similarity between item and . Regardless of how these weights are defined, this fits into the linear preference model with , and

As long as we don’t consider an update to , the update model holds with .

Example A.3 (slim.).

Here, the model predicts

where are sparse row vectors learned via

s.t.

Again, the update model holds with .

Example A.4 (Matrix Factorization.).

The only modification from the body of the paper is in the update equation,

where and .

a.5 Main results

We now restate the main results in this more general setting, and provide proofs.

Proposition A.1 (Result 1 with bias).

A top- item-region is nonempty if and only if the corresponding item factor is a vertex of the convex hull of all item factors for which biases .

Proof.

The item region is described by . We relate its non-emptiness to the feasibility of the linear program, for an arbitrarily small ,

s.t.

The dual of this program is

s.t.

The primal problem is feasible if and only if the dual is bounded. For any feasible , is also feasible for a scalar . Then letting , we see that the dual objective is unbounded whenever there are any feasible that are nonzero when , i.e. . The set of for which this is true is

The dual is thus unbounded whenever there is some with

Rearranging the expression, we see that it is equivalent to . Whenever this is the case, the dual program is unbounded above and therefore the item region is empty. ∎

Proposition A.2 (Result 2 with bias).

Suppose that , and a user has control matrix and anchor point . Then the top- reachability problem for is feasible if and only if the corresponding multiplied item factor is a vertex of the convex hull of multiplied all item factors for which biases and anchor point satisfy .

Furthermore, for matrix factorization, if has rank equal to the latent dimension of the model , then then item availability implies user recourse.

Proof.

We follow the argument for the previous result considering the alternate linear region . In this case, the linear coefficients are given by , while right hand terms are Thus the first statement follows.

The second statement follows because for matrix factorization, has the same rank as because is invertible. If has rank , then its left inverse exists, so

Recall that is the projection matrix onto the subspace spanned by , , while is the projection onto the orthogonal complement, . For this result we define an aligned reachability condition for item ,

(8)
Theorem A.3 (Result 3 with Bias).

Suppose that , and assume the aligned reachability condition (8) holds for all reachable items .

Let indicate the user’s latent factor before any actions are taken or the next set of recommendations are added to the user history:

Then we have the bound on the difficulty of recourse. In the case of full history edits ,

And in the case of reactions,

where represents the effect of the bias of new items on the user factor.

Proof.

We begin in the case of history edits. Here, we have that

By assumption, is a feasible point, which we select because it is a minimizer of . Then we can write

By definition of we have shown the result.

Now we consider the case of new reactions. Here, we have that

As before, we chose ,

Then recall that

Furthermore,

Then letting ,

Then we notice that

Therefore, we conclude that

And we arrive at the bound

Proposition A.4 (Result 4 with bias).

The item-region is nonempty if

(9)

The percent of aligned-reachable items lower bounds the percent of baseline-reachable items.

Proof.

When the condition is true, so it is nonempty. Because this is only a sufficient condition, the number of aligned-reachable items is less than or equal to the number of reachable items. ∎

Proposition A.5.

When , the reachability problem (1) is feasible if

(10)
Proof.

If , we have the test point

Then verifying its feasibility:

where the inequality follows from the property (10), so we have that . ∎

Appendix B Experiments with Additional Dataset

In this section, we present results on an additional dataset We use the LastFM4 1K dataset [10]. The dataset contains records user play counts 992 users with nearly 1 million play counts of songs by 177,023 artists. We aggregate listens by artist, and remove artists with less than 50 total listens, resulting in a smaller dataset with 23,835 artists, 638,677 datapoints, and the same number of users. We further transform the number of listens with . We plot the distribution of ratings in the MovieLens dataset and log-listens in this dataset in Figure 8.

Figure 8: Distribution of prediction targets for MovieLens and LastFM datasets (ratings and log-listens, respectively).

As with the MovieLens data, we train a regularized matrix factorization model and examine models of a variety of latent dimension ranging from to . The models were trained using libfm, optimized using SGD with regularization parameter and step size for epochs on of the data, verifying accuracy on the remaining with a random global test/train split.

Figure 9: The test RMSE of the matrix factorization models on the LastFM dataset.

Figure 10 displays the result of the item-based audit. Few of the items are baseline-reachabile for models with smaller latent dimension; this trend is more exaggerated than with the MovieLens dataset. In Figure 11, we compare the popularity of the available and unavailable items (for ) in the training set on these measures. As before, unavailable items tend have systematically lower popularity, but the trend is less consistent than it was for MovieLens. This may be due to the more heavily skewed target distribution (Figure 8).

Figure 10: Only some of the 23,835 total artists are aligned-reachable, especially for models with smaller complexity and for smaller recommendation set sizes .
Figure 11: Unavailable items tend to be less popular than available items. Each curve represents the cumulative density function (CDF) of the popularity measure within the available (green) and unavailable (red) items. The black line represents the CDF of the combined population. This trend is true for models of varying complexity.

Next, we examine the users in this dataset, combining testing and training data to determine user ratings and histories . For this section, we examine 100 randomly selected users and only the 1,000 most popular items and allow ratings on the continuous interval .

Figure 12 shows the amount of recourse that the system provides to users via history edits. We observe is a similar relationship with the length of user history as with the MovieLens data. Figure 13 displays the amount of recourse via reactions for random and recommended items. The previously observed trends are less definitive, but present: smaller models tend to offer more recourse, as do random recommendations. Furthermore, the bottom panels suggest a negative relationship between recourse and history length, also less definitively than in the MovieLens data.

Figure 12: The proportion of unseen items reachable by users varies with their history length. A LOESS regressed curve illustrates the trend.
Figure 13: When actions are constrained to reaction to a set of items, lower complexity models provide higher reachability. A random set of items provides slightly more recourse to users than if the set is selected based on predicted user preferences. Furthermore, there is a slight trend that users with smaller history lengths have more available recourse.

Footnotes

  1. While we focus on the case of deterministic policies, the analyses can be extended to randomized policies which sample from a subset of items based on their ratings. It is only necessary to define reachability with respect to probabilities of seeing an item, and then to carry through terms related to the sampling distribution.
  2. http://grouplens.org/datasets/movielens/10m/
  3. http://www.libfm.org/
  4. Last.fm

References

  1. E. Asarin, O. Bournez, T. Dang and O. Maler (2000) Approximate reachability analysis of piecewise-linear dynamical systems. In International Workshop on Hybrid Systems: Computation and Control, pp. 20–31. Cited by: §6.
  2. K. Balog, F. Radlinski and S. Arakelyan (2019) Transparent, scrutable and explainable user models for personalized recommendation. Cited by: §6.
  3. S. Bansal, M. Chen, S. Herbert and C. J. Tomlin (2017) Hamilton-jacobi reachability: a brief overview and recent advances. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 2242–2253. Cited by: §1.1.
  4. S. Biswas, L. V. Lakshmanan and S. B. Ray (2017) Combating the cold start user problem in model based collaborative filtering. arXiv preprint arXiv:1703.00397. Cited by: §3.3, §3.3.
  5. S. Bonner and F. Vasile (2018) Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 104–112. Cited by: §1.1.
  6. D. Bouneffouf, A. Bouzeghoub and A. L. Gançarski (2012) A contextual-bandit algorithm for mobile context-aware recommender system. In International Conference on Neural Information Processing, pp. 324–331. Cited by: §1.1.
  7. J. Bridle (2017-11) Something is wrong on the internet. Medium. External Links: Link Cited by: §1.
  8. A. Burashnikova, Y. Maximov and M. Amini (2019) Sequential learning over implicit feedback for robust large-scale recommender systems. arXiv preprint arXiv:1902.08495. Cited by: §1.1.
  9. P. Castells, S. Vargas and J. Wang (2011) Novelty and diversity metrics for recommender systems: choice, discovery and relevance. Cited by: §1.1.
  10. O. Celma (2010) Music Recommendation and Discovery in the Long Tail. Springer. Cited by: Appendix B.
  11. A. Çivril and M. Magdon-Ismail (2009) On selecting a maximum volume sub-matrix of a matrix and related problems. Theoretical Computer Science 410 (47-49), pp. 4801–4811. Cited by: §3.3.
  12. P. Covington, J. Adams and E. Sargin (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: §2.1.
  13. M. F. Dacrema, P. Cremonesi and D. Jannach (2019) Are we really making much progress? a worrying analysis of recent neural recommendation approaches. arXiv preprint arXiv:1907.06902. Cited by: §2.1.
  14. S. E. (2018-01-10) YouTube’s ai is the puppet master over most of what you watch. CNET. External Links: Link Cited by: §1.
  15. M. D. Ekstrand, M. Tian, I. M. Azpiazu, J. D. Ekstrand, O. Anuyah, D. McNeill and M. S. Pera (2018) All the cool kids, how do they fit in?: popularity and demographic biases in recommender evaluation and effectiveness. In Conference on Fairness, Accountability and Transparency, pp. 172–186. Cited by: §1.1, §1, §6.
  16. M. D. Ekstrand, M. Tian, M. R. I. Kazi, H. Mehrpouyan and D. Kluver (2018) Exploring author gender in book rating and recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 242–250. Cited by: §1.1, §6, §6.
  17. S. Flaxman, S. Goel and J. M. Rao (2016) Filter bubbles, echo chambers, and online news consumption. Public opinion quarterly 80 (S1), pp. 298–320. Cited by: §1.
  18. N. Gillis (2014) The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines 12 (257). Cited by: §2.1.
  19. D. Goldberg, D. Nichols, B. M. Oki and D. Terry (1992) Using collaborative filtering to weave an information tapestry. Communications of the ACM 35 (12), pp. 61–71. Cited by: §2.1.
  20. L. Gurobi Optimization (2019) Gurobi optimizer reference manual. External Links: Link Cited by: §3.
  21. M. Hardt, N. Megiddo, C. Papadimitriou and M. Wootters (2016) Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pp. 111–122. Cited by: §1.1.
  22. F. M. Harper and J. A. Konstan (2016) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 19. Cited by: §5.
  23. F. M. Harper, F. Xu, H. Kaur, K. Condiff, S. Chang and L. Terveen (2015) Putting users in control of their recommendations. In Proceedings of the 9th ACM Conference on Recommender Systems, pp. 3–10. Cited by: §1.
  24. X. He, H. Zhang, M. Kan and T. Chua (2016) Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 549–558. Cited by: §1.1, §2.1.
  25. Y. Hu, Y. Koren and C. Volinsky (2008) Collaborative filtering for implicit feedback datasets.. In ICDM, Vol. 8, pp. 263–272. Cited by: §2.1.
  26. J. Kawale, H. H. Bui, B. Kveton, L. Tran-Thanh and S. Chawla (2015) Efficient thompson sampling for online matrix factorization recommendation. In Advances in neural information processing systems, pp. 1297–1305. Cited by: §1.1.
  27. Y. Koren, R. Bell and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer (8), pp. 30–37. Cited by: §2.1.
  28. Y. Koren (2009) The bellkor solution to the netflix grand prize. Netflix prize documentation 81 (2009), pp. 1–10. Cited by: §2.1.
  29. Y. Lei and W. Li (2019) When collaborative filtering meets reinforcement learning. arXiv preprint arXiv:1902.00715. Cited by: §1.1.
  30. J. Mary, R. Gaudel and P. Preux (2015) Bandits and recommender systems. In International Workshop on Machine Learning, Optimization and Big Data, pp. 325–336. Cited by: §1.1.
  31. P. Matuszyk, J. Vinagre, M. Spiliopoulou, A. M. Jorge and J. Gama (2015) Forgetting methods for incremental matrix factorization in recommender systems. In Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 947–953. Cited by: §5.2.
  32. S. Milli, J. Miller, A. D. Dragan and M. Hardt (2018) The social cost of strategic classification. arXiv preprint arXiv:1808.08460. Cited by: §1.1.
  33. T. T. Nguyen, P. Hui, F. M. Harper, L. Terveen and J. A. Konstan (2014) Exploring the filter bubble: the effect of using recommender systems on content diversity. In Proceedings of the 23rd international conference on World wide web, pp. 677–686. Cited by: §1.
  34. J. Nicas (2018-02) How youtube drives people to the internet’s darkest corners. The Wall Street Journal. External Links: Link Cited by: §1, §6.
  35. X. Ning and G. Karypis (2011) Slim: sparse linear methods for top-n recommender systems. In 2011 IEEE 11th International Conference on Data Mining, pp. 497–506. Cited by: §2.1.
  36. S. Osher and R. Fedkiw (2006) Level set methods and dynamic implicit surfaces. Vol. 153, Springer Science & Business Media. Cited by: §1.1.
  37. A. Paterek (2007) Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD cup and workshop, Vol. 2007, pp. 5–8. Cited by: §2.1.
  38. S. Rendle, L. Zhang and Y. Koren (2019) On the difficulty of evaluating baselines: a study on recommender systems. arXiv preprint arXiv:1905.01395. Cited by: §2.1, §5, §5.
  39. S. Rendle (2012-05) Factorization machines with libFM. ACM Trans. Intell. Syst. Technol. 3 (3), pp. 57:1–57:22. External Links: ISSN 2157-6904 Cited by: §5.
  40. S. Rendle (2013) Scaling factorization machines to relational data. In Proceedings of the VLDB Endowment, Vol. 6, pp. 337–348. Cited by: §2.1.
  41. F. Ricci, L. Rokach and B. Shapira (2011) Introduction to recommender systems handbook. In Recommender systems handbook, pp. 1–35. Cited by: §A.1, §2.1.
  42. C. Russell (2019) Efficient search for diverse coherent explanations. arXiv preprint arXiv:1901.04909. Cited by: §1.1.
  43. A. I. Schein, A. Popescul, L. H. Ungar and D. M. Pennock (2002) Methods and metrics for cold-start recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 253–260. Cited by: §3.3.
  44. H. Steck (2011) Item popularity and recommendation accuracy. In Proceedings of the fifth ACM conference on Recommender systems, pp. 125–132. Cited by: §6.
  45. H. Steck (2018) Calibrated recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 154–162. Cited by: §6.
  46. N. Tintarev (2017) Presenting diversity aware recommendations: making challenging news acceptable. Cited by: §1.1.
  47. Z. Tufekci (2018-03) YouTube, the great radicalizer. The New York Times. External Links: Link Cited by: §1, §6.
  48. B. Ustun, A. Spangher and Y. Liu (2019) Actionable recourse in linear classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 10–19. Cited by: §1, §1.1, §3.
  49. S. Wachter, B. Mittelstadt and C. Russell (2017) Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard Journal of Law & Technology 31 (2), pp. 2018. Cited by: §1.1.
  50. B. Webb (2006) Netflix update: try this at home.. External Links: Link Cited by: §2.1.
  51. Z. Wei, J. Xu, Y. Lan, J. Guo and X. Cheng (2017) Reinforcement learning to rank with markov decision process. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 945–948. Cited by: §1.1.
  52. L. Yang, M. Sobolev, Y. Wang, J. Chen, D. Dunne, C. Tsangouri, N. Dell, M. Naaman and D. Estrin (2019) How intention informed recommendations modulate choices: a field study of spoken word content. In The World Wide Web Conference, pp. 2169–2180. Cited by: §1.
  53. Y. Zhou, D. Wilkinson, R. Schreiber and R. Pan (2008) Large-scale parallel collaborative filtering for the netflix prize. In International conference on algorithmic applications in management, pp. 337–348. Cited by: §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel