1 Introduction
###### Abstract

The gang of bandits (GOB) model [7] is a recent contextual bandits framework that shares information between a set of bandit problems, related by a known (possibly noisy) graph. This model is useful in problems like recommender systems where the large number of users makes it vital to transfer information between users. Despite its effectiveness, the existing GOB model can only be applied to small problems due to its quadratic time-dependence on the number of nodes. Existing solutions to combat the scalability issue require an often-unrealistic clustering assumption. By exploiting a connection to Gaussian Markov random fields (GMRFs), we show that the GOB model can be made to scale to much larger graphs without additional assumptions. In addition, we propose a Thompson sampling algorithm which uses the recent GMRF sampling-by-perturbation technique, allowing it to scale to even larger problems (leading to a “horde” of bandits). We give regret bounds and experimental results for GOB with Thompson sampling and epoch-greedy algorithms, indicating that these methods are as good as or significantly better than ignoring the graph or adopting a clustering-based approach. Finally, when an existing graph is not available, we propose a heuristic for learning it on the fly and show promising results.

\aistatstitle

Horde of Bandits using Gaussian Markov Random Fields

\aistatsauthor

Sharan Vaswani &Mark Schmidt &Laks V.S. Lakshmanan \aistatsaddress University of British Columbia

## 1 Introduction

Consider a newly established recommender system (RS) which has little or no information about the users’ preferences or any available rating data. The unavailability of rating data implies that we can not use traditional collaborative filtering based methods [41]. Furthermore, in the scenario of personalized news recommendation or for recommending trending Facebook posts, the set of available items is not fixed but instead changes continuously. This new RS can recommend items to the users and observe their ratings to learn their preferences from this feedback (“exploration”). However, in order to retain its users, at the same time it should recommend “relevant” items that will be liked by and elicit higher ratings from users (“exploitation”). Assuming each item can be described by its content (like tags describing a news article or video), the contextual bandits framework [29] offers a popular approach for addressing this exploration-exploitation trade-off.

However, this framework assumes that users interact with the RS in an isolated manner, when in fact a RS might have an associated social component. In particular, given the large number of users on such systems, we may be able to learn their preferences more quickly by leveraging the relations between them. One way to use a social network of users to improve recommendations is with the recent gang of bandits (GOB) model [7]. In particular, the GOB model exploits the homophily effect [35] that suggests users with similar preferences are more likely to form links in a social network. In other words, user preferences vary smoothly across the social graph and tend to be similar for users connected with each other. This allows us to transfer information between users; we can learn about a user from his or her friends’ ratings. However, the existing recommendation algorithm in the GOB framework has a quadratic time-dependence on the number of nodes (users) and thus can only be used for a small number of users. Several recent works have tried to improve the scaling of the GOB model by clustering the users into groups [17, 36], but this limits the flexibility of the model and loses the ability to model individual users’ preferences.

In this paper, we cast the GOB model in the framework of Gaussian Markov random fields (GMRFs) and show how to exploit this connection to scale it to much larger graphs. Specifically, we interpret the GOB model as the optimization of a Gaussian likelihood on the users’ observed ratings and interpret the user-user graph as the prior inverse-covariance matrix of a GMRF. From this perspective, we can efficiently estimate the users’ preferences by performing MAP estimation in a GMRF. In addition, we propose a Thompson sampling GOB variant that exploits the recent sampling-by-perturbation idea from the GMRF literature [37] to scale to even larger problems. This idea is fairly general and might be of independent interest in the efficient implementation of other Thompson sampling methods. We establish regret bounds (Section 4) and provide experimental results (Section 5) for Thompson sampling as well as an epoch-greedy strategy. These experiments indicate that our methods are as good as or significantly better than approaches which ignore the graph or that cluster the nodes. Finally, when the graph of users is not available, we propose a heuristic for learning the graph and user preferences simultaneously in an alternating minimization framework (Appendix A).

## 2 Related Work

Social Regularization: Using social information to improve recommendations was first introduced by Ma et al. [31]. They used matrix factorization to fit existing rating data but constrained a user’s latent vector to be similar to their friends in the social network. Other methods based on collaborative filtering followed [38, 13], but these works assume that we already have rating data available. Thus, these methods do not address the exploration-exploitation trade-off faced by a new RS that we consider.

Bandits: The multi-armed bandit problem is a classic approach for trading off exploration and exploitation as we collect data [26]. When features (context) for the “arms” are available and changing, it is referred to as the contextual bandit problem [4, 29, 9]. The contextual bandit framework is important for the scenario we consider where the set of items available is constantly changing, since the features allow us to make predictions about items we have never seen before. Algorithms for the contextual bandits problem include epoch-greedy methods [27], those based on upper confidence bounds (UCB) [9, 1], and Thompson sampling methods [2]. Note that these standard contextual bandit methods do not model the user-user dependencies that we want to exploit.

Several graph-based methods to model dependencies between the users have been explored in the (non-contextual) multi-armed bandit framework [6, 33, 3, 32], but the GOB model of Cesa-Bianchi et al. [7] is the first to exploit the network between users in the contextual bandit framework. They proposed a UCB-style algorithm and showed that using the graph leads to lower regret from both a theoretical and practical standpoint. However, their algorithm has a time complexity that is quadratic in the number of users. This makes it infeasible for typical RS that have tens of thousands (or even millions) of users.

To scale up the GOB model, several recent works propose to cluster the users and assume that users in the same cluster have the same preferences [17, 36]. But this solution loses the ability to model individual users’ preferences, and indeed our experiments indicate that in some applications clustering significantly hurts performance. In contrast, we want to scale up the original GOB model that learns more fine-grained information in the form of a preference-vector specific to each user.

Another interesting approach to relax the clustering assumption is to cluster both items and users [30], but this only applies if we have a fixed set of items. Some works consider item-item similarities to improve recommendations [42, 23], but this again requires a fixed set of items while we are interested in RS where the set of items may constantly be changing. There has also been work on solving a single bandit problem in a distributed fashion [24], but this differs from our approach where we are solving an individual bandit problem on each of the nodes. Finally, we note that all of the existing graph-based works consider relatively small RS datasets ( users), while our proposed algorithms can scale to much larger RS.

## 3 Scaling up Gang of Bandits

In this section we first describe the general GOB framework, then discuss the relationship to GMRFs, and finally show how this leads to more scalable method. In this paper denotes the trace of matrix , denotes the Kronecker product of matrices and , is used for the -dimensional identity matrix, and is the stacking of the columns of a matrix into a vector.

### 3.1 Gang of Bandits Framework

The contextual bandits framework proceeds in rounds. In each round , a set of items becomes available. These items could be movies released in a particular week, news articles published on a particular day, or trending stories on Facebook. We assume that for all . We assume that each item can be described by a context (feature) vector . We use as the number of users, and denote the (unknown) ground-truth preference vector for user as . Throughout the paper, we assume there is only a single target user per round. It is straightforward extend our results to multiple target users.

Given a target user , our task is to recommend an available item to them. User then provides feedback on the recommended item in the form of a rating . Based on this feedback, the estimated preference vector for user is updated. The recommendation algorithm must trade-off between exploration (learning about the users’ preferences) and exploitation (obtaining high ratings). We evaluate performance using the notion of regret, which is the loss in recommendation performance due to lack of knowledge of user preferences. In particular, the regret after rounds is given by:

 R(T)=T∑t=1[maxj∈Ct(w∗Titxj)−w∗Titxjt]. (1)

In our analysis we make the following assumptions:

###### Assumption 1.

The -norms of the true preference vectors and item feature vectors are bounded from above. Without loss of generality we’ll assume for all and for all . Also without loss of generality, we assume that the ratings are in the range .

###### Assumption 2.

The true ratings can be given by a linear model [29], meaning that for some noise term .

These are standard assumptions in the literature. We denote the history of observations until round as and the union of the set of available items until round along with their corresponding features as .

###### Assumption 3.

The noise is conditionally sub-Gaussian [2][7] with zero mean and bounded variance, meaning that and that there exists a such that for all , we have .

This assumption implies that for all and , the conditional mean is given by and that the conditional variance satisfies .

In the GOB framework, we assume access to a (fixed) graph of users in the form of a social network (or “trust graph”). Here, the nodes correspond to users, whereas the edges correspond to friendships or trust relationships. The homophily effect implies that the true user preferences vary smoothly across the graph, so we expect the preferences of users connected in the graph to be close to each other. Specifically,

###### Assumption 4.

The true user preferences vary smoothly according to the given graph, in the sense that we have a small value of

 ∑(i1,i2)∈E||w∗i1−w∗i2||2.

Hence, we assume that the graph acts as a correctly-specified prior on the users’ true preferences. Note that this assumption implies that nodes in dense subgraphs will have a higher similarity than those in sparse subgraphs (since they will have a larger number of neighbours).

This assumption is violated in some datasets. For example, in our experiments we consider one dataset in which the available graph is imperfect, in that user preferences do not seem to vary smoothly across all graph edges. Intuitively, we might think that the GOB model might be harmful in this case (compared to ignoring the graph structure). However, in our experiments, we observe that even in these cases, the GOB approach still lead to results as good as ignoring the graph.

The GOB model [7] solves a contextual bandit problem for each user, where the mean vectors in the different problems are related according to the Laplacian 111To ensure invertibility, we set where is the normalized graph Laplacian. of the graph . Let be the preference vector estimate for user at round . Let and (respectively) be the concatenation of the vectors and across all users. The GOB model solves the following regression problem to find the mean preference vector estimate at round ,

 wt=argminw[n∑i=1∑k∈Mi,t(wTixk−ri,k)2 +λwT(L⊗Id)w% ], (2)

where is the set of items rated by user up to round . The first term is a data-fitting term and models the observed ratings. The second term is the Laplacian regularization and equal to . This term models smoothness across the graph with giving the strength of this regularization. Note that the same objective function has also been explored for graph-regularized multi-task learning [14].

### 3.2 Connection to GMRFs

Unfortunately, the approach of Cesa-Bianchi [7] for solving (2) has a computational complexity of . To solve (2) more efficiently, we now show that it can be interpreted as performing MAP estimation in a GMRF. This will allow us to apply the GOB model to much larger datasets, and lead to an even more scalable algorithm based on Thompson sampling (Section 4).

Consider the following generative model for the ratings and the user preference vectors ,

 ri,j∼N(wTixj,σ2),w∼N(0,(λL⊗Id)−1).

This GMRF model assumes that the ratings are independent given and , which is the standard regression assumption. Under this independence assumption the first term in (2) is equal up the negative log-likelihood for all of the observed ratings at time , , up to an additive constant and assuming . Similarly, the negative log-prior in this model gives the second term in (2) (again, up to an additive constant that does not depend on ). Thus, by Bayes rule minimizing (2) is equivalent to maximizing the posterior in this GMRF model.

To characterize the posterior, it is helpful to introduce the notation to represent the “global” feature vector corresponding to recommending item to user . In particular, let be the concatenation of -dimensional vectors where the vector is equal to and the others are zero. The rows of the dimensional matrix correspond to these “global” features for all the recommendations made until time . Under this notation, the posterior is given by a distribution with and with . We can view the approach in [7] as explicitly constructing the dense matrix , leading to an memory requirement. A new recommendation at round is thus equivalent to a rank- update to , and even with the Sherman-Morrison formula this leads to an time requirement for each iteration.

### 3.3 Scalability

Rather than treating as a general matrix, we propose to exploit its structure to scale up the GOB framework to problems where is very large. In particular, solving (2) corresponds to finding the mean vector of the GMRF, which corresponds to solving the linear system . Since is positive-definite, the linear system can be solved using conjugate gradient [20]. Conjugate gradient notably does not require , but instead uses matrix-vector products for vectors . Note that is block diagonal and has only non-zeroes. Hence, can be computed in time. For computing , we use that , where is an matrix such that . This implies can be written as which can be computed in time, where is the number of non-zeroes in . This approach thus has a memory requirement of and a time complexity of per mean estimation. Here, is the number of conjugate gradient iterations which depends on the condition number of the matrix (we used warm-starting by the solution in the previous round for our experiments, which meant that was enough for convergence). Thus, the algorithm scales linearly in and in the number of edges of the network (which tends to be linear in due to the sparsity of social relationships). This enables us to scale to large networks, of the order of K nodes and millions of edges.

## 4 Alternative Bandit Algorithms

The above structure can be used to speed up the mean estimation for any algorithm in the GOB framework. However, the LINUCB-like algorithm in [7] needs to estimate the confidence intervals for each available item . Using the GMRF connection, estimating these requires time since we need solve the linear system with right-hand sides, one for each available item. But this becomes impractical when the number of available items in each round is large.

We propose two approaches for mitigating this: first, in this section we adapt the epoch-greedy [27] algorithm to the GOB framework. Epoch-greedy doesn’t require confidence intervals and is thus very scalable, but unfortunately it doesn’t achieve the optimal regret of . To achieve the optimal regret, we also propose a GOB variant of Thompson sampling [29]. In this section we further exploit the connection to GMRFs to scale Thompson sampling to even larger problems by using the recent sampling-by-perturbation trick [37]. This GMRF connection and scalability trick might be of independent interest for Thompson sampling in other large-scale problems.

### 4.1 Epoch-Greedy

Epoch-greedy [27] is a variant of the popular -greedy algorithm that explicitly differentiates between exploration and exploitation rounds. An “exploration” round consists of recommending a random item from to the target user . The feedback from these exploration rounds is used to learn . An “exploitation” round consists of choosing the available item which maximizes the expected rating, . Epoch-greedy proceeds in epochs, where each epoch consists of 1 exploration round and exploitation rounds.

Scalability: The time complexity for Epoch-Greedy is dominated by the exploitation rounds that require computing the mean and estimating the expected rating for all the available items. Given the mean vector, this estimation takes time. The overall time complexity per exploitation round is thus .

Regret: We assume that we incur a maximum regret of in an exploration round, whereas the regret incurred in an exploitation round depends on how well we have learned . The attainable regret is thus proportional to the generalization error for the class of hypothesis functions mapping the context vector to an expected rating [27]. In our case, the class of hypotheses is a set of linear functions (one for each user) with Laplacian regularization. We characterize the generalization error in the GOB framework in terms of its Rademacher complexity [34], and use this to bound the expected regret leading to the result below. For ease of exposition in the regret bounds, we suppress the factors that don’t depend on either , , or . The complete bound is stated in the supplementary material (Appendix B).

###### Theorem 1.

Under the additional assumption that for all rounds , the expected regret obtained by epoch-greedy in the GOB framework is given as:

 R(T)=~O⎛⎜⎝n1/3(Tr(L−1)λn)13T23⎞⎟⎠
###### Proof Sketch.

Let be the class of valid hypotheses of linear functions coupled with Laplacian regularization. Let be the generalization error for after obtaining unbiased samples in the exploration rounds. We adapt Corollary 3.1 from [27] to our context:

###### Lemma 1.

If and is the smallest such that , the regret obtained by Epoch-Greedy can be bounded as .

We use [34] to bound the generalization error of our class of hypotheses in terms of its empirical Rademacher complexity . With probability ,

 Err(q,H)≤^Rnq(H)+√9ln(2/δ)2q. (3)

Using Theorem 2 in [34] and Theorem 12 from [5], we obtain

 ^Rnq(H)≤2√q√12Tr(L−1)λ. (4)

Using (3) and (4) we obtain

 Err(q,H)≤[2√12Tr(L−1)/λ+√9ln(2/δ)2]√q. (5)

The theorem follows from (5) along with Lemma 1. ∎

The effect of the graph on this regret bound is reflected through the term . For a connected graph, we have the following upper-bound  [34]. Here, is the second smallest eigenvalue of the Laplacian. The value represents the algebraic connectivity of the graph [15]. For a more connected graph, is higher, the value of is lower, resulting in a smaller regret. Note that although this result leads to a sub-optimal dependence on ( instead of ), our experiments incorporate a small modification that gives similar performance to the more-expensive LINUCB.

### 4.2 Thompson sampling

A common alternative to LINUCB and Epoch-Greedy is Thompson sampling (TS). At each iteration TS uses a sample from the posterior distribution at round , . It then selects the item based on the obtained sample, . We show below that the GMRF connection makes TS scalable, but unlike Epoch-Greedy it also achieves the optimal regret.

Scalability: The conventional approach for sampling from a multivariate Gaussian posterior involves forming the Cholesky factorization of the posterior covariance matrix. But in the GOB model the posterior covariance matrix is a -dimensional matrix where the fill-in from the Cholesky factorization can lead to a computational complexity of . In order to implement Thompson sampling for large values of , we adapt the recent sampling-by-perturbation approach [37] to our setting, and this allows us to sample from a Gaussian prior and then solve a linear system to sample from the posterior.

Let be a sample from the prior distribution and let be the perturbed (with standard normal noise) rating vector at round , meaning that for . In order to obtain a sample from the posterior, we can solve the linear system

 Σt~wt=(L⊗Id)~w0+ΦTt~rt. (6)

Let be the Cholesky factor of so that . Note that . If , we can obtain a sample from the prior by solving . Since tends to be sparse (using for example [12, 25]), this equation can be solved efficiently using conjugate gradient. We can pre-compute and store and thus obtain a sample from the prior in time . Using that in (6) and simplifying we obtain

 Σt~wt=(L⊗Id)~w0+bt+ΦTtyt (7)

As before, this system can be solved efficiently using conjugate gradient. Note that solving (7) results in an exact sample from the -dimensional posterior. Computing has a time complexity of . Thus, this approach is faster than the original GOB framework whenever . Since we focus on the case of large graphs, this condition will tend to hold in our setting.

We now describe an alternative method of constructing the right side of (7) that doesn’t depend on . Observe that computing is equivalent to sampling from the distribution . To sample from this distribution, we maintain the Cholesky factor of . Recall that the matrix is block diagonal (one block for every user) for all rounds . Hence, its Cholesky factor also has a block diagonal structure and requires storage. In each round, we make a recommendation to a single user and thus make a rank- update to only one block of . This is an order operation. Once we have an updated , sampling from and constructing the right side of (7) is an operation. The per-round computational complexity for our TS approach is thus for forming the right side in (7), for solving the linear system in (7) as well as for computing the mean, and for selecting the item. Thus, our proposed approach has a complexity linear in the number of nodes and edges and can scale to large networks.

Regret: To analyze the regret with TS, observe that TS in the GOB framework is equivalent to solving a single -dimensional contextual bandit problem, but with a modified prior covariance equal to instead of . We obtain the result below by following a similar argument to Theorem 1 in [2]. The main challenge in the proof is to make use of the available graph to bound the variance of the arms. We first state the result and then sketch the main differences from the original proof.

###### Theorem 2.

Under the following additional technical assumptions: (a) , (b) , and (c) , with probability , the regret obtained by Thompson Sampling in the GOB framework is given as:

 R(T)=~O⎛⎜⎝dn√T√λ ⎷log(3Tr(L−1)n+Tr(L−1)Tλdn2σ2)⎞⎟⎠
###### Proof Sketch.

To make the notation cleaner, for the round and target user under consideration, we use to index the available items. Let the index of the optimal item at round be whereas the index of the item chosen by our algorithm is denoted . Let be the standard deviation in the estimated rating of item at round . It is given as . Further, let . Let be the event such that for all ,

 Eμ(t):|⟨wt,ϕj⟩−⟨w∗,%$ϕ$j⟩|≤ltst(j)

We prove that, for , . Define , where . Let . Given that the event holds with high probability, we follow an argument similar to Lemma 4 of [2] and obtain the following bound:

 R(T)≤3gTγT∑t=1st(jt)+2gTγT∑t=11t2 +6gTγ√2Tln2/δ (8)

To bound the variance of the selected items, , we extend the analysis in [11, 43] to include the prior covariance term. We thus obtain the following inequality:

 T∑t=1st(jt) ≤√dnT × ⎷Clog(Tr(L−1)n)+log(3+Tλdnσ2) (9)

where . Substituting this into (8) completes the proof. ∎

Note that since is large in our case, assumption (a) for the above theorem is reasonable. Assumptions (b) and (c) define the upper and lower bounds on the regularization parameter . Similar to epoch-greedy, transferring information across the graph reduces the regret by a factor dependent on . Note that compared to epoch-greedy, the regret bound for Thompson sampling has a worse dependence on , but its dependence on is optimal. If , we match the regret bound for a -dimensional contextual bandit problem [1]. Note that we have a dependence on and similar to the original GOB paper [7] and that this method performs similarly in practice in terms of regret. However, as will see, our algorithm is much faster.

## 5 Experiments

### 5.1 Experimental Setup

Data: We first test the scalability of various algorithms using synthetic data and then evaluate their regret performance on two real datasets. For synthetic data we generate random -dimensional context vectors and ground-truth user preferences, and generate the ratings according to the linear model. We generated a random Kronecker graph with sparsity (which is approximately equal to the sparsity of our real datasets). It is well known that such graphs capture many properties of real-world social networks [28].

For the real data, we use the Last.fm and Delicious datasets which are available as part of the HetRec 2011 workshop. Last.fm is a music streaming website where each item corresponds to a music artist and the dataset consists of the set of artists each user has listened to. The associated social network consists of K users (nodes) and K friendship relations (edges). Delicious is a social bookmarking website, where an item corresponds to a particular URL and the dataset consists of the set of websites bookmarked by each user. Its corresponding social network consists of K users and K user-user relations. Similar to [7], we use the set of associated tags to construct the TF-IDF vector for each item and reduce the dimension of these vectors to . An artist (or URL) that a user has listened to (or has bookmarked) is said to be “liked” by the user. In each round, we select a target user uniformly at random and make the set consist of randomly chosen items such that there is at least 1 item liked by the target user. An item liked by the target user is assigned a reward of whereas other items are assigned a zero reward. We use a total of = thousand recommendation rounds and average our results across runs.

Algorithms: We denote our graph-based epoch-greedy and Thompson sampling algorithms as G-EG and G-TS, respectively. For epoch-greedy, although the theory suggests that we update the preference estimates only in the exploration rounds, we observed better performance by updating the preference vectors in all rounds (we use this variant in our experiments). We use of the total number of rounds for exploration, and we “exploit" in the remaining rounds. Similar to [17], all hyper-parameters are set using an initial validation set of thousand rounds. The best validation performance was observed for and . To control the amount of exploration for Thompson sampling, we the use posterior reshaping trick [8] which reduces the variance of the posterior by a factor of .

Baselines: We consider two variants of graph-based UCB-style algorithms: GOBLIN is the method proposed in the original GOB paper [7] while we use GOBLIN++ to refer to a variant that exploits the fast mean estimation strategy we develop in Section 3.3. Similar to [7], for both variants we discount the confidence bound term by a factor of .

We also include baselines which ignore the graph structure and make recommendations by solving independent linear contextual bandit problems for each user. We consider 3 variants of this baseline: the LINUCB-IND proposed in [29], an epoch-greedy variant of this approach (EG-IND), and a Thompson sampling variant (TS-IND). We also compared to a baseline that does no personalization and simply considers a single bandit problem across all users (LINUCB-SIN). Finally, we compared against the state-of-the-art online clustering-based approach proposed in [17], denoted CLUB. This method starts with a fully connected graph and iteratively deletes edges from the graph based on UCB estimates. CLUB considers each connected component of this graph as a cluster and maintains one preference vector for all the users belonging to a cluster. Following the original work, we make CLUB scalable by generating a random Erdos-Renyi graph with .222We reimplemented CLUB. Note that one of the datasets from our experiments was also used in that work and we obtain similar performance to that reported in the original paper. In all, we compare our proposed algorithms G-EG and G-TS with 7 reasonable baseline methods.

### 5.2 Results

Scalability: We first evaluate the scalability of the various algorithms with respect to the number of network nodes . Figure 1 shows the runtime in seconds/iteration when we fix and vary the size of the network from thousand to thousand nodes. Compared to GOBLIN, our proposed GOBLIN++ is more efficient in terms of both time (almost 2 orders of magnitude faster) and memory. Indeed, the existing GOBLIN method runs out of memory even on very small networks and thus we do not plot it for larger networks. Further, our proposed G-EG and G-TS methods scale even more gracefully in the number of nodes and are much faster than GOBLIN++ (although not as fast as the clustering-based CLUB or methods that ignore the graph).

We next consider scalability with respect to . Figure 1 fixes and varies from to . In this figure it is again clear that our proposed GOBLIN++ scales much better than the original GOBLIN algorithm. The EG and TS variants are again even faster, and other key findings from this experiment are (i) it was not faster to ignore the graph and (ii) our proposed G-EG and G-TS methods scale better with than CLUB.

Regret Minimization: We follow [17] in evaluating recommendation performance by plotting the ratio of cumulative regret incurred by the algorithm divided by the regret incurred by a random selection policy. Figure 2(a) plots this measure for the Last.fm dataset. In this dataset we see that treating the users independently (LINUCB-IND) takes a long time to drive down the regret (we do not plot EG-IND and TS-IND as they had similar performance) while simply aggregating across users (LINUCB-SIN) performs well initially (but eventually stops making progress). We see that the approaches exploiting the graph help learn the user preferences faster than the independent approach and we note that on this dataset our proposed G-TS method performed similar to or slightly better than the state of the art CLUB algorithm.

Figure 2(b) shows performance on the Delicious dataset. On this dataset personalization is more important and we see that the independent method (LINUCB-IND) outperforms the non-personalized (LINUCB-SIN) approach. The need for personalization in this dataset also leads to worse performance of the clustering-based CLUB method, which is outperformed by all methods that model individual users. On this dataset the advantage of using the graph is less clear, as the graph-based methods perform similar to the independent method. Thus, these two experiments suggest that (i) the scalable graph-based methods do no worse than ignoring the graph in cases where the graph is not helpful and (ii) the scalable graph-based methods can do significantly better on datasets where the graph is helpful. Similarly, when user preferences naturally form clusters our proposed methods perform similarly to CLUB, whereas on datasets where individual preferences are important our methods are significantly better.

## 6 Discussion

This work draws a connection between the GOB framework and GMRFs, and uses this to scale up the existing GOB model to much larger graphs. We also proposed and analyzed Thompson sampling and epoch-greedy variants. Our experiments on recommender systems datasets indicate that the Thompson sampling approach in particular is much more scalable than existing GOB methods, obtains theoretically optimal regret, and performs similar to or better than other existing scalable approaches.

In many practical scenarios we do not have an explicit graph structure available. In the supplementary material we consider a variant of the GOB model where we use L1-regularization to learn the graph on the fly. Our experiments there show that this approach works similarly to or much better than approaches which use the fixed graph structure. It would be interesting to explore the theoretical properties of this approach.

## References

• [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
• [2] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. arXiv preprint arXiv:1209.3352, 2012.
• [3] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. arXiv preprint arXiv:1409.8428, 2014.
• [4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
• [5] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2003.
• [6] Stéphane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observations in stochastic bandits. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, 2012.
• [7] Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. In Advances in Neural Information Processing Systems, pages 737–745, 2013.
• [8] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
• [9] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
• [10] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc.
• [11] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 355–366, 2008.
• [12] Timothy A Davis. Algorithm 849: A concise sparse cholesky factorization package. ACM Transactions on Mathematical Software (TOMS), 31(4):587–591, 2005.
• [13] Julien Delporte, Alexandros Karatzoglou, Tomasz Matuszczyk, and Stéphane Canu. Socially enabled preference learning from implicit feedback data. In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer, 2013.
• [14] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004.
• [15] Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak mathematical journal, 23(2):298–305, 1973.
• [16] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
• [17] Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 757–765, 2014.
• [18] Andre R Goncalves, Puja Das, Soumyadeep Chatterjee, Vidyashankar Sivakumar, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 451–460. ACM, 2014.
• [19] André R Gonçalves, Fernando J Von Zuben, and Arindam Banerjee. Multi-label structure learning with ising model selection. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3525–3531. AAAI Press, 2015.
• [20] Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems, volume 49. 1952.
• [21] Cho-Jui Hsieh, Inderjit S Dhillon, Pradeep K Ravikumar, and Mátyás A Sustik. Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems, pages 2330–2338, 2011.
• [22] Cho-Jui Hsieh, Mátyás A Sustik, Inderjit S Dhillon, Pradeep K Ravikumar, and Russell Poldrack. Big & quic: Sparse inverse covariance estimation for a million variables. In Advances in Neural Information Processing Systems, pages 3165–3173, 2013.
• [23] Tomáš Kocák, Michal Valko, Rémi Munos, and Shipra Agrawal. Spectral thompson sampling. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
• [24] Nathan Korda, Balázs Szörényi, and Shuai Li. Distributed clustering of linear bandits in peer to peer networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1301–1309, 2016.
• [25] Rasmus Kyng and Sushant Sachdeva. Approximate gaussian elimination for laplacians-fast, sparse, and simple. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 573–582. IEEE, 2016.
• [26] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
• [27] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pages 817–824, 2008.
• [28] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. The Journal of Machine Learning Research, 11:985–1042, 2010.
• [29] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
• [30] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016, pages 539–548, 2016.
• [31] Hao Ma, Dengyong Zhou, Chao Liu, Michael R Lyu, and Irwin King. Recommender systems with social regularization. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 287–296. ACM, 2011.
• [32] Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 136–144, 2014.
• [33] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.
• [34] Andreas Maurer. The rademacher complexity of linear transformation classes. In Learning Theory, pages 65–78. Springer, 2006.
• [35] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, pages 415–444, 2001.
• [36] Trong T Nguyen and Hady W Lauw. Dynamic clustering of contextual multi-armed bandits. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1959–1962. ACM, 2014.
• [37] George Papandreou and Alan L Yuille. Gaussian sampling by local perturbations. In Advances in Neural Information Processing Systems, pages 1858–1866, 2010.
• [38] Nikhil Rao, Hsiang-Fu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon. Collaborative filtering with graph information: Consistency and scalable methods. In Advances in Neural Information Processing Systems, pages 2098–2106, 2015.
• [39] Havard Rue and Leonhard Held. Gaussian Markov random fields: theory and applications. CRC Press, 2005.
• [40] Avishek Saha, Piyush Rai, Suresh Venkatasubramanian, and Hal Daume. Online learning of multiple tasks and their relationships. In International Conference on Artificial Intelligence and Statistics, pages 643–651, 2011.
• [41] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009:4, 2009.
• [42] Michal Valko, Rémi Munos, Branislav Kveton, and Tomáš Kocák. Spectral bandits for smooth graph functions. In 31th International Conference on Machine Learning, 2014.
• [43] Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1113–1122, 2015.

Supplementary Material

## Appendix A Learning the Graph

In the main paper, we assumed that the graph is known, but in practice such a user-user graph may not be available. In such a case, we explore a heuristic to learn the graph on the fly. The computational gains described in the main paper make it possible to simultaneously learn the user-preferences and infer the graph between users in an efficient manner. Our approach for learning the graph is related to methods proposed for multitask and multilabel learning in the batch setting [19, 18] and multitask learning in the online setting [40]. However, prior works that learn the graph in related settings only tackle problem with tens or hundreds of tasks/labels while we learn the graph and preferences across thousands of users.

Let be the inverse covariance matrix corresponding to the graph inferred between users at round . Since zeroes in the inverse covariance matrix correspond to conditional independences between the corresponding nodes (users) [39], we use L1 regularization on for encouraging sparsity in the inferred graph. We use an additional regularization term to encourage the graph to change smoothly across rounds. This encourages to be close to according to a distance metric . Following [40], we choose to be the log-determinant Bregman divergence given by . If corresponds to the matrix of user preference estimates, the combined objective can be written as:

 [wt,Vt]=argminw,V||rt−Φtw||22+Tr(V(λWTW+V−1t−1))+λ2||V||1−(dn+1)ln|V| (10)

The first term in (10) is the data fitting term. The second term imposes the smoothness constraint across the graph and ensures that the changes in are smooth. The third term ensures that the learnt precision matrix is sparse, whereas the last term penalizes the complexity of the precision matrix. This function is independently convex in both and (but not jointly convex), and we alternate between solving for and in each round. With a fixed , the sub-problem is the same as the MAP estimation in the main paper and can be done efficiently. For a fixed , the sub-problem is given by

 Vt=argminVTr((V[λ¯¯¯¯¯¯WTt¯¯¯¯¯¯Wt+V−1t−1))+λ2||V||1−(dn+1)ln|V| (11)

Here refers to the mean subtracted (for each dimension) matrix of user preferences. This problem can be written as a graphical lasso problem [16], , where the empirical covariance matrix is equal to . We use the highly-scalable second order methods described in [21, 22] to solve (11). Thus, both sub-problems in the alternating minimization framework at each round can be solved efficiently.

For our preliminary experiments in this direction, we use the most scalable epoch-greedy algorithm for learning the graph on the fly and denote this version as L-EG. We also consider another variant, U-EG in which we start from the Laplacian matrix corresponding to the given graph and allow it to change by re-estimating the graph according to (11). Since U-EG has the flexibility to infer a better graph than the one given, such a variant is important for cases where the prior is meaningful but somewhat misspecified (the given graph accurately reflects some but not all of the user similarities). Similar to [40], we start off with an empty graph and start learning the graph only after the preference vectors have become stable, which happens in this case after each user has received recommendations. We update the graph every K rounds. For both datasets, we allow the learnt graph to contain at most K edges and tune to achieve a sparsity level equal to 0.05 in both cases.

To avoid clutter, we plot all the variants of the EG algorithm, L-EG and U-EG, and use EG-IND, G-EG, EG-SIN as baselines. We also plot CLUB as a baseline. For the Last.fm dataset (Figure 3(b)(a)), U-EG performs slightly better than G-EG, which already performed well. The regret for L-EG is lower compared to LINUCB-IND indicating that learning the graph helps, but is worse as compared to both CLUB and LINUCB-SIN. On the other hand, for Delicious (Figure 3(b)(b)), L-EG and U-EG are the best performing methods. L-EG slightly outperforms EG-IND, underscoring the importance of learning the user-user graph and transferring information between users. It also outperforms G-EG, which implies that it is able to learn a graph which reflects user similarities better than the existing social network between users. For both datasets, U-EG is among the top performing methods, which implies that allowing modifications to a good (in that it reflects user similarities reasonably well) initial graph to model the obtained data might be a good method to overcome prior misspecification. From a scalability point of view, for Delicious the running time for L-EG is seconds/iteration (averaged across ) as compared to seconds/iteration for G-EG. This shows that even in the absence of an explicit user-user graph, it is possible to achieve a low regret in an efficient manner.

## Appendix B Regret bound for Epoch-Greedy

###### Theorem 1.

Under the additional assumption that for all rounds , the expected regret obtained by epoch-greedy in the GOB framework is given as:

 R(T)=~O⎛⎜⎝n1/3(Tr(L−1)λn)13T23⎞⎟⎠ (12)
###### Proof.

Let be the class of hypotheses of linear functions (one for each user) coupled with Laplacian regularization. Let represent the regret or cost of performing exploitation steps in epoch . Let the number of exploitation steps in epoch be .

###### Lemma 2 (Corollary 3.1 from [27]).

If and is the minimum such that , then the regret obtained by Epoch Greedy is bounded by .

We now bound the quantity . Let be the generalization error for after obtaining unbiased samples in the exploration rounds. Clearly,

 μ(H,q,s)=s⋅Err(q,H). (13)

Let be the least squares loss. Let the number of unbiased samples per user be equal to . The empirical Rademacher complexity for our hypotheses class under can be given as . The generalization error for can be bounded as follows:

###### Lemma 3 (Theorem 1 from [34]).

With probability ,

 Err(q,H)≤^Rnp(ℓLS∘H)+√9ln(2/δ)2pn (14)

Assume that the target user is chosen uniformly at random. This implies that the expected number of samples per user is at least . For simplicity, assume is exactly divisible by so that (this only affects the bound by a constant factor). Substituting in (14), we obtain

 Err(q,H)≤^Rnp(ℓLS∘H)+√9ln(2/δ)2q. (15)

The Rademacher complexity can be bounded using Lemma 4 (see below) as follows:

 ^Rnp(ℓLS∘H)≤1√p√48Tr(L−1)λn=1√q√48Tr(L−1)λ (16)

Substituting this into (15) we obtain

 Err(q,H)≤1√q[√48Tr(L−1)λ+√9ln(2/δ)2]. (17)

We set . Denoting as , .

 Recall that from Lemma 2, we need to determine QT such that QT+QT∑q=1sq≥T⟹QT∑q=1(1+sq)≥T Since sq≥1, this implies that ∑QTq=12sq≥T. Substituting the value of sq and observing that for all q, sq+1≥sq, we obtain the following: 2QTsQT≥T⟹2Q3/2TC≥T⟹QT≥(CT2)23 QT=[√12Tr(L−1)λ+√9ln(2/δ)8]23T23 (19) Using the above equation with Lemma 2, we can bound the regret as R(T)≤2[√12Tr(L−1)λ+√9ln(2/δ)8]23T23 (20) To simplify this expression, we suppress the term √9ln(2/δ)8 in the ~O notation, implying that R(T)=~O⎛⎝2[12Tr(L−1)λ]13T23⎞⎠ (21)

To present and interpret the result, we keep only the factors which are dependent on , , and . We then obtain

 R(T)=~O⎛⎜⎝n1/3(Tr(L−1)λn)13T23⎞⎟⎠ (22)

This proves Theorem 1. We now prove Lemma 4, which was used to bound the Rademacher complexity.

###### Lemma 4.

The empirical Rademacher complexity for under on observing unbiased samples for each of the users can be given as:

 ^Rnp(ℓLS∘H)≤1√p√48Tr(L−1)λn (23)
###### Proof.

The Rademacher complexity for a class of linear predictors with graph regularization for a loss function can be bounded using Theorem 2 of [34]. Specifically,

 ^Rnp(ℓ0,1∘H)≤2M√p√Tr((λL)−1)n (24) where M is the upper bound on the value of ||L12W∗||2√n and W∗ is the d×n matrix corresponding to the true user preferences.

We now upper bound .

 ||L12W∗||2≤||L12||2||W∗||2 ||W∗||2≤||W∗||F= ⎷n∑i=1||w∗i||22 ||W∗||2≤√n (Using assumption 1: For all i, ||w∗i||2≤1) ||L12||≤νmax(L12)=√νmax(L)≤√3 (The maximum eigenvalue of any normalized Laplacian LG is 2 [10] and recall that L=LG+In) ⟹||L12W∗||2√n≤√3⟹M=√3 (26)

Since we perform regression using a least squares loss function instead of classification, the Rademacher complexity in our case can be bounded using Theorem 12 from [5]. Specifically, if is the Lipschitz constant of the least squares problem,

 ^Rnp(ℓLS∘H)≤2ρ⋅Rnp(ℓ0,1∘H) (27)

Since the estimates are bounded from above by (additional assumption in the theorem), . From Equations LABEL:eq:rademacher-classification27 and the bound on , we obtain that

 ^Rnp(ℓLS∘H)≤4√p√3Tr(L−1)λn (28)

which proves the lemma. ∎

###### Theorem 2.

Under the following additional technical assumptions: (a) (b) (c) , with probability , the regret obtained by Thompson Sampling in the GOB framework is given as:

 R(T)=~O⎛⎜⎝dn√λ√T ⎷log(Tr(L−1)n)+log(3+Tλdnσ2)⎞⎟⎠ (29)
###### Proof.

We can interpret graph-based TS as being equivalent to solving a single -dimensional contextual bandit problem, but with a modified prior covariance ( instead of ). Our argument closely follows the proof structure in [2], but is modified to include the prior covariance. For ease of exposition, assume that the target user at each round is implicit. We use to index the available items. Let the index of the optimal item at round be , whereas the index of the item chosen by our algorithm is denoted .

 Let ^rt(j) be the estimated rating of item j at round t. Then, for all j, ^rt(j)∼N(⟨w% t,ϕj⟩,st(j)) (31) Here, st(j) is the standard deviation in the estimated rating for item j at round t. Recall that Σt−1 is the covariance matrix at round t. st(j) is given as: st(j)=√ϕTjΣ−1t−1ϕj (32) We drop the argument in st(jt) to denote the standard deviation and estimated rating for the selected item jt i.e. st=st(jt) and ^rt=^rt(jt). Let Δt measure the immediate regret at round t incurred by selecting item jt instead of the optimal item j∗t. The immediate regret is given by: Δt=⟨w∗,ϕj∗t⟩−⟨w∗,ϕjt⟩ (33) Define Eμ(t) as the event such that for all j, Eμ(t):|⟨wt,ϕj⟩−⟨w∗,%$ϕ$j⟩|≤ltst(j) (34) Here lt=√dnlog(3+t/λdnδ)+√3λ. If the event Eμ(t) holds, it implies that the expected rating at round t is close to the true rating with high probability. Recall that |Ct|=K and that ~wt is a sample drawn from the posterior distribution at round t. Define ρt=√9dnlog(tδ) and gt=min{√4dnln(t),√4log(tK)}ρt+lt. Define Eθ(t) as the event such that for all j, Eθ(t):|⟨~wt,ϕj⟩−⟨wt,ϕj⟩|≤min{√4dnln(t),√4log(tK)}ρtst(j) (35) If the event Eθ(t) holds, it implies that the estimated rating using the sample ~wt is close to the expected rating at round t.

In lemma 7, we prove that the event holds with high probability. Formally, for ,

 Pr(Eμ(t))≥1−δ (37)

To show that the event holds with high probability, we use the following lemma from [2].

###### Lemma 5 (Lemma 2 of [2]).
 Pr(Eθ(t))|Ft−1)≥1−1t2 (38)

Next, we use the following lemma to bound the immediate regret at round .

###### Lemma 6 (Lemma 4 in [2]).

Let . If the events and