MultiArmed Bandits on Partially Revealed
Unit Interval Graphs
Abstract
A stochastic multiarmed bandit problem with side information on the similarity and dissimilarity across different arms is considered. The action space of the problem can be represented by a unit interval graph (UIG) where each node represents an arm and the presence (absence) of an edge between two nodes indicates similarity (dissimilarity) between their mean rewards. Two settings of complete and partial side information based on whether the UIG is fully revealed are studied and a general twostep learning structure consisting of an offline reduction of the action space and online aggregation of reward observations from similar arms is proposed to fully exploit the topological structure of the side information. In both cases, the computation efficiency and the order optimality of the proposed learning policies in terms of both the size of the action space and the time length are established.
1 Introduction
Anumber of emerging applications involve largescale online learning in which the objective is to learn, in real time, the most rewarding actions among a large number of options. Example applications include various socioeconomic applications (e.g. ad display in search engines, product/news recommendation systems, targeted marketing and political campaigns) and networking issues (e.g. dynamic channel access and route selection) in largescale communication systems such as Internet of things. For such problems, a linear scaling of the learning cost with the problem size resulting from exploring every option to identify the optimal is undesirable, if not infeasible. The key to achieving a sublinear scaling with the problem size is to exploit the inherent structure of the action space, i.e., various relations among the vast number of options.
A classic framework for online learning and sequential decisionmaking under unknown models is the multiarmed bandit (MAB) formulation. In the classic setting, a player chooses one arm (or more generally, a fixed number of arms) from a set of arms (representing all possible options) at each time and obtains a reward drawn i.i.d. over time from an unknown distribution specific to the chosen arm. The design objective is a sequential arm selection policy that maximizes the total expected reward over a time horizon of length by striking a balance between learning the unknown reward models of all arms (exploration) and capitalizing on this information to maximize the instantaneous gain (exploitation). The performance of an arm selection policy is measured by regret, defined as the expected cumulative reward loss against an omniscient player who knows the reward models and always plays the best arm.
A traditionally adopted assumption in MAB is that arms are independent and that there is no structure in the set of reward distributions. In this case, reward observations from one arm provide no information on other arms, resulting in a linear regret order in . The main focus of the classic MAB problems has been on the regret order in , which measures the learning efficiency over time. The seminal work by Lai and Robins showed that the minimum regret has a logarithmic order in [1]. A number of learning policies have since been developed that offer the optimal regret order in (see [2, 3, 4] and references therein). Developed under the assumption of independent arms and relying on exploring every arm sufficiently often, however, these learning policies are not suitable for applications involving a massive number of arms, especially in the regime of .
1.1 Main Results
In addressing the challenge of massive number of arms, there has been a growing body of studies aiming at exploiting certain side information on the relations among the large number of arms. Among various formulations of the side information (see a more detailed discussion in Sec. 1.2), one notable example is the statistical similarity and dissimilarity among arms. For instance, in recommendation systems and information retrieval, products, ads, and documents in the same category (more generally, close in some feature space) have similar expected rewards. At the same time, it may also be known a priori that some arms have considerably different mean rewards, e.g., news with drastically different opinions, products with opposite usage, documents associated with key words belonging to distant categories in the taxonomy. Such side information opens the possibility of efficient solutions that scale well with the large action space.
In this paper, we study a bandit problem with side information on similarity and dissimilarity relations across actions. We first show that the similaritydissimilarity structure of the action space can be represented by a unit interval graph (UIG) where the presence (absence) of an edge between two arms indicates that the difference of their mean rewards is within (beyond) a given threshold. Based on whether the UIG is fully revealed to the player, we consider two cases of complete and partial side information. For both cases, we propose a general twostep learning structure—LSDT (Learning from SimilarityDissimilarity Topology)—to achieve a full exploitation of the topological structure of the side information. The first step is an offline reduction of the action space to a candidate set, which consists of arms that can assume the largest mean rewards under certain assignments of reward distributions without violating the side information. Arms outside the candidate set are suboptimal and eliminated from online exploration. The second step carries out an online learning algorithm that further exploits the similarity structure through collective exploration by aggregating reward observations from similar arms.
In the case of complete side information, we show that the candidate set is given by the set of left anchors of the UIG, which can be identified by a BreadthFirstSearch (BFS) based algorithm in polynomial time. By defining an equivalence relation between arms through the neighbor sets in the UIG, we obtain an equivalence class partition of arms. We show that the candidate set consists of at most two equivalence classes if the UIG is connected. We exploit this topological structure by maintaining two UCB (upper confidence bound) indices, one at the class level aggregating observations from arms within the same class, the other at the arm level. At each time, the arm with the largest arm index within the class with the largest class index is played. We establish the order optimality of the proposed policy in terms of both and by deriving an upper bound on regret and a matching lower bound feasible among uniformly good policies.
In the case of partial side information, we represent the partially revealed UIG by a multigraph with two types of edges indicating the presence and the absence of the corresponding UIG edges. We show the NPcompleteness of finding the candidate set and propose a polynomial time approximation algorithm to reduce the action space. We show that under certain probabilistic assumptions on the partial side information, the size of the reduced actions space is comparable to that of the ground truth candidate set as determined by the underlying UIG. In the second step of online learning, the key to a full exploitation of the similarity relation is to determine the frequency of exploring an arm based on its exploration value, which measures the topological significance of the node in a similarity graph. By sequentially eliminating arms less likely to be optimal through a UCB index aggregating observations from similar arms, only arms close to the optimal one remain after a sufficient number of plays. We provide performance guarantee for the proposed policy and establish its order optimality under certain probabilistic assumptions on the side information.
It should be noted that the main issue and main contribution of this paper are on how to succinctly model and fully exploit the side information on the similarity and dissimilarity relations across arms. The solution to the former is the UIG representation of the actions space, and to the latter is the twostep learning structure LSDT, which is independent of the specific arm selection rule adopted at the online learning step. In particular, different arm selection techniques developed for the original bandit problems may be incorporated into the second step of LSDT, except based on aggregated observations. In Sec. 5.3, we discuss the use of Thompson Sampling (TS), one of the most wellknown learning techniques in bandit problems (see [27, 25, 28, 29] and references therein), with LSDT to fully exploit the side information.
In summary, we develop a UIG formulation of side information on arm similarity and dissimilarity in MAB problems and consider two cases of complete and partial side information in this paper. We propose a general and computationally efficient twostep learning structure achieving full exploitation of side information and establish the order optimality of the proposed learning policies through theoretical upper bounds on regret as well as matching lower bounds in both cases.
1.2 Related Work
Existing studies on MAB with structured reward models can be categorized based on the types of arm relations adopted in the MAB models. The first type is realizationbased relation that assumes a certain known probabilistic dependency across arms. Examples include combinatorial bandits [5, 6, 7, 8], linearly parameterized bandits [9, 10, 11], and spectral bandits for smooth graph functions [12, 13]. The second type of arm relation can be termed as observationbased relation [14, 15, 16]. Specifically, playing an arm provides additional side observations about its neighboring arms. See [17] for a survey on various bandit models with structured action spaces.
The problem studied in this paper considers another type of relation among arms: ensemblebased relation that aims to capture the relations on ensemble behaviors (i.e., mean rewards) across arms, rather than probabilistic dependencies in their realizations^{1}^{1}1The mean rewards of different arms exhibit certain relations (e.g., closeness or in certain orders) but the realized random rewards of arms being played do not need to exhibit any probabilistic dependency.. Related work includes Lipschitz bandits [18, 19, 20], taxonomy bandits [21] and unimodal bandits [22]. Specifically, in Lipschitz bandits, the mean reward is assumed to be a Lipschitz function of the arm parameter. Taxonomy bandits have a treestructured action space where arms in the same subtree are close in their mean rewards. In unimodal bandits, the action space is represented by a graph where from every suboptimal arm, there exists a path to the optimal arm along which the mean reward increases. Different from these existing studies, the bandit model studied in this paper considers an action space represented by a UIG indicating not only similarity but also dissimilarity relations across actions. Besides, the structure of the proposed learning policy consists of a twolevel exploitation of the UIG structure, which is fundamentally different from the existing ones. Recently, a general formulation of structured bandits was proposed in [23], which includes a variety of known bandit models (e.g., Lipschitz bandits, unimodal bandits, linear bandits, etc.) as well as the bandit model studied in this work as special cases. The learning policy developed in [23], however, was given only implicitly in the form of a linear program (LP) that needs to be solved at every time step. For the problem studied in this paper, the LP does not admit polynomialtime solutions (unless P=NP).
Side information has also been used to refer to context information in the socalled contextual bandits (see [24, 25, 26] and references therein). Under this formulation, context information is revealed at each time, which affects the arm reward distributions. A contextual bandit problem can thus be viewed as multiple simple bandits, one for each context, that are interleaved in time according to the context stream. The complexity of the problem comes from the coupling of these simple bandits by assuming various models on how context affects the arm reward distributions. The problem is fundamentally different from the one studied here.
2 MultiArmed Bandits on Unit Interval Graphs
2.1 Problem Formulation
Consider a stochastic armed bandit problem. At each time , a player chooses one arm to play. Playing an arm yields a reward drawn i.i.d. from an unknown distribution with mean . We assume that belongs to the family of subGaussian distributions^{2}^{2}2A random variable with mean is subGaussian with parameter (or subGaussian) if , for all [30]. for all . Extensions to other distribution types are discussed in Sec. 5.
Across arms, the similarity and dissimilarity relations are defined through a parameter : two arms are similar (dissimilar) if the difference between their mean rewards is below (above) . The similaritydissimilarity structure of the action space can be represented by an undirected graph . In the graph representation, every node represents an arm with reward distribution and the presence (absence) of an edge corresponds to a similar (dissimilar) arm pair. Throughout the paper, is used to refer to an arm or a node, exchangeably. We first show that is a UIG.
Definition 1 (Unit interval graph and unit interval model).
A graph is a unit interval graph if there exists a set of unit length intervals^{3}^{3}3If a UIG is finite (with a finite number of nodes), there is no difference between taking open intervals or closed intervals to represent nodes [31]. Without loss of generality, we assume that where are the left and right coordinates of interval . on the real line such that each interval corresponds to a node and there exists an edge if and only if . The set of intervals is a unit interval model (UIM) for the UIG.
Through a mapping from every node to an length interval , it is not difficult to see that
(1) 
which indicates that is a UIG (see an example in Fig. 1). Without loss of generality, we assume that is connected. Extensions to the disconnected case are discussed in Sec. 5.
We define , as the side information on arm similarity and dissimilarity. Based on whether fully reveal the UIG , we consider the following two cases separately. In the case of complete side information, are identical to the edge set and the complement edge set of , i.e., . In the case of partial side information, they are subsets of the latter, i.e., , .
The objective is an online learning policy that specifies a sequential arm selection rule at each time based on both past observations of selected arms and the side information . The performance of policy is measured by regret defined as the expected reward loss against a player who knows the reward model and always plays the best arm (chosen arbitrarily in the case of multiple optimal arms), i.e.,
(2) 
where is the largest mean reward and is the arm selected by policy at time . The dependency of regret on the unknown reward distributions is omitted in the notation. When there is no ambiguity, the notation is simplified to .
Let denote the number of times that arm has been selected up to time . We rewrite the regret as:
(3) 
where . The objective of maximizing the expected cumulative reward is equivalent to minimizing the regret over a time horizon of length . In order to minimize regret, it can be inferred from (3) that every suboptimal arm () should be distinguished from the optimal one with the least number of plays.
2.2 TwoStep Learning Structure
While classic bandit algorithms have to try out every arm sufficiently often to distinguish the suboptimal arms from the optimal one, which induces a linear scaling of regret in the number of arms, the side information on arm similarity and dissimilarity allows the possibility of identifying a set of suboptimal arms without even playing them. To be specific, we define a candidate set determined by the side information as follows.
Definition 2 (Candidate Arm and Candidate Set).
Given the side information , an arm is a candidate arm if there exists an assignment of reward distributions with means conforming to and . The candidate set is the set consisting of all candidate arms.
Note that the optimal arm under the ground truth assignment of reward distributions in the bandit problem always belongs to the candidate set . It is clear that if we can find the candidate set from the side information efficiently, the action space can be reduced to . Only arms in need to be explored. Furthermore, certain topological structures of the revealed UIG on the reduced action space can be further exploited to accelerate learning. In estimating the mean reward of every arm in the candidate set, observations from similar arms can also be leveraged as approximations, which reduces the number of plays required to distinguish suboptimal arms from the optimal one.
The aforementioned facts motivate a general twostep learning structure: Learning from SimilarityDissimilarity Topology (LSDT) for both cases of complete and partial side information. Specifically, LSDT consists of (1) an offline elimination step that reduces the action space to the candidate set and (2) online learning of the optimal arm by aggregating observations from similar ones. We specify each step for the cases of complete and partial side information separately in Sec. 3 and Sec. 4.
3 Complete Side Information
We first consider the case of complete side information that fully reveals the UIG . We follow the twostep learning structure proposed in Sec. 2.2 and develop a learning policy: LSDTCSI (Learning from SimilarityDissimilarity Topology with Complete Side Information) along with theoretical analysis on its regret performance. While restrictive in applications, this case provides useful insights for tackling the general case of partial side information addressed in Sec. 4.
3.1 Offline Elimination
The first step of LSDTCSI is an offline preprocessing that aims at identifying the candidate set from the complete side information. Since the UIG is fully revealed, we denote the candidate set in this case as to distinguish from the case of partial side information. We show that is identical to the set of left anchors of the UIG .
Definition 3 (Left Anchor).
Given a UIG , a node is a left anchor if there exists a UIM for where corresponds to the leftmost interval along the real line.
Since the mirror image of an UIM with respect to the origin is also an UIM for the same UIG, the node corresponding to the rightmost interval in a UIM is also a left anchor. Based on the definition of the UIG that represents the similaritydissimilarity structure of the arm set in Sec. 2.1, it is not difficult to see that the candidate set is identical to the set of left anchors of , which can be identified through a BFSbased algorithm proposed in [32]. The BFSbased algorithm starts from an arbitrary node in a UIG and returns a set of left anchors. We apply the algorithm two times: in the first time, we start from an arbitrary node in and obtain a set of left anchors. In the second time, we reapply the algorithm starting from one of the returned node in the last time. One can directly infer from Proposition 2.1 and Theorem 2.3 in [32] that the obtained set is the candidate set . The detailed algorithm is summarized below. Note that the computation complexity of the offline elimination step is , which is polynomial in the problem size.
3.2 Online Aggregation
We now present the second step of online learning that further exploits topological structures of the candidate set . We first introduce an equivalence relation between nodes in the UIG .
Definition 4 (Neighborhood Equivalence).
Two nodes in are (neighborhood) equivalent if , where is the set of neighbors of in , including . Moreover, let denote the partition of the arm set in with respect to the neighborhood equivalence relation.
Note that arms within the same equivalence class have the same set of neighbors and thus, they are topologically indistinguishable in the UIG. Based on the equivalence class partition, we obtain a closedform expression for .
Theorem 1.
When the side information fully reveals the UIG (assumed to be connected), the candidate set is the union of two equivalence classes containing the optimal arm and the worst arm (with minimum mean reward)^{4}^{4}4Note that the two equivalence classes containing the optimal arm and the worst arm are identical in the special case where is fully connected. The proposed algorithm and analysis still apply in this case. Without loss of generality, we assume that is not fully connected., i.e.,
(4) 
where
(5) 
(6) 
Proof.
See Appendix B in the supplementary material. ∎
The result is also illustrated in Fig. 2: the candidate set is the union of two equivalence classes and , which can be directly obtained through the offline elimination step.
Based on the topological structure of the candidate set, we develop a hierarchical online learning policy that aggregates observations from arms within the same equivalence class. By considering each class as a super node (arm), we reduce the problem to a simple twoarmed bandit problem.
Specifically, the second step of LSDTCSI carries out a hierarchical UCBbased online learning on the candidate set by maintaining a class index for each equivalence class and an arm index for each individual arm in . The arm index is defined as:
(7) 
where , are the empirical average of observations from arm and the number of times that arm has been played up to time . The class index aggregates the same statistics across arms in the class:
(8) 
At each time, the online learning procedure selects the equivalence class with the largest class index and plays the arm with the largest arm index within the selected class. Once the reward has been observed, both class indices and arm indices are updated.
3.3 Order Optimality
We first present the regret analysis of LSDTCSI, which focuses on upper bounding the expected number of times that each suboptimal arm has been played up to time . We show that when the total number of times that arms in have been played is greater than , the class index will not be chosen with high probability. Besides, if each suboptimal arm has been played more than times, the arm index will not be chosen with high probability. The following theorem provides the performance guarantee for LSDTCSI.
Theorem 2.
Suppose that is connected. Assume that the reward distribution for each arm is subGaussian with parameter ^{5}^{5}5See Sec. 5 for extensions to general .. Then the regret of LSDTCSI up to time is upper bounded as follows:
(9)  
where is the set of arms with largest mean rewards ().
Proof.
See Appendix C in the supplementary material. ∎
Remark 1.
For fixed , the regret of LSDTCSI is of order
(10) 
as . In certain scenarios (e.g., is a line graph), , which indicates a sublinear scaling of regret in terms of the number of arms given such side information.
Remark 2.
If is fully connected (e.g., is large), then . In this case, LSDTCSI degenerates to the classic UCB policy and .
We discuss in Sec. 6 that if the mean reward of each arm is independently and uniformly chosen from and is bounded away from and , the expected value of is smaller than , which indicates a sublinear scaling of regret in terms of the size of the action space. We also use a numerical example to verify the result in Sec. 6.
To establish the order optimality of LSDTCSI, we further derive a matching lower bound on regret. We focus here on the case that the unknown mean reward of each arm is unbounded (i.e., can be any value on the real line). We adopt the same parametric setting as in [1] on classic MAB where the rewards are drawn from a specific parametric family of distributions with known distribution type^{6}^{6}6Although the upper bound on regret of LSDTCSI is derived under the nonparametric setting (the distribution type is unknown), the nonparametric lower bound suffices to show the order optimality of LSDTCSI since it should be no smaller than that in the parametric one.. Specifically, the reward distribution of arm has a univariate density function with an unknown parameter from a set of parameters . Let be the KullbackLeibler (KL) distance between two distributions with density functions and and with means and respectively. We assume the same regularity assumptions on the finiteness of the KL divergence and its continuity with respect to the mean values as in [1].
Assumption 1.
For every such that , we have .
Assumption 2.
For every and with , there exists for which whenever .
The following theorem provides a lower bound on regret for uniformly good policies^{7}^{7}7A policy is uniformly good if for every , the regret of satisfies , as [1]..
Theorem 3.
Suppose is connected. Assume that Assumptions 1, 2 hold and the mean reward of each arm can be any value in . For any uniformly good policy, the regret up to time is lower bounded as follows:
(11) 
where is the optimal value of an LP that only depends on and (see (53) in Appendix D for details). It can be shown that for fixed , and , the regret for any uniformly good policy is of order
as .
Proof.
See Appendix D in the supplementary material. ∎
Remark 3.
LSDTCSI is order optimal since its upper bound on regret matches the lower bound shown in Theorem 3.
Remark 4.
If there is a unique optimal arm, i.e., , as .
4 Partial Side Information
In this section, we consider the general case of partial side information where the UIG is partially revealed. We develop a learning policy: LSDTPSI (Learning from SimilarityDissimilarity Topology with Partial Side Information) following the twostep structure proposed in Sec. 2.2 and provide theoretical analysis on the regret performance.
4.1 Offline Elimination
A partially revealed UIG can be represented by an undirected edgelabeled multigraph (see Fig. 3). Specifically, consists of two types of edges: typeS edges () and typeD edges () indicating the presence and the absence of the corresponding UIG edges. The absence of an edge between two nodes indicates an unknown relation between the two arms.
We first show that finding the candidate set under partial side information is NPcomplete. We notice that finding the candidate set is equivalent to considering every node individually and deciding if can be a left anchor of a UIG consisting of the same set of nodes with and the potential edge set satisfying
(12) 
(13) 
Specifically, we show the NPcompleteness of the following decision problem.

LEFTANCHOR

[INPUT]: A multigraph knowing that there exists a UIG where and , and a specific node .

[QUESTION]: Does there exist a UIG where and such that node is a left anchor of ?
Theorem 4.
LEFTANCHOR is NPcomplete.
Proof.
To show the NPcompleteness of LEFTANCHOR, we give a reduction from a variant of the 3SAT problem: CONSISTENTNAE3SAT. Due to the page limit, we include the definition of CONSISTENTNAE3SAT as well as its proof of NPcompleteness in Appendix E in the supplementary material. The reduction to LEFTANCHOR and the remaining proof are presented in Appendix F. ∎
It should be noted that LEFTANCHOR is similar to the socalled UIG Sandwich Problem[33] where two graphs and are given satisfying . The question is whether a UIG exists satisfying . It is not difficult to see that the typeS edge set corresponds to in the sandwich problem and the complement of corresponds to . However, LEFTANCHOR is different from the sandwich problem as we know that the sandwich problem is satisfied by the ground truth UIG , and what we are interested in is whether a specific node can be a left anchor.
To address the challenge of finding the candidate set in polynomial time, we exploit the following topological property of to obtain an approximation solution.
Proposition 1.
Given , an arm is suboptimal if it is similar to two dissimilar arms, i.e., if there exist , such that but , then .
Based on this property, we develop the offline elimination step of LSDTPSI with complexity.
It is clear that in general, . However, in certain scenarios, the partially revealed UIG provides sufficient topological information to identify the ground truth candidate set obtained from the fully revealed UIG. We show that such information is fully exploited by the offline elimination step of LSDTPSI to achieve the same performance as that of LSDTCSI for the case of complete side information.
Specifically, we make the following assumptions on and its equivalence classes assuming that the neighbor set of every arm is diverse enough. Without loss of generality, we assume an increasing order of the equivalence classes along the real line, i.e., and , we have . Note that .
Assumption 3.
For every , assume that there exist such that and are connected to but mutually disconnected in .^{8}^{8}8Two equivalence classes are connected if and only if at least one pair of arms from the two classes are adjacent in the UIG. It can be inferred from the equivalence relation that if there exists an adjacent arm pair from the two classes, all arm pairs are adjacent.
Assumption 4.
Assume that there exists a constant and for every , .
We further make a probabilistic assumption on the partial side information.
Assumption 5.
The presence and the absence of an edge in the UIG are revealed by the partial side information and independently with probabilities and . Assume that , where is defined in Assumption 4.
Note that as increases, for every arm , the number of dissimilar arm pairs that are similar to increases. Therefore, smaller probabilities of observing edges can still guarantee that arm is elilminated with high probability.
Based on these assumptions, we provide performance guarantee for the offline elimination step of LSDTPSI through the following theorem. We also verify the results through numerical examples in Sec. 6.
Theorem 5.
Proof.
See Appendix G in the supplementary material. ∎
4.2 Online Aggregation
Now we present the second step, the online learning procedure of LSDTPSI. We first define a similarity graph restricted to the remaining arm set : and For every arm , we define an exploration value , which measures the topological significance of node in the similarity graph and determines the frequency of playing arm . Intuitively, a node with a higher degree has a higher exploration value since playing this node provides information about more (neighboring) nodes. Specifically, we define exploration values as the optimal solution to the following LP.
(15)  
where is the set of neighbors of node in (including ). In the online learning procedure, the number of times arm is played is proportional to its exploration value . Note that if at least plays are necessary to distinguish a suboptimal arm from the optimal one in the classic MAB problem, now if suffices to play only times by aggregating observations from every neighboring arm that is played times. Note that and is upper bounded by the size of the minimum dominating set of .
We briefly summarize the second step of LSDTPSI: the algorithm is played in epochs and during epoch , arms are played up to times. Arms less likely to be optimal are eliminated at the end of every epoch and only two types of arms will be played in the next epoch: 1) noneliminated arms and 2) arms with noneliminated neighbors. After a sufficient number of epochs, only arms close to the optimal one remain and we use single arm indices for selection. Let be the average reward from arm up to epoch .
(16)  
4.3 Order Optimality
The following theorem provides an upper bound on regret of LSDTPSI for any given partially revealed UIG.
Theorem 6.
Given a partially revealed UIG . Assume that the reward distribution of reach arm is subGaussian^{9}^{9}9Certain subGaussian distributions (e.g. Bernoulli distribution, uniform distribution on ) have parameters . See Sec. 5 for extensions to general .. Let . Then the regret of LSDTPSI up to time is upper bounded by:
(17)  
where .
Proof.
See Appendix H in the supplementary material. ∎
Remark 5.
For fixed , the regret of LSDTPSI is of order
(18) 
as , where is the size of the minimum dominating set of graph and is the number of suboptimal arms that are close to the optimal one. It is not difficult to see that as increases, decreases and increases. For an appropriate , a sublinear scaling of regret in terms of the number of arms can be achieved.
Recall that in Theorem 5, we show that under certain assumptions, the offline elimination step of LSDTPSI achieves the same performance as LSDTCSI for the case of complete side information. The following corollary further shows the order optimality of LSDTPSI in terms of both and .
Corollary 1.
Assume that Assumptions 35 hold and . For fixed , the expectation of regret of LSDTPSI taken over random realizations of the partial side information , is upper bounded as follows:
(19) 
as , which matches the lower bound on regret for the case of complete side information established in Theorem 3.
Proof.
See Appendix I in the supplementary material. ∎
5 Extensions
In this section, we discuss extensions of the proposed policies: LSDTCSI and LSDTPSI as well as their regret analysis to cases with disconnected UIGs and other reward distributions. We also discuss the extension of applying Thompson Sampling techniques to the LSDT learning structure.
5.1 Extensions to disconnected UIG
Suppose that the UIG has () connected components. It is not difficult to see that every connected component of is still a UIG and the set of left anchors of is the union of left anchors of all components. Therefore, in the case of complete side information, the offline elimination step of LSDTCSI outputs at most equivalences classes and the second step of LSDTCSI can be directly applied by maintaining a class index for every equivalence class as defined in (8). Moreover, by extending the regret analysis of LSDTCSI in Theorem 2 as well as the lower bound on regret for uniformly good policies in Theorem 3 to the disconnected case, we can show that LSDTCSI achieves an order optimal regret, i.e.,
(20) 
as . In the extreme case when (e.g., ), LSDTCSI degenerates to the classic UCB policy and .
In the case of partial side information, the LSDTPSI policy along with its regret analysis applies to any partially revealed UIG without assumptions on the connectivity of the graph. The upper bound on regret in Theorem 6 still holds when has connected components. In the extreme case where , the size of the minimum dominating set of the similarity graph equals and thus, .
To show the order optimality of LSDTPSI in the disconnected case, we need certain modifications on the assumptions of the UIG. We consider every connected component of with equivalence classes . We assume that Assumptions 3 and 4 hold for every connected component and without loss of generality, we assume that the optimal arm is in component . Then under Assumption 5, we can extend the regret analysis in Corollary 1 to the case where has connected components. It can be shown that the expected regret of LSDTPSI is upper bounded by
(21) 
as , which matches the lower bound in the case of complete side information.
5.2 Extensions to Other Distributions
Recall that in the regret analysis of LSDTCSI and LSDTPSI, we assume subGaussian reward distributions with parameter (e.g., standard normal distribution) or (e.g., Bernoulli distribution). We first discuss extensions to general subGaussian distributions with arbitrary parameters .
In LSDTCSI, by replacing the second terms of the UCB indices defined in (7) and (8) by and where is an input parameter, the regret analysis in Theorem 2 still applies and the upper bound on regret is only affected up to a constant scaling factor, as long as . A similar extension also applies to LSDTPSI if we change the second terms of the UCB indicies in (16) to where .
Furthermore, we can extend the results for subGaussian reward distributions to other distribution types such as lighttailed and heavytailed distributions. There are standard techniques for such extensions by replacing the concentration result with the corresponding ones for lighttailed and heavytailed distributions (the latter also requires replacing sample means with truncated sample means). Similar extensions for classic MAB problems without side information are discussed in [34, 4]. To illuminate the main ideas without too much technicality, most existing work assumes an even stronger assumption of bounded support in (see [2],[3], [24], etc.).
5.3 Extensions to Thompson Sampling Techniques
The twostep learning structure LSDT is in general independent of the specific arm selection rule adopted in the online learning step. We discuss here how Thompson Sampling (TS) techniques can be extended and incorporated into the basic structure with aggregation of reward observations. Specifically, in the case of complete side information, after reducing the action space to the candidate set via the offline step, we adopt a similar hierarchical online learning policy as that used in LSDTCSI by maintaining two posterior distributions on the reward parameters, one at the equivalence class level, the other at the arm level. At each time, the policy first randomly selects an equivalence class according to its classlevel probability of containing the optimal arm and then randomly draws an arm within the class according to its armlevel probability of being optimal. In the case of partial side information, similar to LSDTPSI, an eliminative strategy is carried out to sequentially eliminate arms less likely to be optimal. At each time, an arm is randomly drawn according its armlevel posterior distribution of being optimal. The observation from the selected arm is also used to update higher level posterior distributions of its neighbors, which aggregate observations from all similar arms. According to the high level posterior distribution, the arm that is least likely to be optimal is eliminated if it has been explored for sufficient times. Simulation results in Appendix A.1 show a similar performance gain by exploiting the side information on arm similarity and dissimilarity through the twostep learning structure when TS is incorporated in both cases. To achieve a full exploitation of the side information and establish the order optimality on regret, however, further studies are required.
6 Numerical Examples
In this section, we illustrate the advantages of our policies through numerical examples on both synthesized data and a real dataset in recommendation systems. All the experiments are run 100 times using a Monte Carlo method on MATLAB R2014b.
6.1 Reduction of the action space
6.1.1 Complete Side Information
We use two experiments to show how much the action space can be reduced by exploiting the complete side information. In the first experiment, we fix arms with mean rewards uniformly chosen from and let vary from to . For every , we obtain a UIG . We apply the offline elimination step of LSDTCSI to and compare the size of the candidate set with . In the second experiment, we fix and let increase from to . We generate arms and UIGs in the same way as in the first experiment. We show how varies as increases. The results are shown in Figs. 3(a) and 3(b).
As we can see from Fig. 3(a), when is small (), the graph is disconnected. As increases, the number of connected components decreases and thus, decreases. When the graph is connected (), the candidate set only contains two equivalence classes and thus is much smaller than . When is large (), the probability that the graph is complete increases as increases. In this case, the candidate set contains all the arms. Thus, increases to as grows to . In Fig. 3(b), we notice that has a diminishing cardinality compared with . Since the mean rewards are uniformly chosen from , the set of arms becomes denser on the interval as grows. It can be inferred from [35] that the maximum distance between two consecutive points uniformly chosen from is in the order of with probability . If we choose for some , will be connected with high probability. Moreover, it can be shown that the cardinality of () is smaller than the number of nodes whose distance to () is smaller than . Therefore, it follows that the cardinality of the candidate set in this setting is smaller than .
6.1.2 Partial Side Information
We use two other experiments to show the reduction of the action space with partial side information. In the first experiment, we fix arms with mean rewards uniformly chosen from . We choose and obtain the UIG . We let vary from to and for every , we observe the presence and the absence of edges in independently with probability . We apply the offline elimination step of LSDTPSI on and compare the size of the output set with . Note that when , is fully revealed and we use the offline elimination step of LSDTCSI to obtain . In the second experiment, we fix and let increase from to . We generate arms and side information graphs in the same way as in the first experiment and show how varies as increases. The results of the two experiments are shown in Figs. 4(a) and 4(b) .
It can be seen from Fig. 4(a) that as increases, decreases to . Besides, when , the performance of the offline elimination step of LSDTPSI is as good as that of LSDTCSI, which is optimal, i.e. only arms in remain. Moreover, in Fig. 4(b), we see that decreases as increases which indicates a diminishing cardinality of the reduced action space in terms of .
6.2 Regret on Randomly Generated Graphs
6.2.1 Complete Side Information
We compare LSDTCSI with existing algorithms on a set of randomly generated arms. We obtain the UIG on nodes with means uniformly chosen from and . Every time an arm is played, a random reward is drawn independently from a Gaussian distribution with mean and variance . We let vary from to and compare the regret of LSDTCSI and 4 baseline algorithms:

UCB1: classic UCB policy proposed in [2] assuming no relation among arms.

TS: classic Thompson Sampling algorithm proposed in [27] assuming Beta prior and Bernoulli likelihood on the reward model.

CKLUCB: proposed in [20] for Lipschitz bandit exploiting only similarity relations.

OSUB: proposed in [22] for unimodal bandits. Note that if the UIG is connected, it satisfies the unimodal structure: for every suboptimal arm , there exists a path such that for every , .

OSSB: proposed in [23] for general structured bandits. At each time, OSSB estimates the minimum number of times that every arm has to be played by solving a LP.
The results shown in Fig. 5(a) indicate that LSDTCSI outperforms the baseline algorithms. In particular, when , LSDTCSI has already started to exploit the optimal arm while the other algorithms are still exploring. We also compare LSDTCSI with an intuitive algorithm applying UCB1 on the candidate set in Fig. 5(b). With the same setup, we see performance gain due to the online step.
We also evaluate the time complexity of the learning policies. Due to the page limit, we summarize the running times of LSDTCSI and the other baseline algorithms in Table I in Appendix A.2. It is shown that LSDTCSI has a relatively low computation cost in contrast to algorithms with comparable performance, i.e., CKLUCB and OSSB.
6.2.2 Partial Side Information
We compare LSDTPSI with existing algorithms. We obtain the UIG on arms with means uniformly chosen from and . We let and get the partially observed UIG based on Assumption 5. The random rewards for every arm are independently generated from a Bernoulli distribution with mean . We consider .
Given that finding the candidate set is NPcomplete, the OSSB policy is not applicable since the LP is unspecified. Besides, OSUB is also inapplicable since is not unimodal in general. Therefore, we only compare LSDTPSI with three baseline algorithms: UCB1, TS and CKLUCB. In LSDTPSI, we choose the input parameter . Note that the choice of does not affect the theoretical upper bound on regret. However, in practice, it is better to use a smaller to avoid excessive plays of suboptimal arms. The simulation results shown in Fig 6(a) indicates that LSDTPSI outperforms the other two algorithms. Besides, similar to the case of complete side information, we compare LSDTPSI with a heuristic algorithm applying UCB1 on and a similar performance gain is observed in Fig. 6(b). Moreover, the computational efficiency of LSDTPSI is also verified in Table II in Appendix A.2.
6.3 Online Recommendation Systems
In this subsection, we apply LSDTPSI to a problem in online recommendation systems. We test our policy on a dataset from Jester, an online joke recommendation and rating system [36], consisting of 100 jokes and 25K users and every joke has been rated by at least 34% of the entire population.^{10}^{10}10Available on http://eigentaste.berkeley.edu/dataset/.. Ratings are real values between and . In the experiment, we recommend a joke (modeled as an arm) to a new user at each time and observe the rating, which corresponds to playing an arm and receiving the reward. Note that although different users have different preference towards items, every item exhibits certain internal quality that is represented by the mean reward, i.e., the average rating from all users. The variations of ratings from different users correspond to the randomness of rewards. Notice that the algorithms we propose work for any reward distribution as long as it is subGaussian, Jester is a suitable dataset for the purpose of evaluating the performance of our algorithms since any distribution with bounded support is subGaussian. In accordance with the assumptions of the policy, all ratings are normalized to .
To test our policy using side information, we partition the dataset into a training set ( or of the users) and a test set (20K users). We obtain the partially revealed UIG from the training set as follows: we estimate the distance between two jokes by calculating the difference between their average ratings from users in the training set who have rated both jokes. We define a confidence parameter . If the distance between is larger than , we add to . Otherwise if the distance is smaller than , we add to . It is clear that there exist certain pairs of arms whose relations are unknown. We let if the size of the training set is of the entire dataset or if the size of the training set is . Note that as the size of the training set increases, the estimation of distances between jokes becomes more accurate and thus, the confidence parameter can be smaller. As a consequence, the number of joke pairs whose relations are known increases. For the hyperparameter , we use an iterative approach to find the best that minimize the size of , i.e., the set of arms that need to be explored. Intuitively, as increases, first decreases since the side information graph becomes more connected and more similarity relations can be observed. Therefore, the probability of eliminating suboptimal arms by the offine step becomes higher. When is large, the graph approaches a complete graph and less dissimilarity relations are observed. As a consequence, the probability of eliminating suboptimal arms decreases and thus increases. A similar tendency of variation can be observed on the overall regret performance of the learning policy. Based on this, the iterative approach starts from a small (i.e., ) at time and find . It keeps doubling the value of at each step until time when . Then a binary search method is applied to find the best (with resolution , i.e., the minimum increment of ) between and that achieves the minimum .
We use an unbiased offline evaluation method introduced in [26] and [37] to evaluate algorithms including LSDTPSI, UCB1, TS, CKLUCB and UCB1 on , on the test set. Fig. 8 shows the average rating per user with confidence intervals (scaled back to ) of every policy. Note that CKLUCB needs to estimate the KLdivergence between two distributions. Since the distribution type in the real dataset is unknown, we can only use to approximate the KLdivergence where