Multi-Armed Bandits on Partially Revealed Unit Interval Graphs

Multi-Armed Bandits on Partially Revealed Unit Interval Graphs

Xiao Xu,  Sattar Vakili, Qing Zhao,  and Ananthram Swami,  Xiao Xu and Qing Zhao are with the School of Electrical and Computer Engineering, Cornell University, Ithaca, NY, 14853, USA. E-mail: {xx243;qz16}@cornell.edu. Sattar Vakili is with Prowler.io, Cambridge, UK.
E-mail: sv388@cornell.edu. Ananthram Swami is with the CCDC Army Research Laboratory, Adelphi, MD, 20783, USA. E-mail: a.swami@ieee.org.This work was supported in part by the Army Research Laboratory Network Science CTA under Cooperative Agreement W911NF-09-2-0053. The work of Qing Zhao was supported in part by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 754412, during her visit at Chalmers University, Sweden.Parts of the work have been presented at the 36th IEEE Military Communication Conference (MILCOM), October, 2017 and the 52nd Asilomar Conference on Signals, Systems and Computers, October, 2018.
Abstract

A stochastic multi-armed bandit problem with side information on the similarity and dissimilarity across different arms is considered. The action space of the problem can be represented by a unit interval graph (UIG) where each node represents an arm and the presence (absence) of an edge between two nodes indicates similarity (dissimilarity) between their mean rewards. Two settings of complete and partial side information based on whether the UIG is fully revealed are studied and a general two-step learning structure consisting of an offline reduction of the action space and online aggregation of reward observations from similar arms is proposed to fully exploit the topological structure of the side information. In both cases, the computation efficiency and the order optimality of the proposed learning policies in terms of both the size of the action space and the time length are established.

Multi-armed bandits, unit interval graph, side information.

1 Introduction

Anumber of emerging applications involve large-scale online learning in which the objective is to learn, in real time, the most rewarding actions among a large number of options. Example applications include various socio-economic applications (e.g. ad display in search engines, product/news recommendation systems, targeted marketing and political campaigns) and networking issues (e.g. dynamic channel access and route selection) in large-scale communication systems such as Internet of things. For such problems, a linear scaling of the learning cost with the problem size resulting from exploring every option to identify the optimal is undesirable, if not infeasible. The key to achieving a sublinear scaling with the problem size is to exploit the inherent structure of the action space, i.e., various relations among the vast number of options.

A classic framework for online learning and sequential decision-making under unknown models is the multi-armed bandit (MAB) formulation. In the classic setting, a player chooses one arm (or more generally, a fixed number of arms) from a set of arms (representing all possible options) at each time and obtains a reward drawn i.i.d. over time from an unknown distribution specific to the chosen arm. The design objective is a sequential arm selection policy that maximizes the total expected reward over a time horizon of length by striking a balance between learning the unknown reward models of all arms (exploration) and capitalizing on this information to maximize the instantaneous gain (exploitation). The performance of an arm selection policy is measured by regret, defined as the expected cumulative reward loss against an omniscient player who knows the reward models and always plays the best arm.

A traditionally adopted assumption in MAB is that arms are independent and that there is no structure in the set of reward distributions. In this case, reward observations from one arm provide no information on other arms, resulting in a linear regret order in . The main focus of the classic MAB problems has been on the regret order in , which measures the learning efficiency over time. The seminal work by Lai and Robins showed that the minimum regret has a logarithmic order in [1]. A number of learning policies have since been developed that offer the optimal regret order in (see [2, 3, 4] and references therein). Developed under the assumption of independent arms and relying on exploring every arm sufficiently often, however, these learning policies are not suitable for applications involving a massive number of arms, especially in the regime of .

1.1 Main Results

In addressing the challenge of massive number of arms, there has been a growing body of studies aiming at exploiting certain side information on the relations among the large number of arms. Among various formulations of the side information (see a more detailed discussion in Sec. 1.2), one notable example is the statistical similarity and dissimilarity among arms. For instance, in recommendation systems and information retrieval, products, ads, and documents in the same category (more generally, close in some feature space) have similar expected rewards. At the same time, it may also be known a priori that some arms have considerably different mean rewards, e.g., news with drastically different opinions, products with opposite usage, documents associated with key words belonging to distant categories in the taxonomy. Such side information opens the possibility of efficient solutions that scale well with the large action space.

In this paper, we study a bandit problem with side information on similarity and dissimilarity relations across actions. We first show that the similarity-dissimilarity structure of the action space can be represented by a unit interval graph (UIG) where the presence (absence) of an edge between two arms indicates that the difference of their mean rewards is within (beyond) a given threshold. Based on whether the UIG is fully revealed to the player, we consider two cases of complete and partial side information. For both cases, we propose a general two-step learning structure—LSDT (Learning from Similarity-Dissimilarity Topology)—to achieve a full exploitation of the topological structure of the side information. The first step is an offline reduction of the action space to a candidate set, which consists of arms that can assume the largest mean rewards under certain assignments of reward distributions without violating the side information. Arms outside the candidate set are sub-optimal and eliminated from online exploration. The second step carries out an online learning algorithm that further exploits the similarity structure through collective exploration by aggregating reward observations from similar arms.

In the case of complete side information, we show that the candidate set is given by the set of left anchors of the UIG, which can be identified by a Breadth-First-Search (BFS) based algorithm in polynomial time. By defining an equivalence relation between arms through the neighbor sets in the UIG, we obtain an equivalence class partition of arms. We show that the candidate set consists of at most two equivalence classes if the UIG is connected. We exploit this topological structure by maintaining two UCB (upper confidence bound) indices, one at the class level aggregating observations from arms within the same class, the other at the arm level. At each time, the arm with the largest arm index within the class with the largest class index is played. We establish the order optimality of the proposed policy in terms of both and by deriving an upper bound on regret and a matching lower bound feasible among uniformly good policies.

In the case of partial side information, we represent the partially revealed UIG by a multigraph with two types of edges indicating the presence and the absence of the corresponding UIG edges. We show the NP-completeness of finding the candidate set and propose a polynomial time approximation algorithm to reduce the action space. We show that under certain probabilistic assumptions on the partial side information, the size of the reduced actions space is comparable to that of the ground truth candidate set as determined by the underlying UIG. In the second step of online learning, the key to a full exploitation of the similarity relation is to determine the frequency of exploring an arm based on its exploration value, which measures the topological significance of the node in a similarity graph. By sequentially eliminating arms less likely to be optimal through a UCB index aggregating observations from similar arms, only arms close to the optimal one remain after a sufficient number of plays. We provide performance guarantee for the proposed policy and establish its order optimality under certain probabilistic assumptions on the side information.

It should be noted that the main issue and main contribution of this paper are on how to succinctly model and fully exploit the side information on the similarity and dissimilarity relations across arms. The solution to the former is the UIG representation of the actions space, and to the latter is the two-step learning structure LSDT, which is independent of the specific arm selection rule adopted at the online learning step. In particular, different arm selection techniques developed for the original bandit problems may be incorporated into the second step of LSDT, except based on aggregated observations. In Sec. 5.3, we discuss the use of Thompson Sampling (TS), one of the most well-known learning techniques in bandit problems (see [27, 25, 28, 29] and references therein), with LSDT to fully exploit the side information.

In summary, we develop a UIG formulation of side information on arm similarity and dissimilarity in MAB problems and consider two cases of complete and partial side information in this paper. We propose a general and computationally efficient two-step learning structure achieving full exploitation of side information and establish the order optimality of the proposed learning policies through theoretical upper bounds on regret as well as matching lower bounds in both cases.

1.2 Related Work

Existing studies on MAB with structured reward models can be categorized based on the types of arm relations adopted in the MAB models. The first type is realization-based relation that assumes a certain known probabilistic dependency across arms. Examples include combinatorial bandits [5, 6, 7, 8], linearly parameterized bandits [9, 10, 11], and spectral bandits for smooth graph functions [12, 13]. The second type of arm relation can be termed as observation-based relation [14, 15, 16]. Specifically, playing an arm provides additional side observations about its neighboring arms. See [17] for a survey on various bandit models with structured action spaces.

The problem studied in this paper considers another type of relation among arms: ensemble-based relation that aims to capture the relations on ensemble behaviors (i.e., mean rewards) across arms, rather than probabilistic dependencies in their realizations111The mean rewards of different arms exhibit certain relations (e.g., closeness or in certain orders) but the realized random rewards of arms being played do not need to exhibit any probabilistic dependency.. Related work includes Lipschitz bandits [18, 19, 20], taxonomy bandits [21] and unimodal bandits [22]. Specifically, in Lipschitz bandits, the mean reward is assumed to be a Lipschitz function of the arm parameter. Taxonomy bandits have a tree-structured action space where arms in the same subtree are close in their mean rewards. In unimodal bandits, the action space is represented by a graph where from every sub-optimal arm, there exists a path to the optimal arm along which the mean reward increases. Different from these existing studies, the bandit model studied in this paper considers an action space represented by a UIG indicating not only similarity but also dissimilarity relations across actions. Besides, the structure of the proposed learning policy consists of a two-level exploitation of the UIG structure, which is fundamentally different from the existing ones. Recently, a general formulation of structured bandits was proposed in [23], which includes a variety of known bandit models (e.g., Lipschitz bandits, unimodal bandits, linear bandits, etc.) as well as the bandit model studied in this work as special cases. The learning policy developed in [23], however, was given only implicitly in the form of a linear program (LP) that needs to be solved at every time step. For the problem studied in this paper, the LP does not admit polynomial-time solutions (unless P=NP).

Side information has also been used to refer to context information in the so-called contextual bandits (see [24, 25, 26] and references therein). Under this formulation, context information is revealed at each time, which affects the arm reward distributions. A contextual bandit problem can thus be viewed as multiple simple bandits, one for each context, that are interleaved in time according to the context stream. The complexity of the problem comes from the coupling of these simple bandits by assuming various models on how context affects the arm reward distributions. The problem is fundamentally different from the one studied here.

2 Multi-Armed Bandits on Unit Interval Graphs

2.1 Problem Formulation

Consider a stochastic -armed bandit problem. At each time , a player chooses one arm to play. Playing an arm yields a reward drawn i.i.d. from an unknown distribution with mean . We assume that belongs to the family of sub-Gaussian distributions222A random variable with mean is sub-Gaussian with parameter (or sub-Gaussian) if , for all [30]. for all . Extensions to other distribution types are discussed in Sec. 5.

Across arms, the similarity and dissimilarity relations are defined through a parameter : two arms are similar (dissimilar) if the difference between their mean rewards is below (above) . The similarity-dissimilarity structure of the action space can be represented by an undirected graph . In the graph representation, every node represents an arm with reward distribution and the presence (absence) of an edge corresponds to a similar (dissimilar) arm pair. Throughout the paper, is used to refer to an arm or a node, exchangeably. We first show that is a UIG.

Definition 1 (Unit interval graph and unit interval model).

A graph is a unit interval graph if there exists a set of unit length intervals333If a UIG is finite (with a finite number of nodes), there is no difference between taking open intervals or closed intervals to represent nodes [31]. Without loss of generality, we assume that where are the left and right coordinates of interval . on the real line such that each interval corresponds to a node and there exists an edge if and only if . The set of intervals is a unit interval model (UIM) for the UIG.

Through a mapping from every node to an -length interval , it is not difficult to see that

 |μi−μj|<ϵ⇔Ii∩Ij≠∅, (1)

which indicates that is a UIG (see an example in Fig. 1). Without loss of generality, we assume that is connected. Extensions to the disconnected case are discussed in Sec. 5.

We define , as the side information on arm similarity and dissimilarity. Based on whether fully reveal the UIG , we consider the following two cases separately. In the case of complete side information, are identical to the edge set and the complement edge set of , i.e., . In the case of partial side information, they are subsets of the latter, i.e., , .

The objective is an online learning policy that specifies a sequential arm selection rule at each time based on both past observations of selected arms and the side information . The performance of policy is measured by regret defined as the expected reward loss against a player who knows the reward model and always plays the best arm (chosen arbitrarily in the case of multiple optimal arms), i.e.,

 Rπ(T;ESϵ,EDϵ)=Eπ[T∑t=1μimax(t)−T∑t=1Xπt(t)], (2)

where is the largest mean reward and is the arm selected by policy at time . The dependency of regret on the unknown reward distributions is omitted in the notation. When there is no ambiguity, the notation is simplified to .

Let denote the number of times that arm has been selected up to time . We rewrite the regret as:

 R(T)=μimaxT−K∑i=1μiE[τi(T)]=K∑i=1ΔiE[τi(T)], (3)

where . The objective of maximizing the expected cumulative reward is equivalent to minimizing the regret over a time horizon of length . In order to minimize regret, it can be inferred from (3) that every sub-optimal arm () should be distinguished from the optimal one with the least number of plays.

2.2 Two-Step Learning Structure

While classic bandit algorithms have to try out every arm sufficiently often to distinguish the sub-optimal arms from the optimal one, which induces a linear scaling of regret in the number of arms, the side information on arm similarity and dissimilarity allows the possibility of identifying a set of sub-optimal arms without even playing them. To be specific, we define a candidate set determined by the side information as follows.

Definition 2 (Candidate Arm and Candidate Set).

Given the side information , an arm is a candidate arm if there exists an assignment of reward distributions with means conforming to and . The candidate set is the set consisting of all candidate arms.

Note that the optimal arm under the ground truth assignment of reward distributions in the bandit problem always belongs to the candidate set . It is clear that if we can find the candidate set from the side information efficiently, the action space can be reduced to . Only arms in need to be explored. Furthermore, certain topological structures of the revealed UIG on the reduced action space can be further exploited to accelerate learning. In estimating the mean reward of every arm in the candidate set, observations from similar arms can also be leveraged as approximations, which reduces the number of plays required to distinguish sub-optimal arms from the optimal one.

The aforementioned facts motivate a general two-step learning structure: Learning from Similarity-Dissimilarity Topology (LSDT) for both cases of complete and partial side information. Specifically, LSDT consists of (1) an offline elimination step that reduces the action space to the candidate set and (2) online learning of the optimal arm by aggregating observations from similar ones. We specify each step for the cases of complete and partial side information separately in Sec. 3 and Sec. 4.

3 Complete Side Information

We first consider the case of complete side information that fully reveals the UIG . We follow the two-step learning structure proposed in Sec. 2.2 and develop a learning policy: LSDT-CSI (Learning from Similarity-Dissimilarity Topology with Complete Side Information) along with theoretical analysis on its regret performance. While restrictive in applications, this case provides useful insights for tackling the general case of partial side information addressed in Sec. 4.

3.1 Offline Elimination

The first step of LSDT-CSI is an offline preprocessing that aims at identifying the candidate set from the complete side information. Since the UIG is fully revealed, we denote the candidate set in this case as to distinguish from the case of partial side information. We show that is identical to the set of left anchors of the UIG .

Definition 3 (Left Anchor).

Given a UIG , a node is a left anchor if there exists a UIM for where corresponds to the leftmost interval along the real line.

Since the mirror image of an UIM with respect to the origin is also an UIM for the same UIG, the node corresponding to the rightmost interval in a UIM is also a left anchor. Based on the definition of the UIG that represents the similarity-dissimilarity structure of the arm set in Sec. 2.1, it is not difficult to see that the candidate set is identical to the set of left anchors of , which can be identified through a BFS-based algorithm proposed in [32]. The BFS-based algorithm starts from an arbitrary node in a UIG and returns a set of left anchors. We apply the algorithm two times: in the first time, we start from an arbitrary node in and obtain a set of left anchors. In the second time, we re-apply the algorithm starting from one of the returned node in the last time. One can directly infer from Proposition 2.1 and Theorem 2.3 in [32] that the obtained set is the candidate set . The detailed algorithm is summarized below. Note that the computation complexity of the offline elimination step is , which is polynomial in the problem size.

3.2 Online Aggregation

We now present the second step of online learning that further exploits topological structures of the candidate set . We first introduce an equivalence relation between nodes in the UIG .

Definition 4 (Neighborhood Equivalence).

Two nodes in are (neighborhood) equivalent if , where is the set of neighbors of in , including . Moreover, let denote the partition of the arm set in with respect to the neighborhood equivalence relation.

Note that arms within the same equivalence class have the same set of neighbors and thus, they are topologically indistinguishable in the UIG. Based on the equivalence class partition, we obtain a closed-form expression for .

Theorem 1.

When the side information fully reveals the UIG (assumed to be connected), the candidate set is the union of two equivalence classes containing the optimal arm and the worst arm (with minimum mean reward)444Note that the two equivalence classes containing the optimal arm and the worst arm are identical in the special case where is fully connected. The proposed algorithm and analysis still apply in this case. Without loss of generality, we assume that is not fully connected., i.e.,

 B∗=B∗imax∪B∗imin, (4)

where

 B∗imax={j:N[j]=N[imax]}, (5)
 B∗imin={j:N[j]=N[imin]}. (6)
Proof.

See Appendix B in the supplementary material. ∎

The result is also illustrated in Fig. 2: the candidate set is the union of two equivalence classes and , which can be directly obtained through the offline elimination step.

Based on the topological structure of the candidate set, we develop a hierarchical online learning policy that aggregates observations from arms within the same equivalence class. By considering each class as a super node (arm), we reduce the problem to a simple two-armed bandit problem.

Specifically, the second step of LSDT-CSI carries out a hierarchical UCB-based online learning on the candidate set by maintaining a class index for each equivalence class and an arm index for each individual arm in . The arm index is defined as:

 Lj(t)=¯xj(t)+√8logtτj(t), (7)

where , are the empirical average of observations from arm and the number of times that arm has been played up to time . The class index aggregates the same statistics across arms in the class:

 Hi(t)=∑j∈B∗i¯xj(t)τj(t)∑j∈B∗iτj(t)+ ⎷8logt∑j∈B∗iτj(t). (8)

At each time, the online learning procedure selects the equivalence class with the largest class index and plays the arm with the largest arm index within the selected class. Once the reward has been observed, both class indices and arm indices are updated.

3.3 Order Optimality

We first present the regret analysis of LSDT-CSI, which focuses on upper bounding the expected number of times that each suboptimal arm has been played up to time . We show that when the total number of times that arms in have been played is greater than , the class index will not be chosen with high probability. Besides, if each suboptimal arm has been played more than times, the arm index will not be chosen with high probability. The following theorem provides the performance guarantee for LSDT-CSI.

Theorem 2.

Suppose that is connected. Assume that the reward distribution for each arm is sub-Gaussian with parameter 555See Sec. 5 for extensions to general .. Then the regret of LSDT-CSI up to time is upper bounded as follows:

 R(T)≤ (32maxi∈BiminΔi(minj∈B∗iminΔj−maxk∈B∗imaxΔk)2 (9) +∑i∈B∗imax∖A32Δi)logT+O(|B∗|),

where is the set of arms with largest mean rewards ().

Proof.

See Appendix C in the supplementary material. ∎

Remark 1.

For fixed , the regret of LSDT-CSI is of order

 O((1+|B∗imax∖A|)logT), (10)

as . In certain scenarios (e.g., is a line graph), , which indicates a sublinear scaling of regret in terms of the number of arms given such side information.

Remark 2.

If is fully connected (e.g., is large), then . In this case, LSDT-CSI degenerates to the classic UCB policy and .

We discuss in Sec. 6 that if the mean reward of each arm is independently and uniformly chosen from and is bounded away from and , the expected value of is smaller than , which indicates a sublinear scaling of regret in terms of the size of the action space. We also use a numerical example to verify the result in Sec. 6.

To establish the order optimality of LSDT-CSI, we further derive a matching lower bound on regret. We focus here on the case that the unknown mean reward of each arm is unbounded (i.e., can be any value on the real line). We adopt the same parametric setting as in [1] on classic MAB where the rewards are drawn from a specific parametric family of distributions with known distribution type666Although the upper bound on regret of LSDT-CSI is derived under the non-parametric setting (the distribution type is unknown), the non-parametric lower bound suffices to show the order optimality of LSDT-CSI since it should be no smaller than that in the parametric one.. Specifically, the reward distribution of arm has a univariate density function with an unknown parameter from a set of parameters . Let be the Kullback-Leibler (KL) distance between two distributions with density functions and and with means and respectively. We assume the same regularity assumptions on the finiteness of the KL divergence and its continuity with respect to the mean values as in [1].

Assumption 1.

For every such that , we have .

Assumption 2.

For every and with , there exists for which whenever .

The following theorem provides a lower bound on regret for uniformly good policies777A policy is uniformly good if for every , the regret of satisfies , as [1]..

Theorem 3.

Suppose is connected. Assume that Assumptions 1, 2 hold and the mean reward of each arm can be any value in . For any uniformly good policy, the regret up to time is lower bounded as follows:

 limT→∞R(T)logT≥C1, (11)

where is the optimal value of an LP that only depends on and (see (53) in Appendix D for details). It can be shown that for fixed , and , the regret for any uniformly good policy is of order

as .

Proof.

See Appendix D in the supplementary material. ∎

Remark 3.

LSDT-CSI is order optimal since its upper bound on regret matches the lower bound shown in Theorem 3.

Remark 4.

If there is a unique optimal arm, i.e., , as .

4 Partial Side Information

In this section, we consider the general case of partial side information where the UIG is partially revealed. We develop a learning policy: LSDT-PSI (Learning from Similarity-Dissimilarity Topology with Partial Side Information) following the two-step structure proposed in Sec. 2.2 and provide theoretical analysis on the regret performance.

4.1 Offline Elimination

A partially revealed UIG can be represented by an undirected edge-labeled multigraph (see Fig. 3). Specifically, consists of two types of edges: type-S edges () and type-D edges () indicating the presence and the absence of the corresponding UIG edges. The absence of an edge between two nodes indicates an unknown relation between the two arms.

We first show that finding the candidate set under partial side information is NP-complete. We notice that finding the candidate set is equivalent to considering every node individually and deciding if can be a left anchor of a UIG consisting of the same set of nodes with and the potential edge set satisfying

 ESϵ⊆EPϵ, (12)
 EPϵ∩EDϵ=∅. (13)

Specifically, we show the NP-completeness of the following decision problem.

• LEFTANCHOR

• [INPUT]: A multigraph knowing that there exists a UIG where and , and a specific node .

• [QUESTION]: Does there exist a UIG where and such that node is a left anchor of ?

Theorem 4.

LEFTANCHOR is NP-complete.

Proof.

To show the NP-completeness of LEFTANCHOR, we give a reduction from a variant of the 3-SAT problem: CONSISTENT-NAE-3SAT. Due to the page limit, we include the definition of CONSISTENT-NAE-3SAT as well as its proof of NP-completeness in Appendix E in the supplementary material. The reduction to LEFTANCHOR and the remaining proof are presented in Appendix F. ∎

It should be noted that LEFTANCHOR is similar to the so-called UIG Sandwich Problem[33] where two graphs and are given satisfying . The question is whether a UIG exists satisfying . It is not difficult to see that the type-S edge set corresponds to in the sandwich problem and the complement of corresponds to . However, LEFTANCHOR is different from the sandwich problem as we know that the sandwich problem is satisfied by the ground truth UIG , and what we are interested in is whether a specific node can be a left anchor.

To address the challenge of finding the candidate set in polynomial time, we exploit the following topological property of to obtain an approximation solution.

Proposition 1.

Given , an arm is sub-optimal if it is similar to two dissimilar arms, i.e., if there exist , such that but , then .

Based on this property, we develop the offline elimination step of LSDT-PSI with complexity.

It is clear that in general, . However, in certain scenarios, the partially revealed UIG provides sufficient topological information to identify the ground truth candidate set obtained from the fully revealed UIG. We show that such information is fully exploited by the offline elimination step of LSDT-PSI to achieve the same performance as that of LSDT-CSI for the case of complete side information.

Specifically, we make the following assumptions on and its equivalence classes assuming that the neighbor set of every arm is diverse enough. Without loss of generality, we assume an increasing order of the equivalence classes along the real line, i.e., and , we have . Note that .

Assumption 3.

For every , assume that there exist such that and are connected to but mutually disconnected in .888Two equivalence classes are connected if and only if at least one pair of arms from the two classes are adjacent in the UIG. It can be inferred from the equivalence relation that if there exists an adjacent arm pair from the two classes, all arm pairs are adjacent.

Assumption 4.

Assume that there exists a constant and for every , .

We further make a probabilistic assumption on the partial side information.

Assumption 5.

The presence and the absence of an edge in the UIG are revealed by the partial side information and independently with probabilities and . Assume that , where is defined in Assumption 4.

Note that as increases, for every arm , the number of dissimilar arm pairs that are similar to increases. Therefore, smaller probabilities of observing edges can still guarantee that arm is elilminated with high probability.

Based on these assumptions, we provide performance guarantee for the offline elimination step of LSDT-PSI through the following theorem. We also verify the results through numerical examples in Sec. 6.

Theorem 5.

Given a UIG , under Assumptions 3-5, with probability at least , every arm is eliminated by the offline elimination step of LSDT-PSI and thus,

 EESϵ,EDϵ[∣∣B0∣∣]=∣∣B∗∣∣+o(1), (14)

as , where is the arm set remaining after the offline elimination step of LSDT-PSI.

Proof.

See Appendix G in the supplementary material. ∎

4.2 Online Aggregation

Now we present the second step, the online learning procedure of LSDT-PSI. We first define a similarity graph restricted to the remaining arm set : and For every arm , we define an exploration value , which measures the topological significance of node in the similarity graph and determines the frequency of playing arm . Intuitively, a node with a higher degree has a higher exploration value since playing this node provides information about more (neighboring) nodes. Specifically, we define exploration values as the optimal solution to the following LP.

 P2: C2=min{zi}i∈V′∑i∈V′zi, (15) s.t. ∑j∈N′[i]zj≥1, ∀i∈V′, zi≥0, ∀i∈V′,

where is the set of neighbors of node in (including ). In the online learning procedure, the number of times arm is played is proportional to its exploration value . Note that if at least plays are necessary to distinguish a suboptimal arm from the optimal one in the classic MAB problem, now if suffices to play only times by aggregating observations from every neighboring arm that is played times. Note that and is upper bounded by the size of the minimum dominating set of .

We briefly summarize the second step of LSDT-PSI: the algorithm is played in epochs and during epoch , arms are played up to times. Arms less likely to be optimal are eliminated at the end of every epoch and only two types of arms will be played in the next epoch: 1) non-eliminated arms and 2) arms with non-eliminated neighbors. After a sufficient number of epochs, only arms close to the optimal one remain and we use single arm indices for selection. Let be the average reward from arm up to epoch .

4.3 Order Optimality

The following theorem provides an upper bound on regret of LSDT-PSI for any given partially revealed UIG.

Theorem 6.

Given a partially revealed UIG . Assume that the reward distribution of reach arm is sub-Gaussian999Certain sub-Gaussian distributions (e.g. Bernoulli distribution, uniform distribution on ) have parameters . See Sec. 5 for extensions to general .. Let . Then the regret of LSDT-PSI up to time is upper bounded by:

 R(T)≤ ∑j∈B0∖(Q∪A)Δjmax{8logTΔ2j,32zjlog(Tϵ2)ϵ2}+ (17) ∑i∈QΔizi32log(T^Δ2i)^Δ2i+O(|V′|),

where .

Proof.

See Appendix H in the supplementary material. ∎

Remark 5.

For fixed , the regret of LSDT-PSI is of order

 O((γ(G′ϵ)+|B0∖(Q∪A)|)logT), (18)

as , where is the size of the minimum dominating set of graph and is the number of sub-optimal arms that are -close to the optimal one. It is not difficult to see that as increases, decreases and increases. For an appropriate , a sublinear scaling of regret in terms of the number of arms can be achieved.

Recall that in Theorem 5, we show that under certain assumptions, the offline elimination step of LSDT-PSI achieves the same performance as LSDT-CSI for the case of complete side information. The following corollary further shows the order optimality of LSDT-PSI in terms of both and .

Corollary 1.

Assume that Assumptions 3-5 hold and . For fixed , the expectation of regret of LSDT-PSI taken over random realizations of the partial side information , is upper bounded as follows:

 EESϵ,EDϵ[R(T)]≤O((1+|B∗imax∖A|)logT), (19)

as , which matches the lower bound on regret for the case of complete side information established in Theorem 3.

Proof.

See Appendix I in the supplementary material. ∎

5 Extensions

In this section, we discuss extensions of the proposed policies: LSDT-CSI and LSDT-PSI as well as their regret analysis to cases with disconnected UIGs and other reward distributions. We also discuss the extension of applying Thompson Sampling techniques to the LSDT learning structure.

5.1 Extensions to disconnected UIG

Suppose that the UIG has () connected components. It is not difficult to see that every connected component of is still a UIG and the set of left anchors of is the union of left anchors of all components. Therefore, in the case of complete side information, the offline elimination step of LSDT-CSI outputs at most equivalences classes and the second step of LSDT-CSI can be directly applied by maintaining a class index for every equivalence class as defined in (8). Moreover, by extending the regret analysis of LSDT-CSI in Theorem 2 as well as the lower bound on regret for uniformly good policies in Theorem 3 to the disconnected case, we can show that LSDT-CSI achieves an order optimal regret, i.e.,

 R(T)∼Θ((M+|B∗imax∖A|)logT), (20)

as . In the extreme case when (e.g., ), LSDT-CSI degenerates to the classic UCB policy and .

In the case of partial side information, the LSDT-PSI policy along with its regret analysis applies to any partially revealed UIG without assumptions on the connectivity of the graph. The upper bound on regret in Theorem 6 still holds when has connected components. In the extreme case where , the size of the minimum dominating set of the similarity graph equals and thus, .

To show the order optimality of LSDT-PSI in the disconnected case, we need certain modifications on the assumptions of the UIG. We consider every connected component of with equivalence classes . We assume that Assumptions 3 and 4 hold for every connected component and without loss of generality, we assume that the optimal arm is in component . Then under Assumption 5, we can extend the regret analysis in Corollary 1 to the case where has connected components. It can be shown that the expected regret of LSDT-PSI is upper bounded by

 O((M+|B∗imax∖A|)logT), (21)

as , which matches the lower bound in the case of complete side information.

5.2 Extensions to Other Distributions

Recall that in the regret analysis of LSDT-CSI and LSDT-PSI, we assume sub-Gaussian reward distributions with parameter (e.g., standard normal distribution) or (e.g., Bernoulli distribution). We first discuss extensions to general sub-Gaussian distributions with arbitrary parameters .

In LSDT-CSI, by replacing the second terms of the UCB indices defined in (7) and (8) by and where is an input parameter, the regret analysis in Theorem 2 still applies and the upper bound on regret is only affected up to a constant scaling factor, as long as . A similar extension also applies to LSDT-PSI if we change the second terms of the UCB indicies in (16) to where .

Furthermore, we can extend the results for sub-Gaussian reward distributions to other distribution types such as light-tailed and heavy-tailed distributions. There are standard techniques for such extensions by replacing the concentration result with the corresponding ones for light-tailed and heavy-tailed distributions (the latter also requires replacing sample means with truncated sample means). Similar extensions for classic MAB problems without side information are discussed in [34, 4]. To illuminate the main ideas without too much technicality, most existing work assumes an even stronger assumption of bounded support in (see [2],[3], [24], etc.).

5.3 Extensions to Thompson Sampling Techniques

The two-step learning structure LSDT is in general independent of the specific arm selection rule adopted in the online learning step. We discuss here how Thompson Sampling (TS) techniques can be extended and incorporated into the basic structure with aggregation of reward observations. Specifically, in the case of complete side information, after reducing the action space to the candidate set via the offline step, we adopt a similar hierarchical online learning policy as that used in LSDT-CSI by maintaining two posterior distributions on the reward parameters, one at the equivalence class level, the other at the arm level. At each time, the policy first randomly selects an equivalence class according to its class-level probability of containing the optimal arm and then randomly draws an arm within the class according to its arm-level probability of being optimal. In the case of partial side information, similar to LSDT-PSI, an eliminative strategy is carried out to sequentially eliminate arms less likely to be optimal. At each time, an arm is randomly drawn according its arm-level posterior distribution of being optimal. The observation from the selected arm is also used to update higher level posterior distributions of its neighbors, which aggregate observations from all similar arms. According to the high level posterior distribution, the arm that is least likely to be optimal is eliminated if it has been explored for sufficient times. Simulation results in Appendix A.1 show a similar performance gain by exploiting the side information on arm similarity and dissimilarity through the two-step learning structure when TS is incorporated in both cases. To achieve a full exploitation of the side information and establish the order optimality on regret, however, further studies are required.

6 Numerical Examples

In this section, we illustrate the advantages of our policies through numerical examples on both synthesized data and a real dataset in recommendation systems. All the experiments are run 100 times using a Monte Carlo method on MATLAB R2014b.

6.1 Reduction of the action space

6.1.1 Complete Side Information

We use two experiments to show how much the action space can be reduced by exploiting the complete side information. In the first experiment, we fix arms with mean rewards uniformly chosen from and let vary from to . For every , we obtain a UIG . We apply the offline elimination step of LSDT-CSI to and compare the size of the candidate set with . In the second experiment, we fix and let increase from to . We generate arms and UIGs in the same way as in the first experiment. We show how varies as increases. The results are shown in Figs. 3(a) and 3(b).

As we can see from Fig. 3(a), when is small (), the graph is disconnected. As increases, the number of connected components decreases and thus, decreases. When the graph is connected (), the candidate set only contains two equivalence classes and thus is much smaller than . When is large (), the probability that the graph is complete increases as increases. In this case, the candidate set contains all the arms. Thus, increases to as grows to . In Fig. 3(b), we notice that has a diminishing cardinality compared with . Since the mean rewards are uniformly chosen from , the set of arms becomes denser on the interval as grows. It can be inferred from [35] that the maximum distance between two consecutive points uniformly chosen from is in the order of with probability . If we choose for some , will be connected with high probability. Moreover, it can be shown that the cardinality of () is smaller than the number of nodes whose distance to () is smaller than . Therefore, it follows that the cardinality of the candidate set in this setting is smaller than .

6.1.2 Partial Side Information

We use two other experiments to show the reduction of the action space with partial side information. In the first experiment, we fix arms with mean rewards uniformly chosen from . We choose and obtain the UIG . We let vary from to and for every , we observe the presence and the absence of edges in independently with probability . We apply the offline elimination step of LSDT-PSI on and compare the size of the output set with . Note that when , is fully revealed and we use the offline elimination step of LSDT-CSI to obtain . In the second experiment, we fix and let increase from to . We generate arms and side information graphs in the same way as in the first experiment and show how varies as increases. The results of the two experiments are shown in Figs. 4(a) and 4(b) .

It can be seen from Fig. 4(a) that as increases, decreases to . Besides, when , the performance of the offline elimination step of LSDT-PSI is as good as that of LSDT-CSI, which is optimal, i.e. only arms in remain. Moreover, in Fig. 4(b), we see that decreases as increases which indicates a diminishing cardinality of the reduced action space in terms of .

6.2 Regret on Randomly Generated Graphs

6.2.1 Complete Side Information

We compare LSDT-CSI with existing algorithms on a set of randomly generated arms. We obtain the UIG on nodes with means uniformly chosen from and . Every time an arm is played, a random reward is drawn independently from a Gaussian distribution with mean and variance . We let vary from to and compare the regret of LSDT-CSI and 4 baseline algorithms:

1. UCB1: classic UCB policy proposed in [2] assuming no relation among arms.

2. TS: classic Thompson Sampling algorithm proposed in [27] assuming Beta prior and Bernoulli likelihood on the reward model.

3. CKL-UCB: proposed in [20] for Lipschitz bandit exploiting only similarity relations.

4. OSUB: proposed in [22] for unimodal bandits. Note that if the UIG is connected, it satisfies the unimodal structure: for every sub-optimal arm , there exists a path such that for every , .

5. OSSB: proposed in [23] for general structured bandits. At each time, OSSB estimates the minimum number of times that every arm has to be played by solving a LP.

The results shown in Fig. 5(a) indicate that LSDT-CSI outperforms the baseline algorithms. In particular, when , LSDT-CSI has already started to exploit the optimal arm while the other algorithms are still exploring. We also compare LSDT-CSI with an intuitive algorithm applying UCB1 on the candidate set in Fig. 5(b). With the same setup, we see performance gain due to the online step.

We also evaluate the time complexity of the learning policies. Due to the page limit, we summarize the running times of LSDT-CSI and the other baseline algorithms in Table I in Appendix A.2. It is shown that LSDT-CSI has a relatively low computation cost in contrast to algorithms with comparable performance, i.e., CKL-UCB and OSSB.

6.2.2 Partial Side Information

We compare LSDT-PSI with existing algorithms. We obtain the UIG on arms with means uniformly chosen from and . We let and get the partially observed UIG based on Assumption 5. The random rewards for every arm are independently generated from a Bernoulli distribution with mean . We consider .

Given that finding the candidate set is NP-complete, the OSSB policy is not applicable since the LP is unspecified. Besides, OSUB is also inapplicable since is not unimodal in general. Therefore, we only compare LSDT-PSI with three baseline algorithms: UCB1, TS and CKL-UCB. In LSDT-PSI, we choose the input parameter . Note that the choice of does not affect the theoretical upper bound on regret. However, in practice, it is better to use a smaller to avoid excessive plays of suboptimal arms. The simulation results shown in Fig 6(a) indicates that LSDT-PSI outperforms the other two algorithms. Besides, similar to the case of complete side information, we compare LSDT-PSI with a heuristic algorithm applying UCB1 on and a similar performance gain is observed in Fig. 6(b). Moreover, the computational efficiency of LSDT-PSI is also verified in Table II in Appendix A.2.

6.3 Online Recommendation Systems

In this subsection, we apply LSDT-PSI to a problem in online recommendation systems. We test our policy on a dataset from Jester, an online joke recommendation and rating system [36], consisting of 100 jokes and 25K users and every joke has been rated by at least 34% of the entire population.101010Available on http://eigentaste.berkeley.edu/dataset/.. Ratings are real values between and . In the experiment, we recommend a joke (modeled as an arm) to a new user at each time and observe the rating, which corresponds to playing an arm and receiving the reward. Note that although different users have different preference towards items, every item exhibits certain internal quality that is represented by the mean reward, i.e., the average rating from all users. The variations of ratings from different users correspond to the randomness of rewards. Notice that the algorithms we propose work for any reward distribution as long as it is sub-Gaussian, Jester is a suitable dataset for the purpose of evaluating the performance of our algorithms since any distribution with bounded support is sub-Gaussian. In accordance with the assumptions of the policy, all ratings are normalized to .

To test our policy using side information, we partition the dataset into a training set ( or of the users) and a test set (20K users). We obtain the partially revealed UIG from the training set as follows: we estimate the distance between two jokes by calculating the difference between their average ratings from users in the training set who have rated both jokes. We define a confidence parameter . If the distance between is larger than , we add to . Otherwise if the distance is smaller than , we add to . It is clear that there exist certain pairs of arms whose relations are unknown. We let if the size of the training set is of the entire dataset or if the size of the training set is . Note that as the size of the training set increases, the estimation of distances between jokes becomes more accurate and thus, the confidence parameter can be smaller. As a consequence, the number of joke pairs whose relations are known increases. For the hyper-parameter , we use an iterative approach to find the best that minimize the size of , i.e., the set of arms that need to be explored. Intuitively, as increases, first decreases since the side information graph becomes more connected and more similarity relations can be observed. Therefore, the probability of eliminating sub-optimal arms by the offine step becomes higher. When is large, the graph approaches a complete graph and less dissimilarity relations are observed. As a consequence, the probability of eliminating sub-optimal arms decreases and thus increases. A similar tendency of variation can be observed on the overall regret performance of the learning policy. Based on this, the iterative approach starts from a small (i.e., ) at time and find . It keeps doubling the value of at each step until time when . Then a binary search method is applied to find the best (with resolution , i.e., the minimum increment of ) between and that achieves the minimum .

We use an unbiased offline evaluation method introduced in [26] and [37] to evaluate algorithms including LSDT-PSI, UCB1, TS, CKL-UCB and UCB1 on , on the test set. Fig. 8 shows the average rating per user with confidence intervals (scaled back to ) of every policy. Note that CKL-UCB needs to estimate the KL-divergence between two distributions. Since the distribution type in the real dataset is unknown, we can only use to approximate the KL-divergence where