Definition 1

Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints

David Simchi-Levi

Institute for Data, Systems and Society, Department of Civil and Environmental Engineering, and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, dslevi@mit.edu

Yunzong Xu

Institute for Data, Systems and Society, Cambridge, Massachusetts Institute of Technology, MA 02139, yxu@mit.edu

We consider the classical stochastic multi-armed bandit problem with a constraint on the total cost incurred by switching between actions. We prove matching upper and lower bounds on regret and provide near-optimal algorithms for this problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. That is, we show that associated with the multi-armed bandit problem, there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. The results enable us to fully characterize the trade-off between regret and incurred switching cost in the stochastic multi-armed bandit problem, contributing new insights to this fundamental problem. Under the general switching cost structure, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.

 

The multi-armed bandit (MAB) problem is one of the most fundamental problems in online learning, with diverse applications ranging from pricing and online advertising to clinical trails. Over the past several decades, it has been a very active research area spanning different disciplines, including computer science, operations research, statistics and economics.

In a traditional multi-armed bandit problem, the learner (i.e., decision-maker) is allowed to switch freely between actions, and an effective learning policy may incur frequent switching — indeed, the learner’s task is to balance the exploration-exploitation trade-off, and both exploration (i.e., acquiring new information) and exploitation (i.e., optimizing decisions based on up-to-date information) require switching. However, in many real-world scenarios, it is costly to switch between different alternatives, and a learning policy with limited switching behavior is preferred. The learner thus has to consider the cost of switching in her learning task.

In this paper, we introduce the Bandits with Switching Constraints (BwSC) problem. We note that most previous research in multi-armed bandits has modeled the switching cost as a penalty in the learner’s objective, and hence the learner’s switching behavior is a complete output of the learning algorithm. However, in many real-world applications, there are strict limits on the learner’s switching behavior, which should be modeled as a hard constraint, and hence the learner’s allowable level of switching is an input to the algorithm. In addition, while most prior research assumes specific structures on switching costs (e.g., unit or homogeneous costs), in reality, switching between different pairs of actions may incur heterogeneous costs that do not follow any parametric form. These gaps motivate us to propose the BwSC framework, which includes a hard constraint acting on the total switching cost.

In addition to its strong modeling power and practical significance, the BwSC problem is theoretically important, as it is a natural framework to study the fundamental trade-off between the regret and the maximum incurred switching cost of any policy in the classical multi-armed bandit problem. In particular, it enables characterizing important switching patterns associated with any effective exploration-exploitation policies. Thus, the study of BwSC problem leads to a series of new results for the classical multi-armed bandit problem.

The BwSC framework has numerous applications, including dynamic pricing, online assortment optimization, online advertising, clinical trails, labor markets and vehicle routing. We describe a representative motivating example below.

Dynamic pricing with demand learning. Dynamic pricing with demand learning has proven its effectiveness in online retailing. However, it is well known that in practice, sellers often face business constraints that prevent them from conducting extensive price experimentation and making frequent price changes, see Cheung et al. (2017) and Chen and Chao (2019). The seller’s sequential decision-making problem can be modeled as a BwSC problem, where changing from each price to another price incurs some cost, and there is a limit on the total cost incurred by price changes. Here, a high switching cost between two prices implies that the corresponding price change is highly undesirable, while a low switching cost implies that the corresponding price change is generally acceptable.

In Section id1, we propose the BwSC framework. In Section id1, we review related literature. In Section id1, we discuss the unit-switching-cost model. In Section id1, we discuss the general-switching-cost model. Finally, in Section id1, we conclude.

For all n_{1},n_{2}\in\mathbb{N} such that n_{1}\leq n_{2}, we use [n_{1}] to denote the set \{1,\dots,n_{1}\}, and use [n_{1}:n_{2}] (resp. (n_{1}:n_{2}]) to denote the set \{n_{1},n_{1}+1,\dots,n_{2}\} (resp. \{n_{1}+1,\dots,n_{2}\}). For all x\geq 0, we use \lfloor x\rfloor to denote the largest integer less than or equal to x. For ease of presentation, we define \lfloor x\rfloor=0 for all x<0. Throughout the paper, we use big O,\Omega,\Theta notations to hide constant factors, and use \tilde{O},\tilde{\Omega},\tilde{\Theta} notations to hide constant factors and logarithmic factors.

Consider a k-armed bandit problem where a learner chooses actions from a fixed set [k]=\{1,\dots,k\}. There is a total of T rounds. In each round t\in[T], the learner first chooses an action i_{t}\in[k], then observes a reward r_{t}(i_{t})\in\mathbb{R}. For each action i\in[k], the reward of action i is i.i.d. drawn from an (unknown) distribution \mbox{$\mathcal{D}$}_{i} with (unknown) expected value \mu_{i}. We assume that the distributions \mbox{$\mathcal{D}$}_{i} are standardized sub-Gaussian.111This is a standard assumption in the stochastic bandit literature. Note that the class of sub-Gaussian distributions is sufficiently wide as it contains Gaussian, Bernoulli and all bounded distributions. Without loss of generality, we assume \sup_{i,j\in[k]}|\mu_{i}-\mu_{j}|\in[0,1].

In our problem, the learner incurs a switching cost c_{i,j}=c_{j,i}\geq 0 each time she switches between action i and action j (i,j\in[k]).222We allow c_{i,j}=\infty, which means that switching from i to j is prohibited. In particular, c_{i,i}=0 for i\in[k]. There is a pre-specified switching budget S\geq 0 representing the maximum amount of switching costs that the learner can incur in total. Once the total switching cost exceeds the switching budget S, the learner cannot switch her actions any more. The learner’s goal is to maximize the expected total reward over T rounds.

Let \pi denote the learner’s (non-anticipating) learning policy, and \pi_{t}\in[k] denote the action chosen by policy \pi at round t\in[T]. More formally, \pi_{t} establishes a probability kernel acting from the space of historical actions and observations to the space of actions at round t. Let \mathbb{P^{\pi}_{\mbox{$\mathcal{D}$}}} and \mathbb{E}^{\pi}_{\mbox{$\mathcal{D}$}} be the probability measure and expectation induced by policy \pi and latent distributions \mbox{$\mathcal{D}$}=(\mbox{$\mathcal{D}$}_{1},\dots,\mbox{$\mathcal{D}$}_{k}). According to Section id1, we only need to restrict our attention to the S-switching-budget policies, which take S, k and T as input and are defined as below.333Note that here we do not make any assumption on the learner’s behavior. In particular, we do not require the learner to intentionally pick an S-switching-budget policy — the switching constraint makes the learner’s policy automatically equivalent to an S-switching-budget policy.

Definition 1

A policy \pi is said to be an S-switching-budget policy if for all \mathcal{D},

\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}\left[\sum_{t=1}^{T-1}c_{\pi_{t},\pi_{t% +1}}\leq S\right]=1.

Let \Pi_{S} denote the set of all S-switching-budget policies, which is also the admissible policy class of the BwSC problem.

The performance of a learning policy is measured against a clairvoyant policy that maximizes the expected total reward given foreknowledge of the environment (i.e., latent distributions) \mathcal{D}. Let i^{*}=\arg\max_{i\in[k]}\mu_{i} and \mu^{*}=\max_{i\in[k]}\mu_{i}. If a clairvoyant knows \mathcal{D} in advance, then she would choose the “optimal” action i^{*} for every round and her expected total reward would be T\mu^{*}. We define the regret of policy \pi as the worst-case difference between the expected performance of the optimal clairvoyant policy and the expected performance of policy \pi:

R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}\left\{T\mu^{*}-\mathbb{E}_{\mbox{$% \mathcal{D}$}}^{\pi}\left[\sum_{t=1}^{T}\mu_{\pi_{t}}\right]\right\}.

The minimax (optimal) regret of BwSC is defined as R_{S}^{*}(T)=\inf_{\pi\in\Pi_{S}}R^{\pi}(T).

In our paper, when we say a policy is “near-optimal” or “optimal up to logarithmic factors”, we mean that its regret bound is optimal in T up to logarithmic factors of T, irrespective of whether the bound is optimal in k, since typically k is much smaller than T (e.g., k=O(1)). Still, our derived bounds are actually quite tight in k.

Remark. There are two notions of regret in the stochastic bandit literature. The R^{\pi}(T) regret that we consider is called distribution-free, as it does not depend on \mathcal{D}. On the other hand, one can also define the distribution-dependent regret R_{\mbox{$\mathcal{D}$}}^{\pi}(T)=T\mu^{*}-\mathbb{E}_{\mbox{$\mathcal{D}$}}^{% \pi}\left[\sum_{t=1}^{T}\mu_{\pi_{t}}\right] that depends on \mathcal{D}. This second notion of regret is only meaningful when \mu_{1},\dots,\mu_{k} are well-separated. Unlike the classical MAB problem where there are policies that simultaneously achieve near-optimal bounds under both regret notions, in the BwSC problem, due to the limited switching budget, finding a policy that simultaneously achieves near-optimal bounds under both regret notions is usually impossible. Thus in the main body of the paper, we focus on the distribution-free regret. However, in Appendix A, we extend our results to the distribution-dependent regret.

As Section id1 and Section id1 show, BwSC and MAB share the same definition of R^{\pi}(S), and the only difference between BwSC and MAB is the existence of a switching constraint \pi\in\Pi_{S}, determined by (c_{i,j})\in\overline{\mathbb{R}}_{\geq 0}^{k\times k} and S\in\overline{\mathbb{R}}_{\geq 0} (when S=\infty, BwSC degenerates to MAB). This makes BwSC a natural framework to study the trade-off between regret and incurred switching cost in MAB. That is, the trade-off between the optimal regret R_{S}^{*}(T) and switching budget S in BwSC completely characterizes the trade-off between a policy’s best achievable regret and its worst possible incurred switching cost in MAB. We are interested in how R_{S}^{*}(T) behaves over a range of switching budget S, and how it is affected by the structure of switching costs (c_{i,j}).

The stochastic MAB problem has been extensively studied for more than fifty years. Seminal results include the \Theta(\sqrt{T}) distribution-free regret bound in Vogel (1960) and the \Theta(\log T) distribution-dependent regret bound in Lai and Robbins (1985). We point out two excellent surveys written by Lattimore and Szepesvári (2018) and Slivkins (2019) for more reference about this topic.

There is rich literature focusing on stochastic MAB with switching costs. Most of the papers model the switching cost as a penalty in the learner’s objective, i.e., they measure a policy’s regret and incurred switching cost using the same metric and the objective is to minimize the sum of these two terms (e.g., Agrawal et al. 1988, 1990, Brezzi and Lai 2002, Cesa-Bianchi et al. 2013; there are other variations with discounted rewards Banks and Sundaram 1994, Asawa and Teneketzis 1996, Bergemann and Välimäki 2001, see Jun 2004 for a survey). Though this conventional “switching penalty” model has attracted significant research interest in the past, it has two limitations. First, under this model, the learner’s total switching cost is an output determined by the algorithm. However, in many real-world applications, there are strict limits on the learner’s total switching cost, which should be modeled as a hard constraint, and hence the learner’s switching budget should be an input that helps determine the algorithm. In particular, while the algorithm in Cesa-Bianchi et al. (2013) developed for the “switching penalty” model can achieve \tilde{O}(\sqrt{T}) (near-optimal) regret with O(\log\log T) switches, if the learner wants a policy that always incurs finite switching cost independent of T, then prior literature does not provide an answer. Second, the “switching penalty” model has fundamental weakness in studying the trade-off between regret and incurred switching cost in stochastic MAB — since the O(\log\log T) bound on the incurred switching cost of a policy is negligible compared with the \tilde{O}(\sqrt{T}) bound on its optimal regret, when adding the two terms up, the term associated with incurred switching cost is always dominated by the regret, thus no trade-off can be identified. As a result, to the best of our knowledge, prior literature has not characterized the fundamental trade-off between regret and incurred switching cost in stochastic MAB.

The BwSC framework addresses the issues associated with the “switching penalty” model in several ways. First, it introduces a hard constraint on the total switching cost, enabling us to design good policies that guarantee limited switching cost. While O(\log\log T) switches has proven to be sufficient for a learning policy to achieve near-optimal regret in MAB, in BwSC, we are mostly interested in the setting of finite or o(\log\log T) switching budget, which is highly relevant in practice. Second, by focusing on rewards in the objective function and incurred switching cost in the switching constraint, the BwSC framework enables the characterization of the fundamental trade-off between regret and maximum incurred switching cost in MAB. Third, while most prior research assumes specific structures on switching costs (e.g., unit or homogeneous costs), BwSC allows general switching costs, which makes it a powerful modeling framework.

This paper is not the first one to study online learning problems with limited switches. Indeed, a few authors have realized the practical significance of limited switching budget. For example, Cheung et al. (2017) consider a dynamic pricing model where the demand function is unknown but belongs to a known finite set, and a pricing policy is allowed to make at most m price changes. Their constraint on the total number of price changes is motivated by collaboration with Groupon, a major e-commerce marketplace in North America. In such an environment, Groupon limits the number of price changes, either because of implementation constraints, or for fear of confusing customers and receiving negative customer feedback. They propose a pricing policy that guarantees O(\log^{(m)}T) (or m iterations of the logarithm) regret with at most m price changes, and report that in a field experiment, this pricing policy with a single price change increases revenue and market share significantly. Chen and Chao (2019) study a multi-period stochastic inventory replenishment and pricing problem with unknown demand and limited price changes. Assuming that the demand function is drawn from a parametric class of functions, they develop a finite-price-change policy based on maximum likelihood estimation that achieves optimal regret.

We note that both Cheung et al. (2017) and Chen and Chao (2019) only focus on specific decision-making problems, and their results rely on some strong assumption about the unknown environment. Cheung et al. (2017) assume a known finite set of potential demand functions, and require the existence of discriminative prices that can efficiently differentiate all potential demand functions. Chen and Chao (2019) assume a known parametric form of the demand function, and also require a well-separated condition. By contrast, the BwSC model in our paper is generic and assumes no prior knowledge of the environment. The learning task in the BwSC problem is thus more challenging than previous models. Also, the switching constraint in the BwSC problem is more general than the price-change constraints in previous models.

In the Bayesian bandit setting, Guha and Munagala (2013) study the “bandits with metric switching costs” problem that allows a constraint involving metric switching costs. Using competitive ratio as the performance metric and assuming Bayesian priors, they develop a 4-approximation algorithm for the problem. The competitive ratio is measured against an optimal online policy that does not know the true distributions. As pointed out by the authors, the optimal online policy can be directly determined by a dynamic program. So the main challenge in their model is a computational one. Our work is different, as we are using regret as our performance metric, and we are competing with an optimal clairvoyant policy that knows the true distributions — a much stronger benchmark. Our problem thus involves both statistical and computational challenges. In fact, the algorithm in Guha and Munagala (2013) cannot avoid a linear regret when applied to the BwSC problem.

In the adversarial bandit setting, Altschuler and Talwar (2018) study the adversarial MAB problem with limited number of switches, which can be viewed as an adversarial counterpart of the unit-switching-cost BwSC problem. For any policy that makes no more than S\leq T switches, they prove that the optimal regret is \tilde{\Theta}(T\sqrt{k}/\sqrt{S}). Since we are considering a different setting from them (our problem is stochastic while their problem is adversarial), the results and methodologies in our paper are fundamentally different from their paper. In particular, while finite-switch policies cannot avoid linear regret in the adversarial setting, in the stochastic setting, finite switches are already able to guarantee sublinear regret. Moreover, while the optimal regret in Altschuler and Talwar (2018) decreases smoothly as S increases from 0 to T, in the stochastic setting, we identify surprising behavior of the optimal regret as S increases from 0 to \Theta(\log\log T), which, to the best of our knowledge, has not been identified in the bandit literature before.

The BwSC problem is also related to the batched bandit problem proposed by Perchet et al. (2016). The M-batched bandit problem is defined as follows: given a classical bandit problem, assumes that the learner must split her learning process into M batches and is only able to observe data (i.e., realized rewards) from a given batch after the entire batch is completed. This implies that all actions within a batch are determined at the beginning of this batch. Here M can be viewed as a quantity measuring the learner’s adaptivity, i.e., her ability to learn from her data and adapt to the environment. An M-batch policy is defined as a policy that only observes realized data for M-1 times through the entire horizon. Perchet et al. (2016) study the problem in the case of two arms, and prove that the optimal regret for the M-batched bandit problem is \tilde{\Theta}(T^{1/(1-2^{1-M})}). Very recently, Gao et al. (2019) extend these results to general k arms.

On the surface, the batched bandit problem and the BwSC problem seem like two different problems: the batched bandit problem limits observation and allows unlimited switching, while the BwSC problem limits switching and allows unlimited observation. Surprisingly, in this paper, we discover some non-trivial connections between the batched bandit problem and the unit-switching-cost BwSC problem. The connections will be further discussed in Section id1.

In this section, we consider the BwSC problem with unit switching costs, where c_{i,j}=1 for all i\neq j. In this case, since every switch incurs a unit cost, the switching budget S can be interpreted as the maximum number of switches that the learner can make in total. Without loss of generality, in this section we assume that S is a non-negative integer, and refer to an S-switching-budget policy as an S-switch policy. Note that the unit-switching-cost BwSC problem can be simply interpreted as “MAB with limited number of switches”.

The section is organized as follows. In Section id1, we propose a simple and intuitive policy that provides an upper bound on regret. In Section id1, we give a matching lower bound, indicating that our policy is rate-optimal up to logarithmic factors. In Section id1, we discuss several surprising findings in BwSC, named as “phase transitions” and “cyclic phenomena” of the optimal regret, and further quantify the trade-off between regret and maximum number of switches in MAB, contributing new insights to this classical problem. In Section id1, we discuss a surprising relationship between limited switches and limited adaptivity in bandit problems.

We first propose a simple and intuitive policy that provides an upper bound on regret. Our policy, called the S-Switch Successive Elimination (SS-SE) policy, is described in Algorithm 1. The design philosophy behind the SS-SE policy is to divide the entire horizon into several pre-determined intervals (i.e. batches) and to control the number of switches in each interval. The policy thus has some similarities with the 2-armed batched policy of Perchet et al. (2016) and the k-armed batched policy of Gao et al. (2019), which prove to be near-optimal in the batched bandit problem. However, since we are studying a different problem, directly applying a batched policy to the BwSC problem does not work. In particular, in the batched bandit problem, the number of intervals (i.e., batches) is a given constraint, while in the BwSC problem, the switching budget is the given constraint. We thus add two key ingredients into the SS-SE policy: (1) an index m(S) suggesting how many intervals should be used to partition the entire horizon; (2) a switching rule ensuring that the total number of switches across all k actions cannot exceed the switching budget S. These two ingredients make the SS-SE policy substantially different from an ordinary batched policy. The two ingredients are simple yet powerful — they actually enable transformation from any limited-batch policy to a corresponding limited-switch policy.

Algorithm 1 S-Switch Successive Elimination (SS-SE)

Input: Number of arms k, Switching budget S, Horizon T
Partition: Calculate m(S)=\lfloor\frac{S-1}{k-1}\rfloor.
    Divide the entire time horizon 1,\dots,T into m(S)+1 intervals: [t_{0}:t_{1}],(t_{1}:t_{2}],\dots,(t_{m(S)}:t_{m(S)+1}], where the endpoints are defined by t_{0}=1 and

t_{i}=\lfloor k^{1-\frac{2-2^{-(i-1)}}{2-2^{-m(S)}}}T^{\frac{2-2^{-(i-1)}}{2-2% ^{-m(S)}}}\rfloor,~{}~{}\forall i=1,\dots,m(S)+1.

Initialization: Let the set of all active actions in the l-th interval be A_{l}. Set A_{1}=[k].
Policy:

1:  for l=1,\dots,m(S) do
2:     if a_{t_{l-1}}\in A_{l} then
3:        Let a_{t_{l-1}+1}=a_{t_{l-1}}. Starting from this action, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.444We overlook the rounding issues in line 3 and 5 of Algorithm 1, which are easy to fix in regret analysis.
4:     else if a_{t_{l-1}}\notin A_{l} then
5:        Starting from an arbitrary active action in A_{l}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
6:     end if
7:     Statistical test: deactivate all actions i s.t. \exists action j with \mathtt{UCB}_{t_{l}}(i)<\mathtt{LCB}_{t_{l}}(j), where
\texttt{UCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% +\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}},
\texttt{LCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% -\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}.
8:  end for
9:  In the last interval, choose the action with the highest empirical mean (up to round t_{m(S)}).

Intuition about the Policy. The policy divides the T rounds into \lfloor\frac{S-1}{k-1}\rfloor+1 intervals in advance. The sizes of the intervals are designed to balance the exploration-exploitation trade-off. An active set of “effective” actions A_{l} is maintained for each interval l. The policy has the following key features:

  • Limited switches and no adpativity within each interval: In interval l, only |A_{l}|-1 switches happen. Within an interval, decisions on switches are determined at the beginning of the interval and do not depend on the rewards observed in this interval — thus there is no adaptivity.

  • Successive elimination between intervals: At the end of interval l (l<m(S)), actions that perform poorly are eliminated from the active set A_{l+1}, and will not be chosen in interval l+1.

  • At most one switch between two consecutive intervals: If the last action chosen in interval l remains in A_{l+1} (l<m(S)), then it will be the first action chosen in interval l+1, and no switch occurs between these two intervals. If the last action chosen in interval l is eliminated from A_{l+1}, then interval l+1 starts from another action in A_{l+1}, and one switch occurs between these two intervals.

  • Exploitation in the last interval: In the last interval, only the empirical best action is chosen.

We show that the SS-SE policy is indeed an S-switch policy and establish the following upper bound on its regret. See Appendix B for a proof.

Theorem 1

Let \pi be the SS-SE policy, then \pi\in\Pi_{S}. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 1 and T\geq k,

R^{\pi}(T)\leq C{(\log k\log T)}k^{1-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m% (S)}}},

where m(S)=\lfloor\frac{S-1}{k-1}\rfloor.

Theorem 1 provides an upper bound on the optimal regret of the unit-switching-cost BwSC problem:

R^{*}_{S}(T)=\tilde{O}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}).

The SS-SE policy, though achieves sublinear regret, seems to have many limitations that could have weaken its performance, and on the surface it may suggest that the regret bound is not optimal. Specifically:

  • The SS-SE policy does not make full use of its switching budget. Consider the case of 11 actions and 20 switching budget. Since m(20)=\lfloor(20-1)/(11-1)\rfloor=1=m(11), the SS-SE policy will just run as if it could only make 11 switches, despite the fact that it has 9 additional switching budget (which will never be used). Intuitively, an effective learning policy should make full use of its switching budget. It seems that by tracking and allocating the switching budget in a more careful way, one can achieve lower regret.

  • The SS-SE policy has low adaptivity. Note that the SS-SE policy pre-determines the number, sizes and locations of its intervals before seeing any data, and executes actions within each interval based on a pre-determined schedule. Indeed, the SS-SE policy only learns from data at the end of each interval, for at most \lfloor(S-1)/(k-1)\rfloor times — consider the case of 11 actions and 20 switching budget, the SS-SE policy will split the entire horizon into two intervals and will only learn at the end of the first interval, after which it will choose a single action to be applied throughout the entire second interval. Intuitively, data should be utilized to save switches and reduce regret, and one would expect that an effective policy will have high degree of adaptivity, that is, it should learn from the available data and adapt to the environment more frequently than our policy. Put differently, it seems that by utilizing full adaptivity and learning from data in every round, one can achieve lower regret.

  • Besides the above limitations, the \tilde{O}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}) bound provided by the SS-SE also seems a little clumsy. In particular, the m(S)=\lfloor(S-1)/(k-1)\rfloor term looks like an artificial term (it is intentionally designed to fit the switching rule in SS-SE), and does not look like a natural term that should appear in the true optimal regret R^{*}_{S}(T).

While the above arguments are based on our first instinct and seem very reasonable, surprisingly, all of them prove to be wrong: no S-switch policy can theoretically do better! In fact, we match the upper bound provided by SS-SE by showing an information-theoretic lower bound in Theorem 2. This indicates that the SS-SE policy is rate-optimal up to logarithmic factors, and R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}). Note that the tightness of T is acheived per instance, i.e., for every k and every S. That is, our lower bound is substantially stronger than a single lower bound demonstrated for specific k and S.

Theorem 2

There exists an absolute constant C>0 such that for all k\geq 1,S\geq 1,T\geq k and for all policy \pi\in\Pi_{S},

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}(m(S)+% 1)^{-2}\right)T^{\frac{1}{2-2^{-m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases}

where m(S)=\lfloor\frac{S-1}{k-1}\rfloor.

Proof Idea. Our proof involves a novel “tracking the cover time” argument that (to the best of our knowledge) has not appeared in previous lower-bound proofs in the bandit literature and may be of independent interest. Specifically, we track a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)+1}, some of which may be \infty, that are recursively defined as follows:

  • \tau_{1} is the first time that all the actions in [k] have been chosen in period [1:\tau_{1}],

  • \tau_{2} is the first time that all the actions in [k] have been chosen in period [\tau_{1}:\tau_{2}],

  • Generally, \tau_{i} is the first time that all the actions in [k] have been chosen in period [\tau_{i-1}:\tau_{i}], for i=2,\dots,m(S)+1.

The structure of the series is carefully designed, enabling the realization of any two consecutive stopping times \tau_{i-1},\tau_{i} to convey the important message that there exists a specific (possibly unknown) action that has never been chosen in period [\tau_{i-1}:\tau_{i}-1]. This information in turn helps us to bound the difference of several key probabilities and derive the desired lower bound via information-theoretic arguments. For a complete proof of Theorem 2, see Appendix C.

Combining Theorem 1 and Theorem 2, we have

Corollary 1

For any fixed k\geq 1, for any S\geq 1,

R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}).

Remark. We briefly explain why the upper and lower bounds in Theorem 1 and Theorem 2 match in T. When m(S)\leq\log_{2}\log_{2}(T/k), which is the case we are mostly interested in, (m(S)+1)^{2}=o(\log T), thus the upper and lower bounds match within o((\log T)^{2}). When m(S)>\log_{2}\log_{2}(T/k), the upper bound is O({\sqrt{T}\log T}), thus the upper and lower bounds directly match within O(\log T). We also argue that the slightly different terms of k appearing in the upper and lower bounds do not play an important role. In fact, the gap associated with k between the upper and lower bounds is O(\min\{k^{2.5},(T/k)^{m(S)-0.5}\}). Since we are mostly interested in the case of k<<T (e.g., k=O(1) or k=O(\log T)), the O(k^{2.5}) gap is not important relative to T.

Corollary 1 allows us to characterize the trade-off between the switching budget S and the optimal regret R^{*}_{S}{(T)}. To illustrate this trade-off, Figure 1 and Table 1 depict the behavior of R^{*}_{S}{(T)} as a function of S given a fixed k. Note that as discussed in Section id1, the relationship between R^{*}_{S}{(T)} and S also characterizes the inherent trade-off between regret and maximum number of switches in the classical MAB problem.

Figure 1: An Illustration of the Switching Budget-Regret Trade-off.

We observe several surprising phenomena regarding the trade-off between S and R_{S}^{*}(T) for any given k.

Phase Transitions555We borrow this terminology from statistical physics, see Domb (2000).. As we have shown, R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}). To the best of knowledge, this is the first time that a floor function naturally arises in the order of T in the optimal regret of an online learning problem. As a direct consequence of this floor function, the optimal regret of BwSC exhibits surprising phase transitions described as below.

Definition 2

(Phases and Critical Points) For a k-armed unit-switching-cost BwSC, we call the interval [(j-1)(k-1)+1,j(k-1)+1) the j-th phase, and call j(k-1)+1 as the j-th critical point (j\in\mathbb{Z}_{>0}).

Fact 1

(Phase Transitions) As S increases from 0 to \Theta(\log\log T), S will leave the j-th phase and enter the (j+1)-th phase at the j-th critical point (j\in\mathbb{Z}_{>0}). Each time S arrives at a critical point, R_{S}^{*}(T) will drop significantly, and stay at the same level until S arrives the next critical point.

Phase transitions are clearly presented in Figure 1. This phenomenon seems counter-intuitive, as it suggests that increasing switching budget would not help decrease the best achievable regret, as long as the budget does not reach the next critical point.

Note that phase transitions are only exhibited when S is in the range of 0 to \Theta(\log\log T). After S exceeds \Theta(\log\log T), R_{S}^{*}(T) will reamin unchanged at the level of \tilde{\Theta}(\sqrt{T}) — the optimal regret will only vary within logarithmic factors and there is no significant regret drop any more. Therefore, one can also view \Theta(\log\log T) as a “final critical point” after which phase transitions disappear. This additional “final phase transition” reveals a subtle and intriguing nature of phase transitions in BwSC.

Table 1: Regret as a Function of Switching Budget
S [0,k) [k,2k-1) [2k-1,3k-2) [3k-2,4k-3) [4k-3,5k-4) [5k-4,6k-5)
R_{S}^{*}(T) \tilde{\Theta}(T) \tilde{\Theta}(T^{2/3}) \tilde{\Theta}(T^{4/7}) \tilde{\Theta}(T^{8/15}) \tilde{\Theta}(T^{16/31}) \tilde{\Theta}(T^{32/63})
R_{S}^{*}(T)/R_{\infty}^{*}(T) \tilde{\Theta}(T^{1/2}) \tilde{\Theta}(T^{1/6}) \tilde{\Theta}(T^{1/14}) \tilde{\Theta}(T^{1/30}) \tilde{\Theta}(T^{1/62}) \tilde{\Theta}(T^{1/126})

Cyclic Phenomena. Along with phase transitions, we also observe the following phenomena.

Fact 2

(Cyclic Phenomena) The length of each phase is always equal to k-1, independent of S and T. We call the quantity k-1 the budget cycle, which is the length of each phase.

Cyclic Phenomena indicate that, assuming that the learner’s switching budget is at a critical point, then the extra switching budget that the learner needs to achieve the next regret drop (i.e., to arrive at the next critical point) is always k-1. Cyclic phenomena also seem counter-intuitive: when the learner has more switching budget, she can conduct more experiments and statistical tests, eliminate more poorly performing actions (which can be thought of as reducing k) and allocate her switching budget in a more flexible way — all of these suggest that the budget cycle should be a quantity decreasing with S. However, the cyclic phenomena tell us that the budget cycle is always a constant and no learning policy in the unit-switching-cost BwSC (and in MAB) can escape this cycle, no matter how large S is, as long as S=o(\log\log T).

On the other hand, as S contains more and more budget cycles, the gap between R_{S}^{*}(T) and R_{\infty}^{*}(T)=\tilde{\Theta}(\sqrt{T}) does decrease dramatically. In fact, R_{S}^{*}(T) decreases doubly exponentially fast as S contains more budget cycles. For example, when S contains more than 2 budget cycles, R_{S}^{*}(T)=\tilde{\Theta}(T^{4/7}); and when S contains more than 3 budget cycles, R_{S}^{*}(T)=\tilde{\Theta}(T^{8/15}). From both Figure 1 and Table 1, we can verify that 3 or 4 budget cycles are already enough for an S-switch policy to achieve close-to-optimal regret in MAB (compared with the optimal policy with unlimited switching budget).

To sum up, the above analysis generates both “positive” and “negative” insights for decision-makers that face BwSC-type problems. On the one hand, the unavoidable phase transitions and cyclic phenomena show some fundamental limits brought up by the switching constraint, making it hopeless for decision-makers to reduce regret within each phase. On the other hand, once the decision-makers have enough switching budget that brings them to a new phase, they can enjoy substantially regret drop. In particular, 3 or 4 budget cycles are already enough to guarantee extraordinary regret performance.

The lower bound in Theorem 2 also leads to new results for the classical MAB problem.

Corollary 2

(The switching complexity of MAB) For the k-armed bandit problem, N(k-1)+1 switches are necessary and sufficient for achieving \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret for any fixed N\in\mathbb{Z}_{>0}, and \Theta(\log\log T) switches are necessary and sufficient for achieving \tilde{O}(\sqrt{T}) (near-optimal) regret.

Note that the number of switches stated in Corollary 2 refers to the maximum number of switches that a policy can make. While Cesa-Bianchi et al. (2013) and Perchet et al. (2016) have proposed policies that achieve \tilde{O}(\sqrt{T}) regret with O(\log\log T) switches, no prior work has answered the question that how many switches are necessary for a near-optimal learning policy in MAB. To the best of our knowledge, we are the first one to show \Omega(\log\log T) lower bound on the number of switches.

Based on our “tracking the cover time” argument, we can prove further results regarding the number of re-switches of each arm (including the worst arm in hindsight) that are necessary for an effective learning policy.

Definition 3

The number of re-switches of action i\in[k] is the total number that the leaner switches to i from another action j\neq i. For the action chosen in round 1 (where there is no preceding action), the round-1 choice also counts as a re-switch.

Proposition 1

For the k-armed bandit problem, \lceil N/2\rceil re-switches of each action are necessary for achieving \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret for any fixed N\in\mathbb{Z}_{>0}, and \Theta(\log\log T) re-switches of each action are necessary and sufficient for achieving \tilde{O}(\sqrt{T}) (near-optimal) regret.

Note that if the learner is not allowed to re-choose an action that was chosen earlier and discarded later (i.e., if the number of re-switches of each action is at most 1), then the corresponding bandit problem is exactly the “irrevocable MAB problem” proposed by Farias and Madan (2011). Farias and Madan (2011) and Guha and Munagala (2013) study the price of irrevocability in the Bayesian bandit setting. Using competitive ratio (measured against the optimal online policy that does not know the true environment) as the performance metric, they show that the price of irrevocability is limited. Our results on the necessity of re-switching contradicts this idea: in the setting of regret minimization, where we are competing with the optimal clairvoyant policy — a much stronger benchmark, our results indicate that an irrevocable policy must incur linear regret, and any effective policy can not avoid “departing from and re-switching to” each action for many times.

Specifically, Proposition 1 indicates that, for each learning policy that achieves near-optimal regret in MAB, we can always find an environment \mathcal{D} such that the policy departs from the optimal action for \Theta(\log\log T) times and moves to the worst action for \Theta(\log\log T) times — surprisingly, the necessary number of re-switches of the worst action is essentially the same as the optimal action. Put differently, it is inevitable for any effective policy to repeatedly make some switching decisions that would prove to be not effective in hindsight.

In this subsection, we discuss the relationship between limited switches and limited adaptivity in bandit problems. As discussed in Section id1, in the unit-switching-cost BwSC problem, the constraint is on the number of switches and is defined in the “action world”, hence the learner has full adpativity. By contrast, in the batched bandit problem, the constraint is on adaptivity and is defined in the “observation world”, hence the learner has full switching power. Since the two constraints in the two problems are defined in two different “worlds”, it seems that the two problems are not directly related. However, the results in our paper suggest that the two problem are actually deeply related. For this purpose, We provide an alternate way of understanding the results of Section id1 through examining the relationship between the two problems given a fixed k.

The SS-SE policy in Section id1 helps us establish a one-sided relationship between the two problems. Specifically, any M-batch policy that achieves a certain regret upper bound in the M-batched k-armed bandit problem can be transformed, using the SS-SE policy ingredients and randomization, to an S-switch policy that achieves the same regret upper bound in the S-budget k-armed unit-cost BwSC problem, as long as S\in[(M-1)(k-1)+1:M(k-1)]. This implies the following fact:

Fact 3

For any fixed k\geq 1,

  1. Any upper bound on the regret of the M-batched k-armed bandit problem serves as an an upper bound on the regret of the S-budget k-armed unit-cost BwSC problem (S\in[(M-1)(k-1)+1:M(k-1)]).

  2. Any lower bound on the regret of the S-budget k-armed unit-cost BwSC problem serves as an lower bound on the regret of the (m(S)+1)-batched k-armed bandit problem.

By contrast, given an arbitrary S-switch policy, it is generally impossible to transform it to an (m(S)+1)-batch policy, as an S-switch policy may utilize data for unlimited times. For example, consider the following naïve policy: choosing actions based on the celebrated UCB1 policy (Auer et al. 2002) until the number of switches exceeds S. The policy is clearly an S-switch policy, but cannot be transformed to any finite-batch policy. Therefore, the SS-SE policy and Fact 3 are not enough for establishing a two-sided relationship between limited switches and limited adaptivity.

Surprisingly, our strong lower bound in Section id1 directly closes the gap between the regret upper bound of the batched bandit problem and the regret lower bound of a corresponding unit-switching-cost BwSC problem. Thus, we completely establish the two-sided relationship and essentially prove the following fact:

Fact 4

For any fixed k, let R_{M}^{*}(T) be the optimal (minimax) regret of the M-batched k-armed bandit problem, and R_{S}^{*}(T) be the optimal (minimax) regret of the S-budget k-armed unit-cost BwSC. As long as S\in[(M-1)(k-1)+1,M(k-1)], the two problem have near-equal optimal regret (up to logarithmic factors), i.e., R_{M}^{*}(T) and R_{S}^{*}(T) are both \tilde{\Theta}(T^{\frac{1}{2-2^{1-M}}}).

In essence, our results reveal a surprising “regret equivalence” between limited switches (even with full adpativity) and limited adaptivity (even with full switching power) in bandit problems: limiting switches (in the “action” world) implicitly limits adaptivity (in the “observation” world), and limiting adaptivity (in the “observation” world) implicitly limits switches (in the “action” world). Put differently, in an MAB problem, when the number of switches is limited, becoming more adaptive and using data more frequently may not lead to a reduction in regret.

Before closing Section id1, we give some comments on the practical implications and the scope of our results obtained in this section. First, it worth noting that the phase transitions and cyclic phenomena discovered in this section are associated with theoretical bounds (in the minimax sense), not with empirical performance of policies. While these phenomena are theoretically interesting and provide novel and important insights for decision-makers who want to have theoretical guarantees, in reality, when decision-makers are not worried about the regret incurred in the worst case, they can apply some more adaptive policies than the SS-SE policy to achieve better performance, and it is possible that they observe a much smoother empirical performance improvement as their switching budgets grow.

Second, in addition to performance improvement, decision-makers may value low-adaptive policies due to their simplicity and ease of implementation. From this perspective, the SS-SE policy proposed in this Section should be practically appealing, as it is theoretically optimal for both limited number of switches and limited adaptivity. Indeed, the main concern of low-adaptive policies is typically their potential performance loss. The results developed in this Section show that theoretically, given limited number of switches, low-adaptive policies are able to achieve the optimal performance in the worst case. Thus, our results provide a strong validation of low-adaptive policies. For example, in dynamic pricing problems, sellers prefer both few number of price changes (to reduce customers’ confusion) and low adaptivity (for ease of implementation), see Cheung et al. (2017). The SS-SE policy developed in this Section is thus attractive from both points of views.

We now proceed to the general case of BwSC, where c_{i,j} (=c_{j,i}) can be any non-negative real number and even \infty. The problem is significantly more challenging in this general setting. For this purpose, we need to enhance the framework of Section id1 to better characterize the structure of switching costs. We do this by representing switching costs via a weighted graph.

Let G=(V,E) be a (weighted) complete graph, where V=[k] (i.e., each vertex corresponds to an action), and the edge between i and j is assigned a weight c_{i,j} (\forall i\neq j). We call the weighted graph G the switching graph. In this section, we assume the switching costs satisfy the triangle inequality: \forall i,j,l\in[k], c_{i,j}\leq c_{i,l}+c_{l,j}. We relax this assumption in Appendix E.

The results of unit-switching-cost model suggest that an effective policy that minimizes the worst-case regret must repeatedly visit all actions, in a manner similar to the SS-SE policy. This indicates that in the general-switching-cost model, an effective policy should repeatedly visit all vertices in the switching graph, in a most economical way to stay within budget. This insight is proven in a formal way in Appendix F, where we establish an interesting connection between bandit problems and graph traversal problems. Applying the result to the general BwSC problem, we discover a connection between the general BwSC problem and the celebrated shortest Hamiltonian path problem.

Motivated by this connection, we propose the Hamiltonian-Switching Successive Elimination (HS-SE) policy, and present it in Algorithm 2. The policy enhances the original SS-SE policy by adding an additional ingredient: a pre-specified switching order determined by the shortest Hamiltonian path of the switching graph G. Note that while the shortest Hamiltonian path problem is NP-hard, solving this problem is entirely an “offline” step in the HS-SE policy. That is, for a given switching graph, the learner only needs to solve this problem once.

Algorithm 2 Hamiltonian-Switching Successive Elimination (HS-SE)

Input: Switching Graph G, Switching budget S, Horizon T
Offline Step: Find the shortest Hamiltonian path in G: {i_{1}}\rightarrow\dots\rightarrow{i_{k}}. Denote the total weight of the shortest Hamiltonian path as H. Calculate m_{G}^{U}(S)=\left\lfloor\frac{S-\max_{i,j\in[k]}c_{i,j}}{H}\right\rfloor.
Partition: Run the partition step in the SS-SE policy with m(S)=m_{G}^{U}(S).
Initialization: Let the set of all active actions in the l-th interval be A_{l}. Set A_{1}=[k], a_{0}={i_{1}}.
Policy:

1:  for l=1,\dots,m_{G}^{U}(S) do
2:     if a_{t_{l-1}}\in A_{l} and l is odd then
3:        Let a_{t_{l-1}+1}=a_{t_{l-1}}. Starting from this action, along the direction of {i_{1}}\rightarrow\dots\rightarrow{i_{k}}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
4:     else if a_{t_{l-1}}\in A_{l} and l is even then
5:        Let a_{t_{l-1}+1}=a_{t_{l-1}}. Starting from this action, along the direction of {i_{k}}\rightarrow\dots\rightarrow{i_{1}}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
6:     else if a_{t_{l-1}}\notin A_{l} and l is odd then
7:        Along the direction of {i_{1}}\rightarrow\dots\rightarrow{i_{k}}, find the first action that still remains in A_{l}. Starting from this action, along the direction of {i_{1}}\rightarrow\dots\rightarrow{i_{k}}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
8:     else if a_{t_{l-1}}\notin A_{l} and l is even then
9:        Along the direction of {i_{k}}\rightarrow\dots\rightarrow{i_{1}}, find the first action that still remains in A_{l}. Starting from this action, along the direction of {i_{k}}\rightarrow\dots\rightarrow{i_{1}}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
10:     end if
11:     Statistical test: deactivate all actions i s.t. \exists action j with \mathtt{UCB}_{t_{l}}(i)<\mathtt{LCB}_{t_{l}}(j), where
\texttt{UCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% +\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}},
\texttt{LCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% -\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}.
12:  end for
13:  In the last interval, choose the action with the highest empirical mean (up to round t_{m_{G}^{U}(S)}).

Let H denote the total weight of the shortest Hamiltonian path of G. We give an upper bound on regret of the HS-SE policy. See Appendix H for a proof.

Theorem 3

Let \pi be the HS-SE policy, then \pi\in\Pi_{S}. There exists an absolute constant C\geq 0 such that for all G, k=|G|, S\geq 0, T\geq k,

R^{\pi}(T)\leq C{(\log k\log T)}k^{1-\frac{1}{2-2^{-m_{G}^{U}(S)}}}T^{\frac{1}% {2-2^{-m_{G}^{U}(S)}}},

where m_{G}^{U}(S)=\left\lfloor\frac{S-\max_{i,j\in[k]}{c_{i,j}}}{H}\right\rfloor.

We then give a lower bound that is close to the above upper bound. See Appendix I for a proof.

Theorem 4

There exists an absolute constant C>0 such that for all G,k=|G|,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m_{G}^{L}(S)}% }}(m_{G}(S)+1)^{-2}\right)T^{\frac{1}{2-2^{-m_{G}^{L}(S)}}},&\text{if }m_{G}^{% L}(S)\leq\log_{2}\log_{2}(T/k),\\ C\sqrt{kT},&\text{if }m_{G}^{L}(S)>\log_{2}\log_{2}(T/k),\end{cases}

where {m_{G}^{L}}(S)=\left\lfloor\frac{S-\max_{i\in[k]}\min_{j\neq i}c_{i,j}}{H}\right\rfloor.

When the switching costs satisfy the condition \max_{i,j\in[k]}{c_{i,j}}=\max_{i\in[k]}\min_{j\neq i}c_{i,j}, the two bounds directly match. When this condition is not satisfied, for any switching graph G, the above two bounds still match for a wide range of S:

\left[0,H+\max_{i\in[k]}\min_{j\neq i}c_{i,j}\right)\bigcup\left\{\bigcup_{n=1% }^{\infty}\left[nH+\max_{i,j\in[k]}c_{i,j},(n+1)H+\max_{i\in[k]}\min_{j\neq i}% c_{i,j}\right)\right\}.

Even when S is not in this range, we still have m_{G}^{U}(S)\leq m_{G}^{L}(S)\leq m_{G}^{U}(S)+1 for any G and any S, which means that the difference between the two indices is at most 1 and the regret bounds are always very close. In fact, it can be shown that as S increases, the gap between the upper and lower bounds decreases doubly exponentially. Therefore, the HS-SE policy is quite effective for the general BwSC problem. See Figure 2 for an illustration.

Figure 2: An Illustration for the Upper and Lower Bounds in Theorem 3 and Theorem 4

From Figure 2, we can easily observe that in the general-switching-cost BwSC problem, there are still phase transitions, as there are still phases where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. An unanswered question, however, is whether the critical points between phases still exist. Indeed, when S\in[nH+\max_{i\in[k]}\min_{j\neq i}c_{i,j},nH+\max_{i,j\in[k]}c_{i,j}) for some n\in\mathbb{Z}_{>0} (which is the range that the upper and lower bounds in Theorem 3 and Theorem 4 do not match), the current results cannot recover the exact order of T in R_{S}^{*}(T), so it remains unknown whether the optimal regret drop abruptly within this range.

Theorem 5 answers the question in the affirmative, that is, we prove the existence of critical points for any fixed switching graph G. While the result successfully recovers the exact dependency on T of the optimal regret, it is only of theoretical interest, for two reasons. First, the gap between the upper and lower bounds employed in our proof is of the order of O(k!). Also, computing the key index m_{G}(S) is highly difficult. For these reasons, we defer the detailed description of this result to Appendix J.

Theorem 5

Let k\geq 1 be an arbitrary given constant. For any given switching graph G such that |G|=k, for any S\geq 0,

R_{S}^{*}(T)=\tilde{\Theta}(T^{\frac{1}{2-2^{-m_{G}(S)}}}),

where m_{G}(S)\in\{m_{G}^{U}(S),m_{G}^{L}(S)\} is an integer completely determined by the switching graph G. As a result, given G, we can define the j-th critical point as \inf\{S\mid m_{G}(S)\geq j\} for j\in\mathbb{Z}_{>0}, where the order of T in R_{S}^{*}(T) drops from \tilde{\Theta}(T^{\frac{1}{2-2^{1-j}}}) to \tilde{\Theta}(T^{\frac{1}{2-2^{-j}}}).

We make a final remark on whether some of the results obtained in Section id1 still hold in the general-switching-cost model. First, there are still phase transitions, as there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. Second, the cyclic phenomena do not exist, since there are counter-examples where the distances between critical points are not equal. Third, there is no regret equivalence between limited (general) switching cost and limited adaptivity.

We consider the stochastic multi-armed bandit problem with a constraint on the total cost incurred by switching between actions. For the unit-switching-cost model, we prove matching upper and lower bounds on regret and provide near-optimal algorithm for the problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. We also show the regret equivalence between MAB with limited switches and MAB with limited adaptivity. The results enable us to fully characterize the trade-off between regret and incurred switching cost in the stochastic multi-armed bandit problem, contributing new insights to this fundamental problem. For the general-switching-cost model, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.

References

  • Agrawal et al. (1988) Agrawal, R., M. Hedge, D. Teneketzis. 1988. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control 33(10) 899–906.
  • Agrawal et al. (1990) Agrawal, R., M. Hegde, D. Teneketzis. 1990. Multi-armed bandit problems with multiple plays and switching cost. Stochastics and Stochastic Reports 29(4) 437–459.
  • Altschuler and Talwar (2018) Altschuler, J., K. Talwar. 2018. Online learning over a finite action set with limited switching. Conference on Learning Theory. 1569–1573.
  • Asawa and Teneketzis (1996) Asawa, M., D. Teneketzis. 1996. Multi-armed bandits with switching penalties. IEEE Transactions on Automatic Control 41(3) 328–348.
  • Auer et al. (2002) Auer, P., N. Cesa-Bianchi, P. Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3) 235–256.
  • Banks and Sundaram (1994) Banks, J. S., R. K. Sundaram. 1994. Switching costs and the gittins index. Econometrica 62(3) 687–694.
  • Bergemann and Välimäki (2001) Bergemann, D., J. Välimäki. 2001. Stationary multi-choice bandit problems. Journal of Economic Dynamics and Control 25(10) 1585–1594.
  • Brezzi and Lai (2002) Brezzi, M., T. L. Lai. 2002. Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27(1) 87–108.
  • Cesa-Bianchi et al. (2013) Cesa-Bianchi, N., O. Dekel, O. Shamir. 2013. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems. 1160–1168.
  • Chen and Chao (2019) Chen, B., X. Chao. 2019. Parametric demand learning with limited price explorations in a backlog stochastic inventory system. IISE Transactions 1–9.
  • Cheung et al. (2017) Cheung, W. C., D. Simchi-Levi, H. Wang. 2017. Dynamic pricing and demand learning with limited price experimentation. Operations Research 65(6) 1722–1731.
  • Christofides (1976) Christofides, N. 1976. Worst-case analysis of a new heuristic for the travelling salesman problem. Tech. rep., Carnegie-Mellon University Pittsburgh PA Management Sciences Research Group.
  • Cormen et al. (2009) Cormen, T. H., C. E. Leiserson, R. L. Rivest, C. Stein. 2009. Introduction to algorithms. MIT Press.
  • Domb (2000) Domb, C. 2000. Phase Transitions and Critical Phenomena, vol. 1. Elsevier.
  • Farias and Madan (2011) Farias, V. F., R. Madan. 2011. The irrevocable multiarmed bandit problem. Operations Research 59(2) 383–399.
  • Gao et al. (2019) Gao, Z., Y. Han, Z. Ren, Z. Zhou. 2019. Batched multi-armed bandits problem. arXiv preprint arXiv:1904.01763 .
  • Guha and Munagala (2013) Guha, S., K. Munagala. 2013. Approximation algorithms for bayesian multi-armed bandit problems. arXiv preprint arXiv:1306.3525 .
  • Jun (2004) Jun, T. 2004. A survey on the bandit problem with switching costs. De Economist 152(4) 513–541.
  • Lai and Robbins (1985) Lai, T. L., H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1) 4–22.
  • Lattimore and Szepesvári (2018) Lattimore, T., C. Szepesvári. 2018. Bandit algorithms. preprint .
  • Lawler et al. (1985) Lawler, E. L., J. K. Lenstra, A. R. Kan, D. B. Shmoys. 1985. The traveling salesman problem: a guided tour of combinatorial optimization, vol. 3. New York: Wiley.
  • Perchet et al. (2016) Perchet, V., P. Rigollet, S. Chassang, E. Snowberg. 2016. Batched bandit problems. The Annals of Statistics 44(2) 660–681.
  • Slivkins (2019) Slivkins, A. 2019. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272 .
  • Vogel (1960) Vogel, W. 1960. An asymptotic minimax theorem for the two armed bandit problem. The Annals of Mathematical Statistics 31(2) 444–451.
  • Wainwright (2019) Wainwright, M. J. 2019. High-dimensional statistics: A non-asymptotic viewpoint, vol. 48. Cambridge University Press.
\appendixpage

For simplicity, we only present the results of distribution-dependent regret bounds for the unit-switching-cost BwSC problem. Extensions to the general-switching-cost BwSC problem are analogous to Section 5 of the main article.

To achieve tight distribution-dependent regret bounds, we propose the S-Switch Successive Elimination 2 (SS-SE-2) policy, which is stated in Algorithm 3. Note that the difference between the SS-SE-2 policy and the SS-SE policy is the partition of intervals.

Algorithm 3 S-Switch Successive Elimination 2 (SS-SE-2)

Input: Number of arms k, Switching budget S, Horizon T
Partition: Calculate m(S)=\lfloor\frac{S-1}{k-1}\rfloor.
    Divide the entire time horizon 1,\dots,T into m(S)+1 intervals: [t_{0}:t_{1}],(t_{1}:t_{2}],\dots,(t_{m(S)}:t_{m(S)+1}],
    where the endpoints are defined by t_{0}=1 and

t_{i}=\lfloor k^{1-\frac{i}{{m(S)}+1}}T^{\frac{i}{{m(S)}+1}}\rfloor,~{}~{}% \forall i=1,\dots,m(S)+1.

Initialization: The same as the SS-SE policy.
Policy:

1:  The same as the SS-SE policy.

For any environment \mathcal{D}, let i^{*}=\arg\max_{i\in[k]\mu_{i}} denote the optimal action, and \Delta=\Delta(\mbox{$\mathcal{D}$})=\min_{i\neq i^{*}}|\mu_{i^{*}}-\mu_{i}|>0 denote the gap between the rewards of the optimal action and the best sub-optimal action. We have the following results.

Theorem 6

Let \pi be the SS-SE-2 policy. There exists an absolute constant C\geq 0 such that for all \mathcal{D}, for all k\geq 1, S\geq 0 and T\geq k,

R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C\left(k^{\frac{m(S)}{m(S)+1}}\log k% \right)\frac{T^{\frac{1}{m(S)+1}}\log T}{\Delta},

where m(S)=\lfloor\frac{S-1}{k-1}\rfloor.

Theorem 7

There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq 1 and for all policy \pi\in\Pi_{S}, if m(S)\leq{\log_{2}(T/k)}, then

\sup\limits_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\geq C% \left(k^{-\frac{3}{2}-\frac{1}{m(S)+1}}(m(S)+1)^{-2}\right){T^{\frac{1}{m(S)+1% }}},

where m(S)=\lfloor\frac{S-1}{k-1}\rfloor.

Note that when m(S)\leq{\log_{2}(T/k)}, the upper and lower bounds match in the minimax sense (up to logarithmic factors), thus the SS-SE-2 policy can be viewed as near-optimal. When m(S)>\log_{2}(T/k), the upper bound is O(\log T/\Delta), and we can directly use the seminal instance-dependent lower bound of Lai and Robbins (1985) to show the asymptotic optimality of the SS-SE-2 policy.

We omit the proofs of Theorem 6 and Theorem 7. The proof of Theorem 6 resembles the proof of Theorem 1 in Appendix B, and the proof of Theorem 7 resembles the proof of Theorem 2 in Appendix C. The difference is mainly on the partition of intervals.

Besides results on regret upper and lower bounds, we also establish Corollary 3, which can be viewed as a parallel result for Corollary 3 in Section 4.3.2 of the main article.

Corollary 3

(The switching complexity of MAB - distribution-dependent regret version)
For any k\geq 1, for any environment \mathcal{D}, let \Delta=\min\limits_{i\in[k],i\neq i^{*}}|\mu_{i^{*}}-\mu_{i}| denote the gap between the mean rewards of the optimal action and the best sub-optimal action.

  1. N(k-1)+1 switches are necessary and sufficient for uniformly achieving \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distribution-dependent regret for all \mathcal{D} in the k-armed MAB (N\in\mathbb{Z}_{>0}).

  2. \Omega(\frac{\log T}{\log\log T}) switches are necessary for uniformly achieving \tilde{O}(\log{T}/\Delta) distribution-dependent regret for all \mathcal{D} in the k-armed MAB.

The proof of Corollary 3 is deferred to Appendix D.

From round 1 to round t_{1}, the SS-SE policy makes k-1 switches.

For 1\leq l\leq m(S)-1, from round t_{l} to round t_{l+1}:

  • If the last action in interval l remains active in interval l+1, then it will be the first action in interval l+1, and no switch occurs between round t_{l} and round t_{l}+1. Since the SS-SE policy makes at most k-1 switches within interval l+1, i.e., from round t_{l}+1 to round t_{l+1}, the SS-SE policy makes at most 0+(k-1)=k-1 switches from round t_{l} to round t_{l+1}.

  • If the last action in interval l is eliminated before the start of interval l+1, then interval l+1 starts from another active action, and one switch occurs between round t_{l} and round t_{l}+1. The elimination implies that |A_{l+1}|\leq k-1, thus the SS-SE policy makes |A_{l+1}|-1\leq(k-1)-1=k-2 switches within interval l+1, i.e., from round t_{l}+1 to round t_{l+1}. Therefore, the SS-SE policy makes at most 1+(k-2)=k-1 switches from round t_{l} to round t_{l+1}.

From round t_{m(S)} to round T, since the SS-SE policy does not switch within interval m(S)+1, i.e., from round t_{m(S)}+1 to round T, the only possible switch is between round t_{m(S)} and t_{m(S)}+1. Thus the SS-SE policy makes at most 1 switch from round t_{m(S)} to round T.

Summarizing the above arguments, we find that the SS-SE policy makes at most m(S)(k-1)+1\leq S switches from round 1 to round T. Thus it is indeed an S-switching-budget policy.

We start the proof of the upper bound on regret with some definitions. Let n_{t}(i) be the number of chosen samples of action i in period [1:t], and \bar{\mu}_{t}(i) be the average collected reward from action i in period [1:T] (i\in[k],t\in[T]). Define the confidence radius as

r_{t}(i)=\sqrt{\frac{2\log T}{n_{t}(i)}},~{}~{}\forall i\in[k],t\in[T].

Define the clean event as

\mathcal{E}:=\{\forall i\in[k],\forall t\in[T],~{}~{}|\bar{\mu}_{t}(i)-\mu_{i}% |\leq r_{t}(i)\}.

By Lemma 1.5 in Slivkins (2019), since T\geq k, for any policy \pi and any environment \mathcal{D}, we always have \mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}(\mathcal{E})\geq 1-\frac{2}{T^{2}}. Define the bad event \bar{\mathcal{E}} as the complement of the clean event.

The \texttt{UCB}_{t_{l}}(i) and \texttt{LCB}_{t_{l}}(i) confidence bounds defined in Algorithm 1 can be expressed as

\texttt{UCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)+r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k],
\texttt{LCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)-r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k].

Let \pi denote the SS-SE policy. First, observe that for any environment \mathcal{D},

\displaystyle R_{\mbox{$\mathcal{D}$}}^{\pi}(T) \displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}% (\mathcal{E})+\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\bar{\mathcal{E}}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}% ^{\pi}(\bar{\mathcal{E}})
\displaystyle\leq\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i% =1}^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+T\cdot\frac{1}{T^{2}}
\displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+o(1), (1)

so in order to bound R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T), we only need to focus on the clean event.

Consider an arbitrary environment \mathcal{D} and assume the occurrence of the clean event. Let i^{*} be an optimal action, and consider any action i such that \mu_{i}<\mu_{i^{*}}. Let \eta_{i} denote the index of the last interval when i\in A_{\eta_{i}}, i.e., the \eta_{i}-th interval is the last interval when we did not eliminate action i yet (in particular, \eta_{i}=m(S)+1 if and only if i is the only action chosen in the last interval). By the SS-SE policy, if \eta_{i}\geq 2, then the confidence intervals of the two actions i^{*} and i at the end of round \eta_{i}-1 must overlap, i.e., \texttt{UCB}_{t_{\eta_{i}-1}}(i)\geq\texttt{LCB}_{t_{\eta_{i}-1}}(i^{*}). Therefore,

\Delta(i):=\mu_{i^{*}}-\mu_{i}\leq 2r_{t_{\eta_{i}-1}}(i^{*})+2r_{t_{\eta_{i}-% 1}}(i)=4r_{t_{\eta_{i}-1}}(i), (2)

where the last equality is because i^{*} and i are chosen for equal times in each interval until interval \eta_{i}, which indicates that n_{t_{\eta_{i}-1}}(i^{*})=n_{t_{\eta_{i}-1}}(i). (Note that in Algorithm 1, for simplicity, we overlook the rounding issues of \frac{t_{l+1}-t_{l}}{|A_{l}|} for each interval l. Considering the rounding issues will not bring additional difficulty to our analysis, as in the policy we can always design a rounding rule to control the difference between n_{t_{\eta_{i}-1}}(i^{*}) and n_{t_{\eta_{i}-1}}(i) within 1.)

Since i is never chosen after the \eta_{i}-th interval, we have n_{\eta_{i}}(i)=n_{T}(i), and therefore r_{\eta_{i}}(i)=r_{T}(i).

The contribution of action i to regret in the entire horizon [1:T], denoted R(T;i), can be expressed as the sum of \Delta(i) for each round that this action is chosen. By the SS-SE policy and (2), we can bound this quantity as

\displaystyle R(T;i) \displaystyle=n_{T}(i)\Delta(i)
\displaystyle\leq 4n_{\eta_{i}}(i)\sqrt{\frac{2\log T}{n_{\eta_{i}-1}(i)}}
\displaystyle\leq C_{0}\sqrt{2\log T}\frac{t_{\eta_{i}}/|A_{\eta_{i}}|}{\sqrt{% t_{\eta_{i}-1}/k}}
\displaystyle\leq 4C_{0}\sqrt{2\log T}\frac{k(T/k)^{1/(2-2^{-m(S)})}}{{|A_{% \eta_{i}}|}}.

for some absolute C_{0}\geq 0. Then for any \mathcal{D}, conditioned on the clean event,

\displaystyle\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}^% {T}\mu_{\pi_{t}}\mid\mathcal{E}\right] \displaystyle=\sum_{i\in[k]}R(T;i)
\displaystyle\leq\sum_{i\in[k]}4C_{0}\sqrt{2\log T}k(T/k)^{1/(2-2^{-m(S)})}% \frac{1}{{|A_{\eta_{i}}|}}
\displaystyle\leq C_{1}\sqrt{\log T}k(T/k)^{1/(2-2^{-m(S)})}\sum_{i=1}^{k}% \frac{1}{|A_{\eta_{i}}|}
\displaystyle\leq C_{2}\sqrt{\log T}k(T/k)^{1/(2-2^{-m(S)})}\sum_{j=1}^{k}% \frac{1}{j}
\displaystyle\leq C_{3}(\log k\log T)k^{1-1/(2-2^{-m(S)})}T^{1/(2-2^{-m(S)})}

for some absolute constant C_{1},C_{2},C_{3}\geq 0. Thus by (\theequation@IDv) and R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T) we have

R^{\pi}(T)\leq C(\log k\log T)k^{2-1/(2-2^{-m(S)})}T^{1/(2-2^{-m(S)})}

for some absolute constant C\geq 0.\hfill\Box

Given any k\geq 1, S\geq 0 and T\geq 2k, we focus on the setting of \mbox{$\mathcal{D}$}_{i}=\mathcal{N}(\mu_{i},1) (\forall i\in[k]), as this is enough for us to prove the desired lower bound. Note that now the environment of latent distributions \mathcal{D} can be completely determined by a vector \mbox{\boldmath$\mu$}=(\mu_{1},\cdots,\mu_{k})\in\mathbb{R}^{k}. For simplicity, in this proof we will directly use the vector \mu to represent the environment of latent distributions.

For any environment \mu, let X_{\mbox{\boldmath$\mu$}}^{t}(i)\sim\mathcal{N}(\mu_{i},1) denote the i.i.d. random reward of each action i at round t (i\in[k],t\in[T]). For any i\in[k] and n_{1},n_{2}\in[T], let \{X_{\mbox{\boldmath$\mu$}}^{t}(i)\}_{t\in[n_{1}:n_{2}]} denote the random vector whose components are the random awards of action i from round n_{1} to round n_{2}.

For any environment \mu, for any policy \pi\in\Pi_{S}, with some abuse of notation we let X_{\mbox{\boldmath$\mu$}}^{t}(\pi_{t}) denote the learner’s (random) collected reward at round t under policy \pi in environment \mu. Let \mathcal{F}_{t}:=\sigma(X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{% \boldmath$\mu$}}^{t}(\pi_{t})) denote the \sigma-algebra generated by the random variables X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{\boldmath$\mu$}}^{t}(\pi% _{t}), then \mathbb{F}=(\mathcal{F}_{t})_{t\in T} is a filtration.

For any two probability measures \mathbb{P} and \mathbb{Q} defined on the same measurable space, let D_{\mathrm{TV}}(\mathbb{P}\|\mathbb{Q}) denote the total variation distance between \mathbb{P} and \mathbb{Q}, and D_{\mathrm{KL}}(\mathbb{P}\|\mathbb{Q}) denote the Kullback-Leibler (KL) divergence between \mathbb{P} and \mathbb{Q}, see detailed definitions in Chapter 15 of Wainwright (2019).

For any environment \mu, for any policy \pi\in\Pi_{S}, we make some key definitions as below.

1. We first define a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)}\leq\tau_{m(S)+1}.

  • \tau_{1}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[1:\tau_{1}]$}\} if the set is non-empty and \tau_{1}=\infty otherwise.

  • \tau_{2}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{1}:\tau_{2}]$}\} if the set is non-empty and \tau_{2}=\infty otherwise.

  • Generally, \tau_{j}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{j-1}:\tau_{j}]$}\} if the set is non-empty and \tau_{j}=\infty otherwise, for all i=2,\dots,m(S)+1.

It can be verified that \tau_{1},\dots,\tau_{m(S)+1} are stopping times with respect to the filtration \mathbb{F}.

2. We then define a series of random variables (depend on the stopping times).

  • S(1,\tau_{1}) is the number of switches occurs in [1:\tau_{1}] (note that if there is a switch happening between \tau_{1} and \tau_{1}+1, we do not count its cost in S(1,\tau_{1})).

  • For all j=2,\dots,m(S), S(\tau_{j-1},\tau_{j}) is the number of switches occurs in [\tau_{j-1}:\tau_{j}] (note that if there is a switch happening between \tau_{j-1}-1 and \tau_{j-1}, or between \tau_{j} and \tau_{j}+1, we do not count its cost in S(\tau_{j-1},\tau_{j})).

  • S(\tau_{m(S)},T) is the number of switches occurs in [\tau_{m(S)}:T] (note that if there is a switch happening between \tau_{m(S)-1} and \tau_{m(S)}, we do not count its cost in S(\tau_{m(S)},T).

3. Next we define a series of events.

  • E_{1}=\{\tau_{1}>t_{1}\}.

  • For all j=2,\dots,m(S), E_{j}=\{\tau_{j-1}\leq t_{j-1},\tau_{j}>t_{j}\}.

  • E_{m(S)+1}=\{\tau_{m(S)}\leq t_{m(S)}\}.

Note that t_{1},\dots,t_{m(S)}\in[T] are fixed values specified in Algorithm 1.

4. Finally we define a series of shrinking errors.

  • \Delta_{1}=1.

  • For j=2,\dots,m(S), \Delta_{j}=\frac{k^{-1/2}\left(k/T\right)^{(1-2^{1-j})/(2-2^{-m(S)})}}{k(m(S)+% 1)}\in(0,1). (That is, \Delta_{j}\approx\frac{1}{k(m(S)+1)}\frac{1}{\sqrt{t_{j-1}}}.)

  • \Delta_{m(S)+1}=\frac{k^{-1/2}\left(k/T\right)^{(1-2^{-m(S)})/(2-2^{-m(S)})}}{% 2k(m(S)+1)}\in(0,1). (That is, \Delta_{m(S)+1}\approx\frac{1}{2k(m(S)+1)}\frac{1}{\sqrt{t_{m(S)}}}.)

5. For notational convenience, define \pi_{\infty} as an independent uniform random variable taking value in [k] such that {\pi_{\infty}=i} with probability 1/k (i\in[k]).

Lemma 1

For any environment \mu, for any policy \pi\in\Pi_{S}, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the number of switches occurs in }[\tau_{m(S)}:T]\text{ is no more % than }k-1\} almost surely.

Proof of Lemma 1. When E_{m(S)+1} happens, \tau_{m(S)}\leq t_{m(S)}\leq T, thus all \tau_{1},\dots,\tau_{m(S)}\leq T. Since in each of [1:\tau_{1}],[\tau_{1},\tau_{2}],\dots,[\tau_{m(S)-1}:\tau_{m(S)}], all k actions were visited, we know that S(1,\tau_{1})\geq k-1, S(\tau_{1},\tau_{2})\geq k-1, \dots, S(\tau_{m(S)-1},\tau_{m(S)})\geq k-1. Thus we have

S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)-1},\tau_{m(S)})\geq m(S% )(k-1).

Since \pi\in\Pi_{S}, we further know that

S(\tau_{m(S)},T)\leq S-[S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)% -1},\tau_{m(S)})]\leq S-m(S)(k-1)\leq k-1

happens almost surely. As a result, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the number of switches occurs in }[\tau_{m(S)}:T]\text{ is no more % than }k-1\} almost surely. \hfill\Box

Consider a class of environments \Lambda=\{\mbox{\boldmath$\mu$}\mid\frac{\Delta_{m(S)+1}}{4}\leq\mu_{1}-\mu_{i% }\leq\frac{\Delta_{m(S)+1}}{2},\forall i\neq 1\}. Pick an arbitrary environment {\alpha} from \Lambda (e.g., \alpha=(\frac{\Delta_{m(S)+1}}{2},0,\dots,0)). For any policy \pi\in\Pi_{S}, by the union bound, we have

\sum_{j=1}^{m(S)+1}\mathbb{P}_{{\alpha}}^{\pi}(E_{j})\geq\mathbb{P}_{{\alpha}}% ^{\pi}(\cup_{j=1}^{m(S)+1}E_{j})=1.

Therefore, there exists j^{*}\in[m(S)+1] such that \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})\geq 1/(m(S)+1).

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})% \geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{% \pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{% \mathbb{P}_{{\alpha}}^{\pi}(E_{1})}{k}\geq\frac{1}{k(m(S)+1)}.

Note that since \tau_{1} is the first time that all actions in [k] has been chosen in [1:\tau_{1}], the event \{\pi_{\tau_{1}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[1:\tau_{1}-1]\}. Thus, the event \{\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime}\} must imply the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1]:=\{i^{\prime}\text{ was not chosen in }[1:% t_{1}-1]\}. Therefore, we have

\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq% \mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{1% }{k(m(S)+1)}.

Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] only depends on policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{% \pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq\frac{1}{k(m(S)+1)}. (3)

We now consider a new environment {\beta} such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{1} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{1}-1]} and random vectors \{X_{\beta}^{t}(i)\}_{t_{1}:T} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i, we have

\mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]}(\mathcal{E}_{i^{\prime}}[1:t_{1% }-1])\text{ for }i\neq i}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$% \mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1]). (4)

But note that \{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]} and \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} have exactly the same distribution for all i\neq i^{\prime}. Thus from (3) and (4) we have

\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$% \mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq\frac{1}{k% (m(S)+1)}.

However, in environment \beta, i^{\prime} is the unique optimal action, and choosing any action other than i^{\prime} will incur at least a \Delta_{1}-\Delta_{m(S)+1}/2\geq\Delta_{1}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[1:t_{1}-1] indicates that the policy does not choose i^{\prime} for at least t_{1}-1 rounds, we have

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}-1])\left[(t_{1}-1)\frac{\Delta_{1}}{2}\right]% \geq\frac{t_{1}-1}{2k(m(S)+1)}\geq\frac{k^{-1/(2-2^{-m(S)})}}{4(m(S)+1)}T^{1/(% 2-2^{-m(S)})}.

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})=\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1% }\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}})\geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}% })=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau% _{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}% },\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}% })}{k}\geq\frac{1}{k(m(S)+1)}.

Note that since \tau_{j^{*}} is the first time that all actions in [k] has been chosen in [\tau_{j^{*}-1}:\tau_{j^{*}}], the event \{\pi_{\tau_{j^{*}}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[\tau_{j^{*}-1}:\tau_{j^{*}}-1]\}. Thus, the event \{\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{% \prime}\} must imply the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}]:=\{i^{\prime}\text{ was not % chosen in }[t_{j^{*}-1}:t_{j^{*}}]\}. Therefore, we have

\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{% *}}])\geq\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}% >t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{1}{k(m(S)+1)}.

Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[t_{j^{*}-1}:t_{j^{*}}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] only depends on policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}-1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E% }_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}])\geq\frac{1}{k(m(S)+1)}. (5)

We now consider a new environment {\beta} such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{j^{*}} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[t_{j^{*}-1}:t_{j^{*}}]} and random vectors \{X_{\beta}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k]. Let \mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{\prime% }}[t_{j^{*}-1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}-1}:t_{j^{*}}]). (6)

We now try to bound the difference between the left-hand side (LHS) in (5) and the left-hand side (LHS) in (6). We have

\displaystyle|\text{LHS in }(\ref{eq:app3})-\text{LHS in }(\ref{eq:app4})|
\displaystyle\leq \displaystyle{D_{\mathrm{TV}}}\left(\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}% _{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^% {\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime}% )\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i% ^{\prime}}^{\pi}\right)
\displaystyle\leq \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^% {t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}% \text{ for }i\neq i^{\prime}}^{\pi}\right)}
\displaystyle\leq \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^{t}(i^% {\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for% }i\neq i^{\prime}}\right)}
\displaystyle= \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{% \beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]}}\right)}
\displaystyle= \displaystyle\sqrt{\frac{1}{2}\left[(t_{j^{*}-1}-1)\frac{\left(\Delta_{j^{*}}% \right)^{2}}{2}\right]}
\displaystyle\leq \displaystyle\frac{\sqrt{t_{j^{*}-1}}\Delta_{j^{*}}}{2}\leq\frac{1}{2k(m(S)+1)},

where the first inequality is by the definition of total variation distance of two probability measures, the second inequality is by Pinsker’s inequality in information theory, and the third inequality is by the data-processing inequality in infomation theory.

Combining the above inequality with (3) and (4), we have

\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq% \mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])-\frac{% 1}{2k(m(S)+1)}\geq\frac{1}{2k(m(S)+1)}.

However, i^{\prime} is the unique optimal action in environment \beta, and choosing any action other than i^{\prime} will incur at least a \Delta_{j^{*}}-\Delta_{m(S)+1}/2\geq\Delta_{j^{*}}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] indicates that the policy does not choose i^{\prime} for at least t_{j^{*}}-t_{j^{*}-1}+1 rounds, we have

\displaystyle R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}% ^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}])\left[(t_{j^{*}}-t_{j^{% *}-1}+1)\frac{\Delta_{j^{*}}}{2}\right]
\displaystyle\geq \displaystyle\frac{1}{2k(m(S)+1)}\left(k(T/k)^{\frac{2-2^{1-j^{*}}}{2-2^{-m(S)% }}}-k(T/k)^{\frac{2-2^{2-j^{*}}}{2-2^{-m(S)}}}\right)\frac{k^{-\frac{1}{2}}% \left(k/T\right)^{\frac{1-2^{1-j^{*}}}{2-2^{-m(S)}}}}{2k(m(S)+1)}
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}}}{4(m(S)+1)^{2}}\left((T/k)^{\frac{1}{2-2^{% -m(S)}}}-(T/k)^{\frac{1-2^{1-j^{*}}}{2-2^{-m(S)}}}\right)
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{\frac{-2^{1-j^{*}}}{2-2^{-m(S)}}}\right)
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{\frac{-2^{1-m(S)}}{2-2^{-m(S)}}}\right)
\displaystyle= \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{-2^{-m(S)}}\right).

When m(S)\leq\log_{2}\log_{2}(T/K), we have

(T/k)^{-2^{-m(S)}}\leq(T/k)^{-\frac{1}{\log_{2}(T/k)}}=\frac{1}{(T/k)^{\log_{T% /k}(2)}}=\frac{1}{2}.

Thus we know that

R^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(% S)}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{-2^{-m(S)}}\right)\geq\frac{k^{-\frac{3}{2% }-\frac{1}{2-2^{-m(S)}}}}{8(m(S)+1)^{2}}T^{\frac{1}{2-2^{-m(S)}}}

when m(S)\leq\log_{2}\log_{2}(T/k).

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}% \leq t_{m(S)})\geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)})=\sum_{i=1}^{k}\mathbb{P}_{% \alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i^{% \prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})}{k}\geq\frac{1}{k(m(% S)+1)}.

Thus either

\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}>\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}, (7)

or

\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac% {t_{m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}. (8)

If (7) holds true, then we consider a new environment \beta such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{m(S)+1} and all other components are the same as \alpha. Define the event \mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2]:=\{i^{\prime}\text{ was not % chosen in }[t_{m(S)}:(t_{m(S)}+T)/2]\}. From (7) we know that \mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq 1/(2k(m(S)+1)). Using analogous arguments like Case 2 (Appendix id1), we can derive that

\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2% ])-\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}

and

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)% }}}}{64(m(S)+1)^{2}}T^{\frac{1}{2-2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Now we consider the case that (8) holds true. Let \mathcal{E}_{i^{\prime}} denote the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2},\pi_{\tau_{m(% S)+1}}=i^{\prime}\}. According to Lemma 1, the event \{\tau_{m(S)}\leq t_{m(S)}\} implies that the number of switches occurs in [\tau_{m(S)}:T] is no more than k-1. Meanwhile, the event \{\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}<\infty\} implies that the number of switches occurs in [\tau_{m(S)}:\tau_{m(S)+1}] is at least k-1. As a result, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that there is no switch occurs in [\tau_{m(S)+1}:T].

Suppose that i^{\prime}\neq 1, then the event \mathcal{E}_{i^{\prime}}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\} implies that action 1 is not chosen in [\tau_{m(S)+1}:T]. However, action 1 is the unique optimal action in environment \alpha, and choosing any action other than action 1 will incur at least a \Delta_{m(S)+1}/4 term in regret. As a result, we know that

R^{\pi}(T)\geq R_{\alpha}^{\pi}(T)\geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i% ^{\prime}})\left[(T-\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{4}\right]% \geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2% -2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Thus we only need to consider the sub-case of i^{\prime}=1. Define the event \mathcal{E}_{1}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}% {2},\pi_{\tau_{m(S)+1}}=1\}. Note that the occurrence of the event \mathcal{E}_{1} only depends on policy \pi and random vector \{X_{\alpha}^{t}(1)\}_{[1:t_{m(S)}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:{(t_{m(S)}+T)}/{2}]} for i\neq 1. Consider a new environment \beta such that its first component is \alpha_{1}-\Delta_{m(S)+1} and all other components are the same as \alpha. Using analogous arguments like Case 2 (Appendix id1), we can derive that

\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{1})\geq\mathbb{P}_{\alpha}^{\pi}(% \mathcal{E}_{1})-\frac{\sqrt{t_{m(S)}}\Delta_{m(S)+1}}{2}\geq\mathbb{P}_{% \alpha}^{\pi}(\mathcal{E}_{1})-\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}.

However, action 1 is the worst action in environment \beta, and each time of choosing action 1 incurs at least a \Delta_{m(S)+1}/2 term in regret. According to Lemma 1, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that there is no switch occurs in [\tau_{m(S)+1}:T]. Thus the event \mathcal{E}_{1} actually implies that action 1 is continuously chosen in every round from round \tau_{m(S)+1} (\leq\frac{t_{m(S)}+T}{2}) to round T, i.e., action 1 is continuously chosen in the last (T-\frac{t_{m(S)}+T}{2}+1) rounds. As a result, we know that

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}})\left[(T-\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{2}\right]\geq% \frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2-2^{% -m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Combining Case 1, 2 and 3, we know that

R^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{% \frac{1}{2-2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k). On the other hand, since the minimax lower bound for the classical MAB problem (which is equivalent to a BwSC problem with limited switching budget) is \Omega(\sqrt{kT}), we know that

R^{\pi}(T)\geq R_{\infty}^{*}\geq C\sqrt{kT}

for some absolute constant C>0. To sum up, we have

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}(m(S)+% 1)^{-2}\right)T^{\frac{1}{2-2^{-m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases}

for some absolute constant C>0, where m(S)=\lfloor\frac{S-1}{k-1}\rfloor. \hfill\Box

We only prove the first part here, as the proof of the second part is analogous. Since m(N(k-1)+1)=\lfloor(N(k-1)+1-1)/(k-1)\rfloor=N, by Theorem 1, the SS-SE policy guarantees \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in BwSC. Thus N(k-1)+1 switches are sufficient for a carefully-designed policy (e.g., the SS-SE policy) to achieve \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in MAB. On the other hand, suppose that there exists a policy that guarantees \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in MAB with S<N(k-1)+1 switches almost surely. Since m(S)\leq N-1, by Theorem 2, its regret in BwSC is \Omega(T^{\frac{1}{2-2^{-N+1}}}), whose order of T is strictly higher than \tilde{O}(T^{\frac{1}{2-2^{-N}}}) (as N is a fixed integer independent of T), contradiction! Thus for any policy that guarantees \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in MAB, there must exist an environment such that the policy makes at least N(k-1)+1 switches with some positive probability.

We only prove the first part here, as the proof of the second part is analogous. Since m(N(k-1)+1)=\lfloor(N(k-1)+1-1)/(k-1)\rfloor=N, by Theorem 6, the SS-SE-2 policy guarantees \tilde{O}(T^{\frac{1}{N+1}}) distribution-dependent regret in BwSC. Thus N(k-1)+1 switches are sufficient for a carefully-designed policy (e.g., the SS-SE-2 policy) to achieve \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distribution-dependent regret in MAB. On the other hand, given any fixed k\geq 1, for any fixed N\geq 1, suppose that there exists a policy \pi that uniformly achieve \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distribution-dependent regret for all \mathcal{D} with S<N(k-1)+1 switches almost surely. Then there exists a constant C_{k,N}\geq 0 (which may depend on k,N) such that for all \mathcal{D} and for all T\geq 1,

R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C_{k,N}{\mathrm{polylog}}(T)\frac{T^{% \frac{1}{N+1}}}{\Delta},

which means that for all T\geq 1,

\sup_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C_{k,N}{% \mathrm{polylog}}(T)T^{\frac{1}{N+1}}.

However, since m(S)<N, by Theorem 7, we know that there exists an absolute constant C>0 such that for all T\geq 1,

\sup\limits_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\geq C% \left(k^{-\frac{3}{2}-\frac{1}{N}}(m(S)+1)^{-2}\right){T^{\frac{1}{N}}}>C\left% (k^{-\frac{3}{2}-\frac{1}{N}}(N+1)^{-2}\right){T^{\frac{1}{N}}}.

Let T be large enough then there is a contradiction. As a result, N(k-1)+1 switches are necessary for uniformly achieving \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distribution-dependent regret for all \mathcal{D} in the k-armed MAB.

Consider an arbitrary switching graph G with k=|G|\geq 1. In the following, we show that, even without the triangle inequality assumption, a modified version of the results in Section 5 still hold.

Assume that the switching costs associated with G do not satisfy the triangle inequality. We then run the Floyd-Warshall algorithm (see Cormen et al. 2009) on G to efficiently find the shortest paths between all pairs of vertices. For any i,j\in[k] such that i\neq j, let p_{i,j}=i\rightarrow\dots\rightarrow j denote the shortest path between i and j, and c_{i,j}^{\prime} denote the total weight of the shortest path between i and j. We construct a new switching graph G^{\prime}=(V,E^{\prime}) — the vertices in G^{\prime} are the same as G, while the edge between i and j in G^{\prime} is assigned a weight c_{i,j}^{\prime}, which is the total weight of the shortest path between i and j in G. Obviously, G^{\prime} is a switching graph whose switching costs satisfy the triangle inequality. Therefore, for BwSC problems defined with G^{\prime}, we can apply the HS-SE policy, and the regret upper and lower bounds in Theorem 3 and Theorem 4 in Section 5 hold.

In this part we assume that k=o(\sqrt{T}). This assumption is reasonable when k is a known fixed integer.

For any BwSC problem defined with switching graph G (whose switching costs do not satisfy the triangle inequality) and switching budget S, we construct a new switching graph G^{\prime} according to Appendix E.1, and construct a new BwSC problem defined with switching graph G^{\prime} and switching budget S. Let \pi^{\prime} denote the HS-SE policy running on the new BwSC problem. Obviously \pi^{\prime} is a S-switching budget policy for the new problem. We construct \pi by modifying \pi^{\prime}, aiming to obtain an S-switching-budget policy for the original BwSC problem. Let \pi switch (on G) following \pi^{\prime} (on G^{\prime}): every time \pi^{\prime} switches from i to j on G, let \pi switch according to the path p_{i,j}=i\rightarrow\dots\rightarrow j on G, visiting each vertex in p_{i,j} once (since in the HS-SE policy, every active action is chosen for at least \Omega(T^{1/2}) consecutive rounds in each interval, while p_{i,j} contains at most k=o(\sqrt{T}) vertices, we know that \pi^{\prime} is a valid policy). Since the total weight of p_{i,j} is c^{\prime}_{i,j} and \pi^{\prime} is an S-switching-budget policy for G^{\prime}, we know that \pi is an S-switching-budget policy for G.

As mentioned before, Theorem 3 and Theorem 4 in Section 5 hold for the new BwSC problem (defined with G^{\prime}) in Appendix E.2. Based on these two theorems, we give upper and lower bound on regret for the original BwSC problem (defined with the G). The upper and lower bounds are very close to each other (in fact, when k=O(T^{1/4}), the bounds are essentially the same as the bounds in Section 5).

Theorem 8

Let G be a switching graph and G^{\prime} be the corresponding new graph defined in Appendix E.1. Let H denote the total weight of the shortest Hamiltonian path of G^{\prime}. Let \pi be the modified HS-SE policy in Appendix E.2, then \pi is an S-switching-budget policy for G. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 0, T\geq k^{2},

R^{\pi}(T)\leq C{(\log k\log T)}k^{1-\frac{1}{2-2^{-m_{G}^{U}(S)}}}T^{\frac{1}% {2-2^{-m_{G}^{U}(S)}}}+Ck^{2}\log\log T,

where m_{G}^{U}(S)=\lfloor\frac{S-\max_{i,j\in[k]}{c_{i,j}^{\prime}}}{H}\rfloor.

Theorem 9

Let H be the total weight of the shortest Hamiltonian path of G^{\prime}. There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m_{G}^{L}(S)}% }}(m_{G}(S)+1)^{-2}\right)T^{\frac{1}{2-2^{-m_{G}^{L}(S)}}},&\text{if }m_{G}^{% L}(S)\leq\log_{2}\log_{2}(T/k),\\ C\sqrt{kT},&\text{if }m_{G}^{L}(S)>\log_{2}\log_{2}(T/k),\end{cases}

where {m_{G}^{L}}(S)=\lfloor\frac{S-\max_{i\in[k]}\min_{j\neq i}c_{i,j}^{\prime}}{H}\rfloor.

Note that the only difference between the upper bound in Theorem 8 and the upper bound in Theorem 3 is an O(k^{2}\log\log T) term, which can be neglected as long as k is much smaller than T, e.g., k=O(T^{1/4}). To see why Theorem 8 holds, just note that (1) when k is much smaller than T, the modification of the HS-SE policy does not affect the learning rate of the HS-SE policy, and (2) since there are m_{G}^{U}(S)+1=O(\log\log T) intervals in \pi, and in each interval the behavior of \pi (running on G) is different from the behavior of \pi^{\prime} (running on G^{\prime}) for at most O(k^{2}) rounds, the additional regret loss compared to Theorem 3 is at most O(k^{2}\log\log T). Theorem 9 is essentially the same as Theorem 4 — in fact, a lower bound proved for a BwSC problem with the triangle inequality assumption is a natural lower bound for a corresponding BwSC problem without the triangle inequality assumption.

Intuitively, an effective policy in BwSC should identify what type of switching behavior is necessary and sufficient for achieving low regret in MAB, and switch in the most efficient way. Thus, before studying the general BwSC, we first revisit the classical MAB to further understand the relationship between switching and regret. Earlier in Section 4 of the main article, we establish the trade-off between the number of switches and regret in MAB. Unfortunately, this does not provide enough insights for the general BwSC, and hence we need to connect the combinatorics of switching patterns with regret in MAB. In this subsection, we prove the following result: there are some inherent switching patterns that are associated with any effective learning policy in MAB.

Definition 4

Consider a k-armed bandit problem. For any learning policy \pi, any environment \mathcal{D} and any T\geq 1, the stochastic process \{\pi_{t}\}_{t\in[T]}=\pi_{1},\dots,\pi_{T} constitutes a random walk (with a random starting point) on [k]. We call \{\pi_{t}\}_{t\in[T]} the bandit random walk generated by (\pi,\mbox{$\mathcal{D}$},T).

Definition 5

A bandit random walk on an action set [k] finishes a cover in period [T_{1}:T_{2}] if all actions in [k] were chosen between round T_{1} and round T_{2}, here T_{1} is called the starting round of this cover, and T_{2} is called the ending round of this cover.

Definition 6

A bandit random walk on an action set [k] finishes N\geq 0 asynchronous covers between period [T_{1}:T_{2}] if it finishes N covers in period [T_{1}:T_{2}], and the ending round of the j-th cover is no larger than the starting round of the (j+1)-th cover, for all j=1,\dots,N-1.

By using the “tracking the cover time” argument, we establish the following result.

Theorem 10

Consider a k-armed bandit problem. For any fixed N\geq 0, for any policy \pi that achieves \tilde{O}(T^{\frac{1}{2-2^{-N}}-\epsilon}) regret for some \epsilon>0, there exists an environment \mathcal{D} and T\geq 1 such that the bandit random walk generated by (\pi,\mbox{$\mathcal{D}$},T) must “finish N+1 asynchronous covers and then switch to the optimal action666If the bandit random walk happens to be at the optimal action when it just finishes N+1 asynchronous covers, then the event directly occurs. in period [1:T]” with probability at least \max\{N/(N+1),1/2\}.

Theorem 10 holds true for any MAB problem, and reveals some fundamental switching patterns in MAB that any effective learning policy has to reveal under certain environments with certain horizon. Intuitively, the patterns can be summarized as “finishing multiple covers then switching to the optimal action”. For example, if a policy \pi achieves sublinear regret in MAB, then there must be some environment \mathcal{D} and T such that the policy first chooses all actions (i.e., \pi_{1},\dots,\pi_{T} finishes a cover) and then switches to the optimal action with certain probability (even if the policy does not know the optimal action). Also, if a policy \pi achieves near near-optimal regret in MAB, then there must be some environment \mathcal{D} and T such that \pi_{1},\dots,\pi_{T} first finishes \Omega(\log\log T) asynchronous covers and then switches to the optimal action with certain probability.

Theorem 10 indicates that the switching ability of “finishing multiple covers then switching to the optimal action” is necessary for any effective learning policy in MAB. It thus reveals a deep connection between bandit problems and graph traversal problems, since in graph traversal problems there are also requirements for “cover”, i.e., visiting all vertices. Motivated by this connection, in Section 5 of the main article, we design an intuitive policy for the general BwSC problem by leveraging ideas from the shortest Hamiltonian path problem, and give upper and lower bounds on regret that are close to each other.

The proof of Theorem 10 is based on the “tracking the cover time” argument: we first suppose that the switching patterns do not occur with certain probability, then use the “tracking the cover time” argument to establish an \tilde{\Omega}(T^{\frac{1}{2-2^{-N}}}) lower bound on the regret of \pi, which contradicts the condition in Theorem 10. We omit the detailed proof here, as the essential idea of the proof is similar to the proof of Theorem 2 in Appendix C and the proof of Theorem 4 in Appendix I.

The HS-SE policy is practical — for any given switching graph G, the policy only involves solving the shortest Hamiltonian path problem once, which can be finished offline. Thus, the computational complexity of the shortest Hamiltonian path problem does not affect the online decision-making process of the HS-SE policy at all.

Moreover, under the condition that the switching costs satisfy the triangle inequality, the shortest Hamiltonian path problem can be reduced to the celebrated metric traveling salesman problem (metric TSP), see Lawler et al. (1985). This means that we can directly apply many commercial solvers for TSP to solve (or approximately solve) the shortest Hamiltonian path problem efficiently. The reduction also indicates that any approximation algorithm designed for metric TSP can be adapted to be an approximation algorithm for the shortest Hamiltonian path problem. In particular, the celebrated Christofides algorithm for the metric TSP (Christofides 1976) can be used to compute a good approximation of H in polynomial time.

Consider an arbitrary switching graph G whose switching costs satisfy the triangle inequality. Recall that H is the total weight of the shortest Hamiltonian path in G. For simplicity, in this proof we use m(S) to denote m_{G}^{U}(S)=\lfloor(S-\max_{i,j\in[k]}{c_{i,j}})/H\rfloor.

From round 1 to round t_{1}, the HS-SE policy incurs H switching cost.

For 1\leq l\leq m(S)-1, from round t_{l} to round t_{l+1}, no matter whether l is odd or even, no matter whether the last action in interval l is eliminated before the start of interval l+1 or not, by the switching order (determined by the shortest Hamiltonian path of G) and the triangle inequality, the HS-SE policy always incurs at most H switching cost.

From round t_{m(S)} to round T, since the HS-SE policy does not switch within interval m(S)+1, i.e., from round t_{m(S)}+1 to round T, the only possible switch is between round t_{m(S)} and t_{m(S)}+1. Thus the HS-SE policy incurs at most \max_{i,j\in[k]}c_{i,j} switching cost from round t_{m(S)} to round T.

Summarizing the above arguments, we find that the HS-SE policy incurs at most m(S)H+\max_{i,j\in[k]}c_{i,j}\leq S switching cost from round 1 to round T. Thus it is indeed an S-switching-budget policy.

We start the proof of the upper bound on regret with some definitions. Let n_{t}(i) be the number of chosen samples of action i in period [1:t], and \bar{\mu}_{t}(i) be the average collected reward from action i in period [1:T] (i\in[k],t\in[T]). Define the confidence radius as

r_{t}(i)=\sqrt{\frac{2\log T}{n_{t}(i)}},~{}~{}\forall i\in[k],t\in[T].

Define the clean event as

\mathcal{E}:=\{\forall i\in[k],\forall t\in[T],~{}~{}|\bar{\mu}_{t}(i)-\mu_{i}% |\leq r_{t}(i)\}.

By Lemma 1.5 in Slivkins (2019), since T\geq k, for any policy \pi and any environment \mathcal{D}, we always have \mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}(\mathcal{E})\geq 1-\frac{2}{T^{2}}. Define the bad event \bar{\mathcal{E}} as the complement of the clean event.

The \texttt{UCB}_{t_{l}}(i) and \texttt{LCB}_{t_{l}}(i) confidence bounds defined in Algorithm 3 can be expressed as

\texttt{UCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)+r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k],
\texttt{LCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)-r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k].

Let \pi denote the HS-SE policy. First, observe that for any environment \mathcal{D},

\displaystyle R_{\mbox{$\mathcal{D}$}}^{\pi}(T) \displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}% (\mathcal{E})+\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\bar{\mathcal{E}}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}% ^{\pi}(\bar{\mathcal{E}})
\displaystyle\leq\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i% =1}^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+T\cdot\frac{1}{T^{2}}
\displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+o(1), (9)

so in order to bound R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T), we only need to focus on the clean event.

Consider an arbitrary environment \mathcal{D} and assume the occurrence of the clean event. Let i^{*} be an optimal action, and consider any action i such that \mu_{i}<\mu_{i^{*}}. Let \eta_{i} denote the index of the last interval when i\in A_{\eta_{i}}, i.e., the \eta_{i}-th interval is the last interval when we did not eliminate action i yet (in particular, \eta_{i}=m(S)+1 if and only if i is the only action chosen in the last interval). By the HS-SE policy, if \eta_{i}\geq 2, then the confidence intervals of the two actions i^{*} and i at the end of round \eta_{i}-1 must overlap, i.e., \texttt{UCB}_{t_{\eta_{i}-1}}(i)\geq\texttt{LCB}_{t_{\eta_{i}-1}}(i^{*}). Therefore,

\Delta(i):=\mu_{i^{*}}-\mu_{i}\leq 2r_{t_{\eta_{i}-1}}(i^{*})+2r_{t_{\eta_{i}-% 1}}(i)=4r_{t_{\eta_{i}-1}}(i), (10)

where the last equality is because i^{*} and i are chosen for equal times in each interval until interval \eta_{i}, which indicates that n_{t_{\eta_{i}-1}}(i^{*})=n_{t_{\eta_{i}-1}}(i). (Note that in Algorithm 2, for simplicity, we overlook the rounding issues of \frac{t_{l+1}-t_{l}}{|A_{l}|} for each interval l. Considering the rounding issues will not bring additional difficulty to our analysis, as in the policy we can always design a rounding rule to control the difference between n_{t_{\eta_{i}-1}}(i^{*}) and n_{t_{\eta_{i}-1}}(i) within 1.)

Since i is never chosen after the \eta_{i}-th interval, we have n_{\eta_{i}}(i)=n_{T}(i), and therefore r_{\eta_{i}}(i)=r_{T}(i).

The contribution of action i to regret in the entire horizon [1:T], denoted R(T;i), can be expressed as the sum of \Delta(i) for each round that this action is chosen. By the HS-SE policy and (10), we can bound this quantity as

\displaystyle R(T;i) \displaystyle=n_{T}(i)\Delta(i)
\displaystyle\leq 4n_{\eta_{i}}(i)\sqrt{\frac{2\log T}{n_{\eta_{i}-1}(i)}}
\displaystyle\leq C_{0}\sqrt{2\log T}\frac{t_{\eta_{i}}/|A_{\eta_{i}}|}{\sqrt{% t_{\eta_{i}-1}/k}}
\displaystyle\leq 4C_{0}\sqrt{2\log T}\frac{k(T/k)^{1/(2-2^{-m(S)})}}{{|A_{% \eta_{i}}|}}.

for some absolute C_{0}\geq 0. Then for any \mathcal{D}, conditioned on the clean event,

\displaystyle\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}^% {T}\mu_{\pi_{t}}\mid\mathcal{E}\right] \displaystyle=\sum_{i\in[k]}R(T;i)
\displaystyle\leq\sum_{i\in[k]}4C_{0}\sqrt{2\log T}k(T/k)^{1/(2-2^{-m(S)})}% \frac{1}{{|A_{\eta_{i}}|}}
\displaystyle\leq C_{1}\sqrt{\log T}k(T/k)^{1/(2-2^{-m(S)})}\sum_{i=1}^{k}% \frac{1}{|A_{\eta_{i}}|}
\displaystyle\leq C_{2}\sqrt{\log T}k(T/k)^{1/(2-2^{-m(S)})}\sum_{j=1}^{k}% \frac{1}{j}
\displaystyle\leq C_{3}(\log k\log T)k^{1-1/(2-2^{-m(S)})}T^{1/(2-2^{-m(S)})}

for some absolute constant C_{1},C_{2},C_{3}\geq 0. Thus by (\theequation@IDcj) and R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T) we have

R^{\pi}(T)\leq C(\log k\log T)k^{2-1/(2-2^{-m(S)})}T^{1/(2-2^{-m(S)})}

for some absolute constant C\geq 0, where m(S)=m_{G}^{U}(S)=\lfloor(S-\max_{i,j\in[k]}{c_{i,j}})/H\rfloor.\hfill\Box

Consider an arbitrary switching graph G whose switching costs satisfy the triangle inequality. Without loss of generality, we assume that \arg\max_{i\in[k]}(\min_{j\neq i}c_{i,j})=1, i.e., \min_{j\neq 1}c_{1,j}\geq\min_{j\neq i}c_{i,j} for all i\in[k]. Recall that H is the total weight of the shortest Hamiltonian path in G. For simplicity, in this proof we use m(S) to denote m_{G}^{L}(S)=\lfloor(S-\max_{i\in[k]}\min_{j\neq i}c_{i,j})/H\rfloor.

Given any k=|G|\geq 1, S\geq 0 and T\geq 2k, we focus on the setting of \mbox{$\mathcal{D}$}_{i}=\mathcal{N}(\mu_{i},1) (\forall i\in[k]), as this is enough for us to prove the desired lower bound. Note that now the environment of latent distributions \mathcal{D} can be completely determined by a vector \mbox{\boldmath$\mu$}=(\mu_{1},\cdots,\mu_{k})\in\mathbb{R}^{k}. For simplicity, in this proof we will directly use the vector \mu to represent the environment of latent distributions.

For any environment \mu, let X_{\mbox{\boldmath$\mu$}}^{t}(i)\sim\mathcal{N}(\mu_{i},1) denote the i.i.d. random reward of each action i at round t (i\in[k],t\in[T]). For any i\in[k] and n_{1},n_{2}\in[T], let \{X_{\mbox{\boldmath$\mu$}}^{t}(i)\}_{t\in[n_{1}:n_{2}]} denote the random vector whose components are the random awards of action i from round n_{1} to round n_{2}.

For any environment \mu, for any policy \pi\in\Pi_{S}, with some abuse of notation we let X_{\mbox{\boldmath$\mu$}}^{t}(\pi_{t}) denote the learner’s (random) collected reward at round t under policy \pi in environment \mu. Let \mathcal{F}_{t}:=\sigma(X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{% \boldmath$\mu$}}^{t}(\pi_{t})) denote the \sigma-algebra generated by the random variables X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{\boldmath$\mu$}}^{t}(\pi% _{t}), then \mathbb{F}=(\mathcal{F}_{t})_{t\in T} is a filtration.

For any two probability measures \mathbb{P} and \mathbb{Q} defined on the same measurable space, let D_{\mathrm{TV}}(\mathbb{P}\|\mathbb{Q}) denote the total variation distance between \mathbb{P} and \mathbb{Q}, and D_{\mathrm{KL}}(\mathbb{P}\|\mathbb{Q}) denote the Kullback-Leibler (KL) divergence between \mathbb{P} and \mathbb{Q}, see detailed definitions in Chapter 15 of Wainwright (2019).

For any environment \mu, for any policy \pi\in\Pi_{S}, we make some key definitions as below.

1. We first define a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)}\leq\tau_{m(S)+1}.

  • \tau_{1}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[1:\tau_{1}]$}\} if the set is non-empty and \tau_{1}=\infty otherwise.

  • \tau_{2}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{1}:\tau_{2}]$}\} if the set is non-empty and \tau_{2}=\infty otherwise.

  • Generally, \tau_{j}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{j-1}:\tau_{j}]$}\} if the set is non-empty and \tau_{j}=\infty otherwise, for all i=2,\dots,m(S)+1.

It can be verified that \tau_{1},\dots,\tau_{m(S)+1} are stopping times with respect to the filtration \mathbb{F}.

2. We then define a series of random variables (depend on the stopping times).

  • S(1,\tau_{1}) is the total switching cost incurred in [1:\tau_{1}] (note that if there is a switch happening between \tau_{1} and \tau_{1}+1, we do not count its cost in S(1,\tau_{1})).

  • For all j=2,\dots,m(S), S(\tau_{j-1},\tau_{j}) is the total switching cost incurred in [\tau_{j-1}:\tau_{j}] (note that if there is a switch happening between \tau_{j-1}-1 and \tau_{j-1}, or between \tau_{j} and \tau_{j}+1, we do not count its cost in S(\tau_{j-1},\tau_{j})).

  • S(\tau_{m(S)},T) is the total switching cost incurred in [\tau_{m(S)}:T] (note that if there is a switch happening between \tau_{m(S)-1} and \tau_{m(S)}, we do not count its cost in S(\tau_{m(S)},T).

3. Next we define a series of events.

  • E_{1}=\{\tau_{1}>t_{1}\}.

  • For all j=2,\dots,m(S), E_{j}=\{\tau_{j-1}\leq t_{j-1},\tau_{j}>t_{j}\}.

  • E_{m(S)+1}=\{\tau_{m(S)}\leq t_{m(S)}\}.

Note that t_{1},\dots,t_{m(S)}\in[T] are fixed values specified in Algorithm 2.

4. Finally we define a series of shrinking errors.

  • \Delta_{1}=1.

  • For j=2,\dots,m(S), \Delta_{j}=\frac{k^{-1/2}\left(k/T\right)^{(1-2^{1-j})/(2-2^{-m(S)})}}{k(m(S)+% 1)}\in(0,1). (That is, \Delta_{j}\approx\frac{1}{k(m(S)+1)}\frac{1}{\sqrt{t_{j-1}}}.)

  • \Delta_{m(S)+1}=\frac{k^{-1/2}\left(k/T\right)^{(1-2^{-m(S)})/(2-2^{-m(S)})}}{% 2k(m(S)+1)}\in(0,1). (That is, \Delta_{m(S)+1}\approx\frac{1}{2k(m(S)+1)}\frac{1}{\sqrt{t_{m(S)}}}.)

5. For notational convenience, define \pi_{\infty} as an independent uniform random variable taking value in [k] such that {\pi_{\infty}=i} with probability 1/k (i\in[k]).

Lemma 2

For any environment \mu, for any policy \pi\in\Pi_{S}, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the total switching cost incurred in }[\tau_{m(S)}:T]\text{ is % strictly less than }H+\bar{c}\} almost surely.

Proof of Lemma 2. When E_{m(S)+1} happens, \tau_{m(S)}\leq t_{m(S)}\leq T, thus all \tau_{1},\dots,\tau_{m(S)}\leq T. Since in each of [1:\tau_{1}],[\tau_{1},\tau_{2}],\dots,[\tau_{m(S)-1}:\tau_{m(S)}], all k actions were visited, we know that S(1,\tau_{1})\geq H, S(\tau_{1},\tau_{2})\geq H, \dots, S(\tau_{m(S)-1},\tau_{m(S)})\geq H. Thus we have

S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)-1},\tau_{m(S)})\geq m(S% )H.

Since \pi\in\Pi_{S}, we further know that

\displaystyle S(\tau_{m(S)},T) \displaystyle\leq S-[S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)-1}% ,\tau_{m(S)})]
\displaystyle\leq S-m(S)H<H+\max_{i\in[k]}\min_{j\neq i}{c_{i,j}}=H+\min_{j% \neq 1}{c_{1,j}}

happens almost surely. As a result, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the total switching cost incurred in }[\tau_{m(S)}:T]\text{ is % strictly less than }H+\min_{j\neq 1}{c_{1,j}}\} almost surely. \hfill\Box

Consider a class of environments \Lambda=\{\mbox{\boldmath$\mu$}\mid\frac{\Delta_{m(S)+1}}{4}\leq\mu_{1}-\mu_{i% }\leq\frac{\Delta_{m(S)+1}}{2},\forall i\neq 1\}. Pick an arbitrary environment {\alpha} from \Lambda (e.g., \alpha=(\frac{\Delta_{m(S)+1}}{2},0,\dots,0)). For any policy \pi\in\Pi_{S}, by the union bound, we have

\sum_{j=1}^{m(S)+1}\mathbb{P}_{{\alpha}}^{\pi}(E_{j})\geq\mathbb{P}_{{\alpha}}% ^{\pi}(\cup_{j=1}^{m(S)+1}E_{j})=1.

Therefore, there exists j^{*}\in[m(S)+1] such that \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})\geq 1/(m(S)+1).

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})% \geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{% \pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{% \mathbb{P}_{{\alpha}}^{\pi}(E_{1})}{k}\geq\frac{1}{k(m(S)+1)}.

Note that since \tau_{1} is the first time that all actions in [k] has been chosen in [1:\tau_{1}], the event \{\pi_{\tau_{1}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[1:\tau_{1}-1]\}. Thus, the event \{\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime}\} must imply the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1]:=\{i^{\prime}\text{ was not chosen in }[1:% t_{1}-1]\}. Therefore, we have

\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq% \mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{1% }{k(m(S)+1)}.

Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] only depends on policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{% \pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq\frac{1}{k(m(S)+1)}. (11)

We now consider a new environment {\beta} such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{1} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{1}-1]} and random vectors \{X_{\beta}^{t}(i)\}_{t_{1}:T} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i, we have

\mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]}(\mathcal{E}_{i^{\prime}}[1:t_{1% }-1])\text{ for }i\neq i}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$% \mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1]). (12)

But note that \{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]} and \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} have exactly the same distribution for all i\neq i^{\prime}. Thus from (11) and (12) we have

\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$% \mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq\frac{1}{k% (m(S)+1)}.

However, in environment \beta, i^{\prime} is the unique optimal action, and choosing any action other than i^{\prime} will incur at least a \Delta_{1}-\Delta_{m(S)+1}/2\geq\Delta_{1}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[1:t_{1}-1] indicates that the policy does not choose i^{\prime} for at least t_{1}-1 rounds, we have

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}-1])\left[(t_{1}-1)\frac{\Delta_{1}}{2}\right]% \geq\frac{t_{1}-1}{2k(m(S)+1)}\geq\frac{k^{-1/(2-2^{-m(S)})}}{4(m(S)+1)}T^{1/(% 2-2^{-m(S)})}.

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})=\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1% }\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}})\geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}% })=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau% _{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}% },\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}% })}{k}\geq\frac{1}{k(m(S)+1)}.

Note that since \tau_{j^{*}} is the first time that all actions in [k] has been chosen in [\tau_{j^{*}-1}:\tau_{j^{*}}], the event \{\pi_{\tau_{j^{*}}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[\tau_{j^{*}-1}:\tau_{j^{*}}-1]\}. Thus, the event \{\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{% \prime}\} must imply the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}]:=\{i^{\prime}\text{ was not % chosen in }[t_{j^{*}-1}:t_{j^{*}}]\}. Therefore, we have

\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{% *}}])\geq\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}% >t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{1}{k(m(S)+1)}.

Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[t_{j^{*}-1}:t_{j^{*}}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] only depends on policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}-1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E% }_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}])\geq\frac{1}{k(m(S)+1)}. (13)

We now consider a new environment {\beta} such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{j^{*}} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[t_{j^{*}-1}:t_{j^{*}}]} and random vectors \{X_{\beta}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k]. Let \mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{\prime% }}[t_{j^{*}-1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}-1}:t_{j^{*}}]). (14)

We now try to bound the difference between the left-hand side (LHS) in (13) and the left-hand side (LHS) in (14). We have

\displaystyle|\text{LHS in }(\ref{eq:4app3})-\text{LHS in }(\ref{eq:4app4})|
\displaystyle\leq \displaystyle{D_{\mathrm{TV}}}\left(\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}% _{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^% {\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime}% )\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i% ^{\prime}}^{\pi}\right)
\displaystyle\leq \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^% {t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}% \text{ for }i\neq i^{\prime}}^{\pi}\right)}
\displaystyle\leq \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^{t}(i^% {\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for% }i\neq i^{\prime}}\right)}
\displaystyle= \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{% \beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]}}\right)}
\displaystyle= \displaystyle\sqrt{\frac{1}{2}\left[(t_{j^{*}-1}-1)\frac{\left(\Delta_{j^{*}}% \right)^{2}}{2}\right]}
\displaystyle\leq \displaystyle\frac{\sqrt{t_{j^{*}-1}}\Delta_{j^{*}}}{2}\leq\frac{1}{2k(m(S)+1)},

where the first inequality is by the definition of total variation distance of two probability measures, the second inequality is by Pinsker’s inequality in information theory, and the third inequality is by the data-processing inequality in infomation theory.

Combining the above inequality with (11) and (12), we have

\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq% \mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])-\frac{% 1}{2k(m(S)+1)}\geq\frac{1}{2k(m(S)+1)}.

However, i^{\prime} is the unique optimal action in environment \beta, and choosing any action other than i^{\prime} will incur at least a \Delta_{j^{*}}-\Delta_{m(S)+1}/2\geq\Delta_{j^{*}}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] indicates that the policy does not choose i^{\prime} for at least t_{j^{*}}-t_{j^{*}-1}+1 rounds, we have

\displaystyle R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}% ^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}])\left[(t_{j^{*}}-t_{j^{% *}-1}+1)\frac{\Delta_{j^{*}}}{2}\right]
\displaystyle\geq \displaystyle\frac{1}{2k(m(S)+1)}\left(k(T/k)^{\frac{2-2^{1-j^{*}}}{2-2^{-m(S)% }}}-k(T/k)^{\frac{2-2^{2-j^{*}}}{2-2^{-m(S)}}}\right)\frac{k^{-\frac{1}{2}}% \left(k/T\right)^{\frac{1-2^{1-j^{*}}}{2-2^{-m(S)}}}}{2k(m(S)+1)}
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}}}{4(m(S)+1)^{2}}\left((T/k)^{\frac{1}{2-2^{% -m(S)}}}-(T/k)^{\frac{1-2^{1-j^{*}}}{2-2^{-m(S)}}}\right)
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{\frac{-2^{1-j^{*}}}{2-2^{-m(S)}}}\right)
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{\frac{-2^{1-m(S)}}{2-2^{-m(S)}}}\right)
\displaystyle= \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{-2^{-m(S)}}\right).

When m(S)\leq\log_{2}\log_{2}(T/K), we have

(T/k)^{-2^{-m(S)}}\leq(T/k)^{-\frac{1}{\log_{2}(T/k)}}=\frac{1}{(T/k)^{\log_{T% /k}(2)}}=\frac{1}{2}.

Thus we know that

R^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(% S)}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{-2^{-m(S)}}\right)\geq\frac{k^{-\frac{3}{2% }-\frac{1}{2-2^{-m(S)}}}}{8(m(S)+1)^{2}}T^{\frac{1}{2-2^{-m(S)}}}

when m(S)\leq\log_{2}\log_{2}(T/k).

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}% \leq t_{m(S)})\geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)})=\sum_{i=1}^{k}\mathbb{P}_{% \alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i^{% \prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})}{k}\geq\frac{1}{k(m(% S)+1)}.

Thus either

\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}>\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}, (15)

or

\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac% {t_{m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}. (16)

If (15) holds true, then we consider a new environment \beta such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{m(S)+1} and all other components are the same as \alpha. Define the event \mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2]:=\{i^{\prime}\text{ was not % chosen in }[t_{m(S)}:(t_{m(S)}+T)/2]\}. From (15) we know that \mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq 1/(2k(m(S)+1)). Using analogous arguments like Case 2 (Appendix id1), we can derive that

\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2% ])-\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}

and

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)% }}}}{64(m(S)+1)^{2}}T^{\frac{1}{2-2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Now we consider the case that (16) holds true. Let \mathcal{E}_{i^{\prime}} denote the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2},\pi_{\tau_{m(% S)+1}}=i^{\prime}\}. According to Lemma 2, the event \{\tau_{m(S)}\leq t_{m(S)}\} implies that the total switching cost incurred in [\tau_{m(S)}:T] is strictly less than H+\min_{j\neq 1}{c_{1,j}}. Meanwhile, the event \{\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}<\infty\} implies that the total switching cost incurred in [\tau_{m(S)}:\tau_{m(S)+1}] is at least H. As a result, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}.

Suppose that i^{\prime}\neq 1, then the event \mathcal{E}_{i^{\prime}}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\} implies that action 1 is not chosen in [\tau_{m(S)+1}:T], as incurring c_{i^{\prime},1}\geq\min_{j\neq 1}{c_{1,j}} would violate the requirement that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}. However, action 1 is the unique optimal action in environment \alpha, and choosing any action other than action 1 will incur at least a \Delta_{m(S)+1}/4 term in regret. As a result, we know that

R^{\pi}(T)\geq R_{\alpha}^{\pi}(T)\geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i% ^{\prime}})\left[(T-\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{4}\right]% \geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2% -2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Thus we only need to consider the sub-case of i^{\prime}=1. Define the event \mathcal{E}_{1}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}% {2},\pi_{\tau_{m(S)+1}}=1\}. Note that the occurrence of the event \mathcal{E}_{1} only depends on policy \pi and random vector \{X_{\alpha}^{t}(1)\}_{[1:t_{m(S)}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:{(t_{m(S)}+T)}/{2}]} for i\neq 1. Consider a new environment \beta such that its first component is \alpha_{1}-\Delta_{m(S)+1} and all other components are the same as \alpha. Using analogous arguments like Case 2 (Appendix id1), we can derive that

\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{1})\geq\mathbb{P}_{\alpha}^{\pi}(% \mathcal{E}_{1})-\frac{\sqrt{t_{m(S)}}\Delta_{m(S)+1}}{2}\geq\mathbb{P}_{% \alpha}^{\pi}(\mathcal{E}_{1})-\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}.

However, action 1 is the worst action in environment \beta, and each time of choosing action 1 incurs at least a \Delta_{m(S)+1}/2 term in regret. According to Lemma 2, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}. Since switching from action 1 to any other action incurs at least \min_{j\neq 1}{c_{1,j}} cost, the event \mathcal{E}_{1} actually implies that action 1 is continuously chosen in every round from round \tau_{m(S)+1} (\leq\frac{t_{m(S)}+T}{2}) to round T, which means that action 1 is continuously chosen in the last (T-\frac{t_{m(S)}+T}{2}+1) rounds. As a result, we know that

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}})\left[(T-\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{2}\right]\geq% \frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2-2^{% -m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Combining Case 1, 2 and 3, we know that

R^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{% \frac{1}{2-2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k). On the other hand, since the minimax lower bound for the classical MAB problem (which is equivalent to a BwSC problem with infinite switching budget) is \Omega(\sqrt{kT}), we know that

R^{\pi}(T)\geq R_{\infty}^{*}\geq C\sqrt{kT}

for some absolute constant C>0. To sum up, we have

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}(m(S)+% 1)^{-2}\right)T^{\frac{1}{2-2^{-m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases}

for some absolute constant C>0, where m(S)=m_{G}^{L}(S)=\lfloor\frac{S-\max_{i\in[k]}\min_{j\neq i}c_{i,j}}{H}\rfloor. \hfill\Box

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
372901
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description