Definition 1

Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints

David Simchi-Levi

Institute for Data, Systems and Society, Department of Civil and Environmental Engineering, and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, dslevi@mit.edu

Yunzong Xu

Institute for Data, Systems and Society, Cambridge, Massachusetts Institute of Technology, MA 02139, yxu@mit.edu

We consider the classical stochastic multi-armed bandit problem with a constraint on the total cost incurred by switching between actions. We prove matching upper and lower bounds on regret and provide near-optimal algorithms for this problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. That is, we show that associated with the multi-armed bandit problem, there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. The results enable us to fully characterize the trade-off between regret and incurred switching cost in the stochastic multi-armed bandit problem, contributing new insights to this fundamental problem. Under the general switching cost structure, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.

 

The multi-armed bandit (MAB) problem is one of the most fundamental problems in online learning, with diverse applications ranging from pricing and online advertising to clinical trails. Over the past several decades, it has been a very active research area spanning different disciplines, including computer science, operations research, statistics and economics.

In a traditional multi-armed bandit problem, the learner (i.e., decision-maker) is allowed to switch freely between actions, and an effective learning policy may incur frequent switching — indeed, the learner’s task is to balance the exploration-exploitation trade-off, and both exploration (i.e., acquiring new information) and exploitation (i.e., optimizing decisions based on up-to-date information) require switching. However, in many real-world scenarios, it is costly to switch between different alternatives, and a learning policy with limited switching behavior is preferred. The learner thus has to consider the cost of switching in her learning task.

There is rich literature studying stochastic MAB with switching costs. Most of the papers model the switching cost as a penalty in the learner’s objective, i.e., they measure a policy’s regret and incurred switching cost using the same metric and the objective is to minimize the sum of these two terms (e.g., Agrawal et al. (1988, 1990), Brezzi and Lai (2002), Cesa-Bianchi et al. (2013); there are other variations with discounted rewards Banks and Sundaram (1994), Asawa and Teneketzis (1996), Bergemann and Välimäki (2001), see Jun (2004) for a survey).

Though this conventional “switching penalty” model has attracted significant research interest in the past, it has two limitations.

First, under this model, the learner’s total switching cost is a complete output determined by the learning algorithm. However, in many real-world applications, there are strict limits on the learner’s switching behavior, which should be modeled as a hard constraint, and hence the learner’s total budget of switching cost should be an input that helps determine the algorithm. In particular, while the algorithm in Cesa-Bianchi et al. (2013) developed for the “switching penalty” model can achieve \tilde{O}(\sqrt{T}) (distribution-free) regret with O(\log\log T) switches, if the learner wants a policy that always incurs finite switching cost independent of T, then prior literature does not provide an answer.

Second, the “switching penalty” model has fundamental weakness in studying the trade-off between regret and incurred switching cost in stochastic MAB — since the O(\log\log T) bound on the incurred switching cost of a policy is negligible compared with the \tilde{O}(\sqrt{T}) bound on its optimal regret, when adding the two terms up, the term associated with incurred switching cost is always dominated by the regret, thus no trade-off can be identified. As a result, to the best of our knowledge, prior literature has not characterized the fundamental trade-off between regret and incurred switching cost in stochastic MAB.

In this paper, we introduce the Bandits with Switching Constraints (BwSC) problem. The BwSC model addresses the issues associated with the “switching penalty” model in several ways.

First, it introduces a hard constraint on the total switching cost, making the switching budget an input to learning policies, enabling us to design good policies that guarantee limited switching cost. While O(\log\log T) switches has proven to be sufficient for a learning policy to achieve near-optimal regret in MAB, in BwSC, we are mostly interested in the setting of finite or o(\log\log T) switching budget, which is highly relevant in practice.

Second, by focusing on rewards in the objective function and incurred switching cost in the switching constraint, the BwSC framework enables the characterization of the fundamental trade-off between regret and maximum incurred switching cost in MAB.

Third, while most prior research assumes specific structures on switching costs (e.g., unit or homogeneous costs), in reality, switching between different pairs of actions may incur heterogeneous costs that do not follow any parametric form. The BwSC model allows general switching costs, which makes it a powerful modeling framework.

The BwSC framework has numerous applications, including dynamic pricing, online assortment optimization, online advertising, clinical trails and vehicle routing. A representative example is the dynamic pricing problem. Dynamic pricing with demand learning has proven its effectiveness in online retailing. However, it is well known that in practice, sellers often face business constraints that prevent them from conducting extensive price experimentation and making frequent price changes. For example, acording to Cheung et al. (2017), Groupon limits the number of price changes, either because of implementation constraints, or for fear of confusing customers and receiving negative customer feedback. In such scenarios, the seller’s sequential decision-making problem can be modeled as a BwSC problem, where changing from each price to another price incurs some cost, and there is a limit on the total cost incurred by price changes.

The paper’s contributions are along four dimensions.

On the modeling side, we introduce the BwSC model, a general framework with strong modeling power. The model overcomes the limitations of the prior “switching penalty” model and has both practical and theoretical value.

The second dimension of contribution lies in the analysis domain. In Section id1, we study the unit-switching-cost BwSC problem. We obtain an upper bound on regret by proposing a simple and intuitive policy with carefully-designed switching rules, and prove a strong information-theoretic lower bound that matches the above upper bound, indicating that our policy is rate-optimal up to logarithmic factors. Methodologically, the proof of the lower bound involves a novel “tracking the cover time” argument that has not appeared in prior literature and may be of independent interest.

With the analysis described above we obtain a series of surprising and insightful results for both BwSC and MAB. Of the most important discoveries are the phase transitions and cyclic phenomena exhibited by the optimal regret in BwSC and MAB. That is, we show that associated with these problems, there are equal-length phases, defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases, see the precise definitions in Section id1. The tight regret bounds in BwSC motivate new insights about the classical MAB problem. In particular, we show that \Theta(\log\log T) switches are necessary and sufficient to achieve near-optimal regret in MAB.

Finally, we study the general-switching-cost BwSC problem in Section id1. We make conceptual contribution by revealing a deep connection between bandit problems and graph traversal problems. In fact, we characterize some important switching patterns associated with any effective learning policies in MAB, which in turn lead to regret upper and lower bounds for the general BwSC problem.

For all n_{1},n_{2}\in\mathbb{N} such that n_{1}\leq n_{2}, we use [n_{1}] to denote the set \{1,\dots,n_{1}\}, and use [n_{1}:n_{2}] (resp. (n_{1}:n_{2}]) to denote the set \{n_{1},n_{1}+1,\dots,n_{2}\} (resp. \{n_{1}+1,\dots,n_{2}\}). Throughout the paper, we will use the big O notation to hide constant factors, and use the \tilde{O} notation to hide constant factors and logarithmic factors.

Consider a k-armed bandit problem where a learner chooses actions from a fixed set [k]=\{1,\dots,k\}. There is a total of T rounds. In each round t\in[T], the learner first chooses an action i_{t}\in[k], then observes a reward r_{t}(i_{t})\in\mathbb{R}. For each action i\in[k], the reward of action i is i.i.d. drawn from an (unknown) distribution \mbox{$\mathcal{D}$}_{i} with (unknown) expected value \mu_{i}. We assume that the distributions \mbox{$\mathcal{D}$}_{i} are standardized sub-Gaussian.Without loss of generality, we assume \sup_{i,j\in[k]}|\mu_{i}-\mu_{j}|\in[0,1].

In our problem, the learner incurs a switching cost c_{i,j}=c_{j,i}\geq 0 each time she switches between action i and action j (i,j\in[k]). In particular, c_{i,i}=0 for i\in[k]. There is a pre-specified switching budget S\geq 0 representing the maximum amount of switching costs that the learner can incur in total. Once the total switching cost exceeds the switching budget S, the learner cannot switch her actions any more. The learner’s goal is to maximize the expected total reward over T rounds.

Let \pi denote the learner’s (non-anticipating) learning policy, and \pi_{t}\in[k] denote the action chosen by policy \pi at round t\in[T]. More formally, \pi_{t} establishes a probability kernel acting from the space of historical actions and observations to the space of actions at round t. Let \mathbb{P^{\pi}_{\mbox{$\mathcal{D}$}}} and \mathbb{E}^{\pi}_{\mbox{$\mathcal{D}$}} be the probability measure and expectation induced by policy \pi and latent distributions \mbox{$\mathcal{D}$}=(\mbox{$\mathcal{D}$}_{1},\dots,\mbox{$\mathcal{D}$}_{k}). According to the problem formulation, we only need to restrict our attention to the S-switching-budget policies defined as below.111Note that here we do not make any assumption on the learner’s behavior. In particular, we do not require the learner to intentionally pick an S-switching-budget policy — the switching constraint makes the learner’s policy automatically equivalent to an S-switching-budget policy.

Definition 1

A policy \pi is said to be an S-switching-budget policy if for all \mathcal{D} and T\geq 1,

\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}\left[\sum_{t=1}^{T-1}c_{\pi_{t},\pi_{t% +1}}\leq S\right]=1.

Let \Pi_{S} denote the set of all S-switching-budget policies, which is also the admissible policy class of the BwSC problem.

The performance of a learning policy is measured against a clairvoyant policy that maximizes the expected total reward given foreknowledge of the environment (i.e., latent distributions) \mathcal{D}. Let \mu^{*}=\max_{i\in[k]}\mu_{i}. We define the regret of policy \pi as the worst-case difference between the expected performance of the optimal clairvoyant policy and the expected performance of policy \pi:

R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}\left\{T\mu^{*}-\mathbb{E}_{\mbox{$% \mathcal{D}$}}^{\pi}\left[\sum_{t=1}^{T}\mu_{\pi_{t}}\right]\right\}.

The minimax (optimal) regret of BwSC is defined as R_{S}^{*}(T)=\inf_{\pi\in\Pi_{S}}R^{\pi}(T).

In our paper, when we say a policy is “near-optimal” or “optimal up to logarithmic factors”, we mean that its regret bound is optimal in T up to logarithmic factors of T, irrespective of whether the bound is optimal in k, since typically k is much smaller than T. Still, our derived bounds are actually quite tight in k.

Remark. There are two notions of regret in the stochastic bandit literature. The R^{\pi}(T) regret that we consider is called distribution-free, as it does not depend on \mathcal{D}. On the other hand, one can also define the distribution-dependent regret R_{\mbox{$\mathcal{D}$}}^{\pi}(T)=T\mu^{*}-\mathbb{E}_{\mbox{$\mathcal{D}$}}^{% \pi}\left[\sum_{t=1}^{T}\mu_{\pi_{t}}\right] that depends on \mathcal{D}. This second notion of regret is only meaningful when \mu_{1},\dots,\mu_{k} are well-separated. Unlike the classical MAB problem where there are policies simultaneously achieving near-optimal bounds under both regret notions, in the BwSC problem, due to the limited switching budget, finding a policy that simultaneously achieves near-optimal bounds under both regret notions is usually impossible. Thus in the main body of the paper, we focus on the distribution-free regret. However, in Appendix A, we extend our results to the distribution-dependent regret.

Obviously, BwSC and MAB share the same definition of R^{\pi}(S), and the only difference between BwSC and MAB is the existence of a switching constraint \pi\in\Pi_{S}, determined by (c_{i,j})\in\overline{\mathbb{R}}_{\geq 0}^{k\times k} and S\in\overline{\mathbb{R}}_{\geq 0} (when S=\infty, BwSC degenerates to MAB). This makes BwSC a natural framework to study the trade-off between regret and incurred switching cost in MAB. That is, the trade-off between the optimal regret R_{S}^{*}(T) and switching budget S in BwSC completely characterizes the trade-off between a policy’s best achievable regret and its worst possible incurred switching cost in MAB. We are interested in how R_{S}^{*}(T) behaves over a range of switching budget S, and how it is affected by the structure of switching costs (c_{i,j}).

This paper is not the first one to study online learning problems with limited switches. Indeed, a few authors have realized the practical significance of limited switching budget. For example, Cheung et al. (2017) consider a dynamic pricing model where the demand function is unknown but belongs to a known finite set, and a pricing policy is allowed to make at most m price changes. Their constraint on the total number of price changes is motivated by collaboration with Groupon, a major e-commerce marketplace in North America. In such an environment, Groupon limits the number of price changes, either because of implementation constraints, or for fear of confusing customers and receiving negative customer feedback. They propose a pricing policy that guarantees O(\log^{(m)}T) (or m iterations of the logarithm) regret with at most m price changes, and report that in a field experiment, this pricing policy with a single price change increases revenue and market share significantly. Chen and Chao (2019) study a multi-period stochastic inventory replenishment and pricing problem with unknown demand and limited price changes. Assuming that the demand function is drawn from a parametric class of functions, they develop a finite-price-change policy based on maximum likelihood estimation that achieves optimal regret.

We note that both Cheung et al. (2017) and Chen and Chao (2019) only focus on specific decision-making problems, and their results rely on some strong assumption about the unknown environment. Cheung et al. (2017) assume a known finite set of potential demand functions, and require the existence of discriminative prices that can efficiently differentiate all potential demand functions. Chen and Chao (2019) assume a known parametric form of the demand function, and also require a well-separated condition. By contrast, the BwSC model in our paper is generic and assumes no prior knowledge of the environment. The learning task in the BwSC problem is thus more challenging than previous models. Also, the switching constraint in the BwSC problem is more general than the price-change constraints in previous models.

In the Bayesian bandit setting, Guha and Munagala (2013) study the “bandits with metric switching costs” problem that allows a constraint involving metric switching costs. Using competitive ratio as the performance metric and assuming Bayesian priors, they develop a 4-approximation algorithm for the problem. The competitive ratio is measured against an optimal online policy that does not know the true distributions. As pointed out by the authors, the optimal online policy can be directly determined by a dynamic program. So the main issue in their model is a computational one. Our work is different, as we are using regret as our performance metric, and we are competing with an optimal clairvoyant policy that knows the true distributions — a much stronger benchmark. Our problem thus involves both statistical and computational challenges. In fact, the algorithm in Guha and Munagala (2013) cannot avoid a linear regret when applied to the BwSC problem.

In the adversarial bandit setting, Altschuler and Talwar (2018) study the adversarial MAB problem with limited number of switches, which can be viewed as an adversarial counterpart of the BwSC problem with unit switching costs (see Section id1). For any policy that makes no more than S\leq T switches, they prove that the optimal regret is \tilde{\Theta}(T\sqrt{k}/\sqrt{S}). Since we are considering a different setting from them (our problem is stochastic while their problem is adversarial), the methodologies and results in our paper are fundamentally different from their paper. In particular, while finite-switch policies cannot avoid linear regret in the adversarial setting, in the stochastic setting, finite switches are already able to guarantee sublinear regret. Moreover, while the optimal regret in Altschuler and Talwar (2018) decreases smoothly as S increases from 0 to T, in the stochastic setting, we identify very surprising behavior of the optimal regret as S increases from 0 to \Theta(\log\log T), which, to the best of our knowledge, has never been identified in the bandit literature before.

The BwSC problem is also related to the batched bandit problem proposed by Perchet et al. (2016). The M-batched bandit problem is defined as follows: given a classical bandit problem, assumes that the learner must split her learning process into M batches and is only able to observe data (i.e., realized rewards) from a given batch after the entire batch is completed. This implies that all actions within a batch are determined at the beginning of this batch. Here M can be viewed as a quantity measuring the learner’s adaptivity, i.e., her ability to learn from her data and adapt to the environment. An M-batch policy is defined as a policy that only observes realized data for M-1 times through the entire horizon. Perchet et al. (2016) study the problem in the case of two arms, and prove that the optimal regret for the M-batched bandit problem is \tilde{\Theta}(T^{1/(1-2^{1-M})}). Very recently, Gao et al. (2019) extend these results to general k arms.

On the surface, the batched bandit problem and the BwSC problem seem like two totally different problems: the batched bandit problem limits observation and allows unlimited switching, while the BwSC problem limits switching and allows unlimited observation. Surprisingly, in this paper, we discover some deep non-trivial connections between the batched bandit problem and the unit-switching-cost BwSC problem. The connections will be further discussed in Section id1.

In this section, we consider the BwSC problem with unit switching costs, where c_{i,j}=1 for all i\neq j. In this case, since every switch incurs a unit cost, the switching budget S can be interpreted as the maximum number of switches that the learner can make in total. Thus, the unit-switching-cost BwSC problem can be simply interpreted as “MAB with limited number of switches”.

We first propose a simple and intuitive policy that provides an upper bound on the regret. Our policy, called the S-Switch Successive Elimination (SS-SE) policy, is described in Algorithm 1. The design philosophy behind the SS-SE policy is to divide the entire horizon into several pre-determined intervals (i.e. batches) and to control the number of switches in each interval. The policy thus has some similarities with the 2-armed batched policy of Perchet et al. (2016) and the k-armed batched policy of Gao et al. (2019), which prove to be near-optimal in the batched bandit problem. However, since we are studying a different problem, directly applying a batched policy to the BwSC problem does not work. In particular, in the batched bandit problem, the number of intervals (i.e., batches) is a given constraint, while in the BwSC problem, the switching budget is the given constraint. We thus add two key ingredients into the SS-SE policy: (1) an index m(S) suggesting how many intervals should be used to partition the entire horizon; (2) a switching rule ensuring that the total number of switches within k actions cannot exceed the switching budget S. These two ingredients make the SS-SE policy substantially different from an ordinary batched policy.

Algorithm 1 S-Switch Successive Elimination (SS-SE)

Input: Number of arms k, Switching budget S, Horizon T
Partition: Calculate m(S)=\lfloor\frac{S-1}{k-1}\rfloor.
    Divide the entire time horizon 1,\dots,T into m(S)+1 intervals: [t_{0}:t_{1}],(t_{1}:t_{2}],\dots,(t_{m(S)}:t_{m(S)+1}], where the endpoints are defined by t_{0}=1 and

t_{i}=\lfloor k^{1-\frac{2-2^{-(i-1)}}{2-2^{-m(S)}}}T^{\frac{2-2^{-(i-1)}}{2-2% ^{-m(S)}}}\rfloor,~{}~{}\forall i=1,\dots,m(S)+1.

Initialization: Let the set of all active actions in the l-th interval be A_{l}. Set A_{1}=[k].
Policy:

1:  for l=1,\dots,m(S) do
2:     if a_{t_{l-1}}\in A_{l} then
3:        Let a_{t_{l-1}+1}=a_{t_{l-1}}. Starting from this action, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
4:     else if a_{t_{l-1}}\notin A_{l} then
5:        Starting from an arbitrary active action in A_{l}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
6:     end if
7:     Statistical test: deactivate all actions i s.t. \exists action j with \mathtt{UCB}_{t_{l}}(i)<\mathtt{LCB}_{t_{l}}(j), where
\texttt{UCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% +\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}},
\texttt{LCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% -\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}.
8:  end for
9:  In the last interval, choose the action with the highest empirical mean (up to round t_{m(S)}).

Intuition about the Policy. The policy divides the T rounds into \lfloor\frac{S-1}{k-1}\rfloor+1 intervals in advance. The sizes of the intervals are designed to balance the exploration-exploitation trade-off. An active set of “good” actions A_{l} is maintained for each interval l and at the end of each interval some “bad” actions are eliminated before the start of the next interval. The policy controls the number of switches by ensuring that only |A_{l}|-1 switches happen within each interval l and at most one switch happens between two consecutive intervals. Finally, in the last interval only the empirical best action is chosen.

We show that the SS-SE policy is indeed an S-switching-budget policy and establish the following upper bound on its regret, for Appendix B for a proof.

Theorem 1

Let \pi be the SS-SE policy, then \pi\in\Pi_{S}. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 0 and T\geq k,

R^{\pi}(T)\leq C{(\log k\log T)}k^{1-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m% (S)}}},

where m(S)=\lfloor\frac{S-1}{k-1}\rfloor.

Theorem 1 provides an upper bound on the optimal regret of the unit-switching-cost BwSC problem:

R^{*}_{S}(T)=\tilde{O}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}).

The SS-SE policy, though achieves sublinear regret, seems to have many limitations that could have weaken its performance, and on the surface it may suggest that the regret bound is not optimal. Specifically:

  • The SS-SE policy abandons some switching budget directly. Consider the case of 11 actions and 20 switching budget. The SS-SE policy will directly give up 9 switching budget — it just runs as if it could only make 11 switches. Intuitively, an effective learning policy should treasure its switching budget. It seems that by making full use of the switching budget, one can achieve much lower regret.

  • The SS-SE policy runs on pre-determined intervals. The sizes and locations of intervals have nothing to do with observed data or real-time data. It seems that by using data-driven intervals that are determined by observed and real-time data, one can achieve much lower regret.

  • The number of intervals, m(S)+1=\lfloor(S-1)/(k-1)\rfloor+1, is pre-determined by the worst case, i.e., as if no actions would be eliminated in each interval. Since SS-SE is a “successive elimination” policy, the actual number of actions that it needs to consider should be smaller and smaller in its running process, and we should be able to use much more intervals than just m(S) intervals. It seems that by tracking the size of the active set A_{l} and adaptively determining the number of intervals, one can achieve much lower regret (note that by Theorem 1, the regret dramatically decreases as m(S) increases).

  • More importantly, the idea of designing S-switching policies based on (m(S)+1)-interval policies (or in the words of Perchet et al. (2016), (m(S)+1)-batch policies) itself seems questionable. Note that the SS-SE policy runs deterministicly within each interval based on a pre-determined schedule, and it only learns from data at the end of each interval, for at most \lfloor(S-1)/(k-1)\rfloor times — consider the case of 11 actions and 20 switching budget, the SS-SE policy will split the entire horizon into two intervals and will only learn at the end of the first interval, after which it will choose a single action to be applied throughout the entire second interval.222In fact, the SS-SE policy allocates its switching budget before seeing any data, and does not try to save switches after data becomes available, which means that data is not utilized for saving switches. Intuitively, data should be utilized to save switches, and one would expect that an effective policy will have high degree of adaptivity, that is, it should learn from the available data and adapt to the environment more frequently than our policy. Put differently, it seems that by utilizing full adaptivity and learning from data in every round, one can achieve much lower regret.

  • Besides the above limitations, the \tilde{O}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}) bound provided by the SS-SE also seems a little clumsy. In particular, the m(S)=\lfloor(S-1)/(k-1)\rfloor term looks like an artificial term (it is intentionally designed to fit the switching rule in SS-ES), and does not look like a natural term that should appear in the true optimal regret R^{*}_{S}(T).

While the above arguments are based on our first instinct and seem very reasonable, surprisingly, all of them prove to be wrong: no S-switch policy can theoretically do better! In fact, we match the upper bound provided by SS-SE by showing a strong information-theoretic lower bound in Theorem 2. This indicates that the SS-SE policy is indeed rate-optimal up to logarithmic factors, and R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}). Note that the tightness of T is achieved per instance, i.e., for every k and every S. That is, our lower bound is substantially stronger than demonstrating specific k and S for which the upper bound cannot be improved.

Theorem 2

There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}(m(S)+% 1)^{-2}\right)T^{\frac{1}{2-2^{-m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases}

where m(S)=\lfloor\frac{S-1}{k-1}\rfloor.

Proof Idea. Our proof involves a novel “tracking the cover time” argument that (to the best of our knowledge) has not appeared in previous lower-bound proofs in the bandit literature. Specifically, we track a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)+1}, some of which may be \infty, that are recursively defined as follows: \tau_{1} is the first time that all the actions in [k] have been chosen in period [1:\tau_{1}], \tau_{2} is the first time that all the actions in [k] have been chosen in period [\tau_{1}:\tau_{2}], and generally, \tau_{i} is the first time that all the actions in [k] have been chosen in period [\tau_{i-1}:\tau_{i}], for i=2,\dots,m(S)+1. The structure of the series is carefully designed, enabling the realization of any two consecutive stopping times \tau_{i-1},\tau_{i} to convey the important message that there exists a specific (possibly unknown) action that has never been chosen in period [\tau_{i-1}:\tau_{i}-1]. This information in turn helps us to bound the difference of several key probabilities and derive the desired lower bound via information-theoretic arguments. For a complete proof of Theorem 2, see Appendix C.

Combining Theorem 1 and Theorem 2, we have

Corollary 1

For any fixed k, for any S,

R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}).

Remark. We briefly explain why the upper and lower bounds in Theorem 1 and Theorem 2 match in T. When m(S)\leq\log_{2}\log_{2}(T/k), which is the case we are mostly interested in, (m(S)+1)^{2}=o(\log T), thus the upper and lower bounds match within o((\log T)^{2}). When m(S)>\log_{2}\log_{2}(T/k), the upper bound is O({\sqrt{T}\log T}), thus the upper and lower bounds directly match within O(\log T). We also argue that the slightly different terms of k appearing in the upper and lower bounds do not play an important role. In fact, the gap associated with k between the upper and lower bounds is O(\min\{k^{2.5},(T/k)^{m(S)-0.5}\}). Since we are mostly interested in the case of k<<T (or k=O(1)), the O(k^{2.5}) gap is not important relative to T.

Corollary 1 allows us to characterize the trade-off between the switching budget S and the optimal regret R^{*}_{S}{(T)}. To illustrate this trade-off, Figure 1 and Table 1 depict the behavior of R^{*}_{S}{(T)} as a function of S given a fixed k. Note that as discussed in Section id1, the relationship between R^{*}_{S}{(T)} and S also characterizes the inherent trade-off between regret and maximum number of switches in the classical MAB problem.

Figure 1: An Illustration of the Switching Budget-Regret Trade-off.

We observe several surprising phenomena regarding the trade-off between S and R_{S}^{*}(T).

Phase Transitions. As we have shown, R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({2-2^{-\lfloor(S-1)/(k-1)\rfloor})}}). To the best of knowledge, this is the first time that a floor function naturally arises in the order of T in the optimal regret of an online learning problem. As a direct consequence of this floor function, the optimal regret of BwSC exhibits surprising phase transitions described as below.

Definition 2

(Phases and Critical Points) For a k-armed unit-cost BwSC (or a k-armed MAB), we call the interval [(j-1)(k-1)+1,j(k-1)+1) the j-th phase, and call j(k-1)+1 as the j-th critical point (j\in\mathbb{Z}_{>0}).

Fact 1

(Phase Transitions) As S increases from 0 to \Theta(\log\log T), S will leave the j-th phase and enter the (j+1)-th phase at the j-th critical point (j\in\mathbb{Z}_{>0}). Each time S arrives at a critical point, R_{S}^{*}(T) will drop significantly, and stay at the same level until S arrives the next critical point.

Phase transitions are clearly presented in Figure 1. This phenomenon seems counter-intuitive, as it suggests that increasing switching budget would not help to decrease the best achievable regret, as long as the budget does not reach the next critical point.

Note that phase transitions are only exhibited when S is in the range of 0 to \Theta(\log\log T). After S exceeds \Theta(\log\log T), R_{S}^{*}(T) will reamin unchanged at the level of \tilde{\Theta}(\sqrt{T}) — the optimal regret will only vary within logarithmic factors and there is no significant regret drop any more. Therefore, one can also view \Theta(\log\log T) as a “final critical point” that marks the disappearance of phase transitions. This additional “final phase transition” reveals the subtle and intriguing nature of phase transitions in BwSC.

Table 1: Regret as a Function of Switching Budget
S [0,k) [k,2k-1) [2k-1,3k-2) [3k-2,4k-3) [4k-3,5k-4) [5k-4,6k-5)
R_{S}^{*}(T) \tilde{\Theta}(T) \tilde{\Theta}(T^{2/3}) \tilde{\Theta}(T^{4/7}) \tilde{\Theta}(T^{8/15}) \tilde{\Theta}(T^{16/31}) \tilde{\Theta}(T^{32/63})
R_{S}^{*}(T)/R_{\infty}^{*}(T) \tilde{\Theta}(T^{1/2}) \tilde{\Theta}(T^{1/6}) \tilde{\Theta}(T^{1/14}) \tilde{\Theta}(T^{1/30}) \tilde{\Theta}(T^{1/62}) \tilde{\Theta}(T^{1/128})

Cyclic Phenomena. Along with phase transitions, we also observe the following phenomena.

Fact 2

(Cyclic Phenomena) The length of each phase is always equal to k-1, independent of S and T. We call the quantity k-1 the budget cycle, which is the length of each phase.

Cyclic Phenomena indicate that, assuming that the learner’s switching budget is at a critical point, then the extra switching budget that the learner needs to achieve the next regret drop (i.e., to arrive at the next critical point) is always k-1. Cyclic phenomena also seem counter-intuitive: when the learner has more switching budget, she can conduct more experiments and statistical tests, eliminate more bad actions (which can be thought of as reducing k) and allocate her switching budget in a more flexible way — all of these suggest that the budget cycle should be a quantity decreasing with S. However, the cyclic phenomena tell us that the budget cycle is always a constant and no learning policy in the unit-cost BwSC (and in MAB) can escape this cycle, no matter how large S is , as long as S=o(\log\log T).

On the other hand, as S contains more and more budget cycles, the gap between R_{S}^{*}(T) and R_{\infty}^{*}(T)=\tilde{\Theta}(\sqrt{T}) does decrease dramatically. In fact, R_{S}^{*}(T) decreases doubly exponentially fast as S contains more budget cycles. For example, when S contains more than 2 budget cycles, R_{S}^{*}(T)=\tilde{\Theta}(T^{4/7}); and when S contains more than 3 budget cycles, R_{S}^{*}(T)=\tilde{\Theta}(T^{8/15}). From both Figure 1 and Table 1, we can verify that 3 or 4 budget cycles are already enough for an S-switching-budget policy to achieve close-to-optimal regret in MAB (compared with the optimal policy with unlimited switching budget).

To sum up, the above analysis generates both “positive” and “negative” insights for decision-makers that face BwSC-type problems. On the one hand, the unavoidable phase transitions and cyclic phenomena show some fundamental limits brought up by the switching constraint, making it hopeless for decision-makers to reduce regret within each phase. On the other hand, once the decision-makers have enough switching budget that brings them to a new phase, they can enjoy substantially regret drop. In particular, 3 or 4 budget cycles are already enough to guarantee extraordinary regret performance.

The lower bound in Theorem 2 also leads to new results for the classical MAB problem.

Corollary 2

(The switching complexity of MAB) For the k-armed bandit problem, N(k-1)+1 switches are necessary and sufficient for achieving \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret for any fixed N\in\mathbb{Z}_{>0}, and \Theta(\log\log T) switches are necessary and sufficient for achieving \tilde{O}(\sqrt{T}) (near-optimal) regret.

Note that the number of switches stated in Corollary 2 refers to the maximum number of switches that a policy can make. While Cesa-Bianchi et al. (2013) and Perchet et al. (2016) have proposed policies that achieve \tilde{O}(\sqrt{T}) regret with O(\log\log T) switches, no prior work has answered the question that how many switches are necessary for a near-optimal learning policy in MAB. To the best of our knowledge, we are the first one to show \Omega(\log\log T) lower bound on the number of switches.

Based on our “tracking the cover time” argument, we can prove further results regarding how many number of re-switches of each arm (including the worst arm in hindsight) are necessary for an effective learning policy.

Definition 3

The number of re-switches of action i\in[k] is the total number that the leaner switches to i from another action j\neq i. If an action is switched in round 1, this switch also counts as a re-switch.

Proposition 1
  1. \lceil N/2\rceil re-switches of each action are necessary for achieving \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in the k-armed MAB (N\in\mathbb{Z}_{>0}).

  2. \Theta(\log\log T) re-switches of each action are necessary and sufficient for achieving \tilde{O}(\sqrt{T}) (near-optimal) regret in the k-armed MAB.

Note that if the learner is not allowed to re-choose an action that was chosen earlier and discarded later (i.e., if the number of re-switches of each action is at most 1), then the corresponding bandit problem is exactly the “irrevocable MAB problem” proposed by Farias and Madan (2011). Farias and Madan (2011) and Guha and Munagala (2013) study the price of irrevocability in the Bayesian bandit setting. Using competitive ratio (measured against the optimal online policy that does not know the true environment) as the performance metric, they show that the price of irrevocability is limited. Our results on the necessity of re-switching strongly contradicts this idea: in the setting of regret minimization, where we are competing with the optimal clairvoyant policy — a much stronger benchmark, our results indicate that an irrevocable policy must incur linear regret, and any effective policy can not avoid “switching to, revoking, and re-switching to” each action for many times.

Specifically, Proposition 1 indicates that, for each learning policy that achieves \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in MAB, we can always find an environment \mathcal{D} such that the worst action in hindsight is “switched to, revoked, and re-switched to” for at least \lceil N/2\rceil times with some positive probability. Also, for each learning policy that achieves near-optimal regret in MAB, we can always find an environment \mathcal{D} such that the worst action in hindsight is “switched to, revoked, and re-switched to” for at least \Theta(\log\log T) times with some positive probability — surprisingly, the necessary number of re-switches of the worst action is essentially the same as the best action. This reveals some fundamental trade-off between regret and re-switching in MAB. Put differently, it is inevitable for any good policy to repeatedly choose actions that would prove to be not effective in hindsight.

We close this section by discussing a surprising relationship between limited switches and limited adaptivity in bandit problems. As discussed in Section id1, the unit-switching-cost BwSC problem limits switches and the batched bandit problem limits observation — on the surface, the two problems seem unrelated. However, the results in our paper establish a non-trivial connection between the two problems. The SS-SE policy in Section id1 helps establish a one-sided relationship: by using the ingredient (1) and (2) in the SS-SE policy, it is easy to find that any limited-batch policy can be transformed to a limited-switch policy, thus any regret upper bound for the batched bandit problem can be adapted to be a regret upper bound for the unit-switching-cost BwSC problem.

On the other hand, it is generally impossible to transform an arbitrary limited-switch policy to a limited-batch policy, as a limited-switch policy may utilize data for unlimited times. Thus finding a regret lower bound for BwSC is fundamentally harder than finding a regret lower bound for the batched bandit problem. Surprisingly, our strong lower bound in Section id1 directly closes the gap between the regret upper bound of the batched bandit problem and the regret lower bound of a corresponding unit-switching-cost BwSC problem. Thus, we essentially prove the following fact: for any fixed k, both the M-batched k-armed bandit problem and the S-budget k-armed unit-cost BwSC have \tilde{\Theta}(T^{\frac{1}{2-2^{1-M}}}) optimal regret, as long as S\in[(M-1)(k-1)+1,M(k-1)+1). In essence, our results reveal a surprising “near-equivalance” between limited switches (even with full adpativity) and limited adaptivity (even with full switching power) in bandit problems.

We now proceed to the general case of BwSC, where c_{i,j} (=c_{j,i}) can be any non-negative real number and even \infty. The problem is significantly more challenging in this general setting. To better characterize the structure of switching costs, we represent switching costs by a weighted graph. Let G=(V,E) be a complete graph, where V=[k] (i.e., each vertex corresponds to an action), and the edge between i and j is assigned a weight c_{i,j} (\forall i\neq j). We call the weighted graph G the switching graph. In this section, we assume the switching costs satisfy the triangle inequality: \forall i,j,l\in[k], c_{i,j}\leq c_{i,l}+c_{l,j}. We relax this assumption in Appendix E.

In Appendix F, we establish an interesting connection between bandit problems and graph traversal problems. Applying the result to the general BwSC problem, we discover a connection between the general BwSC problem and the celebrated shortest Hamiltonian path problem. Motivated by this connection, we propose the Hamiltonian-Switching Successive Elimination (HS-SE) policy, and present its details in Algorithm 2 in Appendix G. The policy enhances the original SS-SE policy by adding an additional ingredient: a pre-specified switching order determined by the shortest Hamiltonian path of the switching graph G. Note that while the shortest Hamiltonian path problem is NP-hard, solving this problem is entirely an “offline” step in the HS-SE policy. That is, for a given switching graph, the learner only needs to solve this problem once. We give an upper bound on regret of the HS-SE policy, and a lower bound that is close to the upper bound, see proofs in Appendix H and I.

Theorem 3

Let H denote the total weight of the shortest Hamiltonian path of G. Let \pi be the HS-SE policy, then \pi\in\Pi_{S}. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 0, T\geq k,

R^{\pi}(T)\leq C{(\log k\log T)}k^{1-\frac{1}{2-2^{-m_{G}^{U}(S)}}}T^{\frac{1}% {2-2^{-m_{G}^{U}(S)}}},

where m_{G}^{U}(S)=\lfloor\frac{S-\max_{i,j\in[k]}{c_{i,j}}}{H}\rfloor.

Theorem 4

Let H be the total weight of the shortest Hamiltonian path of G. There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m_{G}^{L}(S)}% }}(m_{G}(S)+1)^{-2}\right)T^{\frac{1}{2-2^{-m_{G}^{L}(S)}}},&\text{if }m_{G}^{% L}(S)\leq\log_{2}\log_{2}(T/k),\\ C\sqrt{kT},&\text{if }m_{G}^{L}(S)>\log_{2}\log_{2}(T/k),\end{cases}

where {m_{G}^{L}}(S)=\lfloor\frac{S-\max_{i\in[k]}\min_{j\neq i}c_{i,j}}{H}\rfloor.

When the switching costs satisfy the condition \max_{i,j\in[k]}{c_{i,j}}=\max_{i\in[k]}\min_{j\neq i}c_{i,j}, the two bounds directly match. When this condition is not satisfied, for any switching graph G, the above two bounds still match for a wide range of S. Even when S is not in this range, we still have m_{G}^{U}(S)\leq m_{G}^{L}(S)\leq m_{G}^{U}(S)+1 for any G and any S, which means that the difference between the two indices is at most 1 and the regret bounds are always very close. In fact, it can be shown that as S increases, the gap between the upper and lower bounds decreases doubly exponentially. Therefore, the HS-SE policy is quite effective for the general BwSC problem. We leave closing the remaining gap between the upper and lower bounds for future research.

References

  • Agrawal et al. (1988) Agrawal, R., M. Hedge, D. Teneketzis. 1988. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control 33(10) 899–906.
  • Agrawal et al. (1990) Agrawal, R, M Hegde, D Teneketzis. 1990. Multi-armed bandit problems with multiple plays and switching cost. Stochastics and Stochastic Reports 29(4) 437–459.
  • Altschuler and Talwar (2018) Altschuler, J., K. Talwar. 2018. Online learning over a finite action set with limited switching. arXiv preprint arXiv:1803.01548 .
  • Asawa and Teneketzis (1996) Asawa, M, D Teneketzis. 1996. Multi-armed bandits with switching penalties. IEEE transactions on automatic control 41(3) 328–348.
  • Banks and Sundaram (1994) Banks, Jeffrey S, Rangarajan K Sundaram. 1994. Switching costs and the gittins index. Econometrica 62(3) 687–694.
  • Bergemann and Välimäki (2001) Bergemann, D, J Välimäki. 2001. Stationary multi-choice bandit problems. Journal of Economic dynamics and Control 25(10) 1585–1594.
  • Brezzi and Lai (2002) Brezzi, Monica, Tze Leung Lai. 2002. Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27(1) 87–108.
  • Cesa-Bianchi et al. (2013) Cesa-Bianchi, N., O. Dekel, O. Shamir. 2013. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems. 1160–1168.
  • Chen and Chao (2019) Chen, B., X. Chao. 2019. Parametric demand learning with limited price explorations in a backlog stochastic inventory system. IISE Transactions 1–9.
  • Cheung et al. (2017) Cheung, W. C., D. Simchi-Levi, H. Wang. 2017. Dynamic pricing and demand learning with limited price experimentation. Operations Research 65(6) 1722–1731.
  • Christofides (1976) Christofides, Nicos. 1976. Worst-case analysis of a new heuristic for the travelling salesman problem. Tech. rep., Carnegie-Mellon University Pittsburgh PA Management Sciences Research Group.
  • Cormen et al. (2009) Cormen, Thomas H, Charles E Leiserson, Ronald L Rivest, Clifford Stein. 2009. Introduction to algorithms. MIT Press.
  • Farias and Madan (2011) Farias, V. F., R. Madan. 2011. The irrevocable multiarmed bandit problem. Operations Research 59(2) 383–399.
  • Gao et al. (2019) Gao, Z., Y. Han, Z. Ren, Z. Zhou. 2019. Batched multi-armed bandits problem. arXiv preprint arXiv:1904.01763 .
  • Guha and Munagala (2013) Guha, S., K. Munagala. 2013. Approximation algorithms for bayesian multi-armed bandit problems. arXiv preprint arXiv:1306.3525 .
  • Jun (2004) Jun, T. 2004. A survey on the bandit problem with switching costs. De Economist 152(4) 513–541.
  • Lai and Robbins (1985) Lai, T. L., H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1) 4–22.
  • Lawler et al. (1985) Lawler, E L, J K Lenstra, A R Kan, D B Shmoys. 1985. The traveling salesman problem: a guided tour of combinatorial optimization, vol. 3. New York: Wiley.
  • Perchet et al. (2016) Perchet, V., P. Rigollet, S. Chassang, E. Snowberg. 2016. Batched bandit problems. The Annals of Statistics 44(2) 660–681.
  • Slivkins (2019) Slivkins, A. 2019. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272 .
  • Wainwright (2019) Wainwright, M. J. 2019. High-dimensional statistics: A non-asymptotic viewpoint, vol. 48. Cambridge University Press.
\appendixpage

For simplicity, we only present the results of distribution-dependent regret bounds for the unit-switching-cost BwSC problem. Extensions to the general-switching-cost BwSC problem are analogous to Section 5 of the main article.

To achieve tight distribution-dependent regret bounds, we propose the S-Switch Successive Elimination 2 (SS-SE-2) policy, which is stated in Algorithm 2. Note that the difference between the SS-SE-2 policy and the SS-SE policy is the partition of intervals.

Algorithm 2 S-Switch Successive Elimination 2 (SS-SE-2)

Input: Number of arms k, Switching budget S, Horizon T
Partition: Calculate m(S)=\lfloor\frac{S-1}{k-1}\rfloor.
    Divide the entire time horizon 1,\dots,T into m(S)+1 intervals: [t_{0}:t_{1}],(t_{1}:t_{2}],\dots,(t_{m(S)}:t_{m(S)+1}],
    where the endpoints are defined by t_{0}=1 and

t_{i}=\lfloor k^{1-\frac{i}{{m(S)}+1}}T^{\frac{i}{{m(S)}+1}}\rfloor,~{}~{}% \forall i=1,\dots,m(S)+1.

Initialization: The same as the SS-SE policy.
Policy:

1:  The same as the SS-SE policy.

For any environment \mathcal{D}, let i^{*}=\arg\max_{i\in[k]\mu_{i}} denote the optimal action, and \Delta=\Delta(\mbox{$\mathcal{D}$})=\min_{i\neq i^{*}}|\mu_{i^{*}}-\mu_{i}|>0 denote the gap between the rewards of the optimal action and the best sub-optimal action. We have the following results.

Theorem 5

Let \pi be the SS-SE-2 policy. There exists an absolute constant C\geq 0 such that for all \mathcal{D}, for all k\geq 1, S\geq 0 and T\geq k,

R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C\left(k^{\frac{m(S)}{m(S)+1}}\log k% \right)\frac{T^{\frac{1}{m(S)+1}}\log T}{\Delta},

where m(S)=\lfloor\frac{S-1}{k-1}\rfloor.

Theorem 6

There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq 1 and for all policy \pi\in\Pi_{S}, if m(S)\leq{\log_{2}(T/k)}, then

\sup\limits_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\geq C% \left(k^{-\frac{3}{2}-\frac{1}{m(S)+1}}(m(S)+1)^{-2}\right){T^{\frac{1}{m(S)+1% }}},

where m(S)=\lfloor\frac{S-1}{k-1}\rfloor.

Note that when m(S)\leq{\log_{2}(T/k)}, the upper and lower bounds match in the minimax sense (up to logarithmic factors), thus the SS-SE-2 policy can be viewed as near-optimal. When m(S)>\log_{2}(T/k), the upper bound is O(\log T/\Delta), and we can directly use the seminal instance-dependent lower bound of Lai and Robbins (1985) to show the asymptotic optimality of the SS-SE-2 policy.

We omit the proofs of Theorem 5 and Theorem 6. The proof of Theorem 5 resembles the proof of Theorem 1 in Appendix B, and the proof of Theorem 6 resembles the proof of Theorem 2 in Appendix C. The difference is mainly on the partition of intervals.

Besides results on regret upper and lower bounds, we also establish Corollary 3, which can be viewed as a parallel result for Corollary 2 in Section 4.3.2 of the main article.

Corollary 3

(The switching complexity of MAB - distribution-dependent regret version)
For any k\geq 1, for any environment \mathcal{D}, let \Delta=\min\limits_{i\in[k],i\neq i^{*}}|\mu_{i^{*}}-\mu_{i}| denote the gap between the mean rewards of the optimal action and the best sub-optimal action.

  1. N(k-1)+1 switches are necessary and sufficient for uniformly achieving \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distribution-dependent regret for all \mathcal{D} in the k-armed MAB (N\in\mathbb{Z}_{>0}).

  2. \Omega(\frac{\log T}{\log\log T}) switches are necessary for uniformly achieving \tilde{O}(\log{T}/\Delta) distribution-dependent regret for all \mathcal{D} in the k-armed MAB.

The proof of Corollary 2 is deferred to Appendix D.

From round 1 to round t_{1}, the SS-SE policy makes k-1 switches.

For 1\leq l\leq m(S)-1, from round t_{l} to round t_{l+1}:

  • If the last action in interval l remains active in interval l+1, then it will be the first action in interval l+1, and no switch occurs between round t_{l} and round t_{l}+1. Since the SS-SE policy makes at most k-1 switches within interval l+1, i.e., from round t_{l}+1 to round t_{l+1}, the SS-SE policy makes at most 0+(k-1)=k-1 switches from round t_{l} to round t_{l+1}.

  • If the last action in interval l is eliminated before the start of interval l+1, then interval l+1 starts from another active action, and one switch occurs between round t_{l} and round t_{l}+1. The elimination implies that |A_{l+1}|\leq k-1, thus the SS-SE policy makes |A_{l+1}|-1\leq(k-1)-1=k-2 switches within interval l+1, i.e., from round t_{l}+1 to round t_{l+1}. Therefore, the SS-SE policy makes at most 1+(k-2)=k-1 switches from round t_{l} to round t_{l+1}.

From round t_{m(S)} to round T, since the SS-SE policy does not switch within interval m(S)+1, i.e., from round t_{m(S)}+1 to round T, the only possible switch is between round t_{m(S)} and t_{m(S)}+1. Thus the SS-SE policy makes at most 1 switch from round t_{m(S)} to round T.

Summarizing the above arguments, we find that the SS-SE policy makes at most m(S)(k-1)+1\leq S switches from round 1 to round T. Thus it is indeed an S-switching-budget policy.

We start the proof of the upper bound on regret with some definitions. Let n_{t}(i) be the number of chosen samples of action i in period [1:t], and \bar{\mu}_{t}(i) be the average collected reward from action i in period [1:T] (i\in[k],t\in[T]). Define the confidence radius as

r_{t}(i)=\sqrt{\frac{2\log T}{n_{t}(i)}},~{}~{}\forall i\in[k],t\in[T].

Define the clean event as

\mathcal{E}:=\{\forall i\in[k],\forall t\in[T],~{}~{}|\bar{\mu}_{t}(i)-\mu_{i}% |\leq r_{t}(i)\}.

By Lemma 1.5 in Slivkins (2019), since T\geq k, for any policy \pi and any environment \mathcal{D}, we always have \mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}(\mathcal{E})\geq 1-\frac{2}{T^{2}}. Define the bad event \bar{\mathcal{E}} as the complement of the clean event.

The \texttt{UCB}_{t_{l}}(i) and \texttt{LCB}_{t_{l}}(i) confidence bounds defined in Algorithm 1 can be expressed as

\texttt{UCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)+r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k],
\texttt{LCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)-r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k].

Let \pi denote the SS-SE policy. First, observe that for any environment \mathcal{D},

\displaystyle R_{\mbox{$\mathcal{D}$}}^{\pi}(T) \displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}% (\mathcal{E})+\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\bar{\mathcal{E}}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}% ^{\pi}(\bar{\mathcal{E}})
\displaystyle\leq\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i% =1}^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+T\cdot\frac{1}{T^{2}}
\displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+o(1), (1)

so in order to bound R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T), we only need to focus on the clean event.

Consider an arbitrary environment \mathcal{D} and assume the occurrence of the clean event. Let i^{*} be an optimal action, and consider any action i such that \mu_{i}<\mu_{i^{*}}. Let \eta_{i} denote the index of the last interval when i\in A_{\eta_{i}}, i.e., the \eta_{i}-th interval is the last interval when we did not eliminate action i yet (in particular, \eta_{i}=m(S)+1 if and only if i is the only action chosen in the last interval). By the SS-SE policy, if \eta_{i}\geq 2, then the confidence intervals of the two actions i^{*} and i at the end of round \eta_{i}-1 must overlap, i.e., \texttt{UCB}_{t_{\eta_{i}-1}}(i)\geq\texttt{LCB}_{t_{\eta_{i}-1}}(i^{*}). Therefore,

\Delta(i):=\mu_{i^{*}}-\mu_{i}\leq 2r_{t_{\eta_{i}-1}}(i^{*})+2r_{t_{\eta_{i}-% 1}}(i)=4r_{t_{\eta_{i}-1}}(i), (2)

where the last equality is because i^{*} and i are chosen for equal times in each interval until interval \eta_{i}, which indicates that n_{t_{\eta_{i}-1}}(i^{*})=n_{t_{\eta_{i}-1}}(i). (Note that in Algorithm 1, for simplicity, we overlook the rounding issues of \frac{t_{l+1}-t_{l}}{|A_{l}|} for each interval l. Considering the rounding issues will not bring additional difficulty to our analysis, as in the policy we can always design a rounding rule to control the difference between n_{t_{\eta_{i}-1}}(i^{*}) and n_{t_{\eta_{i}-1}}(i) within 1.)

Since i is never chosen after the \eta_{i}-th interval, we have n_{\eta_{i}}(i)=n_{T}(i), and therefore r_{\eta_{i}}(i)=r_{T}(i).

The contribution of action i to regret in the entire horizon [1:T], denoted R(T;i), can be expressed as the sum of \Delta(i) for each round that this action is chosen. By the SS-SE policy and (2), we can bound this quantity as

\displaystyle R(T;i) \displaystyle=n_{T}(i)\Delta(i)
\displaystyle\leq 4n_{\eta_{i}}(i)\sqrt{\frac{2\log T}{n_{\eta_{i}-1}(i)}}
\displaystyle\leq C_{0}\sqrt{2\log T}\frac{t_{\eta_{i}}/|A_{\eta_{i}}|}{\sqrt{% t_{\eta_{i}-1}/k}}
\displaystyle\leq 4C_{0}\sqrt{2\log T}\frac{k(T/k)^{1/(2-2^{-m(S)})}}{{|A_{% \eta_{i}}|}}.

for some absolute C_{0}\geq 0. Then for any \mathcal{D}, conditioned on the clean event,

\displaystyle\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}^% {T}\mu_{\pi_{t}}\mid\mathcal{E}\right] \displaystyle=\sum_{i\in[k]}R(T;i)
\displaystyle\leq\sum_{i\in[k]}4C_{0}\sqrt{2\log T}k(T/k)^{1/(2-2^{-m(S)})}% \frac{1}{{|A_{\eta_{i}}|}}
\displaystyle\leq C_{1}\sqrt{\log T}k(T/k)^{1/(2-2^{-m(S)})}\sum_{i=1}^{k}% \frac{1}{|A_{\eta_{i}}|}
\displaystyle\leq C_{2}\sqrt{\log T}k(T/k)^{1/(2-2^{-m(S)})}\sum_{j=1}^{k}% \frac{1}{j}
\displaystyle\leq C_{3}(\log k\log T)k^{1-1/(2-2^{-m(S)})}T^{1/(2-2^{-m(S)})}

for some absolute constant C_{1},C_{2},C_{3}\geq 0. Thus by (\theequation@IDr) and R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T) we have

R^{\pi}(T)\leq C(\log k\log T)k^{2-1/(2-2^{-m(S)})}T^{1/(2-2^{-m(S)})}

for some absolute constant C\geq 0.\hfill\Box

Given any k\geq 1, S\geq 0 and T\geq 2k, we focus on the setting of \mbox{$\mathcal{D}$}_{i}=\mathcal{N}(\mu_{i},1) (\forall i\in[k]), as this is enough for us to prove the desired lower bound. Note that now the environment of latent distributions \mathcal{D} can be completely determined by a vector \mbox{\boldmath$\mu$}=(\mu_{1},\cdots,\mu_{k})\in\mathbb{R}^{k}. For simplicity, in this proof we will directly use the vector \mu to represent the environment of latent distributions.

For any environment \mu, let X_{\mbox{\boldmath$\mu$}}^{t}(i)\sim\mathcal{N}(\mu_{i},1) denote the i.i.d. random reward of each action i at round t (i\in[k],t\in[T]). For any i\in[k] and n_{1},n_{2}\in[T], let \{X_{\mbox{\boldmath$\mu$}}^{t}(i)\}_{t\in[n_{1}:n_{2}]} denote the random vector whose components are the random awards of action i from round n_{1} to round n_{2}.

For any environment \mu, for any policy \pi\in\Pi_{S}, with some abuse of notation we let X_{\mbox{\boldmath$\mu$}}^{t}(\pi_{t}) denote the learner’s (random) collected reward at round t under policy \pi in environment \mu. Let \mathcal{F}_{t}:=\sigma(X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{% \boldmath$\mu$}}^{t}(\pi_{t})) denote the \sigma-algebra generated by the random variables X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{\boldmath$\mu$}}^{t}(\pi% _{t}), then \mathbb{F}=(\mathcal{F}_{t})_{t\in T} is a filtration.

For any two probability measures \mathbb{P} and \mathbb{Q} defined on the same measurable space, let D_{\mathrm{TV}}(\mathbb{P}\|\mathbb{Q}) denote the total variation distance between \mathbb{P} and \mathbb{Q}, and D_{\mathrm{KL}}(\mathbb{P}\|\mathbb{Q}) denote the Kullback-Leibler (KL) divergence between \mathbb{P} and \mathbb{Q}, see detailed definitions in Chapter 15 of Wainwright (2019).

For any environment \mu, for any policy \pi\in\Pi_{S}, we make some key definitions as below.

1. We first define a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)}\leq\tau_{m(S)+1}.

  • \tau_{1}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[1:\tau_{1}]$}\} if the set is non-empty and \tau_{1}=\infty otherwise.

  • \tau_{2}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{1}:\tau_{2}]$}\} if the set is non-empty and \tau_{2}=\infty otherwise.

  • Generally, \tau_{j}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{j-1}:\tau_{j}]$}\} if the set is non-empty and \tau_{j}=\infty otherwise, for all i=2,\dots,m(S)+1.

It can be verified that \tau_{1},\dots,\tau_{m(S)+1} are stopping times with respect to the filtration \mathbb{F}.

2. We then define a series of random variables (depend on the stopping times).

  • S(1,\tau_{1}) is the number of switches occurs in [1:\tau_{1}] (note that if there is a switch happening between \tau_{1} and \tau_{1}+1, we do not count its cost in S(1,\tau_{1})).

  • For all j=2,\dots,m(S), S(\tau_{j-1},\tau_{j}) is the number of switches occurs in [\tau_{j-1}:\tau_{j}] (note that if there is a switch happening between \tau_{j-1}-1 and \tau_{j-1}, or between \tau_{j} and \tau_{j}+1, we do not count its cost in S(\tau_{j-1},\tau_{j})).

  • S(\tau_{m(S)},T) is the number of switches occurs in [\tau_{m(S)}:T] (note that if there is a switch happening between \tau_{m(S)-1} and \tau_{m(S)}, we do not count its cost in S(\tau_{m(S)},T).

3. Next we define a series of events.

  • E_{1}=\{\tau_{1}>t_{1}\}.

  • For all j=2,\dots,m(S), E_{j}=\{\tau_{j-1}\leq t_{j-1},\tau_{j}>t_{j}\}.

  • E_{m(S)+1}=\{\tau_{m(S)}\leq t_{m(S)}\}.

Note that t_{1},\dots,t_{m(S)}\in[T] are fixed values specified in Algorithm 1.

4. Finally we define a series of shrinking errors.

  • \Delta_{1}=1.

  • For j=2,\dots,m(S), \Delta_{j}=\frac{k^{-1/2}\left(k/T\right)^{(1-2^{1-j})/(2-2^{-m(S)})}}{k(m(S)+% 1)}\in(0,1). (That is, \Delta_{j}\approx\frac{1}{k(m(S)+1)}\frac{1}{\sqrt{t_{j-1}}}.)

  • \Delta_{m(S)+1}=\frac{k^{-1/2}\left(k/T\right)^{(1-2^{-m(S)})/(2-2^{-m(S)})}}{% 2k(m(S)+1)}\in(0,1). (That is, \Delta_{m(S)+1}\approx\frac{1}{2k(m(S)+1)}\frac{1}{\sqrt{t_{m(S)}}}.)

5. For notational convenience, define \pi_{\infty} as an independent uniform random variable taking value in [k] such that {\pi_{\infty}=i} with probability 1/k (i\in[k]).

Lemma 1

For any environment \mu, for any policy \pi\in\Pi_{S}, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the number of switches occurs in }[\tau_{m(S)}:T]\text{ is no more % than }k-1\} almost surely.

Proof of Lemma 1. When E_{m(S)+1} happens, \tau_{m(S)}\leq t_{m(S)}\leq T, thus all \tau_{1},\dots,\tau_{m(S)}\leq T. Since in each of [1:\tau_{1}],[\tau_{1},\tau_{2}],\dots,[\tau_{m(S)-1}:\tau_{m(S)}], all k actions were visited, we know that S(1,\tau_{1})\geq k-1, S(\tau_{1},\tau_{2})\geq k-1, \dots, S(\tau_{m(S)-1},\tau_{m(S)})\geq k-1. Thus we have

S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)-1},\tau_{m(S)})\geq m(S% )(k-1).

Since \pi\in\Pi_{S}, we further know that

S(\tau_{m(S)},T)\leq S-[S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)% -1},\tau_{m(S)})]\leq S-m(S)(k-1)\leq k-1

happens almost surely. As a result, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the number of switches occurs in }[\tau_{m(S)}:T]\text{ is no more % than }k-1\} almost surely. \hfill\Box

Consider a class of environments \Lambda=\{\mbox{\boldmath$\mu$}\mid\frac{\Delta_{m(S)+1}}{4}\leq\mu_{1}-\mu_{i% }\leq\frac{\Delta_{m(S)+1}}{2},\forall i\neq 1\}. Pick an arbitrary environment {\alpha} from \Lambda (e.g., \alpha=(\frac{\Delta_{m(S)+1}}{2},0,\dots,0)). For any policy \pi\in\Pi_{S}, by the union bound, we have

\sum_{j=1}^{m(S)+1}\mathbb{P}_{{\alpha}}^{\pi}(E_{j})\geq\mathbb{P}_{{\alpha}}% ^{\pi}(\cup_{j=1}^{m(S)+1}E_{j})=1.

Therefore, there exists j^{*}\in[m(S)+1] such that \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})\geq 1/(m(S)+1).

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})% \geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{% \pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{% \mathbb{P}_{{\alpha}}^{\pi}(E_{1})}{k}\geq\frac{1}{k(m(S)+1)}.

Note that since \tau_{1} is the first time that all actions in [k] has been chosen in [1:\tau_{1}], the event \{\pi_{\tau_{1}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[1:\tau_{1}-1]\}. Thus, the event \{\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime}\} must imply the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1]:=\{i^{\prime}\text{ was not chosen in }[1:% t_{1}-1]\}. Therefore, we have

\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq% \mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{1% }{k(m(S)+1)}.

Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] only depends on policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{% \pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq\frac{1}{k(m(S)+1)}. (3)

We now consider a new environment {\beta} such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{1} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{1}-1]} and random vectors \{X_{\beta}^{t}(i)\}_{t_{1}:T} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i, we have

\mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]}(\mathcal{E}_{i^{\prime}}[1:t_{1% }-1])\text{ for }i\neq i}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$% \mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1]). (4)

But note that \{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]} and \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} have exactly the same distribution for all i\neq i^{\prime}. Thus from (3) and (4) we have

\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$% \mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq\frac{1}{k% (m(S)+1)}.

However, in environment \beta, i^{\prime} is the unique optimal action, and choosing any action other than i^{\prime} will incur at least a \Delta_{1}-\Delta_{m(S)+1}/2\geq\Delta_{1}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[1:t_{1}-1] indicates that the policy does not choose i^{\prime} for at least t_{1}-1 rounds, we have

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}-1])\left[(t_{1}-1)\frac{\Delta_{1}}{2}\right]% \geq\frac{t_{1}-1}{2k(m(S)+1)}\geq\frac{k^{-1-1/(2-2^{-m(S)})}}{4(m(S)+1)}T^{1% /(2-2^{-m(S)})}.

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})=\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1% }\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}})\geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}% })=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau% _{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}% },\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}% })}{k}\geq\frac{1}{k(m(S)+1)}.

Note that since \tau_{j^{*}} is the first time that all actions in [k] has been chosen in [\tau_{j^{*}-1}:\tau_{j^{*}}], the event \{\pi_{\tau_{j^{*}}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[\tau_{j^{*}-1}:\tau_{j^{*}}-1]\}. Thus, the event \{\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{% \prime}\} must imply the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}]:=\{i^{\prime}\text{ was not % chosen in }[t_{j^{*}-1}:t_{j^{*}}]\}. Therefore, we have

\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{% *}}])\geq\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}% >t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{1}{k(m(S)+1)}.

Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[t_{j^{*}-1}:t_{j^{*}}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] only depends on policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}-1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E% }_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}])\geq\frac{1}{k(m(S)+1)}. (5)

We now consider a new environment {\beta} such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{j^{*}} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[t_{j^{*}-1}:t_{j^{*}}]} and random vectors \{X_{\beta}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k]. Let \mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{\prime% }}[t_{j^{*}-1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}-1}:t_{j^{*}}]). (6)

We now try to bound the difference between the left-hand side (LHS) in (5) and the left-hand side (LHS) in (6). We have

\displaystyle|\text{LHS in }(\ref{eq:app3})-\text{LHS in }(\ref{eq:app4})|
\displaystyle\leq \displaystyle{D_{\mathrm{TV}}}\left(\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}% _{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^% {\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime}% )\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i% ^{\prime}}^{\pi}\right)
\displaystyle\leq \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^% {t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}% \text{ for }i\neq i^{\prime}}^{\pi}\right)}
\displaystyle\leq \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^{t}(i^% {\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for% }i\neq i^{\prime}}\right)}
\displaystyle= \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{% \beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]}}\right)}
\displaystyle= \displaystyle\sqrt{\frac{1}{2}\left[(t_{j^{*}-1}-1)\frac{\left(\Delta_{j^{*}}% \right)^{2}}{2}\right]}
\displaystyle\leq \displaystyle\frac{\sqrt{t_{j^{*}-1}}\Delta_{j^{*}}}{2}\leq\frac{1}{2k(m(S)+1)},

where the first inequality is by the definition of total variation distance of two probability measures, the second inequality is by Pinsker’s inequality in information theory, and the third inequality is by the data-processing inequality in infomation theory.

Combining the above inequality with (3) and (4), we have

\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq% \mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])-\frac{% 1}{2k(m(S)+1)}\geq\frac{1}{2k(m(S)+1)}.

However, i^{\prime} is the unique optimal action in environment \beta, and choosing any action other than i^{\prime} will incur at least a \Delta_{j^{*}}-\Delta_{m(S)+1}/2\geq\Delta_{j^{*}}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] indicates that the policy does not choose i^{\prime} for at least t_{j^{*}}-t_{j^{*}-1}+1 rounds, we have

\displaystyle R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}% ^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}])\left[(t_{j^{*}}-t_{j^{% *}-1}+1)\frac{\Delta_{j^{*}}}{2}\right]
\displaystyle\geq \displaystyle\frac{1}{2k(m(S)+1)}\left(k(T/k)^{\frac{2-2^{1-j^{*}}}{2-2^{-m(S)% }}}-k(T/k)^{\frac{2-2^{2-j^{*}}}{2-2^{-m(S)}}}\right)\frac{k^{-\frac{1}{2}}% \left(k/T\right)^{\frac{1-2^{1-j^{*}}}{2-2^{-m(S)}}}}{2k(m(S)+1)}
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}}}{4(m(S)+1)^{2}}\left((T/k)^{\frac{1}{2-2^{% -m(S)}}}-(T/k)^{\frac{1-2^{1-j^{*}}}{2-2^{-m(S)}}}\right)
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{\frac{-2^{1-j^{*}}}{2-2^{-m(S)}}}\right)
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{\frac{-2^{1-m(S)}}{2-2^{-m(S)}}}\right)
\displaystyle= \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{-2^{-m(S)}}\right).

When m(S)\leq\log_{2}\log_{2}(T/K), we have

(T/k)^{-2^{-m(S)}}\leq(T/k)^{-\frac{1}{\log_{2}(T/k)}}=\frac{1}{(T/k)^{\log_{T% /k}(2)}}=\frac{1}{2}.

Thus we know that

R^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(% S)}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{-2^{-m(S)}}\right)\geq\frac{k^{-\frac{3}{2% }-\frac{1}{2-2^{-m(S)}}}}{8(m(S)+1)^{2}}T^{\frac{1}{2-2^{-m(S)}}}

when m(S)\leq\log_{2}\log_{2}(T/k).

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}% \leq t_{m(S)})\geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)})=\sum_{i=1}^{k}\mathbb{P}_{% \alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i^{% \prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})}{k}\geq\frac{1}{k(m(% S)+1)}.

Thus either

\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}>\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}, (7)

or

\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac% {t_{m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}. (8)

If (7) holds true, then we consider a new environment \beta such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{m(S)+1} and all other components are the same as \alpha. Define the event \mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2]:=\{i^{\prime}\text{ was not % chosen in }[t_{m(S)}:(t_{m(S)}+T)/2]\}. From (7) we know that \mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq 1/(2k(m(S)+1)). Using analogous arguments like Case 2 (Appendix id1), we can derive that

\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2% ])-\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}

and

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)% }}}}{64(m(S)+1)^{2}}T^{\frac{1}{2-2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Now we consider the case that (8) holds true. Let \mathcal{E}_{i^{\prime}} denote the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2},\pi_{\tau_{m(% S)+1}}=i^{\prime}\}. According to Lemma 1, the event \{\tau_{m(S)}\leq t_{m(S)}\} implies that the number of switches occurs in [\tau_{m(S)}:T] is no more than k-1. Meanwhile, the event \{\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}<\infty\} implies that the number of switches occurs in [\tau_{m(S)}:\tau_{m(S)+1}] is at least k-1. As a result, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that there is no switch occurs in [\tau_{m(S)+1}:T].

Suppose that i^{\prime}\neq 1, then the event \mathcal{E}_{i^{\prime}}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\} implies that action 1 is not chosen in [\tau_{m(S)+1}:T]. However, action 1 is the unique optimal action in environment \alpha, and choosing any action other than action 1 will incur at least a \Delta_{m(S)+1}/4 term in regret. As a result, we know that

R^{\pi}(T)\geq R_{\alpha}^{\pi}(T)\geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i% ^{\prime}})\left[(T-\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{4}\right]% \geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2% -2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Thus we only need to consider the sub-case of i^{\prime}=1. Define the event \mathcal{E}_{1}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}% {2},\pi_{\tau_{m(S)+1}}=1\}. Note that the occurrence of the event \mathcal{E}_{1} only depends on policy \pi and random vector \{X_{\alpha}^{t}(1)\}_{[1:t_{m(S)}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:{(t_{m(S)}+T)}/{2}]} for i\neq 1. Consider a new environment \beta such that its first component is \alpha_{1}-\Delta_{m(S)+1} and all other components are the same as \alpha. Using analogous arguments like Case 2 (Appendix id1), we can derive that

\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{1})\geq\mathbb{P}_{\alpha}^{\pi}(% \mathcal{E}_{1})-\frac{\sqrt{t_{m(S)}}\Delta_{m(S)+1}}{2}\geq\mathbb{P}_{% \alpha}^{\pi}(\mathcal{E}_{1})-\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}.

However, action 1 is the worst action in environment \beta, and each time of choosing action 1 incurs at least a \Delta_{m(S)+1}/2 term in regret. According to Lemma 1, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that there is no switch occurs in [\tau_{m(S)+1}:T]. Thus the event \mathcal{E}_{1} actually implies that action 1 is continuously chosen in every round from round \tau_{m(S)+1} (\leq\frac{t_{m(S)}+T}{2}) to round T, i.e., action 1 is continuously chosen in the last (T-\frac{t_{m(S)}+T}{2}+1) rounds. As a result, we know that

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}})\left[(T-\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{2}\right]\geq% \frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2-2^{% -m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Combining Case 1, 2 and 3, we know that

R^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{% \frac{1}{2-2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k). On the other hand, since the minimax lower bound for the classical MAB problem (which is equivalent to a BwSC problem with limited switching budget) is \Omega(\sqrt{kT}), we know that

R^{\pi}(T)\geq R_{\infty}^{*}\geq C\sqrt{kT}

for some absolute constant C>0. To sum up, we have

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}(m(S)+% 1)^{-2}\right)T^{\frac{1}{2-2^{-m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases}

for some absolute constant C>0, where m(S)=\lfloor\frac{S-1}{k-1}\rfloor. \hfill\Box

We only prove the first part here, as the proof of the second part is analogous. Since m(N(k-1)+1)=\lfloor(N(k-1)+1-1)/(k-1)\rfloor=N, by Theorem 1, the SS-SE policy guarantees \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in BwSC. Thus N(k-1)+1 switches are sufficient for a carefully-designed policy (e.g., the SS-SE policy) to achieve \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in MAB. On the other hand, suppose that there exists a policy that guarantees \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in MAB with S<N(k-1)+1 switches almost surely. Since m(S)\leq N-1, by Theorem 2, its regret in BwSC is \Omega(T^{\frac{1}{2-2^{-N+1}}}), whose order of T is strictly higher than \tilde{O}(T^{\frac{1}{2-2^{-N}}}) (as N is a fixed integer independent of T), contradiction! Thus for any policy that guarantees \tilde{O}(T^{\frac{1}{2-2^{-N}}}) regret in MAB, there must exist an environment such that the policy makes at least N(k-1)+1 switches with some positive probability.

We only prove the first part here, as the proof of the second part is analogous. Since m(N(k-1)+1)=\lfloor(N(k-1)+1-1)/(k-1)\rfloor=N, by Theorem 5, the SS-SE-2 policy guarantees \tilde{O}(T^{\frac{1}{N+1}}) distribution-dependent regret in BwSC. Thus N(k-1)+1 switches are sufficient for a carefully-designed policy (e.g., the SS-SE-2 policy) to achieve \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distribution-dependent regret in MAB. On the other hand, given any fixed k\geq 1, for any fixed N\geq 1, suppose that there exists a policy \pi that uniformly achieve \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distribution-dependent regret for all \mathcal{D} with S<N(k-1)+1 switches almost surely. Then there exists a constant C_{k,N}\geq 0 (which may depend on k,N) such that for all \mathcal{D} and for all T\geq 1,

R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C_{k,N}{\mathrm{polylog}}(T)\frac{T^{% \frac{1}{N+1}}}{\Delta},

which means that for all T\geq 1,

\sup_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C_{k,N}{% \mathrm{polylog}}(T)T^{\frac{1}{N+1}}.

However, since m(S)<N by Theorem 6, we know that there exists an absolute constant C>0 such that for all T\geq 1,

\sup\limits_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\geq C% \left(k^{-\frac{3}{2}-\frac{1}{N}}(m(S)+1)^{-2}\right){T^{\frac{1}{N}}}>C\left% (k^{-\frac{3}{2}-\frac{1}{N}}(N+1)^{-2}\right){T^{\frac{1}{N}}}.

Let T be large enough then there is a contradiction. As a result, N(k-1)+1 switches are necessary for uniformly achieving \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distribution-dependent regret for all \mathcal{D} in the k-armed MAB.

Consider an arbitrary switching graph G with k=|G|\geq 1. In the following, we show that, even without the triangle inequality assumption, a modified version of the results in Section 5 still hold.

Assume that the switching costs associated with G do not satisfy the triangle inequality. We then run the Floyd-Warshall algorithm (see Cormen et al. (2009)) on G to efficiently find the shortest paths between all pairs of vertices. For any i,j\in[k] such that i\neq j, let p_{i,j}=i\rightarrow\dots\rightarrow j denote the shortest path between i and j, and c_{i,j}^{\prime} denote the total weight of the shortest path between i and j. We construct a new switching graph G^{\prime}=(V,E^{\prime}) — the vertices in G^{\prime} are the same as G, while the edge between i and j in G^{\prime} is assigned a weight c_{i,j}^{\prime}, which is the total weight of the shortest path between i and j in G. Obviously, G^{\prime} is a switching graph whose switching costs satisfy the triangle inequality. Therefore, for BwSC problems defined with G^{\prime}, we can apply the HS-SE policy (see Algorithm 3 in Appendix G), and the regret upper and lower bounds in Theorem 3 and Theorem 4 in Section 5 hold.

In this part we assume that k=o(\sqrt{T}). This assumption is reasonable when k is a known fixed integer.

For any BwSC problem defined with switching graph G (whose switching costs do not satisfy the triangle inequality) and switching budget S, we construct a new switching graph G^{\prime} according to Appendix E.1, and construct a new BwSC problem defined with switching graph G^{\prime} and switching budget S. Let \pi^{\prime} denote the HS-SE policy running on the new BwSC problem. Obviously \pi^{\prime} is a S-switching budget policy for the new problem. We construct \pi by modifying \pi^{\prime}, aiming to obtain an S-switching-budget policy for the original BwSC problem. Let \pi switch (on G) following \pi^{\prime} (on G^{\prime}): every time \pi^{\prime} switches from i to j on G, let \pi switch according to the path p_{i,j}=i\rightarrow\dots\rightarrow j on G, visiting each vertex in p_{i,j} once (since in the HS-SE policy, every active action is chosen for at least \Omega(T^{1/2}) consecutive rounds in each interval, while p_{i,j} contains at most k=o(\sqrt{T}) vertices, we know that \pi^{\prime} is a valid policy). Since the total weight of p_{i,j} is c^{\prime}_{i,j} and \pi^{\prime} is an S-switching-budget policy for G^{\prime}, we know that \pi is an S-switching-budget policy for G.

As mentioned before, Theorem 3 and Theorem 4 in Section 5 hold for the new BwSC problem (defined with G^{\prime}) in Appendix E.2. Based on these two theorems, we give upper and lower bound on regret for the original BwSC problem (defined with the G). The upper and lower bounds are very close to each other (in fact, when k=O(T^{1/4}), the bounds are essentially the same as the bounds in Section 5).

Theorem 7

Let G be a switching graph and G^{\prime} be the corresponding new graph defined in Appendix E.1. Let H denote the total weight of the shortest Hamiltonian path of G^{\prime}. Let \pi be the modified HS-SE policy in Appendix E.2, then \pi is an S-switching-budget policy for G. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 0, T\geq k^{2},

R^{\pi}(T)\leq C{(\log k\log T)}k^{1-\frac{1}{2-2^{-m_{G}^{U}(S)}}}T^{\frac{1}% {2-2^{-m_{G}^{U}(S)}}}+Ck^{2}\log\log T,

where m_{G}^{U}(S)=\lfloor\frac{S-\max_{i,j\in[k]}{c_{i,j}^{\prime}}}{H}\rfloor.

Theorem 8

Let H be the total weight of the shortest Hamiltonian path of G^{\prime}. There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m_{G}^{L}(S)}% }}(m_{G}(S)+1)^{-2}\right)T^{\frac{1}{2-2^{-m_{G}^{L}(S)}}},&\text{if }m_{G}^{% L}(S)\leq\log_{2}\log_{2}(T/k),\\ C\sqrt{kT},&\text{if }m_{G}^{L}(S)>\log_{2}\log_{2}(T/k),\end{cases}

where {m_{G}^{L}}(S)=\lfloor\frac{S-\max_{i\in[k]}\min_{j\neq i}c_{i,j}^{\prime}}{H}\rfloor.

Note that the only difference between the upper bound in Theorem 7 and the upper bound in Theorem 3 is an O(k^{2}\log\log T) term, which can be neglected as long as k is much smaller than T, e.g., k=O(T^{1/4}). To see why Theorem 7 holds, just note that (1) when k is much smaller than T, the modification of the HS-SE policy does not affect the learning rate of the HS-SE policy, and (2) since there are m_{G}^{U}(S)+1=O(\log\log T) intervals in \pi, and in each interval the behavior of \pi (running on G) is different from the behavior of \pi^{\prime} (running on G^{\prime}) for at most O(k^{2}) rounds, the additional regret loss compared to Theorem 3 is at most O(k^{2}\log\log T). Theorem 8 is essentially the same as Theorem 4 — in fact, a lower bound proved for a BwSC problem with the triangle inequality assumption is a natural lower bound for a corresponding BwSC problem without the triangle inequality assumption.

Intuitively, an effective policy in BwSC should identify what type of switching behavior is necessary and sufficient for achieving low regret in MAB, and switch in the most efficient way. Thus, before studying the general BwSC, we first revisit the classical MAB to further understand the relationship between switching and regret. Earlier in Section 4 of the main article, we establish the trade-off between the number of switches and regret in MAB. Unfortunately, this does not provide enough insights for the general BwSC, and hence we need to connect the combinatorics of switching patterns with regret in MAB. In this subsection, we prove the following result: there are some inherent switching patterns that are associated with any effective learning policy in MAB.

Definition 4

Consider a k-armed bandit problem. For any learning policy \pi, any environment \mathcal{D} and any T\geq 1, the stochastic process \{\pi_{t}\}_{t\in[T]}=\pi_{1},\dots,\pi_{T} constitutes a random walk (with a random starting point) on [k]. We call \{\pi_{t}\}_{t\in[T]} the bandit random walk generated by (\pi,\mbox{$\mathcal{D}$},T).

Definition 5

A bandit random walk on an action set [k] finishes a cover in period [T_{1}:T_{2}] if all actions in [k] were chosen between round T_{1} and round T_{2}, here T_{1} is called the starting round of this cover, and T_{2} is called the ending round of this cover.

Definition 6

A bandit random walk on an action set [k] finishes N\geq 0 asynchronous covers between period [T_{1}:T_{2}] if it finishes N covers in period [T_{1}:T_{2}], and the ending round of the j-th cover is no larger than the starting round of the (j+1)-th cover, for all j=1,\dots,N-1.

By using the “tracking the cover time” argument, we establish the following result.

Theorem 9

Consider a k-armed bandit problem. For any fixed N\geq 0, for any policy \pi that achieves \tilde{O}(T^{\frac{1}{2-2^{-N}}-\epsilon}) regret for some \epsilon>0, there exists an environment \mathcal{D} and T\geq 1 such that the bandit random walk generated by (\pi,\mbox{$\mathcal{D}$},T) must “finish N+1 asynchronous covers and then switch to the optimal action333If the bandit random walk happens to be at the optimal action when it just finishes N+1 asynchronous covers, then the event directly occurs. in period [1:T]” with probability at least \max\{N/(N+1),1/2\}.

Theorem 9 holds true for any MAB problem, and reveals some fundamental switching patterns in MAB that any effective learning policy has to reveal under certain environments with certain rounds. Intuitively, the patterns can be summarized as “finishing multiple covers then switching to the optimal action”. For example, if a policy \pi achieves sublinear regret in MAB, then there must be some environment \mathcal{D} and T such that the policy first chooses all actions (i.e., \pi_{1},\dots,\pi_{T} finishes a cover) and then switches to the optimal action with certain probability (even if the policy does not know the optimal action). Also, if a policy \pi achieves near near-optimal regret in MAB, then there must be some environment \mathcal{D} and T such that \pi_{1},\dots,\pi_{T} first finishes \Omega(\log\log T) asynchronous covers and then switches to the optimal action with certain probability.

Theorem 9 indicates that the switching ability of “finishing multiple covers then switching to the optimal action” is necessary for any effective learning policy in MAB. It thus reveals a deep connection between bandit problems and graph traversal problems, since in graph traversal problems there are also requirements for “cover”, i.e., visiting all vertices. Motivated by this connection, in Section 5 of the main article, we design an intuitive policy for the general BwSC problem by leveraging ideas from the shortest Hamiltonian path problem, and give upper and lower bounds on regret that are close to each other.

The proof of Theorem 9 is based on the “tracking the cover time” argument: we first suppose that the switching patterns do not occur with certain probability, then use the “tracking the cover time” argument to establish an \tilde{\Omega}(T^{\frac{1}{2-2^{-N}}}) lower bound on the regret of \pi, which contradicts the condition in Theorem 9. We omit the detailed proof here, as the essential idea of the proof is similar to the proof of Theorem 2 in Appendix C and the proof of Theorem 4 in Appendix I.

See the description of the HS-SE policy in Algorithm 3. The algorithm addresses the challenges brought up by heterogeneous switching costs by leveraging the ideas from the celebrated shortest Hamiltonian path problem.

The HS-SE policy is highly practical — for any given switching graph G, the policy only involves solving the shortest Hamiltonian path problem once, which can be finished offline. Thus, the computational complexity of the shortest Hamiltonian path problem does not affect the online decision-making process of the HS-SE policy at all.

Moreover, under the condition that the switching costs satisfy the triangle inequality, the shortest Hamiltonian path problem can be reduced to the celebrated metric traveling salesman problem ( metric TSP), see Lawler et al. (1985). This means that we can directly apply many commercial solvers for TSP to solve (or approximately solve) the shortest Hamiltonian path problem efficiently. The reduction also indicates that any approximation algorithm designed for metric TSP can be adapted to be an approximation algorithm for the shortest Hamiltonian path problem. In particular, the celebrated Christofides algorithm for the metric TSP Christofides (1976) can be used to compute a good approximation of H in polynomial time.

Algorithm 3 Hamiltonian-Switching Successive Elimination (HS-SE)

Input: Switching Graph G, Switching budget S, Horizon T
Offline Step: Find the shortest Hamiltonian path in G: {i_{1}}\rightarrow\dots\rightarrow{i_{k}}. Denote the total weight of the shortest Hamiltonian path as H. Calculate m_{G}^{U}(S)=\lfloor\frac{S-\max_{i,j\in[k]}c_{i,j}}{H}\rfloor.
Partition: Run the partition step in the SS-SE policy with m(S)=m_{G}^{U}(S).
Initialization: Let the set of all active actions in the l-th interval be A_{l}. Set A_{1}=[k], a_{0}={i_{1}}.
Policy:

1:  for l=1,\dots,m_{G}^{U}(S) do
2:     if a_{t_{l-1}}\in A_{l} and l is odd then
3:        Let a_{t_{l-1}+1}=a_{t_{l-1}}. Starting from this action, along the direction of {i_{1}}\rightarrow\dots\rightarrow{i_{k}}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
4:     else if a_{t_{l-1}}\in A_{l} and l is even then
5:        Let a_{t_{l-1}+1}=a_{t_{l-1}}. Starting from this action, along the direction of {i_{k}}\rightarrow\dots\rightarrow{i_{1}}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
6:     else if a_{t_{l-1}}\notin A_{l} and l is odd then
7:        Along the direction of {i_{1}}\rightarrow\dots\rightarrow{i_{k}}, find the first action that still remains in A_{l}. Starting from this action, along the direction of {i_{1}}\rightarrow\dots\rightarrow{i_{k}}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
8:     else if a_{t_{l-1}}\notin A_{l} and l is even then
9:        Along the direction of {i_{k}}\rightarrow\dots\rightarrow{i_{1}}, find the first action that still remains in A_{l}. Starting from this action, along the direction of {i_{k}}\rightarrow\dots\rightarrow{i_{1}}, choose each action in A_{l} for \frac{t_{l+1}-t_{l}}{|A_{l}|} consecutive rounds. Mark the last chosen action as a_{t_{l}}.
10:     end if
11:     Statistical test: deactivate all actions i s.t. \exists action j with \mathtt{UCB}_{t_{l}}(i)<\mathtt{LCB}_{t_{l}}(j), where
\texttt{UCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% +\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}},
\texttt{LCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% -\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}.
12:  end for
13:  In the last interval, choose the action with the highest empirical mean (up to round t_{m_{G}^{U}(S)}).

Consider an arbitrary switching graph G whose switching costs satisfy the triangle inequality. Recall that H is the total weight of the shortest Hamiltonian path in G. For simplicity, in this proof we use m(S) to denote m_{G}^{U}(S)=\lfloor(S-\max_{i,j\in[k]}{c_{i,j}})/H\rfloor.

From round 1 to round t_{1}, the HS-SE policy incurs H switching cost.

For 1\leq l\leq m(S)-1, from round t_{l} to round t_{l+1}, no matter whether l is odd or even, no matter whether the last action in interval l is eliminated before the start of interval l+1 or not, by the switching order (determined by the shortest Hamiltonian path of G) and the triangle inequality, the HS-SE policy always incurs at most H switching cost.

From round t_{m(S)} to round T, since the HS-SE policy does not switch within interval m(S)+1, i.e., from round t_{m(S)}+1 to round T, the only possible switch is between round t_{m(S)} and t_{m(S)}+1. Thus the HS-SE policy incurs at most \max_{i,j\in[k]}c_{i,j} switching cost from round t_{m(S)} to round T.

Summarizing the above arguments, we find that the HS-SE policy incurs at most m(S)H+\max_{i,j\in[k]}c_{i,j}\leq S switching cost from round 1 to round T. Thus it is indeed an S-switching-budget policy.

We start the proof of the upper bound on regret with some definitions. Let n_{t}(i) be the number of chosen samples of action i in period [1:t], and \bar{\mu}_{t}(i) be the average collected reward from action i in period [1:T] (i\in[k],t\in[T]). Define the confidence radius as

r_{t}(i)=\sqrt{\frac{2\log T}{n_{t}(i)}},~{}~{}\forall i\in[k],t\in[T].

Define the clean event as

\mathcal{E}:=\{\forall i\in[k],\forall t\in[T],~{}~{}|\bar{\mu}_{t}(i)-\mu_{i}% |\leq r_{t}(i)\}.

By Lemma 1.5 in Slivkins (2019), since T\geq k, for any policy \pi and any environment \mathcal{D}, we always have \mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}(\mathcal{E})\geq 1-\frac{2}{T^{2}}. Define the bad event \bar{\mathcal{E}} as the complement of the clean event.

The \texttt{UCB}_{t_{l}}(i) and \texttt{LCB}_{t_{l}}(i) confidence bounds defined in Algorithm 3 can be expressed as

\texttt{UCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)+r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k],
\texttt{LCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)-r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k].

Let \pi denote the HS-SE policy. First, observe that for any environment \mathcal{D},

\displaystyle R_{\mbox{$\mathcal{D}$}}^{\pi}(T) \displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}% (\mathcal{E})+\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\bar{\mathcal{E}}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}% ^{\pi}(\bar{\mathcal{E}})
\displaystyle\leq\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i% =1}^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+T\cdot\frac{1}{T^{2}}
\displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+o(1), (9)

so in order to bound R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T), we only need to focus on the clean event.

Consider an arbitrary environment \mathcal{D} and assume the occurrence of the clean event. Let i^{*} be an optimal action, and consider any action i such that \mu_{i}<\mu_{i^{*}}. Let \eta_{i} denote the index of the last interval when i\in A_{\eta_{i}}, i.e., the \eta_{i}-th interval is the last interval when we did not eliminate action i yet (in particular, \eta_{i}=m(S)+1 if and only if i is the only action chosen in the last interval). By the HS-SE policy, if \eta_{i}\geq 2, then the confidence intervals of the two actions i^{*} and i at the end of round \eta_{i}-1 must overlap, i.e., \texttt{UCB}_{t_{\eta_{i}-1}}(i)\geq\texttt{LCB}_{t_{\eta_{i}-1}}(i^{*}). Therefore,

\Delta(i):=\mu_{i^{*}}-\mu_{i}\leq 2r_{t_{\eta_{i}-1}}(i^{*})+2r_{t_{\eta_{i}-% 1}}(i)=4r_{t_{\eta_{i}-1}}(i), (10)

where the last equality is because i^{*} and i are chosen for equal times in each interval until interval \eta_{i}, which indicates that n_{t_{\eta_{i}-1}}(i^{*})=n_{t_{\eta_{i}-1}}(i). (Note that in Algorithm 3, for simplicity, we overlook the rounding issues of \frac{t_{l+1}-t_{l}}{|A_{l}|} for each interval l. Considering the rounding issues will not bring additional difficulty to our analysis, as in the policy we can always design a rounding rule to control the difference between n_{t_{\eta_{i}-1}}(i^{*}) and n_{t_{\eta_{i}-1}}(i) within 1.)

Since i is never chosen after the \eta_{i}-th interval, we have n_{\eta_{i}}(i)=n_{T}(i), and therefore r_{\eta_{i}}(i)=r_{T}(i).

The contribution of action i to regret in the entire horizon [1:T], denoted R(T;i), can be expressed as the sum of \Delta(i) for each round that this action is chosen. By the HS-SE policy and (10), we can bound this quantity as

\displaystyle R(T;i) \displaystyle=n_{T}(i)\Delta(i)
\displaystyle\leq 4n_{\eta_{i}}(i)\sqrt{\frac{2\log T}{n_{\eta_{i}-1}(i)}}
\displaystyle\leq C_{0}\sqrt{2\log T}\frac{t_{\eta_{i}}/|A_{\eta_{i}}|}{\sqrt{% t_{\eta_{i}-1}/k}}
\displaystyle\leq 4C_{0}\sqrt{2\log T}\frac{k(T/k)^{1/(2-2^{-m(S)})}}{{|A_{% \eta_{i}}|}}.

for some absolute C_{0}\geq 0. Then for any \mathcal{D}, conditioned on the clean event,

\displaystyle\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}-\sum_{i=1}^% {T}\mu_{\pi_{t}}\mid\mathcal{E}\right] \displaystyle=\sum_{i\in[k]}R(T;i)
\displaystyle\leq\sum_{i\in[k]}4C_{0}\sqrt{2\log T}k(T/k)^{1/(2-2^{-m(S)})}% \frac{1}{{|A_{\eta_{i}}|}}
\displaystyle\leq C_{1}\sqrt{\log T}k(T/k)^{1/(2-2^{-m(S)})}\sum_{i=1}^{k}% \frac{1}{|A_{\eta_{i}}|}
\displaystyle\leq C_{2}\sqrt{\log T}k(T/k)^{1/(2-2^{-m(S)})}\sum_{j=1}^{k}% \frac{1}{j}
\displaystyle\leq C_{3}(\log k\log T)k^{1-1/(2-2^{-m(S)})}T^{1/(2-2^{-m(S)})}

for some absolute constant C_{1},C_{2},C_{3}\geq 0. Thus by (\theequation@IDch) and R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T) we have

R^{\pi}(T)\leq C(\log k\log T)k^{2-1/(2-2^{-m(S)})}T^{1/(2-2^{-m(S)})}

for some absolute constant C\geq 0, where m(S)=m_{G}^{U}(S)=\lfloor(S-\max_{i,j\in[k]}{c_{i,j}})/H\rfloor.\hfill\Box

Consider an arbitrary switching graph G whose switching costs satisfy the triangle inequality. Without loss of generality, we assume that \arg\max_{i\in[k]}(\min_{j\neq i}c_{i,j})=1, i.e., \min_{j\neq 1}c_{1,j}\geq\min_{j\neq i}c_{i,j} for all i\in[k]. Recall that H is the total weight of the shortest Hamiltonian path in G. For simplicity, in this proof we use m(S) to denote m_{G}^{L}(S)=\lfloor(S-\max_{i\in[k]}\min_{j\neq i}c_{i,j})/H\rfloor.

Given any k=|G|\geq 1, S\geq 0 and T\geq 2k, we focus on the setting of \mbox{$\mathcal{D}$}_{i}=\mathcal{N}(\mu_{i},1) (\forall i\in[k]), as this is enough for us to prove the desired lower bound. Note that now the environment of latent distributions \mathcal{D} can be completely determined by a vector \mbox{\boldmath$\mu$}=(\mu_{1},\cdots,\mu_{k})\in\mathbb{R}^{k}. For simplicity, in this proof we will directly use the vector \mu to represent the environment of latent distributions.

For any environment \mu, let X_{\mbox{\boldmath$\mu$}}^{t}(i)\sim\mathcal{N}(\mu_{i},1) denote the i.i.d. random reward of each action i at round t (i\in[k],t\in[T]). For any i\in[k] and n_{1},n_{2}\in[T], let \{X_{\mbox{\boldmath$\mu$}}^{t}(i)\}_{t\in[n_{1}:n_{2}]} denote the random vector whose components are the random awards of action i from round n_{1} to round n_{2}.

For any environment \mu, for any policy \pi\in\Pi_{S}, with some abuse of notation we let X_{\mbox{\boldmath$\mu$}}^{t}(\pi_{t}) denote the learner’s (random) collected reward at round t under policy \pi in environment \mu. Let \mathcal{F}_{t}:=\sigma(X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{% \boldmath$\mu$}}^{t}(\pi_{t})) denote the \sigma-algebra generated by the random variables X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{\boldmath$\mu$}}^{t}(\pi% _{t}), then \mathbb{F}=(\mathcal{F}_{t})_{t\in T} is a filtration.

For any two probability measures \mathbb{P} and \mathbb{Q} defined on the same measurable space, let D_{\mathrm{TV}}(\mathbb{P}\|\mathbb{Q}) denote the total variation distance between \mathbb{P} and \mathbb{Q}, and D_{\mathrm{KL}}(\mathbb{P}\|\mathbb{Q}) denote the Kullback-Leibler (KL) divergence between \mathbb{P} and \mathbb{Q}, see detailed definitions in Chapter 15 of Wainwright (2019).

For any environment \mu, for any policy \pi\in\Pi_{S}, we make some key definitions as below.

1. We first define a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)}\leq\tau_{m(S)+1}.

  • \tau_{1}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[1:\tau_{1}]$}\} if the set is non-empty and \tau_{1}=\infty otherwise.

  • \tau_{2}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{1}:\tau_{2}]$}\} if the set is non-empty and \tau_{2}=\infty otherwise.

  • Generally, \tau_{j}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{j-1}:\tau_{j}]$}\} if the set is non-empty and \tau_{j}=\infty otherwise, for all i=2,\dots,m(S)+1.

It can be verified that \tau_{1},\dots,\tau_{m(S)+1} are stopping times with respect to the filtration \mathbb{F}.

2. We then define a series of random variables (depend on the stopping times).

  • S(1,\tau_{1}) is the total switching cost incurred in [1:\tau_{1}] (note that if there is a switch happening between \tau_{1} and \tau_{1}+1, we do not count its cost in S(1,\tau_{1})).

  • For all j=2,\dots,m(S), S(\tau_{j-1},\tau_{j}) is the total switching cost incurred in [\tau_{j-1}:\tau_{j}] (note that if there is a switch happening between \tau_{j-1}-1 and \tau_{j-1}, or between \tau_{j} and \tau_{j}+1, we do not count its cost in S(\tau_{j-1},\tau_{j})).

  • S(\tau_{m(S)},T) is the total switching cost incurred in [\tau_{m(S)}:T] (note that if there is a switch happening between \tau_{m(S)-1} and \tau_{m(S)}, we do not count its cost in S(\tau_{m(S)},T).

3. Next we define a series of events.

  • E_{1}=\{\tau_{1}>t_{1}\}.

  • For all j=2,\dots,m(S), E_{j}=\{\tau_{j-1}\leq t_{j-1},\tau_{j}>t_{j}\}.

  • E_{m(S)+1}=\{\tau_{m(S)}\leq t_{m(S)}\}.

Note that t_{1},\dots,t_{m(S)}\in[T] are fixed values specified in Algorithm 3.

4. Finally we define a series of shrinking errors.

  • \Delta_{1}=1.

  • For j=2,\dots,m(S), \Delta_{j}=\frac{k^{-1/2}\left(k/T\right)^{(1-2^{1-j})/(2-2^{-m(S)})}}{k(m(S)+% 1)}\in(0,1). (That is, \Delta_{j}\approx\frac{1}{k(m(S)+1)}\frac{1}{\sqrt{t_{j-1}}}.)

  • \Delta_{m(S)+1}=\frac{k^{-1/2}\left(k/T\right)^{(1-2^{-m(S)})/(2-2^{-m(S)})}}{% 2k(m(S)+1)}\in(0,1). (That is, \Delta_{m(S)+1}\approx\frac{1}{2k(m(S)+1)}\frac{1}{\sqrt{t_{m(S)}}}.)

5. For notational convenience, define \pi_{\infty} as an independent uniform random variable taking value in [k] such that {\pi_{\infty}=i} with probability 1/k (i\in[k]).

Lemma 2

For any environment \mu, for any policy \pi\in\Pi_{S}, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the total switching cost incurred in }[\tau_{m(S)}:T]\text{ is % strictly less than }H+\bar{c}\} almost surely.

Proof of Lemma 2. When E_{m(S)+1} happens, \tau_{m(S)}\leq t_{m(S)}\leq T, thus all \tau_{1},\dots,\tau_{m(S)}\leq T. Since in each of [1:\tau_{1}],[\tau_{1},\tau_{2}],\dots,[\tau_{m(S)-1}:\tau_{m(S)}], all k actions were visited, we know that S(1,\tau_{1})\geq H, S(\tau_{1},\tau_{2})\geq H, \dots, S(\tau_{m(S)-1},\tau_{m(S)})\geq H. Thus we have

S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)-1},\tau_{m(S)})\geq m(S% )H.

Since \pi\in\Pi_{S}, we further know that

\displaystyle S(\tau_{m(S)},T) \displaystyle\leq S-[S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)-1}% ,\tau_{m(S)})]
\displaystyle\leq S-m(S)H<H+\max_{i\in[k]}\min_{j\neq i}{c_{i,j}}=H+\min_{j% \neq 1}{c_{1,j}}

happens almost surely. As a result, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the total switching cost incurred in }[\tau_{m(S)}:T]\text{ is % strictly less than }H+\min_{j\neq 1}{c_{1,j}}\} almost surely. \hfill\Box

Consider a class of environments \Lambda=\{\mbox{\boldmath$\mu$}\mid\frac{\Delta_{m(S)+1}}{4}\leq\mu_{1}-\mu_{i% }\leq\frac{\Delta_{m(S)+1}}{2},\forall i\neq 1\}. Pick an arbitrary environment {\alpha} from \Lambda (e.g., \alpha=(\frac{\Delta_{m(S)+1}}{2},0,\dots,0)). For any policy \pi\in\Pi_{S}, by the union bound, we have

\sum_{j=1}^{m(S)+1}\mathbb{P}_{{\alpha}}^{\pi}(E_{j})\geq\mathbb{P}_{{\alpha}}% ^{\pi}(\cup_{j=1}^{m(S)+1}E_{j})=1.

Therefore, there exists j^{*}\in[m(S)+1] such that \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})\geq 1/(m(S)+1).

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})% \geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{% \pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{% \mathbb{P}_{{\alpha}}^{\pi}(E_{1})}{k}\geq\frac{1}{k(m(S)+1)}.

Note that since \tau_{1} is the first time that all actions in [k] has been chosen in [1:\tau_{1}], the event \{\pi_{\tau_{1}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[1:\tau_{1}-1]\}. Thus, the event \{\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime}\} must imply the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1]:=\{i^{\prime}\text{ was not chosen in }[1:% t_{1}-1]\}. Therefore, we have

\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq% \mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{1% }{k(m(S)+1)}.

Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] only depends on policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{% \pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq\frac{1}{k(m(S)+1)}. (11)

We now consider a new environment {\beta} such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{1} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}-1] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{1}-1]} and random vectors \{X_{\beta}^{t}(i)\}_{t_{1}:T} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]} for i\neq i, we have

\mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]}(\mathcal{E}_{i^{\prime}}[1:t_{1% }-1])\text{ for }i\neq i}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$% \mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1]). (12)

But note that \{X_{\beta}^{t}(i)\}_{[1:t_{1}-1]} and \{X_{\alpha}^{t}(i)\}_{[1:t_{1}-1]} have exactly the same distribution for all i\neq i^{\prime}. Thus from (11) and (12) we have

\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])=\mbox{$% \mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq\frac{1}{k% (m(S)+1)}.

However, in environment \beta, i^{\prime} is the unique optimal action, and choosing any action other than i^{\prime} will incur at least a \Delta_{1}-\Delta_{m(S)+1}/2\geq\Delta_{1}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[1:t_{1}-1] indicates that the policy does not choose i^{\prime} for at least t_{1}-1 rounds, we have

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}-1])\left[(t_{1}-1)\frac{\Delta_{1}}{2}\right]% \geq\frac{t_{1}-1}{2k(m(S)+1)}\geq\frac{k^{-1-1/(2-2^{-m(S)})}}{4(m(S)+1)}T^{1% /(2-2^{-m(S)})}.

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})=\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1% }\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}})\geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}% })=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau% _{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}% },\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}% })}{k}\geq\frac{1}{k(m(S)+1)}.

Note that since \tau_{j^{*}} is the first time that all actions in [k] has been chosen in [\tau_{j^{*}-1}:\tau_{j^{*}}], the event \{\pi_{\tau_{j^{*}}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[\tau_{j^{*}-1}:\tau_{j^{*}}-1]\}. Thus, the event \{\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{% \prime}\} must imply the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}]:=\{i^{\prime}\text{ was not % chosen in }[t_{j^{*}-1}:t_{j^{*}}]\}. Therefore, we have

\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{% *}}])\geq\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}-1}\leq t_{j^{*}-1},\tau_{j^{*}}% >t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{1}{k(m(S)+1)}.

Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[t_{j^{*}-1}:t_{j^{*}}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] only depends on policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}-1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E% }_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}])\geq\frac{1}{k(m(S)+1)}. (13)

We now consider a new environment {\beta} such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{j^{*}} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[t_{j^{*}-1}:t_{j^{*}}]} and random vectors \{X_{\beta}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k]. Let \mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]} and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have

\mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{\prime% }}[t_{j^{*}-1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}-1}:t_{j^{*}}]). (14)

We now try to bound the difference between the left-hand side (LHS) in (13) and the left-hand side (LHS) in (14). We have

\displaystyle|\text{LHS in }(\ref{eq:4app3})-\text{LHS in }(\ref{eq:4app4})|
\displaystyle\leq \displaystyle{D_{\mathrm{TV}}}\left(\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}% _{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^% {\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime}% )\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i% ^{\prime}}^{\pi}\right)
\displaystyle\leq \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^% {t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}% \text{ for }i\neq i^{\prime}}^{\pi}\right)}
\displaystyle\leq \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^{t}(i^% {\prime})\}_{[1:t_{j^{*}-1}-1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for% }i\neq i^{\prime}}\right)}
\displaystyle= \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}-1}-1]}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{% \beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}-1}-1]}}\right)}
\displaystyle= \displaystyle\sqrt{\frac{1}{2}\left[(t_{j^{*}-1}-1)\frac{\left(\Delta_{j^{*}}% \right)^{2}}{2}\right]}
\displaystyle\leq \displaystyle\frac{\sqrt{t_{j^{*}-1}}\Delta_{j^{*}}}{2}\leq\frac{1}{2k(m(S)+1)},

where the first inequality is by the definition of total variation distance of two probability measures, the second inequality is by Pinsker’s inequality in information theory, and the third inequality is by the data-processing inequality in infomation theory.

Combining the above inequality with (11) and (12), we have

\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])\geq% \mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}-1])-\frac{% 1}{2k(m(S)+1)}\geq\frac{1}{2k(m(S)+1)}.

However, i^{\prime} is the unique optimal action in environment \beta, and choosing any action other than i^{\prime} will incur at least a \Delta_{j^{*}}-\Delta_{m(S)+1}/2\geq\Delta_{j^{*}}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}] indicates that the policy does not choose i^{\prime} for at least t_{j^{*}}-t_{j^{*}-1}+1 rounds, we have

\displaystyle R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}% ^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}-1}:t_{j^{*}}])\left[(t_{j^{*}}-t_{j^{% *}-1}+1)\frac{\Delta_{j^{*}}}{2}\right]
\displaystyle\geq \displaystyle\frac{1}{2k(m(S)+1)}\left(k(T/k)^{\frac{2-2^{1-j^{*}}}{2-2^{-m(S)% }}}-k(T/k)^{\frac{2-2^{2-j^{*}}}{2-2^{-m(S)}}}\right)\frac{k^{-\frac{1}{2}}% \left(k/T\right)^{\frac{1-2^{1-j^{*}}}{2-2^{-m(S)}}}}{2k(m(S)+1)}
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}}}{4(m(S)+1)^{2}}\left((T/k)^{\frac{1}{2-2^{% -m(S)}}}-(T/k)^{\frac{1-2^{1-j^{*}}}{2-2^{-m(S)}}}\right)
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{\frac{-2^{1-j^{*}}}{2-2^{-m(S)}}}\right)
\displaystyle\geq \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{\frac{-2^{1-m(S)}}{2-2^{-m(S)}}}\right)
\displaystyle= \displaystyle\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(S% )}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{-2^{-m(S)}}\right).

When m(S)\leq\log_{2}\log_{2}(T/K), we have

(T/k)^{-2^{-m(S)}}\leq(T/k)^{-\frac{1}{\log_{2}(T/k)}}=\frac{1}{(T/k)^{\log_{T% /k}(2)}}=\frac{1}{2}.

Thus we know that

R^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}T^{\frac{1}{2-2^{-m(% S)}}}}{4(m(S)+1)^{2}}\left(1-(T/k)^{-2^{-m(S)}}\right)\geq\frac{k^{-\frac{3}{2% }-\frac{1}{2-2^{-m(S)}}}}{8(m(S)+1)^{2}}T^{\frac{1}{2-2^{-m(S)}}}

when m(S)\leq\log_{2}\log_{2}(T/k).

Since \mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}% \leq t_{m(S)})\geq 1/(m(S)+1) and

\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)})=\sum_{i=1}^{k}\mathbb{P}_{% \alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i),

we know that there exists i^{\prime}\in[k] such that

\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i^{% \prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})}{k}\geq\frac{1}{k(m(% S)+1)}.

Thus either

\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}>\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}, (15)

or

\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac% {t_{m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}. (16)

If (15) holds true, then we consider a new environment \beta such that its i^{\prime}-th component is \alpha_{i^{\prime}}+\Delta_{m(S)+1} and all other components are the same as \alpha. Define the event \mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2]:=\{i^{\prime}\text{ was not % chosen in }[t_{m(S)}:(t_{m(S)}+T)/2]\}. From (15) we know that \mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq 1/(2k(m(S)+1)). Using analogous arguments like Case 2 (Appendix id1), we can derive that

\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2% ])-\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}

and

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)% }}}}{64(m(S)+1)^{2}}T^{\frac{1}{2-2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Now we consider the case that (16) holds true. Let \mathcal{E}_{i^{\prime}} denote the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2},\pi_{\tau_{m(% S)+1}}=i^{\prime}\}. According to Lemma 2, the event \{\tau_{m(S)}\leq t_{m(S)}\} implies that the total switching cost incurred in [\tau_{m(S)}:T] is strictly less than H+\min_{j\neq 1}{c_{1,j}}. Meanwhile, the event \{\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}<\infty\} implies that the total switching cost incurred in [\tau_{m(S)}:\tau_{m(S)+1}] is at least H. As a result, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}.

Suppose that i^{\prime}\neq 1, then the event \mathcal{E}_{i^{\prime}}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\} implies that action 1 is not chosen in [\tau_{m(S)+1}:T], as incurring c_{i^{\prime},1}\geq\min_{j\neq 1}{c_{1,j}} would violate the requirement that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}. However, action 1 is the unique optimal action in environment \alpha, and choosing any action other than action 1 will incur at least a \Delta_{m(S)+1}/4 term in regret. As a result, we know that

R^{\pi}(T)\geq R_{\alpha}^{\pi}(T)\geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i% ^{\prime}})\left[(T-\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{4}\right]% \geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2% -2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Thus we only need to consider the sub-case of i^{\prime}=1. Define the event \mathcal{E}_{1}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}% {2},\pi_{\tau_{m(S)+1}}=1\}. Note that the occurrence of the event \mathcal{E}_{1} only depends on policy \pi and random vector \{X_{\alpha}^{t}(1)\}_{[1:t_{m(S)}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:{(t_{m(S)}+T)}/{2}]} for i\neq 1. Consider a new environment \beta such that its first component is \alpha_{1}-\Delta_{m(S)+1} and all other components are the same as \alpha. Using analogous arguments like Case 2 (Appendix id1), we can derive that

\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{1})\geq\mathbb{P}_{\alpha}^{\pi}(% \mathcal{E}_{1})-\frac{\sqrt{t_{m(S)}}\Delta_{m(S)+1}}{2}\geq\mathbb{P}_{% \alpha}^{\pi}(\mathcal{E}_{1})-\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}.

However, action 1 is the worst action in environment \beta, and each time of choosing action 1 incurs at least a \Delta_{m(S)+1}/2 term in regret. According to Lemma 2, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}. Since switching from action 1 to any other action incurs at least \min_{j\neq 1}{c_{1,j}} cost, the event \mathcal{E}_{1} actually implies that action 1 is continuously chosen in every round from round \tau_{m(S)+1} (\leq\frac{t_{m(S)}+T}{2}) to round T, which means that action 1 is continuously chosen in the last (T-\frac{t_{m(S)}+T}{2}+1) rounds. As a result, we know that

R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}})\left[(T-\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{2}\right]\geq% \frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2-2^{% -m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k).

Combining Case 1, 2 and 3, we know that

R^{\pi}(T)\geq\frac{k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}}{64(m(S)+1)^{2}}T^{% \frac{1}{2-2^{-m(S)}}}

for m(S)\leq\log_{2}\log_{2}(T/k). On the other hand, since the minimax lower bound for the classical MAB problem (which is equivalent to a BwSC problem with infinite switching budget) is \Omega(\sqrt{kT}), we know that

R^{\pi}(T)\geq R_{\infty}^{*}\geq C\sqrt{kT}

for some absolute constant C>0. To sum up, we have

R^{\pi}(T)\geq\begin{cases}C\left(k^{-\frac{3}{2}-\frac{1}{2-2^{-m(S)}}}(m(S)+% 1)^{-2}\right)T^{\frac{1}{2-2^{-m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases}

for some absolute constant C>0, where m(S)=m_{G}^{L}(S)=\lfloor\frac{S-\max_{i\in[k]}\min_{j\neq i}c_{i,j}}{H}\rfloor. \hfill\Box

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
367835
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description