Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints
David SimchiLevi
Institute for Data, Systems and Society, Department of Civil and Environmental Engineering, and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, dslevi@mit.edu
Yunzong Xu
Institute for Data, Systems and Society, Cambridge, Massachusetts Institute of Technology, MA 02139, yxu@mit.edu
We consider the classical stochastic multiarmed bandit problem with a constraint on the total cost incurred by switching between actions. We prove matching upper and lower bounds on regret and provide nearoptimal algorithms for this problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. That is, we show that associated with the multiarmed bandit problem, there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. The results enable us to fully characterize the tradeoff between regret and incurred switching cost in the stochastic multiarmed bandit problem, contributing new insights to this fundamental problem. Under the general switching cost structure, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.
The multiarmed bandit (MAB) problem is one of the most fundamental problems in online learning, with diverse applications ranging from pricing and online advertising to clinical trails. Over the past several decades, it has been a very active research area spanning different disciplines, including computer science, operations research, statistics and economics.
In a traditional multiarmed bandit problem, the learner (i.e., decisionmaker) is allowed to switch freely between actions, and an effective learning policy may incur frequent switching — indeed, the learner’s task is to balance the explorationexploitation tradeoff, and both exploration (i.e., acquiring new information) and exploitation (i.e., optimizing decisions based on uptodate information) require switching. However, in many realworld scenarios, it is costly to switch between different alternatives, and a learning policy with limited switching behavior is preferred. The learner thus has to consider the cost of switching in her learning task.
In this paper, we introduce the Bandits with Switching Constraints (BwSC) problem. We note that most previous research in multiarmed bandits has modeled the switching cost as a penalty in the learner’s objective, and hence the learner’s switching behavior is a complete output of the learning algorithm. However, in many realworld applications, there are strict limits on the learner’s switching behavior, which should be modeled as a hard constraint, and hence the learner’s allowable level of switching is an input to the algorithm. In addition, while most prior research assumes specific structures on switching costs (e.g., unit or homogeneous costs), in reality, switching between different pairs of actions may incur heterogeneous costs that do not follow any parametric form. These gaps motivate us to propose the BwSC framework, which includes a hard constraint acting on the total switching cost.
In addition to its strong modeling power and practical significance, the BwSC problem is theoretically important, as it is a natural framework to study the fundamental tradeoff between the regret and the maximum incurred switching cost of any policy in the classical multiarmed bandit problem. In particular, it enables characterizing important switching patterns associated with any effective explorationexploitation policies. Thus, the study of BwSC problem leads to a series of new results for the classical multiarmed bandit problem.
The BwSC framework has numerous applications, including dynamic pricing, online assortment optimization, online advertising, clinical trails, labor markets and vehicle routing. We describe a representative motivating example below.
Dynamic pricing with demand learning. Dynamic pricing with demand learning has proven its effectiveness in online retailing. However, it is well known that in practice, sellers often face business constraints that prevent them from conducting extensive price experimentation and making frequent price changes, see Cheung et al. (2017) and Chen and Chao (2019). The seller’s sequential decisionmaking problem can be modeled as a BwSC problem, where changing from each price to another price incurs some cost, and there is a limit on the total cost incurred by price changes. Here, a high switching cost between two prices implies that the corresponding price change is highly undesirable, while a low switching cost implies that the corresponding price change is generally acceptable.
In Section id1, we propose the BwSC framework. In Section id1, we review related literature. In Section id1, we discuss the unitswitchingcost model. In Section id1, we discuss the generalswitchingcost model. Finally, in Section id1, we conclude.
For all n_{1},n_{2}\in\mathbb{N} such that n_{1}\leq n_{2}, we use [n_{1}] to denote the set \{1,\dots,n_{1}\}, and use [n_{1}:n_{2}] (resp. (n_{1}:n_{2}]) to denote the set \{n_{1},n_{1}+1,\dots,n_{2}\} (resp. \{n_{1}+1,\dots,n_{2}\}). For all x\geq 0, we use \lfloor x\rfloor to denote the largest integer less than or equal to x. For ease of presentation, we define \lfloor x\rfloor=0 for all x<0. Throughout the paper, we use big O,\Omega,\Theta notations to hide constant factors, and use \tilde{O},\tilde{\Omega},\tilde{\Theta} notations to hide constant factors and logarithmic factors.
Consider a karmed bandit problem where a learner chooses actions from a fixed set [k]=\{1,\dots,k\}. There is a total of T rounds. In each round t\in[T], the learner first chooses an action i_{t}\in[k], then observes a reward r_{t}(i_{t})\in\mathbb{R}. For each action i\in[k], the reward of action i is i.i.d. drawn from an (unknown) distribution \mbox{$\mathcal{D}$}_{i} with (unknown) expected value \mu_{i}. We assume that the distributions \mbox{$\mathcal{D}$}_{i} are standardized subGaussian.^{1}^{1}1This is a standard assumption in the stochastic bandit literature. Note that the class of subGaussian distributions is sufficiently wide as it contains Gaussian, Bernoulli and all bounded distributions. Without loss of generality, we assume \sup_{i,j\in[k]}\mu_{i}\mu_{j}\in[0,1].
In our problem, the learner incurs a switching cost c_{i,j}=c_{j,i}\geq 0 each time she switches between action i and action j (i,j\in[k]).^{2}^{2}2We allow c_{i,j}=\infty, which means that switching from i to j is prohibited. In particular, c_{i,i}=0 for i\in[k]. There is a prespecified switching budget S\geq 0 representing the maximum amount of switching costs that the learner can incur in total. Once the total switching cost exceeds the switching budget S, the learner cannot switch her actions any more. The learner’s goal is to maximize the expected total reward over T rounds.
Let \pi denote the learner’s (nonanticipating) learning policy, and \pi_{t}\in[k] denote the action chosen by policy \pi at round t\in[T]. More formally, \pi_{t} establishes a probability kernel acting from the space of historical actions and observations to the space of actions at round t. Let \mathbb{P^{\pi}_{\mbox{$\mathcal{D}$}}} and \mathbb{E}^{\pi}_{\mbox{$\mathcal{D}$}} be the probability measure and expectation induced by policy \pi and latent distributions \mbox{$\mathcal{D}$}=(\mbox{$\mathcal{D}$}_{1},\dots,\mbox{$\mathcal{D}$}_{k}). According to Section id1, we only need to restrict our attention to the Sswitchingbudget policies, which take S, k and T as input and are defined as below.^{3}^{3}3Note that here we do not make any assumption on the learner’s behavior. In particular, we do not require the learner to intentionally pick an Sswitchingbudget policy — the switching constraint makes the learner’s policy automatically equivalent to an Sswitchingbudget policy.
Definition 1
A policy \pi is said to be an Sswitchingbudget policy if for all \mathcal{D},
\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}\left[\sum_{t=1}^{T1}c_{\pi_{t},\pi_{t% +1}}\leq S\right]=1. 
Let \Pi_{S} denote the set of all Sswitchingbudget policies, which is also the admissible policy class of the BwSC problem.
The performance of a learning policy is measured against a clairvoyant policy that maximizes the expected total reward given foreknowledge of the environment (i.e., latent distributions) \mathcal{D}. Let i^{*}=\arg\max_{i\in[k]}\mu_{i} and \mu^{*}=\max_{i\in[k]}\mu_{i}. If a clairvoyant knows \mathcal{D} in advance, then she would choose the “optimal” action i^{*} for every round and her expected total reward would be T\mu^{*}. We define the regret of policy \pi as the worstcase difference between the expected performance of the optimal clairvoyant policy and the expected performance of policy \pi:
R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}\left\{T\mu^{*}\mathbb{E}_{\mbox{$% \mathcal{D}$}}^{\pi}\left[\sum_{t=1}^{T}\mu_{\pi_{t}}\right]\right\}. 
The minimax (optimal) regret of BwSC is defined as R_{S}^{*}(T)=\inf_{\pi\in\Pi_{S}}R^{\pi}(T).
In our paper, when we say a policy is “nearoptimal” or “optimal up to logarithmic factors”, we mean that its regret bound is optimal in T up to logarithmic factors of T, irrespective of whether the bound is optimal in k, since typically k is much smaller than T (e.g., k=O(1)). Still, our derived bounds are actually quite tight in k.
Remark. There are two notions of regret in the stochastic bandit literature. The R^{\pi}(T) regret that we consider is called distributionfree, as it does not depend on \mathcal{D}. On the other hand, one can also define the distributiondependent regret R_{\mbox{$\mathcal{D}$}}^{\pi}(T)=T\mu^{*}\mathbb{E}_{\mbox{$\mathcal{D}$}}^{% \pi}\left[\sum_{t=1}^{T}\mu_{\pi_{t}}\right] that depends on \mathcal{D}. This second notion of regret is only meaningful when \mu_{1},\dots,\mu_{k} are wellseparated. Unlike the classical MAB problem where there are policies that simultaneously achieve nearoptimal bounds under both regret notions, in the BwSC problem, due to the limited switching budget, finding a policy that simultaneously achieves nearoptimal bounds under both regret notions is usually impossible. Thus in the main body of the paper, we focus on the distributionfree regret. However, in Appendix A, we extend our results to the distributiondependent regret.
As Section id1 and Section id1 show, BwSC and MAB share the same definition of R^{\pi}(S), and the only difference between BwSC and MAB is the existence of a switching constraint \pi\in\Pi_{S}, determined by (c_{i,j})\in\overline{\mathbb{R}}_{\geq 0}^{k\times k} and S\in\overline{\mathbb{R}}_{\geq 0} (when S=\infty, BwSC degenerates to MAB). This makes BwSC a natural framework to study the tradeoff between regret and incurred switching cost in MAB. That is, the tradeoff between the optimal regret R_{S}^{*}(T) and switching budget S in BwSC completely characterizes the tradeoff between a policy’s best achievable regret and its worst possible incurred switching cost in MAB. We are interested in how R_{S}^{*}(T) behaves over a range of switching budget S, and how it is affected by the structure of switching costs (c_{i,j}).
The stochastic MAB problem has been extensively studied for more than fifty years. Seminal results include the \Theta(\sqrt{T}) distributionfree regret bound in Vogel (1960) and the \Theta(\log T) distributiondependent regret bound in Lai and Robbins (1985). We point out two excellent surveys written by Lattimore and Szepesvári (2018) and Slivkins (2019) for more reference about this topic.
There is rich literature focusing on stochastic MAB with switching costs. Most of the papers model the switching cost as a penalty in the learner’s objective, i.e., they measure a policy’s regret and incurred switching cost using the same metric and the objective is to minimize the sum of these two terms (e.g., Agrawal et al. 1988, 1990, Brezzi and Lai 2002, CesaBianchi et al. 2013; there are other variations with discounted rewards Banks and Sundaram 1994, Asawa and Teneketzis 1996, Bergemann and Välimäki 2001, see Jun 2004 for a survey). Though this conventional “switching penalty” model has attracted significant research interest in the past, it has two limitations. First, under this model, the learner’s total switching cost is an output determined by the algorithm. However, in many realworld applications, there are strict limits on the learner’s total switching cost, which should be modeled as a hard constraint, and hence the learner’s switching budget should be an input that helps determine the algorithm. In particular, while the algorithm in CesaBianchi et al. (2013) developed for the “switching penalty” model can achieve \tilde{O}(\sqrt{T}) (nearoptimal) regret with O(\log\log T) switches, if the learner wants a policy that always incurs finite switching cost independent of T, then prior literature does not provide an answer. Second, the “switching penalty” model has fundamental weakness in studying the tradeoff between regret and incurred switching cost in stochastic MAB — since the O(\log\log T) bound on the incurred switching cost of a policy is negligible compared with the \tilde{O}(\sqrt{T}) bound on its optimal regret, when adding the two terms up, the term associated with incurred switching cost is always dominated by the regret, thus no tradeoff can be identified. As a result, to the best of our knowledge, prior literature has not characterized the fundamental tradeoff between regret and incurred switching cost in stochastic MAB.
The BwSC framework addresses the issues associated with the “switching penalty” model in several ways. First, it introduces a hard constraint on the total switching cost, enabling us to design good policies that guarantee limited switching cost. While O(\log\log T) switches has proven to be sufficient for a learning policy to achieve nearoptimal regret in MAB, in BwSC, we are mostly interested in the setting of finite or o(\log\log T) switching budget, which is highly relevant in practice. Second, by focusing on rewards in the objective function and incurred switching cost in the switching constraint, the BwSC framework enables the characterization of the fundamental tradeoff between regret and maximum incurred switching cost in MAB. Third, while most prior research assumes specific structures on switching costs (e.g., unit or homogeneous costs), BwSC allows general switching costs, which makes it a powerful modeling framework.
This paper is not the first one to study online learning problems with limited switches. Indeed, a few authors have realized the practical significance of limited switching budget. For example, Cheung et al. (2017) consider a dynamic pricing model where the demand function is unknown but belongs to a known finite set, and a pricing policy is allowed to make at most m price changes. Their constraint on the total number of price changes is motivated by collaboration with Groupon, a major ecommerce marketplace in North America. In such an environment, Groupon limits the number of price changes, either because of implementation constraints, or for fear of confusing customers and receiving negative customer feedback. They propose a pricing policy that guarantees O(\log^{(m)}T) (or m iterations of the logarithm) regret with at most m price changes, and report that in a field experiment, this pricing policy with a single price change increases revenue and market share significantly. Chen and Chao (2019) study a multiperiod stochastic inventory replenishment and pricing problem with unknown demand and limited price changes. Assuming that the demand function is drawn from a parametric class of functions, they develop a finitepricechange policy based on maximum likelihood estimation that achieves optimal regret.
We note that both Cheung et al. (2017) and Chen and Chao (2019) only focus on specific decisionmaking problems, and their results rely on some strong assumption about the unknown environment. Cheung et al. (2017) assume a known finite set of potential demand functions, and require the existence of discriminative prices that can efficiently differentiate all potential demand functions. Chen and Chao (2019) assume a known parametric form of the demand function, and also require a wellseparated condition. By contrast, the BwSC model in our paper is generic and assumes no prior knowledge of the environment. The learning task in the BwSC problem is thus more challenging than previous models. Also, the switching constraint in the BwSC problem is more general than the pricechange constraints in previous models.
In the Bayesian bandit setting, Guha and Munagala (2013) study the “bandits with metric switching costs” problem that allows a constraint involving metric switching costs. Using competitive ratio as the performance metric and assuming Bayesian priors, they develop a 4approximation algorithm for the problem. The competitive ratio is measured against an optimal online policy that does not know the true distributions. As pointed out by the authors, the optimal online policy can be directly determined by a dynamic program. So the main challenge in their model is a computational one. Our work is different, as we are using regret as our performance metric, and we are competing with an optimal clairvoyant policy that knows the true distributions — a much stronger benchmark. Our problem thus involves both statistical and computational challenges. In fact, the algorithm in Guha and Munagala (2013) cannot avoid a linear regret when applied to the BwSC problem.
In the adversarial bandit setting, Altschuler and Talwar (2018) study the adversarial MAB problem with limited number of switches, which can be viewed as an adversarial counterpart of the unitswitchingcost BwSC problem. For any policy that makes no more than S\leq T switches, they prove that the optimal regret is \tilde{\Theta}(T\sqrt{k}/\sqrt{S}). Since we are considering a different setting from them (our problem is stochastic while their problem is adversarial), the results and methodologies in our paper are fundamentally different from their paper. In particular, while finiteswitch policies cannot avoid linear regret in the adversarial setting, in the stochastic setting, finite switches are already able to guarantee sublinear regret. Moreover, while the optimal regret in Altschuler and Talwar (2018) decreases smoothly as S increases from 0 to T, in the stochastic setting, we identify surprising behavior of the optimal regret as S increases from 0 to \Theta(\log\log T), which, to the best of our knowledge, has not been identified in the bandit literature before.
The BwSC problem is also related to the batched bandit problem proposed by Perchet et al. (2016). The Mbatched bandit problem is defined as follows: given a classical bandit problem, assumes that the learner must split her learning process into M batches and is only able to observe data (i.e., realized rewards) from a given batch after the entire batch is completed. This implies that all actions within a batch are determined at the beginning of this batch. Here M can be viewed as a quantity measuring the learner’s adaptivity, i.e., her ability to learn from her data and adapt to the environment. An Mbatch policy is defined as a policy that only observes realized data for M1 times through the entire horizon. Perchet et al. (2016) study the problem in the case of two arms, and prove that the optimal regret for the Mbatched bandit problem is \tilde{\Theta}(T^{1/(12^{1M})}). Very recently, Gao et al. (2019) extend these results to general k arms.
On the surface, the batched bandit problem and the BwSC problem seem like two different problems: the batched bandit problem limits observation and allows unlimited switching, while the BwSC problem limits switching and allows unlimited observation. Surprisingly, in this paper, we discover some nontrivial connections between the batched bandit problem and the unitswitchingcost BwSC problem. The connections will be further discussed in Section id1.
In this section, we consider the BwSC problem with unit switching costs, where c_{i,j}=1 for all i\neq j. In this case, since every switch incurs a unit cost, the switching budget S can be interpreted as the maximum number of switches that the learner can make in total. Without loss of generality, in this section we assume that S is a nonnegative integer, and refer to an Sswitchingbudget policy as an Sswitch policy. Note that the unitswitchingcost BwSC problem can be simply interpreted as “MAB with limited number of switches”.
The section is organized as follows. In Section id1, we propose a simple and intuitive policy that provides an upper bound on regret. In Section id1, we give a matching lower bound, indicating that our policy is rateoptimal up to logarithmic factors. In Section id1, we discuss several surprising findings in BwSC, named as “phase transitions” and “cyclic phenomena” of the optimal regret, and further quantify the tradeoff between regret and maximum number of switches in MAB, contributing new insights to this classical problem. In Section id1, we discuss a surprising relationship between limited switches and limited adaptivity in bandit problems.
We first propose a simple and intuitive policy that provides an upper bound on regret. Our policy, called the SSwitch Successive Elimination (SSSE) policy, is described in Algorithm 1. The design philosophy behind the SSSE policy is to divide the entire horizon into several predetermined intervals (i.e. batches) and to control the number of switches in each interval. The policy thus has some similarities with the 2armed batched policy of Perchet et al. (2016) and the karmed batched policy of Gao et al. (2019), which prove to be nearoptimal in the batched bandit problem. However, since we are studying a different problem, directly applying a batched policy to the BwSC problem does not work. In particular, in the batched bandit problem, the number of intervals (i.e., batches) is a given constraint, while in the BwSC problem, the switching budget is the given constraint. We thus add two key ingredients into the SSSE policy: (1) an index m(S) suggesting how many intervals should be used to partition the entire horizon; (2) a switching rule ensuring that the total number of switches across all k actions cannot exceed the switching budget S. These two ingredients make the SSSE policy substantially different from an ordinary batched policy. The two ingredients are simple yet powerful — they actually enable transformation from any limitedbatch policy to a corresponding limitedswitch policy.
Input: Number of arms k, Switching budget S, Horizon T
Partition: Calculate m(S)=\lfloor\frac{S1}{k1}\rfloor.
Divide the entire time horizon 1,\dots,T into m(S)+1 intervals: [t_{0}:t_{1}],(t_{1}:t_{2}],\dots,(t_{m(S)}:t_{m(S)+1}], where the endpoints are defined by t_{0}=1 and
t_{i}=\lfloor k^{1\frac{22^{(i1)}}{22^{m(S)}}}T^{\frac{22^{(i1)}}{22% ^{m(S)}}}\rfloor,~{}~{}\forall i=1,\dots,m(S)+1. 
Initialization: Let the set of all active actions in the lth interval be A_{l}. Set A_{1}=[k].
Policy:
\texttt{UCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% +\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}, 
\texttt{LCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% \sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}. 
Intuition about the Policy. The policy divides the T rounds into \lfloor\frac{S1}{k1}\rfloor+1 intervals in advance. The sizes of the intervals are designed to balance the explorationexploitation tradeoff. An active set of “effective” actions A_{l} is maintained for each interval l. The policy has the following key features:

Limited switches and no adpativity within each interval: In interval l, only A_{l}1 switches happen. Within an interval, decisions on switches are determined at the beginning of the interval and do not depend on the rewards observed in this interval — thus there is no adaptivity.

Successive elimination between intervals: At the end of interval l (l<m(S)), actions that perform poorly are eliminated from the active set A_{l+1}, and will not be chosen in interval l+1.

At most one switch between two consecutive intervals: If the last action chosen in interval l remains in A_{l+1} (l<m(S)), then it will be the first action chosen in interval l+1, and no switch occurs between these two intervals. If the last action chosen in interval l is eliminated from A_{l+1}, then interval l+1 starts from another action in A_{l+1}, and one switch occurs between these two intervals.

Exploitation in the last interval: In the last interval, only the empirical best action is chosen.
We show that the SSSE policy is indeed an Sswitch policy and establish the following upper bound on its regret. See Appendix B for a proof.
Theorem 1
Let \pi be the SSSE policy, then \pi\in\Pi_{S}. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 1 and T\geq k,
R^{\pi}(T)\leq C{(\log k\log T)}k^{1\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m% (S)}}}, 
where m(S)=\lfloor\frac{S1}{k1}\rfloor.
Theorem 1 provides an upper bound on the optimal regret of the unitswitchingcost BwSC problem:
R^{*}_{S}(T)=\tilde{O}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}). 
The SSSE policy, though achieves sublinear regret, seems to have many limitations that could have weaken its performance, and on the surface it may suggest that the regret bound is not optimal. Specifically:

The SSSE policy does not make full use of its switching budget. Consider the case of 11 actions and 20 switching budget. Since m(20)=\lfloor(201)/(111)\rfloor=1=m(11), the SSSE policy will just run as if it could only make 11 switches, despite the fact that it has 9 additional switching budget (which will never be used). Intuitively, an effective learning policy should make full use of its switching budget. It seems that by tracking and allocating the switching budget in a more careful way, one can achieve lower regret.

The SSSE policy has low adaptivity. Note that the SSSE policy predetermines the number, sizes and locations of its intervals before seeing any data, and executes actions within each interval based on a predetermined schedule. Indeed, the SSSE policy only learns from data at the end of each interval, for at most \lfloor(S1)/(k1)\rfloor times — consider the case of 11 actions and 20 switching budget, the SSSE policy will split the entire horizon into two intervals and will only learn at the end of the first interval, after which it will choose a single action to be applied throughout the entire second interval. Intuitively, data should be utilized to save switches and reduce regret, and one would expect that an effective policy will have high degree of adaptivity, that is, it should learn from the available data and adapt to the environment more frequently than our policy. Put differently, it seems that by utilizing full adaptivity and learning from data in every round, one can achieve lower regret.

Besides the above limitations, the \tilde{O}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}) bound provided by the SSSE also seems a little clumsy. In particular, the m(S)=\lfloor(S1)/(k1)\rfloor term looks like an artificial term (it is intentionally designed to fit the switching rule in SSSE), and does not look like a natural term that should appear in the true optimal regret R^{*}_{S}(T).
While the above arguments are based on our first instinct and seem very reasonable, surprisingly, all of them prove to be wrong: no Sswitch policy can theoretically do better! In fact, we match the upper bound provided by SSSE by showing an informationtheoretic lower bound in Theorem 2. This indicates that the SSSE policy is rateoptimal up to logarithmic factors, and R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}). Note that the tightness of T is acheived per instance, i.e., for every k and every S. That is, our lower bound is substantially stronger than a single lower bound demonstrated for specific k and S.
Theorem 2
There exists an absolute constant C>0 such that for all k\geq 1,S\geq 1,T\geq k and for all policy \pi\in\Pi_{S},
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m(S)}}}(m(S)+% 1)^{2}\right)T^{\frac{1}{22^{m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases} 
where m(S)=\lfloor\frac{S1}{k1}\rfloor.
Proof Idea. Our proof involves a novel “tracking the cover time” argument that (to the best of our knowledge) has not appeared in previous lowerbound proofs in the bandit literature and may be of independent interest. Specifically, we track a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)+1}, some of which may be \infty, that are recursively defined as follows:

\tau_{1} is the first time that all the actions in [k] have been chosen in period [1:\tau_{1}],

\tau_{2} is the first time that all the actions in [k] have been chosen in period [\tau_{1}:\tau_{2}],

Generally, \tau_{i} is the first time that all the actions in [k] have been chosen in period [\tau_{i1}:\tau_{i}], for i=2,\dots,m(S)+1.
The structure of the series is carefully designed, enabling the realization of any two consecutive stopping times \tau_{i1},\tau_{i} to convey the important message that there exists a specific (possibly unknown) action that has never been chosen in period [\tau_{i1}:\tau_{i}1]. This information in turn helps us to bound the difference of several key probabilities and derive the desired lower bound via informationtheoretic arguments. For a complete proof of Theorem 2, see Appendix C.
Corollary 1
For any fixed k\geq 1, for any S\geq 1,
R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}). 
Remark. We briefly explain why the upper and lower bounds in Theorem 1 and Theorem 2 match in T. When m(S)\leq\log_{2}\log_{2}(T/k), which is the case we are mostly interested in, (m(S)+1)^{2}=o(\log T), thus the upper and lower bounds match within o((\log T)^{2}). When m(S)>\log_{2}\log_{2}(T/k), the upper bound is O({\sqrt{T}\log T}), thus the upper and lower bounds directly match within O(\log T). We also argue that the slightly different terms of k appearing in the upper and lower bounds do not play an important role. In fact, the gap associated with k between the upper and lower bounds is O(\min\{k^{2.5},(T/k)^{m(S)0.5}\}). Since we are mostly interested in the case of k<<T (e.g., k=O(1) or k=O(\log T)), the O(k^{2.5}) gap is not important relative to T.
Corollary 1 allows us to characterize the tradeoff between the switching budget S and the optimal regret R^{*}_{S}{(T)}. To illustrate this tradeoff, Figure 1 and Table 1 depict the behavior of R^{*}_{S}{(T)} as a function of S given a fixed k. Note that as discussed in Section id1, the relationship between R^{*}_{S}{(T)} and S also characterizes the inherent tradeoff between regret and maximum number of switches in the classical MAB problem.
We observe several surprising phenomena regarding the tradeoff between S and R_{S}^{*}(T) for any given k.
Phase Transitions^{5}^{5}5We borrow this terminology from statistical physics, see Domb (2000).. As we have shown, R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}). To the best of knowledge, this is the first time that a floor function naturally arises in the order of T in the optimal regret of an online learning problem. As a direct consequence of this floor function, the optimal regret of BwSC exhibits surprising phase transitions described as below.
Definition 2
(Phases and Critical Points) For a karmed unitswitchingcost BwSC, we call the interval [(j1)(k1)+1,j(k1)+1) the jth phase, and call j(k1)+1 as the jth critical point (j\in\mathbb{Z}_{>0}).
Fact 1
(Phase Transitions) As S increases from 0 to \Theta(\log\log T), S will leave the jth phase and enter the (j+1)th phase at the jth critical point (j\in\mathbb{Z}_{>0}). Each time S arrives at a critical point, R_{S}^{*}(T) will drop significantly, and stay at the same level until S arrives the next critical point.
Phase transitions are clearly presented in Figure 1. This phenomenon seems counterintuitive, as it suggests that increasing switching budget would not help decrease the best achievable regret, as long as the budget does not reach the next critical point.
Note that phase transitions are only exhibited when S is in the range of 0 to \Theta(\log\log T). After S exceeds \Theta(\log\log T), R_{S}^{*}(T) will reamin unchanged at the level of \tilde{\Theta}(\sqrt{T}) — the optimal regret will only vary within logarithmic factors and there is no significant regret drop any more. Therefore, one can also view \Theta(\log\log T) as a “final critical point” after which phase transitions disappear. This additional “final phase transition” reveals a subtle and intriguing nature of phase transitions in BwSC.
S  [0,k)  [k,2k1)  [2k1,3k2)  [3k2,4k3)  [4k3,5k4)  [5k4,6k5) 

R_{S}^{*}(T)  \tilde{\Theta}(T)  \tilde{\Theta}(T^{2/3})  \tilde{\Theta}(T^{4/7})  \tilde{\Theta}(T^{8/15})  \tilde{\Theta}(T^{16/31})  \tilde{\Theta}(T^{32/63}) 
R_{S}^{*}(T)/R_{\infty}^{*}(T)  \tilde{\Theta}(T^{1/2})  \tilde{\Theta}(T^{1/6})  \tilde{\Theta}(T^{1/14})  \tilde{\Theta}(T^{1/30})  \tilde{\Theta}(T^{1/62})  \tilde{\Theta}(T^{1/126}) 
Cyclic Phenomena. Along with phase transitions, we also observe the following phenomena.
Fact 2
(Cyclic Phenomena) The length of each phase is always equal to k1, independent of S and T. We call the quantity k1 the budget cycle, which is the length of each phase.
Cyclic Phenomena indicate that, assuming that the learner’s switching budget is at a critical point, then the extra switching budget that the learner needs to achieve the next regret drop (i.e., to arrive at the next critical point) is always k1. Cyclic phenomena also seem counterintuitive: when the learner has more switching budget, she can conduct more experiments and statistical tests, eliminate more poorly performing actions (which can be thought of as reducing k) and allocate her switching budget in a more flexible way — all of these suggest that the budget cycle should be a quantity decreasing with S. However, the cyclic phenomena tell us that the budget cycle is always a constant and no learning policy in the unitswitchingcost BwSC (and in MAB) can escape this cycle, no matter how large S is, as long as S=o(\log\log T).
On the other hand, as S contains more and more budget cycles, the gap between R_{S}^{*}(T) and R_{\infty}^{*}(T)=\tilde{\Theta}(\sqrt{T}) does decrease dramatically. In fact, R_{S}^{*}(T) decreases doubly exponentially fast as S contains more budget cycles. For example, when S contains more than 2 budget cycles, R_{S}^{*}(T)=\tilde{\Theta}(T^{4/7}); and when S contains more than 3 budget cycles, R_{S}^{*}(T)=\tilde{\Theta}(T^{8/15}). From both Figure 1 and Table 1, we can verify that 3 or 4 budget cycles are already enough for an Sswitch policy to achieve closetooptimal regret in MAB (compared with the optimal policy with unlimited switching budget).
To sum up, the above analysis generates both “positive” and “negative” insights for decisionmakers that face BwSCtype problems. On the one hand, the unavoidable phase transitions and cyclic phenomena show some fundamental limits brought up by the switching constraint, making it hopeless for decisionmakers to reduce regret within each phase. On the other hand, once the decisionmakers have enough switching budget that brings them to a new phase, they can enjoy substantially regret drop. In particular, 3 or 4 budget cycles are already enough to guarantee extraordinary regret performance.
The lower bound in Theorem 2 also leads to new results for the classical MAB problem.
Corollary 2
(The switching complexity of MAB) For the karmed bandit problem, N(k1)+1 switches are necessary and sufficient for achieving \tilde{O}(T^{\frac{1}{22^{N}}}) regret for any fixed N\in\mathbb{Z}_{>0}, and \Theta(\log\log T) switches are necessary and sufficient for achieving \tilde{O}(\sqrt{T}) (nearoptimal) regret.
Note that the number of switches stated in Corollary 2 refers to the maximum number of switches that a policy can make. While CesaBianchi et al. (2013) and Perchet et al. (2016) have proposed policies that achieve \tilde{O}(\sqrt{T}) regret with O(\log\log T) switches, no prior work has answered the question that how many switches are necessary for a nearoptimal learning policy in MAB. To the best of our knowledge, we are the first one to show \Omega(\log\log T) lower bound on the number of switches.
Based on our “tracking the cover time” argument, we can prove further results regarding the number of reswitches of each arm (including the worst arm in hindsight) that are necessary for an effective learning policy.
Definition 3
The number of reswitches of action i\in[k] is the total number that the leaner switches to i from another action j\neq i. For the action chosen in round 1 (where there is no preceding action), the round1 choice also counts as a reswitch.
Proposition 1
For the karmed bandit problem, \lceil N/2\rceil reswitches of each action are necessary for achieving \tilde{O}(T^{\frac{1}{22^{N}}}) regret for any fixed N\in\mathbb{Z}_{>0}, and \Theta(\log\log T) reswitches of each action are necessary and sufficient for achieving \tilde{O}(\sqrt{T}) (nearoptimal) regret.
Note that if the learner is not allowed to rechoose an action that was chosen earlier and discarded later (i.e., if the number of reswitches of each action is at most 1), then the corresponding bandit problem is exactly the “irrevocable MAB problem” proposed by Farias and Madan (2011). Farias and Madan (2011) and Guha and Munagala (2013) study the price of irrevocability in the Bayesian bandit setting. Using competitive ratio (measured against the optimal online policy that does not know the true environment) as the performance metric, they show that the price of irrevocability is limited. Our results on the necessity of reswitching contradicts this idea: in the setting of regret minimization, where we are competing with the optimal clairvoyant policy — a much stronger benchmark, our results indicate that an irrevocable policy must incur linear regret, and any effective policy can not avoid “departing from and reswitching to” each action for many times.
Specifically, Proposition 1 indicates that, for each learning policy that achieves nearoptimal regret in MAB, we can always find an environment \mathcal{D} such that the policy departs from the optimal action for \Theta(\log\log T) times and moves to the worst action for \Theta(\log\log T) times — surprisingly, the necessary number of reswitches of the worst action is essentially the same as the optimal action. Put differently, it is inevitable for any effective policy to repeatedly make some switching decisions that would prove to be not effective in hindsight.
In this subsection, we discuss the relationship between limited switches and limited adaptivity in bandit problems. As discussed in Section id1, in the unitswitchingcost BwSC problem, the constraint is on the number of switches and is defined in the “action world”, hence the learner has full adpativity. By contrast, in the batched bandit problem, the constraint is on adaptivity and is defined in the “observation world”, hence the learner has full switching power. Since the two constraints in the two problems are defined in two different “worlds”, it seems that the two problems are not directly related. However, the results in our paper suggest that the two problem are actually deeply related. For this purpose, We provide an alternate way of understanding the results of Section id1 through examining the relationship between the two problems given a fixed k.
The SSSE policy in Section id1 helps us establish a onesided relationship between the two problems. Specifically, any Mbatch policy that achieves a certain regret upper bound in the Mbatched karmed bandit problem can be transformed, using the SSSE policy ingredients and randomization, to an Sswitch policy that achieves the same regret upper bound in the Sbudget karmed unitcost BwSC problem, as long as S\in[(M1)(k1)+1:M(k1)]. This implies the following fact:
Fact 3
For any fixed k\geq 1,

Any upper bound on the regret of the Mbatched karmed bandit problem serves as an an upper bound on the regret of the Sbudget karmed unitcost BwSC problem (S\in[(M1)(k1)+1:M(k1)]).

Any lower bound on the regret of the Sbudget karmed unitcost BwSC problem serves as an lower bound on the regret of the (m(S)+1)batched karmed bandit problem.
By contrast, given an arbitrary Sswitch policy, it is generally impossible to transform it to an (m(S)+1)batch policy, as an Sswitch policy may utilize data for unlimited times. For example, consider the following naïve policy: choosing actions based on the celebrated UCB1 policy (Auer et al. 2002) until the number of switches exceeds S. The policy is clearly an Sswitch policy, but cannot be transformed to any finitebatch policy. Therefore, the SSSE policy and Fact 3 are not enough for establishing a twosided relationship between limited switches and limited adaptivity.
Surprisingly, our strong lower bound in Section id1 directly closes the gap between the regret upper bound of the batched bandit problem and the regret lower bound of a corresponding unitswitchingcost BwSC problem. Thus, we completely establish the twosided relationship and essentially prove the following fact:
Fact 4
For any fixed k, let R_{M}^{*}(T) be the optimal (minimax) regret of the Mbatched karmed bandit problem, and R_{S}^{*}(T) be the optimal (minimax) regret of the Sbudget karmed unitcost BwSC. As long as S\in[(M1)(k1)+1,M(k1)], the two problem have nearequal optimal regret (up to logarithmic factors), i.e., R_{M}^{*}(T) and R_{S}^{*}(T) are both \tilde{\Theta}(T^{\frac{1}{22^{1M}}}).
In essence, our results reveal a surprising “regret equivalence” between limited switches (even with full adpativity) and limited adaptivity (even with full switching power) in bandit problems: limiting switches (in the “action” world) implicitly limits adaptivity (in the “observation” world), and limiting adaptivity (in the “observation” world) implicitly limits switches (in the “action” world). Put differently, in an MAB problem, when the number of switches is limited, becoming more adaptive and using data more frequently may not lead to a reduction in regret.
Before closing Section id1, we give some comments on the practical implications and the scope of our results obtained in this section. First, it worth noting that the phase transitions and cyclic phenomena discovered in this section are associated with theoretical bounds (in the minimax sense), not with empirical performance of policies. While these phenomena are theoretically interesting and provide novel and important insights for decisionmakers who want to have theoretical guarantees, in reality, when decisionmakers are not worried about the regret incurred in the worst case, they can apply some more adaptive policies than the SSSE policy to achieve better performance, and it is possible that they observe a much smoother empirical performance improvement as their switching budgets grow.
Second, in addition to performance improvement, decisionmakers may value lowadaptive policies due to their simplicity and ease of implementation. From this perspective, the SSSE policy proposed in this Section should be practically appealing, as it is theoretically optimal for both limited number of switches and limited adaptivity. Indeed, the main concern of lowadaptive policies is typically their potential performance loss. The results developed in this Section show that theoretically, given limited number of switches, lowadaptive policies are able to achieve the optimal performance in the worst case. Thus, our results provide a strong validation of lowadaptive policies. For example, in dynamic pricing problems, sellers prefer both few number of price changes (to reduce customers’ confusion) and low adaptivity (for ease of implementation), see Cheung et al. (2017). The SSSE policy developed in this Section is thus attractive from both points of views.
We now proceed to the general case of BwSC, where c_{i,j} (=c_{j,i}) can be any nonnegative real number and even \infty. The problem is significantly more challenging in this general setting. For this purpose, we need to enhance the framework of Section id1 to better characterize the structure of switching costs. We do this by representing switching costs via a weighted graph.
Let G=(V,E) be a (weighted) complete graph, where V=[k] (i.e., each vertex corresponds to an action), and the edge between i and j is assigned a weight c_{i,j} (\forall i\neq j). We call the weighted graph G the switching graph. In this section, we assume the switching costs satisfy the triangle inequality: \forall i,j,l\in[k], c_{i,j}\leq c_{i,l}+c_{l,j}. We relax this assumption in Appendix E.
The results of unitswitchingcost model suggest that an effective policy that minimizes the worstcase regret must repeatedly visit all actions, in a manner similar to the SSSE policy. This indicates that in the generalswitchingcost model, an effective policy should repeatedly visit all vertices in the switching graph, in a most economical way to stay within budget. This insight is proven in a formal way in Appendix F, where we establish an interesting connection between bandit problems and graph traversal problems. Applying the result to the general BwSC problem, we discover a connection between the general BwSC problem and the celebrated shortest Hamiltonian path problem.
Motivated by this connection, we propose the HamiltonianSwitching Successive Elimination (HSSE) policy, and present it in Algorithm 2. The policy enhances the original SSSE policy by adding an additional ingredient: a prespecified switching order determined by the shortest Hamiltonian path of the switching graph G. Note that while the shortest Hamiltonian path problem is NPhard, solving this problem is entirely an “offline” step in the HSSE policy. That is, for a given switching graph, the learner only needs to solve this problem once.
Input: Switching Graph G, Switching budget S, Horizon T
Offline Step: Find the shortest Hamiltonian path in G: {i_{1}}\rightarrow\dots\rightarrow{i_{k}}. Denote the total weight of the shortest Hamiltonian path as H. Calculate m_{G}^{U}(S)=\left\lfloor\frac{S\max_{i,j\in[k]}c_{i,j}}{H}\right\rfloor.
Partition: Run the partition step in the SSSE policy with m(S)=m_{G}^{U}(S).
Initialization: Let the set of all active actions in the lth interval be A_{l}. Set A_{1}=[k], a_{0}={i_{1}}.
Policy:
\texttt{UCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% +\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}, 
\texttt{LCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% \sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}. 
Let H denote the total weight of the shortest Hamiltonian path of G. We give an upper bound on regret of the HSSE policy. See Appendix H for a proof.
Theorem 3
Let \pi be the HSSE policy, then \pi\in\Pi_{S}. There exists an absolute constant C\geq 0 such that for all G, k=G, S\geq 0, T\geq k,
R^{\pi}(T)\leq C{(\log k\log T)}k^{1\frac{1}{22^{m_{G}^{U}(S)}}}T^{\frac{1}% {22^{m_{G}^{U}(S)}}}, 
where m_{G}^{U}(S)=\left\lfloor\frac{S\max_{i,j\in[k]}{c_{i,j}}}{H}\right\rfloor.
We then give a lower bound that is close to the above upper bound. See Appendix I for a proof.
Theorem 4
There exists an absolute constant C>0 such that for all G,k=G,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m_{G}^{L}(S)}% }}(m_{G}(S)+1)^{2}\right)T^{\frac{1}{22^{m_{G}^{L}(S)}}},&\text{if }m_{G}^{% L}(S)\leq\log_{2}\log_{2}(T/k),\\ C\sqrt{kT},&\text{if }m_{G}^{L}(S)>\log_{2}\log_{2}(T/k),\end{cases} 
where {m_{G}^{L}}(S)=\left\lfloor\frac{S\max_{i\in[k]}\min_{j\neq i}c_{i,j}}{H}\right\rfloor.
When the switching costs satisfy the condition \max_{i,j\in[k]}{c_{i,j}}=\max_{i\in[k]}\min_{j\neq i}c_{i,j}, the two bounds directly match. When this condition is not satisfied, for any switching graph G, the above two bounds still match for a wide range of S:
\left[0,H+\max_{i\in[k]}\min_{j\neq i}c_{i,j}\right)\bigcup\left\{\bigcup_{n=1% }^{\infty}\left[nH+\max_{i,j\in[k]}c_{i,j},(n+1)H+\max_{i\in[k]}\min_{j\neq i}% c_{i,j}\right)\right\}. 
Even when S is not in this range, we still have m_{G}^{U}(S)\leq m_{G}^{L}(S)\leq m_{G}^{U}(S)+1 for any G and any S, which means that the difference between the two indices is at most 1 and the regret bounds are always very close. In fact, it can be shown that as S increases, the gap between the upper and lower bounds decreases doubly exponentially. Therefore, the HSSE policy is quite effective for the general BwSC problem. See Figure 2 for an illustration.
From Figure 2, we can easily observe that in the generalswitchingcost BwSC problem, there are still phase transitions, as there are still phases where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. An unanswered question, however, is whether the critical points between phases still exist. Indeed, when S\in[nH+\max_{i\in[k]}\min_{j\neq i}c_{i,j},nH+\max_{i,j\in[k]}c_{i,j}) for some n\in\mathbb{Z}_{>0} (which is the range that the upper and lower bounds in Theorem 3 and Theorem 4 do not match), the current results cannot recover the exact order of T in R_{S}^{*}(T), so it remains unknown whether the optimal regret drop abruptly within this range.
Theorem 5 answers the question in the affirmative, that is, we prove the existence of critical points for any fixed switching graph G. While the result successfully recovers the exact dependency on T of the optimal regret, it is only of theoretical interest, for two reasons. First, the gap between the upper and lower bounds employed in our proof is of the order of O(k!). Also, computing the key index m_{G}(S) is highly difficult. For these reasons, we defer the detailed description of this result to Appendix J.
Theorem 5
Let k\geq 1 be an arbitrary given constant. For any given switching graph G such that G=k, for any S\geq 0,
R_{S}^{*}(T)=\tilde{\Theta}(T^{\frac{1}{22^{m_{G}(S)}}}), 
where m_{G}(S)\in\{m_{G}^{U}(S),m_{G}^{L}(S)\} is an integer completely determined by the switching graph G. As a result, given G, we can define the jth critical point as \inf\{S\mid m_{G}(S)\geq j\} for j\in\mathbb{Z}_{>0}, where the order of T in R_{S}^{*}(T) drops from \tilde{\Theta}(T^{\frac{1}{22^{1j}}}) to \tilde{\Theta}(T^{\frac{1}{22^{j}}}).
We make a final remark on whether some of the results obtained in Section id1 still hold in the generalswitchingcost model. First, there are still phase transitions, as there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. Second, the cyclic phenomena do not exist, since there are counterexamples where the distances between critical points are not equal. Third, there is no regret equivalence between limited (general) switching cost and limited adaptivity.
We consider the stochastic multiarmed bandit problem with a constraint on the total cost incurred by switching between actions. For the unitswitchingcost model, we prove matching upper and lower bounds on regret and provide nearoptimal algorithm for the problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. We also show the regret equivalence between MAB with limited switches and MAB with limited adaptivity. The results enable us to fully characterize the tradeoff between regret and incurred switching cost in the stochastic multiarmed bandit problem, contributing new insights to this fundamental problem. For the generalswitchingcost model, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.
References
 Agrawal et al. (1988) Agrawal, R., M. Hedge, D. Teneketzis. 1988. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control 33(10) 899–906.
 Agrawal et al. (1990) Agrawal, R., M. Hegde, D. Teneketzis. 1990. Multiarmed bandit problems with multiple plays and switching cost. Stochastics and Stochastic Reports 29(4) 437–459.
 Altschuler and Talwar (2018) Altschuler, J., K. Talwar. 2018. Online learning over a finite action set with limited switching. Conference on Learning Theory. 1569–1573.
 Asawa and Teneketzis (1996) Asawa, M., D. Teneketzis. 1996. Multiarmed bandits with switching penalties. IEEE Transactions on Automatic Control 41(3) 328–348.
 Auer et al. (2002) Auer, P., N. CesaBianchi, P. Fischer. 2002. Finitetime analysis of the multiarmed bandit problem. Machine Learning 47(23) 235–256.
 Banks and Sundaram (1994) Banks, J. S., R. K. Sundaram. 1994. Switching costs and the gittins index. Econometrica 62(3) 687–694.
 Bergemann and Välimäki (2001) Bergemann, D., J. Välimäki. 2001. Stationary multichoice bandit problems. Journal of Economic Dynamics and Control 25(10) 1585–1594.
 Brezzi and Lai (2002) Brezzi, M., T. L. Lai. 2002. Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27(1) 87–108.
 CesaBianchi et al. (2013) CesaBianchi, N., O. Dekel, O. Shamir. 2013. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems. 1160–1168.
 Chen and Chao (2019) Chen, B., X. Chao. 2019. Parametric demand learning with limited price explorations in a backlog stochastic inventory system. IISE Transactions 1–9.
 Cheung et al. (2017) Cheung, W. C., D. SimchiLevi, H. Wang. 2017. Dynamic pricing and demand learning with limited price experimentation. Operations Research 65(6) 1722–1731.
 Christofides (1976) Christofides, N. 1976. Worstcase analysis of a new heuristic for the travelling salesman problem. Tech. rep., CarnegieMellon University Pittsburgh PA Management Sciences Research Group.
 Cormen et al. (2009) Cormen, T. H., C. E. Leiserson, R. L. Rivest, C. Stein. 2009. Introduction to algorithms. MIT Press.
 Domb (2000) Domb, C. 2000. Phase Transitions and Critical Phenomena, vol. 1. Elsevier.
 Farias and Madan (2011) Farias, V. F., R. Madan. 2011. The irrevocable multiarmed bandit problem. Operations Research 59(2) 383–399.
 Gao et al. (2019) Gao, Z., Y. Han, Z. Ren, Z. Zhou. 2019. Batched multiarmed bandits problem. arXiv preprint arXiv:1904.01763 .
 Guha and Munagala (2013) Guha, S., K. Munagala. 2013. Approximation algorithms for bayesian multiarmed bandit problems. arXiv preprint arXiv:1306.3525 .
 Jun (2004) Jun, T. 2004. A survey on the bandit problem with switching costs. De Economist 152(4) 513–541.
 Lai and Robbins (1985) Lai, T. L., H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1) 4–22.
 Lattimore and Szepesvári (2018) Lattimore, T., C. Szepesvári. 2018. Bandit algorithms. preprint .
 Lawler et al. (1985) Lawler, E. L., J. K. Lenstra, A. R. Kan, D. B. Shmoys. 1985. The traveling salesman problem: a guided tour of combinatorial optimization, vol. 3. New York: Wiley.
 Perchet et al. (2016) Perchet, V., P. Rigollet, S. Chassang, E. Snowberg. 2016. Batched bandit problems. The Annals of Statistics 44(2) 660–681.
 Slivkins (2019) Slivkins, A. 2019. Introduction to multiarmed bandits. arXiv preprint arXiv:1904.07272 .
 Vogel (1960) Vogel, W. 1960. An asymptotic minimax theorem for the two armed bandit problem. The Annals of Mathematical Statistics 31(2) 444–451.
 Wainwright (2019) Wainwright, M. J. 2019. Highdimensional statistics: A nonasymptotic viewpoint, vol. 48. Cambridge University Press.
For simplicity, we only present the results of distributiondependent regret bounds for the unitswitchingcost BwSC problem. Extensions to the generalswitchingcost BwSC problem are analogous to Section 5 of the main article.
To achieve tight distributiondependent regret bounds, we propose the SSwitch Successive Elimination 2 (SSSE2) policy, which is stated in Algorithm 3. Note that the difference between the SSSE2 policy and the SSSE policy is the partition of intervals.
Input: Number of arms k, Switching budget S, Horizon T
Partition: Calculate m(S)=\lfloor\frac{S1}{k1}\rfloor.
Divide the entire time horizon 1,\dots,T into m(S)+1 intervals: [t_{0}:t_{1}],(t_{1}:t_{2}],\dots,(t_{m(S)}:t_{m(S)+1}],
where the endpoints are defined by t_{0}=1 and
t_{i}=\lfloor k^{1\frac{i}{{m(S)}+1}}T^{\frac{i}{{m(S)}+1}}\rfloor,~{}~{}% \forall i=1,\dots,m(S)+1. 
Initialization: The same as the SSSE policy.
Policy:
For any environment \mathcal{D}, let i^{*}=\arg\max_{i\in[k]\mu_{i}} denote the optimal action, and \Delta=\Delta(\mbox{$\mathcal{D}$})=\min_{i\neq i^{*}}\mu_{i^{*}}\mu_{i}>0 denote the gap between the rewards of the optimal action and the best suboptimal action. We have the following results.
Theorem 6
Let \pi be the SSSE2 policy. There exists an absolute constant C\geq 0 such that for all \mathcal{D}, for all k\geq 1, S\geq 0 and T\geq k,
R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C\left(k^{\frac{m(S)}{m(S)+1}}\log k% \right)\frac{T^{\frac{1}{m(S)+1}}\log T}{\Delta}, 
where m(S)=\lfloor\frac{S1}{k1}\rfloor.
Theorem 7
There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq 1 and for all policy \pi\in\Pi_{S}, if m(S)\leq{\log_{2}(T/k)}, then
\sup\limits_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\geq C% \left(k^{\frac{3}{2}\frac{1}{m(S)+1}}(m(S)+1)^{2}\right){T^{\frac{1}{m(S)+1% }}}, 
where m(S)=\lfloor\frac{S1}{k1}\rfloor.
Note that when m(S)\leq{\log_{2}(T/k)}, the upper and lower bounds match in the minimax sense (up to logarithmic factors), thus the SSSE2 policy can be viewed as nearoptimal. When m(S)>\log_{2}(T/k), the upper bound is O(\log T/\Delta), and we can directly use the seminal instancedependent lower bound of Lai and Robbins (1985) to show the asymptotic optimality of the SSSE2 policy.
We omit the proofs of Theorem 6 and Theorem 7. The proof of Theorem 6 resembles the proof of Theorem 1 in Appendix B, and the proof of Theorem 7 resembles the proof of Theorem 2 in Appendix C. The difference is mainly on the partition of intervals.
Besides results on regret upper and lower bounds, we also establish Corollary 3, which can be viewed as a parallel result for Corollary 3 in Section 4.3.2 of the main article.
Corollary 3
(The switching complexity of MAB  distributiondependent regret version)
For any k\geq 1, for any environment \mathcal{D}, let \Delta=\min\limits_{i\in[k],i\neq i^{*}}\mu_{i^{*}}\mu_{i} denote the gap between the mean rewards of the optimal action and the best suboptimal action.

N(k1)+1 switches are necessary and sufficient for uniformly achieving \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distributiondependent regret for all \mathcal{D} in the karmed MAB (N\in\mathbb{Z}_{>0}).

\Omega(\frac{\log T}{\log\log T}) switches are necessary for uniformly achieving \tilde{O}(\log{T}/\Delta) distributiondependent regret for all \mathcal{D} in the karmed MAB.
The proof of Corollary 3 is deferred to Appendix D.
From round 1 to round t_{1}, the SSSE policy makes k1 switches.
For 1\leq l\leq m(S)1, from round t_{l} to round t_{l+1}:

If the last action in interval l remains active in interval l+1, then it will be the first action in interval l+1, and no switch occurs between round t_{l} and round t_{l}+1. Since the SSSE policy makes at most k1 switches within interval l+1, i.e., from round t_{l}+1 to round t_{l+1}, the SSSE policy makes at most 0+(k1)=k1 switches from round t_{l} to round t_{l+1}.

If the last action in interval l is eliminated before the start of interval l+1, then interval l+1 starts from another active action, and one switch occurs between round t_{l} and round t_{l}+1. The elimination implies that A_{l+1}\leq k1, thus the SSSE policy makes A_{l+1}1\leq(k1)1=k2 switches within interval l+1, i.e., from round t_{l}+1 to round t_{l+1}. Therefore, the SSSE policy makes at most 1+(k2)=k1 switches from round t_{l} to round t_{l+1}.
From round t_{m(S)} to round T, since the SSSE policy does not switch within interval m(S)+1, i.e., from round t_{m(S)}+1 to round T, the only possible switch is between round t_{m(S)} and t_{m(S)}+1. Thus the SSSE policy makes at most 1 switch from round t_{m(S)} to round T.
Summarizing the above arguments, we find that the SSSE policy makes at most m(S)(k1)+1\leq S switches from round 1 to round T. Thus it is indeed an Sswitchingbudget policy.
We start the proof of the upper bound on regret with some definitions. Let n_{t}(i) be the number of chosen samples of action i in period [1:t], and \bar{\mu}_{t}(i) be the average collected reward from action i in period [1:T] (i\in[k],t\in[T]). Define the confidence radius as
r_{t}(i)=\sqrt{\frac{2\log T}{n_{t}(i)}},~{}~{}\forall i\in[k],t\in[T]. 
Define the clean event as
\mathcal{E}:=\{\forall i\in[k],\forall t\in[T],~{}~{}\bar{\mu}_{t}(i)\mu_{i}% \leq r_{t}(i)\}. 
By Lemma 1.5 in Slivkins (2019), since T\geq k, for any policy \pi and any environment \mathcal{D}, we always have \mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}(\mathcal{E})\geq 1\frac{2}{T^{2}}. Define the bad event \bar{\mathcal{E}} as the complement of the clean event.
The \texttt{UCB}_{t_{l}}(i) and \texttt{LCB}_{t_{l}}(i) confidence bounds defined in Algorithm 1 can be expressed as
\texttt{UCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)+r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k], 
\texttt{LCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k]. 
Let \pi denote the SSSE policy. First, observe that for any environment \mathcal{D},
\displaystyle R_{\mbox{$\mathcal{D}$}}^{\pi}(T)  \displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}% (\mathcal{E})+\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\bar{\mathcal{E}}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}% ^{\pi}(\bar{\mathcal{E}})  
\displaystyle\leq\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i% =1}^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+T\cdot\frac{1}{T^{2}}  
\displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+o(1),  (1) 
so in order to bound R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T), we only need to focus on the clean event.
Consider an arbitrary environment \mathcal{D} and assume the occurrence of the clean event. Let i^{*} be an optimal action, and consider any action i such that \mu_{i}<\mu_{i^{*}}. Let \eta_{i} denote the index of the last interval when i\in A_{\eta_{i}}, i.e., the \eta_{i}th interval is the last interval when we did not eliminate action i yet (in particular, \eta_{i}=m(S)+1 if and only if i is the only action chosen in the last interval). By the SSSE policy, if \eta_{i}\geq 2, then the confidence intervals of the two actions i^{*} and i at the end of round \eta_{i}1 must overlap, i.e., \texttt{UCB}_{t_{\eta_{i}1}}(i)\geq\texttt{LCB}_{t_{\eta_{i}1}}(i^{*}). Therefore,
\Delta(i):=\mu_{i^{*}}\mu_{i}\leq 2r_{t_{\eta_{i}1}}(i^{*})+2r_{t_{\eta_{i}% 1}}(i)=4r_{t_{\eta_{i}1}}(i),  (2) 
where the last equality is because i^{*} and i are chosen for equal times in each interval until interval \eta_{i}, which indicates that n_{t_{\eta_{i}1}}(i^{*})=n_{t_{\eta_{i}1}}(i). (Note that in Algorithm 1, for simplicity, we overlook the rounding issues of \frac{t_{l+1}t_{l}}{A_{l}} for each interval l. Considering the rounding issues will not bring additional difficulty to our analysis, as in the policy we can always design a rounding rule to control the difference between n_{t_{\eta_{i}1}}(i^{*}) and n_{t_{\eta_{i}1}}(i) within 1.)
Since i is never chosen after the \eta_{i}th interval, we have n_{\eta_{i}}(i)=n_{T}(i), and therefore r_{\eta_{i}}(i)=r_{T}(i).
The contribution of action i to regret in the entire horizon [1:T], denoted R(T;i), can be expressed as the sum of \Delta(i) for each round that this action is chosen. By the SSSE policy and (2), we can bound this quantity as
\displaystyle R(T;i)  \displaystyle=n_{T}(i)\Delta(i)  
\displaystyle\leq 4n_{\eta_{i}}(i)\sqrt{\frac{2\log T}{n_{\eta_{i}1}(i)}}  
\displaystyle\leq C_{0}\sqrt{2\log T}\frac{t_{\eta_{i}}/A_{\eta_{i}}}{\sqrt{% t_{\eta_{i}1}/k}}  
\displaystyle\leq 4C_{0}\sqrt{2\log T}\frac{k(T/k)^{1/(22^{m(S)})}}{{A_{% \eta_{i}}}}. 
for some absolute C_{0}\geq 0. Then for any \mathcal{D}, conditioned on the clean event,
\displaystyle\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}^% {T}\mu_{\pi_{t}}\mid\mathcal{E}\right]  \displaystyle=\sum_{i\in[k]}R(T;i)  
\displaystyle\leq\sum_{i\in[k]}4C_{0}\sqrt{2\log T}k(T/k)^{1/(22^{m(S)})}% \frac{1}{{A_{\eta_{i}}}}  
\displaystyle\leq C_{1}\sqrt{\log T}k(T/k)^{1/(22^{m(S)})}\sum_{i=1}^{k}% \frac{1}{A_{\eta_{i}}}  
\displaystyle\leq C_{2}\sqrt{\log T}k(T/k)^{1/(22^{m(S)})}\sum_{j=1}^{k}% \frac{1}{j}  
\displaystyle\leq C_{3}(\log k\log T)k^{11/(22^{m(S)})}T^{1/(22^{m(S)})} 
for some absolute constant C_{1},C_{2},C_{3}\geq 0. Thus by (\theequation@IDv) and R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T) we have
R^{\pi}(T)\leq C(\log k\log T)k^{21/(22^{m(S)})}T^{1/(22^{m(S)})} 
for some absolute constant C\geq 0.\hfill\Box
Given any k\geq 1, S\geq 0 and T\geq 2k, we focus on the setting of \mbox{$\mathcal{D}$}_{i}=\mathcal{N}(\mu_{i},1) (\forall i\in[k]), as this is enough for us to prove the desired lower bound. Note that now the environment of latent distributions \mathcal{D} can be completely determined by a vector \mbox{\boldmath$\mu$}=(\mu_{1},\cdots,\mu_{k})\in\mathbb{R}^{k}. For simplicity, in this proof we will directly use the vector \mu to represent the environment of latent distributions.
For any environment \mu, let X_{\mbox{\boldmath$\mu$}}^{t}(i)\sim\mathcal{N}(\mu_{i},1) denote the i.i.d. random reward of each action i at round t (i\in[k],t\in[T]). For any i\in[k] and n_{1},n_{2}\in[T], let \{X_{\mbox{\boldmath$\mu$}}^{t}(i)\}_{t\in[n_{1}:n_{2}]} denote the random vector whose components are the random awards of action i from round n_{1} to round n_{2}.
For any environment \mu, for any policy \pi\in\Pi_{S}, with some abuse of notation we let X_{\mbox{\boldmath$\mu$}}^{t}(\pi_{t}) denote the learner’s (random) collected reward at round t under policy \pi in environment \mu. Let \mathcal{F}_{t}:=\sigma(X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{% \boldmath$\mu$}}^{t}(\pi_{t})) denote the \sigmaalgebra generated by the random variables X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{\boldmath$\mu$}}^{t}(\pi% _{t}), then \mathbb{F}=(\mathcal{F}_{t})_{t\in T} is a filtration.
For any two probability measures \mathbb{P} and \mathbb{Q} defined on the same measurable space, let D_{\mathrm{TV}}(\mathbb{P}\\mathbb{Q}) denote the total variation distance between \mathbb{P} and \mathbb{Q}, and D_{\mathrm{KL}}(\mathbb{P}\\mathbb{Q}) denote the KullbackLeibler (KL) divergence between \mathbb{P} and \mathbb{Q}, see detailed definitions in Chapter 15 of Wainwright (2019).
For any environment \mu, for any policy \pi\in\Pi_{S}, we make some key definitions as below.
1. We first define a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)}\leq\tau_{m(S)+1}.

\tau_{1}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[1:\tau_{1}]$}\} if the set is nonempty and \tau_{1}=\infty otherwise.

\tau_{2}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{1}:\tau_{2}]$}\} if the set is nonempty and \tau_{2}=\infty otherwise.

Generally, \tau_{j}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{j1}:\tau_{j}]$}\} if the set is nonempty and \tau_{j}=\infty otherwise, for all i=2,\dots,m(S)+1.
It can be verified that \tau_{1},\dots,\tau_{m(S)+1} are stopping times with respect to the filtration \mathbb{F}.
2. We then define a series of random variables (depend on the stopping times).

S(1,\tau_{1}) is the number of switches occurs in [1:\tau_{1}] (note that if there is a switch happening between \tau_{1} and \tau_{1}+1, we do not count its cost in S(1,\tau_{1})).

For all j=2,\dots,m(S), S(\tau_{j1},\tau_{j}) is the number of switches occurs in [\tau_{j1}:\tau_{j}] (note that if there is a switch happening between \tau_{j1}1 and \tau_{j1}, or between \tau_{j} and \tau_{j}+1, we do not count its cost in S(\tau_{j1},\tau_{j})).

S(\tau_{m(S)},T) is the number of switches occurs in [\tau_{m(S)}:T] (note that if there is a switch happening between \tau_{m(S)1} and \tau_{m(S)}, we do not count its cost in S(\tau_{m(S)},T).
3. Next we define a series of events.

E_{1}=\{\tau_{1}>t_{1}\}.

For all j=2,\dots,m(S), E_{j}=\{\tau_{j1}\leq t_{j1},\tau_{j}>t_{j}\}.

E_{m(S)+1}=\{\tau_{m(S)}\leq t_{m(S)}\}.
Note that t_{1},\dots,t_{m(S)}\in[T] are fixed values specified in Algorithm 1.
4. Finally we define a series of shrinking errors.

\Delta_{1}=1.

For j=2,\dots,m(S), \Delta_{j}=\frac{k^{1/2}\left(k/T\right)^{(12^{1j})/(22^{m(S)})}}{k(m(S)+% 1)}\in(0,1). (That is, \Delta_{j}\approx\frac{1}{k(m(S)+1)}\frac{1}{\sqrt{t_{j1}}}.)

\Delta_{m(S)+1}=\frac{k^{1/2}\left(k/T\right)^{(12^{m(S)})/(22^{m(S)})}}{% 2k(m(S)+1)}\in(0,1). (That is, \Delta_{m(S)+1}\approx\frac{1}{2k(m(S)+1)}\frac{1}{\sqrt{t_{m(S)}}}.)
5. For notational convenience, define \pi_{\infty} as an independent uniform random variable taking value in [k] such that {\pi_{\infty}=i} with probability 1/k (i\in[k]).
Lemma 1
For any environment \mu, for any policy \pi\in\Pi_{S}, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the number of switches occurs in }[\tau_{m(S)}:T]\text{ is no more % than }k1\} almost surely.
Proof of Lemma 1. When E_{m(S)+1} happens, \tau_{m(S)}\leq t_{m(S)}\leq T, thus all \tau_{1},\dots,\tau_{m(S)}\leq T. Since in each of [1:\tau_{1}],[\tau_{1},\tau_{2}],\dots,[\tau_{m(S)1}:\tau_{m(S)}], all k actions were visited, we know that S(1,\tau_{1})\geq k1, S(\tau_{1},\tau_{2})\geq k1, \dots, S(\tau_{m(S)1},\tau_{m(S)})\geq k1. Thus we have
S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)1},\tau_{m(S)})\geq m(S% )(k1). 
Since \pi\in\Pi_{S}, we further know that
S(\tau_{m(S)},T)\leq S[S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)% 1},\tau_{m(S)})]\leq Sm(S)(k1)\leq k1 
happens almost surely. As a result, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the number of switches occurs in }[\tau_{m(S)}:T]\text{ is no more % than }k1\} almost surely. \hfill\Box
Consider a class of environments \Lambda=\{\mbox{\boldmath$\mu$}\mid\frac{\Delta_{m(S)+1}}{4}\leq\mu_{1}\mu_{i% }\leq\frac{\Delta_{m(S)+1}}{2},\forall i\neq 1\}. Pick an arbitrary environment {\alpha} from \Lambda (e.g., \alpha=(\frac{\Delta_{m(S)+1}}{2},0,\dots,0)). For any policy \pi\in\Pi_{S}, by the union bound, we have
\sum_{j=1}^{m(S)+1}\mathbb{P}_{{\alpha}}^{\pi}(E_{j})\geq\mathbb{P}_{{\alpha}}% ^{\pi}(\cup_{j=1}^{m(S)+1}E_{j})=1. 
Therefore, there exists j^{*}\in[m(S)+1] such that \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})\geq 1/(m(S)+1).
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})% \geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{% \pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{% \mathbb{P}_{{\alpha}}^{\pi}(E_{1})}{k}\geq\frac{1}{k(m(S)+1)}. 
Note that since \tau_{1} is the first time that all actions in [k] has been chosen in [1:\tau_{1}], the event \{\pi_{\tau_{1}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[1:\tau_{1}1]\}. Thus, the event \{\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime}\} must imply the event \mathcal{E}_{i^{\prime}}[1:t_{1}1]:=\{i^{\prime}\text{ was not chosen in }[1:% t_{1}1]\}. Therefore, we have
\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq% \mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{1% }{k(m(S)+1)}. 
Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] only depends on policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{% \pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq\frac{1}{k(m(S)+1)}.  (3) 
We now consider a new environment {\beta} such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{1} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{1}1]} and random vectors \{X_{\beta}^{t}(i)\}_{t_{1}:T} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{1}1]} for i\neq i, we have
\mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}1]}(\mathcal{E}_{i^{\prime}}[1:t_{1% }1])\text{ for }i\neq i}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$% \mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1]).  (4) 
But note that \{X_{\beta}^{t}(i)\}_{[1:t_{1}1]} and \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} have exactly the same distribution for all i\neq i^{\prime}. Thus from (3) and (4) we have
\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$% \mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq\frac{1}{k% (m(S)+1)}. 
However, in environment \beta, i^{\prime} is the unique optimal action, and choosing any action other than i^{\prime} will incur at least a \Delta_{1}\Delta_{m(S)+1}/2\geq\Delta_{1}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[1:t_{1}1] indicates that the policy does not choose i^{\prime} for at least t_{1}1 rounds, we have
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}1])\left[(t_{1}1)\frac{\Delta_{1}}{2}\right]% \geq\frac{t_{1}1}{2k(m(S)+1)}\geq\frac{k^{1/(22^{m(S)})}}{4(m(S)+1)}T^{1/(% 22^{m(S)})}. 
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})=\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1% }\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}})\geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}% })=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau% _{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}% },\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}% })}{k}\geq\frac{1}{k(m(S)+1)}. 
Note that since \tau_{j^{*}} is the first time that all actions in [k] has been chosen in [\tau_{j^{*}1}:\tau_{j^{*}}], the event \{\pi_{\tau_{j^{*}}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[\tau_{j^{*}1}:\tau_{j^{*}}1]\}. Thus, the event \{\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{% \prime}\} must imply the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}]:=\{i^{\prime}\text{ was not % chosen in }[t_{j^{*}1}:t_{j^{*}}]\}. Therefore, we have
\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{% *}}])\geq\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}% >t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{1}{k(m(S)+1)}. 
Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[t_{j^{*}1}:t_{j^{*}}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] only depends on policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E% }_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}])\geq\frac{1}{k(m(S)+1)}.  (5) 
We now consider a new environment {\beta} such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{j^{*}} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[t_{j^{*}1}:t_{j^{*}}]} and random vectors \{X_{\beta}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k]. Let \mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{\prime% }}[t_{j^{*}1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}1}:t_{j^{*}}]).  (6) 
We now try to bound the difference between the lefthand side (LHS) in (5) and the lefthand side (LHS) in (6). We have
\displaystyle\text{LHS in }(\ref{eq:app3})\text{LHS in }(\ref{eq:app4})  
\displaystyle\leq  \displaystyle{D_{\mathrm{TV}}}\left(\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}% _{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^% {\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime}% )\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i% ^{\prime}}^{\pi}\right)  
\displaystyle\leq  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^% {t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}% \text{ for }i\neq i^{\prime}}^{\pi}\right)}  
\displaystyle\leq  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^{t}(i^% {\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for% }i\neq i^{\prime}}\right)}  
\displaystyle=  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{% \beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]}}\right)}  
\displaystyle=  \displaystyle\sqrt{\frac{1}{2}\left[(t_{j^{*}1}1)\frac{\left(\Delta_{j^{*}}% \right)^{2}}{2}\right]}  
\displaystyle\leq  \displaystyle\frac{\sqrt{t_{j^{*}1}}\Delta_{j^{*}}}{2}\leq\frac{1}{2k(m(S)+1)}, 
where the first inequality is by the definition of total variation distance of two probability measures, the second inequality is by Pinsker’s inequality in information theory, and the third inequality is by the dataprocessing inequality in infomation theory.
Combining the above inequality with (3) and (4), we have
\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq% \mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\frac{% 1}{2k(m(S)+1)}\geq\frac{1}{2k(m(S)+1)}. 
However, i^{\prime} is the unique optimal action in environment \beta, and choosing any action other than i^{\prime} will incur at least a \Delta_{j^{*}}\Delta_{m(S)+1}/2\geq\Delta_{j^{*}}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] indicates that the policy does not choose i^{\prime} for at least t_{j^{*}}t_{j^{*}1}+1 rounds, we have
\displaystyle R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}% ^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}])\left[(t_{j^{*}}t_{j^{% *}1}+1)\frac{\Delta_{j^{*}}}{2}\right]  
\displaystyle\geq  \displaystyle\frac{1}{2k(m(S)+1)}\left(k(T/k)^{\frac{22^{1j^{*}}}{22^{m(S)% }}}k(T/k)^{\frac{22^{2j^{*}}}{22^{m(S)}}}\right)\frac{k^{\frac{1}{2}}% \left(k/T\right)^{\frac{12^{1j^{*}}}{22^{m(S)}}}}{2k(m(S)+1)}  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}}}{4(m(S)+1)^{2}}\left((T/k)^{\frac{1}{22^{% m(S)}}}(T/k)^{\frac{12^{1j^{*}}}{22^{m(S)}}}\right)  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{\frac{2^{1j^{*}}}{22^{m(S)}}}\right)  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{\frac{2^{1m(S)}}{22^{m(S)}}}\right)  
\displaystyle=  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{2^{m(S)}}\right). 
When m(S)\leq\log_{2}\log_{2}(T/K), we have
(T/k)^{2^{m(S)}}\leq(T/k)^{\frac{1}{\log_{2}(T/k)}}=\frac{1}{(T/k)^{\log_{T% /k}(2)}}=\frac{1}{2}. 
Thus we know that
R^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(% S)}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{2^{m(S)}}\right)\geq\frac{k^{\frac{3}{2% }\frac{1}{22^{m(S)}}}}{8(m(S)+1)^{2}}T^{\frac{1}{22^{m(S)}}} 
when m(S)\leq\log_{2}\log_{2}(T/k).
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}% \leq t_{m(S)})\geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)})=\sum_{i=1}^{k}\mathbb{P}_{% \alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i^{% \prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})}{k}\geq\frac{1}{k(m(% S)+1)}. 
Thus either
\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}>\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)},  (7) 
or
\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac% {t_{m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}.  (8) 
If (7) holds true, then we consider a new environment \beta such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{m(S)+1} and all other components are the same as \alpha. Define the event \mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2]:=\{i^{\prime}\text{ was not % chosen in }[t_{m(S)}:(t_{m(S)}+T)/2]\}. From (7) we know that \mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq 1/(2k(m(S)+1)). Using analogous arguments like Case 2 (Appendix id1), we can derive that
\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2% ])\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)} 
and
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)% }}}}{64(m(S)+1)^{2}}T^{\frac{1}{22^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Now we consider the case that (8) holds true. Let \mathcal{E}_{i^{\prime}} denote the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2},\pi_{\tau_{m(% S)+1}}=i^{\prime}\}. According to Lemma 1, the event \{\tau_{m(S)}\leq t_{m(S)}\} implies that the number of switches occurs in [\tau_{m(S)}:T] is no more than k1. Meanwhile, the event \{\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}<\infty\} implies that the number of switches occurs in [\tau_{m(S)}:\tau_{m(S)+1}] is at least k1. As a result, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that there is no switch occurs in [\tau_{m(S)+1}:T].
Suppose that i^{\prime}\neq 1, then the event \mathcal{E}_{i^{\prime}}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\} implies that action 1 is not chosen in [\tau_{m(S)+1}:T]. However, action 1 is the unique optimal action in environment \alpha, and choosing any action other than action 1 will incur at least a \Delta_{m(S)+1}/4 term in regret. As a result, we know that
R^{\pi}(T)\geq R_{\alpha}^{\pi}(T)\geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i% ^{\prime}})\left[(T\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{4}\right]% \geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2% 2^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Thus we only need to consider the subcase of i^{\prime}=1. Define the event \mathcal{E}_{1}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}% {2},\pi_{\tau_{m(S)+1}}=1\}. Note that the occurrence of the event \mathcal{E}_{1} only depends on policy \pi and random vector \{X_{\alpha}^{t}(1)\}_{[1:t_{m(S)}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:{(t_{m(S)}+T)}/{2}]} for i\neq 1. Consider a new environment \beta such that its first component is \alpha_{1}\Delta_{m(S)+1} and all other components are the same as \alpha. Using analogous arguments like Case 2 (Appendix id1), we can derive that
\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{1})\geq\mathbb{P}_{\alpha}^{\pi}(% \mathcal{E}_{1})\frac{\sqrt{t_{m(S)}}\Delta_{m(S)+1}}{2}\geq\mathbb{P}_{% \alpha}^{\pi}(\mathcal{E}_{1})\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}. 
However, action 1 is the worst action in environment \beta, and each time of choosing action 1 incurs at least a \Delta_{m(S)+1}/2 term in regret. According to Lemma 1, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that there is no switch occurs in [\tau_{m(S)+1}:T]. Thus the event \mathcal{E}_{1} actually implies that action 1 is continuously chosen in every round from round \tau_{m(S)+1} (\leq\frac{t_{m(S)}+T}{2}) to round T, i.e., action 1 is continuously chosen in the last (T\frac{t_{m(S)}+T}{2}+1) rounds. As a result, we know that
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}})\left[(T\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{2}\right]\geq% \frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{22^{% m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Combining Case 1, 2 and 3, we know that
R^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{% \frac{1}{22^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k). On the other hand, since the minimax lower bound for the classical MAB problem (which is equivalent to a BwSC problem with limited switching budget) is \Omega(\sqrt{kT}), we know that
R^{\pi}(T)\geq R_{\infty}^{*}\geq C\sqrt{kT} 
for some absolute constant C>0. To sum up, we have
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m(S)}}}(m(S)+% 1)^{2}\right)T^{\frac{1}{22^{m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases} 
for some absolute constant C>0, where m(S)=\lfloor\frac{S1}{k1}\rfloor. \hfill\Box
We only prove the first part here, as the proof of the second part is analogous. Since m(N(k1)+1)=\lfloor(N(k1)+11)/(k1)\rfloor=N, by Theorem 1, the SSSE policy guarantees \tilde{O}(T^{\frac{1}{22^{N}}}) regret in BwSC. Thus N(k1)+1 switches are sufficient for a carefullydesigned policy (e.g., the SSSE policy) to achieve \tilde{O}(T^{\frac{1}{22^{N}}}) regret in MAB. On the other hand, suppose that there exists a policy that guarantees \tilde{O}(T^{\frac{1}{22^{N}}}) regret in MAB with S<N(k1)+1 switches almost surely. Since m(S)\leq N1, by Theorem 2, its regret in BwSC is \Omega(T^{\frac{1}{22^{N+1}}}), whose order of T is strictly higher than \tilde{O}(T^{\frac{1}{22^{N}}}) (as N is a fixed integer independent of T), contradiction! Thus for any policy that guarantees \tilde{O}(T^{\frac{1}{22^{N}}}) regret in MAB, there must exist an environment such that the policy makes at least N(k1)+1 switches with some positive probability.
We only prove the first part here, as the proof of the second part is analogous. Since m(N(k1)+1)=\lfloor(N(k1)+11)/(k1)\rfloor=N, by Theorem 6, the SSSE2 policy guarantees \tilde{O}(T^{\frac{1}{N+1}}) distributiondependent regret in BwSC. Thus N(k1)+1 switches are sufficient for a carefullydesigned policy (e.g., the SSSE2 policy) to achieve \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distributiondependent regret in MAB. On the other hand, given any fixed k\geq 1, for any fixed N\geq 1, suppose that there exists a policy \pi that uniformly achieve \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distributiondependent regret for all \mathcal{D} with S<N(k1)+1 switches almost surely. Then there exists a constant C_{k,N}\geq 0 (which may depend on k,N) such that for all \mathcal{D} and for all T\geq 1,
R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C_{k,N}{\mathrm{polylog}}(T)\frac{T^{% \frac{1}{N+1}}}{\Delta}, 
which means that for all T\geq 1,
\sup_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C_{k,N}{% \mathrm{polylog}}(T)T^{\frac{1}{N+1}}. 
However, since m(S)<N, by Theorem 7, we know that there exists an absolute constant C>0 such that for all T\geq 1,
\sup\limits_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\geq C% \left(k^{\frac{3}{2}\frac{1}{N}}(m(S)+1)^{2}\right){T^{\frac{1}{N}}}>C\left% (k^{\frac{3}{2}\frac{1}{N}}(N+1)^{2}\right){T^{\frac{1}{N}}}. 
Let T be large enough then there is a contradiction. As a result, N(k1)+1 switches are necessary for uniformly achieving \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distributiondependent regret for all \mathcal{D} in the karmed MAB.
Consider an arbitrary switching graph G with k=G\geq 1. In the following, we show that, even without the triangle inequality assumption, a modified version of the results in Section 5 still hold.
Assume that the switching costs associated with G do not satisfy the triangle inequality. We then run the FloydWarshall algorithm (see Cormen et al. 2009) on G to efficiently find the shortest paths between all pairs of vertices. For any i,j\in[k] such that i\neq j, let p_{i,j}=i\rightarrow\dots\rightarrow j denote the shortest path between i and j, and c_{i,j}^{\prime} denote the total weight of the shortest path between i and j. We construct a new switching graph G^{\prime}=(V,E^{\prime}) — the vertices in G^{\prime} are the same as G, while the edge between i and j in G^{\prime} is assigned a weight c_{i,j}^{\prime}, which is the total weight of the shortest path between i and j in G. Obviously, G^{\prime} is a switching graph whose switching costs satisfy the triangle inequality. Therefore, for BwSC problems defined with G^{\prime}, we can apply the HSSE policy, and the regret upper and lower bounds in Theorem 3 and Theorem 4 in Section 5 hold.
In this part we assume that k=o(\sqrt{T}). This assumption is reasonable when k is a known fixed integer.
For any BwSC problem defined with switching graph G (whose switching costs do not satisfy the triangle inequality) and switching budget S, we construct a new switching graph G^{\prime} according to Appendix E.1, and construct a new BwSC problem defined with switching graph G^{\prime} and switching budget S. Let \pi^{\prime} denote the HSSE policy running on the new BwSC problem. Obviously \pi^{\prime} is a Sswitching budget policy for the new problem. We construct \pi by modifying \pi^{\prime}, aiming to obtain an Sswitchingbudget policy for the original BwSC problem. Let \pi switch (on G) following \pi^{\prime} (on G^{\prime}): every time \pi^{\prime} switches from i to j on G, let \pi switch according to the path p_{i,j}=i\rightarrow\dots\rightarrow j on G, visiting each vertex in p_{i,j} once (since in the HSSE policy, every active action is chosen for at least \Omega(T^{1/2}) consecutive rounds in each interval, while p_{i,j} contains at most k=o(\sqrt{T}) vertices, we know that \pi^{\prime} is a valid policy). Since the total weight of p_{i,j} is c^{\prime}_{i,j} and \pi^{\prime} is an Sswitchingbudget policy for G^{\prime}, we know that \pi is an Sswitchingbudget policy for G.
As mentioned before, Theorem 3 and Theorem 4 in Section 5 hold for the new BwSC problem (defined with G^{\prime}) in Appendix E.2. Based on these two theorems, we give upper and lower bound on regret for the original BwSC problem (defined with the G). The upper and lower bounds are very close to each other (in fact, when k=O(T^{1/4}), the bounds are essentially the same as the bounds in Section 5).
Theorem 8
Let G be a switching graph and G^{\prime} be the corresponding new graph defined in Appendix E.1. Let H denote the total weight of the shortest Hamiltonian path of G^{\prime}. Let \pi be the modified HSSE policy in Appendix E.2, then \pi is an Sswitchingbudget policy for G. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 0, T\geq k^{2},
R^{\pi}(T)\leq C{(\log k\log T)}k^{1\frac{1}{22^{m_{G}^{U}(S)}}}T^{\frac{1}% {22^{m_{G}^{U}(S)}}}+Ck^{2}\log\log T, 
where m_{G}^{U}(S)=\lfloor\frac{S\max_{i,j\in[k]}{c_{i,j}^{\prime}}}{H}\rfloor.
Theorem 9
Let H be the total weight of the shortest Hamiltonian path of G^{\prime}. There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m_{G}^{L}(S)}% }}(m_{G}(S)+1)^{2}\right)T^{\frac{1}{22^{m_{G}^{L}(S)}}},&\text{if }m_{G}^{% L}(S)\leq\log_{2}\log_{2}(T/k),\\ C\sqrt{kT},&\text{if }m_{G}^{L}(S)>\log_{2}\log_{2}(T/k),\end{cases} 
where {m_{G}^{L}}(S)=\lfloor\frac{S\max_{i\in[k]}\min_{j\neq i}c_{i,j}^{\prime}}{H}\rfloor.
Note that the only difference between the upper bound in Theorem 8 and the upper bound in Theorem 3 is an O(k^{2}\log\log T) term, which can be neglected as long as k is much smaller than T, e.g., k=O(T^{1/4}). To see why Theorem 8 holds, just note that (1) when k is much smaller than T, the modification of the HSSE policy does not affect the learning rate of the HSSE policy, and (2) since there are m_{G}^{U}(S)+1=O(\log\log T) intervals in \pi, and in each interval the behavior of \pi (running on G) is different from the behavior of \pi^{\prime} (running on G^{\prime}) for at most O(k^{2}) rounds, the additional regret loss compared to Theorem 3 is at most O(k^{2}\log\log T). Theorem 9 is essentially the same as Theorem 4 — in fact, a lower bound proved for a BwSC problem with the triangle inequality assumption is a natural lower bound for a corresponding BwSC problem without the triangle inequality assumption.
Intuitively, an effective policy in BwSC should identify what type of switching behavior is necessary and sufficient for achieving low regret in MAB, and switch in the most efficient way. Thus, before studying the general BwSC, we first revisit the classical MAB to further understand the relationship between switching and regret. Earlier in Section 4 of the main article, we establish the tradeoff between the number of switches and regret in MAB. Unfortunately, this does not provide enough insights for the general BwSC, and hence we need to connect the combinatorics of switching patterns with regret in MAB. In this subsection, we prove the following result: there are some inherent switching patterns that are associated with any effective learning policy in MAB.
Definition 4
Consider a karmed bandit problem. For any learning policy \pi, any environment \mathcal{D} and any T\geq 1, the stochastic process \{\pi_{t}\}_{t\in[T]}=\pi_{1},\dots,\pi_{T} constitutes a random walk (with a random starting point) on [k]. We call \{\pi_{t}\}_{t\in[T]} the bandit random walk generated by (\pi,\mbox{$\mathcal{D}$},T).
Definition 5
A bandit random walk on an action set [k] finishes a cover in period [T_{1}:T_{2}] if all actions in [k] were chosen between round T_{1} and round T_{2}, here T_{1} is called the starting round of this cover, and T_{2} is called the ending round of this cover.
Definition 6
A bandit random walk on an action set [k] finishes N\geq 0 asynchronous covers between period [T_{1}:T_{2}] if it finishes N covers in period [T_{1}:T_{2}], and the ending round of the jth cover is no larger than the starting round of the (j+1)th cover, for all j=1,\dots,N1.
By using the “tracking the cover time” argument, we establish the following result.
Theorem 10
Consider a karmed bandit problem. For any fixed N\geq 0, for any policy \pi that achieves \tilde{O}(T^{\frac{1}{22^{N}}\epsilon}) regret for some \epsilon>0, there exists an environment \mathcal{D} and T\geq 1 such that the bandit random walk generated by (\pi,\mbox{$\mathcal{D}$},T) must “finish N+1 asynchronous covers and then switch to the optimal action^{6}^{6}6If the bandit random walk happens to be at the optimal action when it just finishes N+1 asynchronous covers, then the event directly occurs. in period [1:T]” with probability at least \max\{N/(N+1),1/2\}.
Theorem 10 holds true for any MAB problem, and reveals some fundamental switching patterns in MAB that any effective learning policy has to reveal under certain environments with certain horizon. Intuitively, the patterns can be summarized as “finishing multiple covers then switching to the optimal action”. For example, if a policy \pi achieves sublinear regret in MAB, then there must be some environment \mathcal{D} and T such that the policy first chooses all actions (i.e., \pi_{1},\dots,\pi_{T} finishes a cover) and then switches to the optimal action with certain probability (even if the policy does not know the optimal action). Also, if a policy \pi achieves near nearoptimal regret in MAB, then there must be some environment \mathcal{D} and T such that \pi_{1},\dots,\pi_{T} first finishes \Omega(\log\log T) asynchronous covers and then switches to the optimal action with certain probability.
Theorem 10 indicates that the switching ability of “finishing multiple covers then switching to the optimal action” is necessary for any effective learning policy in MAB. It thus reveals a deep connection between bandit problems and graph traversal problems, since in graph traversal problems there are also requirements for “cover”, i.e., visiting all vertices. Motivated by this connection, in Section 5 of the main article, we design an intuitive policy for the general BwSC problem by leveraging ideas from the shortest Hamiltonian path problem, and give upper and lower bounds on regret that are close to each other.
The proof of Theorem 10 is based on the “tracking the cover time” argument: we first suppose that the switching patterns do not occur with certain probability, then use the “tracking the cover time” argument to establish an \tilde{\Omega}(T^{\frac{1}{22^{N}}}) lower bound on the regret of \pi, which contradicts the condition in Theorem 10. We omit the detailed proof here, as the essential idea of the proof is similar to the proof of Theorem 2 in Appendix C and the proof of Theorem 4 in Appendix I.
The HSSE policy is practical — for any given switching graph G, the policy only involves solving the shortest Hamiltonian path problem once, which can be finished offline. Thus, the computational complexity of the shortest Hamiltonian path problem does not affect the online decisionmaking process of the HSSE policy at all.
Moreover, under the condition that the switching costs satisfy the triangle inequality, the shortest Hamiltonian path problem can be reduced to the celebrated metric traveling salesman problem (metric TSP), see Lawler et al. (1985). This means that we can directly apply many commercial solvers for TSP to solve (or approximately solve) the shortest Hamiltonian path problem efficiently. The reduction also indicates that any approximation algorithm designed for metric TSP can be adapted to be an approximation algorithm for the shortest Hamiltonian path problem. In particular, the celebrated Christofides algorithm for the metric TSP (Christofides 1976) can be used to compute a good approximation of H in polynomial time.
Consider an arbitrary switching graph G whose switching costs satisfy the triangle inequality. Recall that H is the total weight of the shortest Hamiltonian path in G. For simplicity, in this proof we use m(S) to denote m_{G}^{U}(S)=\lfloor(S\max_{i,j\in[k]}{c_{i,j}})/H\rfloor.
From round 1 to round t_{1}, the HSSE policy incurs H switching cost.
For 1\leq l\leq m(S)1, from round t_{l} to round t_{l+1}, no matter whether l is odd or even, no matter whether the last action in interval l is eliminated before the start of interval l+1 or not, by the switching order (determined by the shortest Hamiltonian path of G) and the triangle inequality, the HSSE policy always incurs at most H switching cost.
From round t_{m(S)} to round T, since the HSSE policy does not switch within interval m(S)+1, i.e., from round t_{m(S)}+1 to round T, the only possible switch is between round t_{m(S)} and t_{m(S)}+1. Thus the HSSE policy incurs at most \max_{i,j\in[k]}c_{i,j} switching cost from round t_{m(S)} to round T.
Summarizing the above arguments, we find that the HSSE policy incurs at most m(S)H+\max_{i,j\in[k]}c_{i,j}\leq S switching cost from round 1 to round T. Thus it is indeed an Sswitchingbudget policy.
We start the proof of the upper bound on regret with some definitions. Let n_{t}(i) be the number of chosen samples of action i in period [1:t], and \bar{\mu}_{t}(i) be the average collected reward from action i in period [1:T] (i\in[k],t\in[T]). Define the confidence radius as
r_{t}(i)=\sqrt{\frac{2\log T}{n_{t}(i)}},~{}~{}\forall i\in[k],t\in[T]. 
Define the clean event as
\mathcal{E}:=\{\forall i\in[k],\forall t\in[T],~{}~{}\bar{\mu}_{t}(i)\mu_{i}% \leq r_{t}(i)\}. 
By Lemma 1.5 in Slivkins (2019), since T\geq k, for any policy \pi and any environment \mathcal{D}, we always have \mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}(\mathcal{E})\geq 1\frac{2}{T^{2}}. Define the bad event \bar{\mathcal{E}} as the complement of the clean event.
The \texttt{UCB}_{t_{l}}(i) and \texttt{LCB}_{t_{l}}(i) confidence bounds defined in Algorithm 3 can be expressed as
\texttt{UCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)+r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k], 
\texttt{LCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k]. 
Let \pi denote the HSSE policy. First, observe that for any environment \mathcal{D},
\displaystyle R_{\mbox{$\mathcal{D}$}}^{\pi}(T)  \displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}% (\mathcal{E})+\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\bar{\mathcal{E}}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}% ^{\pi}(\bar{\mathcal{E}})  
\displaystyle\leq\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i% =1}^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+T\cdot\frac{1}{T^{2}}  
\displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+o(1),  (9) 
so in order to bound R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T), we only need to focus on the clean event.
Consider an arbitrary environment \mathcal{D} and assume the occurrence of the clean event. Let i^{*} be an optimal action, and consider any action i such that \mu_{i}<\mu_{i^{*}}. Let \eta_{i} denote the index of the last interval when i\in A_{\eta_{i}}, i.e., the \eta_{i}th interval is the last interval when we did not eliminate action i yet (in particular, \eta_{i}=m(S)+1 if and only if i is the only action chosen in the last interval). By the HSSE policy, if \eta_{i}\geq 2, then the confidence intervals of the two actions i^{*} and i at the end of round \eta_{i}1 must overlap, i.e., \texttt{UCB}_{t_{\eta_{i}1}}(i)\geq\texttt{LCB}_{t_{\eta_{i}1}}(i^{*}). Therefore,
\Delta(i):=\mu_{i^{*}}\mu_{i}\leq 2r_{t_{\eta_{i}1}}(i^{*})+2r_{t_{\eta_{i}% 1}}(i)=4r_{t_{\eta_{i}1}}(i),  (10) 
where the last equality is because i^{*} and i are chosen for equal times in each interval until interval \eta_{i}, which indicates that n_{t_{\eta_{i}1}}(i^{*})=n_{t_{\eta_{i}1}}(i). (Note that in Algorithm 2, for simplicity, we overlook the rounding issues of \frac{t_{l+1}t_{l}}{A_{l}} for each interval l. Considering the rounding issues will not bring additional difficulty to our analysis, as in the policy we can always design a rounding rule to control the difference between n_{t_{\eta_{i}1}}(i^{*}) and n_{t_{\eta_{i}1}}(i) within 1.)
Since i is never chosen after the \eta_{i}th interval, we have n_{\eta_{i}}(i)=n_{T}(i), and therefore r_{\eta_{i}}(i)=r_{T}(i).
The contribution of action i to regret in the entire horizon [1:T], denoted R(T;i), can be expressed as the sum of \Delta(i) for each round that this action is chosen. By the HSSE policy and (10), we can bound this quantity as
\displaystyle R(T;i)  \displaystyle=n_{T}(i)\Delta(i)  
\displaystyle\leq 4n_{\eta_{i}}(i)\sqrt{\frac{2\log T}{n_{\eta_{i}1}(i)}}  
\displaystyle\leq C_{0}\sqrt{2\log T}\frac{t_{\eta_{i}}/A_{\eta_{i}}}{\sqrt{% t_{\eta_{i}1}/k}}  
\displaystyle\leq 4C_{0}\sqrt{2\log T}\frac{k(T/k)^{1/(22^{m(S)})}}{{A_{% \eta_{i}}}}. 
for some absolute C_{0}\geq 0. Then for any \mathcal{D}, conditioned on the clean event,
\displaystyle\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}^% {T}\mu_{\pi_{t}}\mid\mathcal{E}\right]  \displaystyle=\sum_{i\in[k]}R(T;i)  
\displaystyle\leq\sum_{i\in[k]}4C_{0}\sqrt{2\log T}k(T/k)^{1/(22^{m(S)})}% \frac{1}{{A_{\eta_{i}}}}  
\displaystyle\leq C_{1}\sqrt{\log T}k(T/k)^{1/(22^{m(S)})}\sum_{i=1}^{k}% \frac{1}{A_{\eta_{i}}}  
\displaystyle\leq C_{2}\sqrt{\log T}k(T/k)^{1/(22^{m(S)})}\sum_{j=1}^{k}% \frac{1}{j}  
\displaystyle\leq C_{3}(\log k\log T)k^{11/(22^{m(S)})}T^{1/(22^{m(S)})} 
for some absolute constant C_{1},C_{2},C_{3}\geq 0. Thus by (\theequation@IDcj) and R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T) we have
R^{\pi}(T)\leq C(\log k\log T)k^{21/(22^{m(S)})}T^{1/(22^{m(S)})} 
for some absolute constant C\geq 0, where m(S)=m_{G}^{U}(S)=\lfloor(S\max_{i,j\in[k]}{c_{i,j}})/H\rfloor.\hfill\Box
Consider an arbitrary switching graph G whose switching costs satisfy the triangle inequality. Without loss of generality, we assume that \arg\max_{i\in[k]}(\min_{j\neq i}c_{i,j})=1, i.e., \min_{j\neq 1}c_{1,j}\geq\min_{j\neq i}c_{i,j} for all i\in[k]. Recall that H is the total weight of the shortest Hamiltonian path in G. For simplicity, in this proof we use m(S) to denote m_{G}^{L}(S)=\lfloor(S\max_{i\in[k]}\min_{j\neq i}c_{i,j})/H\rfloor.
Given any k=G\geq 1, S\geq 0 and T\geq 2k, we focus on the setting of \mbox{$\mathcal{D}$}_{i}=\mathcal{N}(\mu_{i},1) (\forall i\in[k]), as this is enough for us to prove the desired lower bound. Note that now the environment of latent distributions \mathcal{D} can be completely determined by a vector \mbox{\boldmath$\mu$}=(\mu_{1},\cdots,\mu_{k})\in\mathbb{R}^{k}. For simplicity, in this proof we will directly use the vector \mu to represent the environment of latent distributions.
For any environment \mu, let X_{\mbox{\boldmath$\mu$}}^{t}(i)\sim\mathcal{N}(\mu_{i},1) denote the i.i.d. random reward of each action i at round t (i\in[k],t\in[T]). For any i\in[k] and n_{1},n_{2}\in[T], let \{X_{\mbox{\boldmath$\mu$}}^{t}(i)\}_{t\in[n_{1}:n_{2}]} denote the random vector whose components are the random awards of action i from round n_{1} to round n_{2}.
For any environment \mu, for any policy \pi\in\Pi_{S}, with some abuse of notation we let X_{\mbox{\boldmath$\mu$}}^{t}(\pi_{t}) denote the learner’s (random) collected reward at round t under policy \pi in environment \mu. Let \mathcal{F}_{t}:=\sigma(X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{% \boldmath$\mu$}}^{t}(\pi_{t})) denote the \sigmaalgebra generated by the random variables X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{\boldmath$\mu$}}^{t}(\pi% _{t}), then \mathbb{F}=(\mathcal{F}_{t})_{t\in T} is a filtration.
For any two probability measures \mathbb{P} and \mathbb{Q} defined on the same measurable space, let D_{\mathrm{TV}}(\mathbb{P}\\mathbb{Q}) denote the total variation distance between \mathbb{P} and \mathbb{Q}, and D_{\mathrm{KL}}(\mathbb{P}\\mathbb{Q}) denote the KullbackLeibler (KL) divergence between \mathbb{P} and \mathbb{Q}, see detailed definitions in Chapter 15 of Wainwright (2019).
For any environment \mu, for any policy \pi\in\Pi_{S}, we make some key definitions as below.
1. We first define a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)}\leq\tau_{m(S)+1}.

\tau_{1}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[1:\tau_{1}]$}\} if the set is nonempty and \tau_{1}=\infty otherwise.

\tau_{2}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{1}:\tau_{2}]$}\} if the set is nonempty and \tau_{2}=\infty otherwise.

Generally, \tau_{j}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{j1}:\tau_{j}]$}\} if the set is nonempty and \tau_{j}=\infty otherwise, for all i=2,\dots,m(S)+1.
It can be verified that \tau_{1},\dots,\tau_{m(S)+1} are stopping times with respect to the filtration \mathbb{F}.
2. We then define a series of random variables (depend on the stopping times).

S(1,\tau_{1}) is the total switching cost incurred in [1:\tau_{1}] (note that if there is a switch happening between \tau_{1} and \tau_{1}+1, we do not count its cost in S(1,\tau_{1})).

For all j=2,\dots,m(S), S(\tau_{j1},\tau_{j}) is the total switching cost incurred in [\tau_{j1}:\tau_{j}] (note that if there is a switch happening between \tau_{j1}1 and \tau_{j1}, or between \tau_{j} and \tau_{j}+1, we do not count its cost in S(\tau_{j1},\tau_{j})).

S(\tau_{m(S)},T) is the total switching cost incurred in [\tau_{m(S)}:T] (note that if there is a switch happening between \tau_{m(S)1} and \tau_{m(S)}, we do not count its cost in S(\tau_{m(S)},T).
3. Next we define a series of events.

E_{1}=\{\tau_{1}>t_{1}\}.

For all j=2,\dots,m(S), E_{j}=\{\tau_{j1}\leq t_{j1},\tau_{j}>t_{j}\}.

E_{m(S)+1}=\{\tau_{m(S)}\leq t_{m(S)}\}.
Note that t_{1},\dots,t_{m(S)}\in[T] are fixed values specified in Algorithm 2.
4. Finally we define a series of shrinking errors.

\Delta_{1}=1.

For j=2,\dots,m(S), \Delta_{j}=\frac{k^{1/2}\left(k/T\right)^{(12^{1j})/(22^{m(S)})}}{k(m(S)+% 1)}\in(0,1). (That is, \Delta_{j}\approx\frac{1}{k(m(S)+1)}\frac{1}{\sqrt{t_{j1}}}.)

\Delta_{m(S)+1}=\frac{k^{1/2}\left(k/T\right)^{(12^{m(S)})/(22^{m(S)})}}{% 2k(m(S)+1)}\in(0,1). (That is, \Delta_{m(S)+1}\approx\frac{1}{2k(m(S)+1)}\frac{1}{\sqrt{t_{m(S)}}}.)
5. For notational convenience, define \pi_{\infty} as an independent uniform random variable taking value in [k] such that {\pi_{\infty}=i} with probability 1/k (i\in[k]).
Lemma 2
For any environment \mu, for any policy \pi\in\Pi_{S}, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the total switching cost incurred in }[\tau_{m(S)}:T]\text{ is % strictly less than }H+\bar{c}\} almost surely.
Proof of Lemma 2. When E_{m(S)+1} happens, \tau_{m(S)}\leq t_{m(S)}\leq T, thus all \tau_{1},\dots,\tau_{m(S)}\leq T. Since in each of [1:\tau_{1}],[\tau_{1},\tau_{2}],\dots,[\tau_{m(S)1}:\tau_{m(S)}], all k actions were visited, we know that S(1,\tau_{1})\geq H, S(\tau_{1},\tau_{2})\geq H, \dots, S(\tau_{m(S)1},\tau_{m(S)})\geq H. Thus we have
S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)1},\tau_{m(S)})\geq m(S% )H. 
Since \pi\in\Pi_{S}, we further know that
\displaystyle S(\tau_{m(S)},T)  \displaystyle\leq S[S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)1}% ,\tau_{m(S)})]  
\displaystyle\leq Sm(S)H<H+\max_{i\in[k]}\min_{j\neq i}{c_{i,j}}=H+\min_{j% \neq 1}{c_{1,j}} 
happens almost surely. As a result, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the total switching cost incurred in }[\tau_{m(S)}:T]\text{ is % strictly less than }H+\min_{j\neq 1}{c_{1,j}}\} almost surely. \hfill\Box
Consider a class of environments \Lambda=\{\mbox{\boldmath$\mu$}\mid\frac{\Delta_{m(S)+1}}{4}\leq\mu_{1}\mu_{i% }\leq\frac{\Delta_{m(S)+1}}{2},\forall i\neq 1\}. Pick an arbitrary environment {\alpha} from \Lambda (e.g., \alpha=(\frac{\Delta_{m(S)+1}}{2},0,\dots,0)). For any policy \pi\in\Pi_{S}, by the union bound, we have
\sum_{j=1}^{m(S)+1}\mathbb{P}_{{\alpha}}^{\pi}(E_{j})\geq\mathbb{P}_{{\alpha}}% ^{\pi}(\cup_{j=1}^{m(S)+1}E_{j})=1. 
Therefore, there exists j^{*}\in[m(S)+1] such that \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})\geq 1/(m(S)+1).
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})% \geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{% \pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{% \mathbb{P}_{{\alpha}}^{\pi}(E_{1})}{k}\geq\frac{1}{k(m(S)+1)}. 
Note that since \tau_{1} is the first time that all actions in [k] has been chosen in [1:\tau_{1}], the event \{\pi_{\tau_{1}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[1:\tau_{1}1]\}. Thus, the event \{\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime}\} must imply the event \mathcal{E}_{i^{\prime}}[1:t_{1}1]:=\{i^{\prime}\text{ was not chosen in }[1:% t_{1}1]\}. Therefore, we have
\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq% \mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{1% }{k(m(S)+1)}. 
Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] only depends on policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{% \pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq\frac{1}{k(m(S)+1)}.  (11) 
We now consider a new environment {\beta} such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{1} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{1}1]} and random vectors \{X_{\beta}^{t}(i)\}_{t_{1}:T} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{1}1]} for i\neq i, we have
\mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}1]}(\mathcal{E}_{i^{\prime}}[1:t_{1% }1])\text{ for }i\neq i}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$% \mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1]).  (12) 
But note that \{X_{\beta}^{t}(i)\}_{[1:t_{1}1]} and \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} have exactly the same distribution for all i\neq i^{\prime}. Thus from (11) and (12) we have
\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$% \mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq\frac{1}{k% (m(S)+1)}. 
However, in environment \beta, i^{\prime} is the unique optimal action, and choosing any action other than i^{\prime} will incur at least a \Delta_{1}\Delta_{m(S)+1}/2\geq\Delta_{1}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[1:t_{1}1] indicates that the policy does not choose i^{\prime} for at least t_{1}1 rounds, we have
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}1])\left[(t_{1}1)\frac{\Delta_{1}}{2}\right]% \geq\frac{t_{1}1}{2k(m(S)+1)}\geq\frac{k^{1/(22^{m(S)})}}{4(m(S)+1)}T^{1/(% 22^{m(S)})}. 
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})=\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1% }\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}})\geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}% })=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau% _{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}% },\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}% })}{k}\geq\frac{1}{k(m(S)+1)}. 
Note that since \tau_{j^{*}} is the first time that all actions in [k] has been chosen in [\tau_{j^{*}1}:\tau_{j^{*}}], the event \{\pi_{\tau_{j^{*}}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[\tau_{j^{*}1}:\tau_{j^{*}}1]\}. Thus, the event \{\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{% \prime}\} must imply the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}]:=\{i^{\prime}\text{ was not % chosen in }[t_{j^{*}1}:t_{j^{*}}]\}. Therefore, we have
\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{% *}}])\geq\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}% >t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{1}{k(m(S)+1)}. 
Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[t_{j^{*}1}:t_{j^{*}}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] only depends on policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E% }_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}])\geq\frac{1}{k(m(S)+1)}.  (13) 
We now consider a new environment {\beta} such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{j^{*}} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[t_{j^{*}1}:t_{j^{*}}]} and random vectors \{X_{\beta}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k]. Let \mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{\prime% }}[t_{j^{*}1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}1}:t_{j^{*}}]).  (14) 
We now try to bound the difference between the lefthand side (LHS) in (13) and the lefthand side (LHS) in (14). We have
\displaystyle\text{LHS in }(\ref{eq:4app3})\text{LHS in }(\ref{eq:4app4})  
\displaystyle\leq  \displaystyle{D_{\mathrm{TV}}}\left(\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}% _{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^% {\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime}% )\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i% ^{\prime}}^{\pi}\right)  
\displaystyle\leq  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^% {t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}% \text{ for }i\neq i^{\prime}}^{\pi}\right)}  
\displaystyle\leq  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^{t}(i^% {\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for% }i\neq i^{\prime}}\right)}  
\displaystyle=  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{% \beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]}}\right)}  
\displaystyle=  \displaystyle\sqrt{\frac{1}{2}\left[(t_{j^{*}1}1)\frac{\left(\Delta_{j^{*}}% \right)^{2}}{2}\right]}  
\displaystyle\leq  \displaystyle\frac{\sqrt{t_{j^{*}1}}\Delta_{j^{*}}}{2}\leq\frac{1}{2k(m(S)+1)}, 
where the first inequality is by the definition of total variation distance of two probability measures, the second inequality is by Pinsker’s inequality in information theory, and the third inequality is by the dataprocessing inequality in infomation theory.
Combining the above inequality with (11) and (12), we have
\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq% \mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\frac{% 1}{2k(m(S)+1)}\geq\frac{1}{2k(m(S)+1)}. 
However, i^{\prime} is the unique optimal action in environment \beta, and choosing any action other than i^{\prime} will incur at least a \Delta_{j^{*}}\Delta_{m(S)+1}/2\geq\Delta_{j^{*}}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] indicates that the policy does not choose i^{\prime} for at least t_{j^{*}}t_{j^{*}1}+1 rounds, we have
\displaystyle R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}% ^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}])\left[(t_{j^{*}}t_{j^{% *}1}+1)\frac{\Delta_{j^{*}}}{2}\right]  
\displaystyle\geq  \displaystyle\frac{1}{2k(m(S)+1)}\left(k(T/k)^{\frac{22^{1j^{*}}}{22^{m(S)% }}}k(T/k)^{\frac{22^{2j^{*}}}{22^{m(S)}}}\right)\frac{k^{\frac{1}{2}}% \left(k/T\right)^{\frac{12^{1j^{*}}}{22^{m(S)}}}}{2k(m(S)+1)}  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}}}{4(m(S)+1)^{2}}\left((T/k)^{\frac{1}{22^{% m(S)}}}(T/k)^{\frac{12^{1j^{*}}}{22^{m(S)}}}\right)  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{\frac{2^{1j^{*}}}{22^{m(S)}}}\right)  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{\frac{2^{1m(S)}}{22^{m(S)}}}\right)  
\displaystyle=  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{2^{m(S)}}\right). 
When m(S)\leq\log_{2}\log_{2}(T/K), we have
(T/k)^{2^{m(S)}}\leq(T/k)^{\frac{1}{\log_{2}(T/k)}}=\frac{1}{(T/k)^{\log_{T% /k}(2)}}=\frac{1}{2}. 
Thus we know that
R^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(% S)}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{2^{m(S)}}\right)\geq\frac{k^{\frac{3}{2% }\frac{1}{22^{m(S)}}}}{8(m(S)+1)^{2}}T^{\frac{1}{22^{m(S)}}} 
when m(S)\leq\log_{2}\log_{2}(T/k).
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}% \leq t_{m(S)})\geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)})=\sum_{i=1}^{k}\mathbb{P}_{% \alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i^{% \prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})}{k}\geq\frac{1}{k(m(% S)+1)}. 
Thus either
\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}>\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)},  (15) 
or
\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac% {t_{m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}.  (16) 
If (15) holds true, then we consider a new environment \beta such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{m(S)+1} and all other components are the same as \alpha. Define the event \mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2]:=\{i^{\prime}\text{ was not % chosen in }[t_{m(S)}:(t_{m(S)}+T)/2]\}. From (15) we know that \mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq 1/(2k(m(S)+1)). Using analogous arguments like Case 2 (Appendix id1), we can derive that
\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2% ])\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)} 
and
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)% }}}}{64(m(S)+1)^{2}}T^{\frac{1}{22^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Now we consider the case that (16) holds true. Let \mathcal{E}_{i^{\prime}} denote the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2},\pi_{\tau_{m(% S)+1}}=i^{\prime}\}. According to Lemma 2, the event \{\tau_{m(S)}\leq t_{m(S)}\} implies that the total switching cost incurred in [\tau_{m(S)}:T] is strictly less than H+\min_{j\neq 1}{c_{1,j}}. Meanwhile, the event \{\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}<\infty\} implies that the total switching cost incurred in [\tau_{m(S)}:\tau_{m(S)+1}] is at least H. As a result, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}.
Suppose that i^{\prime}\neq 1, then the event \mathcal{E}_{i^{\prime}}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\} implies that action 1 is not chosen in [\tau_{m(S)+1}:T], as incurring c_{i^{\prime},1}\geq\min_{j\neq 1}{c_{1,j}} would violate the requirement that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}. However, action 1 is the unique optimal action in environment \alpha, and choosing any action other than action 1 will incur at least a \Delta_{m(S)+1}/4 term in regret. As a result, we know that
R^{\pi}(T)\geq R_{\alpha}^{\pi}(T)\geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i% ^{\prime}})\left[(T\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{4}\right]% \geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2% 2^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Thus we only need to consider the subcase of i^{\prime}=1. Define the event \mathcal{E}_{1}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}% {2},\pi_{\tau_{m(S)+1}}=1\}. Note that the occurrence of the event \mathcal{E}_{1} only depends on policy \pi and random vector \{X_{\alpha}^{t}(1)\}_{[1:t_{m(S)}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:{(t_{m(S)}+T)}/{2}]} for i\neq 1. Consider a new environment \beta such that its first component is \alpha_{1}\Delta_{m(S)+1} and all other components are the same as \alpha. Using analogous arguments like Case 2 (Appendix id1), we can derive that
\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{1})\geq\mathbb{P}_{\alpha}^{\pi}(% \mathcal{E}_{1})\frac{\sqrt{t_{m(S)}}\Delta_{m(S)+1}}{2}\geq\mathbb{P}_{% \alpha}^{\pi}(\mathcal{E}_{1})\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}. 
However, action 1 is the worst action in environment \beta, and each time of choosing action 1 incurs at least a \Delta_{m(S)+1}/2 term in regret. According to Lemma 2, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}. Since switching from action 1 to any other action incurs at least \min_{j\neq 1}{c_{1,j}} cost, the event \mathcal{E}_{1} actually implies that action 1 is continuously chosen in every round from round \tau_{m(S)+1} (\leq\frac{t_{m(S)}+T}{2}) to round T, which means that action 1 is continuously chosen in the last (T\frac{t_{m(S)}+T}{2}+1) rounds. As a result, we know that
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}})\left[(T\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{2}\right]\geq% \frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{22^{% m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Combining Case 1, 2 and 3, we know that
R^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{% \frac{1}{22^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k). On the other hand, since the minimax lower bound for the classical MAB problem (which is equivalent to a BwSC problem with infinite switching budget) is \Omega(\sqrt{kT}), we know that
R^{\pi}(T)\geq R_{\infty}^{*}\geq C\sqrt{kT} 
for some absolute constant C>0. To sum up, we have
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m(S)}}}(m(S)+% 1)^{2}\right)T^{\frac{1}{22^{m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases} 
for some absolute constant C>0, where m(S)=m_{G}^{L}(S)=\lfloor\frac{S\max_{i\in[k]}\min_{j\neq i}c_{i,j}}{H}\rfloor. \hfill\Box