Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints
David SimchiLevi
Institute for Data, Systems and Society, Department of Civil and Environmental Engineering, and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, dslevi@mit.edu
Yunzong Xu
Institute for Data, Systems and Society, Cambridge, Massachusetts Institute of Technology, MA 02139, yxu@mit.edu
We consider the classical stochastic multiarmed bandit problem with a constraint on the total cost incurred by switching between actions. We prove matching upper and lower bounds on regret and provide nearoptimal algorithms for this problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. That is, we show that associated with the multiarmed bandit problem, there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. The results enable us to fully characterize the tradeoff between regret and incurred switching cost in the stochastic multiarmed bandit problem, contributing new insights to this fundamental problem. Under the general switching cost structure, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.
The multiarmed bandit (MAB) problem is one of the most fundamental problems in online learning, with diverse applications ranging from pricing and online advertising to clinical trails. Over the past several decades, it has been a very active research area spanning different disciplines, including computer science, operations research, statistics and economics.
In a traditional multiarmed bandit problem, the learner (i.e., decisionmaker) is allowed to switch freely between actions, and an effective learning policy may incur frequent switching — indeed, the learner’s task is to balance the explorationexploitation tradeoff, and both exploration (i.e., acquiring new information) and exploitation (i.e., optimizing decisions based on uptodate information) require switching. However, in many realworld scenarios, it is costly to switch between different alternatives, and a learning policy with limited switching behavior is preferred. The learner thus has to consider the cost of switching in her learning task.
There is rich literature studying stochastic MAB with switching costs. Most of the papers model the switching cost as a penalty in the learner’s objective, i.e., they measure a policy’s regret and incurred switching cost using the same metric and the objective is to minimize the sum of these two terms (e.g., Agrawal et al. (1988, 1990), Brezzi and Lai (2002), CesaBianchi et al. (2013); there are other variations with discounted rewards Banks and Sundaram (1994), Asawa and Teneketzis (1996), Bergemann and Välimäki (2001), see Jun (2004) for a survey).
Though this conventional “switching penalty” model has attracted significant research interest in the past, it has two limitations.
First, under this model, the learner’s total switching cost is a complete output determined by the learning algorithm. However, in many realworld applications, there are strict limits on the learner’s switching behavior, which should be modeled as a hard constraint, and hence the learner’s total budget of switching cost should be an input that helps determine the algorithm. In particular, while the algorithm in CesaBianchi et al. (2013) developed for the “switching penalty” model can achieve \tilde{O}(\sqrt{T}) (distributionfree) regret with O(\log\log T) switches, if the learner wants a policy that always incurs finite switching cost independent of T, then prior literature does not provide an answer.
Second, the “switching penalty” model has fundamental weakness in studying the tradeoff between regret and incurred switching cost in stochastic MAB — since the O(\log\log T) bound on the incurred switching cost of a policy is negligible compared with the \tilde{O}(\sqrt{T}) bound on its optimal regret, when adding the two terms up, the term associated with incurred switching cost is always dominated by the regret, thus no tradeoff can be identified. As a result, to the best of our knowledge, prior literature has not characterized the fundamental tradeoff between regret and incurred switching cost in stochastic MAB.
In this paper, we introduce the Bandits with Switching Constraints (BwSC) problem. The BwSC model addresses the issues associated with the “switching penalty” model in several ways.
First, it introduces a hard constraint on the total switching cost, making the switching budget an input to learning policies, enabling us to design good policies that guarantee limited switching cost. While O(\log\log T) switches has proven to be sufficient for a learning policy to achieve nearoptimal regret in MAB, in BwSC, we are mostly interested in the setting of finite or o(\log\log T) switching budget, which is highly relevant in practice.
Second, by focusing on rewards in the objective function and incurred switching cost in the switching constraint, the BwSC framework enables the characterization of the fundamental tradeoff between regret and maximum incurred switching cost in MAB.
Third, while most prior research assumes specific structures on switching costs (e.g., unit or homogeneous costs), in reality, switching between different pairs of actions may incur heterogeneous costs that do not follow any parametric form. The BwSC model allows general switching costs, which makes it a powerful modeling framework.
The BwSC framework has numerous applications, including dynamic pricing, online assortment optimization, online advertising, clinical trails and vehicle routing. A representative example is the dynamic pricing problem. Dynamic pricing with demand learning has proven its effectiveness in online retailing. However, it is well known that in practice, sellers often face business constraints that prevent them from conducting extensive price experimentation and making frequent price changes. For example, acording to Cheung et al. (2017), Groupon limits the number of price changes, either because of implementation constraints, or for fear of confusing customers and receiving negative customer feedback. In such scenarios, the seller’s sequential decisionmaking problem can be modeled as a BwSC problem, where changing from each price to another price incurs some cost, and there is a limit on the total cost incurred by price changes.
The paper’s contributions are along four dimensions.
On the modeling side, we introduce the BwSC model, a general framework with strong modeling power. The model overcomes the limitations of the prior “switching penalty” model and has both practical and theoretical value.
The second dimension of contribution lies in the analysis domain. In Section id1, we study the unitswitchingcost BwSC problem. We obtain an upper bound on regret by proposing a simple and intuitive policy with carefullydesigned switching rules, and prove a strong informationtheoretic lower bound that matches the above upper bound, indicating that our policy is rateoptimal up to logarithmic factors. Methodologically, the proof of the lower bound involves a novel “tracking the cover time” argument that has not appeared in prior literature and may be of independent interest.
With the analysis described above we obtain a series of surprising and insightful results for both BwSC and MAB. Of the most important discoveries are the phase transitions and cyclic phenomena exhibited by the optimal regret in BwSC and MAB. That is, we show that associated with these problems, there are equallength phases, defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases, see the precise definitions in Section id1. The tight regret bounds in BwSC motivate new insights about the classical MAB problem. In particular, we show that \Theta(\log\log T) switches are necessary and sufficient to achieve nearoptimal regret in MAB.
Finally, we study the generalswitchingcost BwSC problem in Section id1. We make conceptual contribution by revealing a deep connection between bandit problems and graph traversal problems. In fact, we characterize some important switching patterns associated with any effective learning policies in MAB, which in turn lead to regret upper and lower bounds for the general BwSC problem.
For all n_{1},n_{2}\in\mathbb{N} such that n_{1}\leq n_{2}, we use [n_{1}] to denote the set \{1,\dots,n_{1}\}, and use [n_{1}:n_{2}] (resp. (n_{1}:n_{2}]) to denote the set \{n_{1},n_{1}+1,\dots,n_{2}\} (resp. \{n_{1}+1,\dots,n_{2}\}). Throughout the paper, we will use the big O notation to hide constant factors, and use the \tilde{O} notation to hide constant factors and logarithmic factors.
Consider a karmed bandit problem where a learner chooses actions from a fixed set [k]=\{1,\dots,k\}. There is a total of T rounds. In each round t\in[T], the learner first chooses an action i_{t}\in[k], then observes a reward r_{t}(i_{t})\in\mathbb{R}. For each action i\in[k], the reward of action i is i.i.d. drawn from an (unknown) distribution \mbox{$\mathcal{D}$}_{i} with (unknown) expected value \mu_{i}. We assume that the distributions \mbox{$\mathcal{D}$}_{i} are standardized subGaussian.Without loss of generality, we assume \sup_{i,j\in[k]}\mu_{i}\mu_{j}\in[0,1].
In our problem, the learner incurs a switching cost c_{i,j}=c_{j,i}\geq 0 each time she switches between action i and action j (i,j\in[k]). In particular, c_{i,i}=0 for i\in[k]. There is a prespecified switching budget S\geq 0 representing the maximum amount of switching costs that the learner can incur in total. Once the total switching cost exceeds the switching budget S, the learner cannot switch her actions any more. The learner’s goal is to maximize the expected total reward over T rounds.
Let \pi denote the learner’s (nonanticipating) learning policy, and \pi_{t}\in[k] denote the action chosen by policy \pi at round t\in[T]. More formally, \pi_{t} establishes a probability kernel acting from the space of historical actions and observations to the space of actions at round t. Let \mathbb{P^{\pi}_{\mbox{$\mathcal{D}$}}} and \mathbb{E}^{\pi}_{\mbox{$\mathcal{D}$}} be the probability measure and expectation induced by policy \pi and latent distributions \mbox{$\mathcal{D}$}=(\mbox{$\mathcal{D}$}_{1},\dots,\mbox{$\mathcal{D}$}_{k}). According to the problem formulation, we only need to restrict our attention to the Sswitchingbudget policies defined as below.^{1}^{1}1Note that here we do not make any assumption on the learner’s behavior. In particular, we do not require the learner to intentionally pick an Sswitchingbudget policy — the switching constraint makes the learner’s policy automatically equivalent to an Sswitchingbudget policy.
Definition 1
A policy \pi is said to be an Sswitchingbudget policy if for all \mathcal{D} and T\geq 1,
\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}\left[\sum_{t=1}^{T1}c_{\pi_{t},\pi_{t% +1}}\leq S\right]=1. 
Let \Pi_{S} denote the set of all Sswitchingbudget policies, which is also the admissible policy class of the BwSC problem.
The performance of a learning policy is measured against a clairvoyant policy that maximizes the expected total reward given foreknowledge of the environment (i.e., latent distributions) \mathcal{D}. Let \mu^{*}=\max_{i\in[k]}\mu_{i}. We define the regret of policy \pi as the worstcase difference between the expected performance of the optimal clairvoyant policy and the expected performance of policy \pi:
R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}\left\{T\mu^{*}\mathbb{E}_{\mbox{$% \mathcal{D}$}}^{\pi}\left[\sum_{t=1}^{T}\mu_{\pi_{t}}\right]\right\}. 
The minimax (optimal) regret of BwSC is defined as R_{S}^{*}(T)=\inf_{\pi\in\Pi_{S}}R^{\pi}(T).
In our paper, when we say a policy is “nearoptimal” or “optimal up to logarithmic factors”, we mean that its regret bound is optimal in T up to logarithmic factors of T, irrespective of whether the bound is optimal in k, since typically k is much smaller than T. Still, our derived bounds are actually quite tight in k.
Remark. There are two notions of regret in the stochastic bandit literature. The R^{\pi}(T) regret that we consider is called distributionfree, as it does not depend on \mathcal{D}. On the other hand, one can also define the distributiondependent regret R_{\mbox{$\mathcal{D}$}}^{\pi}(T)=T\mu^{*}\mathbb{E}_{\mbox{$\mathcal{D}$}}^{% \pi}\left[\sum_{t=1}^{T}\mu_{\pi_{t}}\right] that depends on \mathcal{D}. This second notion of regret is only meaningful when \mu_{1},\dots,\mu_{k} are wellseparated. Unlike the classical MAB problem where there are policies simultaneously achieving nearoptimal bounds under both regret notions, in the BwSC problem, due to the limited switching budget, finding a policy that simultaneously achieves nearoptimal bounds under both regret notions is usually impossible. Thus in the main body of the paper, we focus on the distributionfree regret. However, in Appendix A, we extend our results to the distributiondependent regret.
Obviously, BwSC and MAB share the same definition of R^{\pi}(S), and the only difference between BwSC and MAB is the existence of a switching constraint \pi\in\Pi_{S}, determined by (c_{i,j})\in\overline{\mathbb{R}}_{\geq 0}^{k\times k} and S\in\overline{\mathbb{R}}_{\geq 0} (when S=\infty, BwSC degenerates to MAB). This makes BwSC a natural framework to study the tradeoff between regret and incurred switching cost in MAB. That is, the tradeoff between the optimal regret R_{S}^{*}(T) and switching budget S in BwSC completely characterizes the tradeoff between a policy’s best achievable regret and its worst possible incurred switching cost in MAB. We are interested in how R_{S}^{*}(T) behaves over a range of switching budget S, and how it is affected by the structure of switching costs (c_{i,j}).
This paper is not the first one to study online learning problems with limited switches. Indeed, a few authors have realized the practical significance of limited switching budget. For example, Cheung et al. (2017) consider a dynamic pricing model where the demand function is unknown but belongs to a known finite set, and a pricing policy is allowed to make at most m price changes. Their constraint on the total number of price changes is motivated by collaboration with Groupon, a major ecommerce marketplace in North America. In such an environment, Groupon limits the number of price changes, either because of implementation constraints, or for fear of confusing customers and receiving negative customer feedback. They propose a pricing policy that guarantees O(\log^{(m)}T) (or m iterations of the logarithm) regret with at most m price changes, and report that in a field experiment, this pricing policy with a single price change increases revenue and market share significantly. Chen and Chao (2019) study a multiperiod stochastic inventory replenishment and pricing problem with unknown demand and limited price changes. Assuming that the demand function is drawn from a parametric class of functions, they develop a finitepricechange policy based on maximum likelihood estimation that achieves optimal regret.
We note that both Cheung et al. (2017) and Chen and Chao (2019) only focus on specific decisionmaking problems, and their results rely on some strong assumption about the unknown environment. Cheung et al. (2017) assume a known finite set of potential demand functions, and require the existence of discriminative prices that can efficiently differentiate all potential demand functions. Chen and Chao (2019) assume a known parametric form of the demand function, and also require a wellseparated condition. By contrast, the BwSC model in our paper is generic and assumes no prior knowledge of the environment. The learning task in the BwSC problem is thus more challenging than previous models. Also, the switching constraint in the BwSC problem is more general than the pricechange constraints in previous models.
In the Bayesian bandit setting, Guha and Munagala (2013) study the “bandits with metric switching costs” problem that allows a constraint involving metric switching costs. Using competitive ratio as the performance metric and assuming Bayesian priors, they develop a 4approximation algorithm for the problem. The competitive ratio is measured against an optimal online policy that does not know the true distributions. As pointed out by the authors, the optimal online policy can be directly determined by a dynamic program. So the main issue in their model is a computational one. Our work is different, as we are using regret as our performance metric, and we are competing with an optimal clairvoyant policy that knows the true distributions — a much stronger benchmark. Our problem thus involves both statistical and computational challenges. In fact, the algorithm in Guha and Munagala (2013) cannot avoid a linear regret when applied to the BwSC problem.
In the adversarial bandit setting, Altschuler and Talwar (2018) study the adversarial MAB problem with limited number of switches, which can be viewed as an adversarial counterpart of the BwSC problem with unit switching costs (see Section id1). For any policy that makes no more than S\leq T switches, they prove that the optimal regret is \tilde{\Theta}(T\sqrt{k}/\sqrt{S}). Since we are considering a different setting from them (our problem is stochastic while their problem is adversarial), the methodologies and results in our paper are fundamentally different from their paper. In particular, while finiteswitch policies cannot avoid linear regret in the adversarial setting, in the stochastic setting, finite switches are already able to guarantee sublinear regret. Moreover, while the optimal regret in Altschuler and Talwar (2018) decreases smoothly as S increases from 0 to T, in the stochastic setting, we identify very surprising behavior of the optimal regret as S increases from 0 to \Theta(\log\log T), which, to the best of our knowledge, has never been identified in the bandit literature before.
The BwSC problem is also related to the batched bandit problem proposed by Perchet et al. (2016). The Mbatched bandit problem is defined as follows: given a classical bandit problem, assumes that the learner must split her learning process into M batches and is only able to observe data (i.e., realized rewards) from a given batch after the entire batch is completed. This implies that all actions within a batch are determined at the beginning of this batch. Here M can be viewed as a quantity measuring the learner’s adaptivity, i.e., her ability to learn from her data and adapt to the environment. An Mbatch policy is defined as a policy that only observes realized data for M1 times through the entire horizon. Perchet et al. (2016) study the problem in the case of two arms, and prove that the optimal regret for the Mbatched bandit problem is \tilde{\Theta}(T^{1/(12^{1M})}). Very recently, Gao et al. (2019) extend these results to general k arms.
On the surface, the batched bandit problem and the BwSC problem seem like two totally different problems: the batched bandit problem limits observation and allows unlimited switching, while the BwSC problem limits switching and allows unlimited observation. Surprisingly, in this paper, we discover some deep nontrivial connections between the batched bandit problem and the unitswitchingcost BwSC problem. The connections will be further discussed in Section id1.
In this section, we consider the BwSC problem with unit switching costs, where c_{i,j}=1 for all i\neq j. In this case, since every switch incurs a unit cost, the switching budget S can be interpreted as the maximum number of switches that the learner can make in total. Thus, the unitswitchingcost BwSC problem can be simply interpreted as “MAB with limited number of switches”.
We first propose a simple and intuitive policy that provides an upper bound on the regret. Our policy, called the SSwitch Successive Elimination (SSSE) policy, is described in Algorithm 1. The design philosophy behind the SSSE policy is to divide the entire horizon into several predetermined intervals (i.e. batches) and to control the number of switches in each interval. The policy thus has some similarities with the 2armed batched policy of Perchet et al. (2016) and the karmed batched policy of Gao et al. (2019), which prove to be nearoptimal in the batched bandit problem. However, since we are studying a different problem, directly applying a batched policy to the BwSC problem does not work. In particular, in the batched bandit problem, the number of intervals (i.e., batches) is a given constraint, while in the BwSC problem, the switching budget is the given constraint. We thus add two key ingredients into the SSSE policy: (1) an index m(S) suggesting how many intervals should be used to partition the entire horizon; (2) a switching rule ensuring that the total number of switches within k actions cannot exceed the switching budget S. These two ingredients make the SSSE policy substantially different from an ordinary batched policy.
Input: Number of arms k, Switching budget S, Horizon T
Partition: Calculate m(S)=\lfloor\frac{S1}{k1}\rfloor.
Divide the entire time horizon 1,\dots,T into m(S)+1 intervals: [t_{0}:t_{1}],(t_{1}:t_{2}],\dots,(t_{m(S)}:t_{m(S)+1}], where the endpoints are defined by t_{0}=1 and
t_{i}=\lfloor k^{1\frac{22^{(i1)}}{22^{m(S)}}}T^{\frac{22^{(i1)}}{22% ^{m(S)}}}\rfloor,~{}~{}\forall i=1,\dots,m(S)+1. 
Initialization: Let the set of all active actions in the lth interval be A_{l}. Set A_{1}=[k].
Policy:
\texttt{UCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% +\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}, 
\texttt{LCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% \sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}. 
Intuition about the Policy. The policy divides the T rounds into \lfloor\frac{S1}{k1}\rfloor+1 intervals in advance. The sizes of the intervals are designed to balance the explorationexploitation tradeoff. An active set of “good” actions A_{l} is maintained for each interval l and at the end of each interval some “bad” actions are eliminated before the start of the next interval. The policy controls the number of switches by ensuring that only A_{l}1 switches happen within each interval l and at most one switch happens between two consecutive intervals. Finally, in the last interval only the empirical best action is chosen.
We show that the SSSE policy is indeed an Sswitchingbudget policy and establish the following upper bound on its regret, for Appendix B for a proof.
Theorem 1
Let \pi be the SSSE policy, then \pi\in\Pi_{S}. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 0 and T\geq k,
R^{\pi}(T)\leq C{(\log k\log T)}k^{1\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m% (S)}}}, 
where m(S)=\lfloor\frac{S1}{k1}\rfloor.
Theorem 1 provides an upper bound on the optimal regret of the unitswitchingcost BwSC problem:
R^{*}_{S}(T)=\tilde{O}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}). 
The SSSE policy, though achieves sublinear regret, seems to have many limitations that could have weaken its performance, and on the surface it may suggest that the regret bound is not optimal. Specifically:

The SSSE policy abandons some switching budget directly. Consider the case of 11 actions and 20 switching budget. The SSSE policy will directly give up 9 switching budget — it just runs as if it could only make 11 switches. Intuitively, an effective learning policy should treasure its switching budget. It seems that by making full use of the switching budget, one can achieve much lower regret.

The SSSE policy runs on predetermined intervals. The sizes and locations of intervals have nothing to do with observed data or realtime data. It seems that by using datadriven intervals that are determined by observed and realtime data, one can achieve much lower regret.

The number of intervals, m(S)+1=\lfloor(S1)/(k1)\rfloor+1, is predetermined by the worst case, i.e., as if no actions would be eliminated in each interval. Since SSSE is a “successive elimination” policy, the actual number of actions that it needs to consider should be smaller and smaller in its running process, and we should be able to use much more intervals than just m(S) intervals. It seems that by tracking the size of the active set A_{l} and adaptively determining the number of intervals, one can achieve much lower regret (note that by Theorem 1, the regret dramatically decreases as m(S) increases).

More importantly, the idea of designing Sswitching policies based on (m(S)+1)interval policies (or in the words of Perchet et al. (2016), (m(S)+1)batch policies) itself seems questionable. Note that the SSSE policy runs deterministicly within each interval based on a predetermined schedule, and it only learns from data at the end of each interval, for at most \lfloor(S1)/(k1)\rfloor times — consider the case of 11 actions and 20 switching budget, the SSSE policy will split the entire horizon into two intervals and will only learn at the end of the first interval, after which it will choose a single action to be applied throughout the entire second interval.^{2}^{2}2In fact, the SSSE policy allocates its switching budget before seeing any data, and does not try to save switches after data becomes available, which means that data is not utilized for saving switches. Intuitively, data should be utilized to save switches, and one would expect that an effective policy will have high degree of adaptivity, that is, it should learn from the available data and adapt to the environment more frequently than our policy. Put differently, it seems that by utilizing full adaptivity and learning from data in every round, one can achieve much lower regret.

Besides the above limitations, the \tilde{O}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}) bound provided by the SSSE also seems a little clumsy. In particular, the m(S)=\lfloor(S1)/(k1)\rfloor term looks like an artificial term (it is intentionally designed to fit the switching rule in SSES), and does not look like a natural term that should appear in the true optimal regret R^{*}_{S}(T).
While the above arguments are based on our first instinct and seem very reasonable, surprisingly, all of them prove to be wrong: no Sswitch policy can theoretically do better! In fact, we match the upper bound provided by SSSE by showing a strong informationtheoretic lower bound in Theorem 2. This indicates that the SSSE policy is indeed rateoptimal up to logarithmic factors, and R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}). Note that the tightness of T is achieved per instance, i.e., for every k and every S. That is, our lower bound is substantially stronger than demonstrating specific k and S for which the upper bound cannot be improved.
Theorem 2
There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m(S)}}}(m(S)+% 1)^{2}\right)T^{\frac{1}{22^{m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases} 
where m(S)=\lfloor\frac{S1}{k1}\rfloor.
Proof Idea. Our proof involves a novel “tracking the cover time” argument that (to the best of our knowledge) has not appeared in previous lowerbound proofs in the bandit literature. Specifically, we track a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)+1}, some of which may be \infty, that are recursively defined as follows: \tau_{1} is the first time that all the actions in [k] have been chosen in period [1:\tau_{1}], \tau_{2} is the first time that all the actions in [k] have been chosen in period [\tau_{1}:\tau_{2}], and generally, \tau_{i} is the first time that all the actions in [k] have been chosen in period [\tau_{i1}:\tau_{i}], for i=2,\dots,m(S)+1. The structure of the series is carefully designed, enabling the realization of any two consecutive stopping times \tau_{i1},\tau_{i} to convey the important message that there exists a specific (possibly unknown) action that has never been chosen in period [\tau_{i1}:\tau_{i}1]. This information in turn helps us to bound the difference of several key probabilities and derive the desired lower bound via informationtheoretic arguments. For a complete proof of Theorem 2, see Appendix C.
Corollary 1
For any fixed k, for any S,
R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}). 
Remark. We briefly explain why the upper and lower bounds in Theorem 1 and Theorem 2 match in T. When m(S)\leq\log_{2}\log_{2}(T/k), which is the case we are mostly interested in, (m(S)+1)^{2}=o(\log T), thus the upper and lower bounds match within o((\log T)^{2}). When m(S)>\log_{2}\log_{2}(T/k), the upper bound is O({\sqrt{T}\log T}), thus the upper and lower bounds directly match within O(\log T). We also argue that the slightly different terms of k appearing in the upper and lower bounds do not play an important role. In fact, the gap associated with k between the upper and lower bounds is O(\min\{k^{2.5},(T/k)^{m(S)0.5}\}). Since we are mostly interested in the case of k<<T (or k=O(1)), the O(k^{2.5}) gap is not important relative to T.
Corollary 1 allows us to characterize the tradeoff between the switching budget S and the optimal regret R^{*}_{S}{(T)}. To illustrate this tradeoff, Figure 1 and Table 1 depict the behavior of R^{*}_{S}{(T)} as a function of S given a fixed k. Note that as discussed in Section id1, the relationship between R^{*}_{S}{(T)} and S also characterizes the inherent tradeoff between regret and maximum number of switches in the classical MAB problem.
We observe several surprising phenomena regarding the tradeoff between S and R_{S}^{*}(T).
Phase Transitions. As we have shown, R^{*}_{S}(T)=\tilde{\Theta}(T^{1/({22^{\lfloor(S1)/(k1)\rfloor})}}). To the best of knowledge, this is the first time that a floor function naturally arises in the order of T in the optimal regret of an online learning problem. As a direct consequence of this floor function, the optimal regret of BwSC exhibits surprising phase transitions described as below.
Definition 2
(Phases and Critical Points) For a karmed unitcost BwSC (or a karmed MAB), we call the interval [(j1)(k1)+1,j(k1)+1) the jth phase, and call j(k1)+1 as the jth critical point (j\in\mathbb{Z}_{>0}).
Fact 1
(Phase Transitions) As S increases from 0 to \Theta(\log\log T), S will leave the jth phase and enter the (j+1)th phase at the jth critical point (j\in\mathbb{Z}_{>0}). Each time S arrives at a critical point, R_{S}^{*}(T) will drop significantly, and stay at the same level until S arrives the next critical point.
Phase transitions are clearly presented in Figure 1. This phenomenon seems counterintuitive, as it suggests that increasing switching budget would not help to decrease the best achievable regret, as long as the budget does not reach the next critical point.
Note that phase transitions are only exhibited when S is in the range of 0 to \Theta(\log\log T). After S exceeds \Theta(\log\log T), R_{S}^{*}(T) will reamin unchanged at the level of \tilde{\Theta}(\sqrt{T}) — the optimal regret will only vary within logarithmic factors and there is no significant regret drop any more. Therefore, one can also view \Theta(\log\log T) as a “final critical point” that marks the disappearance of phase transitions. This additional “final phase transition” reveals the subtle and intriguing nature of phase transitions in BwSC.
S  [0,k)  [k,2k1)  [2k1,3k2)  [3k2,4k3)  [4k3,5k4)  [5k4,6k5) 

R_{S}^{*}(T)  \tilde{\Theta}(T)  \tilde{\Theta}(T^{2/3})  \tilde{\Theta}(T^{4/7})  \tilde{\Theta}(T^{8/15})  \tilde{\Theta}(T^{16/31})  \tilde{\Theta}(T^{32/63}) 
R_{S}^{*}(T)/R_{\infty}^{*}(T)  \tilde{\Theta}(T^{1/2})  \tilde{\Theta}(T^{1/6})  \tilde{\Theta}(T^{1/14})  \tilde{\Theta}(T^{1/30})  \tilde{\Theta}(T^{1/62})  \tilde{\Theta}(T^{1/128}) 
Cyclic Phenomena. Along with phase transitions, we also observe the following phenomena.
Fact 2
(Cyclic Phenomena) The length of each phase is always equal to k1, independent of S and T. We call the quantity k1 the budget cycle, which is the length of each phase.
Cyclic Phenomena indicate that, assuming that the learner’s switching budget is at a critical point, then the extra switching budget that the learner needs to achieve the next regret drop (i.e., to arrive at the next critical point) is always k1. Cyclic phenomena also seem counterintuitive: when the learner has more switching budget, she can conduct more experiments and statistical tests, eliminate more bad actions (which can be thought of as reducing k) and allocate her switching budget in a more flexible way — all of these suggest that the budget cycle should be a quantity decreasing with S. However, the cyclic phenomena tell us that the budget cycle is always a constant and no learning policy in the unitcost BwSC (and in MAB) can escape this cycle, no matter how large S is , as long as S=o(\log\log T).
On the other hand, as S contains more and more budget cycles, the gap between R_{S}^{*}(T) and R_{\infty}^{*}(T)=\tilde{\Theta}(\sqrt{T}) does decrease dramatically. In fact, R_{S}^{*}(T) decreases doubly exponentially fast as S contains more budget cycles. For example, when S contains more than 2 budget cycles, R_{S}^{*}(T)=\tilde{\Theta}(T^{4/7}); and when S contains more than 3 budget cycles, R_{S}^{*}(T)=\tilde{\Theta}(T^{8/15}). From both Figure 1 and Table 1, we can verify that 3 or 4 budget cycles are already enough for an Sswitchingbudget policy to achieve closetooptimal regret in MAB (compared with the optimal policy with unlimited switching budget).
To sum up, the above analysis generates both “positive” and “negative” insights for decisionmakers that face BwSCtype problems. On the one hand, the unavoidable phase transitions and cyclic phenomena show some fundamental limits brought up by the switching constraint, making it hopeless for decisionmakers to reduce regret within each phase. On the other hand, once the decisionmakers have enough switching budget that brings them to a new phase, they can enjoy substantially regret drop. In particular, 3 or 4 budget cycles are already enough to guarantee extraordinary regret performance.
The lower bound in Theorem 2 also leads to new results for the classical MAB problem.
Corollary 2
(The switching complexity of MAB) For the karmed bandit problem, N(k1)+1 switches are necessary and sufficient for achieving \tilde{O}(T^{\frac{1}{22^{N}}}) regret for any fixed N\in\mathbb{Z}_{>0}, and \Theta(\log\log T) switches are necessary and sufficient for achieving \tilde{O}(\sqrt{T}) (nearoptimal) regret.
Note that the number of switches stated in Corollary 2 refers to the maximum number of switches that a policy can make. While CesaBianchi et al. (2013) and Perchet et al. (2016) have proposed policies that achieve \tilde{O}(\sqrt{T}) regret with O(\log\log T) switches, no prior work has answered the question that how many switches are necessary for a nearoptimal learning policy in MAB. To the best of our knowledge, we are the first one to show \Omega(\log\log T) lower bound on the number of switches.
Based on our “tracking the cover time” argument, we can prove further results regarding how many number of reswitches of each arm (including the worst arm in hindsight) are necessary for an effective learning policy.
Definition 3
The number of reswitches of action i\in[k] is the total number that the leaner switches to i from another action j\neq i. If an action is switched in round 1, this switch also counts as a reswitch.
Proposition 1

\lceil N/2\rceil reswitches of each action are necessary for achieving \tilde{O}(T^{\frac{1}{22^{N}}}) regret in the karmed MAB (N\in\mathbb{Z}_{>0}).

\Theta(\log\log T) reswitches of each action are necessary and sufficient for achieving \tilde{O}(\sqrt{T}) (nearoptimal) regret in the karmed MAB.
Note that if the learner is not allowed to rechoose an action that was chosen earlier and discarded later (i.e., if the number of reswitches of each action is at most 1), then the corresponding bandit problem is exactly the “irrevocable MAB problem” proposed by Farias and Madan (2011). Farias and Madan (2011) and Guha and Munagala (2013) study the price of irrevocability in the Bayesian bandit setting. Using competitive ratio (measured against the optimal online policy that does not know the true environment) as the performance metric, they show that the price of irrevocability is limited. Our results on the necessity of reswitching strongly contradicts this idea: in the setting of regret minimization, where we are competing with the optimal clairvoyant policy — a much stronger benchmark, our results indicate that an irrevocable policy must incur linear regret, and any effective policy can not avoid “switching to, revoking, and reswitching to” each action for many times.
Specifically, Proposition 1 indicates that, for each learning policy that achieves \tilde{O}(T^{\frac{1}{22^{N}}}) regret in MAB, we can always find an environment \mathcal{D} such that the worst action in hindsight is “switched to, revoked, and reswitched to” for at least \lceil N/2\rceil times with some positive probability. Also, for each learning policy that achieves nearoptimal regret in MAB, we can always find an environment \mathcal{D} such that the worst action in hindsight is “switched to, revoked, and reswitched to” for at least \Theta(\log\log T) times with some positive probability — surprisingly, the necessary number of reswitches of the worst action is essentially the same as the best action. This reveals some fundamental tradeoff between regret and reswitching in MAB. Put differently, it is inevitable for any good policy to repeatedly choose actions that would prove to be not effective in hindsight.
We close this section by discussing a surprising relationship between limited switches and limited adaptivity in bandit problems. As discussed in Section id1, the unitswitchingcost BwSC problem limits switches and the batched bandit problem limits observation — on the surface, the two problems seem unrelated. However, the results in our paper establish a nontrivial connection between the two problems. The SSSE policy in Section id1 helps establish a onesided relationship: by using the ingredient (1) and (2) in the SSSE policy, it is easy to find that any limitedbatch policy can be transformed to a limitedswitch policy, thus any regret upper bound for the batched bandit problem can be adapted to be a regret upper bound for the unitswitchingcost BwSC problem.
On the other hand, it is generally impossible to transform an arbitrary limitedswitch policy to a limitedbatch policy, as a limitedswitch policy may utilize data for unlimited times. Thus finding a regret lower bound for BwSC is fundamentally harder than finding a regret lower bound for the batched bandit problem. Surprisingly, our strong lower bound in Section id1 directly closes the gap between the regret upper bound of the batched bandit problem and the regret lower bound of a corresponding unitswitchingcost BwSC problem. Thus, we essentially prove the following fact: for any fixed k, both the Mbatched karmed bandit problem and the Sbudget karmed unitcost BwSC have \tilde{\Theta}(T^{\frac{1}{22^{1M}}}) optimal regret, as long as S\in[(M1)(k1)+1,M(k1)+1). In essence, our results reveal a surprising “nearequivalance” between limited switches (even with full adpativity) and limited adaptivity (even with full switching power) in bandit problems.
We now proceed to the general case of BwSC, where c_{i,j} (=c_{j,i}) can be any nonnegative real number and even \infty. The problem is significantly more challenging in this general setting. To better characterize the structure of switching costs, we represent switching costs by a weighted graph. Let G=(V,E) be a complete graph, where V=[k] (i.e., each vertex corresponds to an action), and the edge between i and j is assigned a weight c_{i,j} (\forall i\neq j). We call the weighted graph G the switching graph. In this section, we assume the switching costs satisfy the triangle inequality: \forall i,j,l\in[k], c_{i,j}\leq c_{i,l}+c_{l,j}. We relax this assumption in Appendix E.
In Appendix F, we establish an interesting connection between bandit problems and graph traversal problems. Applying the result to the general BwSC problem, we discover a connection between the general BwSC problem and the celebrated shortest Hamiltonian path problem. Motivated by this connection, we propose the HamiltonianSwitching Successive Elimination (HSSE) policy, and present its details in Algorithm 2 in Appendix G. The policy enhances the original SSSE policy by adding an additional ingredient: a prespecified switching order determined by the shortest Hamiltonian path of the switching graph G. Note that while the shortest Hamiltonian path problem is NPhard, solving this problem is entirely an “offline” step in the HSSE policy. That is, for a given switching graph, the learner only needs to solve this problem once. We give an upper bound on regret of the HSSE policy, and a lower bound that is close to the upper bound, see proofs in Appendix H and I.
Theorem 3
Let H denote the total weight of the shortest Hamiltonian path of G. Let \pi be the HSSE policy, then \pi\in\Pi_{S}. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 0, T\geq k,
R^{\pi}(T)\leq C{(\log k\log T)}k^{1\frac{1}{22^{m_{G}^{U}(S)}}}T^{\frac{1}% {22^{m_{G}^{U}(S)}}}, 
where m_{G}^{U}(S)=\lfloor\frac{S\max_{i,j\in[k]}{c_{i,j}}}{H}\rfloor.
Theorem 4
Let H be the total weight of the shortest Hamiltonian path of G. There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m_{G}^{L}(S)}% }}(m_{G}(S)+1)^{2}\right)T^{\frac{1}{22^{m_{G}^{L}(S)}}},&\text{if }m_{G}^{% L}(S)\leq\log_{2}\log_{2}(T/k),\\ C\sqrt{kT},&\text{if }m_{G}^{L}(S)>\log_{2}\log_{2}(T/k),\end{cases} 
where {m_{G}^{L}}(S)=\lfloor\frac{S\max_{i\in[k]}\min_{j\neq i}c_{i,j}}{H}\rfloor.
When the switching costs satisfy the condition \max_{i,j\in[k]}{c_{i,j}}=\max_{i\in[k]}\min_{j\neq i}c_{i,j}, the two bounds directly match. When this condition is not satisfied, for any switching graph G, the above two bounds still match for a wide range of S. Even when S is not in this range, we still have m_{G}^{U}(S)\leq m_{G}^{L}(S)\leq m_{G}^{U}(S)+1 for any G and any S, which means that the difference between the two indices is at most 1 and the regret bounds are always very close. In fact, it can be shown that as S increases, the gap between the upper and lower bounds decreases doubly exponentially. Therefore, the HSSE policy is quite effective for the general BwSC problem. We leave closing the remaining gap between the upper and lower bounds for future research.
References
 Agrawal et al. (1988) Agrawal, R., M. Hedge, D. Teneketzis. 1988. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control 33(10) 899–906.
 Agrawal et al. (1990) Agrawal, R, M Hegde, D Teneketzis. 1990. Multiarmed bandit problems with multiple plays and switching cost. Stochastics and Stochastic Reports 29(4) 437–459.
 Altschuler and Talwar (2018) Altschuler, J., K. Talwar. 2018. Online learning over a finite action set with limited switching. arXiv preprint arXiv:1803.01548 .
 Asawa and Teneketzis (1996) Asawa, M, D Teneketzis. 1996. Multiarmed bandits with switching penalties. IEEE transactions on automatic control 41(3) 328–348.
 Banks and Sundaram (1994) Banks, Jeffrey S, Rangarajan K Sundaram. 1994. Switching costs and the gittins index. Econometrica 62(3) 687–694.
 Bergemann and Välimäki (2001) Bergemann, D, J Välimäki. 2001. Stationary multichoice bandit problems. Journal of Economic dynamics and Control 25(10) 1585–1594.
 Brezzi and Lai (2002) Brezzi, Monica, Tze Leung Lai. 2002. Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27(1) 87–108.
 CesaBianchi et al. (2013) CesaBianchi, N., O. Dekel, O. Shamir. 2013. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems. 1160–1168.
 Chen and Chao (2019) Chen, B., X. Chao. 2019. Parametric demand learning with limited price explorations in a backlog stochastic inventory system. IISE Transactions 1–9.
 Cheung et al. (2017) Cheung, W. C., D. SimchiLevi, H. Wang. 2017. Dynamic pricing and demand learning with limited price experimentation. Operations Research 65(6) 1722–1731.
 Christofides (1976) Christofides, Nicos. 1976. Worstcase analysis of a new heuristic for the travelling salesman problem. Tech. rep., CarnegieMellon University Pittsburgh PA Management Sciences Research Group.
 Cormen et al. (2009) Cormen, Thomas H, Charles E Leiserson, Ronald L Rivest, Clifford Stein. 2009. Introduction to algorithms. MIT Press.
 Farias and Madan (2011) Farias, V. F., R. Madan. 2011. The irrevocable multiarmed bandit problem. Operations Research 59(2) 383–399.
 Gao et al. (2019) Gao, Z., Y. Han, Z. Ren, Z. Zhou. 2019. Batched multiarmed bandits problem. arXiv preprint arXiv:1904.01763 .
 Guha and Munagala (2013) Guha, S., K. Munagala. 2013. Approximation algorithms for bayesian multiarmed bandit problems. arXiv preprint arXiv:1306.3525 .
 Jun (2004) Jun, T. 2004. A survey on the bandit problem with switching costs. De Economist 152(4) 513–541.
 Lai and Robbins (1985) Lai, T. L., H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1) 4–22.
 Lawler et al. (1985) Lawler, E L, J K Lenstra, A R Kan, D B Shmoys. 1985. The traveling salesman problem: a guided tour of combinatorial optimization, vol. 3. New York: Wiley.
 Perchet et al. (2016) Perchet, V., P. Rigollet, S. Chassang, E. Snowberg. 2016. Batched bandit problems. The Annals of Statistics 44(2) 660–681.
 Slivkins (2019) Slivkins, A. 2019. Introduction to multiarmed bandits. arXiv preprint arXiv:1904.07272 .
 Wainwright (2019) Wainwright, M. J. 2019. Highdimensional statistics: A nonasymptotic viewpoint, vol. 48. Cambridge University Press.
For simplicity, we only present the results of distributiondependent regret bounds for the unitswitchingcost BwSC problem. Extensions to the generalswitchingcost BwSC problem are analogous to Section 5 of the main article.
To achieve tight distributiondependent regret bounds, we propose the SSwitch Successive Elimination 2 (SSSE2) policy, which is stated in Algorithm 2. Note that the difference between the SSSE2 policy and the SSSE policy is the partition of intervals.
Input: Number of arms k, Switching budget S, Horizon T
Partition: Calculate m(S)=\lfloor\frac{S1}{k1}\rfloor.
Divide the entire time horizon 1,\dots,T into m(S)+1 intervals: [t_{0}:t_{1}],(t_{1}:t_{2}],\dots,(t_{m(S)}:t_{m(S)+1}],
where the endpoints are defined by t_{0}=1 and
t_{i}=\lfloor k^{1\frac{i}{{m(S)}+1}}T^{\frac{i}{{m(S)}+1}}\rfloor,~{}~{}% \forall i=1,\dots,m(S)+1. 
Initialization: The same as the SSSE policy.
Policy:
For any environment \mathcal{D}, let i^{*}=\arg\max_{i\in[k]\mu_{i}} denote the optimal action, and \Delta=\Delta(\mbox{$\mathcal{D}$})=\min_{i\neq i^{*}}\mu_{i^{*}}\mu_{i}>0 denote the gap between the rewards of the optimal action and the best suboptimal action. We have the following results.
Theorem 5
Let \pi be the SSSE2 policy. There exists an absolute constant C\geq 0 such that for all \mathcal{D}, for all k\geq 1, S\geq 0 and T\geq k,
R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C\left(k^{\frac{m(S)}{m(S)+1}}\log k% \right)\frac{T^{\frac{1}{m(S)+1}}\log T}{\Delta}, 
where m(S)=\lfloor\frac{S1}{k1}\rfloor.
Theorem 6
There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq 1 and for all policy \pi\in\Pi_{S}, if m(S)\leq{\log_{2}(T/k)}, then
\sup\limits_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\geq C% \left(k^{\frac{3}{2}\frac{1}{m(S)+1}}(m(S)+1)^{2}\right){T^{\frac{1}{m(S)+1% }}}, 
where m(S)=\lfloor\frac{S1}{k1}\rfloor.
Note that when m(S)\leq{\log_{2}(T/k)}, the upper and lower bounds match in the minimax sense (up to logarithmic factors), thus the SSSE2 policy can be viewed as nearoptimal. When m(S)>\log_{2}(T/k), the upper bound is O(\log T/\Delta), and we can directly use the seminal instancedependent lower bound of Lai and Robbins (1985) to show the asymptotic optimality of the SSSE2 policy.
We omit the proofs of Theorem 5 and Theorem 6. The proof of Theorem 5 resembles the proof of Theorem 1 in Appendix B, and the proof of Theorem 6 resembles the proof of Theorem 2 in Appendix C. The difference is mainly on the partition of intervals.
Besides results on regret upper and lower bounds, we also establish Corollary 3, which can be viewed as a parallel result for Corollary 2 in Section 4.3.2 of the main article.
Corollary 3
(The switching complexity of MAB  distributiondependent regret version)
For any k\geq 1, for any environment \mathcal{D}, let \Delta=\min\limits_{i\in[k],i\neq i^{*}}\mu_{i^{*}}\mu_{i} denote the gap between the mean rewards of the optimal action and the best suboptimal action.

N(k1)+1 switches are necessary and sufficient for uniformly achieving \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distributiondependent regret for all \mathcal{D} in the karmed MAB (N\in\mathbb{Z}_{>0}).

\Omega(\frac{\log T}{\log\log T}) switches are necessary for uniformly achieving \tilde{O}(\log{T}/\Delta) distributiondependent regret for all \mathcal{D} in the karmed MAB.
The proof of Corollary 2 is deferred to Appendix D.
From round 1 to round t_{1}, the SSSE policy makes k1 switches.
For 1\leq l\leq m(S)1, from round t_{l} to round t_{l+1}:

If the last action in interval l remains active in interval l+1, then it will be the first action in interval l+1, and no switch occurs between round t_{l} and round t_{l}+1. Since the SSSE policy makes at most k1 switches within interval l+1, i.e., from round t_{l}+1 to round t_{l+1}, the SSSE policy makes at most 0+(k1)=k1 switches from round t_{l} to round t_{l+1}.

If the last action in interval l is eliminated before the start of interval l+1, then interval l+1 starts from another active action, and one switch occurs between round t_{l} and round t_{l}+1. The elimination implies that A_{l+1}\leq k1, thus the SSSE policy makes A_{l+1}1\leq(k1)1=k2 switches within interval l+1, i.e., from round t_{l}+1 to round t_{l+1}. Therefore, the SSSE policy makes at most 1+(k2)=k1 switches from round t_{l} to round t_{l+1}.
From round t_{m(S)} to round T, since the SSSE policy does not switch within interval m(S)+1, i.e., from round t_{m(S)}+1 to round T, the only possible switch is between round t_{m(S)} and t_{m(S)}+1. Thus the SSSE policy makes at most 1 switch from round t_{m(S)} to round T.
Summarizing the above arguments, we find that the SSSE policy makes at most m(S)(k1)+1\leq S switches from round 1 to round T. Thus it is indeed an Sswitchingbudget policy.
We start the proof of the upper bound on regret with some definitions. Let n_{t}(i) be the number of chosen samples of action i in period [1:t], and \bar{\mu}_{t}(i) be the average collected reward from action i in period [1:T] (i\in[k],t\in[T]). Define the confidence radius as
r_{t}(i)=\sqrt{\frac{2\log T}{n_{t}(i)}},~{}~{}\forall i\in[k],t\in[T]. 
Define the clean event as
\mathcal{E}:=\{\forall i\in[k],\forall t\in[T],~{}~{}\bar{\mu}_{t}(i)\mu_{i}% \leq r_{t}(i)\}. 
By Lemma 1.5 in Slivkins (2019), since T\geq k, for any policy \pi and any environment \mathcal{D}, we always have \mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}(\mathcal{E})\geq 1\frac{2}{T^{2}}. Define the bad event \bar{\mathcal{E}} as the complement of the clean event.
The \texttt{UCB}_{t_{l}}(i) and \texttt{LCB}_{t_{l}}(i) confidence bounds defined in Algorithm 1 can be expressed as
\texttt{UCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)+r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k], 
\texttt{LCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k]. 
Let \pi denote the SSSE policy. First, observe that for any environment \mathcal{D},
\displaystyle R_{\mbox{$\mathcal{D}$}}^{\pi}(T)  \displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}% (\mathcal{E})+\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\bar{\mathcal{E}}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}% ^{\pi}(\bar{\mathcal{E}})  
\displaystyle\leq\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i% =1}^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+T\cdot\frac{1}{T^{2}}  
\displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+o(1),  (1) 
so in order to bound R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T), we only need to focus on the clean event.
Consider an arbitrary environment \mathcal{D} and assume the occurrence of the clean event. Let i^{*} be an optimal action, and consider any action i such that \mu_{i}<\mu_{i^{*}}. Let \eta_{i} denote the index of the last interval when i\in A_{\eta_{i}}, i.e., the \eta_{i}th interval is the last interval when we did not eliminate action i yet (in particular, \eta_{i}=m(S)+1 if and only if i is the only action chosen in the last interval). By the SSSE policy, if \eta_{i}\geq 2, then the confidence intervals of the two actions i^{*} and i at the end of round \eta_{i}1 must overlap, i.e., \texttt{UCB}_{t_{\eta_{i}1}}(i)\geq\texttt{LCB}_{t_{\eta_{i}1}}(i^{*}). Therefore,
\Delta(i):=\mu_{i^{*}}\mu_{i}\leq 2r_{t_{\eta_{i}1}}(i^{*})+2r_{t_{\eta_{i}% 1}}(i)=4r_{t_{\eta_{i}1}}(i),  (2) 
where the last equality is because i^{*} and i are chosen for equal times in each interval until interval \eta_{i}, which indicates that n_{t_{\eta_{i}1}}(i^{*})=n_{t_{\eta_{i}1}}(i). (Note that in Algorithm 1, for simplicity, we overlook the rounding issues of \frac{t_{l+1}t_{l}}{A_{l}} for each interval l. Considering the rounding issues will not bring additional difficulty to our analysis, as in the policy we can always design a rounding rule to control the difference between n_{t_{\eta_{i}1}}(i^{*}) and n_{t_{\eta_{i}1}}(i) within 1.)
Since i is never chosen after the \eta_{i}th interval, we have n_{\eta_{i}}(i)=n_{T}(i), and therefore r_{\eta_{i}}(i)=r_{T}(i).
The contribution of action i to regret in the entire horizon [1:T], denoted R(T;i), can be expressed as the sum of \Delta(i) for each round that this action is chosen. By the SSSE policy and (2), we can bound this quantity as
\displaystyle R(T;i)  \displaystyle=n_{T}(i)\Delta(i)  
\displaystyle\leq 4n_{\eta_{i}}(i)\sqrt{\frac{2\log T}{n_{\eta_{i}1}(i)}}  
\displaystyle\leq C_{0}\sqrt{2\log T}\frac{t_{\eta_{i}}/A_{\eta_{i}}}{\sqrt{% t_{\eta_{i}1}/k}}  
\displaystyle\leq 4C_{0}\sqrt{2\log T}\frac{k(T/k)^{1/(22^{m(S)})}}{{A_{% \eta_{i}}}}. 
for some absolute C_{0}\geq 0. Then for any \mathcal{D}, conditioned on the clean event,
\displaystyle\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}^% {T}\mu_{\pi_{t}}\mid\mathcal{E}\right]  \displaystyle=\sum_{i\in[k]}R(T;i)  
\displaystyle\leq\sum_{i\in[k]}4C_{0}\sqrt{2\log T}k(T/k)^{1/(22^{m(S)})}% \frac{1}{{A_{\eta_{i}}}}  
\displaystyle\leq C_{1}\sqrt{\log T}k(T/k)^{1/(22^{m(S)})}\sum_{i=1}^{k}% \frac{1}{A_{\eta_{i}}}  
\displaystyle\leq C_{2}\sqrt{\log T}k(T/k)^{1/(22^{m(S)})}\sum_{j=1}^{k}% \frac{1}{j}  
\displaystyle\leq C_{3}(\log k\log T)k^{11/(22^{m(S)})}T^{1/(22^{m(S)})} 
for some absolute constant C_{1},C_{2},C_{3}\geq 0. Thus by (\theequation@IDr) and R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T) we have
R^{\pi}(T)\leq C(\log k\log T)k^{21/(22^{m(S)})}T^{1/(22^{m(S)})} 
for some absolute constant C\geq 0.\hfill\Box
Given any k\geq 1, S\geq 0 and T\geq 2k, we focus on the setting of \mbox{$\mathcal{D}$}_{i}=\mathcal{N}(\mu_{i},1) (\forall i\in[k]), as this is enough for us to prove the desired lower bound. Note that now the environment of latent distributions \mathcal{D} can be completely determined by a vector \mbox{\boldmath$\mu$}=(\mu_{1},\cdots,\mu_{k})\in\mathbb{R}^{k}. For simplicity, in this proof we will directly use the vector \mu to represent the environment of latent distributions.
For any environment \mu, let X_{\mbox{\boldmath$\mu$}}^{t}(i)\sim\mathcal{N}(\mu_{i},1) denote the i.i.d. random reward of each action i at round t (i\in[k],t\in[T]). For any i\in[k] and n_{1},n_{2}\in[T], let \{X_{\mbox{\boldmath$\mu$}}^{t}(i)\}_{t\in[n_{1}:n_{2}]} denote the random vector whose components are the random awards of action i from round n_{1} to round n_{2}.
For any environment \mu, for any policy \pi\in\Pi_{S}, with some abuse of notation we let X_{\mbox{\boldmath$\mu$}}^{t}(\pi_{t}) denote the learner’s (random) collected reward at round t under policy \pi in environment \mu. Let \mathcal{F}_{t}:=\sigma(X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{% \boldmath$\mu$}}^{t}(\pi_{t})) denote the \sigmaalgebra generated by the random variables X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{\boldmath$\mu$}}^{t}(\pi% _{t}), then \mathbb{F}=(\mathcal{F}_{t})_{t\in T} is a filtration.
For any two probability measures \mathbb{P} and \mathbb{Q} defined on the same measurable space, let D_{\mathrm{TV}}(\mathbb{P}\\mathbb{Q}) denote the total variation distance between \mathbb{P} and \mathbb{Q}, and D_{\mathrm{KL}}(\mathbb{P}\\mathbb{Q}) denote the KullbackLeibler (KL) divergence between \mathbb{P} and \mathbb{Q}, see detailed definitions in Chapter 15 of Wainwright (2019).
For any environment \mu, for any policy \pi\in\Pi_{S}, we make some key definitions as below.
1. We first define a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)}\leq\tau_{m(S)+1}.

\tau_{1}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[1:\tau_{1}]$}\} if the set is nonempty and \tau_{1}=\infty otherwise.

\tau_{2}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{1}:\tau_{2}]$}\} if the set is nonempty and \tau_{2}=\infty otherwise.

Generally, \tau_{j}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{j1}:\tau_{j}]$}\} if the set is nonempty and \tau_{j}=\infty otherwise, for all i=2,\dots,m(S)+1.
It can be verified that \tau_{1},\dots,\tau_{m(S)+1} are stopping times with respect to the filtration \mathbb{F}.
2. We then define a series of random variables (depend on the stopping times).

S(1,\tau_{1}) is the number of switches occurs in [1:\tau_{1}] (note that if there is a switch happening between \tau_{1} and \tau_{1}+1, we do not count its cost in S(1,\tau_{1})).

For all j=2,\dots,m(S), S(\tau_{j1},\tau_{j}) is the number of switches occurs in [\tau_{j1}:\tau_{j}] (note that if there is a switch happening between \tau_{j1}1 and \tau_{j1}, or between \tau_{j} and \tau_{j}+1, we do not count its cost in S(\tau_{j1},\tau_{j})).

S(\tau_{m(S)},T) is the number of switches occurs in [\tau_{m(S)}:T] (note that if there is a switch happening between \tau_{m(S)1} and \tau_{m(S)}, we do not count its cost in S(\tau_{m(S)},T).
3. Next we define a series of events.

E_{1}=\{\tau_{1}>t_{1}\}.

For all j=2,\dots,m(S), E_{j}=\{\tau_{j1}\leq t_{j1},\tau_{j}>t_{j}\}.

E_{m(S)+1}=\{\tau_{m(S)}\leq t_{m(S)}\}.
Note that t_{1},\dots,t_{m(S)}\in[T] are fixed values specified in Algorithm 1.
4. Finally we define a series of shrinking errors.

\Delta_{1}=1.

For j=2,\dots,m(S), \Delta_{j}=\frac{k^{1/2}\left(k/T\right)^{(12^{1j})/(22^{m(S)})}}{k(m(S)+% 1)}\in(0,1). (That is, \Delta_{j}\approx\frac{1}{k(m(S)+1)}\frac{1}{\sqrt{t_{j1}}}.)

\Delta_{m(S)+1}=\frac{k^{1/2}\left(k/T\right)^{(12^{m(S)})/(22^{m(S)})}}{% 2k(m(S)+1)}\in(0,1). (That is, \Delta_{m(S)+1}\approx\frac{1}{2k(m(S)+1)}\frac{1}{\sqrt{t_{m(S)}}}.)
5. For notational convenience, define \pi_{\infty} as an independent uniform random variable taking value in [k] such that {\pi_{\infty}=i} with probability 1/k (i\in[k]).
Lemma 1
For any environment \mu, for any policy \pi\in\Pi_{S}, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the number of switches occurs in }[\tau_{m(S)}:T]\text{ is no more % than }k1\} almost surely.
Proof of Lemma 1. When E_{m(S)+1} happens, \tau_{m(S)}\leq t_{m(S)}\leq T, thus all \tau_{1},\dots,\tau_{m(S)}\leq T. Since in each of [1:\tau_{1}],[\tau_{1},\tau_{2}],\dots,[\tau_{m(S)1}:\tau_{m(S)}], all k actions were visited, we know that S(1,\tau_{1})\geq k1, S(\tau_{1},\tau_{2})\geq k1, \dots, S(\tau_{m(S)1},\tau_{m(S)})\geq k1. Thus we have
S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)1},\tau_{m(S)})\geq m(S% )(k1). 
Since \pi\in\Pi_{S}, we further know that
S(\tau_{m(S)},T)\leq S[S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)% 1},\tau_{m(S)})]\leq Sm(S)(k1)\leq k1 
happens almost surely. As a result, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the number of switches occurs in }[\tau_{m(S)}:T]\text{ is no more % than }k1\} almost surely. \hfill\Box
Consider a class of environments \Lambda=\{\mbox{\boldmath$\mu$}\mid\frac{\Delta_{m(S)+1}}{4}\leq\mu_{1}\mu_{i% }\leq\frac{\Delta_{m(S)+1}}{2},\forall i\neq 1\}. Pick an arbitrary environment {\alpha} from \Lambda (e.g., \alpha=(\frac{\Delta_{m(S)+1}}{2},0,\dots,0)). For any policy \pi\in\Pi_{S}, by the union bound, we have
\sum_{j=1}^{m(S)+1}\mathbb{P}_{{\alpha}}^{\pi}(E_{j})\geq\mathbb{P}_{{\alpha}}% ^{\pi}(\cup_{j=1}^{m(S)+1}E_{j})=1. 
Therefore, there exists j^{*}\in[m(S)+1] such that \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})\geq 1/(m(S)+1).
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})% \geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{% \pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{% \mathbb{P}_{{\alpha}}^{\pi}(E_{1})}{k}\geq\frac{1}{k(m(S)+1)}. 
Note that since \tau_{1} is the first time that all actions in [k] has been chosen in [1:\tau_{1}], the event \{\pi_{\tau_{1}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[1:\tau_{1}1]\}. Thus, the event \{\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime}\} must imply the event \mathcal{E}_{i^{\prime}}[1:t_{1}1]:=\{i^{\prime}\text{ was not chosen in }[1:% t_{1}1]\}. Therefore, we have
\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq% \mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{1% }{k(m(S)+1)}. 
Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] only depends on policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{% \pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq\frac{1}{k(m(S)+1)}.  (3) 
We now consider a new environment {\beta} such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{1} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{1}1]} and random vectors \{X_{\beta}^{t}(i)\}_{t_{1}:T} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{1}1]} for i\neq i, we have
\mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}1]}(\mathcal{E}_{i^{\prime}}[1:t_{1% }1])\text{ for }i\neq i}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$% \mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1]).  (4) 
But note that \{X_{\beta}^{t}(i)\}_{[1:t_{1}1]} and \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} have exactly the same distribution for all i\neq i^{\prime}. Thus from (3) and (4) we have
\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$% \mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq\frac{1}{k% (m(S)+1)}. 
However, in environment \beta, i^{\prime} is the unique optimal action, and choosing any action other than i^{\prime} will incur at least a \Delta_{1}\Delta_{m(S)+1}/2\geq\Delta_{1}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[1:t_{1}1] indicates that the policy does not choose i^{\prime} for at least t_{1}1 rounds, we have
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}1])\left[(t_{1}1)\frac{\Delta_{1}}{2}\right]% \geq\frac{t_{1}1}{2k(m(S)+1)}\geq\frac{k^{11/(22^{m(S)})}}{4(m(S)+1)}T^{1% /(22^{m(S)})}. 
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})=\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1% }\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}})\geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}% })=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau% _{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}% },\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}% })}{k}\geq\frac{1}{k(m(S)+1)}. 
Note that since \tau_{j^{*}} is the first time that all actions in [k] has been chosen in [\tau_{j^{*}1}:\tau_{j^{*}}], the event \{\pi_{\tau_{j^{*}}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[\tau_{j^{*}1}:\tau_{j^{*}}1]\}. Thus, the event \{\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{% \prime}\} must imply the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}]:=\{i^{\prime}\text{ was not % chosen in }[t_{j^{*}1}:t_{j^{*}}]\}. Therefore, we have
\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{% *}}])\geq\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}% >t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{1}{k(m(S)+1)}. 
Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[t_{j^{*}1}:t_{j^{*}}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] only depends on policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E% }_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}])\geq\frac{1}{k(m(S)+1)}.  (5) 
We now consider a new environment {\beta} such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{j^{*}} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[t_{j^{*}1}:t_{j^{*}}]} and random vectors \{X_{\beta}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k]. Let \mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{\prime% }}[t_{j^{*}1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}1}:t_{j^{*}}]).  (6) 
We now try to bound the difference between the lefthand side (LHS) in (5) and the lefthand side (LHS) in (6). We have
\displaystyle\text{LHS in }(\ref{eq:app3})\text{LHS in }(\ref{eq:app4})  
\displaystyle\leq  \displaystyle{D_{\mathrm{TV}}}\left(\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}% _{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^% {\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime}% )\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i% ^{\prime}}^{\pi}\right)  
\displaystyle\leq  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^% {t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}% \text{ for }i\neq i^{\prime}}^{\pi}\right)}  
\displaystyle\leq  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^{t}(i^% {\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for% }i\neq i^{\prime}}\right)}  
\displaystyle=  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{% \beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]}}\right)}  
\displaystyle=  \displaystyle\sqrt{\frac{1}{2}\left[(t_{j^{*}1}1)\frac{\left(\Delta_{j^{*}}% \right)^{2}}{2}\right]}  
\displaystyle\leq  \displaystyle\frac{\sqrt{t_{j^{*}1}}\Delta_{j^{*}}}{2}\leq\frac{1}{2k(m(S)+1)}, 
where the first inequality is by the definition of total variation distance of two probability measures, the second inequality is by Pinsker’s inequality in information theory, and the third inequality is by the dataprocessing inequality in infomation theory.
Combining the above inequality with (3) and (4), we have
\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq% \mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\frac{% 1}{2k(m(S)+1)}\geq\frac{1}{2k(m(S)+1)}. 
However, i^{\prime} is the unique optimal action in environment \beta, and choosing any action other than i^{\prime} will incur at least a \Delta_{j^{*}}\Delta_{m(S)+1}/2\geq\Delta_{j^{*}}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] indicates that the policy does not choose i^{\prime} for at least t_{j^{*}}t_{j^{*}1}+1 rounds, we have
\displaystyle R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}% ^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}])\left[(t_{j^{*}}t_{j^{% *}1}+1)\frac{\Delta_{j^{*}}}{2}\right]  
\displaystyle\geq  \displaystyle\frac{1}{2k(m(S)+1)}\left(k(T/k)^{\frac{22^{1j^{*}}}{22^{m(S)% }}}k(T/k)^{\frac{22^{2j^{*}}}{22^{m(S)}}}\right)\frac{k^{\frac{1}{2}}% \left(k/T\right)^{\frac{12^{1j^{*}}}{22^{m(S)}}}}{2k(m(S)+1)}  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}}}{4(m(S)+1)^{2}}\left((T/k)^{\frac{1}{22^{% m(S)}}}(T/k)^{\frac{12^{1j^{*}}}{22^{m(S)}}}\right)  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{\frac{2^{1j^{*}}}{22^{m(S)}}}\right)  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{\frac{2^{1m(S)}}{22^{m(S)}}}\right)  
\displaystyle=  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{2^{m(S)}}\right). 
When m(S)\leq\log_{2}\log_{2}(T/K), we have
(T/k)^{2^{m(S)}}\leq(T/k)^{\frac{1}{\log_{2}(T/k)}}=\frac{1}{(T/k)^{\log_{T% /k}(2)}}=\frac{1}{2}. 
Thus we know that
R^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(% S)}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{2^{m(S)}}\right)\geq\frac{k^{\frac{3}{2% }\frac{1}{22^{m(S)}}}}{8(m(S)+1)^{2}}T^{\frac{1}{22^{m(S)}}} 
when m(S)\leq\log_{2}\log_{2}(T/k).
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}% \leq t_{m(S)})\geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)})=\sum_{i=1}^{k}\mathbb{P}_{% \alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i^{% \prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})}{k}\geq\frac{1}{k(m(% S)+1)}. 
Thus either
\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}>\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)},  (7) 
or
\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac% {t_{m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}.  (8) 
If (7) holds true, then we consider a new environment \beta such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{m(S)+1} and all other components are the same as \alpha. Define the event \mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2]:=\{i^{\prime}\text{ was not % chosen in }[t_{m(S)}:(t_{m(S)}+T)/2]\}. From (7) we know that \mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq 1/(2k(m(S)+1)). Using analogous arguments like Case 2 (Appendix id1), we can derive that
\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2% ])\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)} 
and
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)% }}}}{64(m(S)+1)^{2}}T^{\frac{1}{22^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Now we consider the case that (8) holds true. Let \mathcal{E}_{i^{\prime}} denote the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2},\pi_{\tau_{m(% S)+1}}=i^{\prime}\}. According to Lemma 1, the event \{\tau_{m(S)}\leq t_{m(S)}\} implies that the number of switches occurs in [\tau_{m(S)}:T] is no more than k1. Meanwhile, the event \{\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}<\infty\} implies that the number of switches occurs in [\tau_{m(S)}:\tau_{m(S)+1}] is at least k1. As a result, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that there is no switch occurs in [\tau_{m(S)+1}:T].
Suppose that i^{\prime}\neq 1, then the event \mathcal{E}_{i^{\prime}}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\} implies that action 1 is not chosen in [\tau_{m(S)+1}:T]. However, action 1 is the unique optimal action in environment \alpha, and choosing any action other than action 1 will incur at least a \Delta_{m(S)+1}/4 term in regret. As a result, we know that
R^{\pi}(T)\geq R_{\alpha}^{\pi}(T)\geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i% ^{\prime}})\left[(T\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{4}\right]% \geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2% 2^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Thus we only need to consider the subcase of i^{\prime}=1. Define the event \mathcal{E}_{1}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}% {2},\pi_{\tau_{m(S)+1}}=1\}. Note that the occurrence of the event \mathcal{E}_{1} only depends on policy \pi and random vector \{X_{\alpha}^{t}(1)\}_{[1:t_{m(S)}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:{(t_{m(S)}+T)}/{2}]} for i\neq 1. Consider a new environment \beta such that its first component is \alpha_{1}\Delta_{m(S)+1} and all other components are the same as \alpha. Using analogous arguments like Case 2 (Appendix id1), we can derive that
\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{1})\geq\mathbb{P}_{\alpha}^{\pi}(% \mathcal{E}_{1})\frac{\sqrt{t_{m(S)}}\Delta_{m(S)+1}}{2}\geq\mathbb{P}_{% \alpha}^{\pi}(\mathcal{E}_{1})\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}. 
However, action 1 is the worst action in environment \beta, and each time of choosing action 1 incurs at least a \Delta_{m(S)+1}/2 term in regret. According to Lemma 1, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that there is no switch occurs in [\tau_{m(S)+1}:T]. Thus the event \mathcal{E}_{1} actually implies that action 1 is continuously chosen in every round from round \tau_{m(S)+1} (\leq\frac{t_{m(S)}+T}{2}) to round T, i.e., action 1 is continuously chosen in the last (T\frac{t_{m(S)}+T}{2}+1) rounds. As a result, we know that
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}})\left[(T\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{2}\right]\geq% \frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{22^{% m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Combining Case 1, 2 and 3, we know that
R^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{% \frac{1}{22^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k). On the other hand, since the minimax lower bound for the classical MAB problem (which is equivalent to a BwSC problem with limited switching budget) is \Omega(\sqrt{kT}), we know that
R^{\pi}(T)\geq R_{\infty}^{*}\geq C\sqrt{kT} 
for some absolute constant C>0. To sum up, we have
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m(S)}}}(m(S)+% 1)^{2}\right)T^{\frac{1}{22^{m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases} 
for some absolute constant C>0, where m(S)=\lfloor\frac{S1}{k1}\rfloor. \hfill\Box
We only prove the first part here, as the proof of the second part is analogous. Since m(N(k1)+1)=\lfloor(N(k1)+11)/(k1)\rfloor=N, by Theorem 1, the SSSE policy guarantees \tilde{O}(T^{\frac{1}{22^{N}}}) regret in BwSC. Thus N(k1)+1 switches are sufficient for a carefullydesigned policy (e.g., the SSSE policy) to achieve \tilde{O}(T^{\frac{1}{22^{N}}}) regret in MAB. On the other hand, suppose that there exists a policy that guarantees \tilde{O}(T^{\frac{1}{22^{N}}}) regret in MAB with S<N(k1)+1 switches almost surely. Since m(S)\leq N1, by Theorem 2, its regret in BwSC is \Omega(T^{\frac{1}{22^{N+1}}}), whose order of T is strictly higher than \tilde{O}(T^{\frac{1}{22^{N}}}) (as N is a fixed integer independent of T), contradiction! Thus for any policy that guarantees \tilde{O}(T^{\frac{1}{22^{N}}}) regret in MAB, there must exist an environment such that the policy makes at least N(k1)+1 switches with some positive probability.
We only prove the first part here, as the proof of the second part is analogous. Since m(N(k1)+1)=\lfloor(N(k1)+11)/(k1)\rfloor=N, by Theorem 5, the SSSE2 policy guarantees \tilde{O}(T^{\frac{1}{N+1}}) distributiondependent regret in BwSC. Thus N(k1)+1 switches are sufficient for a carefullydesigned policy (e.g., the SSSE2 policy) to achieve \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distributiondependent regret in MAB. On the other hand, given any fixed k\geq 1, for any fixed N\geq 1, suppose that there exists a policy \pi that uniformly achieve \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distributiondependent regret for all \mathcal{D} with S<N(k1)+1 switches almost surely. Then there exists a constant C_{k,N}\geq 0 (which may depend on k,N) such that for all \mathcal{D} and for all T\geq 1,
R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C_{k,N}{\mathrm{polylog}}(T)\frac{T^{% \frac{1}{N+1}}}{\Delta}, 
which means that for all T\geq 1,
\sup_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\leq C_{k,N}{% \mathrm{polylog}}(T)T^{\frac{1}{N+1}}. 
However, since m(S)<N by Theorem 6, we know that there exists an absolute constant C>0 such that for all T\geq 1,
\sup\limits_{\Delta\in(0,1]}\Delta R_{\mbox{$\mathcal{D}$}}^{\pi}(T)\geq C% \left(k^{\frac{3}{2}\frac{1}{N}}(m(S)+1)^{2}\right){T^{\frac{1}{N}}}>C\left% (k^{\frac{3}{2}\frac{1}{N}}(N+1)^{2}\right){T^{\frac{1}{N}}}. 
Let T be large enough then there is a contradiction. As a result, N(k1)+1 switches are necessary for uniformly achieving \tilde{O}(T^{\frac{1}{N+1}}/\Delta) distributiondependent regret for all \mathcal{D} in the karmed MAB.
Consider an arbitrary switching graph G with k=G\geq 1. In the following, we show that, even without the triangle inequality assumption, a modified version of the results in Section 5 still hold.
Assume that the switching costs associated with G do not satisfy the triangle inequality. We then run the FloydWarshall algorithm (see Cormen et al. (2009)) on G to efficiently find the shortest paths between all pairs of vertices. For any i,j\in[k] such that i\neq j, let p_{i,j}=i\rightarrow\dots\rightarrow j denote the shortest path between i and j, and c_{i,j}^{\prime} denote the total weight of the shortest path between i and j. We construct a new switching graph G^{\prime}=(V,E^{\prime}) — the vertices in G^{\prime} are the same as G, while the edge between i and j in G^{\prime} is assigned a weight c_{i,j}^{\prime}, which is the total weight of the shortest path between i and j in G. Obviously, G^{\prime} is a switching graph whose switching costs satisfy the triangle inequality. Therefore, for BwSC problems defined with G^{\prime}, we can apply the HSSE policy (see Algorithm 3 in Appendix G), and the regret upper and lower bounds in Theorem 3 and Theorem 4 in Section 5 hold.
In this part we assume that k=o(\sqrt{T}). This assumption is reasonable when k is a known fixed integer.
For any BwSC problem defined with switching graph G (whose switching costs do not satisfy the triangle inequality) and switching budget S, we construct a new switching graph G^{\prime} according to Appendix E.1, and construct a new BwSC problem defined with switching graph G^{\prime} and switching budget S. Let \pi^{\prime} denote the HSSE policy running on the new BwSC problem. Obviously \pi^{\prime} is a Sswitching budget policy for the new problem. We construct \pi by modifying \pi^{\prime}, aiming to obtain an Sswitchingbudget policy for the original BwSC problem. Let \pi switch (on G) following \pi^{\prime} (on G^{\prime}): every time \pi^{\prime} switches from i to j on G, let \pi switch according to the path p_{i,j}=i\rightarrow\dots\rightarrow j on G, visiting each vertex in p_{i,j} once (since in the HSSE policy, every active action is chosen for at least \Omega(T^{1/2}) consecutive rounds in each interval, while p_{i,j} contains at most k=o(\sqrt{T}) vertices, we know that \pi^{\prime} is a valid policy). Since the total weight of p_{i,j} is c^{\prime}_{i,j} and \pi^{\prime} is an Sswitchingbudget policy for G^{\prime}, we know that \pi is an Sswitchingbudget policy for G.
As mentioned before, Theorem 3 and Theorem 4 in Section 5 hold for the new BwSC problem (defined with G^{\prime}) in Appendix E.2. Based on these two theorems, we give upper and lower bound on regret for the original BwSC problem (defined with the G). The upper and lower bounds are very close to each other (in fact, when k=O(T^{1/4}), the bounds are essentially the same as the bounds in Section 5).
Theorem 7
Let G be a switching graph and G^{\prime} be the corresponding new graph defined in Appendix E.1. Let H denote the total weight of the shortest Hamiltonian path of G^{\prime}. Let \pi be the modified HSSE policy in Appendix E.2, then \pi is an Sswitchingbudget policy for G. There exists an absolute constant C\geq 0 such that for all k\geq 1, S\geq 0, T\geq k^{2},
R^{\pi}(T)\leq C{(\log k\log T)}k^{1\frac{1}{22^{m_{G}^{U}(S)}}}T^{\frac{1}% {22^{m_{G}^{U}(S)}}}+Ck^{2}\log\log T, 
where m_{G}^{U}(S)=\lfloor\frac{S\max_{i,j\in[k]}{c_{i,j}^{\prime}}}{H}\rfloor.
Theorem 8
Let H be the total weight of the shortest Hamiltonian path of G^{\prime}. There exists an absolute constant C>0 such that for all k\geq 1,S\geq 0,T\geq k and for all policy \pi\in\Pi_{S},
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m_{G}^{L}(S)}% }}(m_{G}(S)+1)^{2}\right)T^{\frac{1}{22^{m_{G}^{L}(S)}}},&\text{if }m_{G}^{% L}(S)\leq\log_{2}\log_{2}(T/k),\\ C\sqrt{kT},&\text{if }m_{G}^{L}(S)>\log_{2}\log_{2}(T/k),\end{cases} 
where {m_{G}^{L}}(S)=\lfloor\frac{S\max_{i\in[k]}\min_{j\neq i}c_{i,j}^{\prime}}{H}\rfloor.
Note that the only difference between the upper bound in Theorem 7 and the upper bound in Theorem 3 is an O(k^{2}\log\log T) term, which can be neglected as long as k is much smaller than T, e.g., k=O(T^{1/4}). To see why Theorem 7 holds, just note that (1) when k is much smaller than T, the modification of the HSSE policy does not affect the learning rate of the HSSE policy, and (2) since there are m_{G}^{U}(S)+1=O(\log\log T) intervals in \pi, and in each interval the behavior of \pi (running on G) is different from the behavior of \pi^{\prime} (running on G^{\prime}) for at most O(k^{2}) rounds, the additional regret loss compared to Theorem 3 is at most O(k^{2}\log\log T). Theorem 8 is essentially the same as Theorem 4 — in fact, a lower bound proved for a BwSC problem with the triangle inequality assumption is a natural lower bound for a corresponding BwSC problem without the triangle inequality assumption.
Intuitively, an effective policy in BwSC should identify what type of switching behavior is necessary and sufficient for achieving low regret in MAB, and switch in the most efficient way. Thus, before studying the general BwSC, we first revisit the classical MAB to further understand the relationship between switching and regret. Earlier in Section 4 of the main article, we establish the tradeoff between the number of switches and regret in MAB. Unfortunately, this does not provide enough insights for the general BwSC, and hence we need to connect the combinatorics of switching patterns with regret in MAB. In this subsection, we prove the following result: there are some inherent switching patterns that are associated with any effective learning policy in MAB.
Definition 4
Consider a karmed bandit problem. For any learning policy \pi, any environment \mathcal{D} and any T\geq 1, the stochastic process \{\pi_{t}\}_{t\in[T]}=\pi_{1},\dots,\pi_{T} constitutes a random walk (with a random starting point) on [k]. We call \{\pi_{t}\}_{t\in[T]} the bandit random walk generated by (\pi,\mbox{$\mathcal{D}$},T).
Definition 5
A bandit random walk on an action set [k] finishes a cover in period [T_{1}:T_{2}] if all actions in [k] were chosen between round T_{1} and round T_{2}, here T_{1} is called the starting round of this cover, and T_{2} is called the ending round of this cover.
Definition 6
A bandit random walk on an action set [k] finishes N\geq 0 asynchronous covers between period [T_{1}:T_{2}] if it finishes N covers in period [T_{1}:T_{2}], and the ending round of the jth cover is no larger than the starting round of the (j+1)th cover, for all j=1,\dots,N1.
By using the “tracking the cover time” argument, we establish the following result.
Theorem 9
Consider a karmed bandit problem. For any fixed N\geq 0, for any policy \pi that achieves \tilde{O}(T^{\frac{1}{22^{N}}\epsilon}) regret for some \epsilon>0, there exists an environment \mathcal{D} and T\geq 1 such that the bandit random walk generated by (\pi,\mbox{$\mathcal{D}$},T) must “finish N+1 asynchronous covers and then switch to the optimal action^{3}^{3}3If the bandit random walk happens to be at the optimal action when it just finishes N+1 asynchronous covers, then the event directly occurs. in period [1:T]” with probability at least \max\{N/(N+1),1/2\}.
Theorem 9 holds true for any MAB problem, and reveals some fundamental switching patterns in MAB that any effective learning policy has to reveal under certain environments with certain rounds. Intuitively, the patterns can be summarized as “finishing multiple covers then switching to the optimal action”. For example, if a policy \pi achieves sublinear regret in MAB, then there must be some environment \mathcal{D} and T such that the policy first chooses all actions (i.e., \pi_{1},\dots,\pi_{T} finishes a cover) and then switches to the optimal action with certain probability (even if the policy does not know the optimal action). Also, if a policy \pi achieves near nearoptimal regret in MAB, then there must be some environment \mathcal{D} and T such that \pi_{1},\dots,\pi_{T} first finishes \Omega(\log\log T) asynchronous covers and then switches to the optimal action with certain probability.
Theorem 9 indicates that the switching ability of “finishing multiple covers then switching to the optimal action” is necessary for any effective learning policy in MAB. It thus reveals a deep connection between bandit problems and graph traversal problems, since in graph traversal problems there are also requirements for “cover”, i.e., visiting all vertices. Motivated by this connection, in Section 5 of the main article, we design an intuitive policy for the general BwSC problem by leveraging ideas from the shortest Hamiltonian path problem, and give upper and lower bounds on regret that are close to each other.
The proof of Theorem 9 is based on the “tracking the cover time” argument: we first suppose that the switching patterns do not occur with certain probability, then use the “tracking the cover time” argument to establish an \tilde{\Omega}(T^{\frac{1}{22^{N}}}) lower bound on the regret of \pi, which contradicts the condition in Theorem 9. We omit the detailed proof here, as the essential idea of the proof is similar to the proof of Theorem 2 in Appendix C and the proof of Theorem 4 in Appendix I.
See the description of the HSSE policy in Algorithm 3. The algorithm addresses the challenges brought up by heterogeneous switching costs by leveraging the ideas from the celebrated shortest Hamiltonian path problem.
The HSSE policy is highly practical — for any given switching graph G, the policy only involves solving the shortest Hamiltonian path problem once, which can be finished offline. Thus, the computational complexity of the shortest Hamiltonian path problem does not affect the online decisionmaking process of the HSSE policy at all.
Moreover, under the condition that the switching costs satisfy the triangle inequality, the shortest Hamiltonian path problem can be reduced to the celebrated metric traveling salesman problem ( metric TSP), see Lawler et al. (1985). This means that we can directly apply many commercial solvers for TSP to solve (or approximately solve) the shortest Hamiltonian path problem efficiently. The reduction also indicates that any approximation algorithm designed for metric TSP can be adapted to be an approximation algorithm for the shortest Hamiltonian path problem. In particular, the celebrated Christofides algorithm for the metric TSP Christofides (1976) can be used to compute a good approximation of H in polynomial time.
Input: Switching Graph G, Switching budget S, Horizon T
Offline Step: Find the shortest Hamiltonian path in G: {i_{1}}\rightarrow\dots\rightarrow{i_{k}}. Denote the total weight of the shortest Hamiltonian path as H. Calculate m_{G}^{U}(S)=\lfloor\frac{S\max_{i,j\in[k]}c_{i,j}}{H}\rfloor.
Partition: Run the partition step in the SSSE policy with m(S)=m_{G}^{U}(S).
Initialization: Let the set of all active actions in the lth interval be A_{l}. Set A_{1}=[k], a_{0}={i_{1}}.
Policy:
\texttt{UCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% +\sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}, 
\texttt{LCB}_{t_{l}}(i)={\text{empirical mean of action }i\text{ in}[1:t_{l}]}% \sqrt{\frac{2\log T}{\text{number of plays of action }i\text{ in}[1:t_{l}]}}. 
Consider an arbitrary switching graph G whose switching costs satisfy the triangle inequality. Recall that H is the total weight of the shortest Hamiltonian path in G. For simplicity, in this proof we use m(S) to denote m_{G}^{U}(S)=\lfloor(S\max_{i,j\in[k]}{c_{i,j}})/H\rfloor.
From round 1 to round t_{1}, the HSSE policy incurs H switching cost.
For 1\leq l\leq m(S)1, from round t_{l} to round t_{l+1}, no matter whether l is odd or even, no matter whether the last action in interval l is eliminated before the start of interval l+1 or not, by the switching order (determined by the shortest Hamiltonian path of G) and the triangle inequality, the HSSE policy always incurs at most H switching cost.
From round t_{m(S)} to round T, since the HSSE policy does not switch within interval m(S)+1, i.e., from round t_{m(S)}+1 to round T, the only possible switch is between round t_{m(S)} and t_{m(S)}+1. Thus the HSSE policy incurs at most \max_{i,j\in[k]}c_{i,j} switching cost from round t_{m(S)} to round T.
Summarizing the above arguments, we find that the HSSE policy incurs at most m(S)H+\max_{i,j\in[k]}c_{i,j}\leq S switching cost from round 1 to round T. Thus it is indeed an Sswitchingbudget policy.
We start the proof of the upper bound on regret with some definitions. Let n_{t}(i) be the number of chosen samples of action i in period [1:t], and \bar{\mu}_{t}(i) be the average collected reward from action i in period [1:T] (i\in[k],t\in[T]). Define the confidence radius as
r_{t}(i)=\sqrt{\frac{2\log T}{n_{t}(i)}},~{}~{}\forall i\in[k],t\in[T]. 
Define the clean event as
\mathcal{E}:=\{\forall i\in[k],\forall t\in[T],~{}~{}\bar{\mu}_{t}(i)\mu_{i}% \leq r_{t}(i)\}. 
By Lemma 1.5 in Slivkins (2019), since T\geq k, for any policy \pi and any environment \mathcal{D}, we always have \mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}(\mathcal{E})\geq 1\frac{2}{T^{2}}. Define the bad event \bar{\mathcal{E}} as the complement of the clean event.
The \texttt{UCB}_{t_{l}}(i) and \texttt{LCB}_{t_{l}}(i) confidence bounds defined in Algorithm 3 can be expressed as
\texttt{UCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)+r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k], 
\texttt{LCB}_{t_{l}}(i)=\bar{\mu}_{t_{l}}(i)r_{t_{l}}(i),~{}~{}\forall l\in[m% (S)+1],i\in[k]. 
Let \pi denote the HSSE policy. First, observe that for any environment \mathcal{D},
\displaystyle R_{\mbox{$\mathcal{D}$}}^{\pi}(T)  \displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}^{\pi}% (\mathcal{E})+\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\bar{\mathcal{E}}\right]\mathbb{P}_{\mbox{$\mathcal{D}$}}% ^{\pi}(\bar{\mathcal{E}})  
\displaystyle\leq\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i% =1}^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+T\cdot\frac{1}{T^{2}}  
\displaystyle=\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}% ^{T}\mu_{\pi_{t}}\mid\mathcal{E}\right]+o(1),  (9) 
so in order to bound R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T), we only need to focus on the clean event.
Consider an arbitrary environment \mathcal{D} and assume the occurrence of the clean event. Let i^{*} be an optimal action, and consider any action i such that \mu_{i}<\mu_{i^{*}}. Let \eta_{i} denote the index of the last interval when i\in A_{\eta_{i}}, i.e., the \eta_{i}th interval is the last interval when we did not eliminate action i yet (in particular, \eta_{i}=m(S)+1 if and only if i is the only action chosen in the last interval). By the HSSE policy, if \eta_{i}\geq 2, then the confidence intervals of the two actions i^{*} and i at the end of round \eta_{i}1 must overlap, i.e., \texttt{UCB}_{t_{\eta_{i}1}}(i)\geq\texttt{LCB}_{t_{\eta_{i}1}}(i^{*}). Therefore,
\Delta(i):=\mu_{i^{*}}\mu_{i}\leq 2r_{t_{\eta_{i}1}}(i^{*})+2r_{t_{\eta_{i}% 1}}(i)=4r_{t_{\eta_{i}1}}(i),  (10) 
where the last equality is because i^{*} and i are chosen for equal times in each interval until interval \eta_{i}, which indicates that n_{t_{\eta_{i}1}}(i^{*})=n_{t_{\eta_{i}1}}(i). (Note that in Algorithm 3, for simplicity, we overlook the rounding issues of \frac{t_{l+1}t_{l}}{A_{l}} for each interval l. Considering the rounding issues will not bring additional difficulty to our analysis, as in the policy we can always design a rounding rule to control the difference between n_{t_{\eta_{i}1}}(i^{*}) and n_{t_{\eta_{i}1}}(i) within 1.)
Since i is never chosen after the \eta_{i}th interval, we have n_{\eta_{i}}(i)=n_{T}(i), and therefore r_{\eta_{i}}(i)=r_{T}(i).
The contribution of action i to regret in the entire horizon [1:T], denoted R(T;i), can be expressed as the sum of \Delta(i) for each round that this action is chosen. By the HSSE policy and (10), we can bound this quantity as
\displaystyle R(T;i)  \displaystyle=n_{T}(i)\Delta(i)  
\displaystyle\leq 4n_{\eta_{i}}(i)\sqrt{\frac{2\log T}{n_{\eta_{i}1}(i)}}  
\displaystyle\leq C_{0}\sqrt{2\log T}\frac{t_{\eta_{i}}/A_{\eta_{i}}}{\sqrt{% t_{\eta_{i}1}/k}}  
\displaystyle\leq 4C_{0}\sqrt{2\log T}\frac{k(T/k)^{1/(22^{m(S)})}}{{A_{% \eta_{i}}}}. 
for some absolute C_{0}\geq 0. Then for any \mathcal{D}, conditioned on the clean event,
\displaystyle\mathbb{E}_{\mbox{$\mathcal{D}$}}^{\pi}\left[T\mu^{*}\sum_{i=1}^% {T}\mu_{\pi_{t}}\mid\mathcal{E}\right]  \displaystyle=\sum_{i\in[k]}R(T;i)  
\displaystyle\leq\sum_{i\in[k]}4C_{0}\sqrt{2\log T}k(T/k)^{1/(22^{m(S)})}% \frac{1}{{A_{\eta_{i}}}}  
\displaystyle\leq C_{1}\sqrt{\log T}k(T/k)^{1/(22^{m(S)})}\sum_{i=1}^{k}% \frac{1}{A_{\eta_{i}}}  
\displaystyle\leq C_{2}\sqrt{\log T}k(T/k)^{1/(22^{m(S)})}\sum_{j=1}^{k}% \frac{1}{j}  
\displaystyle\leq C_{3}(\log k\log T)k^{11/(22^{m(S)})}T^{1/(22^{m(S)})} 
for some absolute constant C_{1},C_{2},C_{3}\geq 0. Thus by (\theequation@IDch) and R^{\pi}(T)=\sup_{\mbox{$\mathcal{D}$}}R_{\mbox{$\mathcal{D}$}}^{\pi}(T) we have
R^{\pi}(T)\leq C(\log k\log T)k^{21/(22^{m(S)})}T^{1/(22^{m(S)})} 
for some absolute constant C\geq 0, where m(S)=m_{G}^{U}(S)=\lfloor(S\max_{i,j\in[k]}{c_{i,j}})/H\rfloor.\hfill\Box
Consider an arbitrary switching graph G whose switching costs satisfy the triangle inequality. Without loss of generality, we assume that \arg\max_{i\in[k]}(\min_{j\neq i}c_{i,j})=1, i.e., \min_{j\neq 1}c_{1,j}\geq\min_{j\neq i}c_{i,j} for all i\in[k]. Recall that H is the total weight of the shortest Hamiltonian path in G. For simplicity, in this proof we use m(S) to denote m_{G}^{L}(S)=\lfloor(S\max_{i\in[k]}\min_{j\neq i}c_{i,j})/H\rfloor.
Given any k=G\geq 1, S\geq 0 and T\geq 2k, we focus on the setting of \mbox{$\mathcal{D}$}_{i}=\mathcal{N}(\mu_{i},1) (\forall i\in[k]), as this is enough for us to prove the desired lower bound. Note that now the environment of latent distributions \mathcal{D} can be completely determined by a vector \mbox{\boldmath$\mu$}=(\mu_{1},\cdots,\mu_{k})\in\mathbb{R}^{k}. For simplicity, in this proof we will directly use the vector \mu to represent the environment of latent distributions.
For any environment \mu, let X_{\mbox{\boldmath$\mu$}}^{t}(i)\sim\mathcal{N}(\mu_{i},1) denote the i.i.d. random reward of each action i at round t (i\in[k],t\in[T]). For any i\in[k] and n_{1},n_{2}\in[T], let \{X_{\mbox{\boldmath$\mu$}}^{t}(i)\}_{t\in[n_{1}:n_{2}]} denote the random vector whose components are the random awards of action i from round n_{1} to round n_{2}.
For any environment \mu, for any policy \pi\in\Pi_{S}, with some abuse of notation we let X_{\mbox{\boldmath$\mu$}}^{t}(\pi_{t}) denote the learner’s (random) collected reward at round t under policy \pi in environment \mu. Let \mathcal{F}_{t}:=\sigma(X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{% \boldmath$\mu$}}^{t}(\pi_{t})) denote the \sigmaalgebra generated by the random variables X_{\mbox{\boldmath$\mu$}}^{1}(\pi_{1}),\dots,X_{\mbox{\boldmath$\mu$}}^{t}(\pi% _{t}), then \mathbb{F}=(\mathcal{F}_{t})_{t\in T} is a filtration.
For any two probability measures \mathbb{P} and \mathbb{Q} defined on the same measurable space, let D_{\mathrm{TV}}(\mathbb{P}\\mathbb{Q}) denote the total variation distance between \mathbb{P} and \mathbb{Q}, and D_{\mathrm{KL}}(\mathbb{P}\\mathbb{Q}) denote the KullbackLeibler (KL) divergence between \mathbb{P} and \mathbb{Q}, see detailed definitions in Chapter 15 of Wainwright (2019).
For any environment \mu, for any policy \pi\in\Pi_{S}, we make some key definitions as below.
1. We first define a series of ordered stopping times \tau_{1}\leq\tau_{2}\dots\leq\tau_{m(S)}\leq\tau_{m(S)+1}.

\tau_{1}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[1:\tau_{1}]$}\} if the set is nonempty and \tau_{1}=\infty otherwise.

\tau_{2}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{1}:\tau_{2}]$}\} if the set is nonempty and \tau_{2}=\infty otherwise.

Generally, \tau_{j}=\min\{1\leq t\leq T:\text{all the actions in $[k]$ have been chosen % in $[\tau_{j1}:\tau_{j}]$}\} if the set is nonempty and \tau_{j}=\infty otherwise, for all i=2,\dots,m(S)+1.
It can be verified that \tau_{1},\dots,\tau_{m(S)+1} are stopping times with respect to the filtration \mathbb{F}.
2. We then define a series of random variables (depend on the stopping times).

S(1,\tau_{1}) is the total switching cost incurred in [1:\tau_{1}] (note that if there is a switch happening between \tau_{1} and \tau_{1}+1, we do not count its cost in S(1,\tau_{1})).

For all j=2,\dots,m(S), S(\tau_{j1},\tau_{j}) is the total switching cost incurred in [\tau_{j1}:\tau_{j}] (note that if there is a switch happening between \tau_{j1}1 and \tau_{j1}, or between \tau_{j} and \tau_{j}+1, we do not count its cost in S(\tau_{j1},\tau_{j})).

S(\tau_{m(S)},T) is the total switching cost incurred in [\tau_{m(S)}:T] (note that if there is a switch happening between \tau_{m(S)1} and \tau_{m(S)}, we do not count its cost in S(\tau_{m(S)},T).
3. Next we define a series of events.

E_{1}=\{\tau_{1}>t_{1}\}.

For all j=2,\dots,m(S), E_{j}=\{\tau_{j1}\leq t_{j1},\tau_{j}>t_{j}\}.

E_{m(S)+1}=\{\tau_{m(S)}\leq t_{m(S)}\}.
Note that t_{1},\dots,t_{m(S)}\in[T] are fixed values specified in Algorithm 3.
4. Finally we define a series of shrinking errors.

\Delta_{1}=1.

For j=2,\dots,m(S), \Delta_{j}=\frac{k^{1/2}\left(k/T\right)^{(12^{1j})/(22^{m(S)})}}{k(m(S)+% 1)}\in(0,1). (That is, \Delta_{j}\approx\frac{1}{k(m(S)+1)}\frac{1}{\sqrt{t_{j1}}}.)

\Delta_{m(S)+1}=\frac{k^{1/2}\left(k/T\right)^{(12^{m(S)})/(22^{m(S)})}}{% 2k(m(S)+1)}\in(0,1). (That is, \Delta_{m(S)+1}\approx\frac{1}{2k(m(S)+1)}\frac{1}{\sqrt{t_{m(S)}}}.)
5. For notational convenience, define \pi_{\infty} as an independent uniform random variable taking value in [k] such that {\pi_{\infty}=i} with probability 1/k (i\in[k]).
Lemma 2
For any environment \mu, for any policy \pi\in\Pi_{S}, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the total switching cost incurred in }[\tau_{m(S)}:T]\text{ is % strictly less than }H+\bar{c}\} almost surely.
Proof of Lemma 2. When E_{m(S)+1} happens, \tau_{m(S)}\leq t_{m(S)}\leq T, thus all \tau_{1},\dots,\tau_{m(S)}\leq T. Since in each of [1:\tau_{1}],[\tau_{1},\tau_{2}],\dots,[\tau_{m(S)1}:\tau_{m(S)}], all k actions were visited, we know that S(1,\tau_{1})\geq H, S(\tau_{1},\tau_{2})\geq H, \dots, S(\tau_{m(S)1},\tau_{m(S)})\geq H. Thus we have
S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)1},\tau_{m(S)})\geq m(S% )H. 
Since \pi\in\Pi_{S}, we further know that
\displaystyle S(\tau_{m(S)},T)  \displaystyle\leq S[S(1,\tau_{1})+S(\tau_{1},\tau_{2})+\cdots+S(\tau_{m(S)1}% ,\tau_{m(S)})]  
\displaystyle\leq Sm(S)H<H+\max_{i\in[k]}\min_{j\neq i}{c_{i,j}}=H+\min_{j% \neq 1}{c_{1,j}} 
happens almost surely. As a result, the occurrence of E_{m(S)+1} implies the occurrence of the event \{\text{the total switching cost incurred in }[\tau_{m(S)}:T]\text{ is % strictly less than }H+\min_{j\neq 1}{c_{1,j}}\} almost surely. \hfill\Box
Consider a class of environments \Lambda=\{\mbox{\boldmath$\mu$}\mid\frac{\Delta_{m(S)+1}}{4}\leq\mu_{1}\mu_{i% }\leq\frac{\Delta_{m(S)+1}}{2},\forall i\neq 1\}. Pick an arbitrary environment {\alpha} from \Lambda (e.g., \alpha=(\frac{\Delta_{m(S)+1}}{2},0,\dots,0)). For any policy \pi\in\Pi_{S}, by the union bound, we have
\sum_{j=1}^{m(S)+1}\mathbb{P}_{{\alpha}}^{\pi}(E_{j})\geq\mathbb{P}_{{\alpha}}% ^{\pi}(\cup_{j=1}^{m(S)+1}E_{j})=1. 
Therefore, there exists j^{*}\in[m(S)+1] such that \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})\geq 1/(m(S)+1).
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})% \geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1})=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{% \pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{% \mathbb{P}_{{\alpha}}^{\pi}(E_{1})}{k}\geq\frac{1}{k(m(S)+1)}. 
Note that since \tau_{1} is the first time that all actions in [k] has been chosen in [1:\tau_{1}], the event \{\pi_{\tau_{1}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[1:\tau_{1}1]\}. Thus, the event \{\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime}\} must imply the event \mathcal{E}_{i^{\prime}}[1:t_{1}1]:=\{i^{\prime}\text{ was not chosen in }[1:% t_{1}1]\}. Therefore, we have
\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq% \mathbb{P}_{\alpha}^{\pi}(\tau_{1}>t_{1},\pi_{\tau_{1}}=i^{\prime})\geq\frac{1% }{k(m(S)+1)}. 
Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] only depends on policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{% \pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq\frac{1}{k(m(S)+1)}.  (11) 
We now consider a new environment {\beta} such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{1} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[1:t_{1}1] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{1}1]} and random vectors \{X_{\beta}^{t}(i)\}_{t_{1}:T} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}1]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{1}1]} for i\neq i, we have
\mathbb{P}_{\{X_{\beta}^{t}(i)\}_{[1:t_{1}1]}(\mathcal{E}_{i^{\prime}}[1:t_{1% }1])\text{ for }i\neq i}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$% \mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1]).  (12) 
But note that \{X_{\beta}^{t}(i)\}_{[1:t_{1}1]} and \{X_{\alpha}^{t}(i)\}_{[1:t_{1}1]} have exactly the same distribution for all i\neq i^{\prime}. Thus from (11) and (12) we have
\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])=\mbox{$% \mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq\frac{1}{k% (m(S)+1)}. 
However, in environment \beta, i^{\prime} is the unique optimal action, and choosing any action other than i^{\prime} will incur at least a \Delta_{1}\Delta_{m(S)+1}/2\geq\Delta_{1}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[1:t_{1}1] indicates that the policy does not choose i^{\prime} for at least t_{1}1 rounds, we have
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}^{\pi}(% \mathcal{E}_{i^{\prime}}[1:t_{1}1])\left[(t_{1}1)\frac{\Delta_{1}}{2}\right]% \geq\frac{t_{1}1}{2k(m(S)+1)}\geq\frac{k^{11/(22^{m(S)})}}{4(m(S)+1)}T^{1% /(22^{m(S)})}. 
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}})=\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1% }\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}})\geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}% })=\sum_{i=1}^{k}\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau% _{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}% },\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{j^{*}% })}{k}\geq\frac{1}{k(m(S)+1)}. 
Note that since \tau_{j^{*}} is the first time that all actions in [k] has been chosen in [\tau_{j^{*}1}:\tau_{j^{*}}], the event \{\pi_{\tau_{j^{*}}}=i^{\prime}\} must imply the event \{i^{\prime}\text{ was not chosen in }[\tau_{j^{*}1}:\tau_{j^{*}}1]\}. Thus, the event \{\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}>t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{% \prime}\} must imply the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}]:=\{i^{\prime}\text{ was not % chosen in }[t_{j^{*}1}:t_{j^{*}}]\}. Therefore, we have
\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{% *}}])\geq\mathbb{P}_{\alpha}^{\pi}(\tau_{j^{*}1}\leq t_{j^{*}1},\tau_{j^{*}}% >t_{j^{*}},\pi_{\tau_{j}^{*}}=i^{\prime})\geq\frac{1}{k(m(S)+1)}. 
Meanwhile, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] is independent of random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[t_{j^{*}1}:t_{j^{*}}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k], i.e., the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] only depends on policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}. Let \mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t% }(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E% }_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}])\geq\frac{1}{k(m(S)+1)}.  (13) 
We now consider a new environment {\beta} such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{j^{*}} and all other components are the same as \alpha. Again, the occurrence of the event \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] is independent of random vector \{X_{\beta}^{t}(i^{\prime})\}_{[t_{j^{*}1}:t_{j^{*}}]} and random vectors \{X_{\beta}^{t}(i)\}_{[t_{j^{*}+1}:T]} for all i\in[k]. Let \mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi} be the probability measure induced by policy \pi and random vector \{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]} and random vectors \{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]} for i\neq i^{\prime}, we have
\mathbb{P}_{\{X_{\beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(% i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^{\prime}}^{\pi}(\mathcal{E}_{i^{\prime% }}[t_{j^{*}1}:t_{j^{*}}])=\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}}[t_{j^{*}1}:t_{j^{*}}]).  (14) 
We now try to bound the difference between the lefthand side (LHS) in (13) and the lefthand side (LHS) in (14). We have
\displaystyle\text{LHS in }(\ref{eq:4app3})\text{LHS in }(\ref{eq:4app4})  
\displaystyle\leq  \displaystyle{D_{\mathrm{TV}}}\left(\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime})\}% _{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i^% {\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\alpha}^{t}(i^{\prime}% )\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for }i\neq i% ^{\prime}}^{\pi}\right)  
\displaystyle\leq  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}^{\pi}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^% {t}(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}% \text{ for }i\neq i^{\prime}}^{\pi}\right)}  
\displaystyle\leq  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\alpha}^{t}(i)\}_{[1:t_{j^{*}}]}\text% { for }i\neq i^{\prime}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{\beta}^{t}(i^% {\prime})\}_{[1:t_{j^{*}1}1]},\{X_{\beta}^{t}(i)\}_{[1:t_{j^{*}}]}\text{ for% }i\neq i^{\prime}}\right)}  
\displaystyle=  \displaystyle\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\mathbb{P}_{\{X_{\alpha}^{t% }(i^{\prime})\}_{[1:t_{j^{*}1}1]}}~{}{\huge{\parallel}}~{}\mathbb{P}_{\{X_{% \beta}^{t}(i^{\prime})\}_{[1:t_{j^{*}1}1]}}\right)}  
\displaystyle=  \displaystyle\sqrt{\frac{1}{2}\left[(t_{j^{*}1}1)\frac{\left(\Delta_{j^{*}}% \right)^{2}}{2}\right]}  
\displaystyle\leq  \displaystyle\frac{\sqrt{t_{j^{*}1}}\Delta_{j^{*}}}{2}\leq\frac{1}{2k(m(S)+1)}, 
where the first inequality is by the definition of total variation distance of two probability measures, the second inequality is by Pinsker’s inequality in information theory, and the third inequality is by the dataprocessing inequality in infomation theory.
Combining the above inequality with (11) and (12), we have
\mbox{$\mathbb{P}$}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\geq% \mbox{$\mathbb{P}$}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[1:t_{1}1])\frac{% 1}{2k(m(S)+1)}\geq\frac{1}{2k(m(S)+1)}. 
However, i^{\prime} is the unique optimal action in environment \beta, and choosing any action other than i^{\prime} will incur at least a \Delta_{j^{*}}\Delta_{m(S)+1}/2\geq\Delta_{j^{*}}/2 term in regret. Since \mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}] indicates that the policy does not choose i^{\prime} for at least t_{j^{*}}t_{j^{*}1}+1 rounds, we have
\displaystyle R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mbox{$\mathbb{P}$}_{\beta}% ^{\pi}(\mathcal{E}_{i^{\prime}}[t_{j^{*}1}:t_{j^{*}}])\left[(t_{j^{*}}t_{j^{% *}1}+1)\frac{\Delta_{j^{*}}}{2}\right]  
\displaystyle\geq  \displaystyle\frac{1}{2k(m(S)+1)}\left(k(T/k)^{\frac{22^{1j^{*}}}{22^{m(S)% }}}k(T/k)^{\frac{22^{2j^{*}}}{22^{m(S)}}}\right)\frac{k^{\frac{1}{2}}% \left(k/T\right)^{\frac{12^{1j^{*}}}{22^{m(S)}}}}{2k(m(S)+1)}  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}}}{4(m(S)+1)^{2}}\left((T/k)^{\frac{1}{22^{% m(S)}}}(T/k)^{\frac{12^{1j^{*}}}{22^{m(S)}}}\right)  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{\frac{2^{1j^{*}}}{22^{m(S)}}}\right)  
\displaystyle\geq  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{\frac{2^{1m(S)}}{22^{m(S)}}}\right)  
\displaystyle=  \displaystyle\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(S% )}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{2^{m(S)}}\right). 
When m(S)\leq\log_{2}\log_{2}(T/K), we have
(T/k)^{2^{m(S)}}\leq(T/k)^{\frac{1}{\log_{2}(T/k)}}=\frac{1}{(T/k)^{\log_{T% /k}(2)}}=\frac{1}{2}. 
Thus we know that
R^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}T^{\frac{1}{22^{m(% S)}}}}{4(m(S)+1)^{2}}\left(1(T/k)^{2^{m(S)}}\right)\geq\frac{k^{\frac{3}{2% }\frac{1}{22^{m(S)}}}}{8(m(S)+1)^{2}}T^{\frac{1}{22^{m(S)}}} 
when m(S)\leq\log_{2}\log_{2}(T/k).
Since \mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})=\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}% \leq t_{m(S)})\geq 1/(m(S)+1) and
\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)})=\sum_{i=1}^{k}\mathbb{P}_{% \alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i), 
we know that there exists i^{\prime}\in[k] such that
\mathbb{P}_{\alpha}^{\pi}(\tau_{m(S)}\leq t_{m(S)},\pi_{\tau_{m(S)+1}}=i^{% \prime})\geq\frac{\mathbb{P}_{{\alpha}}^{\pi}(E_{m(S)+1})}{k}\geq\frac{1}{k(m(% S)+1)}. 
Thus either
\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}>\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)},  (15) 
or
\mathbb{P}_{\alpha}^{\pi}\left(\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac% {t_{m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\right)\geq\frac{1}{2k(m(S)+1)}.  (16) 
If (15) holds true, then we consider a new environment \beta such that its i^{\prime}th component is \alpha_{i^{\prime}}+\Delta_{m(S)+1} and all other components are the same as \alpha. Define the event \mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2]:=\{i^{\prime}\text{ was not % chosen in }[t_{m(S)}:(t_{m(S)}+T)/2]\}. From (15) we know that \mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq 1/(2k(m(S)+1)). Using analogous arguments like Case 2 (Appendix id1), we can derive that
\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2])% \geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i^{\prime}}[t_{m(S)}:(t_{m(S)}+T)/2% ])\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)} 
and
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)% }}}}{64(m(S)+1)^{2}}T^{\frac{1}{22^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Now we consider the case that (16) holds true. Let \mathcal{E}_{i^{\prime}} denote the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2},\pi_{\tau_{m(% S)+1}}=i^{\prime}\}. According to Lemma 2, the event \{\tau_{m(S)}\leq t_{m(S)}\} implies that the total switching cost incurred in [\tau_{m(S)}:T] is strictly less than H+\min_{j\neq 1}{c_{1,j}}. Meanwhile, the event \{\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}<\infty\} implies that the total switching cost incurred in [\tau_{m(S)}:\tau_{m(S)+1}] is at least H. As a result, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}.
Suppose that i^{\prime}\neq 1, then the event \mathcal{E}_{i^{\prime}}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_% {m(S)}+T}{2},\pi_{\tau_{m(S)+1}}=i^{\prime}\} implies that action 1 is not chosen in [\tau_{m(S)+1}:T], as incurring c_{i^{\prime},1}\geq\min_{j\neq 1}{c_{1,j}} would violate the requirement that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}. However, action 1 is the unique optimal action in environment \alpha, and choosing any action other than action 1 will incur at least a \Delta_{m(S)+1}/4 term in regret. As a result, we know that
R^{\pi}(T)\geq R_{\alpha}^{\pi}(T)\geq\mathbb{P}_{\alpha}^{\pi}(\mathcal{E}_{i% ^{\prime}})\left[(T\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{4}\right]% \geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{2% 2^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Thus we only need to consider the subcase of i^{\prime}=1. Define the event \mathcal{E}_{1}:=\{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}% {2},\pi_{\tau_{m(S)+1}}=1\}. Note that the occurrence of the event \mathcal{E}_{1} only depends on policy \pi and random vector \{X_{\alpha}^{t}(1)\}_{[1:t_{m(S)}]} and random vectors \{X_{\alpha}^{t}(i)\}_{[1:{(t_{m(S)}+T)}/{2}]} for i\neq 1. Consider a new environment \beta such that its first component is \alpha_{1}\Delta_{m(S)+1} and all other components are the same as \alpha. Using analogous arguments like Case 2 (Appendix id1), we can derive that
\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{1})\geq\mathbb{P}_{\alpha}^{\pi}(% \mathcal{E}_{1})\frac{\sqrt{t_{m(S)}}\Delta_{m(S)+1}}{2}\geq\mathbb{P}_{% \alpha}^{\pi}(\mathcal{E}_{1})\frac{1}{4k(m(S)+1)}\geq\frac{1}{4k(m(S)+1)}. 
However, action 1 is the worst action in environment \beta, and each time of choosing action 1 incurs at least a \Delta_{m(S)+1}/2 term in regret. According to Lemma 2, the event \{\tau_{m(S)}\leq t_{m(S)},\tau_{m(S)+1}\leq\frac{t_{m(S)}+T}{2}\} implies that the total switching cost incurred in [\tau_{m(S)+1}:T] is strictly less than \min_{j\neq 1}{c_{1,j}}. Since switching from action 1 to any other action incurs at least \min_{j\neq 1}{c_{1,j}} cost, the event \mathcal{E}_{1} actually implies that action 1 is continuously chosen in every round from round \tau_{m(S)+1} (\leq\frac{t_{m(S)}+T}{2}) to round T, which means that action 1 is continuously chosen in the last (T\frac{t_{m(S)}+T}{2}+1) rounds. As a result, we know that
R^{\pi}(T)\geq R_{\beta}^{\pi}(T)\geq\mathbb{P}_{\beta}^{\pi}(\mathcal{E}_{i^{% \prime}})\left[(T\frac{t_{m(S)}+T}{2}+1)\frac{\Delta_{m(S)+1}}{2}\right]\geq% \frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{\frac{1}{22^{% m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k).
Combining Case 1, 2 and 3, we know that
R^{\pi}(T)\geq\frac{k^{\frac{3}{2}\frac{1}{22^{m(S)}}}}{64(m(S)+1)^{2}}T^{% \frac{1}{22^{m(S)}}} 
for m(S)\leq\log_{2}\log_{2}(T/k). On the other hand, since the minimax lower bound for the classical MAB problem (which is equivalent to a BwSC problem with infinite switching budget) is \Omega(\sqrt{kT}), we know that
R^{\pi}(T)\geq R_{\infty}^{*}\geq C\sqrt{kT} 
for some absolute constant C>0. To sum up, we have
R^{\pi}(T)\geq\begin{cases}C\left(k^{\frac{3}{2}\frac{1}{22^{m(S)}}}(m(S)+% 1)^{2}\right)T^{\frac{1}{22^{m(S)}}},&\text{if }m(S)\leq\log_{2}\log_{2}(T/% k),\\ C\sqrt{kT},&\text{if }m(S)>\log_{2}\log_{2}(T/k),\end{cases} 
for some absolute constant C>0, where m(S)=m_{G}^{L}(S)=\lfloor\frac{S\max_{i\in[k]}\min_{j\neq i}c_{i,j}}{H}\rfloor. \hfill\Box