Exploiting Correlation in FiniteArmed Structured Bandits
Abstract
We consider a correlated multiarmed bandit problem in which rewards of arms are correlated through a hidden parameter. Our approach exploits the correlation among arms to identify some arms as suboptimal and pulls them only times. This results in significant reduction in cumulative regret, and in fact our algorithm achieves bounded (i.e., ) regret whenever possible; explicit conditions needed for bounded regret to be possible are also provided by analyzing regret lower bounds. We propose several variants of our approach that generalize classical bandit algorithms such as UCB, Thompson sampling, KLUCB to the structured bandit setting, and empirically demonstrate their superiority via simulations.
1 Introduction
The Multiarmed bandit problem [1] (MAB) falls under the umbrella of sequential decisionmaking problems. In the classical armed bandit formulation, a player is presented with arms. At each time step , she decides to pull an arm and receives a random reward with unknown mean . The goal of the player is to maximize her cumulative reward. In order to do so, the player must balance exploration and exploitation of arms. This classical armed bandit formulation assumes independence of the rewards of different arms. However in many online learning problems, such as dynamic pricing and drug dosage optimization, there is a correlation between rewards of different actions.
Motivated by this, we consider a correlated multiarmed bandit problem, in which the mean reward of different arms are related through a common hidden parameter . Specifically, the expected return of arm , i.e., . In the setting considered, the mean reward functions, , are known to the player but the true value of shared parameter is unknown. The dependence on the common parameter introduces a structure in this MAB problem. This makes the model interesting as the rewards observed from an arm can provide information about mean rewards from other arms. Similar models have been considered in [2, 3, 4], but as explained in Section 2.2, we consider a more general setting that subsumes the models in [2, 3, 5].
There are many applications where the structured bandit problem described above can be useful. For instance, in the dynamic pricing problem [2], a player needs to select the price of a product from a finite set of prices , and the average revenue in time slot is a function of the selected price and the market size . These functions are typically known from literature [6], but the pricing decisions need to be made without knowing the market size such that the total revenue is maximized; hence, this problem fits perfectly in our setting. Authors in [4] provide a similar example for the purpose of advertising, in which a company needs to decide what form of advertising to purchase so as to maximize its profit. The problem setup is also relevant in system diagnosis, where represents the unknown cause of a failure in the system, and ’s represent the response of system to different actions. Other applications of this model include cellular coverage optimization [7] and drug dosage optimization [3]. Our general treatment of the structured bandit setting will allow our work to be helpful in all these problems.
Main Contributions.
1) We consider a general setting for the problem which subsumes previously considered models. [2, 3, 5].
2) We develop a novel approach that exploits the structure of the bandit problem to identify suboptimal arms. In particular, we generate an estimate of at each round to identify competitive and noncompetitive arms. The noncompetitive arms are identified as suboptimal without having to pull them. We refer to this identification as implicit exploration. This implicit exploration is combined with traditional bandit algorithms such as UCB and Thompson sampling to design UCBC and TSC algorithms.
3) Our finitetime regret analysis reveals how this idea leads to a smaller regret as compared to UCB. In fact, the proposed UCBC algorithm ends up pulling noncompetitive arms only times. Due to this only out of the arms are pulled times. The value of can be much less than and can even be , in which case our proposed algorithm achieves bounded regret! Our analysis reveals that proposed algorithm achieves bounded regret whenever possible.
4) The design of UCBC makes it easy to extend other classical bandit algorithms (such as Thompson sampling [8], KLUCB [9] etc.) in the structured bandit setting. This extension was deemed to be not possible easily for the UCBS algorithm proposed in [4].
5) We design two variants of UCBC, namely UCBint and UCBmin, and demonstrate the empirical superiority of the proposed algorithms in different scenarios.
2 Problem Formulation
2.1 System Model
We consider a armed bandit setting in which the rewards of arms are correlated. As shown in Figure 1, we assume that the mean reward of each arm is dependent on a common hidden parameter . At each time step , the player pulls arm and observes the reward .
The reward obtained at time step is a random variable with mean , where is a fixed unknown parameter which lies in a known set ; our formulation allows the set to be a countable or uncountable. The functions are known to the player but the true value of parameter , i.e., , is unknown. The parameter can also be a vector.
The objective of the player is to maximize her cumulative reward in rounds. If a player had known the true value , then she would always pull the arm having the highest mean reward for the parameter , as that would lead to maximum cumulative reward in expectation. Motivated by this, we call the optimal arm as , i.e., the best arm for the true parameter . The suboptimality gap of arm , , is defined as the difference between mean reward of the optimal arm and of arm ; i.e., . The performance of a player is evaluated by the cumulative regret defined as:
Here is a random variable denoting the number of times arm is pulled in a total of time slots. The cumulative regret quantifies the performance of a player in comparison to an oracle that pulls the optimal arm at each time slot. Thus, the smaller is the regret, the better is the performance of the player.
As mentioned earlier, the player only knows the mean reward functions and not the conditional distribution of rewards i.e., is not known. Throughout the paper, we assume that the rewards are subGaussian with variance proxy , i.e., , and is known to the player. Both of these assumptions are common in the multiarmed bandit literature [4, 10, 11, 12, 13]. In particular, the subGaussianity of rewards enables us to apply Hoeffding’s inequality, which is essential for the regret analysis.
We would like to highlight that we make no assumptions on the functions , unlike some previous works ([2, 3, 5]) that place restrictive assumptions on the functions. Due to the general nature of our setup, our model subsumes these previously studied frameworks and is applicable to much more general scenarios as well. The similarities and differences between our model and existing studies are discussed next.
2.2 Connections with Previously Studied Bandit Models
Classical MAB. Under the classical Multiarmed bandit setting, the rewards obtained from each arm are independent. By considering and , our setting reduces to the classical MAB setting. Our proposed algorithm will in fact perform UCB/Thompson sampling ([1, 8]) in this special case.
Global Bandits [2]. In [2], a model where mean reward functions are dependent on a common scalar parameter in studied. A key assumption in [2] is that the mean reward functions are invertible and Höldercontinuous. Under these assumptions, they demonstrate that it is possible to achieve bounded regret through a greedy policy. In contrast, our work makes no assumptions on the nature of the functions . In fact, when reward functions are invertible, our proposed algorithm also achieves bounded regret. Hence, our formulation covers the setting described in [2].
Regional Bandits [3]. The paper [3] studies a setting in which there are common unknown parameters. The mean reward function of each arm depends on one of these parameters, . These mean reward functions of each arm are assumed to be invertible and Höldercontinuous as a function of . The setting described in [3] is captured in our formulation by setting . In fact, our problem setup allows for the mean reward function of arm to be a function of a combination of all of these parameters and these mean reward functions need not be invertible.
Structured bandits with linear functions [5]. In [5], the authors consider a model in which rewards of all arms depend on a common parameter. However, they assume that the mean reward functions, are linear functions of . Under this assumption, they design a greedy policy that achieves bounded regret. Our formulation places no such restriction on the reward functions, and thus is more general. In the specific cases where reward functions are linear, our proposed algorithm also achieves bounded regret.
Finitearmed generalized linear bandits [14]. Under the finitearmed linear bandit setting [14], the reward function of arm is . Here, is the shared unknown parameter. Similarly, when , this becomes the generalized linear bandit setting [15], for some known function . One can easily see that, our setting perfectly captures both of them.
Minimal Exploration in Structured Bandits [16]. Authors in [16] consider a problem formulation that is more general than the setting described in this paper. However, the focus of [16] is to obtain asymptotically optimal regret for the regimes when regret scales as . When all arms are noncompetitive, the solution to the optimization problem described in [16, Theorem 1] becomes , causing the algorithm to get stuck in the exploitation phase and not perform properly in such settings. Also, [16] assume the knowledge of the shape of reward distribution (For example, gaussian with unknown mean), we only assume that the rewards are subgaussian. Moreover, they assume is continuous, while we make no such assumption.
Finitearmed structured bandits [4]. The work closest to ours is [4]. The authors in [4] consider the same model that we consider and propose the UCBS algorithm, which is a UCBstyle algorithm for this setting. We take a different approach to this problem, and propose a novel algorithm that separates implicit and explicit exploration, that allows us to extend our UCB style algorithm to other classical bandit algorithms such as Thompson sampling. A Thompson sampling style algorithm was not proposed in [4]. Through simulations, we make comparisons of our proposed algorithms against the UCBS algorithm proposed in [4].
2.3 Intuitions for developing an algorithm
Classic multiarmed bandit algorithms such as UCB [1] and Thompson sampling [8] rely on explicit exploration of empirically suboptimal arms to learn the optimal action. In our framework, since mean rewards of all arms are dependent on a common parameter, obtaining an estimate of from the samples observed till slot can give us some information on the mean rewards of all arms. This additional knowledge can then be used to reduce the exploration needed when designing bandit algorithms. Identifying suboptimal arms through this estimate of can be thought of as implicit exploration.
Consider the example shown in Figure 2. In this case, the true parameter, , is equal to 3, and the mean reward of Arm 1 is , Arm 2 is , and that of Arm 3 is . Thus, the optimal arm in this setup is Arm 2. Assume now that the player has obtained a large number of samples of arm 2 at a given time step. Based on the samples observed from Arm 2, the player has an empirical estimate of the mean reward as
(1) 
Using this empirical estimate, the player can construct a region in which lies with high probability. Figure 3 illustrates such a region in shaded pink color. This region can then be used to identify the set of values within which the true parameter lies with high probability. For example, in Figure 3 that region is the set . Upon identifying this set, we can now see that if indeed lies in this set, then Arm 3 cannot be optimal as it is suboptimal compared to Arm 2 for all values of . However, Arm 1 may still be better than Arm 2 as it has higher mean reward than Arm 2 for some values of . This provides an example where we implicitly explore Arm 3 without pulling it. As Arm 3 cannot be optimal in the set , we refer to it as noncompetitive with respect to the set . On the other hand, we call Arm 1 and 2 competitive with respect to as they are optimal for at least one in this set.
We formalize this idea of identifying noncompetitive arms into an online algorithm that performs both implicit and explicit exploration. The proposed algorithm, presented in the next section, successfully reduces a armed bandit problem to a armed bandit problem, where is the number of competitive arms, defined formally in Section 3. More interestingly, this algorithm can lead to bounded (i.e., not scaling with ) regret in certain regimes as we will show in Section 4.
3 Proposed Algorithms: UCBC and TSC
Classical bandit algorithms such as Thompson sampling and Upper Confidence Bound (UCB) are often termed as indexbased policies. At every time instant, these policies maintain an index for each arm, and select the arm with the highest index in the next time slot. More specifically, at each round , UCB selects the arm
where is the empirical mean of arm obtained from the samples obtained till . Under Thompson sampling, we select the arm at time step . Here, is the sample obtained from the posterior distribution of , i.e.,
Since mean rewards are correlated through the hidden parameter in the structured bandit model, obtaining an estimate of can help identify the optimal arm. In our approach, we will identify subset of arms, called the competitive arms, through the estimate of and then perform UCB or TS over that set of arms. We now define the notion of competitive and noncompetitive arms, which are a key component in the design of UCBC and TSC Algorithms.
3.1 Competitive and NonCompetitive Arms
From the samples observed till time step , one can construct a confidence set . The set represents the set of values in which the true parameter lies with high confidence, based on rewards observed until time . Next, we define the notions of Competitive and Noncompetitive arms.
Definition 1 (Competitive arm).
An arm is said to be Competitive if for some .
Intuitively, an arm is Competitive if it is optimal for some in the confidence set . Similarly, we define a Noncompetitive arm as follows.
Definition 2 (Noncompetitive arm).
An arm is said to be Noncompetitive if , for all .
Intuitively, if an arm is Noncompetitive, it means that it cannot be optimal if the true parameter lies inside the confidence set . This allows us to identify the Noncompetitive arm as suboptimal under the assumption that the true parameter is in the set .
We now introduce the notion of noncompetitive arm.
Definition 3 (noncompetitive arm).
We call an arm as noncompetitive if
Informally, this means that if an arm is noncompetitive, then it is Noncompetitive, with being the set of that do not change the true mean of the optimal arm by more than .
Throughout, we say that an arm is competitive if there is no for which it is noncompetitive. The set of all competitive arms is denoted by and the number of competitive arms by .
Let be defined as the set . We can view as the confidence set after the optimal arm is sampled infinitely many times. This definition ensures that if an arm is Competitive, then it is competitive.
3.2 Components of Our Algorithm
Motivated with the above discussion, we propose the following algorithm. At each step , we:

Construct a confidence set from the samples observed till time step .

Identify Noncompetitive arms.

Play a bandit algorithm (say UCB or Thompson sampling) among arms which are Competitive and choose the next arm accordingly.
The formal description of this algorithm with UCB and Thompson sampling as final steps is given in Algorithm 1 and Algorithm 2, respectively. Below, we explain the three key components of these algorithms.
Constructing a confidence set, . From the samples observed till time step , we identify the arm that has been selected the maximum number of times so far, namely . We define the confidence set as follows:
Here is the empirical mean of rewards obtained in samples of arm . We construct the arm with samples of as it has smallest variance in its empirical estimate among all arms. In our regret analysis, we show that using samples of just one arm suffices to achieve the desired dimension reduction and bounded regret properties. In Section 5, we present and discuss the UCBint algorithm that constructs the confidence set using samples of all arms.
Identifying Noncompetitive arms. At each time step , we define the set as the set of Competitive arms, that includes all arms that satisfy for some . The rest of the arms, termed as Noncompetitive, are eliminated for round and are not considered in the next part of the algorithm.
Play bandit algorithm among Competitive arms. After identifying the Competitive arms, we use classical bandit algorithms such as UCB and Thompson sampling to decide which arm to play at time step . For example, in the case of UCBC, the next arm is selected as , with
It is important to note that the last step of our algorithm can utilize any one of the classical bandit algorithms. This allows us to easily define a Thompson sampling algorithm which has attracted great attention [17, 18, 19, 20, 21] for the structured bandits problem considered in this paper. The ability to employ any bandit algorithm in its last step is an important advantage of our algorithm. For instance, the extension to Thompson sampling was deemed to be not possible for the UCBS algorithm proposed in [4].
The idea of eliminating noncompetitive arms was initially proposed in [22] for studying multiarmed bandits with a latent random source. However, given the different nature of the problem studied in [22], an entirely different definition of arm competitiveness was used.
4 Regret Analysis and Bounds
In this section, we analyze the performance of the UCBC algorithm through a finitetime analysis of the cumulative expected regret defined as
(2) 
Here, and is the number of times arm is pulled in a total of time steps.
To analyze the expected regret of a proposed algorithm (as given by (2) above), we need to determine for each suboptimal arm . Our first result shows that expected pulls for any arm is .
Theorem 1 (Expected pulls for any arm).
The expected number of times any arm is pulled by UCBC Algorithm is upper bounded as
(3) 
Our next result shows that the expected number of pulls for an noncompetitive arm are bounded.
Theorem 2 (Expected pulls of Noncompetitive Arms).
If an arm is noncompetitive, then the number of times it is pulled by UCBC is upper bounded as
where,
Plugging the results of Theorem 1 and Theorem 2 in (2) yields the following bound on the expected regret of UCBC Algortihm.
Theorem 3 (Regret upper bound).
Dimension Reduction. The classic UCB algorithm that is agnostic to the structure of the problem pulls each of the suboptimal arms times. In contrast, our algorithm pulls only suboptimal arms times, where . In fact, when , all suboptimal arms are pulled only times, leading to a bounded regret. Such cases can arise quite often in practical settings. For example, when the optimal arm is invertible around , the set becomes a singleton; i.e., there is just a single that leads to . In that case, all suboptimal arms become noncompetitive and our UCBC algorithm returns bounded (i.e., ) regret.
We now show that the UCBC algorithm achieves bounded regret whenever possible. We do so by analyzing a lower bound obtained in [16].
Proposition 1 (Lower bound).
For any uniformly good algorithm [1], and for any , we have:
An algorithm is uniformly good if for all and all .
The proof of this proposition, given in Appendix, follows from a lower bound derived in [16]. This lower bound leads us to the following observation.
Remark 1 (Bounded regret whenever possible).
The result on lower bound in Proposition 1 shows that sublogarithmic regret is possible only when . In the case when , our proposed algorithm achieves a bounded regret (see Theorem 3). This implies that the UCBC algorithm is able to achieve bounded regret whenever possible and have dimensionality reduction for cases when regret is logarithmic.
5 Variants of UCBC
Recall that the design of UCBC and TSC algorithms involved the construction of the confidence set . This confidence set was constructed by selecting all for which is inside a ball of size around the empirical mean . In this section, we discuss two other methods of constructing the confidence set to show that our idea of separating implicit and explicit exploration can be easily extended to design new algorithms. These extensions can lead to lower empirical regret than the UCBC algorithm in some cases. We now describe the two algorithms and evaluate their regret bounds.
5.1 The UCBint Algorithm
The Algorithm UCBC, constructs the confidence set using just the samples of arm . We now present the UCBint Algorithm, which differs from UCBC in this aspect. At an additional computation cost, it constructs the confidence set by using samples of all the arms pulled so far. More specifically, UCBint constructs as follows:
(4) 
Similar to UCBC, the UCBint Algorithm works for any functions and any set . The UCBint also has performance guarantees similar to UCBC, i.e., . It also enjoys the same property of achieving bounded regret whenever possible. We present the exact regret bound of UCBint in the Appendix. Since we consider samples of all arms in constructing the confidence set for UCBint, the confidence set is smaller for UCBint than for the UCBC. Hence, UCBint is more aggressive in removal of noncompetitive arms and can obtain better empirical performance than the UCBC algorithm.
5.2 The UCBmin Algorithm
The Algorithm UCBC, was designed to accommodate all type of functions and all sets . In this Section we design an aggressive algorithm, named UCBmin, for a class of structured bandit settings. The problem settings for which UCBmin is designed includes all cases where is a countable set.
The UCBmin Algorithm has the same three components as that of UCBC. It differs from UCBC only in the first component, i.e., the construction of the confidence set. Instead of considering a ball of size around , we directly choose by selecting corresponding to the closest value. More specifically, we construct as follows:
(5)  
(6) 
The UCBmin algorithm is designed for structured bandit settings that satisfy the following assumption.
Assumption 1.
There exists such that for all suboptimal arms ,
Informally this means that the optimal arm at remains Competitive as long as lies at a distance of away from . creftypecap 1 is always satisfied when is a countable set. Under this assumption, UCBmin enjoys similar regret guarantees as that of UCBC. We have , with denoting the number of competitive arms. As was the case with UCBC, it also achieves bounded regret whenever possible. The regret upper bound for UCBmin, which depends on , is given in the appendix.
6 Simulation Results
We now study the empirical performance of the proposed algorithms. For all simulations we choose . Rewards are drawn from the distribution , i.e., . In each result we average the regret over experiments. We first show how UCBC is able to achieve dimension reduction.
Dimension Reduction. In Figure 4 we compare the regret of the UCBC algorithm with classic UCB for the example considered in Figure 4. For , Arm 2 is optimal and Arms 1 and 3 are noncompetitive. As expected from our regret analysis, in Figure 5(a) the Algorithms UCBC and UCBint achieve bounded regret, while the regret of UCB grows logarithmically in the number of time steps . When , Arm 3 is optimal, Arm 2 becomes competitive and Arm 1 is noncompetitive. In this case, we expect UCBC and UCBint to pull Arm 1 only times due to which we notice significantly reduced regret with UCBC and UCBint as compared to UCB in Figure 5(b). Figure 5(c) shows the case where , leading to Arm 1 being optimal and all the arms being competitive. Since UCBC performs UCB over the set of Competitive arms at each round, its performance is similar to that of the UCB algorithm in this case. The UCBint algorithm uses samples from all arms to generate the confidence set , which helps in achieving empirically smaller regret for this setting.
Comparison with UCBS. We now compare the performance of our UCBC and TSC algorithms against the UCB and UCBS Algorithm proposed in [4]. We consider the example shown in Figure 7. For the situations in which the parameter , Arm 2 is optimal and Arm 1 is noncompetitive. When takes values in , Arm 1 is the optimal arm while Arm 2 is competitive. We plot the cumulative regret of the UCB, UCBS, UCBC and TSC algorithms over time steps for the values of between and in Figure 8. When is below , UCBS, UCBC and TSC all obtain lesser regret than UCB as they are able to identify the suboptimal arm as noncompetitive. When , we see that UCBC has a performance similar to that of UCB as suboptimal arm is competitive.
We see that when , UCBS achieves a regret which is quite large compared to even UCB. This is because the algorithm in UCBS selects . The causes the algorithm to prefer certain arms over others, this is the reason for its large regret at and relatively smaller regret at . We notice that TSC achieves significantly less regret as compared to other algorithms as Thompson sampling can offer significant empirical improvement over UCB. This highlights that the possibility to incorporate Thompson sampling in algorithm is beneficial, this extension to Thompson sampling was not possible in [4].
Performance of UCBmin. We now compare the performance of the UCBmin algorithm against the UCBC and UCB algorithm for the example considered in Figure 7. In this example, is a countable set, which allows us to use the UCBmin algorithm here. The aggressive nature of UCBmin in selection of helps it to achieve smaller bounded regret than UCBC, as shown in Figure 9 where .
7 Concluding Remarks
In this work, we studied a correlated bandit problem in which the rewards of different arms are correlated through a common shared parameter. By using reward samples of an arm, we were able to generate estimates on mean reward of other arms. This approach allowed us to identify some suboptimal arms without having to explore them explicitly. The finite time regret analysis of the proposed UCBC algorithm reveals that it is able to reduce the armed bandit problem to a armed bandit problem. In addition, we showed that UCBC achieves bounded regret whenever possible. Ongoing work includes the finitetime regret analysis of TSC algorithm. An interesting future direction is to study this problem when the number of arms is large. We also plan to study the bestarm identification version of this problem.
SUPPLEMENTARY MATERIAL
Appendix A Lower bound
We use the following result of [16] to state the Proposition 1.
Theorem 4 (Lower bound, Theorem 1 in [16].).
For any uniformly good algorithm [1], and for any , we have:
where is the solution of the optimization problem:
(7)  
(8) 
Here, is the KLDivergence between distributions and . An algorithm, , is uniformly good if for all and all .
We see that the solution to the optimization problem (7) is only when the set is empty. The set being empty corresponds to a case where all suboptimal arms are noncompetitive. This implies that sublogarithmic regret is possible only when , i.e there is only one competitive arm, which is the optimal arm, and all other arms are noncompetitive.
Appendix B Results for UCBint and UCBmin
b.1 Regret bounds for UCBint
We state the regret bounds for UCBint Algorithm here.
Theorem 5 (Expected pulls for any arm).
Expected number of times any arm is pulled by UCBint Algorithm is upper bounded as
(9) 
Theorem 6 (Expected pulls for noncompetitive arms).
The Expected number of times an noncompetitive arm is pulled by UCBint Algorithm is upper bounded as
with,
(10) 
Combining the above two results gives us the following regret bound for the UCBint Algorithm.
Theorem 7 (Regret upper bound).
Observe that this regret bound is similar to the regret bound for UCBC algorithm and enjoys the same properties of achieving bounded regret whenever possible and performing dimension reduction.
b.2 Regret bounds for UCBmin
We now present the performance guarantees of the UCBmin Algorithm. Under the creftypecap 1 and defined as
(13)  
(14) 
We have the following result on expected pulls of arms,
Theorem 8 (Expected pulls for any arm).
Expected number of times any arm is pulled is upper bounded in the UCBmin Algorithm as
(15) 
Theorem 9 (Expected pulls for noncompetitive arms).
Expected number of times an noncompetitive arm is pulled by the UCBmin Algorithm is upper bounded as
(16) 
Combining the above two result yields us the following regret bound for UCBmin Algorithm.
Theorem 10 (Regret upper bound).
Observe that this regret bound is similar to the regret bound for UCBC algorithm and enjoys the same properties of achieving bounded regret whenever possible and performing dimension reduction.
Appendix C Proof for the UCBC Algorithm
Fact 1 (Hoeffding’s inequality).
Let be i.i.d. random variables, where is subgaussian with mean , then
Here is the empirical mean of the .
Lemma 1.
Probability that true mean lies outside the confidence set decays with total number of pulls,i.e , as,
Proof.
See that,
(19)  
(20)  
(21) 
We have (19) from union bound and is a standard trick to deal with the random variable as it can take values from to . We use this trick repeatedly in the paper, whenever we encounter such expressions. The true mean of arm is . Therefore, if denotes the empirical mean of arm taken over pulls of arm then, (20) follows from Fact 1 with in Fact 1 being equal to . ∎
Lemma 2.
Define to be the event that arm is noncompetitive for the round , then,
Proof.
Lemma 3.
If for some constant then,
where is a suboptimal arm.
Proof.
The probability that arm is pulled at step , given it has been pulled times can be bounded as follows:
(25)  
(26)  
(27)  
(28) 
Here, (27) follows from union bound and (28) follows from Lemma 2. We now bound the second term as,
(29)  
(30)  
(31)  
(32)  
(33)  
(34)  
(35)  
(36)  
(37) 
Equation (29) follows from the fact that . Inequality (30) arrives from dropping and in the previous expression. We have (31) from Lemma 2 and the fact that