Bandit Convex Optimization in Nonstationary Environments
Abstract
Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decisionmaking with partial information, where the only feedback available to the player is the onepoint or twopoint function values. In this paper, we investigate BCO in nonstationary environments and choose the dynamic regret as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let be the time horizon and be the pathlength of the comparator sequence that reflects the nonstationarity of environments. We propose a novel algorithm that achieves and dynamic regret respectively for the onepoint and twopoint feedback models. The latter result is optimal, matching the lower bound established in this paper. Notably, our algorithm is more adaptive to nonstationary environments since it does not require prior knowledge of the pathlength ahead of time, which is generally unknown.
1
Bandit Convex Optimization, Dynamic Regret, Nonstationary Environments
1 Introduction
Online Convex Optimization (OCO) is a powerful tool for modeling sequential decisionmaking problems, which can be regarded as an iterative game between the player and environments (ShalevShwartz, 2012). At iteration , the player commits a decision from a convex feasible set , simultaneously, a convex function is revealed by environments, and then the player will suffer an instantaneous loss . The standard performance measure is the regret,
(1) 
which is the difference between the cumulative loss of the player and that of the best fixed decision in hindsight. To emphasize the fact that the comparator in (1) is fixed, it is called static regret.
There are two setups for online convex optimization according to the information that environments reveal (Hazan, 2016). In the fullinformation setup, the player has all the information of the function , including the gradients of over . By contrast, in the bandit setup, the instantaneous loss is the only feedback available to the player. In this paper, we focus on the latter case, which is referred to as the bandit convex optimization (BCO).
BCO has attracted considerable attention because it successfully models many realworld scenarios where the feedback available to the decision maker is partial or incomplete (Hazan, 2016). The key challenge lies in the limited feedback, i.e., the player has no access to gradients of the function. In the standard onepoint feedback model, the only feedback is the onepoint function value, based on which Flaxman et al. (2005) constructed an unbiased estimator of the gradient and then appealed to the online gradient descent algorithm that developed in the fullinformation setting (Zinkevich, 2003) to establish an expected regret. Another common variant is the twopoint feedback model, where the player is allowed to query function values of two points at each iteration. Agarwal et al. (2010) demonstrated an optimal regret for convex functions under this feedback model. Algorithms and regret bounds are further developed in later studies (Saha and Tewari, 2011; Hazan and Levy, 2014; Bubeck et al., 2015; Dekel et al., 2015; Yang and Mohri, 2016; Bubeck et al., 2017).
Feedback model  Dynamic regret  Type  ParmFree  Reference 
onepoint  worstcase  NO  (Chen and Giannakis, 2019)  
onepoint  universal  YES  This work  
twopoint  worstcase  NO  (Yang et al., 2016)  
twopoint  worstcase  NO  (Chen and Giannakis, 2019)  
twopoint  universal  YES  This work 
Note that the static regret in (1) compares with a fixed benchmark, so it implicitly assumes that there is a reasonably good decision over all iterations. Unfortunately, this may not be true in nonstationary environments, where the underlying distribution of online functions changes. To address this limitation, the notion of dynamic regret is introduced by Zinkevich (2003) and defined as the difference between the cumulative loss of the player and that of a comparator sequence ,
(2) 
In contrast to a fixed benchmark in the static regret, dynamic regret compares with a changing comparator sequence and therefore is more suitable in nonstationary environments. We remark that (2) is also called the universal dynamic regret, since it holds universally for any feasible comparator sequence. In the literature, there is a variant named the worstcase dynamic regret (Besbes et al., 2015), which specifies the comparator sequence to be minimizers of online functions, namely, . As pointed out by Zhang et al. (2018a), the universal dynamic regret is more desired, because the worstcase dynamic regret is typically too pessimistic while the universal one is more adaptive to the nonstationarity of environments. Moreover, the universal dynamic regret is more general since it accommodates the worstcase dynamic regret and static regret as special cases.
Recently, there are some studies on the dynamic regret of BCO problems (Yang et al., 2016; Chen and Giannakis, 2019). They provide the worstcase dynamic regret only, and the algorithms require some quantities as the input which are generally unknown in advance. Therefore, it is desired to design algorithms that enjoy universal dynamic regret for BCO problems.
In this paper, we start with the bandit gradient descent (BGD) algorithm of Flaxman et al. (2005), and analyze its universal dynamic regret. We demonstrate that the optimal parameter configuration of vanilla BGD also requires prior information of the unknown pathlength. To address this issue, we propose the Parameterfree Bandit Gradient Descent algorithm (PBGD), which is inspired by the strategy of maintaining multiple learning rates (van Erven and Koolen, 2016). Our approach is essentially an online ensemble method (Zhou, 2012), consisting of metaalgorithm and expertalgorithm. The basic idea is to maintain a pool of candidate parameters, and then invoke multiple instances of the expertalgorithm simultaneously, where each expertalgorithm is associated with a candidate parameter. Next, the metaalgorithm combines predictions from expertalgorithms by an experttracking algorithm (CesaBianchi and Lugosi, 2006). However, it is prohibited to run multiple expertalgorithms with different parameters simultaneously in BCO problems, since the player is only allowed to query one/two points in the bandit setup. To overcome this difficulty, we carefully design a surrogate function, as the linearization of the smoothed version of the loss function in the sense of expectation, and make the strategy suitable for bandit convex optimization. Our algorithm and analysis accommodate onepoint and twopoint feedback models, and Table 1 summarizes existing dynamic regret for BCO problems and our results. The main contributions of this work are listed as follows.

We establish the first universal dynamic regret that supports to compare with any feasible comparator sequence for the bandit gradient descent algorithm, in a unified analysis framework.

We propose a parameterfree algorithm, which does not require to know the upper bound of the pathlength ahead of time, and meanwhile enjoys the stateoftheart dynamic regret.

We establish the first minimax lower bound of universal dynamic regret for BCO problems.
The rest of the paper is structured as follows. Section 2 briefly reviews related work. In Section 3, we introduce the bandit gradient descent algorithm for BCO problems and provide the dynamic regret analysis. Section 4 presents the parameterfree BGD algorithm, the main contribution of this paper, with dynamic regret analysis. Next, in Section 5, we establish the lower bound and provide several extensions. Section 6 and Section 7 present the proofs of main results. Section 8 concludes the paper and discusses future directions.
2 Related Work
We briefly introduce related work of bandit convex optimization and dynamic regret.
2.1 Bandit Convex Optimization
In the bandit convex optimization setting, the player is only allowed to query function values of one point or two points, and the gradient information is not accessible as opposed to the fullinformation setting.
For the onepoint feedback model, the seminal work of Flaxman et al. (2005) constructed an unbiased gradient estimator and established an expected regret for convex and Lipschitz functions. A similar result was independently obtained by Kleinberg (2004). Later, an rate was shown to be attainable with either strong convexity (Agarwal et al., 2010) or smoothness (Saha and Tewari, 2011). When functions are both strongly convex and smooth, Hazan and Levy (2014) designed a novel algorithm that achieves a regret of based on the followtheregularizedleader framework with selfconcordant barriers, matching the lower bound (Shamir, 2013) up to logarithmic factors. Furthermore, recent breakthroughs (Bubeck et al., 2015, 2017) showed that regret is attainable for convex and Lipschitz functions, though with a high dependence on the dimension .
BCO with twopoint feedback is proposed and studied by Agarwal et al. (2010), and is also independently studied in the context of stochastic optimization (Nesterov, 2011). Agarwal et al. (2010) first establish the expected regret of and for convex Lipschitz and strongly convex Lipschitz functions, respectively. These bounds are proved to be minimax optimal in (Agarwal et al., 2010), and the dependence on is later improved to be optimal (Shamir, 2017).
2.2 Dynamic Regret
There are two types of dynamic regret as aforementioned. The universal dynamic regret holds universally for any feasible comparator sequence, while the worstcase one only compares with the sequence of the minimizers of online functions.
For the universal dynamic regret, existing results are only limited to the fullinformation setting. Zinkevich (2003) showed that OGD achieves an regret, where is the pathlength of comparator sequence ,
(3) 
Recently, Zhang et al. (2018a) demonstrated that this upper bound is not optimal by establishing an lower bound, and further proposed an algorithm that attains an optimal dynamic regret for convex functions. However, there is no universal dynamic regret in the bandit setting.
For the worstcase dynamic regret, there are many studies in the fullinformation setting (Besbes et al., 2015; Jadbabaie et al., 2015; Yang et al., 2016; Mokhtari et al., 2016; Zhang et al., 2017) as well as a few works in the bandit setting (Gur et al., 2014; Yang et al., 2016; Luo et al., 2018; Auer et al., 2019; Cheung et al., 2019; Chen and Giannakis, 2019; Zhao et al., 2020). In the bandit convex optimization, when the upper bound of is known, Yang et al. (2016) established an dynamic regret for the twopoint feedback model. Here, is the longest pathlength of the feasible comparator sequence. Later, Chen and Giannakis (2019) applied BCO techniques in the dynamic InternetofThings management, showing and dynamic regret bounds respectively for onepoint and twopoint feedback models.
Another closely related performance measure for online convex optimization in nonstationary environments is the adaptive regret (Hazan and Seshadhri, 2009), which is defined as the maximum of “local” static regret in every time interval ,
Hazan and Seshadhri (2009) proposed an efficient algorithm that enjoys and regrets for convex and exponentially concave functions, respectively. The rate for convex functions was improved later (Daniely et al., 2015; Jun et al., 2017). Moreover, Zhang et al. (2018b) investigated the relation between adaptive regret and the worstcase dynamic regret.
3 Bandit Gradient Descent (BGD)
In this section, we provide assumptions used in the paper, then present the bandit gradient descent (BGD) algorithm for BCO problems, as well as its universal dynamic regret. To the best of our knowledge, this is the first work that analyzes the universal dynamic regret of BGD.
3.1 Assumptions
We make following common assumptions for bandit convex optimization (Flaxman et al., 2005; Agarwal et al., 2010).
Assumption 1 (Bounded Region).
The feasible set contains the ball of radius centered at the origin and is contained in the ball of radius , namely,
(4) 
where .
Assumption 2 (Bounded Function Value).
The absolute values of all the functions are bounded by , namely,
(5) 
Assumption 3 (Lipschitz Continuity).
All the functions are Lipschitz continuous over domain , that is, for all , we have
(6) 
Meanwhile, we consider loss functions and the comparator sequence are chosen by an oblivious adversary.
3.2 Algorithm and Regret Analysis
In this part, we present algorithm and regret analysis of the bandit gradient descent.
We start from the online gradient descent (OGD) developed in the fullinformation setting (Zinkevich, 2003). OGD begins with any and performs
(7) 
where is the step size and denotes the projection onto the nearest point in .
The key challenge of BCO problems is the lack of gradients. Therefore, Flaxman et al. (2005) and Agarwal et al. (2010) propose to replace in (7) with a gradient estimator , obtained by evaluating the function at one (in the onepoint feedback model) or two random points (in the twopoint feedback model) around . Details will be presented later. We unify their algorithms in Algorithm 1, called the Bandit Gradient Descent (BGD). Notice that in lines 8 and 14 of the algorithm, the projection of is on a slightly smaller set instead of , to ensure that the final decision lies in the feasible set . In the following, we describe the gradient estimator and analyze the universal dynamic regret for each model.
OnePoint Feedback Model.
Flaxman et al. (2005) propose the following gradient estimator,
(8) 
where is a unit vector selected uniformly at random and is the perturbation parameter. Then, the following lemma (Flaxman et al., 2005, Lemma 2.1) guarantees that (8) is an unbiased gradient estimator of the smoothed version of the loss function .
Lemma 1.
For any convex (but not necessarily differentiable) function , define its smoothed version . Then, for any ,
(9) 
where is the unit sphere centered around the origin, namely, .
Therefore, we adopt to perform the online gradient descent in (7). The main update procedures of the onepoint feedback model are summarized in the case 1 (line 47) of Algorithm 1. We have the following result regarding its universal dynamic regret.
Theorem 1.
Remark 1.
By setting and , we obtain an dynamic regret. However, such a configuration requires prior knowledge of , which is generally unavailable. We will develop a parameterfree algorithm to eliminate the undesired dependence later.
TwoPoint Feedback Model.
In this setup, the player is allowed to query two points, and . Then, the function values and are revealed as the feedback. We use the following gradient estimator (Agarwal et al., 2010),
(11) 
The major limitation of the onepoint gradient estimator (8) is that it has a potentially large magnitude, proportional to the which is usually quite large since the perturbation parameter is typically small. This is avoided in the twopoint gradient estimator (11), whose magnitude can be upper bounded by , independent of the perturbation parameter . This crucial advantage leads to the substantial improvement in the dynamic regret (also static regret).
Theorem 2.
Remark 2.
By setting and , BGD algorithm achieves an dynamic regret. However, this configuration has an unpleasant dependence on the unknown quantity , which will be removed in the next part.
4 ParameterFree BGD
From Theorems 1 and 2, we observe that the optimal parameter configurations of BGD algorithm require to know the pathlength in advance, which is generally unknown. In this section, we develop a parameterfree algorithm to address this limitation.
The fundamental obstacle in obtaining universal dynamic regret guarantees is that the pathlength remains unknown even after all iterations, since the comparator sequence can be chosen arbitrarily from the feasible set. Therefore, the wellknown doubling trick (CesaBianchi et al., 1997) is not applicable to remove the dependence on the unknown pathlength. Another possible technique to overcome this difficulty is to grid search the optimal parameter by maintaining multiple learning rates in parallel and using experttracking algorithms to combine predictions and track the best parameter (van Erven and Koolen, 2016). However, it is infeasible to directly apply this method to bandit convex optimization because of the inherent difficulty of bandit setting — it is only allowed to query the function value once at each iteration.
To address this issue, we need a closer investigation of dynamic regret analysis of BCO problems. Taking the onefeedback model as an example, the expected dynamic regret can be decomposed into three terms,
(13) 
where is the scaled comparator sequence set as . It turns out that term (b) and term (c) can be bounded by and respectively without involving the unknown pathlength, and the rigorous argument can be found in (23) and (24) of Section 6.1. Hence, it suffices to design parameterfree algorithms to optimize term (a), i.e., the dynamic regret of the smoothed loss function .
However, it remains infeasible to maintain multiple learning rates for optimizing dynamic regret of . Suppose there are in total experts where each expert is associated with a learning rate (step size), then at iteration , expertalgorithms will require the information of to perform the bandit gradient descent. This necessitates to query function values of original loss , which is prohibited in bandit convex optimization.
Fortunately, we discover that the expected dynamic regret of can be upper bounded by that of a linear function, as demonstrated in the following proposition.
Proposition 1.
(14) 
This feature motivates us to design the following surrogate loss function ,
(15) 
which can be regarded as a linearization of smoothed function on the point in terms of expectation. Furthermore, the surrogate loss function enjoys the following two properties.
Property 1.
, .
Property 2.
,
(16) 
Property 1 follows from the definition of surrogate loss, and Proposition 1 immediately implies Property 2. These two properties are simple yet quite useful, and they together make the grid search feasible in bandit convex optimization. Concretely speaking,

Property 1 implies that we can now initialize experts to perform the bandit gradient descent over the surrogate loss where each expert is associated with a specific learning rate, since all the gradients essentially equal to , which can be obtained by querying the function value of only once.

Property 2 guarantees the expected dynamic regret of smoothed functions ’s is upper bounded by that of the surrogate loss ’s.
Consequently, we propose to optimize surrogate loss instead of original loss (or its smoothed version ). We note that the idea of constructing surrogate loss for maintaining multiple learning rates is originally proposed by van Erven and Koolen (2016) but for different purposes. They construct a quadratic upper bound for original loss as surrogate loss, with the aim to adapt to the potential curvature of online functions in fullinformation online convex optimization. In this paper, we design the surrogate loss as linearization of smoothed function in terms of expectation, to make the grid search of optimal parameter doable in bandit convex optimization. To the best of our knowledge, this is the first time to optimize surrogate loss for maintaining multiple learning rates in bandit setup.
In the following, we describe the design details of parameterfree algorithms for the onepoint feedback model, and present configurations of BCO with twopoint feedback model later (in Section 7.3).
In the onepoint feedback model, the optimal step size is , whose value is unavailable due to the unknown pathlength . Nevertheless, we confirm
(17) 
always holds from the nonnegativity and boundedness of the pathlength (). Hence, we first construct the following pool of candidate step sizes to discretize the range of optimal parameter in (17),
(18) 
where . The above configuration ensures there exists an index such that . More intuitively, there is a step size in the pool that is not optimal but sufficiently close to . Next, we instantiate expertalgorithms, where the th expert is a BGD algorithm with parameters and . Finally, we adopt an experttracking algorithm as the metaalgorithm to combine predictions from all the experts to produce the final decision. Owing to nice theoretical guarantees of the metaalgorithm, dynamic regret of final decisions is comparable to that of the best expert, i.e., the expertalgorithm with nearoptimal step size.
We present descriptions for expertalgorithm and metaalgorithm of PBGD as follows.
Expertalgorithm.
For each candidate step size from the pool , we initialize an expert, and the expert performs the online gradient descent over the surrogate loss defined in (15),
(19) 
where is the step size of the expert , shown in (18).
The above update procedure once again demonstrates the necessity of constructing the surrogate loss. Due to the nice property of surrogate loss (Property 1), at each iteration, all the experts can perform the exact online gradient descent in the same direction . By contrast, suppose each expert is conducted over the smoothed loss function , then at each iteration it requires to query multiple gradients , or equivalently, to query multiple function values , which are unavailable in bandit convex optimization.
Metaalgorithm.
To combine predictions returned from various experts, we adopt the exponentially weighted average forecaster algorithm (CesaBianchi and Lugosi, 2006) with nonuniform initial weights as the metaalgorithm, whose input is the pool of candidate step sizes in (18) and its own learning rate . The nonuniform initialization of weights aims to make regret analysis tighter, which will be clear in the proof. Algorithm 2 presents detailed procedures. Note that the metaalgorithm itself does not require any prior information of the unknown pathlength .
The metaalgorithm in Algorithm 2, together with the expertalgorithm (19), gives PBGD (short for Parameterfree Bandit Gradient Descent). The following theorem states the dynamic regret of the proposed PBGD algorithm.
Theorem 3.
Under Assumptions 1, 2, and 3, with a proper setting of the pool of candidate step sizes and the learning rate , PBGD algorithm enjoys the following expected dynamic regret,

OnePoint Feedback Model: ;

TwoPoint Feedback Model: .
The above results hold universally for any feasible comparator sequence .
Remark 3.
Theorem 3 shows that the dynamic regret can be improved from to when it is allowed to query two points at each iteration. The attained dynamic regret (though in expectation) of BCO with twopoint feedback, surprisingly, is in the same order with that of the fullinformation setting (Zhang et al., 2018a). This extends the claim argued by Agarwal et al. (2010) knowing the value of each loss function at two points is almost as useful as knowing the value of each function everywhere to dynamic regret analysis. Furthermore, we will show that the obtained dynamic regret for the twopoint feedback model is minimax optimal in the next section.
5 Lower Bound and Extensions
In this section, we investigate the attainable dynamic regret for BCO problems, and then extend our algorithm to an anytime version, that is, an algorithm without requiring the time horizon in advance. Furthermore, we study the adaptive regret for BCO problems, another measure for online learning in nonstationary environments.
5.1 Lower Bound
We have the following minimax lower bound of universal dynamic regret for BCO problems.
Theorem 4.
The proof is detailed in Appendix B. From the above lower bound and the upper bounds in Theorem 3, we know that our dynamic regret for the twopoint feedback model is optimal, while the rate for onepoint feedback model remains suboptimal, where the desired rate is of order as demonstrated in Remark 1. Note that the desired bound does not contradict with the minimax lower bound, since is larger than the lower bound by noticing that .
Our attained dynamic regret exhibits a squareroot dependence on the pathlength, and it will become vacuous when , though the pathlength is typically small. The challenge is that the grid search technique cannot support to approximate the optimal perturbation parameter which is also dependent on . Otherwise, we have to query the function more than once at each iteration. We will investigate a sharper bound for BCO with onepoint feedback in the future.
Remark 4.
The lower bound holds even all the functions ’s are strongly convex and smooth in BCO with onepoint feedback. This is to be contrasted with that in the fullinformation setting. The reason is that the minimax static regret of BCO with onepoint feedback can neither benefit from strongly convexity nor smoothness (Shamir, 2013). This implies the inherent difficulty of learning with bandit feedback.
5.2 Extension to Anytime Algorithm
Notice that the proposed PBGD algorithm requires the time horizon as an input, which is not available in advance. We remove the undesired dependence and develop an anytime algorithm.
Our method is essentially a standard implementation of the doubling trick (CesaBianchi et al., 1997). Specifically, the idea is to initialize the interval by , and once the actual number of iterations exceeds the current counts, double the counts and restart the algorithm. So there will be epochs and the th epoch contains iterations. We have the following regret guarantees for the above anytime algorithm.
Theorem 5.
Under the same conditions with Theorem 3, the anytime version of PBGD enjoys the following expected dynamic regret,

OnePoint Feedback Model: ;

TwoPoint Feedback Model: .
The above results hold universally for any feasible comparator sequence .
We take the onepoint feedback model as an example and provide a brief analysis as follows. Actually, by the strategy of doubling trick, we can bound the dynamic regret of the anytime algorithm by
Compared with the rate of the original PBGD algorithm, we observe that an extra term is suffered due to the anytime demand.
5.3 Adaptive Regret
In this part, we investigate the adaptive regret. Following the seminal work of Hazan and Seshadhri (2009), we define the expected adaptive regret for BCO as
We note that, in the fullinformation setting, a stronger version of adaptive regret named strongly adaptive regret is introduced by Daniely et al. (2015). However, they prove that it is impossible to achieve meaningful strongly adaptive regret in bandit settings, so we focus on the notion defined by Hazan and Seshadhri (2009).
To minimize the above measure, we propose an algorithm called Minimizing Adaptive regret in Bandit Convex Optimization (MABCO). Our algorithm follows a similar framework used in the Coin Betting for Changing Environment (CBCE) algorithm (Jun et al., 2017), which achieves the stateoftheart adaptive regret in the fullinformation setting. However, we note that a direct reduction of CBCE algorithm to the bandit setting requires to query the loss function multiple times at each iteration, which is invalid in the bandit feedback model. To address this difficulty, similar to PBGD we introduce a new surrogate loss function, which can be constructed by only using the onepoint or twopoint function values. We provide algorithmic details and proofs of theoretical results in Appendix C.
Theorem 6.
With a proper setting of surrogate loss functions and parameters, the proposed MABCO algorithm enjoys the following expected adaptive regret,

OnePoint Feedback Model: ;

TwoPoint Feedback Model: .
6 Analysis of BGD Algorithm
In this section, we provide the proofs of theoretical guarantees for the BGD algorithm including Theorem 1 (onepoint feedback model) and Theorem 2 (twopoint feedback model).
Before presenting rigorous proofs, we first highlight the main idea and procedures of the argument as follows.

Guarantee that for any , is a feasible point in , because the projection in Algorithm 1 is over instead of .

Analyze the dynamic regret of the smoothed functions in terms of a certain comparator sequence.

Check the gap between the dynamic regret of the smoothed functions and that of the original functions .
6.1 Proof of Theorem 1
Proof.
Notice that the projection in Algorithm 1 only guarantees that is in a slightly smaller set , so we first need to prove that , is a feasible point in . This is convinced by Lemma 3, since we know that from the parameter setting ().
Next, as demonstrated in (13), the expected dynamic regret can be decomposed into three terms. So we will bound the three terms separately.
The term (a) is essentially the dynamic regret of the smoothed functions. In the onepoint feedback model, the gradient estimator is set according to (8), and we know that due to Lemma 1. Therefore, the procedure of is actually the randomized online gradient descent over the smoothed function . So term (a) can be upper bound by using Theorem 8.
(21) 
where , and by noticing
(22) 
Now, it suffices to bound term (b) and term (c). By Assumption 3 and Lemma 4, we have
(23) 
And term (c) can be bounded by
(24) 
where the second inequality holds due to Lemma 4 and Assumption 3.
By combining upper bounds of three terms in (21), (23) and (24), we obtain the dynamic regret of the original function over the comparator sequence of ,
(25)  
where (25) follows from the setting of ; the last equation is obtained by the AMGM inequality via optimizing values of and . The optimal parameter configuration is
∎
6.2 Proof of Theorem 2
Proof.
In the twopoint feedback model, the gradient estimator is constructed according to (11), whose norm can be upper bounded as follows,
(26) 
where in the last inequality, we utilize the Lipschitz property due to Assumption 3. Hence, . We remark that by contrast with that in the onepoint feedback model as shown in (22), the upper bound of gradient norm here is independent of the , which leads to a substantially improved regret bound.
Meanwhile, by exploiting the Lipschitz property, we have
(27) 
and similar result holds for . We can thus bound the expected regret as follows,
(28)  
(29) 
The core characteristic of analysis of the twopoint feedback model lies in the second term of (28), which is independent of , and thus is much smaller than that of (25). This owes to the benefit of the gradient estimator evaluated by two points at each iteration. Notice that (29) is obtained by setting and . ∎
7 Analysis of PBGD Algorithm
In this section, we provide the proofs of theoretical guarantees for the PBGD algorithm including Proposition 1 and Theorem 3 (both onepoint and twopoint feedback models). Besides, we present the algorithmic details for BCO with twopoint feedback.