Bandit Convex Optimization in Non-stationary Environments
Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point function values. In this paper, we investigate BCO in non-stationary environments and choose the dynamic regret as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let be the time horizon and be the path-length of the comparator sequence that reflects the non-stationarity of environments. We propose a novel algorithm that achieves and dynamic regret respectively for the one-point and two-point feedback models. The latter result is optimal, matching the lower bound established in this paper. Notably, our algorithm is more adaptive to non-stationary environments since it does not require prior knowledge of the path-length ahead of time, which is generally unknown.
Bandit Convex Optimization, Dynamic Regret, Non-stationary Environments
Online Convex Optimization (OCO) is a powerful tool for modeling sequential decision-making problems, which can be regarded as an iterative game between the player and environments (Shalev-Shwartz, 2012). At iteration , the player commits a decision from a convex feasible set , simultaneously, a convex function is revealed by environments, and then the player will suffer an instantaneous loss . The standard performance measure is the regret,
which is the difference between the cumulative loss of the player and that of the best fixed decision in hindsight. To emphasize the fact that the comparator in (1) is fixed, it is called static regret.
There are two setups for online convex optimization according to the information that environments reveal (Hazan, 2016). In the full-information setup, the player has all the information of the function , including the gradients of over . By contrast, in the bandit setup, the instantaneous loss is the only feedback available to the player. In this paper, we focus on the latter case, which is referred to as the bandit convex optimization (BCO).
BCO has attracted considerable attention because it successfully models many real-world scenarios where the feedback available to the decision maker is partial or incomplete (Hazan, 2016). The key challenge lies in the limited feedback, i.e., the player has no access to gradients of the function. In the standard one-point feedback model, the only feedback is the one-point function value, based on which Flaxman et al. (2005) constructed an unbiased estimator of the gradient and then appealed to the online gradient descent algorithm that developed in the full-information setting (Zinkevich, 2003) to establish an expected regret. Another common variant is the two-point feedback model, where the player is allowed to query function values of two points at each iteration. Agarwal et al. (2010) demonstrated an optimal regret for convex functions under this feedback model. Algorithms and regret bounds are further developed in later studies (Saha and Tewari, 2011; Hazan and Levy, 2014; Bubeck et al., 2015; Dekel et al., 2015; Yang and Mohri, 2016; Bubeck et al., 2017).
|Feedback model||Dynamic regret||Type||Parm-Free||Reference|
|one-point||worst-case||NO||(Chen and Giannakis, 2019)|
|two-point||worst-case||NO||(Yang et al., 2016)|
|two-point||worst-case||NO||(Chen and Giannakis, 2019)|
Note that the static regret in (1) compares with a fixed benchmark, so it implicitly assumes that there is a reasonably good decision over all iterations. Unfortunately, this may not be true in non-stationary environments, where the underlying distribution of online functions changes. To address this limitation, the notion of dynamic regret is introduced by Zinkevich (2003) and defined as the difference between the cumulative loss of the player and that of a comparator sequence ,
In contrast to a fixed benchmark in the static regret, dynamic regret compares with a changing comparator sequence and therefore is more suitable in non-stationary environments. We remark that (2) is also called the universal dynamic regret, since it holds universally for any feasible comparator sequence. In the literature, there is a variant named the worst-case dynamic regret (Besbes et al., 2015), which specifies the comparator sequence to be minimizers of online functions, namely, . As pointed out by Zhang et al. (2018a), the universal dynamic regret is more desired, because the worst-case dynamic regret is typically too pessimistic while the universal one is more adaptive to the non-stationarity of environments. Moreover, the universal dynamic regret is more general since it accommodates the worst-case dynamic regret and static regret as special cases.
Recently, there are some studies on the dynamic regret of BCO problems (Yang et al., 2016; Chen and Giannakis, 2019). They provide the worst-case dynamic regret only, and the algorithms require some quantities as the input which are generally unknown in advance. Therefore, it is desired to design algorithms that enjoy universal dynamic regret for BCO problems.
In this paper, we start with the bandit gradient descent (BGD) algorithm of Flaxman et al. (2005), and analyze its universal dynamic regret. We demonstrate that the optimal parameter configuration of vanilla BGD also requires prior information of the unknown path-length. To address this issue, we propose the Parameter-free Bandit Gradient Descent algorithm (PBGD), which is inspired by the strategy of maintaining multiple learning rates (van Erven and Koolen, 2016). Our approach is essentially an online ensemble method (Zhou, 2012), consisting of meta-algorithm and expert-algorithm. The basic idea is to maintain a pool of candidate parameters, and then invoke multiple instances of the expert-algorithm simultaneously, where each expert-algorithm is associated with a candidate parameter. Next, the meta-algorithm combines predictions from expert-algorithms by an expert-tracking algorithm (Cesa-Bianchi and Lugosi, 2006). However, it is prohibited to run multiple expert-algorithms with different parameters simultaneously in BCO problems, since the player is only allowed to query one/two points in the bandit setup. To overcome this difficulty, we carefully design a surrogate function, as the linearization of the smoothed version of the loss function in the sense of expectation, and make the strategy suitable for bandit convex optimization. Our algorithm and analysis accommodate one-point and two-point feedback models, and Table 1 summarizes existing dynamic regret for BCO problems and our results. The main contributions of this work are listed as follows.
We establish the first universal dynamic regret that supports to compare with any feasible comparator sequence for the bandit gradient descent algorithm, in a unified analysis framework.
We propose a parameter-free algorithm, which does not require to know the upper bound of the path-length ahead of time, and meanwhile enjoys the state-of-the-art dynamic regret.
We establish the first minimax lower bound of universal dynamic regret for BCO problems.
The rest of the paper is structured as follows. Section 2 briefly reviews related work. In Section 3, we introduce the bandit gradient descent algorithm for BCO problems and provide the dynamic regret analysis. Section 4 presents the parameter-free BGD algorithm, the main contribution of this paper, with dynamic regret analysis. Next, in Section 5, we establish the lower bound and provide several extensions. Section 6 and Section 7 present the proofs of main results. Section 8 concludes the paper and discusses future directions.
2 Related Work
We briefly introduce related work of bandit convex optimization and dynamic regret.
2.1 Bandit Convex Optimization
In the bandit convex optimization setting, the player is only allowed to query function values of one point or two points, and the gradient information is not accessible as opposed to the full-information setting.
For the one-point feedback model, the seminal work of Flaxman et al. (2005) constructed an unbiased gradient estimator and established an expected regret for convex and Lipschitz functions. A similar result was independently obtained by Kleinberg (2004). Later, an rate was shown to be attainable with either strong convexity (Agarwal et al., 2010) or smoothness (Saha and Tewari, 2011). When functions are both strongly convex and smooth, Hazan and Levy (2014) designed a novel algorithm that achieves a regret of based on the follow-the-regularized-leader framework with self-concordant barriers, matching the lower bound (Shamir, 2013) up to logarithmic factors. Furthermore, recent breakthroughs (Bubeck et al., 2015, 2017) showed that regret is attainable for convex and Lipschitz functions, though with a high dependence on the dimension .
BCO with two-point feedback is proposed and studied by Agarwal et al. (2010), and is also independently studied in the context of stochastic optimization (Nesterov, 2011). Agarwal et al. (2010) first establish the expected regret of and for convex Lipschitz and strongly convex Lipschitz functions, respectively. These bounds are proved to be minimax optimal in (Agarwal et al., 2010), and the dependence on is later improved to be optimal (Shamir, 2017).
2.2 Dynamic Regret
There are two types of dynamic regret as aforementioned. The universal dynamic regret holds universally for any feasible comparator sequence, while the worst-case one only compares with the sequence of the minimizers of online functions.
For the universal dynamic regret, existing results are only limited to the full-information setting. Zinkevich (2003) showed that OGD achieves an regret, where is the path-length of comparator sequence ,
Recently, Zhang et al. (2018a) demonstrated that this upper bound is not optimal by establishing an lower bound, and further proposed an algorithm that attains an optimal dynamic regret for convex functions. However, there is no universal dynamic regret in the bandit setting.
For the worst-case dynamic regret, there are many studies in the full-information setting (Besbes et al., 2015; Jadbabaie et al., 2015; Yang et al., 2016; Mokhtari et al., 2016; Zhang et al., 2017) as well as a few works in the bandit setting (Gur et al., 2014; Yang et al., 2016; Luo et al., 2018; Auer et al., 2019; Cheung et al., 2019; Chen and Giannakis, 2019; Zhao et al., 2020). In the bandit convex optimization, when the upper bound of is known, Yang et al. (2016) established an dynamic regret for the two-point feedback model. Here, is the longest path-length of the feasible comparator sequence. Later, Chen and Giannakis (2019) applied BCO techniques in the dynamic Internet-of-Things management, showing and dynamic regret bounds respectively for one-point and two-point feedback models.
Another closely related performance measure for online convex optimization in non-stationary environments is the adaptive regret (Hazan and Seshadhri, 2009), which is defined as the maximum of “local” static regret in every time interval ,
Hazan and Seshadhri (2009) proposed an efficient algorithm that enjoys and regrets for convex and exponentially concave functions, respectively. The rate for convex functions was improved later (Daniely et al., 2015; Jun et al., 2017). Moreover, Zhang et al. (2018b) investigated the relation between adaptive regret and the worst-case dynamic regret.
3 Bandit Gradient Descent (BGD)
In this section, we provide assumptions used in the paper, then present the bandit gradient descent (BGD) algorithm for BCO problems, as well as its universal dynamic regret. To the best of our knowledge, this is the first work that analyzes the universal dynamic regret of BGD.
Assumption 1 (Bounded Region).
The feasible set contains the ball of radius centered at the origin and is contained in the ball of radius , namely,
Assumption 2 (Bounded Function Value).
The absolute values of all the functions are bounded by , namely,
Assumption 3 (Lipschitz Continuity).
All the functions are -Lipschitz continuous over domain , that is, for all , we have
Meanwhile, we consider loss functions and the comparator sequence are chosen by an oblivious adversary.
3.2 Algorithm and Regret Analysis
In this part, we present algorithm and regret analysis of the bandit gradient descent.
We start from the online gradient descent (OGD) developed in the full-information setting (Zinkevich, 2003). OGD begins with any and performs
where is the step size and denotes the projection onto the nearest point in .
The key challenge of BCO problems is the lack of gradients. Therefore, Flaxman et al. (2005) and Agarwal et al. (2010) propose to replace in (7) with a gradient estimator , obtained by evaluating the function at one (in the one-point feedback model) or two random points (in the two-point feedback model) around . Details will be presented later. We unify their algorithms in Algorithm 1, called the Bandit Gradient Descent (BGD). Notice that in lines 8 and 14 of the algorithm, the projection of is on a slightly smaller set instead of , to ensure that the final decision lies in the feasible set . In the following, we describe the gradient estimator and analyze the universal dynamic regret for each model.
One-Point Feedback Model.
Flaxman et al. (2005) propose the following gradient estimator,
where is a unit vector selected uniformly at random and is the perturbation parameter. Then, the following lemma (Flaxman et al., 2005, Lemma 2.1) guarantees that (8) is an unbiased gradient estimator of the smoothed version of the loss function .
For any convex (but not necessarily differentiable) function , define its smoothed version . Then, for any ,
where is the unit sphere centered around the origin, namely, .
Therefore, we adopt to perform the online gradient descent in (7). The main update procedures of the one-point feedback model are summarized in the case 1 (line 4-7) of Algorithm 1. We have the following result regarding its universal dynamic regret.
By setting and , we obtain an dynamic regret. However, such a configuration requires prior knowledge of , which is generally unavailable. We will develop a parameter-free algorithm to eliminate the undesired dependence later.
Two-Point Feedback Model.
In this setup, the player is allowed to query two points, and . Then, the function values and are revealed as the feedback. We use the following gradient estimator (Agarwal et al., 2010),
The major limitation of the one-point gradient estimator (8) is that it has a potentially large magnitude, proportional to the which is usually quite large since the perturbation parameter is typically small. This is avoided in the two-point gradient estimator (11), whose magnitude can be upper bounded by , independent of the perturbation parameter . This crucial advantage leads to the substantial improvement in the dynamic regret (also static regret).
By setting and , BGD algorithm achieves an dynamic regret. However, this configuration has an unpleasant dependence on the unknown quantity , which will be removed in the next part.
4 Parameter-Free BGD
From Theorems 1 and 2, we observe that the optimal parameter configurations of BGD algorithm require to know the path-length in advance, which is generally unknown. In this section, we develop a parameter-free algorithm to address this limitation.
The fundamental obstacle in obtaining universal dynamic regret guarantees is that the path-length remains unknown even after all iterations, since the comparator sequence can be chosen arbitrarily from the feasible set. Therefore, the well-known doubling trick (Cesa-Bianchi et al., 1997) is not applicable to remove the dependence on the unknown path-length. Another possible technique to overcome this difficulty is to grid search the optimal parameter by maintaining multiple learning rates in parallel and using expert-tracking algorithms to combine predictions and track the best parameter (van Erven and Koolen, 2016). However, it is infeasible to directly apply this method to bandit convex optimization because of the inherent difficulty of bandit setting — it is only allowed to query the function value once at each iteration.
To address this issue, we need a closer investigation of dynamic regret analysis of BCO problems. Taking the one-feedback model as an example, the expected dynamic regret can be decomposed into three terms,
where is the scaled comparator sequence set as . It turns out that term (b) and term (c) can be bounded by and respectively without involving the unknown path-length, and the rigorous argument can be found in (23) and (24) of Section 6.1. Hence, it suffices to design parameter-free algorithms to optimize term (a), i.e., the dynamic regret of the smoothed loss function .
However, it remains infeasible to maintain multiple learning rates for optimizing dynamic regret of . Suppose there are in total experts where each expert is associated with a learning rate (step size), then at iteration , expert-algorithms will require the information of to perform the bandit gradient descent. This necessitates to query function values of original loss , which is prohibited in bandit convex optimization.
Fortunately, we discover that the expected dynamic regret of can be upper bounded by that of a linear function, as demonstrated in the following proposition.
This feature motivates us to design the following surrogate loss function ,
which can be regarded as a linearization of smoothed function on the point in terms of expectation. Furthermore, the surrogate loss function enjoys the following two properties.
Property 1 follows from the definition of surrogate loss, and Proposition 1 immediately implies Property 2. These two properties are simple yet quite useful, and they together make the grid search feasible in bandit convex optimization. Concretely speaking,
Property 1 implies that we can now initialize experts to perform the bandit gradient descent over the surrogate loss where each expert is associated with a specific learning rate, since all the gradients essentially equal to , which can be obtained by querying the function value of only once.
Property 2 guarantees the expected dynamic regret of smoothed functions ’s is upper bounded by that of the surrogate loss ’s.
Consequently, we propose to optimize surrogate loss instead of original loss (or its smoothed version ). We note that the idea of constructing surrogate loss for maintaining multiple learning rates is originally proposed by van Erven and Koolen (2016) but for different purposes. They construct a quadratic upper bound for original loss as surrogate loss, with the aim to adapt to the potential curvature of online functions in full-information online convex optimization. In this paper, we design the surrogate loss as linearization of smoothed function in terms of expectation, to make the grid search of optimal parameter doable in bandit convex optimization. To the best of our knowledge, this is the first time to optimize surrogate loss for maintaining multiple learning rates in bandit setup.
In the following, we describe the design details of parameter-free algorithms for the one-point feedback model, and present configurations of BCO with two-point feedback model later (in Section 7.3).
In the one-point feedback model, the optimal step size is , whose value is unavailable due to the unknown path-length . Nevertheless, we confirm
always holds from the non-negativity and boundedness of the path-length (). Hence, we first construct the following pool of candidate step sizes to discretize the range of optimal parameter in (17),
where . The above configuration ensures there exists an index such that . More intuitively, there is a step size in the pool that is not optimal but sufficiently close to . Next, we instantiate expert-algorithms, where the -th expert is a BGD algorithm with parameters and . Finally, we adopt an expert-tracking algorithm as the meta-algorithm to combine predictions from all the experts to produce the final decision. Owing to nice theoretical guarantees of the meta-algorithm, dynamic regret of final decisions is comparable to that of the best expert, i.e., the expert-algorithm with near-optimal step size.
We present descriptions for expert-algorithm and meta-algorithm of PBGD as follows.
For each candidate step size from the pool , we initialize an expert, and the expert performs the online gradient descent over the surrogate loss defined in (15),
where is the step size of the expert , shown in (18).
The above update procedure once again demonstrates the necessity of constructing the surrogate loss. Due to the nice property of surrogate loss (Property 1), at each iteration, all the experts can perform the exact online gradient descent in the same direction . By contrast, suppose each expert is conducted over the smoothed loss function , then at each iteration it requires to query multiple gradients , or equivalently, to query multiple function values , which are unavailable in bandit convex optimization.
To combine predictions returned from various experts, we adopt the exponentially weighted average forecaster algorithm (Cesa-Bianchi and Lugosi, 2006) with nonuniform initial weights as the meta-algorithm, whose input is the pool of candidate step sizes in (18) and its own learning rate . The nonuniform initialization of weights aims to make regret analysis tighter, which will be clear in the proof. Algorithm 2 presents detailed procedures. Note that the meta-algorithm itself does not require any prior information of the unknown path-length .
The meta-algorithm in Algorithm 2, together with the expert-algorithm (19), gives PBGD (short for Parameter-free Bandit Gradient Descent). The following theorem states the dynamic regret of the proposed PBGD algorithm.
One-Point Feedback Model: ;
Two-Point Feedback Model: .
The above results hold universally for any feasible comparator sequence .
Theorem 3 shows that the dynamic regret can be improved from to when it is allowed to query two points at each iteration. The attained dynamic regret (though in expectation) of BCO with two-point feedback, surprisingly, is in the same order with that of the full-information setting (Zhang et al., 2018a). This extends the claim argued by Agarwal et al. (2010) knowing the value of each loss function at two points is almost as useful as knowing the value of each function everywhere to dynamic regret analysis. Furthermore, we will show that the obtained dynamic regret for the two-point feedback model is minimax optimal in the next section.
5 Lower Bound and Extensions
In this section, we investigate the attainable dynamic regret for BCO problems, and then extend our algorithm to an anytime version, that is, an algorithm without requiring the time horizon in advance. Furthermore, we study the adaptive regret for BCO problems, another measure for online learning in non-stationary environments.
5.1 Lower Bound
We have the following minimax lower bound of universal dynamic regret for BCO problems.
The proof is detailed in Appendix B. From the above lower bound and the upper bounds in Theorem 3, we know that our dynamic regret for the two-point feedback model is optimal, while the rate for one-point feedback model remains sub-optimal, where the desired rate is of order as demonstrated in Remark 1. Note that the desired bound does not contradict with the minimax lower bound, since is larger than the lower bound by noticing that .
Our attained dynamic regret exhibits a square-root dependence on the path-length, and it will become vacuous when , though the path-length is typically small. The challenge is that the grid search technique cannot support to approximate the optimal perturbation parameter which is also dependent on . Otherwise, we have to query the function more than once at each iteration. We will investigate a sharper bound for BCO with one-point feedback in the future.
The lower bound holds even all the functions ’s are strongly convex and smooth in BCO with one-point feedback. This is to be contrasted with that in the full-information setting. The reason is that the minimax static regret of BCO with one-point feedback can neither benefit from strongly convexity nor smoothness (Shamir, 2013). This implies the inherent difficulty of learning with bandit feedback.
5.2 Extension to Anytime Algorithm
Notice that the proposed PBGD algorithm requires the time horizon as an input, which is not available in advance. We remove the undesired dependence and develop an anytime algorithm.
Our method is essentially a standard implementation of the doubling trick (Cesa-Bianchi et al., 1997). Specifically, the idea is to initialize the interval by , and once the actual number of iterations exceeds the current counts, double the counts and restart the algorithm. So there will be epochs and the -th epoch contains iterations. We have the following regret guarantees for the above anytime algorithm.
Under the same conditions with Theorem 3, the anytime version of PBGD enjoys the following expected dynamic regret,
One-Point Feedback Model: ;
Two-Point Feedback Model: .
The above results hold universally for any feasible comparator sequence .
We take the one-point feedback model as an example and provide a brief analysis as follows. Actually, by the strategy of doubling trick, we can bound the dynamic regret of the anytime algorithm by
Compared with the rate of the original PBGD algorithm, we observe that an extra term is suffered due to the anytime demand.
5.3 Adaptive Regret
In this part, we investigate the adaptive regret. Following the seminal work of Hazan and Seshadhri (2009), we define the expected adaptive regret for BCO as
We note that, in the full-information setting, a stronger version of adaptive regret named strongly adaptive regret is introduced by Daniely et al. (2015). However, they prove that it is impossible to achieve meaningful strongly adaptive regret in bandit settings, so we focus on the notion defined by Hazan and Seshadhri (2009).
To minimize the above measure, we propose an algorithm called Minimizing Adaptive regret in Bandit Convex Optimization (MABCO). Our algorithm follows a similar framework used in the Coin Betting for Changing Environment (CBCE) algorithm (Jun et al., 2017), which achieves the state-of-the-art adaptive regret in the full-information setting. However, we note that a direct reduction of CBCE algorithm to the bandit setting requires to query the loss function multiple times at each iteration, which is invalid in the bandit feedback model. To address this difficulty, similar to PBGD we introduce a new surrogate loss function, which can be constructed by only using the one-point or two-point function values. We provide algorithmic details and proofs of theoretical results in Appendix C.
With a proper setting of surrogate loss functions and parameters, the proposed MABCO algorithm enjoys the following expected adaptive regret,
One-Point Feedback Model: ;
Two-Point Feedback Model: .
6 Analysis of BGD Algorithm
Before presenting rigorous proofs, we first highlight the main idea and procedures of the argument as follows.
Guarantee that for any , is a feasible point in , because the projection in Algorithm 1 is over instead of .
Analyze the dynamic regret of the smoothed functions in terms of a certain comparator sequence.
Check the gap between the dynamic regret of the smoothed functions and that of the original functions .
6.1 Proof of Theorem 1
Notice that the projection in Algorithm 1 only guarantees that is in a slightly smaller set , so we first need to prove that , is a feasible point in . This is convinced by Lemma 3, since we know that from the parameter setting ().
Next, as demonstrated in (13), the expected dynamic regret can be decomposed into three terms. So we will bound the three terms separately.
The term (a) is essentially the dynamic regret of the smoothed functions. In the one-point feedback model, the gradient estimator is set according to (8), and we know that due to Lemma 1. Therefore, the procedure of is actually the randomized online gradient descent over the smoothed function . So term (a) can be upper bound by using Theorem 8.
where , and by noticing
And term (c) can be bounded by
where (25) follows from the setting of ; the last equation is obtained by the AM-GM inequality via optimizing values of and . The optimal parameter configuration is
6.2 Proof of Theorem 2
In the two-point feedback model, the gradient estimator is constructed according to (11), whose norm can be upper bounded as follows,
where in the last inequality, we utilize the Lipschitz property due to Assumption 3. Hence, . We remark that by contrast with that in the one-point feedback model as shown in (22), the upper bound of gradient norm here is independent of the , which leads to a substantially improved regret bound.
Meanwhile, by exploiting the Lipschitz property, we have
and similar result holds for . We can thus bound the expected regret as follows,
The core characteristic of analysis of the two-point feedback model lies in the second term of (28), which is independent of , and thus is much smaller than that of (25). This owes to the benefit of the gradient estimator evaluated by two points at each iteration. Notice that (29) is obtained by setting and . ∎
7 Analysis of PBGD Algorithm
In this section, we provide the proofs of theoretical guarantees for the PBGD algorithm including Proposition 1 and Theorem 3 (both one-point and two-point feedback models). Besides, we present the algorithmic details for BCO with two-point feedback.