Strongly Adaptive Regret Implies Optimally Dynamic Regret
To cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret independently. In this paper, we illustrate an intrinsic connection between these two concepts by showing that the dynamic regret can be expressed in terms of the adaptive regret and the functional variation. This observation implies that strongly adaptive algorithms can be directly leveraged to minimize the dynamic regret. As a result, we present a series of strongly adaptive algorithms whose dynamic regrets are minimax optimal for convex functions, exponentially concave functions, and strongly convex functions, respectively. To the best of our knowledge, this is the first time that such kind of dynamic regret bound is established for exponentially concave functions. Moreover, all of those adaptive algorithms do not need any prior knowledge of the functional variation, which is a significant advantage over previous specialized methods for minimizing dynamic regret.
Online convex optimization is a powerful paradigm for sequential decision making (Shalev-Shwartz, 2011). It can be viewed as a game between a learner and an adversary: In the -th round, the learner selects a decision , simultaneously the adversary chooses a function , and then the learner suffers an instantaneous loss . This study focuses on the full-information setting (Cesa-Bianchi and Lugosi, 2006), where the function is revealed to the leaner at the end of each round. The goal of the learner is to minimize the cumulative loss over periods. The standard performance measure is regret, which is the difference between the loss incurred by the learner and that of the best fixed decision in hindsight, i.e.,
The above regret is typically referred to as static regret in the sense that the comparator is time-invariant. The rationale behind this evaluation metric is that one of the decision in is reasonably good over the rounds. However, when the underlying distribution of loss functions changes, the static regret may be too optimistic and fails to capture the hardness of the problem.
To address this limitation, new forms of performance measure, including adaptive regret (Hazan and Seshadhri, 2007, 2009) and dynamic regret (Zinkevich, 2003), were proposed and received significant interest recently. Given a parameter , which is the length of the interval, the strong version of adaptive regret is defined as
From the definition, we observe that minimizing the adaptive regret enforces the learner has small static regret over any interval of length . Since the best decision for different intervals could be different, the learner is essentially competing with a changing comparator.
A parallel line of research introduces the concept of dynamic regret, where the cumulative loss of the learner is compared against a comparator sequence , i.e.,
It is well-known that in the worst case, a sublinear dynamic regret is impossible unless we impose some regularities on the comparator sequence or the function sequence (Jadbabaie et al., 2015). A representative example is the functional variation defined below
Besbes et al. (2015) have proved that as long as is sublinear in , there exists an algorithm that achieves a sublinear dynamic regret. Furthermore, under the noisy gradient feedback, a general restarting procedure is developed, and it enjoys and rates for convex functions and strongly convex functions, respectively. This result is very strong in the sense that these rates are (almost) minimax optimal. However, the restarting procedure can only be applied when an upper bound of is known beforehand, thus limiting its application in practice.
While both the adaptive and dynamic regrets aim at coping with changing environments, little is known about their relationship. This paper makes a step towards understanding their connections. Specifically, we show that the strongly adaptive regret in (1), together with the functional variation, can be used to upper bound the dynamic regret in (2). Thus, an algorithm with a small strongly adaptive regret is automatically equipped with a tight dynamic regret. As a result, we obtain a series of algorithms for minimizing the dynamic regret that do not need any prior knowledge of the functional variation. The main contributions of this work are summarized below.
We provide a general theorem that upper bounds the dynamic regret in terms of the strongly adaptive regret and the functional variation.
For convex functions, we show that the strongly adaptive algorithm of Jun et al. (2016) has a dynamic regret of , which matches the minimax rate, up to a polylogarithmic factor.
For exponentially concave functions, we propose a strongly adaptive algorithm that allows us to control the tradeoff between the adaptive regret and the computational cost explicitly. Furthermore, we demonstrate that its dynamic regret is , and this is the first time such kind of dynamic regret bound is established for exponentially concave functions.
Since strongly convex functions with bounded gradients are also exponentially concave, our previous result immediately implies a dynamic regret of , which is also minimax optimal up to a polylogarithmic factor. It also indicates our bound for exponentially concave functions is almost optimal.
2 Related Work
In this section, we give a brief introduction to previous work on static, adaptive, and dynamic regrets in the context of online convex optimization.
2.1 Static Regret
The majority of studies in online learning are focused on static regret Shalev-Shwartz and Singer (2007); Langford et al. (2009). For general convex functions, the classical online gradient descent achieves and regret bounds for convex and strongly convex functions, respectively (Zinkevich, 2003; Hazan et al., 2007; Shalev-Shwartz et al., 2007). Both the and rates are known to be minimax optimal (Abernethy et al., 2009). When functions are exponentially concave, a different algorithm, named online Newton step, is developed and enjoys an regret bound (Hazan et al., 2007).
2.2 Adaptive Regret
The concept of adaptive regret is introduced by Hazan and Seshadhri (2007), and later strengthened by Daniely et al. (2015). To distinguish between them, we refer to the definition of Hazan and Seshadhri (2007) as weakly adaptive regret and the one of Daniely et al. (2015) as strongly adaptive regret. The weak version is given by
To minimize the adaptive regret, Hazan and Seshadhri (2007) have developed two meta-algorithms: an efficient algorithm with computational complexity per iteration and an inefficient one with computational complexity per iteration. These meta-algorithms use an existing online method (that was possibly designed to have small static regret) as a subroutine.111For brevity, we ignored the factor of subroutine in the statements of computational complexities. The computational complexity should be interpreted as space complexity and time complexity, where and are space and time complexities of the subroutine per iteration, respectively. For convex functions, the efficient and inefficient meta-algorithms have and regret bounds, respectively. For exponentially concave functions, those rates are improved to and , respectively. We can see that the price paid for the adaptivity is very small: The rates of weakly adaptive regret differ from those of static regret only by logarithmic factors.
A major limitation of weakly adaptive regret is that it does not respect short intervals well. Taking convex functions as an example, the and bounds are meaningless for intervals of length . To overcome this limitation, Daniely et al. (2015) proposed a refined adaptive regret that takes the length of the interval as a parameter , as indicated in (1). If the strongly adaptive regret is small for all , we can guarantee the learner has small regret over any interval of any length. In particular, Daniely et al. (2015) introduced the following definition.
Let be the minimax static regret bound of the learning problem over periods. An algorithm is strongly adaptive, if
It is easy to verify that the meta-algorithms of Hazan and Seshadhri (2007) are strongly adaptive for exponentially concave functions,222That is because (i) , and (ii) there is a factor in the definition of strong adaptivity. but not for convex functions. Thus, Daniely et al. (2015) developed a new meta-algorithm that satisfies for convex functions, and thus is strongly adaptive. The algorithm is also efficient and the computational complexity per iteration is . Later, the strongly adaptive regret of convex functions was improved to by Jun et al. (2016).
2.3 Dynamic Regret
In a seminal work, Zinkevich (2003) proposed to use the path-length defined as
to upper bound the dynamic regret. Specifically, Zinkevich (2003) proved that for any sequence of convex functions, the dynamic regret of online gradient descent can be upper bounded by . Another regularity of the comparator sequence, which is similar to the path-length, is defined as
where is a dynamic model that predicts a reference point for the -th round. Hall and Willett (2013) developed a novel algorithm named dynamic mirror descent and proved that its dynamic regret is on the order of . The advantage of is that when the comparator sequence follows the dynamical model closely, it can be much smaller than the path-length .
Let be a local minimizer of . For any sequence of , we have
Thus, can be treated as the worst case of the dynamic regret, and there are many work that were devoted to minimizing (Jadbabaie et al., 2015; Mokhtari et al., 2016; Yang et al., 2016; Zhang et al., 2016).
When a prior knowledge of is available, can be upper bounded by (Yang et al., 2016). If all the functions are strongly convex and smooth, the upper bound can be improved to (Mokhtari et al., 2016). The rate is also achievable when all the functions are convex and smooth, and all the minimizers ’s lie in the interior of (Yang et al., 2016). In a recent study, Zhang et al. (2016) introduced a new regularity—squared path-length
which could be much smaller than the path-length when the difference between successive local minimizers is small. Zhang et al. (2016) developed a novel algorithm named online multiple gradient descent, and proved that is on the order of for (semi-)strongly convex and smooth functions.
Although closely related, adaptive regret and dynamic regret are studied independently and there are few discussions of their relationships. In the literature, dynamic regret is also referred to as tracking regret or shifting regret (Littlestone and Warmuth, 1994; Herbster and Warmuth, 1998, 2001). In the setting of “prediction with expert advice”, Adamskiy et al. (2012) have shown that the tracking regret can be derived from the adaptive regret. In the setting of “online linear optimization in the simplex”, Cesa-bianchi et al. (2012) introduced a generalized notion of shifting regret which unifies adaptive regret and shifting regret. Different from previous work, this paper considers the setting of online convex optimization, and illustrates that the dynamic regret can be upper bounded by the adaptive regret and the functional variation.
3 From Adaptive to Dynamic
In this section, we first introduce a general theorem that bounds the dynamic regret by the adaptive regret, and then derive specific regret bounds for convex functions, exponentially concave functions, and strongly convex functions.
3.1 Adaptive-to-Dynamic Conversion
Let be a partition of . That is, they are successive intervals such that
Define the local functional variation of the -th interval as
and it is obvious that .333Note that in certain cases, the sum of local functional variation can be much smaller than the total functional variation . For example, when the sequence of functions only changes times, we can construct the intervals based on the changing rounds such that . Then, we have the following theorem for bounding the dynamic regret in terms of the strongly adaptive regret and the functional variation.
Let . We have
where the minimization is taken over any sequence of intervals that satisfy (4).
The above theorem is analogous to Proposition 2 of Besbes et al. (2015), which provides an upper bound for a special choice of the interval sequence. The main difference is that there is a minimization operation in our bound, which allows us to get ride of the issue of parameter selection. For a specific type of problems, we can plug in the corresponding upper bound of strongly adaptive regret, and then choose any sequence of intervals to obtain a concrete upper bound. In particular, the choice of the intervals may depend on the (possibly unknown) functional variation.
Before proceeding to specific bounds, we introduce the following common assumption.
Both the gradient and the domain are bounded.
The gradients of all the online functions are bounded by , i.e., for all .
The diameter of the domain is bounded by , i.e., .
3.2 Convex Functions
For convex functions, we choose the meta-algorithm of Jun et al. (2016) and take the online gradient descent as its subroutine. The following theorem regarding the adaptive regret can be obtained from that paper.
According to Theorem 2 of Besbes et al. (2015), we know that the minimax dynamic regret of convex functions is . Thus, our upper bound is minimax optimal up to a polylogarithmic factor. The key advantage of the meta-algorithm of Jun et al. (2016) over the restarted online gradient descent of Besbes et al. (2015) is that the former one do not need any prior knowledge of the functional variation . Notice that the meta-algorithm of Daniely et al. (2015) can also be used here, and its dynamic regret is on the order of .
3.3 Exponentially Concave Functions
We first provide the definition of exponentially concave (abbr. exp-concave) functions (Cesa-Bianchi and Lugosi, 2006).
A function is -exp-concave if is concave over domain .
Exponential concavity is stronger than convexity but weaker than strong convexity. It can be used to model many popular losses used in machine learning, such as the square loss in regression, logistic loss in classification and negative logarithm loss in portfolio management (Koren, 2013).
For exp-concave functions, Hazan and Seshadhri (2007) have developed two meta-algorithms that take the online Newton step as its subroutine, and proved the following properties.
The inefficient one has computational complexity per iteration, and its weakly adaptive regret is .
The efficient one has computational complexity per iteration, and its weakly adaptive regret is .
As can be seen, there is a tradeoff between the computational complexity and the weakly adaptive regret: A lighter computation incurs a looser bound and a tighter bound requires a higher computation. In Section 4, we develop a unified approach, i.e., Algorithm 1, that allows us to trade effectiveness for efficiency explicitly. Lemma 6 indicates the proposed algorithm has
computational complexity per iteration, where is a tunable parameter. On the other hand, Theorem 6 implies that for -exp-concave functions that satisfy Assumption 1, the strongly adaptive regret of Algorithm 1 is
where is the dimensionality and .
We list several choices of and the resulting theoretical guarantees in Table 1, and have the following observations.
When , we recover the guarantee of the efficient algorithm of Hazan and Seshadhri (2007), and when , we obtain the inefficient one.
By setting where is a small constant, such as , the strongly adaptive regret can be viewed as , and at the same time, the computational complexity is also very low for a large range of .
According to Definition 1, Algorithm 1 in this paper, as well as the two meta-algorithms of Hazan and Seshadhri (2007), is strongly adaptive. Based on Theorem 1, we derive the dynamic regret of the proposed algorithm.
To the best of our knowledge, this is the first time such kind of dynamic regret bound is established for exp-concave functions. Furthermore, the discussions in Section 3.4 implies our upper bound is minimax optimal, up to a polylogarithmic factor.
3.4 Strongly Convex Functions
In the following, we study strongly convex functions, defined below.
A function is -strongly convex if
It is easy to verify that strongly convex functions with bounded gradients are also exp-concave (Hazan et al., 2007).
Suppose is -strongly convex and for all . Then, is -exp-concave.
Thus, Corollary 4 can be directly applied to strongly convex functions, and yields a dynamic regret of . According to Theorem 4 of Besbes et al. (2015), the minimax dynamic regret of strongly convex functions is , which implies our upper bound is almost minimax optimal.
A limitation of Corollary 4 is that the constant in the upper bound depends on the dimensionality . In the following, we show that when the functions are strongly convex and online gradient descent is used as the subroutine of Algorithm 1, both the adaptive and dynamic regrets are independent from .
4 An Unified Adaptive Algorithm
In this section, we introduce a unified approach for minimizing the adaptive regret of exp-concave functions, as well as strongly convex functions.
Let be an online learning algorithm that is designed to minimize the static regret of exp-concave functions or strongly convex functions, e.g., online Newton step (Hazan et al., 2007) or online gradient descent (Zinkevich, 2003). Similar to the approach of following the leading history (FLH) (Hazan and Seshadhri, 2007), at any time , we will instantiate an expert by applying the online learning algorithm to the sequence of loss functions , and utilize the strategy of learning from expert advice to combine solutions of different experts (Herbster and Warmuth, 1998). Our method is named as improved following the leading history (IFLH), and is summarized in Algorithm 1.
Let be the expert that starts to work at time . To control the computational complexity, we will associate an ending time for each . The expert is alive during the period . In each round , we maintain a working set of experts , which contains all the alive experts, and assign a probability for each . In Steps 6 and 7, we remove all the experts whose ending times are no larger than . Since the number of alive experts has changed, we need to update the probability assigned to them, which is performed in Steps 12 to 14. In Steps 15 and 16, we add a new expert to , calculate its ending time according to Definition 5 introduced below, and set . It is easy to verify . Let be the output of at the -th round, where . In Step 17, we submit the weighted average of with coefficient as the output , and suffer the loss . From Steps 18 to 25, we use the exponential weighting scheme to update the weight for each expert based on its loss . In Step 21, we pass the loss function to all the alive experts such that they can update their predictions for the next round.
The difference between our IFLH and the original FLH is how to decide the ending time of expert . In this paper, we propose the following base- ending time.
Definition 5 (Base- Ending Time)
Let be an integer, and the representation of in the base- number system as
where , for all . Let be the smallest integer such that , i.e.,
Then, the base- ending time of is defined as
In other words, the ending time is the number represented by the new sequence obtained by setting the first nonzero elements in the sequence to be and adding to the element after it.
Let’s take the decimal system as an example (i.e., ). Then,
We note that a similar strategy for deciding the ending time was proposed by György et al. (2012), and a discussion about the difference is given in the supplementary.
When the base- ending time is used in Algorithm 1, we have the following properties.
Suppose we use the base- ending time in Algorithm 1.
For any , we have
For any interval , we can always find segments
with , such that
The first part of Lemma 6 implies that the size of is . An example of in the decimal system is given below.
The second part of Lemma 6 implies that for any interval , we can find experts such that their survival periods cover . Again, we present an example in the decimal system: The interval can be covered by
which are the survival periods of experts , , and , respectively. Recall that , , and .
Based on Lemma 6, we have the following theorem regarding the adaptive regret of exp-concave functions.
From Lemma 6 and Theorem 6, we observe that the adaptive regret is a decreasing function of , while the computational cost is an increasing function of . Thus, we can control the tradeoff by tuning the value of .
For strongly convex functions, we have a similar guarantee but without any dependence on the dimensionality , as indicated below.
We here present the proofs of main theorems. The omitted proofs are provided in the supplementary.
5.1 Proof of Theorem 1
First, we upper bound the dynamic regret in the following way
From the definition of strongly adaptive regret, we can upper bound by
To upper bound , we follow the analysis of Proposition 2 of Besbes et al. (2015).
Furthermore, for any , we have
Substituting the upper bounds of and into (5), we arrive at
Since the above inequality holds for any partition of , we can take minimization to get a tight bound.
5.2 Proof of Corollary 3
To simplify the upper bound in Theorem 1, we restrict to intervals of the same length , and in this case . Then, we have
Combining with Theorem 2, we have
In the following, we consider two cases. If , we choose
Otherwise, we choose , and have
6 Proof of Theorem 6
From the second part of Lemma 6, we know that there exist segments
with , such that
Furthermore, the expert is alive during the period .
Using Claim 3.1 of Hazan and Seshadhri (2009), we have
where is the sequence of solutions generated by the expert . Similarly, for the last segment, we have
By adding things together, we have
According to the property of online Newton step (Hazan et al., 2007, Theorem 2), we have, for any ,