Predictive Online Convex Optimization\thanksreffootnoteinfo

Predictive Online Convex Optimization\thanksreffootnoteinfo

[    [    [
Abstract

We incorporate future information in the form of the estimated value of future gradients in online convex optimization. This is motivated by demand response in power systems, where forecasts about the current round, e.g., the weather or the loads’ behavior, can be used to improve on predictions made with only past observations. Specifically, we introduce an additional predictive step that follows the standard online convex optimization step when certain conditions on the estimated gradient and descent direction are met. We show that under these conditions and without any assumptions on the predictability of the environment, the predictive update strictly improves on the performance of the standard update. We give two types of predictive update for various family of loss functions. We provide a regret bound for each of our predictive online convex optimization algorithms. Finally, we apply our framework to an example based on demand response which demonstrates its superior performance to a standard online convex optimization algorithm.

ut]Antoine Lesage-Landry, is]Iman Shames, ut]Joshua A. Taylor

The Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada 

Department of Electrical and Electronic Engineering, University of Melbourne, Parkville, Australia 


Key words:  Convex optimization; learning algorithms; machine learning; power systems; renewable energy systems; load dispatching

 

11footnotetext: This work was partly done while A. Lesage-Landry was visiting The University of Melbourne, Australia. This work was funded by the Fonds de recherche du Québec – Nature et technologies, the Ontario Ministry of Research, Innovation and Science and the Natural Sciences and Engineering Research Council of Canada. This paper was not presented at any IFAC meeting. Corresponding author A. Lesage-Landry Tel.: +1 416-978-6842; fax: +1 416-978-1145

1 Introduction

Online convex optimization (OCO) has found applications in fields like network resource allocation [8, 7, 6] and demand response in power systems [20, 21]. It is used for sequential decision-making when contextual information or feedback is only revealed to the decision maker at the end of the current round. Theoretical results showing that OCO algorithms have bounded regret guarantee the performance of these algorithms under mild assumptions.

In many applications, the decision maker has access to both revealed past information and estimates about future rounds. For example, in power systems, weather forecasts or historical load patterns can be used to estimate the future regulation needs [22, 4]. In this work, we present the predictive online convex optimization (POCO) framework. POCO works under the assumption that an estimate of the gradient of the loss function for the next round is available to the decision maker. In POCO, a standard OCO update is first applied using past information to compute the next decision. Then, the decision maker checks the quality of the estimated information available to them. If the estimated gradient is considered accurate enough, the decision maker implements an additional projected gradient step based on the estimated gradient to improve their decision for this round. This last step is referred as the predictive update.

We introduce explicit criteria for determining if the quality of the estimated gradient is high enough to guarantee an improvement over a standard OCO step when the predictive update is applied. A regret bound is obtained for all our algorithms. We conclude this work by presenting numerical examples where a POCO algorithm is used to improve on the performance of demand response with standard OCO. This example is motivated by the fact that a load aggregator often has access to an estimate of the power imbalance they have to counteract for regulation purposes.

Literature review. Recent work in online convex optimization has focused on including prior or future information. Reference [28], which builds on [9], assumes that the problem’s unknown and uncertain parameters follow a predictable process plus some noise [27] for their OCO algorithm. As in our setting, a second update with an estimated gradient-like term follows a mirror descent update. This second update is used by the algorithm in every step regardless of the quality of the estimated gradient. For this reason, the algorithm is referred to optimistic. Optimistic algorithms were also studied in [31, 23, 34]. No conditions are provided about the estimated gradient in this case except that it comes from past observations and/or side information via an oracle. The authors of [28] show that the optimistic mirror descent can lead to a tighter bound than a standard online mirror descent algorithm if the process is indeed predictable. In [19], the authors provide a dynamic regret bound for the optimistic mirror descent. There is, however, no guarantee that in a given round the optimistic update does not do worse than the standard OCO update. An algorithm similar to [28] is given in [18]. In their work, they make the stronger assumption in which the exact gradient of the next round loss function is available and then provide a static regret bound for their setting. This differs from our setting in that we provide dynamic regret-bounded algorithms and use with an estimated gradient which entails less restrictive assumptions. Several other authors have studied different ways to incorporate future information in OCO like using state information [17] or the direction of the loss function’s gradient in an online linear optimization setting [10].

The projected gradient descent, inexact gradient descent, and proximal algorithms [1, 29, 2] from conventional convex optimization resemble our setting. These algorithms differ from ours because they aim to minimize the same objective function throughout all descent steps. In OCO, we minimize a sequence of objective functions and at each time provide a decision to minimize the current loss function. The loss function in a given round is only observed after we have committed to a decision. OCO will be introduced formally in Section 2.

Model predictive control (MPC) [14, 3] is another widely-used sequential decision-making framework. In MPC, the decision maker solves to optimality a receding horizon optimization problem that relies on models of future round loss functions. This thus requires significantly more contextual information and computational resources. These limitations are absent in OCO, making it a more suitable tool for real-time decision making with small computational resources.

Because we characterize conditions under which the predictive step improves performance, we guarantee improvement over conventional OCO and require no predictability assumptions. These conditions can be checked at each round of OCO, and if satisfied, the predictive update is implemented. In sum, in this work we make the following contributions:

  • We introduce a novel predictive online convex optimization framework and provide conditions for when to use side information.

  • We propose a predictive update with a predetermined step size for loss functions that have a Lipschitz gradient. We show that this update leads to a strict improvement over an OCO update when used (Section 4).

  • We give a predictive update with backtracking line search that applies to a broader family of problems. We show that it leads to strict improvement over an OCO update (Section 5).

  • We obtain sublinear regret bounds in the number of rounds for all algorithms.

  • We apply our framework to demand response in power systems and find that it outperforms a standard OCO algorithm (Section 6).

2 Background

In OCO, one must make a decision at each round to minimize their cumulative loss [30, 16]. The current round’s loss function and any other round-dependent parameters are not available at the moment when the decision is made. Only information about previous rounds can be used to make the decision. Once the decision has been made, information about the current round is observed.

Let denote the current round index and be the time horizon. Let , , be the decision set, and let be the decision variable at time . We denote the differentiable convex loss function by for . Let be the Euclidean norm. We denote the projection operator onto the set as .

The goal of the decision maker is to sequentially solve the following sequence of problems:

(1)

for . The decision maker observes the loss function after choosing . For this reason, even if the loss function has a simple form, an analytical solution to the round optimization problem (2) is not obtainable. The decision is computed using a gradient descent-based [35], mirrored descent-based [13] or Newton step-based rule [15]. For example, in the online gradient descent (OGD[35] algorithm, the decision at round , , is given by the update:

(2)

where .

Throughout this work, we make the following assumptions, which are standard in the OCO literature [35, 30, 16].

Assumption 1

The set is convex and compact.

The decision set represents all constraints on . In this version of OCO, we only consider time-invariant constraints.

Assumption 2

The loss function is -bounded: for and .

Assumption 3

The gradient of the loss function is -bounded: for and .

As a consequence of Assumption 1, the decision variable is also -bounded: for . We define the diameter of the compact set as and let , a positive scalar. The remainder of the assumptions will be stated when a specific technical result requires it.

The design tool of OCO algorithms is the regret [30, 16]. In this work, we use the dynamic regret [8, 19, 35, 24]:

(3)

where . The dynamic regret compares the loss suffered by the decision maker to optimal performance in each round. Other versions of the regret exists, e.g., static regret [30, 16, 35], which is defined in terms of the optimal stationary decision, in (2). In this work, we only consider the dynamic regret because it yields a stronger theoretical guarantee. This theoretical guarantee is also more relevant in the context of time-varying optimization. For this reason, we refer to the dynamic regret, , simply as the regret. Note that a bounded dynamic regret implies a bounded static regret [8]. The goal when designing an OCO algorithm is to show that the regret is sublinearly bounded above in the number of rounds. An OCO algorithm with a sublinearly bounded dynamic regret in the number of rounds will on average perform at least as well as the round optimal decision at each round [30, 16, 15].

We conclude this section by defining the quantity

The term quantifies the variation of the optimal predictions through all rounds.

3 Predictive OCO

We now introduce our POCO framework. We let be the decision computed by an OCO algorithm in round . This OCO algorithm can be, for example, the aforementioned OGD. The decision is then given by the update (2). In POCO, we consider an -forecaster introduced in Assumption 4. Let be the estimated gradient of the loss function at .

Assumption 4 (-forecaster)

The -forecaster has access to an estimated gradient, , such that where is a positive scalar, for and all time .

In other words, we consider a forecaster that has access to some information about the next round, an estimated gradient , in addition to the standard OCO assumptions. For conciseness, we denote the estimated gradient by . We omit its dependency on because it is always evaluated at the OCO update output, , and no other points. The decision maker can meet Assumption 4 by relying on an exogenous model to estimate the gradient . In the context of demand response, historical data of the load’s consumption and generator output’s patterns, weather history and the historical values of the gradient, for example, can be used to build a statistical model to estimate the value of the at the decision given by OCO update. The parameter can then be set according to, for example, a high confidence interval or a worst-case performance parameter. The forecaster would then provide using this model.

Then, if certain conditions are met, the following update rule for our proposed POCO algorithm is used.

Definition 1 (Predictive update)

Let be an appropriately chosen step size. The predictive update is

(4)

The predictive update is to be used directly after the OCO update and will lead to a strict improvement over the OCO update under certain conditions. The aforementioned conditions will be discussed in the next sections and depend on the properties of the loss function. If the conditions are not met, is directly used. Let be the desired improvement when using the predictive update. We define the counter :

with . The variable represents the number of predictive updates as described in Definition 1. Let be the ratio of rounds using the predictive update to the total number of rounds.

Depending on the loss function, any regret-bounded OCO update can be used in the POCO framework. Back to the OGD example, the predictive OGD uses the update (2) and if certain conditions are met,

and if not, . We write as a function of the step size and let be the descent direction.

Next, we provide sufficient conditions for the estimated gradient to be a feasible descent direction. Later, we consider the step size selection problem. Particularly, two cases are considered where (i) the step sizes are constant and chosen a priori based on a property of the loss functions, or (ii) the step sizes are selected through the application of a backtracking line search that enforces a modified online version of the Armijo condition [2].

The following lemma introduces a sufficient condition for the estimated gradient to be a descent direction of the OCO problem (2).

Lemma 1 (Estimated descent direction)

The estimated gradient provided by the -forecaster is a descent direction for if .

The proof of Lemma 1 is presented in Appendix A. The next lemma is adapted from [2] and ensures that the predictive step follows a feasible descent direction.

Lemma 2 (Feasible estimated descent direction)

For all and , if and , then is a feasible descent direction at and

The proof is provided in Appendix B.

4 POCO with fixed step size

We now present a predictive update where step sizes are fixed and based on a propriety of sequence of loss functions. We conclude this section by providing regret-bounded algorithms using these updates. In this section, we add the following assumption:

Assumption 5

Let , the loss function has an -Lipschitz gradient:

for all and .

We propose a predictive update with fixed step size next. We state sufficient conditions that guarantee a strict improvement over an OCO update. These sufficient conditions can be checked at each round to determine if the estimated information is accurate enough, and therefore if the predictive update should be used in the current round.

Lemma 3 (Predictive update with fixed step size)

Suppose that Assumption 5 holds and . If and , then the predictive update (1) used by the -forecaster strictly improves on the OCO update and the improvement is bounded below by .

The proof of Lemma 3 is provided in Appendix C. We now present regret bounds for POCO algorithms. This algorithm uses the predictive update with fixed step size to improve the performance of OCO algorithms.

Theorem 1 (POCO regret bound)

Consider an OCO algorithm with a sublinear regret upper bound. Suppose that the forecaster uses the predictive update (1) only at rounds when the estimated gradient and feasible descent direction satisfy the assumptions of Lemma 3. If the ratio of rounds satisfying these assumptions is greater than , then the regret of the POCO algorithm is bounded above by

Proof. Let denote the decision variable with for all . In other words, represents the decision variable computed without the predictive algorithm. Denote the set of assumptions of Lemma 3 at round by . Let be the indicator function where if the assumptions are satisfied and otherwise. Observe that the improvement, , is given by

(5)

where is the improvement when . The regret of the POCO algorithm is

(6)

Using (4), we re-express in (4):

(7)

By Lemma 3, the improvement is bounded below by . We rewrite (7) as

A minimum of rounds satisfy and hence

(8)

This theorem leads to the following corollary which provides a regret bound for the OGD with predictive updates (POGD).

Corollary 1 ( regret bound for Pogd)

Suppose that the ratio of rounds that respects the assumptions of Lemma 3 is . Then the predictive OGD algorithm’s regret is bounded above by

which is sublinear and tighter than the OGD regret bound.

Proof. The dynamic regret bound for the OGD algorithm, , is given in [35]. The results then follows from substituting and in Theorem 1.

5 POCO with backtracking line search

In this section, we do not require Assumption 5 to hold. We however use the following proposition:

Proposition 1

The loss function is -time-Lipschitz with , that is:

for all at all , and all .

Proposition 1 always holds because Assumption 2 implies that is sufficient. Under Proposition 1, we consider functions that are -locally and globally Lipschitz in their time argument, respectively, for the intermediary bound () and the upper bound (). This can represent, for example, loss functions like squared tracking error functions, in which the time-varying targets are always contained in a closed set.

In the case of the POCO with backtracking (POCOb), we re-express the update (1) given in Definition 2. The backtracking line search for predictive update is given in Algorithm 1.

Definition 2 (POCOb update)

Let be a positive scalar and be determined by a backtracking line search algorithm. The predictive update with backtracking line search is

(9)
1:  Parameters: Given and .
2:  Initialization: Set .
3:  
4:  .
5:  while  and  do
6:     .
7:  end while
8:  if  then
9:     .
10:  end if
Algorithm 1 Backtracking algorithm for predictive gradient projection

The next lemma shows that the backtracking line search-based predictive update improves on the OCO update. Our claim relies on the modified Armijo condition for gradient projection. This condition ensures a sufficient decrease in the objective when using an estimated gradient projection descent direction [33]. We adapt this condition to the estimated gradient and online setting. The modified Armijo condition for gradient projection [2] on and feasible descent direction for some with step size is given by:

(10)
Lemma 4 (Sufficient decrease of POCOb update)

Suppose . If Algorithm 1 terminates to a step size , then the predictive update with backtracking line seach (9) used by the -forecaster satisfies the modified Armijo condition (10), and will thus leads to a sufficient decrease in the loss function, outperforming the OCO update.

The proof of the previous lemma is given in Appendix D.

Remark 1

Algorithm 1 ensures that when , satisfied:

(11)

Every element of (11) is available at time , which is not the case in (10). This allows us to use a backtracking line search algorithm to determine in an OCO setting. Algorithm 1 also ensures that the step size is not too small (cf. [33, Section 3.1]).

Note that there is an additional term in the modified Armijo condition for estimated gradient projection. This is a consequence of not having access to the exact gradient of . Hence, to ensure that the update is valid, the modified Armijo condition is augmented by a term proportional to the error of the estimated gradient. The second additional term, , is due to the time-varying setting of OCO.

We now discuss the existence of step sizes that satisfy (11). Before stating the main result, for a given and , define the set of step sizes that comply with line 5 in the line search algorithm, which is the modified Armijo condition for online settings (11):

Theorem 2

Suppose is a feasible descent direction and is bounded below for all . Then there exists such that if and only if .

Proof. Assume . This assumption implies that by Proposition 1. Thus, is not the minimum point of . It follows that . By assumption, is a feasible descent direction and we have

(12)

Let . Subtracting on both side of (12) we obtain,

(13)

If the following condition holds, then (13) also holds:

(14)

Under Assumption 3, for all we have

and by Assumption 1, we have . Then, if

(15)

holds, then so does (14). We rewrite (15) as

(16)

Recalling Taylor’s Theorem [33, Theorem 2.1]:

where , and for some . We let and . We have,

(17)

We bound above the the last term of (17) using (16) and obtain

(18)

By setting in (18), we then have . This shows that there always exists at least one point which satisfies the assumption on the existence of such that that is along the feasible descent direction from .

Next, adapting the proof of [33, Lemma 3.1] for the modified Armijo condition for online settings (11), it follows that there exists such that

The set is therefore non-empty if there exists such that .

We now show the converse. Assuming , then there exists and

(19)

holds since by Lemma 2 and . Thus, (19) implies that there exists such that and one of such point is . This completes the proof.

We note that Theorem 2 does not guarantee that the backtracking algorithm, Algorithm 1, will find a non-zero step size. Other techniques like exact line searches, might be required to identify an adequate step size in some problem instances. Using Theorem 2, we can provide a lower bound on the improvement of the predictive update with backtracking line search.

Corollary 2 (POCOb update improvement)

Suppose that the assumptions of Lemma 4 hold and , then the predictive update with backtracking line search improves on the OCO update by a minimum of .

Proof. Since , then . By the converse of Theorem 2, we have where , the decision played by the predictive update (9). The predictive update hence improves on the OCO update by at least .

We now state a regret bound for the POCOb algorithm.

Theorem 3 (POCOb regret bound)

Consider an OCO algorithm with bounded regret. Suppose that the assumptions of Lemma 4 are met. If the ratio of rounds with and satisfying these assumptions to is greater than , then the regret of the POCO algorithm with backtracking used by the -forecaster is bounded above by

(20)

and thus outperforms the OCO algorithm.

Proof. Let be the indicator function where if at round , and or otherwise. Using the same approach as in Theorem 1’s proof with Corollary 2, we obtain the regret bound. The last term of (20) is strictly positive and thus the POCOb regret is always bounded above by the OCO algorithm regret.

Remark 2

Note that if the locally Lipschitz statement of Proposition 1 is used, then is replaced by in the modified Armijo condition for online settings (11), and the bound (20) can be recomputed accordingly.

6 Example

In this section, we apply POCO algorithms to demand response in power systems [5, 26], specifically regulation and curtailment. At each time step, a demand response (DR) aggregator sends instructions to their loads to follow a regulation signal, e.g., a power imbalance due to a sudden change in renewable power generation [4, 32]. A second example of a regulation signal is the area control error (ACE). Each load responds to the signal by adjusting its power consumption. The power consumption is constrained by a storage capacity, which could represent physical storage like a battery or the load’s limits, e.g., thermal constraints. The regulation signal is unknown at the time the DR instructions are sent. This can be due, for example, to a drop in renewable power generation which is only assessed after the generator has committed to some amount of power. The objective of the DR aggregator is, therefore, to predict the DR dispatch at each time instance. This problem can be formulated as POCO, in which an estimate of the regulation signal is available to the load aggregator.

6.1 Setting

We consider loads. We denote as the decision variable at round . The variable represents the instructions sent to the loads. Let be the regulation at time . Let be the maximum and minimum power that can be consumed or delivered for all loads. Define the decision set , a convex and compact set. We denote as the state of charge vectors of the loads at time and as the vector vector of load energy capacities. The state of charge of a load at time is . In the current case, we assume that there is no leakage nor energy losses.

The OCO problem takes the following form:

(21)

The loss function has two terms: (i) a regulation term where the aggregated loads are dispatched to follow a regulation signal and (ii) a state of charge objective added to keep the loads near half their energy capacity. The loss function given in (21) is -strongly convex. For this reason we use the OGD for strongly convex functions (OGD) proposed in [24], which offers tighter regret bound than the standard OGD. The following corollary gives an upper bound on the regret of predictive OGD for strongly convex function (POGD).

Corollary 3 (Pogd for strongly convex functions)

Suppose is -strongly convex and satisfies Assumption 5 for all . Consider the OGD update

where and . Then, the POGD with fixed step size, given that the assumptions of Lemma 3 hold for a ratio of the total rounds greater than , has a regret bounded above by

Proof. For Corollary 3, we follow the proof of Theorem 1 and obtain

(22)

where we have substituted the OGD algorithm in (8). From [24], we have

(23)

Combining (22) and (23) leads to our result.

We now present simulation results. All optimizations are solved using CVXPY [11] and the ECOS [12] solver.

6.2 Fixed step size numerical examples

The load and numerical parameters for this example are gathered in Table 1. The initial state of charge of each load is set to half its capacity. The regulation signal is . The parameter is a Gaussian noise used to model sudden changes. We assume that the aggregator has access to estimated gradient for different level of accuracy . This represents, for example when , a relative error of at least of the actual gradient norm. The parameter is set to achieve adequate regulation performance without deviating too much from each load’s desired state of charge.

Parameter Value Unit
loads
seconds
kW
kW
kWh
, &
Table 1: Parameters for POCO numerical simulations
(a) Regret comparison (log scale)
(b) Regulation using POCO with fixed step size
Fig. 1: Experimental comparison between the POCO with fixed step size and OCO

We now present the performance of our POCO algorithm with a fixed step size. We implement the OMD from [28] for comparison. This algorithm uses without validating the estimated information. Figure (a)a shows an instance of the experimental regret for the POCO with three different values of , the conventional OCO algorithm, their respective regret bounds and OMD’s regret. POCO outperforms its bound, the OCO and the OMD algorithms. The improved performance of the POCO algorithm is also seen in the comparison with the OCO and OMD algorithms in Table 4. We remark that as expected the number of predictive updates increases with the accuracy of the estimated gradient, the performance of the POCO algorithm also improves.

(out of ) reduction
OMD
Table 2: Comparison between POCO, OCO, and OMD algorithms

Lastly, Figure (b)b presents the regulation services provided by the DR aggregator for . In this figure, the tracking done by the POCO and the OCO algorithm are shown in blue and orange, respectively. The POCO algorithm accurately follows the regulation signal and consequently is almost always superimposed on in Figure (b)b. The high performance of the POCO algorithm can be observed in the zoomed subplot of Figures (b)b.

6.3 Backtracking line search numerical examples

We now present an example of POCO with backtracking. We consider a curtailment scenario. We let be the total power to be curtailed by the loads at time for . When a contingency occurs in the network, flexible loads are called to curtail their power consumption, e.g., by using their battery energy storage or temporarily shutting down their HVAC system. Contrary to the regulation case, the loads are not contracted to follow a setpoint and no penalties are assessed on loads curtailing more than asked. Similar to the regulation setting, the curtailment signal is unknown until immediately after the current round. This setting can be modeled as POCO where an estimated curtailment signal is available to the aggregator at each round. We use the same notation as the previous examples. Let . This curtailment scenario is modeled by loss function given below:

(24)

where we have added a recovery coefficient to the state of charge objective term used previously. This coefficient models the usual evolution of the load (e.g., ambient temperature heating for a thermostatic load). We let . This is equivalent to a recovery coefficient of per hour. The function given in (24) is not gradient Lipschitz and Assumption 5 does not hold. We model the curtailment signal to be quickly increasing at first and then slowly plateauing to represent new level of available generation. This event is assumed to be limited in time, after which the network goes back to its normal state and no curtailment is then required. We let where for and then where for . The noise variance is equivalent to approximatively of curtailment signal’s value at first and then about .

Parameter Value
, &
Table 3: Different parameters for POCOb numerical simulations

We use the same parameters as in the previous section, except for the ones shown in Table 3. Figure 2 shows the performance of our algorithm. The POCOb experimental regret shown in Figure (a)a is sublinear in the numbers of round and outperforms the OCO’s regret. While the performance is not as high in the fixed step size, this algorithm can be applied to a broader family of functions since it does not require the loss function to be gradient Lipschitz continuous. We note that POCOb performs better when large variation of are registered and smaller values of . Similarly to the fixed size case, the POCO allows better curtailment than its OCO counterpart in the context of DR as presented on Figures (b)b for .

(out of ) reduction
Table 4: Comparison between POCOb and OCO algorithms
(a) Regret comparison (log scale)
(b) Curtailment using POCO with backtracking
Fig. 2: Experimental comparison between the POCO with backtracking and OCO

7 Conclusion

In this work, we have presented the predictive online convex optimization framework. In POCO, a second update is used after the OCO update to improve performance using an estimated gradient. We have presented three versions of the predictive update that can be used under different assumptions. We have shown a regret upper bound for all of our POCO algorithms. We have applied POCO to demand response in electric power systems and found that they outperform conventional OCO using commonly available forecast information. In the case of fixed step size update, we observed an improvement of in the final regret and of in the backtracking case when having access to a -forecaster.

Acknowledgements

A.L.-L. thanks P. Mancarella for his support and for co-hosting him at The University of Melbourne.

A Proof of Lemma 1

Define as where by the definition of the -forecaster. is a descent direction if . From this, we have

Equivalently, we have . Taking the norm of both sides and dividing by the norm of gives

(A.1)

where is the angle between and

By assumption, and . Therefore (A.1) always holds and we have proved the lemma.

B Proof of Lemma 2

The identity follows from [2, Proposition 6.1.1] with instead of the gradient of the loss function.

It then follows from Lemma 1 that with is a descent direction at . Thus, is a feasible descent direction because and for all and .

C Proof of Lemma 3

By Assumption 5, has an -Lipschitz gradient. We use the following inequality from  [25, Theorem 2.1.5]

(C.1)

for all . We substitute and into (C.1) to obtain

For the reminder of the proof, we use to simplify the notation. We rewrite the gradient in term of the estimated gradient, which yields

(C.2)

By assumption, , which ensures that Lemma 2 holds. We use Lemma 2 to upper bound the second term of the right-hand side of (C.2). We then have

Therefore, the predictive update with fixed step size will improve on the OCO update by a minimum of if the following condition is satisfied:

(C.3)

Assuming , then , and if