Personalized Optimization with User’s Feedback

Personalized Optimization with User’s Feedback

Andrea Simonetto, Emiliano Dall’Anese, Julien Monteil, Andrey Bernstein
Version: July 18, 2019
A. Simonetto and J. Monteil are with the Optimization and Control Group of IBM Research Ireland, Dublin, Ireland. Email: andrea.simonetto@ibm.com, julien.monteil@ie.ibm.com. E. Dall’Anese is with the department of Electrical, Computer, and Energy Engineering at the University of Colorado Boulder, Boulder, CO, USA; email: emiliano.dallanese@colorado.edu. A. Bernstein is with the National Renewable Energy Laboratory (NREL), Golden, CO, USA; email: andrey.bernstein@nrel.gov. The work of A. Bernstein was supported in part by the Laboratory Directed Research and Development (LDRD) Program at NREL.
Abstract

This paper develops an online algorithm to solve a time-varying optimization problem with an objective that comprises a known time-varying cost and an unknown function. This problem structure arises in a number of engineering systems and cyber-physical systems where the known function captures time-varying engineering costs, and the unknown function models user’s satisfaction; in this context, the objective is to strike a balance between given performance metrics and user’s satisfaction. Key challenges related to the problem at hand are related to (1) the time variability of the problem, and (2) the fact that learning of the user’s utility function is performed concurrently with the execution of the online algorithm. This paper leverages Gaussian processes (GP) to learn the unknown cost function from noisy functional evaluation and build pertinent upper confidence bounds. Using the GP formalism, the paper then advocates time-varying optimization tools to design an online algorithm that exhibits tracking of the oracle-based optimal trajectory within an error ball, while learning the user’s satisfaction function with no-regret. The algorithmic steps are inexact, to account for possible limited computational budgets or real-time implementation considerations. Numerical examples are illustrated based on a problem related to vehicle platooning.

I Introduction

Optimization is ubiquitous in engineering systems and cyber-physical systems including smart homes, energy grids, and intelligent transportation systems. As an example for the latter, optimization toolboxes are advocated for intelligent traffic light management and for congestion-aware systems for highways, where a networked control system could control vehicles and form a platoon optimized for lowering fuel emission and increasing vehicle spatial density. In many applications involving (and affecting) end-users, optimization problems are formulated with the objective of striking a balance between given engineered performance metrics and user’s satisfaction. Engineered metrics may include, e.g., operational cost and efficiency; on the other hand, metrics to be addressed for the users may be related to comfort (e.g., temperature in a home or building), perceived safety (e.g., distance from preceding vehicles while driving on a freeway), or simply preferences (e.g., taking a route with the least number of traffic lights).

While engineered performance goals may be synthetized based on well-defined metrics emerging from physical models or control structures, the utility function to be optimized for the users is primarily based on synthetic models postulated for comfort and satisfaction; these synthetic functions are constructed based on generic welfare models, averaged or statistical human-perception models estimated over a sufficiently large population of users’ responses, or just simplified mathematical models that make the problem tractable. However, oftentimes, these strategies do not lead to meaningful optimization outcomes, since (1) generic welfare or averaged models may not fully capture preferences, comfort, or satisfaction of individual users, and (2) synthetic utility functions may bear no relevance to a number of individuals and users.

This paper investigates time-varying optimization problems with an objective that comprises a known time-varying cost and an unknown utility function. The known function captures time-varying engineering costs, where time variability emerges from underlying dynamics of the systems. The second term pertains to the users, and models personalized utility functions; these functions are in general unknown, and they must be learned concurrently with the solution of the optimization problem. Typically, three separate time-scales are considered in this setting: (i) temporal variations of the engineering cost; (ii) learning rates for the user’s utility function; and, (iii) convergence rate of the optimization algorithm. In the paper, we compress these time-scales by advocating time-varying optimization tools to design an online algorithm that tracks time-varying optimal decision variables emerging from time-varying engineering costs within an error bound, while concurrently learning the user’s satisfaction with no-regret. We model the user’s satisfaction as a Gaussian process (GP) and learn its parameters from user’s feedback [1, 2]; in particular, feedback is in the form of noisy functional evaluations of the (unknown) utility function. The user’s satisfaction is learned by implementing the approximate decisions and measuring the user’s feedback to update the user’s GP model.

Overall, the main contributions of the paper are as follows:

The paper extends the works [1, 2] on GP to handle a sum of a convex engineering cost and an unknown user utility function by considering approximate online algorithms; the term “approximate” refers to the fact that intermediate optimization sub-problems in the algorithmic steps are not assumed to be solved to convergence due to underlying complexity limits. This is key in many time-varying optimization applications and increasingly important for large-scale systems [3, 4, 5, 6].

We devise an approximate upper confidence bound algorithm in compact continuous spaces and provide its regret analysis to solve the formulated problem. As in [1], for the time-invariant and squared exponential kernel case we recover a cumulative regret of the form , where is the dimension of the problem and is the horizon111The notation means a big- result up to polylog factors.. As in [2], the time-variation of the cost yields an extra term in the cumulative regret. We explicitly show how the regret depends on the convergence rate of the algorithm that we use, and how it is dependent on the choice of the kernel.

We further detail the regret analysis to the case where we use a projected gradient method to compute the approximate optimizers of the upper confidence bound algorithm, and we extend the analysis to vanishing cost changes.

We demonstrate the performance of the proposed algorithm in a numerical example derived from vehicle platooning, where the inter-vehicle distances are computed in real time based on both engineering and a user’s comfort perspectives.

Incorporating user’s satisfaction in the decision-making process is not a new concept; yet, it is starting to play a major role in emerging data-driven optimization and control paradigms, especially within the context of cyber-physical-social systems or cyber-physical-and-human systems. For example, [7] employs control-theoretic techniques to model human behavior and drive systems to suitable working conditions. In this context, human behavior has been captured as a stochastic process [8] and discrete decision models [9], among others.

User’s satisfaction and preferences have been modelled as a GP in the machine learning community; see, e.g., [10, 11]. For an account of Gaussian processes we refer to  [12], while for their use in control we refer the reader to [13, 14, 15, 16]. User’s satisfaction and comfort has been taken into account from a control perspective in, e.g., control systems for houses, electric vehicles, and routing [17, 18, 19].

The techniques and ideas that are expressed in this paper are related to inverse reinforcement learning [20], restless bandit problems [21] (even though we do not use their machinery and they are typically in discrete spaces, while we are in compact continuous spaces), and (partially) to socially-guided machine learning [22].

Organization. The remainder of the paper is organized as follows. Section II presents the mathematical formulation and the proposed approach using an approximate upper confidence bound algorithm. The convergence properties of said algorithm are analyzed in Section III, along with two specifications of the algorithm in cases where we use a projected gradient method and the cost changes are vanishing. In Section IV, we report numerical examples. Proofs of all the results are in the Appendix.

Ii Formulation and Approach

Consider a decision variable , and a possibly time-varying objective (value) function that has to be maximized. The function , where represents time, is given by the sum of two terms: a concave engineering value function , and a user’s satisfaction function . The function is known (or can be easily evaluated at time ), whereas is unknown and has to be learned. Accordingly, the problem to be solved amounts to:

(1)

that is, the objective is find an optimal decision for each time . Here, we implicitly assume that the user’s satisfaction function does not change in time; however, slow changes of the function do not affect reasoning and technical approach. Furthermore, is not necessarily a concave function, thus rendering a challenging problem even in the static setting. The structure of (1) naturally suggests that optimal decisions strike a balance between design choices (which may be time-varying, e.g., in tracking problems) and user’s preferences, which are often slowly varying and not known beforehand.

As a first step, consider sampling the problem (1) at discrete time instances , , with a given sampling period. This leads to a sequence of time-invariant problems as:

(2)

which we want to solve approximately within the sampling period to generate a sequence of approximate optimizers (one for each problem ) that eventually converges to an optimal decision trajectory up to a bounded error. When only the known engineering value function is considered, this tracking problem has been considered in various prior works; see, e.g., [23, 24, 5, 25, 26] and pertinent references therein. A key difference in the setting proposed in this paper is that we construct approximate optimizers concurrently with the learning of the (unknown) user’s function . The main operating principles of the algorithm to be explained shortly are, qualitatively, as follows: (i) at time , an approximate optimizer is computed based on a partial knowledge of ; (ii) the optimizer is implemented, and it generates some “feedback” from the user in the form of, for example, , where is noise; and, (iii) is collected and utilized to “refine” the knowledge of . At time , the process is then repeated.

To this end, the paper leverages a Gaussian process (GP) model for the unknown user function . Such non-parametric model is advantageous in the present setting because of (1) the simplicity of the online updates of both mean and covariance; (2) the inherent ability to handle asynchronous and intermittent updates (which is an important feature in user’s feedback systems); and, (3) the implicit and smooth handling of measurement (i.e., feedback) noise. Accordingly, let be specified by a GP with mean function and covariance (or kernel) function . We assume bounded variance; i.e., . For GPs not conditioned on data, we assume without loss of generality that , i.e., GP. We follow a frequentist perspective and assume the existence of a true data-generating . Let be a set of sample points and let , i.i.d. Gaussian noise, be the noisy measurements at the sample points , . Let . Then the posterior distribution of is a GP distribution with mean , covariance , and variance given by:

(3)
(4)
(5)

where , and is the positive definite kernel matrix .

With this in place, to approximately solve the sequence of optimization problems (where we remind that the objective function is partially unknown), we utilize an online approximate Gaussian process upper confidence bound (AGP-UCB) algorithm as described next:

 

 

AGP-UCB algorithm

 

Initialize , from average user’s profiles; choose confidence parameters . Set .

 

At each time , perform the following steps [S1][S3]:

[S1] Set

(6)

[S2] Find a possibly approximate optimizer:

(7)

by running a finite numbers of steps of a given algorithmic scheme, and implement ;

[S3] Collect the user’s feedback in the form of

and perform a Bayesian update to obtain and according to (3)-(5); set to +. If the feedback is not received, no actions are performed.

 

 

In (6), a proxy for the unknown function is built using an upper confidence bound. Upper confidence bound methods are popular in stochastic bandit settings; see, e.g., [1, 27]. The aim of an upper confidence bound method is to trade off exploitation (areas with high mean but low variance) and explorations (areas with low mean but high variance).

Step (7) is utilized to find based on the (current) upper confidence bound. In [1], it is assumed that (7) can be solved to optimality; on the other hand, we consider a setting where only an approximate optimizer of (7) can be obtained at each time . In particular, we consider a case where it might not be possible to execute the algorithm until convergence to an optimal solution within a period of time  [23, 24, 5, 25, 26]; this is the case where, for example, the sampling period is too short (relative to the time required to perform an algorithmic step and the number of iterations required to converge) or the problem is computationally demanding. Thus, denoting as the map of a given algorithmic step (e.g., gradient descent, proximal method, etc.) and assuming that one can perform steps within an interval , is obtained as:

(8)

where is defined as:

(9)

For example, if is the map of a gradient method, then (8) implies that one runs gradient steps; this example will be explained shortly in Section III-B. See also the works [23, 24, 5, 25, 26] (and pertinent references therein) for the case of . From a more utilitarian perspective, this setting allows one to allocate computational resources to solve (7) parsimoniously; spending resources to solve (7) to optimality might not provide performance gains since the user’s function is not known accurately.

Finally, step [S3] involves the gathering of the user’s feedback in the form of a measurement of . This feedback is in general noisy (e.g., it may be collected by sensors or be quantized), and it could be intermittent in the sense that it might not be available at every time step (this explains why we utilize the subscript for the temporal index and for the updated of ; of course, if is collected at each time ). If the feedback is available, the GP model is updated via (3)-(4), otherwise not.

The AGP-UCB algorithm presented here can be seen as a generalization of the algorithm in [1], with two key distinctions: (1) the optimization step (7) is performed in an approximate fashion, and (2) the problem features a time-varying function . These two features are accounted for in the regret analysis presented next.

Iii Regret Bounds

In this section, we establish cumulative regret bounds for the AGP-UCB algorithm. A critical quantity affecting these bounds is the maximum information gain after feedback rounds, which is defined as

(10)

with , and where represents the set of points , which maximizes the expression [cf. [1]].

The regret bounds are of the form , where the first term is sub-linear and depends on how fast one can learn the unknown user’s profile, while the second term is due to the time drift (i.e., temporal variability) of the design function. These regrets bound resemble [1] for the time invariant case and [2] for the time-varying one222We leave for future research the use of more sophisticated methods that learn, predict, or bound the variations of the cost function and might deliver no-regret results also in the time-varying setting, see [28, 29]. . The main proof techniques that we use are a mix of GP regret results, convex analysis, and Kernel ridge regression theory.

Iii-a Main result

The following assumptions are imposed.

Assumption 1

The function is convex and -strongly smooth over , uniformly in .

Assumption 2

Let be compact and convex, , . Let the kernel satisfy the following high probability bound on the derivatives of the GP sample paths : for some constants :

Assumption 3

The changes in time of function are bounded, in the sense that at two subsequent sampling times and , the following bound holds

for a given .

Assumption 4

Recall the definition of in (9) and let its maximum be . There exists an algorithm with linear convergence; when applied to , the algorithm yields a point so that

Assumption 5

The kernel is not degenerate; the mean and variance are well-behaved, so that can be represented by generalized Fourier series of the kernel basis.

Most of the assumptions are standard in either time-varying optimization or Gaussian process analysis; see, e.g., [1, 23]. From Assumption 1, one has that the following holds for any and :

Further, since is compact, one has that:

(11)

where is defined (uniformly in ) as:

(12)

Assumption 2 is standard in the analysis of GPs (see, e.g., [1]); it holds true for four-time differentiable stationary kernels, such as the widely-used squared exponential and (some) Matérn kernels. This assumption is needed when is compact in order to ensure smoothness of the GP samples.

Assumption 3 is required in time-varying settings to bound the temporal variability of the cost function; specifically, it presupposes that the decision yields similar function values for at two subsequent time (with their difference being upper bounded by ). This assumption is also common in online optimization and machine learning; see, e.g., [28, 29].

For Assumption 4, two cases are in order based on whether the assumption holds locally around the optimal trajectory (in the paper, the term optimum refers to the global one) or globally. In the first case, results will be valid around the optimal trajectory (provided that the algorithm is started close enough). In the second case, results will be valid in a general sense. Since the conditions are the same in both cases, the paper will hereafter focus on the second case. In a convex setting, an algorithm with -linear convergence can be found when the function is strongly convex and strongly smooth; when the function is nonconvex, results are available for special cases. A noteworthy example (which is also explored in the simulation results) is when has a Lipschitz-continuous gradient and it satisfies the Polyak-Lojasiewicz (PL) inequality; in this case, the (projected) gradient method has a global linear converge rate and Assumption 4 is satisfied [30]. This case implies that the function is invex (i.e., any stationary point must be a global minimizer). Such PL-type function arises, e.g., when is dominant over ; that is, when i) the is convex, and represents small (yet non-negligible for the user) variations, or ii) when is PL and the variance is almost constant 333Other cases in which Assumption 4 is verified for the general nonconvex case could be found by using recent global convergence analysis of ADMM [31, 32], and left for future research. Note that if one has a or method, then one can transform it to a linear converging method by running at least or iterations, respectively.. With this in mind, one could substitute Assumption 4 with the following (more restrictive) assumption.

Assumption 6

The function in (9) has a Lipschitz continuous gradient with coefficient and satisfies the Polyak-Lojasiewicz (PL) inequality as,

(13)

with

(14)

over , for some , , uniformly in time (i.e., ).

Assumption 6 can be also required in a local sense only (around the optimal trajectory), if necessary 444Condition (13) imposes that the gradient of the cost function is properly lower bounded. If is the whole space, reduces to and the condition becomes of easier interpretation. Condition (13) also implies invexity. See [30, 33] for detailed discussions. .

Assumption 5 is a mild technical assumption needed in the frequentist view to determine the learning rates of the proposed method; see also [12, 34]. The non-degeneracy of the kernel holds true for squared exponential kernels [35]. If the assumption is not satisfied, then one would learn a smoother version of .

To simplify the exposition and the notation, hereafter we set the time indexes and in AGP-UCB to be the same; that is, . This is done without loss of generality, since represents the worst case for the regret analysis; in fact, for the time scales of the variation of and the convergence of the optimization algorithm are the same, which causes coupling of the two. For , the convergence of the optimization algorithm is basically achieved for any and we are back to time-scale separations as in [1].

Define the cumulative regret as:

where is the maximum of function at time provided by an oracle. The following result holds for the regret.

Theorem 1 (Regret bound)

Let Assumptions 15 hold. Pick and set the parameter as

for , where , and are the parameters defined in Assumption 2. Running AGP-UCB with for a sample of a GP with mean function zero and covariance function , we obtain a regret bound of with high probability. In particular,

(15)

where ,

(16)

and,

(17)

Proof. See Appendix A.

Theorem 1 asserts that the average regret converges to an error bound with high probability. It can be noticed that the error bound is a function of the variability of the function ; if is time-invariant, then a no-regret result can be recovered. The term in the bound can be found in the result of [1, Theorem 2] too; in particular, the term depends only on the variance of the measurement noise. On the other hand, the term is due to the time-varying concave function , and it explicitly shows the linear dependence on the Lipschitz constant of as well as on the bound on the gradient of . The term is also due to the time variation of .

The term emerges from the approximation error in the step [S2] of the algorithm and it is key in our analysis; that is, because only a limited number of algorithmic steps are performed, instead of running the optimization algorithm to convergence. It is also due to the fact that the function changes every time a new measurement is collected. Define the learning rate error as

Then, the error term comes from the sum

(18)

whose proof is deferred in the Appendix. From (18), we can see that if the step [S2] is carried out exactly, , then this term vanishes. On the other hand, if , then the error is weighted by the changes in the surrogate function . Because of the Bayesian update, converges fast enough so that its cumulative error is constant.

The following result pertains to squared exponential kernels.

Theorem 2 (Squared exponential kernels)

Under the same assumptions of Theorem 1, consider a squared exponential kernel. Then and the cumulative regret is of the order of .

Proof. See Appendix A.

Theorem 2 is a customized version of Theorem 1 for a squared exponential kernel. Other special cases can be derived for other kernels as in [1], but are omitted here due to space constraints. In the following, we exemplify the results of Theorem 2 for the case where the algorithmic map represents a projected gradient method.

Iii-B Example: Agp-Ucb with projected gradient method

Consider a projected gradient method, applied to a time-varying problem where Assumption 6 holds. Consider further the case where only one step of the projected gradient method can be performed in [S2] (i.e., ); then, denoting as the projection operator, in this case [S2] is replaced with:

[S2’] Update as:

(19)

and implement .

Step [S2’] encodes a projected gradient method on the function with stepsize . Note that for the Bayesian framework, the derivative is straightforward to compute and no approximation has to be made; this is in contrast with online bandit methods where the gradient is estimated from functional evaluations (see e.g., [36, 37, 38] and references therein); this is one of the strength of Bayesian modelling, which however comes at the cost of an increase in computational complexity due to the GP updates.

Under Assumption 6 and with the choice of stepsize , a suitable extension of the results in [30] (considering the proximal-gradient method with being the indicator function, see Appendix B) yields the following convergence result for the iteration (19):

(20)

In this context, the following corollary is therefore in place.

Corollary 1 (Agp-Ucb with projected gradient method)

Consider the modified Step [S2’] as in (19), with . Under the same assumptions of Theorem 1, but with Assumption 4 replaced with Assumption 6 with , Theorem 1 holds with the specific value .

Proof. See Appendix B.

Iii-C Vanishing Changes

Consider the case where in Assumption 3 vanishes in time; that is, as . This case is important when the variations in the engineering cost function eventually vanish (for example, if the cost function is derived by a stationary process that is learned while the algorithm is run, as typically done in online convex optimization [39]).

Theorem 3 (Vanishing changes)

Under the same assumptions of Theorem 1, if as , then in Theorem 1 can be upper bounded by a sublinear function in and we obtain a no-regret result.

Furthermore, if decays at least as , then the result of Theorem 1 on the regret is indistinguishable from the static result of [1, Theorem 2] in a sense.

Proof. See Appendix C.

Iv Numerical evaluation

This section considers an example of application of the proposed framework in a vehicle platooning problem, based on [40, 41], and it provides illustrative numerical results. This problem includes all the modeling elements discussed in the paper, and its real-time implementation requirements are aligned with the design principles of the proposed framework.

Consider then automated vehicles that are grouped in a platoon. The leading vehicle is labeled as , while the vehicles following the leading one are indexed with increasing numbers from to . The platoon leading and desired velocity is , while the inter-vehicle distances are denoted as for (cf. Fig. 1). Consider the problem of deciding which are the best inter-vehicle distances such that they are as close as possible to some desired values that are dictated by road, aerodynamics considerations, and platoon’s speed, while being comfortable to the car riders; e.g. the automated vehicles distances are not too different than the distances that users would naturally adopt in human-driven vehicles.

Let , and denote as the time-varying vector of distances that one would obtain by considering only the engineering cost. Then, the problem considered in this section is of the form:

(21)

where is a compact set representing allowed distances between vehicles; is the function capturing the “comfort” of user ; and, is the weighted norm based on the positive definite matrix . We further consider the case , and being not diagonal (this way, the decision variables are coupled). We also set ; that is, the comfort of the -th user depends only on the distance with the preceding vehicle (however, more general models can be easily adopted).

Fig. 1: Simulation setup.

The true data-generating functions are modeled using log-normal functions; the maximizer of does not coincide with the one of the engineering function to avoid a trivial solution. Further, the functions ’s are different for each vehicle. Using log-normal functions for users’ comfort was motivated in, e.g., [42], by observing that inter-vehicle times and distances follow log-normal distributions. Intuitively, smaller distances are more critical than larger ones, and there is a given distance after which the comfort decreases since the users feel that they are too slow relative to the preceding vehicle. In the simulations, we learn the true-data generating model via a GP with squared exponential kernel with length scale parameter .

It is important to notice that, with the selected parameters, the approximate function is invex, it has Lipschitz continuous gradient, and it also satisfies the Polyak-Lojasiewicz inequality over . With this in mind, we can readily apply a projected gradient method with linear convergence.

For the numerical tests, we set the probability to , and the step size for the gradient method to approximately solve (7) is set to (we run one step of the gradient for each iteration , and ). The set is (distances are properly scaled so that the set maps to a real distance of ), and the number of vehicles following the leading one is .

The desired distances are set to for all ’s, where is a tunable parameter. The rationale for modeling the in this way is to capture (1) a stationary case () and a dynamic case where the distances change because of dynamic traffic conditions (and therefore varying ), road changes, etc. Users’ feedback comes as a noisy sample of their comfort function, and the noise is modeled as a zero-mean Gaussian variable with variance . Feedback in this example can come in different ways: it can come at low frequency, if the users are asked to hit the break or the accelerator every time they feel too close or too far from the vehicle in front, or it can come at higher frequency, if the users are equipped with heart rate/breathing rate sensors (which can be in smartwatches or incorporated in the seat of the vehicles [43]) which may be used as proxies of stress and discomfort.

Fig. 2: Average regret for different ’s.

In Figure 2, we show the performance of the AGP-UCB algorithm varying . On the vertical axis, we plot the regret , averaged on 10 different runs of the algorithm. As expected, we observe that when , we obtain a no-regret scenario, with eventually going to zero; in this case, it goes to zero as (note that the learning dimension is in our example). In the time-varying scenarios instead, there is an asymptotic error bound, thus corroborating the theoretical results.

In this example, the vehicles are supposed to follow the new set-points (i.e., the new inter-vehicle distances) rapidly i.e., faster than . If this is not the case, the analysis of the algorithm could be extended to include an actuation error; we plan to investigate this in future research.

V Conclusions

We developed an online algorithm to solve an optimization problem with a cost function comprising a time-varying engineering function and a user’s utility function. The algorithm is an approximate upper confidence bound algorithm, and it provably generates a sequence of optimizers that is within a ball of the optimal trajectory, while learning the user’s satisfaction function with no-regret. We have illustrated the result with a numerical example derived from vehicle platooning, which offered some additional inspiration for future research.

Appendix A Proof of Theorems 1-2

A-a Preliminaries

We start by defining some quantities that will be subsequently used in the proofs. We rewrite the main algorithmic step (7) as

(22)

where we recall that ; we simplify the notation from to , since there is no confusion when .

Define the error as the difference between the optimum of the function and the approximate solution computed via (22), as

(23)

where we use the superscript for quantities that are related to the optimizer/optimum of , so that the optimum of is .

Define the instantaneous regret as the difference between the optimum of function , which denoted as , and the approximate optimum where is computed via (22), as

(24)

Further, define the instantaneous optimal regret as the difference between and the optimum where is the optimizer of , as

(25)

Lemma 5.5 in [1] is applicable here: pick a and set , where , . Then, for all ,

(26)
(27)

hold with probability .

Similarly, Lemma 5.6 and Lemma 5.7 of [1] hold. For Lemma 5.7, we choose a regular discretization , with and the discretization size . Similar to [1] we have that

(28)
(29)
(30)

where is the closest point in to . We are now able to bound the instantaneous optimal regret .

Lemma 1 (Extension of Lemma 5.8 of [1])

Let Assumptions 1 and 2 hold true. Pick a and define a sequence such that , . Set the parameter as

where are the parameters defined in Assumption 2. Define the parameters and as

Then instantaneous optimal regret is bounded for all as follows:

with probability .

Proof:

The proof is a modification of the one of [1, Lemma 5.8]; also in this case we use in both Lemmas 5.5 and 5.7 of [1] (which are valid here for discussion above). We report only the parts that are different. Let be the optimizer of , i.e., .

By definition of and optimality, we have that . By [1, Lemma 5.7], we have that . These two combined yield

(31)

Now, by convexity of and the Lipschitz condition on its gradient (Assumption 1 and Eq. (11)),

(32)

Therefore by plugging and ,

(33)

where, by the definition of in [1, Lemma 5.7] and Eq.s (29)-(30), we derive the expressions of and reported in Lemma 1.

Therefore by (31) and (33),

(34)

Putting the above results together, and by using (27),

(35)

from which the claim. ∎

Lemma 2

Under the same assumptions and definitions of Lemma 1, the instantaneous regret is bounded for all as follows:

with probability .

Proof:

By definition of , by calling , and by (35), we have

where we have use (26) in the last inequality. ∎

We now move to the analysis of the error sequence .

Lemma 3

Let Assumptions 1 till 4 hold true. Define the learning rate error as

Then, the error sequence is upper bounded as

Proof:

From the definition of we obtain,

(36)

where the inequality is due to Assumption 4 on the convergence rate of method . In addition,

By algebraic manipulations, the error can be written as,

The first term in the right hand side can be written as

The second term is , by optimality. Putting things together, . Furthermore, by Assumption 3,