[

# [

## Abstract

This work addresses the classic machine learning problem of online prediction with expert advice. We consider the finite-horizon version of this zero-sum, two-person game.

Using verification arguments from optimal control theory, we view the task of finding better lower and upper bounds on the value of the game (regret) as the problem of finding better sub- and supersolutions of certain partial differential equations (PDEs). These sub- and supersolutions serve as the potentials for player and adversary strategies, which lead to the corresponding bounds. Our techniques extend in a nonasymptotic setting the recent work of Drenska and Kohn (J. Nonlinear Sci. 2020), which showed that the asymptotically optimal value function is the unique solution of an associated nonlinear PDE.

To get explicit bounds, we use closed-form solutions of specific PDEs. Our bounds hold for any fixed number of experts and any time-horizon; in certain regimes (which we identify) they improve upon the previous state-of-the-art.

For up to three experts, our bounds provide the asymptotically optimal leading order term. Therefore, in this setting, we provide a continuum perspective on recent work on optimal strategies.

O

New Potential-Based Bounds for Prediction with Expert Advice]New Potential-Based Bounds for Prediction with Expert Advice \coltauthor\NameVladimir A. Kobzar \Emailvladimir.kobzar@nyu.edu
\addrCenter for Data Science, New York University, 60 Fifth Ave., New York, New York \AND\NameRobert V. Kohn \Emailkohn@cims.nyu.edu and \NameZhilei Wang \Emailzhilei@cims.nyu.edu
\addrCourant Institute of Mathematical Sciences, 251 Mercer St., New York, New York

nline learning, expert advice framework, regret minimization, fixed horizon, upper and lower bounds, potential-based strategies, subsolutions and supersolutions of partial differential equations, optimal control, dynamic programming, verification argument, linear heat equation, comb adversary

## 1 Introduction

The classic machine learning problem of online prediction with expert advice (the expert problem) is a repeated two-person zero-sum game with the following structure. At each round, the predictor (player) uses guidance from a collection of experts with the goal of minimizing the difference (regret) between the player’s loss and that of the best performing expert in hindsight. The environment (adversary) determines the losses of each expert for that round. The player’s selection of the experts and the adversary’s choice of the loss for each expert are revealed to both parties, and this prediction process is repeated until the final round.

The expert problem arises in the context of applications of data science and machine learning in adversarial environments, such as aggregation of political polls (Roughgarden and Schrijvers, 2017), portfolio allocation and trading (Agarwal et al., 2010; Dayri and Phadnis, 2016), cybersecurity (Truong et al., 2018) and cancer screening (Zhdanov et al., 2009; Morino et al., 2015). The experts framework also appears in other contexts where the data had no obvious distributional assumptions, such as neural architecture search (Nayman et al., 2019), online shortest path in graphs (Kalai and Vempala, 2005), signal processing (Singer and Kozat, 2010; Harrington, 2003), memory caching and energy saving (Gramacy et al., 2003; Helmbold et al., 2000). This framework has also been used to design approximation algorithms for provably hard off-line problems, such as the similarity mapping (Rakhlin et al., 2007). More broadly, the expert framework has been viewed as a meta-learning algorithm seeking to achieve the performance of the best among several constituent learning algorithms (robust model selection) (Bubeck, 2011).

The problem has several formulations, which reflect, among other things, differences in the flow of information, classes of loss functions, randomization of the strategies, as well as whether or not the regret is assessed in expectation. We will focus on the following representative definition of the expert problem, which mirrors (up to a trivial translation and rescaling of the loss) the version considered in recent work on optimal strategies (Gravin et al., 2016; Abbasi-Yadkori et al., 2017). {tcolorbox} Prediction with expert advice: At each period ,

• the player determines which of the experts to follow by selecting a discrete probability distribution ;

• the adversary determines the allocation of losses to the experts by selecting a probability distribution over the hypercube ; and

• the expert losses and the player’s choice of the expert are sampled from and , respectively, and revealed to both parties.

We consider the finite horizon version, where the number of periods is fixed and the regret is where the joint distributions and refer to, respectively, the adversary and player strategies or simply the adversary and player.

Numerous strategies attain vanishing per round regret. For example, the exponentially weighted forecaster provides the non-asymptotic upper bound . Also for all , there exist and sufficiently large, such that the randomized adversary (which chooses each vertex with equal probability) approaches that bound: .1

A minmax optimal player (optimal player) is a player that minimizes the regret over all possible adversary strategies and a minmax optimal adversary (optimal adversary) is an adversary that maximizes the regret over all possible player strategies. Thus, and are optimal asymptotically in and .

Nonasymptotic minmax optimal strategies were determined explicitly using random walk methods for (a) , and (b) , up to the leading order term (Cover, 1966; Gravin et al., 2016; Abbasi-Yadkori et al., 2017). For general , optimal strategies are given by a recursion, but they have not been determined explicitly.

In a related line of work, strategies that are optimal asymptotically in were determined by PDE-based methods. For , Zhu (2014) established that the value function is given by the solution of a 1D linear heat equation, which provides a continuous perspective on the earlier random walk characterization of the non-asymptotic problem. Drenska and Kohn (2020) showed that, for any fixed , the value function, in the scaling limit, is the unique solution of an associated nonlinear PDE. Bayraktar et al. (2019) determined the closed-form solutions of the PDEs for and .

Due to the complexity of determining optimal strategies for an arbitrary fixed , it is common to use potential functions to bound the regret above for all possible adversary strategies. For example, could be viewed as a descent strategy for the entropy potential; the corresponding upper bound is obtained by bounding the evolution of this potential for all possible adversaries.

Rakhlin et al. (2012) proposed a principled way of deriving potential-based player strategies by bounding above the value function, conditional on the realized losses, in a manner that is consistent with its recursive minmax form. Rokhlin (2017) suggested using supersolutions of the asymptotic PDE as potentials for player strategies in the scaling limit. The present paper extends these ideas by applying related arguments to the original problem (not a scaling limit), and by providing numerous examples (including lower as well as upper bounds).

Adversary strategies have been commonly studied as random processes. For example, guarantees that the regret is given by the expectation of the maximum of i.i.d. Gaussians with mean zero and variance . This guarantee is based on the central limit theorem and is therefore asymptotic in . Nonasymptotic lower bounds have been established using random walk methods (Orabona and Pal, 2015; György et al., ).

While the player and the adversary may use randomization in their strategies, the deterministic control paradigm fully describes the adversarial expert problem. Accordingly, in this paper we propose a control-based framework for designing strategies for the expert problem using sub- and supersolutions of certain PDEs. Our principal contributions are the following:

1. The potential-based framework is extended to adversary strategies, leading to lower bounds (Section 3).

2. The bounds hold for any fixed number of experts and are nonasymptotic in ; their rate of convergence to the asymptotic (in ) value is determined explicitly using error estimates similar to those applied to finite difference schemes in numerical analysis. (Theorems 3 and 4). For lower bounds, this rate is determined by the smoothness of the relevant potentials (Remark 3).

3. The task of finding better regret bounds reduces to the mathematical problem of finding better sub- and supersolutions of certain PDEs (See Equations (2) and (5)).

4. Our framework is based on elementary ”verification” arguments from the optimal control theory and does not rely on a scaling argument (Appendices A and B). Therefore, the final value function no longer needs to be homogeneous to satisfy the scaling property, which increases the range of possible applications of our methods.

5. To get explicit bounds, we use the classical solution of the linear heat equation with suitable diffusion factors as lower and upper bound potentials (Section 5).

1. The resulting lower bound is expressed as the expectation of the maximum of i.i.d. Gaussians with mean zero, and is therefore similar to the existing lower bound proved using . However, the constant factor of the leading order term (i.e., the standard deviation of the Gaussians) is, to the best of our knowledge, state-of-the-art (Section 7).

2. Accordingly, the resulting lower bound improves the existing non-asymptotic lower bounds for general fixed and relatively large, but fixed, (Section 7).

6. To get another family of bounds, we introduce new upper and lower bound potentials using a closed-form solution of a nonlinear PDE based on the largest diagonal entry of the Hessian (Section 6).

1. For up to three experts, the lower and upper bounds for this potential provide a matching leading order term. Therefore, the corresponding strategies are optimal at the leading order (Section 6).

2. The same leading order constant for three experts was determined in Abbasi-Yadkori et al. (2017) with the regret scaling as (for our loss). Our approach, however, provides a smaller error term: (Section 6).

3. The resulting upper bound is tighter than the bound obtained using for small and relatively large but fixed (Section 7).

7. Lastly, our framework leads to efficient strategies. For example, the explicit adversary strategies in this paper do not require runtime computations involving the potential or its derivatives; moreover, those strategies are time-independent. This illustrates the feasibility of the framework for high-dimensional problems.

## 2 Notation

We will use the following notation. For a multi-index , refers to the partial derivative and refers to the differential with respect to the spatial variable(s) in , and refers to the differential with respect to all except the spatial variables in I. and denotes the 3-rd and 4-th derivative in the direction of given by the linear forms and , respectively. Whenever the region of integration is omitted, it is assumed to be .

denotes the set if or if . is a vector in with all components equal to 1, and refers to the indicator function of the set . refers to a discrete probability distribution over outcomes. Whenever the feasible set of is omitted, it is assumed to be .

A classical solution of a partial differential equation (PDE) on a specified region is a solution such that all derivatives appearing in the statement of the PDE exist and are continuous on the specified region.

To bound the fixed horizon regret, we will apply the dynamic programming principle backwards from the final time. Since in this setting it is convenient to denote the time by nonpositive numbers such that the starting time is and the final time is zero, we will use this convention in the remainder of this paper.

Let the vector denote the player’s losses realized in round relative to those of each expert (instantaneous regret) and let the vector denote the player’s cumulative losses realized before the outcome of round relative to those of each expert (cumulative regret or simply the regret).

## 3 Lower Bound

When the prediction process starts at given and , the fixed horizon value function reflecting the worst-case (smallest) regret at the final time for a given adversary is constructed by a dynamic program (DP) backwards from the final time.

 va(x,0) =maxixi va(x,t) =minptEat,pt va(x+rt,t+1) for t≤−1 (1)

This reflects the fact that an optimal player against depends only on and the cumulative history represented by , rather than the full history .

In the context of lower bounds in this paper, we will only consider those adversary strategies that assign the same probability to each component of : for some (balanced strategies).

Note that for all . To bound below, we introduce the following potential function, or simply potential. {tcolorbox} Lower-bound potential : We will use this term for a function , such that, for every and , there exists a balanced strategy on ensuring that is a classical solution of

 ut+12Eat⟨D2u⋅q,q⟩≥0 u(x,0)≤maxixi (2) u(x+c1,t)=u(x,t)+c

Adversary strategy : A strategy associated with as above is: At each , the adversary selects a balanced strategy such that (2) is satisfied at . At , the adversary selects an arbitrary distribution over . As confirmed in Appendix A, the dependence of on is eliminated due to the fact that is balanced, and the PDE (2) controls the decrease of at each step, leading to the following result. {theorem} [Lower bound] If for all , (i) , and (ii) for , , are Lipschitz continuous, and for any sampled from ,

 16ess supy∈[x,x−q] D3u(y,t+1)[q,q,q]+12ess supτ∈[t,t+1]utt(x,τ)≤K(t) (3)

then , where .

Under the stronger assumptions that: (a) for any given , the adversary assigning the same probability to and (symmetric strategies) and (b) having Lipschitz continuous higher-order derivatives, we obtain the following.

{remark}

[Lower bound - Lipschitz continuous higher-order derivatives] If has higher-order Lipschitz continuous derivatives, they could be used to bound the value function of a symmetric adversary . For example, if, for all and , exists and is Lipschitz continuous, and for any sampled from ,

 −124ess infy∈[x,x−q] D4u(y,t+1)[q,q,q,q]+12ess supτ∈[t,t+1]utt(x,τ)≤K(t)

then the conclusion of Theorem 3 still holds.

Remark 2 can be used to obtain a better error estimate. For example, in the context of the heat potential discussed below, we can bound the error uniformly in .

Our lower bound potentials defined by (2) have a PDE interpretation: they are subsolutions of the nonlinear PDE obtained by the arguments in Drenska and Kohn (2020).

## 4 Upper Bound

In parallel to the discussion above, the value function reflecting the worst-case (largest) regret at the final time inflicted on a given player is constructed by the following DP:

 vp(x,0) =maxixi vp(x,t) =maxatEat,pt vp(x+rt,t+1) (4)

As also noted in connection with (1), this reflects the fact that an optimal adversary against depends only on and the cumulative history represented by .

Note that for all . To bound above, we use the following potential. {tcolorbox} Upper-bound potential : We will use this term for a function , which is nondecreasing as a function of each , and which is, for all and is a classical solution of

 wt+12maxq∈[−1,1]N⟨D2w⋅q,q⟩≤0 w(x,0)≥maxixi (5) w(x+c1,t)=w(x,t)+c

Player strategy : The player strategy associated with as above is: At each period , the player selects . At , the player selects an arbitrary distribution in . Since is nondecreasing in each and by linearity of along , .

As confirmed in Appendix B, this player strategy eliminates the first-order spatial derivative of for all choices of when Taylor expansion is used to estimate how changes at each time step. We use the PDE (5) to control the other terms of the expansion, leading to the following result. {theorem} [Upper bound] If for all , (i) , and (ii) for , and are Lipschitz continuous and for all

 −16ess infy∈[x,x−qt] D3w(y,t+1)[q,q,q]−12ess infτ∈[t,t+1]wtt(x,τ)≤K(t)

then where .

Our upper bound potentials defined by (5) also have a PDE interpretation: they are supersolutions of the nonlinear PDE obtained by the arguments in Drenska and Kohn (2020).

As an example, using our framework in Appendix C, we recover the classic upper bound for .

## 5 Heat Potentials

In this subsection, we consider a specific potential given by

 u(x,t) =α∫e−∥y∥22σ2maxk(xk−yk)dy (6)

where and . This potential is the classical solution, on , of the following linear heat equation

 {ut+κΔu=0u(x,0)=maxix

The linearity of the function in the direction of confirms that . This implies that , , and therefore . Appendix D confirms

###### Claim \thetheorem

If , then for

 κh=⎧⎪ ⎪⎨⎪ ⎪⎩1if N=212+12Nif N  is odd12+12N−2otherwise (7)

The proof of this Claim is short and elementary. But to put the result in context, we note that since the function is convex, is convex as a function of . Therefore, a maximum of this quadratic form is attained on the vertices of the hypercube . Appendix E confirms that for . Thus, we can consider to be the Laplacian of an undirected weighted graph with vertices, and where is the maximum cut of .

For an unweighted graph with vertices and edges, it is known that (Haglin and Venkatesan, 1991). Claim 5 provides a similar result for using only the fact that it is symmetric and has in the kernel.

We define the adversary to be a uniform distribution on the set of “balanced cuts”

 S=⎧⎪⎨⎪⎩{q∈{−1,1}N∣∑Ni=1qi=±1}for N  odd{q∈{−1,1}N∣∑Ni=1qi=0}for N even

which were used in the proof of Claim 5; is symmetric because it is the uniform distribution over the symmetric set . {tcolorbox} Heat-based adversary : At each , the adversary samples uniformly from . The potential given by (6) with the diffusion factor (7), combined with the adversary , satisfies (2). The resulting nonasymptotic lower bound is described in Example 5 below.

Moreover, Appendix D shows that (2) is satisfied with equalities instead of inequalities. Therefore, approaches asymptotically in , i.e., .

Note that does not require any runtime computations of or its derivatives; moreover does not depend on or .

Similar ideas can be used to give an upper bound. Appendix E confirms that for and . Also the fact , as noted above, implies that . Therefore,

 12maxq∈[−1,1]N⟨D2u⋅q,q⟩ ≤12Δu−12∑i≠j∂iju=Δu

Also we proved in Appendix E. Thus, given by with satisfies (5). {tcolorbox} Heat-based player : At each , the player selects and, at , the player selects an arbitrary distribution in .

In Appendix F, we compute the error estimates for the heat potential. The theorems presented above lead to the upper and lower bounds on the relevant value functions in Example 5. Note that since the heat potential is smooth, we can bound uniformly in using Remark 3. Theorem 3 is also available and provides ; as a result, is . {example} [Heat potential bounds]

1. where is the value function of and
; and

2. where is the value function of and
.

Since where is a Gaussian N-dimensional vector , the bounds on the value function lead to the following bounds on the regret

 √2κh|T|EGmaxGi−O(min(N√N,√NlogN+√Nlog|T|))≤vah(0,T)=minpRT(ah,p)

and

 maxaRT(a,ph)=vph(0,T)≤√2|T|EGmaxGi+O(√NlogN+√Nlog|T|)

For two experts, the lower and upper bounds provide a matching leading order term . Therefore, the corresponding strategies are minmax optimal asymptotically in .

Is there a conceptual relationship between our heat lower bound and previously known randomized adversary ? Of course there is. The solution of our heat equation with replaced by the smaller value of can be viewed as a potential associated with . We explain this and recover the classic lower bound for in Appendix G.1.

Several recent papers discuss the so-called comb adversary. To define it, we introduce for any , its ranked coordinates , such that . The comb adversary is then defined by {tcolorbox} Comb adversary : At each , the adversary assigns probability to each of and where if is odd and if is even. Gravin et al. (2016) suggested that might be optimal asymptotically in for any fixed and Abbasi-Yadkori et al. (2017) and Bayraktar et al. (2019) showed that to be the case for and respectively.

Although we do not resolve this conjecture for any other fixed , in Appendix G.2, we show that is doubly asymptotically optimal as first and then (previously this was only known for ). Specifically, our heat potential with replaced by can be also viewed as a potential associated with . Therefore, the lower bound obtained by is at least as large as the classic lower bound obtained by (which is doubly asymptotically optimal).

## 6 Max Potentials

In this section, we consider the potential given by the solution of

 {ut+κmaxi∂2iu=0u(x,0)=maxixi (8)

Abbasi-Yadkori et al. (2017), using probabilistic methods, showed that the strategy associated with this potential (the max adversary, which we formally define below) is asymptotically in optimal for . A PDE perspective on this result is given at the end of this section.

The building blocks of are functions of the form , which are self-similar solutions of the linear 1D heat equation with the final value . In this setting, we have

 f(z)=√2πe−z22+zerf(z√2) and  erf(y)=2√π∫y0e−s2ds. (9)

As confirmed in Appendix H, solves

 {f(z)=f′′(z)+zf′(z)lim|z|→∞f(z)|z|=1.

Therefore, solves the one-dimensional linear heat equation on

 {gt+κgxx=0g(x,0)=|x|

The ranked coordinates of allow us to define globally in a uniform manner, which we shall call the max potential: in Appendix H, we confirm that

###### Claim \thetheorem

The classical solution of (8) on is given by

 u(x,t) =1N∑ix(i)+√−2κtN−1∑l=1clf(zl) (10)

where , is given by (9) and .

Since does not change when a multiple of is added to , we have . This implies that , and therefore . The corresponding strategy is {tcolorbox} max adversary : At each , the adversary selects the distribution by assigning probability to each of and where the entry of corresponding to the largest component of is set to 1 and the remaining entries are set to .

Suppose , in Appendix H.3 we confirm that , thus

 ⟨D2u⋅(±qm),±qm⟩=⟨D2u⋅±(qm+1),±(qm+1)⟩=4∂iiu=4maxj∂jju.

Therefore, given by (10) with satisfies (2) for the adversary strategy . The resulting lower bound is given below in Example 6.

To determine an upper bound, we note that since is convex, is convex. Therefore,
is attained at the vertices of the hypercube .

Also from Appendix H, we see that has a special structure: for all and for . In Appendix I, we use this structure to confirm that a class of simple rank-based strategies maximizes the quadratic form as follows.2

###### Claim \thetheorem

for

 κm=⎧⎨⎩N22(N−1)for N evenN+12for N odd (11)

Also in Appendix H.1 we show . Therefore, we obtain an upper bound potential by using by with given by (11). This satisfies (5); the associated player strategy is {tcolorbox} player : At each , the player selects and, at , the player selects an arbitrary .

Since is constructed by reflection of a smooth function whose first derivatives normal to the reflection boundary vanish, its third spatial derivatives are bounded almost everywhere on but are discontinuous at the reflection boundary. Therefore, in this setting, the tighter control of the lower bound error described in Remark 3 is not available. Since the error terms in our upper and lower bounds are obtained by the same arguments, we denote both of them by . {example} [max-based bounds]

1. where is the value function of ; and

2. where is the value function of ,

and, in each case, (for more details on E(t), see Appendix J).

Since , we obtain the following bounds on the regret

 2(N−1)N√2π|T|−O(Nlog|T|)≤vam(0,T)=minpRT(am,p)

and

 maxaRT(a,pm)=vpm(0,T)≤2(N−1)N√κmπ|T|+O(Nlog|T|)

The lower and upper bounds have the matching leading order term of and for, respectively, two and three experts. Therefore, the corresponding strategies are minmax optimal asymptotically in .

The same leading order constant for three experts was determined in Abbasi-Yadkori et al. (2017) with the regret scaling as (for our loss function). Our strategies and however, give improved guarantees with respect to the lower order (error) term: .

The fact that our bounds for match asymptotically can be understood from a PDE perspective. Indeed, our upper-bound and lower-bound max potentials for are the same; they solve the PDE derived as in Drenska and Kohn (2020) that characterizes the asymptotically optimal result. This observation can also be found in Bayraktar et al. (2019) (for , however, the solution of the relevant PDE is different from our max potential).

## 7 Related work

As noted at the end of Section 5, is strictly larger than for any fixed . Therefore, asymptotically in , the lower bound attained by our heat-based adversary is tighter than the one attained by the classic randomized adversary .

When and are fixed, a bound obtained using is provided by Theorem 8 in Orabona and Pal (2015); their argument involves lower bounding the maximum of independent symmetric random walks of length .

Another lower bound is given in Chapter 7 of György et al. () for an adversary strategy constructed from a single random walk of length . This provides a tighter lower bound than our when is relatively small. However, as illustrated by Figure LABEL:fig:l, when is large, our strategy improves on the lower bound obtained by . (The lower bound given by Orabona and Pal (2015) is not shown because its value is negative for the given range of and .)

Turning to the upper bounds: when is small and is large, as illustrated by Figure LABEL:fig:2, the max player improves on the upper bound given by the exponential weights ( remains advantageous in this setting relative to ).

See Appendix K for details regarding the numerical computation of these bounds.

## 8 Conclusions

We establish that potentials can be used to design effective strategies leading to lower bounds as well as upper bounds. We also demonstrate that solutions of certain PDEs are good candidates for such potentials, which improve on the existing bounds.

While this paper focuses on the fixed horizon version of the expert problem, Kobzar et al. (2019) extended our framework to the geometric stopping version, where the final time is not fixed but is rather random, chosen from the geometric distribution.

We expect that our framework could be used to systematize and advance theory and practice of online learning in other settings as well.

\acks

V.A.K and R.V.K. are supported, in part, by NSF grant DMS-1311833. V.A.K. is also supported by the Moore-Sloan Data Science Environment at New York University.

## Appendix A Proof of Theorem 3 and Remark 3

{proof}

[of Theorem 3]

Since is characterized by the dynamic program (1), we confirm that by induction starting from the final time. The initial step follows from the inequality between and at .

To prove the inductive step, as a preliminary result, we bound below the difference in terms of and .

At , the conditions of the theorem already provide,

 minp−1 Ea−1,p−1 [u(x+r−1,0)]−u(x,−1)≥−C

For , we decompose the difference first with respect to the change of , holding fixed, and with respect to , holding fixed:

 minpt Ept,at [u(x+rt,t+1)−u(x,t)] =minpt Ept,at [u(x−qt,t+1)+(qt)It]−u(x,t+1)+u(x,t+1)−u(x,t)

Since is with Lipschitz continuous second order derivatives, we use Taylor’s theorem with the integral remainder

 u(x−qt,t+1)= u(x,t+1)−∇u(x,t+1)⋅qt+12⟨D2u(x,t+1)⋅qt,qt⟩ −∫10D3u(x−μqt,t+1)[qt,qt,qt](1−μ)22dμ

Thus,

 u(x−qt,t+1)−u(x,t+1)≥ −∇u(x,t+1)⋅qt+12⟨D2u(x,t+1)⋅qt,qt⟩ −16ess supy∈[x,x−qt]D3u(y,t+1)[qt,qt,qt]

and similarly for

 u(x,t+1)−u(x,t)≥ut(x,t+1)−12ess supτ∈[t,t+1]utt(x,τ)

Note that for all because is balanced and by linearity of along . Therefore we eliminated the first order spatial derivative and consequently the dependence on .

Also we use the condition on the potential

 ut(x,t+1)+12Eat⟨D2u(x,t+1)⋅qt,qt⟩≥0

Collecting the above inequalities and using the fact that is symmetric we have

 minpt Ept,at [u(x+rt,t+1)−u(x,t)]≥−K(t)=E(t+1)−E(t)

Finally, use the inductive hypothesis , and the dynamic program formulation of , we obtain

 u(x,t)−E(t) ≤u(x,t)+minptEpt,at u(x+rt,t+1)−u(x,t)−E(t+1) ≤minpt Ept,at[va(x+rt,t+1)]=va(x,t)

The proof of Remark 3 is the same except that we expand up to fourth order spatial derivatives and use the fact that by symmetry ( and have the same probability).

## Appendix B Proof of Theorem 4

{proof}

[of Theorem 4] Since is characterized by the dynamic program (4), we confirm by induction that . The initial step follows from the inequality between and at , and the rest of the proof is similar to Appendix A. To prove the inductive step, we first note that .

For , we decompose the difference as following

 maxat Ept,at [w(x+rt,t+1)−w(x,t)] = maxat Ept,at [w(x−qt,t+1)+(qt)It]−w(x,t+1)+w(x,t+1)−w(x,t) = maxat Eat [w(x−qt,t+1)−w(x,t+1)+pt⋅qt]+w(x,t+1)−w(x,t) (12)

where we applied the linearity along in the first equality.

Since is with Lipschitz continuous second order derivatives, we again use Taylor’s theorem with the integral remainder

 w(x−qt,t+1)= w(x,t+1)−∇w(x,t+1)⋅qt+12⟨D2w(x,t+1)⋅qt,qt⟩ (13) −∫10D3w(x−μqt,t+1)[qt,qt,qt](1−μ)22dμ

Thus

 w(x−qt,t+1)−w(x,t+1)+pt⋅qt≤12⟨D2w(x,t+1)⋅qt,qt⟩ −16ess infy∈[x,x−qt]D3w(y,t+1)[qt,qt,qt]

We eliminated the dependence on using the fact that , which gives the cancellation between in (12) with (13).

Similarly for