Excess Risk Bounds for Exponentially Concave Losses

# Excess Risk Bounds for Exponentially Concave Losses

Michigan State University
mahdavim@cse.msu.edu
Rong Jin
Michigan State University
rongjin@cse.msu.edu
###### Abstract

The overarching goal of this paper is to derive excess risk bounds for learning from exp-concave loss functions in passive and sequential learning settings. Exp-concave loss functions encompass several fundamental problems in machine learning such as squared loss in linear regression, logistic loss in classification, and negative logarithm loss in portfolio management. In batch setting, we obtain sharp bounds on the performance of empirical risk minimization performed in a linear hypothesis space and with respect to the exp-concave loss functions. We also extend the results to the online setting where the learner receives the training examples in a sequential manner. We propose an online learning algorithm that is a properly modified version of online Newton method to obtain sharp risk bounds. Under an additional mild assumption on the loss function, we show that in both settings we are able to achieve an excess risk bound of that holds with a high probability.

## 1 Introduction

We investigate the excess risk bounds for learning a linear classifier using a exponentially concave (abbr. as exp-concave) loss function (see e.g., [1] and [2]). More specifically, let be a set of i.i.d. training examples sampled from an unknown distribution over instance space , where with and and in classification and regression problems, respectively. Let be our domain of linear classifiers with bounded norm, where determines the size of the domain. We aim at finding a learner with the assist of training samples that generalizes well on unseen instances.

Let be the convex surrogate loss function used to measure the classification error. In this work, we are interested in learning problems where the loss function is a one-dimensional exponentially concave function with constant (i.e., is concave for any ). Examples of such loss functions are the squared loss used in regression, logistic loss used in classification, and negative logarithm loss used in portfolio management [3, 1, 4, 5]. Similar to most analysis of generalization performance, we assume to be Lipschitz continuous with constant , i.e. . Define as the expected loss function for an arbitrary classifier , i.e.

 L(w)=E(x,y)∼P[ℓ(yw⊤x)].

Let be the optimal solution that minimizes over the domain , i.e. . We note that the exp-concavity of individual loss functions also implies the exp-concavity of the expected function (a straightforward proof can be found in [3, Lemma 1]). Our goal is to efficiently learn a classifier with the help of training set with small excess risk defined by:

While the main focus of statistical learning theory was on understanding learnability and sample complexity by investigating the complexity of hypothesis class in terms of known combinatorial measures, recent advances in online learning and optimization theory opened a new trend in understanding the generalization ability of learning algorithms in terms of the characteristics of loss functions being used in convex learning problems. In particular, a staggering number of results have focused on strong convexity of loss function (that is a stronger condition than exp-concavity) and obtained better generalization bounds which are referred to as fast rates [6, 7]. In terms of smoothness of loss function, a recent result [8] has shown that under smoothness assumption, it is possible to obtain optimistic rates (in the sense that smooth losses yield better generalization bounds when the problem is easier and the expected loss of optimal classifier is small), which are more appealing than Lipschitz continuous cases. This work extends the results to exp-concave loss functions and investigates how to obtain sharper excess risk bounds for learning from such functions. We note that although the online Newton method [1] yields regret bound, it is only able to achieve an bound for excess risk in expectation. In contrast, the excess risk bounds analyzed in this work are all in high probability sense.

We consider two settings to learn a classifier from the provided training set . In statistical setting (also called batch learning) [9], we assume that the learner has access to all training examples in advance, and in online setting the examples are assumed to become available to the learner one at a time. We show that with an additional assumption regarding the exponential concave loss function, we will be able to achieve an excess risk bound of , which is significantly faster than rate for general convex Lipschitz loss functions. The proof of batch setting utilizes the notion of local Radamacher complexities and involves novel ingredients tailored to exp-concave functions in order to obtain sharp convergence rates. In online setting, the results follows from Bernstein inequality for martingales and peeling process. We note that fast rates are possible and well known in sequential prediction via the notion of mixable losses [10], and in batch setting under Tsybakov’s margin condition with  [11], where the relation between these two settings has been recently investigated via the notion of stochastic mixability [12]. However, our analysis and conditions are different and only focuses on the exp-concavity property of the loss to derive an risk bound.

## 2 The Algorithms

We study two algorithms for learning with exp-concave loss functions. The first algorithm that is devised for batch setting is simply based on empirical risk minimization. More specifically, it learns a classifier from the space of linear classifiers by solving the following optimization problem

 minw∈W1nn∑i=1ℓ(yiw⊤xi). (1)

The optimal solution to (1) is denoted by . Here, we are not concerned with the optimization procedure to find and only investigate the access risk of obtained classifier with respect to the optimal classifier .

Our second algorithm is a modified online Newton method [1]. Algorithm 1 gives the detailed steps. The key difference between Algorithm 1 and the online Newton algorithm [1] is that at each iteration, it estimates a smoothed version of the covariance matrix using the training examples received in the past. In contrast, the online Newton method takes into account the gradient when updating . It is this difference that allows us to derive an excess risk bound for the learned classifier. The classifier learned from the online algorithm is simply the average of solutions obtained over all iterations. We also note that the idea of using an estimated covariance matrix for online learning and optimization has been examined by several studies [13, 14, 15, 16]. It is also closely related to the technique of time varying potential discussed in [17, 2] for regression. Unlike these studies that are mostly focused on obtaining regret bound, we aim to study the excess risk bound for the learned classifier.

## 3 Main Results

We first state the result of batch learning problem in (1), and then the result of online learning algorithm that is detailed in Algorithm 1. In order to achieve an excess risk bound better than , we introduce following key assumption for the analysis of the empirical error minimization problem in (1)

 Assumption I there exists a constant θ>0 s. t. E[[ℓ′(yw⊤∗x)]2xx⊤]⪰θE[xx⊤].

For the online learning method in Algorithm 1, we strengthen Assumption (I) as:

 Assumption II there exists a constant θ>0 s. t. E[[ℓ′(yw⊤x)]2xx⊤]⪰θE[xx⊤],∀w∈W.

Note that unlike Assumption (I) that only requires the property to hold with respect to the optimal solution , Assumption (II) requires the property to hold for any , making a stronger assumption than Assumption (I). We also note that Assumption (II) is closely related to strong convexity assumption. In particular, it is easy to verify that when is strictly positive definite, the expected loss will be strongly convex in by using the property of exponential concave function.

The following lemma shows a general scenario when both Assumptions (I) and (II) hold.

###### Lemma 1.

Suppose (i) and for any , where , and (ii) . Then Assumption (I) and (II) hold with

###### Proof.

We first bound for any given by

 E[[ℓ′(yw⊤x)]2|x]≥q([ℓ′(w⊤x)]2+[ℓ′(−w⊤x)]2)

Since is monotonically increasing function, we have

 |ℓ′(0)|≤max(|ℓ′(w⊤x)|,|ℓ′(−w⊤x)|)

and therefore

 E[[ℓ′(yw⊤x)]2|x]≥q[ℓ′(0)]2

implying that as desired. ∎

We note that is the necessary and sufficient condition that the convex surrogate loss function for 0-1 loss function to be classification-calibrated [18], and therefore is almost unavoidable if our final goal is to minimize the binary classification error.

The excess risk bound for the batch learning algorithm is given in the following theorem.

###### Theorem 1.

Suppose Assumption (I) holds. Let be the solution to the convex optimization problem in (1). Define

 γ=max(1,G2θ,GαθR),ρ0=max(32γ,√γ(28+3GR)).

Then with a probability , where , we have

 L(ˆw∗)−L(w∗)≤(GR[32ρ0+28]+3)t+2+dlognn=~O(dlognn).

The following theorem provides the excess risk bound for Algorithm 1 where training examples ae received in an online fashion and the final solution is reported as the average of all the intermediate solution.

###### Theorem 2.

Suppose Assumption (II) holds. Let be the average solution returned by Algorithm 1, with and . With a probability , where , we have

 L(ˆw∗)−L(w∗)≤ρ0GRdnlog(1+4nγ2d2)+3GR(2γ+1)t=~O(dlognn).
###### Remark 1.

As indicated in Theorems 1 and 2, the excess risk for both batch learning and online learning is reduced at the rate of , which is consistent with the regret bound for online optimizing the exponentially concave loss functions [1]. We note that the linear dependence on is in general unavoidable. This is because when is strictly positive definite, the function will be strongly convex with modulus proportion to . Since , we would expect a linear dependence on based on the minimax convergence rate of stochastic optimization for strongly convex function. Finally, we note that for strongly convex loss functions, it is known that an excess risk bound can be achieved without the factor. It is however unclear if the factor can be removed from the excess risk bounds for exponential concave functions, a question to be investigated in the future.

Comparing the result of online learning with that of batch learning, we observe that, although both achieve similar excess risk bounds, batch learning algorithm is advantageous in two aspects. First, the batch learning algorithm has to make a weaker assumption about the data (i.e. Assumption (I) vs. Assumption (II)). Second, the batch learning algorithm does not have to know the parameter and in advance, which is important for online learning method to determine the step size .

## 4 Analysis

We now turn to the proofs of our main results. The main steps in each proof are provided in the main text, with some of the more technical results deferred to the appendix.

### 4.1 Proof of Theorem 1

Our analysis for batch setting is based on the Talagrand’s inequality and in particular its variant (Klein-Rio bound) with improved constants derived in [19] (see also [20, Chapter 2]). To do so, we define

 ∥Pn−P∥W=supw∈W∣∣ ∣∣1nn∑i=1[ℓ(yiw⊤xi)−ℓ(yiw⊤∗xi)]−E(x,y)[ℓ(yw⊤x)−ℓ(yw⊤∗x)]∣∣ ∣∣

and

 U(W)=maxw∈W,∥x∥≤1ℓ(yw⊤x)−ℓ(yw⊤∗x),σP(W)=supw∈WE(x,y)[(ℓ(yw⊤x)−ℓ(yw⊤∗x))2].

The analysis is rooted in the following concentration inequality:

###### Theorem 3.

We have

 Pr{∥Pn−P∥W≥2E∥Pn−P∥W+σP(W)√2tn+(U(W)+3)t3n}≤e−t.

The following property of exponential concave loss function from [1] will be used throughout the paper.

###### Theorem 4.

If a function is such that is concave, and has gradient bounded by , then there exists such that the following holds

 f(w)≥f(w′)+(w−w′)⊤∇f(w′)+β2[∇f(w′)⊤(w−w′)]2,∀w,w′∈W.

The key quantity for our analysis is the following random variable:

 ρ(w)=12R√E[|x⊤(w−w∗)|]2.

Evidently, . The following lemma deals with the concentration of .

###### Lemma 2.

Define . Then, with a probability , we have,

 1nsupw∈Δn∑i=1[x⊤i(w−w∗)]2≤10R2(ρ2+t+1+dlognn)
###### Proof.

Fix a . Using the standard Bernstein’s inequality [21], we have, with a probability ,

 ∣∣ ∣∣1nn∑i=1[x⊤i(w−w∗)]2−E[((w−w∗)⊤x)2]∣∣ ∣∣≤16R2t3n+2R√2E[((w−w∗)⊤x)2]tn

By definition of the domain , i.e., and above concentration result we obtain:

 1nn∑i=1[x⊤i(w−w∗)]2≤4R2(ρ2+4t3n+ρ√2tn)≤10R2(ρ2+tn)

Next, we consider a discrete version of the space . Let be the proper -net of . Since , we have

Using the union bound, we have, with a probability , for any ,

 1nn∑i=1[x⊤i(w−w∗)]2≤10R2(ρ2+t+dlognn)

Since for any , there exists , such that , we have, with a probability , for any ,

 1nn∑i=1[x⊤i(w−w∗)]2≤10R2(ρ2+t+1+dlognn),

as desired.

Define . The next theorem allows us to bound the excess risk using the random variable .

###### Theorem 5.

With a probability , where , we have

 L(ˆw∗)−L(w∗)≤GR⎛⎝26⎡⎣ˆρ√˜tn+˜tn⎤⎦+6ˆρ√˜tn+2˜tn⎞⎠+3˜tn

where .

Taking this statement as given for the moment, we proceed with the proof of Theorem 1, returning later to establish the claim stated in Theorem 5. Our overall strategy of proving Theorem 1 is to first bound by using the property of exp-concave function and the result from Theorem 5, and then bound the excess risk. More specifically, using the result from Theorem 5, we have, with a probability at least ,

 L(ˆw∗)−L(w∗)≤GR⎛⎝26⎡⎣ˆρ√˜tn+˜tn⎤⎦+6ˆρ√˜tn+2˜tn⎞⎠+3˜tn (3)

Using the property of exp-concave loss functions stated in Theorem 4, we have

 L(ˆw∗)−L(w∗) ≥ (ˆw∗−w∗)⊤∇L(w∗)+β2E[[ℓ′(yw⊤∗x)(ˆw∗−w∗)⊤x]2] ≥ β2E[[ℓ′(yw⊤∗x)(ˆw∗−w∗)⊤x]2]

where the second step follows from the fact that minimizes over the domain and as a result . We then use Assumption (I) to get

and therefore

 L(ˆw∗)−L(w∗)≥2βθR2ˆρ2. (4)

Combining the bounds in (3) and (4), we have, with a probability ,

 ˆρ2≤G2βθR⎛⎝32ˆρ√˜tn+28˜tn⎞⎠+3˜t2βθR2n

implying that

 ˆρ≤max(32GβθR,√GβθR(28+3GR))√˜tn.

We derive the final bound for by plugging the bound for . The excess risk bound is completed by plugging the above bound for .

We now turn to proving the result stated in Theorem 5.

###### of Theorem 5.

Our analysis will be based on the technique of local Rademacher complexity [22, 23, 20]. The notion of local Rademacher complexity works by considering Rademacher averages of smaller subsets of the hypothesis set. It generally leads to sharper learning bounds which, under certain general conditions, guarantee a faster convergence rate. Define . We divide the range into segments, with , , …, , where . Let . Note that is a random variable depending on the sampled training examples.

As the first step, we assume that for some fixed . Define domain as

 Δ={w∈W:ρ(w)≤ρk+1}

Using the Telegrand inequality, with a probability at least , we have

 L(ˆw∗)−L(w∗)≤2E∥Pn−P∥Δ+σP(Δ)√2tn+(U(Δ)+3)tn (5)

We now bound each item on the right hand side of (5). First, we bound as

 E∥Pn−P∥Δ = 2nE[supw∈Δn∑i=1σi(ℓ(yiw⊤xi)−ℓ(yiw⊤∗xi))] ≤ 4GnE[supw∈Δn∑i=1yiσix⊤i(w−w∗)]

where are Rademacher random variables and the second step utilizes the contraction property of Rademacher complexity.

To bound , we need to bound . Using Lemma 2, we have, with a probability ,

 E∥Pn−P∥Δ ≤ 4G√n ⎷supw∈Δn∑i=1[x⊤i(w−w∗)]2≤13GR√n√ρ2k+1+t+1+dlognn ≤ 13GR(√t+1+dlognn+ρk+1√n)

Next, we bound and , i.e.

 σ2P(Δ) ≤ supw∈ΔE[(ℓ(yw⊤x)−ℓ(yw⊤∗x))2] ≤ supw∈ΔG2E[((w−w∗)⊤x)2]=4R2G2ρ2k+1

and . By putting the above results together, under the assumption , we have, with a probability ,

 L(ˆw∗)−L(w∗)≤26GR[ρk+1√n+√t+1+dlognn]+2GRρk+1√2tn+(2GR+3)tn (6)

Define . Using the fact that , we can rewrite the bound in (6) as

 L(ˆw∗)−L(w∗)≤26GR[ˆρ˜t√n+√˜tn]+6GRˆρ√˜tn+(2GR+3)˜tn

By taking the union bound over all the segments, with probability , for any , we have

 L(ˆw∗)−L(w∗)≤26GR⎡⎣ˆρ√˜tn+√˜tn⎤⎦+6GRˆρ√˜tn+(2GR+3)˜tn (7)

Finally, when , we obtain

 L(ˆw∗)−L(w∗)≤2Rρ0G≤2RGn (8)

We complete the proof by combining the bounds in (7) and (8). ∎

### 4.2 Proof of Theorem 2

We now turn to proving the main result on the excess risk for online setting. Define the covariance matrix as . The following theorem bounds by exploiting the property of exponentially concave functions (i.e., Theorem 4) and Assumption (II). Define as

 δi=∇L(wi)−ℓ′(yix⊤iwi)xi. (9)
###### Lemma 3.

Suppose Assumption (II) holds. We have

 L(wi)−L(w∗)+θβ3∥wi−w∗∥2H (10) ≤ ∥wi−w∗∥2Mi−12η1−∥wi+1−w∗∥2Mi2η1+η1G22x⊤iM−1ixi+(wi−w∗)⊤δi +θβ6([(wi−w∗)⊤xi]2−∥wi−w∗∥2H),

where .

###### Lemma 4.

We have

 x⊤iM−1ixi≤lndet(Mi)−lndet(Mi−1)

By using Lemma 4 and adding the inequalities in (10) over all the iterations, we have

 n∑i=1L(wi)−L(w∗)+θβ3n∑i=1∥wi−w∗∥2H (11) ≤ ∥w1−w∗∥2M02η1+η1G22(logdet(Mn)−logdet(M0))≡Δ1+n∑i=1(wi−w∗)⊤δi≡Δ2 +θβ6n∑i=1[(wi−w∗)⊤xi]2−∥wi−w∗∥2H≡Δ3.

We will bound , , and , separately. We start by bounding as indicated by the following lemma.

###### Lemma 5.

To bound , we define . Using the Berstein inequality for martingale [21] and peeling process [20], we have the following lemmas for bounding and .

###### Lemma 6.

We have

 Pr(A≤4R2n)+Pr(Δ2≤[6G2θβ+GR]t+θβ6A)≥1−me−t

where .

###### Lemma 7.

We have

 Pr(A≤4R2n)+Pr(Δ3≤8R2t+A)≥1−me−t

where .

First, we consider the case when and show the following bound.

###### Lemma 8.

Assume that the condition holds. We have

 n∑i=1L(wi)−L(w∗)+θβ2n∑i=1∥wi−w∗∥2H≤2RG. (12)

Second, we assume that the following two conditions hold

 Δ2≤[6G2θβ+GR]t+θβ6A,Δ3≤8R2t+A

Combining the above conditions with Lemma 5 and using the inequality in (11), we have

Using the fact that , we set , we have

 n∑i=1L(wi)−L(w∗)≤3G2θβdlog(1+4R2θβ2nG2d2)+[6G2θβ+GR+4θβ3R2]t.

We complete the proof by combining the two cases.

## 5 Conclusions and Future Work

In this work, we addressed the generalization ability of learning from exp-concave loss functions in batch and online settings. For both cases we show that the excess risk bound can be bounded by when the learning is performed in a linear hypothesis space with dimension and with the help of training examples.

One open question to be addressed in the future is if factor can be removed from the excess risk bound for exponentially concave loss functions by a more careful analysis. Another open question that needs to be investigated in the future is to improve the dependence on if we are after a sparse solution. According to the literature of sparse recovery [24] and optimization [25], we should be able to replace with in the excess risk bound if we restrict the optimal solution to a sparse one. In the future, we plan to explore the technique of sparse recovery in analyzing the generalization performance of exponential concave function to reduce the dependence on .

## References

• [1] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Mach. Learn., vol. 69, no. 2-3, pp. 169–192, 2007.
• [2] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge University Press, 2006.
• [3] T. Koren, “Open problem: Fast stochastic exp-concave optimization,” in COLT, 2013.
• [4] H. B. McMahan and M. J. Streeter, “Open problem: Better bounds for online logistic regression,” in COLT, pp. 44.1–44.3, 2012.
• [5] A. Agarwal, E. Hazan, S. Kale, and R. E. Shapire, “Algorithms for portfolio management based on the newton method,” in Proceedings of the 23rd International Conference on Machine Learning (ICML 2006), pp. 9–16, 2006.
• [6] S. M. Kakade and A. Tewari, “On the generalization ability of online strongly convex programming algorithms,” in Advances in Neural Information Processing Systems, pp. 801–808, 2008.
• [7] K. Sridharan, S. Shalev-Shwartz, and N. Srebro, “Fast rates for regularized objectives,” in Advances in Neural Information Processing Systems, pp. 1545–1552, 2008.
• [8] N. Srebro, K. Sridharan, and A. Tewari, “Smoothness, low noise and fast rates,” in Advances in Neural Information Processing Systems, pp. 2199–2207, 2010.
• [9] O. Bousquet, S. Boucheron, and G. Lugosi, “Introduction to statistical learning theory,” in Advanced Lectures on Machine Learning, pp. 169–207, Springer, 2004.
• [10] V. Vovk, “A game of prediction with expert advice,” in Proceedings of the eighth annual conference on Computational learning theory, pp. 51–60, ACM, 1995.
• [11] A. B. Tsybakov, “Optimal aggregation of classifiers in statistical learning,” The Annals of Statistics, vol. 32, no. 1, pp. 135–166, 2004.
• [12] T. Van Erven, P. D. Grünwald, M. D. Reid, R. C. Williamson, et al., “Mixability in statistical learning,” in Advances in Neural Information Processing Systems 25 (NIPS 2012), 2012.
• [13] K. Crammer and D. D. Lee, “Learning via gaussian herding,” in Advances in neural information processing systems, pp. 451–459, 2010.
• [14] K. Crammer, A. Kulesza, and M. Dredze, “Adaptive regularization of weight vectors,” Machine Learning, pp. 1–33, 2009.
• [15] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 11, pp. 2121–2159, 2011.
• [16] F. Orabona and K. Crammer, “New adaptive algorithms for online classification,” in Advances in Neural Information Processing Systems, 2010.
• [17] N. Cesa-Bianchi, A. Conconi, and C. Gentile, “A second-order perceptron algorithm,” in Computational Learning Theory, pp. 121–137, Springer, 2002.
• [18] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, 2006.
• [19] T. Klein and E. Rio, “Concentration around the mean for maxima of empirical processes,” The Annals of Probability, vol. 33, no. 3, pp. 1060–1077, 2005.
• [20] V. Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011.
• [21] S. Boucheron, G. Lugosi, and O. Bousquet, “Concentration inequalities,” in Advanced Lectures on Machine Learning, pp. 208–240, Springer, 2004.
• [22] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,” The Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005.
• [23] V. Koltchinskii, “Local rademacher complexities and oracle inequalities in risk minimization,” The Annals of Statistics, vol. 34, no. 6, pp. 2593–2656, 2006.
• [24] V. Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011.
• [25] A. Agarwal, S. Negahban, and M. J. Wainwright, “Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions,” in NIPS, pp. 1547–1555, 2012.

## Appendix A. Proof of Lemma 3

From the exp-concavity of expected loss function we have

 L(w∗)≥L(wi)+(w∗−wi)⊤∇L(wi)+β2(wi−w∗)⊤E[[ℓ′(yw⊤ix)]2xx⊤](wi−w∗).

Combining the above inequality with our assumption

 E[[ℓ′(yw⊤ix)]2xx⊤]⪰θE[xx⊤],

and rearranging the terms results in the following inequality

 L(wi)−L(w∗)+θβ2∥wi−w∗∥2H≤(wi−w∗)⊤∇L(wi).

Applying the fact that

 (wi−w∗)⊤∇L(wi) = ℓ′(yw⊤ix)(wi−w∗)⊤x+(wi−w∗)(∇L(wi)−ℓ′(yw⊤ix)x) = ∥wi−w∗∥2Zi2ηi−∥wi+1−w∗∥2Z−1i2ηi+ηG22x⊤iZixi+(wi−w∗)⊤δi

we obtain

 L(wi)−L(w∗)+θβ2∥wi−w∗∥2H ≤ ∥wi−w∗∥2Zi2ηi−∥wi+1−w∗∥2Zi2ηi+ηG22x⊤iZixi+(wi−w∗)⊤δi.

Using the fact that ,

 ∥wi−w∗∥2Zi2ηi=∥wi−w∗∥2Miη1

and

 ∥wi+1−w∗∥2Zi2ηi=∥wi+1−w∗∥2Miη1=∥wi+1−w∗∥2Miη1−12η1[(wi−w∗)⊤xi]2,

we get

 L(wi)−L(w∗)+θβ3∥wi−w∗∥2H ≤ ∥wi−w∗∥2Mi2η1−∥wi+1−w∗∥2Mi+12η1+η1G22x⊤iM−1ixi+(wi−w∗)⊤δi +θβ3([(wi−w∗)⊤xi]2−∥wi−w∗∥2H).

## Appendix B. Proof of Lemma 4

Since , we have

 x⊤iM−1ix⊤i=trace(M−1i(Mi−M