A Simple Analysis for Exp-concave Empirical Minimization with Arbitrary Convex Regularizer

# A Simple Analysis for Exp-concave Empirical Minimization with Arbitrary Convex Regularizer

Tianbao Yang, Zhe Li, Lijun Zhang
Department of Computer Science, The University of Iowa
National Key Laboratory for Novel Software Technology, Nanjing University
tianbao-yang, zhe-li-1@uiowa.edu, zhanglj@lamda.nju.edu.cn
###### Abstract

In this paper, we present a simple analysis of fast rates with high probability of empirical minimization for stochastic composite optimization over a finite-dimensional bounded convex set with exponential concave loss functions and an arbitrary convex regularization. To the best of our knowledge, this result is the first of its kind. As a byproduct, we can directly obtain the fast rate with high probability for exponential concave empirical risk minimization with and without any convex regularization, which not only extends existing results of empirical risk minimization but also provides a unified framework for analyzing exponential concave empirical risk minimization with and without any convex regularization. Our proof is very simple only exploiting the covering number of a finite-dimensional bounded set and a concentration inequality of random vectors.

A Simple Analysis for Exp-concave Empirical Minimization with Arbitrary Convex Regularizer

Tianbao Yang, Zhe Li, Lijun Zhang Department of Computer Science, The University of Iowa National Key Laboratory for Novel Software Technology, Nanjing University tianbao-yang, zhe-li-1@uiowa.edu, zhanglj@lamda.nju.edu.cn

\@float

noticebox[b]

\end@float

## 1 Introduction

Stochastic minimization with exponential concave (or exp-concave in short) loss functions can find many applications in machine learning, e.g., linear regression, logistic regression, support vector machine with squared hinge loss and portfolio optimization. There are two popular approaches for stochastic optimization. The first approach is called Sample Average Approximation (also known as empirical risk minimization in machine learning ), in which a set of i.i.d examples are drawn from the underlying distribution and an empirical risk minimization problem is solved. The second approach is Stochastic Approximation  (closely related to online optimization), which iteratively learns the model from randomly sampled examples. Comparing with stochastic approximation, empirical risk minimization is deemed as more general and usually achieves better performance in practice. Importantly, it is amenable to any optimization algorithms.

Fast rates of optimization with exponential concave functions in online setting or in stochastic setting have attracted a bulk of studies. In the seminal paper by Hazan et al. , the authors proposed an online Newton step (ONS) algorithm - a reminiscent of Newton-Raphson method for offline optimization, which achieves an regret bound with being the dimension of the problem and being the total number of iterations. With the standard trick of online-to-batch conversion [1, 6], one can obtain a fast convergence rate of recently achieved in stochastic setting [11, 10]. However, the computational cost of ONS scales badly with the dimensionality of the problem (with a factor) , which may prohibit its application to high-dimensional problems.

In terms of empirical risk minimization (ERM), it was not until recently that the fast rates for exp-concave risk minimization were established. Koren & Levy  obtained the first result for exp-concave risk minimization that ERM is able to attain fast generalization rates, i.e., an expected convergence bound - difference between the risk of the learned model and the risk of the optimal model. Strictly speaking, their guarantee is not for the solution to ERM but for the solution to a penalized ERM by adding a strongly convex regularizer to the ERM objective. Gonen and Shalev-Shwartz  derived a similar expectational fast rate for supervised learning with exp-concave losses. Recently, Mehta  established high probability fast rates for exp-concave empirical risk minimization, which is only worse by a factor of than the in-expectation rate.

This paper is motivated by solving the following stochastic composite optimization problem:

 (1)

where the objective consists of a stochastic component that is the expectation over a random function and a deterministic component . In this paper, we will assume: (i) is a compact and convex set; (ii) is a smooth and -exp-concave function of for any , and is Lipschitz continuous over the bounded domain . To make it general, we do not impose strong convexity or exp-concave or smoothness assumption on except for convexity.

We study the convergence of the empirical minimizer of (1), i.e.,

 ˆw=argminw∈W[Pn(w)≜1nn∑i=1f(w,zi)+R(w)], (2)

where are i.i.d samples from . Our major goal here is to establish the fast convergence rate of the empirical minimizer in terms of . This is in contrast to many previous works focusing on convergence analysis of stochastic approximation algorithms  for solving (1). It is noticeable that many efficient optimization algorithms are available for solving (2[20, 2]. In machine learning applications, the deterministic component is usually a regularizer that enforces some kind of structure over the model . Many studies in machine learning and statistics have found that using a certain kind of regularization that incorporates prior knowledge about the model can lead to great improvements on performance in many applications.

To establish the convergence rate of the empirical minimizer (2), one may consider to define a new loss such that and , and then leverage the existing theory to prove the convergence rate. However, the combined function is not necessarily an exp-concave function (see Example 2 below). Therefore, all the previous fast rates analysis for exp-concave empirical risk minimization cannot carry over to the considered problem. As a consequence, the standard generalization theory of ERM  applied to (2) can only guarantee an convergence rate, which is worse than - the fast rate that we aim to establish.

Contributions. The main contribution of this work is a simple analysis of a fast rate of with high probability for the empirical minimization (2) with a -exp-concave losses and an arbitrary convex regularizer in terms of . Our proof is simple and elementary, which only utilizes the covering number of and a concentration inequality of random vectors.

## 2 Comparison with Previous Works

There are extensive studies about fast rates of ERM. Due to limit of space, the review below focuses on closely related work. In the three recent studies [7, 4, 11], the focus is to establish fast rates in terms of risk minimization without a regularizer, i.e,

 minw∈WF(w)≜Ez[f(w,z)], (3)

where is a -exp-concave function. As reasoned above, the fast rates in these studies do not carry over to the minimization problem (1) with an arbitrary regularizer.

Koren & Levy  studied the convergence of a penalized/regularized empirical risk minimizer by

 ˜w=argminw∈W[ˆFn(w)≜1nn∑i=1f(w,zi)+1ng(w)]. (4)

They assumed that is a -strongly regularizer w.r.t the Euclidean norm and is bounded over . Gonen and Shalev-Shwartz  focused on the risk minimization with generalized linear model :

 minw∈WL(w)≜E(x,y)∼D[ϕy(w⊤x)], (5)

which is a special case of the general minimization problem (1). Under the assumption that is a -exp-concave function of , they provided an expected convergence rate of the empirical risk minimizer, which is in the same order as the result in , i.e., .

There are three differences between our work and these two works: (i) their results of fast rate are with respect to , where does not include any regularizer; in contrast our result is respect to ; (ii) the strongly convex penalization term in  is artificially added to the ERM to facilitate the analysis; in contrast the arbitrary convex regularizer in this paper is built into the objective; (iii) their fast rate guarantee is in expectation while our fast rate guarantee is in high probability. In light of these differences, we can see that our result is more general and much stronger. In particular, when setting in our problem, we obtain the fast rate with high probability of the empirical risk minimizer, which is only worse by a factor of than the in-expectation rate in [4, 7]. Additionally, a similar high probability risk bound with respect to for any regularized empirical risk minimizer (4) can be easily derived in our framework as long as is convex and bounded over (see Theorem 2).

A more recent work by Mehta  establishes a fast rate of for the exp-concave ERM. His analysis for the empirical risk minimizer is based on the connection between exp-concavity and the stochastic mixability condition, and he exploited the heavy machinery developed in their previous work for fast rate analysis of empirical risk minimizer under the stochastic mixability condition . However, this analysis does not apply to the regularized empirical risk minimizer (4) with an arbitrary convex regularizer. Admittedly,  has made additional contribution on removing the factor by using boosting techniques to boost the in-expectation results of [7, 4].

We comment on the extra conditions on the loss functions .  requires that is a smooth function of for any and is bounded over . Both  and  require that or to be Lipschitz continuous. Note that Lipschitz continuity implies bounded range of the loss function . In the present paper, we assume that is Lipschitz continuous and smooth over for all . Both conditions are necessary for us to deliver a simple analysis for exp-concave empirical minimization with an arbitrary convex regularizer. We also notice for a twice differentiable smooth and exp-concave function, Lipschitz continuity is automatically satisfied.

Next, we briefly mention several results regarding fast rates of ERM under strong convexity condition - a stronger condition than exp-concavity. Shalev-Shwartz et al.  established an in-expectation convergence bound of ERM over any bounded convex set for (3), which requires each individual loss function to be a -strongly convex function of . However, in machine learning applications, individual loss functions are usually not strongly convex. Recently, Zhang et al.  developed optimistic rates of ERM over a bounded convex set for (3), where they assumed to be non-negative and smooth. Under -strong convexity assumption of , they established a fast rate of with high probability and a faster rate of when the optimal risk is small and the number of samples is sufficiently large (). In , the authors considered the composite problem (1) with having a generalized linear form and established a fast rate with high probability of for -strongly convex objective.

We can also compare with stochastic approximation algorithms. Lan  presented an optimal method for solving (1) without the exp-concavity assumption, which employs a proximal mapping to handle and has a convergence rate of , where is related to the noise in the stochastic gradient. In contrast, the convergence rate of empirical minimizer shown in this work has a better dependence on . One may also apply the online-to-batch conversion to a variant of ONS that employs the proximal mapping to handle to obtain an convergence with high probability. Nonetheless, the resulting algorithm will be at least as expensive as ONS. Finally, it is worth mentioning that the linear dependence on the dimensionality of the convergence rate of the empirical minimizer for the minimization problem (1) over is unavoidable even with smooth functions  .

## 3 Preliminaries

In this section, we present some preliminaries. Let denote a random variable following a distribution . Denote by the partial gradient in terms of . Define

 F(w)=Ez[f(w,z)],P(w)=F(w)+R(w). (6)

Let denote the Euclidean norm of a vector. For a positive definite matrix , define the -norm and its dual norm . By Hölder’s inequality, we have .

A function is -exp-concave over the domain for some if the function is concave over . If is twice differentiable and -exp-concave, it follows that

 ∇2f(w)⪰β∇f(w)∇f(w)⊤. (7)

A function is a -smooth function with respect to over if the following inequality holds that for all for some

 f(w)≤f(u)+∇f(u)⊤(w−u)+L2∥w−u∥22. (8)

A function is -Lipschitz continuous if .

We will make the following assumptions regarding the loss function and the regularizer .

###### Assumption 1.

We assume that (i) is a closed and bounded convex set, i.e., there exists such that for all . (ii) is a -Lipschitz continuous, -smooth and -exp-concave function of for any . (iii) is a convex function.

Remark 1: Note that if is twice differentiable, the smoothness and exp-concavity naturally imply Lipschitz continuity. This can be seen from (7) by noting that . As a result , which implies that is -Lipschitz continuous.

There are many machine learning problems satisfying the above assumptions. If we consider the loss function in supervised learning where denote a random feature vector and label, then the square loss, logistic loss, squared hinge loss are exp-concave function under appropriate conditions on . Let us consider the square loss as an example.

Example 1. Suppose and are bounded. W.l.g we can assume and . Then and . It then follows that for any and any ,

which guarantees that is a -exp-concave function of .

Next, we give an example showing that the sum of an exp-concave function and a convex function is not necessarily an exp-concave function.

Example 2. Let , and . To see is an exp-concave function, we can show that and , then for any and

 β∇f(w)∇f(w)⊤=β(4(w1−1)2000)⪯∇2f(w).

To see is not an exp-concave function, we can show that and , then for any the following matrix is not positive semi-definite

 ∇2P(w)−β∇P(w)∇P(w)⊤=(2000)−β(4(w1−1)2001)=(2−4(w1−1)200−β),

which contracts to (7) if is an exp-concave function. As a result, is not an exp-concave function.

In our analysis, we will use the covering number of . A subset is called an -net of if for every one can find so that . The minimal cardinality of an -net of is called the covering number and denoted by . The covering number of the Euclidean ball can be estimated using a standard volume comparison argument , as follows The covering numbers are (almost) increasing by inclusion : implies . Since , then

 N(W,ε)≤N(Bd2(R),ε/2)≤(6Rε)d. (9)

Finally, we present two basic lemmas, which will be useful in our analysis.

###### Lemma 1.

 Under Assumption 1 , for any , and the following holds for

 f(w,z)≥f(w′,z)+(w−w′)⊤∇f(w′,z)+σ2(w−w′)⊤∇f(w′,z)∇f(w′,z)⊤(w−w′).
###### Lemma 2.

Let be an optimal solution to (1). Then for any we have

 (w−w∗)⊤∇F(w∗)≥R(w∗)−R(w).

The proof of Lemma 1 can be found in  and is thus omitted. The proof of Lemma 2 is simple and presented below.

###### Proof of Lemma 2.

By the optimality condition of (1), there exists such that . Since is a convex function, then we have . Combining the above two inequalities, we have . ∎

## 4 Main Result and Analysis

In the sequel, we let be a constant as in Lemma 1 and fix - an optimal solution of (1). Our main result is stated in the following theorem.

###### Theorem 1.

For the stochastic composite minimization problem (1), we consider the empirical minimizer by solving (2). Under Assumptions 1, with probability at least , we have

 P(ˆw)−P(w∗)≤O(dlognn+dlog(1/δ)nσ).

Remark 2: Note that when , we directly obtain a fast rate with high probability of the empirical risk minimizer for the exp-concave risk minimization problem (3). We can also obtain a similar result for (3) regarding the regularized empirical risk minimizer (4) that provides a different way for solving (3), which is usually preferred over solving the empirical risk minimization problem without any regularization due to that (i) a regularization can lead to a better condition from the perspective of optimization complexity; and (ii) the prior knowledge about the model can be encoded into the regularizer.

###### Theorem 2.

For the risk minimization problem (3), we consider the regularized empirical risk minimizer by solving (4). Under Assumptions 1(i), (ii), and that is bounded over such that , with probability at least , we have

 F(˜w)−minw∈WF(w)≤O(dlognn+dlog(1/δ)nσ).

Remark 3: This new result not only addresses the open problem raised in  about the high probability bound for the strongly regularized empirical risk minimizer but also extends the fast rate to any regularized empirical risk minimizer as long as the regularizer is convex. In comparison,  only provides the expectational fast rate for the regularized empirical risk minimizer using a strongly convex regularizer. The additional assumption used in our analysis compared to  is the Lipschitz continuity of the loss functions over the domain , which is mild.

Below, we will prove Theorem 1 and Theorem 2. To prove the theorems, we first establish several lemmas.

###### Lemma 3.

Suppose Assumptions 1 hold. For any we have

 α2∥w−w∗∥2H≤P(w)−P(w∗)+α2∥w−w∗∥22, (10)

where .

###### Proof.

Let . We begin with the following inequality in Lemma 1

 f(w,z)≥f(w′,z)+(w−w′)⊤∇f(w′,z)+σ2(w−w′)⊤∇f(w′,z)∇f(w′,z)⊤(w−w′)

Let and taking expectation over both sides over the random variable , we have

 F(w)≥F(w∗)+(w−w∗)⊤∇F(w∗)+σ2(w−w∗)⊤E[∇f(w∗,z)∇f(w∗,z)⊤](w−w∗)

Adding up the above inequality and the inequality in Lemma 2, we have

 P(w)≥P(w∗)+σ2(w−w∗)⊤E[∇f(w∗,z)∇f(w∗,z)⊤](w−w∗)

Adding on both sides and by the definition of , we can finish the proof.

###### Lemma 4.

Let . Under Assumption 1, with probability at least , for any , we have

 ∥∇P(w∗)−∇Pn(w∗)∥H−1≤2Glog(2/δ)n+√2αdlog(2/δ)nσ. (11)
###### Proof.

To prove the above lemma. We need the following concentration result of random vectors.

###### Proposition 1.

. Let be a Hilbert space equipped with a norm and let be a random variable with values in . Assume almost surely. Denote . Let be () independent drawers of . For any , with confidence ,

To utilize the above lemma, we consider as a random variable in a Hilbert space equipped with a norm . Then we have . To prove Lemma 4, we need an upper bound of and . First, we note that , then . Second,

 E[∥∇f(w∗,z)∥2H−1]=tr(H−1E[∇f(w∗,z)∇f(w∗,z)⊤])≤ασd,

where denotes the trace function and the last inequality uses . Then, according to Proposition 1, with probability at least , we have

 ∥∇P(w∗)−∇Pn(w∗)∥H−1 =∥∥ ∥∥∇F(w∗)−1nn∑i=1∇f(w∗,zi)∥∥ ∥∥H−1 ≤2Glog(2/δ)n+√2αdlog(2/δ)nσ.

###### Lemma 5.

Under Assumptions 1, with probability at least , for any and any , we have

 ∥∇P(w)−∇P(w∗)−[∇Pn(w)−∇Pn(w∗)]∥2≤LC(ε)∥w−w∗∥2n +LC(ε)εn+√LC(ε)(P(w)−P(w∗))n+√LGC(ε)εn+2Lε.

where .

The proof of the above lemma is similar to the proof of Lemma 1 in  and is deferred to supplement. The idea of the proof is that: first we establish an upper bound for a fixed using Proposition 1 and then use the union bound and the covering number of to get an upper bound for any . Then we utilize the property of the -net to prove the inequality in the lemma.

### 4.1 Proof of Theorem 1

Let and with . The values of and will be decided later.

 P(ˆw)−P(w∗) ≤∇P(ˆw)⊤(ˆw−w∗)=(∇P(ˆw)−∇P(w∗))⊤(ˆw−w∗)+∇P(w∗)⊤(ˆw−w∗) =(∇P(ˆw)−∇P(w∗)−[∇Pn(ˆw)−∇Pn(w∗)])⊤(ˆw−w∗) +(∇P(w∗)−∇Pn(w∗))⊤(ˆw−w∗)+∇Pn(ˆw)⊤(ˆw−w∗) ≤(∇P(ˆw)−∇P(w∗)−[∇Pn(ˆw)−∇Pn(w∗)])⊤(ˆw−w∗) +(∇P(w∗)−∇Pn(w∗)⊤(ˆw−w∗),

where the first inequality uses the convexity of and the second inequality uses the optimality condition of , i.e., . Then we have

 P(ˆw)−P(w∗) ≤∥∇P(ˆw)−∇P(w∗)−[∇Pn(ˆw)−∇Pn(w∗)]∥2∥ˆw−w∗∥2 +∥∇P(w∗)−∇Pn(w∗)∥H−1∥ˆw−w∗∥H,

By Lemma 5 and Lemma 4, with probability at least , we have

 P(ˆw)−P(w∗)≤2Lε∥ˆw−w∗∥2+LC(ε)∥ˆw−w∗∥22n+LC(ε)ε∥ˆw−w∗∥2n+∥ˆw−w∗∥2√LC(ε)(P(ˆw)−P(w∗))n+∥ˆw−w∗∥2√LGC(ε)εn+2Glog(2/δ)∥ˆw−w∗∥Hn+∥ˆw−w∗∥H√2αdlog(2/δ)nσ. (12)

Next, we bound the last four terms in the R.H.S using Hölder’s inequality.

 ∥ˆw−w∗∥2√LC(ε)(P(ˆw)−P(w∗))n≤3LC(ε)∥ˆw−w∗∥222n+P(ˆw)−P(w∗)6, (13) ∥ˆw−w∗∥2√LC(ε)Gεn≤LC(ε)∥ˆw−w∗∥22n+Gε4, (14) ∥ˆw−w∗∥H√2αdlog(2/δ)nσ≤α12∥w∗−ˆw∥2H+6dlog(2/δ)nσ, (15) 2Glog(2/δ)∥ˆw−w∗∥Hn≤α12∥w∗−ˆw∥2H+12G2log2(2/δ)n2α. (16)

Combining the inequalities in (12), (13), (14), (15), and (16), with probability we have

 P(ˆw)−P(w∗) ≤2Lε∥ˆw−w∗∥2+4LC(ε)∥ˆw−w∗∥22n+LC(ε)ε∥ˆw−w∗∥2n +Gε4+P(ˆw)−P(w∗)6+α6∥ˆw−w∗∥2H+12G2log2(2/δ)n2α+6dlog(2/δ)nσ ≤2Lε∥ˆw−w∗∥2+4LC(ε)∥ˆw−w∗∥22n+LC(ε)ε∥ˆw−w∗∥2n +Gε4+P(ˆw)−P(w∗)2+α6∥ˆw−w∗∥22+12G2log2(2/δ)n2α+6dlog(2/δ)nσ,

where the second inequality uses Lemma 1. Then we have

 P(ˆw)−P(w∗)2 ≤4LC(ε)∥ˆw−w∗∥22n+LC(ε)ε∥ˆw−w∗∥2n +6dlog(2/δ)nσ+12G2log2(2/δ)n2α+Gε4+2Lε∥ˆw−w∗∥2+α6∥ˆw−w∗∥22

Let , and noting that , , we have

 P(ˆw)−P(w∗)≤64LR2(log(2/δ)+dlog(6Rn))n+8LR(log(2/δ)+dlog(6Rn))n2 +12dlog(2/δ)nσ+24G2log(2/δ)n+G2n+8LRn+4R2log(2/δ)3n=O(dlognn+dlog(1/δ)nσ).

### 4.2 Proof of Theorem 2

We utilize the Theorem 1 to prove Theorem 2. First, we define and recall some notations.

 ˆF(w)=F(w)+1ng(w),ˆFn(w)=Fn(w)+1ng(w) ˆw∗=argminw∈WˆF(w),˜w=argminw∈WˆFn(w),w∗=argminw∈WF(w)

where and . According to Theorem 1, the following inequality holds with high probability ,

 ˆF(˜w)−ˆF(ˆw∗)≤O(dlognn+dlog(1/δ)nσ).

Plugging the definition of we have

 F(˜w)−F(ˆw∗)≤1n(g(ˆw∗)−g(˜w))+O(dlognn+dlog(1/δ)nσ)≤O(dlognn+dlog(1/δ)nσ) (17)

where the last inequality uses the assumption .

Due to that is the minimizer of , then

 F(ˆw∗)+1ng(ˆw∗)≤F(w∗)+1ng(w∗)

Then

 F(ˆw∗)≤F(w∗)+1n(g(w∗)−g(ˆw∗))≤F(w∗)+Bn (18)

Combining (17) and (18), the following inequality holds with high probability

 F(˜w)−F(w∗)≤O(dlognn+dlog(1/δ)nσ)

which finishes the proof.

## 5 Conclusion

In this paper, we have developed a simple analysis of fast rats for empirical minimization with exponential concave loss functions and an arbitrary convex regularizer. This represents the first result of its kind. The proof is elementary only exploiting the covering number of a finite-dimensional bounded set and a concentration inequality of random vectors. Our framework also induces a unified fast rate results for exponential concave empirical risk minimization without and with any convex regularizer. An open problem remains is whether the factor can be removed without using the boosting technique.

## References

•  N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
•  A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), pages 1646–1654, 2014.
•  V. Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes back. In Advances in Neural Information Processing Systems 29, pages 3576–3584. 2016.
•  A. Gonen and S. Shalev-Shwartz. Average stability is invariant to data preconditioning. implications to exp-concave empirical risk minimization. CoRR, abs/1601.04011, 2016.
•  E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
•  S. M. Kakade and A. Tewari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 21 (NIPS), pages 801–808, 2008.
•  T. Koren and K. Y. Levy. Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems 28 (NIPS), pages 1477–1485, 2015.
•  H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, second edition, 2003.
•  G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 2010.
•  M. Mahdavi, L. Zhang, and R. Jin. Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Proceedings of The 28th Conference on Learning Theory (COLT), pages 1305–1320, 2015.
•  N. A. Mehta. Fast rates with high probability in exp-concave statistical learning. ArXiv e-prints, arXiv:1605.01288, 2016.
•  N. A. Mehta and R. C. Williamson. From stochastic mixability to fast rates. In Advances in Neural Information Processing Systems 27 (NIPS), pages 1197–1205, 2014.
•  Y. Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied optimization. Kluwer Academic Publishers, 2004.
•  G. Pisier. The volume of convex bodies and Banach space geometry. Cambridge Tracts in Mathematics (No. 94). Cambridge University Press, 1989.
•  Y. Plan and R. Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
•  S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.
•  S. Smale and D.-X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2):153–172, 2007.
•  K. Sridharan, S. Shalev-shwartz, and N. Srebro. Fast rates for regularized objectives. In Advances in Neural Information Processing Systems 21, pages 1545–1552, 2009.
•  V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
•  L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
•  L. Zhang, T. Yang, and R. Jin. Empirical risk minimization for stochastic convex optimization: O(1/n)- and o(1/n)-type of risk bounds. CoRR, abs/1702.02030, 2017.

## Appendix A Proof of Lemma 5

The proof is similar to the proof of Lemma 1 in . Denote by the -net of with minimal cardinality. By the covering number theory, we have . To prove the upper bound for all , we first consider a fixed point in the denoted by . Since is -smooth for any , we have

 ∥∇f(w,z)−∇f(w∗,z)∥2≤L∥w−w∗∥2. (19)

Because is both convex and -smooth, by (2.1.7) of , we have

Taking expectation over both sides, we have

 E[∥∇f(w,z)−∇f(w∗,z)∥22] ≤2L(F(w)−F(w∗)−⟨∇F(w∗),w−w∗⟩) ≤2L(F(w)−F(w∗)−(R(w∗)−R(w)))=2L(P(w)−P(w∗)),

where the second inequality uses Lemma 2. Following Proposition 1, with probability at least , we have

 ∥∇P(w)−∇P(w∗)−[∇Pn(w)−∇Pn(w∗)]∥2=∥∥ ∥∥∇F(w)−∇F(w∗)−1n