Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

# Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

## Abstract

We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sum optimization problems. First, we show that, perhaps surprisingly, the finite sum structure by itself, is not sufficient for obtaining a complexity bound of for -smooth and -strongly convex individual functions - one must also know which individual function is being referred to by the oracle at each iteration. Next, we show that for a broad class of first-order and coordinate-descent finite sum algorithms (including, e.g., SDCA, SVRG, SAG), it is not possible to get an ‘accelerated’ complexity bound of , unless the strong convexity parameter is given explicitly. Lastly, we show that when this class of algorithms is used for minimizing -smooth and convex finite sums, the iteration complexity is bounded from below by , assuming that (on average) the same update rule is used in any iteration, and otherwise.

## 1 Introduction

An optimization problem principal to machine learning and statistics is that of finite sums:

 minw∈RdF(w)\coloneqq1nn∑i=1fi(w), (1)

where the individual functions are assumed to possess some favorable analytical properties, such as Lipschitz-continuity, smoothness or strong convexity (see [Nes04] for details). We measure the iteration complexity of a given optimization algorithm by determining how many evaluations of individual functions (via some external oracle procedure, along with their gradient, Hessian, etc.) are needed in order to obtain an -solution, i.e., a point which satisfies (where the expectation is taken w.r.t. the algorithm and the oracle randomness).

Arguably, the simplest way of minimizing finite sum problems is by using optimization algorithms for general optimization problems. For concreteness of the following discussion, let us assume for the moment that the individual functions are -smooth and -strongly convex. In this case, by applying vanilla Gradient Descent (GD) or Accelerated Gradient Descent (AGD, [Nes04]), one obtains iteration complexity of

 ~O(nκln(1/ϵ)) or% ~O(n√κln(1/ϵ)), (2)

respectively, where denotes the condition number of the problem and hides logarithmic factors in the problem parameters. However, whereas such bounds enjoy logarithmic dependence on the accuracy level, the multiplicative dependence on renders this approach unsuitable for modern applications where is very large.

A different approach to tackle a finite sum problem is by reformulating it as a stochastic optimization problem, i.e., , and then applying a general stochastic method, such as SGD, which allows iteration complexity of or (depending on the problem parameters). These methods offer rates which do not depend on , and are therefore attractive for situations where one seeks for a solution of relatively low accuracy. An evident drawback of these methods is their broad applicability for stochastic optimization problems, which may conflict with the goal of efficiently exploiting the unique noise structure of finite sums (indeed, in the general stochastic setting, these rates cannot be improved, e.g., [AWBR09, RR11]).

In recent years, a major breakthrough was made when stochastic methods specialized in finite sums (first SAG [SLRB13] and SDCA [SSZ13], and then SAGA [DBLJ14], SVRG [JZ13], SDCA without duality [SS15], and others) were shown to obtain iteration complexity of

 ~O((n+κ)ln(1/ϵ)). (3)

The ability of these algorithms to enjoy both logarithmic dependence on the accuracy parameter and an additive dependence on  is widely attributed to the fact that the noise of finite sum problems distributes over a finite set of size . Perhaps surprisingly, in this paper we show that another key ingredient is crucial, namely, a mean of knowing which individual function is being referred to by the oracle at each iteration. In particular, this shows that variance-reduction mechanisms (see, e.g., [DBLJ14, Section 3]) cannot be applied without explicitly knowing the ‘identity’ of the individual functions. On the more practical side, this result shows that when data augmentation (e.g., [LCB07]) is done without an explicit enumeration of the added samples, it is impossible to obtain iteration complexity as stated in (3, see [BM16] for relevant upper bounds).

Although variance-reduction mechanisms are essential for obtaining an additive dependence on (as shown in (3)), they do not necessarily yield ‘accelerated’ rates which depend on the square root of the condition number (as shown in (2) for AGD). Recently, generic acceleration schemes were used by [LMH15] and accelerated SDCA [SSZ16] to obtain iteration complexity of

 ~O((n+√nk)ln(1/ϵ)). (4)

The question of whether this rate is optimal was answered affirmatively by [WS16, Lan15, AS16c, AS16a]. The first category of lower bounds exploits the degree of freedom offered by a - (or an infinite-) dimensional space to show that any first-order and a certain class of second-order methods cannot obtain better rates than (4) in the regime where the number of iterations is less than . The second category of lower bounds is based on maintaining the complexity of the functional form of the iterates, thereby establishing bounds for first-order and coordinate-descent algorithms whose step sizes are oblivious to the problem parameters (e.g., SAG, SAGA, SVRG, SDCA, SDCA without duality) for any number of iterations, regardless of and .

In this work, we further extend the theory of oblivious finite sum algorithms, by showing that if a first-order and a coordinate-descent oracle are used, then acceleration is not possible without an explicit knowledge of the strong convexity parameter. This implies that in cases where only poor estimation of the strong convexity is available, faster rates may be obtained through ‘adaptive’ algorithms (see relevant discussions in [SLRB13, AS16b]).

Next, we show that in the smooth and convex case, oblivious finite sum algorithms which, on average, apply the same update rule at each iteration (e.g., SAG, SDCA, SVRG, SVRG++ [AZY16], and typically, other algorithms with a variance-reduction mechanism as described in [DBLJ14, Section 3]), are bound to iteration complexity of , where denotes the smoothness parameter (rather than ). To show this, we employ a restarting scheme (see [AS16b]) which explicitly introduces the strong convexity parameter into algorithms that are designed for smooth and convex functions. Finally, we use this scheme to establish a tight dimension-free lower bound for smooth and convex finite sums which holds for oblivious algorithms with a first-order and a coordinate-descent oracle.

To summarize, our contributions (in order of appearance) are the following:

• In Section 2, we prove that in the setting of stochastic optimization, having finitely supported noise (as in finite sum problems) is not sufficient for obtaining linear convergence rates with a linear dependence on - one must also know exactly which individual function is being referred to by the oracle at each iteration. Deriving similar results for various settings, we show that SDCA, accelerated SDCA, SAG, SAGA, SVRG, SVRG++ and other finite sum algorithms must have a proper enumeration of the individual functions in order to obtain their stated convergence rate.

• In Section 3.1, we lay the foundations of the framework of general CLI algorithms (see [AS16a]), which enables us to formally address oblivious algorithms (e.g., when step sizes are scheduled regardless of the function at hand). In section 3.2, we improve upon [AS16b], by showing that (in this generalized framework) the optimal iteration complexity of oblivious, deterministic or stochastic, finite sum algorithms with both first-order and coordinate-descent oracles cannot perform better than , unless the strong convexity parameter is provided explicitly. In particular, the richer expressiveness power of this framework allows addressing incremental gradient methods, such as Incremental Gradient Descent [Ber97] and Incremental Aggregated Gradient [BHG07, IAG].

• In Section 3.3, we show that, in the -smooth and convex case, the optimal complexity bound (in terms of the accuracy parameter) of oblivious algorithms whose update rules are (on average) fixed for any iteration is (rather then , as obtained, e.g., by accelerated SDCA). To show this, we first invoke a restarting scheme (used by [AS16b]) to explicitly introduce strong convexity into algorithms for finite sums with smooth and convex individuals, and then apply the result derived in Section 3.2.

• In Section 3.4, we use the reduction introduced in Section 3.3, to show that the optimal iteration complexity of minimizing -smooth and convex finite sums using oblivious algorithms equipped with a first-order and a coordinate-descent oracle is

## 2 The Importance of Individual Identity

In the following, we address the stochastic setting of finite sum problems (1) where one is equipped with a stochastic oracle which, upon receiving a call, returns some individual function chosen uniformly at random and hides its index. We show that not knowing the identity of the function returned by the oracle (as opposed to an incremental oracle which addresses the specific individual functions chosen by the user), significantly harms the optimal attainable performance. To this end, we reduce the statistical problem of estimating the bias of a noisy coin into that of optimizing finite sums. This reduction (presented below) makes an extensive use of elementary definitions and tools from information theory, all of which can be found in [CT12].

First, given , we define the following finite sum problem

 Fσ\coloneqq1n(n−σ2f++n+σ2f−), (5)

where is w.l.o.g. assumed to be odd, and are some functions (to be defined later). We then define the following discrepancy measure between and for different values of  (see also [AWBR09]),

 δ(n)=minw∈Rd{F1(w)+F−1(w)−F∗1−F∗−1}, (6)

where . It is easy to verify that no solution can be -optimal for both and , at the same time. Thus, by running a given optimization algorithm long enough to obtain -solution w.h.p., we can deduce the value of . Also, note that, one can simplify the computation of by choosing convex such that . Indeed, in this case, we have (in particular, ), and since is convex, it must attain its minimum at , which yields

 δ(n) =2(F1(0)−F∗1). (7)

Next, we let be drawn uniformly at random, and then use the given optimization algorithm to estimate the bias of a random variable which, conditioned on , takes w.p. , and w.p. . To implement the stochastic oracle described above, conditioned on , we draw i.i.d. copies of , denoted by , and return , if , and , otherwise. Now, if is such that

 E[Fσ(w(k))−F∗σ|σ]≤δ(n)40,

for both and , then by Markov inequality, we have that

 P(Fσ(w(k))−F∗σ≥δ(n)/4∣∣σ)≤1/10 (8)

(note that is a non-negative random variable). We may now try to guess the value of using the following estimator

 ^σ(w(k))=argminσ′∈{−1,1}{Fσ′(w(k))−F∗σ′},

whose probability of error, as follows by Inequality (8), is

 P(^σ≠σ)≤1/10. (9)

Lastly, we show that the existence of an estimator for with high probability of success implies that . To this end, note that the corresponding conditional dependence structure of this probabilistic setting can be modeled as follows: . Thus, we have

 H(σ|X1,…,Xk)\lx@stackrel(a)≤H(σ|^σ)\lx@stackrel(b)≤Hb(P(^σ≠σ))\lx@stackrel(c)≤12, (10)

where and denote the Shannon entropy function and the binary entropy function, respectively, follows by the data processing inequality (in terms of entropy), follows by Fano’s inequality and follows from Equation (9). Applying standard entropy identities, we get

 H(σ|X1,…,Xk) \lx@stackrel(d)=H(X1,…,Xk|σ)+H(σ)−H(X1,…,Xk) \lx@stackrel(e)=kH(X1|σ)+1−H(X1,…,Xk) \lx@stackrel(f)≥kH(X1|σ)+1−kH(X1), (11)

where follows from Bayes rule, follows by the fact that , conditioned on , are i.i.d. and follows from the chain rule and the fact that conditioning reduces entropy. Combining this with Inequality (10) and rearranging, we have

 k≥12(H(X1)−H(X1|σ))≥12(1/n)2=n22,

where the last inequality follows from the fact that and the following estimation for the binary entropy function: (see Lemma 2, Appendix A). Thus, we arrive at the following statement.

###### Lemma 1.

The minimal number of stochastic oracle calls required to obtain -optimal solution for problem (5) is .

Instantiating this schemes for of various analytical properties yields the following.

###### Theorem 1.

When solving a finite sum problem (defined in 1) with a stochastic oracle, one needs at least oracle calls in order to obtain an accuracy level of:

1. for smooth and strongly convex individuals with condition .

2. for -smooth and convex individuals.

3. if , and , otherwise, for -Lipschitz continuous and -strongly convex individuals.

###### Proof.
1. Define,

 f±(w)=12(w±q)⊤A(w±q),

where is a diagonal matrix whose diagonal entries are , and is a -dimensional vector. One can easily verify that are smooth and strongly convex functions with condition number , and that

 Fσ(w)

Therefore, the minimizer of is , and using Equation (7), we see that .

2. We define

 f±(w) =L2∥w±e1∥2.

One can easily verify that are -smooth and convex functions, and that the minimizer of is . By Equation (7), we get .

3. We define

 f±(w) =M∥w±e1∥+λ2∥w∥2,

over the unit ball. Clearly, are -Lipschitz continuous and -strongly convex functions. It can be verified that the minimizer of is . Therefore, by Equation (7), we see that in this case we have

 δ(n)={M2λn2Mλn≤12Mn−λo.w..

A few conclusions can be readily made from Theorem 1. First, if a given optimization algorithm obtains an iteration complexity of an order of , up to logarithmic factors (including the norm of the minimizer which, in our construction, is of an order of and coupled with the accuracy parameter), for solving smooth and strongly convex finite sum problems with a stochastic oracle, then

 c(n,κ) =~Ω(n2ln(n2/(κ+1))).

Thus, the following holds for optimization of finite sums with smooth and strongly convex individuals.

###### Corollary 1.

In order to obtain linear convergence rate with linear dependence on , one must know the index of the individual function addressed by the oracle.

This implies that variance-reduction methods such as, SAG, SAGA, SDCA and SVRG (possibly combining with acceleration schemes), which exhibit linear dependence on , cannot be applied when data augmentation is used. In general, this conclusion also holds for cases when one applies general first-order optimization algorithms, such as AGD, on finite sums, as this typically results in a linear dependence on . Secondly, if a given optimization algorithm obtains an iteration complexity of an order of for solving smooth and convex finite sum problems with a stochastic oracle, then . Therefore, and , indicating that an iteration complexity of an order of , as obtained by, e.g., SVRG++, is not attainable with a stochastic oracle. Similar reasoning based on the Lipschitz and strongly convex case in Theorem 1 shows that the iteration complexity guaranteed by accelerated SDCA is also not attainable in this setting.

## 3 Oblivious Optimization Algorithms

In the previous section, we discussed different situations under which variance-reduction schemes are not applicable. Now, we turn to study under what conditions can one apply acceleration schemes. First, we define the framework of oblivious CLI algorithms. Next, we show that, for this family of algorithms, knowing the strong convexity parameter is crucial for obtaining accelerated rates. We then describe a restarting scheme through which we establish that stationary algorithms (whose update rule are, on average, the same for every iteration) for smooth and convex functions are sub-optimal. Finally, we use this reduction to derive a tight lower bound for smooth and convex finite sums on the iteration complexity of any oblivious algorithm (not just stationary).

### 3.1 Framework

In the sequel, following [AS16a], we present the analytic framework through which we derive iteration complexity bounds. This, perhaps pedantic, formulation will allows us to study somewhat subtle distinctions between optimization algorithms. First, we give a rigorous definition for a class of optimization problems which emphasizes the role of prior knowledge in optimization.

###### Definition 1 (Class of Optimization Problems).

A class of optimization problems is an ordered triple , where is a family of functions defined over some domain designated by , is the side-information given prior to the optimization process and is a suitable oracle procedure which upon receiving and in some parameter set , returns for a given (we shall omit the subscript in when is clear from the context).

In finite sum problems, comprises of functions as defined in (1); the side-information may contain the smoothness parameter , the strong convexity parameter and the number of individual functions ; and the oracle may allow one to query about a specific individual function (as in the case of incremental oracle, and as opposed to the stochastic oracle discussed in Section 2). We now turn to define CLI optimization algorithms (see [AS16a] for a more comprehensive discussion).

###### Definition 2 (Cli).

An optimization algorithm is called a Canonical Linear Iterative (CLI) optimization algorithm over a class of optimization problems , if given an instance and initialization points , where is some index set, it operates by iteratively generating points such that for any ,

 w(k+1)i∈∑j∈JOf(w(k)j;θ(k)ij),k=0,1,… (12)

holds, where are parameters chosen, stochastically or deterministically, by the algorithm, possibly based on the side-information. If the parameters do not depend on previously acquired oracle answers, we say that the given algorithm is oblivious. For notational convenience, we assume that the solution returned by the algorithm is stored in .

Throughout the rest of the paper, we shall be interested in oblivious CLI algorithms (for brevity, we usually omit the ‘CLI’ qualifier) equipped with the following two incremental oracles:

 Generalized first-order oracle: O(w;A,B,c,i)\coloneqqA∇fi(w)+Bw+c, Steepest coordinate-descent oracle:O(w;j,i)\coloneqqw+t∗ej, (13)

where , denotes the ’th -dimensional unit vector and . We restrict the oracle parameters such that only one individual function is allowed to be accessed at each iteration. We remark that the family of oblivious algorithms with a first-order and a coordinate-descent oracle is wide and subsumes SAG, SAGA, SDCA, SDCA without duality, SVRG, SVRG++ to name a few. Also, note that coordinate-descent steps w.r.t. partial gradients can be implemented using the generalized first-order oracle by setting to be some principal minor of the unit matrix (see, e.g., RDCM in [Nes12]). Further, similarly to [AS16a], we allow both first-order and coordinate-descent oracles to be used during the same optimization process.

### 3.2 No Strong Convexity Parameter, No Acceleration for Finite Sum Problems

Having described our analytic approach, we now turn to present some concrete applications. Below, we show that in the absence of a good estimation for the strong convexity parameter, the optimal iteration complexity of oblivious algorithms is . Our proof is based on the technique used in [AS16a, AS16b] (see [AS16a, Section 2.3] for a brief introduction of the technique).

Given , we define the following set of optimization problems (over with )

 Fμ(w) \coloneqq1nn∑i=1(12w⊤Qμw−q⊤w),where (14) Qμ \coloneqq⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝L+μ2μ−L2μ−L2L+μ2μ⋱μ⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, q\coloneqqϵR√2⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝110⋮0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

parametrized by (note that the individual functions are identical. We elaborate more on this below). It can be easily verified that the condition number of , which we denote by , is , and that the corresponding minimizers are with norm .

If we are allowed to use different optimization algorithm for different in this setting, then we know that the optimal iteration complexity is of an order of . However, if we allowed to use only one single algorithm, then we show that the optimal iteration complexity is of an order of . The proof goes as follows. First, note that in this setting, the oracles defined in (3.1) take the following form,

 Generalized first-order oracle: O(w;A,B,c,i)=A(Qμw−q)+Bw+c, (15) Steepest coordinate-descent oracle: O(w;j,i)=(I−(1/(Qμ)jj)ei(Qμ)j,∗)w−qj/(Qμ)jjej.

Now, since the oracle answers are linear in and the ’th iterate is a -fold composition of sums of the oracle answers, it follows that forms a -dimensional vector of univariate polynomials in of degree with (possibly random) coefficients (formally, see Lemma 3, Appendix A). Denoting the polynomial of the first coordinate of by , we see that for any ,

where the first inequality follows by Jensen inequality and the second inequality by focusing on the first coordinate of and . Lastly, since the coefficients of do not depend on , we have by Lemma 4 in Appendix A, that there exists , such that for any it holds that

 Rϵ√2L∣∣ ∣∣√2s(μ)μRϵ−1∣∣ ∣∣≥Rϵ√2L(1−1κ(Fμ))k+1,

by which we derive the following.

###### Theorem 2.

The iteration complexity of oblivious finite sum optimization algorithms equipped with a first-order and a coordinate-descent oracle whose side-information does not contain the strong convexity parameter is .

The part of the lower bound holds for any type of finite sum algorithm and is proved in [AS16a, Theorem 5]. The lower bound stated in Theorem 2 is tight up to logarithmic factors and is attained by, e.g., SAG [SLRB13]. Although relying on a finite sum with identical individual functions may seem somewhat disappointing, it suggests that some variance-reduction schemes can only give optimal dependence in terms of , and that obtaining optimal dependence in terms of the condition number need to be done through other (acceleration) mechanisms (e.g., [LMH15]). Lastly, note that, this bound holds for any number of iterations (regardless of the problem parameters).

### 3.3 Stationary Algorithms for Smooth and Convex Finite Sums are Sub-optimal

In the previous section, we showed that not knowing the strong convexity parameter reduces the optimal attainable iteration complexity. In this section, we use this result to show that whereas general optimization algorithms for smooth and convex finite sum problems obtain iteration complexity of , the optimal iteration complexity of stationary algorithms (whose expected update rules are fixed) is .

The proof (presented below) is based on a general restarting scheme (see Scheme 1) used in [AS16b]. The scheme allows one to apply algorithms which are designed for -smooth and convex problems on smooth and strongly convex finite sums by explicitly incorporating the strong convexity parameter. The key feature of this reduction is its ability to ‘preserve’ the exponent of the iteration complexity from an order of in the non-strongly convex case to an order of in the strongly convex case, where denotes some quantity which may depend on but not on , and is some positive constant.

The proof goes as follows. Suppose is a stationary CLI optimization algorithm for -smooth and convex finite sum problems equipped with oracles (3.1). Also, assume that its convergence rate for is of an order of , for some . First, observe that in this case we must have . For otherwise, we get , implying that, simply by scaling , one can optimize to any level of accuracy using at most iterations, which contradicts [AS16a, Theorem 5]. Now, by [AS16b, Lemma 1], Scheme 1 produces a new algorithm whose iteration complexity for smooth and strongly convex finite sums with condition number is

 O(N+nγ(L/ϵ)α)⟶~O(n+nγκαln(1/ϵ)). (16)

Finally, stationary algorithms are invariant under this restarting scheme. Therefore, the new algorithm cannot depend on . Thus, by Theorem 2, it must hold that that and that , proving the following.

###### Theorem 3.

If the iteration complexity of a stationary optimization algorithm for smooth and convex finite sum problems equipped with a first-order and a coordinate-descent oracle is of the form of the l.h.s. of (16), then it must be at least .

We note that, this lower bound is tight and is attained by, e.g., SDCA.

### 3.4 A Tight Lower Bound for Smooth and Convex Finite Sums

We now turn to derive a lower bound for finite sum problems with smooth and convex individual functions using the restarting scheme shown in the previous section. Note that, here we allow any oblivious optimization algorithm, not just stationary. The technique shown in Section 3.2 of reducing an optimization problem into a polynomial approximation problem was used in [AS16a] to derive lower bounds for various settings. The smooth and convex case was proved only for , and a generalization for seems to reduce to a non-trivial approximation problem. Here, using Scheme 1, we are able to avoid this difficulty by reducing the non-strongly case to the strongly convex case, for which a lower bound for a general is known.

The proof follows the same lines of the proof of Theorem 3. Given an oblivious optimization algorithm for finite sums with smooth and convex individuals equipped with oracles (3.1), we apply again Scheme 1 to get an algorithm for the smooth and strongly convex case, whose iteration complexity is as in (16). Now, crucially, oblivious algorithm are invariant under Scheme 1 (that is, when applied on a given oblivious algorithm, Scheme 1 produces another oblivious algorithm). Therefore, using [AS16a, Theorem 2], we obtain the following.

###### Theorem 4.

If the iteration complexity of an oblivious optimization algorithm for smooth and convex finite sum problems equipped with a first-order and a coordinate-descent oracle is of the form of the l.h.s. of (16), then it must be at least

 Ω(n+√nLϵ).

This bound is tight and is obtained by, e.g., accelerated SDCA [SSZ16]. Optimality in terms of and  can be obtained simply by applying Accelerate Gradient Descent [Nes04], or alternatively, by using an accelerated version of SVRG as presented in [Nit16]. More generally, one can apply acceleration schemes, e.g., [LMH15], to get an optimal dependence on .

#### Acknowledgments

We thank Raanan Tvizer and Maayan Maliach for several helpful and insightful discussions.

## Appendix A Technical Lemmas

###### Lemma 2.

Let be the binary entropy function. Then,

 Hb(p)≥1−4(p−1/2)2.
###### Proof.

First, note that the first two derivatives of are

 H′b(p)=log2(1−p)−log2p, H′′b(p)=−1ln(2)p(1−p).

We show that the following function

 φ(p)\coloneqqHb(p)−(1−4(p−12)2),

is non-negative on (note that, since is continuous, it is bounded from below on and its minimum is attained on some local minimum in ). Let us locate all the extrema points of in . We have that,

 φ′(p)=log2(1−pp)+8(p−12).

Therefore, , and since

 φ′′(p)=−1ln(2)x(1−x)+8,

it follows that , which implies that is a local minimum of . We claim that there are exactly two more extrema points of which are in fact local maximum points. To this end, note that

 φ′′(p)⎧⎨⎩>0|p−1/2|c,

where . Therefore, by Rolle’s Theorem, does not vanish in , and vanishes exactly once in and exactly once in . Since, is strictly negative in , it follows that the other two stationary points of are local maxima of . All in all, we have that if is a local minimum of , then , which implies that

 φ(p)≥min{φ(0),φ(1/2),φ(1)}=0,

concluding the proof. ∎

###### Lemma 3.

When applied on problem (14) with oracles (15), the coordinates of iterates produced by oblivious stochastic CLIs form polynomials in with random coefficients (which do not depend on ) and whose degrees do not exceed the iteration number.

###### Proof.

Let be an oblivious stochastic CLI, and suppose we apply on the class of problems (14) parametrized by , using oracles (15). We use mathematical induction to show that for any , the coordinates of the ’th iterate produced by such process can be expressed as distributions over , where denotes the set of all real polynomials with degree .

As the first iterate is allowed to depend only on and , the base case is trivial. For the inductive step, assume that any coordinate of can be expressed as a distribution over . Now, for any , the oracles answers of

 Generalized first-order oracle: O(w(k)i;A,B,c,i)=A(Qμw(k)i−q)+Bw(k)i+c Steepest coordinate-descent oracle: O(w(k)i;j,i)=(I−(1/(Qμ)jj)ei(Qμ)j,∗)w(k)i−qj/(Qμ)jjej (17)

form a distribution over , as the random quantities involved in the expressions ( and ) do not depend on (due to obliviousness) and the rest of the terms are either constant or linear in . Lastly, are computed by simply summing up all the oracle answers, and as such, form again distributions over . ∎

###### Lemma 4.

Let be a real polynomial of degree , and let . Then, there exists such that for any it holds that

 |s(μ)μ+1|≥(1−μ/L)k+1.
###### Proof.

Assume for the sake of contradiction that for any , there exists such that

 |s(μ)μ+1|<(1−μL)k+1.

Define

 q(μ)\coloneqqs(L(1−μ))L(1−μ)+1 (18)

and denote the corresponding coefficients by . We show by induction that for all . For we have that since for any there exists some such that

 |q(^μ)|<(1−L(1−^μ)L)k+1=^μk+1,

it holds, by continuity, that

 |q0|=|q(0)|=∣∣∣limμ→0+q(μ)∣∣∣ ≤limμ→0+μk+1=0.

Now, if for then

 |qm|=∣∣∣q(0)μm∣∣∣=∣∣∣limμ→0+q(μ)μm∣∣∣ ≤limt→0+μk+1−m=0.

Thus, proving the induction claim. This, in turns, implies that . Now, by Equation (18), it follows that . Hence, . Lastly, using Equation (18) again yields

 s(μ)μ+1=q(1−μL)=(1−μL)k+1,

which contradicts our assumption, thus concluding the proof. ∎

### References

1. Yossi Arjevani and Ohad Shamir. Dimension-free iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems, pages 3540–3548, 2016.
2. Yossi Arjevani and Ohad Shamir. On the iteration complexity of oblivious first-order optimization algorithms. In Proceedings of the 33nd International Conference on Machine Learning, pages 908–916, 2016.
3. Yossi Arjevani and Ohad Shamir. Oracle complexity of second-order methods for finite-sum problems. arXiv preprint arXiv:1611.04982, 2016.
4. Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
5. Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. Technical report, Technical report, arXiv preprint, 2016.
6. Dimitri P Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4):913–926, 1997.
7. Doron Blatt, Alfred O Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29–51, 2007.
8. Alberto Bietti and Julien Mairal. Stochastic optimization with variance reduction for infinite datasets with finite-sum structure. arXiv preprint arXiv:1610.00970, 2016.
9. Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
10. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
11. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
12. Guanghui Lan. An optimal randomized incremental gradient method. arXiv preprint arXiv:1507.02000, 2015.
13. Gaëlle Loosli, Stéphane Canu, and Léon Bottou. Training invariant support vector machines using selective sampling. Large scale kernel machines, pages 301–320, 2007.
14. Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3366–3374, 2015.
15. Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.
16. Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
17. Atsushi Nitanda. Accelerated stochastic gradient descent for minimizing finite sums. In Artificial Intelligence and Statistics, pages 195–203, 2016.
18. Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and dynamics in convex programming. Information Theory, IEEE Transactions on, 57(10):7036–7056, 2011.
19. Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2013.
20. Shai Shalev-Shwartz. Sdca without duality. arXiv preprint arXiv:1502.06177, 2015.
21. Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
22. Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145, 2016.
23. Blake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, pages 3639–3647, 2016.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   