Parallelization does not Accelerate Convex Optimization: Adaptivity Lower Bounds for Non-smooth Convex Minimization

# Parallelization does not Accelerate Convex Optimization: Adaptivity Lower Bounds for Non-smooth Convex Minimization

Eric Balkanski
Harvard University
ericbalkanski@g.harvard.edu
Yaron Singer
Harvard University
yaron@seas.harvard.edu
###### Abstract

In this paper we study the limitations of parallelization in convex optimization. A convenient approach to study parallelization is through the prism of adaptivity which is an information theoretic measure of the parallel runtime of an algorithm. Informally, adaptivity is the number of sequential rounds an algorithm needs to make when it can execute polynomially-many queries in parallel at every round. For combinatorial optimization with black-box oracle access, the study of adaptivity has recently led to exponential accelerations in parallel runtime and the natural question is whether dramatic accelerations are achievable for convex optimization.

Our main result is a spoiler. We show that, in general, parallelization does not accelerate convex optimization. In particular, for the problem of minimizing a non-smooth Lipschitz and strongly convex function with black-box oracle access we give information theoretic lower bounds that indicate that the number of adaptive rounds of any randomized algorithm exactly match the upper bounds of single-query-per-round (i.e. non-parallel) algorithms.

## 1 Introduction

In this paper we study the limitations of parallelization in convex optimization. Since applications of convex optimization are ubiquitous across machine learning and data sets become larger, there is consistent demand for accelerating convex optimization. For over 40 years computer science has formally studied acceleration of computation via parallelization [F78, G78, S79]. Our goal in this paper is to study whether parallelization can generally accelerate convex optimization.

A convenient approach to study parallelization is through the prism of adaptivity. Adaptivity is an information theoretic measure of the parallel runtime of an algorithm, which in many cases also translates to a computational upper bound, up to lower order terms. Informally, adaptivity is the number of sequential rounds of an algorithm where every round allows for polynomially-many parallel queries. Adaptivity is studied across a wide variety of areas, including sorting [Val75, Col88, BMW16], communication complexity [papadimitriou1984communication, duris1984lower, nisan1991rounds], multi-armed bandits [AAAK17], sparse recovery [HNC09, IPW11, haupt2009compressive], and property testing [CG17, chen2017settling]. In the celebrated PRAM model adaptivity is the depth of the computation tree. More generally, in any parallel computation model, adaptivity lower bounds the runtime of algorithms that make polynomially-many parallel queries.

For combinatorial optimization, the study of adaptivity has recently led to dramatic accelerations in parallel runtime. For the canonical problem of maximizing a monotone submodular function under a cardinality constraint, a recent line of work initiated in [BS18] introduces techniques that produce exponential speedups in parallel runtime. Until very recently the best known adaptivity (and hence best parallel running time) for obtaining constant factor approximations for a submodular function was linear in . In contrast, [BS18] and the line of work that follows [BS18b, BBS18, EN18, BRS18, fahrbach2018submodular, chekuri2018submodular] achieve constant, and even optimal, approximation guarantees in adaptive steps.

For convex minimization, in some special cases, parallelization provides non-trivial speedups. An important example that is well studied in machine learning is when the objective function is decomposable, i.e. when . In this case parallelism allows computing stochastic subgradients of the convex functions simultaneously during iterations of stochastic gradient descent [DGSX12, DBW12, RRWN11]. Another special case is when the function is low dimensional. Recently Duchi et al. show that for when the number of queries can be exponential in the dimension , then parallelization can accelerate minimization when is either Lipschitz convex or strongly convex and strongly smooth [duchi2018minimax]. The natural question is whether algorithms that can execute function evaluations in each iteration can achieve faster convergence rates than those that make a single evaluation in every iteration.

Can parallelization accelerate convex optimization?

### 1.1 Main result

Our main result is a spoiler. We show that, in general, parallelization does not accelerate convex optimization. In particular, for the problem of minimizing a Lipschitz and strongly convex function over a convex body, we give a tight lower bound that shows that even when queries can be executed in parallel, there is no randomized algorithm that has convergence rate that is better than those achievable with a one-query-per-round algorithm [nesterov2013introductory, shamir2013stochastic].

\thmt@toks\thmt@toks

For any and , there exists a family of convex functions with for all subgradients of all , such that for any -adaptive algorithm , there exists for which returns such that

 f(xr)−minx∈Wf(x)≥GD(12√r+1−(r+1/2)logn√n)

with probability over the randomization of and with domain of diameter . In particular, for any ,

 f(xr)−minx∈Wf(x)≥GD⋅(1−o(1))⋅12√r+1.
###### Theorem 1.

This convergence rate matches (up to lower order terms) the convergence rate of standard sequential algorithms for Lipschitz convex functions [nesterov2013introductory, shamir2013stochastic].

\thmt@toks\thmt@toks

For any and , there exists a family of -strongly convex functions , with for all subgradients of all over domain , such that for any -adaptive algorithm , there exists for which returns such that

 f(xr)−minx∈Wf(x)≥G2λ(18(r+1)−√r+1nlogn2)

with probability over the randomization of and with the box as domain . In particular, for any ,

 f(xr)−minx∈Wf(x)≥G2λ⋅(1−o(1))⋅(18(r+1)).
###### Theorem 2.

Again, this convergence rate matches (up to lower order terms) the convergence rate of standard sequential algorithms with one query per round for -strongly convex functions [nesterov2013introductory, shamir2013stochastic].

##### Some remarks.

The lower bounds hold for both deterministic and randomized algorithms for optimizing non-stochastic functions . The lower bounds thus trivially hold for the stochastic case as well, since it is strictly harder. Similarly, these lower bounds also hold for decomposable convex functions since a decomposable function composed of a single function is a special case. The lower bounds hold with high probability over the randomization of the algorithm, and trivially also hold in expectation with an additional multiplicative term. Finally, the lower bounds hold for both zeroth and first-order oracles. We present these lower bounds for zeroth-order oracles, which extend to first-order oracles since first-order oracles can be obtained from zeroth-order oracles when poly-many queries are allowed per round by querying a small ball around the point of interest in a single round.

### 1.2 Technical overview

Previous hardness constructions for convex optimization that bound the query complexity do not apply since they assume one query per round and at most linearly-many total queries. In particular these constructions break, even against -adaptive algorithms with a super-linear number of queries. For this reason we introduce a novel class of families of functions that are hard to optimize for not only one but queries per round. As such, this requires a new framework to argue about the information theoretic limitations of algorithms when given access to polynomially-many queries in each round.

We begin by describing a simple class of functions that can be used to show the lower bound. We then reduce the problem to showing that the class of functions we define respects two conditions: indistinguishability and gap. The main technical challenge is in proving that the family of functions we construct satisfies these indistinguishability and gap conditions.

Satisfying these two conditions requires finding, for any algorithm , two functions and in the family of functions which have equal value over queries by but different optima. The main difficulty is that we pick and depending on the queries of the algorithm , but can learn partial information about and from those queries. Thus, the queries of the algorithm are dependent on and , which creates a cycle of dependence.

The main conceptual part of the analysis is in finding such and . We do so using an oracle which is defined adaptively and over multiple rounds as it receives queries from an algorithm . We call such an oracle which is dependent on an algorithm an obfuscating oracle. This construction contains multiple subtleties due to the complex dependencies between and . Showing the existence of an obfuscating oracle with the desired properties is also non-trivial. It requires a probabilistic argument that derandomizes an algorithm by showing that it is sufficient to argue about properties of a deterministic query by an algorithm to a random function instead of random queries to a deterministic function.

### 1.3 Related work

The study of the hardness of convex optimization was initiated in the seminal work of [nemirovskii1983problem] which introduced the standard model for lower bounds in convex optimization (see [nesterov2013introductory, bubeck2015convex] for a simplified presentation). In that model, there is a black-box oracle for a convex function such that the algorithm queries points and receives answers from the oracle. There is a rich line of work on information theoretic lower bounds for the number of sequential queries needed for convex optimization in the setting where the oracle is stochastic, e.g. [raginsky2009information, raginsky2011information], or non-stochastic, e.g. [agarwal2009information, woodworth2016tight, braun2017lower]. In this paper, we consider the basic case where the oracle is not stochastic and note that any lower bound in the non-stochastic setting trivially extends to the stochastic setting. Since the standard model for lower bounds in convex optimization uses a black-box oracle access setting, adaptivity is well-suited for the study of lower bounds for parallel convex optimization.

Very recent work has obtained lower bounds on the convergence rates of adaptive algorithms for convex optimization [smith2017interaction, woodworth2018graph, duchi2018minimax]. The exact settings vary, but the high level goal is the same as ours, which is to derive convergence rates for algorithms which allow multiple parallel queries in every round. We give lower bounds which improve over these previous lower bounds. In particular, the convergence rates in [duchi2018minimax] exponentially decrease in the dimension . The lower bounds in [woodworth2018graph] have a dependence term where is the number of queries, while our lower bounds are independent of and we only assume that the number of queries is at most . Motivated by applications to local differential privacy, [smith2017interaction] obtained lower bounds on the convergence rate that have an exponential dependence on the number of rounds , while we obtain the optimal and rates. Also related to adaptivity for convex optimization is the work of [perchet2016batched], which studies adaptivity in a bandit setting and obtains regret bounds for strategies that can be updated only a limited number of times.

Non-adaptivity, i.e. -adaptive algorithms, has been studied for convex optimization in [BS17a], where it has been shown that there is no algorithm that can obtain even a constant error using fewer than exponentially-many (in the dimension ) samples drawn from any distribution. This hardness result for non-adaptive algorithms for convex optimization motivated our study of algorithms with rounds of adaptivity. More generally, non-adaptive algorithms have also been studied for combinatorial optimization to study the power and limitations of algorithms whose input is learned from observational data [balkanski2017limitations, BS17, BIS17, BRS16, RBGS18].

The adaptivity of an algorithm is the number of sequential rounds of queries it makes, where every round allows for parallel queries where is the dimension of the problem.

###### Definition.

Given an oracle , an algorithm is -adaptive if every query to the oracle occurs at a round such that is independent of the answers to all other queries at round .

We note that the definition is stated for zero-order oracles (given the oracle returns ), but as previously mentioned, we emphasize that it is equivalent to assuming first-order oracle access since queries to are allowed in every round, and thus subgradients can be obtained in one round of querying .

### 1.5 Paper organization

In Section 2, we construct the family of Lipschitz convex functions that is hard to optimize in adaptive rounds of queries and present two simple sufficient conditions, called the indistinguishability and gap conditions, on a class of functions to obtain the lower bound. In Section 3, we present the obfuscating oracle, which is used to find two functions in the hard family of functions that satisfy these two conditions. We show that these two functions satisfy the indistinguishability and gap conditions in Section 4. Finally, in Section 4.5, we extend the construction and the lower bound to -strongly convex functions.

## 2 The Construction of the Hard Family of Functions

In this section, we construct the family of functions that cannot be optimized in rounds of queries. We then describe two simple conditions that together are sufficient for showing the hardness of optimizing a class of functions in rounds.

### 2.1 The hard family of functions

We give the construction of the family of Lipschitz convex functions for the lower bound for -adaptive algorithms. In Section 4.5, we extend this construction to obtain a family of functions which is Lipschitz and -strongly convex. The functions are parameterized by a binary vector and optimized over domain , which is the box of diameter . For a vector (and similarly for ), we often break into blocks of consecutive entries of , where

 xi:=x[(i−1)nr+1+1:inr+1]∈Rn/(r+1).

The functions are in terms of some which we later define. Formally, the function is defined as

 fy(x):=γ⋅maxi∈[r+1](x⊺iyi−2iϵ)

where . The family of functions for which we show a lower bound for -adaptive algorithms is

 Fr:={fy:y∈{−1,1}n}.

We discuss some informal intuition for the hardness of optimizing in -adaptive rounds before giving the formal argument. The main idea behind these functions is that an algorithm needs to learn all , , to optimize within good accuracy, but that it cannot learn before round . The reason is that for a query by an algorithm at any round , if and concentrate, i.e., and , then

 x⊺jyj−2jϵ>x⊺iyi−2iϵ.

Note that by the definition of , conditioned on , the value of is independent of and the algorithm does not learn . Informally, if an algorithm has not yet learn at some round , is likely to concentrate for the queries by this algorithm at round .

Observe that a minimizer for over is such that

 x⋆j=⎧⎨⎩D2√n if yj=−1−D2√n if yj=1

If an algorithm cannot learn in -adaptive rounds, then is likely to concentrate for the solution returned by the algorithm. If concentrates, then is a bad solution compared to .

### 2.2 Two sufficient conditions for hardness

We reduce the analysis of the lower bound for to showing that for any algorithm , there exists that satisfy two simple conditions. Informally, the first condition, called indistinguishability, states that the functions and have, with high probability, equal value over all queries of the algorithm . On the other hand, the second condition, called -gap, states that there is no solution which simultaneously -approximates the optimal solutions of both functions and . It is easy to show that if a class of functions contains two such functions and for any algorithm , then is hard to optimize since and need to be distinguished for an algorithm to have good performance.

###### Theorem 3.

Let be some algorithm for a class of functions . Assume there exists with the properties of

• Indistinguishability: for all rounds , let be the queries at round by , which are adaptive to the answers by to queries by to at rounds . Then, with probability over , for all ,

 fy(x)=fz(x).
• Gap: Minimizers for and have equal value, i.e.,

 minx∈Wfy(x)=minx∈Wfz(x),

but for all :

 max(fy(x),fz(x))−minx∈Wfy(x)>α.

Then, there is no -adaptive algorithm that finds for all , with probability strictly larger than over the randomization of the algorithm, a solution s.t. .

###### Proof.

Consider an algorithm for . Let be the functions satisfying the indistinguishability and -gap conditions.

Pick the function oracle to be either or with probability each. By the indistinguishability property, with probability over the queries , the answers of the oracle to all queries by are independent of whether the oracle is for or . Thus, the decisions by the algorithm are independent of the randomization over and . By the gap condition, with probability over the algorithm, we conclude that the algorithm returns a (possibly randomized) such that either or . ∎

## 3 The Obfuscating Oracle

In this section, we construct two functions which satisfy the indistinguishability and gap conditions for Theorem 3. This construction relies on a tool called an obfuscating oracle. The definition and construction of an obfuscating oracle for is the main conceptual part of the analysis. Recall that to obtain the two desired conditions, we need to show that for any -adaptive algorithm , there exist two functions that have equal value over all queries by but do not have a common minimizer.

We start with a high level overview of the structure of this pair of functions . Recall that a function in is defined by a binary vector broken into vectors . For our construction of and , for but . The identical first blocks imply the indistinguishability condition and the different last block implies the gap condition. More precisely, we wish to pick such that for all queries at round , for all . Note that by the definition of , this implies that and thus indistinguishability. Intuitively, the consequence is that algorithm does not learn before round , and in particular does not distinguish and at the end of the rounds of the algorithm.

An important subtlety which complicates the analysis is that algorithm can learn some information about before round . This is since with query at round , with the answer of the oracle, learns that . Thus, we cannot argue that does not learn any information about and that queries at round are completely independent of . The remaining of this section is devoted to finding such that and , where for and , satisfy the indistinguishability and gap conditions. The main difficulty is that we wish to pick depending on the queries of the algorithm , but since is adaptive and it can learn partial information about , the queries of the algorithm are dependent on , which creates a cycle of dependence. Some subtle dependencies between and are needed.

### 3.1 The definition of an obfuscating oracle

Instead of a function oracle which is defined before the algorithm starts querying , an obfuscating oracle is an oracle which is adaptively defined as it interacts with the queries of an algorithm. In particular, the answers of an obfuscating oracle might be dependent on the previous queries by and on the round in which a query occurs, which is of course not possible for a function oracle.

In our case, we construct an obfuscating oracle which, similarly as function oracles in , depends on points . The main idea is to define point for obfuscating oracle depending on the queries of at rounds , as illustrated in Figure 1. Deferring the choice of for until round of is the key part of the obfuscating oracle which allows us to argue about indistinguishability, and involves important subtleties which are discussed in Section 3.2 where we formally construct the obfuscating oracle. First, we formally defining obfuscating oracles. An obfuscating oracle to is an oracle that is defined by its interactions with .

###### Definition 1.

Let be some algorithm. An obfuscating oracle is defined inductively on the th round of queries by to . At round , is undefined for all . At round , we assume that is defined for all queries at rounds . We consider queries at round by algorithm , which has received answers for all queries at rounds from the oracle. For all , is defined independently of all queries at rounds .

Next, we formalize an obfuscating condition which implies the indistinguishability property. The obfuscating condition states that, with high probability, there exist two functions in which have equal value with over all queries by .

###### Lemma 1.

Assume that for all -adaptive algorithms , there exists, with probability , an obfuscating oracle such that, for some ,

• Obfuscating condition: for all queries by to obfuscating oracle ,

 fA(x)=fy(x)=fz(x).

Then and satisfy the indistinguishability property.

###### Proof.

Assume that the obfuscating condition holds, which occurs with probability . We show that and satisfy the indistinguishability property by induction on the round . Consider round and assume that for all queries by to function oracle from previous rounds, . Since and have equal value over all previous queries, the algorithm cannot distinguish between if it is querying or . Thus, the queries by at round to function oracle are identical to the queries by at round to obfuscating oracle . By the obfuscating condition, for all , we have that . Since this holds with probability for all rounds , we get that and satisfy the indistinguishability property. ∎

### 3.2 The construction of the obfuscating oracle

We construct an obfuscating oracle for . Let be any -adaptive algorithm. We construct inductively on the round of queries. At round , let be the (possibly randomized) collection of queries by , after having received answers from for queries at rounds . Let be the points previously constructed by . Then,

• Let be a binary vector such that

 Prx∈∪ij=1Qj[|x⊺iyi|<ϵ, for all x∈∪ij=1Qj]=1−n−ω(1),

i.e., for all queries , concentrates with high probability.

• The obfuscating oracle at round answers, for all ,

 fA(x)=γ⋅maxj∈[i](x⊺jyj−2jϵ).

Finally, let be a binary vector such that

 Prx∈∪rj=1Qr[|x⊺r+1yr+1|<ϵ, for all x∈∪rj=1Qr]=1−n−ω(1).

Thanks to the obfuscating oracle, we are now ready to define and for the two functions and for which we show the indistinguishability and gap conditions. The function is defined by the blocks constructed by the obfuscating oracle and is defined by the blocks such that for and .

The crucial part of is that the maximum is over instead of . There are multiple subtleties and difficulties with the above construction of the obfuscating oracle, which we discuss next.

• The definition of at round must be carefully constructed to not contradict an answer of to a query from a round . In other words, since answered to a query at round , we wish to have such that

 γ⋅maxℓ∈[i](x⊺ℓyℓ−2ℓϵ)=γ⋅maxℓ∈[j](x⊺ℓyℓ−2ℓϵ)

for the obfuscating condition. It is for this reason that is defined so that the concentration of not only holds for queries at round , but for all queries at rounds .

• The obfuscating oracle does not always construct such that for for all queries . This is because the concentration property of only holds with high probability over the queries of the algorithm. Thus, for a query at round answered with by , only holds with high probability and the answers of might not correspond to a function in . This even implies that for a same queries at two different rounds, might answer differently.

• Note that for all , the queries at round are not independent of and is not independent of . Thus, there are multiple layers of dependencies between and .

• Finally, and most importantly, it is not trivial that there exists satisfying the concentration condition for the definition of at round . Showing that for any randomized queries at rounds by , there exists such that, with high probability, for all these queries , concentrates is an important part of the analysis which is shown in Lemma 2.

We use the term , and more precisely for our purposes, to apply union bounds over at most events each happening with probability . This is useful since the number of queries is at most .

## 4 Proof of the Main Theorem

In this section, we show that the functions and constructed in the previous section satisfy the indistinguishability and gap conditions. The indistinguishability condition is satisfied by showing that the obfuscating oracle with and from the previous section satisfies the obfuscating condition. The main hardness result then immediately follows by Theorem 3. In Section 4.1, we show the existence of the blocks needed for the obfuscating oracle, which is the main technical part of this section. Then, we show the obfuscating condition in Section 4.2 and the gap condition in Section 4.3. We bound the subgradients of any function in Section 4.4. Finally, we conclude with the main result in Section 4.5.

### 4.1 The existence of yi for the obfuscating oracle

We show the existence of the blocks needed for the obfuscating oracle defined in Section 3.

###### Lemma 2.

For any and randomized collection of queries , there exists such that

 Prx∈∪ij=1Qj[|x⊺yi|<ϵ, for all x∈∪ij=1Qj]=1−n−ω(1).

The remainder of Section 4.1 is devoted to proving Lemma 2. The main idea of the proof is that instead of considering all possible randomized collection of queries , we consider a random where is the uniform distribution over all binary vectors . The next lemma switches the randomization from the queries to the randomization of . This is done by reducing the problem of showing the claim for any randomized collection of queries to showing the claim for any and random . This is useful since standard concentrations bounds apply more easily to than to .

\thmt@toks\thmt@toks

Assume that for all , w.p. over , we have that . Then, for any (possibly randomized) collection of queries , there exists a deterministic such that with probability over the randomization of , for all queries , .

###### Proof.

We denote by the event that for all . Let be a randomized collection of queries and let be such that for all , w.p. over the randomization of , we have that .

Let be any realization of the randomized collection of queries . By a union bound over the queries , We obtain

 maxyi∈{−1,1}n/(r+1)PrQ[I(yi,Q)]≥Pryi∼UPrQ[I(yi,Q)]≥1−n−ω(1)

and thus there exists some such that w.p. over the randomization of , for all queries , . ∎

Before showing the condition needed for Lemma 4.1, we state the following version of Hoeffding’s inequality.

###### Lemma 4 (Hoeffding’s inequality).

Let be independent random variables with values in . Let and . Then for every ,

 Pr[|S−μ|≥t]≤2e−2t2n(b−a)2.

Next, we show the condition needed for Lemma 4.1, namely that for all , w.p. over , concentrates. This follows from a straightforward application of Hoeffding’s inequality.

###### Lemma 5.

For all , with probability over , we have that

###### Proof.

Consider with . By ignoring indices such that and considering indices such that with probability each independently, is the sum of independent random variables with values in and expected value , by Hoeffding’s inequality (Lemma 4), we get

 Pryi∼U[|x⊺yi|<ϵ] =Pryi∼U[|x⊺yi|

Lemma 2 then follows immediately from Lemmas 4.1 and 5.

###### Proof of Lemma 2.

We combine Lemma 4.1 and Lemma 5. ∎

### 4.2 The obfuscating condition

We show that the obfuscating oracle together with and defined in Section 3 satisfy the obfuscating condition. By Lemma 1, this implies the indistinguishability condition for and . The main idea to show the obfuscating condition for queries at round is that for any , concentrates with high probability.

###### Lemma 6.

Let be an -adaptive algorithm, then , , and satisfy the obfuscating condition: with probability , for all rounds and all queries by at round to obfuscating oracle ,

 fA(x)=γ⋅maxj∈[i](x⊺jyj−2jϵ)=fy(x)=fz(x).
###### Proof.

Consider round of the algorithm querying the obfuscating oracle . By definition of , for ,

 Prx∈∪jℓ=1Qℓ[∣∣x⊺jyj∣∣<ϵ, for all x∈∪jℓ=1Qℓ]=1−n−ω(1).

In particular, this implies that for , we have

 Prx∈Qi[∣∣x⊺jyj∣∣<ϵ, for all x∈Qi]=1−n−ω(1).

By a union bound, we get

 Prx∈Qi[∣∣x⊺jyj∣∣<ϵ, for all x∈Qi and for all j≥i]=1−n−ω(1).

Assume that . This implies that

 x⊺iyi−2iϵ>x⊺jyj−2jϵ

for all and for all . If then it is also the case that . Thus,

 γ⋅maxℓ∈[i](x⊺ℓyℓ−2ℓϵ)=fy(x)=fz(x).

### 4.3 The gap condition

We show that and satisfy the -gap condition with . The main observation for the gap condition is that for all and ,

 max(x⊺r+1yr+1,x⊺r+1zr+1)=max(x⊺r+1yr+1,−x⊺r+1yr+1)≥0.

Thus, for all there is no which is a good solution to both and .

###### Lemma 7.

Assume . For any , and for all ,

 max(fy(x),fz(x))−minxfy(x)≥GD(12√r+1−(r+1/2)logn√n).
###### Proof.

A minimizer for is such that if and if . With ,

 f(x⋆)=γ⋅((x⋆1)⊺y1−Dlogn2√r+1)=γ⋅(−√nD2(r+1)−Dlogn2√r+1).

We construct a minimizer for similarly and get

 γ⋅(−√nD2(r+1)−Dlogn2√r+1)=minxfy(x)=minxfz(x).

By the definition of , for all , we have

 max(fy(x),fz(x))≥γ⋅max(x⊺r+1yr+1,−x⊺r+1yr+1)−γ2(r+1)ϵ≥−γ2(r+1)ϵ.

Thus, for all , with and ,

 max(fy(x),fz(x))−fy(x⋆) ≥γ(−√r+1Dlogn+√nD2(r+1)+Dlogn2√r+1) =GD(12√r+1−(r+1/2)logn√n).\qed

### 4.4 The subgradients of Fr

It remains to bound the subgradients of the functions in . We use and the following standard lemma for subdifferentials.

###### Lemma 8 ([nesterov2013introductory], Lemma 3.1.10).

Let the function be closed and convex. Then the function is also closed and convex. For any we have where .

We bound the norm of subgradients of functions using the above lemma.

###### Lemma 9.

Let for any and any . If , then .

###### Proof.

Let for some , then by Lemma 8, Thus,

 g=γ∑i∈[r]αiyi

for some such that , and we get

 ∥g∥2=∑i∈[r+1]∑j∈[(i−1)nr+1+1:inr+1](γαi)2=nr+1γ2∑i∈[r+1](αi)2≤nr+1γ2∑i∈[r+1]α