More Iterations per Second, Same Quality – Why Asynchronous Algorithms may Drastically Outperform Traditional Ones

More Iterations per Second, Same Quality – Why Asynchronous Algorithms may Drastically Outperform Traditional Ones

Robert Hannah111Email: RobertHannah89@math.ucla.edu Department of Mathematics, University of California, Los Angeles, CA 90095, USA Wotao Yin222Email: wotaoyin@math.ucla.edu Department of Mathematics, University of California, Los Angeles, CA 90095, USA
July 26, 2019
Abstract

In this paper, we consider the convergence of a very general asynchronous-parallel algorithm called ARock [PengXuYanYin2016_arock], that takes many well-known asynchronous algorithms as special cases (gradient descent, proximal gradient, Douglas Rachford, ADMM, etc.). In asynchronous-parallel algorithms, the computing nodes simply use the most recent information that they have access to, instead of waiting for a full update from all nodes in the system. This means that nodes do not have to waste time waiting for information, which can be a major bottleneck, especially in distributed systems. When the system has p nodes, asynchronous algorithms may complete \Theta(\ln\MT@delim@Auto\p@star{p}) more iterations than synchronous algorithms in a given time period (“more iterations per second”).

Although asynchronous algorithms may compute more iterations per second, there is error associated with using outdated information. How many more iterations in total are needed to compensate for this error is still an open question. The main results of this paper aim to answer this question. We prove, loosely, that as the size of the problem becomes large, the number of additional iterations that asynchronous algorithms need becomes negligible compared to the total number (“same quality” of the iterations). Taking these facts together, our results provide solid evidence of the potential of asynchronous algorithms to vastly speed up certain distributed computations.

1 Introduction

Designing efficient algorithms to solve large-scale optimization problem is an increasingly important area of research. However parallel algorithms are more challenging to analyze and implement because there is a host of additional considerations and issues that only arise in parallel settings.

The vast majority of parallel algorithms are synchronous. At each iteration, all processors will compute an update, and then share this update with all others. The next iteration can proceed only when all processors are finished computing an update. This synchronization can be extremely expensive at scale or on a congested network. Network latency, packet loss, loss of a node, unexpected drains on computational resources that affect even one node will cause the entire system to slow down. Asynchronous algorithms overcome the problem of synchronization by simply computing their next update with the most recent information they have available. Though this will eliminate synchronization penalty, there is a slowdown or penalty associated with using outdated information. It is not immediately clear whether asynchronous versions of algorithms will be faster, or will even converge to a solution.

In this paper, we examine the theoretical performance of asynchronous algorithms compared to traditional synchronous ones. We do this by analyzing a very general asynchronous-parallel algorithm called ARock [PengXuYanYin2016_arock]. ARock takes many popular algorithms as special cases, such as asynchronous block gradient descent, forward backward, proximal point, etc. Hence our results will apply to all of these algorithms.

1.1 Main argument

This paper aims to provide a solid theoretical evidence that asynchronous algorithms will drastically outperform synchronous ones at scale under a wide range of scenarios that ARock encompasses. These include gradient descent, forward backward, etc. for strongly convex objectives with Lipschitz gradient, or any other algorithm that can be written in the form of a block fixed-point algorithm on a contractive operator. Our argument involves a series of steps, that together will bolster our conclusion.

  1. We first argue that synchronous algorithms have significant synchronization penalty with scale under a well-justified model. That is, as the number of processing nodes p increases, a larger and larger portion time will be spend waiting instead of computing. In fact, it will be shown that synchronous algorithms progress at least \Theta(\ln(p)) fewer epochs333We measure the iteration complexity in terms of epochs. This is a context-dependent unit of computation which loosely correspond to one computation of Sx for some operator S and vector x, e.g., the calculation of a full gradient for gradient descent. per second than asynchronous algorithms.

  2. It has always been plausible that asynchronous algorithms may progress more epochs per second. However, it has always been an open question whether an increased number of iterations per second is worth a potential iteration complexity penalty for using outdated information. In this paper, we provide a surprising answer: the iteration complexity of ARock is asymptotically the same as the corresponding synchronous algorithm, even under certain kinds of unbounded delays.

  3. Since asynchronous algorithms allow for far more iterations per second (more iterations), and these iterations make the nearly same progress as synchronous ones per iteration (same quality), asynchronous algorithms may drastically outperform synchronous ones in large-scale applications. Since ARock is extremely general, this argument applies to wide variety of asynchronous algorithms, such as block gradient descent, proximal gradient, Douglas-Rachford, etc.

The remainder of this section will introduce the general setting, notation, and background for our results. We will discuss related work in Section 2.5.

1.2 Fixed-point algorithms, and their generality

ARock is a fixed-point algorithm, which allows it to be very general. This is because most optimization problems and algorithms can be written in the fixed-point form. Many popular algorithms such as asynchronous block gradient descent, forward backward, proximal point, etc. are special cases of ARock. They differ only in their choice of fixed-point operator T (see [HannahYin2016_unbounded] for a list of applications and special cases). Hence our results will apply to all of these algorithms.

Take an operator T:\mathbb{H}\to\mathbb{H} with Lipschitz constant 0<r\leq 1. Such an operator is called nonexpansive. The aim is to find a fixed-point of this operator: That is, a point x^{*}\in\mathbb{H} such that Tx^{*}=x^{*}. For example, smooth minimization of a convex function f:\mathbb{H}\to\mathbb{R} with L-Lipschitz gradient \nabla f is equivalent to finding a fixed point of the nonexpansive operator T=I-\gamma\nabla f, where I is the identity, and 0<\gamma\leq\frac{2}{L}. The set of fixed points of an operator T is denoted \text{Fix}(T). In this paper, we consider the case where 0<r<1, that is, T is contractive, since this leads to linear convergence444The case of r=1 was considered in [HannahYin2016_unbounded]..

The most common fixed-point algorithm is the Krasnosel’skiĭ-Mann (KM) algorithm. ARock is essentially an asynchronous block-coordinate version of KM iteration.

TheoremDefinition 1. Krasnosel’skiĭ-Mann algorithm.

Let \epsilon>0, and \eta^{k} be a series of step lengths in \MT@delim@Auto\p@star{\epsilon,1-\epsilon}. Let T be a nonexpansive operator with at least one fixed point, and

\displaystyle S \displaystyle:=I-T. (1.1)

Staring from x^{0}, the KM Algorithm is defined by the following:

\displaystyle\begin{aligned} \displaystyle x^{k+1}&\displaystyle=x^{k}-\eta^{k% }S(x^{k})\end{aligned} (1.2)
TheoremRemark 1. .

Gradient descent is equivalent to the KM algorithm with T=I-\gamma\nabla f. If the fixed-point framework is unfamiliar, it may be helpful to mentally replace S with \gamma\nabla f, and view ARock as simply asynchronous block gradient descent.

1.3 The ARock algorithm

ARock is an asynchronous-parallel, block, fixed-point algorithm, in which a shared solution vector x^{k} is updated by a collection of p computing nodes.

Take a space \mathbb{H} on which to solve an optimization problem. \mathbb{H} can be the real space \mathbb{R}^{N} or a separable Hilbert space. Break this space into m orthogonal subspaces: \mathbb{H}=\mathbb{H}_{1}\times\ldots\times\mathbb{H}_{m} so that vectors x\in\mathbb{H} can be written as (x_{1},x_{2},\ldots,x_{m}) where each x_{i} is x’s component in subspace \mathbb{H}_{i}. Take an operator T:\mathbb{H}\to\mathbb{H} that is r-Lipschitz for 0<r<1. Let S=I-T and Sx=\MT@delim@Auto\p@star{S_{1}x,\ldots,S_{m}x} where S_{j}x denotes the j’th block of Sx.

TheoremRemark 2. Conventions.

Superscripts will denote the iteration number of a sequence of points x^{0},x^{1},x^{2},\ldots. Subscripts will denote different blocks of a vector or operator, e.g., x=(x_{1},x_{2},\ldots,x_{m}) and Sx=\MT@delim@Auto\p@star{S_{1}x,\ldots,S_{m}x}. For instance, x^{k}_{l} is the lth block of the kth iterate (x^{k}). S_{l}x^{k} is the lth block of S(x^{k}).

TheoremDefinition 2. The ARock Algorithm.

Let \eta^{k}\in\mathbb{R} be a series of step lengths and i(k)\in\{1,\ldots,m\} be a series of block indices. Let T be a nonexpansive operator with at least one fixed point x^{*}, and S=I-T. Take a starting point x^{0}\in\mathbb{H}. Then the ARock algorithm [PengXuYanYin2016_arock] is defined via the iteration:

\mbox{for}~{}i=1,\ldots,m,\quad x_{i}^{k+1}\leftarrow\begin{cases}x_{i}^{k}-% \eta^{k}S_{i}(\hat{x}^{k}),&i=i\MT@delim@Auto\p@star{k},\\ x_{i}^{k},&i\neq i\MT@delim@Auto\p@star{k},\end{cases} (1.3)

where the delayed iterate \hat{x}^{k} represents a possibly outdated version of the iteration vector x^{k} used to make an update and the block index sequence i(k) specifies which block of x^{k} is being updated to produce the next iterate x^{k+1}.

The ARock algorithm resembles a block KM iteration. However we use a delayed iterate \hat{x}^{k} because of asynchronicity. In Section 1.4 we precisely define the block sequence i(k), and the delayed iterate \hat{x}^{k}.

1.4 Setup

We work with a probability measure space: (\Omega,\Sigma,\mathbb{P})555\Omega is the probability space, \Sigma is a sigma algebra, and \mathbb{P} is a corresponding probability measure.. We now describe the delayed iterates (which is part of our model of asynchronicity) and the block index.

1.4.1 Delayed iterates

Let \vec{j}=(j_{1},\ldots,j_{m})\in\mathbb{N}^{m} be a vector, and x^{0},x^{1},x^{2},\ldots a series of iterates. To model outdated solution data, we find it convenient to define:

\displaystyle x^{k-\vec{j}} \displaystyle=\MT@delim@Auto\p@star{x_{1}^{k-j_{1}},x_{2}^{k-j_{2}},\ldots,x_{% m}^{k-j_{m}}}. (1.4)

Hence we define a series of delay vectors \vec{j}(0),\vec{j}(1),\vec{j}(2),\ldots in \mathbb{N}^{m}, corresponding to x^{0},x^{1},x^{2},\ldots respectively. The components of the delay vector \vec{j}\MT@delim@Auto\p@star{k}=\MT@delim@Auto\p@star{j\MT@delim@Auto\p@star{k% ,1},j\MT@delim@Auto\p@star{k,2},\ldots,j\MT@delim@Auto\p@star{k,m}} represent the staleness of the components of the solution vector x^{k}.

TheoremDefinition 3. Delayed iterate.

The delayed iterate \hat{x}^{k} is defined as666Stronger asynchronicity: It is possible to have more general asynchronicity, where different components of the same block, x_{l}\in\mathbb{H}_{l}, have different ages. This leads to similar results, and a similar proof, but the current setup was chosen for simplicity.:

\displaystyle\hat{x}^{k} \displaystyle=x^{k-\vec{j}\MT@delim@Auto\p@star{k}},~{}\mbox{or equivalently,} (1.5)
\displaystyle\hat{x}^{k}=\MT@delim@Auto\p@star{\hat{x}_{1}^{k},\hat{x}_{2}^{k}% ,\ldots,\hat{x}_{m}^{k}} \displaystyle=\MT@delim@Auto\p@star{x_{1}^{k-j\MT@delim@Auto\p@star{k,1}},x_{2% }^{k-j\MT@delim@Auto\p@star{k,2}},\ldots,x_{m}^{k-j\MT@delim@Auto\p@star{k,m}}}. (1.6)

Lastly, we define the current delay j(k) as follows777Notice the lack of a vector symbol: This distinguishes the current delay from the delay vector.:

\displaystyle j\MT@delim@Auto\p@star{k} \displaystyle=\max_{i}\MT@delim@Auto\cp@star{j\MT@delim@Auto\p@star{k,i}}. (1.7)

These delay vectors depend on the model of asynchronicity chosen. We consider two possibilities in this paper: stochastic and deterministic delays.

1.4.2 Block sequence

We define following filtrations to represent the information that is accumulated as the algorithm runs.

\displaystyle{\mathcal{F}}^{k} \displaystyle=\sigma\MT@delim@Auto\p@star{x^{0},x^{1},\ldots,x^{k},\vec{j}% \MT@delim@Auto\p@star{0},\vec{j}\MT@delim@Auto\p@star{1},\ldots,\vec{j}% \MT@delim@Auto\p@star{k}} (1.8)
\displaystyle{\mathcal{G}}^{k} \displaystyle=\sigma\MT@delim@Auto\p@star{x^{0},x^{1},\ldots,x^{k}} (1.9)

Here \sigma(a,b,c,\ldots) represents the sigma algebra generated by a,b,c,\ldots.

TheoremAssumption 1. IID block sequence.

The sequence in which blocks of the solution vector are updated, i(k), is a series of uniform888Nonuniform probabilities are a simple extension. However for simplicity we assume a uniform distribution. IID random variables that takes values 1,2,\ldots,m each with probability 1/m. i(k) is independent of {\mathcal{F}}^{k}. That is, i(k) is independent of the sequence of iterates (x^{0},x^{1},\ldots,x^{k}) and the sequence of delays (\vec{j}\MT@delim@Auto\p@star{0},\vec{j}\MT@delim@Auto\p@star{1},\ldots,\vec{j% }\MT@delim@Auto\p@star{k}) jointly999Clearly this makes i(k) independent of {\mathcal{G}}^{k} as well..

It is very difficult to remove the assumption that the block sequence is independent of the sequence of delays. Only a few papers that we are aware of make progress in eliminating this assumption [SunHannahYin2017_asynchronous, LeblondPedregosaLacoste-Julien2017_asaga, CannelliFacchineiKungurtsevScutari2017_asynchronous]. However obtaining good convergence rates remains elusive.

Also removing the assumption of a random block sequence, and assuming, say, a cyclic choice as in [SunYe2016_worstcase, ChowWuYin2017_cyclic] leads to at least an m-times slowdown of the algorithm in the worst case for smooth minimization [SunYe2016_worstcase]. The block sequence will be IID if we allow all nodes to randomly update any block chosen in a uniform IID fashion, and computing each block is of equal difficulty. Future work may involve finding an intermediate scenarios between IID and cyclic block choices that still results in adequate rates.

1.5 Lyapunov functions

In this framework, it is easy to generate an example where we have conditional expectation bound:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto\n@star{x^{k+1}-% x^{*}}^{2}\big{|}\sigma\MT@delim@Auto\p@star{x^{0},x^{1},\ldots,x^{k},\vec{j}% \MT@delim@Auto\p@star{0},\vec{j}\MT@delim@Auto\p@star{1},\ldots,\vec{j}% \MT@delim@Auto\p@star{k}}}>\MT@delim@Auto\n@star{x^{k}-x^{*}}^{2} (1.10)

for any nonzero step size. However usually some kind of monotonicity is necessary to prove convergence, especially linear convergence. In [HannahYin2016_unbounded], following on from [PengXuYanYin2016_arock], the authors propose an asynchronicity error term to add to the classical error:

\displaystyle\underbrace{\xi^{k}}_{\text{Total error}} \displaystyle=\underbrace{\MT@delim@Auto\n@star{x^{k}-x^{*}}^{2}}_{\text{% Classical error}}+\underbrace{\frac{1}{m}\sum_{i=1}^{\infty}c_{i}% \MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}}_{\text{Asynchronicity error}} (1.11)

for positive decreasing coefficients (c_{1},c_{2},\ldots). Using carefully chosen coefficients and step size, they are able to prove convergence of ARock under unbounded delay for T with Lipschitz constant r=1. This Lyapunov function appears naturally in the proof, and much like a well chosen basis in linear algebra, it seems to be the most natural error to consider when proving convergence. Refer to [HannahYin2016_unbounded] for further motivation, and the general strategy for generating useful Lyapunov functions.

We use the same Lyapunov function in this paper to derive the main results. However our choice of coefficients is drastically different. Choosing the coefficients that yield strong results is a very involved process and is part of the technical innovation of this paper.

1.6 Structure of the paper

In Section 2, we systematically work through the points of the main argument in Section 1.1, which bolsters our main thesis that asynchronous algorithms are drastically faster at scale, and discuss related work in Section 2.5. This culminates in our most important contribution: Theorem 1, which essentially proves that asynchronous ARock has the same iteration complexity as its synchronous counterpart. This result is proven in Section 3, and Section 4 introduces and proves a similar result for deterministic unbounded delays.

2 New results

In this section, we present our main results, that justify the proposition in our main argument in Section 1.1. First we describe the implementation setup in Section 2.1, that is, the kind of context where our main argument applies. Next in Section 2.2, we describe some factors that cause synchronous algorithms to perform fewer epoch in a given time period. For instance, we prove that as the number of nodes increases, under a reasonable model synchronous algorithms suffer a \Theta(\ln(p)) slowdown, whereas asynchronous algorithms suffer no such penalty (more iterations). In Section 2.3, we derive a sharp convergence rate for synchronous-parallel ARock, so that we can compare iteration complexities. This appears to be a new result, with many implications in parallel optimization. In Section 2.4, we present our main results: The iteration complexity of (asynchronous) ARock is essentially the same as its synchronous counterpart. This result is our main theoretical contribution. The other subsections are independent contributions whose role is to justify our argument and bolster the important of the main theorems. Finally in Section 2.5, we discuss related work and compare results.

2.1 Implementation setup

We now describe the implementation setup that we will be considering. The results and analysis may be more general than this setup, but being concrete about the setup allows us to compare synchronous vs. asynchronous implementations of the same algorithm. Our implementation setup will consist of:

  1. Central Parameter Server: This is a central node that will maintain the solution vector x, and apply updates that are supplied to it101010We could have considered a shared memory architecture, however we chose this setting since we are more focused on algorithm performance at scale.: x\leftarrow x-\eta S_{i}\hat{x}.

  2. Computing Nodes: There are p nodes that read the solution vector \hat{x} into local memory, calculate S_{i}\hat{x}, and send this update back to the central server.

We now compare and contrast how a synchronous and asynchronous version of ARock would function:

  • Synchronous-parallel: All computing nodes read the same solution vector x from the server. All computing nodes will compute an update S_{i}x, where i depends on the node, and send this update to the server. The server will save all received updates and, only after x has been read by every node, apply the received updates to the solution vector. After this, all computing nodes will then read the same updated solution vector again, and the process repeats.

  • Asynchronous-parallel: All computing nodes will continually read the central solution vector x into their local memory. This yields \hat{x}, a potentially inconsistently read solution vector. Each node will then calculate an update S_{i}\hat{x} and send it back to the server. The server will apply any updates to the solution vector as they arrive.

Notice that in the synchronous implementation, a single slow node can hold the entire system up. In the asynchronous implementation, all nodes function independently, and never wait for any other node. The central server does not wait to receive an update from every node, but simply applies them as they come. We now analyze the synchronization penalty of these two implementation setups.

2.2 Synchronization penalty

In this section we discuess point 2 of the main argument in Section 1.1. We consider a simple and well-justified model of our implementation setup to investigate synchronization penalty. Even under perfect load balancing, this model will imply a slowdown for synchronous algorithms that increases as \Theta(\ln(p)), where p is the number of processors. This may be even worse in a real implementation, where other factors may further disadvantage synchronous algorithms (such as the overhead associated with message passing and read/write locks). For the corresponding asynchronous algorithm, we show these problems do not occur. Hence asynchronous algorithms may compute far more iterations per second.

Following [SerpedinChaudhari2009_synchronization] (pp. 43) and adding modifications, we model the time taken for node l’s update as follows:

\displaystyle P_{l}=R_{l}+C_{l}(i_{l},m)+S_{l}. (2.1)

Here i_{l} is the block of the solution vector that node l updates, and C_{l}(i,m) represents the “predictable” portion of the update time (it is merely a function, not a random variable). This includes computation time, and the delay because of the limited bandwidth of the network. C_{l}(i,m) is a function of i because different blocks may have different sizes and difficulties. R_{l} is the random delay involved in receiving the solution vector from the central server, and S_{l} is the random delay involved in sending an update back to the server. R_{l} and S_{l} are assumed IID with exponential distribution of mean \lambda_{l}. This exponential model for the random portion of the delay has extensive theoretical and empirical justifications (see [SerpedinChaudhari2009_synchronization], pp. 44-45 for a discussion of the evidence).

2.2.1 Synchronous algorithms and random delays

Let’s first consider the effect of random delays on the synchronization penalty. For simplicity, assume that C_{l}(i,m) is constant over i and l, and hence we can write this function as C(m). Also assume we have \lambda_{l}=\lambda for all l. This situation would occur if all blocks were of equal difficulty to update, and all nodes had the same computational power and network delay distribution. This is the ideal scenario, and yet we will observe a growing synchronization penalty with scale.

Because all nodes must finish updating for the next iteration to start, the iteration time P is given by:

\displaystyle P=C(m)+\max_{l=1,2,\ldots,p}\MT@delim@Auto\cp@star{R_{l}+S_{l}}. (2.2)

Hence we have (using [Eisenberg2008_expectation]):

\displaystyle{\mathbb{E}}P-C(m) \displaystyle\geq{\mathbb{E}}\MT@delim@Auto\p@star{\max_{l=1,\ldots,p}\{R_{l}% \}}=\lambda\sum_{l=1}^{p}\frac{1}{l}\geq\lambda\ln\MT@delim@Auto\p@star p.

Let’s now look at the time \mathcal{T}(K) required for K epochs, which corresponds to Km/p iterations. Let111111We write A\sim B for random variables A and B if these variables have the same distribution. P^{1},P^{2},\ldots\sim P. Then:

\displaystyle\mathcal{T}(K) \displaystyle=\sum_{k=1}^{\lceil Km/p\rceil}P^{k},
\displaystyle{\mathbb{E}}\mathcal{T}(K) \displaystyle\geq(Km/p){\mathbb{E}}P
\displaystyle\geq(Km/p)\MT@delim@Auto\p@star{C(m)+\lambda\ln\MT@delim@Auto% \p@star{p}}.

Hence for small values of p, the expected time to reach K epochs will decrease linearly with the number of nodes p. However as p becomes larger, there is at least a \Theta(\ln(p)) penalty in how long this will take compared to a linear speedup.

2.2.2 Asynchronous algorithms

Using the same model, we now show that asynchronous algorithms have no such \Theta(\ln(p)) scaling penalty. The time taken for node l to complete k iterations is given by:

\displaystyle S^{k}_{l} \displaystyle=\sum_{j=1}^{k}P_{l}^{j} (2.3)

where P_{l}^{k}\sim P. This is actually a renewal process with interarrival time P_{l} (see [MitovOmey2014_renewal, KellaStadje2006_superposition]). However if you consider the total number of iterations completed by all nodes together, then the time of the k’th iteration S^{k} is known as a superposition of renewal processes. From [KellaStadje2006_superposition] 1.4, as k\to\infty we have:

\displaystyle\frac{{\mathbb{E}}S^{k}}{k} \displaystyle\to\frac{{\mathbb{E}}P}{p}, (2.4)
\displaystyle\text{(by convergence in the previus step) }{\mathbb{E}}S^{k} \displaystyle=k\frac{\MT@delim@Auto\p@star{C(m)+2\lambda}}{p}(1+o_{m,\lambda,p% }\MT@delim@Auto\p@star{1}). (2.5)
TheoremRemark 3. Notation.

The subscripts in o_{m,\lambda,p}\MT@delim@Auto\p@star{1} denote that this term converges to 0 as k\to\infty in a way that depends on m, p, and \lambda.

Hence the expected time to complete K epochs is given by:

\displaystyle{\mathbb{E}}\mathcal{T}(K) \displaystyle=\frac{Km}{p}\MT@delim@Auto\p@star{C(m)+2\lambda}(1+o_{m,\lambda,% p}\MT@delim@Auto\p@star{1}) (2.6)

as K\to\infty. Hence it can be seen that asynchronous algorithms do not have a \ln(p) penalty as p becomes larger, when K is sufficiently large. Hence for large K, asynchronous algorithms will compute at least \Theta(\ln\MT@delim@Auto\p@star p) more epochs per second than synchronous algorithms.

2.2.3 Heterogeneity and synchronous algorithms

Sometimes a parallel problem cannot be split into m blocks in a way that updating each block is of equal difficulty (as was previously assumed in this subsection). This can cause significant synchronization penalty in the synchronous case, but has no such effect on asynchronous algorithms because computing nodes do not have to wait for slower nodes or blocks to complete.

Let us assume for the moment that there is no random component of the update time for a single node, and that all nodes have the same computational power. This means that the update time for node l at iteration k is simply:

\displaystyle P^{k}_{l} \displaystyle=C(i_{l},m) (2.7)

where i_{l} is the block that node l updates at iteration k. Assume also that at every iteration, each node l will chose a random block to update, and hence i_{l} is a uniform random variable on \MT@delim@Auto\cp@star{1,2,\ldots,m}. For the synchronous algorithm, we have an update time:

\displaystyle P \displaystyle=\max_{l=1,2,\ldots,p}\MT@delim@Auto\cp@star{C(i_{l},m)} (2.8)

Clearly then, as p increases, we have:

\displaystyle{\mathbb{E}}P \displaystyle\to\max_{i}C(i,m) (2.9)

That is, the update time is determined by the most difficult block to update. Hence the the expected time for K epochs is:

\displaystyle{\mathbb{E}}\mathcal{T}(K) \displaystyle=\frac{Km}{p}\MT@delim@Auto\p@star{\max_{i}C(i,m)+o_{m}(1)} (2.10)

as p\to\infty.

2.2.4 Heterogeneity and asynchronous algorithms

Now consider an asynchronous algorithm. The update time of a single node is:

\displaystyle{\mathbb{E}}P \displaystyle={\mathbb{E}}C(i_{l},m)=\frac{1}{m}\sum_{i=1}^{m}C(i,m) (2.11)

Yet again we have a superposition of renewal processes, and hence from [KellaStadje2006_superposition], we have as k\to\infty:

\displaystyle\frac{{\mathbb{E}}S^{k}}{k} \displaystyle\to\frac{{\mathbb{E}}P}{p} (2.12)
\displaystyle{\mathbb{E}}S^{k} \displaystyle=\frac{k}{p}\MT@delim@Auto\p@star{\frac{1}{m}\sum_{i=1}^{m}C(i,m)% }\MT@delim@Auto\p@star{1+o_{m,p}\MT@delim@Auto\p@star 1} (2.13)

Hence the expected time for K epochs is given by:

\displaystyle{\mathbb{E}}\mathcal{T}\MT@delim@Auto\p@star K \displaystyle=\frac{Km}{p}\MT@delim@Auto\p@star{\frac{1}{m}\sum_{i=1}^{m}C(i,m% )}\MT@delim@Auto\p@star{1+o_{m,p}\MT@delim@Auto\p@star 1} (2.14)

as K\to\infty. Notice that the time taken for an asynchronous algorithm is determined by the average difficulty of updating a block. Compare this to synchronous algorithms where the most difficult block determines the time complexity. If the difficulty of blocks is highly heterogeneous, asynchronous algorithms may complete far more iterations per second, even without considering network effects.

2.2.5 Additional factors

In the previous, we merely gave an analysis of a couple factors that decrease the number of epochs that synchronous solvers complete. Another factors is heterogeneous computing power of the computing nodes themselves, which causes disadvantages even if all blocks are the same difficulty. Also there is significant overhead associated with enforcing synchronization, as well as read and write locks.

2.3 Iteration complexity for synchronous Block KM

In this subsection, we start to look at point 2 of the main argument in Section 1.1. In order to prove there is no iteration complexity penalty, we need to obtain tight rates of convergence for synchronous ARock (which is merely a synchronous block KM iteration). At every step, each of the p processing nodes are given a random block to update with a KM-style iteration. Hence we have:

\displaystyle x^{k+1} \displaystyle=x^{k}-\eta^{k}P^{k}Sx^{k} (2.15)

Here P^{k} is a projection onto a random subset of \MT@delim@Auto\cp@star{1,2,\ldots,m} of size p (we assume p\leq m). We note that each block has a \frac{p}{m} probability of being updated on a given iteration.

TheoremDefinition 4. Convergence rate.

An algorithm is said to linearly converge if the error E(k)={\mathcal{O}}\MT@delim@Auto\p@star{R^{k}} for 0<R<1. R is called the convergence rate.

TheoremDefinition 5. Epoch iteration complexity.

The epoch iteration complexity I(\epsilon) is the number of epochs required to decrease the error below \epsilon E(0), where E(0) is the initial error.

This error could be the distance from the solution \MT@delim@Auto\n@star{x^{k}-x^{*}}^{2} or the gap between the function value and its optimal value f(x^{k})-f^{*}.

TheoremProposition 6. Convergence rate of block KM iterations.

Let T be an r-Lipschitz operator for 0<r<1. Consider the random subset KM iteration defined in Equation 2.15 for 1\leq p\leq m. A step size of \eta^{k}=1 optimizes the convergence rate. When this optimal step size is chosen, we have convergence rate:

\displaystyle R \displaystyle=1-\frac{p}{m}\MT@delim@Auto\p@star{1-r^{2}}

and corresponding epoch iteration complexity:

\displaystyle I(\epsilon) \displaystyle=\MT@delim@Auto\p@star{\frac{1}{1-r^{2}}-\theta\frac{p}{m}}\ln% \MT@delim@Auto\p@star{1/\epsilon} (2.16)

for some \theta\in\MT@delim@Auto\sp@star{\frac{1}{2},1}.

For problems of interest, the first term (1/(1-r^{2})) will dominate the second. We are interested in huge-scale problems, which will usually have m\gg p or r\approx 1.

This is proven in Section A.1. Hence if we have either r\to 1 or p/m\to 0, then:

\displaystyle I \displaystyle=\MT@delim@Auto\p@star{1+o(1)}\frac{1}{1-r^{2}}\ln\MT@delim@Auto% \p@star{1/\epsilon} (2.17)

We will eventually prove that ARock has essentially the same iteration complexity.

This result allows us to obtain a sharp convergence rate and epoch iteration complexity for synchronous-parallel block gradient descent (of which block gradient descent is a special case of p=1). This appears to be a new result that extends recent work in [TaylorHendrickxGlineur2017_exact] to the case of random-subset block gradient descent (which corresponds to the special case m=1, p=1).

TheoremCorollary 7. Sharp Convergence Rate of Synchronous-Parallel Block Gradient Descent.

Let f be \mu-strongly convex function with L-Lipcshitz gradient \nabla f(x). Let \kappa=L/\mu be the condition number. The operator T=I-\frac{2}{\mu+L}\nabla f is r=\MT@delim@Auto\p@star{1-\frac{2}{\kappa+1}} Lipschitz. The corresponding block KM iteration Equation 2.15 with optimal step size \eta^{k}=1 is equivalent to synchronous-parallel block gradient descent with step size 2/\MT@delim@Auto\p@star{\mu+L}. The linear convergence rate R and epoch iteration complexity I(\epsilon) with respect to the error \MT@delim@Auto\n@star{x^{k}-x^{*}}^{2} are given by the following:

\displaystyle R \displaystyle=1-4\frac{p}{m}\frac{\kappa}{\MT@delim@Auto\p@star{\kappa+1}^{2}} \displaystyle=1-4\frac{p}{m\kappa}\MT@delim@Auto\p@star{1+{\mathcal{O}}% \MT@delim@Auto\p@star{1/\kappa}} (2.18)
\displaystyle I\MT@delim@Auto\p@star{\epsilon} \displaystyle=\frac{1}{4}\MT@delim@Auto\p@star{\kappa+{\mathcal{O}}% \MT@delim@Auto\p@star 1}\ln\MT@delim@Auto\p@star{1/\epsilon} (2.19)

as \kappa\to\infty. Lastly, this convergence rate is sharp.

This is proven in Section A.2. Among other things, our main results will show that asynchronous-parallel block-gradient descent has epoch iteration complexity that is asymptotically equal to \frac{1}{4}\kappa\ln\MT@delim@Auto\p@star{1/\epsilon}, which is the complexity of synchronous-parallel block coordinate descent.

TheoremRemark 4. Composite objectives.

Let g(x)=\sum_{i=1}^{n}g\MT@delim@Auto\p@star{x_{i}} be separable, with each component convex, and subdifferentiable. The exact same convergence rate and complexity clearly holds for block proximal gradient descent. This is because the corresponding nonexpansive operator T=\MT@delim@Auto\p@star{I+\partial g}^{-1}\circ\MT@delim@Auto\p@star{I-\frac{2% }{\mu+L}\nabla f} is also 1-\frac{2}{\kappa+1}-Lipschitz, and hence all preceding theory applies.

2.4 Iteration complexity for ARock

In this subsection we present our theoretical results for ARock, which completes part 2 of the main argument in Section 1.1. That is, ARock (and hence all its special cases) has essentially the same iteration complexity as its synchronous counterpart. However we present the results in simplified form so as to more effectively communication the main message. Full versions of these results that contain more technical details are given and proven in the proof sections.

2.4.1 Stochastic delays

The first result is convergence under a stochastic unbounded delays from a fixed distribution (though this can be weakened to a changing distribution).

TheoremAssumption 2. Stochastic unbounded delays.

The sequence of delay vectors \vec{j}(0),\vec{j}(1),\vec{j}(2),\ldots is IID, and \vec{j}(k) is independent of {\mathcal{G}}^{k}. That is, \vec{j}(k) is indepdendent of the iterates \MT@delim@Auto\p@star{x^{0},x^{1},x^{2},\ldots,x^{k}}.

We can imagine a large optimization problem being solved on a busy network. Nodes are continually sending and receiving their updates. Traffic is chaotic and there is some kind of distribution of how long the information take to get from one node to the rest.

The delay vectors \vec{j}(k) have a fixed distribution (though this can be relaxed). Define

\displaystyle P_{l}=\mathbb{P}\MT@delim@Auto\sp@star{j(k)\geq l} (2.20)

We let \rho be defined by:

\displaystyle\rho \displaystyle=1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}} (2.21)

which is the linear convergence rate of the corresponding synchronous KM algorithm (Equation 2.15) with p=1, and ideal step size. We also define probability moments:

\displaystyle M_{1} \displaystyle=\sum_{l=1}^{\infty}P_{l}\rho^{-l/2}, \displaystyle M_{2} \displaystyle=\sum_{l=1}^{\infty}P_{l}^{1/2}\rho^{-l/2} (2.22)

and step size:

\displaystyle\eta^{k} \displaystyle=\eta_{1}\triangleq\MT@delim@Auto\p@star{1+m^{-1/2}\MT@delim@Auto% \p@star{\MT@delim@Auto\p@star{1-r^{2}}^{1/2}M_{1}+2M_{2}}} (2.23)

Notice that these moments are a function of m. This is immediately clear because \rho is a function of m. Also the probability distribution of the delays may depend on m in a way that depends on the network and how you decide to scale up the computation to a higher number of nodes or blocks.

TheoremTheorem 1. Linear convergence for stochastic delays.

Let Assumption 1 and Assumption 2 hold. Let M_{1}, and M_{2} be finite and {\mathcal{O}}\MT@delim@Auto\p@star{m^{q}} for 0\leq q<1/2. Let \eta^{k}=\eta_{1}. Then there exist positive coefficients (c_{i})_{i=1}^{\infty} of the Lyapunov function \xi^{k} such that we have the following linear convergence rate and iteration complexity respectively:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{G}}^% {k}} \displaystyle\leq\underbrace{\Big{(}1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}% }}_{\text{Ideal synchronous rate}}+\underbrace{{\mathcal{O}}\MT@delim@Auto% \p@star{m^{q-3/2}}\Big{)}}_{\text{Asynchronicity penalty}}\xi^{k}, (2.24)
\displaystyle I\MT@delim@Auto\p@star{\epsilon} \displaystyle=\MT@delim@Auto\p@star{1+{\mathcal{O}}\MT@delim@Auto\p@star{m^{-1% /2+q}}}\MT@delim@Auto\p@star{\frac{1}{1-r^{2}}}\ln\MT@delim@Auto\p@star{1/% \epsilon}, (2.25)

as m\to\infty.

The full version of this is Theorem 2, which is proven in Section 3.5 after a series of results built up in Section 3.

Comparing this to Proposition 6, we can see that as m\to\infty, the iteration complexities of asynchronous and synchronous ARock approach the same value. Thus at scale there is no iteration complexity penalty for using outdated information, so long as that information does not become too old as you scale up the problem (i.e. M_{1} and M_{2} don’t grow too fast as m increases.).

TheoremRemark 5. Reduction to synchronous case.

We will see that when there is no asynchronicity, the Lyapunov function will reduce to the classical error \xi^{k}=\MT@delim@Auto\n@star{x^{k}-x^{*}}^{2}, and \eta_{1}=1. Also the convergence rate and iteration complexity will exactly equal the values obtained in Proposition 6 (again for p=1).

2.4.2 Deterministic delays

We also prove a similar convergence result for deterministic unbounded delays. However we present the section in Section 4, because the specifics of the theorem are a little subtle. The proof of the stochastic delay result can be seen as a warmup to the deterministic result.

2.4.3 Completing the main argument

Hence our convergence rate results complete point 2 of the main argument. We have more epochs per given time period, and the epochs make the same progress because iterations are of the same quality. Hence asynchronous algorithms may drastically outperform synchronous ones in this setting.

2.5 Related Work

Though asynchronous algorithms were invented long ago, there has been a lot of recent interest, especially for random-block-coordinate-type algorithms, such as “Hogwild!” [RechtReWrightNiu2011_hogwild]. In [AvronDruinskyGupta2014], the authors prove linear convergence for an asynchronous stochastic linear solver. In [LiuWrightReBittorfSridhar2015_asynchronous], the authors prove function-value convergence for asynchronous stochastic coordinate descent. They prove {\mathcal{O}}\MT@delim@Auto\p@star{1/k} convergence for f convex with \nabla f Lipschitz, and linear convergence when f is also strongly convex. This was extended in [LiuWright2015_asynchronous] to composite objective functions.

For condition number \kappa, they report a per-iteration linear convergence rate of

\displaystyle 1-\frac{1}{2m\kappa} (2.26)

This implies an iteration complexity approximately 8 times higher than our result (where our result matches asymptotically to the complexity of the corresponding synchronous algorithm). For asynchronicity to be useful, the reduced penalty would need to compensate for this 8-fold increase in iterations. Like almost all recent work except for [HannahYin2016_unbounded, PengXuYanYin2016_convergence], they assume a bounded delay \tau. For linear speedup, they require \tau={\mathcal{O}}(m^{1/2}) and \tau={\mathcal{O}}(m^{1/4}) for composite objectives. We do not require bounded delays for linear speedup, only sufficiently slowly growing moments of delay. For bounded delay our condition is \tau={\mathcal{O}}(m^{q}) for 0\leq q<\frac{1}{2} for composite and non-composite objectives.

Our work is also more general, since we use the operator setting. So not only do the main results apply to gradient descent, and proximal gradient, but any other algorithm that can be written in a block fixed-point form.

In [ManiaPanPapailiopoulosRechtRamchandranJordan2015_perturbed], authors achieve a linear speedup but with a far higher iteration complexity of \Theta\MT@delim@Auto\p@star{\kappa^{2}\ln\MT@delim@Auto\p@star{1/\epsilon}}. When the iteration complexity is increased by a factor of \kappa, the linear speedup may be of limited utility. We also note that our conditions for linear speedup are weaker than theirs (which is \tau={\mathcal{O}}\MT@delim@Auto\p@star{m^{1/6}}), and they need to assume that x^{k} remains bounded, which is unjustified.

In [PengXuYanYin2016_convergence] prove function-value linear convergence of an asynchronous block proximal gradient algorithm under unbounded delays. However it is unclear how the iteration complexity they obtain compares to the corresponding synchronous algorithm. Also our result applies to a KM iteration, of which block proximal gradient is a special case.

In [LianZhangHsiehHuangLiu2016_comprehensive], the authors review a number of asynchronous algorithm analyses and collect conditions necessary for linear speedup on a fixed problem. In light of potentially increased complexity, as seen in [ManiaPanPapailiopoulosRechtRamchandranJordan2015_perturbed], a linear speedup does not imply that asynchronous algorithms will run faster in time. It only implies that the potential slowdown factor for using asynchronicity is bounded for a given problem, but this bound may be as large as \kappa. What we prove in this paper is much stronger, that the slowdown factor (in terms of iterations) for using asynchronicity is asymptotically 1, i.e. that the slowdown is negligible.

2.6 Unbounded delays

Almost all work on asynchronous algorithms except for [HannahYin2016_unbounded, PengXuYanYin2016_convergence] assumed bounded delays. That is, there was a limit \tau such that j(k)\leq\tau for all k. However there may be no bound on the delay in practice: There is always the possibility of an arbitrarily large network delay. Also this \tau needs to be known in advance in order to set the step size, which is impractical. A large \tau will hinder the convergence rate by limiting the step size, even if a delay of \tau is extremely unlikely. In this work we assume no bound on the delay. In Theorem 1, we are able to obtain fast convergence if the delay distribution is not too spread out. In Theorem 3 we are able to determine a convergence rate that depends only on the current delay conditions, not on the worst case behavior of delay on the entire time that the algorithm runs.

3 Analysis of ARock under Stochastic Delays

We now give the full version of Theorem 1. We let the coefficients in \xi^{k} be given by:

\displaystyle c_{i} \displaystyle=m^{1/2}\MT@delim@Auto\p@star{1+\MT@delim@Auto\p@star{1-r^{2}}^{1% /2}}\sum^{\infty}_{l=i}P_{l}^{-1/2}\rho^{-(l/2-i+1)} (3.1)

and define the constant:

\displaystyle\eta_{2} \displaystyle=m^{1/2}\MT@delim@Auto\p@star{1-r^{2}}^{-1/2}M_{1}^{-1} (3.2)

Finally, define the following convergence rate function:

\displaystyle R\MT@delim@Auto\p@star{\eta,\gamma} \displaystyle=\MT@delim@Auto\p@star{1-\frac{\eta}{m}\MT@delim@Auto\p@star{1-r^% {2}}\MT@delim@Auto\p@star{1-\eta/\gamma}} (3.3)

Note that we have R<1 when 0<\eta<\gamma, and the rate is optimized when \eta=(1/2)\gamma. Also \rho\leq R for \eta\leq 1, where \rho=1-(1/m)(1-r^{2}) is the optimal rate that we wish to prove R is close to.

TheoremTheorem 2. Linear convergence for stochastic delays.

Let Assumption 1 and Assumption 2 hold. Let the step size \eta^{k} be {\mathcal{F}}^{k}-measurable, and satisfy \eta^{k}\leq\eta_{1}. Let the probability moments M_{1} and M_{2} defined in Equation 2.22 be finite. Consider the Lyapunov function defined in Equation 1.11 with constants given by Equation 3.1. Then we have the following linear convergence rate:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{G}}^% {k}} \displaystyle\leq R\MT@delim@Auto\p@star{\eta^{k},\eta_{2}}\xi^{k} (3.4)

Additionally, let \eta^{k}=\eta_{1} and assume that M_{1},M_{2}={\mathcal{O}}\MT@delim@Auto\p@star{m^{q}} for 0\leq q<1/2. As the number of blocks m approaches \infty, we have the following linear convergence rate and iteration complexity respectively:

\displaystyle R\MT@delim@Auto\p@star{\eta_{1},\eta_{2}} \displaystyle=\underbrace{\MT@delim@Auto\p@star{1-\frac{1}{m}\MT@delim@Auto% \p@star{1-r^{2}}}}_{\text{Ideal synchronous rate}}+\underbrace{\frac{2}{m^{3/2% }}\MT@delim@Auto\p@star{1-r^{2}}^{3/2}\MT@delim@Auto\p@star{\MT@delim@Auto% \p@star{1-r^{2}}^{1/2}M_{1}+M_{2}}}_{\text{Asynchronous rate penalty}}% \MT@delim@Auto\p@star{1+o\MT@delim@Auto\p@star 1} (3.5)
\displaystyle I\MT@delim@Auto\p@star{\epsilon} \displaystyle=\MT@delim@Auto\p@star{1+\underbrace{2m^{-1/2}\MT@delim@Auto% \p@star{\MT@delim@Auto\p@star{1-r^{2}}^{1/2}M_{1}+M_{2}}}_{\text{Highest order% penalty term}}+o\MT@delim@Auto\p@star{m^{-1/2+q}}}\MT@delim@Auto\p@star{\frac% {1}{1-r^{2}}}\ln\MT@delim@Auto\p@star{1/\epsilon} (3.6)

We chose to include the exact form of the highest order iteration complexity penalty. This allows us to calculate approximately how large m has to be in order for asynchronicity to cause negligible penalty.

We now prove Theorem 2 in a way that emphasizes the reasons and intuition behind our approach – especially the strategic way in which the coefficients are chosen.

3.1 Preliminary results

Let x^{*} be any solution, and set x^{*}=0 with no loss in generality, to make the notation more compact. This can be achieved by translating the origin of the coordinate system to x^{*}. Hence \MT@delim@Auto\n@star{x^{k}} is the distance from the solution. The starting point of our analysis is the following121212We will use an abuse of notation in this paper. We equate S_{i}(x)\in\mathbb{H}_{i} (the components of S(x) in the ith block) and (0,\ldots,0,S_{i}(x),0,\ldots,0)\in\mathbb{H}_{1}\times\ldots\times\mathbb{H}_% {m} (the projection of S(x) to the i’th subspace). Hence we can write the ARock iteration more compactly as x^{k+1}=x^{k}-\eta^{k}S_{i\MT@delim@Auto\p@star{k}}\hat{x}^{k}.:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto\n@star{x^{k+1}}% ^{2}|{\mathcal{F}}^{k}} \displaystyle={\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto\n@star{x^{k}-% \eta^{k}S_{i\MT@delim@Auto\p@star{k}}\hat{x}^{k}}^{2}|{\mathcal{F}}^{k}}
\displaystyle=\MT@delim@Auto\n@star{x^{k}}^{2}+{\mathbb{E}}\MT@delim@Auto% \sp@star{-2\eta^{k}\MT@delim@Auto\dotp@star{x^{k},S_{i\MT@delim@Auto\p@star{k}% }\hat{x}^{k}}+\MT@delim@Auto\p@star{\eta^{k}}^{2}\MT@delim@Auto\n@star{S_{i% \MT@delim@Auto\p@star{k}}\hat{x}^{k}}^{2}|{\mathcal{F}}^{k}}.

Here the expectation is taken over only the block index i\MT@delim@Auto\p@star{k} (Recall Assumption 1). We let the step size \eta^{k} be {\mathcal{G}}^{k}-measurable (and hence {\mathcal{F}}^{k}-measurable). Essentially this means that the step size \eta^{k} can depend only on the sequence \MT@delim@Auto\p@star{x^{0},x^{1},\ldots,x^{k}}, and not the block index or delay. Hence

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto\n@star{x^{k+1}}% ^{2}|{\mathcal{F}}^{k}} \displaystyle=\MT@delim@Auto\n@star{x^{k}}^{2}\underbrace{-2\frac{\eta^{k}}{m}% \MT@delim@Auto\dotp@star{x^{k},S\hat{x}^{k}}}_{\text{cross term}}+\frac{% \MT@delim@Auto\p@star{\eta^{k}}^{2}}{m}\MT@delim@Auto\n@star{S\hat{x}^{k}}^{2}. (3.7)

We now present a simple lemma on the operator S that will be used in the convergence proof (This is the operator version of Theorem 2.1.12 in [Nesterov2013_introductory]).

TheoremLemma 8. .

Let S=I-T, where T is an r-Lipschitz operator. Then for all x,y\in\mathbb{H} we have:

\displaystyle\MT@delim@Auto\dotp@star{Sy-Sx,y-x} \displaystyle\geq\frac{1}{2}\MT@delim@Auto\n@star{Sy-Sx}^{2}+\frac{1}{2}% \MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\n@star{y-x}^{2} (3.8)
Proof.
\displaystyle\text{($T$ is $r$-Lipschitz) }r^{2}\MT@delim@Auto\n@star{y-x}^{2} \displaystyle\geq\MT@delim@Auto\n@star{Ty-Tx}^{2}
\displaystyle=\MT@delim@Auto\n@star{\MT@delim@Auto\p@star{I-S}y-\MT@delim@Auto% \p@star{I-S}x}^{2}
\displaystyle=\MT@delim@Auto\n@star{Sy-Sx}^{2}-2\MT@delim@Auto\dotp@star{Sy-Sx% ,y-x}+\MT@delim@Auto\n@star{y-x}^{2}
\displaystyle\text{(rearrange) }\MT@delim@Auto\dotp@star{Sy-Sx,y-x} \displaystyle\geq\frac{1}{2}\MT@delim@Auto\n@star{Sy-Sx}^{2}+\frac{1}{2}% \MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\n@star{y-x}^{2}\qed

3.2 The cross term

TheoremRemark 6. Strategy.

Consider Equation 3.7 again. The \MT@delim@Auto\n@star{S\hat{x}^{k}}^{2} term in Equation 3.7 can be thought of as a “waste” term that has to be negated. In light of Lemma 8, a -\MT@delim@Auto\dotp@star{S\hat{x}^{k},\hat{x}^{k}} term can be used to generate a -\MT@delim@Auto\n@star{S\hat{x}^{k}}^{2} term to clean this waste. In addition to cleaning this waste, ideally we would have a -\MT@delim@Auto\n@star{x^{k}}^{2} to help prove linear convergence, but instead Lemma 8 produces -\MT@delim@Auto\n@star{\hat{x}^{k}}^{2}.

The strategy we pursue is as follows: The cross term -\MT@delim@Auto\dotp@star{S\hat{x}^{k},x^{k}} is approximately equal to -\MT@delim@Auto\dotp@star{S\hat{x}^{k},\hat{x}^{k}}, which allows us to clean the \MT@delim@Auto\n@star{S\hat{x}^{k}}^{2} term. The -\MT@delim@Auto\n@star{\hat{x}^{k}}^{2} that is also generated is approximately equal to -\MT@delim@Auto\n@star{x^{k}}^{2}, which helps prove linear convergence. However, there is an error associated with this “conversion”. Finally, this conversion error is negated by the use of as Lyapunov function (see eq. 1.11).

Lemma 9 will eventually allow us to quantify the error associated with converting -\MT@delim@Auto\dotp@star{S\hat{x}^{k},x^{k}} to -\MT@delim@Auto\dotp@star{S\hat{x}^{k},\hat{x}^{k}}, and the error associated with converting -\MT@delim@Auto\n@star{\hat{x}^{k}}^{2} to -\MT@delim@Auto\n@star{x^{k}}^{2} mentioned in Remark 6.

TheoremLemma 9. .

Let a>0, j(k) be the current delay, \eta^{k} be the current step size, and \epsilon_{1},\epsilon_{2},\ldots>0 be a series of parameters. Then we have:

\displaystyle a\MT@delim@Auto\n@star{x^{k}-\hat{x}^{k}} \displaystyle\leq\frac{1}{2}a^{2}\eta^{k}\MT@delim@Auto\p@star{\sum_{i=1}^{j(k% )}\frac{1}{\epsilon_{i}}}+\frac{1}{2}\frac{1}{\eta^{k}}\sum_{i=1}^{j(k)}% \MT@delim@Auto\p@star{\epsilon_{i}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}} (3.9)
Proof.

See [HannahYin2016_unbounded, PengXuYanYin2016_arock]. ∎

TheoremRemark 7. Free parameters.

Lemma 10 generates some positive parameters \epsilon_{1},\epsilon_{2},\ldots>0 and \delta_{1},\delta_{2},\ldots>0. It’s not immediately clear what these parameters should be set to. However we will see in Section 3.4 if they are properly chosen, they can be used to construct a Lyapunov function that will allow us to prove linear convergence.

We will make use of Lemma 9 twice with parameter sets {\MT@delim@Auto\p@star{\epsilon_{1},\epsilon_{2},\ldots}} and {\MT@delim@Auto\p@star{\delta_{1},\delta_{2},\ldots}} respectively. To simplify notation, we define:

\displaystyle E_{j} \displaystyle=\sum_{i=1}^{j}\frac{1}{\epsilon_{i}} \displaystyle D_{j} \displaystyle=\sum_{i=1}^{j}\frac{1}{\delta_{i}} (3.10)
TheoremLemma 10. .

Let Assumption 1 hold. Let \epsilon_{1},\epsilon_{2},\ldots>0 and \delta_{1},\delta_{2},\ldots>0 be a sequence of parameters, and E_{j}, D_{j} defined as above. Let \eta^{k} be {\mathcal{G}}^{k}-measurable. ARock yields the following inequality:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto\n@star{x^{k+1}}% ^{2}\big{|}{\mathcal{F}}^{k}} \displaystyle\leq\MT@delim@Auto\p@star{1-\frac{\eta^{k}}{m}\MT@delim@Auto% \p@star{1-r^{2}}\MT@delim@Auto\p@star{1-\eta^{k}D_{j(k)}}}\MT@delim@Auto% \n@star{x^{k}}^{2}+\frac{1}{m}\sum_{i=1}^{j(k)}\MT@delim@Auto\p@star{\delta_{i% }\MT@delim@Auto\p@star{1-r^{2}}+\epsilon_{i}}\MT@delim@Auto\n@star{x^{k+1-i}-x% ^{k-i}}^{2}
\displaystyle-\frac{\eta^{k}}{m}\MT@delim@Auto\n@star{S\hat{x}^{k}}^{2}% \MT@delim@Auto\p@star{1-\eta^{k}\MT@delim@Auto\p@star{1+E_{j(k)}}}
Proof.

We make use of Lemma 9 twice in this proof, with parameter sets {\MT@delim@Auto\p@star{\epsilon_{1},\epsilon_{2},\ldots}} and {\MT@delim@Auto\p@star{\delta_{1},\delta_{2},\ldots}} respectively.

\displaystyle-2\frac{\eta^{k}}{m}\MT@delim@Auto\dotp@star{x^{k},S\hat{x}^{k}}
\displaystyle=-2\frac{\eta^{k}}{m}\MT@delim@Auto\dotp@star{\hat{x}^{k},S\hat{x% }^{k}}-2\frac{\eta^{k}}{m}\MT@delim@Auto\dotp@star{x^{k}-\hat{x}^{k},S\hat{x}^% {k}}
\displaystyle\leq-\frac{\eta^{k}}{m}\MT@delim@Auto\p@star{\MT@delim@Auto% \n@star{S\hat{x}^{k}}^{2}+\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\n@star{% \hat{x}}^{2}}+2\frac{\eta^{k}}{m}\MT@delim@Auto\n@star{x^{k}-\hat{x}^{k}}\cdot% \MT@delim@Auto\n@star{S\hat{x}^{k}}
\displaystyle\leq-\frac{\eta^{k}}{m}\MT@delim@Auto\p@star{\MT@delim@Auto% \n@star{S\hat{x}^{k}}^{2}+\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\n@star{% \hat{x}^{k}}^{2}}+2\frac{\eta^{k}}{m}\MT@delim@Auto\p@star{\frac{1}{2}% \MT@delim@Auto\n@star{S\hat{x}^{k}}^{2}\eta^{k}E_{j(k)}+\frac{1}{2}\frac{1}{% \eta^{k}}\sum_{i=1}^{j(k)}\MT@delim@Auto\p@star{\epsilon_{i}\MT@delim@Auto% \n@star{x^{k+1-i}-x^{k-i}}^{2}}}
\displaystyle=-\frac{\eta^{k}}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto% \n@star{\hat{x}^{k}}^{2}+\frac{1}{m}\sum_{i=1}^{j(k)}\epsilon_{i}% \MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}-\frac{\eta^{k}}{m}\MT@delim@Auto% \n@star{S\hat{x}^{k}}^{2}\MT@delim@Auto\p@star{1-\eta^{k}D_{j(k)}} (3.11)

For sufficiently small step size, this inequality allows us to negate the \MT@delim@Auto\n@star{S\hat{x}^{k}}^{2} terms. Now let’s examine -\MT@delim@Auto\n@star{\hat{x}^{k}}^{2}, which we convert to a -\MT@delim@Auto\n@star{x^{k}}^{2} term (and some error) for linear convergence.

\displaystyle-\MT@delim@Auto\n@star{\hat{x}^{k}}^{2} \displaystyle=-\MT@delim@Auto\n@star{x^{k}}^{2}-2\MT@delim@Auto\dotp@star{\hat% {x}^{k}-x^{k},x^{k}}-\MT@delim@Auto\n@star{x^{k}-\hat{x}^{k}}^{2}
\displaystyle\leq-\MT@delim@Auto\n@star{x^{k}}^{2}+2\MT@delim@Auto\n@star{\hat% {x}^{k}-x^{k}}\MT@delim@Auto\n@star{x^{k}}
(Lemma 9) \displaystyle\leq-\MT@delim@Auto\n@star{x^{k}}^{2}+\MT@delim@Auto\n@star{x^{k}% }^{2}\eta^{k}D_{j(k)}+\frac{1}{\eta^{k}}\sum_{i=1}^{j(k)}\MT@delim@Auto\p@star% {\delta_{i}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}}
\displaystyle=-\MT@delim@Auto\p@star{1-\eta^{k}D_{j(k)}}\MT@delim@Auto\n@star{% x^{k}}^{2}+\frac{1}{\eta^{k}}\sum_{i=1}^{j(k)}\MT@delim@Auto\p@star{\delta_{i}% \MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}}

Hence substituting into (3.11), we have

\displaystyle-2\frac{\eta^{k}}{m}\MT@delim@Auto\dotp@star{x^{k},S\hat{x}^{k}} \displaystyle\leq-\frac{\eta^{k}}{m}\MT@delim@Auto\p@star{1-r^{2}}% \MT@delim@Auto\p@star{1-\eta^{k}D_{j(k)}}\MT@delim@Auto\n@star{x^{k}}^{2}+% \frac{\eta^{k}}{m}\MT@delim@Auto\p@star{1-r^{2}}\frac{1}{\eta^{k}}\sum_{i=1}^{% j(k)}\MT@delim@Auto\p@star{\delta_{i}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^% {2}}
\displaystyle+\frac{1}{m}\sum_{i=1}^{j(k)}\epsilon_{i}\MT@delim@Auto\n@star{x^% {k+1-i}-x^{k-i}}^{2}-\frac{\eta^{k}}{m}\MT@delim@Auto\n@star{S\hat{x}^{k}}^{2}% \MT@delim@Auto\p@star{1-\eta^{k}D_{j(k)}}
\displaystyle=-\frac{\eta^{k}}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto% \p@star{1-\eta^{k}D_{j(k)}}\MT@delim@Auto\n@star{x^{k}}^{2}+\frac{1}{m}\sum_{i% =1}^{j(k)}\MT@delim@Auto\p@star{\delta_{i}\MT@delim@Auto\p@star{1-r^{2}}+% \epsilon_{i}}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}
\displaystyle-\frac{\eta^{k}}{m}\MT@delim@Auto\n@star{S\hat{x}^{k}}^{2}% \MT@delim@Auto\p@star{1-\eta^{k}D_{j(k)}}

Using (3.7) immediately yields the result. ∎

3.3 The Lyapunov function

We now consider how the Lyapunov function defined in Equation 1.11 changes in size from step to step. The reason that a Lyapunov function is needed is to deal with the \MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2} terms. They cannot be negated like \MT@delim@Auto\n@star{S\hat{x}^{k}}^{2} terms, and so must be incorporated into the error by using a Lyapunov function. Let P_{l}=\mathbb{P}\MT@delim@Auto\sp@star{j\MT@delim@Auto\p@star{k}\geq l}.

TheoremLemma 11. .

Let the conditions of Lemma 10 and Assumption 2 hold. Define

\displaystyle\eta_{1} \displaystyle=\MT@delim@Auto\p@star{1+\frac{c_{1}}{m}+\MT@delim@Auto\n@star{% \frac{1}{\epsilon_{i}}}_{\ell^{1}}}^{-1} (3.12)
\displaystyle\eta_{2} \displaystyle=\MT@delim@Auto\p@star{\sum_{i=1}^{\infty}\frac{P_{i}}{\delta_{i}% }}^{-1} (3.13)
\displaystyle R\MT@delim@Auto\p@star{\eta,\gamma} \displaystyle=\MT@delim@Auto\p@star{1-\frac{\eta}{m}\MT@delim@Auto\p@star{1-r^% {2}}\MT@delim@Auto\p@star{1-\eta/\gamma}} (3.14)

Let \eta^{k} be {\mathcal{G}}^{k}-measurable, and \eta^{k}\leq\eta_{1}. Then ARock satisfies:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{G}}^% {k}} \displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}R\MT@delim@Auto\p@star{\eta^{% k},\eta_{2}}+\frac{1}{m}\sum_{i=1}^{\infty}\MT@delim@Auto\p@star{% \MT@delim@Auto\p@star{\epsilon_{i}+\MT@delim@Auto\p@star{1-r^{2}}\delta_{i}}P_% {i}+c_{i+1}}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}

Notice that we have defined \eta_{1} and \eta_{2} in terms of the unspecified parameters {\MT@delim@Auto\p@star{\epsilon_{1},\epsilon_{2},\ldots}} and {\MT@delim@Auto\p@star{\delta_{1},\delta_{2},\ldots}}. Eventually, we will set \epsilon_{i}=m^{1/2}P_{i}^{-1/2}\rho^{i/2} and \delta_{i}=m^{1/2}(1-r^{2})^{-1/2}\rho^{i/2} for reasons that will be explained in Section 3.5. With this parameter choice, the definitions of \eta_{1} and \eta_{2} will match eq. 2.23 and eq. 3.2 respectively.

Proof.
\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{F}}^% {k}} \displaystyle=\underbrace{{\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto% \n@star{x^{k+1}}^{2}\big{|}{\mathcal{F}}^{k}}}_{A}+\underbrace{\frac{c_{1}}{m}% {\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto\n@star{x^{k+1}-x^{k}}^{2}% \big{|}{\mathcal{F}}^{k}}}_{B}+\underbrace{\frac{1}{m}\sum_{i=1}^{\infty}c_{i+% 1}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}}_{C} (3.15)

We obtain a bound on A from Lemma 10. B follows by the definition of ARock:

\displaystyle B \displaystyle=\frac{c_{1}}{m}\frac{\MT@delim@Auto\p@star{\eta^{k}}^{2}}{m}% \MT@delim@Auto\n@star{S\hat{x}^{k}}^{2}.

C contains no expectation because it is {\mathcal{F}}^{k} measurable. Hence we have:

\displaystyle\begin{aligned} \displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{% \xi^{k+1}\big{|}{\mathcal{F}}^{k}}&\displaystyle\leq\MT@delim@Auto\n@star{x^{k% }}^{2}\MT@delim@Auto\p@star{1-\frac{\eta^{k}}{m}\MT@delim@Auto\p@star{1-r^{2}}% \MT@delim@Auto\p@star{1-\eta^{k}\MT@delim@Auto\p@star{D_{j(k)}}}}-\frac{\eta^{% k}}{m}\MT@delim@Auto\n@star{S\hat{x}^{k}}^{2}\MT@delim@Auto\p@star{1-\eta^{k}% \MT@delim@Auto\p@star{1+\frac{c_{1}}{m}+E_{j(k)}}}\\ &\displaystyle+\frac{1}{m}\sum_{i=1}^{j(k)}\MT@delim@Auto\p@star{\epsilon_{i}+% \MT@delim@Auto\p@star{1-r^{2}}\delta_{i}}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-% i}}^{2}+\frac{1}{m}\sum_{i=1}^{\infty}c_{i+1}\MT@delim@Auto\n@star{x^{k+1-i}-x% ^{k-i}}^{2}\end{aligned} (3.16)

Notice that E_{j}=\sum_{i=1}^{j}1/\epsilon_{i}\leq\MT@delim@Auto\n@star{1/\epsilon_{i}}_{% \ell^{1}} for all j, and that therefore the step size condition eliminates the \MT@delim@Auto\n@star{S\hat{x}^{k}}^{2} term.

Now it becomes necessary to take expectations over the delay distribution (by taking the expectation with respect to {\mathcal{G}}^{k} instead of {\mathcal{F}}^{k}). Notice that for a sequence \MT@delim@Auto\p@star{\gamma_{1},\gamma_{2},\ldots}, we have: {\mathbb{E}}\MT@delim@Auto\sp@star{\sum_{i=1}^{j(k)}\gamma_{i}\big{|}{\mathcal% {G}}^{k}}=\sum_{i=1}^{\infty}P_{i}\gamma_{i}. This yields:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}G^{k}} \displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}\MT@delim@Auto\p@star{1-\frac% {\eta^{k}}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\p@star{1-\eta^{k}% \MT@delim@Auto\p@star{\sum_{i=1}^{\infty}\frac{P_{i}}{\delta_{i}}}}}
\displaystyle+\frac{1}{m}\sum_{i=1}^{\infty}\MT@delim@Auto\p@star{\epsilon_{i}% +\MT@delim@Auto\p@star{1-r^{2}}\delta_{i}}P_{i}\MT@delim@Auto\n@star{x^{k+1-i}% -x^{k-i}}^{2}+\frac{1}{m}\sum_{i=1}^{\infty}c_{i+1}\MT@delim@Auto\n@star{x^{k+% 1-i}-x^{k-i}}^{2}

which completes the proof. ∎

3.4 Linear convergence

The right-hand side in Lemma 11 closely resembles \xi^{k}. Ideally, we have:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{G}}^% {k}} \displaystyle\leq\gamma\xi^{k} (3.17)

for 0<\gamma<1, and some choice of parameters (\epsilon_{1},\epsilon_{2},\ldots), (\delta_{1},\delta_{2},\ldots) and coefficients (c_{1},c_{2},\ldots). In this section we will derive such a result by carefully chosing these parameters. However, in order to derive a coefficient formula, we need the following lemma.

TheoremLemma 12. Coefficient formula.

Let 0<\rho<1 and let (s_{1},s_{2},\ldots) be a positive sequence. Consider the coefficient formula:

\displaystyle c_{i} \displaystyle=\sum^{\infty}_{l=i}s_{l}\rho^{-(l-i+1)}. (3.18)

If c_{1}<\infty, then we have c_{i}\downarrow 0 and:

\displaystyle\rho c_{i} \displaystyle=c_{i+1}+s_{i} (3.19)
Proof.
\displaystyle\rho c_{i} \displaystyle=\sum_{l=i}^{\infty}s_{l}\rho^{-\MT@delim@Auto\p@star{l-i}}=\sum_% {l=i+1}^{\infty}s_{l}\rho^{-\MT@delim@Auto\p@star{l-\MT@delim@Auto\p@star{i+1}% +1}}+s_{i}=c_{i+1}+s_{i}

Clearly this implies c_{i}\downarrow 0, since \rho<1 and coefficients are nonnegative. ∎

Recall that \rho is defined in eq. 2.21.

TheoremProposition 13. Linear convergence for stochastic delays.

Let Assumption 1 hold. Let \eta^{k}\leq\eta_{1}, and let \epsilon_{1},\epsilon_{2},\ldots>0 and \delta_{1},\delta_{2},\ldots>0 be a sequence of parameters. Let

\displaystyle\sum^{\infty}_{l=i}\MT@delim@Auto\p@star{\epsilon_{l}+% \MT@delim@Auto\p@star{1-r^{2}}\delta_{l}}P_{l}\rho^{-l}<\infty (3.20)

With the choice of coefficients131313This formula will eventually match eq. 3.1 when the parameters \epsilon_{1},\epsilon_{2},\ldots>0 and \delta_{1},\delta_{2},\ldots>0 are chosen later.:

\displaystyle c_{i} \displaystyle=\sum^{\infty}_{l=i}\MT@delim@Auto\p@star{\epsilon_{l}+% \MT@delim@Auto\p@star{1-r^{2}}\delta_{l}}P_{l}\rho^{-(l-i+1)} (3.21)

We have the following linear convergence result:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{G}}^% {k}} \displaystyle\leq R\MT@delim@Auto\p@star{\eta^{k},\eta_{2}}\xi^{k} (3.22)
Proof.

By applying Lemma 12 with s_{l}=P_{l}(\epsilon_{l}+\MT@delim@Auto\p@star{1-r^{2}}\delta_{l}), we obtain:

\displaystyle\rho c_{i} \displaystyle=\MT@delim@Auto\p@star{\epsilon_{i}+\MT@delim@Auto\p@star{1-r^{2}% }\delta_{i}}P_{i}+c_{i+1}

Hence from Lemma 11, we have:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{G}}^% {k}} \displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}R\MT@delim@Auto\p@star{\eta^{% k},\eta_{2}}+\frac{1}{m}\sum_{i=1}^{\infty}\MT@delim@Auto\p@star{% \MT@delim@Auto\p@star{\epsilon_{i}+\MT@delim@Auto\p@star{1-r^{2}}\delta_{i}}P_% {i}+c_{i+1}}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}
\displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}R\MT@delim@Auto\p@star{\eta^{% k},\eta_{2}}+\frac{1}{m}\sum_{i=1}^{\infty}\rho c_{i}\MT@delim@Auto\n@star{x^{% k+1-i}-x^{k-i}}^{2}\leq\max\MT@delim@Auto\p@star{\rho,R\MT@delim@Auto\p@star{% \eta^{k},\eta_{2}}}\xi^{k}=\rho\xi^{k}\qed

The last line follows, because 0\leq\eta^{k}\leq\eta_{1}\leq 1 implies \rho\leq R(\eta^{k},\eta_{2}).

3.5 Proof of Theorem 2

Recall we have the following step size restriction \eta^{k}\leq\eta_{1} (with \eta_{1} defined in eq. 3.12) coupled with the convergence rate:

\displaystyle R\MT@delim@Auto\p@star{\eta^{k},\eta_{2}} \displaystyle=\MT@delim@Auto\p@star{1-\frac{\eta^{k}}{m}\MT@delim@Auto\p@star{% 1-r^{2}}\MT@delim@Auto\p@star{1-\eta^{k}/\eta_{2}}}\text{, for }\eta_{2}=% \MT@delim@Auto\p@star{\sum_{i=1}^{\infty}\frac{P_{i}}{\delta_{i}}}^{-1}

We now prove Theorem 2. However we do so in a way that justifies the choice of parameters that we use. In the case of {\MT@delim@Auto\p@star{\epsilon_{1},\epsilon_{2},\ldots}} there is a best choice. For other parameters, we simply pick a sensible though not necessarily optimal choice. First though, a couple of lemmas are needed to look at the asymptotic convergence rate and iteration complexity.

TheoremLemma 14. .

Say that as m\to\infty, we have x(m)={\mathcal{O}}(1), y(m)={\mathcal{O}}(1), and:

\displaystyle t_{1} \displaystyle=1+x(m)m^{-a}+o\MT@delim@Auto\p@star{m^{-a}},\,1>a>0
\displaystyle t_{2}^{-1} \displaystyle=y(m)m^{-b}+o\MT@delim@Auto\p@star{m^{-b}},\,1>b>0

Then

\displaystyle R\MT@delim@Auto\p@star{t_{1},t_{2}} \displaystyle=1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\p@star% {1-\MT@delim@Auto\p@star{x(m)m^{-a}+y(m)m^{-b}}+o\MT@delim@Auto\p@star{m^{-a}+% m^{-b}}}
Proof.
\displaystyle R\MT@delim@Auto\p@star{t_{1},t_{2}} \displaystyle=1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}}t_{1}\MT@delim@Auto% \p@star{1-t_{1}t_{2}^{-1}}
\displaystyle t_{1}\MT@delim@Auto\p@star{1-t_{1}t_{2}^{-1}} \displaystyle=\MT@delim@Auto\p@star{1-xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}}}% \MT@delim@Auto\p@star{1-\MT@delim@Auto\p@star{1-xm^{-a}+o\MT@delim@Auto\p@star% {m^{-a}}}\MT@delim@Auto\p@star{ym^{-b}+o\MT@delim@Auto\p@star{m^{-b}}}}
\displaystyle=\MT@delim@Auto\p@star{1-xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}}}% \MT@delim@Auto\p@star{1-\MT@delim@Auto\p@star{1+{\mathcal{O}}\MT@delim@Auto% \p@star{m^{-a}}}\MT@delim@Auto\p@star{ym^{-b}+o\MT@delim@Auto\p@star{m^{-b}}}}
\displaystyle=\MT@delim@Auto\p@star{1-xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}}}% \MT@delim@Auto\p@star{1-ym^{-b}+o\MT@delim@Auto\p@star{m^{-b}}}
\displaystyle=1-xm^{-a}-ym^{-b}+o\MT@delim@Auto\p@star{m^{-a}+m^{-b}}
\displaystyle R\MT@delim@Auto\p@star{t_{1},t_{2}} \displaystyle=1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\p@star% {1-\MT@delim@Auto\p@star{xm^{-a}+ym^{-b}}+o\MT@delim@Auto\p@star{m^{-a}+m^{-b}% }}\qed
TheoremLemma 15. .

Let x(m)={\mathcal{O}}(1), and 0\leq a. If the linear convergence rate R satisfies:

\displaystyle R=1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto% \p@star{1-(xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}})} (3.23)

as m\to\infty. Then we have:

\displaystyle I\MT@delim@Auto\p@star\epsilon \displaystyle=\MT@delim@Auto\p@star{1+xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}}}% \MT@delim@Auto\p@star{\frac{1}{1-r^{2}}}\ln\MT@delim@Auto\p@star{1/\epsilon} (3.24)
Proof.
\displaystyle\epsilon \displaystyle=\MT@delim@Auto\p@star{R\MT@delim@Auto\p@star{\eta_{1},\eta_{2}}}% ^{I(\epsilon)m}
\displaystyle\ln\MT@delim@Auto\p@star{1/\epsilon} \displaystyle=-I(\epsilon)m\ln\MT@delim@Auto\p@star{R\MT@delim@Auto\p@star{% \eta_{1},\eta_{2}}}
\displaystyle=I(\epsilon)m\MT@delim@Auto\p@star{\frac{1}{m}\MT@delim@Auto% \p@star{1-r^{2}}\MT@delim@Auto\p@star{1-xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}}% }+{\mathcal{O}}\MT@delim@Auto\p@star{\MT@delim@Auto\p@star{\frac{1}{m}% \MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\p@star{1-xm^{-a}+o\MT@delim@Auto% \p@star{m^{-a}}}}^{2}}}
\displaystyle=I(\epsilon)m\MT@delim@Auto\p@star{\frac{1}{m}\MT@delim@Auto% \p@star{1-r^{2}}\MT@delim@Auto\p@star{1-xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}}% }+{\mathcal{O}}\MT@delim@Auto\p@star{\MT@delim@Auto\p@star{\frac{1}{m}% \MT@delim@Auto\p@star{1-r^{2}}{\mathcal{O}}\MT@delim@Auto\p@star 1}^{2}}}
\displaystyle=I(\epsilon)m\MT@delim@Auto\p@star{\frac{1}{m}\MT@delim@Auto% \p@star{1-r^{2}}\MT@delim@Auto\p@star{1-xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}}}}
\displaystyle I(\epsilon) \displaystyle=\MT@delim@Auto\p@star{1+xm^{-a}+o\MT@delim@Auto\p@star{m^{-a}}}% \MT@delim@Auto\p@star{\frac{1}{1-r^{2}}}\ln\MT@delim@Auto\p@star{1/\epsilon}\qed
Proof of Theorem 2.

We start with the conditions of Proposition 13. M_{1} and M_{2} being finite corresponds to Equation 3.20 in Proposition 13.

It is immediately possible to maximize \eta_{1} over the sequence \epsilon_{i}, by letting \epsilon_{i}=\sqrt{m}P_{i}^{-1/2}\rho^{i/2}. All thing being equal, increasing \eta_{1} allows for a better convergence rate by increasing the range of possible step sizes. This leads to:

\displaystyle\eta_{1} \displaystyle=\MT@delim@Auto\p@star{1+m^{-1}\MT@delim@Auto\p@star{1-r^{2}}\sum% _{l=1}^{\infty}\delta_{l}P_{l}\rho^{-l}+2m^{-1/2}\sum_{l=1}^{\infty}P_{l}^{1/2% }\rho^{-l/2}}^{-1} (3.25)

and leaves \eta_{2} unchanged. We also let \delta_{l}=d\rho^{l/2}, with d to be determined later. This yields:

\displaystyle\eta_{1} \displaystyle=\MT@delim@Auto\p@star{1+m^{-1}\MT@delim@Auto\p@star{1-r^{2}}dM_{% 1}+2m^{-1/2}M_{2}}^{-1}, \displaystyle\eta_{2}^{-1} \displaystyle=d^{-1}M_{1}

for M_{1}=\sum_{l=1}^{\infty}P_{l}\rho^{-l/2}, and M_{2}=\sum_{l=1}^{\infty}P_{l}^{1/2}\rho^{-l/2}. Now we set d=am^{1/2}\MT@delim@Auto\p@star{1-r^{2}}^{-1/2}, for a to be determined later. We make this choice so that that \eta_{1} and \eta_{2} are both 1+{\mathcal{O}}\MT@delim@Auto\p@star{m^{q-1/2}} for large m, which optimizes the asymptotic rate at which \eta_{1} converges to 1. Recall that the moments M_{1} and M_{2} vary with m, and satisfy M_{1},M_{2}={\mathcal{O}}\MT@delim@Auto\p@star{m^{q}} for 0\leq q<1/2. This yields:

\displaystyle\eta_{1} \displaystyle=\MT@delim@Auto\p@star{1+am^{-1/2}\MT@delim@Auto\p@star{1-r^{2}}^% {1/2}M_{1}+2m^{-1/2}M_{2}}^{-1}
\displaystyle=1-m^{-1/2}\MT@delim@Auto\p@star{a\MT@delim@Auto\p@star{1-r^{2}}^% {1/2}M_{1}+2M_{2}}+{\mathcal{O}}\MT@delim@Auto\p@star{m^{2q-1}}
\displaystyle\eta_{2} \displaystyle=a^{-1}m^{-1/2}\MT@delim@Auto\p@star{1-r^{2}}^{1/2}M_{1}

It is clear that M_{1}, M_{2} and c_{i} match the formulas given in the Theorem 2 (see eq. 2.22, and eq. 3.1). We now use Lemma 14 with t_{1}=\eta_{1}, t_{2}=\eta_{2}, a=b=q-\frac{1}{2}, etc.

\displaystyle R\MT@delim@Auto\p@star{\eta_{1},\eta_{2}} \displaystyle=1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\p@star% {1-m^{-1/2}\MT@delim@Auto\p@star{\MT@delim@Auto\p@star{a+a^{-1}}\MT@delim@Auto% \p@star{1-r^{2}}^{1/2}M_{1}+2M_{2}}+o\MT@delim@Auto\p@star{m^{-1/2+q}}}

Clearly to minimize the lowest order asynchronicity penalty, we should let a=1. With this choice, we have proven Equation 3.4. Then using this in conjunction with Lemma 15 completes the proof of Theorem 2. ∎

4 Deterministic Unbounded Delays

We now present a convergence result for deterministic unbounded delays.

TheoremAssumption 3. Deterministic unbounded delays.

The sequence of delay vectors \vec{j}(0),\vec{j}(1),\vec{j}(2),\ldots is an arbitrary sequence in \mathbb{N}^{m}.

If there can be no assumption made on the distribution of the delays, we a have a slightly weaker result. Whereas before it was possible to use a constant step size, here the step size and convergence rate at step k depend on the current delay j\MT@delim@Auto\p@star{k}. Results that assume bounded delay \tau need to know \tau in advance to set the step size. If \tau is very large, this will decrease the allowable timestep, which will slow convergence – even if a delay of \tau is very rare. Here we make no such assumption and the step size and convergence rate are adaptive to whatever delay conditions exist in the system. The step size \eta^{k} is a function of the current delay j(k), that must be measured (or upper bounded) at each iteration k.

The delay is allowed to be unbounded. The larger the delay, the less progress is made, but some progress is always made at every step. We define a good behavior boundary

\displaystyle T(m) \displaystyle=bm^{q}+d (4.1)

for 0\leq q<1/2, for arbitrary141414These parameters are arbitrary, but setting them involves a trade-off as we will later see. The larger b is, the larger the asynchronicity penalty, and larger m has to be to ensure negligible penalty as will be seen in Theorem 3. parameters b,d>0. Any delay less than this will not create a noticeable penalty in the progress of an algorithm in a large system (i.e. as m\to\infty) as compared to synchronous ARock. However if the delay is much larger than T(m), the progress can eventually become vanishingly small. T(m) can become very large since it grows as \Theta\MT@delim@Auto\p@star{m^{q}}.

We let \rho again be defined as in Equation 2.21. For arbitrary151515Again, c can be set freely, but larger c is, the more the convergence rate is penalized for exceeding T(m). However the smaller c is, the larger the asynchronicity penalty will be, and the large m needs to be to ensure negligible penalty (much like b). c>0, let

\displaystyle\gamma \displaystyle=\rho-cm^{-q} (4.2)

We define the coefficients in the Lyapunov function from Equation 1.11:

\displaystyle c_{i} \displaystyle=m^{1/2}\MT@delim@Auto\p@star{1+2\MT@delim@Auto\p@star{1-r^{2}}}% \frac{\MT@delim@Auto\p@star{\gamma/\rho}^{i}}{1-\gamma/\rho} (4.3)

We also define the following step size functions:

\displaystyle H_{1}\MT@delim@Auto\p@star j \displaystyle=\left(1+c^{-1}m^{q-1/2}\MT@delim@Auto\p@star{3+\gamma^{-j}}% \right)^{-1}, \displaystyle H_{2}\MT@delim@Auto\p@star j=2cm^{\frac{1}{2}-q}\gamma^{j} (4.4)

These functions play similar roles to \eta_{1} and \eta_{2} from Theorem 2. We again use the same convergence-rate function R as defined in Equation 3.3. Let \eta^{k} be {\mathcal{F}}^{k} measurable.

TheoremTheorem 3. .

Let Assumption 1 and Assumption 3 hold. Consider the Lyapunov function defined in Equation 1.11 with coefficients given by Equation 4.3. Let \eta^{k}\leq H_{1}\MT@delim@Auto\p@star j. Then we have:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{G}}^% {k}} \displaystyle\leq R\MT@delim@Auto\p@star{\eta^{k},H_{2}\MT@delim@Auto\p@star{j% \MT@delim@Auto\p@star k}}\xi^{k} (4.5)

Additionally, let \eta^{k}=H_{1}\MT@delim@Auto\p@star{j\MT@delim@Auto\p@star k}. If we have j\MT@delim@Auto\p@star k\leq T\MT@delim@Auto\p@star m=bm^{q}+d for some q\in[0,\frac{1}{2}) and b,d>0, then the convergence rate satisfies:

\displaystyle R \displaystyle\leq 1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto% \p@star{1-\frac{1}{c}m^{q-1/2}\MT@delim@Auto\p@star{3+\frac{3}{2}\exp% \MT@delim@Auto\p@star{bc}}+o\MT@delim@Auto\p@star{m^{q-\frac{1}{2}}}}

which corresponds to iteration complexity:

\displaystyle I\MT@delim@Auto\p@star{\epsilon} \displaystyle=\MT@delim@Auto\p@star{1+\frac{1}{c}m^{q-1/2}\MT@delim@Auto% \p@star{3+\frac{3}{2}\exp\MT@delim@Auto\p@star{bc}}+o\MT@delim@Auto\p@star{m^{% q-\frac{1}{2}}}}\MT@delim@Auto\p@star{\frac{1}{1-r^{2}}}\ln\MT@delim@Auto% \p@star{1/\epsilon}
TheoremRemark 8. .

Much like in the stochastic delay case (see Remark 7), we prove a more general result where the Lyapunov function, and step size depend on a series of parameters \epsilon_{1},\epsilon_{2},\ldots>0 and \delta_{1},\delta_{2},\ldots>0. We make a sensible choice of these parameters to make the result more simple and interpretable.

4.1 Starting point

The starting point is Equation 3.16. However in this result, we assume that \eta^{k} is actually {\mathcal{F}}^{k} measurable (that is, it is now allowed to also depend on the delays \MT@delim@Auto\p@star{\vec{j}(0),\vec{j}(1),\vec{j}(2),\ldots,\vec{j}(k)}, whereas before it could only depend on the iterates (x^{0},x^{1},x^{2},\ldots)). Recall the definitions of E_{j} and D_{j} given in eq. 3.10.

TheoremLemma 16. .

Let Assumption 1 and Assumption 3 hold. Let \epsilon_{1},\epsilon_{2},\ldots>0 and \delta_{1},\delta_{2},\ldots>0 be a sequence of parameters. Let \eta^{k} be {\mathcal{F}}^{k}-measurable. Define the step size functions:

\displaystyle h_{1}(j) \displaystyle=\MT@delim@Auto\p@star{1+\frac{c_{1}}{m}+E_{j}}^{-1}, \displaystyle h_{2}(j)=D^{-1}_{j}

Let \eta^{k}\leq h_{1}\MT@delim@Auto\p@star{j\MT@delim@Auto\p@star k}. Then ARock yields the following inequality:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{F}}^% {k}} \displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}R\MT@delim@Auto\p@star{\eta^{% k},h_{2}\MT@delim@Auto\p@star{j(k)}}+\frac{1}{m}\sum_{i=1}^{\infty}% \MT@delim@Auto\p@star{\MT@delim@Auto\p@star{\epsilon_{i}+\MT@delim@Auto\p@star% {1-r^{2}}\delta_{i}}+c_{i+1}}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}

The step size function h_{1} and h_{2} are complex expressions. When we eventually set the free parameters \epsilon_{1},\epsilon_{2},\ldots>0 and \delta_{1},\delta_{2},\ldots>0, we simplify these functions with inequalities to yield H_{1} and H_{2}, which are much easier to interpret.

Proof.

From Equation 3.16 we have:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{F}}^% {k}} \displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}\MT@delim@Auto\p@star{1-\frac% {\eta^{k}}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\p@star{1-\eta^{k}D_{% j(k)}}}-\frac{\eta^{k}}{m}\MT@delim@Auto\n@star{S\hat{x}^{k}}^{2}% \MT@delim@Auto\p@star{1-\eta^{k}\MT@delim@Auto\p@star{1+\frac{c_{1}}{m}+E_{j(k% )}}}
\displaystyle+\frac{1}{m}\sum_{i=1}^{j(k)}\MT@delim@Auto\p@star{\epsilon_{i}+% \MT@delim@Auto\p@star{1-r^{2}}\delta_{i}}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-% i}}^{2}+\frac{1}{m}\sum_{i=1}^{\infty}c_{i+1}\MT@delim@Auto\n@star{x^{k+1-i}-x% ^{k-i}}^{2}
\displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}R\MT@delim@Auto\p@star{\eta^{% k},h_{2}\MT@delim@Auto\p@star{j(k)}}-\frac{\eta^{k}}{m}\MT@delim@Auto\n@star{S% \hat{x}^{k}}^{2}\MT@delim@Auto\p@star{1-\eta^{k}/h_{1}\MT@delim@Auto\p@star{j(% k)}}
\displaystyle+\frac{1}{m}\sum_{i=1}^{\infty}\MT@delim@Auto\p@star{% \MT@delim@Auto\p@star{\epsilon_{i}+\MT@delim@Auto\p@star{1-r^{2}}\delta_{i}}+c% _{i+1}}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2}

Setting \eta^{k}\leq h_{1}\MT@delim@Auto\p@star{j\MT@delim@Auto\p@star k} eliminates the \MT@delim@Auto\n@star{S\hat{x}^{k}}^{2} term, and yields the desired result. ∎

4.2 Linear convergence

TheoremProposition 17. Linear convergence for deterministic delays.

Let the conditions of Lemma 16 hold. Also assume:

\displaystyle\sum_{i=1}^{\infty}\MT@delim@Auto\p@star{\epsilon_{i}+% \MT@delim@Auto\p@star{1-r^{2}}\delta_{i}}\rho^{-i}<\infty (4.6)

Let the coefficients of the Lyapunov function be given by:

\displaystyle c_{i} \displaystyle=\sum_{l=i}^{\infty}\MT@delim@Auto\p@star{\epsilon_{i}+% \MT@delim@Auto\p@star{1-r^{2}}\delta_{i}}\rho^{l-i+1} (4.7)

Then we have the following linear convergence rate at step k:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{F}}^% {k}} \displaystyle\leq R\MT@delim@Auto\p@star{\eta^{k},h_{2}\MT@delim@Auto\p@star{j% (k)}}\xi^{k} (4.8)
Proof.

We apply Lemma 12 to Lemma 16 with s_{i}=\epsilon_{i}+\MT@delim@Auto\p@star{1-r^{2}}\delta_{i}, and we obtain:

\displaystyle\epsilon_{i}+\MT@delim@Auto\p@star{1-r^{2}}\delta_{i}+c_{i+1} \displaystyle\leq\rho c_{i} (4.9)

Hence we have:

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\xi^{k+1}\big{|}{\mathcal{F}}^% {k}} \displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}R\MT@delim@Auto\p@star{\eta^{% k},h_{2}\MT@delim@Auto\p@star{j(k)}}+\frac{1}{m}\sum_{i=1}^{\infty}% \MT@delim@Auto\p@star{\MT@delim@Auto\p@star{\epsilon_{i}+\MT@delim@Auto\p@star% {1-r^{2}}\delta_{i}}+c_{i+1}}\MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2} (4.10)
\displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}R\MT@delim@Auto\p@star{\eta^{% k},h_{2}\MT@delim@Auto\p@star{j(k)}}+\frac{1}{m}\sum_{i=1}^{\infty}\rho c_{i}% \MT@delim@Auto\n@star{x^{k+1-i}-x^{k-i}}^{2} (4.11)
\displaystyle\leq\max\MT@delim@Auto\cp@star{R\MT@delim@Auto\p@star{\eta^{k},h_% {2}\MT@delim@Auto\p@star{j(k)}},\rho}\xi^{k} (4.12)
\displaystyle=R\MT@delim@Auto\p@star{\eta^{k},h_{2}\MT@delim@Auto\p@star{j(k)}% }\xi^{k} (4.13)

The last line follows, because 0\leq\eta^{k}\leq h_{1}(j)\leq 1 implies \rho\leq R(\eta^{k},\eta_{2}). ∎

4.3 Proof of Theorem 3

Proof of Theorem 3.

For the first part of the theorem, we simply set

\displaystyle\epsilon_{i} \displaystyle=m^{1/2}\gamma^{i}
\displaystyle\delta_{i} \displaystyle=2m^{1/2}\gamma^{i}

This choice automatically satisfies that conditions of Proposition 17 (i.e. Equation 4.6), and the coefficient formula that arises matches Equation 4.3.

We have step size functions h_{1} and h_{2} that are given by complicated expressions. Notice that we may overestimate h_{1} and underestimate h_{2}, and the linear convergence result still holds. Hence we will simplify these step size expressions to give a more concise step size rule. We have:

\displaystyle\sum_{i=1}^{j}\gamma^{-j} \displaystyle\leq\gamma^{-j}\sum_{i=0}^{\infty}\gamma^{i}=\gamma^{-j}\frac{1}{% 1-\gamma}\leq c^{-1}m^{q}\gamma^{-j}

which yields:

\displaystyle\MT@delim@Auto\p@star{h_{2}\MT@delim@Auto\p@star j}^{-1} \displaystyle=\sum_{i=1}^{j}\frac{1}{\delta_{i}}\leq\frac{1}{2c}m^{q-\frac{1}{% 2}}\gamma^{-j}=H_{2}\MT@delim@Auto\p@star j

which matches Equation 4.4. We also have:

\displaystyle c_{1} \displaystyle=\sum_{l=1}^{\infty}\MT@delim@Auto\p@star{\epsilon_{l}+% \MT@delim@Auto\p@star{1-r^{2}}\delta_{l}}\rho^{-l}=m^{1/2}\MT@delim@Auto% \p@star{1+2\MT@delim@Auto\p@star{1-r^{2}}}\sum_{l=1}^{\infty}\MT@delim@Auto% \p@star{\gamma/\rho}^{l}=m^{1/2}\MT@delim@Auto\p@star{1+2\MT@delim@Auto\p@star% {1-r^{2}}}\frac{\gamma}{\rho-\gamma}\leq 3c^{-1}m^{1/2+q}

And hence:

\displaystyle h_{1}(j) \displaystyle=\left(1+\frac{c_{1}}{m}+\sum_{i=1}^{j}\frac{1}{\epsilon_{i}}% \right)^{-1}\geq\left(1+\frac{3}{c}m^{q-1/2}+c^{-1}m^{q-1/2}\gamma^{-j}\right)% ^{-1}=\left(1+\frac{1}{c}m^{q-1/2}\MT@delim@Auto\p@star{3+\gamma^{-j}}\right)^% {-1}=H_{1}\MT@delim@Auto\p@star j

which matches Equation 4.4. Hence Equation 4.5 is proven.

For the second part of the theorem, we need asymptotic expressions for H_{1}\MT@delim@Auto\p@star T and H_{2}\MT@delim@Auto\p@star T to determine an asymptotic convergence rate and iteration complexity. We first bound \gamma^{-T}

\displaystyle\gamma^{-T} \displaystyle=\MT@delim@Auto\p@star{1-\MT@delim@Auto\p@star{1-\gamma}}^{-T}=% \exp\MT@delim@Auto\p@star{T\MT@delim@Auto\p@star{1-\gamma}+{\mathcal{O}}% \MT@delim@Auto\p@star{T\MT@delim@Auto\p@star{1-\gamma}^{2}}}=\exp% \MT@delim@Auto\p@star{bc+{\mathcal{O}}\MT@delim@Auto\p@star{m^{-q}}}=\exp% \MT@delim@Auto\p@star{bc}+{\mathcal{O}}\MT@delim@Auto\p@star{m^{-q}}

Hence this yields:

\displaystyle\MT@delim@Auto\p@star{h_{2}\MT@delim@Auto\p@star T}^{-1} \displaystyle\leq\frac{1}{2c}m^{q-\frac{1}{2}}\MT@delim@Auto\p@star{\exp% \MT@delim@Auto\p@star{bc}+{\mathcal{O}}\MT@delim@Auto\p@star{m^{-q}}}
\displaystyle h_{1}(T) \displaystyle\geq\left(1+\frac{1}{c}m^{q-1/2}\MT@delim@Auto\p@star{3+\exp% \MT@delim@Auto\p@star{bc}+{\mathcal{O}}\MT@delim@Auto\p@star{m^{-q}}}\right)^{% -1}=1-\frac{1}{c}m^{q-1/2}\MT@delim@Auto\p@star{3+\exp\MT@delim@Auto\p@star{bc% }}+o\MT@delim@Auto\p@star{m^{q-\frac{1}{2}}}

Therefore, if j\MT@delim@Auto\p@star k\leq T, we have convergence rate:

\displaystyle R\MT@delim@Auto\p@star{H_{1}\MT@delim@Auto\p@star{j% \MT@delim@Auto\p@star k},h_{2}\MT@delim@Auto\p@star{j\MT@delim@Auto\p@star k}} \displaystyle\leq R\MT@delim@Auto\p@star{H_{1}\MT@delim@Auto\p@star{j% \MT@delim@Auto\p@star T},H_{2}\MT@delim@Auto\p@star{j\MT@delim@Auto\p@star T}}
\displaystyle=1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\p@star% {1-\frac{1}{c}m^{q-\frac{1}{2}}\MT@delim@Auto\p@star{3+\frac{3}{2}\exp% \MT@delim@Auto\p@star{bc}}+o\MT@delim@Auto\p@star{m^{q-\frac{1}{2}}}}
\displaystyle=1-\frac{1}{m}\MT@delim@Auto\p@star{1-r^{2}}\MT@delim@Auto\p@star% {1-{\mathcal{O}}\MT@delim@Auto\p@star{m^{q-\frac{1}{2}}}}

by Lemma 14. By Lemma 15 this corresponds to iteration complexity:

\displaystyle I\MT@delim@Auto\p@star{\epsilon} \displaystyle=\MT@delim@Auto\p@star{1+\frac{1}{c}m^{q-1/2}\MT@delim@Auto% \p@star{3+\frac{3}{2}\exp\MT@delim@Auto\p@star{bc}}+o\MT@delim@Auto\p@star{m^{% q-\frac{1}{2}}}}\MT@delim@Auto\p@star{\frac{1}{1-r^{2}}}\ln\MT@delim@Auto% \p@star{1/\epsilon}
\displaystyle=\MT@delim@Auto\p@star{1+{\mathcal{O}}\MT@delim@Auto\p@star{m^{q-% \frac{1}{2}}}}\MT@delim@Auto\p@star{\frac{1}{1-r^{2}}}\ln\MT@delim@Auto\p@star% {1/\epsilon}

which completes the proof. ∎

\pdfbookmark

[0]ReferencesReferences \printbibliography

Appendix A Auxillary Results

A.1 Block KM Iterations

Proof of Proposition 6.

Taking conditional expectation on

\displaystyle\MT@delim@Auto\n@star{x^{k+1}}^{2} \displaystyle=\MT@delim@Auto\n@star{x^{k}}^{2}-2\eta^{k}\MT@delim@Auto% \dotp@star{x^{k},P^{k}x^{k}}+\MT@delim@Auto\p@star{\eta^{k}}^{2}\MT@delim@Auto% \n@star{P^{k}x^{k}}^{2}

with respect to i(k) yields

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto\n@star{x^{k+1}}% ^{2}\big{|}x^{k}} \displaystyle=\MT@delim@Auto\n@star{x^{k}}^{2}-2\eta^{k}\frac{p}{m}% \MT@delim@Auto\dotp@star{x^{k},Sx^{k}}+\MT@delim@Auto\p@star{\eta^{k}}^{2}% \frac{p}{m}\MT@delim@Auto\n@star{Sx^{k}}^{2}
(By Lemma 8) \displaystyle\leq\MT@delim@Auto\n@star{x^{k}}^{2}-\eta^{k}\frac{p}{m}% \MT@delim@Auto\p@star{\MT@delim@Auto\n@star{Sx^{k}}^{2}+\MT@delim@Auto\p@star{% 1-r^{2}}\MT@delim@Auto\n@star{x^{k}}^{2}}+\MT@delim@Auto\p@star{\eta^{k}}^{2}% \frac{p}{m}\MT@delim@Auto\n@star{Sx^{k}}^{2}
\displaystyle=\MT@delim@Auto\p@star{1-\eta^{k}\frac{p}{m}\MT@delim@Auto\p@star% {1-r^{2}}}\MT@delim@Auto\n@star{x^{k}}^{2}-\eta^{k}\frac{p}{m}\MT@delim@Auto% \p@star{1-\eta^{k}}\MT@delim@Auto\n@star{Sx^{k}}^{2}, (A.1)

Let \eta^{k}\leq 1, and we follow on from (A.1):

(By (1-r)-strong monotonicity of S) \displaystyle\leq\MT@delim@Auto\p@star{1-\eta^{k}\frac{p}{m}\MT@delim@Auto% \p@star{1-r^{2}}}\MT@delim@Auto\n@star{x^{k}}^{2}-\MT@delim@Auto\p@star{1-r}^{% 2}\eta^{k}\frac{p}{m}\MT@delim@Auto\p@star{1-\eta^{k}}\MT@delim@Auto\n@star{x^% {k}}^{2}
\displaystyle=\MT@delim@Auto\p@star{1-\frac{p}{m}+\frac{p}{m}\MT@delim@Auto% \p@star{1-\eta^{k}\MT@delim@Auto\p@star{1-r}}^{2}}\MT@delim@Auto\n@star{x^{k}}% ^{2}.

Now let \eta^{k}\geq 1, and we follow on again from A.1.

\displaystyle{\mathbb{E}}\MT@delim@Auto\sp@star{\MT@delim@Auto\n@star{x^{k+1}}% ^{2}\big{|}x^{k}}
(Since S is (1+r)-Lipschitz) \displaystyle\leq\MT@delim@Auto\p@star{1-\eta^{k}\frac{p}{m}\MT@delim@Auto% \p@star{1-r^{2}}}\MT@delim@Auto\n@star{x^{k}}^{2}-\eta^{k}\frac{p}{m}% \MT@delim@Auto\p@star{1-\eta^{k}}\MT@delim@Auto\p@star{1+r}^{2}\MT@delim@Auto% \n@star{x^{k}}^{2}
\displaystyle=\MT@delim@Auto\p@star{1-\frac{p}{m}+\frac{p}{m}\MT@delim@Auto% \p@star{1-\eta^{k}\MT@delim@Auto\p@star{1+r}}^{2}}\MT@delim@Auto\n@star{x^{k}}% ^{2}.

It can be verified that every single inequality for \eta^{k}\leq 1 is an equality for T=rI, and every single inequality for \eta^{k}\geq 1 is an equality for T=-rI. Therefore the inequalities give a sharp rate of convergence. This rate is optimized when \eta^{k}=1, and matches the rate given in Equation 2.21.

Let’s now look at the corresponding iteration complexity:

\displaystyle\MT@delim@Auto\p@star{1-\frac{p}{m}\MT@delim@Auto\p@star{1-r^{2}}% }^{I(\epsilon)\frac{m}{p}} \displaystyle=\epsilon
\displaystyle I(\epsilon) \displaystyle=\frac{p}{m}\MT@delim@Auto\p@star{\frac{\ln\MT@delim@Auto\p@star{% 1/\epsilon}}{-\ln\MT@delim@Auto\p@star{1-\frac{p}{m}\MT@delim@Auto\p@star{1-r^% {2}}}}}

Note that:

\displaystyle 1-x \displaystyle\leq\frac{-x}{\ln\MT@delim@Auto\p@star{1-x}}\leq 1-\frac{1}{2}x

and hence

\displaystyle I(\epsilon)=\frac{1}{1-r^{2}}\MT@delim@Auto\p@star{1-\theta\frac% {p}{m}\MT@delim@Auto\p@star{1-r^{2}}}\ln\MT@delim@Auto\p@star{1/\epsilon}

where \theta\in\MT@delim@Auto\sp@star{\frac{1}{2},1}. This matches Equation 2.16, hence the proof is complete. ∎

A.2 Random-Subset Block Gradient Descent

Proof of Corollary 7.

First we consider the operator T=I-\frac{2}{L+\mu}\nabla f:

\displaystyle\MT@delim@Auto\n@star{T(y)-T(x)}^{2} \displaystyle=\MT@delim@Auto\n@star{y-x}^{2}-\frac{4}{\mu+L}\MT@delim@Auto% \dotp@star{\nabla f\MT@delim@Auto\p@star y-\nabla f\MT@delim@Auto\p@star x,y-x% }+\MT@delim@Auto\p@star{\frac{2}{L+\mu}}^{2}\MT@delim@Auto\n@star{\nabla f% \MT@delim@Auto\p@star y-\nabla f\MT@delim@Auto\p@star x}^{2}
(By Theorem 2.1.12 of [Nesterov2013_introductory]) \displaystyle\leq\MT@delim@Auto\n@star{y-x}^{2}-\frac{4}{\mu+L}\MT@delim@Auto% \p@star{\frac{\mu L}{\mu+L}\MT@delim@Auto\n@star{x-y}^{2}+\frac{1}{\mu+L}% \MT@delim@Auto\n@star{\nabla f\MT@delim@Auto\p@star y-\nabla f\MT@delim@Auto% \p@star x}^{2}}
\displaystyle+\MT@delim@Auto\p@star{\frac{2}{L+\mu}}^{2}\MT@delim@Auto\n@star{% \nabla f\MT@delim@Auto\p@star y-\nabla f\MT@delim@Auto\p@star x}^{2}
\displaystyle=\MT@delim@Auto\n@star{y-x}^{2}\MT@delim@Auto\p@star{1-\frac{4\mu L% }{\MT@delim@Auto\p@star{\mu+L}^{2}}}=\MT@delim@Auto\n@star{y-x}^{2}% \MT@delim@Auto\p@star{1-\frac{4\kappa}{\MT@delim@Auto\p@star{1+\kappa}^{2}}}=% \MT@delim@Auto\n@star{y-x}^{2}\MT@delim@Auto\p@star{1-\frac{2}{\kappa+1}}^{2}

Hence T is \MT@delim@Auto\p@star{1-\frac{2}{\kappa+1}}-Lipschitz.

We now determine the linear convergence rate:

\displaystyle r \displaystyle=1-\frac{2}{\kappa+1}
\displaystyle 1-r^{2} \displaystyle=\frac{4}{\kappa+1}\MT@delim@Auto\p@star{\frac{\kappa}{\kappa+1}}% =\frac{4\kappa}{\MT@delim@Auto\p@star{\kappa+1}^{2}}

And hence using Proposition 6 we have:

\displaystyle R \displaystyle=1-\frac{p}{m}\MT@delim@Auto\p@star{1-r^{2}}=1-\frac{p}{m}\frac{4% \kappa}{\MT@delim@Auto\p@star{\kappa+1}^{2}}=1-4\frac{p}{m\kappa}% \MT@delim@Auto\p@star{1+{\mathcal{O}}\MT@delim@Auto\p@star{1/\kappa}}

which matches Equation 2.18. Next we determine the iteration complexity using Proposition 6 again:

\displaystyle I(\epsilon) \displaystyle=\MT@delim@Auto\p@star{\frac{1}{1-r^{2}}-\theta\frac{p}{m}}\ln% \MT@delim@Auto\p@star{1/\epsilon}=\MT@delim@Auto\p@star{\frac{\MT@delim@Auto% \p@star{\kappa+1}^{2}}{4\kappa}-\theta\frac{p}{m}}\ln\MT@delim@Auto\p@star{1/% \epsilon}=\frac{1}{4}\MT@delim@Auto\p@star{\kappa+2+\frac{1}{\kappa}-4\theta% \frac{p}{m}}\ln\MT@delim@Auto\p@star{1/\epsilon}
\displaystyle=\frac{1}{4}\MT@delim@Auto\p@star{\kappa+{\mathcal{O}}% \MT@delim@Auto\p@star 1}\ln\MT@delim@Auto\p@star{1/\epsilon}

This matches Equation 2.19.

Lastly, we prove that the convergence rate is sharp. Recall the proof of Proposition 6. Now consider the function f\MT@delim@Auto\p@star x=\frac{1}{2}\mu\MT@delim@Auto\n@star x^{2}. This leads to T=I\MT@delim@Auto\p@star{1-\frac{2}{\kappa+1}}, which is equal to the worst-case example for \eta\leq 1. Additionally, consider f\MT@delim@Auto\p@star x=\frac{1}{2}L\MT@delim@Auto\n@star x^{2}, which leads to T=-I\MT@delim@Auto\p@star{1-\frac{2}{\kappa+1}}. This is the worst-case example for \eta\geq 1. Therefore, since the worst case examples in the previous are attained by \mu-strongly convex functions with L-Lipschitz gradients, we can see that the convergence rate for error \MT@delim@Auto\n@star{x^{k}-x^{*}}is sharp. ∎

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
283098
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description