Stochastic Optimization from Distributed, Streaming Data in Rate-limited Networks

# Stochastic Optimization from Distributed, Streaming Data in Rate-limited Networks

Matthew Nokleby and Waheed U. Bajwa Matthew Nokleby is with Wayne State University, Detroit, MI (email: matthew.nokleby@wayne.edu), and Waheed U. Bajwa is with Rutgers University, Piscataway, NJ (email: waheed.bajwa@rutgers.edu). The work of WUB was supported in part by the NSF under grant CCF-1453073 and by the ARO under grant W911NF-14-1-0295.
###### Abstract

Motivated by machine learning applications in networks of sensors, internet-of-things (IoT) devices, and autonomous agents, we propose techniques for distributed stochastic convex learning from high-rate data streams. The setup involves a network of nodes—each one of which has a stream of data arriving at a constant rate—that solve a stochastic convex optimization problem by collaborating with each other over rate-limited communication links. To this end, we present and analyze two algorithms—termed distributed stochastic approximation mirror descent (D-SAMD) and accelerated distributed stochastic approximation mirror descent (AD-SAMD)—that are based on two stochastic variants of mirror descent. The main collaborative step in the proposed algorithms is approximate averaging of the local, noisy subgradients using distributed consensus. While distributed consensus is well suited for collaborative learning, its use in our setup results in perturbed subgradient averages due to rate-limited links, which may slow down or prevent convergence. Our main contributions in this regard are: (i) bounds on the convergence rates of D-SAMD and AD-SAMD in terms of the number of nodes, network topology, and ratio of the data streaming and communication rates, and (ii) sufficient conditions for order-optimum convergence of D-SAMD and AD-SAMD. In particular, we show that there exist regimes under which AD-SAMD, when compared to D-SAMD, achieves order-optimum convergence with slower communications rates. This is in contrast to the centralized setting in which use of accelerated mirror descent results in a modest improvement over regular mirror descent for stochastic composite optimization. Finally, we demonstrate the effectiveness of the proposed algorithms using numerical experiments.

## I Introduction

Machine learning at its core involves solving stochastic optimization (SO) problems of the form

 \min_{\mathbf{x}\in X}\psi(\mathbf{x})\triangleq\min_{\mathbf{x}\in X}E_{\xi}[% \phi(\mathbf{x},\xi)] (1)

to learn a “model” \mathbf{x}\in X\subset\mathbb{R}^{n} that is then used for tasks such as dimensionality reduction, classification, clustering, regression, and/or prediction. A primary challenge of machine learning is to find a solution to the SO problem (1) without knowledge of the distribution P(\xi). This involves finding an approximate solution to (1) using a sequence of T samples \{\xi(t)\in\Upsilon\}_{t=1}^{T}, termed training data, that are typically drawn independently from the distribution P(\xi), which is supported on a subset of \Upsilon. There are, in particular, two main categorizations of training data that, in turn, determine the types of methods that can be used to find approximate solutions to the SO problem. These are (i) batch training data and (ii) streaming training data.

In the case of batch training data, where all T samples \{\xi(t)\} are pre-stored and simultaneously available, sample average approximation (SAA) (also referred to as empirical risk minimization (ERM)) is commonly employed. In SAA/ERM, one minimizes the empirical average of the “risk” function \phi(\cdot,\cdot) in lieu of the true expectation:

 \mathbf{x}_{\mathrm{SAA}}^{*}=\arg\min_{\mathbf{x}\in X}\frac{1}{T}\sum_{t=1}^% {T}\phi(\mathbf{x},\xi(t)). (2)

In contrast, the samples \{\xi(t)\} in the case of streaming training data arrive one-by-one, cannot be stored in memory for long, and should be processed as soon as possible. In this setting, stochastic approximation (SA) methods—the most well known of which is stochastic gradient descent (SGD)—are preferred over SAA/ERM techniques. The main idea in SA is to update the current iterate using only the most recent data sample(s) and terminate on exhaustion of training samples.111It is important to note here that the distinction between SAA and SA methods is not always as sharp as this discussion might suggest. Indeed, SGD is used routinely for SAA in deep learning not because training data are streaming, but to reduce computation and memory burdens (see, e.g. [1, Chapter 8.1.3]). Both SAA and SA have a long history in the literature; see [2] for a historical survey of SA methods, [3] for a comparative review of SAA and SA techniques, and [4] for a recent survey of SO techniques.

Despite the success of classical SO methods for machine learning, there has been a renewed interest in development and analysis of new methods due to the advent of fast-streaming and distributed data. Indeed, rapid proliferation of sensing and wearable devices, emergence of the internet-of-things (IoT), storage of data across geographically-distributed data centers, etc., are increasingly leading to scenarios in which data samples both continuously arrive at high rates and are distributed across physical entities. The goal of this paper is to provide an approximate solution to the SO problem (1) in this setting of distributed, streaming data. In particular, we focus on geographically-distributed entities (nodes) that collaborate over rate-limited communication links (e.g., wireless links within an IoT infrastructure) and obtain independent streams of training data arriving at a constant rate. The relationship between the rate at which communication takes place between nodes and the rate at which streaming data arrive at individual nodes plays a critical role in this setting. If, for example, data samples arrive much faster than nodes can communicate among themselves, it is difficult for the nodes to exchange enough information to enable an SA iteration on existing data in the network before new data arrives, thereby overwhelming the network. In order to address the challenge of distributed SO in the presence of a mismatch between the communications and streaming rates, we propose and analyze two distributed SA techniques, each based on distributed averaging consensus and stochastic mirror descent. In particular, we present bounds on the convergence rates of these techniques and derive conditions—involving the number of nodes, network topology, the streaming rate, and the communications rate—under which our solutions achieve order-optimum convergence speed.

### I-A Relationship to Prior Work

The idea of solving SO problems using SA methods dates back to the seminal work of Robbins and Monro [5]. Until recently, it was commonly believed that SAA/ERM methods converge faster than SA methods. This changed with [6, 7], which established that SA techniques can in fact outperform SAA/ERM methods for stochastic convex learning. Lan [8] improved on these results further using an accelerated stochastic mirror descent algorithm for stochastic composite optimization. This method, which makes use of noisy subgradients of \psi(\cdot) computed using incoming training samples, satisfies

 E[\psi(\mathbf{x}(T))-\psi(\mathbf{x}^{*})]\leq O(1)\left[\frac{L}{T^{2}}+% \frac{\mathcal{M}+\sigma}{\sqrt{T}}\right], (3)

where \sigma^{2} denotes variance of the subgradient noise, and \mathcal{M} and L denote the Lipschitz constants associated with the non-smooth (convex) component of \psi and the gradient of the smooth (convex) component of \psi, respectively. In the case of general stochastic composite optimization, the rate in (3) is in fact the best possible rate for any iterative algorithm. Further assumptions such as smoothness and strong convexity of \psi(\cdot) and/or presence of a structured regularizer term in \psi(\cdot) can remove the dependence of the convergence rate on \mathcal{M} and/or improve the convergence rate to O(\sigma/T) [9, 6, 10, 11].

The problem of distributed SO goes as far back as the seminal work of Tsitsiklis et al. [12], which presents distributed first-order methods for SO and gives proofs of their asymptotic convergence. Myriad works since then have applied these ideas to other settings, each with different assumptions about the type of data, how the data are distributed across the network, and how distributed units process data and share information among themselves. In order to put our work in context, we review a representative sample of these works.

A setting closely related to [12] is distributed optimization under the shared-memory model, in which distributed processors access optimization variables stored at a central, shared location. Synchronization of reads/writes to memory is a critical issue in this case; recent SO methods such as HogWild [13] and KroMagnon [14] offer convergence guarantees under asynchronous and/or sparse access to memory.

A line of work more closely related to this paper involves incremental (sub)gradient methods, in which a single node takes an optimization step based on its local data and passes that solution to a neighbor, which takes another optimization step based on its data, and so on. The message passing in incremental methods may suppose a ring network topology or use a randomized message-passing rule over an arbitrary topology. Incremental methods that are based on either SAA/ERM [15, 16, 17, 18, 19] or SA [20] have been published. But, since only node at a time processes data, these methods are unsuitable for rate-limited networks with fast streaming data.

Another related line of work is distributed optimization of linearly separable objective functions. The objective function in this case is the sum of constituent functions, each one of which is available at a single node. This setup is equivalent to SAA/ERM, where each node’s objective function is the empirical risk evaluated over local data. Solutions to this problem minimize the global objective function via iterate, dual, or (sub)gradient averaging [21, 22, 23, 24, 25] or via alternating direction method of multipliers or augmented Lagrangian methods [26, 27], often using gossip algorithms or average consensus [28] to average variables.

A final line of work directly addresses distributed SA. In [29], nodes carry out projected stochastic subgradient descent via a weighted average of neighbor nodes’ iterates. Similarly, in [30], nodes perform distributed stochastic dual averaging via a weighted average of neighbors’ dual variables. These works require nodes to engage in one round of information exchange per stochastic subgradient sample. Equivalently, they require the communications rate to be identical to the data streaming rate, which allows the nodes to exchange messages before the next set of subgradients arrives. By contrast, this paper allows the communications and streaming rates to be different; in particular, our solutions enable distributed SA even when the communications rate is slower than the streaming rate.

We conclude by discussing two works [31, 32] that are most closely related to this work. In [31], nodes perform distributed SA by forming distributed mini-batch averages of stochastic gradients and using stochastic dual averaging. The main assumption in this work is that nodes can compute exact stochastic gradient averages (e.g., via AllReduce in parallel computing architectures). Under this assumption, it is shown in this work that there is an appropriate mini-batch size for which the nodes’ iterates converge at the optimum (centralized) rate. However, the need for exact averages in this work is not suited to rate-limited (e.g., wireless) networks, in which mimicking the AllReduce functionality can be costly and challenging.

The need for exact stochastic gradient averages in [31] has been relaxed recently in [32], in which nodes carry out distributed stochastic dual averaging by computing approximate mini-batch averages of dual variables via distributed consensus. In addition, and similar to our work, [32] allows for a mismatch between the communications rate and the data streaming rate. Nonetheless, there are four main distinctions between [32] and our work. First, we provide results for stochastic composite optimization, in which the objective function has a non-smooth component, whereas [31, 32] consider a differentiable objective. Second, we study distributed mirror descent, which allows for a limited generalization to non-Euclidean settings, whereas [32] (and [31]) consider only Euclidean settings. Third, we explicitly examine the impact of slow communications rate on performance, in particular highlighting the need for large mini-batches and their impact on convergence speed when the communications rate is slow. In [32], on the other hand, the optimum mini-batch size is first derived from [31], after which the communications rate needed to facilitate distributed consensus at the optimum mini-batch size is specified. While it appears to be possible to derive some of our results from a re-framing of the results of [32], it is crucial to highlight the trade-offs necessary under slow communications, which is not done in prior works. Finally, this work presents a distributed accelerated mirror descent approach to distributed SA, whereas [32] (and [31]) consider only standard/“non-accelerated” methods. A somewhat surprising outcome of this is that acceleration substantially improves the convergence rate in rate-limited networks.

### I-B Our Contributions

In this paper, we present two strategies for distributed SA over networks with fast streaming data and slow communications links: distributed stochastic approximation mirror descent (D-SAMD) and accelerated distributed stochastic approximation mirror descent (AD-SAMD). In both cases, nodes first locally compute mini-batch stochastic subgradient averages to accommodate a fast streaming rate (or, equivalently, a slow communications rate), and then they collaboratively compute approximate network subgradient averages via distributed consensus. Finally, nodes individually employ mirror descent and accelerated mirror descent, respectively, on the approximate averaged subgradients for the next set of iterates.

Our main theoretical contribution is the derivation of bounds on the convergence rates of D-SAMD and AD-SAMD. These bounds involve a careful analysis of the impact of imperfect subgradient averaging on individual nodes’ iterates. In addition, we derive sufficient conditions for order-optimum convergence of D-SAMD and AD-SAMD in terms of the streaming and communications rates, the size and topology of the network, and the data statistics.

A key finding of this paper is that AD-SAMD achieves order-optimum convergence in a larger regime than D-SAMD, thus accommodating slower communications links relative to the streaming rate. By contrast, the convergence speeds of centralized stochastic mirror descent and accelerated stochastic mirror descent differ only in higher order terms.

### I-C Notation and Organization

We typically use boldfaced lowercase and boldfaced capital letters (e.g., \mathbf{x} and \mathbf{W}) to denote (possibly random) vectors and matrices, respectively. Unless otherwise specified, all vectors are assumed to be column vectors. The transpose operation is denoted by (\cdot)^{T}, while \mathbf{1} denotes the vector of all ones. Further, we use E[\cdot] to denote the expectation operation and \mathbb{R} to denote the field of real numbers. The gradient operation is denoted by \nabla, while \odot denotes the Hadamard product. Finally, given two functions p(r) and q(r), we write p(r)=O(q(r)) if there exists a constant C such that \forall r,p(r)\leq Cq(r), and we write p(r)=\Omega(q(r)) if q(r)=O(p(r)).

The rest of this paper is organized as follows. In Section II, we formalize the problem of distributed stochastic composite optimization. In Sections III and IV, we describe D-SAMD and AD-SAMD, respectively, and also derive performance guarantees for these two methods. We demonstrate the effectiveness of D-SAMD and AD-SAMD via numerical experiments in Section V, and we conclude the paper in Section VI. Proofs are provided in the Appendix.

## II Problem Formulation

The objective of this paper is order-optimal, distributed minimization of the composite function

 \psi(\mathbf{x})=f(\mathbf{x})+h(\mathbf{x}), (4)

where \mathbf{x}\in X\subset\mathbb{R}^{n} and X is convex and compact. The space \mathbb{R}^{n} is endowed with an inner product \langle\cdot,\cdot\rangle that need not be the usual one and a norm \left\|{\cdot}\right\| that need not be the one induced by the inner product. In the following, the minimizer and the minimum value of \psi are denoted as:

We now make a few assumptions on the smooth (f(\cdot)) and non-smooth (h(\cdot)) components of \psi. The function f:X\to\mathbb{R} is convex with Lipschitz continuous gradients, i.e.,

 \left\|{\nabla f(\mathbf{x})-\nabla f(\mathbf{y})}\right\|_{*}\leq L\left\|{% \mathbf{x}-\mathbf{x}}\right\|,\ \forall\ \mathbf{x},\mathbf{y}\in X, (6)

where \left\|{\cdot}\right\|_{*} is the dual norm associated with \langle\cdot,\cdot\rangle and \left\|{\cdot}\right\|:

 \left\|{\mathbf{g}}\right\|_{*}\triangleq\sup_{\left\|{\mathbf{x}}\right\|\leq 1% }\langle\mathbf{g},\mathbf{x}\rangle. (7)

The function h:X\to\mathbb{R} is convex and Lipschitz continuous:

 \left\|{h(\mathbf{x})-h(\mathbf{y})}\right\|\leq\mathcal{M}\left\|{\mathbf{x}-% \mathbf{y}}\right\|,\forall\ \mathbf{x},\mathbf{y}\in X. (8)

Note that h need not have gradients; however, since it is convex we can consider its subdifferential, denoted by \partial h(\mathbf{y}):

 \partial h(\mathbf{y})=\{\mathbf{g}:h(\mathbf{z})\geq h(\mathbf{y})+\mathbf{g}% ^{T}(\mathbf{z}-\mathbf{y}),\forall\ \mathbf{z}\in X\}. (9)

An important fact that will be used in this paper is that the subgradient \mathbf{g}\in\partial h of a Lipschitz-continuous convex function h is bounded [33, Lemma 2.6]:

 \left\|{\mathbf{g}}\right\|_{*}\leq\mathcal{M},\ \forall\mathbf{g}\in\partial h% (\mathbf{y}),\ \mathbf{y}\in X. (10)

Consequently, the gap between the subgradients of \psi is bounded: \forall\mathbf{x},\mathbf{y}\in X and \mathbf{g}_{\mathbf{x}}\in\partial h(\mathbf{x}), \mathbf{g}_{\mathbf{y}}\in\partial h(\mathbf{y}), we have

 \displaystyle\left\|{\partial\psi(\mathbf{x})-\partial\psi(\mathbf{y})}\right% \|_{*} \displaystyle=\left\|{\nabla f(\mathbf{x})-\nabla f(\mathbf{y})+\mathbf{g}_{% \mathbf{x}}-\mathbf{g}_{\mathbf{y}}}\right\|_{*} \displaystyle\leq\left\|{\nabla f(\mathbf{x})-\nabla f(\mathbf{y})}\right\|_{*% }+\left\|{\mathbf{g}_{\mathbf{x}}-\mathbf{g}_{\mathbf{y}}}\right\|_{*} \displaystyle\leq L\left\|{\mathbf{x}-\mathbf{y}}\right\|+2\mathcal{M}. (11)

### II-A Distributed Stochastic Composite Optimization

Our focus in this paper is the minimization of \psi(\mathbf{x}) over a network of m nodes, represented by the undirected graph G=(V,E). To this end, we suppose that nodes minimize \psi collaboratively by exchanging subgradient information with their neighbors at each communications round. Specifically, each node i\in V transmits a message at each communications round to each of its neighbors, defined as

 \mathcal{N}_{i}=\{j\in V:(i,j)\in E\}, (12)

where we suppose that a node is in its own neighborhood, i.e. i\in\mathcal{N}_{i}. We assume that this message passing between nodes takes place without any error or distortion. Further, we constrain the messages between nodes to be members of the dual space of X and to satisfy causality; i.e., messages transmitted by a node can depend only on its local data and previous messages received from its neighbors.

Next, in terms of data generation, we suppose that each node i\in V queries a first-order stochastic “oracle” at a fixed rate—which may be different from the rate of message exchange—to obtain noisy estimates of the subgradient of \psi at different query points in X. Formally, we use ‘t’ to index time according to data-acquisition rounds and define \{\xi_{i}(t)\in\Upsilon\}_{t\geq 1} to be a sequence of independent (with respect to i and t) and identically distributed (i.i.d.) random variables with unknown probability distribution P(\xi). At each data-acquisition round t, node i queries the oracle at search point \mathbf{x}_{i}(s) to obtain the point G(\mathbf{x}_{i}(s),\xi_{i}(t)) that is a noisy version of the subgradient of \psi at \mathbf{x}_{i}(s). Here, we use ‘s’ to index time according to search-point update rounds, with possibly multiple data-acquisition rounds per search-point update. The reason for allowing the search-point update index s to be different from the data-acquisition index t is to accommodate the setting in which data (equivalently, subgradient estimates) arrive at a much faster rate than the rate at which nodes can communicate with each other; we will elaborate further on this in the next subsection.

Formally, G(\mathbf{x},\xi) is a Borel function that satisfies the following properties:

 \displaystyle E[G(\mathbf{x},\xi)] \displaystyle\triangleq\mathbf{g}(\mathbf{x})\in\partial\psi(\mathbf{x}),\quad% \text{and} (13) \displaystyle E[\left\|{G(\mathbf{x},\xi)-\mathbf{g}(\mathbf{x})}\right\|_{*}^% {2}] \displaystyle\leq\sigma^{2}, (14)

where the expectation is with respect to the distribution P(\xi). We emphasize that this formulation is equivalent to that in which the objective function is \psi(\mathbf{x})\triangleq E[\phi(\mathbf{x},\xi)], and where nodes in the network acquire data point \{\xi_{i}(t)\}_{i\in V} at each data-acquisition round t that are then used to compute the subgradients of \phi(\mathbf{x},\xi_{i}(t)), which—in turn—are noisy subgradients of \psi(\mathbf{x}).

### II-B Mini-batching for Rate-Limited Networks

A common technique to reduce the variance of the (sub)gradient noise and/or reduce the computational burden in centralized SO is to average “batches” of oracle outputs into a single (sub)gradient estimate. This technique, which is referred to as mini-batching, is also used in this paper; however, its purpose in our distributed setting is to both reduce the subgradient noise variance and manage the potential mismatch between the communications rate and the data streaming rate. Before delving into the details of our mini-batch strategy, we present a simple model to parametrize the mismatch between the two rates. Specifically, let \rho>0 be the communications ratio, i.e. the fixed ratio between the rate of communications and the rate of data acquisition. That is, \rho\geq 1 implies nodes engage in \rho rounds of message exchanges for every data-acquisition round. Similarly, \rho<1 means there is one communications round for every 1/\rho data-acquisition rounds. We ignore rounding issues for simplicity.

The mini-batching in our distributed problem proceeds as follows. Each mini-batch round spans b\geq 1 data-acquisition rounds and coincides with the search-point update round, i.e., each node i updates its search point at the end of a mini-batch round. In each mini-batch round s, each node i uses its current search point \mathbf{x}_{i}(s) to compute an average of oracle outputs

 \theta_{i}(s)=\frac{1}{b}\sum_{t=(s-1)b+1}^{sb}G(\mathbf{x}_{i}(s),\xi_{i}(t)). (15)

This is followed by each node computing a new search point \mathbf{x}_{i}(s+1) using \theta_{i}(s) and messages received from its neighbors.

In order to analyze the mini-batching distributed SA techniques proposed in this work, we need to generalize the usual averaging property of variances to non-Euclidean norms.

###### Lemma 1

Let \mathbf{z}_{1},\dots,\mathbf{z}_{k} be i.i.d. random vectors in \mathbb{R}^{n} with E[\mathbf{z}_{i}]=0 and E[\left\|{\mathbf{z}_{i}}\right\|^{2}_{*}]\leq\sigma^{2}. There exists a constant C_{*}\geq 0, which depends only on \left\|{\cdot}\right\| and \langle\cdot,\cdot\rangle, such that

 E\left[\left\|{\frac{1}{k}\sum_{i=1}^{k}\mathbf{z}_{i}}\right\|_{*}^{2}\right]% \leq\frac{C_{*}\sigma^{2}}{k}. (16)
###### Proof:

This follows directly from the property of norm equivalence in finite-dimensional spaces. \qed

In order to illustrate Lemma 1, notice that when \left\|{\cdot}\right\|=\left\|{\cdot}\right\|_{1}, i.e., the l_{1} norm, and \langle\cdot,\cdot\rangle is the standard inner product, the associated dual norm is the l_{\infty} norm: \left\|{\cdot}\right\|_{*}=\left\|{\cdot}\right\|_{\infty}. Since \left\|{\mathbf{x}}\right\|^{2}_{\infty}\leq\left\|{\mathbf{x}}\right\|^{2}_{2% }\leq n\left\|{\mathbf{x}}\right\|_{\infty}^{2}, we have C_{*}=n in this case. Thus, depending on the norm in use, the extent to which averaging reduces subgradient noise variance may depend on the dimension of the optimization space.

In the following, we will use the notation \mathbf{z}_{i}(s)\triangleq\theta_{i}(s)-\mathbf{g}(\mathbf{x}_{i}(s)). Then, E[\left\|{\mathbf{z}_{i}(s)}\right\|_{*}^{2}]\leq C_{*}\sigma^{2}/b. We emphasize that the subgradient noise vectors \mathbf{z}_{i}(s) depend on the search points \mathbf{x}_{i}(s); we suppress this notation for brevity.

### II-C Problem Statement

It is straightforward to see that mini-batching induces a performance trade-off: Averaging reduces subgradient noise and processing time, but it also reduces the rate of search-point updates (and hence, slows down convergence). This trade-off depends on the relationships between the streaming and communications rates. In order to carry out distributed SA in an order-optimal manner, we will require that the nodes collaborate by carrying out r\geq 1 rounds of averaging consensus on their mini-batch averages \theta_{i}(s) in each mini-batch round s (see Section III for details). In order to complete the r communication rounds in time for the next mini-batch round, we have the constraint

 r\leq b\rho. (17)

If communications is faster, or if the mini-batch rounds are longer, nodes can fit in more rounds of information exchange between each mini-batch round or, equivalently, between each search-point update. But when the mismatch factor \rho is small, the mini-batch size b needed to enable sufficiently many consensus rounds may be so large that the reduction in subgradient noise is outstripped by the reduction in search-point updates and the resulting convergence speed is sub-optimum. In this context, our main goal is specification of sufficient conditions for \rho such that the resulting convergence speeds of the proposed distributed SA techniques are optimum.

## III Distributed stochastic approximation mirror descent

In this section we present our first distributed SA algorithm, called distributed stochastic approximation mirror descent (D-SAMD). This algorithm is based upon stochastic approximated mirror descent, which is a generalized version of stochastic subgradient descent. Before presenting D-SAMD, we review a few concepts that underlie mirror descent.

### III-A Stochastic Mirror Descent Preliminaries

Stochastic mirror descent, presented in [8], is a generalization of stochastic subgradient descent. This generalization is characterized by a distance-generating function \omega:X\to\mathbb{R} that generalizes the Euclidean norm. The distance-generating function must be continuously differentiable and strongly convex with modulus \alpha, i.e.

 \langle\nabla\omega(\mathbf{x})-\nabla\omega(\mathbf{y}),\mathbf{x}-\mathbf{y}% \rangle\geq\alpha\left\|{\mathbf{x}-\mathbf{y}}\right\|^{2},\forall\ \mathbf{x% },\mathbf{y}\in X. (18)

In the convergence analysis, we will require two measures of the “radius” of X that will arise in the convergence analysis, defined as follows:

The distance-generating function induces the prox function, or the Bregman divergence V:X\times X\to\mathbb{R}_{+}, which generalizes the Euclidean distance:

 V(\mathbf{x},\mathbf{z})=\omega(\mathbf{z})-(\omega(\mathbf{x})+\langle\nabla% \omega(\mathbf{x}),\mathbf{z}-\mathbf{x}\rangle). (19)

The prox function V(\mathbf{x},\cdot) inherits strong convexity from \omega(\cdot), but it need not be symmetric or satisfy the triangle inequality. We define the prox mapping P_{\mathbf{x}}:\mathbb{R}^{n}\to X as

 P_{\mathbf{x}}(\mathbf{y})=\operatorname*{arg\,min}_{\mathbf{z}\in X}\langle% \mathbf{y},\mathbf{z}-\mathbf{x}\rangle+V(\mathbf{x},\mathbf{z}). (20)

The prox mapping generalizes the usual subgradient descent step, in which one minimizes the local linearization of the objective function regularized by the Euclidean distance of the step taken. In (centralized) stochastic mirror descent, one computes iterates of the form

 \displaystyle\mathbf{x}(s+1) \displaystyle=P_{\mathbf{x}}(s)(\gamma_{s}\mathbf{g}(s))

where \mathbf{g}(s) is a stochastic subgradient of \psi(\mathbf{x}(s)), and \gamma_{s} is a stepsize. These iterates have the same form as (stochastic) subgradient descent; indeed, choosing \omega(\mathbf{x})=\frac{1}{2}\left\|{\mathbf{x}}\right\|^{2}_{2} as well as \langle\cdot,\cdot\rangle and \left\|{\cdot}\right\| to be the usual ones results in subgradient descent iterations.

One can speed up convergence by choosing \omega(\cdot) to match the structure of X and \psi. For example, if the optimization space X is the unit simplex over \mathbb{R}^{n}, one can choose \omega(\mathbf{x})=\sum_{i}x_{i}\log(x_{i}) and \left\|{\cdot}\right\| to be the l_{1} norm. This leads to V(\mathbf{x},\mathbf{z})=D(\mathbf{z}||\mathbf{x})), where D(\cdot||\cdot) denotes the Kullback-Leibler divergence between \mathbf{z} and \mathbf{x}. This choice speeds up convergence on the order of O\left(\sqrt{n/\log(n)}\right) over using the Euclidean norm throughout. Along a similar vein, when \psi includes an l_{1} regularizer to promote a sparse minimizer, one can speed up convergence by choosing \omega(\cdot) to be a p-norm with p=\log(n)/(\log(n)-1).

In order to guarantee convergence for D-SAMD, we need to further restrict the distance-generating function \omega(\cdot). In particular, we require that the resulting prox mapping be 1-Lipschitz continuous in \mathbf{x},\mathbf{y} pairs, i.e.

 \left\|{P_{\mathbf{x}}(\mathbf{y})-P_{\mathbf{x}^{\prime}}(\mathbf{y}^{\prime}% )}\right\|\leq\left\|{\mathbf{x}-\mathbf{x}^{\prime}}\right\|+\left\|{\mathbf{% y}-\mathbf{y}\prime}\right\|,\forall\ \mathbf{x},\mathbf{x}^{\prime},\mathbf{y% },\mathbf{y}^{\prime}\in\mathbb{R}^{n}.

This condition holds whenever the prox mapping is the projection of a 1-Lipschitz function of \mathbf{x},\mathbf{y} onto X. For example, it is easy to verify that this condition holds in the Euclidean setting. However, some important Bregman divergences, including the KL-divergence, result in prox mappings that are not Lipschitz. Consequently, while we present our results in terms of general prox functions, we emphasize that the results do not apply in all cases.222We note further that it is possible to relax the constraint that the best Lipschitz constant be no larger than unity. This worsens the scaling laws—in particular, the required communications ratio \rho grows in T rather than decreases—and we omit this case for brevity’s sake. One can think of the results primarily in the setting of Euclidean (accelerated) stochastic subgradient descent—for which case they are guaranteed to hold—with the understanding that one can check on a case-by-case basis to see if they hold for a particular non-Euclidean setting.

### III-B Description of D-SAMD

Here we present D-SAMD in detail, which generalizes stochastic mirror descent to the setting of distributed, streaming data. Nodes in this case carry out iterations similar to stochastic mirror descent as presented in [8], but instead of using local stochastic subgradients associated with the local search points, they carry out approximate consensus to estimate the average of stochastic subgradients across the network. This reduces the subgradient noise at each node and speeds up convergence.

Let \mathbf{W} be a symmetric, doubly-stochastic matrix consistent with the network graph G, i.e., [\mathbf{W}]_{ij}=0 if (i,j)\notin E. Also set the constant step size 0<\gamma\leq\alpha/(2L).333It is shown in [8] that a constant step size is sufficient for order-optimal performance, so we adopt such a rule here. For simplicity, we suppose that there is a predetermined number of data-acquisition rounds T, which leads to S=T/b mini-batch rounds. We detail the steps of D-SAMD in Algorithm 1.

In D-SAMD, each node i initializes its iterate at the minimizer of \omega(\cdot), which is guaranteed to be unique due to strong convexity. At each mini-batch round s, each node i obtains its mini-batched subgradient and nodes engage in r rounds of averaging consensus to produce the (approximate) average subgradients \mathbf{h}_{i}^{r}(s). Then, each node i takes a mirror prox step, using \mathbf{h}^{r}_{i}(s) instead of its own mini-batched estimate. Finally, each node keeps a running average of its iterates, which is well-known to speed up convergence [34].

### III-C Convergence Analysis

The convergence rate of D-SAMD depends on the bias and variance of the approximate subgradient averages \mathbf{h}_{i}^{r}(s). In principle, averaging subgradients together reduces the noise variance and speeds up convergence. However, because averaging consensus using only r communications rounds results in approximate averages, each node takes a slightly different mirror prox step and therefore ends up with a different iterate. At each mini-batch round s, nodes then compute subgradients at different search points, leading to bias in the averages \mathbf{h}^{r}_{i}(s). This bias accumulates at a rate that depends on the subgradient noise variance, the topology of the network, and the number of consensus rounds per mini-batch round.

Therefore, the first step in bounding the convergence speed of D-SAMD is to bound the bias and the variance of the subgradient estimates \mathbf{h}_{i}(s), which we do in the following lemma.

###### Lemma 2

Let 0\leq\lambda_{2}<1 denote the second-largest eigenvalue of \mathbf{W}. Define the matrices

 \displaystyle\mathbf{H}(s) \displaystyle\triangleq[\mathbf{h}_{1}^{r}(s),\dots,\mathbf{h}_{m}^{r}(s)] \displaystyle\mathbf{G}(s) \displaystyle\triangleq[\mathbf{g}(\mathbf{x}_{1}(s)),\dots,\mathbf{g}(\mathbf% {x}_{m}(s))] \displaystyle\mathbf{Z}(s) \displaystyle\triangleq[\mathbf{z}_{1}(s),\dots,\mathbf{z}_{m}(s)],

recalling that the subgradient noise \mathbf{z}_{i}(s) is defined with respect to the mini-batched subgradient \theta_{i}(s). Also define

where the matrices \overline{\mathbf{G}}(s),\overline{\mathbf{Z}}(s)\in\mathbb{R}^{n\times n} have identical columns. Finally, define the matrices

 \displaystyle\mathbf{E}(s) \displaystyle\triangleq\mathbf{G}(s)\mathbf{W}^{r}-\overline{\mathbf{G}}(s) \displaystyle\tilde{\mathbf{Z}}(s) \displaystyle\triangleq\mathbf{Z}(s)\mathbf{W}^{r}-\overline{\mathbf{Z}}(s)

of average consensus error on the subgradients and subgradient noise, respectively. Then, the following facts are true. First, one can write \mathbf{H}(s) as

 \mathbf{H}(s)=\overline{\mathbf{G}}(s)+\mathbf{E}(s)+\overline{\mathbf{Z}}(s)+% \tilde{\mathbf{Z}}(s). (21)

Second, the columns of \overline{\mathbf{Z}}(s) satisfy

 E[\left\|{\overline{\mathbf{z}}_{i}(s)}\right\|_{*}^{2}]\leq\frac{C_{*}\sigma^% {2}}{mb}. (22)

Finally, the ith columns of \mathbf{E}(s) and \tilde{\mathbf{Z}}(s), denoted by \mathbf{e}_{i}(s) and \tilde{\mathbf{z}}_{i}(s), respectively, satisfy

 \left\|{\mathbf{e}_{i}(s)}\right\|_{*}\leq\max_{j,k}m^{2}\sqrt{C_{*}}\lambda_{% 2}^{r}\left\|{\mathbf{g}_{j}(s)-\mathbf{g}_{k}(s)}\right\|_{*} (23)

and

 E[\left\|{\tilde{\mathbf{z}}_{i}(s)}\right\|_{*}^{2}]\leq\frac{\lambda^{2r}m^{% 2}C_{*}\sigma^{2}}{b}. (24)

The next step in the convergence analysis is to bound the distance between iterates at different nodes. As long as iterates are not too far apart, the subgradients computed at different nodes have sufficiently similar means that averaging them together reduces the overall subgradient noise.

###### Lemma 3

Let a_{s}\triangleq\max_{i,j}\left\|{\mathbf{x}_{i}(s)-\mathbf{x}_{j}(s)}\right\|. The moments of a_{s} follow:

 \displaystyle E[a_{s}] \displaystyle\leq\frac{\mathcal{M}+\sigma/\sqrt{b}}{L}((1+\alpha m^{2}\sqrt{C_% {*}}\lambda_{2}^{r})^{s}-1), (25) \displaystyle E[a_{s}^{2}] \displaystyle\leq\frac{(\mathcal{M}+\sigma/\sqrt{b})^{2}}{L^{2}}((1+\alpha m^{% 2}\sqrt{C_{*}}\lambda_{2}^{r})^{s}-1)^{2}. (26)

Now, we bound D-SAMD’s expected gap to optimality.

###### Theorem 1

For D-SAMD, the expected gap to optimality at each node i satisfies

 \displaystyle E[\psi(\mathbf{x}_{i}^{\mathrm{av}}(S+1))]-\psi^{*}\leq\\ \displaystyle\frac{2L\Omega_{\omega}^{2}}{\alpha S}+\sqrt{\frac{2(4\mathcal{M}% ^{2}+2\Delta_{S}^{2})}{\alpha S}}+\sqrt{\frac{\alpha}{2}}\frac{\Xi_{S}D_{% \omega}}{L}, (27)

where

 \Xi_{s}\triangleq(\mathcal{M}+\sigma/\sqrt{b})(1+m^{2}\sqrt{C_{*}}\lambda_{2}^% {r})((1+\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{s}-1)+2\mathcal{M} (28)

and

 \displaystyle\Delta_{s}^{2}\triangleq 2(\mathcal{M}+\sigma/\sqrt{b})^{2}(1+m^{% 4}C_{*}\lambda_{2}^{2r})((1+\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{s}-1)^{2% }+\\ \displaystyle 4C_{*}\sigma^{2}/(mb)+4\lambda_{2}^{2r}C_{*}\sigma^{2}m^{2}/b+4% \mathcal{M} (29)

quantify the moments of the effective subgradient noise.

The convergence rate proved in Theorem 1 is akin to that provided in [8], with \Delta_{s}^{2} taking the role of the subgradient noise variance. A crucial difference is the presence of the final term involving \Xi_{s}. In [8], this term vanishes because the intrinsic subgradient noise has zero mean. However, the equivalent gradient error in D-SAMD does not have zero mean in general. As nodes’ iterates diverge, their subgradients differ, and the nonlinear mapping between iterates and subgradients results in noise with nonzero mean.

The critical question is how fast communications needs to be for order-optimum convergence speed, i.e., the convergence speed that one would obtain if nodes had access to other nodes’ subgradient estimates at each round. After S mini-batch rounds, the network has processed mT data samples. Centralized mirror descent, with access to all mT data samples in sequence, achieves the convergence rate [8]

 O(1)\left[\frac{L}{mT}+\frac{\mathcal{M}+\sigma}{\sqrt{mT}}\right].

The final term dominates the error as a function of m and T if \sigma^{2}>0. In the following corollary we derive conditions under which the convergence rate of D-SAMD matches this term.

###### Corollary 1

The optimality gap for D-SAMD satisfies

 E[\psi(\mathbf{x}_{i}^{\mathrm{av}}(t+1))]-\psi^{*}=O\left(\frac{\mathcal{M}+% \sigma}{\sqrt{mT}}\right), (30)

provided the mini-batch size b, the communications ratio \rho, the number of users m, and the Lipschitz constant \mathcal{M} satisfy

###### Remark 1

Corollary 1 gives new insights into the interdependence between rate, network topology, and mini-batch size. In [31], a mini-batch size of b=O(T^{1/2}) is prescribed—which is sufficient whenever gradient averages are perfect—and in [32] the number of imperfect consensus rounds needed to facilitate the mini-batch size b is derived. By contrast, we derive a mini-batch condition sufficient to drive the effective noise variance to O(\sigma^{2}/(mT)) while taking into consideration the impact of imperfect subgradient averaging. This condition depends not only on T but also on m, \rho, and \sigma^{2}—indeed, for all else constant, the optimum mini-batch size is merely \Omega(\log(T)). Then, the condition on \rho essentially ensures that b=O(T^{1/2}) as specified in [31].

###### Remark 2

Corollary 1 imposes a strict requirement on \mathcal{M}, the Lipschitz constant of the non-smooth part of \psi. Essentially the non-smooth part must vanish as m, T, or \sigma^{2} becomes large. This is because the contribution of h(\mathbf{x}) to the convergence rate depends only on the number of iterations taken, not on the noise variance. Reducing the effective subgradient noise via mini-batching has no impact on this contribution, so we require the Lipschitz constant \mathcal{M} to be small to compensate.

## IV Accelerated Distributed Stochastic Approximation Mirror Descent

In this section, we present accelerated distributed stochastic approximation mirror descent (AD-SAMD), which distributes the accelerated stochastic approximation mirror descent proposed in [8]. The centralized version of accelerated mirror descent achieves the optimum convergence rate of

 O(1)\left[\frac{L}{T^{2}}+\frac{\mathcal{M}+\sigma^{2}}{\sqrt{T}}\right].

Consequently, we will see that the convergence rate of AD-SAMD has 1/S^{2} as its first term. This faster convergence in S allows for more aggressive mini-batching, and the resulting conditions for order-optimal convergence are less stringent.

The setting for AD-SAMD is the same as in Section III. We again suppose a distance function \omega:X\to\mathbb{R}, its associated prox function/Bregman divergence V:X\times X\to\mathbb{R}, and the resulting (Lipschitz) prox mapping P_{x}:\mathbb{R}^{n}\to X.

We again suppose a mixing matrix \mathbf{W}\in\mathbb{R}^{m\times m} that is symmetric, doubly stochastic, and consistent with G. The main distinction between accelerated and standard mirror descent is the way one averages iterates. Rather than simply average the sequence of iterates, one maintains several distinct sequences of iterates, carefully averaging them along the way. This involves two sequences of step sizes \beta_{s}\in[1,\infty) and \gamma_{s}\in\mathbb{R}, which are not held constant. Again we suppose that the number of mini-batch rounds S=T/b is predetermined. We detail the steps of AD-SAMD in Algorithm 2.

The sequences of iterates \mathbf{x}_{i}(s), \mathbf{x}_{i}^{\mathrm{md}}(s), and \mathbf{x}^{\mathrm{ag}}_{i}(s) are interrelated in complicated ways; we refer the reader to [8] for an intuitive explanation of these iterations.

### IV-B Convergence Analysis

As with D-SAMD, the convergence analysis relies on bounds on the bias and variance of the averaged subgradients. To this end, we note first that Lemma 2 holds for AD-SAMD, where \mathbf{H}(s) has columns corresponding to noisy subgradients evaluated at \mathbf{x}_{i}^{\mathrm{md}}(s). Next, we bound the distance between iterates at different nodes. This analysis is somewhat more complicated due to the relationships between the three iterate sequences.

###### Lemma 4

Let

 \displaystyle a_{s} \displaystyle\triangleq\max_{i,j}\left\|{\mathbf{x}^{\mathrm{ag}}_{i}(s)-% \mathbf{x}^{\mathrm{ag}}_{j}(s)}\right\| \displaystyle b_{s} \displaystyle\triangleq\max_{i,j}\left\|{\mathbf{x}_{i}(s)-\mathbf{x}_{j}(s)}\right\| \displaystyle c_{s} \displaystyle\triangleq\max_{i,j}\left\|{\mathbf{x}^{\mathrm{md}}_{i}(s)-% \mathbf{x}^{\mathrm{md}}_{j}(s)}\right\|.

Then, the moments of a_{s}, b_{s}, and c_{s} follow:

 \displaystyle E[a_{s}],E[b_{s}],E[c_{s}] \displaystyle\leq\frac{\mathcal{M}+\sigma/\sqrt{b}}{L}((1+2\gamma_{s}m^{2}% \sqrt{C_{*}}L\lambda_{2}^{r})^{s}-1), \displaystyle E[a_{s}^{2}],E[b_{s}^{2}],E[c_{s}^{2}] \displaystyle\leq\frac{(\mathcal{M}+\sigma/\sqrt{b})^{2}}{L^{2}}((1+2\gamma_{s% }m^{2}\sqrt{C_{*}}L\lambda_{2}^{r})^{s}-1)^{2}.

Now, we bound the expected gap to optimality of the AD-SAMD iterates.

###### Theorem 2

For AD-SAMD, the expected gap to optimality satisfies

 \displaystyle E[\Psi(\mathbf{x}_{i}^{\mathrm{ag}}(S+1))]-\Psi^{*}\leq\frac{8LD% _{\omega,X}^{2}}{\alpha S^{2}}+\\ \displaystyle 4D_{\omega,X}\sqrt{\frac{4M+\Delta_{S}^{2}}{\alpha S}}+\sqrt{% \frac{32}{\alpha}}D_{\omega,X}\Xi_{S}, (31)

where

 \displaystyle\Delta_{\tau}^{2}=2(\mathcal{M}+\sigma/\sqrt{b})^{2}((1+2\gamma_{% \tau}m^{2}\sqrt{C_{*}}L\lambda_{2}^{r})^{\tau}-1)^{2}+\\ \displaystyle\frac{4C_{*}\sigma^{2}}{b}(\lambda_{2}^{2r}m^{2}+1/m)+4\mathcal{M}.

and

 \Xi_{\tau}=(\mathcal{M}+\sigma/\sqrt{b})(1+\sqrt{C_{*}}m^{2}\lambda_{2}^{r})((% 1+2\gamma_{\tau}m^{2}\sqrt{C_{*}}L\lambda_{2}^{r})^{\tau}-1)+2\mathcal{M}.

As with D-SAMD, we study the conditions under which AD-SAMD achieves order-optimum convergence speed. The centralized version of accelerated mirror descent, after processing the mT data samples that the network sees after S mini-batch rounds, achieves the convergence rate

 O(1)\left[\frac{L}{(mT)^{2}}+\frac{\mathcal{M}+\sigma}{\sqrt{mT}}\right].

This is the optimum convergence rate under any circumstance. In the following corollary, we derive the conditions under which the convergence rate matches the second term, which usually dominates the error when \sigma^{2}>0.

###### Corollary 2

The optimality gap satisfies

 E[\psi(\mathbf{x}^{\mathrm{ag}}_{i}(S+1)]-\psi^{*}=O\left(\frac{\mathcal{M}+% \sigma}{\sqrt{mT}}\right),

provided

###### Remark 3

The crucial difference between the two schemes is that AD-SAMD has a convergence rate of 1/S^{2} in the absence of noise and non-smoothness. This faster term, which is often negligible in centralized mirror descent, means that AD-SAMD tolerates more aggressive mini-batching without impact on the order of the convergence rate. As a result, while the condition on the mini-batch size b is the same in terms of \rho, the condition on \rho is relaxed by 1/4 in the exponents of m and T. This is because the condition b=O(T^{1/2}), which holds for standard stochastic SO methods, is relaxed to b=O(T^{3/2}) for accelerated stochastic mirror descent.

## V Numerical Example: Logistic Regression

We demonstrate the performance of D-SAMD and AD-SAMD on supervised learning by considering binary logistic regression. A learning machine observes a stream of pairs \xi(t)=(y(t),l(t)) of data points y(t)\in\mathbb{R}^{d} and their labels l(t)\in\{0,1\}, from which it learns a classifier of the form

 F(\mathbf{x},x_{0},\mathbf{y})=\frac{1}{1+\exp(-\mathbf{x}^{T}\mathbf{y}(t)+x_% {0})},

where \mathbf{x}\in\mathbb{R}^{d} and x_{0}\in\mathbb{R} are regression coefficients. The function F(\mathbf{x},x_{0},\mathbf{y}(t)) represents the posterior probability that l(t)=1 given \mathbf{y}(t).

The SO task is to learn the optimum regression coefficients \mathbf{x},x_{0}. In terms of the framework of this paper, \Upsilon=(\mathbb{R}^{d}\times\{0,1\}), and X=\mathbb{R}^{d+1}. We use the Euclidean norm, inner product, and distance-generating function to compute the prox mapping. The objective function is the cross-entropy loss

 \psi(\mathbf{x})=E_{\mathbf{y},l}[l\log(F(\mathbf{x},x_{0},\mathbf{y}))+(1-l)% \log(1-F(\mathbf{x},x_{0},\mathbf{y}))].

Minimizing \psi is equivalent to minimizing the KL divergence between the posterior indicated by F and the true posterior p(l|\mathbf{y}). Minimizing the empirical loss \psi over a training set is equivalent to performing maximum likelihood estimation of the regression coefficients.

Although it can be used more generally, logistic regression is well-known to follow from the assumptions of Gaussian naive Bayes [35]. Therefore, we consider a synthetic example in which the data follow a Gaussian distribution. For l(t)~{}\in~{}\{0,1\}, we let \mathbf{y}(t)~{}\sim~{}\mathcal{N}(\mu_{l(t)},\sigma_{r}^{2}\mathbf{I}), where \mu_{l(t)} is one of two mean vectors, and \sigma_{r}^{2}>0 is the noise variance.444The variance \sigma_{r}^{2} is distinct from the resulting gradient noise variance \sigma^{2}.

For this experiment, we draw the elements \mu_{0} and \mu_{1} randomly from the standard normal distribution and choose \sigma^{2}=1. We pick m=20 and draw a graph at random from the Erdős-Rényi distribution with connection probability 0.1. We choose \mathbf{W} to be the Metropolis weights [36] associated with the resulting graph; it turns out here that \lambda_{2}~{}=~{}0.9436. For T=5000, we choose step-size \gamma~{}=~{}0.005 and b~{}=~{}\log(Tm^{2})/(\rho\log(1/\lambda_{2})), which guarantees that the equivalent gradient noise variance is O(\sigma^{2}/(mT)).

In Figure 1 we plot the optimality gap for communications ratios \rho\in\{1,10\}. For \rho=1 the mini-batch size b is rather large, so D-SAMD and AD-SAMD only update their search points every few data-acquisition rounds. While one can clearly see the advantage of AD-SAMD over D-SAMD, both centralized algorithms converge much more quickly. By contrast, for \rho=10 the resulting mini-batch size is smaller, and the gap between centralized and decentralized performance is comparable. Because the other parameters are fixed and b=O(T^{1/2}), each of these schemes has an optimality gap of O(1/\sqrt{T}).

## VI Conclusion

We have presented two distributed schemes, D-SAMD and AD-SAMD, for convex stochastic optimization over networks of nodes that collaborate via rate-limited links. Further, we have derived sufficient conditions for the order-optimum convergence of D-SAMD and AD-SAMD, showing that accelerated mirror descent provides a foundation for distributed SO that better tolerates slow communications links. These results characterize relationships between network communications speed and the convergence speed of stochastic optimization.

A limitation of this work is that we are restricted to settings in which the prox mapping is Lipschitz continuous, which excludes important Bregman divergences such as the Kullbeck-Liebler divergence. Further, the conditions for optimum convergence restrict the Lipschitz constant of non-smooth component of the objective function to be small. Future work includes study of the limits on convergence speed for more general divergences and composite objective functions.

### -A Proof of Lemma 2

The first claim follows immediately from the definitions of the constituent matrices. The second claim follows from Lemma 1. To establish the final claim, we first bound the norm of the columns of \mathbf{E}(s):

 \displaystyle\mathbf{E}(s) \displaystyle\triangleq\mathbf{G}(s)\mathbf{W}^{r}-\overline{\mathbf{G}}(s) (32) \displaystyle=\mathbf{G}(s)(\mathbf{W}^{r}-1/n\mathbf{1}\mathbf{1}^{T}), (33) \displaystyle=(\mathbf{G}(s)-\overline{\mathbf{G}}(s))(\mathbf{W}^{r}-1/n% \mathbf{1}\mathbf{1}^{T}), (34)

where the first equality follows from \bar{\mathbf{G}}(s)=\mathbf{G}(s)1/n\mathbf{1}\mathbf{1}^{T} by definition, and the second equality follows from the fact that the dominant eigenspace of a column-stochastic matrix \mathbf{W} is the subspace of constant vectors, thus the rows of \bar{\mathbf{G}}(s) lie in the null space of \mathbf{W}^{r}-1/n\mathbf{1}\mathbf{1}^{T}. We bound the norm of the columns of \mathbf{E}(s) via the Frobenius norm of the entire matrix:

 \displaystyle\left\|{\mathbf{e}_{i}(s)}\right\|_{*} \displaystyle\leq\sqrt{C_{1}}\left\|{\mathbf{E}(s)}\right\|_{F} \displaystyle=\sqrt{C_{1}}\left\|{(\mathbf{G}(s)-\overline{\mathbf{G}}(s))(% \mathbf{W}^{r}-1/n\mathbf{1}\mathbf{1}^{T})}\right\|_{F} \displaystyle\leq\sqrt{C_{1}}m\lambda_{2}^{r}\left\|{\mathbf{G}(s)-\overline{% \mathbf{G}}(s)}\right\|_{F} \displaystyle\leq\max_{j,k}m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\left\|{\mathbf{g}_% {j}(s)-\mathbf{g}_{k}(s)}\right\|_{*},

where C_{1} bounds the gap between the Euclidean and dual norms, C_{2} bounds the gap in the opposite direction, and thus C_{*}\leq C_{1}C_{2}. Then, we bound the variance of the subgradient noise. Similar to the case of \mathbf{E}(s), we can rewrite

 \tilde{\mathbf{Z}}(s)=\mathbf{Z}(s)(\mathbf{W}^{r}-\mathbf{1}\mathbf{1}^{T}).

We bound the expected squared norm of each column \tilde{\mathbf{z}}_{i}(s) in terms of the expected Frobenius norm:

 \displaystyle E[\left\|{\mathbf{\tilde{z}}_{i}(s)}\right\|_{*}^{2}] \displaystyle\leq C_{1}E[\left\|{\mathbf{Z}(s)(\mathbf{W}^{r}-\mathbf{1}% \mathbf{1}^{T})}\right\|^{2}_{F}] \displaystyle\leq m\lambda_{2}^{2r}C_{1}E[\left\|{\mathbf{Z}(s)}\right\|_{F}^{% 2}] \displaystyle\leq\lambda_{2}^{2r}m^{2}C_{*}\sigma^{2}/b.

### -B Proof of Lemma 3

By the Lipschitz condition on P_{\mathbf{x}}(\mathbf{y}),

 \displaystyle a_{s+1} \displaystyle=\max_{i,j}\left\|{P_{\mathbf{x}_{i}(s)}(\gamma\mathbf{h}_{i}(s))% -P_{\mathbf{x}_{j}(s)}(\gamma\mathbf{h}_{j}(s))}\right\| \displaystyle\leq\max_{i,j}\left\|{\mathbf{x}_{i}(s)-\mathbf{x}_{j}(s)}\right% \|+\gamma\left\|{\mathbf{h}_{i}(s)-\mathbf{h}_{j}(s)}\right\|_{*} \displaystyle\leq\max_{i,j}\left\|{\mathbf{x}_{i}(s)-\mathbf{x}_{j}(s)}\right% \|+\frac{\alpha}{2L}\left\|{\mathbf{h}_{i}(s)-\mathbf{h}_{j}(s)}\right\|_{*},

where the final inequality is due to the constraint \gamma\leq\alpha/(2L). Applying Lemma 2, we obtain

 a_{s+1}\leq a_{s}+\max_{i,j}\frac{\alpha}{2L}\left\|{\mathbf{e}_{i}(s)-\mathbf% {e}_{j}(s)+\tilde{\mathbf{z}}_{i}(s)-\tilde{\mathbf{z}}_{j}(s)}\right\|_{*},

which, applying (23), yields

 \displaystyle a_{s+1}\leq a_{s}+\max_{i,j}\frac{\alpha m^{2}\sqrt{C_{*}}% \lambda_{2}^{r}}{2L}\left\|{\mathbf{g}_{i}(s)-\mathbf{g}_{j}(s)}\right\|_{*}+% \\ \displaystyle\frac{\alpha}{2L}\left\|{\tilde{\mathbf{z}}_{i}(s)-\tilde{\mathbf% {z}}_{j}(s)}\right\|_{*}.

Applying (11), we get

 \displaystyle a_{s+1} \displaystyle\leq a_{s}+\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r}(a_{s}+\mathcal% {M}/L)+\max_{i}\frac{\alpha}{L}\left\|{\tilde{\mathbf{z}}_{i}(s)}\right\|_{*} \displaystyle=(1+\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})a_{s}+\frac{\alpha}{L% }\left(m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\mathcal{M}+\left\|{\tilde{\mathbf{z}}_% {i^{*}}(s)}\right\|_{*}\right), (35)

where i^{*}\in V is the index that maximizes \left\|{\tilde{\mathbf{z}}_{i}}\right\|_{*}.

The final expression (35) gives the recurrence relationship that governs the divergence of the iterates, and it takes the form of a standard first-order difference equation. Recalling the base case a_{1}=0, we obtain the solution

 a_{s}\leq\frac{\alpha}{L}\sum_{\tau=0}^{s-1}(1+\alpha m^{2}\sqrt{C_{*}}\lambda% _{2}^{r})^{s-1-\tau}\left(m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\mathcal{M}+\left\|{% \tilde{\mathbf{z}}_{i^{*}}(s)}\right\|_{*}\right). (36)

To prove the lemma, we bound the moments of a_{s} using the solution (36). First, we bound E[a_{s}^{2}]:

 \displaystyle E[a_{s}^{2}]\leq E\Bigg{[}\Bigg{(}\frac{\alpha}{L}\sum_{\tau=0}^% {s-1}(1+\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{s-1-\tau}\times\\ \displaystyle\left(m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\mathcal{M}+\left\|{\tilde{% \mathbf{z}}_{i^{*}}(s)}\right\|_{*}\right)\Bigg{)}^{2}\Bigg{]},

which we can rewrite via a change of variables as

 \displaystyle E[a_{s}^{2}]\leq\left(\frac{\alpha}{L}\right)^{2}\sum_{\tau=0}^{% s-1}\sum_{q=0}^{s-1}(1+\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{\tau+q}\times% \\ \displaystyle E\big{[}(m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\mathcal{M}+\left\|{% \tilde{\mathbf{z}}_{i^{*}}(s-1-\tau)}\right\|_{*})\times\\ \displaystyle(m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\mathcal{M}+\left\|{\tilde{% \mathbf{z}}_{i^{*}}(s-1-q)}\right\|_{*})\big{]}.

Then, we apply the Cauchy-Schwartz inequality and (24) to obtain

 \displaystyle E[a_{s}^{2}]\leq\lambda_{2}^{2r}C_{*}m^{4}(\mathcal{M}+\sigma/% \sqrt{b})^{2}\left(\frac{\alpha}{L}\right)^{2}\times\\ \displaystyle\sum_{\tau=0}^{s-1}(1+\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{% \tau}\sum_{q=0}^{s-1}(1+\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{q}.

Finally, we apply the geometric sum identity and simplify:

 \displaystyle E[a_{s}^{2}] \displaystyle\leq\!\lambda_{2}^{2r}C_{*}m^{4}(\mathcal{M}\!+\!\!\sigma/\sqrt{b% })^{2}\!\left(\frac{\alpha}{L}\right)^{2}\left(\frac{1\!-\!\!(1+\alpha m^{2}% \sqrt{C_{*}}\lambda_{2}^{r})^{s}}{1\!-\!\!(1+\alpha m^{2}\sqrt{C_{*}}\lambda_{% 2}^{r})}\right)^{2} \displaystyle=\frac{(\mathcal{M}+\sigma/\sqrt{b})^{2}}{L^{2}}((1+\alpha m^{2}% \sqrt{C_{*}}\lambda_{2}^{r})^{s}-1)^{2}.

This establishes the result for E[a_{s}^{2}], and the bound on E[a_{s}] follows a fortiori from Jensen’s inequality.

### -C Proof of Theorem 1

The proof involves an analysis of mirror descent, with the subgradient error quantified by Lemmas 2 and 3. As necessary, we will cite results from the proof of [8, Theorem 1] rather than reproduce its analysis.

Define

 \displaystyle\boldsymbol{\delta}_{i}(\tau) \displaystyle\triangleq\mathbf{h}_{i}(\tau)-\mathbf{g}(\mathbf{x}_{i}(\tau)) \displaystyle\boldsymbol{\eta}_{i}(\tau) \displaystyle\triangleq\bar{\mathbf{g}}(\tau)+\mathbf{e}_{i}(\tau)-\mathbf{g}(% \mathbf{x}_{i}(\tau))=\boldsymbol{\delta}_{i}(\tau)-\bar{\mathbf{z}}(\tau)-% \tilde{\mathbf{z}}_{i}(\tau) \displaystyle\zeta_{i}(\tau) \displaystyle\triangleq\gamma\langle\boldsymbol{\delta}_{i}(\tau),\mathbf{x}^{% *}-\mathbf{x}_{i}(\tau)\rangle.

Applying Lemmas 2 and 3, we bound E[\left\|{\boldsymbol{\eta}_{i}(\tau)}\right\|_{*}]:

 \displaystyle E[\left\|{\boldsymbol{\eta}_{i}(\tau)}\right\|_{*}] \displaystyle=\!E\left[\left\|{\overline{\mathbf{g}}(\tau)+\mathbf{e}_{i}(\tau% )-\mathbf{g}(\mathbf{x}_{i}(\tau))}\right\|_{*}\right] \displaystyle\leq\!E[\left\|{\overline{\mathbf{g}}(\tau)-\mathbf{g}(\mathbf{x}% _{i}(\tau))}\right\|_{*}]+E[\left\|{\mathbf{e}_{i}(\tau)}\right\|_{*}] \displaystyle\leq\!LE[a_{\tau}]+Lm^{2}\sqrt{C_{*}}\lambda_{2}^{r}E[a_{\tau}]+2% \mathcal{M} \displaystyle\leq\!L(1+m^{2}\sqrt{C_{*}}\lambda_{2}^{r})E[a_{\tau}]+2\mathcal{M} \displaystyle\leq\!(\mathcal{M}\!+\!\!\sigma/\sqrt{b})(1\!+\!\!m^{2}\sqrt{C_{*% }}\lambda_{2}^{r})((1\!+\!\!\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{s}\!-\!% \!1)\!+\!\!2\mathcal{M}.

Next, we bound the expected value of \zeta_{i}(\tau):

 \displaystyle E[\zeta_{i}(\tau)] \displaystyle=\gamma E[\langle\boldsymbol{\eta}_{i}(\tau)+\bar{\mathbf{z}}(% \tau)-\tilde{\mathbf{z}}_{i}(\tau),\mathbf{x}^{*}-\mathbf{x}_{i}(\tau)\rangle] \displaystyle=\gamma E[\langle\boldsymbol{\eta}_{i}(\tau),\mathbf{x}^{*}-% \mathbf{x}_{i}(\tau)\rangle]+ \displaystyle\quad\quad\quad\gamma E[\langle\bar{\mathbf{z}}(\tau)-\tilde{% \mathbf{z}}_{i}(\tau),\mathbf{x}^{*}-\mathbf{x}_{i}(\tau)\rangle] \displaystyle=\gamma E[\langle\boldsymbol{\eta}_{i}(\tau),\mathbf{x}^{*}-% \mathbf{x}_{i}(\tau)\rangle] \displaystyle\leq\gamma E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|_{*}% \left\|{\mathbf{x}^{*}-\mathbf{x}_{i}(\tau)}\right\|] \displaystyle\leq\gamma\Xi_{\tau}\Omega_{\omega} \displaystyle\leq\sqrt{\frac{\alpha}{2}}\frac{\Xi_{\tau}D_{\omega}}{L},

where the first inequality is due to the definition of the dual norm, and the third equality follows because \bar{\mathbf{z}}(\tau) and \tilde{\mathbf{z}}_{i}(\tau) have zero mean and are independent of \mathbf{x}^{*}-\mathbf{x}_{i}(\tau).

Next, we bound E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}_{*}]. By Lemma 2,

 \displaystyle E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}_{*}]=E\left% [\left\|{\overline{\mathbf{g}}(\tau)+\mathbf{e}_{i}(\tau)+\overline{\mathbf{z}% }(\tau)+\tilde{\mathbf{z}}_{i}(\tau)-\mathbf{g}(\mathbf{x}_{i}(\tau))}\right\|% ^{2}_{*}\right],

which we can bound via repeated uses of the triangle inequality:

 \displaystyle E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}_{*}]\leq 2E% [\left\|{\bar{g}(\tau)-g(\mathbf{x}_{i}(\tau))}\right\|_{*}^{2}]+\\ \displaystyle 2E[\left\|{\mathbf{e}_{i}(\tau)}\right\|_{*}^{2}]+4E[\left\|{% \bar{\mathbf{z}}(\tau)}\right\|_{*}^{2}+\left\|{\tilde{\mathbf{z}}_{t}(\tau)}% \right\|_{*}^{2}].

Next, we apply Lemma 2:

 \displaystyle E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}_{*}]\leq 2L% ^{2}(1+m^{4}C_{*}\lambda_{2}^{2r})E[a_{\tau}^{2}]+\\ \displaystyle 4C_{*}\sigma^{2}/(mb)+4\lambda_{2}^{2r}C_{*}\sigma^{2}m^{2}/b+4% \mathcal{M}.

Finally, we apply Lemma 3:

 \displaystyle E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}_{*}]\leq 2L% ^{2}(1+m^{4}C_{*}\lambda_{2}^{2r})E[a_{\tau}^{2}]+\\ \displaystyle 4C_{*}\sigma^{2}/(mb)+4\lambda_{2}^{2r}C_{*}\sigma^{2}m^{2}/b+4% \mathcal{M}=\Delta_{\tau}^{2}.

Applying the proof of Theorem 1 of [8] to the D-SAMD iterates, we have that

 \displaystyle S\gamma(\psi(\mathbf{x}_{i}^{\mathrm{av}}(S+1))-\psi^{*})\leq D_% {\omega}^{2}+\\ \displaystyle\sum_{\tau=1}^{S}\left[\zeta_{i}(\tau)+\frac{2\gamma^{2}}{\alpha}% \left(4\mathcal{M}^{2}+\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}_{*}% \right)\right].

We take the expectation to obtain

 \displaystyle E[\psi(\mathbf{x}_{i}^{\mathrm{av}}(S+1))]-\psi^{*}\leq\frac{D_{% \omega}^{2}}{S\gamma}+\frac{1}{S}\sum_{\tau=1}^{S}\sqrt{\frac{\alpha}{2}}\frac% {\Xi_{\tau}D_{\omega}}{L}+\\ \displaystyle\frac{2\gamma}{\alpha S}\left(4S\mathcal{M}^{2}+\sum_{\tau=1}^{S}% \Delta_{\tau}^{2}\right).

Because \{\Xi_{\tau}\} and \{\Delta_{\tau}^{2}\} are increasing sequences, we can bound the preceding by

 \displaystyle E[\psi(\mathbf{x}_{i}^{\mathrm{av}}(S+1))]-\psi^{*}\leq\frac{D_{% \omega}^{2}}{S\gamma}+\frac{2\gamma}{\alpha}(4\mathcal{M}^{2}+\Delta_{S}^{2})+% \sqrt{\frac{\alpha}{2}}\frac{\Xi_{S}D_{\omega}}{L},

Minimizing this bound over \gamma in the interval (0,\alpha/(2L)], we obtain the following optimum value:

 \gamma^{*}=\min\left\{\frac{\alpha}{2L},\sqrt{\frac{\alpha D_{\omega}^{2}}{2S(% 4\mathcal{M}^{2}+2\Delta_{S}^{2})}}\right\},

which results in the bound

 \displaystyle E[\psi(\mathbf{x}_{i}^{\mathrm{av}}(s+1))]-\psi^{*}\leq\frac{2LD% _{\omega}^{2}}{\alpha S}+\\ \displaystyle\sqrt{\frac{2(4\mathcal{M}^{2}+2\Delta_{S}^{2})}{\alpha S}}+\sqrt% {\frac{\alpha}{2}}\frac{\Xi_{S}D_{\omega}}{L}.

### -D Proof of Corollary 1

First, we make a few simplifying approximations to \Xi_{\tau}:

 \displaystyle\Xi_{\tau} \displaystyle=(\mathcal{M}\!+\!\!\sigma/\sqrt{b})(1\!+\!\!m^{2}\sqrt{C_{*}}% \lambda_{2}^{r})((1\!+\!\!\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{s}\!-\!\!1% )\!+\!\!2\mathcal{M} \displaystyle\leq 2(\mathcal{M}+\sigma/\sqrt{b})\sqrt{C_{*}m}((1+\alpha m^{2}% \sqrt{C_{*}}\lambda_{2}^{r})^{\tau})+2\mathcal{M} \displaystyle=2(\mathcal{M}+\sigma/\sqrt{b})\sqrt{C_{*}m}\sum_{k=0}^{\tau}% \binom{s}{k}(\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{k})+2\mathcal{M} \displaystyle\leq 2(\mathcal{M}+\sigma/\sqrt{b})\sqrt{C_{*}m}\sum_{k=0}^{\tau}% (\tau\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r})^{k})+2\mathcal{M} \displaystyle\leq 2(\mathcal{M}+\sigma/\sqrt{b})\sqrt{C_{*}m}(\tau+1)(\tau% \alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r}))+2\mathcal{M} \displaystyle=O(\tau^{2}\sqrt{m^{5}}\lambda_{2}^{r}(\mathcal{M}+\sigma/\sqrt{b% })+\mathcal{M}),

where the first inequality is trivial, the next two statements are due to the binomial theorem and the exponential upper bound on the binomial coefficient, and the final inequality is true if we constrain S\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\leq 1, which is consistent with the order-wise constraints listed in the statement of this corollary. Recalling that r\leq b\rho, we obtain

 \Xi_{S}=O\left(S^{2}\lambda_{2}^{b\rho}\sqrt{m^{5}}(\mathcal{M}+\sigma/\sqrt{b% })+\mathcal{M}^{2}\right). (37)

Now, we find the optimum scaling law on the mini-batch size b. To ensure S\alpha m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\leq 1 for all s, we need

 b=\Omega\left(\frac{\log(mT)}{\rho\log(1/\lambda_{2})}\right). (38)

In order to get optimum scaling we need \Xi_{S}~{}=~{}O(\sigma/\sqrt{mT}+\mathcal{M}), which yields a slightly stronger necessary condition:

 b=\Omega\left(1+\frac{\log(mT)}{\rho\log(1/\lambda_{2})}\right), (39)

where the first term is necessary because b cannot be smaller than unity.

Similar approximations establish that

 \Delta^{2}_{S}=O(S^{4}m^{5}\lambda_{2}^{2b\rho}(\mathcal{M}+\sigma^{2}/b)+% \sigma^{2}/(mb)+\lambda_{2}^{2b\rho}\sigma^{2}m^{2}/b+\mathcal{M}^{2}).

Thus, (39) ensures \Delta_{S}^{2}=O(\sigma^{2}/(mb)+\mathcal{M}^{2}). The resulting gap to optimality scales as

 E[\Psi(\mathbf{x}_{i}^{\mathrm{av}}(S+1))]-\Psi^{*}=O\left(\frac{b}{T}+\sqrt{% \frac{b\mathcal{M}^{2}}{T}}+\frac{\sigma}{\sqrt{mT}}+\mathcal{M}\right).

Substituting (39) yields

 \displaystyle E[\Psi(\mathbf{x}_{i}^{\mathrm{av}}(S+1))]-\Psi^{*}=O\Bigg{(}% \max\left\{\frac{\log(mT)}{\rho T\log(1/\lambda_{2})},\frac{1}{T}\right\}+\\ \displaystyle\max\left\{\sqrt{\frac{\mathcal{M}^{2}\log(mT)}{\rho T\log(1/% \lambda_{2})}},\sqrt{\frac{\mathcal{M}^{2}}{T}}\right\}+\frac{\sigma}{\sqrt{mT% }}+\mathcal{M}\Bigg{)}.

In order to achieve order optimality, we need the first term to be O((\mathcal{M}+\sigma)/\sqrt{mT}), which requires

We also need the second and fourth terms to be O((\mathcal{M}~{}+~{}\sigma)/\sqrt{mT}), which requires

 \mathcal{M}=O\left(\min\left\{\frac{1}{m},\frac{1}{\sqrt{m\sigma^{2}T}}\right% \}\right). (40)

This establishes the result.

### -E Proof of Lemma 4

The sequences a_{s}, b_{s}, c_{s} are interdependent. Rather than solving the complex recursion directly, we bound the dominant sequence. Define d_{s}\triangleq\max\{a_{s},b_{s},c_{s}\}. First, we bound a_{s}:

 \displaystyle a_{s+1}=\max_{i,j}||\beta_{s+1}^{-1}(\mathbf{x}_{i}(s+1)-\mathbf% {x}_{j}(s+1)+\\ \displaystyle(1-\beta_{s+1}^{-1})(\mathbf{x}_{i}^{\mathrm{ag}}(s)-\mathbf{x}_{% j}^{\mathrm{ag}(s)})||,

which, via the triangle inequality, is bounded by

 \displaystyle a_{s+1}\leq\max_{i,j}\beta_{s+1}^{-1}\left\|{P_{\mathbf{x}_{i}(s% )}(\gamma_{s}\mathbf{h}_{i}(s))-P_{\mathbf{x}_{j}(s)}(\gamma_{s}\mathbf{h}_{j}% (s))}\right\|+\\ \displaystyle(1-\beta_{s+1}^{-1})a_{s}.

Because the prox-mapping is Lipschitz,

 \displaystyle a_{s+1}\leq\max_{i,j}\beta_{s+1}^{-1}(\left\|{\mathbf{x}_{i}(s)-% \mathbf{x}_{j}(s)}\right\|+\gamma_{s}\left\|{\mathbf{h}_{i}(s)-\mathbf{h}_{j}(% s)}\right\|_{*})\\ \displaystyle+(1-\beta_{s+1}^{-1})a_{s}.

Applying the first part of Lemma 2 yields

 \displaystyle a_{s+1}\leq\max_{i,j}\beta_{s+1}^{-1}b_{s}+(1-\beta_{s+1}^{-1})a% _{s}+\\ \displaystyle\beta_{s+1}^{-1}\gamma_{s}\left\|{\mathbf{e}_{i}(s)-\mathbf{e}_{j% }(s)+\tilde{\mathbf{z}}_{i}(s)-\tilde{\mathbf{z}}_{j}(s)}\right\|_{*},

and applying the second part and the triangle inequality yields

 \displaystyle a_{s+1}\leq\max_{i,j}d_{s}+2\gamma_{s}m^{2}\sqrt{C_{*}}\lambda_{% 2}^{r}\left\|{\mathbf{g}_{i}(s)-\mathbf{g}_{j}(s)}\right\|_{*}+\\ \displaystyle 2\gamma_{s}\left\|{\tilde{\mathbf{z}}_{i}(s)}\right\|_{*}.

We apply (11) and collect terms to obtain

Next, we bound b_{s} via similar steps:

Finally, we bound c_{s}:

 \displaystyle c_{s+1}=\max_{i,j}||\beta_{s}^{-1}(\mathbf{x}_{i}(s+1)-\mathbf{x% }_{j}(s+1))+\\ \displaystyle(1-\beta_{s}^{-1})(\mathbf{x}_{i}^{\mathrm{ag}}(s)-\mathbf{x}_{j}% ^{\mathrm{ag}}(s))||.

We bound this sequence in terms of a_{s} and b_{s}:

Comparing the three bounds, we observe that

 \displaystyle d_{s+1}\leq\max_{i}(1+2\gamma_{s}m^{2}\sqrt{C_{*}}L\lambda_{2}^{% r})d_{s}+\\ \displaystyle 2\gamma_{s}(2m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\mathcal{M}+\left\|% {\tilde{\mathbf{z}}_{i}(s)}\right\|_{*}).

Solving the recursion yields

 d_{s}\leq\sum_{\tau=0}^{s-1}(1+2m^{2}\gamma_{s}\sqrt{C_{*}}L\lambda_{2}^{r})^{% \tau}2\gamma_{s}(2m^{2}\sqrt{C_{*}}\lambda_{2}^{r}\mathcal{M}+\left\|{\tilde{% \mathbf{z}}_{i}(s)}\right\|_{*}). (41)

Following the same argument as in Lemma 3, we obtain

 \displaystyle E[d_{s}^{2}] \displaystyle\leq\frac{(\mathcal{M}+\sigma/\sqrt{b})^{2}}{L^{2}}((1+2\gamma_{s% }m^{2}\sqrt{C_{*}}L\lambda_{2}^{r})^{s}-1)^{2}.

The bound on E[d_{s}] follows a fortiori.

### -F Proof of Theorem 2

Similar to the proof of Theorem 1, the proof involves an analysis of accelerated mirror descent, and as necessary we will cite results from [8, Theorem 2]. Define

 \displaystyle\boldsymbol{\delta}_{i}(\tau) \displaystyle=\mathbf{h}_{i}(\tau)-g(\mathbf{x}_{i}^{\mathrm{md}}(\tau)) \displaystyle\boldsymbol{\eta}_{i}(\tau) \displaystyle\triangleq\bar{\mathbf{g}}(\tau)+\mathbf{e}_{i}(\tau)-\mathbf{g}(% \mathbf{x}^{\mathrm{md}}_{i}(\tau))=\boldsymbol{\delta}_{i}(\tau)-\bar{\mathbf% {z}}(\tau)-\tilde{\mathbf{z}}_{i}(\tau) \displaystyle\zeta_{i}(\tau) \displaystyle=\gamma_{\tau}\langle\delta_{i}(\tau),\mathbf{x}^{*}-\mathbf{x}_{% i}(\tau)\rangle.

Applying (11) and Lemmas 2 and 4, we bound E[\left\|{\boldsymbol{\eta}_{i}(\tau)}\right\|]:

By the definition of the dual norm:

 \displaystyle E[\zeta_{i}(\tau)]\leq\gamma_{\tau}E[\left\|{\eta_{i}(\tau)}% \right\|_{*}\left\|{\mathbf{x}^{*}-\mathbf{x}_{i}(\tau)}\right\|]\leq\\ \displaystyle\gamma_{\tau}\Xi_{\tau}\Omega_{\omega}\leq\sqrt{\frac{\alpha}{2}}% \frac{\Xi_{\tau}D_{\omega}}{L}.

We bound E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}] via the triangle inequality:

 \displaystyle E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}]\leq 2E[% \left\|{\bar{\mathbf{g}}(\tau)-g(\mathbf{x}_{t}^{\mathrm{md}}(\tau))}\right\|_% {*}^{2}]+2E[\left\|{\mathbf{e}_{i}(\tau)}\right\|_{*}^{2}]+\\ \displaystyle 4E[\left\|{\tilde{\mathbf{z}}_{i}(\tau)}\right\|_{*}^{2}]+4E[% \left\|{\bar{\mathbf{z}}(\tau)}\right\|_{*}^{2}],

which by (11) and Lemma 2 is bounded by

 \displaystyle E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}]\leq 2L^{2}% (1+m^{4}C_{*}\lambda_{2}^{2r})E[c_{\tau}^{2}]+\\ \displaystyle\frac{4C_{*}\sigma^{2}}{b}(\lambda_{2}^{2r}m^{2}+1/m)+4\mathcal{M}.

Applying Lemma 4, we obtain

 \displaystyle E[\left\|{\boldsymbol{\delta}_{i}(\tau)}\right\|^{2}]\leq 2(% \mathcal{M}+\sigma/\sqrt{b})^{2}((1+2\gamma_{\tau}m^{2}\sqrt{C_{*}}L\lambda_{2% }^{r})^{\tau}-1)^{2}+\\ \displaystyle\frac{4C_{*}\sigma^{2}}{b}(\lambda_{2}^{2r}m^{2}+1/m)+4\mathcal{M}.

From the proof of [8, Theorem 2], we observe that

 \displaystyle(\beta_{S+1}-1)\gamma_{S+1}[\Psi(\mathbf{x}_{i}^{\mathrm{ag}}(S+1% ))-\Psi^{*}]\leq D^{2}{\omega,X}+\\ \displaystyle\sum_{\tau=1}^{S}\zeta_{\tau}+\frac{2}{\alpha}(4M^{2}+\left\|{% \boldsymbol{\delta}_{i}(\tau)}\right\|_{*}^{2})\gamma_{\tau}^{2})].

Taking the expectation yields

 \displaystyle(\beta_{S+1}-1)\gamma_{S+1}[E[\Psi(\mathbf{x}_{i}^{\mathrm{ag}}(S% +1))]-\Psi^{*}]\leq D^{2}_{\omega,X}+\\ \displaystyle\sqrt{\frac{2}{\alpha}}D_{\omega,X}\sum_{\tau=1}^{S}\gamma_{\tau}% \Xi_{\tau}+\frac{2}{\alpha}\sum_{\tau=1}^{S}\gamma_{\tau}^{2}(4M^{2}+\Delta_{% \tau}^{2}).

Letting \beta_{\tau}=\frac{\tau+1}{2} and \gamma_{\tau}=\frac{\tau+1}{2}\gamma, observing that \Delta_{\tau} and \Xi_{\tau} are increasing in \tau, and simplifying, we obtain

 \displaystyle E[\Psi(\mathbf{x}_{i}^{\mathrm{ag}}(S+1))]-\Psi^{*}\leq\frac{4D_% {\omega,X}^{2}}{(S^{2}+1)\gamma}+\\ \displaystyle\sqrt{\frac{32}{\alpha}}D_{\omega,X}\Xi_{S}+\frac{4\gamma S}{% \alpha}(4M^{2}+\Delta_{S}^{2}).

Solving for the optimum \gamma in the interval 0\leq\gamma\leq\alpha/(2L), we obtain:

 \gamma^{*}=\min\left\{\frac{\alpha}{2L},\sqrt{\frac{\alpha D_{\omega,X}^{2}}{S% (S^{2}+1)(4M^{2}+\Delta_{S}^{2})}}\right\}.

This gives the bound

 \displaystyle E[\Psi(\mathbf{x}_{i}^{\mathrm{ag}}(S+1))]-\Psi^{*}\leq\frac{8LD% _{\omega,X}^{2}}{\alpha(S^{2}+1)}+\\ \displaystyle\frac{16SD_{\omega,X}^{2}(4M^{2}+\Delta_{S}^{2})}{\alpha(S+1)^{2}% }+\sqrt{\frac{32}{\alpha}}D_{\omega,X}\Xi_{S},

which simplifies to the desired bound

 \displaystyle E[\Psi(\mathbf{x}_{i}^{\mathrm{ag}}(S+1))]-\Psi^{*}\leq\frac{8LD% _{\omega,X}^{2}}{\alpha S^{2}}+\\ \displaystyle 4D_{\omega,X}\sqrt{\frac{4M+\Delta_{S}^{2}}{\alpha S}}+\sqrt{% \frac{32}{\alpha}}D_{\omega,X}\Xi_{S}.

### -G Proof of Corollary 2

First, we bound \Xi_{\tau} and \Delta_{\tau}^{2} order-wise. Using steps similar to those in the proof of Corollary 1, one can show that

 \Xi_{\tau}=O\left(\tau^{3}m^{3}\lambda_{2}^{r}(\mathcal{M}+\sigma/\sqrt{b})+% \mathcal{M}\right),

and

 \Delta_{\tau}^{2}=O\left(\tau^{6}m^{6}\lambda_{2}^{2r}(\mathcal{M}^{2}+\sigma^% {2}/b)+\frac{\sigma^{2}}{mb}+\mathcal{M}^{2}\right).

For order optimum convergence, we need \Delta_{S}^{2}~{}=~{}O(\sigma^{2}/(mb)+\mathcal{M}^{2}). Combining the above equation with the constraint r\leq b\rho, we obtain the condition

 b=\Omega\left(1+\frac{\log(mT)}{\rho\log(1/\lambda_{2})}\right).

This condition also ensures that \Xi_{S}=O(\sigma/\sqrt{mT}+\mathcal{M}), as is necessary for optimum convergence speed. This leads to a convergence speed of

 E[\Psi(\mathbf{x}_{i}^{\mathrm{av}}(S+1))]-\Psi^{*}=O\left(\frac{b^{2}}{T^{2}}% +\sqrt{\frac{b\mathcal{M}^{2}}{T}}+\frac{\sigma}{\sqrt{mT}}+\mathcal{M}\right).

In order to make the first term O((\mathcal{M}+\sigma)/\sqrt{mT}), we need

To make the second and fourth terms O((\mathcal{M}+\sigma)/\sqrt{mT}), we need, as before,

 \mathcal{M}=O\left(\min\left\{\frac{1}{m},\frac{1}{\sqrt{m\sigma^{2}T}}\right% \}\right).

## References

• [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016, http://www.deeplearningbook.org.
• [2] H. Kushner, “Stochastic approximation: a survey,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, no. 1, pp. 87–96, 2010.
• [3] S. Kim, R. Pasupathy, and S. G. Henderson, “A guide to sample average approximation,” in Handbook of Simulation Optimization.   Springer, 2015, pp. 207–243.
• [4] M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. O. Hero, and S. McLaughlin, “A survey of stochastic simulation and optimization methods in signal processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 2, pp. 224–241, 2016.
• [5] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
• [6] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009.
• [7] A. Juditsky, A. Nemirovski, and C. Tauvel, “Solving variational inequalities with stochastic mirror-prox algorithm,” Stochastic Systems, vol. 1, no. 1, pp. 17–58, 2011.
• [8] G. Lan, “An optimal method for stochastic composite optimization,” Mathematical Programming, vol. 133, no. 1, pp. 365–397, 2012.
• [9] C. Hu, W. Pan, and J. T. Kwok, “Accelerated gradient methods for stochastic optimization and online learning,” in Advances in Neural Information Processing Systems, 2009, pp. 781–789.
• [10] L. Xiao, “Dual averaging methods for regularized stochastic learning and online optimization,” Journal of Machine Learning Research, vol. 11, no. Oct, pp. 2543–2596, 2010.
• [11] X. Chen, Q. Lin, and J. Pena, “Optimal regularized dual averaging methods for stochastic optimization,” in Advances in Neural Information Processing Systems, 2012, pp. 395–403.
• [12] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE Trans. Automat. Control, vol. 31, no. 9, pp. 803–812, Sep. 1986.
• [13] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011, pp. 693–701.
• [14] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan, “Perturbed iterate analysis for asynchronous stochastic optimization,” arXiv preprint arXiv:1507.06970, 2015.
• [15] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proc. 3rd Intl. Symp. Information Processing in Sensor Networks (IPSN’04), Berkeley, CA, Apr. 2004, pp. 20–27.
• [16] M. G. Rabbat and R. D. Nowak, “Quantized incremental algorithms for distributed optimization,” IEEE J. Select. Areas Commun., vol. 23, no. 4, pp. 798–808, Apr. 2005.
• [17] D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incremental gradient method with a constant step size,” SIAM J. Optim., vol. 18, no. 1, pp. 29–51, 2007.
• [18] A. Nedic and D. P. Bertsekas, “Incremental subgradient methods for nondifferentiable optimization,” SIAM Journal on Optimization, vol. 12, no. 1, pp. 109–138, 2001.
• [19] B. Johansson, M. Rabi, and M. Johansson, “A randomized incremental subgradient method for distributed optimization in networked systems,” SIAM J. Optim., vol. 20, no. 3, pp. 1157–1170, 2010.
• [20] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Incremental stochastic subgradient algorithms for convex optimization,” SIAM J. Optim., vol. 20, no. 2, pp. 691–717, 2009.
• [21] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Automat. Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.
• [22] K. Srivastava and A. Nedic, “Distributed asynchronous constrained stochastic optimization,” IEEE J. Select. Topics Signal Processing, vol. 5, no. 4, pp. 772–790, Aug. 2011.
• [23] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in Proc. 51st IEEE Conf. Decision and Control (CDC’12), Maui, HI, Dec. 2012, pp. 5453–5458.
• [24] A. Mokhtari and A. Ribeiro, “DSA: Decentralized double stochastic averaging gradient algorithm,” J. Mach. Learn. Res., vol. 17, no. 61, pp. 1–35, 2016.
• [25] A. S. Bijral, A. D. Sarwate, and N. Srebro, “On data dependence in distributed stochastic optimization,” arXiv preprint arXiv:1603.04379, 2016.
• [26] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Trans. Signal Processing, vol. 62, no. 7, pp. 1750–1761, Apr. 2014.
• [27] D. Jakovetić, J. M. F. Moura, and J. Xavier, “Linear convergence rate of a class of distributed augmented lagrangian algorithms,” IEEE Trans. Automat. Control, vol. 60, no. 4, pp. 922–936, Apr. 2015.
• [28] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
• [29] S. Sundhar Ram, A. Nedic, and V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex optimization,” J. Optim. Theory App., vol. 147, no. 3, pp. 516–545, Jul. 2010.
• [30] J. Duchi, A. Agarwal, and M. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Trans. Automat. Control, vol. 57, no. 3, pp. 592–606, Mar. 2012.
• [31] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction using mini-batches,” Journal of Machine Learning Research, vol. 13, no. Jan, pp. 165–202, 2012.
• [32] K. I. Tsianos and M. G. Rabbat, “Efficient distributed online prediction and stochastic optimization with approximate distributed averaging,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 489–506, Dec 2016.
• [33] S. Shalev-Shwartz, Online learning and online convex optimization, ser. Foundations and Trends in Machine Learning.   Now Publishers, Inc., 2012, vol. 4, no. 2.
• [34] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4, pp. 838–855, 1992.
• [35] C. M. Bishop, Pattern Recognition and Machine Learning.   New York, NY: Springer, 2006.
• [36] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensor fusion based on average consensus,” in Proceedings of the 4th International Symposium on Information Processing in Sensor Networks, 2005, p. 9.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters