Excess-Risk of Distributed Stochastic Learners
This work studies the learning ability of consensus and diffusion distributed learners from continuous streams of data arising from different but related statistical distributions. Four distinctive features for diffusion learners are revealed in relation to other decentralized schemes even under left-stochastic combination policies. First, closed-form expressions for the evolution of their excess-risk are derived for strongly-convex risk functions under a diminishing step-size rule. Second, using these results, it is shown that the diffusion strategy improves the asymptotic convergence rate of the excess-risk relative to non-cooperative schemes. Third, it is shown that when the in-network cooperation rules are designed optimally, the performance of the diffusion implementation can outperform that of naive centralized processing. Finally, the arguments further show that diffusion outperforms consensus strategies asymptotically, and that the asymptotic excess-risk expression is invariant to the particular network topology. The framework adopted in this work studies convergence in the stronger mean-square-error sense, rather than in distribution, and develops tools that enable a close examination of the differences between distributed strategies in terms of asymptotic behavior, as well as in terms of convergence rates.
distributed stochastic optimization, diffusion strategies, consensus strategies, centralized processing, excess-risk, asymptotic behavior, convergence rate, combination policy
Machine learning applications rely on the premise that it is possible to benefit from leveraging information collected from different users. The range of benefits, and the computational cost necessary to analyze the data, depend on how the information is mined. It is sometimes advantageous to aggregate the information from all users at a central location for processing and analysis. Many current implementations rely on this centralized approach. However, the rapid increase in the number of users, coupled with privacy and communication constraints related to transmitting, storing, and analyzing huge amounts of data at remote central locations, have been serving as strong motivation for the development of decentralized solutions to learning and data mining [3, 4, 5, 6, 7, 8, 9, 10, 11].
In this work, we study the distributed real-time prediction problem over a network of learners. We assume the network is connected, meaning that any two arbitrary agents are either connected directly or by means of a path passing through other agents. We do not expect the agents to share their data sets but only a parameter vector (or a statistic) that is representative of their local information. Such networks serve as useful models for peer-to-peer networks and social networks. The objective of the learning process is for all nodes to minimize some objective function, termed the risk function, in a distributed manner. We shall compare the performance of cooperative and non-cooperative solutions by examining the gap between the risk achieved by the distributed implementations and the risk achieved by an oracle solution with access to the true distribution of the input data; we shall refer to this gap as the excess-risk.
Among other contributions, this work studies stochastic gradient-based distributed strategies that are shown here to converge in the mean-square-error sense when a decaying step-size sequence is used, and that are also shown to outperform other implementations, even under left-stochastic combination rules [12, 13, 14, 15]. Specifically, we will show that the strategies under study achieve a better convergence rate than non-cooperative algorithms, and we will also explain why diffusion strategies outperform other distributed solutions such as those relying on consensus constructions or on doubly-stochastic combination policies [16, 4, 17, 18, 19], as well as naïve centralized algorithms . It was previously shown that the diffusion strategies outperform their consensus-based counterparts in the constant step-size scenario . We analytically show that the same conclusion holds in the diminishing step-size scenario even as the step-size decays. We also illustrate in the simulations that while diffusion and consensus-based algorithms have the same computational complexity, it turns out that diffusion algorithms reduce the overshoot during the transient phase. In comparison to the useful work , our formulation studies convergence in the stronger mean-square-error sense, and develops analysis tools that do not depend on using the central limit theorem or on studying convergence in a weaker distributional sense. In addition, unlike the works [21, 10, 3, 22], we are not solely interested in bounding the excess-risk. Instead, we are interested in obtaining a closed-form expression for the asymptotic excess-risk of distributed and non-distributed strategies in order to compare and optimize their absolute asymptotic performance.
Recently, there has also been interest in primal-dual approaches for distributed optimization [23, 24, 25, 6, 26]. Generally, these approaches are studied in the deterministic optimization context where the iterates are not prone to noise or when the risk function is non-differentiable. It was demonstrated in  that the primal diffusion strategy studied in this manuscript also outperforms augmented Lagrangian and Arrow-Hurwicz primal-dual algorithms in the stochastic constant-step-size setting in both stability and steady-state performance. It is possible to carry out similar comparisons in the diminishing step-size scenario, but this manuscript is focused on the study of primal approaches. As the extended analysis and derivation in later sections and appendices show, this case is already demanding enough to warrant separate consideration in this work.
The techniques developed will allow us to examine analytically and closely the differences between distributed strategies in terms of asymptotic behavior, as well as in terms of rates of convergence by exploiting properties of Gamma functions and the convergence properties of products of infinitely many scaling coefficients. For instance, when the noise profile is uniform across all agents, one of our conclusions will be to show that the convergence rate of diffusion strategies is in the order of , where the notation means that the sequence decays at the same rate as for sufficiently large , i.e., there exist positive constants and and an integer such that for all . This rate is consistent with the result established for consensus implementations under doubly-stochastic combination policies in [17, 19, 16] albeit under a weaker convergence in distribution sense, where it was argued that the estimation error approaches a Gaussian distribution whose covariance matrix scales as . On the other hand, when the noise profile is non-uniform across the agents, the analysis will show that diffusion methods can surpass this rate. These and other useful conclusions will follow from the detailed mean-square and convergence analyses that are carried out in the sequel. The theoretical findings are illustrated by simulations in the last section.
For ease of reference, we summarize here the main conclusions in the manuscript:
We derive a closed-form expression (and not only a bound) for the asymptotic excess-risk curve of the distributed strategies.
We analyze the derived expression to conclude that the asymptotic performance depends on the network topology solely through the Perron vector of the combination matrix used in the strategy. In this way, different topology structures with the same Perron vector are shown to attain the same asymptotic performance. That is, the full eigen-structure of the topology are become irrelevant in the asymptotic regime.
We show that once the Perron vector is optimized to minimize the asymptotic excess-risk, it is possible to construct a combination matrix with that Perron vector in order to attain the optimal performance in a fully distributed manner.
We compare the asymptotic excess-risk performance of the diffusion strategy to centralized and non-cooperative strategies to conclude that the diffusion strategy can attain the performance of a weighted centralized strategy asymptotically.
We compare the asymptotic excess-risk performance of the diffusion strategy to consensus distributed strategies to conclude that the asymptotic excess-risk curve of the consensus strategy will be worse than that of the diffusion strategy.
We verify our conclusions through simulations.
Notation: Random quantities are denoted in boldface. Throughout the manuscript, all vectors are column vectors. Matrices are denoted in capital letters, while vectors and scalars are denoted in lowercase letters. Network variables that aggregate variables across the network are denoted in calligraphic letters. Unless otherwise noted, the notation refers to the Euclidean norm for a vector and to the matrix norm that is induced by Euclidean norm for vectors. Furthermore, the notation denotes the Kronecker product operation [27, p. 139]. The notation denotes a vector of dimension with all its elements equal to one.
Ii Problem Formulation and Algorithms
Consider a network of learners. Each learner is subject to a streaming sequence of independent data samples , for , arising from some fixed distribution . The goal of each agent is to learn the vector that optimizes the average of some loss function, say, , where the expectation is over the distribution of the data and is the vector variable of optimization. For example, in order to learn the hyper-plane that best separates feature data belonging to one of two classes , a regularized logistic-regression (RLR) algorithm would minimize the expected value of the following loss function over (with the expectation computed over the distribution of the data ) :
The expectation of the loss function over the distribution of the data is referred to as the risk function [30, p. 16]:
and we denote the optimizer of (3) by :
where is unique when is strongly-convex, which we shall assume for the remainder of the manuscript. The assumption of strong-convexity of the risk function is important in practice since the convergence rate of most stochastic approximation strategies will be significantly reduced when the condition does not hold . This is not a limitation in most problems arising in the context of adaptation and learning since regularization (such as ) is often used and it helps ensure strong convexity. The risk function can be viewed as a measure of the “prediction-error” of a classifier or regression method since it evaluates the performance of the method against samples taken from the distribution of the input data that have not yet been observed by the classifier/regressor [30, p. 16]. The risk serves as a measure about how well an estimate will perform on a new sample on average. For this reason, the risk is also referred to as the generalization ability of the classifier.
We will assume for the remainder of this exposition that the optimizer is the same for all nodes . This case is common in both machine learning (where, for example, for all ) and distributed inference applications where the distributions are dependent on a common parameter vector to be optimized (see Sec. V further ahead). In order to measure the performance of each learner, we define the excess-risk (ER) at node as:
where denotes the estimator of that is computed by node at time (i.e., it is the estimator that is generated observing past data within the neighborhood of node ). The excess-risk is non-negative because is strongly-convex and, therefore, for all . The expectation in (5) is over the data since is a random quantity that depends on all the data samples up to time (i.e., ). The dependence on the data from the other agents arises from the network topology. Our interest in this work is to characterize the convergence rate, to zero, of the excess-risk for various distributed and non-distributed strategies of learning for a given loss function. We also derive closed-form expressions for the asymptotic excess-risk and compare the absolute-value of the excess-risk curves for algorithms that converge at the same rate.
There are various approaches for optimizing (4). We concentrate on fully-distributed strategies that operate over sparsely-connected networks. The concept of a fully distributed strategy is used here to mean the following:
There is no central node that is coordinating the communication and computation during the learning process.
A node does not need to be connected to all other nodes. Indeed, as long as the network is connected (and it can be sparsely connected), the algorithm is able to approach the solution to the global learning problem.
Only one-hop communication is allowed during the learning process. That is, we do not allow the routing of a data packet over the network. Instead, each agent/node is only allowed to be directly communicating with its intermediate neighbors.
Figure 1 illustrates the types of topologies examined in this manuscript. It is important to notice that the centralized and fully-connected topologies are theoretically equivalent, but practically different as the centralized topology greatly reduces the amount of information that is communicated throughout the network. The centralized topology, however, is not robust to node failure since the entire solution breaks down if the central node fails. Throughout the remainder of the manuscript, we will refer to the centralized and fully-connected approaches interchangeably since they have identical excess-risk performance.
Ii-a Non-Cooperative Strategy
where denotes the gradient vector of the loss function, and is a step-size sequence. The gradient vector employed in (6) is an approximation for the actual gradient vector, , of the risk function. The difference between the true gradient vector and its approximation used in (6) is called gradient noise. Due to the presence of the gradient noise, the estimate generated by (6) becomes a random quantity; we use boldface letters to refer to random variables throughout our manuscript, which is already reflected in our notation in (6).
It is shown in [31, 35] that for strongly-convex risk functions , the non-cooperative scheme (6) achieves an asymptotic convergence rate in the order of under some conditions on the gradient noise and the step-size sequence , where the notation means that the sequence decays at a rate that is at most the rate of decay of for sufficiently large —i.e., there exist positive constant and an integer such that for all . In this way, in order to achieve an excess-risk accuracy on the order of , the non-cooperative algorithm (6) would require samples. It is further shown in [31, 36] that no algorithm can improve upon this rate under the same conditions. This implies that if no cooperation is to take place between the nodes, then the best asymptotic rate each learner would hope to achieve is on the order of .
Ii-B Centralized Strategy
Now, in place of the non-cooperative strategy, let us assume that the nodes transmit their samples to a central processor, which executes the following algorithm:
It can be shown that this implementation will have an asymptotic convergence rate in the order of for step-size sequences of the form (see Corollary 2). In other words, the centralized implementation (7) provides an fold increase in convergence rate relative to the non-cooperative solution (6). One of the questions we wish to answer in this work is whether it is possible to derive a fully distributed algorithm that allows every node in the network to converge (in the mean-square-error sense) at the same rate as the centralized solution, i.e., , with only communication between neighboring nodes and for general ad-hoc networks. We show that this task is indeed possible. We will additionally show that the distributed strategy can outperform the naïve centralized implementation (7) when the gradient noise profile across the agents is non-uniform, but that it will match the performance of a weighted version of (7), namely, the following weighted centralized strategy:
where the are convex combination weights that satisfy:
and are meant to discount gradients with higher noise power compared to others. We next describe two popular fully-distributed strategies.
Ii-C Diffusion Strategy
where denotes the current data sample available at node . Each node begins with an estimate and employs a diminishing positive step-size sequence . The non-negative coefficients , which form a left-stochastic combination matrix , are used to scale information arriving at node from its neighbors. These coefficients are chosen to satisfy:
We emphasize that we are only requiring to be left-stochastic, meaning that only each of its columns should add up to one rather than each of its columns and rows. The neighborhood for each node is defined as the set of nodes for which . The neighborhood is typically known to agent . The main difference between the above algorithm and the original adapt-then-combine (ATC) diffusion strategy studied in [14, 34, 32] is that we are employing a diminishing step-size sequence as opposed to a constant step-size. Constant step-sizes have the distinct advantage that they allow nodes to continue adapting their estimates in response to drifts in the underlying data distribution . In this work, we are interested in examining the generalization ability of distributed learners asymptotically when the underlying distribution, , remains stationary, in which case the use of decaying step-sizes sequences is justified. If the statistical distribution of the data were subject to drifts, then constant step-sizes become a necessity, and this scenario is already studied in some detail in [14, 15, 34, 32].
Ii-D Consensus Strategy
The diffusion and consensus strategies (10)-(13) have exactly the same computational complexity, except that the computations are performed in a different order. We will see in Sec. IV-D that this difference enhances the performance of diffusion over consensus. Moreover, in the constant step-size case, the difference in the order in which the operations are performed causes an anomaly in the behavior of consensus solutions in that they can become unstable even when all individual nodes are able to solve the inference task in a stable manner; see [20, 34, 32]. Furthermore, consensus strategies of the form (13) are usually limited to employing a doubly-stochastic combination matrix . The analysis in the sequel will show that left-stochastic matrices actually lead to improved excess-risk performance (see Eqs. (58)–(60)) while convergence of the distributed implementation continues to be guaranteed (see Theorem 1).
Iii Main Assumptions
Before proceeding with the analysis, we list in this section the main assumptions that are needed to facilitate the exposition. The conditions listed below are common in the broad stochastic optimization literature — see the explanations in [31, 33, 34, 32]. The first condition assumes that the are strongly-convex, with a common minimizer for . This condition ensures that the optimization problem (4) is well-conditioned.
Assumption 1 (Properties of risk functions)
Each risk function is twice continuously-differentiable and its Hessian matrix is uniformly bounded from below and from above, namely,
Furthermore, the risks at the various agents are minimized at the same location:
and the Hessian matrices are locally Lipschitz continuous at , i.e., for all , there exists some such that:
We denote the value of the Hessian matrices at (assumed uniform across the agents) by
where . We let denote the smallest eigenvalue of .
One useful implication that follows from Assumption 1 is the following. Consider the expected excess-risk (5) at node . Using the following sequence of inequalities, we can bound the excess-risk by a square weighted norm:
where , denotes expectation over the distribution of , and the weighting matrix is defined as
Expression (20a) shows that the expected excess-risk at node is equal to a weighted mean-square-error with weight matrix (21). This means that one way to compute or bound the expected excess-risk is by evaluating weighted mean-square-error quantities of the form (20a) or (20b). This is the route that we will follow in this manuscript. We will analyze the right-hand side of (20a) in order to draw conclusions regarding the evolution of the expected excess-risk. In particular, once we establish that the distributed algorithm converges in the mean-square-error sense, then inequality (20b) will immediately allow us to conclude that the algorithm also converges in excess-risk. Similarly, we can obtain the asymptotic expression for the excess-risk by leveraging the weighted-mean-square-error analysis developed for constant step-size distributed strategies [14, 15, 32], adjusted for the decaying step-size case. Observe that these conclusions are different than the useful results in , which focused on studying convergence in distribution. The mean-square-error results will enable us to expose analytically various interesting differences in the performance of distributed strategies, such as diffusion and consensus.
Our second condition is on the gradient noise process, which is defined, for a generic vector , as
We collect the noises from across the network into a column vector
where we are introducing the vector notation
for the collection of parameters across the agents. We denote the covariance matrix of the gradient noise vector by
where the conditioning is in relation to the past history of the estimators, . The following conditions are relaxations of assumptions that are regularly considered in the stochastic approximation literature; they are generally satisfied in important scenarios, such as logistic regression or quadratic loss functions of the form (1)–(2) — see .
Assumption 2 (Gradient noise model)
We assume the gradient noise process satisfies:
for some , , as well as:
for some , , and where
and (30) is assumed to hold for , for some small .
Observe that Assumption 2 implies that:
for some and . In addition, the local Lipschitz condition (30) of order [38, p. 53] (sometimes referred to as the Hölder condition of order [39, p. 110]) implies, under Assumption 1, that the following global condition also holds [32, 15]:
Since nodes sample the data in an independent fashion, it is reasonable to expect the gradient noise to be uncorrelated across all nodes, as required by (29).
Our third condition relates to the structure of the network topology. We will assume that the network is strongly-connected, which means that (a) there exists at least one nontrivial self-loop, for some , and (b) for any two agents and , there exists a path with nonzero weights from to , either directly if they are neighbors or through other agents. It is well-known that the combination matrix for such networks is primitive [40, p. 516]. That is, all entries of are non-negative and there exists some positive integer such that all entries of are strictly positive. One important property of primitive matrices follows from the Perron-Frobenius theorem [40, p. 534]; will have a single eigenvalue at one, while all other eigenvalues of will lie strictly inside the unit circle. Moreover, if we let denote the right-eigenvector associated with the eigenvalue at one, and normalize its entries to add up to one, i.e.,
then all entries of will be strictly positive. We shall refer to as the Perron eigenvector of . We formalize this assumption in the following:
Assumption 3 (Network Topology)
The network is strongly-connected so that the combination matrix is primitive with and , where denotes the the Perron eigenvector of .
Iv Main Results
In this section, we list the main results and defer the detailed proofs to the appendices.
Iv-a Convergence Properties
Our first result provides conditions on the step-size sequence under which the diffusion strategy converges both in the mean-square-error (MSE) sense and also almost surely. The difference between the two sets of conditions that appear below is that in one case the step-size sequence is additionally required to be square-summable.
Theorem 1 (Convergence rates)
If the step-size sequence satisfies the additional square-summability condition:
then converges to almost surely (i.e., with probability one) for all . Furthermore, when the step-size sequence is of the form with satisfying , then the second and fourth-order moments of the error vector converge at the rates of and , respectively:
where was introduced in (33).
See Appendix A. Observe that (41) implies that each node converges in the mean-square-error sense at the rate . Combining this result with (20b), we conclude that each node also converges in excess-risk at this rate:
Note that this conclusion does not yet reveal the benefit of cooperation (for example, it does not show how the convergence rate depends on ). In the next section, we will derive closed-form asymptotic expressions for the mean-square-error and excess-risk, and from these expressions we will be able to highlight the benefit of network cooperation.
Iv-B Evolution of Excess-Risk Measure
We continue to assume that the step-size sequence is selected as for some . This sequence satisfies conditions (38) and (40). Observe that in order to evaluate the excess-risk at node , we must evaluate (20a). To do so, we first form the following network-wide error quantity:
and let denote the matrix with a single entry equal to one at the th location and all other entries equal to zero. Then, using (20a), we can write:
In order to facilitate the analysis, we introduce the eigenvalue decomposition of matrix :
where is an orthogonal matrix and is diagonal with positive entries . Moreover, since the matrix is left-stochastic and primitive (by Assumption 3), we can express its Jordan decomposition in the form:
where represent the remaining left and right eigenvectors while represents the Jordan structure associated with the eigenvalues inside the unit disc.
Theorem 2 (Asymptotic Convergence of )
Then, when , it holds asymptotically that
where the notation implies that . Moreover, is the -th eigenvalue of and the matrix is defined as:
where the notation denotes the -th diagonal element of the matrix .
See Appendix B. Theorem 2 establishes a closed-form expression for the asymptotic excess-risk of the diffusion algorithm. We observe that the slowest rate at which the asymptotic term converges depends on the smallest eigenvalue of , which is , and the constant . Interestingly, the only dependence on the topology of the network asymptotically is encoded in the Perron vector of the combination matrix —i.e., most of the eigen-structure of the topology matrices becomes irrelevant asymptotically and only influences the convergence rate in the transient regime. We will see further ahead that the Perron vector can be optimized to minimize the excess-risk in the asymptotic regime. It is natural that the transient stage should depend on the network geometry because the networked agents are propagating their information over the entire network. The speed of information propagation over a sparsely connected network is determined by the second largest eigenvalue of the combination matrix , which is influenced by the degree of network connectivity. Our results show, however, that there is an asymptotic regime where the performance of the diffusion strategy can be made invariant to the specific network topology since the Perron vector can be designed for general connected networks, as we will see further ahead in (60). Finally, we observe that all agents participating in the network will achieve the same asymptotic performance given by the right-hand-side of (49) as this asymptotic expression for the excess-risk does not depend on any particular node index but only on the network-wide quantity .
When is not known, and thus it is not clear how to choose to satisfy , it is common to choose a large that forces . In this case, we obtain from (50) that
This approximation is close in form to the original steady-state performance expression derived for the diffusion algorithm when a constant step-size is used . The main difference is that the “steady-state” term will now diminish at the rate when and .
By specializing the previous results to the case (a stand-alone node), we obtain as a corollary the following result for the expected excess-risk that is delivered by the non-cooperative stochastic gradient algorithm (6).
Corollary 1 (Stochastic gradient approximation)
Observe that (53)–(54) are stronger than those in Theorem 1 since we are not only stating that the convergence rate is but we are also giving the exact constant that multiplies the factor . In the next section, we examine the relationship between the derived constant and the network size and noise parameters across the network. Following this presentation, we will utilize our mean-square-error expressions to examine the differences between the diffusion strategy (10)–(11) and the consensus strategy (13).
Iv-C Benefit of Cooperation
|Metropolis Rule |
|Hastings Rule |
Up to this point in the discussion, the benefit of cooperation has not yet manifested itself explicitly; this benefit is actually already included in the vector . Optimization over will help bring forth these advantages. Thus, observe that the expression for the asymptotic term in (49) is quadratic in . We can optimize the asymptotic expression over in order to reduce the excess-risk. We re-write the asymptotic excess-risk (49) as:
Then, we consider the problem of optimizing (56) over :
where denotes the set of left-stochastic and primitive combination matrices that satisfy the network topology structure. It is generally not clear how to solve this optimization problem over both and . We pursue an indirect route. We first remove the optimization over and determine an optimal . Subsequently, given the optimal , we show that a left-stochastic and primitive matrix in can be constructed. The relaxed problem is:
whose solution is
It is possible to see that for agent to implement the Hastings rule, it needs to know its neighborhood (which is known to agent ), as well as the number of neighbors that each of its neighbors has (this information is easily derived from the immediate neighbors), and the Perron vector that the network wishes to obtain. Therefore, the design of the weighting matrix can be done in a fully distributed manner. Table I lists three fully-distributed combination rules (combination rules that can be implemented in a fully-distributed manner) and their corresponding Perron vector.
To see the effectiveness of this choice for , we consider the case where , so that
Corollary 2 (Un-weighted Centralized processing)
The centralized algorithm (7) is a special case of the diffusion algorithm when , where , which yields a network that satisfies Assumption 3. To see this, consider the diffusion algorithm (10)–(11) with :
First, observe that after the first iteration, the estimates across the network are now uniform since (67) does not depend on . We can therefore drop the subscript from :
where step is due to and step is obtained by substituting . Observe that (70) is identical to (7), the un-weighted centralized algorithm. Then, using the analysis of the diffusion algorithm in Theorem 2, we have that
Since the right-hand-side of (71) does not depend on the agent index (all agents will achieve the same asymptotic performance), we have that the average excess-risk remains the same:
where step is due to (61), which is the desired result.
Comparing (63) to (54) in the special case when for all , we find that the diffusion algorithm offers an -fold improvement in the excess-risk over the non-cooperative solution. Also, comparing (63) to (65) in this case, we observe that asymptotically the diffusion algorithm achieves the same performance as the centralized algorithm (7). More generally, let us consider the case in which the noise profile is not uniform across the agents. We call upon the following inequality:
which follows from the fact that the harmonic mean of a set of numbers is upper-bounded by their arithmetic mean. Then, we conclude from (73) that the excess-risk of the diffusion strategy is upper-bounded by that of the centralized strategy (7), and equality holds only when the network experiences a spatially uniform gradient noise profile. This implies that the diffusion strategy actually outperforms the implementation studied in , which uses a doubly-stochastic combination matrix. Furthermore, in this case of non-uniform noise profile and since