# Swarming for Faster Convergence in Stochastic Optimization^{†}^{†}thanks: Accepted in SIAM Journal on Control and Optimization. The authors gratefully acknowledge partial support from AFOSR [FA9550-15-1-0504] and NSF [CMMI-1829552].

###### Abstract

We study a distributed framework for stochastic optimization which is inspired by models of collective motion found in nature (e.g., swarming) with mild communication requirements. Specifically, we analyze a scheme in which each one of independent threads, implements in a distributed and unsynchronized fashion, a stochastic gradient-descent algorithm which is perturbed by a swarming potential. Assuming the overhead caused by synchronization is not negligible, we show the swarming-based approach exhibits better performance than a centralized algorithm (based upon the average of observations) in terms of (real-time) convergence speed. We also derive an error bound that is monotone decreasing in network size and connectivity. We characterize the scheme’s finite-time performances for both convex and non-convex objective functions.

Key words. stochastic optimization, swarming-based framework, distributed optimization

AMS subject classifications. 90C15, 90C25, 68Q25

## 1 Introduction

Consider the optimization problem

(1) |

where is a differentiable function. In many applications, the gradient information is not available in closed form and only noisy estimates can be computed by time-consuming simulations (see [2, 22] for a survey of gradient estimation techniques). Noise can come from many sources such as incomplete convergence, modeling and discretization error, and finite sample size for Monte-Carlo methods (see for instance [11]).

In this paper we consider the case in which independent and symmetric stochastic oracles (s) can be queried in order to obtain noisy gradient samples of the form that satisfies the following condition:

###### Assumption 1.1.

Each random vector is independent. , for some for all .

We assume the time needed to generate a gradient sample is random and not negligible (e.g., samples are obtained by executing time-consuming simulations). In this setting, a centralized implementation of the stochastic gradient-descent algorithm in which a batch of samples is gathered at every iteration, incurs an overhead that is increasing in . If sampling is undertaken in parallel by simultaneously querying s, the overhead burden is reduced but is still non-negligible due to the need for synchronization. In this paper we study a distributed optimization scheme that does not require synchronization. In the scheme, independent computing threads implement each a stochastic gradient-descent algorithm in which a noisy gradient sample is perturbed by an attractive term, a function of the relative distance between solutions identified by neighboring threads (where the notion of neighborhood is related to a given network topology). This coupling is similar to those found in mathematical models of swarming (see [4]). Complex forms of collective motion such as swarms can be found in nature in many organisms ranging from simple bacteria to mammals (see [17, 16, 20] for references). Such forms of collective behavior rely on limited communication amongst individuals and are believed to be effective for avoiding predators and/or for increasing the chances of finding food (foraging) (see [7, 19]).

We show the proposed scheme has an important noise reduction property as noise realizations that induce individual trajectories differing too much from the group average are likely to be discarded because of the attractive term which aims to maintain cohesion. In contrast to the centralized sample average approach, the noise reduction obtained in a swarming-based framework with threads does not require synchronization since each thread only needs the information on the current solutions identified by neighboring threads. When sampling times are not negligible and exhibit large variation, synchronization may result in significant overhead so that the real-time performance of stochastic gradient-descent algorithms based upon the average of samples obtained in parallel is highly affected by large sampling time variability. In contrast, the real-time performance of a swarming-based implementation with threads exhibits better performance as each thread can asynchronously update its solution based upon a small sample size and still reap the benefits of noise reduction stemming from the swarming discipline.

The main contribution of this paper is the formalization of the benefits of the swarming-based approach for stochastic optimization. Specifically, we show the approach exhibits better performance than a centralized algorithm (based upon the average of observations) in terms of (real time) convergence speed. We derive an error bound that is monotone decreasing in network size and connectivity. Finally, we characterize the scheme’s finite-time performances for both convex and non-convex objective functions.

The structure of this paper is as follows. We introduce two candidate algorithms for solving problem (LABEL:opt_Problem_def) in Section LABEL:sec:_pre along with the related literature. In Section LABEL:sec:_pre_anal we perform preliminary analysis on the swarming-based approach. In Section LABEL:sec:_main we present our main results. A numerical example is provided in Section LABEL:sec:_numeric. We conclude the paper in Section LABEL:sec:_con.

## 2 Preliminaries

We now present two algorithms for solving problem (LABEL:opt_Problem_def). First, we introduce a centralized stochastic gradient-descent algorithm. Then we propose the corresponding swarming-based approach.

### A Centralized Algorithm

A centralized gradient-descent algorithm is of the form (see for instance [12]):

(2) |

Suppose in a single step, s each generate a noisy gradient sample in parallel, and . Then (LABEL:eq:x_k_centralized_pre) can be rewritten as

(3) |

where . Step size . In this paper we consider constant step size policies, i.e., for some .

### A Swarming-Based Approach

A swarming-based asynchronous implementation has computing threads. In contrast to the centralized approach, each thread queries one and independently implements a gradient-descent algorithm based on only one sample per step:

(4) |

where , and denotes the solution of thread at the time of thread ’s -th update.

The last term on the right hand side of equation (LABEL:opt_eq:_x_i,k_basic) represents the function of mutual attraction between individual threads, in which measures the degree of attraction (see [3] for a reference). Let denote the graph of all threads, where stands for the set of vertices (threads), and is the set of edges connecting vertices. Let be the adjacency matrix of where . indicates that thread is informed of the solution identified by threads , or . In this paper we assume the following condition regarding the network structure amongst threads:

###### Assumption 2.1.

The graph corresponding to the network of threads is undirected () and connected, i.e., there is a path between every pair of vertices.

### 2.1 Sampling Times and Time-Scales

In our comparison of convergence speeds in real-time we will need to describe the time-scales in which the algorithms described above operate. These time scales depend upon the following standing assumption regarding sampling times:

###### Assumption 2.2.

The times needed for generating gradient samples by oracles are independent and exponentially distributed with mean .

It follows that for the centralized implementation, the time in between updates is . To describe the time-scale for the swarming-based scheme, let . In this time-scale, the time in between any two updates (by possibly different threads) is exponentially distributed with mean . Suppose there is a (virtual) global clock that ticks whenever a thread updates its solution, and let be the indicator random variable for the event that thread provides the ()-th update. We rewrite the asynchronous algorithm as follows:

(5) |

where

### 2.2 Swarming for Faster Convergence: A Preview of the Main Results

To illustrate the benefits of the swarming-based approach, we will compare the performances of the centralized approach (LABEL:opt_eq:_x_i,k_basic) and the swarming-based approach (LABEL:eq:x_k_centralized) in Section LABEL:sec:_main. In what follows we provide a brief introduction to the main results.

Denote . Let (respectively, ) measure the quality of solutions under the centralized scheme (respectively, swarming-based approach).

Assuming is -strongly convex with Lipschitz continuous gradients, for sufficiently small , we will show that and both have an upper bound in the order of .

Regarding real-time performance, we will show that and converge at a similar rate. However, the time needed to complete iterations in the swarming-based method is approximately

while the time needed for the centralized algorithm to complete steps is approximately . The speed-up achieved through the swarming-based framework is due to the fact that . Since the relation holds true in general, this property is likely to be preserved under other distributions of sampling times as well. As we shall see in Section LABEL:sec:_main, the ratio of convergence speeds between the swarming-based approach and the centralized algorithm is approximately . In other word, the convergence speed is inversely proportional to the rate that gradient samples are generated.

### 2.3 Literature Review

Our work is linked with the extensive literature in stochastic approximation (SA) methods dating back to [21] and [10]. These works include the analysis of convergence (conditions and rates for convergence, proper step size choices) in the context of diverse noise models (see [12]). Recently there has been considerable interest in parallel or distributed implementation of stochastic gradient-descent methods (see [1, 24, 13, 23, 25] for examples). They mainly focus on minimizing a finite sum of convex functions: . Notice that we may write , so that problem (LABEL:opt_Problem_def) can be seen as a special case of the finite-sum formulation, and the swarming-based scheme (LABEL:opt_eq:_x_i,k_basic) resembles some of the algorithms in the literature (see [23] for example). However, to the best of our knowledge, this literature does not address the noise reduction properties stemming from multi-agent coordination. Moreover, they do not consider random sampling times or the real-time performance of the algorithms.

Our work is also related to population-based algorithms for simulation-based optimization. In these approaches, at every iteration, the quality of each solution in the population is assessed, and a new population of solutions is randomly generated based upon a given rule that is devised for achieving an acceptable trade-off between “exploration” and “exploitation”. Recent efforts have focused on model-based approaches (see [9]) which differ from population-based methods in that candidate solutions are generated at each round by sampling from a “model”, i.e., a probability distribution over the space of solutions. The basic idea is to adjust the model based on the sampled solutions in order to bias the future search towards regions containing solutions of higher qualities (see [8] for a recent survey). These methods are inherently centralized in that the updating of populations (or models) is performed after the quality of all candidate solutions is assessed.

In [18] we have also considered a swarming-type stochastic optimization method. In the paper we used stochastic differential equations to approximate the real-time optimization process. This approach relies on the assumption that step sizes are arbitrarily close to zero. Moreover, finite-time performance was obtained only for strongly convex functions.

## 3 Preliminary Analysis

In this section we study the stochastic processes associated with each one of the threads in the swarming-based approach. The average solution will play an important role in characterizing the performance. This part of the analysis demonstrates the cohesiveness among solutions identified by different threads. To this end, we will analyze the process defined as

Let and . Then .

###### Lemma 3.1.

Under Algorithm (LABEL:opt_eq:_x_i,k_basic), suppose Assumption LABEL:asp:_exponential holds.

(6) |

where

(7) |

with

(8) |

###### Proof.

See Section LABEL:subsec_proof_lem1.

Let be the Laplacian matrix associated with the adjacency matrix , where and when . For an undirected graph, the Laplacian matrix is symmetric positive semi-definite [6]. Let .

(9) |

where is the second-smallest eigenvalue of , also called the algebraic connectivity of (see [6]).

###### Lemma 3.2.

Suppose Assumptions LABEL:asp:_gradient_samples, LABEL:asp:_network and LABEL:asp:_exponential hold. Under Algorithm (LABEL:opt_eq:_x_i,k_basic), let , and let denote the conditional expectation of given . Then,

(10) |

###### Proof.

See Appendix LABEL:subsec_proof_lem2.

###### Remark 3.3.

Lemma LABEL:lem2 sheds light on the cohesiveness property of solutions obtained by different threads. When satisfies proper regulatory conditions, is expected to decrease once exceeding a certain value. As a result, is bounded in expectation so that are not too different from each other. As we shall see in the next section, Lemma LABEL:lem2 helps us characterize the superior performance of the swarming-based approach.

## 4 Main Results

In this section we formalize the superior properties of the swarming-based framework. The following additional assumptions will be used.

###### Assumption 4.1.

(Lipschitz) for some and for all .

###### Assumption 4.2.

(Strong convexity) for some and for all .

###### Assumption 4.3.

(Convexity) for all .

Let us introduce a measure , of the distance between the average solution identified by all threads at step and the unique optimal solution . We present some additional lemmas below.

###### Lemma 4.4.

Suppose Assumptions LABEL:asp:_gradient_samples and LABEL:asp:_exponential hold. Under Algorithm (LABEL:opt_eq:_x_i,k_basic),

(11) |

###### Proof.

See Appendix LABEL:subsec_proof_lem_eq:_E_U_k+1.

###### Lemma 4.5.

Let denote the degree of vertex in graph and let .

(12) |

###### Proof.

See Appendix LABEL:subsec_proof_lem_nabla+a.

###### Lemma 4.6.

Suppose Assumptions LABEL:asp:_gradient_samples, LABEL:asp:_network, LABEL:asp:_exponential and LABEL:asp:Lipschitz hold. Under Algorithm (LABEL:opt_eq:_x_i,k_basic), let be arbitrary. Then,

(13) |

###### Proof.

See Appendix LABEL:subsec_proof_lem_EUwV.

### 4.1 Strongly-Convex Objective Function

In this section we will characterize and compare the performances of the swarming-based approach (LABEL:opt_eq:_x_i,k_basic) and the centralized method (LABEL:eq:x_k_centralized) when the objective function is strongly convex. Specifically, we will discuss the ultimate error bounds and convergence speeds achieved under the two approaches. We will show that the two approaches are comparable in their ultimate error bounds while the swarming-based method enjoys a faster convergence.

The following theorem characterizes the performance of the swarming-based approach when the objective function is Lipschitz continuous and strongly convex.

###### Theorem 4.7.

Suppose Assumptions LABEL:asp:_gradient_samples, LABEL:asp:_network, LABEL:asp:_exponential, LABEL:asp:Lipschitz and LABEL:asp:gradient_strconvexity hold. Under Algorithm (LABEL:opt_eq:_x_i,k_basic) with step size satisfying

(14) |

we have

(15) |

Here is the solution to

(16) |

and

(17) |

###### Proof.

See Appendix LABEL:subsec_proof_thm:_strongly_convex_constant.

###### Remark 4.8.

There are two terms on the right hand side of inequality (LABEL:EU_k_thm1). The first term is a positive constant, while the second term converges to exponentially fast. determines the convergence speed. In the long run (),

The following corollary characterizes the dependency relationship between and several parameters including the step size , algebraic connectivity , noise variance , and the convexity factor .

###### Corollary 4.9.

Suppose all the conditions in Theorem LABEL:thm:_strongly_convex_constant hold, and

(18) |

Then,

If in addition , i.e., is bounded above and below by asymptotically (e.g., when is a complete graph),

###### Proof.

By (LABEL:gamma_bound2),

According to definition (LABEL:hat_omega_equation),

Hence , and

When , follows immediately.

###### Remark 4.10.

From Corollary LABEL:cor:bounds, is decreasing in the algebraic connectivity of the swarming network. In particular with a strong connectivity (), decreases in the number of computing threads. This demonstrates the noise reduction property of the swarming-based scheme.

#### 4.1.1 Comparison with Centralized Implementation

In this section, we compare the performances of the swarming-based approach (LABEL:opt_eq:_x_i,k_basic) and the centralized method (LABEL:eq:x_k_centralized) in terms of their ultimate error bounds and convergence speeds.

First we derive the convergence results for the centralized algorithm (LABEL:eq:x_k_centralized):

Suppose Assumption LABEL:asp:Lipschitz, LABEL:asp:gradient_strconvexity hold and . Define to measure the quality of solutions under this algorithm. Then,

Taking expectation on both sides,

We have

(19) |

The long-run error bound of scheme (LABEL:eq:x_k_centralized) is

By Corollary LABEL:cor:bounds, when and is sufficiently small, and are in the same order.

We now compare the real-time convergence speeds of the two approaches. In the swarming-based implementation, by (LABEL:EU_k_thm1),

Under a centralized algorithm, from (LABEL:EG_k),

For large ,

We can see that converges at a similar rate to . However, the time needed to complete iterations in the swarming-based method is approximately

while the time needed for the centralized algorithm to complete steps is approximately , where (see [14])

Clearly the swarming-based scheme converges faster than the centralized one in real time. The ratio of convergence speeds is approximately , or .

###### Remark 4.11.

Another potential benchmark for assessing relative performance is the average solution of threads implementing independently the stochastic gradient-descent method (without an attraction term). However, this approach is deficient as the expected value of distance between the average solution and the optimal solution is bounded away from zero whenever the function is not symmetric with respect to the optimal solution.

In what follows we show the swarming-based framework could be applied to the optimization of general convex and non-convex objective functions.

### 4.2 General Convex Optimization

The following theorem characterizes the performance of the swarming-based approach for general convex objective functions. We refer to [15] for utilizing the average of history iterates.

###### Theorem 4.12.

Suppose Assumptions LABEL:asp:_gradient_samples, LABEL:asp:_network, LABEL:asp:_exponential, LABEL:asp:Lipschitz and LABEL:asp:gradient_convexity hold. Define

(20) |

Let step size be such that and

(21) |

Let and

(22) |

Under Algorithm (LABEL:opt_eq:_x_i,k_basic),

(23) |

###### Proof.

See Appendix LABEL:subsec_proof_thm:_convex.

###### Remark 4.13.

To compute efficiently, notice that

where . At each step , each thread needs only conduct local update to keep track of its own running average.

It is clear that the upper bound is decreasing in and . We now consider a possible strategy for choosing step size .

###### Corollary 4.14.

Suppose Assumptions LABEL:asp:_gradient_samples, LABEL:asp:_network, LABEL:asp:_exponential, LABEL:asp:Lipschitz and LABEL:asp:gradient_convexity hold. Let step size and satisfy the conditions in Theorem LABEL:thm:_convex and

(24) |

for some . Let , , and be set to (LABEL:omega_k)-(LABEL:tilde_x_k). Under Algorithm (LABEL:opt_eq:_x_i,k_basic),

###### Proof.

See Appendix LABEL:subsec_proof_cor:_convex.

###### Remark 4.15.

. If in addition , then . Similar to the strongly-convex case, the error bound obtained here when is comparable with that achieved by a centralized method with iterations (see [15, 5] for reference). Nevertheless, the time needed in a swarming-based approach is considerably less than that in a centralized one (the time ratio is the same as in the strongly-convex case).

### 4.3 Non-Convex Optimization

We further show that the swarming-based approach can be used for optimizing non-convex objective functions. The randomization technique was introduced in [5].

###### Theorem 4.16.

Suppose Assumptions LABEL:asp:_gradient_samples, LABEL:asp:_network, LABEL:asp:_exponential, and LABEL:asp:Lipschitz hold. Let be the same as in (LABEL:mu_k). Define

(25) |

and define a new random variable as the following (see [5]):

Then,

(26) |

where is given by (LABEL:omega_k).

###### Proof.

See Appendix LABEL:subsec_proof_thm:_nonconvex.

###### Remark 4.17.

Notice that the right hand sides of (LABEL:nonconvex_error) and (LABEL:convex_error) coincide. Hence the discussions in Section LABEL:convex apply here.

## 5 Numerical Example

In this section, we provide a numerical example to illustrate our theoretic findings. Consider the on-line Ridge regression problem, i.e.,

(27) |

where ia a penalty parameter. Samples in the form of are gathered continuously with representing the features and being the observed outputs. We assume that each is uniformly distributed, and is drawn according to . Here is a predefined parameter, and are independent Gaussian noises with mean and variance . We further assume that the time needed to gather each sample is independent and exponentially distributed with mean .

Given a pair , we can calculate an estimated gradient of :

(28) |

This is an unbiased estimator of since

Furthermore,

for some . As long as is bounded (which can be verified in the experiments), is uniformly bounded which satisfies Assumption LABEL:asp:_gradient_samples. Notice that the Hessian matrix of is . Therefore is strongly convex, and problem (LABEL:Ridge_Regression) has a unique solution , which solves

We get

In the experiments, we consider instances with different combinations of and . For each instance, we run simulations with , and is drawn uniformly randomly. Penalty parameter and step size . Under the swarming-based approach, we assume that threads constitute a random network, in which each two threads are linked with probability . Attraction parameter is set to .

Figure LABEL:fig:_comparison visualizes three sample simulations with and chosen from respectively. In all three cases,