Distributed Stochastic Approximation for Solving Network Optimization Problems Under Random Quantization

Distributed Stochastic Approximation for Solving Network Optimization Problems Under Random Quantization

Thinh T. Doan, Siva Theja Maguluri, Justin Romberg
School of Electrical and Computer Engineering
H. Milton Stewart School of Industrial and Systems Engineering
Georgia Institute of Technology, GA, 30332, USA
{thinhdoan, siva.theja}@gatech.edu, jrom@ece.gatech.edu.
Abstract.

We study distributed optimization problems over a network when the communication between the nodes is constrained, and so information that is exchanged between the nodes must be quantized. This imperfect communication poses a fundamental challenge, and this imperfect communication, if not properly accounted for, prevents the convergence of these algorithms. Our first contribution in this paper is to propose a modified consensus-based gradient method for solving such problems using random (dithered) quantization. This algorithm can be interpreted as a distributed variant of a well-known two-time-scale stochastic algorithm. We then study the convergence and derive upper bounds on the rates of convergence of the proposed method as a function of the bandwidths available between the nodes and the underlying network topology, for both convex and strongly convex objective functions. Our results complement for existing literature where such convergence and explicit formulas of the convergence rates are missing. Finally, we provide numerical simulations to compare the convergence properties of the distributed gradient methods with and without quantization for solving the well-known regression problems over networks, for both quadratic and absolute loss functions.

copyright: rightsretained

1. Introduction

In this paper, we consider optimization problems that are defined over a network of nodes111In this paper, nodes can be used to present for processors, robotics, or sensors.. The objective function is composed of a sum of local functions where each function is known by only one node. In addition, each node is only allowed to interact with its neighboring nodes that are connected to it through the network. We assume no central coordination between the nodes and since each node knows only its local function, they are required to cooperatively solve the problems. This necessitates the development of distributed algorithms, which can be done under communication and computation constraints.

We are motivated by various applications of such problems within engineering. A standard example is the problem of estimating the radio frequency in a wireless network of sensors where the goal is to cooperatively estimate the radio-frequency power spectrum density through solving a regression problem (Meteos et al., 2010). In this application, the objective function is the total loss over the entire measured data by the sensors, which are scattered across a large geographical area. Due to some privacy concerns, the sensors may not be willing to share their measurements, but only their own estimates, making distributed algorithms become necessary.

Another possible application is the problem of distributed information processing in edge (fog) computing, which has recently received a surge in interests (Chiang and Zhang, 2016). This new technology, emerging from the rapid development of the Internet of Things, aims to reduce the burden of communication and computation at cloud or centralized servers by shifting the computing infrastructure closer to the source of data (e.g, smart devices, wireless sensors, or mobile robots). In this context, distributed algorithms provide a promising solution for coping with the large-scale complex networks while handling massive amounts of generated data.

Distributed algorithms for these problems have received wide attention during the last decade, mostly focusing on three classes of algorithms, namely, the alternating direction method of multipliers (ADMM) (Wei and Ozdaglar, 2012; Boyd et al., 2011; Shi et al., 2014; Makhdoumi and Ozdaglar, 2017), distributed dual methods (mirror descent/dual averaging) (Duchi et al., 2012; Tsianos et al., 2012; Doan et al., 2019; Li et al., 2016a; Yuan et al., 2018), and distributed gradient algorithms (Shi et al., 2015; Qu and Li, 2017; Nedić and Ozdaglar, 2009; Nedíc et al., 2017; Yuan et al., 2016; Lorenzo and Scutari, 2016; Doan, 2018; Doan et al., 2018a, 2017; Lan et al., 2017; Shah and Borkar, 2018). The focus in this paper will be on distributed (sub)gradient algorithms, as they have the benefits (in terms of convergence rates and simplicity) of both ADMM and dual methods. We refer interested readers to the recent survey paper (Nedić et al., 2018) for a summary of existing results in this area.

In distributed algorithms, the nodes are required to communicate and exchange information while cooperatively solving the problems. Thus, communication constraints, such as delays and finite bandwidth, are critical issues in distributed systems. For this reason, there are recent interests in studying the convergence of distributed gradient methods while taking into account these communication constraints. The convergence rates of such methods in the presence of communication delays have been studied in (Wu et al., 2018; Tian et al., 2018; Doan et al., 2018a, 2017), while some works presented in (Lan et al., 2017; Notarnicola et al., 2017) focus on reducing the number of communication rounds between nodes.

Our focus is on studying the convergence properties of distributed gradient methods when the nodes are only allowed to exchange their quantized values due to the finite bandwidths shared between them. Different variants of distributed gradient methods under quantized communication have been studied in (Pu et al., 2017; Li et al., 2016b; Yi and Hong, 2014; Nedić et al., 2008; Doan et al., 2018b; Reisizadeh et al., 2018). In (Li et al., 2016b; Nedić et al., 2008) the authors only show the convergence to a neighborhood around the optimal of the problem due to the quantized error. On the other hand, an asymptotic convergence to the optimal has been studied in (Pu et al., 2017; Doan et al., 2018b; Yi and Hong, 2014); however, a condition on the growing communication bandwidth is assumed in these works to remove the quantized error. Recently, the authors in (Reisizadeh et al., 2018) study distributed gradient methods with random quantization using finite bandwidths, and show a convergence rate for some for unconstrained problems with strongly convex and smooth objective functions.

We consider in this paper a stochastic variant of distributed gradient methods with random quantization, which can be viewed as a distributed version of the well-known two-time-scale stochastic approximation. Similar to (Reisizadeh et al., 2018), we consider the problems where the nodes only share a finite communication bandwidth. However, unlike (Reisizadeh et al., 2018) we consider a constrained problem with nonsmooth objective functions. We derive explicit formulas for the rates of convergence of the algorithm, which show the dependence on the network topology and the communication capacity, for both convex and strongly convex objective functions. It is worth to note that the techniques used to derive the convergence rates in this paper are different from the ones in (Reisizadeh et al., 2018). While the authors in (Reisizadeh et al., 2018) use a dual approach in their convergence analysis, we utilize the standard techniques from two-time-scale stochastic approximation studied in (Borkar, 2008; Kushner and Yin, 2003; Wang et al., 2017; Konda and Tsitsiklis, 2004). This allows us to clearly show the impact of network topology and communication bandwidths on the convergence of the algorithm.

Main Contributions. The main contributions of this paper are two folds. We first propose a distributed variant of the well-known two-time-scale stochastic approximation for solving network optimization problems under random quantization. Second, we study the convergence and derive upper bounds on the rates of convergence of such methods. In particular, when the objective function is convex we first show the almost sure convergence of the variables of the nodes to the optimal solution of the problem. Then under an appropriate choice of the step sizes, we derive the convergence of the objective function to the optimal value in expectation at a rate , where is the number of iterations and represents for the connectivity of the underlying network. In addition, represents for quantization errors, which depends on the size of the communication bandwidths. When the objective function is strongly convex, we further show that this rate occurs at . We then conclude our paper with numerical experiments comparing the performance of distributed subgradient methods for solving the well-known regression problems with and without quantization.

1.1. Notation And Definition

Notation: We first introduce here a set of notation and definition used throughout this paper. We use boldface to distinguish between vectors in and scalars in . Given a collection of vectors in , we denote by a matrix in , whose -th row is . We then denote by and the Euclidean norm and the Frobenius norm of and , respectively. Let be the vector whose entries are and the identity matrix. Given a closed convex set , we denote by the projection of to .

Given a nonsmooth convex function , we denote by its subdifferential estimated at , i.e., is the set of subgradients of at . Since is convex, is nonempty. The function is -Lipschitz continuous if and only if

(1)

Note that the -Lipschitz continuity of is equivalent to the subgradients of are uniformly bounded by (Shalev-Shwartz, 2012). A function is -strongly convex if and only if satisfies

(2)

Random Quantization: We now present a brief review of random quantization adopted from (Aysal et al., 2008), which is also equivalent to dithered quantization in signal processing. In particular, given a finite interval we divide this interval into a number of bins . We assume that the points are uniformly spaced with a distance , i.e., for all implying that . Thus, to present the points we need a finite bits where .

Next given we denote by . Then the random quantization of is defined as

(3)

As shown in (Aysal et al., 2008) the random quantization Eq. (3) satisfies

(4)

In addition, we have a.s. if for some Thus, a.s. if .

Finally, we consider the random quantization for the vector case. In particular, consider a compact set defined as

With some abuse of notation, given a vector we denote by , where , the quantization of -th coordinate of , for . Here, each is defined by using Eq. (3) with a uniform distance associated with each interval , for all .

2. Problem Formulation

We consider an optimization problem defined over a network of nodes. Associated with each node is a nonsmooth convex function over a convex compact set . The goal is to solve

(5)

Each node knows only its local function , and since there is no central coordination, the nodes are required to cooperatively solve the problem. We will use a distributed consensus-based (sub)gradient method where each node maintains their own version of the decision variables ; the goal is to have all the converge to , a solution of problem (5). Each node can exchange a quantized version of with its neighbors, as defined through a connected and undirected graph , where and are the vertex and edge sets, respectively. We denote by the set of node ’s neighbors.

A concrete motivating example for this problem is distributed linear regression problems solved over a network of processors. Regression problems involving massive amounts of data are common in machine learning; see for example, (Shalev-Shwartz and Ben-David, 2014; Hastie et al., 2009). Each function is the empirical loss over the local data stored at processor . The objective is to minimize the total loss over the entire dataset. Due to the difficulty of storing the enormous amount of data at a central location, the processors perform local computations over the local data, which are then exchanged to arrive at the globally optimal solution. Distributed gradient methods are a natural choice to solve such problems since they have been observed to be both fast and easily parallelizable in the case where the processors can exchange data instantaneously. The goal of this paper is to show that the algorithm continues to be convergent even the nodes only exchange the quantized values of their variables due to the finite bandwidths shared between them. In particular, we derive expressions for the convergence rate as a function of the communication bandwidths and the underlying network topology.

In the sequel, we will use to denote the optimal value of problem (5), i.e., where is a solution of problem (5). We denote by the solution set of problem (5), which is nonempty due to the compactness of . In addition, since is compact it is obvious that each is Lipschitz continuous with some positive constant , as stated in the following proposition.

TheoremProposition 1 ().

Each function , for all , is -Lipschitz continuous, i.e., Eq. (1) holds for some for all .

3. Distributed Gradient Methods Under Random Quantization

Distributed subgradient (DSG) methods, Eq. (6), for solving problem (5) were first studied and analyzed rigorously in (Nedić and Ozdaglar, 2009; Nedić et al., 2010). In these methods each node iteratively updates as

(6)

where is some sequence of stepsizes and . Here, is some positive weight which node assigns for . We assume that these weights, which capture the topology of , satisfy the following condition.

TheoremAssumption 1 ().

The matrix , whose -th entries are , is doubly stochastic, i.e., . Moreover, is irreducible and aperiodic. Finally, the weights if and only if otherwise .

This assumption also implies that has as the largest singular value and others are strictly less than ; see for example, the Perron-Frobenius theorem (Horn and Johnson, 1985). Also, we denote by the second largest singular value of , which by the Courant-Fisher theorem (Horn and Johnson, 1985) gives

(7)

Our focus in this section is to study DSG under random quantization in communication between nodes. In particular, at any iteration the nodes are only allowed to send and receive the quantized values of their local copies to their neighboring nodes. Due to such quantized communication, we modify the update in Eq. (6), that is, each node now considers the following update

where given in Eq. (3). Here, in addition to we introduce a new stepsize due to the random quantization exchanged between nodes.

This update has a simple interpretation. At any time , each node first obtains the quantized value of its value . Each node then formulates a convex combination between its value and the weighted quantized value received from its neighbors , with the goal of seeking a consensus on their estimates. Each node then moves along the subgradients of its respective objective function to update its estimates, pushing the consensus point toward the optimal set . The distributed subgradient algorithm under random quantization is formally stated in Algorithm 1.

3.1. The Role of

We discuss in this section some aspects of the new stepsize . First, one can interpret Eq. (8) as a distributed two-time-scale stochastic algorithm (Borkar, 2008; Kushner and Yin, 2003; Wang et al., 2017; Konda and Tsitsiklis, 2004), where the first sum play the role of fast time scale while the gradient step is the slow time scale. Due to the random quantization, each node first uses the fast time scale to estimate the true average of their estimates. Each node then applies the gradient step to slowly push its estimate toward a solution of problem (5). As will be seen, will be chosen relatively larger as compared to to guarantee for the convergence of Algorithm 1.

Second, it is obvious that when , for all , we recover the update in Eq. (6). In a sense, introducing gives us one more freedom to design our algorithm, especially when dealing with communication constraints. This has also been observed in our previous works (Doan et al., 2017, 2018a) where we use a constant to study the impact of network latencies on the performance of distributed gradient methods.

Finally, one can view , in addition to , is some weight which each node uses to indicate that it “trusts” its own value more than the value received from its neighbor . As will be seen, to guarantee the convergence of the algorithm we will let go to zero at some proper rate, implying eventually node only uses its own value.

1. Initialize: Each node arbitrarily initializes .
2. Iteration: For each node implements
(8)
Algorithm 1 Distributed Subgradient Algorithm Under Random Quantization

4. Convergence Results

The focus of this section is to study the convergence properties of Algorithm 1 for solving problem (5), when the objective functions are both convex and strongly convex. The key idea of our analysis is to utilize the standard techniques used in centralized subgradient methods and stochastic approximation approach. In particular, for convex objective functions we first show that , for all , converges almost surely to a solution of problem (5) under some proper choice of stepsizes . We next show the convergence of the function estimated at the time weighted average of each to the optimal value in expectation at a rate , where is the spectral gap of the network connectivity. Finally, when the objective functions are strongly convex, we derive the convergence of the time-weighted average of each to an optimal solution of problem (5) in expectation at a rate .

We start our analysis by introducing more notation. Given the nodes’ estimates in we denote by a matrix whose -th rows are , i.e.,

Let be the average of , i.e.,

For convenience, we use the following notation

Moreover, let be the filtration containing all the history generated by Eq. (8) upto time , i.e.,

Finally, given a vector let denote the error due to the projection of to , i.e.,

(9)

Thus, Eq. (8) now can be rewritten as

(10)

which by using the matrix form of Eq. (10) is given as

(11)

where is the matrix whose -th row is . In addition, since is doubly stochastic, we have

(12)

4.1. Preliminaries

In this section, we consider some preliminary results, which are essential in our analysis given in the next section. For an ease of exposition, we delay the proofs of all results in this section to the appendix. However, we present a sketch of their proofs to explain some intuition behind our analysis

We first provide an upper bound for the consensus error in the following lemma.

TheoremLemma 1 ().

Suppose that Assumption 1 holds. Let the sequence , for all , be generated by Algorithm 1. In addition, let be two sequences of nonnegative and nonincreasing stepsizes. Then we have

(13)

Moreover, we also obtain

(14)
Sketch of Proof.

To show Eq. (13) we first use Eqs. (11) and (12) to have

Next, we use the following Cauchy-Schwarz inequality with some and

Thus, by taking the -norm square of the first equation and using the preceding Cauchy-Schwarz inequality with we have

(15)

We next analyze each term on the right-hand side of Eq. (15). First, Proposition 1 gives

Second, using the projection lemma, Lemma 5(b) in the Appendix, one can show

Third, by Eq. (4) we have

Fourth, by Eq. (7) we have

Thus, taking the conditional expectation of Eq. (41) w.r.t. and using the last four inequalities we obtain Eq. (13).

Finally, taking the expectation on both sides of Eq. (13) and summing up over for some immediately give Eq. (14). ∎

We next provide proper conditions on the stepsizes , which guarantees that the nodes achieve a consensus on their estimates . The analysis of this lemma is a consequence of Lemma 1 and Lemma 4 on the almost supermartingale convergence theorem given later.

TheoremLemma 2 ().

Suppose that Assumption 1 holds. Let the sequence , for all , be generated by Algorithm 1. In addition, let and satisfy

(16)

Then we have

(17)

Furthermore, the following condition holds

(18)
TheoremRemark 1 ().

One example of stepsizes , which satisfies Eq. (16), can be chosen as follows

(19)

Third, we study an upper bound for the optimal distance in the following lemma.

TheoremLemma 3 ().

Suppose that Assumption 1 holds. Let the sequence , for all , be generated by Algorithm 1. Let be two sequences of nonnegative and nonincreasing stepsizes with . Let be a solution of problem (5). Then we have

(20)
Sketch of Proof.

First, using Eq. (12) to have

which by expanding the right-hand side and taking the conditional expectation w.r.t yields

The next step is to provide an upper bound for each term on the right-hand side of the preceding equation to obtain Eq. (20). This step can be done in a similar concept of the one given in Lemma 1. ∎

Finally, we utilize the result on almost supermartingale convergence studied in (Robbins and Siegmund, 1971), stated as follows.

TheoremLemma 4 ((Robbins and Siegmund, 1971)).

Let , , and be non-negative sequences of random variables and satisfy

where , the history of up to time . Then converges a.s., and a.s.

4.2. Convergence Results of Convex Functions

In this section, we study the convergence and the rate of convergence of Algorithm 1 when the objective functions are convex. For an ease of explanation, we only provide a sketch of the proofs for all the main results in this section and the next section, where their details are presented in Section 6.

Our first main result is to show that if the stepsizes satisfy Eq. (16), then , for all , converges almost surely to , a solution of problem (5). The following theorem is states this result.

TheoremTheorem 1 ().

Suppose that Assumption 1 holds. Let the sequence , for all , be generated by Algorithm 1. Let be two sequences of nonnegative and nonincreasing stepsizes satisfying Eq. (16) with , e.g., Eq. (19) holds. Then we have

(21)

for some that is a solution of Problem (5).

Proof Sketch.

The main idea of this proof is first using the convexity of the functions into Eq. (20) in Lemma 3 to obtain

(22)

Second, since the stepsizes satisfy the conditions in Eq. (16), Eq. (18) holds. Thus, we can now apply Lemma 4 to the preceding relation to have

Thus, using these relations and standard analysis on the convergence of subsequence of we can obtain Eq. (21). ∎

We now study the rate of convergence of Algorithm 1 to the optimal value in expectation, where we utilize a similar technique used to establish the convergence rate of centralized subgradient methods. In particular, we show that if each node maintains a variable used to estimate the time weighted average of its local copy , then the function value estimated at each converges in expectation to the optimal value with a rate . The dependence on the variance of the quantized error in the upper bound of the rate is natural, as we often observe in stochastic gradient descent where such dependence is on the variance of the gradient noise. Such result is derived under different assumptions on the stepsizes as shown in the following theorem222We note that the conditions on the stepsizes in Theorems 1 and 2 are common choices to derive the asymptotic convergence and the rate of centralized subgradient methods, respectively; see for example (Nesterov, 2004).. Note that while the previous theorem studies almost sure convergence of the local copies, this theorem studies convergence in expectation of the functional value, and so it is not surprising that the stepsizes are different.

TheoremTheorem 2 ().

Suppose that Assumption 1 holds. Let the sequence , for all , be generated by Algorithm 1. Let be defined as

(23)

In addition, suppose that each node maintains a variable initialized arbitrarily in and updated as

Then we have for all and

(24)
Proof Sketch.

The analysis of this theorem is divided into there main steps. First, we fix some , and utilize Eq. (22) and Proposition 1 to have

Second, we utilize Eq. (14) to obtain the following for some

Third, using the preceding equation and the integral test with in Eq. (23) into step with some algebraic manipulation immmediately gives us Eq. (24). ∎

TheoremRemark 2 ().

We note that in Theorem 2 can be iteratively updated as follows

where and for .

4.3. Convergence Results of Strongly Convex Functions

We study here the convergence rate of Algorithm 1 when are strongly convex, that is, we consider the following assumption.

TheoremAssumption 2 ().

Each function , for all , is -strongly convex, i.e., Eq. (2) holds for some .

Note that this assumption implies that is strongly convex where . Under this assumption, we show the rate of convergence of Algorithm 1 to an optimal solution of problem (5) in expectation, stated in the following theorem.

TheoremTheorem 3 ().

Suppose that Assumptions 1 and 2 hold. Let the sequence , for all , be generated by Algorithm 1. Let be a solution of problem (5) and be defined as

(25)

In addition, suppose that each node maintains a variable initialized arbitrarily and updated as

Then we have for all and

(26)
Proof Sketch.

The first step in this analysis is using the strong convexity of the functions into Eq. (20) to have

The rest of this proof is similar to the one in Theorem 2. ∎

5. Simulations

In this section, we apply Algorithm 1 for solving linear regression problems, the most popular technique for data fitting (Hastie et al., 2009; Shalev-Shwartz and Ben-David, 2014) in statistical machine learning, over a network of processors under random quantization. The goal of this problem is to find a linear relationship between a set of variables and some real value outcome. That is, given a training set for , we want to learn a parameter that minimizes

where and , i.e., . Here, are the loss functions defined over the dataset. For the purpose of our simulation, we will consider two loss functions, namely, quadratic loss and absolute loss functions. While the quadratic loss is strongly convex, the absolute loss is only convex.

First, when are quadratic, we have the well-known least square problem given as

Second, regression problems with absolute loss functions (or L norm) is often referred to as robust regression, which is known to be robust to outliers (Karst, 1958), given as follows

We consider simulated training data sets, i.e., are generated randomly with uniform distribution between . We consider the performance of the distributed subgradient methods on an undirected connected graph of nodes, i.e., and n = . Our graph is generated as follows.

  1. In each network, we first randomly generate the nodes’ coordinates in the plane with uniform distribution.

  2. Then any two nodes are connected if their distance is less than a reference number , e.g, for our simulations.

  3. Finally we check whether the network is connected. If not we return to step and run the program again.

To implement our algorithm, the adjacency matrix is chosen as a lazy Metropolis matrix corresponding to , i.e.,

It is straightforward to verify that the lazy Metropolis matrix satisfies Assumption 1.

5.1. Convergence of Function Values

(a) Quadratic loss functions
(b) Absolute loss functions
Figure 1. The convergence of function values using distributed subgradient methods without (), with time-varying (), and with random () quantization for and are illustrated.

In this simulation, we apply variants of distributed subgradient methods for solving the linear regression problems. In particular, we compare the performance of such methods for three different scenarios, namely, DSG with no quantization (a.k.a Eq. (6)), DSG with time-varying quantization in (Doan et al., 2018b), and the proposed stochastic variant of Eq. (6) in Algorithm 1. The plots in Fig. 1 show the convergence of these three methods for both quadratic and absolute loss functions.

Note that, to achieve an asymptotic convergence the work in (Doan et al., 2018b) requires that the nodes eventually exchange an infinite number of bits. On the other hand, Algorithm 1 in this paper assumes the nodes use a finite number of constant bits in their communication. However, as observed both in Fig. 0(a) for quadratic loss and in Fig. 0(b) for absolute loss, Algorithm 1 performs almost as well as the one in (Doan et al., 2018b).

5.2. Impacts of the Number of Bits

Here, we consider the impacts of the number of communication bits on the performance of Algorithm 1. In Fig. 2 we plots the number of iterations, needed for the relative error , as a function of . As we can see the more bits we use the faster the algorithm converges. Moreover, the number of iterations required by the algorithm seems to be the same when is larger than . This does make sense due to the numerical rounding of the computer program. Finally, the curves in Fig. 2 seems to reflect the dependence of the rate of Algorithm 1 on the variance within some constant factor , which agrees with our results in Theorems 2 and 3.

(a) Quadratic loss functions
(b) Absolute loss functions
Figure 2. The number of iterations as a function of using distributed subgradient methods with random quantization for and are illustrated.

6. Proofs of Main Results

In this section, we present the proofs of our main results given in Section 4.

6.1. Proof of Theorem 1

By the convexity of we have

which by the -Lipschitz continuity of in Proposition 1 yields