Distributed Stochastic Approximation for Solving Network Optimization Problems Under Random Quantization
We study distributed optimization problems over a network when the communication between the nodes is constrained, and so information that is exchanged between the nodes must be quantized. This imperfect communication poses a fundamental challenge, and this imperfect communication, if not properly accounted for, prevents the convergence of these algorithms. Our first contribution in this paper is to propose a modified consensus-based gradient method for solving such problems using random (dithered) quantization. This algorithm can be interpreted as a distributed variant of a well-known two-time-scale stochastic algorithm. We then study the convergence and derive upper bounds on the rates of convergence of the proposed method as a function of the bandwidths available between the nodes and the underlying network topology, for both convex and strongly convex objective functions. Our results complement for existing literature where such convergence and explicit formulas of the convergence rates are missing. Finally, we provide numerical simulations to compare the convergence properties of the distributed gradient methods with and without quantization for solving the well-known regression problems over networks, for both quadratic and absolute loss functions.
In this paper, we consider optimization problems that are defined over a network of nodes111In this paper, nodes can be used to present for processors, robotics, or sensors.. The objective function is composed of a sum of local functions where each function is known by only one node. In addition, each node is only allowed to interact with its neighboring nodes that are connected to it through the network. We assume no central coordination between the nodes and since each node knows only its local function, they are required to cooperatively solve the problems. This necessitates the development of distributed algorithms, which can be done under communication and computation constraints.
We are motivated by various applications of such problems within engineering. A standard example is the problem of estimating the radio frequency in a wireless network of sensors where the goal is to cooperatively estimate the radio-frequency power spectrum density through solving a regression problem (Meteos et al., 2010). In this application, the objective function is the total loss over the entire measured data by the sensors, which are scattered across a large geographical area. Due to some privacy concerns, the sensors may not be willing to share their measurements, but only their own estimates, making distributed algorithms become necessary.
Another possible application is the problem of distributed information processing in edge (fog) computing, which has recently received a surge in interests (Chiang and Zhang, 2016). This new technology, emerging from the rapid development of the Internet of Things, aims to reduce the burden of communication and computation at cloud or centralized servers by shifting the computing infrastructure closer to the source of data (e.g, smart devices, wireless sensors, or mobile robots). In this context, distributed algorithms provide a promising solution for coping with the large-scale complex networks while handling massive amounts of generated data.
Distributed algorithms for these problems have received wide attention during the last decade, mostly focusing on three classes of algorithms, namely, the alternating direction method of multipliers (ADMM) (Wei and Ozdaglar, 2012; Boyd et al., 2011; Shi et al., 2014; Makhdoumi and Ozdaglar, 2017), distributed dual methods (mirror descent/dual averaging) (Duchi et al., 2012; Tsianos et al., 2012; Doan et al., 2019; Li et al., 2016a; Yuan et al., 2018), and distributed gradient algorithms (Shi et al., 2015; Qu and Li, 2017; Nedić and Ozdaglar, 2009; Nedíc et al., 2017; Yuan et al., 2016; Lorenzo and Scutari, 2016; Doan, 2018; Doan et al., 2018a, 2017; Lan et al., 2017; Shah and Borkar, 2018). The focus in this paper will be on distributed (sub)gradient algorithms, as they have the benefits (in terms of convergence rates and simplicity) of both ADMM and dual methods. We refer interested readers to the recent survey paper (Nedić et al., 2018) for a summary of existing results in this area.
In distributed algorithms, the nodes are required to communicate and exchange information while cooperatively solving the problems. Thus, communication constraints, such as delays and finite bandwidth, are critical issues in distributed systems. For this reason, there are recent interests in studying the convergence of distributed gradient methods while taking into account these communication constraints. The convergence rates of such methods in the presence of communication delays have been studied in (Wu et al., 2018; Tian et al., 2018; Doan et al., 2018a, 2017), while some works presented in (Lan et al., 2017; Notarnicola et al., 2017) focus on reducing the number of communication rounds between nodes.
Our focus is on studying the convergence properties of distributed gradient methods when the nodes are only allowed to exchange their quantized values due to the finite bandwidths shared between them. Different variants of distributed gradient methods under quantized communication have been studied in (Pu et al., 2017; Li et al., 2016b; Yi and Hong, 2014; Nedić et al., 2008; Doan et al., 2018b; Reisizadeh et al., 2018). In (Li et al., 2016b; Nedić et al., 2008) the authors only show the convergence to a neighborhood around the optimal of the problem due to the quantized error. On the other hand, an asymptotic convergence to the optimal has been studied in (Pu et al., 2017; Doan et al., 2018b; Yi and Hong, 2014); however, a condition on the growing communication bandwidth is assumed in these works to remove the quantized error. Recently, the authors in (Reisizadeh et al., 2018) study distributed gradient methods with random quantization using finite bandwidths, and show a convergence rate for some for unconstrained problems with strongly convex and smooth objective functions.
We consider in this paper a stochastic variant of distributed gradient methods with random quantization, which can be viewed as a distributed version of the well-known two-time-scale stochastic approximation. Similar to (Reisizadeh et al., 2018), we consider the problems where the nodes only share a finite communication bandwidth. However, unlike (Reisizadeh et al., 2018) we consider a constrained problem with nonsmooth objective functions. We derive explicit formulas for the rates of convergence of the algorithm, which show the dependence on the network topology and the communication capacity, for both convex and strongly convex objective functions. It is worth to note that the techniques used to derive the convergence rates in this paper are different from the ones in (Reisizadeh et al., 2018). While the authors in (Reisizadeh et al., 2018) use a dual approach in their convergence analysis, we utilize the standard techniques from two-time-scale stochastic approximation studied in (Borkar, 2008; Kushner and Yin, 2003; Wang et al., 2017; Konda and Tsitsiklis, 2004). This allows us to clearly show the impact of network topology and communication bandwidths on the convergence of the algorithm.
Main Contributions. The main contributions of this paper are two folds. We first propose a distributed variant of the well-known two-time-scale stochastic approximation for solving network optimization problems under random quantization. Second, we study the convergence and derive upper bounds on the rates of convergence of such methods. In particular, when the objective function is convex we first show the almost sure convergence of the variables of the nodes to the optimal solution of the problem. Then under an appropriate choice of the step sizes, we derive the convergence of the objective function to the optimal value in expectation at a rate , where is the number of iterations and represents for the connectivity of the underlying network. In addition, represents for quantization errors, which depends on the size of the communication bandwidths. When the objective function is strongly convex, we further show that this rate occurs at . We then conclude our paper with numerical experiments comparing the performance of distributed subgradient methods for solving the well-known regression problems with and without quantization.
1.1. Notation And Definition
Notation: We first introduce here a set of notation and definition used throughout this paper. We use boldface to distinguish between vectors in and scalars in . Given a collection of vectors in , we denote by a matrix in , whose -th row is . We then denote by and the Euclidean norm and the Frobenius norm of and , respectively. Let be the vector whose entries are and the identity matrix. Given a closed convex set , we denote by the projection of to .
Given a nonsmooth convex function , we denote by its subdifferential estimated at , i.e., is the set of subgradients of at . Since is convex, is nonempty. The function is -Lipschitz continuous if and only if
Note that the -Lipschitz continuity of is equivalent to the subgradients of are uniformly bounded by (Shalev-Shwartz, 2012). A function is -strongly convex if and only if satisfies
Random Quantization: We now present a brief review of random quantization adopted from (Aysal et al., 2008), which is also equivalent to dithered quantization in signal processing. In particular, given a finite interval we divide this interval into a number of bins . We assume that the points are uniformly spaced with a distance , i.e., for all implying that . Thus, to present the points we need a finite bits where .
Next given we denote by . Then the random quantization of is defined as
In addition, we have a.s. if for some Thus, a.s. if .
Finally, we consider the random quantization for the vector case. In particular, consider a compact set defined as
With some abuse of notation, given a vector we denote by , where , the quantization of -th coordinate of , for . Here, each is defined by using Eq. (3) with a uniform distance associated with each interval , for all .
2. Problem Formulation
We consider an optimization problem defined over a network of nodes. Associated with each node is a nonsmooth convex function over a convex compact set . The goal is to solve
Each node knows only its local function , and since there is no central coordination, the nodes are required to cooperatively solve the problem. We will use a distributed consensus-based (sub)gradient method where each node maintains their own version of the decision variables ; the goal is to have all the converge to , a solution of problem (5). Each node can exchange a quantized version of with its neighbors, as defined through a connected and undirected graph , where and are the vertex and edge sets, respectively. We denote by the set of node ’s neighbors.
A concrete motivating example for this problem is distributed linear regression problems solved over a network of processors. Regression problems involving massive amounts of data are common in machine learning; see for example, (Shalev-Shwartz and Ben-David, 2014; Hastie et al., 2009). Each function is the empirical loss over the local data stored at processor . The objective is to minimize the total loss over the entire dataset. Due to the difficulty of storing the enormous amount of data at a central location, the processors perform local computations over the local data, which are then exchanged to arrive at the globally optimal solution. Distributed gradient methods are a natural choice to solve such problems since they have been observed to be both fast and easily parallelizable in the case where the processors can exchange data instantaneously. The goal of this paper is to show that the algorithm continues to be convergent even the nodes only exchange the quantized values of their variables due to the finite bandwidths shared between them. In particular, we derive expressions for the convergence rate as a function of the communication bandwidths and the underlying network topology.
In the sequel, we will use to denote the optimal value of problem (5), i.e., where is a solution of problem (5). We denote by the solution set of problem (5), which is nonempty due to the compactness of . In addition, since is compact it is obvious that each is Lipschitz continuous with some positive constant , as stated in the following proposition.
TheoremProposition 1 ().
Each function , for all , is -Lipschitz continuous, i.e., Eq. (1) holds for some for all .
3. Distributed Gradient Methods Under Random Quantization
Distributed subgradient (DSG) methods, Eq. (6), for solving problem (5) were first studied and analyzed rigorously in (Nedić and Ozdaglar, 2009; Nedić et al., 2010). In these methods each node iteratively updates as
where is some sequence of stepsizes and . Here, is some positive weight which node assigns for . We assume that these weights, which capture the topology of , satisfy the following condition.
TheoremAssumption 1 ().
The matrix , whose -th entries are , is doubly stochastic, i.e., . Moreover, is irreducible and aperiodic. Finally, the weights if and only if otherwise .
This assumption also implies that has as the largest singular value and others are strictly less than ; see for example, the Perron-Frobenius theorem (Horn and Johnson, 1985). Also, we denote by the second largest singular value of , which by the Courant-Fisher theorem (Horn and Johnson, 1985) gives
Our focus in this section is to study DSG under random quantization in communication between nodes. In particular, at any iteration the nodes are only allowed to send and receive the quantized values of their local copies to their neighboring nodes. Due to such quantized communication, we modify the update in Eq. (6), that is, each node now considers the following update
where given in Eq. (3). Here, in addition to we introduce a new stepsize due to the random quantization exchanged between nodes.
This update has a simple interpretation. At any time , each node first obtains the quantized value of its value . Each node then formulates a convex combination between its value and the weighted quantized value received from its neighbors , with the goal of seeking a consensus on their estimates. Each node then moves along the subgradients of its respective objective function to update its estimates, pushing the consensus point toward the optimal set . The distributed subgradient algorithm under random quantization is formally stated in Algorithm 1.
3.1. The Role of
We discuss in this section some aspects of the new stepsize . First, one can interpret Eq. (8) as a distributed two-time-scale stochastic algorithm (Borkar, 2008; Kushner and Yin, 2003; Wang et al., 2017; Konda and Tsitsiklis, 2004), where the first sum play the role of fast time scale while the gradient step is the slow time scale. Due to the random quantization, each node first uses the fast time scale to estimate the true average of their estimates. Each node then applies the gradient step to slowly push its estimate toward a solution of problem (5). As will be seen, will be chosen relatively larger as compared to to guarantee for the convergence of Algorithm 1.
Second, it is obvious that when , for all , we recover the update in Eq. (6). In a sense, introducing gives us one more freedom to design our algorithm, especially when dealing with communication constraints. This has also been observed in our previous works (Doan et al., 2017, 2018a) where we use a constant to study the impact of network latencies on the performance of distributed gradient methods.
Finally, one can view , in addition to , is some weight which each node uses to indicate that it “trusts” its own value more than the value received from its neighbor . As will be seen, to guarantee the convergence of the algorithm we will let go to zero at some proper rate, implying eventually node only uses its own value.
4. Convergence Results
The focus of this section is to study the convergence properties of Algorithm 1 for solving problem (5), when the objective functions are both convex and strongly convex. The key idea of our analysis is to utilize the standard techniques used in centralized subgradient methods and stochastic approximation approach. In particular, for convex objective functions we first show that , for all , converges almost surely to a solution of problem (5) under some proper choice of stepsizes . We next show the convergence of the function estimated at the time weighted average of each to the optimal value in expectation at a rate , where is the spectral gap of the network connectivity. Finally, when the objective functions are strongly convex, we derive the convergence of the time-weighted average of each to an optimal solution of problem (5) in expectation at a rate .
We start our analysis by introducing more notation. Given the nodes’ estimates in we denote by a matrix whose -th rows are , i.e.,
Let be the average of , i.e.,
For convenience, we use the following notation
Moreover, let be the filtration containing all the history generated by Eq. (8) upto time , i.e.,
Finally, given a vector let denote the error due to the projection of to , i.e.,
Thus, Eq. (8) now can be rewritten as
which by using the matrix form of Eq. (10) is given as
where is the matrix whose -th row is . In addition, since is doubly stochastic, we have
In this section, we consider some preliminary results, which are essential in our analysis given in the next section. For an ease of exposition, we delay the proofs of all results in this section to the appendix. However, we present a sketch of their proofs to explain some intuition behind our analysis
We first provide an upper bound for the consensus error in the following lemma.
TheoremLemma 1 ().
Sketch of Proof.
Next, we use the following Cauchy-Schwarz inequality with some and
Thus, by taking the -norm square of the first equation and using the preceding Cauchy-Schwarz inequality with we have
Second, using the projection lemma, Lemma 5(b) in the Appendix, one can show
Third, by Eq. (4) we have
Fourth, by Eq. (7) we have
We next provide proper conditions on the stepsizes , which guarantees that the nodes achieve a consensus on their estimates . The analysis of this lemma is a consequence of Lemma 1 and Lemma 4 on the almost supermartingale convergence theorem given later.
TheoremLemma 2 ().
TheoremRemark 1 ().
One example of stepsizes , which satisfies Eq. (16), can be chosen as follows
Third, we study an upper bound for the optimal distance in the following lemma.
TheoremLemma 3 ().
Sketch of Proof.
First, using Eq. (12) to have
which by expanding the right-hand side and taking the conditional expectation w.r.t yields
Finally, we utilize the result on almost supermartingale convergence studied in (Robbins and Siegmund, 1971), stated as follows.
TheoremLemma 4 ((Robbins and Siegmund, 1971)).
Let , , and be non-negative sequences of random variables and satisfy
where , the history of up to time . Then converges a.s., and a.s.
4.2. Convergence Results of Convex Functions
In this section, we study the convergence and the rate of convergence of Algorithm 1 when the objective functions are convex. For an ease of explanation, we only provide a sketch of the proofs for all the main results in this section and the next section, where their details are presented in Section 6.
TheoremTheorem 1 ().
Thus, using these relations and standard analysis on the convergence of subsequence of we can obtain Eq. (21). ∎
We now study the rate of convergence of Algorithm 1 to the optimal value in expectation, where we utilize a similar technique used to establish the convergence rate of centralized subgradient methods. In particular, we show that if each node maintains a variable used to estimate the time weighted average of its local copy , then the function value estimated at each converges in expectation to the optimal value with a rate . The dependence on the variance of the quantized error in the upper bound of the rate is natural, as we often observe in stochastic gradient descent where such dependence is on the variance of the gradient noise. Such result is derived under different assumptions on the stepsizes as shown in the following theorem222We note that the conditions on the stepsizes in Theorems 1 and 2 are common choices to derive the asymptotic convergence and the rate of centralized subgradient methods, respectively; see for example (Nesterov, 2004).. Note that while the previous theorem studies almost sure convergence of the local copies, this theorem studies convergence in expectation of the functional value, and so it is not surprising that the stepsizes are different.
TheoremTheorem 2 ().
Second, we utilize Eq. (14) to obtain the following for some
TheoremRemark 2 ().
We note that in Theorem 2 can be iteratively updated as follows
where and for .
4.3. Convergence Results of Strongly Convex Functions
We study here the convergence rate of Algorithm 1 when are strongly convex, that is, we consider the following assumption.
TheoremAssumption 2 ().
Each function , for all , is -strongly convex, i.e., Eq. (2) holds for some .
Note that this assumption implies that is strongly convex where . Under this assumption, we show the rate of convergence of Algorithm 1 to an optimal solution of problem (5) in expectation, stated in the following theorem.
TheoremTheorem 3 ().
In this section, we apply Algorithm 1 for solving linear regression problems, the most popular technique for data fitting (Hastie et al., 2009; Shalev-Shwartz and Ben-David, 2014) in statistical machine learning, over a network of processors under random quantization. The goal of this problem is to find a linear relationship between a set of variables and some real value outcome. That is, given a training set for , we want to learn a parameter that minimizes
where and , i.e., . Here, are the loss functions defined over the dataset. For the purpose of our simulation, we will consider two loss functions, namely, quadratic loss and absolute loss functions. While the quadratic loss is strongly convex, the absolute loss is only convex.
First, when are quadratic, we have the well-known least square problem given as
Second, regression problems with absolute loss functions (or L norm) is often referred to as robust regression, which is known to be robust to outliers (Karst, 1958), given as follows
We consider simulated training data sets, i.e., are generated randomly with uniform distribution between . We consider the performance of the distributed subgradient methods on an undirected connected graph of nodes, i.e., and n = . Our graph is generated as follows.
In each network, we first randomly generate the nodes’ coordinates in the plane with uniform distribution.
Then any two nodes are connected if their distance is less than a reference number , e.g, for our simulations.
Finally we check whether the network is connected. If not we return to step and run the program again.
To implement our algorithm, the adjacency matrix is chosen as a lazy Metropolis matrix corresponding to , i.e.,
It is straightforward to verify that the lazy Metropolis matrix satisfies Assumption 1.
5.1. Convergence of Function Values
In this simulation, we apply variants of distributed subgradient methods for solving the linear regression problems. In particular, we compare the performance of such methods for three different scenarios, namely, DSG with no quantization (a.k.a Eq. (6)), DSG with time-varying quantization in (Doan et al., 2018b), and the proposed stochastic variant of Eq. (6) in Algorithm 1. The plots in Fig. 1 show the convergence of these three methods for both quadratic and absolute loss functions.
Note that, to achieve an asymptotic convergence the work in (Doan et al., 2018b) requires that the nodes eventually exchange an infinite number of bits. On the other hand, Algorithm 1 in this paper assumes the nodes use a finite number of constant bits in their communication. However, as observed both in Fig. 0(a) for quadratic loss and in Fig. 0(b) for absolute loss, Algorithm 1 performs almost as well as the one in (Doan et al., 2018b).
5.2. Impacts of the Number of Bits
Here, we consider the impacts of the number of communication bits on the performance of Algorithm 1. In Fig. 2 we plots the number of iterations, needed for the relative error , as a function of . As we can see the more bits we use the faster the algorithm converges. Moreover, the number of iterations required by the algorithm seems to be the same when is larger than . This does make sense due to the numerical rounding of the computer program. Finally, the curves in Fig. 2 seems to reflect the dependence of the rate of Algorithm 1 on the variance within some constant factor , which agrees with our results in Theorems 2 and 3.
6. Proofs of Main Results
In this section, we present the proofs of our main results given in Section 4.
6.1. Proof of Theorem 1
By the convexity of we have
which by the -Lipschitz continuity of in Proposition 1 yields