Distributed SemiStochastic Optimization with Quantization Refinement
Abstract
We consider the problem of regularized regression in a network of communicationconstrained devices. Each node has local data and objectives, and the goal is for the nodes to optimize a global objective. We develop a distributed optimization algorithm that is based on recent work on semistochastic proximal gradient methods. Our algorithm employs iteratively refined quantization to limit message size. We present theoretical analysis and conditions for the algorithm to achieve a linear convergence rate. Finally, we demonstrate the performance of our algorithm through numerical simulations.
I Introduction
We consider the problem of distributed optimization in a network where communication is constrained, for example a wireless sensor network. In particular, we focus on problems where each node has local data and objectives, and the goal is for the nodes to learn a global objective that includes this local information. Such problems arise in networked systems problems such as estimation, prediction, resource allocation, and control.
Recent works have proposed distributed optimization methods that reduce communication by using quantization. For example, in [1], the authors propose a distributed algorithm to solve unconstrained problems based on a centralized inexact proximal gradient method [2]. In [3], the authors extend their work to constrained optimization problems. In these algorithms, the nodes compute a full gradient step in each iteration, requiring quantized communication between every pair of neighboring nodes. Quantization has been applied in distributed consensus algorithms [4, 5, 6] and distributed subgradient methods [7].
In this work, we address the specific problem of distributed regression with regularization over the variables across all nodes. Applications of our approach include distributed compressed sensing, LASSO, group LASSO, and regression with Elastic Net regularization, among others. Our approach is inspired by [1, 3]. We seek to further reduce periteration communication by using an approach based on a stochastic proximal gradient algorithm. This approach only requires communication between a small subset of nodes in each iteration. In general, stochastic gradients may suffer from slow convergence. Thus any periteration communication savings could be counteracted by an extended number of iterations. Recently, however, several works have proposed semistochastic gradient methods [8, 9, 10]. To reduce the variance of the iterates generated by a stochastic approach, these algorithms periodically incorporate a full gradient computation. It has been shown that these algorithms achieve a linear rate of convergence to the optimal solution.
We propose a distributed algorithm for regularized regression based on the centralized semistochastic proximal gradient of [10]. In most iterations, only a subset of nodes need communicate. We further reduce communication overhead by employing quantized messaging. Our approach reduces both the length of messages sent between nodes as well as the number of messages sent in total to converge to the optimal solution. The detailed contributions of our work are as follows:

We extend the centralized semistochastic proximal gradient algorithm to include errors in the gradient computations and show the convergence rate of this inexact algorithm.

We propose a distributed optimization algorithm based on this centralized algorithm that uses iteratively refined quantization to limit message size.

We show that our distributed algorithm is equivalent to the centralized algorithm, where the errors introduced by quantization can be interpreted as inexact gradient computations. We further design quantizers that guarantees a linear convergence rate to the optimal solution.

We demonstrate the performance of the proposed algorithm in numerical simulations.
The remainder of this paper is organized as follows. In Section II, we present the centralized inexact proximal gradient algorithm and give background on quantization. In Section III, we give the system model and problem formulation. Section IV details our distributed algorithm. Section V provides theoretical analysis of our proposed algorithm. Section VI presents our simulation results, and we conclude in Section VII.
Ii Preliminaries
Iia Inexact SemiStochastic Proximal Gradient Algorithm
We consider an optimization problem over the form:
(1) 
where , and the following assumptions are satisfied.
Assumption 1
Each is differentiable, and its gradient is Lipschitz continuous with constant , i.e., for all ,
(2) 
Assumption 2
The function is lower semicontinuous, convex, and its effective domain, , is closed.
Assumption 3
The function is strongly convex with parameter , i.e., for all and for all ,
(3) 
where is the subdifferential of at . This strong convexity may come from either or (or both).
Problem (1) can be solved using a stochastic proximal gradient algorithm [11] where, in each iteration, a single is computed for a randomly chosen , and the iterate is updated accordingly as,
Here, is the proximal operator
While stochastic methods offer the benefit of reduced periteration computation over standard gradient methods, the iterates may have high variance. These methods typically use a decreasing stepsize to compensate for this variance, resulting in slow convergence. Recently, Xiao and Zhang proposed a semistochastic proximal gradient algorithm, ProxSVRG that reduces the variance by periodically incorporating a full gradient computation [10]. This modification allows ProxSVRG to use a constant step size, and thus, ProxSVRG achieves a linear convergence rate.
We extend ProxSVRG to include a zeromean error in the gradient computation. Our resulting algorithm, Inexact ProxSVRG, is given in Algorithm 1. The algorithm consists of an outer loop where the full gradient is computed and an inner loop where the iterate is updated based on both the stochastic and full gradients.
The following theorem states the convergence behavior of Algorithm 1.
Theorem 1
The proof is given in the appendix.
From this theorem, we can derive conditions for the algorithm to converge to the optimal . Let the sequence decrease linearly at a rate . Then

If , then converges linearly with a rate of .

If , then converges linearly with a rate of .

If , then converges linearly with a rate in .
IiB Subtractively Dithered Quantization
We employ a subtractively dithered quantizer to quantize values before transmission. We use a substractively dithered quantizer rather than nonsubtractively dithered quantizer because the quantization error of the subtractively dithered quantizer is not correlated with its input. We briefly summarize the quantizer and its key properties below.
Let be real number to be quantized into bits. The quantizer is parameterized by an interval size and a midpoint value . Thus the quantization interval is , and the quantization stepsize is . We first define the uniform quantizer,
(4) 
In subtractively dithered quantization, a dither is added to , the resulting value is quantized using a uniform quantizer, and then transmitted. The recipient then subtracts from this value. The subtractively dithered quantized value of , denoted , is thus
(5) 
Note that this quantizer requires both the sender and recipient to use the same value for , for example, by using the same pseudorandom number generator.
The following theorem describes the statistical properties of the quantization error.
Theorem 2 (See [12])
Let and , for in (5). Further, let is a real number drawn uniformly at random from the interval . The quantization error satisfies the following:

.



For and in the interval ,
With some abuse of notation, we also write where is a vector. In this case, the quantization operator is applied to each component of independently, using a vectorvalued midpoint and the same scalarvalued interval bounds.
Iii Problem Formulation
We consider a similar system model to that in [1]. The network is a connected graph of nodes where internode communication is limited to the local neighborhood of each node. The neighbor set consists of node ’s neighbors and itself. The neighborhoods exist corresponding to the fixed undirected graph . We denote as the maximum degree of the graph .
Each node has a state vector with dimension . The state of the system is . We let be the vector consisting of the concatenation of states of all nodes in . For ease of exposition, we define the selecting matrices , , where and the matrices , where . These matrices each have norm of 1.
Every node has a local objective function over the states in . The distributed optimization problem is thus,
(6) 
where . We assume that Assumptions 1 and 3 are satisfied. Further, we require the following assumptions hold.
Assumption 4
For all , is linear or constant. This implies that, for a zeromean random variable , .
Assumption 5
The proximal operation can be performed by each node locally, i.e.,
We note that Assumption 5 holds for standard regularization functions used in LASSO (), group LASSO where each its own group, and Elastic Net regularization ().
In the next section, we present our distributed implementation of ProxSVRG to solve Problem (6).
Iv Algorithm
Our distributed algorithm is given in Algorithm 2. In each outer iteration , node quantizes its iterate and the gradient and sends it to all of its neighbors. These values are quantized using two subtractively dithered quantizers, and , whereby the sender (node ) sends an bit representation and the recipient reconstructs the value from this representation and subtracts the dither. The midpoints for and are set to be the quantized values from the previous iteration. Thus, the recipients already know these midpoints. The quantized values (after the dither is subtracted) are denoted by and , and the quantization errors are and , respectively.
For every iteration of the outer loop of the algorithm, there is an inner loop of iterations. In each inner iteration, a single node , chosen at random, computes its gradient. To do this, node and its neighbors exchange their states and gradients . These values are quantized using two subtractively dithered quantizers, and . The midpoints for these quantizers are and . Each node sends these values to their neighbors before the inner loop, so all nodes are aware of the midpoints. The quantized values (after the dither is subtracted) are denoted by and , and their quantization errors are and , respectively. The quantization interval bounds , , , and , are initialized to , , , and , respectively, and each iteration, the bounds are multiplied by . Thus the quantizers are refined in each iteration.
The quantizers limit the length of a single variable transmission to bits. In the outer loop of the algorithm, each node sends its local variable, consisting of quantized components, to every neighbor. It also sends its gradient, consisting of quantized components to every neighbor. Thus the number of bits exchanged by all nodes is bits. In each inner iteration, only nodes exchange messages. Each node quantizes state variables and sends them to node . This yields a transmission of bits in total. In turn, node quantizes its gradient and sends it to all of its neighbors, which is total bits. Thus, in each inner iteration bits are transmitted. The total number of bits transmitted in a single outer iteration is therefore,
Let and . An upper bound on the number bits transmitted by the algorithm in each outer iteration is .
V Algorithm Analysis
We now present our analysis of Algorithm 2. First we show that the algorithm is equivalent to Algorithm 1, where the quantization errors are encapsulated in the error term . We also give an explicit expression for this error term.
Lemma 1
The error is:
We note that all quantization errors are zeromean. Further, by Assumption 5, , for a zeromean random variable . Therefore, .
We now show that is is uncorrelated with and the gradients , . Clearly, and are uncorrelated with the terms of containing , , and . In accordance with Assumption 5, the gradients and are either linear or constant. If they are constant, then and . Thus, the terms in containing these differences are also 0. If they are linear, e.g., , for an appropriately sized, matrix and vector (possibly 0). Then,
By Theorem 2, is uncorrelated with . It is clearly also uncorrelated with . Similar arguments can be used to show that and are uncorrelated with the remaining terms in .
With respect to , we have
The first term on the right hand side can be bounded using the fact that , as
We now bound the first term in this expression,
where the first inequality follows from Assumptions 1 and 5 and the fact that . The second inequality follows from the independence of quantization errors (Theorem 2). Next we bound the second term,
where the first inequality uses the fact that for a random variable , . The remaining inequalities follow from Assumptions 1 and 5, the fact that , and the independence of the quantization errors.
Finally, again from the independence of the quantization errors, we have,
Combining these bounds, we obtain the desired result,
We next show that, if all of the values fall within their respective quantization intervals, then the error term decreases linearly with rate , and thus the algorithm converges to the optimal solution linearly with rate .
Theorem 3
Given , if for all , the values of , , , and fall inside of the respective quantization intervals , , , and , then , where,
with and .
It follows that, for