Distributed Semi-Stochastic Optimization with Quantization Refinement

# Distributed Semi-Stochastic Optimization with Quantization Refinement

Neil McGlohon and Stacy Patterson *This work was funded in part by NSF grants 1553340 and 1527287.N. McGlohon and S. Patterson are with the Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA mcglon@rpi.edu, sep@cs.rpi.edu
###### Abstract

We consider the problem of regularized regression in a network of communication-constrained devices. Each node has local data and objectives, and the goal is for the nodes to optimize a global objective. We develop a distributed optimization algorithm that is based on recent work on semi-stochastic proximal gradient methods. Our algorithm employs iteratively refined quantization to limit message size. We present theoretical analysis and conditions for the algorithm to achieve a linear convergence rate. Finally, we demonstrate the performance of our algorithm through numerical simulations.

## I Introduction

We consider the problem of distributed optimization in a network where communication is constrained, for example a wireless sensor network. In particular, we focus on problems where each node has local data and objectives, and the goal is for the nodes to learn a global objective that includes this local information. Such problems arise in networked systems problems such as estimation, prediction, resource allocation, and control.

Recent works have proposed distributed optimization methods that reduce communication by using quantization. For example, in [1], the authors propose a distributed algorithm to solve unconstrained problems based on a centralized inexact proximal gradient method [2]. In [3], the authors extend their work to constrained optimization problems. In these algorithms, the nodes compute a full gradient step in each iteration, requiring quantized communication between every pair of neighboring nodes. Quantization has been applied in distributed consensus algorithms [4, 5, 6] and distributed subgradient methods [7].

In this work, we address the specific problem of distributed regression with regularization over the variables across all nodes. Applications of our approach include distributed compressed sensing, LASSO, group LASSO, and regression with Elastic Net regularization, among others. Our approach is inspired by [1, 3]. We seek to further reduce per-iteration communication by using an approach based on a stochastic proximal gradient algorithm. This approach only requires communication between a small subset of nodes in each iteration. In general, stochastic gradients may suffer from slow convergence. Thus any per-iteration communication savings could be counter-acted by an extended number of iterations. Recently, however, several works have proposed semi-stochastic gradient methods [8, 9, 10]. To reduce the variance of the iterates generated by a stochastic approach, these algorithms periodically incorporate a full gradient computation. It has been shown that these algorithms achieve a linear rate of convergence to the optimal solution.

We propose a distributed algorithm for regularized regression based on the centralized semi-stochastic proximal gradient of [10]. In most iterations, only a subset of nodes need communicate. We further reduce communication overhead by employing quantized messaging. Our approach reduces both the length of messages sent between nodes as well as the number of messages sent in total to converge to the optimal solution. The detailed contributions of our work are as follows:

• We extend the centralized semi-stochastic proximal gradient algorithm to include errors in the gradient computations and show the convergence rate of this inexact algorithm.

• We propose a distributed optimization algorithm based on this centralized algorithm that uses iteratively refined quantization to limit message size.

• We show that our distributed algorithm is equivalent to the centralized algorithm, where the errors introduced by quantization can be interpreted as inexact gradient computations. We further design quantizers that guarantees a linear convergence rate to the optimal solution.

• We demonstrate the performance of the proposed algorithm in numerical simulations.

The remainder of this paper is organized as follows. In Section II, we present the centralized inexact proximal gradient algorithm and give background on quantization. In Section III, we give the system model and problem formulation. Section IV details our distributed algorithm. Section V provides theoretical analysis of our proposed algorithm. Section VI presents our simulation results, and we conclude in Section VII.

## Ii Preliminaries

### Ii-a Inexact Semi-Stochastic Proximal Gradient Algorithm

We consider an optimization problem over the form:

 minimizex∈RP  G(x)=F(x)+R(x), (1)

where , and the following assumptions are satisfied.

###### Assumption 1

Each is differentiable, and its gradient is Lipschitz continuous with constant , i.e., for all ,

 ∥∇fi(x)−∇fi(y)∥≤Li∥x−y∥. (2)
###### Assumption 2

The function is lower semicontinuous, convex, and its effective domain, , is closed.

###### Assumption 3

The function is strongly convex with parameter , i.e., for all and for all ,

 G(x)−G(y)−12μ∥x−y∥2≥ξT(x−y), (3)

where is the subdifferential of at . This strong convexity may come from either or (or both).

Problem (1) can be solved using a stochastic proximal gradient algorithm [11] where, in each iteration, a single is computed for a randomly chosen , and the iterate is updated accordingly as,

 x(t+1)=proxηR(x(t)−η(t)∇fℓ(x(t))).

Here, is the proximal operator

 proxηR(v)=argminy∈Rp12∥y−v∥2+ηR(y).

While stochastic methods offer the benefit of reduced per-iteration computation over standard gradient methods, the iterates may have high variance. These methods typically use a decreasing step-size to compensate for this variance, resulting in slow convergence. Recently, Xiao and Zhang proposed a semi-stochastic proximal gradient algorithm, Prox-SVRG that reduces the variance by periodically incorporating a full gradient computation [10]. This modification allows Prox-SVRG to use a constant step size, and thus, Prox-SVRG achieves a linear convergence rate.

We extend Prox-SVRG to include a zero-mean error in the gradient computation. Our resulting algorithm, Inexact Prox-SVRG, is given in Algorithm 1. The algorithm consists of an outer loop where the full gradient is computed and an inner loop where the iterate is updated based on both the stochastic and full gradients.

The following theorem states the convergence behavior of Algorithm 1.

###### Theorem 1

Let be the sequence generated by Algorithm 1, with , where . Assume that the functions , , and , , satisfy Assumptions 1, 2, and 3, and that the errors are zero-mean and uncorrelated with the iterates and their gradients . Let , and let be such that,

 α=1μη(1−4¯¯¯¯Lη)T+4¯¯¯¯Lη(T+1)(1−4¯¯¯¯Lη)T<1.

Then,

 E[G(~x(s))−G(x⋆)] ≤αs(G(~x(0))−G(x⋆)+βs∑i=1α−iΓ(i))

where and .

The proof is given in the appendix.

From this theorem, we can derive conditions for the algorithm to converge to the optimal . Let the sequence decrease linearly at a rate . Then

1. If , then converges linearly with a rate of .

2. If , then converges linearly with a rate of .

3. If , then converges linearly with a rate in .

### Ii-B Subtractively Dithered Quantization

We employ a subtractively dithered quantizer to quantize values before transmission. We use a substractively dithered quantizer rather than non-subtractively dithered quantizer because the quantization error of the subtractively dithered quantizer is not correlated with its input. We briefly summarize the quantizer and its key properties below.

Let be real number to be quantized into bits. The quantizer is parameterized by an interval size and a midpoint value . Thus the quantization interval is , and the quantization step-size is . We first define the uniform quantizer,

 (4)

In subtractively dithered quantization, a dither is added to , the resulting value is quantized using a uniform quantizer, and then transmitted. The recipient then subtracts from this value. The subtractively dithered quantized value of , denoted , is thus

 ^z=Q(z)≜q(z+ν)−ν. (5)

Note that this quantizer requires both the sender and recipient to use the same value for , for example, by using the same pseudorandom number generator.

The following theorem describes the statistical properties of the quantization error.

###### Theorem 2 (See [12])

Let and , for in (5). Further, let is a real number drawn uniformly at random from the interval . The quantization error satisfies the following:

1. .

2. For and in the interval ,

With some abuse of notation, we also write where is a vector. In this case, the quantization operator is applied to each component of independently, using a vector-valued midpoint and the same scalar-valued interval bounds.

## Iii Problem Formulation

We consider a similar system model to that in [1]. The network is a connected graph of nodes where inter-node communication is limited to the local neighborhood of each node. The neighbor set consists of node ’s neighbors and itself. The neighborhoods exist corresponding to the fixed undirected graph . We denote as the maximum degree of the graph .

Each node has a state vector with dimension . The state of the system is . We let be the vector consisting of the concatenation of states of all nodes in . For ease of exposition, we define the selecting matrices , , where and the matrices , where . These matrices each have -norm of 1.

Every node has a local objective function over the states in . The distributed optimization problem is thus,

 minimizex∈RP  G(x)=F(x)+R(x), (6)

where . We assume that Assumptions 1 and 3 are satisfied. Further, we require the following assumptions hold.

###### Assumption 4

For all , is linear or constant. This implies that, for a zero-mean random variable , .

###### Assumption 5

The proximal operation can be performed by each node locally, i.e.,

 proxR(x)=[proxR(x1)T proxR(x2)T…proxR(xN)T]T.

We note that Assumption 5 holds for standard regularization functions used in LASSO (), group LASSO where each its own group, and Elastic Net regularization ().

In the next section, we present our distributed implementation of Prox-SVRG to solve Problem (6).

## Iv Algorithm

Our distributed algorithm is given in Algorithm 2. In each outer iteration , node quantizes its iterate and the gradient and sends it to all of its neighbors. These values are quantized using two subtractively dithered quantizers, and , whereby the sender (node ) sends an bit representation and the recipient reconstructs the value from this representation and subtracts the dither. The midpoints for and are set to be the quantized values from the previous iteration. Thus, the recipients already know these midpoints. The quantized values (after the dither is subtracted) are denoted by and , and the quantization errors are and , respectively.

For every iteration of the outer loop of the algorithm, there is an inner loop of iterations. In each inner iteration, a single node , chosen at random, computes its gradient. To do this, node and its neighbors exchange their states and gradients . These values are quantized using two subtractively dithered quantizers, and . The midpoints for these quantizers are and . Each node sends these values to their neighbors before the inner loop, so all nodes are aware of the midpoints. The quantized values (after the dither is subtracted) are denoted by and , and their quantization errors are and , respectively. The quantization interval bounds , , , and , are initialized to , , , and , respectively, and each iteration, the bounds are multiplied by . Thus the quantizers are refined in each iteration.

The quantizers limit the length of a single variable transmission to bits. In the outer loop of the algorithm, each node sends its local variable, consisting of quantized components, to every neighbor. It also sends its gradient, consisting of quantized components to every neighbor. Thus the number of bits exchanged by all nodes is bits. In each inner iteration, only nodes exchange messages. Each node quantizes state variables and sends them to node . This yields a transmission of bits in total. In turn, node quantizes its gradient and sends it to all of its neighbors, which is total bits. Thus, in each inner iteration bits are transmitted. The total number of bits transmitted in a single outer iteration is therefore,

 n⎛⎝N∑i=1(|Ni|mi(1+|Ni|))+T−1∑t=0⎛⎝|Nℓ|2mℓ+∑j∈Nℓmj⎞⎠⎞⎠.

Let and . An upper bound on the number bits transmitted by the algorithm in each outer iteration is .

## V Algorithm Analysis

We now present our analysis of Algorithm 2. First we show that the algorithm is equivalent to Algorithm 1, where the quantization errors are encapsulated in the error term . We also give an explicit expression for this error term.

###### Lemma 1

Algorithm 2 is equivalent to the Inexact Prox-SVG method in Algorithm 1, with

 −ATℓ(∇fℓ(^~x(s)Nℓ)−∇fℓ(~x(s)Nℓ))−ATℓb(s)ℓ +1N∑Ni=1ATi(∇fi(^~x(s)Ni)−∇fi(~x(s)Ni))+1N∑Ni=1ATib(s)i.

Further, is upper-bounded by,

 E∥e(st)∥2 ≤2¯¯¯¯L2∑j∈NℓE∥c(st)j∥2+2¯¯¯¯L2∑j∈NℓE∥a(s)j∥2 +E∥d(st)ℓ∥2+2E∥b(s)ℓ∥2+2N2N∑i=1E∥b(s)i∥2.
{proof}

The error is:

 e(st)=AℓT^∇f(st)ℓ−ATℓ^∇f(s)ℓ+1N∑Ni=1AiT^∇f(s)i −(AℓT∇fℓ(x(st)Nℓ)−AℓT∇fℓ(~x(s)Nℓ) +1N∑Ni=1AiT∇fi(~x(s)Ni)) =ATℓ(∇fℓ(^x(st)Nℓ)−∇fℓ(x(st)Nℓ))+ATℓd(st)ℓ −ATℓ(∇fℓ(^~x(s)Nℓ)−∇fℓ(~x(s)Nℓ))−ATℓb(s)ℓ +1N∑Ni=1ATi(∇fi(^~x(s)Ni)−∇fi(~x(s)Ni))+1N∑Ni=1ATib(s)i.

We note that all quantization errors are zero-mean. Further, by Assumption 5, , for a zero-mean random variable . Therefore, .

We now show that is is uncorrelated with and the gradients , . Clearly, and are uncorrelated with the terms of containing , , and . In accordance with Assumption 5, the gradients and are either linear or constant. If they are constant, then and . Thus, the terms in containing these differences are also 0. If they are linear, e.g., , for an appropriately sized, matrix and vector (possibly 0). Then,

 ∇fℓ(^x(st)Nℓ)−∇fℓ(x(st)Nℓ) =(H(x(st)Nℓ+c(st)i)+h)−(Hx(st)+h)=Hc(st)i.

By Theorem 2, is uncorrelated with . It is clearly also uncorrelated with . Similar arguments can be used to show that and are uncorrelated with the remaining terms in .

With respect to , we have

 E∥e(st)∥2 −ATℓ(∇fℓ(^~x(s)Nℓ)−∇fℓ(~x(s)Nℓ)) +1N∑Ni=1ATi(∇fi(^~x(s)Ni)−∇fi(~x(s)Ni))∥2 +E∥ATℓd(st)ℓ+1N∑Ni=1ATib(s)i−ATℓb(s)ℓ∥2.

The first term on the right hand side can be bounded using the fact that , as

 ≤2E∥ATℓ(∇fℓ(^x(st)Nℓ)−∇fℓ(x(st)Nℓ))∥2 +2E∥ATℓ(∇fℓ(^~x(s)Nℓ)−∇fℓ(~x(s)Nℓ)) +1N∑Ni=1ATi(∇fi(^~x(s)Ni)−∇fi(~x(s)Ni))∥2.

We now bound the first term in this expression,

 2E∥ATℓ(∇fℓ(^x(st)Nℓ)−∇fℓ(x(st)Nℓ))∥2 ≤2E(L2i∥^x(st)Nℓ−x(st)Nℓ∥2)≤2¯¯¯¯L2∑j∈NℓE∥c(st)j∥2,

where the first inequality follows from Assumptions 1 and 5 and the fact that . The second inequality follows from the independence of quantization errors (Theorem 2). Next we bound the second term,

 2E∥ATℓ(∇fℓ(^~x(s)Nℓ)−∇fℓ(~x(s)Nℓ)) +1N∑Ni=1ATi(∇fi(,^~x(s)Ni)−∇fi(~x(s)Ni))∥2 =2E∥ATℓ(∇fℓ(^~x(s)Nℓ)−∇fℓ(~x(s)Nℓ)) ≤2E∥ATℓ(∇fℓ(^~x(s)Nℓ)−∇fℓ(~x(s)Nℓ))∥2 ≤2E(L2i∥(^~x(s)Nℓ−~x(s)Nℓ∥2) ≤2¯¯¯¯L2∑j∈NℓE∥a(s)j∥2,

where the first inequality uses the fact that for a random variable , . The remaining inequalities follow from Assumptions 1 and 5, the fact that , and the independence of the quantization errors.

Finally, again from the independence of the quantization errors, we have,

 E∥ATℓd(st)ℓ+1N∑Ni=1ATib(s)i−ATℓb(s)ℓ∥2 ≤E∥d(st)ℓ∥2+2E∥b(s)ℓ∥2+2N2∑Ni=1E∥b(s)i∥2.

Combining these bounds, we obtain the desired result,

 E∥e(st)∥2 ≤2¯¯¯¯L2∑j∈NℓE∥c(st)j∥2+2¯¯¯¯L2∑j∈NℓE∥a(s)j∥2 +E∥d(st)ℓ∥2+2E∥b(s)ℓ∥2+2N2N∑i=1E∥b(s)i∥2.

We next show that, if all of the values fall within their respective quantization intervals, then the error term decreases linearly with rate , and thus the algorithm converges to the optimal solution linearly with rate .

###### Theorem 3

Given , if for all , the values of , , , and fall inside of the respective quantization intervals , , , and , then , where,

 C=DT¯¯¯¯¯m12(2ℓ−1)2(2¯¯¯¯L2(Ca+Cb)+2(N+1N)Cb+Cd),

with and .

It follows that, for