Communication-Efficient Accurate Statistical EstimationSupported by NSF grants DMS-1662139 and DMS-1712591, NIH grant 2R01-GM072611-14, and ONR grant N00014-19-1-2120.

# Communication-Efficient Accurate Statistical Estimation††thanks: Supported by NSF grants DMS-1662139 and DMS-1712591, NIH grant 2R01-GM072611-14, and ONR grant N00014-19-1-2120.

Jianqing Fan, Yongyi Guo and Kaizheng Wang
###### Abstract

When the data are stored in a distributed manner, direct application of traditional statistical inference procedures is often prohibitive due to communication cost and privacy concerns. This paper develops and investigates two Communication-Efficient Accurate Statistical Estimators (CEASE), implemented through iterative algorithms for distributed optimization. In each iteration, node machines carry out computation in parallel and communicate with the central processor, which then broadcasts aggregated information to node machines for new updates. The algorithms adapt to the similarity among loss functions on node machines, and converge rapidly when each node machine has large enough sample size. Moreover, they do not require good initialization and enjoy linear converge guarantees under general conditions. The contraction rate of optimization errors is presented explicitly, with dependence on the local sample size unveiled. In addition, the improved statistical accuracy per iteration is derived. By regarding the proposed method as a multi-step statistical estimator, we show that statistical efficiency can be achieved in finite steps in typical statistical applications. In addition, we give the conditions under which the one-step CEASE estimator is statistically efficient. Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our algorithms.

## 1 Introduction

Statistical inference in modern era faces tremendous challenge on computation and storage. The exceedingly large size of data often makes it impossible to store all of them on a single machine. Moreover, many applications have individual agents (e.g. local governments, research labs, hospitals, smart phones) collecting data independently. Communication between them is prohibitively expensive due to the limited bandwidth, and direct data sharing has also raised privacy and lost of ownership concerns. These constraints make it necessary to develop methodologies for distributed systems, solving statistical problems with divide-and-conquer procedures and communicating only certain summary statistics. In modern distributed computing architectures, the speeds of intra-node computation and inter-node communication may differ by a factor of 1000 (Lan et al., 2018). It is then desirable to conduct expensive computation on node machines and communicate as few rounds as possible.

Distributed statistical inference has received considerable attention in the past few years, covering a wide spectrum of topics including -estimation (Zhang et al., 2013; Chen and Xie, 2014; Shamir et al., 2014; Wang et al., 2017a; Lee et al., 2017b; Battey et al., 2018; Jordan et al., 2018; Wang et al., 2018a; Shi et al., 2018; Chen et al., 2018; Banerjee et al., 2019), principal component analysis (Fan et al., 2017; Garber et al., 2017), nonparametric regression (Zhang et al., 2015; Shang and Cheng, 2017; Szabo and van Zanten, 2017; Han et al., 2018), quantile regression (Volgushev et al., 2017; Chen et al., 2018), bootstrap (Kleiner et al., 2014), confidence intervals (Jordan et al., 2018; Wang et al., 2018b; Chen et al., 2018), Bayesian methods (Suchard et al., 2010; Wang and Dunson, 2013; Jordan et al., 2018), and so on. In the commonly-used setting, the overall dataset is partitioned and stored on node machines, which are connected to a central processor. Most of the approaches studied in this literature only require one round of communication: the node machines conduct inference in parallel and send their results to the central processor, which then aggregates the information and outputs a final result. As typical examples, Zhang et al. (2013) average the -estimators obtained by node machines; Battey et al. (2018) average debiased estimators; and Fan et al. (2017) define an average for subspaces and compute it via eigen-decomposition. While these one-shot methods are communication-efficient, they only work with a small number of node machines (e.g. , where is the total sample size) and require large sample size on each of them, as their theories heavily rely on asymptotic expansions of certain estimators (Zhang et al., 2013; Rosenblatt and Nadler, 2016). Since the conditions are easily violated, their performance may well be sub-optimal.

Multi-round procedures come as a remedy for this, where local computation and global aggregation are repeatedly performed. On the one hand, the central processor gathers and broadcasts overall information for node machines to improve their estimates accordingly. On the other hand, the similar structures of data on node machines as well as their computational power are exploited by the algorithm. It is then possible to achieve optimal statistical precision after a few rounds of communication, under broader settings than those for one-shot procedures. Shamir et al. (2014) proposes a Distributed Approximate NEwton (DANE) algorithm where, in each iteration, each node machine minimizes a modified loss function based on its own samples and the gradient information from all other machines obtained through communication. However, for non-quadratic losses, the analysis in Shamir et al. (2014) does not imply any advantage of DANE in terms of communication over distributed implementation of gradient descent. Other approximate Newton algorithms include Zhang and Xiao (2015), Mahajan et al. (2015), Wang et al. (2018a) and Crane and Roosta (2019). A recent work by Chen et al. (2018) approximate the Newton step by first-order stochastic algorithms. In addition, Jordan et al. (2018) develops a Communication-efficient Surrogate Likelihood (CSL) framework for estimation and inference in regular parametric models, high-dimensional penalized regression, and Bayesian statistics. A similar method for penalized regression also appear independently in Wang et al. (2017a). These methods no longer have restrictions on the number of machines such as .

Due to the nature of Newton-type methods, existing theories for these algorithms heavily rely on good initialization or even self-concordance assumption on loss functions. They essentially focus on improving an initial estimator that is already consistent but not efficient, whose ideas coincide with the classical one-step estimator (Bickel, 1975). Such initialization itself needs additional efforts and assumptions. Moreover, current results still require each machine to have sufficiently many samples so that loss functions on different machines are similar to each other. These all make the proposed methods unreliable in practice.

Aside from distributed statistical inference, there has also been a vast literature in distributed optimization since the pioneering works in the 80s (Bertsekas and Tsitsiklis, 1989). The ADMM (Boyd et al., 2011) is a celebrated example among the numerous algorithms designed to handle deterministic optimization problems with minimum structural assumption. While having theoretical guarantees under general conditions (Deng and Yin, 2016; Hong and Luo, 2017), the convergence can be quite slow. Recent developments in distributed optimization include dual algorithms (Yang, 2013; Jaggi et al., 2014; Smith et al., 2018), algorithms for feature-partitioned problems Recht et al. (2011); Richtárik and Takáč (2016), resampling-based algorithms (Lee et al., 2017a; Wang et al., 2017b), decentralized algorithms (Duchi et al., 2012; Lan et al., 2018), federated optimization (Konečnỳ et al., 2015; Chen et al., 2017), communication complexity (Arjevani and Shamir, 2015; Woodworth et al., 2018), to name a few. This list is by no means exhaustive for this booming area. In the statistical setting we are interested in, most of the algorithms above cannot fully utilize the similarity among loss functions on node machines.

In this paper, we develop and study two Communication-Efficient Accurate Statistical Estimators (CEASE) based on multi-round algorithms for distributed statistical estimation. The samples are stored on node machines connected to a central processor. For simplicity, we assume that all the node machines have the same sample size . Each node machine has a regularized empirical risk function defined by the samples stored there, and the goal is to compute the minimizer of the overall regularized risk function to statistical precision. The algorithms alternate between computation on node machines and aggregation on the central processor in a communication-efficient way. When is sufficiently large, their rates of convergence are better than or comparable to existing statistical methods designed for this large-sample regime. Even for moderate or small , they are still guaranteed to converge linearly even in the absence of good initializations, while other statistical methods fail. In addition, our algorithms take advantage of the similarity among in statistical applications, and thus improve over general-purpose distributed optimization algorithms like ADMM. To some extent, they interpolate between distributed algorithms for statistical estimation and general deterministic problems. Theoretical findings are verified by extensive numerical experiments.

From a technical point of view, our algorithms use the proximal point algorithm (Rockafellar, 1976; Parikh and Boyd, 2014) as the backbone and obtain inexact one-step updates in a distributed manner. This turns out to be crucial for proving convergence under general conditions, without good initialization or large sample size on each node machine. Moreover, it makes our algorithms reliable in practice. Our perspective and techniques are potentially useful for analyzing other distributed optimization algorithms.

The rest of this paper is organized as follows. Section 2 introduces the problem setup and presents two vanilla algorithms for the large-sample regime. Section 3 proposes two advanced algorithms and analyzes their theoretical properties under general conditions. Section 4 uses numerical experiments on both synthetic and real data to validate the theoretical results. Section 5 finally concludes the paper and discusses possible future directions.

### Notations

Here we list notations to be used throughout the paper. denotes the set for any positive integer . For two sequences and , we write or if there exists a constant such that holds for sufficiently large ; and if and . Given a Euclidean space where is clear from the context, , and , we define to be a closed ball and to be the inner product. For a convex function on , we let be its sub-differential set at , and be the set of its minimizers if . We use to denote the norm of a vector or operator norm of a matrix. For two sequences of random variables and where , we write if for any there exists such that for sufficiently large . We use to refer to the sub-Gaussian norm of random variable , and to denote the sub-Gaussian norm of random vector .

## 2 Distributed estimation in large sample regimes

### 2.1 Problem setup

Suppose there is an unknown probability distribution over some sample space . For any parameter , define its population risk based on a loss function . In parametric inference problems, is often chosen as the negative log-likelihood function of some parametric family. Under mild conditions, is well-defined and has a unique minimizer . A ubiquitous problem in statistics and machine learning is to estimate given i.i.d. samples from , and the minimizer of the empirical risk becomes a natural candidate. To achieve desirable precision in high-dimensional problems, it is often necessary to incorporate prior knowledge of into the estimation procedure. To this end, the regularized empirical risk minimization

 minθ∈Rp{f(θ)+g(θ)}, (2.1)

provides a principled approach, where is a deterministic panelty function. Common choices for include the penalty (Hoerl and Kennard, 1970), the penalty (Tibshirani, 1996), and a family of folded concave penalty functions such as SCAD (Fan and Li, 2001) and MCP (Zhang et al., 2010), where is a regularization parameter. Throughout the paper, we assume that both and are convex in , and is twice continuously differentiable in . We allow to be non-smooth (e.g. the penalty).

Consider the distributed setting where the samples are stored on machines connected to a central processor. Define to be the indices of samples on the -th machine, and the associated empirical loss. For simplicity, we assume that are disjoint, is a multiple of , and for all . Then (2.1) can be rewritten as

 minθ∈Rp{f(θ)+g(θ)},f(θ)=1mm∑k=1fk(θ). (2.2)

Each machine only has access to its local data and hence local loss function and the penalty . We aim to solve (2.2) in a distributed manner with both statistical efficiency and communication-efficiency.

We drop the regularization term for now and consider the empirical risk minimization problem for estimating . In some problems, direct minimization of is costly, while it is easy to obtain some rough estimate that is close to but not as accurate as the global minimimizer . Bickel (1975) proposes the one-step estimator based on the local quadratic approximation and shows that it is as efficient as if the initial estimator is accuracy enough. Iterating this further results in multiple-step estimators that improve the optimization error and hence statistical errors when the initial estimator is not good enough (Robinson, 1988). This inspires us to refine an existing estimator using some proxy of .

In the distributed environment, starting from an initial estimator , only the gradient vector can easily be communicated and hence the linear function , the first-order Taylor expansion of around . The object function to be minimized can be written as

 f(θ)=f(1)(θ)+R(θ),whereR(θ)=f(θ)−f(1)(θ).

Since the linear function can easily be communicated to each node machine whereas can not, the latter is naturally replaced by its subsampled version at node :

 Rk(θ)=fk(θ)−[fk(¯θ)+⟨∇fk(¯θ),θ−¯θ⟩],

where is the loss function based on the data at node . With this replacement, the target of optimization at node becomes , which equals to

 fk(θ)−⟨∇fk(¯θ)−∇f(¯θ),θ⟩

up to an additive constant. This function will be called gradient-enhenced loss (GEL) function, in which the gradient at point based on the local data is replaced by the global one. This function has one very nice fixed point at the global minimum : the minimizer of the adaptive gradient-enhanced function at is still . This can easily be seen by computing the gradient at the point .

The idea of using such an adaptive enhanced function has been proposed in Shamir et al. (2014) and Jordan et al. (2018), though the motivations are different. Jordan et al. (2018) develop a Commmunication-efficient Surrogate Likelihood (CSL) method using the GEL function on the first machine, uses the minimizer on that machine as a new estimate, and iterates these steps until convergence. In the presence of a regularizer in (2.1), one simply adds to the gradient-enhanced loss; see the Algorithm 1 below. It is also studied by Shamir et al. (2014) and Wang et al. (2017a) under certain settings.

Note that in Algorithm 1, only the first machine solves optimization problems and others just evaluate gradients. These machines are idling while the first one is working hard. To fully utilize the computing power of machines and accelerate convergence, all the machines can optimize their corresponding GEL functions in parallel and the central processor then aggregates the results. This is motivated by the Distributed Approximate NEwton (DANE) algorithm (Shamir et al., 2014). Algorithm 2 describes the procedure in detail. Intuitively, the averaging step requires little computation but helps reduce the variance of estimators on node machines and enhance the accuracy.

### 2.3 Contracting optimization errors

In this subsection, we first present deterministic (almost sure) results for Algorithms 1 and 2 based on high-level structural assumptions. We will then apply the results to the statistical setting in the next subsection.

###### Definition 2.1.

Let be a convex function, be a convex set, and . is said to be -strongly convex in if , and .

###### Assumption 2.1 (Strong convexity).

has a unique minimizer , and is -strongly convex in for some and .

###### Assumption 2.2 (Homogeneity).

holds for all and .

We will refer to as a homogeneity parameter. Based on both assumptions, we define

 ρ0=sup{c∈[0,ρ]:{fk+g}mk=1 are % c-strongly convex in B(ˆθ,R)}. (2.3)

A simple but useful fact is . In most interesting problems, the population risk is smooth and strongly convex on any compact set. When are i.i.d. and the total sample size is large, the empirical risk concentrates around its population counterpart and inherits nice properties from the latter, making Assumption 2.1 hold easily.

Since are i.i.d. stochastic approximations of , they should not be too far away from their average provided that is not too small. Assumption 2.2 is a natural way of characterizing this similarity. It is a generalization of the concept “-related functions” for quadratic losses in Arjevani and Shamir (2015). With high probability, it holds with reasonably small and large under general conditions (Mei et al., 2018). Large implies small homogeneity parameter and thus similar . Assumption 2.2 always holds with .

In this section, we restrict ourselves to the large-sample regime where the local sample size is sufficiently large such that , where is the strong convexity parameter in (2.3). General cases will be discussed in Section 3 where additional local regularization is needed.

The following additional assumption on smoothness of the Hessian matrix of is not necessary for contraction, but it helps us obtain a much stronger result on the contraction rate of Algorithm 2, justifying the power of the simple averaging step.

###### Assumption 2.3 (Smoothness of Hessian).

, and there exists such that , .

Now for , define

 φk(ξ)=argminθ∈Rp{fk(θ)+g(θ)−⟨∇fk(ξ)−∇f(ξ),θ⟩}.

In Algorithm 1, we have ; in Algorithm 2, we have . Note that and is a fixed point of . This is the key to success of the algorithms. The following theorem describes the contraction of optimization errors of Algorithms 1 and 2. It is deterministic and non-asymptotic by nature.

###### Theorem 2.1.

Let Assumptions 2.1 and 2.2 hold, and . Consider the iterates produced by Algorithm 1 or 2, with . Then

 ∥θt+1−ˆθ∥2≤(δ/ρ0)∥θt−ˆθ∥2,∀t≥0.

In addition, if Assumption 2.3 also holds, then for Algorithm 2 we have

 ∥θt+1−ˆθ∥2≤δρ0∥θt−ˆθ∥2⋅min{1,δρ(1+Mρ0∥θt−ˆθ∥2)},∀t≥0.

Theorem 2.1 shows the -linear convergence***According to Nocedal and Wright (2006), a sequence in is said to converge -linearly to if there exists such that for sufficiently large. of the sequence generated by both Algorithms 1 and 2. The contraction rate depends explicitly on homogeneity parameter . With an additional standard assumption on Hessian smoothness, we further show that the averaging step alone in Algorithm 2 is almost as powerful as an optimization step in terms of contraction: The contracting constant will eventually be . With negligible computational cost, averaging significantly improves upon individual solutions by doubling the speed of convergence. In short, one iteration in Algorithm 2 is approximately the same as two iterations in Algorithm 1 under suitable conditions.

The first part of result is a refinement of that in Jordan et al. (2018). In particular, we allow initial estimator to be inaccurate and we have more explicit rates of contraction of optimization errors. This will be demonstrated in Section 2.4 below. The second part points out benefits of the averaging step, which is a novel result.

### 2.4 Multi-step estimators and statistical analysis

While we present the CSL methods as two algorithms, they are really -step estimators, starting from the initial estimator . The question is then the effect of iterations in the multiple step estimators and the role of the initial estimator. In this section, we show that each iteration makes is closer to the global minimum by a factor of order for CSL and by a factor of by the GEL method. This is done by explicitly finding the rate of convergence of . Thus, through finite steps, the optimization errors are eventually negligible in comparison with statistical errors (assuming is of order for a finite in typical applications) and the distributed multi-step estimator will work as well as the global minimum as if the data were aggregated in the central server.

The deterministic analysis above applies to a wide range of statistical models. As an illustration, we consider the generalized linear model with canonical link, where our samples are i.i.d. pairs of covariates and responses and the conditional density of given is given by

 h(yi;xi,θ∗)=c(xi,yi)exp(yi(x⊤iθ∗)−b(x⊤iθ∗)).

Here for simplicity we let the dispersion parameter to be 1 as we do not consider the issue of over-dispersion; is some known convex function, and is a known function such that is a valid probability density function. The negative log-likelihood of the whole data is an affine transformation of with

 fk(θ)=1n∑i∈Ik[b(x⊤iθ)−yi(x⊤iθ)].

It’s easy to verify that

 ∇fk(θ)=1n∑i∈Ik[b′(x⊤iθ)−yi]xiand∇2fk(θ)=1n∑i∈Ikb′′(x⊤iθ)xix⊤i.

Assume that , where are i.i.d. random covariate vectors with zero mean and covariance matrix . Suppose there exist universal positive constants , and such that . Let , be a deterministic penalty function, and be the population risk function. Below we impose some standard regularity assumptions.

###### Assumption 2.4.
• are i.i.d. sub-Gaussian random vectors.

• For all , and are bounded by some constant.

• is bounded by some constant.

As in Assumptions 2.1 and 2.3, the following general assumptions is also needed for our analysis. Here is some positive quantity that satisfies for some universal constants and .

###### Assumption 2.5.

There exists a universal constant such that is -strongly convex in .

The following smoothness assumption is only needed for part of our theory; it is used to show that the averaging step in Algorithm 2 can significantly enhance the accuracy.

###### Assumption 2.6.

, and there exists a universal constant such that , .

Under the model assumptions above, we can explicitly determine rate for in Assumption 2.2. In particular, we will show in Section A.6 that

 maxk∈[m]maxθ∈B(ˆθ,R)∥∇2fk(θ)−∇2f(θ)∥2=OP⎛⎝∥Σ∥2√p(logp+logN)n⎞⎠,

provided that for an arbitrary positive constant . Therefore, with high probability, . Omitting the logarithmic terms, we see that the contraction factor is approximately , where can be viewed as condition number. This rate is more explicit on and than that in Jordan et al. (2018), where finite and are assumed. In addition, with a smooth regularization, Algorithm 2 benefits from the averaging step in that it improves the contraction rate to approximately .

Let be the -th step estimator of Algorithm 1 or Algorithm 2 with some initialization . It is clear that the statistical error of the estimator is upper bounded by its optimization error and the statistical error of :

 ∥θt−θ∗∥2≤∥θt−ˆθ∥2+∥ˆθ−θ∗∥2.

The second term is well-studied in statistics, which is of order under mild conditions. The following theorem controls the magnitude of the first term, which is the optimization error.

###### Theorem 2.2.

Suppose that Assumptions 2.4 and 2.5 hold and with probability tending to one for some . For Algorithms 1 and 2, we have

 ∥θt−ˆθ∥2=OP(ηt/2∥θ0−ˆθ∥2),∀t≥0,

where . In addition, let Assumption 2.6 also hold. There exists some constant such that for Algorithm 2 we have

 ∥θt−ˆθ∥2=OP(ηt−t0∥θt0−ˆθ∥2),∀t≥t0, (2.4)

where .

Theorem 2.2 explicitly describes how Algorithms 1 and 2 depend on structural parameters of the problem. First, the condition on initialization is mild, since is usually a consistent estimate and is bounded (Assumption 2.4). In contract with Jordan et al. (2018), we allow inaccurate initial value such as and give more explicit rates of contraction even when and diverge.

In contrast to fixed contraction derived by Shamir et al. (2014), Theorem 2.2 explains the significant benefits of large local sample size even in the presence of a non-smooth penalty: optimization error is shrunken by a factor that converges to zero explicitly in . When and are bounded, and within a finite step , the optimization error can be much smaller than statistical error , so long as for some .

The accuracy of initial estimator helps reducing the number of iterations. As an example of this, consider the simple average of individual estimator for node machine as for smooth loss function with no regularization. By Corollary 2 in Zhang et al. (2013), this simple divide-and-conquer estimator has accuracy . Using the explicit expression of , we can easily show that the one-step estimator obtained by Algorithm 1 behaves the same as the global minimizer if the local sample size

 n3≫Nκ2plogN(p+κ2logp). (2.5)

In this case, the local optimization in Algorithm 1 can further be replaced by using the explicit one-step estimator as in Bickel (1975) and Jordan et al. (2018). A similar remark applies to the GEL method (Algorithm  2).

## 3 Distributed estimation in general regimes

Algorithms 1 and 2 and their analysis in the last section are built upon the large-sample regime, with sufficiently strong convexity of and small discrepancy between them. This requires the local sample size to be large enough, which may not be the case in practice. Even worse, the required local sample size depends on structural parameters, making such a condition unverifiable. Our numerical experiments confirm the instability of Algorithms 1 and 2 even for moderate . A naive method of remedy is to add strict convex quadratic regularization . While this remedy can make the algorithm converges rapidly, the nonadaptive nature of will make the convergence to a wrong target. Instead of using a fixed , we will adjust the regularization function according to current solutions. The idea stems from the proximal point algorithm (Rockafellar, 1976; Parikh and Boyd, 2014).

### 3.1 Distributed approximate proximal point algorithms

###### Definition 3.1.

For any convex function , define the proximal mapping , .

For a given , the proximal point algorithm for minimizing iteratively computes

starting from some initial value . Under mild conditions, converges linearly to some (Rockafellar, 1976).

Now we take and write the proximal point iteration for our problem (2.2):

 θt+1=proxα−1(f+g)(θt)=argminθ∈Rp{f(θ)+g(θ)+α2∥θ−θt∥22}. (3.1)

Each iteration (3.1) is a distributed optimization problem, whose object function is not available to node machines. But it can be solved by Algorithms 1 and 2. Specifically, suppose we have already obtained and aim for in (3.1). Letting , Algorithm 2 starting from produces iterations over

 ~θs,k =argminθ∈Rp{fk(θ)+~g(θ)+⟨∇fk(~θs)−∇f(~θs),θ⟩},k∈[m], ~θs+1 =1mm∑k=1~θs,k.

When , converges -linearly to . On the other hand, there is no need to solve (3.1) exactly, as is merely an intermediate quantity for computing . We therefore only run one iteration of the GEL Algorithm 2 and use the resulting approximate solution as . This simplifies the algorithm, reducing double loops to a single loop, and enhances statistical interpretation of the method as a multi-step estimator. However, it makes technical arguments more challenging. Similarly, we may also use one step of Algorithm 1 to compute the inexact proximal update.

The above discussions lead us to propose two Communication-Efficient Accurate Statistical Estimators (CEASE) in Algorithms 3 and 4, which use the proximal point algorithm as the backbone and obtain inexact updates in a distributed manner. They are regularized versions of Algorithms 1 and 2, with an additional proximal term in the objective functions. The term reduces relative differences of the local loss functions on individual machines, and is particularly crucial for convergence when are not similar enough. Ideas from the proximal point algorithm have appeared in the literature of distributed stochastic optimization for different purposes such as accelerating first-order algorithms (Lee et al., 2017a) and regularizing sizes of updates (Wang et al., 2017b).

In each iteration, Algorithm 3 has one round of communication and one optimization problem to solve. Although Algorithm 4 has two rounds of communication per iteration, only one round involves parallel optimization and the other is simply averaging. We will compare their theoretical guarantees as well as practical performances in the sequel.

### 3.2 Contraction of optimization errors

Theorem 3.1 gives contraction guarantees for Algorithms 3 and 4.

###### Theorem 3.1.

Let Assumptions 2.1 and 2.2 hold. Consider the multi-step estimators generated by Algorithm 3 or 4. Suppose that and .

• For both Algorithms 3 and 4, we have

 ∥θt+1−ˆθ∥2≤∥θt−ˆθ∥2⋅δρ0+α√ρ2+2αρ+αρ+α,0≤t≤T−1; (3.2)
• If Assumption 2.3 also holds, then for Algorithm 4 we have

 ∥θt+1−ˆθ∥2≤∥θt−ˆθ∥2⋅γt√ρ2+2αρ+αρ+α,0≤t≤T−1, (3.3)

where we define ;

• Both multiplicative factors in (3.2) and (3.3) are strictly less than 1.

In the contraction factor in (3.2), the two summands and come from the error of the inexact proximal update and the residual of the proximal point , respectively. Similar results hold for (3.3).

Theorem 3.1 justifies the linear convergence of Algorithms 3 and 4 under quite general settings. The local loss functions just need to be convex and smooth, and the convex penalty is allowed to be non-smooth, e.g. the norm. On the contrary, most algorithms for distributed statistical estimation are only designed for smooth problems, and many of them are only rigorously studied when the loss functions are quadratic or self-concordant (Shamir et al., 2014; Zhang and Xiao, 2015; Wang et al., 2017b). This is another important aspect of our contributions.

Note that the convergence of the vanilla Algorithms 1 and 2 hinges on the homogeneity assumption in Theorem 2.1, i.e. the functions must be similar enough. In the statistical setting, this requires to be large. Algorithms 3 and 4 no longer need such a condition and converge linearly as long as , which is guaranteed to hold by choosing sufficiently large . Hence proper regularization provides a safety net for the algorithms. Corollary 3.1 below gives a guideline for choosing to make Algorithms 3 and 4 converge.

###### Corollary 3.1.

Let Assumptions 2.1 and 2.2 hold, , and be the iterates of Algorithm 3 or 4. With any , both algorithms converge with contraction factors in (3.2) and (3.3) bounded by . Hence to reach the statistically negligible accuracy of for a constant , we need at most iterations.

Consider again the case where the local loss functions have small relative difference . In this case, Theorem 2.1 states that the contraction factors for unregularized versions () of Algorithms 3 and 4 are in the same order of . The following corollary tells us how large can be so that the contraction factors are still of that order. It provides an upper bound for the amount of regularization to make the algorithms converge rapidly in nice scenarios.

###### Corollary 3.2.

Let Assumptions 2.1 and 2.2 hold, , and suppose for some constant . There exist constants and such that the followings hold when is sufficiently small:

• Algorithms 3 and 4 have the contraction property

 ∥θt+1−ˆθ∥2≤C1δ/ρ∥θt−ˆθ∥2,0≤t≤T−1;
• if Assumption 2.3 also holds and , then

 ∥θt+1−ˆθ∥2≤C2(δ/ρ)2∥θt−ˆθ∥2

holds for Algorithm 4.

Consequently, suffices for both algorithms to achieve a statistically negligible accuracy of with a small constant .

Corollary 3.2 suggests choosing a small regularizer when is small. The contraction factors in Corollary 3.2 go to zero if does, indicating both algorithms’ ability to utilize the similarity among local loss functions. With a regularizer up to the order of , the contraction factors are essentially the same as those of the unregularized () algorithms. If is smooth and is reasonably close to , then Corollary 3.2 shows that each iteration of Algorithm 4 is roughly equivalent to two iterations of Algorithm 3, although the former only has one round of optimization. The averaging step in Algorithm 4 reduces the error as much as the optimization step, while taking much less time. In this case, Algorithm 4 is preferable, and our numerical experiments also confirm this.

By combining Corollaries 3.1 and 3.2, we get

 α≍δ2/ρ

as a default choice for Algorithms 3 and 4 to become fast and robust. They are reliable in general cases and efficient in nice cases.

According to the results above, Algorithms 3 and 4 achieve -accuracy within

 O(max{1,(δ/ρ)2}log(∥θ0−ˆθ∥2ε)) (3.4)

rounds of communication. In contrast, the distributed accelerated gradient descent requires rounds of communication to achieve -accuracy (Shamir et al., 2014), with being the condition number of , which does not take advantage of sample size . As long as , our Algorithms 3 and 4 communicate less than the distributed accelerated gradient descent. This is achieved by leveraging the similarity among . And again, our general results for Algorithms 3 and 4 also apply to the case with nonsmooth penalty functions while those for distributed accelerated gradient descent do not.

For unregularized empirical risk minimization, i.e. in (2.2), Algorithm 4 reduces to an extension or a useful case of the DANE algorithm (Shamir et al., 2014). DANE is motivated from the mirror descent and only deals with smooth optimization problems. Theoretical analysis of DANE beyond quadratic loss require extremal choice of tuning parameters and does not show any advantage over distributed implementation of the gradient descent. On the other hand, we derive Algorithm 4 from the proximal point algorithm, handling both smooth and nonsmooth problems. This new perspective leads to sharp analysis of Algorithm 4 along with suggestions on choosing the tuning parameter . As a by-product, we close a gap in the theory of DANE in non-quadratic settings. Our analysis techniques are potentially useful for other distributed optimization algorithms, especially when the loss is not quadratic.

### 3.3 Multi-step estimators and their statistical properties

#### 3.3.1 General case

Consider again the generalized linear model with the canonical link as in Section 2.4. In the following theorem, we specify the correct order of the regularization parameter such that Algorithms 3 and 4 not only overcome the difficulties with a small local sample size , but also inherit all the advantages of previous algorithms in the large- regime. This leverages upon the explicit homogeneity rate .

###### Theorem 3.2.

Suppose that Assumptions 2.4 and 2.5 hold, and with high probability . Let and . For any , there exists such that the followings hold with high probability:

• if and , then both Algorithms 3 and 4 have linear convergence

 ∥θt−ˆθ∥2≤[1−ρ10(α+ρ)]t∥θ0−ˆθ∥2,∀t≥0;
• if is sufficiently small and , then for both algorithms

 ∥θt−ˆθ∥2=OP(ηt/2∥θ0−ˆθ∥2),∀t≥0;

in addition, if Assumption 2.6 also holds, then for Algorithm 4 we have

 ∥θt−ˆθ∥2=OP(ηt−t0∥θt0−ˆθ∥2),∀t≥t0,

where .

For many big-data problems of interest, it is reasonable to assume that is bounded away from 0 by some small constant. Then Theorem 3.2 indicates that by choosing , Algorithms 3 and 4 inherit all the merits of Algorithms 1 and 2 in the large regime – fast linear contraction of rate , and for Algorithm 4, a even faster rate of to when the loss functions and the penalty are smooth. These facts also guarantee that Algorithms 3 and 4 reach the statistical efficiency in iterations. While it is hard to check whether is sufficiently large in practice, proper choice of always guarantees linear convergence, and the contraction rates adapt to the sample size . In this way, Algorithms 3 and 4 perfectly resolve the main issue of their vanilla versions.

Similar to the discussion at the end of Seciton 2.4, produces by CEASE and CEASE with averaging can be regarded as multi-step statistical estimators. As the contraction of optimization error is at least at the order of , it only takes finite steps to achieve negligible optimization error and hence achieves statistical efficiency in typical statistical applications. In particular, when satisfies (2.5), the one-step estimator from the one-shot average (Zhang et al., 2013) is statistically efficient and its asymptotic inference follows from that based on the empirical minimizer based on all the data.