One-shot distributed ridge regression in high dimensions

One-shot distributed ridge regression in high dimensions

Edgar Dobriban111Wharton Statistics Department, University of Pennsylvania. E-mail: dobriban@wharton.upenn.edu.   and Yue Sheng222Graduate Group in Applied Mathematics and Computational Science, Department of Mathematics, University of Pennsylvania. E-mail: yuesheng@sas.upenn.edu.
August 4, 2019
Abstract

In many areas, practitioners need to analyze large datasets that challenge conventional single-machine computing. To scale up data analysis, distributed and parallel computing approaches are increasingly needed. Datasets are spread out over several computing units, which do most of the analysis locally, and communicate short messages.

Here we study a fundamental and highly important problem in this area: How to do ridge regression in a distributed computing environment? Ridge regression is an extremely popular method for supervised learning, and has several optimality properties, thus it is important to study. We study one-shot methods that construct weighted combinations of ridge regression estimators computed on each machine. By analyzing the mean squared error in a high dimensional random-effects model where each predictor has a small effect, we discover several new phenomena.

Infinite-worker limit: The distributed estimator works well for very large numbers of machines, a phenomenon we call ”infinite-worker limit”.

Optimal weights: The optimal weights for combining local estimators sum to more than unity, due to the downward bias of ridge. Thus, all averaging methods are suboptimal.

We also propose a new optimally weighted one-shot ridge regression algorithm. We confirm our results in simulation studies and using the Million Song Dataset as an example. There we can save at least 100x in computation time, while nearly preserving test accuracy.

\definecolor

edRGB225,0,0 \definecoloryueRGB0,100,100

1 Introduction

Computers have changed all aspects of our world. Importantly, computing has made data analysis more convenient than ever before. However, computers also pose limitations and challenges for data science. For instance, hardware architecture is based on a model of a universal computer—a Turing machine—but in fact has physical limitations of storage, memory, processing speed, and communication bandwidth over a network. As large datasets become more and more common in all areas of human activity, we need to think carefully about working with these limitations.

How can we design methods for data analysis (statistics and machine learning) that scale to large datasets? A general approach is distributed and parallel computing. Roughly speaking, the data is divided up among computing units, which perform most of the computation locally, and synchronize by passing relatively short messages. While the idea is simple, a good implementation can be hard and nontrivial. Moreover, different problems have different inherent needs in terms of local computation and global communication resources. For instance, in statistical problems with high levels of noise, simple one-shot schemes like averaging estimators computed on local datasets can sometimes work well.

In this paper, we study a fundamental problem in this area. We are interested in linear regression, which is arguably one of the most important problems in statistics and machine learning. A popular method for this model is ridge regression (aka Tikhonov regularization), which regularizes the estimates using a quadratic penalty to improve estimation and prediction accuracy. We aim to understand how to do ridge regression in a distributed computing environment. We are also interested in the important high-dimensional setting, where the number of features can be very large. In fact our approach allows the dimension and sample size to have any ratio. We also work in a random-effects model where each predictor has a small effect on the outcome, which is the model for which ridge regression is best suited.

We consider the simplest and most fundamental method, which performs ridge regression locally on each dataset housed on the individual machines or other computing units, sends the estimates to a global datacenter (or parameter server), and then constructs a final one-shot estimator by taking a linear combination of the local estimates. As mentioned, such methods are sometimes near-optimal, and it is therefore well-justified to study them. We will later give several additional justifications for our work.

However, in contrast to existing work, we introduce a completely new mathematical approach to the problem, which has never been used for studying distributed ridge regression before. Specifically, we leverage and further develop sophisticated recent techniques from random matrix theory and free probability theory in our analysis. This enables us to make important contributions, that were simply unattainable using more ”traditional” mathematical approaches.

To give a sense of our results, we provide a brief discussion here. We have a dataset consisting of datapoints, for instance 1000 heart disease patients. Each datapoint has an outcome , such as blood pressure, and features , such as age, height, electronic health records, lab results, and genetic variables. Our goal is to predict the outcome of interest (i.e., blood pressure) for new patients based on their features, and to estimate the relationship of the outcome to the features.

The samples are distributed across several sites, for instance patients from different countries are housed in different data centers. We will refer to the sites as ”machines”, though they may actually be other computing entities, such as entire computer networks or data centers. In many important settings, it can be impossible to share the data across the different sites, for instance due to logistical or privacy reasons.

Therefore, we assume that each site has a subset of the samples. Our approach is to train ridge regression on this local data. As usual, we can arrange the local dataset (say on the -th machine) into a feature matrix , where each row contains a sample (i.e., datapoint), and an outcome vector where each entry is an outcome. We compute the local ridge regression estimates

where are some regularization parameters. We then aggregate them by a weighted combination, constructing the final one-shot distributed ridge estimator (where is the number of sites)

The important questions here are:

  1. How does this work?

  2. How to tune the parameters? (such as the regularization parameters and weights)

Question (1) is of interest because we wish to know when one-shot methods are a good approach, and when they are not. For this we need to understand the performance as a function of the key problem parameters, such as the signal strength, sample size, and dimension. For question (2), the challenge is posed by the constraints of the distributed computing environment, where standard methods for parameter tuning such as cross-validation may be expensive.

In this work we are able to make several crucial contributions to these questions. We work in an asymptotic setting where grow to infinity at the same rate, which effectively gives good results for any . We study a linear-random effects model, where each regressor has a small random effect on the outcome. This is a good model for the applications where ridge regression is used, because ridge does not assume sparsity, and has optimality properties in certain dense random effects models. Importantly, this analysis does not assume any sparsity in a high-dimensional setting. Sparsity has been one of the biggest driving forces in statistics and machine learning in the last 20 years. Our work is in a different line of work, and shows that meaningful results are available without sparsity.

We find the limiting mean squared error of the one-shot distributed ridge estimator. This enables us to characterize the optimal weights and tuning parameters, as well as the relative efficiency compared to centralized ridge regression, meaning the ratio of the risk of usual ridge to the distributed estimator. This can precisely pinpoint the computation-accuracy tradeoff achieved via one-shot distributed estimation. See Figure 1 for an illustration.

(a) Surface
(b) Contour
Figure 1: Efficiency loss due to one-shot distributed learning. This plot shows the relative mean squared error of centralized ridge regression compared to optimally weighted one-shot distributed ridge regression. This quantity is at most unity, and the larger, the ”better” distributed ridge works. Specifically, the model is asymptotic, and we show the dependence of the Asymptotic Relative Efficiency (ARE) on the aspect ratio (where is sample size and is dimension) and on the signal strengh , in the infinite-worker limit when we distribute our data over many machines. We show (a) surface and (b) contour plots of the ARE. See the text for details.

As a consequence of our detailed and precise risk analysis, we make several qualitative discoveries that we find quite striking:

  1. Efficiency depends strongly on signal strength. The statistical efficiency of the one-shot distributed ridge estimator depends strongly on signal strength. The efficiency is generally high (meaning distributed ridge regression works well) when the signal strength is low.

  2. Infinite-worker limit. The one-shot distributed estimator does not lose all efficiency compared to the ridge estimator even in the limit of infinitely many machines. Somewhat surprisingly, this suggests that simple one-shot weighted combination methods for distributed ridge regression can work well even for very large numbers of machines. The statement that this can be achieved by communication-efficient methods is nontrivial. This finding is clearly important from a practical perspective.

  3. Decoupling. When the features are uncorrelated, the problem of choosing the optimal regularization parameters decouples over the different machines. We can choose them in a locally optimal way, and they are also globally optimal. We emphasize that this is a very delicate result, and is not true in general for correlated features. Moreover, this discovery is also important in practice, because it gives conditions under which we can choose the regularization parameters separately for each machine, thus saving valuable computational resources.

  4. Optimal weights do not sum to unity. Our work uncovers unexpected properties of the optimal weights. Naively, one may think that the weights need to sum to unity, meaning that we need a weighted average. However, it turns out the optimal weights sum to more than unity, because of the negative bias of the ridge estimator. This means that any type of averaging method is suboptimal. We characterize the optimal weights and under certain conditions find their explicit analytic form.

Based on these results, we propose a new optimally weighted one-shot ridge regression algorithm. We also confirm these results in detailed simulation studies and on an empirical data example, using the Million Song Dataset.

We also emphasize that some aspects of our work can help practitioners directly (e.g., our new algorithm), while others are developed for deepening our understanding of the nature of the problem. We discuss the practical implications of our work in Section 4.6.

The paper is structured as follows. We discuss some related work in Section 1.1. We start with finite sample results in Section 2. We provide asymptotic results for features with an arbitrary covariance structure in Section 3. We consider the special case of an identity covariance in Section 4. Here we provide an explicit algorithm for optimally weighted one-shot distributed ridge. We also study in detail the properties of the estimation error, relative efficiency (including minimax properties in Section 4.7), tuning parameters (and decoupling), as well as optimal weights, including answers to the questions above. We provide numerical simulations throughout the paper, and additional ones in Section 5, along with an example using an empirical dataset. The code for our paper is available at github.com/dobriban/dist_ridge.

1.1 Related work

Here we discuss some related work. Historically, distributed and parallel computation has first been studied in computer science and optimization (see e.g., Bertsekas and Tsitsiklis, 1989; Lynch, 1996; Blelloch and Maggs, 2010; Boyd et al., 2011; Rauber and Rünger, 2013; Koutris et al., 2018). However, the problems studied there are quite different from the ones that we are interested in. Those works often focus on problems where correct answers are required within numerical precision, e.g., 16 bits of accuracy. However, when we have noisy datasets, such as in statistics and machine learning, numerical precision is neither needed nor usually possible. We may only hope for 3-4 bits of accuracy, and thus the problems are different.

The area of distributed statistics and machine learning has attracted increasing attention only relatively recently, see for instance Mcdonald et al. (2009); Zhang et al. (2012, 2013b); Li et al. (2013); Zhang et al. (2013a); Duchi et al. (2014); Chen and Xie (2014); Mackey et al. (2011); Zhang et al. (2015); Braverman et al. (2016); Jordan et al. (2016); Rosenblatt and Nadler (2016); Smith et al. (2016); Banerjee et al. (2016); Zhao et al. (2016); Xu et al. (2016); Fan et al. (2017); Lin et al. (2017); Lee et al. (2017); Volgushev et al. (2017); Shang and Cheng (2017); Battey et al. (2018); Zhu and Lafferty (2018); Chen et al. (2018a, b); Wang et al. (2018); Shi et al. (2018); Duan et al. (2018); Liu et al. (2018), and the references therein. See Huo and Cao (2018) for a review. We can only discuss the most closely related papers due to space limitations.

Zhang et al. (2013b) study the MSE of averaged estimation in empirical risk minimization. Later Zhang et al. (2015) study divide and conquer kernel ridge regression, showing that the partition-based estimator achieves the statistical minimax rate over all estimators, when the number of machines is not too large. These results are very general, however they are not as explicit or precise as our results. In addition they consider fixed dimensions, whereas we study increasing dimensions under random effects models. Lin et al. (2017) improve the above results, removing certain eigenvalue assumptions on the kernel, and sharpening the rate.

Guo et al. (2017) study regularization kernel networks, and propose a debiasing scheme that can improve the behavior of distributed estimators. This work is also in the same framework as those above (general kernel, fixed dimension). Xu et al. (2016) propose a distributed General Cross-Validation method to choose the regularization parameter.

Rosenblatt and Nadler (2016) consider averaging in distributed learning in fixed and high-dimensional M-estimation, without studying regularization. Lee et al. (2017) study sparse linear regression, showing that averaging debiased lasso estimators can achieve the optimal estimation rate if the number of machines is not too large. A related work is Battey et al. (2018), which also includes hypothesis testing under more general sparse models. These last two works are on a different problem (sparse regression), whereas we study ridge regression in random-effects models.

2 Finite sample results

We start our study of distributed ridge regression by a finite sample analysis of estimation error in linear models. Consider the standard linear model

(1)

Here is the -dimensional continous outcome vector of independent samples (e.g., the blood pressure level of patients, or the amount of time spent on an activity by internet users), is the design matrix containing the values of features for each sample (e.g., demographical and genetic variables of each patient). Moreover, is the -dimensional vector of unknown regression coefficients.

Our goals are to predict the outcome variable for future samples, and also to estimate the regression coefficients. The outcome vector is affected by the random noise . We assume that the coordinates of are independent random variables with mean zero and variance .

The ridge regression (or Tikhonov regularization) estimator is one of the most popular methods for estimation and prediction in linear models. Recall that the ridge estimator of is

where is a tuning parameter. This estimator has many justifications. It shrinks the coefficients of the usual ordinary least squares estimator, which can lead to improved estimation and prediction. When the entries of and are iid Gaussian, and for suitable , it is the posterior mean of given the outcomes, and hence is a Bayes optimal estimator for any quadratic loss function, including estimation and prediction error.

Suppose now that we are in a distributed computation setting. The samples are distributed across different sites or machines. For instance, the data of users from a particular country may be stored in a separate datacenter. This may happen due to memory or storage limitations of individual data storage facilities, or may be required by data usage agreements. As mentioned, for simplicity we call the sites ”machines”.

We can write the partitioned data as

Thus the -th machine contains samples whose features are stored in the matrix and also the corresponding outcome vector .

Since the ridge regression estimator is a widely used gold standard method, we would like to understand how we can approximate it in a distributed setting. Specifically, we will focus on one-shot weighting methods, where we perform ridge regression locally on each subset of the data, and then aggregate the regression coefficients by a weighted sum. There are several reasons to consider weighting methods:

  1. This is a practical method with minimal communication cost. When communication is expensive, it is imperative to develop methods that minimize communication cost. In this case, one-shot weighting methods are attractive, and so it is important to understand how they work. In a well-known course on scalable machine learning, Alex Smola calls such methods ”idiot-proof” (Smola, 2012), meaning that they are straightforward to implement (unlike some of the more sophisticated methods).

  2. Averaging (which is a special case of one-shot weighting) has already been studied in several works on distributed ridge regression (e.g., Zhang et al. (2015); Lin et al. (2017)), and much more broadly in distributed learning, see the related work section for details. Such methods are known to be rate-optimal under certain conditions.

  3. However, in our setting, we are able to discover several new phenomena about one-shot weighting. For instance, we can quantify in a much more nuanced way the accuracy loss compared to centralized ridge regression.

  4. Weighting may serve as a useful initialization to iterative methods. In practical distributed learning problems, iterative optimization algorithms such as distributed gradient descent or ADMM (Boyd et al., 2011) may be used. However, there are examples where the first step of the iterative method has worse performance than a simple averaging (Pourshafeie et al., 2018). Therefore, we can imagine hybrid or warm start methods that use weighting as an initialization. This also suggests that studying one-shot weighting is important.

Therefore, we define local ridge estimators for each dataset , with regularization parameter as

We consider combining the local ridge estimators at a central server via a one-step weighted summation. We will find the optimally weighted one-shot distributed estimator

Note that, unlike ordinary least squares (OLS), the local ridge estimators are always well-defined, i.e. can be smaller than . Also, for the distributed OLS estimator averaging local OLS solutions, it is natural to require , because this ensures unbiasedness (Dobriban and Sheng, 2018). However, the ridge estimators are biased, so it is not clear if we should put any constraints on the weights. In fact we will find that the optimal weights typically do not sum to unity. These features distinguish our work from prior art, and lead to some surprising consequences.

Throughout the paper, we will frequently use the notations and . A stepping stone to our analysis is the following key result.

Theorem 2.1 (Finite sample risk and efficiency of optimally weighted distributed ridge for fixed regularization parameters).

Consider the distributed ridge regression problem described above. Suppose we have a dataset with datapoints (samples), each with an outcome and features. The dataset is distributed across sites. Each site has a subset of the data, with the matrix of features of samples, and the corresponding outcomes . We compute the local ridge regression estimator with fixed regularization parameters on each dataset. We send the local estimates to a central location, and combine them via a weighted sum, i.e., .

Under the linear regression model (1), the optimal weights that minimize the mean squared error of the distributed estimator are

where the quantities are defined below.

  1. is a -dimensional vector with -th coordinate , and are the matrices ,

  2. is a matrix with -th entry , and

  3. is a diagonal matrix with -th diagonal entry .

The mean squared error of the optimally weighted distributed ridge regression estimator with sites equals

See Section 6.1 for the proof. The argument proceeds via a direct calculation, recognizing that finding the optimal weights for combining the local estimators can be viewed as a -parameter regression problem of on , for .

This result quantifies the mean squared error of the optimally weighted distributed ridge estimator for fixed regularization parameters . Later we will study how to choose the regularization parameters optimally. The result also gives an exact formula for the optimal weights. However, the optimal weights depend on the unknown regression coefficients , and are thus not directly usable in practice. Instead, our approach is to make stronger assumptions on under which we can develop estimators for the weights.

Computational efficiency. We take a short detour here to discuss computational efficiency. Computing one ridge regression estimator for a fixed regularization parameter and design matrix can be done in time by first computing the SVD of . This automatically gives the ridge estimator for all values of .

How much computation can we save by distributing the data? Suppose first that , in which case the global cost is . Computing ridge locally on the -th machine takes time. Suppose next that we distribute equally to of machines, and we also have . Then the global cost is reduced to . In this case we can say that the computational cost decreases proportionally to the number of machines. This shows the benefit of parallel data processing.

On the other extreme, if , then , the global cost is reduced from to . This shows that the computational cost decreases quadratically in the number of machines (albeit of course the constant is much worse). If we are in an intermediate case where and , then the cost decreases at a rate between linear and quadratic.

2.1 Addressing reader concerns

At this stage, our readers may have several concerns about our approach. We address some concerns in turn below.

  1. Does it make sense to average ridge estimators, which can be biased?

    A possible concern is that we are working with biased estimators. Would it make sense to debias them first, before weighting? A similar approach has been used for sparse regression, with the debiased Lasso estimators (Lee et al., 2017; Battey et al., 2018). However, our results allow the regularization parameters to be arbitrarily close to zero, which leads to least squares estimators, with an inverse or pseudoinverse . These are the ”natural” debiasing estimators for ridge regression. For OLS, these are exactly unbiased, while for pseudoinverse, they are approximately so. Hence our approach allows nearly unbiased estimators, and we automatically discover when this is the optimal method.

  2. Is it possible to improve the weighted sum of local ridge estimators in trivial ways?

    One-shot weighting is merely a heuristic. If it were possible to improve it in a simple way, then it would make sense to study those methods instead of weighting. However, we are not aware of such methods. For instance, one possibility is to try and add the constant vector into the regression on the global parameter server, because this may help reduce the bias. In simulation studies, we have observed that this approach does not usually lead to a perceptible decrease in MSE. Specifically we have found that under the simulation setting common throughout the paper, the MSEs with and without a constant term are close (see Section 6.2 for details).

3 Asymptotics under linear random-effects models

The finite sample results obtained so far can be hard to interpret, and do not allow us to directly understand the performance of the optimal one-shot distributed estimator. Therefore, we will consider an asymptotic setting that leads to more insightful results.

Recall that our basic linear model is , where the error is random. Next, we also assume that a random-effects model holds. We assume is random—independently of —with coordinates that are themselves independent random variables with mean zero and variance . Thus, each feature contributes a small random amount to the outcome. Ridge regression is designed to work well in such a setting, and has several optimality properties in variants of this model. The parameters are now : the noise level and the signal-to-noise ratio respectively. This parametrization is standard and widely used (e.g. Searle et al. (2009); Dicker and Erdogdu (2017); Dobriban and Wager (2018)).

To get more insight into the performance of ridge regression in a distributed environment, we will take an asymptotic approach. Notice from Theorem 2.1 that the mean squared error depends on the data only through simple functionals of the sample covariance matrices and , such as

When the coordinates of are iid, the means of the quadratic functionals become proportional to the traces of functions of the sample covariance matrices. This motivates us to adopt models from asymptotic random matrix theory, where the asymptotics of such quantities are a central topic.

We begin by introducing some key concepts from random matrix theory (RMT) which will be used in our analysis. We will focus on ”Marchenko-Pastur” (MP) type sample covariance matrices, which are fundamental and popular in statistics (see e.g., Bai and Silverstein (2009); Anderson (2003); Paul and Aue (2014); Yao et al. (2015)). A key concept is the spectral distribution, which for a symmetric matrix is the distribution that places equal mass on all eigenvalues of . This has cumulative distribution function (CDF) . A central result in the area is the Marchenko-Pastur theorem, which states that eigenvalue distributions of sample covariance matrices converge (Marchenko and Pastur, 1967; Bai and Silverstein, 2009). We state the required assumptions below:

Assumption 1.

Consider the following conditions:

  1. The design matrix is generated as for an matrix with i.i.d. entries (viewed as coming from an infinite array), satisfying and , and a deterministic positive semidefinite population covariance matrix .

  2. The sample size grows to infinity proportionally with the dimension , i.e. and .

  3. The sequence of spectral distributions of converges weakly to a limiting distribution supported on , called the population spectral distribution.

Then, the Marchenko-Pastur theorem states that with probability , the spectral distribution of the sample covariance matrix also converges weakly (in distribution) to a limiting distribution supported on (Marchenko and Pastur, 1967; Bai and Silverstein, 2009). The limiting distribution is determined uniquely by a fixed-point equation for its Stieltjes transform, which is defined for any distribution supported on as

With this notation, the Stieltjes transform of the spectral measure of satisfies

where is the Stieltjes transform of . In addition, we denote by the derivative of the Stieltjes transform. Then, it is also known that

The results stated above can be expressed in a different, and perhaps slightly more modern language, using deterministic equivalents (Serdobolskii, 2007; Hachem et al., 2007; Couillet et al., 2011; Dobriban and Sheng, 2018). For instance, the Marchenko-Pastur law is a consequence of the following result. For any where it is well-defined, consider the resolvent . This random matrix is equivalent to a deterministic matrix for a certain scalar , and we write

Here two sequences of matrices (not necessarily symmetric) of growing dimensions are equivalent, and we write

if

almost surely, for any sequence of deterministic matrices (not necessarily symmetric) with bounded trace norm, i.e., such that (Dobriban and Sheng, 2018). Informally, any linear combination of the entries of can be approximated by the entries of . This also can be viewed as a kind of weak convergence in the matrix space equipped with an inner product (trace). From this, it also follows that the traces of the two matrices are equivalent, from which we can recover the MP law.

In Dobriban and Sheng (2018), we collected some useful properties of the calculus of deterministic equivalents. In this work, we use those properties extensively. We also develop and use a new differentiation rule for the calculus of deterministic equivalents (see Section 6.3).

We are now ready to study the asymptotics of the risk. We express the limits of interest in two equivalent forms, one in terms of population quantities (such as the limiting spectral distribution of ), and one in terms of sample quantities (such as the limiting spectral distribution of ). Moreover, we will denote by a random variable distributed according to , so that denotes the mean of when is a random variable distributed according to the limit spectral distribution .

The key to obtaining the results based on population quantities is that the quadratic forms involving have asymptotic equivalents that only depend on , based on the concentration of quadratic forms. Specifically, we have

for suitable matrices (see the proof of Theorem 3.1 for details). The key to the results based on sample quantities is the MP law and the calculus of deterministic equivalents.

Theorem 3.1 (Asymptotics for distributed ridge, arbitrary regularization).

In the linear random-effects model under Assumption 1, suppose in addition that the eigenvalues of are uniformly bounded away from zero and infinity, and that the entries of have a finite -th moment for some . Suppose moreover that the local sample sizes grow proportionally to , so that .

Then the optimal weights for distributed ridge regression, and its MSE, converge to definite limits. Recall from Theorem 2.1 that we have the formulas and MSE for the optimal finite sample weights and risk, and thus it is enough to find the limit of and . These have the following limits:

  1. With probability one, we have the convergence . The -th coordinate of the limit has the following two equivalent forms, in terms of population and sample quantities, respectively:

    Recall that is the limiting population spectral distribution of , and is a random variable distributed according to . Among the empirical quantities, is the limiting empirical spectral distribution of and is the unique solution of the fixed point equation

    It is part of the theorem’s claim that there is such an .

  2. With probability one, . For , the -th entry of is, in terms of the population spectral distribution ,

    The -th diagonal entry of is, in terms of population and sample quantities, respectively,

  3. With probability one, the diagonal matrix converges, , where of course is also diagonal. The -th diagonal entry of is, in terms of population and sample quantities, respectively,

The limiting weights and MSE are then

and

See Section 6.4 for the proof. The statement may look complicated, but the formulas simplify considerably in the uncorrelated case , on which we will focus later. Moreover, these limiting formulas are also fundamental for developing consistent estimators for the optimal weights.

Now we discuss the problem of estimating the optimal weights, which is crucial for developing practical methods. Recall that the optimal weights are , where are given in Theorem 2.1, and depend on the unknown parameter . The results in Theorem 3.1 show that to estimate the weights consistently, we only need to estimate consistently. The reason is that the other components only depend on the local covariance matrices , which are empirically observable.

Estimating these two parameters is a well-known problem, and several approaches have been proposed, for instance restricted maximum likelihood (REML) estimators (Jiang, 1996; Searle et al., 2009; Dicker, 2014; Dicker and Erdogdu, 2016; Jiang et al., 2016), etc. We can use—for instance—results from Dicker and Erdogdu (2017), who showed that the Gaussian MLE is consistent and asymptotically efficient for even in the non-Gaussian setting of this paper (see Section 6.5 for a summary).

4 Special case: identity covariance

When the population covariance matrix is the identity, that is , the results simplify considerably. In this case the features are nearly uncorrelated. It is known that the limiting Stieltjes transform of has the explicit form (Marchenko and Pastur, 1967):

(2)

As usual in the area, we use the principal branch of the square root of complex numbers.

4.1 Properties of the estimation error and asymptotic relative efficiency

We can use the closed form expression for the Stieltjes transform to get explicit formulas for the optimal weights. From Theorem 3.1, we conclude the following simplified result:

Theorem 4.1 (Asymptotics for isotropic population covariance, arbitrary regularization parameters).

In addition to the assumptions of Theorem 3.1, suppose that the population covariance matrix . Then the limits of and have simple explicit forms:

  1. The -th coordinate of is:

    where is the Stieltjes transform given above in equation (2).

  2. The entries of are

  3. The -th diagonal entry of is

The limiting optimal weights for combining the local ridge estimators are , and MSE of the optimally weighted distributed estimator is

See Section 6.6 for the proof. This theorem shows the surprising fact that the limiting risk decouples over the different machines. By this we mean that the liming risk can be written in a simple form, involving a sum of terms depending on each machine, without any interaction. This seems like a major surprise.

To explain in more detail the decoupling phenomenon, let us study how the local risks are related to the distributed risks. Define to be the limiting scalar defind above, in the special case . Explicitly, this is the limit of the quantity , where , as given in Theorem 2.1 applied for . Let be the scalar expression when . With these notations, the risk of ridge regression when computed on the entire dataset equals

Moreover, the risk of optimally weighted one-shot distributed ridge over subsets, with arbitrary regularization parameters , equals

Then one can check that we have the following equations connecting the risk computed on the entire dataset and the distributed risk:

These equations are precisely what we mean by decoupling. The distributed risk can be written as a function of the type of the distributed risks. Therefore, there are no ”interactions” between the different risk functions. Similar expressions have been obtained for linear regression (Dobriban and Sheng, 2018).

Next, we discuss in more depth why the limiting risk decouples. Mathematically, the key reason is that when , the limit of for decouples into a product of two terms. Therefore, the distributed risk function involves a quadratic form with zero off-diagonal terms. This is not the case for general population covariance . We provide an explanation via free probability theory in Section 6.7.

An important consequence of the decoupling is that we can optimize the individual risks over the tuning parameters separately.

Proposition 4.2 (Optimal regularization (tuning) parameters, and risk).

Under the assumptions of Theorem 4.1, the optimal regularization (tuning) parameters that minimize the local MSEs also minimize the distributed risk . They have the form

Moreover, the risk of distributed ridge regression with optimally tuned regularization parameters is

See Section 6.8 for the proof.

The main goal of our paper is to study the behavior of the one-shot distributed ridge estimator and compare it with the centralized estimator. It is helpful to first understand the properties of the optimal risk function . The optimal risk function equals the optimally tuned global risk up to a factor . It has the explicit form

Proposition 4.3 (Properties of the optimal risk function).

The optimal risk function has the following properties:

  1. Monotonicity: is an increasing function of with and .

  2. Concavity: When is a concave function of . When is convex for small (close to ), and concave for large .

See Section 6.9 for the proof. See also Figure 2 for plots of for different , which show its monotonicity and convexity properties. The aspect ratio characterizes the dimensionality of the problem. It makes sense that is increasing, since the regression problem should become more difficult as the dimension increases. For the second property, the concavity of the function means that it grows very fast to approach its limit. When the signal-to-noise ratio is small, the risk is concave, so it grows fast with the dimension. But when the signal-to-noise ratio becomes large, the risk will grow much slower at the beginning. Here the phase transition happens at . This gives insight into the effect of the signal-to-noise ratio on the regression problem.

Figure 2: Plots of the optimal risk function as a function of the aspect ratio (denoted by in the plots), for different signal strength parameters .

To compare the distributed and centralized estimators, we will study their (asymptotic) relative efficiency (ARE), which is the (limit of the) ratio of their mean squared errors. Here we assume each estimator is optimally tuned. This quantity, which is at most unity, captures the loss of efficiency due to the distributed setting. An ARE close to is ”good”, while an ARE close to is ”bad”. From the results above, it follows that the ARE has the form

We have the following properties of the ARE.

Theorem 4.4 (Properties of the asymptotic relative efficiency (ARE)).

The asymptotic relative efficiency (ARE) has the following properties:

  1. Worst case is equally distributed data: For fixed and , the ARE attains its minimum when the samples are equally distributed across machines, i.e. . We denote the minimal value by . That is

  2. Adding more machines leads to efficiency loss: For fixed and , is a decreasing function on with and infinite-worker limit

    Here we can view as a continuous function of for convenience, although originally it is only well-defined for . We emphasize that the infinite-worker limit tells us how much efficiency we have for a very large number of machines. It is a nontrivial result that this quantity is strictly positive.

  3. Form of the infinite-worker limit: As a function of and , has the explicit form

  4. Edge cases of the infinite-worker limit: For fixed , is an increasing function of with limit

    On the other hand, for fixed , is a decreasing function of with limit

See Section 6.10 for the proof. See Figure 3 for some plots of the evenly distributed ARE for various and and Figure 1 for the surface and contour plots of . The efficiency loss tends to be larger (ARE is smaller) when the signal-to-noise ratio is larger. The plots confirm the theoretical result that the efficiency always decreases with the number of machines. Relatively speaking, the distributed problem becomes easier and easier as the dimension increases, compared to the aggregated problem (i.e., the ARE increases in for fixed parameters). This can be viewed as a blessing of dimensionality.

We also observe a nontrivial infinite-worker limit. Even in the limit of many machines, distributed ridge does not lose all efficiency. This is in contrast to doing linear regression on each machine, where all efficiency is lost when the local sample sizes are less than the dimension (Dobriban and Sheng, 2018). This is one of the few results in the distributed learning literature where one-step weighting gives nontrivial results for arbitrary large , i.e., we can take and we still obtain nontrivial results. We find this quite remarkable.

Overall, the ARE is generally large, except when is small and is large. This is a setting with strong signal and relatively low dimension, which is also the ”easiest” setting from a statistical point of view. In this case, perhaps we should use other techniques for distributed estimation, such as iterative methods.

Figure 3: Plots of the asymptotic relative efficiency when the datasets are evenly distributed, for different and . See Theorem 4.4 for the properties of the ARE.

4.2 Properties of the optimal weights

Next, we study properties of the optimal weights. This is important, because choosing them is a crucial practical question. The literature on distributed regression typically considers simple averages of local estimators, for which (see, e.g. Zhang et al. (2015); Lee et al. (2017); Battey et al. (2018)). In contrast, we will find that the optimal weights do not sum up to unity.

Formally, we have the following properties of the optimal weights.

Theorem 4.5 (Properties of the optimal weights).

The asymptotically optimal weights have the following properties:

  1. Form of the optimal weights: The -th coordinate of is:

    and the sum of the limiting weights is always greater than or equal to one: . When , the sum is strictly greater than one:

  2. Evenly distributed optimal weights: When the samples are evenly distributed, so that all limiting aspect ratios are equal, , then all equal the optimal weight function , which has the form

    This can also be written in terms of the optimal risk function defined above as

  3. Limiting cases: For fixed and , the optimal weight function is an increasing function of with and .

Figure 4: Plots of optimal weights for different .
(a) Surface
(b) Contour
Figure 5: Surface and contour plots of the optimal weights.

See Section 6.11 for the proof. See Figures 4 and 5 for some plots of the optimal weight function with . We can see that the optimal weights are usually large, and always greater than . When the signal-to-noise ratio is small, the weight function is concave and increases fast to approach one. In the low dimensional setting where , the weights tend to the uniform average . Hence in this setting we recover the classical uniform averaging methods, which makes sense, because ridge regression with optimal regularization parameter tends to linear regression in this regime.

Figure 6: Distributed risk as a function of the regularization parameter. We plot both the risk with optimal weights (MSE opt) and the risk obtained from sub-optimal averaging (MSE avg). We set , and .

How much does optimal weighting help? It is both interesting and important to know this, especially compared to naive uniform weighting, because it allows us to compare our proposed weighting method to the ”baseline”. See Figure 6. We have plotted the risk of distributed ridge regression for both the optimally weighted version and the simple average, as a function of the regularization parameter. We observe that optimal weighting can lead to a 30-40% decrease in the risk. Therefore, our proposed weighting scheme can lead to a substantial benefit.

Why are the weights large, and why do they sum to a quantity greater than one? The short intuitive answer is that ridge regression is negatively (or downward) biased, and so we must counter the effect of bias by upweighting. This also can be viewed as a way of debiasing. In different contexts, it is already well known that debiasing can play a key role in distributed learning (Lee et al. (2017); Battey et al. (2018)). We provide a slightly more detailed intuitive explanation in Section 6.12.

4.3 An algorithm for distributed ridge regression

\@float

algocf[h]     \end@float

For identity covariance, our results lead to a very concrete algorithm for optimally weighted distributed ridge regression. Recall that we have samples distributed across machines. On the -th machine, we compute a local ridge estimator , local estimators , of the signal-to-noise ratio and the noise level, and quantities needed to find the optimal weights. Then, we send them to a global machine or data center, and aggregate them to compute a weighted ridge estimator. See Algorithm LABEL:alg:distalgo. This algorithm is communication efficient as the local machines only need to send the local ridge estimator and some scalars to the global datacenter.

We assume the data are already mean-centered, which can be performed exactly in one additional round of communication, or approximately by centering the individual datasets. Our algorithm is designed for the scenario when the population covariance matrix of the design is close to identity, meaning the features are nearly uncorrelated. In some cases, we can broaden the applicability of the algorithm by using techniques like whitening or variable pruning. For example, if we have a good estimator of the population covariance matrix , then we can transform the local design matrix to by whitening: .

In the algorithm, we combine the local estimators of the noise level and signal strength to find a global estimator . A simple method is to take the average: . Another option is to use inverse-variance weighting, based on the asymptotic variance of the MLE (which then of course has to be estimated).

Based on the results so far, it follows that our algorithm can consistently estimate the limiting optimal weights, and moreover it has asymptotically optimal mean squared error among all weighted distributed ridge estimators, for identity covariance. We omit the details.

4.4 Out-of-sample prediction

So far, we have discussed the estimation problem. In real applications, out-of-sample prediction is also of interest. We consider a test datapoint , generated from the same model , where are independent of . We want to use to predict , and the out-of-sample prediction error is defined as . Then we have the following proposition.

Proposition 4.6 (Out-of-sample prediction error (test error) and relative efficiency).

Under the conditions of Theorem 4.1, the limiting out-of-sample prediction error of the optimal distributed estimator is

Thus, the asymptotic out-of-sample relative efficiency, meaning the ratio of prediction errors, is

and the efficiency for prediction is higher than for estimation Furthermore, when the samples are equally distributed, the relative efficiency has the form

and the corresponding infinite-worker limit (taking ) is

Figure 7: Limit of OE: (a) surface and (b) contour plots of .

See Section 6.13 for the proof and Figure 7 for some plots. This proposition implies that, for the identity covariance case, the efficiency loss of the distributed estimator in terms of the test error is always less than the loss in terms of the estimation error. When the signal-to-noise ratio is small, the relative efficiency is always very large and close to . This observation can be an encouragement to use our distributed methods for out-of-sample prediction.

4.5 Choosing the regularization parameter

Previous work found that, under certain conditions, the regularization parameters on the individual machines should be chosen as if they had the all samples (Zhang et al., 2015). Our findings are consistent with these results. However, the reasons behind our findings are very different from prior work. The intuition for the previous results is that the variance of distributed estimators averages out, while the bias does not do so. Therefore, the regularization parameters should be chosen such that the local bias is lower than for locally optimal tuning. This means that we should use smaller regularization parameters locally.

In our case, we find that for isotropic covariance, the optimal risk decouples across machines. Hence, the regularization parameters on the machines can be chosen optimally for each machine. Moreover, in our asymptotics the locally optimal choice is a constant multiple of the globally optimal choice, namely the multiple in front of the identity matrix in the local ridge estimator should be whereas the globally optimal is .

Roughly speaking, this derivation reaches the same conclusion as prior work about the choice of regularization parameters, namely that the regularization parameters on the machines should be chosen as if they had the all samples. However, we emphasize that our results are very different, because the optimal weighting procedure has weights summing to greater than unity. Moreover, we also consider the proportional-limit case, and the conclusion for regularization parameters only applies to the isotropic case.

4.6 Implications and practical relevance

We discuss some of the implications of our results. Our finite-sample results show that the optimal way to weight the estimators depends on functionals of the unknown parameter , while the asymptotic results in general depend on the eigenvalues of (or ). These are unavailable in practice, and hence these results can typically not be used on real datasets. However, since our results are precise and accurate (they capture the truth about the problem), we interpret this as saying that the problem is hard in general. Meaning that optimal weighting for ridge regression is a challenging statistical problem. In practice that means that we may be content with uniform weighting. It remains to be investigated in future work how much we should up-adjust those equal weights.

The optimal weights become usable only in the case of spherical data, when (or, more accurately, the limiting spectral distribution of is the point mass at unity). In practice, we can get closer to this assumption by using some form of whitening on the data, for instance by scaling all variables to the same scale, by estimating over restricted classes, such as assuming block-covariance structures. Alternatively, we can use correlation screening, where we remove features with high correlation. At this stage, all these approaches are heuristic, but we include them to explain how our results can be relevant in practice. It is a topic of future research to make these ideas more concrete.

On the theoretical side, our results can also be interpreted as a form of reduction between statistical problems. If we can estimate the quadratic functionals of the unknown regression parameter involved in our weights, then we can do optimally weighted ridge regression. In this sense, we reduce distributed ridge regression to the estimation of those quadratic functionals. We think that in the challenging and novel setting of distributed learning, such reductions can be both interesting and potentially useful.

An important question is ”Should we use distributed linear or ridge regression?”. If we have and linear regression is defined on each local machine, then we can use either distributed linear (Dobriban and Sheng, 2018) or ridge regression. Linear regression has the advantage that the optimal weights are easy to find. Therefore, if we cannot reasonably reduce to the case , it seems we should use linear regression.

4.7 Minimax optimality of the optimal distributed estimator

To deepen our understanding of the distributed problem, we next show that the optimal distributed ridge estimator is asymptotically rate-minimax. Suppose without loss of generality that the noise level , and let denote the sphere of radius in centered at the origin. Then the minimax risk for estimating over the sphere is

where the expectation is over both and . This problem has been well studied by Dicker (2016), who reduced it to the following Bayes problem. Let be the uniform measure on . Then the Bayes risk with respect to is

The Bayes estimator is the posterior mean So the corresponding Bayes risk is Then, the Bayes estimator also minimizes the original minimax risk and (Dicker, 2016).

Recall that the ridge estimator with optimally tuned regularization parameter is

which can be interpreted as the posterior mean of under the normal prior assumption . When is very large, the normal distribution is very close to the uniform distribution on , so we would expect that . With this intuition, Dicker (2016) further showed that, as , for any

So the global ridge estimator is asymptotically exact minimax.

We call an estimator is