An asymptotic analysis of distributed nonparametric methods
We investigate and compare the fundamental performance of several distributed learning methods that have been proposed recently. We do this in the context of a distributed version of the classical signal-in-Gaussian-white-noise model, which serves as a benchmark model for studying performance in this setting. The results show how the design and tuning of a distributed method can have great impact on convergence rates and validity of uncertainty quantification. Moreover, we highlight the difficulty of designing nonparametric distributed procedures that automatically adapt to smoothness.
Both in statistics and machine learning there has been substantial interest in the design and study of distributed statistical or learning methods in recent years. One driving reason is the fact that in certain applications datasets have become so large that it is often unfeasible, or computationally undesirable, to carry out the analysis on a single machine. In a distributed method the data are divided over a cluster consisting of several machines and/or cores. The machines in the cluster then process their data locally, after which the local results are somehow aggregated on a central machine to finally produce the overall outcome of the statistical analysis. Distributed methods are not only used for computational reasons, but are for instance also of interest in situations where privacy is important and it is undesirable that all data are handled at a single location. Moreover, there are applications in which data are by construction gathered at multiple locations and first processed locally, before being combined at a central location.
Over the last years a variety of distributed methods have been proposed. Recent examples include Consensus Monte Carlo (Scott et al. (2016)), WASP (Srivastava et al. (2015)), and distributed GP’s (Deisenroth and Ng (2015)), to mention but a few. Most papers on distributed methods do extensive experiments on simulated, benchmark and real data to numerically assess and compare the performance of the various methods. Some papers also derive a number of theoretical properties. Theoretical results on the performance of distributed methods are not yet widely available however and there is certainly no common theoretical framework in place that allows a clear theoretical comparison of methods and the development of an understanding of fundamental performance guarantees and limitations.
Since a better theoretical understanding of distributed methods can help to pinpoint fundamental difficulties and opportunities, we develop a framework in this paper which allows us to study and compare the performance of various methods. We are in particular interested in high-dimensional, or nonparametric problems. It is by now well known that the performance of learning or statistical methods in such settings depends crucially on wether or not a method succeeds in realizing the correct bias-variance trade-off, or, in different terminology, succeeds in balancing under- and overfitting. For classical, non-distributed settings we have a rather well-developed understanding of how methods should be tuned to achieve a proper bias-variance trade-off. For distributed methods however, such theory is currently not yet available.
To be able to develop relevant theory we study an idealized model, which is a distributed version of the canonical “signal-in-white-noise” model that serves as an important benchmark model in mathematical statistics (see for instance Tsybakov (2009); Johnstone (2017); Giné and Nickl (2016)). The model is on the one hand rich enough to be interesting, in the sense that it is really distributed in nature and the unknown object that needs to be learned is truly infinite-dimensional. On the other hand it is tractable enough to allow detailed mathematical analysis. In the non-distributed case the signal-in-white-noise model is well known to be very closely related to other nonparametric models, such as nonparametric regression and density estimation. (This can be made very precise in the context of Le Cam’s theory of limits of experiments (e.g. Le Cam (2012), Brown and Low (1996), Nussbaum (1996)).) Similarly, the distributed signal-in-white-noise model that we consider in this paper provides a unified framework to compare methods that were originally introduced in different settings. We introduce the model in Section 2.
It is not difficult to see that if the number of machines is relatively large with respect to the total sample size, or signal-to-noise-ratio , then doing things completely naively in the distributed case leads to a sub-optimal bias-variance trade-off (see also the simulation example in Section 2). In particular, just computing the “usual” estimators on every local machine and then averaging them on the central machine typically leads to a global estimator with a bias that is too large. To achieve good performance, the trade-off has to be adjusted somehow. This can in principle be done in various ways. For instance by locally choosing the “wrong” settings for tuning parameters on purpose, or, in a Bayesian setting, by adjusting the likelihood (e.g. raising it to some power) or by adjusting the prior. In Section 3 we study to what degree various methods that have been proposed in the literature succeed in ultimately achieving the right trade-off. We will see that some are more successful than others in this respect.
An important observation that we make is that the methods that are shown to work well in Section 3 all use information on aspects of the true signal that are in principle unknown, such as its degree of regularity. A key question is whether in distributed settings it is fundamentally possible to set tuning parameters correctly in a purely data-driven way, without using such information. In the non-distributed setting it is well known that such adaptive methods indeed exist (e.g. Tsybakov (2009) or Giné and Nickl (2016)). In the distributed case that we study here however, this is much less clear. In Section 4 we show that using a distributed version of a standard adaptation method that is known to work in the non-distributed case, such as maximum marginal likelihood empirical Bayes, can lead to sub-optimal results in the distributed setting. We will argue that this seems to be a fundamental issue and that we expect that correct automatic setting of tuning parameters in distributed methods is fundamentally more challenging than in the classical, non-distributed case. We believe this is an important issue and want to highlight it as an important and interesting topic for future research.
The remainder of the paper is organized as follows. In the next section we introduce the distributed version of the signal-in-white-noise model and provide a simple simulation example to show that in a distributed setting, naively combining inferences from local machines into a global estimator may produce misleading results. In Section 3 we study the performance of a number of Bayesian procedures for signal reconstruction in the distributed signal-in-white-noise model introduced in Section 2. We include a number of methods that have recently been proposed in the literature. We show that some succeed in obtaining the appropriate bias-variance trade-off, but others do not. Moreover, the ones that do produce good results are all non-adaptive, in the sense that they use knowledge of the smoothness of the unkown signal to set their tuning parameters. In the final Section 4 we consider the more realistic setting in which this smoothness is unknown. We study a distributed method that has been proposed for data-driven tuning of the hyperparameters and show that there exist “difficult signals”, which this method can not recover in the distributed model at an optimal rate. We argue that this appears to be a fundamental issue, and that designing procedures that automatically adapt to smoothness is fundamentally more challenging in the distributed framework. Mathematical proofs are collected in appendix Sections A and B.
2 Distributed signal-in-white-noise model
Consider the problem of estimating a signal in Gaussian white noise. In the usual setting there is a single observer that observes every coefficient with additive Gaussian noise with variance , say. In the distributed version of the model we divide the “precision budget” over different observers, so that each one observes the signal in Gaussian noise with variance , independent of the others. In other words, observer has data satisfying
where the are independent, standard Gaussian random variables. We call the independent sub-problems in which the signal-to-noise ratio is the “local” problems.
The classical, non-distributed signal-in-white-noise model is obtained from the distributed model by aggregating all the local data. Indeed, if for we define , then
where the are independent standard normal variables. This model has been studied extensively in the literature, serving as a canonical model for understanding the performance of high-dimensional or nonparametric statistical procedures. It is well known for instance that if the true signal belongs to an ellipsoid or a hyper rectangle of the form
for some , then the optimal rate of convergence of estimators (relative to the -norm) is of the order . Moreover, there exist so-called adaptive estimators, which achieve this rate without using knowledge about the parameters or that describe the complexity, or regularity of the true signal. See, for instance, Tsybakov (2009) or Giné and Nickl (2016). Our central question is whether or not the same results can be obtained in the distributed setting in which each of the different observers first separately make inference about the signal, and then the local estimates are aggregated into one joint estimator.
The specific examples of distributed procedures that we consider in this paper are about distributed Bayesian methods. These methods have in common that each local observer first chooses a prior distribution and computes the corresponding local posterior distribution using the local data (or an appropriate modification). In the next step the local posteriors are somehow aggregated into a global posterior-type distribution, which is then used to produce an estimate of the signal and/or a quantification of the associated uncertainty. In general there is no guarantee that this “aggregated posterior” resembles the posterior distribution that would be obtained in the non-distributed setting, using all the data at once. In particular, it is not clear beforehand how a distributed Bayes method should be constructed in order to have good theoretical properties, like optimal convergence rates or adaptation properties. In this paper we investigate various distributed methods that have been proposed from this point of view.
To see that interesting things can happen it is exemplifying to compare the results of a distributed and a non-distributed (Bayesian) analysis of simulated data. Concretely, we consider a true signal consisting of the Fourier coefficients of the function shown in Figure 1.
For this signal we simulate data according to (2.1), with , and . Then for every local observer a Bayesian procedure is carried out with a Gaussian prior on , postulating that the coordinates are independent and -distributed. The hyperparameter , which describes the regularity of the prior, is determined using a distributed version of maximum marginal likelihood, as described in Section 4. This analysis leads to local posterior distributions. These are then combined to produce an overall posterior distribution for the signal. The precise procedure is described in Section 4. The resulting estimator for the signal, together with pointwise credible intervals, is shown in the left plot in Figure 2.
The corresponding non-distributed result is obtained by first aggregating all local data as in (2.2) and then carrying out the same Bayesian procedure on these complete data. The resulting non-distributed reconstruction of the signal is shown on the right in Figure 2.
The non-distributed version of this method was studied theoretically for instance in Knapik et al. (2016) and Szabó et al. (2015), where it was shown that the method is adaptive and rate-optimal. The simulation suggests however that an apparently reasonable distributed analogue of the method does not necessarily inherit these favourable properties. The procedure seems to be underfitting and the credible intervals appear to be too narrow. We will argue that this is in some sense a fundamental issue and in the next sections we will study various proposed distributed methods to investigate to what degree they succeed in avoiding or solving these problems.
3 Results for non-adaptive procedures
In this section we study the performance of a number of Bayesian procedures for signal reconstruction in the distributed signal-in-white-noise model introduced in Section 2. All methods involve putting a prior distribution on the unknown signal in each local problem and then combining the resulting local posteriors into one global posterior-type distribution. To be able to compare the various methods we consider the same Gaussian process (GP) prior in every case, namely the prior
which postulates that the coefficients of the signal are independent and -distributed. The hyper parameter controls the regularity of the prior. (Some of the methods we consider use exactly this prior, others modify it in a certain way with the aim of achieving better performance.) The global posterior-type distribution depends on all the data and is denoted by . It is generally some type of average of the local posteriors, but its precise construction differs between proposed methods. We will see that this can have a significant effect on performance.
We take an asymptotic perspective and investigate in every case the rate at which the global posterior contracts around the true signal as relative to the -norm, which is as usual defined by . For a sequence of positive numbers we say that the global posterior contracts at the rate around the true signal if for all sequences ,
as . This means that asymptotically, all posterior mass is concentrated in balls around the true signal with -radius of the order .
Additionally, we study how well the posterior quantifies the remaining uncertainty. Specifically, we consider the coverage probabilities of credible balls around the global posterior mean. These credible sets are constructed by first computing the mean of the global “posterior” . Then for a level , the posterior is used to determine the radius such that the ball around with radius receives posterior mass, i.e.
For , the credible set is subsequently defined by
(The extra constant gives some added flexibility, for we obtain an exact credible set.) We are interested in the coverage probabilities . If this tends to as , the credible sets are asymptotically not frequentist confidence sets, hence give a misleading quantification of the uncertainty. Ideally, the coverage probabilities stay bounded away from as .
In the non-distributed case it is well known that both the rate at which the posterior contracts around the truth and the behaviour of the coverage probabilities of credible sets depend crucially on how the hyper parameter is tuned. The correct bias-variance trade-off is achieved if is in accordance with the regularity of the unknown signal. To make this precise, we will consider signals belonging to hyper rectangles of the form
for some . It is shown for instance in Knapik et al. (2011) for the non-distributed case that if and we set , then the posterior contracts around at the optimal rate . Moreover, for large enough it then holds that . Hence, in the non-distributed case it is optimal to choose the hyper parameter in such a way that the regularity of the prior matches the regularity of the true signal. Moreover, this choice leads to a contraction rate that is optimal in a minimax sense.
In the remainder of this section we investigate distributed methods from this point of view. We will see that the different proposed methods lead to different behaviours in terms of contraction rates and coverage. We stress that the results in this section are non-adaptive, in the sense that we allow the tuning parameter and other aspects of the constructions to use knowledge of the regularity of the true signal. This is of course not realistic. It is however important to first understand for every method whether ideally, if the value of is given to us by an oracle, it is possible to tune the method optimally. Whether this is also possible adaptively, without knowing , is then the next natural question, which we address in Section 4.
3.1 Naive averaging of local posterior draws
Recall that we have local observers that each have a dataset of noisy coefficients satisfying (2.1). The aim is to recover the true sequence of coefficients .
As a starting point, and to have a baseline case to compare the other methods to, we analyse the naive distributed approach in which in every local problem we simply use the prior defined by (3.1), with equal to the regularity of the true sequence , in the sense of (3.3). Every local observer then computes its corresponding local posterior, . By Bayes’ formula this is given by
where the likelihood for the th local problem is given by
Finally these local posteriors are combined into a global, “average posterior” by postulating that a draw from this global posterior is generated by first drawing once from each local posterior and then averaging these independent draws. (Formally, this means that the global “posterior” is the convolution of the rescaled local posteriors , …, ).
This distributed method is conceptually very simple, but it turns out that neither from the point of view of contraction rates, nor from the point of view of uncertainty quantification it performs very well. The reason is basically that although the choice of the tuning parameter of the prior correctly matches squared bias, variance and posterior spread in the local problems, the averaging procedure results in a global “posterior” for which the spread and the variance of the mean are too small relative to the squared bias. The following theorem asserts that for every smoothness level there exist -regular truths for which the contraction rate of the posterior deteriorates substantially and for which the uncertainty quantification by the credible sets (3.2) constructed from the global posterior is useless, no matter how far they are blown up by a constant .
Theorem 3.1 (naive averaging).
For every there exists a such that for small enough ,
as and . Furthermore, for all it holds that
The proof of the theorem is given in Section A.1. ∎
In the literature several less naive distributed strategies have been proposed. These methods either change the local likelihoods in a certain way, and/or the priors that are locally used, and/or the way that the local posteriors are aggregated. In the next few sections we investigate whether such strategies can improve the bad asymptotic performance of the naive averaging method.
3.2 Adjusted local likelihoods and averaging
One perspective on the bad performance of the naive method is to say that since the “sample size” in the local problems is too small, the influence of the data on the local posterior is too small, resulting in a variance (and spread) that is too small relative to the squared bias. A possible way to remedy this that has been proposed in several papers is to raise the local likelihoods to the power , in order to mimic the situation that we have sample size in the local problems. This generalized Bayesian approach for the local problems has for instance been considered in the distributed context by Srivastava et al. (2015). They combine it with a different aggregation method however, which we consider in Section 3.4. In this section we still consider the simple averaging scheme, in order to isolate the effect of adjusting the local likelihoods.
So in method II all local observers use the prior again, with equal to the regularity of the truth. They now each compute a generalized local posterior , defined by
As before the global “posterior” is defined by postulating that a draw from this global posterior is generated by first drawing once from each local generalized posterior and then averaging these independent draws.
The following theorem states that this method indeed improves the naive approach of Section 3.1. The global posterior now contracts at the optimal rate for every -regular truth. Unfortunately, the bad behaviour of the credible sets has not been remedied. For this approach the uncertainty quantification is in fact misleading for all -regular truths.
Theorem 3.2 (adjusted likelihoods + averaging).
For all and all sequences ,
as . However, for all and all it holds that
The proof is given in Section A.2. ∎
3.3 Adjusted priors and averaging
Adjusting the likelihood as in the preceding section resulted in a correct trade-off between the bias and the variance of the global posterior mean, yielding an optimal posterior contraction rate. The spread of the posterior remained too small in comparison however, resulting in credible sets with zero asymptotic coverage. Instead of raising the local posteriors to the power , as considered in the preceding section, we could alternatively raise the prior density to the power . This has for instance been proposed in the context of the “Consensus Monte Carlo” approach by Scott et al. (2016), in combination with simple averaging of the local posteriors. In this section we investigate the performance of this method in terms of posterior contraction and uncertainty quantification in our distributed signal-in-white-noise model.
The prior that we use in the local problems is again a product of centered Gaussians with variance . Raising the corresponding densities to the power has the effect of multiplying the th prior variance by . Hence, in our case raising the prior density to the power is the same as multiplicative rescaling, postulating that is a-priori distributed according to , where
for . Rescaled GPs have also been considered by Shang and Cheng (2015), who have used them in the distributed setting to construct global credible sets from local ones.
Using rescaling we can actually obtain good results if the prior regularity is not exactly equal to the true regularity . By using a scaling different from we can somehow compensate for the mismatch between and , at least in the range . In the non-distributed setting this is a well-known phenomenon, see for instance van der Vaart and van Zanten (2007); Knapik et al. (2011); Szabó et al. (2013).
The distributed procedure that we consider in this section then takes the following form. Every local observer uses the rescaled prior defined by (3.5), with and
where is the regularity of the truth. Next the (normal, unadjusted) corresponding posteriors are computed and they are averaged into a global “posterior” as in the preceding sections. (Note that if in the local problems the prior regularity is used, then , so the method corresponds to raising the prior density to the power .)
The following theorem gives the posterior contraction and coverage results for this method.
Theorem 3.3 (adjusted priors + averaging).
Suppose and . Then for all sequences ,
as . Moreover, for all it holds that
for large enough .
See Section A.3. ∎
So adjusting the prior in this way actually works better than adjusting the likelihood. Not only do we get optimal contraction rates, but the credible sets that this method produces have asymptotic coverage too. The proof shows that the credible sets have optimal radius of the order as well.
3.4 Adjusted local likelihoods and Wasserstein barycenters
In Section 3.2 we saw that raising the local likelihoods to the power and then averaging the corresponding generalized posteriors yields optimal contraction rates, but can produce badly performing credible sets. In this section we study the approach considered by Minsker et al. (2014); Srivastava et al. (2015) in the context of their “WASP” method, which consists in aggregating the local posteriors not by simple averaging, but by computing their Wasserstein barycenter.
The generalized local posteriors , as defined in Section 3.2, are (Gaussian) measures on . The -Wasserstein distance between two probability measures and on is defined by
where the infimum is over all measures on with marginals and . The corresponding 2-Wasserstein barycenter of probability measures on is then defined by
where the minimum is over all probability measures on with finite second moments. There exist effective algorithms to compute Wasserstein barycenters in many cases, see for instance Cuturi and Doucet (2014) and the references therein.
Having this notion at our disposal the distributed method we consider in this section proceeds as follows. In every local problem the prior is used, with equal to the regularity of the truth. Next, the corresponding generalized posteriors are computed locally, which involves raising the likelihood to the power as described in Section 3.2. Finally, the global “posterior” is constructed as the 2-Wasserstein barycenter of the local measures .
The following theorem asserts that this method results in optimal posterior contraction rates and correct quantification of uncertainty.
Theorem 3.4 (adjusted likelihoods + barycenters).
For all and all sequences ,
as . Moreover, for all it holds that
for large enough .
See Section A.4. ∎
3.5 Product of Gaussian process experts
The proofs of the theorems presented so far show that since in our context the global “posterior” is always a Gaussian measure, the behaviour of the procedure can be understood by analyzing three central quantities: the bias of the posterior mean, the variance of the posterior mean, and the spread of the posterior. Depending on how these quantities are related we have found different behaviours: sub-optimal posterior contraction and bad coverage of credible sets (Section 3.1), optimal posterior contraction but bad coverage of credible sets (Section 3.2), and optimal posterior contraction and also good coverage of credible sets (Sections 3.3 and 3.4).
In principle it is now straightforward to analyze different methods as well, provided the three central quantities can be controlled. As an illustration we consider in this section the single-layer version of the product-of-Gaussian-process-expert (PoE) model, introduced in Ng and Deisenroth (2014) and a generalization proposed in Cao and Fleet (2014). An interesting fact is that we will encounter a combination of behaviours that we have not seen yet: sub-optimal contraction rates, but good coverage of credible sets. These methods were introduced to deal with the distributed non-parametric regression model, but for the sake of comparison we analyze them in the context of our distributed signal-in-white-noise model, which can be thought of as an idealized version of the regression model.
The idea of the basic version of the Gaussian PoE model is to employ a Gaussian prior in every local machine, compute the corresponding posterior densities and approximate the global posterior density by multiplying and normalizing these. In our infinite-dimensional setting this does not make sense strictly speaking, since we can not express priors and posteriors on in terms of densities with respect to some generic dominating measure. We could remedy this by considering a truncated version of our distributed model, where we assume we only observe the first noisy coefficients in every machine, say, and focus on making inference about the first true coefficients . This would make the setting finite-dimensional, allowing us to write prior and posterior densities with respect to Lebesgue measure. Alternatively, we can stay in the infinite-dimensional setting of the paper and just reason formally and still arrive at a well-defined global PoE “posterior”. This is the approach we follow here.
Indeed, say that as before we use the prior given by (3.1) in every local machine, with equal to the regularity of the true signal. This prior has formal “density” proportional to
By completing the square we see that the product of this expression with the local likelihood given by (3.4) is, still formally, proportional to
Taking the product over we then obtain the formal density of the PoE posterior, which is proportional to
Now this last expression is, up to a constant, the density of a product of Gaussians with means and variances given by
The latter is in fact a well-defined Gaussian measure on , so we can now simply define the global PoE “posterior” as the latter measure.
We see that the expressions for the global mean and spread are in fact the same as what we found in Section A.1 for the naive averaging method. As a consequence, the negative result of Theorem 3.1 holds for the basic version of the Gaussian PoE model as well.
Theorem 3.5 (product of Gaussian experts).
For every there exists a such that for small enough ,
as and . Furthermore, for all it holds that
One can generalize the PoE model by raising the local posterior densities to some power before multiplying and normalizing them, as proposed in Cao and Fleet (2014). In the subsequent analysis we consider the choice to for the power, as suggested in Deisenroth and Ng (2015). Adapting the preceding analysis for the ordinary PoE model we see that for this generalized PoE model the global “posterior” is in our setting again a product of Gaussian, but now with means and variances given by
So the global posterior mean is unaltered compared to the basic PoE model, but the global posterior spread has been blown up by a factor . As a result, there still exists the same class of truths as in Section A.1 for which the squared bias and the variance of the posterior mean will be incorrectly balanced, resulting in a sub-optimal rate of posterior contraction. However, the larger posterior spread ensures that we do have asymptotic coverage of credible sets. It should be noted however that these sets have a diameter that is sub-optimal, i.e. they are too conservative.
Theorem 3.6 (generalized product of Gaussian experts).
For every there exists a such that for small enough ,
as and . However, for all it holds that
for large enough .
The proof of the theorem can be found in Section A.5. ∎
3.6 Summary of results for non-adaptive methods
We have seen that the various methods for aggregation of the local posteriors can give quite different results. The methods we considered produce different global “posterior” measures. Depending on the relation between the bias and variance of the global posterior mean and the spread of this global posterior, the posterior contraction rate and coverage probabilities of credible sets can have different behaviours. We summarize our findings in Table 1. This is certainly not meant to be an exhaustive list of methods, but rather an illustration of how the design of distributed procedures can affect their fundamental performance.
|II||adjusted likelihoods, averaging||yes||no|
|III||adjusted priors, averaging||yes||yes|
|IV||adjusted likelihoods, barycenter||yes||yes|
|V||product of experts||no||no|
|VI||generalized product of experts||no||yes|
Simulations further illustrate the theoretical results. We have considered a true signal consisting of the Fourier coefficients of the function shown in the left panel of Figure 3. This is a signal which has regularity in the sense of (3.3). For this signal we simulated data according to (2.1), with , and , i.e. we considered a distributed setting with machines. For the sake of comparison, the right panel of Figure 3 shows the signal reconstruction and uncertainty quantification for the non-distributed method which first aggregates all data in a single machine and then computes the posterior corresponding to the prior defined by (3.1), with . This is a method which is known to have an optimal convergence rate and correct quantification of uncertainty. This classical, non-distributed result should be compared to Figure 4, which visualizes the “posteriors” generated by each of the distributed methods I–VI.
In accordance with our theoretical results, we see that the results of methods III and IV are comparable with the non-distributed method. Methods I, V and VI have worse signal reconstruction. The posterior mean of Method II is comparable to that of the optimal methods, but the uncertainty is underestimated.
An important observation to make is that the methods that achieve the same optimal performance as non-distributed methods, all use information about the regularity of the unkown signal, mostly through the setting of tuning parameters in the priors. In that sense, they are non-adaptive. They serve as useful results that indicate what is possible in principle if we have certain oracle knowledge about the truth we are trying to learn. To understand what realistic procedures can achieve this has to be combined with insight into what can be learned about this oracle knowledge from the data. In the next section we address this issue in the context of our distributed signal-in-white-noise model.
4 Results for adaptive procedures
In the non-distributed case it is well known that there exist adaptive methods that achieve the same optimal performance as non-adaptive procedures, without using knowledge of the regularity of the unkown signal. These methods somehow succeed in correctly trading off bias, variance (and spread in Bayesian methods) in a purely data-driven manner. For several such result in the context of the signal-in-white-noise model, see, for instance, Giné and Nickl (2016) and the references therein. For distributed methods the issue of adaptation appears to be a lot more subtle. In this paper we only have a first, negative result on adaptive properties of distributed methods.
So now we do not assume that we know the true regularity of the unknown signal. As before we employ the prior in the local machines. To tune the regularity parameter of the prior we consider a distributed version of maximum marginal likelihood, as proposed by Deisenroth and Ng (2015). The usual, non-distributed version of that method would use the maximizer of the map
as tuning parameter. Maximizing this function however requires having all data available in a central machine. In the distributed setting, Deisenroth and Ng (2015) argue that this map is well approximated by the map
Now every term in the sum just depends on one of the local machines and this function can be maximized on the central machine by repeatedly asking the local machines for function evaluations and gradients of the local log-marginal likelihoods
The resulting estimator is denoted by , i.e.
(We maximize over a compact interval to ensure that the maximizer exists.)
It turns out that in the distributed setting, the local machines are in general not able to learn enough about the true signal regularity . The following lemma asserts that there exist “difficult” signals for which the estimator overestimates the regularity.
For , consider a signal such that
Then and if is small enough, then
if and .
The proof is given in Section B.1. ∎
In view of Lemma 4.1 it is perhaps not surprising that if the approximated maximum marginal likelihood estimator is used to tune the local prior that is used in every machine, sub-optimal performance is obtained for certain truths. Intuitively this is because due to the smaller signal-to-noise ratio, or “sample size” in the local machines, certain truths may appear more regular than they really are. It turns out that using the estimator in combination with any of the methods considered in the preceding section indeed leads to sub-optimal rates and bad coverage probabilities for certain truths. As an illustration we present a rigorous statement for the method of Section 3.4, but similar results can be derived for the others methods as well.
So suppose that in every local problem the prior is used, the corresponding generalized posterior is computed locally (which involves raising the local likelihood to the power ), and then the tuning parameter is substituted by the estimator defined above. In the central machine, the global “posterior” is constructed as the 2-Wasserstein barycenter of the local “posterior” measures .
For and as in Lemma 4.1 we have, for some ,
as and . Furthermore, for all it holds that
See Section B.2. ∎
A simulation illustrating the theoretical result of theorem is given in Figure 2. The left panel visualizes the “posterior” generated by method VII, in the same distributed setting, and using the same simulated data as considered in Section 3.6.
So when combined with a data-driven tuning method like the distributed version of maximum marginal likelihood considered here, even the distributed methods that perform well in the non-adaptive setting loose their favourable properties. None of the methods yields a procedure that automatically adapts to regularity and achieves the optimal non-distributed rate. This does not imply of course that such an adaptive method does not exist. We expect however that the matter is delicate and that fundamental limitations exist.
The issue appears to be similar to that of the existence of adaptive confidence sets. To achieve adaptation in our distributed setting the local machines must be able to learn the “global” regularity of the signal from the limited local data that they have available. Analogous to the adaptive confidence problem we expect that this is in general only possible under additional assumptions on the true signal, like the self-similarity or polished tail conditions proposed for instance in Giné and Nickl (2010), Bull (2012), Szabó et al. (2015), Nickl and Szabó (2016), Belitser et al. (2017). Making these admittedly somewhat loose claims mathematically precise will take considerably more effort, but seems an important and interesting direction for future work.
Appendix A Proofs for Section 3
a.1 Proof of Theorem 3.1
By completing the square we see that under the local posterior the coefficients are independent and Gaussian, with mean and variance given by
Hence the global “posterior” is Gaussian as well, and under that measure the coefficients are independent and have mean and variance given by
For the global posterior mean we have, for every ,
By Lemma A.1 of Szabó et al. (2013) the second, variance term is of the order
as . For , by the same lemma, the first, squared bias term is proportional to . For the global spread, we have
By the triangle inequality we have
It follows from the bounds on the variance and squared bias of the posterior mean that for as chosen above, the quantity
appearing in the posterior probability is with -probability tending to one bounded from below by for small enough. Then by the upper bound for the posterior spread and Chebyshev’s inequality we obtain the first statement of the theorem.
For the coverage we note that the radius of the credible set is a multiple of , which follows from the Gaussianity of the posterior and . Then by similar computations as above we get that for the same truth ,
This completes the proof of the theorem.
a.2 Proof of Theorem 3.2
Raising the local likelihood (3.4) to the power makes it proportional to
which is the likelihood for the case . It follows that under the generalized local posterior the coefficients are independent and Gaussian, with mean and variance given by
Hence the global “posterior” is again Gaussian, and under this global measure the coefficients are independent and have mean and variance given by
For the global posterior mean we have in this case, for every ,
For all , the squared bias term is bounded by
for large , and the variance term behaves like a constant times as well. The global spread is of the order for large .
For and we now have, by the triangle inequality,
By the bounds on the bias and the variance of the posterior mean derived above the quantity on the right of the inequality in the last posterior probability is bounded from below by with -probability tending to one as , uniformly in . By Chebychev’s inequality, and the bound on the posterior spread, we conclude that the first statement of the theorem holds.
For the second statement we first note that by Chebychev’s inequality and by the upper bound on the posterior spread the radius of the credible set is for large bounded by for some . Hence, since the posterior mean is Gaussian, Anderson’s inequality implies that
By Chebychev’s inequality,
for all , where . Above we saw that . Similarly, it is easily seen that . Hence by taking , for instance, we see that for small enough,
as . But then also
a.3 Proof of Theorem 3.3
In this case the th local posterior is a product of Gaussians with means and variances given by
As before the global “posterior” is Gaussian as well, and under that measure the coefficients are independent and have mean and variance given by
For the global posterior mean we have, for every ,
By considering Riemann sums as in Lemma A.1 of Szabó et al. (2013) we see that for and uniformly for , the squared bias term is bounded by a constant times
Similarly, the variance term and the posterior spread both behave like a constant times
as . The choice balances these quantities, so that all three are of the order .
By exactly the same reasoning as in Section A.2, the fact that the squared bias bound and the variance and spread are of the same order implies the first statement of the theorem. For the coverage statement we first note that the squared credible set radius is the quantile of the distribution of , with as above and independent standard normals. This distribution has mean and variance
as can be seen by considering Riemann sums again. As the standard deviation is of smaller order than the mean, it follows from Chebychev’s inequality that for some . For the coverage probability we then have