Bayesian Learning with Wasserstein Barycenters

Bayesian Learning with Wasserstein Barycenters

Gonzalo Rios,  Julio Backhoff-Veraguas,  Joaquin Fontbona,  Felipe Tobar
Center for Mathematical Modeling University of Chile Dept. of Mathematical Engineering University of Chile Inst. of Stats. and Math. Methods in Economics TU Vienna
Abstract

In this work we introduce a novel paradigm for Bayesian learning based on optimal transport theory. Namely, we propose to use the Wasserstein barycenter of the posterior law on models, as an alternative to the maximum a posteriori estimator (MAP) and Bayes predictive distributions. We exhibit conditions granting the existence and consistency of this estimator, discuss some of its basic and specific properties, and propose a numerical approximation relying on standard posterior sampling in general finite-dimensional parameter spaces. We thus also contribute to the recent blooming of applications of optimal transport theory in machine learning, beyond the discrete and semidiscrete settings so far considered. Advantages of the proposed estimator are discussed and illustrated with numerical simulations.

 

Bayesian Learning with Wasserstein Barycenters


  Gonzalo Rios,  Julio Backhoff-Veraguas,  Joaquin Fontbona,  Felipe Tobar Center for Mathematical Modeling University of Chile Dept. of Mathematical Engineering University of Chile Inst. of Stats. and Math. Methods in Economics TU Vienna

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Model Selection

In probabilistic modeling, given a finite sample in a data space (typically ), the learning or fitting task consists in selecting one single model or distribution from the model space that best explains the data. Here and in the sequel, denotes the set of probability measures on , and we assume that , with the subset of measures absolutely continuous with respect to a common reference measure on .

We say that is finitely parametrized if there is an integer , a set termed parameter space and a (measurable) function , called parametrization mapping, such that . This is typically the case in common applications, and we will usually use the notation in such setting. The learning task explained in abstract terms in the previous paragraph, boils down in the parametric case to finding the best parameters to feed into the model. The maximum likelihood estimator (MLE) is the standard frequentist choice in this regard. We illustrate with an example of a familiar application, how the several objects introduced so far may look like:

Example: In standard linear regression we have for and the model space is given by the set of joint distributions with linear relationship between and . If we moreover assume that is normally distributed, then for some fixed and . In this setting we need to choose , and to obtain the joint distribution (aka the generative model, though one often needs to deal with the conditional distribution , aka discriminative model). Hence, for each fixed , the parameter space induces a model space through the mapping , where has the density , . Denoting and , the MLE is then given by and . A similar treatment can be given to general regression models (non-linear, non-normal and/or multivariate), classification models and general generative models.

Although considering the finitely parametrized case is illustrative, our work is actually conceived to cover non-parametric situations. Additionally, we will adopt the Bayesian viewpoint, which provides a probabilistic framework to deal with model uncertainty, in terms of a prior distribution on the space of models; we refer the reader to Ghosal and van der Vaart (2017); Murphy (2012) and references therein for mathematical background on Bayesian statistics and methods. A critical challenge in the Bayesian perspective, is that of selecting our constructing a predictive law on from the posterior distribution on . This shall be the actual learning task that will occupy us in this work, and our motivation is to find alternative learning strategies which can cope with drawbacks of classic approaches such as maximum a posteriori (MAP) estimation or Bayesian model average.

As a convention, we shall use the same notation for an element and its density . Moreover, we assume that there exists , the true model (which may in general not be an element of ), such that are i.i.d. according to . We now summarize the main contributions of the article.

2 Main Contribution and Outline of the Article

In the context of model selection, the main contribution of our work is to introduce the concept of Bayesian Wasserstein barycenter estimator, as well as a novel selection criterion based on optimal transport theory. See Ambrosio and Gigli (2013); Ambrosio et al. (2004); Villani (2003, 2008); McCann and Guillen (2011) for a thorough introduction to optimal transport. The article is structured in two parts as follows.

In Section 3 we propose a general framework for Bayesian estimation based on loss functions over probability distributions. This allows us to cover finitely parametrized model spaces, as well as parameter-free settings. It also allows us to retrieve classical selection criteria including MAP and Bayes estimators. We further show that this framework recovers the Bayesian model average estimator and generalizations of it, as particular instances of “Fréchet means” with respect to suitable metrics on spaces of probability distributions.

In the second part, starting from Section 4, we recall the notions of Wasserstein distances and, relying on the previously developed framework, we introduce our Bayesian Wasserstein barycenter estimator. At a theoretical level, we study existence, uniqueness and statistical consistency for this estimator. At a practical level, we provide methods and numerical evidence supporting the potential of this estimator, highlighting the Gaussian setting, where available closed-form expressions allow for a simple numerical implementation.

3 Bayesian learning

Consider a fixed prior probability measure on the model space . We assume as customary that, conditionally on the choice of model , the data is distributed as i.i.d. observations from the common law . As a consequence we can write

(1)

By the Bayes rule, the posterior distribution on models given the data, denoted for simplicity , is given by

(2)

The density of with respect the prior is called the likelihood function.

Given the model space , a loss function is a non-negative functional. We interpret as the cost of selecting model when the true model is . With a loss function and the posterior distribution over models, we define the Bayes risk (or expected loss) and the Bayes estimator as follows:

(3)
(4)

This approach allows us to perform Bayesian estimation in general model spaces without depending on geometric aspects of parameter spaces, by focusing directly on predictive distributions. Moreover, it will allow us to define loss functions in terms of various metrics/divergences on the space , which will enable us to enlarge the classical Bayesian estimation framework by using transportation distances. Before further developing these ideas, we briefly describe how this general framework includes finitely parametrized model spaces, and discuss some classical choices in that setting, together with their appealing features and drawbacks.

3.1 Parametric setting

Given a prior distribution over a parameter space , its push forward Bogachev (2007) through the map is the probability measure given by . Expressing the likelihood function in terms of the parameter such that , we then easily recover from (2) the standard posterior distribution over the parameter space, . Moreover, any loss function induces a functional defined on (or a subset) by interpreted as the cost of choosing parameter when the actual true parameter is . The Bayes risk Berger (2013) of is then defined by

(5)

where with the prior distribution . The associated Bayes estimator is of course given by .

For instance, the 0-1 loss defined as yields , that is, the corresponding Bayes estimator is the posterior mode, also referred to as Maximum a Posteriori Estimator (MAP), . The 0-1 loss penalizes all incorrect parameters in the same way, so there is no partial credit under this loss function. For continuous-valued quantities the use of a quadratic loss is often preferred, as it corresponds to ordinary Euclidean distance in the parameter space. The corresponding Bayes estimator is the posterior mean . In one dimensional parameter space, the absolute loss yields the posterior median estimator, and is sometimes preferred to the one for being more robust in presence of outliers.

The MAP approach is computationally appealing as it reduces to an optimization problem in a finite dimensional space. The performance of this method might however be highly sensitive to the choice of the initial condition used in the optimization algorithm Wright and Nocedal (1999). This is a critical drawback of MAP estimation, since likelihood functions of expressive models may be populated with numerous local optima. A second drawback of this method is that it fails to capture global information of the model space or to provide a measure of uncertainty of the parameters, which might result in an overfit of the predictive distribution. Indeed, the mode can often be a very poor summary or untypical choice of the posterior distribution (e.g. the mode of an exponential density is , irrespective of its parameter). Yet another serious failure of MAP estimation is its dependence on the parameterization. Indeed, for instance, in the case of a Bernoulli distribution on with and an uniform prior on for , the mode can be anything in . On the other hand, parameterizing the model by yields the mode , while parametrizing it by yields as mode.

Using general Bayes estimators on parametrized models of course enables for a richer choice of criteria for model selection (by integrating global information of the parameter space) while providing a measure of uncertainty (through the Bayes risk value). However, the approach might also neglect parameterization related issues, such as overparametrization of the model space (we say that overparametrizes if it is not one-to-one). The latter might result in a multi-modal posterior distribution over parameters. For example, take , and . If we choose a symmetric prior , e.g. , then with enough data, the posterior distribution is symmetric with modes near , so both and estimators are close to .

3.2 Posterior average estimators

The next result, proved in Appendix A.1 illustrates the fact that many Bayesian estimators, including the classic model average estimator, correspond to finding a so-called Fréchet mean Panaretos and Zemel (2017), or barycenters, under a suitable metric/divergence on probability measures.

Proposition 1.

Let and consider the loss functions given by:

  • The -distance:

  • The reverse Kullback-Leibler divergence:

  • The forward Kullback-Leibler divergence

  • The squared Hellinger distance

Then, in cases i) and ii) the corresponding Bayes estimators of Equation (4) coincide with the Bayesian model average:

(6)

Furthermore, with denoting a normalizing constant, the Bayes estimators corresponding to the cases iii) and iv) are given by the exponential model average and the square model average, respectively:

(7)

All the Bayesian estimators described above (eq. (6) and (7)) share a common feature: their values at each point are computed in terms of some posterior average of the values of certain functions evaluated at . This is due to the fact that all the above distances are vertical Santambrogio (2015), in the sense that computing the distance between and involves the integral of vertical displacements between the graphs of these two densities. An undesirable fact about vertical averages is that they do not preserve properties of the original model space. For example, if the posterior distribution is equally concentrated on two different models and with , that is, both models are unimodal (Gaussian) with unit variance, the model average is in turn a bimodal (non-Gaussian) distribution with variance strictly greater than . More generally, model averages might yield untracktable representations or be hardly interpretable in terms of the prior and parameters.

We shall next introduce the analogous objects in the case of Wasserstein distances, which are horizontal distances Santambrogio (2015), in the sense that they involve integrating horizontal displacements between the graphs of the densities. We will further develop the theory of the corresponding Bayes estimators, which will correspond to Wasserstein barycenters Agueh and Carlier (2011); Pass (2013); Kim and Pass (2017); Le Gouic and Loubes (2017) arising in optimal transport theory. In Fig. 1 we illustrate a vertical (left) and a horizontal (right) interpolation between two Gaussian densities, for the reader’s convenience.

Figure 1: Vertical interpolation (left) and horizontal interpolation (right) of two Gaussian densities.

4 Optimal Transport and Wasserstein Distances: a summary

Optimal transport has become increasingly popular within the machine learning community Kolouri et al. (2017), though most of the published works have focused on the discrete setting (e.g. comparing histograms Cuturi (2013); Cuturi and Doucet (2014), classification Frogner et al. (2015) and images Courty et al. (2017); Arjovsky et al. (2017), among others). Since we will focus on continuous distributions, we next review definitions and results needed to present our approach.

Let be a complete and separable metric space. Given two measures over we denote the set of couplings with marginals and , i.e. if and we have that and . We say that is a deterministic coupling if where is a (measurable) so-called transport map satisfying .

Given a real we define the -Wasserstein transportation metric between measures and by

(8)

which is a complete metric in , the space of probability measure with finite -th moment. Since is not empty (it contains ), convex and compact with respect to the weak (also called weak-*) topology, minima always exist under mild assumptions on (joint lower semicontinuity). If is non-atomic, then are the extreme points of . In the case , , and if has a density, then the minimzer of (8) is unique and is induced by and optimal map which is characterized as for a lower semicontinuous convex function. In some cases there exists an explicit form for the optimal transport and the Wasserstein distance, e.g. for the generic one-dimensional case, and for the multivariate Gaussian case with (see Cuestaalbertos et al. (1993)).

5 Bayesian Wasserstein Barycenter Estimator

In this section we propose a novel Bayesian estimator obtained by using the Wasserstein distance as a loss function. This approach will yield an estimator given by a Fréchet mean in the Wasserstein metric, termed Wasserstein barycenter Agueh and Carlier (2011). We state conditions for the statistical consistency of our estimator which ensure that it has clear advantages over the Bayesian model average. Finally we show that our estimator has a consistent finite approximation, which can be computed by an iterative method. In the Gaussian case we provide an explicit algorithm.

Throughout we assume that and that is a separable locally-compact geodesic space. For a more detailed summary of the notion of Wasserstein barycenters we refer to Appendix A.2.

5.1 Wasserstein population barycenter

We assume that the support of the prior is contained in the subspace of probability measures with a density with respect to a fixed reference measure . Moreover, we assume that for some (and then all) it satisfies

These conditions are surmised into the statement that . We now propose to use the Wasserstein barycenter as the model selection criterion from the posterior , namely:

Definition 1.

Given a prior and a set which determines as in (2), the -Wasserstein Bayes risk of , and a Bayes estimator of over a model space , are defined respectively by:

(9)
(10)

In the case , any that is a minimum of is referred to as a -Wasserstein population barycenter of Bigot et al. (2012); Bigot and Klein (2012).

Remark 1.

The existence of a population barycenter follows from the assumptions on and if . The latter is guaranteed as soon as

(11)
(12)

for some , namely, that the model average has a finite -moment. See (Le Gouic and Loubes, 2017, Theorem 2) for the original result, and Appendix A.2 for our own argument.

As a consequence of Remark 1 we obtain the following important result:

Proposition 2.

Assume that , then the -Wasserstein barycenter of exists.

In Appendix A.3 we provide conditions on the prior ensuring that

and therefore the existence of the barycenter estimator.

From now on throughout this section, we assume implicitly that for all n. S

Remark 2.

In the case , and that , the population barycenter of is unique; the argument is almost identical to (Le Gouic and Loubes, 2017, Proposition 6) so we skip it.

5.2 Statistical Consistency

A natural question is whether our estimator is consistent in the statistical sense (see Schwartz (1965); Diaconis and Freedman (1986); Ghosal and van der Vaart (2017)). In a nutshell, consistency corresponds to the convergence of our estimator towards the true model , as we observe more data.

Definition 2.

The prior is said to be strongly consistent at if for each open neighbourhoud of in the weak topology of , we have

Remark 3.

As mentioned in (Ghosal and van der Vaart, 2017, Proposition 6.2), strong consistency is equivalent to the -almost sure weak convergence of to . We do not examine in this work the notion of weak consistency.

The celebrated Schwartz’s theorem provides sufficient conditions for strong consistency. See the original Schwartz (1965) or the more modern treatment (Ghosal and van der Vaart, 2017, Proposition 6.16). A key ingredient is:

Definition 3.

A density belongs to the Kullback-Leibler support of , denoted , if for every , where .

Remark 4.

As can be derived from (Ghosal and van der Vaart, 2017, Example 6.20), in our particular setting, we have

We now come to the important question, of whether our Wasserstein barycenter estimator converges to the model . We assume from now on that

(Note that the first condition implies that the model is “correct” or “well specified”, in the sense , discussed in Grendár and Judge (2009); Kleijn et al. (2004, 2012), though this could be slightly relaxed in the “misspecified” setting dealt with in those works).

We first provide a rather direct sufficient condition for the convergence of to . We use to denote throughout the Wasserstein distance both on and on , not to make the notation heavier.

Proposition 3.

If -almost surely, then -almost surely.

Proof.

We have, by minimality of the barycenter

It follows that the last terms goes to too. On the other hand,

where the constant only depends on . Thus

so we conclude. ∎

The next result states that if the posterior distribution is consistent in the Wasserstein sense (a weaker condition than above), then our estimator is consistent (under some conditions) for models in the Kullback-Leibler support of .

Proposition 4.

If the posterior distribution is -Wasserstein strongly consistent at , then the barycenter is strongly consistent at with some regularity conditions, like e.g. for some and . It is possible to relax the condition to if the likelihood function converge uniformly to as , on the sets for every .

We finally specialize to the consistency of the 2-Wasserstein barycenter in the case of (sub-)Gaussian models:

Proposition 5.

In the case that , , Lebesgue, and assuming that consists of measures which have uniformly sub-Gaussian tails (i.e. with decay or lighter), then the posterior distribution is strongly 2-Wasserstein consistent for all .

The proofs of the previous two propositions will be provided in the next version of this work. In any case we shall not need these results in the sequel

5.3 Bayesian Wasserstein barycenter vs Bayesian model average

It is illustrative to compare the Bayesian model average with our barycenter estimators. First we show that if the Bayesian model converges, then our estimator converges too. We even have the stronger condition that the posterior distribution converges in . Recall that the model average is given by .

Proposition 6.

If -a.s. the -moments of the model average converge to that of , then -a.s. In particular, also -almost surely.

Proof.

We already know that the prior is strongly consistent at with respect to the weak topology. Notice that

from which it follows that not only weakly but in , almost surely. We conclude by Proposition 3. ∎

Remark 5.

Since by Remark 4, the prior is strongly consistent at with respect to the weak topology, (Ghosal and van der Vaart, 2017, Theorem 6.8) and the discussion thereafter imply that the model average is consistent at too.

We now briefly consider the case of , , and Lebesgue. Let be its unique population barycenter, and denote the unique optimal transport map from to . In Appendix A.2 we will prove that . Thanks to this fixed-point property, it follows

where is the Bayesian model average. By similar arguments, we can show that has the same mean as . All in all, we have proved that the -Wasserstein barycenter estimator is less dispersed than the Bayesian model average:

Proposition 7.

Given a prior , let be the Bayesian model average and the -Wasserstein barycenter. Then we have

In particular, the -Wasserstein barycenter estimator has less variance than .

5.4 Wasserstein Empirical Barycenter

In practice, except in special cases, we cannot calculate integrals over the whole model space . Thus we must approximate such integrals by e.g. Monte Carlo methods. For this reason, we now discuss the empirical Wasserstein barycenter and its usefulness as an estimator.

Definition 4.

Given for , we define the empirical distribution as

(13)

Using instead of , we define the -Wasserstein empirical Bayes risk , as well as a corresponding empirical Bayes estimator , which in the case is referred to as a -Wasserstein empirical barycenter of (see Bigot et al. (2012); Bigot and Klein (2012)).

Remark 6.

It is known that a.s.  converges weakly to as . If has finite -th moments, then by the strong law of large numbers we have convergence of -th moments too

(14)

Thus we have that a.s. in . Thanks to (Le Gouic and Loubes, 2017, Theorem 3), any sequence of empirical barycenters of converges (up to extraction of a subsequence) in -Wasserstein distance to a (population) barycenter of . Combining these facts, the following result is immediate:

Proposition 8.

If in -probability (a.s. ), there exists a data-dependent sequence such that the empirical barycenters satisfy in -probability (a.s. ).

6 The Gaussian Case: Examples, Methods and Experiments

If , and consists of Gaussian distributions, then exact results are available, thanks to the barycenter of non-degenerate Gaussian distributions being a Gaussian distribution Alvarez-Esteban et al. (2015); Álvarez-Esteban et al. (2016). We develop an explicit univariate example to calculate the population barycenter, show methods to calculate the empirical barycenter in the multivariate case, and present an example with real-world data.

6.1 Population barycenter of the conjugate prior for Gaussian distributions

Let and assume as done so far that . Given and we can choose the Normal-inverse-gamma distribution as the prior distribution over , that induces a prior over . This prior is conjugate, and the posterior distribution is given by where , , and with and , see Murphy (2007).

As , the posterior converges almost surely in the weak topology to a point mass at which, by the strong law of large numbers, are the mean and the variance of the true model , so the posterior is consistent in the weak topology. Thanks to Remark 4, we have that . Moreover, by Proposition 6 and the fact that the -moment of the model average (see below) is given by converge to those of the true model, the Wasserstein barycenter of the posterior converges a.s. to .

The MAP estimator is given by , while the model average is Murphy (2012, 2007), whose variance is .

Thanks to (Álvarez Esteban et al., 2018, Theorem 3.10), denoting the variance for a model , the -Wasserstein barycenter is given by , where the variance satisfy

(15)

How the posterior distribution over the variance is , denoting we have

(16)

Thus, all selected models have equal mean but different variance. In Fig. 2 (left) we compare their variance in an example with , and . Note that our estimator has higher variance than MAP, but less than the model average.

Figure 2: Left: Variance of the selected model under three criterion. Right: A Gaussian process regression with a cosine kernel, learning with Wasserstein barycenter.

6.2 Empirical barycenter of multivariate Gaussian distributions

Assume are symmetric positive semidefinite matrices, with at least one of them positive definite. Let for . Given weights with , by Alvarez-Esteban et al. (2015) the barycenter of is where is the unique positive definite root of the matrix equation . Following Álvarez-Esteban et al. (2016), consider some symmetric, positive definite and define the fast sequence

(17)

Then as . The standard approximation to fixed-points: for , converges slower than the one in Eq. (17). Fig. 3 shows the computation the covariance matrix barycenter between two covariance matrices of dimension , and the distance to the barycenter of standard and fast sequences.

Figure 3: Barycenter (first), covariance matrices (second, third) and convergence speeds (fourth).
Remark 7.

In order to compute the empirical Wasserstein barycenters, we can use the existing methods presented in the works Alvarez-Esteban et al. (2015); Álvarez-Esteban et al. (2016) along with efficient Markov Chain Monte Carlo (MCMC) techinques Goodman and Weare (2010), or efficient sampling procedures El Moselhy and Marzouk (2012); Parno (2015); Kim et al. (2015); Marzouk et al. (2016)

6.3 Wasserstein estimator in a real world problem

We first considered the Sunspot time series (from scikit-learn toolbox Pedregosa et al. (2011)), that represents the yearly number of sunspots between 1700 and 2008. We use 154 randomly selected observations as training set, and the remaining 155 as the test set. We assume a Gaussian process (GP) Williams and Rasmussen (1996); Rasmussen and Williams (2006) prior over the data, with a constant mean function and a cosine plus noise covariance function Wilson and Adams (2013).

We tested the proposed methodology as follows. We set a flat prior Gelman et al. (2008) over the parameters, and then we sampled from the posterior distribution with MCMC to generate independent parameter samples, to obtain the mean vectors and covariance matrices. We averaged the mean vectors and applied the fast sequence in Eq. (17) to obtain the covariance matrix of the barycenter model. To determine the required number of samples (which, according to Prop. 8, is data-dependent), we started with a default amount and then increased it until the empirical barycenter converged to a stable model.

In Fig. 2 (right) we can see the posterior predictive mean and -confidence interval of the barycenter model. Note that our model is able to find two different frequencies simultaneously in the data, thus being more expressive that the single-cosine covariance function. This shows that our estimator has a desired behavior in misspecified cases, in the sense that the true model but the barycenter estimator is more expressive that the model space , so it is possible get a model closer to than any other model in . Table 1 shows that the model selected with Wasserstein barycenter has a better performance in mean absolute error (MAE) and square root mean error (RMSE) on observed data and test data, compared with MAP and model average.

Score MAE RMSE
Model / Dataset Obs Test Total Obs Test Total
MAP 29.380 29.067 29.223 37.515 36.483 37.001
Model Average 27.631 25.057 26.340 35.460 31.648 33.602
Wasserstein 23.143 22.552 22.846 30.874 28.977 29.937
Table 1: Result of model selection with Sunspots dataset.

7 Discussion and Future Work

In this work we proposed an unifying framework for Bayesian estimation, covering common selection criteria, and allowing us to introduce the novel Bayesian Wasserstein barycenter estimator. This estimator has good statistical properties, and we have a general procedure to calculate it in the Gaussian case. Our simulations illustrate its good performance as a selection criterion. Interesting directions of future work are finding more efficient computational methods, and extending our estimator to more expressive families of distributions that better explain complex real-life data.

References

  • Agueh and Carlier (2011) M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  • Alvarez-Esteban et al. (2015) P. C. Alvarez-Esteban, E. Del Barrio, J. Cuesta-Albertos, and C. Matrán. A note on the computation of Wasserstein barycenters. 2015.
  • Álvarez-Esteban et al. (2016) P. C. Álvarez-Esteban, E. del Barrio, J. Cuesta-Albertos, and C. Matrán. A fixed-point approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2):744–762, 2016.
  • Ambrosio and Gigli (2013) L. Ambrosio and N. Gigli. A user’s guide to optimal transport. Springer, 2013.
  • Ambrosio et al. (2004) L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows with metric and differentiable structures, and applications to the wasserstein space. Atti Accad. Naz. Lincei Cl. Sci. Fis. Mat. Natur. Rend. Lincei (9) Mat. Appl, 15(3-4):327–343, 2004.
  • Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • Berger (2013) J. O. Berger. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 2013.
  • Bigot and Klein (2012) J. Bigot and T. Klein. Characterization of barycenters in the Wasserstein space by averaging optimal transport maps. arXiv preprint arXiv:1212.2562, 2012.
  • Bigot et al. (2012) J. Bigot, T. Klein, et al. Consistent estimation of a population barycenter in the Wasserstein space. ArXiv e-prints, 2012.
  • Bogachev (2007) V. I. Bogachev. Measure theory, volume 1. Springer Science & Business Media, 2007.
  • Courty et al. (2017) N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2017.
  • Cuestaalbertos et al. (1993) J. A. Cuestaalbertos, L. Ruschendorf, and A. Tuerodiaz. Optimal coupling of multivariate distributions and stochastic processes. Journal of Multivariate Analysis, 46(2):335–361, 1993.
  • Cuturi (2013) M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
  • Cuturi and Doucet (2014) M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.
  • Diaconis and Freedman (1986) P. Diaconis and D. Freedman. On the consistency of bayes estimates. The Annals of Statistics, pages 1–26, 1986.
  • El Moselhy and Marzouk (2012) T. A. El Moselhy and Y. M. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850, 2012.
  • Frogner et al. (2015) C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
  • Gelman et al. (2008) A. Gelman, A. Jakulin, M. G. Pittau, Y.-S. Su, et al. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4):1360–1383, 2008.
  • Ghosal and van der Vaart (2017) S. Ghosal and A. van der Vaart. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017.
  • Goodman and Weare (2010) J. Goodman and J. Weare. Ensemble samplers with affine invariance. Communications in applied mathematics and computational science, 5(1):65–80, 2010.
  • Grendár and Judge (2009) M. Grendár and G. Judge. Asymptotic equivalence of empirical likelihood and bayesian map. Ann. Statist., 37(5A):2445–2457, 10 2009. doi: 10.1214/08-AOS645. URL https://doi.org/10.1214/08-AOS645.
  • Kim et al. (2015) S. Kim, D. Mesa, R. Ma, and T. P. Coleman. Tractable fully bayesian inference via convex optimization and optimal transport theory. arXiv preprint arXiv:1509.08582, 2015.
  • Kim and Pass (2017) Y.-H. Kim and B. Pass. Wasserstein barycenters over riemannian manifolds. Advances in Mathematics, 307:640–683, 2017.
  • Kleijn et al. (2012) B. Kleijn, A. Van der Vaart, et al. The bernstein-von-mises theorem under misspecification. Electronic Journal of Statistics, 6:354–381, 2012.
  • Kleijn et al. (2004) B. J. K. Kleijn et al. Bayesian asymptotics under misspecification. PhD thesis, Vrije Universiteit Amsterdam, 2004.
  • Kolouri et al. (2017) S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43–59, 2017.
  • Le Gouic and Loubes (2017) T. Le Gouic and J.-M. Loubes. Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168(3-4):901–917, 2017.
  • Marzouk et al. (2016) Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini. An introduction to sampling via measure transport. arXiv preprint arXiv:1602.05023, 2016.
  • McCann and Guillen (2011) R. J. McCann and N. Guillen. Five lectures on optimal transportation: geometry, regularity and applications. Analysis and geometry of metric measure spaces: lecture notes of the séminaire de Mathématiques Supérieure (SMS) Montréal, pages 145–180, 2011.
  • Murphy (2007) K. P. Murphy. Conjugate bayesian analysis of the gaussian distribution. def, 1(22):16, 2007.
  • Murphy (2012) K. P. Murphy. Machine learning: a probabilistic perspective. Cambridge, MA, 2012.
  • Panaretos and Zemel (2017) V. Panaretos and Y. Zemel. Fréchet means and procrustes analysis in wasserstein space. Bernoulli, 2017.
  • Parno (2015) M. D. Parno. Transport maps for accelerated Bayesian computation. PhD thesis, Massachusetts Institute of Technology, 2015.
  • Pass (2013) B. Pass. Optimal transportation with infinitely many marginals. Journal of Functional Analysis, 264(4):947–963, 2013.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Rasmussen and Williams (2006) C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT, 2006.
  • Santambrogio (2015) F. Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, pages 99–102, 2015.
  • Schwartz (1965) L. Schwartz. On bayes procedures. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 4(1):10–26, 1965.
  • Villani (2003) C. Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
  • Villani (2008) C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Williams and Rasmussen (1996) C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520. MIT Press, 1996. URL http://papers.nips.cc/paper/1048-gaussian-processes-for-regression.pdf.
  • Wilson and Adams (2013) A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1067–1075, 2013.
  • Wright and Nocedal (1999) S. J. Wright and J. Nocedal. Numerical optimization. Springer Science, 35(67-68):7, 1999.
  • Álvarez Esteban et al. (2018) P. C. Álvarez Esteban, E. del Barrio, J. A. Cuesta-Albertos, and C. Matrán. Wide consensus aggregation in the wasserstein space. application to location-scatter families. Bernoulli, 24(4A):3147–3179, 11 2018. doi: 10.3150/17-BEJ957. URL https://doi.org/10.3150/17-BEJ957.

Appendix A Appendix

a.1 Bayes Estimators as Generalized Model Averages

Let and consider the squared -distance between densities as loss function

by Fubini’s theorem we have that

where, by the fundamental lemma of calculus of variations, denoting

the extrema of are weak solutions of the Euler-Lagrange equation