On the consistency of graph-based Bayesian learning and the scalability of sampling algorithms

On the consistency of graph-based Bayesian learning and the scalability of sampling algorithms

N. García Trillos, Z. Kaplan, T. Samakhoana, and D. Sanz-Alonso Division of Applied Mathematics, Brown University, Providence, RI, 02912, USA.
Abstract.

A popular approach to semi-supervised learning proceeds by endowing the input data with a graph structure in order to extract geometric information and incorporate it into a Bayesian framework. We introduce new theory that gives appropriate scalings of graph parameters that provably lead to a well-defined limiting posterior as the size of the unlabeled data set grows. Furthermore, we show that these consistency results have profound algorithmic implications. When consistency holds, carefully designed graph-based Markov chain Monte Carlo algorithms are proved to have a uniform spectral gap, independent of the number of unlabeled inputs. Several numerical experiments corroborate both the statistical consistency and the algorithmic scalability established by the theory.

Division of Applied Mathematics, Brown University, Providence, RI, 02912, USA

1. Introduction

The principled learning of functions is at the core of machine learning, statistics, artificial intelligence, and inverse problems. This paper is concerned with semi-supervised Bayesian learning, where unlabeled data is used to build a prior distribution on the unknown function, that is updated into a posterior distribution by the use of labeled data. Labeled data consists of pairs of function inputs and outputs, whereas unlabeled data consists only of inputs. We focus on graph-based methods, in which the geometry of the input space is learned by means of a graph, and incorporated into the prior covariance. The posterior is defined by combining the prior with a likelihood function that involves the labeled data. Our main goal is to contribute to the understanding of graph-based semi-supervised learning and its Bayesian formulation. We consider -graphs, built by connecting any two inputs whose distance is less than . Our results guarantee that, provided that the connectivity parameter is suitably scaled with the number of inputs, the graph-based posteriors converge, as the size of the unlabeled data set grows, to a continuum posterior. Moreover we show that, under the existence of a continuum limit, carefully designed graph-based Markov chain Monte Carlo (MCMC) sampling algorithms have a uniform spectral gap, independent of the number of unlabeled data. The continuum limit theory is thus of interest in three distinct ways. First, it establishes the statistical consistency of graph-based methods in machine learning; second, it suggests suitable scalings of graph parameters of practical interest; and third, statistical consistency is shown to go hand in hand with algorithmic scalability: when graph-based learning problems have a continuum limit, algorithms that exploit this limit structure have a convergence rate that is independent of the size of the unlabeled data set. The theoretical understanding of these questions relies heavily on recently developed bounds for the asymptotic behavior of the spectra of graph-Laplacians. Our presentation aims to bring together intrinsic and extrinsic approaches to semi-supervised learning, and to highlight the similarities and differences between optimization and Bayesian formulations. We include a computational study that suggests directions for further theoretical developments, and illustrates the non-asymptotic relevance of our asymptotic results.

1.1. Problem Description

We now provide a brief intuitive problem description. For expository purposes we interpret here the unknown function of interest as heat on a manifold; a full account of our setting is given in section 2 below. We remark that our framework covers a wide class of regression and classification problems, e.g. probit and logistic, as well as a variety of Bayesian inverse problems.

We assume that we are given inputs lying on an unknown -dimensional manifold , of which are labeled. The collection of input data will be denoted by and we denote by the vector of labels. Each label is interpreted here as a noisy measurement of the temperature at time at one of the inputs. The ideal but unattainable goal would be to learn the initial temperature in the unknown manifold , described by a function Within a Bayesian setting, this would boil down to putting a prior distribution on and assuming a statistical model for the data encoded in a negative log-likelihood in order to define a posterior distribution over functions on by

 μ(du)∝exp(−Φ(u;→y))π(du). (1.1)

However, as is assumed to be unknown, we follow an intrinsic approach and aim to recover the initial temperature in the given point cloud , described by a discrete function . This will be achieved by using a graph Laplacian to define a prior measure on the temperatures and also to incorporate within the likelihood a surrogate graph-based heat equation; in this way geometric information from extracted from is taken into account both in the prior and the likelihood. The solution of the graph-based Bayesian approach will be a posterior distribution over discrete functions

 μn(dun)∝exp(−Φn(un;→y))πn(dun). (1.2)

The details on how we construct —without use of the ambient space or — the prior and likelihood surrogates and are given in section 2. In the case where , observations are taken directly from the unknown function of interest, and our problem directly corresponds to semi-supervised regression or classification.

Two interpretations of our learning problem will be useful. The first is to see (1.2) as a surrogate, graph-based discretization of a Bayesian inverse problem over functions on whose posterior solution is given by equation (1.1). The second is to interpret it as a classical Bayesian linear regression problem. In the latter interpretation, may represent a low-dimensional manifold sufficient to represent features living in an extremely high dimensional ambient space (), perhaps upon some dimensionality reduction of the given inputs; in the former, may represent the unknown physical domain of a differential equation. A direct application area of the above setting is in the recovery of an image that has been subject to noisy blurring. More broadly our framework covers —by the flexibility in the choice of misfit functional — a wide class of classification and regression learning problems that including Bayesian probit and logistic models.

Our theory on statistical consistency and algorithmic scalability concerns regimes of large number of input data with moderate number of labeled data. Such regime is of particular interest in semi-supervised learning applications where labeled data is often expensive to collect, but unlabeled data abounds. In a way to be made precise, consistency results guarantee that graph-based posteriors of the form (1.2) are close to a ground truth posterior of the form (1.1), while the algorithmic scalability that we establish ensures the convergence, in an -independent number of iterations, of certain MCMC methods for graph posterior sampling. The computational cost per iteration may, however, grow with . These MCMC methods are in principle applicable in fully supervised learning, but their performance would deteriorate if both and are allowed to grow. Finally, we note that although our exposition is focused on semi-supervised regression, our conclusions are equally relevant for semi-supervised classification.

1.2. Literature

Here we put into perspective our framework by contrasting it with optimization and extrinsic approaches to semi-supervised learning, and by relating it to other methods for surrogate Bayesian inversion. We also give some background on MCMC algorithms.

1.2.1. Graph-Based Semi-supervised Learning

We refer to [56] for an introductory tutorial on semi-supervised learning, containing useful pointers to the literature. The question of when and how unlabeled data matters is addressed in [37]. Some key papers on graph-based methods are [57], [33], and [12]. Spectral properties of graph Laplacians, that form the core of a theoretical justification for graph-based approaches, are studied in [8], [34], [14], [30], [47]. As already noted, a key motivation for graph-based semi-supervised learning is that high dimensional inputs can often be represented in a low-dimensional manifold, whose local geometry may be learned by imposing a graph structure on the inputs. In practice, features may be close to but not exactly on an underlying manifold. The question of how to find suitable manifold representations has led to a vast literature on dimensionality reduction techniques and manifold learning, e.g. [43], [52], [20], [7].

1.2.2. Bayesian vs. Optimization, and Intrinsic vs. Extrinsic

In this subsection we focus on the regression interpretation, with labels directly obtained from noisy observation of the unknown function. The Bayesian formulation that we consider has the advantage over traditional optimization formulations that it allows for uncertainty quantification in the recovery of the unknown function [10]. Moreover, from a computational viewpoint, we shall show that certain sampling algorithms have desirable scaling properties —these algorithms, in the form of simulated annealing, may also find application within optimization formulations [29].

The Bayesian update (1.2) is intimately related to the optimization problem

 minun⟨ΔsMnun,un⟩+Φn(un;→y). (1.3)

Here represents the graph-Laplacian, as defined in equation (2.8) below, and the minimum is taken over square integrable functions on the point cloud Precisely, the solution to (1.3) is the mode (or MAP for maximum a posteriori) of the posterior distribution in (1.2) with a Gaussian prior .

The Bayesian problem (1.2) and the variational problem (1.3) are intrinsic in the sense that they are constructed without reference to the ambient space (other than through its metric), working in the point cloud In order to address the generalization problem of assigning labels to points we use interpolation operators that turn functions defined on the point cloud into functions defined on the ambient space. We will restrict our attention to the family of -NN interpolants defined by

 [Ikn(un)](x):=1k∑xi∈Nk(x)un(xi),x∈\mathdsRd, (1.4)

where is the set of -nearest neighbors in to ; here the distance used to define nearest neighbors is that of the ambient space. Within our Bayesian setting we consider , the push-forward of by , as the fundamental object that allows us to assign labels to inputs , and quantify the uncertainty in such inference. The need of interpolation maps also appears in the context of intrinsic variational approaches to binary classification [22] or in the context of variational problems of the form (1.3): the function is only defined on and hence should be extended to the ambient space via an interpolation map .

Intrinsic approaches contrast with extrinsic ones, such as manifold regularization [8], [9]. This method solves a variational problem of the form

 minu⟨ΔsMnu|Mn,u|Mn⟩+Φ(u;→y)+ζ∥u∥2HK, (1.5)

where now the minimum is taken over functions in a reproducing kernel Hilbert space defined over the ambient space and denotes the restriction of to The kernel is defined in and the last term in the objective functional, not present in (1.3), serves as a regularizer in the ambient space; the parameter controls the weight given to this new term. Bayesian and extrinsic formulations may be combined in future work.

In short, extrinsic variational approaches solve a problem of the form (1.5), and intrinsic ones solve (1.3) and then generalize by using an appropriate interpolation map. In the spirit of the latter, the intrinsic Bayesian approach of this paper defines an intrinsic graph-posterior by (1.2) and then this posterior is pushed-forward by an interpolation map. What are the advantages and disadvantages of each approach? Intuitively, the intrinsic approach seems more natural for label inference of inputs on or close to the underlying manifold However, the extrinsic approach is appealing for problems where no low-dimensional manifold structure is present in the input space.

1.2.3. Surrogate Bayesian Learning

Our learning problem can be seen as a surrogate for a ground-truth Bayesian inverse problem over functions in the underlying manifold [50], [18], [19], [25]. Traditional problem formulations and sampling algorithms require repeated evaluation of the likelihood, often making naive implementations impractical. For this reason, there has been recent interest in reduced order models [45], [35], [3], [17], and in defining surrogate likelihoods in terms of Gaussian processes [42], [49], [51], or polynomial chaos expansions [54], [39]. Pseudo-marginal [5], [2] and approximate Bayesian computation methods [6] have become popular in intractable problems where evaluation of the likelihood is not possible. Bounds in Kullback-Leibler [40] and Hellinger metric [51] between surrogate and ground-truth posteriors have been established. There are two distinctive features of the graph-based models employed here. First, they involve surrogates for both the prior and the likelihood; and second, the surrogate and ground-truth posteriors live in different spaces: the former is a measure over functions on a point cloud, while the latter is a measure over functions on the continuum. The paper [23] studied the continuum limits of surrogate posteriors to the ground-truth continuum posterior. This was achieved by using a suitable topology inspired by the analysis of functionals over functions in point clouds arising in machine learning [27] [26], [28], [48].

1.2.4. Markov Chain Monte Carlo

MCMC is a popular class of algorithms for posterior sampling. Here we consider certain Metropolis–Hastings MCMC methods that construct a Markov chain that has the posterior as invariant distribution by sampling from a user-chosen proposal and accepting/rejecting the samples using a general recipe. Posterior expectations are then approximated by averages with respect to the chain’s empirical measure. The generality of Metropolis–Hastings algorithms is a double-edged sword: the choice of proposal may have a dramatic impact on the convergence of the chain. Even for a given form of proposal, parameter tuning is often problematic. These issues are exacerbated in learning problems over functions, as traditional algorithms often break-down.

The pCN algorithm, introduced in [11], allows for robust sampling of infinite dimensional functions provided that the target is suitably defined as a change of measure. Indeed, the key idea of the method is to exploit this change of measure structure, that arises naturally in Bayesian nonparameterics but also in the sampling of conditioned diffusions and elsewhere. Robustness is understood in the sense that, when pCN is used to sample functions projected onto a finite -dimensional space, the rate of convergence of the chain is independent of This was already observed in [11] and [16], and was further understood in [32] by showing that projected pCN methods have a uniform spectral gap, while traditional random walk does not.

In this paper we substantiate the use of graph-based pCN MCMC algorithms [10] in semi-supervised learning. The key idea is that our continuum limit results provide the necessary change of measure structure for the robustness of pCN. This allows us to establish their uniform spectral gap in the regime where the continuum limit holds. Namely, we show that if the number of labeled data is fixed, then the rate of convergence of graph pCN methods for sampling graph posterior distributions is independent of . We remark that pCN addresses some of the challenges arising from sampling functions, but fails to address challenges arising from tall data. Some techniques to address this complementary difficulty are reviewed in [4].

1.3. Paper Organization and Main Contributions

A thorough description of our setting is given in section 2. Algorithms for posterior sampling are presented in section 3. Section 4 contains our main theorems on continuum limits and uniform spectral gaps. Finally, a computational study is conducted in section 5. All proofs and technical material are collected in an appendix.

The two main theoretical contributions of this paper are Theorem 4.2 —establishing statistical consistency of intrinsic graph methods generalized by means of interpolants— and Theorem 4.6 —establishing the uniform spectral gap for graph-based pCN methods under the conditions required for the existence of a continuum limit. Both results require appropriate scalings of the graph connectivity with the number of inputs. An important contribution of this paper is the analysis of truncated graph-priors that retain only the portion of the spectra that provably approximates that of the ground-truth continuum. From a numerical viewpoint, our experiments illustrate parameter choices that lead to successful graph-based inversion, highlight the need for a theoretical understanding of the spectrum of graph Laplacians and of regularity of functions on graphs, and show that the asymptotic consistency and scalability analysis set forth in this paper is of practical use outside the asymptotic regime.

2. Setting

Throughout, will denote an -dimensional, compact, smooth manifold embedded in . We let be a collection of i.i.d. samples from the uniform distribution on . We are interested in learning functions defined on by using the inputs and some output values, obtained by noisy evaluation at inputs of a transformation of the unknown function. The learning problem in the discrete space is defined by means of a surrogate, graph-based discretization of a continuum learning problem defined over functions on We view the continuum problem as a ground-truth case where full geometric information of the input space is available. We describe the continuum learning setting in subsection 2.1, followed by the discrete learning setting in subsection 2.2. We will denote by the space of functions on the underlying manifold that are square integrable with respect to the uniform measure . We use extensively that functions in can be written in terms of the (normalized) eigenfunctions of the Laplace Beltrami operator . We denote by the associated eigenvalues of , assumed to be in non-decreasing order and repeated according to multiplicity. Analogous notations will be used in the graph-based setting, with scripts .

2.1. Continuum Learning Setting

Our ground-truth continuum learning problem consists of the recovery of a function from data The data is assumed to be a noisy observation of a vector obtained indirectly from the function of interest as follows:

 u∈L2(γ)↦→v:=O∘F(u)↦→y.

Here is interpreted as a forward map representing, for instance, a mapping from inputs to outputs of a differential equation. As a particular case of interest, may be the identity map in The map is interpreted as an observation map, and is assumed to be linear and continuous. The Bayesian approach that we will now describe proceeds by specifying a prior on the unknown function , and a noise model for the generation of data given the vector The solution is a posterior measure over functions on , supported on

2.1.1. Continuum Prior

We assume a Gaussian prior distribution on the unknown initial condition :

 π=N(0,Cu),Cu=(αI−ΔM)−s/2, (2.1)

where , and so that belongs to almost surely, and denotes the Laplace Beltrami operator. The parameter characterizes, in a well-understood manner, the almost sure regularity of draws from [50].

Draws can be obtained, for instance, via the Karhunen-Loève expansion

 u(x)=∞∑i=1 (α+λi)−s/4ξiψi(x),ξii.i.d∼N(0,1). (2.2)

Equation (2.1) corresponds to the covariance operator description of the measure The covariance function representation of this operator may be advantageous in the derivation of regression formulae —see the appendix.

2.1.2. Continuum Forward and Observation Maps

In what follows we take, for concreteness, the forward map to be the solution of the heat equation on up to a given time That is, we set

 Fu≡Ftu:=etΔMu. (2.3)

Note that plays two roles in definition of : it determines both the physical domain of the heat equation and the Laplace Beltrami operator . If , the forward map is the identity operator in

We now describe our choice and interpretation of observation maps. Let , and let be small. For we define the -th coordinate of the vector by

 [Ow](j):=1γ(Bδ(xj)∩M)∫Bδ(xj)∩Mw(x)γ(dx),1≤j≤p, (2.4)

where denotes the Euclidean ball of radius centered at At an intuitive level, and in our numerical investigations, we see as the point-wise evaluation map at the inputs :

 Ow=[w(x1),…,w(xp)]′∈\mathdsRp.

The definition in (2.4) allows us, however, to perform rigorous analysis in an sense. Henceforth we denote

2.1.3. Data and Noise Models

Having specified the forward and observation maps and we assume that the label vector arises from noisy measurement of A noise-model will be specified via a function We postpone the precise statement of assumptions on to section 4. Two guiding examples, covered by the theory, are given by

 ϕ→y(→w):=12σ2|→y−→w|2,%orϕ→y(→w):=−p∑i=1log(Ψ(→y(i)→w(i);σ)), (2.5)

where denotes the CDF of a centered univariate Gaussian with variance The former noise model corresponds to Gaussian i.i.d. noise in the observation of each of the coordinates of The latter corresponds to probit classification, and a noise model of the form with i.i.d. and the sign function. For label inference in Bayesian classification the posterior obtained below needs to be pushed-forward via the sign function, see [10].

2.1.4. Continuum Posterior

The Bayesian solution to the ground-truth continuum learning problem is a continuum posterior measure

 μ(du)∝exp(−ϕ→y(Gu))π(du)=:exp(−Φ(u;→y))π(du), (2.6)

that represents the conditional distribution of given the data Equation (2.6) defines the negative log-likelihood function , that represents the conditional distribution of given The posterior contains —in a precise sense [55]— all the information on the unknown input available in the prior and the data.

2.2. Discrete Learning Setting

We consider the learning of functions defined on a point cloud The underlying manifold is assumed to be unknown. We suppose to have access to the same data as in the continuous setting, and that the inputs in the definition of correspond to the first points in . Thus, the data may be interpreted as noisy measurements of the true temperature at the first points in the cloud at time The aim is to construct —without knowledge of — a surrogate “posterior” measure over functions in representing the initial temperatures at each point in the cloud.

Analogous to the continuous setting, we will denote by the space of functions on the cloud that are square integrable with respect to the uniform measure on . It will be convenient to view, formally, functions as vectors in . We then write and think of as evaluation of the function at

The surrogate posteriors are built by introducing a surrogate prior, and surrogate forward and observation maps and for and The same noise-model and data as in the continuum case will be used. We start by introducing a graph structure in the point cloud. Surrogate priors and forward maps are defined via a graph-Laplacian that summarizes the geometric information available in the point cloud

2.2.1. Geometric Graph and Graph-Laplacian

We endow the point cloud with a graph structure. We focus on -neighborhood graphs: an input is connected to every input within a distance of A popular alternative are -nearest neighbor graphs, where an input is connected to its nearest neighbors. The influence of the choice of graphs in unsupervised learning is studied in [38].

First, consider the kernel function defined by

 K(r):={1 if r≤1,0otherwise. (2.7)

For we let be the rescaled version of given by

 Kε(r):=m+2n2αmεm+2K(rε),

where denotes the volume of the -dimensional unit ball. We then define the weight between by

 Wn(xi,xj):=Kεn(|xi−xj|),

for a given choice of parameter , where we have made the dependence of the connectivity rate on explicit. In order for the graph-based learning problems to be consistent in the large limit, should be scaled appropriately with —see subsection 4.1. Figure 1 shows three geometric graphs with fixed and different choices of connectivity

We now define the graph Laplacian of the geometric graph by

 ΔMn:=Dn−Wn, (2.8)

where is the degree matrix of the weighted graph, i.e., the diagonal matrix with diagonal entries . We remark that several definitions of graph Laplacian co-exist in the literature; the one above is some times referred to as the unnormalized graph Laplacian [53]. As will be made precise, the performance of the learning problems considered here is largely determined by the behavior of the spectrum of the graph Laplacian. Throughout we denote its eigenpairs by , and assume that the eigenvalues are in non-decreasing order.

2.2.2. Graph Prior

A straight-forward discrete analogue to (2.1) suggests to endow the unknown function with a prior

 ˜πn=N(0,Cun),Cun:=(αIn+ΔMn)−s/2, (2.9)

where and are chosen as in (2.1). The graph Laplacian, contrary to the regular Laplacian, is positive semi-definite, and hence the change in sign with respect to (2.1). This choice of graph prior was considered in [23]. In this paper we introduce and study priors defined in terms of truncation of the priors , retaining only the portion of the spectra of that provably approximates that of

Precisely, we define the graph priors as the law of given by

 un=kn∑i=1(α+λni)−s/4ξiψni,ξii.i.d∼N(0,1), (2.10)

where may be chosen freely with the restrictions that and . Such choice is possible as long as the connectivity decays with .

2.2.3. Graph Forward and Observation Maps

We define a surrogate forward map by

 Fnun≡Ftnun:=e−tΔMnun, (2.11)

where is given as in the continuum setting. Likewise, for as in (2.4) we define a surrogate observation map by

 [Onw](j):=1nγn(Bδ(xj))∑k:xk∈Bδ(xj)∩Mnw(k),1≤j≤p.

As in the continuum setting, should be thought of as point-wise evaluation at the inputs and we denote

2.2.4. Data and Likelihood

For the construction of graph posteriors we use the same data and noise model as in the continuum case —see subsection 2.1.3.

2.2.5. Graph Posterior

We define the graph-posterior measure by

 μn(du)∝exp(−ϕ→y(Gnun))πn(dun)=:exp(−Φn(un;→y))πn(dun), (2.12)

where is the (truncated) graph prior defined as the law of (2.10), and the above expression defines the function , interpreted as a surrogate negative log-likelihood.

In subsection 4.1 we will contrast the above “truncated” graph-posteriors to the “untruncated” graph-posteriors

 ˜μn(du)∝exp(−ϕ→y(Gnun))˜πn(dun)=:exp(−Φn(un;→y))˜πn(du), (2.13)

obtained by using the prior in equation 2.1.

3. Posterior Sampling: pCN and Graph-pCN

The continuum limit theory developed in [23] and stated in subsection 4.1 suggests viewing graph posteriors as discretizations of a posterior measure over functions on the underlying manifold. Again, these discretizations are robust for fixed and growing number of total inputs . This observation substantiates the idea introduced in [10] of using certain version of the pCN MCMC method [11] for robust sampling of graph posteriors. We review the continuum pCN method in subsection 3.1, and the graph pCN counterpart in 3.2.

3.1. Continuum pCN

In practice, sampling of functions on the continuum always requires a discretization of the infinite dimensional function, usually defined in terms of a mesh and possibly a series truncation. A fundamental idea is that algorithmic robustness with respect to discretization refinement can be guaranteed by ensuring that the algorithm is well defined in function space, prior to discretization [50]. This insight led to the formulation of the pCN method for sampling of conditioned diffusions [11], and of measures arising in Bayesian nonparametrics in [15]. The pCN method for sampling the continuum posterior measure (2.6) is given in Algorithm 1.

Posterior expectations of suitable test functions can then be approximated by empirical averages

 μ(f)≈1JJ∑j=1f(u(j))=SN(f). (3.1)

The user-chosen parameter in Algorithm 2 monitors the step-size of the chain jumps: larger leads to larger jumps, and hence to more state space exploration, more rejections and slower probing of high probability regions. Several robust discretization properties of Algorithm 1 —that contrast with the deterioration of traditional random walk approaches— have been proved in [32]. Note that the acceptance probability is determined by the potential (here interpreted as the negative log-likelihood) that defines the density of posterior with respect to prior. In the extreme case where is constant, moves are always accepted. However, if the continuum posterior is far from the continuum prior, the density will be far from constant. This situation may arise, for instance, in cases where is large or the observation noise is small. A way to make posterior informed proposals that may lead to improved performance in these scenarios has been proposed in [44].

3.2. Graph pCN

The graph pCN method is described in Algorithm 2, and is defined in complete analogy to the continuum pCN, Algorithm 1. When considering a sequence of problems with fixed and increasing the continuum theory intuitively supports the robustness of the method. Moreover, as indicated in [10] the parameter may be chosen independently of the value of

Our experiments in section 5 confirm this robustness, and also investigate the deterioration of the acceptance rate when both and are large.

4. Main Results

4.1. Continuum Limits

The paper [23] established large asymptotic convergence of the untruncated graph-posteriors in (2.13) to the continuum posterior in (2.6). The convergence was established in a topology that combines Wasserstein distance and an -type term in order to compare measures over functions in the continuum with measures over functions in graphs.

Proposition 4.1 (Theorem 4.4 in [23]).

Suppose that and that

 (log(n))pmn1/m≪εn≪1n1/s, as n→∞, (4.1)

where for and for Then, the untruncated graph-posteriors converge towards the posterior in the sense.

We refer to [27] for the construction of the space that allows to compare functions on with functions on , and to [23] for the construction of space which allows us to compare probabilities over functions on with probabilities over functions on It is important to note that here convergence refers to the limit of fixed labeled data set of size , and growing size of unlabeled data. In order for the continuum limit to hold, the connectivity of the graph needs to be carefully scaled with as in (4.1).

At an intuitive level, the lower bound on guarantees that there is enough averaging in the limit to recover a meaningful deterministic quantity. The upper bound ensures that the graph priors converge appropriately towards the continuum prior. At a deeper level, the lower bound is an order one asymptotic estimate for the -optimal transport distance between the uniform and uniform empirical measure on the manifold [26], that hinges on the points lying on the manifold : if the inputs were sampled from a distribution whose support is close to , but whose intrinsic dimension is and not , then the lower bound would be written in terms of instead of . The upper bound, on the other hand, relies on the approximation bounds (5.1) of the continuum spectrum of the Laplace-Beltrami by the graph Laplacian.

We now present a new result on the stability of intrinsically constructed posteriors, generalized to by interpolation via the map —see (1.4); this is the most basic interpolant that can be constructed exclusively from the point cloud and the metric on the ambient space. Other than extending the theory to cover the important question of generalization, there is another layer of novelty in Theorem 4.2: graph-posteriors are constructed with truncated priors, and the upper-bound in the connectivity in (4.1) is removed.

Theorem 4.2.

Suppose that and that

 (log(n))pmn1/m≪εn≪1, as n→∞, (4.2)

where is as in Proposition 4.1 Then, with probability one,

 In♯μn→P(L2)μ, as n→∞.

The proof is presented in Appendix B. Similar results hold for more general interpolants as long as they are uniformly bounded and consistent when evaluated at the eigenfunctions of graph Laplacians (see Remark B.1.)

4.2. Uniform Spectral Gaps for Graph-pCN Algorithms

The aim of this subsection is to establish that, in a precise and rigorous sense, the graph-pCN method in Algorithm 2 is insensitive to the increase of the number of input data provided that the number of labeled data is fixed and a continuum limit exists. This behavior contrasts dramatically with other sampling methodologies such as the random walk sampler. One could characterize the robustness of MCMC algorithms in terms of uniform spectral gaps.

We start by defining the spectral gap for a single Markov chain with state space an arbitrary separable Hilbert space . We consider two notions of spectral gap, one with respect to a Wasserstein distance with respect to some distance like function and one in terms of .

Definition 4.3 (Wasserstein spectral gaps).

Let be the transition kernel for a discrete time Markov chain with state space . Let be a distance like function (i.e. a symmetric, lower-semicontinuous function satisfying if and only if ). Without the loss of generality we also denote by the Wasserstein distance (1-OT distance) on induced by (see (C.1)). Then has spectral gap if there exists positive constants such that

 ~d(Pjμ,Pjν)≤Cexp(−λj)~d(μ,ν),∀μ,ν∈P(H),∀j∈\mathdsN.

In the above stands for the set of Borel probability measures on .

Definition 4.4 (L2-spectral gaps).

Let be the transition kernel for a discrete time Markov chain with state space and suppose that is invariant under . is said to have -spectral gap (for ) if for every we have

 ∥Pf−μ(f)∥2L2(H;μ)∥f−μ(f)∥2L2(H;μ)≤exp(−λ).

In the above, and .

Having defined the notion of spectral gap for a single Markov chain, the notion of uniform spectral gap for a family of Markov chains is defined in an obvious way. Namely, if is a family of Markov chains, with perhaps different state spaces , we say that the family of Markov chains has uniform Wasserstein spectral gap with respect to a family of distance like functions , if the Markov chains have spectral gaps with constants which can be uniformly controlled, respectively, from above and away from zero. Likewise the chains are said to have uniform -gaps (with respect to respective invariant measures) if the constant can be uniformly bounded away from zero. We remark that Wasserstein spectral gaps imply uniqueness of invariant measures.

In what follows we use the following assumption:

Assumption 4.5.

Let . For a certain fixed we assume the following conditions on .

1. For every there exists such that if satisfy

 |→w−√1−β2→v|≤K

then,

 ϕy(→v)−ϕ→y(→w)≥c.
2. (Linear growth of local Lipschitz constant) There exists a constant such that

 |ϕ→y(→v1)−ϕ→y(→v2)|≤Lmax{|→v1|,|→v2|,1}|→v1−→v2|,∀→v1,→v2∈\mathdsRp.

In Appendix D we show that the Gaussian model and the probit model satisfy these assumptions.

In what follows it is convenient to use as a placeholder for one of the spaces or the space . Likewise is a placeholder for the transition kernel associated to the pCN scheme from section 3 defined on for each choice of . We are ready to state our second main theorem:

Theorem 4.6 (Uniform Wasserstein spectral gap).

Let , . For each choice of let ,

 d(u,v):=min{1,¯¯¯d(u,v)θ},u,v∈H

be a rescaled and truncated version of the distance

 ¯¯¯d(u,v):=infT,ψ∈A(T,u,v)∫T0exp(η∥ψ∥)dt,
 A(T,u,v):={ψ∈C1([0,T];H):ψ(0)=u,ψ(T)=v,∥˙ψ∥=1}.

Finally, let be the distance-like function

 ~d(x,y):=√d(x,y)(1+V(x)+V(y)),u,v∈H

where

 V(u):=∥u∥2,u∈H.

Then, under the assumptions of Theorem 4.2 and the assumptions 4.5, and can be chosen independently of the specific choice of in such a way that

 ~d(Pjν1,Pjν2)≤Cexp(−λj)~d(ν1,ν2),∀ν1,ν2∈P(H),∀j∈\mathdsN,

for constants that are independent of the choice of .

A few remarks help clarify our results.

Remark 4.7.

Notice that the distance is a Riemannian distance whose metric tensor changes in space and in particular gets larger for points that are far away from the origin (notice that the choice returns the canonical distance on ). In particular, points that are far away from the origin have to be very close in the canonical distance in order to be close in the distance. This distance was considered in [32]. We would also like to point out that the exponential form of the metric tensor can be changed to one with polynomial growth given the choice of .

Remark 4.8.

Theorem 4.6 is closely related to Theorem 2.14 in [32]. There, uniform spectral gaps are obtained for the family of pCN kernels indexed by the different truncation levels of the Karhunen Loève expansion of the continuum prior. For that type of discretization, all distributions are part of the same space; this contrasts with our set-up where the discretizations of the continuum prior are the graph priors.

Due to the reversibility of the kernels associated to the pCN algorithms, Theorem 4.6 implies uniform -spectral gaps. Notice that the Wasserstein gaps imply uniqueness of invariant measures (which are precisely the graph and continuum posteriors for each setting) and hence there is no ambiguity when talking about -spectral gaps.

Corollary 4.9.

Under the assumptions of Theorem 4.2 and the assumptions 4.5 the kernel associated to the pCN algorithm has an -spectral gap independent of the choice of .

The proof of Theorem 4.6 and its corollary are presented in Appendix C.

Remark 4.10.

Uniform spectral gaps may be used to find uniform bounds on the asymptotic variance of empirical averages [36].

5. Numerical Study

In the numerical experiments that follow we take to be the two-dimensional sphere in Our main motivation for this choice of manifold is that it allows us to expediently make use of well-known closed formulae [41] for the spectrum of the spherical Laplacian in the continuum setting that serves as our ground truth model. We recall that admits eigenvalues , with corresponding eigenspaces of dimension . These eigenspaces are spanned by spherical harmonics [41]. In subsections 5.1, 5.2, and 5.3 we study, respectively, the spectrum of graph Laplacians, continuum limits, and the scalability of pCN methods.

5.1. Spectrum of Graph Laplacians

The asymptotic behavior of the spectra of graph-Laplacians is crucial in the theoretical study of consistency of graph-based methods. In subsection 5.1.1 we review approximation bounds that motivate our truncation of graph-priors, and in subsection 5.1.2 we comment on the theory of regularity of functions on graphs.

5.1.1. Approximation Bounds

Quantitative error bounds for the difference of the spectrum of the graph Laplacian and the spectrum of the Laplace-Beltrami operator are given in [13] and [21]. Those results imply that, with very high probability,

 ∣∣1−λniλi∣∣≤C(δnεn+εn√λi),∀i, (5.1)

where denotes the -optimal transport distance [26] between the uniform and the uniform empirical measure on the underlying manifold. The important observation here is that the above estimates are only relevant for the first portion of the spectra (in particular for those indices for which is small). The truncation point at which the estimates stop being meaningful can then be estimated combining (5.1) and Weyl’s formula for the growth of eigenvalues of the Laplace Beltrami operator on a compact Riemannian manifold of dimension [23]. Namely, from we see that as long as and

 1≪kn≪1εmn.

This motivates our truncation point for graph priors in equation (2.10).

Figure 2 illustrates the approximation bounds (5.1). The figure shows the eigenvalues of the graph Laplacian for three different choices of connectivity length scale and three different choices of number of inputs in the graph; superimposed is the spectra of the spherical Laplacian. We notice the flattening of the spectra of the graph Laplacian and, in particular, how the eigenvalues of the graph Laplacian start deviating substantially from those of the Laplace-Beltrami operator after some point in the -axis. As discussed in [21], the estimates (5.1) are not necessarily sharp, and may be conservative in suggesting where the deviations start to happen.

5.1.2. Regularity of Discrete Functions

We numerically investigate the role of the parameter in the discrete regularity of functions sampled from . We focus on studying the oscillations of a function within balls of radius . More precisely, we consider

 [oscεn(un)](xi):=maxx,z∈Bεn(xi)∩Mn|un(x)−un(z)|,i=1,…,n.

For given we take samples and we normalize so that

 ⟨Δsnun,un⟩L2(γn)=1.

We then compute the maximum value of over all and over all samples and plot the outcome against . The results are shown in Figure 3.

This experiment illustrates the regularity of functions with bounded semi-norm

 ∥un∥2Hsn:=kn∑i=1(λni)s⟨un,ψni⟩2L2(γn).

As expected, higher values of enforce more regularity on the functions. Notice that here we only consider functions in the support of and hence we remove the effect of high eigenfunctions of (which may be irregular). In particular, the regularity of the functions must come from the regularity of the first eigenvectors of together with the growth of . To the best of our knowledge nothing is known about regularity of eigenfunctions of graph Laplacians. Studying such regularity properties is an important direction to explore in the future as we believe it would allow us to go beyond the set-up that we consider for the theoretical results in this paper. In that respect we would like to emphasize that the observation maps considered for the theory of this work are defined in terms of averages and not in terms of pointwise evaluations, but that for our numerical experiments we have used the latter.

A closely related setting in which discrete regularity has been mathematically studied is in the context of graph -Laplacian semi-norm (here denotes an arbitrary number greater than one, and is not to be confused with the number of labeled data points). Lemma 4.1 in [48] states that, under the assumptions on from Theorem 4.2, for all large enough and for every discrete function satisfying

 E(p)n(un):=1n2εpn∑i,jK(|xi−xj|εn)|un(xi)−un(xj)|p=1,

it holds

 [oscεn(un)](xi)≤C1/pn1/pεn,∀i=1,…,n.

This estimate allows to establish uniform convergence (and not simply convergence in ) of discrete functions towards functions defined at the continuum level. More precisely, suppose that and that . Let be a sequence with converging to a function in the sense and for which

 supn∈\mathdsNE(p)n(un)<∞.

Then, must be continuous (in fact Hölder continuous with Hölder constant obtained from the Sobolev embedding theorem) and moreover

 maxi=1,…,n|un(xi)−u(xi)|→0,as n→∞.

This is the content of Lemma 4.5 in [48]. This type of result rigorously justifies pointwise evaluation of discrete functions with bounded graph -Laplacian seminorm and the stability of this operation as .

5.2. Continuum Limits

5.2.1. Set-up

For the remainder of section 5 we work under the assumption of Gaussian observation noise, so that

 Φ(u;y)=12σ2|y−G(u)|2,Φn(un,y)=12σ2|y−Gn(un)|2. (5.2)

The synthetic data in our numerical experiments is generated by drawing a sample , and setting

 y=G(u†)+η,

where is the function in the left panel of Figure 4. We consider several choices of number of labeled data points, and size of observation noise The parameters and in the prior measures are fixed to , throughout.

The use of Gaussian observation noise, combined with the linearity of our forward and observation maps, allows us to derive closed formulae for the graph and continuum posteriors. We do so in the the appendix.

5.2.2. Numerical Results

Here we complement the theory by studying the effect that various model parameters have in the accurate approximation of continuum posteriors by graph posteriors. We emphasize that the continuum posteriors serve as a gold standard for our learning problem: graph posteriors built with appropriate choices of connectivity result in good approximations to continuum posteriors; however, reconstruction of the unknown function is not accurate if the data is not informative enough. In such case, MAPs constructed with graph or continuum posteriors may be far from

In all the figures involving graph-posterior means, these are represented by using a -NN interpolant, as defined in equation (1.4), with Figure 5 shows a graph-prior draw represented in the point cloud (left), and the associated -NN interpolant (right). All the plots have been obtained using the (graph) pCN algorithm. The pCN algorithm was run for iterations, and the last samples were used to compute quantities of interest (e.g means and variances).

Figure 6 shows graph and continuum posteriors with and For these plots, a suitable choice of graph connectivity was taken. In all three cases we see remarkable similarity between the graph and continuum posterior means. However, recovery of the initial condition with is unsuccessful: the data does not contain enough information to accurately reconstruct . Figure 7 shows graph-posterior means computed in the regime of the first row of Figure 6 using the three graphs in Figure 1. Note that the spectra of the associated graph-Laplacians is represented in Figure 2. It is clear that inappropriate choice of leads to poor approximation of the continuum posterior, and here also to poor recovery of the initial condition This is unsurprising in view of the dramatic effect of the choice of in the approximation properties of the spectrum of the spherical Laplacian, as shown in Figure 2. Note that while the numerical results are outside the asymptotic regime ( throughout), they illustrate the role of Theorem 4.2 establishes appropriate scalings for successful graph-learning in the large asymptotic.

5.3. Algorithmic Scalability

It is important to stress that the large robust performance of pCN methods established in this paper hinges on the existence of a continuum limit for the measures Indeed, the fact that the limit posterior over infinite dimensional functions can be written as a change of measure from the limit prior has been rigorously shown to be equivalent to the limit learning problem having finite intrinsic dimension [1]. In such a case, a key principle for the robust large sampling of the measures is to exploit the existence of a limit density, and use some variant of the dominating measure to obtain proposal samples. It has been established —and we do so here in the context of graph-based methods— that careful implementation of this principle leads to robust MCMC and importance sampling methodologies [32], [1].

A further point to note is that —even though from a theoretical and applied viewpoint it is clearly desirable that the data is informative— computational challenges in Bayesian settings often arise when the data is highly informative. This is so in the context of importance sampling and particle filters [1], [46], where certain notion of distance between prior and proposal characterizes the algorithmic complexity. In the context of the pCN MCMC algorithms, if is constant then the algorithm has acceptance probability On the other hand, large Lipschitz constant of (which translates to a posterior that is far from the prior) leads to small spectral gap. Indeed, tracking the spectral gap of pCN in terms of model parameters via the understanding of Lipschitz constants is in principle possible, and will be the subject of further work. In particular, small observation noise leads to deterioration of the pCN performance, see Figure 8. This issue may be alleviated by the use of the generalized version of pCN introduced in [44]. Figures 9 and 10 investigate the role of the parameters and . All these figures show the posterior mean at one of the inputs, and the true graph posterior means have been computed with the formulae in the appendix.

Table 1 shows the large robustness of pCN methods, while also exhibits its deterioration in the fully supervised case The table shows the average acceptance probability with model parameters for the semi-supervised setting, and same parameters but with for the fully supervised case. The corresponding graph-posterior means are shown in Figure 11.

6. Acknowledgements

ZK was funded by, and would like to acknowledge, NSF grant IDyaS; TS would like to thank the Brown Division of Applied Mathematics for providing funds for the research. The authors are thankful to Dejan Slepčev for a careful reading of a first version of this manuscript.

References

• [1] S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, and A. M. Stuart. Importance sampling: Intrinsic dimension and computational cost. Statistical Science, 32(3):405–431, 2017.
• [2] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, pages 697–725, 2009.
• [3] S. R. Arridge, J. P. Kaipio, V. Kolehmainen, M. Schweiger, E. Somersalo, T. Tarvainen, and M. Vauhkonen. Approximation errors and model reduction with an application in optical diffusion tomography. Inverse Problems, 22(1):175, 2006.
• [4] R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18(47):1–43, 2017.
• [5] M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139–1160, 2003.
• [6] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162(4):2025–2035, 2002.
• [7] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Machine learning, 56(1-3):209–239, 2004.
• [8] M. Belkin and P. Niyogi. Towards a theoretical foundation for laplacian-based manifold methods. In COLT, volume 3559, pages 486–500. Springer, 2005.
• [9] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
• [10] A. L. Bertozzi, X. Luo, A. M. Stuart, and K. C. Zygalakis. Uncertainty quantification in the classification of high dimensional data.
• [11] A. Beskos, G. O. Roberts, A. M. Stuart, and J. Voss. MCMC methods for diffusion bridges. Stochastics and Dynamics, 8(03):319–350, 2008.
• [12] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. 2001.
• [13] D. Burago, S. Ivanov, and Y. Kurylev. A graph discretization of the Laplace-Beltrami operator. J. Spectr. Theory, 4:675â714, 2014.
• [14] R. R. Coifman and S. Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
• [15] S. L. Cotter, M. Dashti, J. C. Robinson, and A. M. Stuart. Bayesian inverse problems for functions and applications to fluid mechanics. Inverse problems, 25(11):115008, 2009.
• [16] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC methods for functions: modifying old algorithms to make them faster. Statistical Science, 28(3):424–446, 2013.
• [17] T. Cui, Y. M. Marzouk, and K. E. Willcox. Data-driven model reduction for the Bayesian solution of inverse problems. International Journal for Numerical Methods in Engineering, 102(5):966–990, 2015.
• [18] M. Dashti and A. M. Stuart. The bayesian approach to inverse problems. Handbook of Uncertainty Quantification.
• [19] M. Dashti and A. M. Stuart. Uncertainty quantification and weak approximation of an elliptic inverse problem. SIAM Journal on Numerical Analysis, 49(6):2524–2542, 2011.
• [20] D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–5596, 2003.
• [21] N. García Trillos, M. Gerlach, M. Hein, and D. Slepčev. Spectral convergence of empirical graph Laplacians. In preparation.
• [22] N. García Trillos and R. Murray. A new analytical approach to consistency and overfitting in regularized empirical risk minimization. European Journal of Applied Mathematics, pages 1–36, 2017.
• [23] N. García Trillos and D. Sanz-Alonso. Continuum limit of posteriors in graph Bayesian inverse problems. arXiv preprint arXiv:1706.07193, 2017.
• [24] N. García Trillos and D. Sanz-Alonso. Gradient Flows: Applications to Classification, Image Denoising, and Riemannian MCMC. arXiv preprint arXiv:1705.07382, 2017.
• [25] N. García Trillos and D. Sanz-Alonso. The Bayesian formulation and well-posedness of fractional elliptic inverse problems. Inverse Problems, 33(6):065006, 2017.
• [26] N. García Trillos and D. Slepčev. On the rate of convergence of empirical measures in -transportation distance. Canadian Journal of Mathematics, 67:1358–1383, 2014.
• [27] N. García Trillos and D. Slepčev. Continuum limit of total variation on point clouds. Archive for rational mechanics and analysis, 220(1):193–241, 2016.
• [28] N. García Trillos and D. Slepčev. A variational approach to the consistency of spectral clustering. Applied and Computational Harmonic Analysis, 2016.
• [29] C. J. Geyer and E. A. Thompson. Annealing Markov chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association, 90(431):909–920, 1995.
• [30] E. Giné and V. Koltchinskii. Empirical graph Laplacian approximation of Laplace–Beltrami operators: Large sample results. In High dimensional probability, pages 238–259. Institute of Mathematical Statistics, 2006.
• [31] M. Hairer, J. C. Mattingly, and M. Scheutzow. Asymptotic coupling and a general form of Harrisâ theorem with applications to stochastic delay equations. Probability theory and related fields, 149(1):223–259, 2011.
• [32] M. Hairer, A. M. Stuart, and S. J. Vollmer. Spectral gaps for a Metropolis–Hastings algorithm in infinite dimensions. The Annals of Applied Probability, 24(6):2455–2490, 2014.
• [33] J. Hartog and H. van Zanten. Nonparametric Bayesian label prediction on a graph. arXiv preprint arXiv:1612.01930, 2016.
• [34] M. Hein, J.-Y. Audibert, and U. Von Luxburg. From graphs to manifolds–weak and strong pointwise consistency of graph Laplacians. In International Conference on Computational Learning Theory, pages 470–485. Springer, 2005.
• [35] M. C. Kennedy and A. O’Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425–464, 2001.
• [36] C. Kipnis and S. R. S. Varadhan. Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Communications in Mathematical Physics, 104(1):1–19, 1986.
• [37] F. Liang, S. Mukherjee, and M. West. The use of unlabeled data in predictive modeling. Statistical Science, pages 189–205, 2007.
• [38] M. Maier, U. Von Luxburg, and M. Hein. Influence of graph construction on graph-based clustering measures. In Advances in neural information processing systems, pages 1025–1032, 2009.
• [39] Y. M. Marzouk, H. N. Najm, and L. A. Rahn. Stochastic spectral methods for