Variational limits of -NN graph-based functionals on data clouds
This paper studies the large sample asymptotics of data analysis procedures based on the optimization of functionals defined on -NN graphs on point clouds. The paper is framed in the context of minimization of balanced cut functionals, but our techniques, ideas and results can be adapted to other functionals of relevance. We rigorously show that provided the number of neighbors in the graph scales with the number of points in the cloud as , then with probability one, the solution to the graph cut optimization problem converges towards the solution of an analogue variational problem at the continuum level.
Key words and phrases:k-NN graph, discrete to continuum limit, Gamma-convergence, graph cut, Cheeger cut, spectral clustering.
This paper studies the large sample asymptotics of data analysis procedures based on the optimization of functionals defined on graphs; the procedures of interest include graph-based methods for clustering, classification, and semi-supervised learning. The set of vertices of the graph is a random data set while the edges capture the level of similarity among points. In this work our focus is on -NN graphs, where one puts an edge between a pair of points whenever one of them is among the -nearest neighbors of the other; we will assume from here on that is a subset of Euclidean space. Our main results rigurously show that when scales like , then the solutions of the optimization problems of interest at the graph level are consistent and converge towards the solutions of analogue variational problems at the continuum level. Spectral clustering, total variation clustering, diffusion maps, and -Laplacian regularization, are all examples of graph-based data analysis procedures with a variational flavor (see [7, 9, 21, 22, 26, 27, 32, 1]), and the results and ideas in this paper can be applied to analyze their large sample limit in this -NN setting. For expository purposes we focus on the concrete example of minimizing balance graph cuts as described below.
To get a flavor of our main results, consider a set of random points uniformly distributed on the region as depicted in Figure 2. A -NN graph for a certain choice of is then obtained as shown in Figure 2. We introduce a functional on partitions of by
The numerator favors low interaction between the two sets in a partition, while the denominator forces it to be balanced in terms of size. It is thus natural to consider the optimization problem
as a sensible approach for data clustering; (1.1) is known as the Cheeger cut functional. In Figure 4 we illustrate the minimizer of 1.1 for the graph in Figure 2. We observe the close resemblance between the discrete minimizer in Figure 4 and the partition of the region in Figure 4 which can be described as a solution to a variational problem at the continuum level of the form
where the cut functional is defined (at least for with smooth boundary) as
The main theorem of this paper is a rigorous mathematical statement of the previous observation. We show that with probability one, and in a very precise sense, the solution to (1.1) converges as towards the solution to a continuum variational problem (see Theorem 2.4 below), provided that (the number of neighbors in the definition of the graph) scales like
Phrased in the language of data clustering, our results establish the statistical consistency of a series of clustering procedures based on minimization of graph cuts when these come from -NN graphs. Moreover, our results show that (at least in terms of scaling with ), the condition on needed to establish the consistency is dimension free.
To the best of our knowledge this work is the first one to rigorously address the stability of variational problems on -NN graphs such as the one defined in 1.1 in the large sample limit. Most of the theoretical works found in the literature addressing similar questions assume an -graph construction on the data set (i.e. there is an edge between two points if they are within distance of each other); an exception to this is the work  where pointwise convergence of graph Laplacians on -NN graphs (among other constructions) is analyzed. We find the absence of theoretical results in the -NN setting to be a strong motivation for our work given their frequent use by practitioners due to their nicer regularity properties and their robustness to data dimensionality; see  for a more complete discussion on this matter. Notice that -NN graphs can be constructed completely from ordinal information about the data points and in particular exact values of interpoint distances are not needed (this is a property that adds to the robustness of the -NN construction). Figure 5 illustrates the difference between -graphs and -NN graphs.
defined for functions ; this is the functional that appears in the numerator in (1.1) (restricting to ), we will denote it by and refer to it as the graph total variation, see (2.4) below for its precise definition. The -limit of is shown to be a weighted (local) total variation functional at the continuum level. The notion of -convergence provides precise sufficient conditions implying the stability of minimizers of variational problems; we review its definition in Section 2.4 and we refer the interested reader to  for a more complete discussion on the topic.
Most of the technical work in this paper is devoted to analysing the “bias” of the random functional . We establish the -convergence of a kernel based functional with inhomegeneous bandwidth (a sort of average of ) towards a (local) weighted total variation (see Propositions 3.1 and 3.2 below). The inhomogeneity associated to the kernel based functional is intrinsic to the -NN graph construction, where length-scales are determined by the Euclidean distance and the data density around each point . The fact that the resulting kernel based approximation has a varying bandwidth makes our analysis different from that in . At each point in space, the bandwidth depends inversely on the density of the ground-truth measure generating the data.
It is possible to reinterpret this feature and think of it as fixing a bandwidth but now measuring distances with an effective metric induced by the ground-truth on the ambient space; this metric in particular shrinks distances on regions with low density. In this work we will not pursue this idea any further, but we anticipate that there are several advantages of doing so. In particular, it is of relevance to further investigate the dimension free condition , and understand better the constants appearing in these asymptotic inequalities. This is of special relevance if we want to extend our results to settings where the ground-truth distribution is for example a Gaussian measure in an finite dimensional Hilbert space. Both the unboundedness of the support of the ground truth and the infinite dimension of the ambient space are settings that are not covered by the results in this paper. We believe that a better understanding of this and other related issues are of importance for a better understanding of -NN graphs and their benefits.
We would like to finish this introduction by mentioning some of the growing literature on large sample asymptotic analysis of operators and functionals constructed from -graphs. Convergence of graph Laplacians can be found in the work by Belkin and Niyogi , Coifman and Lafon , Giné and Koltchinskii , Hein, Audibert and von Luxburg , and Singer . These works deal mostly with pointwise consistency of graph Laplacians. The work of Arias-Castro, Pelletier and Pudlo  studies the pointwise convergence of Cheeger energy and consequently of total variation, as well as variational convergence when the discrete functional is considered over an admissible set of characteristic functions which satisfy a “regularity” requirement. Spectral convergence of graph Laplacians, (relevant for data clustering) has been studied, among others, by Huang, and Jordan , Belkin and Niyogi , von Luxburg, Belkin and Bousquet , Singer and Wu . Most of the results previously mentioned are deduced using tools from perturbation theory of linear operators. A different set of tools was introduced in  and later used in [12, 16, 17]. The notion of -convergence (a.k.a. epi-convergence) and the introduction of a suitable metric (the metric; see  and Section 2.4 below) allowing to compare functions at the graph level with functions at the continuum level, were crucial tools to deduce statistical consistency for a large class of balanced graph cuts  and for spectral clustering . In all these results, sharp convergence rates for guaranteeing the consistency were provided. In the context of graph-based approaches to semi-supervised learning, it is worth mentioning the work  where the -Laplacian regularization (for large enough) is studied, as well as the Bayesian approach in [13, 11] where the convergence of graph posteriors is analyzed.
The rest of the paper is organized as follows. In Section 2 we make our assumptions precise, present some of the examples of variational problems that are relevant for important tasks in data analysis, and present the main results of the paper. In Section 2.4 we give the definitions of the space and the notion of -convergence; we also present some auxiliary results that are needed in the remainder. Finally, in Section 3 we present the proofs of the main results.
2. Set-up and main results
We assume that the data are i.i.d samples from a distribution on . is assumed to be an absolutely continuous measure with respect to the Lebesgue measure, with density where is a bounded, connected, open set with Lipschitz boundary, and is a continuous function bounded above and below by positive constants. Namely, we assume that there are positive constants such that for every we have
We denote by the empirical measure
whenever is among the -nearest neighors of or vice-versa.
Because has a density, with probability one, the condition is equivalent to
where . Without the loss of generality we assume this fact in the remainder.
let us introduce some functions that we use in the remainder. Let be the Heaviside function given by
For an arbitrary we let be the function
We let be the quantity
where is the first coordinate of the vector . Notice that in the above expression can be replaced with for any vector with unit norm.
2.1. Graph total variation and total variation in the continuum
In order to introduce the graph total variation functional (an appropriate rescaled version of the numerator in (1.1)), it will be convenient to define
and for , let be the number for which
The graph total variation functional is then defined as
Notice that when for some , is just a rescaled version of the denominator of the functional in (1.1). At the discrete level we are interested in the following optimization problem
The continuum counterpart of the graph total variation is a functional that takes the following form.
Let be a continuous function bounded below and above by positive constants. For an arbitrary function , we define its weighted (by ) total variation as
In the above and in the remainder of the paper, we use to represent the space of functions with respect to the Lebesgue measure restricted to . In addition, when we write instead of and we denote by the space of functions for which . Notice that for any continuous and bounded above and below by positive constants, the condition is equivalent to the condition .
If is smooth, then
Also, if for some open set with smooth boundary, then
where is the -dimensional Hausdorff measure.
At the continuum level we will be interested in a variational problem of the form
for an appropriately chosen function .
2.2. Main results
Our main result establishes the convergence of minimizers of (2.5) towards minimizers of (2.7). Notice however that solutions to (2.5) are discrete sets, whereas solutions to (2.7) are continuum sets, and so we need to clarify the sense in which we will establish the convergence of discrete minimizers towards continuum ones. The convergence is taken in the space introduced in , where in particular functions on the point cloud and functions on are seen as elements of the same space (see section 2.4 below for details).
We are ready to establish our main result.
Let and let be an open, connected, bounded domain with Lipschitz boundary. Let be a continuous density function satisfying (2.1) and let be the probability measure . Let be a sequence of natural numbers satisfying
Let be i.i.d. points from and let be a solution to (2.5).
Then, with probability one, along every subsequence of there is a further subsequence converging in the -sense towards a minimizer of (2.7), where
In the above theorem if the minimizer of (2.7) is unique, then the convergence is along the entire sequence of discrete minimizers.
As discussed in the introduction, and especially as shown in , our results are a direct corollary of Theorems 2.6 and 2.7 below. We show that the -limit (in the -sense) of the functional (for the same scaling as in Theorem 2.4) is the functional . The notion of -convergence and its connection to the stability of minimizers of functionals are reviewed in section 2.4.
(-convergence) Let and let be an open, connected, bounded domain with Lipschitz boundary. Let be a continuous density function satisfying (2.1). Let be a sequence of natural numbers satisfying
Then, the functionals -converge towards in the -sense, where is defined in (2.2), and is the volume of the unit ball in . Furthermore, when restricted to indicator functions, the energies -converge to the functional restricted to indicator functions.
(Compactness) Under the same assumptions in Theorem 2.6, the following statement holds with probability one: If is a sequence with for which
then is pre-compact in . That is, every subsequence of has a further subsequence converging in .
The above theorems hold for domains in provided we replace the condition with the condition . This can be seen directly from our proofs, and follows from the fact that the rate of convergence of the -transportation distance between and (for ) scales like and not like (see ).
The above set-up and results can be extended to the setting in which the support of is not a domain contained in the ambient space but a compact manifold with intrinsic dimension . When that is the case, the appropriate scalings for the functionals are obtained using the intrinsic dimension and not the dimension of the ambient space . Although in this paper we omit the details of such extension, we point out that the results in  (later used in the proofs in ) can be adapted to the manifold setting by establishing results analogue to those in  in the manifold setting, using the geodesic flow in , and the fact that the Euclidean distance (distance in the ambient space ) is a third order approximation for the geodesic distance (intrinsic distance), i.e. ,
The details can be presented elsewhere in a more general setting which is also of interest; see Remark 2.10 below.
Suppose that . One of the important consequences of Theorem 2.6 is that the admissible regimes for that guarantee the recovery of a non-trivial variational limit for the energies does not depend on (we impose ). As a consequence, notice that despite the fact that the scaling factor appearing in depends on , the minimizers of the Cheeger cut are unaffected by a rescaling of the energy by a positive constant. In other words, in principle we do not need to know the dimensionality of the data before hand in order to obtain good clusters using the graph total variation associated to a -NN graph. Notice that the same is not true for -graphs. In particular, it is plausible to extend the consistency of Cheeger cuts to the setting in which the dimensionality of the data changes in space, i.e., if ’s support is something like a union of manifolds with different intrinsic dimensions. This problem will be explored in the future.
2.3. Other examples of relevant variational problems on graphs
In this section we present a variety of examples of other optimization problems on graphs that are relevant for data analysis. The common structure in all these optimization problems is that in their objective functions the highest order term is either a graph total variation or a version of it; for this reason theorems 2.6 and 2.7 (in conjunction with Proposition 2.18) are of relevance beyond the setting of Theorem 2.4.
For the rest of this section represents a similarity graph on a data set (not necessarily a -NN graph).
(Ratio graph cuts for clustering). A functional closely related to the Cheeger cut in (1.1) is the ratio cut defined by
Both the above functional and the Cheeger cut can be seen as examples of a larger class of functionals known as balanced cuts. This type of functionals penalize partitions of that either have a big interface separating and or are highly unbalanced in the sense that and are of dissimilar size; intuitively, these are desirable features for a good partitioning of the data and motivate considering the minimization problem
to obtain a good partition of . Notice that the denominator in both functionals is, up to a multiplicative factor, the graph total variation defined by
Assuming that the set of vertices consists of samples from the distribution and that the weights of the graph are those obtained from an -graph (with chosen appropriately), the results in  state that minimizers of the Cheeger and ratio cuts converge, as , towards solutions to analogue variational problems in the continuum. This result can then be interpreted as a consistency result for clustering using balanced cuts.
(Graph total variation in the context of classification). Suppose that
are samples from some distribution supported on . The pair is interpreted as the feature values ( variable) and label ( variable) of an individual from a given population. Based on the observed data (training data) the idea is to construct a “good” classifier assigning labels to every potential individual in the population. A typical choice of risk functional used to define “good” classifiers is the average missclassification error. Unfortunately, when the distribution is unknown (as it is usually the case) it is not possible to determine the classifier that minimizes the average misclassification error (Bayes classifier) and hence an approximation to it based exclusively on the observed data is the best thing one can hope for. Such an approximation can be obtained using the graph total variation as we describe below.
Given the weighted graph , consider the energy
The functional is the empirical risk and the parameter is introduced so as to emphasize or deemphasize the regularizing effect of the graph total variation. In , the weights for the graph are assumed to be those coming from an -graph and the problem of obtaining an approximation to the Bayes classifier is divided into first solving the minimization problem
and then extend the minimizer to the whole ambient space appropriately. With the right scaling for , one can establish the asymptotic consistency of the constructed approximation. See  for more details.
(Spectral clustering and spectral embeddings). Undoubtedly, one of the most popular graph based methods for data clustering is spectral clustering. In the two way clustering setting, spectral clustering can be seen as a relaxation of the ratio cut minimization problem mentioned in Example 2.11. Indeed, the relaxed problem takes the form
where is the average ; the denominator of the above objective function is a -version of the graph total variation. It is well known that the above optimization problem is actually an eigenvalue problem for the graph Laplacian and that the first non-trivial eigenvector of the graph Laplacian is a minimizer. In the context of multi-way clustering, higher eigenvectors of the graph Laplacian are used to define an embedding of the data cloud into a Euclidean space with low dimension. In turn, the embedded data can be clustered using an algorithm like -means, inducing in this way a partition for the original data. Among the many results in the literature addressing the consistency of spectral clustering (see the Introduction for some references), we highlight the work in  which exploits the variational characterization of eigenvectors/eigenvalues. In that work the graph on the cloud is assumed to be an -graph.
(-Laplacian regularization for semi-supervised learning) Let
be labeled data points and let
be a set of unlabeled data points. We think of as being much larger than .
In  the authors consider the optimization problem
for the purposes of semi-supervised learning; here is a user chosen parameter, whose role is to impose regularity on candidate functions (the higher the more regular the functions will be). In two independent contributions [8, 30] the authors study the large data limit of this variational problem and by studying the discrete regularity induced by the term in (2.8), they are able to rigurously show that there is a phase transition in the value of at which solutions to (2.8) “stop forgetting” the labeled data points; the transition occurs at (where is the intrinsic dimension of the data set), a result that is reminescent to Sobolev’s embedding theorem. The setting in which these results are shown is that of -graphs.
(Bayesian formulation of semi-supervised learning) In the context of Example 2.14, a related optimization problem is
where is the graph Laplacian associated to the graph and where is a positive number whose role is to enforce more or less regularity on a candidate function (higher resuls in more regularity). The minimizer of this functional can be seen as the MAP of the posterior distribution
where is a prior distribution (in this case Gaussian with covariance matrix ) and is a negative log-likelihood model (in this case additive Gaussian noise).
This Bayesian point of view to graph based semi-supervised learning was introduced in . In  the authors study the passage to the large limit ( fixed) and study the consistency of posterior measures. Moreover, from said consistency result and from the properties of the limiting posterior distribution, the authors in  provide some theory supporting the MCMC algorithm to sample from the posterior that was proposed in  . This algorithm was introduced so as to alleviate the curse of dimensionality when sampling from (here the curse of dimensionality arises from the large number ). The setting in which all these results are shown is that of -graphs.
The purpose of this subsection is to present some definitions and preliminary results that we use in the proof of our main theorems. In particular we present the definitions of -space and -convergence.
is defined as the set of all pairs where is a Borel probability measure on and is a function in (written from now on). This set can be endowed with the -metric defined by
where stands for the set of couplings between and , that is, the set of all probability measures on the product whose first and second marginals are and respectively.
Given that with probability one the empirical measure converges weakly to as , and given that has a density, we may use the characterization of -convergence from Proposition 3.12 in  to conclude that converges to in if and only if there exists a sequence of transportation maps with ( pushes forward into ), satisfying
In turn, this holds if and only if for all sequences of transportation maps with that satisfy (2.9) one has (2.10). Because of this characterization, we may abuse notation slightly and simply write without specifying the corresponding attached measures whenever it is clear from context that they can be omitted.
A special choice of transportation maps between and that we use in the remainder are provided by . We may consider transportation maps between the measures and satisfying the condition
for some constant (provided ). Indeed, the results in  show that with probability one, there exists transportation maps for which and for which (2.11) holds for all large enough . Since all of our results are asymptotic in nature, we may as well assume for the remainder that with probability one (2.11) holds for all . One last relevant property of the map , which follows directly from the fact that it transports into , is the change of variables formula
which allows us to write integrals with respect to in terms of integrals with respect to .
We now present the definition of -convergence in the context of a general metric space. This is a notion of convergence for functionals which together with a coercivity assumption, guarantee the stability of minimizers in the limit. A standard reference for -convergence is .
Let be a metric space and let be a sequence of functionals. The sequence -converges with respect to the metric to the functional as if the following properties hold:
Liminf inequality: For every and every sequence converging to ,
Limsup inequality: For every there exists a sequence converging to satisfying
Compactness: Every bounded sequence satisfying
We say that is the -limit of the sequence of functionals (with respect to the metric ).
A sequence like the one appearing in the limsup inequality is said to be a recovery sequence for . It is straightforward to show that if one can find recovery sequences for all elements in a set satisfying:
For all there exists a sequence such that and ,
then one can find recovery sequences for all elements in ; a set satisfying (1) above is said to be dense in with respect to . This fact follows from a simple diagonal argument (see ).
The most relevant property of -convergence (in particular for our purposes) is presented in the following proposition which can be found in .
Let be a metric space and let be a sequence of functionals that are not identically equal to . If -converges towards then,
Any sequence where is a minimizer of is precompact, and each of its accumulation points is a minimizer of . In particular, if has a unique minimizer, then converges towards it.
We have presented the notion of convergence in the above generality because some of the -limits we will consider in this paper are taken in the context of the metric space whereas others are taken in the context of the metric space . Also, notice that when the functionals are allowed to be random (as it is the case for the graph total variation), -convergence has to be interpreted as in the Definition 2.11 in .
To conclude this section we list two important properties of the weighted total variation functional . Let be a continuous function which is bounded above and below by positive constants. The first property is a representation formula for in terms of the distributional derivative of . Indeed, it follows from the work in  that for every one can write
where in the above stands for the distributional derivative of (which in general is a signed measure) and stands for the total variation measure associated to .
The second property that is relevant for our purposes is the coarea formula which states that for every ,
This formula says that the total variation of can be written in terms of the total variation of its level sets. See Theorem 13.25 in .
3. Proofs of Theorems
The main auxiliary result that we need to establish Theorem 2.6 is the following.
For every let be functions satisfying
For consider the energy defined by
Then, as , -converges in the sense towards , where is given by
Proposition 3.1 is a consequence of the next lemma.
Let be two Lipschitz continuous functions which are bounded above and below by positive constants in . Let be the Borel measure on given by . Let be functions satisfying
Let be defined by
Then, as , -converges in the sense towards
Let us start by proving the liminf inequality. That is, let us prove that for every and every sequence satisfying , we have
The following simplifications are standard. First, working along subsequences we may assume without the loss of generality that the liminf is actually a limit. In addition, we may assume that both of the terms involved in inequality 3.1 are finite. We now split the proof of (3.1) into several steps.
Step 1: Instead of working with the energy directly, we first consider a simpler related energy defined by
Notice that for every we have
From the previous inequality, the Lipschitz continuity of and the assumptions on and , we deduce that for every and ,
where is a constant that does not depend on or . It then follows that
In particular, to obtain (3.1) we may as well show that
Step 2: Let be a closed ball contained in . We claim that
where is defined as
where in the above stands for the interior of the closed ball (the ball without its boundary).
The claim follows from Theorem 8 in . The only difference in the setting we consider here and the setting considered in  is that in our case the length-scale depends on location . The fact that converges to zero uniformly over as is enough to make the proof in  carry through with essentially no modifications. We remark that the arguments in  were also used in ; in  the presence of a non-constant density makes computations a bit more tedious.
Step 3: Suppose that is of the form for some measurable set . That is, suppose that is the indicator function of a set with finite perimeter (with respect to ). In the Appendix we show the following fact: there exists a sequence of collections of closed balls contained in satisfying:
For every , the family is finite.
For every , the balls in are pairwise disjoint.
For every , and for every , .
With the families at hand, we may now consider the functions
Then, for every