Variational limits of NN graphbased functionals on data clouds
Abstract.
This paper studies the large sample asymptotics of data analysis procedures based on the optimization of functionals defined on NN graphs on point clouds. The paper is framed in the context of minimization of balanced cut functionals, but our techniques, ideas and results can be adapted to other functionals of relevance. We rigorously show that provided the number of neighbors in the graph scales with the number of points in the cloud as , then with probability one, the solution to the graph cut optimization problem converges towards the solution of an analogue variational problem at the continuum level.
Key words and phrases:
kNN graph, discrete to continuum limit, Gammaconvergence, graph cut, Cheeger cut, spectral clustering.1. Introduction
This paper studies the large sample asymptotics of data analysis procedures based on the optimization of functionals defined on graphs; the procedures of interest include graphbased methods for clustering, classification, and semisupervised learning. The set of vertices of the graph is a random data set while the edges capture the level of similarity among points. In this work our focus is on NN graphs, where one puts an edge between a pair of points whenever one of them is among the nearest neighbors of the other; we will assume from here on that is a subset of Euclidean space. Our main results rigurously show that when scales like , then the solutions of the optimization problems of interest at the graph level are consistent and converge towards the solutions of analogue variational problems at the continuum level. Spectral clustering, total variation clustering, diffusion maps, and Laplacian regularization, are all examples of graphbased data analysis procedures with a variational flavor (see [7, 9, 21, 22, 26, 27, 32, 1]), and the results and ideas in this paper can be applied to analyze their large sample limit in this NN setting. For expository purposes we focus on the concrete example of minimizing balance graph cuts as described below.
To get a flavor of our main results, consider a set of random points uniformly distributed on the region as depicted in Figure 2. A NN graph for a certain choice of is then obtained as shown in Figure 2. We introduce a functional on partitions of by
(1.1) 
The numerator favors low interaction between the two sets in a partition, while the denominator forces it to be balanced in terms of size. It is thus natural to consider the optimization problem
as a sensible approach for data clustering; (1.1) is known as the Cheeger cut functional. In Figure 4 we illustrate the minimizer of 1.1 for the graph in Figure 2. We observe the close resemblance between the discrete minimizer in Figure 4 and the partition of the region in Figure 4 which can be described as a solution to a variational problem at the continuum level of the form
where the cut functional is defined (at least for with smooth boundary) as
The main theorem of this paper is a rigorous mathematical statement of the previous observation. We show that with probability one, and in a very precise sense, the solution to (1.1) converges as towards the solution to a continuum variational problem (see Theorem 2.4 below), provided that (the number of neighbors in the definition of the graph) scales like
Phrased in the language of data clustering, our results establish the statistical consistency of a series of clustering procedures based on minimization of graph cuts when these come from NN graphs. Moreover, our results show that (at least in terms of scaling with ), the condition on needed to establish the consistency is dimension free.
To the best of our knowledge this work is the first one to rigorously address the stability of variational problems on NN graphs such as the one defined in 1.1 in the large sample limit. Most of the theoretical works found in the literature addressing similar questions assume an graph construction on the data set (i.e. there is an edge between two points if they are within distance of each other); an exception to this is the work [31] where pointwise convergence of graph Laplacians on NN graphs (among other constructions) is analyzed. We find the absence of theoretical results in the NN setting to be a strong motivation for our work given their frequent use by practitioners due to their nicer regularity properties and their robustness to data dimensionality; see [32] for a more complete discussion on this matter. Notice that NN graphs can be constructed completely from ordinal information about the data points and in particular exact values of interpoint distances are not needed (this is a property that adds to the robustness of the NN construction). Figure 5 illustrates the difference between graphs and NN graphs.
In this paper we follow the same line of thought described in [17], and in particular, reduce our analysis to obtaining the limit (see definitions in section 2.4) of a rescaled version of the energy
defined for functions ; this is the functional that appears in the numerator in (1.1) (restricting to ), we will denote it by and refer to it as the graph total variation, see (2.4) below for its precise definition. The limit of is shown to be a weighted (local) total variation functional at the continuum level. The notion of convergence provides precise sufficient conditions implying the stability of minimizers of variational problems; we review its definition in Section 2.4 and we refer the interested reader to [10] for a more complete discussion on the topic.
Most of the technical work in this paper is devoted to analysing the “bias” of the random functional . We establish the convergence of a kernel based functional with inhomegeneous bandwidth (a sort of average of ) towards a (local) weighted total variation (see Propositions 3.1 and 3.2 below). The inhomogeneity associated to the kernel based functional is intrinsic to the NN graph construction, where lengthscales are determined by the Euclidean distance and the data density around each point . The fact that the resulting kernel based approximation has a varying bandwidth makes our analysis different from that in [14]. At each point in space, the bandwidth depends inversely on the density of the groundtruth measure generating the data.
It is possible to reinterpret this feature and think of it as fixing a bandwidth but now measuring distances with an effective metric induced by the groundtruth on the ambient space; this metric in particular shrinks distances on regions with low density. In this work we will not pursue this idea any further, but we anticipate that there are several advantages of doing so. In particular, it is of relevance to further investigate the dimension free condition , and understand better the constants appearing in these asymptotic inequalities. This is of special relevance if we want to extend our results to settings where the groundtruth distribution is for example a Gaussian measure in an finite dimensional Hilbert space. Both the unboundedness of the support of the ground truth and the infinite dimension of the ambient space are settings that are not covered by the results in this paper. We believe that a better understanding of this and other related issues are of importance for a better understanding of NN graphs and their benefits.
We would like to finish this introduction by mentioning some of the growing literature on large sample asymptotic analysis of operators and functionals constructed from graphs. Convergence of graph Laplacians can be found in the work by Belkin and Niyogi [5], Coifman and Lafon [9], Giné and Koltchinskii [18], Hein, Audibert and von Luxburg [20], and Singer [28]. These works deal mostly with pointwise consistency of graph Laplacians. The work of AriasCastro, Pelletier and Pudlo [2] studies the pointwise convergence of Cheeger energy and consequently of total variation, as well as variational convergence when the discrete functional is considered over an admissible set of characteristic functions which satisfy a “regularity” requirement. Spectral convergence of graph Laplacians, (relevant for data clustering) has been studied, among others, by Huang, and Jordan [31], Belkin and Niyogi [4], von Luxburg, Belkin and Bousquet [33], Singer and Wu [29]. Most of the results previously mentioned are deduced using tools from perturbation theory of linear operators. A different set of tools was introduced in [14] and later used in [12, 16, 17]. The notion of convergence (a.k.a. epiconvergence) and the introduction of a suitable metric (the metric; see [14] and Section 2.4 below) allowing to compare functions at the graph level with functions at the continuum level, were crucial tools to deduce statistical consistency for a large class of balanced graph cuts [17] and for spectral clustering [16]. In all these results, sharp convergence rates for guaranteeing the consistency were provided. In the context of graphbased approaches to semisupervised learning, it is worth mentioning the work [30] where the Laplacian regularization (for large enough) is studied, as well as the Bayesian approach in [13, 11] where the convergence of graph posteriors is analyzed.
1.1. Outline
The rest of the paper is organized as follows. In Section 2 we make our assumptions precise, present some of the examples of variational problems that are relevant for important tasks in data analysis, and present the main results of the paper. In Section 2.4 we give the definitions of the space and the notion of convergence; we also present some auxiliary results that are needed in the remainder. Finally, in Section 3 we present the proofs of the main results.
2. Setup and main results
We assume that the data are i.i.d samples from a distribution on . is assumed to be an absolutely continuous measure with respect to the Lebesgue measure, with density where is a bounded, connected, open set with Lipschitz boundary, and is a continuous function bounded above and below by positive constants. Namely, we assume that there are positive constants such that for every we have
(2.1) 
We denote by the empirical measure
and write
whenever is among the nearest neighors of or viceversa.
Remark 2.1.
Because has a density, with probability one, the condition is equivalent to
where . Without the loss of generality we assume this fact in the remainder.
let us introduce some functions that we use in the remainder. Let be the Heaviside function given by
For an arbitrary we let be the function
We let be the quantity
(2.2) 
where is the first coordinate of the vector . Notice that in the above expression can be replaced with for any vector with unit norm.
2.1. Graph total variation and total variation in the continuum
In order to introduce the graph total variation functional (an appropriate rescaled version of the numerator in (1.1)), it will be convenient to define
and for , let be the number for which
(2.3) 
The graph total variation functional is then defined as
(2.4) 
Notice that when for some , is just a rescaled version of the denominator of the functional in (1.1). At the discrete level we are interested in the following optimization problem
(2.5) 
The continuum counterpart of the graph total variation is a functional that takes the following form.
Definition 2.2.
Let be a continuous function bounded below and above by positive constants. For an arbitrary function , we define its weighted (by ) total variation as
(2.6) 
In the above and in the remainder of the paper, we use to represent the space of functions with respect to the Lebesgue measure restricted to . In addition, when we write instead of and we denote by the space of functions for which . Notice that for any continuous and bounded above and below by positive constants, the condition is equivalent to the condition .
Remark 2.3.
If is smooth, then
Also, if for some open set with smooth boundary, then
where is the dimensional Hausdorff measure.
At the continuum level we will be interested in a variational problem of the form
(2.7) 
for an appropriately chosen function .
2.2. Main results
Our main result establishes the convergence of minimizers of (2.5) towards minimizers of (2.7). Notice however that solutions to (2.5) are discrete sets, whereas solutions to (2.7) are continuum sets, and so we need to clarify the sense in which we will establish the convergence of discrete minimizers towards continuum ones. The convergence is taken in the space introduced in [14], where in particular functions on the point cloud and functions on are seen as elements of the same space (see section 2.4 below for details).
We are ready to establish our main result.
Theorem 2.4.
Let and let be an open, connected, bounded domain with Lipschitz boundary. Let be a continuous density function satisfying (2.1) and let be the probability measure . Let be a sequence of natural numbers satisfying
Let be i.i.d. points from and let be a solution to (2.5).
Then, with probability one, along every subsequence of there is a further subsequence converging in the sense towards a minimizer of (2.7), where
Remark 2.5.
In the above theorem if the minimizer of (2.7) is unique, then the convergence is along the entire sequence of discrete minimizers.
As discussed in the introduction, and especially as shown in [17], our results are a direct corollary of Theorems 2.6 and 2.7 below. We show that the limit (in the sense) of the functional (for the same scaling as in Theorem 2.4) is the functional . The notion of convergence and its connection to the stability of minimizers of functionals are reviewed in section 2.4.
Theorem 2.6.
(convergence) Let and let be an open, connected, bounded domain with Lipschitz boundary. Let be a continuous density function satisfying (2.1). Let be a sequence of natural numbers satisfying
Then, the functionals converge towards in the sense, where is defined in (2.2), and is the volume of the unit ball in . Furthermore, when restricted to indicator functions, the energies converge to the functional restricted to indicator functions.
Theorem 2.7.
(Compactness) Under the same assumptions in Theorem 2.6, the following statement holds with probability one: If is a sequence with for which
and
then is precompact in . That is, every subsequence of has a further subsequence converging in .
Remark 2.8.
The above theorems hold for domains in provided we replace the condition with the condition . This can be seen directly from our proofs, and follows from the fact that the rate of convergence of the transportation distance between and (for ) scales like and not like (see [15]).
Remark 2.9.
The above setup and results can be extended to the setting in which the support of is not a domain contained in the ambient space but a compact manifold with intrinsic dimension . When that is the case, the appropriate scalings for the functionals are obtained using the intrinsic dimension and not the dimension of the ambient space . Although in this paper we omit the details of such extension, we point out that the results in [25] (later used in the proofs in [14]) can be adapted to the manifold setting by establishing results analogue to those in [15] in the manifold setting, using the geodesic flow in , and the fact that the Euclidean distance (distance in the ambient space ) is a third order approximation for the geodesic distance (intrinsic distance), i.e. ,
The details can be presented elsewhere in a more general setting which is also of interest; see Remark 2.10 below.
Remark 2.10.
Suppose that . One of the important consequences of Theorem 2.6 is that the admissible regimes for that guarantee the recovery of a nontrivial variational limit for the energies does not depend on (we impose ). As a consequence, notice that despite the fact that the scaling factor appearing in depends on , the minimizers of the Cheeger cut are unaffected by a rescaling of the energy by a positive constant. In other words, in principle we do not need to know the dimensionality of the data before hand in order to obtain good clusters using the graph total variation associated to a NN graph. Notice that the same is not true for graphs. In particular, it is plausible to extend the consistency of Cheeger cuts to the setting in which the dimensionality of the data changes in space, i.e., if ’s support is something like a union of manifolds with different intrinsic dimensions. This problem will be explored in the future.
2.3. Other examples of relevant variational problems on graphs
In this section we present a variety of examples of other optimization problems on graphs that are relevant for data analysis. The common structure in all these optimization problems is that in their objective functions the highest order term is either a graph total variation or a version of it; for this reason theorems 2.6 and 2.7 (in conjunction with Proposition 2.18) are of relevance beyond the setting of Theorem 2.4.
For the rest of this section represents a similarity graph on a data set (not necessarily a NN graph).
Example 2.11.
(Ratio graph cuts for clustering). A functional closely related to the Cheeger cut in (1.1) is the ratio cut defined by
Both the above functional and the Cheeger cut can be seen as examples of a larger class of functionals known as balanced cuts. This type of functionals penalize partitions of that either have a big interface separating and or are highly unbalanced in the sense that and are of dissimilar size; intuitively, these are desirable features for a good partitioning of the data and motivate considering the minimization problem
to obtain a good partition of . Notice that the denominator in both functionals is, up to a multiplicative factor, the graph total variation defined by
Assuming that the set of vertices consists of samples from the distribution and that the weights of the graph are those obtained from an graph (with chosen appropriately), the results in [17] state that minimizers of the Cheeger and ratio cuts converge, as , towards solutions to analogue variational problems in the continuum. This result can then be interpreted as a consistency result for clustering using balanced cuts.
Example 2.12.
(Graph total variation in the context of classification). Suppose that
are samples from some distribution supported on . The pair is interpreted as the feature values ( variable) and label ( variable) of an individual from a given population. Based on the observed data (training data) the idea is to construct a “good” classifier assigning labels to every potential individual in the population. A typical choice of risk functional used to define “good” classifiers is the average missclassification error. Unfortunately, when the distribution is unknown (as it is usually the case) it is not possible to determine the classifier that minimizes the average misclassification error (Bayes classifier) and hence an approximation to it based exclusively on the observed data is the best thing one can hope for. Such an approximation can be obtained using the graph total variation as we describe below.
Given the weighted graph , consider the energy
The functional is the empirical risk and the parameter is introduced so as to emphasize or deemphasize the regularizing effect of the graph total variation. In [12], the weights for the graph are assumed to be those coming from an graph and the problem of obtaining an approximation to the Bayes classifier is divided into first solving the minimization problem
and then extend the minimizer to the whole ambient space appropriately. With the right scaling for , one can establish the asymptotic consistency of the constructed approximation. See [12] for more details.
Example 2.13.
(Spectral clustering and spectral embeddings). Undoubtedly, one of the most popular graph based methods for data clustering is spectral clustering. In the two way clustering setting, spectral clustering can be seen as a relaxation of the ratio cut minimization problem mentioned in Example 2.11. Indeed, the relaxed problem takes the form
where is the average ; the denominator of the above objective function is a version of the graph total variation. It is well known that the above optimization problem is actually an eigenvalue problem for the graph Laplacian and that the first nontrivial eigenvector of the graph Laplacian is a minimizer. In the context of multiway clustering, higher eigenvectors of the graph Laplacian are used to define an embedding of the data cloud into a Euclidean space with low dimension. In turn, the embedded data can be clustered using an algorithm like means, inducing in this way a partition for the original data. Among the many results in the literature addressing the consistency of spectral clustering (see the Introduction for some references), we highlight the work in [16] which exploits the variational characterization of eigenvectors/eigenvalues. In that work the graph on the cloud is assumed to be an graph.
Example 2.14.
(Laplacian regularization for semisupervised learning) Let
be labeled data points and let
be a set of unlabeled data points. We think of as being much larger than .
In [1] the authors consider the optimization problem
(2.8) 
for the purposes of semisupervised learning; here is a user chosen parameter, whose role is to impose regularity on candidate functions (the higher the more regular the functions will be). In two independent contributions [8, 30] the authors study the large data limit of this variational problem and by studying the discrete regularity induced by the term in (2.8), they are able to rigurously show that there is a phase transition in the value of at which solutions to (2.8) “stop forgetting” the labeled data points; the transition occurs at (where is the intrinsic dimension of the data set), a result that is reminescent to Sobolev’s embedding theorem. The setting in which these results are shown is that of graphs.
Example 2.15.
(Bayesian formulation of semisupervised learning) In the context of Example 2.14, a related optimization problem is
where is the graph Laplacian associated to the graph and where is a positive number whose role is to enforce more or less regularity on a candidate function (higher resuls in more regularity). The minimizer of this functional can be seen as the MAP of the posterior distribution
where is a prior distribution (in this case Gaussian with covariance matrix ) and is a negative loglikelihood model (in this case additive Gaussian noise).
This Bayesian point of view to graph based semisupervised learning was introduced in [6]. In [13] the authors study the passage to the large limit ( fixed) and study the consistency of posterior measures. Moreover, from said consistency result and from the properties of the limiting posterior distribution, the authors in [11] provide some theory supporting the MCMC algorithm to sample from the posterior that was proposed in [6] . This algorithm was introduced so as to alleviate the curse of dimensionality when sampling from (here the curse of dimensionality arises from the large number ). The setting in which all these results are shown is that of graphs.
2.4. Preliminaries
The purpose of this subsection is to present some definitions and preliminary results that we use in the proof of our main theorems. In particular we present the definitions of space and convergence.
is defined as the set of all pairs where is a Borel probability measure on and is a function in (written from now on). This set can be endowed with the metric defined by
where stands for the set of couplings between and , that is, the set of all probability measures on the product whose first and second marginals are and respectively.
Given that with probability one the empirical measure converges weakly to as , and given that has a density, we may use the characterization of convergence from Proposition 3.12 in [14] to conclude that converges to in if and only if there exists a sequence of transportation maps with ( pushes forward into ), satisfying
(2.9) 
and
(2.10) 
In turn, this holds if and only if for all sequences of transportation maps with that satisfy (2.9) one has (2.10). Because of this characterization, we may abuse notation slightly and simply write without specifying the corresponding attached measures whenever it is clear from context that they can be omitted.
A special choice of transportation maps between and that we use in the remainder are provided by [15]. We may consider transportation maps between the measures and satisfying the condition
(2.11) 
for some constant (provided ). Indeed, the results in [15] show that with probability one, there exists transportation maps for which and for which (2.11) holds for all large enough . Since all of our results are asymptotic in nature, we may as well assume for the remainder that with probability one (2.11) holds for all . One last relevant property of the map , which follows directly from the fact that it transports into , is the change of variables formula
(2.12) 
which allows us to write integrals with respect to in terms of integrals with respect to .
We now present the definition of convergence in the context of a general metric space. This is a notion of convergence for functionals which together with a coercivity assumption, guarantee the stability of minimizers in the limit. A standard reference for convergence is [10].
Definition 2.16.
Let be a metric space and let be a sequence of functionals. The sequence converges with respect to the metric to the functional as if the following properties hold:

Liminf inequality: For every and every sequence converging to ,

Limsup inequality: For every there exists a sequence converging to satisfying

Compactness: Every bounded sequence satisfying
is precompact.
We say that is the limit of the sequence of functionals (with respect to the metric ).
Remark 2.17.
A sequence like the one appearing in the limsup inequality is said to be a recovery sequence for . It is straightforward to show that if one can find recovery sequences for all elements in a set satisfying:

For all there exists a sequence such that and ,
then one can find recovery sequences for all elements in ; a set satisfying (1) above is said to be dense in with respect to . This fact follows from a simple diagonal argument (see [10]).
The most relevant property of convergence (in particular for our purposes) is presented in the following proposition which can be found in [10].
Proposition 2.18.
Let be a metric space and let be a sequence of functionals that are not identically equal to . If converges towards then,

Any sequence where is a minimizer of is precompact, and each of its accumulation points is a minimizer of . In particular, if has a unique minimizer, then converges towards it.

We have
We have presented the notion of convergence in the above generality because some of the limits we will consider in this paper are taken in the context of the metric space whereas others are taken in the context of the metric space . Also, notice that when the functionals are allowed to be random (as it is the case for the graph total variation), convergence has to be interpreted as in the Definition 2.11 in [14].
To conclude this section we list two important properties of the weighted total variation functional . Let be a continuous function which is bounded above and below by positive constants. The first property is a representation formula for in terms of the distributional derivative of . Indeed, it follows from the work in [3] that for every one can write
(2.13) 
where in the above stands for the distributional derivative of (which in general is a signed measure) and stands for the total variation measure associated to .
The second property that is relevant for our purposes is the coarea formula which states that for every ,
(2.14) 
This formula says that the total variation of can be written in terms of the total variation of its level sets. See Theorem 13.25 in [23].
3. Proofs of Theorems
The main auxiliary result that we need to establish Theorem 2.6 is the following.
Proposition 3.1.
For every let be functions satisfying
and
For consider the energy defined by
where
Then, as , converges in the sense towards , where is given by
Proposition 3.1 is a consequence of the next lemma.
Lemma 3.2.
Let be two Lipschitz continuous functions which are bounded above and below by positive constants in . Let be the Borel measure on given by . Let be functions satisfying
and
Let be defined by
where
Then, as , converges in the sense towards
Proof.
Let us start by proving the liminf inequality. That is, let us prove that for every and every sequence satisfying , we have
(3.1) 
The following simplifications are standard. First, working along subsequences we may assume without the loss of generality that the liminf is actually a limit. In addition, we may assume that both of the terms involved in inequality 3.1 are finite. We now split the proof of (3.1) into several steps.
Step 1: Instead of working with the energy directly, we first consider a simpler related energy defined by
where
Notice that for every we have
From the previous inequality, the Lipschitz continuity of and the assumptions on and , we deduce that for every and ,
where is a constant that does not depend on or . It then follows that
In particular, to obtain (3.1) we may as well show that
Step 2: Let be a closed ball contained in . We claim that
where is defined as
(3.2) 
where in the above stands for the interior of the closed ball (the ball without its boundary).
The claim follows from Theorem 8 in [25]. The only difference in the setting we consider here and the setting considered in [25] is that in our case the lengthscale depends on location . The fact that converges to zero uniformly over as is enough to make the proof in [25] carry through with essentially no modifications. We remark that the arguments in [25] were also used in [14]; in [14] the presence of a nonconstant density makes computations a bit more tedious.
Step 3: Suppose that is of the form for some measurable set . That is, suppose that is the indicator function of a set with finite perimeter (with respect to ). In the Appendix we show the following fact: there exists a sequence of collections of closed balls contained in satisfying:

For every , the family is finite.

For every , the balls in are pairwise disjoint.

.

For every , and for every , .

.
With the families at hand, we may now consider the functions
Then, for every