The k-PDTM : a coreset for robust geometric inference
Analyzing the sub-level sets of the distance to a compact sub-manifold of is a common method in TDA to understand its topology. The distance to measure (DTM) was introduced by Chazal, Cohen-Steiner and Mérigot in [Merigot1] to face the non-robustness of the distance to a compact set to noise and outliers. This function makes possible the inference of the topology of a compact subset of from a noisy cloud of points lying nearby in the Wasserstein sense. In practice, these sub-level sets may be computed using approximations of the DTM such as the -witnessed distance [Merigot2] or other power distance [Buchet16]. These approaches lead eventually to compute the homology of unions of growing balls, that might become intractable whenever is large.
To simultaneously face the two problems of large number of points and noise, we introduce the -power distance to measure (-PDTM). This new approximation of the distance to measure may be thought of as a -coreset based approximation of the DTM. Its sublevel sets consist in union of -balls, , and this distance is also proved robust to noise. We assess the quality of this approximation for possibly dramatically smaller than , for instance is proved to be optimal for -dimensional shapes. We also provide an algorithm to compute this -PDTM.
Keywords : distance to a measure, geometric inference, coreset, power function, weighted Voronoï tesselation, empirical approximation
1.1 Background on robust geometric inference
Let be a compact set included in the closed Euclidean ball , for , whose topology is to be inferred. A common approach is to sample on , and approximate the distance to via the distance to the sample points. As emphasized in [Merigot1], such an approach suffers from non-robustness to outliers. To face this issue, [Merigot1] introduces the distance to measure as a robust surrogate of the distance to , when is considered as a -sample, that is independent realizations of a distribution measure whose support is , possibly corrupted by noise. Namely, for a Borel probability measure on , a mass parameter and , the distance of to the measure , is defined by
Definition 1 (Dtm).
where denotes the closed Euclidean ball with radius . When is uniform enough on a compact set with positive reach , this distance is proved to approximate well the distance to ([Merigot1, Proposition 4.9]) and is robust to noise ([Merigot1, Theorem 3.5]). The distance to measure is usually inferred from via its empirical counterpart, also called empirical DTM, replacing by the empirical distribution , where is the Dirac mass on .
Mérigot et al noted in [Merigot2] that the sublevel sets of empirical DTM are union of around balls with , which makes their computation intractable in practice. To bypass this issue, approximations of the empirical DTM have been proposed in [Merigot2] (-witnessed distance) and [Buchet16] (power distance). Up to our knowledge, these are the only available approximations of the empirical DTM. The sublevel sets of these two approximations are union of balls. Thus, it makes the computation of topological invariants more tractable for small data sets, from alpha-shape for instance; see [Ede92]. Nonetheless, when is large, there is still a need for a coreset allowing to efficiently compute an approximation of the DTM, as pointed out in [Phillips]. In [MerigotLB], Mérigot proves that such a coreset cannot be too small for large dimension.
This paper aims at providing such a coreset for the DTM, to face the case where there are many observations, possibly corrupted by noise. We introduce the -power distance to a measure (-PDTM), which is defined as the square root of one of the best -power functions approximating the square of the DTM from above, for the norm. Roughly, we intend to approximate the DTM of a point with a power distance of the form
where the ’s and corresponding ’s are suitably chosen. Its sub-level sets are union of balls. Thus, the study of the associated topological invariants gets tractable in practice, even for massive data.
We begin by providing some theoretical guarantees on the -PDTM we introduce. For instance, we prove that it can be expressed as a power distance from a coreset of points that are local means of the measure . The proofs rely on a geometric study of local sub-measures of with fixed mass , showing that such a coreset makes sense whenever is supported on a compact set. In particular, we prove that the set of means of local sub-measures of is convex. The discrete case relies on the duality between a weighted Delaunay diagram and its associated weighted Voronoï diagram.
Once the -PDTM properly defined, the main contribution of our paper are the following. First we assess that the -DTM is a good approximation of the DTM in the sense (Proposition 18), showing for instance that whenever has dimension
where stands for the integration of with respect to measure . As mentioned in Proposition 22, this allows to infer topological guarantees from the sublevel sets of the -PDTM.
Second we prove that this -PDTM shares the robustness properties of the DTM with respect to Wasserstein deformations (Proposition 21). Namely, if is a sub-Gaussian deformation of such that the Wasserstein distance , it holds
ensuring that the approximation guarantees of our -PDTM are stable with respect to Wasserstein noise. Similar to the DTM, this also guarantees that an empirical -PDTM, that is built on , is a consistent approximation of the true -PDTM.
At last, we provide more insights on the construction of the empirical -PDTM from a point cloud , facing the practical situation where only a corrupted sample is at hand. We expose a -means like algorithm with complexity , and we analyze the approximation performance of such an empirical output. Theorem 24 shows that, with high probability,
Combining this estimation result with the approximation results between -PDTM and DTM mentioned above suggest that an optimal choice for is , whenever has dimension , resulting in a deviation between empirical -PDTM and DTM of order . This has to be compared with the approximation that the empirical DTM achieves in such cases. In the case where is large, this approximation suffices for topological inference. Thus, topological inference built on significantly less points might provide almost similar guarantees than the DTM.
1.3 Organization of the paper
This paper is organized as follows. In Section 2, we recall some definitions for the DTM that can be expressed as a power distance, and study the set of local means. Section 3 is devoted to the -PDTM, a -power distance which approximates the DTM. We make the link with two equivalent definitions for the -PDTM, derive some stability results, prove its proximity to the DTM highlighting its interest for topological inference. The case of noisy point clouds is addressed in Section 4, where an algorithm to approximate the -PDTM comes up with theoretical guarantees.
2 Some background about the DTM
2.1 Notation and definitions for the DTM
In the paper, we denote by the -dimensional space equipped with the Euclidean norm . For and any space , stands for , where two elements are identified whenever they are equal up to a permutation of the coordinates. Also, denotes the Euclidean sphere of radius , the Euclidean ball centred at , and for and , denotes the half-space . Also, for any subset of , stands for its closure, for its interior, its boundary and its complementary set in .
In the following, stands for the set of Borel probability distributions , with support , and, for any -integrable function , denotes the expectation of with respect to . The following sets of distributions are of particular interest: we denote by for , and is the set of which put mass neither on the boundaries of balls nor on the half-spaces of -mass . We also allow perturbations of measures in . A sub-Gaussian measure with variance is a measure such that for all . The set of such measures is denoted by . As well we can define . The set might be thought of as perturbations of . Indeed, if , where has distribution in and is Gaussian with variance , then has distribution in , with . All these sets of distributions are included in , that denotes the set of distributions with finite second moment.
For all and , denotes a -sample from , meaning that the ’s are independent and sampled according to . Also, denotes the empirical measure associated to , where is such that . Then is the set of uniform on a set of points.
An alternative definition to Definition 1, for the distance to measure, might be stated in terms of sub-measures. Let . We define as the set of distributions , for a sub-measure of coinciding with on , and such that and . Note that when , is reduced to a singleton with defined for all Borel sets by . From [Merigot1, Proposition 3.3], it holds, for any and ,
with the mean of and its variance. For convenience, we denote by the second moment of , so that . Whenever is in , satisfies the following property.
Let , then and , .
2.2 From balls to half-spaces: structure of the local means set
In the previous part, we have seen that the DTM is built from sub-measures of supported on balls of -mass . Now, by making the center of a ball go to along a direction such that the ball keeps a fixed mass , we obtain a sub-measure of supported on a half-space, as follows.
For , we denote by the infinite point associated to the direction . It can be seen as a limit point .
Then, we denote .
Note that we can equip with the metric defined by , with when and for all .
Also, for this metric, a sequence of converges to if and only if and with the convention for all .
Let , set . Then, corresponds to the largest (for the inclusion order) half-space directed by with -mass at most , which contains all the ’s for large enough.
Let and . Assume that . If for all , then for -almost all , we have:
If is a sequence of such that , then, the result holds up to a subsequence.
The proof of Lemma 3 is given in the Appendix, Section A.2. For all , we can generalize the definition of , , , and to the elements for all . Note that when , is reduced to the singleton with equal to for all Borel set . Intuitively, he distributions behave like extreme points of . This intuition is formalized by the following Lemma. Denote .
Let , the set is equal to .
Let , then
The proofs of Lemmas 4 and 5 are to be found in Section A.3 and A.4. A key property of the local means sets is convexity. This will be of particular interest in Section 3.1. We begin with the finite-sample case.
Let such that is a set of points in general position, as described in [Boissonat, Section 3.1.4], meaning that any subset of with size at most is a set of affinely independent points, set . Then, the set is convex.
[Proof of lemma 6] Let
with the collection of all sets of -nearest neighbors associated to . Note that different may be associated to the same , and also note that . Moreover, since any , for , can be expressed as a convex combination of the ’s.
Then, breaks down into a finite number of weighted Voronoï cells , with the weight associated to any point in . According to [Boissonat, Theorem 4.3], the weighted Delaunay triangulation partitions the convex hull of any finite set of weighted points in general position by -dimensional simplices with vertices in , provided that the associated weighted Voronoï cells of all the points in are non empty. By duality, (also see [Boissonat, Lemma 4.5]) these vertices are associated to weighted Voronoï cells that have non-empty common intersection. Thus, any satisfies for some ’s in and some non negative ’s such that . Also, there exists some in the intersection of the weighted Voronoï cells, .
Set , with when . Then, is a probability measure such that () coincides with on and is supported on . Thus it belongs to . Moreover, its mean . Thus, .
If , convexity of might be deduced from the above Lemma 6 using the convergence of the empirical distribution towards in a probabilistic sense. This is summarized by the following Lemma.
Let and . There exists sequences , , with the points in in general position, and such that
If for is such that and for all , , , , then for all , is convex.
2.3 The DTM defined as a power distance
A power distance indexed on a set is the square root of a power function defined on from a family of centers and weights by . A -power distance is a power distance indexed on a finite set of cardinal .
In [Merigot1, Proposition 3.3], the authors point out that for all such that is a sub-measure of . This remark, together with (1), provides an expression for the DTM as a power distance.
Proposition 9 ([Merigot1, Proposition 3.3]).
If , then for all , we have:
and the infimum is attained at and any measure .
As noted in Mérigot et al [Merigot2], this expression holds for the empirical DTM . In this case, corresponds to the barycentre of the nearest-neighbors of in , , and , at least for points whose set of nearest neighbors is uniquely defined.
2.4 Semiconcavity and DTM
In the following, we will often use the following lemma connected to the property of concavity of the function .
Lemma 10 ([Merigot1, Proposition 3.6]).
If , then for all and ,
with equality if and only if .
3 The -PDTM: a coreset for the DTM
In Proposition 9, we have written the DTM as a power distance. This remark has already been exploited in [Merigot2] and [Buchet16], where the DTM has been approximated by -power distances. In this paper, we propose to keep only centers.
For any , we define by:
A closely related notion to Definition 11 is the following weighted Voronoï measures.
A set of weighted Voronoï measures associated to a distribution , and is a set of positive sub-measures of such that and
We denote by the expectation of , with the convention when .
Note that a set of weighted Voronoï measures can always be assigned to any and , it suffices to split in weighted Voronoï cells associated to the centers and weights , see [Boissonat, Section 4.4.2], and split the remaining mass on the border of the cells in a measurable arbitrary way.
For all , and for some , such that for all , the set is not empty. Moreover, there is some such that for all .
[Sketch of proof] For , set . Then, Lemma 3 and the dominated convergence theorem yield .
Set for all , then and .
3.1 Two equivalent definitions for the -Pdtm
Let , the -power distance to a measure (-PDTM) is defined for any by:
An -approximation of the -PDTM, denoted by an is a function defined by the previous expression but for some satisfying
Theorem 13 states that the -PDTM is well defined when and satisfies for all . Nonetheless, whenever is not a singleton, the -PDTM is not unique. Note that for all , .
The set is defined by:
with for , that is: