Coarse-Refinement Dilemma: On Generalization Bounds for Data Clustering1footnote 11footnote 12footnote 22footnote 2

Coarse-Refinement Dilemma: On Generalization Bounds for Data Clustering11footnotemark: 122footnotemark: 2

Yule Vaz yule.vaz@usp.br Rodrigo Fernandes de Mello mello@icmc.usp.br Carlos Henrique Grossi grossi@icmc.usp.br Institute of Mathematical and Computer Sciences, University of São Paulo, Trabalhador São Carlense 400, São Carlos, SP, Brazil
Abstract

The Data Clustering (DC) problem is of central importance for the area of Machine Learning (ML), given its usefulness to represent data structural similarities from input spaces. Differently from Supervised Machine Learning (SML), which relies on the theoretical frameworks of the Statistical Learning Theory (SLT) and the Algorithm Stability (AS), DC has scarce literature on general-purpose learning guarantees, affecting conclusive remarks on how those algorithms should be designed as well as on the validity of their results. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on Topological Spaces, instead of metric ones as typically approached in the literature. From that, we show that the DC problem presents an analogous dilemma to the Bias-Variance one, which is here referred to as the Coarse-Refinement (CR) dilemma. CR is intended to clarify the contrast between: (i) highly-refined partitions and the clustering instability (overfitting); and (ii) over-coarse partitions and the lack of representativeness (underfitting); consequently, the CR dilemma suggests the need of a relaxation of Kleinberg’s richness axiom. Experimental results were used to illustrate that multidimensional persistent homology support the measurement of divergences among DC models, leading to a consistency criterion.

keywords:
Data Clustering, Topology, Persistent Homology, Multidimensional Persistence, Algorithm Stability
\newdefinition

rmkRemark \newproofpfProof

1 Introduction

Machine Learning (ML) is among the most important concepts to be considered while designing real-world applications from different domains (Zhao:2003; larkshmanan15; Zhan:2018; Tang:2018; Bablani:2019) by mainly relying on two paradigms: (i) the Supervised Machine Learning (SML), and (ii) the Unsupervised Machine Learning (UML). SML counts on fundamental proofs provided by the Statistical Learning Theory (SLT) (Vapnik95; luxburg09) while estimating a classification/regression function in form , given an input space and class labels in an output space . All those proofs are driven to ensure that the empirical risk probabilistically converges to its expected value, so that it can be used to assess multiple learning models. This SLT framework cannot be employed to formulate or ensure learning in the context of UML, once class labels are not available but only inputs . In this sense, UML attempts to represent data spatial structures according to the features composing , being Data Clustering (DC) the most iconic approach of such branch. As matter of fact, some specific proofs have been already formulated (kleinberg02; bousquet02; ben-david08; ackerman10; carlsson10), although UML still requires advances in order to have a stronger, preferably general, theoretical foundation to ensure learning.

As a step in such direction, the concept of Algorithmic Stability (AS) (bousquet02) supports learning guarantees in terms of bounded perturbations on the domain of a measurable function of random variables, even in the absence of labeled data (kutin02; poggio04). In (mello09), the Algorithm Stability was employed in an attempt to characterize learning on unsupervised online data, which resulted in the development of a method for concept drift detection. We then consider that Algorithmic Stability is an appropriate framework to study Data Clustering problems.

Considering clustering partitioning, kleinberg02 formalizes the DC problem according to three necessary properties: (i) scale-invariance – the partitions formed by a clustering algorithm should not depend on the distance scale among elements; (ii) consistency – the partitions must not change whenever intra-cluster distances decrease and inter-cluster distances increase; and (iii) richness – a clustering algorithm should be capable of producing all possible partitions for a distance function. kleinberg02 proved those properties cannot be simultaneously satisfied though in attempt to unify intuitive clustering notions, therefore even those basic axiomatic framework statements need some sort of relaxation to perform clustering.

Based on  kleinberg02 results, ben-david08 propose the Clustering Quality Measure which guarantee the satisfiability for all kleinberg02’s axioms simultaneously. Although, we believe that richness is actually not mandatory, given its impacts on the algorithm stability as discussed in Sections 3 and 5. Roughly speaking, richness imposes that some irrelevant and unstable partitions must be also produced by clustering algorithms, something that may not be desirable. In addition, ackerman10 develop a taxonomy scheme for clustering properties from which we adopt the isomorphic invariance between clustering models.

Assuming that Hierarchical Clustering (HC) relaxes Kleinberg’s axioms, carlsson10 designed a theoretical framework to ensure data-permutation stability by taking advantage of ultrametric spaces built upon HC algorithms. The authors firstly proved that, after some modifications, the Single-Linkage (SL) agglomerative criterion is enough to ensure the same clustering model (dendrogram) for all input permutations , and secondly confirmed the same result for different inputs following the same data distribution.

From those two main papers, this manuscript firstly consolidates those concepts in the sense of providing a general description for the Data Clustering (DC) problem by using topological spaces, thus complementing carlsson10’s study that assumes data in some metric space. Secondly, we discuss on the practical usefulness of kleinberg02’s richness property given its impacts on the clustering algorithm consistency, as later discussed in Sections 3 and 5. Roughly speaking, richness imposes that either irrelevant or unstable partitions must be also produced which are not desirable. Finally, we conclude that over-refined or over-coarse HC partitions tend to be either unstable or irrelevant when data is subject to bounded perturbations, something associated to the Bias-Variance dilemma (Geman92) in terms of the space of the admissible partitions, thus suggesting that Kleinberg’s richness should be anyway relaxed.

To complement carlsson10’s study, this paper shows that topological spaces sufficiently model the DC problem, allowing to derive consistency results to ensure clustering generalization from a more general point of view (Sections 3 and 5). This consistency considers topological features primarily associated with connected components, holes and voids (munkres2000), which cannot be directly represented once the underlying topological space is unknown. However, given we have access to some points cloud, the consistency of topological structures can be assessed by evaluating isomorphisms between homology groups (homology equivalences) (carlsson09).

Equivalent spaces from the homological perspective may not be homeomorphic (hatcher2002). Even though, homology equivalence preserves holes, voids and connected components of geometric objects, which are defined by homology classes, i.e., by elements of the homology group. In this particular context, we claim that inferior (fine-grained, e.g. at the bottom level of a dendrogram) hierarchies in some HC model are not consistent for homology classes as data points are subject to data inclusion. In this scenario, persistent homology (Edelsbrunner2000) is suitable to study the homology groups extracted from the hierarchies of a HC model as it allows to analyze changes in the number of connected components and voids, what is formally defined in terms of inclusions of the corresponding topological spaces. Hence, our goal is to verify how persistent homology is affected after the acquisition of new data so that we find the collection of hierarchies satisfying generalization bounds for the HC problem.

That motivated us to study the stability proofs by Cohen07; Chazal09 of persistence diagrams. Such diagrams describe how homology classes change throughout the sequence of topological spaces inclusions which, in the context of this paper, are associated to the hierarchies of a given HC model (or dendrogram). Although the relevant contributions on the stability of persistence diagrams by Cohen07; Chazal09, their approaches did not considered variations on the topological spaces produced by new data. Such data insertions required us to employ the concept of multifiltration (bifiltration specifically), introduced in (carlsson2009multi), in order to represent persistent homology along data inclusions (ville1939).

In summary, the main contributions of this paper are: i) the conclusion that Kleinberg’s richness property (kleinberg02) may lead to inconsistent partitions; ii) the conclusion that Topological Spaces produce relevant features for DC consistency analysis; iii) a new DC problem formulation based on topological spaces; iv) a new HC problem formulation based on topological spaces; v) the formulation of the Coarse-Refinement dilemma based on homology groups which is associated with the Bias-Variance dilemma from supervised learning (Vapnik95); vi) the formulation of generalization bounds for homology groups in clustering structures, based on the Coarse-Refinement dilemma; vii) the proof that our proposed DC generalization bound is an upper limit for carlsson10’s consistency; viii) besides the theoretical contributions, we show experimental results to confirm that over-refined clusters produce inconsistent homology groups.

This paper is organized as follows: Section 2 briefly introduces some studies related to the formalization of theoretical frameworks in the context of the Data Clustering (DC) problem; Section 3 introduces a general formulation for the DC and HC problems; Section 4 discusses the Coarse-Refinement Dilemma considering the homology group ; Section 5 shows that homology groups of degree greater than zero are affected by over-refined and over-coarsed topologies; Section 6 compares our proposed generalization bounds to carlsson10’s consistency; finally, conclusions and future directions are provided in Section 8.

2 Related work

Data Clustering (DC) faces many challenges in defining and guaranteeing generalization from datasets, as it does not rely on labels and, consequently, it cannot take advantage of computing any evident error measurement such as risk (luxburg09). While studying this issue, kleinberg02 considered that a clustering model is an application of a mapping on top of a distance function , given contains indices of data points in some fixed-size set , disregarding its ambient space though (ambientspace). From this initial setup, kleinberg02 defined three properties to be respected in order to assess clustering algorithms and models:

  • Scale-invariance: Given a distance and a clustering function, and , and a scalar , the following must hold . Thus, the similarity representation over must be consistent with the units of measurement;

  • Consistency: Let be a partition of and two distance functions. Function is referred to as a transformation of if: (i) for all belonging to the same cluster, ; and (ii) for all belonging to different clusters, . Consistency holds if whenever is a transformation of . Intuitively, this property is assured when the partition is maintained whenever intra-cluster distances reduce and inter-cluster distances increase;

  • Richness: Let be the set of all partitions , given some distance function such that . The richness property is guaranteed if and only if is equivalent to the set of all partitions of . In other words, for any partition of , there exists some such that .

The author also proves that those three properties cannot be simultaneously ensured, and claim that one of them is always somehow relaxed to obtain a clustering model. Nonetheless, if we consider as a statistical model, richness does not appear to be suitable for the Data Clustering scenario as, in the spirit of the Bias-Variance dilemma, it would permit the production of biased models, thus leading to phenomena similar to under (single cluster) and overfitting (every point is a cluster).

In addition, carlsson10 relax Kleinberg’s properties to prove the stability and consistency for their adapted single-linkage HC strategy. They assume hierarchical methods on metric spaces to build up dendrograms, i.e., structures that map the groups of an HC model into the real line, associating every cluster with the required radii to form it. They also show dendrograms are equivalent to ultrametric spaces, allowing them to compare HC models using the Gromov-Hausdorff distance (gromov1981), confirming their adapted single-linkage strategy is the only one in the class of linkage algorithms (uniqueness) to be stable and consistent according to the metric space. More precisely, they conclude that two different ultrametric spaces built from identically and independently distributed samples from the same data distribution give rise to isometric spaces as the sample size goes to infinity.

Complementary to those relevant studies, topological features could be used to derive other theoretical results for the DC and HC problems. In that sense, persistent homology is particularly useful as it describes when homology classes appear or vanish throughout a sequence of topological inclusions in attempt of representing hierarchical clusterings (Edelsbrunner2002). The persistence diagram is typically employed to represent the birth and death of homology classes, whose stability was already proven by considering changes in tame functions (Cohen07). Taking advantage of such foundation to prove stability, Cohen07 assumed that: (i) the topological space must be triangulable; (ii) the topological space must be fixed; and (iii) tame functions must be continuous. This has also motivated Chazal09 to prove that persistence diagrams are stable when topological spaces are not triangulable nor tame functions are continuous, yet topological spaces must be fixed. In our scenario, neighborhoods may change when data is subject to perturbations, so we cannot assume the topology to be fixed. Therefore, the main goal of this paper is to study the consistency in the presence of adaptable topological spaces, guaranteeing the property of isomorphism invariance as in (ackerman10).

Hence we adopted the concept of multifiltration (bifiltration precisely), introduced in (carlsson2009multi), from which the persistence of homology groups can be studied along variations of two or more parameters. For instance, a bifiltration can be used to study the radius and the density of the DBSCAN algorithm Ester96 to obtain adequate clusters given a target application. In our work we consider the inclusion of new data points as a second dimension in the bifiltration such that homology classes are associated with a pair in which is a dataset with the inclusion of new samples and is a variable related to the level of the HC (e.g. an acceptable radius). In order to study the persistent homology of such bifiltration and its -th Betti-numbers consistency, the DC and HC problems must be formulated in terms of topological spaces, as shown in the next section.

3 Generalizing the Data Clustering problem

The Data Clustering (DC) problem typically relies on metric spaces in order to describe similarities among dataset instances, unfortunately losing the inherent abstraction that it could take advantage from topological spaces. As discussed in this section, the DC problem should be represented in a more general mathematical space so it can be restricted when and if necessary, according to the target learning task. For example, if no restriction is needed, one can analyze topological features to study the data space and understand more general stability/consistency criteria on top of structures such as holes, voids and connected components (carlsson10). However, if some restriction is part of the target task, one can still endow such space with some topology and take advantage of the same general stability/consistency criteria, as formulated in this manuscript.

In this context, we consider data points are acquired from some unknown topological space , whose modeling is the goal of data clustering. To produce such model, the DC problem assesses the similarities among dataset instances in order to organize them into clusters, each one corresponding to a set of neighborhoods of data elements. A neighborhood of , having as the resultant topological space sampled from , defines an open set of necessarily containing .

{rmk}

The goal of Data Clustering is to approximate topological features of an unknown topology on a space from which data is sampled. In practice, a subspace of a larger topological space (typically, a closed cube in ). From the data instances , one attempts to approximate topological features of by means of a neighborhood map , in such a way that, ideally, and are homeomorphic. Here, is an open neighborhood of and is endowed with the subspace topology (we denote by the closure of the topology ). We call the pair a neighborhood topological space.

For the sake of illustration, the DBSCAN algorithm (Ester96) employs a neighborhood map that uses open balls around every data instance in some metric space, considering the density of points in every neighborhood is enough to form a new cluster. Observe this concept differs from ours in which all open sets are taken into account.

Neighborhood topologies induce equivalence relations on (an example will be given below) which determine the pertinence of every data instance to a given cluster. For example, the single-linkage algorithm induces an equivalence relation , referred to as -relation by carlsson10, in which a metric space is built from some dataset using a distance function , so that holds when there is a set such that .

We assume Data Clustering models to be built upon independently and identically sampled instances from which some topological space is obtained, so that there is a probability measure over some unknown underlying topological space . In order to ensure such assumption, the set must also endows a probability space. In this scenario, we need to endow the Borel -algebra of the larger topological space (see Remark 3) with a probability measure that is supported on .

Summarizing, in this paper, we define the Data Clustering (DC) problem as follows.

Definition 0 (Data Clustering problem)

The Data Clustering problem consists in finding an adequate neighborhood topology of (see Remark 3), where and is a known topological space sampled from the unknown underlying topological space . The neighborhood topology should approximate topological features of the unknown topology . Random variables are independently and identically sampled from some unknown probability distribution which is supported on . Clusters are obtained from an equivalence relation derived from .

As claimed in (carlsson09), the use of hierarchical schemes is more informative than choosing a single neighborhood topology. The Hierarchical Clustering (HC) problem considered in this paper is defined as follows.

Definition 0 (Hierarchical Clustering problem)

The Hierarchical Clustering problem consists in finding, for each in which is a refinement index (e.g. such as the radius index for a metric space), an adequate neighborhood topology as in the previous definition. Moreover, we require that the neighborhood topology determined by is contained in whenever .

In this context, the comparison among neighborhood topologies produced from different datasets acquired by the same i.i.d. distribution can be performed by verifying homeomorphisms among them. Two problems occur in this sense, though:

  • As the underlying topological space is unknown, it is actually impossible to verify when a neighborhood topology is homeomorphic to ; this avoids the possibility of a direct design of an adequate neighborhood topology;

  • Consider two subsets and the respective neighborhood topologies and (see Remark 3). Since the probability distribution is unknown, homeomorphisms between and cannot be explicitly studied.

Then, some notion of “approximately” equal topologies is required in order to analyze whether a (hierarchical) clustering algorithm is capable of consistently grouping data. To this end, we use homology: in some sense, homology allows one to compare some features of topological spaces by mapping topological structures into algebraic ones (hatcher2002).

In this paper, we are mainly interested in singular homology. This particular kind of homology uses the continuous images of standard simplices into a topological space in order to “get a sense” of the topology . More precisely, the standard -simplex is the convex hull of the standard unit vectors in , that is, and a singular -simplex in is a continuous map . The boundary of the standard -simplex is made up of -simplices; so, we define the boundary of the singular -simplex as being the formal sum of the singular -simplices that are given by restricting to the faces of . In this formal sum, we alternate signs so orientation is taken into account.

The homology groups can now be constructed as follows. First, one takes the free abelian group generated by all singular -simplices on , i.e., the formal finite sums of singular -simplices with integer coefficients. Elements of are called singular -chains. The boundary operator immediately extends to an operator giving rise to the chain complex

The image of the boundary operator provides the -boundaries while its kernel provides the -cycles. It is not difficult to show (hatcher2002) that the composition . So, and the -th homology group of the topological space is defined as the quotient

(1)

The inclusion says that the “boundary of a boundary” is trivially empty. So, homology groups detect the cycles in whose boundaries are empty but not by the trivial reason of already being a boundary (of a higher dimensional singular simplex) themselves. The presence of a cycle in that is not the boundary of a higher dimension singular simplex can be seen as the presence of an -dimensional void (or hole) in .

The -dimensional voids in are characterized by its -th Betti-number which is, by definition, the rank of the abelian group . One can also define -th Betti-numbers as the dimension of the vector space ; the definition of is essentially the same as that of but we take formal sums of singular simplices with coefficients in the field of rational numbers – a subtle difference allowing one to endow with a linear structure. For example, for a topological space built up from a two-dimensional torus, , and . So, the corresponding -th Betti-numbers are , , and .

In these settings, topological spaces built up from datasets can be compared in terms of their homology groups without the introduction of a function to represent perturbations on such sets. More precisely, given a dataset and a perturbed version which consists of with added points, one expects that the probability of and not being isomorphic tends to zero as (consistency).

4 Coarse-Refinement dilemma of the homology group

In the previous section, we argued that isomorphisms at the level of (singular) homology groups can be useful to compare topological spaces constructed from some Data Clustering algorithm. Homology groups encode how many connected components, holes, or voids (topological structures) are present in a topological space. If topological spaces have the same topological structure, i.e., if they are homeomorphic, their homology groups and are isomorphic. The converse is not true.

In spite of that, homology is used in Topological Data Analysis (TDA) to study the topology of point clouds, because it does not require any assumption on the unknown underlying topology  (carlsson09; Chazal17). In fact, neighborhood topologies (see Remark 3), typically based on metric spaces (Ester96; carlsson10), are used in TDA to extract topological features from point clouds. A more general analysis can be performed by considering arbitrary open neighborhoods instead of metric ones. In addition, the homology groups can be used to compare how a neighborhood topology changes when compared to a perturbed version of itself, analyzing the stability of by comparing with its expected homology .

If is refined enough to represent each element into a single cluster (singletons), a problem would occur at the level of topological spaces: if , then such spaces are not homeomorphic as their number of connected components differ, hence their -homological groups are not isomorphic; so should hold when over-refined spaces are considered. However, no assumption about the cardinality of can be given since is unknown and, therefore, may not hold.

Theorem 3

Two over-refined topologies and produces isomorphic -dimensional homology groups iff .

{@proof}

[Proof.] Assume that, given a neighborhood topology , the space (we remind the reader that, typically, is a closed cube in ) is endowed with a probability measure (in the Borel -algebra of ) supported in and such that each corresponding cluster has measure . Consider the presence of some sampled perturbation , where is an element of a measurable set endowed with a probability function. One issue that motivated the dilemma on over-coarse versus over-refined topologies (in the same sense of the Bias-Variance dilemma) in this paper is that whenever an element in any becomes unrelated with any neighborhood in , the topology changes as well as its homology group . Then, we formulate the probability of the topology being “cut” (or divided) as

(2)

Note that, if , i.e., tends to over-refinement, and some measure does not depend on , then and for every . In this case, the topological space determined by is not homeomorphic to the one determined by nor its homology group is isomorphic to . On the other hand, if the elements cover , then for every and, therefore, .

The neighborhood topology could produce an over-coarse cluster that consists only of the universal set , giving rise to homeomorphic spaces for every perturbation of , but overlooking homology classes with degree greater than one could appear if the topology were refined. Thus, we conclude that there is a trade-off between probabilistic consistence and data representation, such as depicted in the Bias-Variance dilemma (mello18). This leads to the same common issue resultant from the SLT, in which a SML algorithm should never produce all possible classification functions given that would lead to overfitting (Vapnik95). In the Data Clustering context, a clustering algorithm must not produce all possible neighborhood topologies, because this would lead to unstable clustering related to over-refined topologies.

5 Generalization bounds for homology groups

The trade-off resultant of the refinement of can be studied using persistent homology, from the most refined topological space to the coarsest one. In this sense, we have inclusions of topological spaces , known as a filtration, that corresponds to the levels of a hierarchical clustering (here, stands for the neighborhood topology from which arises the topological space (see Remark 3). Our proposal is to find suitable neighborhood topologies resultant from hierarchical clusterings that consistently, as increases, represent the persistence of homology classes even when data is subject to perturbations.

Along the filtration, new simplices may fill/create new cycles in or merge/create connected components accordingly to a tame function:

Definition 0 (Tame functions)

Let be a topological space. A tame function is a continuous map such that whenever , where is taken with the subspace topology. Moreover, a tame function must satisfy the following properties:

  • The homology groups are of finite rank for every ;

  • There are finitely many such that and are not isomorphic; in which ’s are called the critical values of .

Whenever a homology class emerges at and vanishes at , given , its persistence is defined as . Note that this process is a generalization for the Morse function (morse1947).

Roughly speaking, given , a persistent homology group identifies the homology classes which are present along the interval , of the co-domain of the tame function. More precisely, the persistent homology groups are defined as follows.

Definition 0 (Persistent Homology Group)

Let be a topological space equipped with the filtration that arises from a tame function , as in the previous definition. Given , we have the inclusion . The persistent homology group of degree is the image of the induced homomorphism

(3)

The rank of the image of is called the -persistent Betti-number, i.e.,. It allows us to compute the number of homology classes persisting in an interval . In the context of Hierarchical Clustering, we consider a model to be adequate whenever the corresponding homology classes are likely to persist. We claim that over-refined topologies do not provide homology classes that tend to persist when data is subject to perturbations given by instance inclusions.

In order to prove that homological groups are not stable when it comes to over-refined topologies, let us consider perturbations on some dataset after the inclusion of new instances such that . Let be a neighborhood topology with the corresponding topological space and assume that there exists a tame function giving rise to the filtration as in Definition 4. New data must be identically and independently sampled from the unknown underlying topology according to an unknown probability distribution and it produces a new neighborhood topology of such that if its -dimensional homology group is not isomorphic to that of , then the clusters in are not consistent according to these homology groups (-homology consistency). For simplicity and without loss of generality, from now on we will define the topological spaces produced from neighborhood topologies and as, respectively and .

The impact of acquiring a new sample can be analyzed considering the coarse-grain of the corresponding topological space through the sampling of a new element , forming, then, inclusions with . This sequence of topological inclusions induces another filtration; assume that there exists some tame function such that . The inclusions and define a bifiltration (carlsson2009multi), as follows:

such that, for every and , with , the following diagram is commutative:

In this work we consider that every simplicial complex of and in the filtrations, with , must be associated with the same pre-image in in order to allow their comparison. In this sense, note that may not correspond to a critical point for all . For instance, consider the case of when a neighborhood topology is constructed using open balls and let be a perturbed neighborhood topology, where the radius associated to is . If the diameter of is sufficiently greater than that of , the probability that a new instance forms another cluster is close to one because the added data is not likely to belong to an existing cluster thus having probability zero. This implies that , where is the map induced on homology by the data inclusions, being related to the tame function .

In this sense, will never represent well and the analyzed topological features will change as new data are included. In order to represent well , must has, almost certainly, to be an isomorphism. Therefore, a DC model can be considered consistent as follows:

Definition 0 (Dc -homology consistence)

A DC model associated with the neighborhood topology , is -homology consistent if, as , for all , is almost surely an isomorphism.

Although, hierarchical structures require the study of a filtration also throughout the domain of the tame function . In this sense, we have to consider a set of morphisms applied over for in order to define isomorphisms among them. In the ideal scenario, all are isomorphic for . But note that, in hierarchical clustering, the indices of the filtrations are previously chosen, hence the set of morphisms is limited by this choice. Critical points existent in any interval will not be considered and, therefore, there is a resolution problem (in terms of ) inherent to this analysis. Although, assuming this limited set of , the behavior of can be study as:

Lemma 7

Given the persistences and of a bifiltration, if and are isomorphisms then and .

{@proof}

[Proof.] The proof follows from the commutativeness of the bifiltration diagram.

Note that, considering Lemma 7, if isomorphism is almost certain for and for as , then, is likely to occur for any included data, i.e., consistently represent for the values of . In sequence, we define a -homology as:

Definition 0 (Hc -homology consistence)

A HC model associated with the filtration , is -homology consistent if for all , is homology consistent (Definition 6).

In addition, the evaluation if is non-isomorphic not only allows to locally study but also to represent as . In order to prove that, consider the following theorem

Theorem 9

Given the associated simplicial complexes built up from , if is not an isomorphism then as .

{@proof}

[Proof.] Given simplicial complexes are considered, then, for an arbitrary unknown compact topological space , for and for as . Therefore, if is not an isomorphism, as , i.e., the homology classes of which die, or are created after , are different when compared to .

Corollary 1

If there is at least a critical point between the interval in which as .

In this sense, considering , the -persistence homology and will be equivalent if is an isomorphism. Although, even if is isomorphic, there is no guarantee that a critical point between will not produce a non-isomorphic . An adequate representation is then assured when enough points between are chosen in order to characterize all possible critical points in the filtration.

The analysis of the morphism determines how the topological features change along the levels of a hierarchical clustering model (given by the -persistent homology ) when the HC model is subject to data inclusions. Then, a probability measure endowed with a measure , somehow associated with -homology groups, is required in order to study the probability that is isomorphic whenever increases. In this sense, the -th Betti-number can be considered such measure as follows:

Theorem 10

The -th Betti-number is a measure over a -algebra given by a collection of all -homology groups.

{@proof}

[Proof.] Let be a topological space, recovering the definition of the -th Betti-number we have . Considering -dimensional simplicial complex, for a limited natural number , is a free abelian group and, therefore, . Hence, there is a collection , such that is a group transformation where and are the direct and the usual sum respectively, and is a bijective map. Note that we can construct a algebra closed under a countable disjoint union from the one-by-one mapping , with such that, for a set of indices , exists an isomorphism between and with (hence a natural isomorphism for ). From , we have the following measure function:

which respects

  • Non-negativity as ;

  • Null-empty set with (induced by );

  • Countable-additivity as .

It is worth to mention that these properties are associated one-by-one with the rank of the abelian group (-th Betti-number) since:

  • ;

  • .

Therefore, is a measure endowed of a disjoint set -algebra from which a probability measure can be defined with expected value .

Hence, topological features can be measured by -th Betti-numbers, allowing the comparison among topological spaces built from a Data Clustering or Hierarchical Clustering algorithm. The associated -th Betti-number depends on the inclusion of new i.i.d. samples leading to the stochastic process . In this sense, as is required to guarantee that a DC or HC model is consistent, should behave as a martingale (ville1939), respecting the following properties

which is associated with a difference of martingales in the form

Consequently, Equation 4 calculates the expectation of arising and vanishing -th degree homology classes produced by the inclusion of new data points:

(4)

implying in

(5)

As foretold, is considered to behave as a martingale, hence Azuma’s Inequality (azuma1967) (Inequality 12 in C) allows the study of the convergence of the following generalization measure:

(6)

and, therefore, of . Employing such generalization we formulate the DC -homology convergence Lemma (the proof is in C):

Lemma 11 (Dc -homology convergence)

Let and be two datasets with i.i.d. samples such that , be the neighborhood topological space built up from a points cloud, and be the -th Betti-number calculated upon the neighborhood topologies. The probability that the average absolute difference is bounded by decays exponentially with respect to

(7)

given is bounded.

Then, whenever the difference between the number of -homology classes of and increases asymptotically faster than , there is no guarantee that, in average, approximates . Conversely, if is asymptotically smaller than , , and, therefore, is likely to produce an isomorphism.

{rmk}

For instance, considering the case of , if each new sample increases the number of connected components by one, i.e., if an over-refinement occurs, the terms grow as follows:

consequently, in this case, Inequality 7 diverges as follows:

hence, there is no guarantee that is isomorphic.

Corollary 2

An over-refinement occurs whenever

and convergence of the right-hand term in Inequality 7 occurs whenever

Therefore, whenever , is likely to produce an isomorphism. In this sense, a maximum value bounded as described in Corollary 2 also guarantees consistency.

Corollary 3

Let and be the number of new samples added to a dataset ,

(8)

The result of Lemma 11 can, therefore, be extended for the HC problem. Remember that every simplicial complex of and in the filtrations, with , must be associated with the same pre-image in the tame function . In this, the convergence for HC models in terms of -th Betti-numbers is given respecting the following theorem

Theorem 12 (Hc -homology convergence)

Let be the pre-image of the tame function associated with the filtrations , and a bijective function which maps the points of onto the index of such filtrations. Then, the DC -homology convergence for each one of those points is given by

with

where is the cardinality of the pre-image of a tame function , and .

Corollary 4

Considering Theorem 12, let us define and take , then the HC -homology convergence is guaranteed whenever

(9)
Corollary 5

If is constant, convergence of Equation 9 is equivalent to

{rmk}

The parameter can be stated as constant, for example, such as in partitional clustering as K-means (Steinhaus56; Macqueen67), K-medoids (kaufmanl1987) and Self-Organizing Maps (Kohonen1982) which relies on optimization procedures, typically based on some distance from data points and pre-determined centroids. Considering an i.i.d. data distribution and enough samples, those centroids will never sufficiently vary as new data are acquired, hence producing the same number of connected components. Although, as the number of centroids increases w.r.t. the new samples acquired, it is likely that those algorithms identify every point as a single cluster, and HC -homology consistency is not guaranteed.

Considering the homology group, the generalization is guaranteed for every critical point greater than the first point that ensures the DC -homology consistency, as proved in the following lemma:

Lemma 13

Assuming the homology group, we have that, for every critical point , which implies in , if is DC -homology consistent then is also DC -homology consistent.

{@proof}

[Proof.] As is DC -homology consistent, following Lemma 11, we have that must have to be bounded and, therefore, and are also bounded. In this sense, as , and , then implying that is stable, hence DC -homology consistent. Following these consistency results we conclude that kleinberg02’s richness axiom must be relaxed as it considers partitions which does not guarantee -homology consistency. Section 6 will discuss how our results are related with carlsson10’s consistency for their adapted single-linkage method.

6 An ultrametric perspective of Coarse-Refinement dilemma

In (carlsson10), the hirarchical clusters of the agglomerative algorithm single-linkage are described as dendrograms which are equivalent to ultrametric spaces . Those ultrametric spaces were employed in order to support carlsson10 study of the convergence and the stability for their proposed version of the single-linkage algorithm. In this sense, a HC model of single-linkage approximates the unknown underlying ultrametric space following the Gramov-Hausdorff distance (gromov1981) as the number of samples increases. Hence, increases, the HC model becomes isometric to the underlying ultrametric space.

More precisely, in (carlsson10), a hierarchical clustering algorithm produces a compact metric space from which is a finite collection of disjoint compact subsets of a compact metric space . The distance function of such metric space is defined by a linkage function in form , from which