Intrinsic dimension of concept lattices
Intrinsic dimension of concept lattices
Geometric analysis is a very capable theory to understand the influence of the high dimensionality of the input data in machine learning (ML) and knowledge discovery (KD). With our approach we can assess how far the application of a specific KD/ML-algorithm to a concrete data set is prone to the curse of dimensionality. To this end we extend V. Pestov’s axiomatic approach to the instrinsic dimension of data sets, based on the seminal work by M. Gromov on concentration phenomena, and provide an adaptable and computationally feasible model for studying observable geometric invariants associated to features that are natural to both the data and the learning procedure. In detail, we investigate data represented by formal contexts and give first theoretical as well as experimental insights into the intrinsic dimension of a concept lattice. Because of the correspondence between formal concepts and maximal cliques in graphs, applications to social network analysis are at hand.
Keywords:Intrinsic Dimension, Formal Concept Analysis, Geometric Analysis, Curse of Dimensionality
One of the essential challenges in data driven research is to cope with sparse and high dimensional data sets. Various machine learning (ML) and knowledge (KD) procedures are susceptible to the so called curse of dimensionality, an effect that is closely linked to the phenomenon of concentration of measure. The latter was discovered by V. Milman [Milman2010, Milman1983, Milman1988] and is also known as the Lévy property. Despite its frequent occurrence, this effect lacks for a comprehensive computable approach to decide beforehand if and to what extent a data set will be tapped with the phenomenon of measure concentration.
A valuable step towards an indicative for concentration is the axiomatic approach to intrinsic dimension of data by V. Pestov [Pestov0, Pestov1, Pestov2], which involves modeling data sets as metric spaces with measures and utilizing geometric analysis for their quantitative assessment. His work is based on M. Gromov’s observable distance between metric measure spaces [Gromov99, Chapter 3.H] and uses observable invariants to define concrete instances of dimension functions. However, despite its mathematical elegance, this approach is computationally infeasible, as discussed in [Pestov1, Section IV] and [Pestov2, Sections 5, 8], as it amounts to computing the set of all real-valued -Lipschitz functions on a metric space.
In the present paper, we overcome the issue above by modifying the considered mathematical model and, accordingly, enlarging the axiomatic system. More precisely, instead of a measured metric space, our model for a data structure consists of a measured topological space together with a set of functions, features, which are supposed to be both computationally feasible and adaptable to the representation of data as well as the respective ML or KD procedure. On this class of data structures, we propose an extended version of V. Pestov’s axiomatic system, i.e., a notion of a dimension function. We then adapt M. Gromov’s notion of observable diameters to our data structures to establish a concrete instance of such a dimension function.
As a first illustration of our approach, we show how to represent concept lattices as data structures of the above kind and calculate their intrinsic dimension. Although the latter is a geometric quantity, of unrelated origin to FCA, it does reflect the sensitivity of a concept lattice to a natural order-theoretic deformation. We conclude our work by computing and discussing this empirically for seven real world data sets and corresponding null models. Those first computations suggest that the intrinsic dimension of concept lattices carries information not captured by other invariants of FCA.
2 Related Work
The term intrinsic dimension is widely used in various signal or data related fields. Most of them consider the distribution of variables for their respective notion of intrinsic dimension, only. One essential new idea by V. Pestov [Pestov0, Pestov1, Pestov2], apart from utilizing concentration invariants, was to incorporate a general notion of features together with the distributions.
In the field of FCA there are various quantities of interest on formal contexts and particular formal concepts, like concept stability or concept probability. We refer the reader for more on this to [KuznetsovM16]. Most of them can be lifted in order to describe the whole concept lattice, e.g., average stability. The Majority of those quantities is either motivated from within FCA or was identified in related fields like database theory, by translating similar conceptual ideas to FCA. The remaining are usually introduced empirically into the realm of FCA by showing useful correlations. The quantity of order dimension in posets, a quantity that translates to the Ferrers dimension of a formal contexts [fca-book] is a common dimension concept in FCA. However, it was not considered in this work.
3 Data structures and concentration
In this section, we propose a mathematical model for data structures (Definition 3.1), which is accessible to methods of geometric analysis. Subsequently, we introduce a pseudo-metric on the collection of all data structures (Definition 3.3), a variant of Gromov’s observable distance [Gromov99, Chapter 3.H], which was proposed as a tool for analyzing high-dimensional data by Pestov [Pestov1, Pestov2].
Before stating our approach to data structures, we briefly recollect some bits of terminology and notation. Let be a topological space. Recall that is said to be Polish if is separable and there exists a complete metric generating the topology of . Furthermore, as usual, a set will be called (pointwise) equicontinuous if, for any and , there exists a neighborhood of in such that for all . Given two measurable spaces and , the push-forward measure of a measure on with respect to a measurable map is the measure on defined by for every measurable . For a measure on a measurable space and a measurable , the measure on the induced measure space is given by for every measurable . We denote by the normalized counting measure on a finite non-empty set , i.e., for .
Definition 3.1 (data structure).
A data structure is a triple consisting of a Polish space together with a Borel probability measure on and an equicontinuous set of real-valued functions on , where the elements of will be referred to as the features of . We call the trivial data structure. Given a data structure , we let
We will call two data structures isomorphic and write if there exists a homeomorphism such that and .
Since any separable metric space has cardinality less than or equal to , the collection of all isomorphism classes of data structures constitutes a set, which we denote by . Before introducing the above-mentioned distance function on , let us provide some necessary prerequisites.
It is a well-known fact that every Borel probability measure on a Polish space admits a parametrization, i.e., a Borel map with for the Lebesgue measure on (see, e.g., [ShioyaBook, Lemma 4.2]). With regard to data structures, we note the following observation.
Let be a data structure and let be any two parametrizations of . Then, for every , there exist Borel isomorphisms with and .
Let . Since is equicontinuous and is second-countable, we find a sequence of pairwise disjoint Borel subsets such that
for all .
Let . For each , let and . According to [KechrisBook, (17.41)], for each there exists a Borel isomorphism such that . The map defined by for all is a Borel isomorphism with and for each . Similarly, we find a Borel isomorphism with and for all . It remains to show that . Indeed, for every , there is with , whence and therefore . ∎
Given a pseudo-metric space , the Hausdorff distance between two subsets is defined as
where for any , . For a probability measure on a measurable space , we define the pseudo-metric on the set of all measurable real-valued functions on by
Everything is prepared to introduce the following adaptation of Gromov’s observable distance [Gromov99, Chapter 3.H] to our setup of data structures.
Definition 3.3 (observable distance).
We define the observable distance between two data structures and to be
Note that is symmetric, ranges in (as does), and assigns the value to identical pairs. Moreover, it is easy to see that is invariant under isomorphisms of data structures, i.e., for any two pairs of isomorphic data structures . Henceforth, we will identify with the induced function on . This map is a pseudo-metric, thanks to the following fact.
For any three data structures ,
We need to prove that for all . To this end, let and pick parametrizations for , and for , and for such that and . Thanks to Lemma 3.2, there exist Borel isomorphisms with and
Evidently, is a parametrization for , while is a parametrization for . In turn,
The pseudo-metric induces a topology on , the concentration topology.
Definition 3.5 (concentration of data).
A sequence of data structures is said to concentrate to a data structure if
The concentration topology is a conceptual extension of measure concentration phenomena, described by the Lévy property.
A sequence of data structures is said to have the Lévy property or to be a Lévy family, resp., if
Let us point out the connection between the Lévy property and the observable distance of data structures.
For every data structure ,
In particular, a sequence of data structures has the Lévy property if and only if concentrates to the trivial data structure.
4 Observable diameters of data
In this section, we adapt the concept of observable diameters [Gromov99, Chapter 3] to our data structure setup and study its behavior with respect to the concentration topology. This is a necessary preparatory step towards the results of Section 5.
Definition 4.1 (observable diameter).
Let . The -partial diameter of a Borel probability measure on is defined as
We define the -observable diameter of a data structure to be
Observable diameters are invariant under isomorphisms of data structures, that is, for any pair of isomorphic data structures and . Furthermore, we have the following continuity with respect to the pseudo-metric .
Consider any two data structures and let . For all and ,
Let . It suffices check that
Let . Choose two parametrizations, for and for , such that . Let . Then there exists some such that . Fix any Borel subset with
and . For the open subset ,
and , which proves that . ∎
In Proposition 4.3 below, we introduce a quantity for data structures, which is well defined due to the following fact.
Given an arbitrary data structure , the function
is antitone, thus Borel measurable.
The map defined by
is Lipschitz with respect to .
Let for data structures . Without loss of generality, we assume that . For every ,
due to Lemma 4.2. Hence, . Thanks to symmetry, it follows that , i.e., is -Lipschitz with respect to . ∎
Observable diameters reflect the Lévy property in a natural manner.
Let be a sequence of data structures. Then the following are equivalent.
has the Lévy property.
for every .
(1)(2). Let . To see that , consider any . By assumption, there exists such that
We argue that for all with . Let . For every , there exists such that , whence for the Borel subset , which has . Therefore, for all , that is, .
(2)(1). Let . Thanks to our hypothesis, there exists such that for all . We will show that
Let . For any and , we find a (necessarily non-empty) Borel subset with and , and observe that for any . Thus, .
(2)(3). This follows from Lebesgue’s dominated convergence theorem.
(3)(2). By Remark 1, we have for any data structure and any . Consequently, if , then for every , as desired. ∎
5 Intrinsic dimension
Below we propose an axiomatic approach to intrinsic dimension of data structures (Definition 5.1), a modification of Pestov’s ideas [Pestov1] suited for our concept of data structures. As argued in [Pestov1, Pestov2], it is desirable for a reasonable notion of intrinsic dimension to agree with our geometric intuition in the way that the value assigned to the Euclidean -sphere , viewed as a data structure, would be in the order of . To turn this idea into an axiom, let us fix some additional notation. For an integer , we consider the data structure where is the unique rotation invariant Borel probability measure on and the set of all real-valued -Lipschitz functions on .
A mapping is called a dimension function if it satisfies the following conditions:
(1) Axiom of concentration: A sequence has the Lévy property if and only if .
(2) Axiom of continuity: If a sequence concentrates to , then as .
(3) Axiom of antitonicity: If and are data structures with , then .
(4) Axiom of negligibility of constant features: for all .
(5) Axiom of normalization: .111Given two functions , we say that and asymptotically have the same order of magnitude and write if there exist and such that for all .
In presence of the axioms (2) and (4), the axiom of concentration may be replaced equivalently by the condition that . This is due to Proposition 3.7.
The map defined by
is a dimension function.
Note that is well defined on , since is invariant under isomorphisms of data structures, i.e., for any pair of isomorphic data structures . Moreover, satisfies the axiom of concentration by Proposition 4.4 and the axiom of continuity by Proposition 4.3. Also, it is not difficult to see that, for every and data structures and with , we have as well as , which readily implies that satisfies the axioms of antitonicity and negligibility of constant features. Finally, a straightforward argument building on [ShioyaBook, Proposition 2.26] and [Pestov2, Example 7] shows that , i.e., satisfies the axiom of normalization. ∎
6 Intrinsic dimension of concept lattices
As a first application of the introduced intrinsic dimension setup we choose the formal contexts as used in Formal Concept Analysis (FCA). These data tables are natural in a way that they are widely used in data science far beyond FCA. Let us recall the basic notions of FCA that are essential for the rest of this work. For a detailed introduction to FCA we refer to [fca-book]. Let be a formal context, i.e., an ordered triple consisting of two non-empty sets and and a relation . As usual, the elements of are called the objects of and the elements of are called the attributes of , while is referred to as the incidence relation of . The usual visualization for such a formal context is a context table as shown in Figure 1. We call a formal context empty if its incidence relation is empty. For subsets and , put
The elements of are called the formal concepts of . We endow with the partial order given by
6.0.1 Concept lattices as data structures.
In order to assign an intrinsic dimension to a concept lattice, we need to transform a formal context into a data structure accordant to Definition 3.1. The crucial step here is a meaningful choice for the set of features, which should reflect both the distribution of data as well as the variables essential for the utilized ML procedure, or investigated KD process, respectively. Holding on to this idea, we propose the following construction.
We define the data structure associated to a finite formal context to be with the feature set
Let us unravel Definition 4.1 for data structures arising from formal contexts.
Let be a finite formal context and let . For every ,
Note that in the special case of an empty context the observable diameter of the associated data structure is zero, in accordance with Definition 4.1.
6.0.2 Intrinsic dimension of scales.
There are various formal contexts used for scaling non-binary attributes into binary ones. Those contexts do usually not reflect real world data themselves. However, investigating them increases the first grasp for the intrinsic dimension of concept lattices.
The most common scales are the nominal scale, , and the contranominal scale, , where for a natural number . For , they are shown in Figure 1, accompanied by the empty formal context on three objects and three attributes. The intrinsic dimension for those contexts can be computed as follows. For the contranominal scale, a straightforward application of the trapezoidal rule reveals that
Hence, . For the nominal scale, in a similar manner, we see that , which diverges to as . In the latter case, we observe that our intrinsic dimension reflects the curse of dimension appropriately as the number of attribute increases.
7 Computing and Experiments
We computed the intrinsic dimension for various real world data sets in order to provide a first conception of how variable the values of are for different kinds of real world data. For brevity we reuse data sets investigated in [Borchmann2017] and refer the reader there for an elaborate discussion of those. All but one of the data sets are scaled versions of downloads from the UCI Machine Learning Repository [Lichman:2013]. In short we will consider the Zoo data set (zoo) describing 101 animals by fifteen attributes. The Breast Cancer data set (cancer) representing 699 clinical cases of cell classification. The Southern Woman data set (southern), a (offline) social network consisting of fourteen woman attending eighteen different events. The Brunson Club Membership Network data set (brunson), another (offline) social network describing the affiliations of a set of 25 corporate executive officers to a set of 40 social organizations. The Facebook-like Forum Network data set (facebooklike), a (online) social network from an online community linking 377 users to 522 topics. A data set from an annual cultural event organized in the city of Munich in 2013, the so-called Lange Nacht der Musik, a (online/offline) social network linking 79 users to 188 events. And, finally the well-known Mushroom data set, a collection of 8124 described by 119 attributes. Additionally we consider for all those data sets, with exception for the mushroom, a randomized version. Those are indicated by the suffix r.
We conducted our experiments straightforward utilizing Proposition 6.2. This was done using conexp-clj222https://github.com/exot/conexp-clj. The intermediate results for can be seen in Figure 2 and the final result for is denoted in Table 1.
|Name||# Objects||# Attributes||Density||# Concepts||I-dim|
All curves in Figure 2 show a different behavior resulting in different values for . The overall descending monotonicity is expected, however, the average as well as the local slopes are quite distinguished. The general trend that less dense contexts receive a higher intrinsic dimension is also expected taking the results for the empty context into account as well as the overall motivation of the curse of dimension. Considering the random data sets in Table 1 we obeserve that neither the density nor the number of formal concepts (features) is an indicator for the intrinsic dimension. This fosters the belief that introduced intrinsic dimension is independent of the usual descriptive properties.
In this work we introduced a new method of how to utilize observable geometric invariants for measuring the intrinsic dimension of a data structure. Depending on the particular set of features – which reflect the concrete KD or ML procedure in question – this notion of dimension is computationally feasible. From this we derived a dimension function for the data structure related to concept lattices. For thirteen different concept lattices we observed that the intrinsic dimension of the concept lattice does in fact reflect aspects of measure concentration, as predicted from theory. This result evokes various next to be investigated research questions. Which insights does this provide for the proneness of the various problem classes in ML and KD? Which features are natural to the algorithm at hand, e.g., clustering algorithms? What result is to be expected when exchanging the observable diameter to one of the other geometric invariants proposed by Gromov?
F.M.S. acknowledges funding of the Excellence Initiative by the German Federal and State Governments, as well as the Brazilian CNPq, processo 150929/2017-0.