Basic statisticsfor distributional symbolic variables: a new metric-based approach

Basic statistics
for distributional symbolic variables:
a new metric-based approach


Abstract

In data mining it is usual to describe a group of measurements using summary statistics or through their empirical distribution functions. Each summary of a group of measurements is the representation of a typology of individuals (sub-populations) or of the evolution of the observed variable for each individual. Therefore, typologies or individuals are expressible through multi-valued descriptions (intervals, frequency distributions). Symbolic Data Analysis, a relatively new statistical approach, aims at the treatment of such kinds of data.
In the conceptual framework of Symbolic Data Analysis, the paper aims at presenting new basic statistics for numeric multi-valued data. First of all, we propose how to consider all numerical multi-valued descriptions as special cases of distributional data, i.e. as data described by distributions. Secondly, we extend some classic univariate (mean, variance, standard deviation) and bivariate (covariance and correlation) basic statistics taking into account the nature, the source and the interpretation of the variability of such data. As opposed to those proposed in the literature, the novel statistics are based on a distance between distributions, the Wasserstein distance.
Using a clinic dataset, we compare the proposed approach to the existing one showing the main differences in terms of interpretation of results.
Keywords:Wasserstein metric, symbolic data, mean, variance, dependence measures, distributional data, modal variables.

1 Introduction

In many real experiences data are collected or represented by multi-valued descriptions: intervals, frequency distributions, histograms, density distributions, and so on. Typical examples are the description of macrodata in official statistics, basic statistics, frequency distributions or estimate of parameters from data referred to a group of units or to the same unit observed in multiple occasions, or data directly measured in a condition of uncertainty. In these cases, even if we are observing a single variable, the information coming from the observations couldn’t be conveniently expressed by only one number or category. Thus, it is usual to represent such information by multiple values, and such multi-valued description is the start of novel analysis. Several proposals appeared in the literature for processing such data, according to their nature, source or mathematical modeling. For example, when data are expressed by intervals of , Interval arithmetic (Moore, 1996), fuzzy set (Moore, 2003), or a Symbolic Data Analysis (Bock and Diday, 2000) approach provide useful tools for their statistical treatment. When the data domain is categorical, a noteworthy approach is the Compositional data one (Aitchinson, 1986). Among the methods listed above, Symbolic Data Analysis (SDA) Bock and Diday (2000); Billard and Diday (2006, 2003); Noirhomme and Brito (2011) approach provides models and techniques for generalizing the statistical treatment of most of them. In fact, SDA is a relatively new statistical approach designed for processing data described by set-valued variables (or Symbolic variables) like interval, multi-valued discrete, multi-categorical, histogram and modal variables. In particular, modal variables can model the description of an individual, of a group, or of a concept, by distribution of probabilities, frequencies or, in general, by random variables.
In recent years, several authors proposed and defined new statistics and new techniques for the analysis of a particular case of modal data description: the histogram-valued data. Bertrand and Goupil (2000) proposed a first set of basic univariate and bivariate statistics that was integrated and extended by Billard and Diday (2006). Further developments for the quantification of the variability and the dependence between variables of a set of multi-valued data can be found in Billard (2007) and Brito (2007). For interval data, the statistics proposed by Billard (2007) and Bertrand and Goupil (2000) start from the assumption that an interval-valued data is a uniformly distribution . Considering histograms as a weighted collection of intervals, Bertrand and Goupil (2000) and Billard and Diday (2006) extended the univariate (mean, variance and standard deviation) and bivariate statistics (covariance and correlation) to histogram data.
A set of multi-valued data holds two kind of variability: an internal to data variability and a between data variability. The first is related to the multiplicity of values that describes the single observation: for example, an interval description has proper variability related to its width. The last is related to the different multi-valued descriptions: two intervals can be different for position, width or both. The basic univariate statistics proposed by Bertrand and Goupil (2000) and Billard and Diday (2006) are not sensible to express the role of the two sources of variability (for example, the variance of a set of identical multi-valued data is in general positive). In this paper, we propose a novel set of univariate and bivariate statistics that better take into account the two sources of variability and extend some properties of the classic basic statistics to those for multi-valued numeric data.

The paper is organized as follows: in section 2 we present the different multi-valued numerical data according to their definition in SDA, and how to consider them as a unique, more general, type of data described by distributions. In section 3, we show state-of-the-art basic univariate statistics and their coincidence with the basic statistics of finite mixture of distributions, we reflect on their use and we propose new basic statistics that solve some discrepancies present in the former approach. The novel univariate statistics emerge from the definition of a measure of variability that is related to a distance between distributions. Among the different distances presented in the literature, we motivate the choice of the Wasserstein distance Rüschendorf (2001) showing the gain of interpretability of the results provided by this choice and its consistency with the double source of variability of a set of multi-valued data.
The choice of the Wasserstein distance allows to use a novel product operator between two distributions. Using such an operator, in section 4, we propose an extension of the classical covariance and correlation measures between two standard variables to the case of numeric modal variables. Also in this case, we show that it is possible to take into account the different source of variability of the multi-valued data.
Using a clinic dataset presented in Billard and Diday (2006), in section 5 we present an application of the proposed statistics and a comparison with those proposed in the same book. The results give evidence of the interpretative properties of the novel statistics. Section 6 ends the paper with some comments and suggestions for future research.

2 Numerical symbolic modal data

The definition of the types of data used in this paper is presented consistently with the Symbolic Data Analysis (SDA)(Bock and Diday, 2000; Billard and Diday, 2006) terminology. SDA aims to extend classical data analysis and statistical methods to more complex data called symbolic data that are realizations of a so-called symbolic variables. In SDA the symbolic datum describes an individual according to a set of numbers or categories, that can be equipped with a set of weights, while standard datum describe an individual assigning a single measurement (number or category) of a standard variable. Bock and Diday (2000) defined symbolic variables as follows:

Definition 1.

Let be a set of objects, a variable is termed set-valued with domain , if for all ,

(1)

where the description is defined by . A set-valued variable is called multi-valued if its description set is the set of all finite subsets of the underlying domain ; such that , for all .

A set-valued variable is called categorical multi-valued if it has a finite set of categories and quantitative multi-valued if the values are finite sets of real numbers.

A set-valued variable is called interval-valued if its description set is the set of intervals of .

Bock and Diday (2000) also defined modal (symbolic) variables as follows:

Definition 2.

A modal variable on a set of objects with domain is a mapping

(2)

where is a measure or a (frequency, probability or weight) distribution on the domain of possible observation values (completed by a -field), and is the support of in the domain . The description set of a modal variable is denoted with .

In the present paper, it is not considered the multi-categorical case, but only those descriptions based on numerical support. We propose to treat all numerical (single-valued or set-valued) variables as particular cases of the modal variables. In particular, we propose to treat data in a probabilistic perspective, as distributional data. In order to follow the terminology adopted in SDA, the variables which allow distributions as description of individuals are termed modal-numeric (probabilistic) variables.

Definition 3.

Given a set of objects with domain and support partitioned into subsets, a probability measure associated with a density function and with the respective distribution function , such that

(3)

where , a modal (probabilistic) variable is a mapping

(4)

In the following, we consider the main types of symbolic numeric variables. After defining the support , the density function and the distribution function , we propose how to consider them as particular modal-numeric descriptor.

Classic single valued data

such that , and

In this case, the individual is described by a single value . is considered like a modal-numeric datum associated with a density function that follows as Dirac delta function shifted in :

subject to the constraint that .

The corresponding distribution function is:

Therefore the modal-numeric description is:

Multi-valued discrete description

Modal multi-valued discrete description can be considered as a mixture of Dirac delta distributions, where is a set of distinct single values.

The support can be written as where, each element of the support is associated with a , such that (or the mixing weights). We then consider the function:

where is a density function associated to the description of and the corresponding distribution function is:

In this case, the modal-numeric description is:

Interval description

such that , and assuming a uniform distribution in , we can rewrite as

The corresponding distribution function is:

In this case, the modal-numeric description is:

If it is known the distribution of the data on the interval we may consider as a the (cumulative) distribution function corresponding to .

Histogram valued description

We assume that (the support is bounded in ). The support is partitioned into a set of intervals , where and , i.e.

Histograms suppose that each interval is uniformly dense. It is possible to define the modal description of as follows:

where .
Given the generical interval where , and as the Uniform continuous function defined between and , we may rewrite a histogram as a linear combination of Uniform distribution (a mixture) as follows:

where is a density function associated to the description of and the corresponding distribution function is:

In this case, the modal-numeric description is:

Continuous random variable

correspond to the support of the random variable, correspond to its density function.

We can consider, then, the density as

where is a vector of parameters, and the distribution function as

In this case the modal-numeric description is:

In conclusion, the numeric set-valued variables (single-valued and interval-valued) are considered distributional variables (or numeric modal symbolic variables) whose distribution function is a uniform or a -Dirac distribution. While the first assumption is accepted in the SDA literature Bertrand and Goupil (2000), the last one corresponds to the same assumption for a thin interval: a point-value can be considered a zero-width interval.
The proposed reformulation of the different types of numeric symbolic variables into a unique and more general type of distributional symbolic variable (a numerical modal probabilistic symbolic variable) permits to consider a unique approach for computing univariate and bivariate statistics for a wide class of symbolic numeric data. The rest of the paper discusses the proposal of new statistics for distributional variables.

3 Basic univariate statistics for numerical symbolic data

The first to propose a set of univariate and bivariate statistics for symbolic data was Bertrand and Goupil (2000), and subsequently Billard and Diday (2006) improved them. The Bertrand and Goupil (2000) approach relies on the so-called two level paradigm presented in SDA in Bock and Diday (2000): the set-valued description of a statistical unit of a higher order is the generalization of the values observed for a class of the lower order units. For example, the income distribution of a nation (the higher order unit) is the empirical distribution of the incomes of each citizen (the lower order units) of that nation. Naturally, other generalization of grouping criteria can be taken into consideration.
The generalization process from lower to higher order units considered by Bertrand and Goupil (2000) and by Billard and Diday (2006) implies the following assumptions: given two symbolic data and described by the frequency distributions and , a lower order unit can be described by a single value that has a probability of occurring equal to . The univariate statistics proposed by Bertrand and Goupil (2000) and by Billard and Diday (2006) for a symbolic variable (namely, a variable describing higher order units, or a class of units) correspond to those of the classic variable used for describing the (unknown) lower order units. Thus, given a set of higher order units described by the numerical symbolic variable , the mean, the variance and the standard deviation proposed by Bertrand and Goupil (2000) and extended by Billard and Diday (2006) correspond to those of a finite mixture of density (or frequency) functions with mixing weights equal to . Given density functions denoted with with the respective means and variance , and given the finite mixture density as follows:

(5)

Frühwirth-Schnatter (2006) shows that the mean and the variance of are the following:

(6)
(7)

It is worth noting that the two statistics in eq. (6) and (7) are the same as those proposed by Billard and Diday (2006) for a numeric symbolic variable, except for a different notation.
For the sake of simplicity, we show only the formulas related to interval-valued data. Let be an interval-valued variable, thus, the generic symbolic datum is with belonging to . According to Bertrand and Goupil (2000), is considered as a uniform distribution in , with mean equal to and variance equal to . Given a set of units described by a interval-valued variable, the symbolic sample mean (Billard and Diday, 2006, eq. (3.22)) is:

(8)

It is straightforward to show its equivalence with in eq.(6), indeed:

In (Billard and Diday, 2006, eq. (3.22)) is also proposed the symbolic sample variance as follows:

(9)

Considering that:

the term (I) of eq. (9) can be expressed as follows:

The term is clearly , indeed:

Thus, in eq. (9) corresponds to eq. (7), indeed:

(10)

The same correspondences also hold for the mean and the variance of the other numerical modal symbolic variables.
This approach is particularly useful and coherent when the symbolic data, referred to higher order units, are the description of groups of lower level units (the income distribution of a nation is described by the incomes of its citizens). In general, all the symbolic data have the same weight, but knowing in advance the cardinality of the groups, it is possible to estimate an unbiased mean or standard deviation of the variable describing all the lower order units (the per-capita income in Europe is the weighted, by the respective population, mean of the per-capita incomes of the single nations).
In some situations, the proposed approach for the definition of the univariate basic statistics hides some peculiarities present in the data. For example, describing the pulse rate of a patient during a particular activity (while he walks, runs, swims, sleeps, etc.), we can modelize this information using the distribution of the pulse rates recorded during that activity. If we collect the same information from people, we obtain symbolic data described respectively by pulse rate distributions. Using the mean and the standard deviation of a symbolic variable like those proposed by Billard and Diday (2006), we obtain measures related to all the pulse rate measurements independently from belonging to a particular individual of the group. Indeed, being the basic statistics of a mixture, the pulse rate measurements can be permuted among the individuals and the basic statistics do not change, avoiding the possibility of comparing individuals. In such a case, we could be interested in studying the variability of the individuals according to their pulse rate distributions, such that, the more the pulse rate distributions are different, the more variability is in the data. Extending the concept of variability like a measure of divergence of the observed data with respect to an average datum, the mean individual should have a distribution that is as close as possible to all the observed distribution: the average should be expressed by a distribution. In this case, like for the classic case, if the data are identical (thus identical to the mean) the variability of that symbolic variable should be zero. Other dissimilarities for interval-valued data, treated as distributions, have been considered in (Irpino and Verde, 2008).
Therefore, in SDA the source of variability of a symbolic variable is twofold:

internal to data

each symbolic datum has an inherent variability due to the summarization process of lower order units into higher order ones, and that is a possible element of the domain of the symbolic variable: each individual is described by the distribution of the recorded pulse rate;

between data

a set of units described by a symbolic variable is a set of multi-valued observations: each individual may have a different pulse rate distribution.

In contrast to (Billard and Diday, 2006), where the sample variance of a symbolic variable is the amount of the internal to data and between data variability, we here consider the possibility of relating the internal variability as a characteristic that pertains to the mean unit (the mean individual is described by a distribution that is as close as possible to all the observed pulse rate distributions) while the variability of a symbolic data is related to the diversity of their symbolic descriptions.

3.1 The mean and the variability of a set of data described by distributions

While in probability theory the mean corresponds to the expected value of a random variable, in descriptive statistics the mean can assume several definitions. Starting from proximity relations among data it is possible to define the so called Fréchet means, while starting from the definition of a function of the observed data it is possible to define the so called Chisini means. More formally:

Fréchet (or Karcher) mean

according to Ginestet et al. (2012), given a set of elements described by the variable , a distance between two descriptions and a set of real numbers, a Fréchet type mean (barycenter) is the argmin of the following minimization problem:

(11)

provided that a unique minimizer exists.

Chisini mean

according to Chisini (1929), given a set of units described by the single real valued variable and a function , a Chisini type mean must satisfy the following condition:

(12)

for example, the arithmetic mean is invariant with respect to the sum function, i.e.:

To extend Chisini type means to multi-valued numeric data, it is important to define functions and operators for multi-valued data.

The definition of a Fréchet and Chisini compatible mean of distributional variables requires two conditions: the definition of a distance between distributions (or random variables) and the definition of, at least, the sum of distributions and the product of a distribution and a scalar.

A variety of dissimilarities for symbolic data are presented in (Bock and Diday, 2000, Chap. 8). For continuous and multi-valued categorical data, several component-wise dissimilarities are presented: the Gowda-Diday, Ichino-Yaguchi and De Carvalho dissimilarities. Unfortunately, none of those are formulated for comparing data described by frequency or probability distribution functions with a numeric and continuous support. In the same chapter, for comparing multi-valued modal data, the authors presented a review of dissimilarities based on particular families of divergence indices for probability distributions. Such divergences are based on a function of the likelihood ratio between two probability measures and, therefore, they are not symmetric. The well known Kullback-Liebler (KL) divergence suffers from this inconvenience, too. A well-known symmetric version of the KL divergence is the Jensen-Shannon (JS) dissimilarity, which corresponds to the mean of the KL divergences between two probability distributions and their mixture. Nielsen and Nock (2009), in a k-means framework, studied the minimization of the sum of squared divergences based on information theory (KL, JS, and their generalizations into Bregman divergences). They showed that in minimizing a distance criterion, a single centroid distribution of a set of probability distributions cannot be obtained. They solved this problem by proposing a couple of centroids (a left and a right centroid for each class) according to the direction of the computed divergence.
Starting from the study of Gibbs and Su (2002), Verde and Irpino (2007) considered a set of dissimilarity and distance measures for probability distributions. They observed that not all the considered probabilistic distances and dissimilarities consent to identify a unique distribution as a center of a set of distributions, or that the resulting center could not be expressed as a distribution. However, the authors noticed that only two distances give the possibility of defining a single center in the form of a distribution: the Euclidean and the Wasserstein distance between distributions.
In the rest of the paper, we use the following notation: given probability distributions with density (or probability) functions denoted with , the respective expected value is denoted with and the standard deviation is denoted with ; each is in a one-to-one correspondence with a cumulative distribution function (cdf) denoted with and with a quantile function (qf) denoted with .

The sample mean based on Euclidean distance.

The distance between two density functions and (for continuous distributions) is:

(13)

it is straightforward to prove that the Fréchet mean associated to (assuming equal weights ) is given by the finite mixture of the density functions as follows:

(14)

and, thus, the mean and the variance of correspond to those presented above in eqs. (8) and (9). is a density function, but in general has a different shape with respect to the set of summarized densities: for example, a mixture of not identical Normal distributions is not a Normal distribution too. As we can see in Fig. 1, the representation is coherent with the aggregation criterion where each distribution comes from a sub-population and is the distribution of for the whole population.


Figure 1: The mean according to distance

Verde and Irpino (2007) considered another distance: the version of the Wasserstein distance. The literature provides different formulations of the Wasserstein distance but we use the formalization used by Rüschendorf (2001) (which also contains the main references to the Wasserstein metric) which expressed the distance using the quantile functions associated with the respective cdfs as follows:

(15)

The proposed formulation in Rüschendorf (2001) shows that distance can be considered as an extension of the classic Minkowski distance for quantile functions (the inverse of cdfs). The quantile functions (qfs) have several useful statistical properties Gilchrist (2000), some of the most interesting for the arguments of this paper are: qfs are in a one-to-one correspondence with the corresponding density functions, qfs have a finite domain (), and qfs are non-decreasing functions.

The sample mean based on Wasserstein distance.

For avoiding multiple indices, we denote with the following formulation of the Wasserstein distance between two probability distributions:

(16)

In this case, the Fréchet mean with respect to (assuming equal weights ) is the distribution corresponding to the mean quantile function that solves the following optimization problem:

(17)

Assuming that is a density function and is the corresponding quantile function and considering the integral operator, the solution of the optimization in eq. 3.1 is obtained for each according to the classic first order condition, as follows:

(18)

where indicates the mean quantile function observed in .
The Fréchet mean distribution corresponds to the distribution that is into a one-to-one correspondence with , i.e.

(19)

Figure 2 shows for the same three Normal distributions represented in Fig. 1. Differently from , we observe that has a central position and an intermediate shape with respect to the observed distributional data.


Figure 2: The mean according to Wasserstein distance

For showing the centrality properties of , the quantity , the correlation coefficient between two quantile functions, plays an important role. It is defined as follows:

(20)

It is worth noting that is always positive, being the correlation coefficient between two not decreasing functions and is exactly equal to 0 when at least one distribution has no variability (is a single valued data). Imaging a QQ (Quantile-Quantile) plot, is the correlation of the scattered points, therefore it can be considered a measure of the similarity of the shapes of two distribution functions. In fact, only if the two distributions have the same standardized quantiles by the respective mean and standard deviation, which occurs when the two distributions have the same shape.
According to Barrio et al. (1999) and using eq. 20, it is possible to prove (see A) that the squared Wasserstein distance can be decomposed as follows:

(21)

The decomposition permits the interpretation of the (squared) distance between two distribution functions according to two additive aspects. The Location aspect emphasizes the difference in position of the two distributions through the (squared Euclidean) distance between the respective means. The second aspect is related to the different Variability structure of the compared distributions due to the different standard deviations (the Size component) and to the different shapes of the density functions (the Shape component). While the Size component is expressed by the (squared Euclidean) distance between the standard deviations, the Shape component is fundamentally governed by the value of .
The decomposition in Eq. (21) suggests that optimization problem in Eq. (3.1) leads to a solution where has the minimum Location difference with respect to all the locations of the distributions and the minimum Variability difference with respect to all the variabilities of the distributions. Given the optimization problem in Eq. (3.1), and considering that the quantities in Eq. (21) cannot be negative, and respectively the mean and the standard deviation of , and the correlation between the qf of -th observation and the qf of it is possible to show that:

In order to show that is also a Chisini mean we introduce the sum operator between qfs and the product of a qf by a scalar. Let be the set of functions of the kind with bounded domain and imagine in . Let be the set of all possible quantile functions, i.e. the set containing only non-decreasing functions with bounded domain in . Let be the sum between two elements of and the product of a scalar by a function, it is known that is a vector space. However, given the pair , the sum is still an internal operation because it is the sum of two non decreasing functions, while the product between a scalar and a qf is internal (i.e. returns a qf) only if . Using these operators, and considering the sum of quantile functions as the in eq. 12 , it is possible to affirm that (associated with ) is the Chisini mean of a set of quantile functions which is invariant with respect to the sum of qfs.
Being a density function, we may derive its basic statistics like the mean and the variance as follows:

(22)

which correspond to the mean of the means of the distribution. The variance of is formulated as follows:

(23)

For simplifying the last formula, we present the following formulation of the product of two quantile functions.

Definition 4.

Given two quantile functions and , associated with two pdf’s and with means and and standard deviations and , the product is defined as follows:

(24)

The proof is straightforward using algebra from eq. (20).
Using this result, we obtain a final formulation of the variance of as follows:

(25)

It is worth noting that if all the distributions have the same shape then for each couple of distributions and the variance of reaches its maximum value. The minimum value is clearly obtained when all the observed data are points (i.e. for each ), thus:

(26)

We do not investigate the computation of the further moments of the distribution associated with because it requires further considerations beyond the topic of this paper. However, it is worth noting that is a distribution having a shape similar to all the distributions: if we have single-valued data (points), is a point (i.e. it generalizes the arithmetic mean of a set of standard data), if we have interval-valued data, is an interval-valued description, if we have histogram-valued data, is a histogram.

The variance of Y with respect to .

Given , the mean of a set of units described by the distributional symbolic variable , we define the variance of as the mean of the squared Wasserstein distance between each distribution and . In this sense, the variance of corresponds to the Frechét criterion in eq. (11) with . Using the definition of product between two qfs in definition 4, we denote with the variance of which is computed as follows:

(27)

We note that is the sum of two positive independent sources of variability:

is the variance of the means of the distributions;

is a measure of variance related to the (squared) differences of the internal variability of the distributions. is always positive. Considering that is maximum when, for each couple of distributions, , (i.e., all the distributions have the same shape), and if , we observe that the minimum value of is:

that is the variance of the standard deviations of the distributions. Finally, is equal to zero in two cases: when all the distributions are identically distributed except for their means or when all the data are single valued.

Differently from in eq. 10, is also equal to zero when all the distributions are identical and positive ’s. Secondly, comparing in eq. (9) (or its simplified version of eq. (10)) with , it is clear that depends only from the means and the standard deviations of the compared distributions, while depends also from the different shapes of the compared distributions. Finally, is generally greater than . In fact, rewriting we observe that:

being the difference between the two indices equal to:

The difference depends from the shape and the standard deviations of the distributions.

The standard deviation

According to the , a generalization of the standard deviation of the numerical modal multi-valued variable observed for a set of units is the following:

(28)

Using the sum and the product by a scalar for quantile functions, it is straightforward to show that, given the variable , its standard deviation respects the following properties:

  1. Positivity: .

  2. If all data are identically distributed (i.e. have the same modal multi-valued numerical description) then

  3. Given two real numbers and and being a transformation of the variable, the corresponding standard deviation is:

The two novel basic statistics and reach the objective of better considering the double source of the variability of a symbolic variable observed on a set of units: the mean internal to the data variability is expressed by the variability of the distribution, while the takes into consideration only the differences among the distributions, i.e. it measure the between variability.
From a computational point of view, the difficulties of computing an exact value for is related to the possibility of computing the . Irpino and Verde (2006); Irpino et al. (2006) proposed a closed form for computing the squared Wasserstein distance between two histogram-valued data and from that formation it is also possible to derive the closed form related to . The computation is also done in a time that is linear with respect to the number of bins of the histograms. The same is possible for interval-valued data, considering them as data described by trivial histograms (i.e., histograms having only one-bin). For data described by different types of density functions the can be derived analytically only when all the qfs can be expressed in closed forms (for example, this is not possible for the Normal distribution). In all the other cases, numerical methods can be applied for approximating .
Another result presented in Irpino and Verde (2006); Irpino et al. (2006) is related to the variance decomposition in a framework of clustering analysis of histogram-valued data. Irpino and Verde (2006); Irpino et al. (2006), after showing that the Wasserstein distance is an extension of the Euclidean distance between quantile functions, the authors showed that it was possible to obtain a decomposition of the variability of a set of histograms according to the Huygens theorem of decomposition of the inertia and used such properties for extending some clustering methods for standard data to histogram-valued data.

4 Measures of interdependence

In this section, we consider how to extend the classic measure of association between two single real valued variables, like the covariance and the correlation indices, to a couple of numeric modal variables.
Starting from the Bertrand and Goupil (2000) approach, Billard and Diday (2006) proposed a formulation of the covariance between interval or histogram valued variables. For example, let and be two interval-valued variables, such that the generic -th unit () is described by the ordered pair of descriptions where and are intervals, the covariance index has the following formulation:

(29)

where

(30)

and and are the means calculated according to eq. (8). The authors (Billard and Diday, 2006, Eq. 4.19, pag. 136) extended to data described by a couple of histogram valued variables, considering them as weighted combination of intervals. Like for the univariate statistics, such statistics can be brought back to an approach based on mixture of bivariate distributions. Further, the proposed measures do not consider clearly the different sources of variability of a set of multi-valued symbolic data. Billard and Diday (2006) also propose a measure of correlation that is computed as follows:

(31)

The covariance based on Wasserstein metric

Using the Wasserstein metric and the associated product of qfs defined in Eqn. (24), we propose an alternative approach for the measure of the covariance and of the correlation between two symbolic variables and for solving some of the above mentioned deficiencies of the Billard and Diday (2006) approach. Let and be two modal numeric variables describing a set of units, the generic unit is described by the ordered pair where and are density functions, with respective means equal to and , and standard deviations