Copula Correlation: An Equitable Dependence Measure and Extension of Pearson’s Correlation.
Abstract
In Science, Reshef et al. (2011) proposed the concept of equitability for measures of dependence between two random variables. To this end, they proposed a novel measure, the maximal information coefficient (MIC). Recently a PNAS paper (Kinney and Atwal, 2014) gave a mathematical definition for equitability. They proved that MIC in fact is not equitable, while a fundamental information theoretic measure, the mutual information (MI), is selfequitable. In this paper, we show that MI also does not correctly reflect the proportion of deterministic signals hidden in noisy data. We propose a new equitability definition based on this scenario. The copula correlation (Ccor), based on the distance of copula density, is shown to be equitable under both definitions. We also prove theoretically that Ccor is much easier to estimate than MI. Numerical studies illustrate the properties of the measures.
Copula Correlation
and \thankstextt1This research project is supported by NSF grant CCF1442728
class=MSC] \kwd[Primary ]62H20 \kwd62B10 \kwd[; secondary ]62C20, 62G99, 94A17
Equitability \kwdCopula \kwdmutual information \kwdrate of convergence \kwddistance correlation
1 Introduction
With the advance of modern technology, the size of available data keeps exploding. Data mining is increasingly used to keep up with the trend, and to explore complex relationships among a vast number of variables. The nonlinear relationships are as important as the linear relationship in data exploration. Hence the traditional measure such as Pearson’s linear correlation coefficient is no longer adequate for today’s big data analysis. Reshef et al. (2011) proposed the concept of equitability. That is, a dependence measure should give equal importance to linear and nonlinear relationships. For this purpose, they proposed a novel maximal information coefficient (MIC) measure.
The MIC measure stimulated great interest and further studies in the statistical community. Speed (2011) praised it as “a correlation for the 21st century”. It has been quickly adopted by many researchers in data analysis. However, its mathematical and statistical properties are still not studied very well. There are also criticisms on the measure based on those properties.
MIC has been criticised for its low power in detecting dependence (Simon and Tibshirani, 2011; de Siqueira Santos et al., 2013; Heller, Heller and Gorfine, 2013), in comparison to existing measures and tests. Particularly, Simon and Tibshirani (2011) recommended the distance correlation (dcor) by Székely, Rizzo and Bakirov (2007) over MIC. However, dcor does not have the equitable property. The equitable dependence measure is needed to properly rank the strength of relationships in data exploration. As we will discuss in detail later, the equitability is a different feature from the power of dependence testing.
Kinney and Atwal (2014) gives a strict mathematical definition of Requitability described in Reshef et al. (2011). They discovered that no nontrivial statistic can be Requitable, thus MIC is in fact not Requitable. They further proposed a replacement definition of selfequitability. Interestingly, the MIC is also not selfequitable. Kinney and Atwal (2014) recommended a fundamental measure from information theory, the mutual information (MI), which is selfequitable.
While the estimation of MI has been studied extensively in the literature, practitioners are often frustrated by the unreliability of these estimation (Fernandes and Gloor, 2010; Reshef et al., 2011). We show that this is in fact due to a problem in the MI measure’s definition: it does not correctly reflect the strength of deterministic relationships hidden in noise. We propose a new equitability definition to clarify the issue.
We relate the study of equitability to another popular line of research on the copula – a joint probability distribution with uniform marginals. Sklar’s Theorem decomposes any joint probability distribution into two components: the marginal distributions and the copula. The copula captures all the dependence information among the variables. Hence an equitable dependence measure should be copulabased. The copulabased dependence measures have been studied for a long time. An earlier classic work by Schweizer and Wolff (1981) proved many mathematical properties for several copulabased dependence measures. With the advance of modern computing power, there are renewed high interest in copulabased dependence measures (Schmid et al., 2010; Póczos, Ghahramani and Schneider, 2012; LopezPaz, Hennig and Schölkopf, 2013).
Using copula, we mathematically define the robustequitability condition: a dependence measure should equal the proportion of deterministic relationship (linear or nonlinear) hidden in uniform background noise. Hence such measures equal Pearson’s correlation for linear relationship hidden in uniform background noise, and extend Pearson’s correlation to all deterministic relationships hidden in uniform background noise. We propose a new robustequitable measure, the copula correlation (Ccor), which is defined as half the distance of the copula density function from independence. This measure was used as a test statistic for independence testing before (Chan and Tran, 1992; Tjøstheim, 1996; Bagnato, De Capitani and Punzo, 2013). For discrete random variables, it is also called as the Kolmogorov dependence measure in the pattern recognition literature (Vilmansen, 1972, 1973; Ekdahl and Koski, 2006) and as the Mortara dependence index (Bagnato, De Capitani and Punzo, 2013). We consider the measure for continuous variables, and refer to it as the copula correlation. The name emphasizes the facts that it is a copulabased dependence measure, and that it is an extension of Pearson’s correlation. The distance based statistics are robust in many statistical application. The distance based dependence measure here is robust to mixture of some deterministic data with continuous data, properly reflect the dependence strength in the mixture.
We shall show that Ccor is both selfequitable and robustequitable. On the other hand, MI is not robustequitable. This also provides insights on the difficulty to estimate MI. Some authors (Pál, Póczos and Szepesvári, 2010; Liu, Lafferty and Wasserman, 2012) studied the convergence of MI estimators by imposing the Hölder condition on the copula density. This Hölder condition, while being a standard condition for density estimations, does not hold for any commonly used copula (Omelka, Gijbels and Veraverbeke, 2009; Segers, 2012). Under a more realistic Hölder condition on the bounded region of copula density, we provide a theoretical proof that the mutual information (MI)’s minimax risk is infinite. This provides a theoretical explanation on the statistical difficulty of estimating MI observed by practitioners. In contrast, Ccor is consistently estimable under the same condition.
Section 2 prepares the notations by defining several dependence measures and relating equitability to the copula. A weakequitability definition is introduced which relates to copulabased measures. We define our new measure Ccor and review some existing dependence measures in literature, including MIC, MI, dcor, etc. We review the copulabased measures by Schweizer and Wolff (1981), and their modified version of Rényi’s Axioms (Rényi, 1959). We clarify the relationship between these Axioms and the equitability. Section 3.1 reviews the equitability definitions of Kinney and Atwal (2014), and studies the selfequitability of these dependence measures. The selfequitable measures such as MI may not reflect the proportion of deterministic signal in data correctly. This motivates our definition of equitable extension of the Pearson’s linear correlation coefficient. Section 3.2 mathematically formulate this into our robustequitability definition. Ccor is the only measure proven to be both selfequitable and robustequitable. Multivariate extension is also discussed. Section 4 further studies the convergence of estimators for the two selfequitable measures MI and Ccor. Ccor is shown to be easier to estimate theoretically than MI. This and its equitability provide the desirable theoretical properties for the applications of Ccor in big data exploration. The estimation of MI have been studied extensively in literature. MI can be estimated using methods including kernel density estimation (KDE) method (Moon, Rajagopalan and Lall, 1995), the nearestneighbor (KNN) method (Kraskov, Stögbauer and Grassberger, 2004), maximum likelihood estimation of density ratio method (Suzuki, Sugiyama and Tanaka, 2009), etc. We advocate that more attention should be paid to estimating Ccor instead. In this paper, we propose a KDEbased estimator for Ccor. Section 5 compares the numerical performance of this estimator with other dependence measures through simulation studies and a real data analysis. The Ccor is shown to rank the strength of dependence relationship better than other measures. It also provides good performance in the real data. We end the paper with proofs and summary discussions.
2 Copula and Dependence Measures
We review several classes of dependence measures between two random variables and in the literature, and introduce our proposed new measure. For simplicity, we will focus on the dependence measures for two continuous univariate random variables and in most of the paper. The multivariate extension will be discussed in Section 3.3.
2.1 Weakequitability and Copulabased Dependence Measures
The most commonly used dependence measure is Pearson’s linear correlation coefficient where denotes the covariance between and , and denotes the variance of . The linear correlation coefficient is good at characterizing linear relationships between and : for perfectly deterministic linear relationship, and when and are independent. However, it does not measure the nonlinear relationships between and well.
To motivate the equitability concept, we can look at three examples in the left half of Table 1, where the two continuous random variable and are related by deterministic relationships: linear in (A); nonlinear in (B) and (C). These examples illustrate two deficiencies for Pearson’s linear correlation coefficient :

It is not invariant to monotone transformations of the variables. The value would change, say, using a logarithm/exponential scale. The value is lower in example (B) than (A) of Table 1 under a logarithm transformation of .

does not treat all deterministic relationship equally, and can not capture some nonmonotone nonlinear relationships. In example (C), for and related by the nonlinear relationship , in contrast to in the linear relationship of example (A).
raw data scale  copula transformation  







A.  B.  C.  A.  B.  C. 
Kinney and Atwal (2014) mathematically defines equitability of a dependence measure through its invariance under certain transformations of the random variables and . The deficiency (D1) above provides the original motivation for invariance consideration. For example, if we change the unit of (or ), the values of (or ) changes by a constant multiple, but should not affect the dependence measure at all. Similarly, if we apply a monotone transformation on (e.g. the commonly used logarithmic or exponential transformation), then the dependence with should not be affected and the measure should remain the same. For dependence scanning in data mining/variable selection, invariance to monotone transformations of the variables is very important, since we do not know beforehand the appropriate scale of each variable. This leads to our following definition of weakequitability.
Definition 1.
A dependence measure is weaklyequitable if and only if whenever is a strictly monotone continuous deterministic function.
The weakequitability property relates to the popular copula concept. The Sklar’s theorem ensures that, for any joint distribution function , there exists a copula – a probability distribution on the unit square – such that
(1) 
Here and are the marginal cumulative distribution functions (CDFs) of and respectively. The copula captures all the dependence between and .
The copula decomposition separates the dependence (copula) from any marginal effects. Figure 1 shows the data from two distributions with different marginals but the same dependence structure.
We call a dependence measure symmetric if for all random variables and . Then a symmetric weaklyequitable measure satisfies the monotoneinvariance property: is invariant to strictly monotone continuous transformations both for and for . A symmetric dependence measure is weaklyequitable if and only if depends on the copula only and is not affected by the marginals and . In other words, the symmetric weaklyequitable dependence measures are defined on the copulatransformed, uniformly distributed, variables and . The right half of Table 1 shows the copulatransformed variables for Examples (A), (B) and (C) in contrast to the original variables on the left. Calculating the linear correlation coefficient on the copulatransformed variables leads to the Spearman’s Rho, which is weaklyequitable. This remedies the first deficiency (D1) above, as shown in Examples (A) and (B) in Table 1 after copulatransformation. The deficiency (D2) is still not solved by copulatransformation in example (C). We will address this in section 3.1, as this relates to the equitability concept of treating all deterministic relationships equally.
2.2 Rényi’s Axioms for Nonlinear Dependence Measures
Schweizer and Wolff (1981) showed that several copulabased dependence measures satisfy a modified version of Rényi’s Axioms on two continuously distributed random variables and .

is defined for any and .

.

.

if and only if and are statistically independent.

if and only if each of , is a.s. a strictly monotone function of the other.

If and are strictly monotone a.s. on and , respectively, then .

If the joint distribution of and is bivariate Gaussian, with linear correlation coefficient , then is a strictly increasing function of .
Rényi (1959)’s original axioms differ from the Schweizer and Wolff (1981)’s version in that: (i) They were not restricted to continuously distributed random variables; (ii) Axiom A5, A6 and A7 are replaced by:

if either or for some Borelmeasurable functions and .

If and are Borelmeasurable, oneone mappings of the real line into itself then .

If the joint distribution of and is bivariate Gaussian, with linear correlation coefficient , then .
We will mostly stick with continuous random variables as in Schweizer and Wolff (1981) so that we can relate to the copula representation. But we will also discuss the original A5a, A6a and A7a as they relate to the discussions on the equitability concept. The original Rényi’s Axioms are too strong for nonparametric measures (Schweizer and Wolff, 1981). The only known measure shown to satisfy all seven original Rényi’s Axioms is the Rényi’s maximum correlation coefficient (Rcor). The Rcor has a number of major drawbacks, e.g., it equals 1 too often and is generally not effectively computable (Schweizer and Wolff, 1981; Székely and Rizzo, 2009). We will discuss this more in section 3.1. In section 5, we will numerically study a recently proposed estimator for Rcor by LopezPaz, Hennig and Schölkopf (2013).
The Axiom A4 partially addresses the deficiency (D2) in the example (C) above. The Axiom A2 states that the measure is symmetric. Hence under Axiom A2, the weakequitability Definition 1 is equivalent to the Axiom A6. The selfequitability definition (Kinney and Atwal, 2014) is stronger than Axiom A6 (weakequitability), and is weaker than the original Axiom A6a.
2.3 Some Dependence Measures and Independence Characterization
One common class of copulabased measures are the concordance measures (Nelsen, 2006, chapter 5). In the bivariate case, let denote the density function of the copula , for . Then Spearman’s Rho is ; Kendall’s Tau is ; Gini’s Gamma is ; Blomqvist’s Beta is .
However, those concordance measures all suffer from the deficiency (D2) above: they all equal zero for the deterministic relationship in example (C) of Table 1. Naturally we want dependence measures satisfies Rényi’s Axiom A4. Several classes of dependence measures satisfies Axiom A4 using different but equivalent mathematical characterizations of the statistical independence between and with a similar form:
(2) 
Here the can be either joint CDF , or joint characteristic function with denoting the expectation, or joint probability density function . Then and are the corresponding marginal functions: CDFs and , or characteristic functions and , or probability density functions and .
Due to the characterization (2), it is natural to define through a discrepancy measure between the joint function and the product of marginal functions . Such types of would equal to zero if and only if always, i.e., and are independent.
The first class of dependence measures use CDFs in the characterization (2). Denote the independence copula on . Then using and distance between and , we get the commonly used KolmogorovSmirnov criterion and Cramérvon Mises criterion . These criteria are often used for independence testing (Genest and Rémillard, 2004; Genest, Quessy and Rémillard, 2007; Kojadinovic and Holmes, 2009).
We notice that, to satisfy the Axiom A3: , and need to be scaled with appropriate constants. The scaling does not affect the results for independence testing, but only affects the numerical values of the dependence measures. Schweizer and Wolff (1981) studied dependence measures in this class using distance. The , and distance result in, respectively, the Wolf’s , Hoeffding’s and Wolf’s measures:
(3) 
(4) 
(5) 
This class of dependence measures satisfies the modified Rényi’s Axioms 17 (Schweizer and Wolff, 1981).
For the second class of dependence measures, using the characteristic functions in the characterization (2) can lead to the distance covariance (Székely, Rizzo and Bakirov, 2007; Székely and Rizzo, 2009).
(6) 
To satisfy the Axiom A3, the distance correlation is defined as
(7) 
The does not satisfy the Axiom A6. This can be remedied by defining the distance correlation on the copulatransformed variables and . That is, we use the rankbased version of that replaces , and with , and in (6). This will be assumed in the rest of the paper.
The third class of dependence measures use the probability density functions , and in the characterization (2). Then the copulabased version involves only the copula density . This class includes many informationtheoretical measures such as the Rényi’s mutual information
(8) 
In the limit of , becomes the popular Shannon’s mutual information (MI) criterion
(9) 
MI is the recommended measure in Kinney and Atwal (2014). For Axiom A3, we can define mutual information correlation (Joe, 1989)
(10) 
We use the name to indicate it as the scaled version of MI. It is also known as the Linfoot correlation in literature (Speed, 2011).
Other information measures include Tsallis entropy (Tsallis, 1988):
(11) 
In the limit of , becomes MI. When , becomes the Hellinger distance. The scaled version is the Hellinger dependence measure (Tjøstheim, 1996; Granger, Maasoumi and Racine, 2004) .
Also in this class are measures using distance between the copula density and the independence copula density . Hence we call them the CopulaDistance
(12) 
Again, we can scale to satisfy Axiom A3. is the Pearson’s with its scaled version being (Joe, 1989).
Particularly, we call the scaled version of as copula correlation
(13) 
We defined the third class of dependence measures through the copula density . For some important cases such as when is a deterministic function of , the copula density does not exist with respect to the twodimensional Lebesgue measure. That is, the copula contains a singular component (Nelsen, 2006, page 27). For the copula with a singular component, we define the dependence measures on it as the limits of dependence measures on continuous copulas approaching it. Let be a sequence of continuous copulas that converges to the copula . The convergence can be defined in any distance for probability distributions, and we take the distance here. That is, , where the supreme is taken over all Borel sets . Then the dependence measure ’s value under copula is defined as . Using such a definition, if is a deterministic function of , then clearly , , and .
2.4 Parameters, Estimators and MIC
The dependence measures in Section 2.3 are all parameters. Sometimes the same names also refer to the corresponding sample statistics. Let , …, be a random sample of size from the joint distribution of . Then the sample statistic is also called Pearson’s correlation coefficient. In fact, is an estimator for , and converges at the parametric rate of . The first two classes of measures have natural empirical estimators, replacing CDFs and characteristic functions by their empirical versions. Particularly, Székely, Rizzo and Bakirov (2007) showed that the resulting statistic is the sample correlation of centered distances between pairs of and . The last class of dependence measures use the probability density functions instead, and are harder to estimate. For continuous and , simply plugging in empirical density functions may not result in good estimators for the dependence measures. However, we will see in section 3.1 that the first two class of measures do not have the equitability property. Hence we need to study the hardertoestimate measures such as MIcor and Ccor.
The MIC introduced in Reshef et al. (2011) is in fact a definition of a sample statistic, not a parameter. On the data set , …, , they first consider putting these data points into a grid of bins. Then the mutual information for the grid is computed from the empirical frequencies of the data on the grid. The MIC statistic is defined as the maximum value of over all possible grids with the total number of bins bounded by . That is,
(14) 
The is always bounded between and since .
The corresponding parameter MIC for the joint distribution of and can be defined as the limit of the sample statistic for big sample size . We notice that this definition depends on the tuning parameter and the implicit assumption that the limit exists. Hence the MIC parameter may change with different selection of . This is in contrast to the usual statistical literature, where the parameter definition is fixed but its estimator may contain some tuning parameter . Because the MIC parameter is only defined as a limit, the theoretical study on its mathematical properties is very hard.
As we introduce the strict mathematical definition for the equitability in next subsection 3.1, we can see that equitability should be a property for the parameter but not for the statistic.
3 Equitable measures
3.1 Equitability and Selfequitability
We first describe the theoretical results on equitability by Kinney and Atwal (2014). Reshef et al. (2011) proposed that an equitable measure should treat all deterministic relationships similarly under noisy situations. Particularly, they focused on the nonlinear regression setting for motivation: , where denotes the random noise that is independent of conditional on . The squared Pearson’s coefficient reflects the proportion of variance in explained by the regression on . They want the nonlinear dependence measure to be close to regardless of the specific form of . To formalize this concept, Kinney and Atwal (2014) used the condition “ forms a Markov chain” to characterize the nonlinear regression model. This condition means, in the model with deterministic , is the random noise variable which may depend on as long as has no additional dependence on . Then Kinney and Atwal (2014) defined the equitability as
Definition 2.
A dependence measure is equitable if and only if, . Here, is a function that does not depend on the distribution , is a deterministic function and forms a Markov chain.
Given the joint distribution , the function in the regression model is not uniquely specified. This implies that any equitable measure must be a trivial constant measure. Therefore, Kinney and Atwal (2014) proposed a new replacement definition of equitability by extending the invariance property (of the weaklyequitability or Axiom A6) in the regression model.
Definition 3.
A dependence measure is selfequitable if and only if whenever is a deterministic function and forms a Markov chain.
The selfequitability turned out to be characterized by a commonly used inequality in information theory.
Definition 4.
A dependence measure satisfies the Data Processing Inequality (DPI) if and only if whenever the random variables X, Y, Z form a Markov chain .
Kinney and Atwal (2014, SI, Theorem 3) showed that every DPIsatisfying measure is selfequitable. Kinney and Atwal (2014, SI, Theorem 4) proved that measures of the following form must satisfy DPI:
with a convex function on the nonnegative real numbers. In term of copula density, .
Therefore, due to the convexity of functions (when ) and (when ) on , we get the following proposition.
Proposition 1.
The CopulaDistance with and the Tsallis entropy with are selfequitable.
As a direct result of Proposition 1, the copula correlation and the Hellinger dependence measure are both selfequitable.
The Rényi’s Axiom A6a is a stronger condition than the selfequitability as no Markov Chain condition is required. Therefore, Rényi’s maximum correlation coefficient Rcor is also selfequitable. However, Rcor equals one too often. We illustrate this deficiency of Rcor, and the selfequitability of the dependence measures on some examples of simple probability distributions on the unit square. These examples are modified from those in Kinney and Atwal (2014), and the results are displayed in Table 2.
Examples  MIcor  Ccor  cor  Rcor  MIC  dcor  

A

0.94  0.63  0.82  1  1  0.56  0.75  0.31  0.53 
B

0.94  0.63  0.82  1  0.95  0.82  0.75  0.66  0.84 
C

0.94  0.63  0.82  1  1  0.87  1  0.75  0.84 
D

1  1  1  1  1  1  1  1  1 
E

0.97  0.75  0.87  1  1  0.94  1  0.88  0.94 
F

0.87  0.50  0.71  1  1  0.79  1  0.63  0.75 
A selfequitable measure will equal the same value in the first three examples A, B and C in Table 2 due to the existence of an invertible transformation satisfying the Markov chain condition (Kinney and Atwal, 2014). We can see that MIcor (or MI), Ccor, cor (or ) and Rcor all remain constants for the first three examples A, B and C. In contrast, the MIC, dcor, and those measures of the first class (, and ) are not selfequitable.
The next three examples D, E and F show increasing noise levels. However, Rcor, MIC and always equal one across Examples D, E and F, failing to correctly reflect the noise levels here. Particularly, Rcor equals one in all six examples here, failing to distinguish the strengths of deterministic signals among them.
3.2 robustequitability
An equitable dependence measure should reflect the strength of the deterministic signal in data, regardless of the relationship form. However, what quantity is the proper measure for the signal’s strength? Reshef et al. (2011) proposed to use the nonlinear to measure the signal strength, which could not lead to a proper equitability definition (Kinney and Atwal, 2014). One reason for the failure is the incompatibility of the nonlinear regression model with the joint Gaussian distribution. (The is the natural measure for Gaussian distribution as in Rényi’s Axiom A7). However, would result in the joint Gaussian distribution only for linear but not for any nonlinear .
For a better equitability definition, we consider a different situation: a mixture distribution with proportion of deterministic relationship hidden in continuous background noise. This situation can be mathematically rigorously expressed through the mixturecopula. The copula can always be separated into a singular component and an absolutely continuous component (Nelsen, 2006, page 27). The absolutely continuous component corresponds to the background noise. The independent background noise must corresponds to the independence copula (the uniform distribution on the unit square). Therefore, the data with proportion of hidden deterministic relationship have copula . Here is a singular copula representing the deterministic relationship, so that its support has Lebesgue measure zero. Clearly the signal strength in this situation should equal to , regardless of the specific form of deterministic relationship. Hence we have the following equitability definition.
Definition 5.
A dependence measure is robustequitable if and only if whenever follows a distribution whose copula is , for a singular copula .
We note that a robustequitable measure is an extension for the Pearson’s linear correlation. When the proportion of the deterministic relationship is linear, has the support on the diagonal of the unit square, and hence . A robustequitable dependence measure treat the linear hidden deterministic relationship the same as a nonlinear one. For the dependence measures mentioned above, only the copula correlation is known to be robustequitable.
Proposition 2.
The copula correlation is robustequitable.
The Proposition 2 comes directly from calculation that
Most selfequitable measures discussed above are not robustequitable. Direct calculations show that the mutual information and copula distance for all equal to for the mixture copula with . Hence they are not robustequitable, neither are their scaled version ( and other scaled version such as all equal to ). On the mixture copula, the Tsallis entropy for . Hence the Tsallis entropies are also not robustequitable.
We do not have a proof on whether Rcor is robustequitable. However Rcor has many drawbacks as mentioned earlier. As shown in the examples in Table 2, Rcor equals one too often. Because Rcor’s definition involve taking the supreme over all Borel functions, its theoretical properties are often hard to analyze. Another drawback of Rcor is that it is very difficult to estimate. There is no commonly accepted estimator for Rcor.
The difference between selfequitable and robustequitable measures is illustrated through examples in Figure 2. Figures 1(a) and 1(b) shows of data coming from two deterministic curves, and in Figures 1(c) and 1(d) the of data is nearly deterministic around the curve in a very small strip of area . In Figure 2, MI and Ccor are selfequitable, (their values are the same on (a) and (b), and the same on (c) and (d)), whereas Pearson’s correlation coefficient is not. However, the data distributions in (a) and (b) () are in fact very close to the corresponding cases of (c) and (d) (), Ccor reflects this with (differ only in order) in all cases but MI does not.
From the examples, we see that selfequitability is not sufficient for a good dependence measure. While selfequitability ensures the measure’s invariance under transformation between Figures 1(a) and Figures 1(b), MI would equal to , an unreasonable value for those cases. In fact, MI would equal to for an arbitrarily tiny amount of hidden deterministic relationship in the data. Therefore, its value is very unstable. This instability makes the consistent estimation of MI impossible as we will show in Section 4.
3.3 Multivariate Extensions
We have so far concentrated on the simple bivariate case. The dependence measure can be extended to the multivariate case.
There are two possible directions of extending dependence measures to the multivariate case. In the first direction, we are interested in any dependence among variables , …, . Therefore, the divergence of their joint distribution from the independent joint distribution (the product of marginals) can be used to measure such dependence. Schmid et al. (2010) provided higherdimensional extension of many copulabased dependence measures along this direction. We define a multivariate version as the half distance between the dimensional joint copula density from the independent copula density:
(15) 
The corresponding robustequitability definition becomes
Definition 6.
A dependence measure is robustequitable if and only if whenever follows a distribution whose copula is , for a singular copula .
Here is the independence copula of dimension .
It is easy to check that is robustequitable for this dimensional extension.
In the second direction, we can divide the dimensional vector into a dimensional vector and dimensional vector with . And we want a dependence measure between and , not caring about the dependence within or within . The dcor (Székely and Rizzo, 2009) is a dependence measure of this type. Along this direction, we define the multivariate version for and as
(16) 
Here and are the copula densities for and respectively. The robustequitability definition in this direction of extension is
Definition 7.
A dependence measure is robustequitable if and only if whenever follows a distribution whose copula is , for a singular copula .
Here and are the dimensional and dimensional copulas of and respectively. The measure is robustequitable under this definition.
4 Statistical Error in the Dependence Measure Estimation
We now turn our attention to the statistical errors in estimating the dependence measures. Particularly we focus on the two selfequitable measures MI and Ccor.
First, we point out that the first class of dependence measures are generally estimable at the parametric rate of . These measures, including Hoeffding’s , Wolf’s and , are defined through the CDFs. We use the notations , and to emphasize that they are functionals of the copula function . Then we can estimate them by plugin estimators , and , where denotes the empirical estimator for the copula function . Since converges to at the parametric rate of (Omelka, Gijbels and Veraverbeke, 2009; Segers, 2012), , and can also be estimated at the parametric rate of .
However, the selfequitable measures come from the third class of dependence measures which involves the density function. Hence the parametric rate of convergence can only be achieved with the plugin density estimator for discrete distributions, e.g., for (Joe, 1989). The convergence rate involving continuous distributions need more care. We consider the estimation of MI and Ccor respectively in the next two subsections 4.1 and 4.2.
4.1 The Mutual Information Is Not Consistently Estimable
The estimation of MI has been studied extensively in literature. Over all distributions, even discrete ones, no uniform rate of convergence is possible for MI (Antos and Kontoyiannis, 2001; Paninski, 2003). On the other hand, many estimators were shown to converge to MI for every distribution. These two results are not contradictory, but rather common phenomenon for many parameters. The first result is about the uniform convergence over all distributions while the second result is about the pointwise convergence for each distribution. The first restriction is too strong while the second restriction is too weak. The difficulty of estimating a parameter needs to be studied for uniform convergence over a properly chosen family.
As MI is defined through the copula density, it is natural to consider the families generally used in density estimation literature. Starting from Farrell (1972), it is standard to study the minimax rate of convergence for density estimation over the class of functions whose th derivatives satisfy the Hölder condition. Since the minimax convergence rate usually is achieved by the kernel estimator, it is also the optimal convergence rate of density estimation under those Hölder classes. Generally, with the Hölder condition imposed on the th derivatives, the optimal rate of convergence for twodimensional kernel density estimator is (Silverman, 1986; Scott, 1992).
Therefore, when studying the convergence of MI estimators, it is very attempting to impose the Hölder condition on the th derivatives of the copula density. In fact, under the Hölder condition on the copula density itself (i.e., on the th derivative), Liu, Lafferty and Wasserman (2012) showed that the kernel density estimation (KDE) based MI estimator converges at the parametric rate of . Pál, Póczos and Szepesvári (2010) also considered similar Hölder condition when they studied the convergence of nearestneighbor (KNN) based MI estimator. However, we argue that such conditions are too strong for copula density, thus these results do not reflect the true difficulty of MI estimation.
Specifically, the Hölder condition on the copula density means
(17) 
for a constant and all values between and . Here and in the following refers to the Euclidean norm. However, this Hölder condition (17) would exclude all commonly used continuous copula densities since they are unbounded (Omelka, Gijbels and Veraverbeke, 2009; Segers, 2012). Therefore, we need to consider the minimax convergence rate under a less restrictive condition.
When is unbounded, the Hölder condition can not hold for the region where is big. Hence we impose it only on the region where the copula density is small. Specifically, we assume that the Hölder condition (17) holds only on the region for a constant . That is, whenever and . Then this condition is satisfied by all common continuous copulas in the book by Nelsen (2006). For example, all Gaussian copulas satisfy the Hölder condition (17) on for some constants and . But no Gaussian copulas, except the independence copula , satisfy the Hölder condition (17) over the whole .
If (17) holds on for any particular and values, then (17) holds on also for all smaller values and for all bigger values. Without loss of generality, we assume that is close to and is a big constant.
Let denotes the class of continuous copulas whose density satisfies the Hölder condition (17) on . We can then study the minimax risk of estimating for . Without loss of generality, we consider the data set consisting of independent observations from a copula distribution .
Theorem 1.
The proof of Theorem 1 uses a method of Le Cam (Le Cam, 1973, 1986) by finding a pair of hardest to estimate copulas. That is, we can find a pair of copulas and in the class such that and are arbitrarily close in Hellinger distance but their mutual information are very different. Then no estimator can estimate MI well at both copulas and , leading to a lower bound for the minimax risk. Detailed proof is provided in Section 6.1.
In the literature, MI are estimated using methods including kernel density estimation (KDE) (Moon, Rajagopalan and Lall, 1995), the nearestneighbor (KNN) (Kraskov, Stögbauer and Grassberger, 2004), maximum likelihood estimation of density ratio (Suzuki, Sugiyama and Tanaka, 2009). There are also other density estimation based MI estimators (Blumentritt and Schmid, 2012) that use the Beta kernel density estimation (Chen, 1999) and the Bernstein estimator (Bouezmarni, Ghouch and Taamouti, 2013).
No matter which MI estimator above is used, Theorem 1 states that its minimax risk over the family is infinite. Also, the scaled version for estimating MIcor have minimax risk bounded away from zero. That is, the MI and MIcor can not be estimated consistently over the class . This inconsistency is not specific to an estimation method. The estimation difficulty comes from the instability of MI due to its definition, as shown by the huge difference in MI values in Figures 1(a) and 1(c) for two virtually same probability distributions.
Mathematically, MI is unstable because it overweighs the region with large density values. From equation (9), is the expectation of under the true copula distribution . In contrast, the in (13) takes the expectation at the independence case instead. This allows consistent estimation of over the family , as shown in the next subsection 4.2.
4.2 The Consistent Estimation Of Copula Correlation
The proposed copula correlation measure can be consistently estimated since the region of large copula density values has little effect on it. To see this, we derive an alternative expression of (13). Let denote the nonnegative part of . Then
Hence . Therefore,
Then we arrive at the alternative expression
(19) 
In the new expression (19), only depends on which is nonzero only when . To estimate well, we only need the density estimator to be good for points with low copula density. Specifically, we consider the plugin estimator
(20) 
where is a kernel density estimator with kernel and bandwidth .
To analyze the statistical error of , we can look at the error in the low copula density region separately from the error in the high copula density region. Specifically, let be a constant between and , say, . Then we can separate the unit square into the low copula density region and the high copula density region . We now have where and . Since the Hölder condition (17) holds on , the classical error rate for the kernel density estimator holds for on the low copula density region . Hence the error is also bounded by . While the density estimation error can be unbounded on the high copula density region , it only propagates into error for when . We can show that the overall propagated error is controlled at a higher order . Therefore, the error rate of can be controlled by the classical kernel density estimation error rate as summarized in the following Theorem 2.
Theorem 2.
Let be a kernel estimation of the copula density based on observations , …, . We assume the following conditions

The bandwidth and .

The kernel has compact support .

, and .
Then the plugin estimator in (20) has a risk bound
(21) 
for some finite constant .
The detailed proofs for Theorem 2 are provided in Section 6.2. From (21), if we choose the bandwidth , then converges to the true value at the rate of . Thus can be consistently estimated, in contrast to the results on and in subsection 4.1.
The Theorem 2 provides only an upper bound for the statistical error of the plugin estimator . The actual error may be lower. In fact, the error can be controlled at using kernel density estimator (Bickel and Ritov, 2003). Here we did not find the optimal rate of convergence. But the upper bound already shows that is much easier to estimate than and . Similar to classical kernel density estimation theory, assuming that the Hölder condition holds on for the th derivatives of the copula density, the upper bound on the convergence rate can be further improved to .
The technical conditions in Theorem 2 are classical conditions on the bandwidth and the kernel. We have used the bivariate product kernel for technical simplicity. Other variations of the conditions in the literature may be used. For example, it is possible to relax the compact support condition 2 to allow using the Gaussian kernel.
Further adjustment is needed for a practical estimator for Ccor. In practice, the ’s are not observed. From the raw data of ’s, , it is conventional to estimate , and then calculate using ’s. Here is the rank of