Mutual Dependence: A Novel Method for Computing Dependencies Between Random Variables

Mutual Dependence: A Novel Method for Computing Dependencies
Between Random Variables

Rahul Agarwal, Pierre Sacré, and Sridevi V. Sarma R. Agarwal, P. Sacré, and S.V. Sarma are with the Institute for Computational Medicine and the Department of Biomedical Engineering, The Johns Hopkins University, Baltimore, MD (rahul.jhu@gmail.com, p.sacre@jhu.edu, sree@jhu.edu).
Abstract

In data science, it is often required to estimate dependencies between different data sources. These dependencies are typically calculated using Pearson’s correlation, distance correlation, and/or mutual information. However, none of these measures satisfy all the Granger’s axioms for an “ideal measure”. One such ideal measure, proposed by Granger himself, calculates the Bhattacharyya distance between the joint probability density function (pdf) and the product of marginal pdfs. We call this measure the mutual dependence. However, to date this measure has not been directly computable from data. In this paper, we use our recently introduced maximum likelihood non-parametric estimator for band-limited pdfs, to compute the mutual dependence directly from the data. We construct the estimator of mutual dependence and compare its performance to standard measures (Pearson’s and distance correlation) for different known pdfs by computing convergence rates, computational complexity, and the ability to capture nonlinear dependencies. Our mutual dependence estimator requires fewer samples to converge to theoretical values, is faster to compute, and captures more complex dependencies than standard measures.

I Introduction

In data science and modeling, it is often required to test whether two random variables are independent. Out of several measures that quantify dependencies between random variables [2, 1, 3, 4], the most widely used are mutual information , Pearson’s correlation , and distance correlation .

Mutual information, is generally thought of as a benchmark for quantifying dependencies between random variables; however, it can only be computed by first estimating the joint and marginal probability density functions (pdfs). Pearson’s correlation, , can be directly estimated from data, but it does not capture nonlinear dependencies. Distance correlation, , can also be directly estimated directly from data and can capture nonlinear dependencies, but is in general slow to compute (computational complexity ). Further, distance correlation often does not reflect the nonlinear dependencies correctly as described succinctly by Rényi’s axioms [5], which were slightly improved upon by Granger, Maasoumi and Racine [6]. See Table I. Specifically, distance correlation is not invariant under strictly monotonic transformations (6th axiom in Table I).

An “ideal measure” should satisfy axioms given in Table I and should be directly estimable from the data. A less popular and unnamed measure uses the Bhattacharyya distance between the joint pdf and the product of the marginals as a measure for dependence between two random variables [7, 8]. It has been shown that this measure satisfies all six axioms. Importantly, this measure is invariant under continuous and strictly increasing transformations [9, 6]. It is also closely related to mutual information, k-class entropy and copula [10, 11, 12]. In this paper, we call this measure mutual dependence.

# Property
1
2 iff and are independent
3
4 if there is a strict dependence between and
5 if the joint distribution of and is normal
6
TABLE I: Desired properties of ideal dependency measure .

Mutual dependence has not been widely used because, like mutual information, it requires non-parametric density estimation to compute the marginal and joint pdfs, which are then substituted into the theoretical measure and numerically integrated to yield estimates. This process is both computationally complex and inaccurate. In this paper, we develop an estimator that estimates mutual dependence directly from the data. It uses our recently proposed Band-Limited Maximum Likelihood (BLML) estimator that maximizes the data likelihood function over a set of band-limited pdfs with known cut-off frequency . The BLML estimator is consistent, efficiently computable, and results in a smooth pdf [13]. The BLML estimator also has a faster rate of convergence and reduced computational complexity over other widely used non-parametric methods such as kernel density estimators. Along with these properties, if the BLML estimator is substituted into the expression for mutual dependence (see (5)), the mutual dependence can be computed directly from the data without performing numerical integration, which is often inaccurate and inefficient.

We show through simulations that converges faster than and for various data sets with different types of linear and nonlinear dependencies, and the convergence rate for computing is maintained for different type of nonlinearities. is faster to compute than as it has time complexity, where is the number of bins containing a finite number of samples which is always less than or equal to (the number of data samples).

The paper is organized as follows. Section II discusses variation in different measures as a function of mutual information and nonlinearity. Section III introduces the notion of mutual dependence and its estimator. Section IV uses simulation to compare convergence of mutual dependence with Pearson’s and distance correlation for different nonlinearity dependencies and marginal pdfs. We end the paper with conclusions and future work in Section V.

Fig. 1: Point clouds. Illustrating point cloud for data generated from (1) for different nonlinearities and generating pdfs. was used for generating this data.

Ii A motivating example

Consider two random variables and defined as:

where and follow either a band-limited pdf

or a normal pdf

and where is one of four types of (nonlinear) dependence among

The ‘spread’ is varied from to to obtain different degrees of dependencies. Figure 1 illustrates the data generated in this example.

The goal of the dependency measures is to quantify dependencies between and given the data. In cases where underlying pdfs are known these dependency are captured pretty nicely by mutual information.

Therefore in Figure 2, we plot theoretical values for Pearson’s and distance correlation of dependence as a function of mutual information for the four different nonlinearity types and the two different generating pdfs.

  • Mutual information

    (2)
  • Pearson’s correlation

    (3)
  • Distance correlation

    (4)

    here , , are the respective characteristic functions. and are the dimension of and . For details see [3]. (Note we have eliminated the constants from the definition of as they are not needed to define ).

Both Pearson’s and distance correlation measures depend largely on the nonlinearity for a given value of mutual information. This variability may occur because both the types of correlation measures are not invariant to strictly monotonic transformations, unlike mutual information. Therefore, changing the type of nonlinearity results in different values for both Pearson’s and distance correlation, while the mutual information remains invariant. Such variance is undesirable as it may lead to incorrect inferences when comparing dependencies between data having different types of nonlinear dependencies. Therefore, a measure that is invariant to strictly monotonic transformations is desirable.

Fig. 2: Pearson’s and distance correlation. Illustrating theoretical values of and as a function of for different nonlinearities and generating pdfs as used in Figure 1.

Iii Mutual dependence and its estimation

In this section, we introduce the mutual dependence, which is based on an unnamed existing measure, and show several properties of this measure. Then, we derive an estimator of mutual dependence derived directly from data generated from band-limited pdfs. Finally, we describe efficient algorithms to compute this estimator.

Iii-a Mutual dependence

Consider two random variables and , their joint distribution , and their marginal distributions and . These random variables are independent if and only if . It is therefore natural to measure dependence as the distance (in the space of pdfs) between the joint and the product of marginal distributions. A good distance candidate is the Bhattacharyya distance (also known as Hellinger distance). See [6, 9] for details.

Definition 1

The mutual dependence between two random variables and is defined as the Bhattacharyya distance between their joint distribution and the product of their marginal distributions and , that is,

(5)

with

(6)

We call this measure ‘mutual dependence’ as it represents mutual information most closely. For a given value of mutual information, the value of mutual dependence remains almost the same irrespective of the nonlinearity type, which is not true for Pearson’s and distance correlation measures.

Iii-B Properties of mutual dependence

Due to symmetry of , it is easy to see that . The measure if and are partially dependent which quantifies the degree of dependence between the two random variables. In the extreme cases, and are independent and if either or is a Borel-measurable function of the other. Also, it can be easily established that is invariant under strictly monotonic transformations and , i.e . A detailed description of these properties can be found in [6, 9].

For jointly normal data, the mutual dependence can be estimated by first calculating the Bhattacharyya distance between two multivariate Gaussian distributions [14]

(7)

where and are the mean vectors and and covariance matrices. Then substituting

gives

(8)

This shows that mutual dependence satisfies axiom 5 (see Table 1).

Fig. 3: Mutual dependence. Illustrating theoretical values of as a function of for different nonlinearities and generating pdfs as used in Figure 1.

Iii-C Estimation of mutual dependence

To estimate we use the BLML method [13] that maximizes the likelihood of observing data samples over the set of band-limited pdfs. The BLML estimator is shown to outperform kernel density estimators (KDE) both in convergence rates and computational time and hence provides a better alternative for non-parametric estimation of pdfs. In addition, the structure of the BLML estimator is well suited for evaluating the integral in (5), resulting in an estimate which is a direct function of observed data and hence avoids numerical integration errors.

Below we briefly describe the BLML estimator.

Theorem III.1

Consider independent samples of an unknown BL pdf, , with assumed cut-off frequency Then the BLML estimator of is given as:

(9)

where, is the assumed cutoff frequency, vectors ’s, with , are the data samples, and the vector , is given by

(10)

Here with .

See [13] for details. Now we introduce the estimator for , in the following theorem.

Fig. 4: Monte Carlo Estimates for band-limited generating pdfs. The Monte Carlo distribution of estimates for different measures for different nonlinearities and band-limited generating pdfs. ess mark the estimates calculated using sample sizes whereas s mark the estimates calculated using sample size . is estimate assuming the cut-off frequency .
Fig. 5: Monte Carlo Estimates for normal generating pdfs. The Monte Carlo distribution of estimates for different measures for different nonlinearities and normal generating pdf. es mark the estimates calculated using sample sizes whereas s mark the estimates calculated using sample size . is estimate assuming the cut-off frequency .
Theorem III.2

If   are paired independent and identically distributed data observations and is the cut-off frequency parameter. Then the estimator for mutual dependence is given as:

(11)

where is given by:

is:

and is:

{proof}

The BLML estimators of , and from Theorem III.1 (using same cut-off frequency and respectively) are plugged into (5) and the resultant equation is integrated which gives .

Iii-D Computation of mutual dependence

As described in [13] solving for requires exponential time. Therefore, heuristic algorithms also described in [13] such as BLMLBQP and BLMLTrivial, can be used directly to compute , , approximately for each for small scale () and large scale () problems, respectively.

To further improve the computational time BLMLQuick algorithm [13] can also be used. BLMLQuick uses binning and estimates , , approximately for each . It is also shown in [13] that both BLMLTrivial and BLMLQuick algorithms yield consistent estimate of pdfs if the true pdf is strictly positive, therefore in cases where the joint the estimate, is also consistent.

Iv Performance of mutual dependence

In this section, we evaluate the performance of our estimator for mutual information by first comparing the empirical distribution of the estimator with the empirical distribution of the estimators for Pearson’s and distance correlation for different mutual information values, , nonlinearities, , and generating pdfs, We compare the convergence of these metrics to the true values for different sample sizes. Finally, we compare the computational complexity of our estimator with the estimator for distance correlation to evaluate the relative computational time needed to implement each estimator.

Iv-a Comparison of convergence rate for different nonlinearities

Figures 4 and 5 plot the estimated , and for and from about 50 Monte Carlo runs as a function of for different nonlinearities (linear, quadratic, cubic and sinusoidal) and generating pdfs (band-limited and normal). Underlaid are the respective theoretical values. Specifically, the first row shows about 50 Monte Carlo computation of for different values, nonlinearities and generating pdfs. It can be seen that for both and , works best for linear and sinusoidal data, but for quadratic data has a larger variance and for cubic data has a larger bias in bandlimited case. The second row shows 50 Monte Carlo computations of for different values, nonlinearities and generating pdfs. It can be seen that for both and , works best for linear data, but for quadratic and sinusoidal data, it has larger bias whereas for cubic data it has larger variance. The bottom row shows 50 Monte Carlo computation of for different values, nonlinearities and generating pdfs. It can be seen that works equally good for all nonlinearities and shows less bias and variance than both and .

Figure 6 plots the integration (over different values) of mean squared error (IMSE) between the theoretical and estimated measures using about 50 Monte Carlo runs, for different nonlinearities and generating pdf types.

(15)

Here, is the number of Monte Carlo simulations and is the dependency metric. It can be seen from the Figure 6 that the convergence rate is fastest for irrespective of nonlinearity type and/or generating pdf. and show an equally fast convergence rate for linear and normal data, but the rate is slower for nonlinear and non-normal data. Specifically, the first row shows convergence of , from which it can be established that convergence of to the theoretical values is fastest for linear data. For nonlinear data, the convergence is slo either due to large bias or variance as discussed previously. The second row shows convergence of . It can be seen that does well for linear data, but the rate slows down and saturates for nonlinear data again due to either large bias or variance. Specially, for cubic and band-limited data, the IMSE of does not decrease with increasing the number of samples, this is due to the nondecreasing variance of the estimator (see Figure 4). The bottom row shows convergence of . It can be seen that converges equally well for all data types and generating pdfs.

Fig. 6: Integrated mean squared error vs sample size. Showing the Integrated mean squared error as a function of sample size for different measures, different nonlinearities and different generating pdfs. is estimate assuming the cut-off frequency .

Iv-B Comparison of computational time

The computational complexity of computing is least which is , whereas computational complexity of computing is maximum which is . is same as computational complexity of BLMLQuick algorithm which is , where is the number of bins containing nonzero number of samples, which is always less than equal to . For dense data therefore computation of is a lot quicker than estimating in such cases.

V Conclusions

In this paper, we introduced a novel estimator for measuring dependency that can be directly computed from the data. Our estimator computes the mutual dependence which is an “ideal” measure for dependence between two random variables [6]. Our estimator has advantages over mutual information estimators as it does not require estimating the pdfs from data. It also has advantage over Pearson’s and distance correlation estimators as it is invariant under strictly monotonic transformation. Further, we showed that under simulation, estimators of both Pearson’s and distance correlation require more samples to achieve the same integrated mean squared error (IMSE) as compared to our mutual dependence estimator showing lower convergence rate. The slower convergence rate for the estimators of Pearson’s and distance correlation was due to their higher variance and bias for the nonlinearly dependent data. Such nonlinearities did not affect our estimator and it showed a uniform decrease in IMSE as the sample size increases for all tested nonlinearities. Even further, our estimate for mutual dependence showed a computational time complexity of where is the number of bins, which is superior to the time complexity of distance correlation () and is much faster when the data is dense.

V-a Future work

Although our estimator for the mutual dependence showed some nice properties under simulation, it remained to be established that it shows consistency for any nonlinearity which would require building up a theoretical proof. Further, in this paper, we assumed through out that we knew the cut-off frequency of the band-limited pdf or approximate cut-off frequency for the normal pdf (the band where most of the power of pdf lies, in case it is not band limited). However, in general this cut-off frequency is not known. A more in-depth analysis is needed to understand the behavior of our estimator as a function of the cut-off frequency.

References

  • [1] J. Lee Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician, vol. 42, no. 1, pp. 59–66, 1988.
  • [2] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423, 623–656, 1948.
  • [3] G. J. Székely, M. L. Rizzo, N. K. Bakirov, et al., “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
  • [4] G. J. Székely and M. L. Rizzo, “Brownian distance covariance,” Ann. Appl. Stat., vol. 3, pp. 1236–1265, 12 2009.
  • [5] A. Renyi, “On measures of dependence,” Acta. Math. Acad. Sci. Hung., vol. 10, pp. 441–451, 1959.
  • [6] C. W. Granger, E. Maasoumi, and J. Racine, “A dependence metric for possibly nonlinear processes,” Journal of Time Series Analysis, vol. 25, no. 5, 2004.
  • [7] T. Kailath, “The divergence and bhattacharyya distance measures in signal selection,” Communication Technology, IEEE Transactions on, vol. 15, no. 1, pp. 52–60, 1967.
  • [8] R. Beran, “Minimum hellinger distance estimates for parametric models,” The Annals of Statistics, pp. 445–463, 1977.
  • [9] H. Skaug and D. Tjstheim, “Testing for serial independence using measures of distance between densities,” in Athens Conference of Applied Probability and Time Series (Robinson P and Rosenblatt M, eds.), Springer, 1996.
  • [10] C. Genest and R. J. MacKay, “The joy of copulas: Bivarate distribution with uniform marginals,” The American Statistician, vol. 40, pp. 280–3, 1986.
  • [11] R. Nelsen, An Introduction to Copulas. Springer-Verlag, Berlin, 1999.
  • [12] J. Havrda and F. Charvat, “Quantification method of classification processes: concept of structual -entropy,” kybernetika Cislo I. Rocnik, vol. 3, pp. 30–4, 1967.
  • [13] R. Agarwal, Z. Chen, and S. S. V, “Nonparametric estimation of bandlimited probability density functions,” arXiv:1503.06236v1, http://arxiv.org/pdf/1503.06236v1.pdf, 2015.
  • [14] L. Pardo Llorente, Statistical inference based on divergence measures, vol. 185 of Statistics, textbooks and monographs. Boca Raton, FL: Chapman & Hall/CRC, 2006.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
137858
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description