Fast Computing for Distance Covariance
Distance covariance and distance correlation have been widely adopted in measuring dependence of a pair of random variables or random vectors. If the computation of distance covariance and distance correlation is implemented directly accordingly to its definition then its computational complexity is O() which is a disadvantage compared to other faster methods. In this paper we show that the computation of distance covariance and distance correlation of real valued random variables can be implemented by an O() algorithm and this is comparable to other computationally efficient algorithms. The new formula we derive for an unbiased estimator for squared distance covariance turns out to be a U-statistic. This fact implies some nice asymptotic properties that were derived before via more complex methods. We apply the fast computing algorithm to some synthetic data. Our work will make distance correlation applicable to a much wider class of applications.
Dr. Xiaoming Huo is a professor in the School of Industrial and Systems Engineering at the Georgia Institute of Technology. He has been serving as a rotator at the National Science Foundation since August 2013. Mailing address: 765 Ferst Dr, Atlanta, GA 30332 (email: firstname.lastname@example.org).
Dr. Gábor J. Székely is a program officer at the National Science Foundation. Mailing address: 4201 Wilson Blvd, Arlington, VA 22203 (email: email@example.com).
Keywords: distance correlation, fast algorithm, statistical dependence
Since its induction (?), distance correlation has had many applications in, e.g., life science (?) and variable selection (?), and has been analyzed (?; ?), extended (?; ?) in various aspects. If distance correlation were implemented straightforwardly from its definition, its computational complexity can be as high as a constant times for a sample size . This fact has been cited for numerous times in the literature as a disadvantage of adopting the distance correlation. In this paper, we demonstrate that an O() algorithm for a version of the distance correlation exits.
The main idea behind the proposed algorithm is to use an idea rooted in the the AVL tree structure (?). The same idea has been utilized to develop fast algorithm for computing the Kendall’s rank correlation coefficient (?) (?). We extend it to make it suitable for our purpose. The derivation of the fast algorithm also involves significant reformulation from the original version of the distance correlation. Details are presented in this paper.
In simulations, not only we demonstrate the effectiveness of the fast algorithm, but also we testify that the advantage of using distance correlation (in comparison with other existing methods) become more evident when the sample sizes increase. These experiments become feasible due to the availability of the proposed fast algorithm. In one experiment (See details in Section 5.3), we increased the sample size by fold from a previously published simulation study.
The rest of this paper is organized as follows. Section 2 reviews the distance covariance/correlation and its relevant properties. In Section 3, we consider a reformulation of the distance covariance, such that the new estimator is both unbiased and a U-statistic. In Section 4, an algorithm with the average complexity of was presented. Extensive simulations are presented in Section 5 to demonstrate the additional capability we obtained due to the proposed fast algorithm. Finally, some concluding remarks are made in Section 6. Detailed description of the algorithm is relegated to the Appendix, along with most of the technical proofs.
2 A Review of Distance Covariance
Distance covariance and distance correlation was introduced in 2005 by one of the co-authors of this paper, G. J. Székely, in several lectures to address the deficiency of Pearson’s correlation, namely that the Pearson’s correlation can be zero for dependent variables. In the following, we start with a definition of the distance covariance.
The population distance covariance between random vectors and with finite first moments is the nonnegative number defined by a weighted norm measuring the distance between the joint characteristic function (c.f.) of and , and the product of the marginal c.f.’s of and . If and take values in and , respectively, is
where . The integral exists provided that and have finite first moments.
This immediately shows that distance covariance is zero if and only if the underlying random variables are independent. The beauty of kernel is that the corresponding sample statistic has the following surprisingly simple form. Denote the pairwise distances of the observations by and the pairwise distances of the observations by for and denote the corresponding double centered distance matrices by , and where
It is clear that the row sums and column sums of these double centered matrices are 0. The squared sample distance covariance is the following simple formula
The corresponding squared sample variance is
and we can define the sample distance correlation as the standardized sample covariance:
For more details see ? and a discussion paper (?). It is clear that is rigid motion invariant and scale invariant. For recent applications of distance correlation, see e.g., ? and ?.
The population version of distance covariance and distance correlation can be defined without characteristic functions, see ?. This definition is as follows. Let and be random variables with finite expectations. The random distance functions are and . Here the primed random variable denotes an independent and identically distributed (i.i.d.) copy of the variable , and similarly are i.i.d.
Introduce the real-valued function
where is the cumulative distribution function (cdf) of , and
which is a real-valued random variable. For simplicity we write and .
Next we define the counterpart of centered distance matrices. The centered distance function is
For random variables we have
Similarly define the centered distance function and the random variable . Now for i.i.d., and i.i.d., such that and have finite expectations, the population distance covariance is defined by
We have that is always nonnegative, and equates zero if and only if and are independent.
It is clear by inspection that without further efforts the implementation of the sample distance covariance and the corresponding sample distance correlation requires O() steps. In this paper we show that for real-valued random variables and , we do not need more than O() steps.
3 The unbiased version of the squared sample distance covariance: reformulation and relation to U-statistics
We will work with the unbiased version of the squared sample distance covariance, which is published in ?. The definition is as follows.
Definition 3.1 (-centered matrix).
Let be a symmetric, real valued matrix with zero diagonal, . Define the -centered matrix as follows: the -th entry of is
Here the “-centered” is so named because as shown below, the corresponding inner product (which will be specified in (3.2)) defines an unbiased estimator of the squared distance covariance.
Let denote a sample of observations from the joint distribution of random vectors and . Let be the Euclidean distance matrix of the sample from the distribution of , and be the Euclidean distance matrix of the sample from the distribution of . Then if , for , the following
is an unbiased estimator of squared population distance covariance .
The proof of the above proposition is in the appendix of ?.
Let denote the inner product defined in (3.2). The following notations will be used. Define the column and row sums as follows:
We will need the following lemma.
If is the inner product defined in (3.2) then we have
3.2 Validating the Statistic is a U-Statistic
Suppose is a sample. For positive integer , let denote all the distinct -subsets of . For a set , we define notation . Let be a symmetric real-valued or complex-valued kernel function of variables. For each , the associated U-statistic of order , , is equal to the average over ordered samples of size of the sample values . In other words,
For U-statistics, we can verify the following lemma.
For , we denote
where is defined in (3.4) after removing the element . Then we must have
In fact, using arithmetic deduction, one can prove that the converse of the above is also true. In other words, the jackknife invariance is a necessary and sufficient condition for being U-statistics. For a very similar (equivalent) approach see ?.
A proof of the above can be found in the Appendix.
The two lemmas above show that the recursive relation (3.6) is a necessary and sufficient condition for a U-statistic. For later use, we explicitly restate the result below.
Let be a statistic of a sample . Let , , be a statistic of a reduced sample ; i.e., is the statistic after removing the observation . The necessary and sufficient condition for to be a U-statistic of order is
holds for all .
The above can be extended to a two-sample problem, in which a sample is for . By replacing with , all the previous arguments still hold.
Let denote the inner product that is defined in (3.2). Note is based on the entire sample (i.e., . For , let denote the corresponding statistic after knocking out pair from the entire sample.
The following lemma establish counterpart for , where .
For , let , , , and denote the corresponding sums after entry is removed from the sample. If is the inner product that is defined in (3.2) after knocking off the -th entry , we have
See a proof in the Appendix.
Now we know that (3.2) is a U-statistic and it is easy to see that (3.2) is in fact a U-statistic with a degenerate kernel under the null hypothesis of independence of and , thus we can see from Corollary 4.4.2 of ? that if the second moments of and are finite then under the null hypothesis, the limit distribution of has the form , where , and are i.i.d. standard normal random variables. Under the alternative hypothesis we have that , thus we can easily construct a consistent test of independence. For a technically much more difficult approach, see ? where a similar result was derived for a related V-statistic using deep results on complex-valued Gaussian processes.
4 Fast Algorithm
We now argue that when and are univariate, there is an O algorithm to implement (3.3). We start with several intermediate results, which are presented as lemmas below.
For , we also denote
A proof is relegated to the appendix.
Due to symmetry, the following is the counterpart fact for . We state it without a proof.
For , we denote
The following lemma will be used.
We define a sign function, for ,
For any sequence , for , we define
The following is true:
Proof of Lemma 4.4.
Per the definition of , one can verify that the above equates to (4.5). ∎
For any sequence , there is an O algorithm to compute for all (), where .
Again, we relegate the proof to the appendix. The main idea of the proposed algorithm is a modification as well as an extension of the idea that was used in ? and ?, which developed a fast algorithm for computing the Kendall’s rank correlation coefficient. The principle of the AVL tree structure (?) was adopted. Despite they are in a similar spirit, the algorithmic details are different. We now present the main result in the following theorem.
The unbiased estimator of the squared population distance covariance (that was defined in (3.2)) can be computed by an O() algorithm.
Proof of Theorem 4.6.
For the second term on the right hand side of (3.3), Note that quantities , and that were defined in Lemmas 4.1 and 4.2, respectively, are partial sums, which can be computed for all ’s with O() algorithms. The factor is inserted, because one may need to sort ’s or ’s in order to compute for , and . Then by (4.1) and (4.2), all and can be computed at order O(). Consequently, the second term on the right hand side of (3.3) can be computed by using an O() algorithm.
For readers’ convenience, we present a detailed algorithm description in Appendix, where Algorithm 2 realizes the idea that is described in the proof of Lemma 4.5; Algorithm 3 is a subroutine that will be called in Algorithm 2; and the Algorithm 1 is the algorithm that can compute for the distance covariance at O().
5 Numerical Experiments
In Section 5.1, we describe a MATLAB and C based implementation of the newly proposed fast algorithm. This fast algorithm enables us to run some simulations with sample sizes that were impossible to experiment with before its appearance. We report some numerical experiments in Section 5.2. Distance correlation has been found helpful in feature screening. In Section 5.3, we redo experiments on this regard, increasing the sample size from to . It is observed that the advantage of using the distance correlation is more evident when the sample size becomes larger.
5.1 Matlab Implementation
The fast algorithm was implemented in MATLAB, with a key step (of dyadic updating) being implemented in C. It was then compared against the direct (i.e., slow) implementation. Table 1 presents the average running time for the two different implementations in MATLAB with replications at each sample size. The sample size goes from () to (). In all these cases, the two methods ended with identical solutions; this validates our fast algorithm. Note a comparison in MATLAB is not desirable for our fast algorithm. The direct method calls some MATLAB functions, which achieve the speed of a low-level language implementation, while the implementation of the fast method is not. In theory, the fast algorithm will compare more favorably if both methods are implemented in a low-level language, such as in C or C++.
|Sample Size||Direct method||Fast method|
|32||0.0006 (0.0001)||0.0014 (0.0001)|
|64||0.0008 (0.0001)||0.0024 (0.0002)|
|128||0.0019 (0.0004)||0.0053 (0.0006)|
|256||0.0083 (0.0010)||0.0120 (0.0011)|
|512||0.0308 (0.0021)||0.0272 (0.0018)|
|1024||0.1223 (0.0051)||0.0647 (0.0037)|
|2048||0.4675 (0.0172)||0.1478 (0.0045)|
Fig. 1 provides a visual comparison of the two methods. All the experiments that are reported in this paper is run on a laptop (Lenovo T520, Intel Core i7-2640M CPU @ 2.80GHz) with allowable 975 MB memory in MATLAB Version 220.127.116.11 (R2013b).
When the sample size is large, e.g., when , the direct method will generate an “out-of-memory” message. Recall the direct method computes for all pairwise distances, hence it requires O memory. The fast method only requires O in memory. For illustration purpose, we run the fast algorithm for sample size going from (which is ) to (which is ). The running times are reported in Tab. 2 and Fig. 2. When , the running time is a little more than three minutes. The trend that is observable from Fig. 2 consists with our claim that the fast method is an O algorithm. It is evident that the running time scales approximately linearly with the sample size (). We did not run experiments with larger sample sizes, because their outcomes are predictable by property of the fast method.
|Sample Size||Fast method|
5.2 Measuring Effectiveness of Distance Correlation
The distance correlation is zero if and only if the corresponding two random variables are independent of each other. The Pearson’s correlation does not have such a property. There have been intuitive numerical examples to illustrate such an advantage of using the distance correlation. See the Wikipedia page on “distance correlation.” When the direct implementation of the distance correlation is adopted, the sample size (which was denoted by ) cannot be large, due to the O complexity of the direct method. In Fig. 3, we compare the Pearson’s correlation with the distance correlation in nine representative cases:
is a bivariate normal with moderate correlation;
a bivariate normal with a correlation close to ;
a thickened rippled curve;
a rotation of a uniformly filled rectangle;
a further rotation of the aforementioned uniformly filled rectangle;
a thickened quadratic curve;
bifurcated quadratic curves;
a thickened circle; and
a bivariate mixed normal with independent coordinates.
When the sample sizes are and , respectively, Fig. 3 presents the Pearson’s correlation and the distance correlation in all cases. In the cases (3) through (8), we seemingly observe the trend that the Pearson’s correlations are getting close to zero, while the distance correlations are not. However, the significance of such a pattern is not evident.
|(a) Sample size:|
With the fast method, we now can run the same experiments with larger sample sizes. In Fig. 4, we run the comparison with sample size .
This is a sample size for which the corresponding experiment cannot be done with the direct method. It is clear that the Pearson’s correlation become nearly zero in the cases of (3) through (8), even though the two random variables are not independent. The corresponding distance correlations are clearly far from zero. For readers’ convenience, we summarize the results in Table 3.
|n = 40||0.92; 0.91||0.98; 0.98||0.08; 0.35||-0.04; 0.00||-0.07; 0.00||-0.38; 0.44|
|n = 400||0.80; 0.75||0.98; 0.97||0.04; 0.30||0.10; 0.17||0.08; 0.18||-0.03; 0.40|
|n = 10,000||0.81; 0.76||0.98; 0.97||0.01; 0.33||0.00; 0.14||0.01; 0.14||0.01; 0.42|
|n = 40||0.04; 0.21||0.16; 0.22||0.10; 0.00|
|n = 400||0.10; 0.25||0.04; 0.19||0.01; 0.00|
|n = 10,000||0.01; 0.25||0.00; 0.18||-0.01; 0.00|
The fast method allows us to study how the sample distance correlation converge to the population counterpart as a function of the sample size. Fig. 5 shows the convergence of the sample distance correlation and Pearson’s correlation. It is worth noting that in cases (3)-(8), the Pearson’s correlation quickly converges to zero, while the sample distance correlation clearly stays away from zero. This experiments shows that a previous observation in Fig. 4 should occur with large probability.
5.3 Feature Screening
In ?, distance correlation has been proposed to facilitate feature screening in ultrahigh-dimensional data analysis. The proposed sure independence screening procedure based on the distance correlation (DC-SIS) has been proven to be effective in their simulation study. Due to the use of the direct method, they restricted their sample size to . We redo the simulations as in ?, however increases the sample size to , i.e., times of the originally attempted. It is observed that the use of distance correlation becomes more advantageous when the sample size increases.
The screening algorithm, which was initially advocated by ?, works as follows. For each covariate , a ‘marginal utility’ function was computed. Such a marginal utility function can be the Pearson’s correlation, the distance correlation that was discussed in this paper, or other dependence measure such as the one in ? that was also used in the simulation studies of ?. The ‘screening’ is based on the magnitude of the values of these marginal utility function. Sometimes, forward, backward, or a hybrid stepwise approach is proposed. In this paper, we refrain from further discussion in this potential research direction.
Our simulation setup follows the one in ?. Note that an alternative approach named sure independent ranking and screening (SIRS) (?) was compared against. For a sample, , of two random variable and , the SIRS dependence measure (i.e., the marginal utility function) is defined as
where is an indicator function. The formulation in the above definition seemingly hint an O algorithm. The following theorem will show that it can be computed via an O algorithm. The proof and the algorithmic details are relegated to the appendix.
For a sample, , of a bivariate random vector , the SIRS measure (?) in (5.1) can be computed via an algorithm whose average complexity is O.
For completeness, we state our simulation setup below. we generate from normal distribution with zero mean and covariance matrix , and the error term from the standard normal distribution . Two covariance matrices are considered to assess the performance of the DC-SIS and to compare with existing methods: (1) and (2) . Note that a covariance matrix with entries enjoys a known Cholesky decomposition: , where if , and , , for and , . In our simulations, we take advantage of this known decomposition. The dimension varies from to . Each experiment was repeated times, and the performance is evaluated through the following three criteria:
: the minimum model size to include all active predictors. We report the , and quantiles of out of replications.
: the proportion that an individual active predictor is selected for a given model size in the replications.
: the proportion that all active predictors are selected for a given model size in the replications.
The is used to measure the model complexity of the resulting model of an underlying screening procedure. The closer to the minimum model size the is, the better the screening procedure is. The sure screening property ensures that and are both close to one when the estimated model size is sufficiently large. Different from ?, the is chosen to be , , and throughout our simulations to empirically examine the effect of the cutoff, where denotes the integer part of .
An innovative stopping rule is introduced in ? for DC-SIS. We did not implement it here, because the new stopping rule requires a multivariate version of the distance correlation, which is not covered by this paper.
The example is designed to compare the finite sample performance of the DC-SIS with the SIS (?) and the SIRS (?). In this example, we generate the response from the following four models:
where is an indicator function.
The regression functions in models (1.a)-(1.d) are all nonlinear in . In addition, models (1.b) and (1.c) contain an interaction term , and model (1.d) is heteroscedastic. Following Fan and Lv (2008), we choose for , and , where , Bernoulli and . We set in this example to be consistent with the experiments in ?: challenging the feature screening procedures under consideration. For each independence screening procedure, we compute the associated marginal utility between each predictor and the response . That is, we regard as the predictor vector in this example.
|Case 1: and|
|Case 2: and|
|Case 3: and|
|Case 4: and|
|Case 1: and|
|Case 2: and|
|Case 3: and|
|Case 4: and|
Tables 4, 5, and 6 present the simulation results for , and . The performances of the DC-SIS, SIS, and SIRS are quite similar in model (1.a), indicating that the SIS has a robust performance if the working linear model does not deviate far from the underlying true model. The DC-SIS outperforms the SIS and the SIRS significantly in models (1.b)-(1.d). Both the SIS and the SIRS have little chance to identify the important predictors and in models (1.b) and (1.c), and in model (1.d).
Distance correlation has been found useful in many applications (?; ?). A direct implementation of the distance correlation led to an algorithm with sample size