1
Causal Inference on Discrete Data via Estimating Distance Correlations
Furui Liu, Laiwan Chan
{frliu,lwchan}@cse.cuhk.edu.hk
Department of Computer Science and Engineering,
The Chinese University of Hong Kong
Abstract
In this paper, we deal with the problem of inferring causal directions when the data is on discrete domain. By considering the distribution of the cause and the conditional distribution mapping cause to effect as independent random variables, we propose to infer the causal direction via comparing the distance correlation between and with the distance correlation between and . We infer “ causes ” if the dependence coefficient between and is smaller. Experiments are performed to show the performance of the proposed method.
1 Introduction
Inferring the causal direction between two variables from observational data becomes a hot research topic. Additive noise models (ANMs) (Zhang and Hyvärinen, 2009a; Shimizu et al., 2006; Hoyer et al., 2009; Peters et al., 2011; Hoyer and Hyttinen, 2009; Zhang and Hyvärinen, 2009b; Shimizu et al., 2011; Hyvärinen and Smith, 2013; Hyvärinen et al., 2010; Hyttinen et al., 2012; Mooij et al., 2011) are preliminary trials to solve this problem. They assume that the effect is governed by the cause and an additive noise, and the causal inference is done by finding the direction that admits such a model. Recently, under another view of exploiting the asymmetry between cause and effect, the linear trace method (LTr) (Zscheischler et al., 2011; Janzing et al., 2010) and information geometric causal inference (IGCI) (Janzing et al., 2012) are proposed. Suppose is the cause and is the effect. Based on the fact that the generating of is independent with that of (Janzing and Schölkopf, 2010; Lemeire and Janzing, 2013; Schölkopf et al., 2012), LTr suggests that the trace condition is fulfilled in the causal direction while violated in the anticausal direction, and IGCI shows that the density of the cause and the log slope of the function transforming cause to effect are uncorrelated while the density of the effect and the log slope of the inverse of the function are positively correlated. By accessing these socalled causeeffect asymmetries, one can determine the causal direction. Then a kernel method using the framework of IGCI to deal with high dimensional variables is developed (Chen et al., 2014), and nonlinear extensions of trace method is presented (Chen et al., 2013).
In some situations, the variables of interest are on discrete domains, and researchers have adopted additive noise models to discrete data for causal inference (Peters et al., 2010, 2011). Given observations of the variable pair, they do regressions in both directions, and test the independence between the residuals and the regressors. The direction that admits an additive noise model is inferred as the causal direction. However, ANMs may not be suitable for modeling discrete variables in some situations. For example, it is not natural to adopt ANMs to modeling categorical variables. Methods with wider applicability would be valuable for causal inference on discrete data.
Motivated by the postulate that the generating of is independent with that of , we suggest that the is an observation of a pair of variables that are independent with each other (here , referring to the probability at one specified point). To infer the causal direction, we calculate the dependence coefficients between and , and and . Then the direction that induces smaller correlation is inferred as the causal direction. Without a functional causal model assumption, our method is with wider applicability than traditional ANMs. Various experiments are conducted to demonstrate the performance of the proposed method.
This paper is organized as follows. Section 2 defines the problem. Section 3 presents the causal inference principle. Section 4 gives the detailed causal inference method. Section 5 shows the experiments, and section 6 concludes the whole paper.
2 Problem Description
Suppose we have a set of observed variables and with support domain and respectively. is the cause and is the effect but we do not have prior knowledge about the causal direction. We assume that they are on discrete domain (for continuous variable we can perform discretization) and there are no latent confounders. We want to identify the causal direction (“X causes Y” or “Y causes X”). For clarity, we list the symbols that may appear in the following sections in table 1. Since we constrain variables in finite range, and can be written as matrices, and is a vector in the matrix . We use these representations in the rest of the paper.
Symbol  Description 

support of variable  
support of variable  
distribution of  
probability of  
joint distribution of  
conditional distribution of given  
conditional distribution of given  
cardinality of a set 
3 Causal Inference Principle
In this section, we will present the principle that we are using for causal inference on discrete data. We start with the basic idea.
3.1 Basic Idea
The basic idea is to consider the as a realization of a variable pair, and the two variables (one is one dimensional and the other is high dimensional) are independent with each other. See figure 1 for an example.
Figure 1 shows an example of and . Suppose and (here and ). is a vector in and is a matrix in . The highlighted (red bars) grids are a pair . Consider and as two independent random variables. The generating of the and is done by drawing realizations at each possible value of (shifting the red bars from right to left). We have realizations. We formalize this in postulate 1.
Postulate 1.
and are both random variables taking realizations at different . is independent with .
Once we have this postulate in mind, one could seek for some properties induced by it for causal discovery. We will discuss this in the next section.
3.2 Distance Correlation
If we want to characterize the dependence between and , one measurement is the correlation. However, is a high dimensional random vector. Adopting traditional dependence coefficients like Pearson correlations would cause certain estimation bias when sample size is not large. Moreover, it would be useful if the independence between variables corresponds to 0 correlation. This is not true if we use traditional correlations. Here we propose to use distance correlation (Székely et al., 2007) as the dependence measurement.
Distance correlation is a measurement of dependence between two random variables (one dimensional or high dimensional). Suppose we have two random variables , with characteristic functions and respectively. Their joint characteristic function is . The distance covariance is defined as below.
Definition 1.
The distance covariance between two random variables is
(1) 
Here refers the weighted norm, and similarly we can define distance variance (Székely et al., 2007). Then the distance correlation is defined as
Definition 2.
The distance correlation is
(2) 
and if or .
This dependence measurement is a distance metric between and . There are other methods to measure the dependence, like mutual information and kernel independence measurements (Gretton et al., 2005). However, mutual information is hard to estimate given finite sample size. Kernel methods involve a few parameters (kernel functions, kernel widths) which is not easy to choose. So we use this metric in our paper. Then we discuss how to estimate the distance correlation empirically from data (Székely et al., 2007). Suppose we have observations of two random variables as . For variable and , we can construct
(3) 
(4) 
and then we construct matrices and , with its entries as
(5) 
(6) 
Then we can estimate the empirical distance covariance as follows (Székely et al., 2007).
Definition 3.
The empirical distance covariance is
(7) 
We can estimate the empirical distance correlation using the empirical distance covariance. The distance correlation has a property that implies independence between and . We show that this helps to identify the causal direction in the next section.
3.3 Inferring Causal Directions
In this section we discuss how to infer the causal directions. Suppose we have the joint distribution of the variable pair as . We are able to factorize it in two directions and get and . Each of them is a random variable pair. We define the dependence measurements of them as below.
Definition 4.
The dependence measurements are defined as
(8)  
(9) 
If postulate 1 is accepted, then in the causal direction, the distance correlation between and reaches the lower bound as
(10) 
Since the distance correlation is nonnegative, in the anticausal direction we have
(11) 
and now we get the causal inference principle as
(12) 
Intuitively speaking, in the causal direction we get smaller dependence coefficient between the marginal distribution and the conditional distribution than that in the anticausal direction. One thing worth attention is that the domain size should be reasonably large to generate reliable statistics. We will give the detailed causal inference method in the next section.
4 Causal Inference Method
In this section we give a causal inference method which identifies the causal direction via estimating the distance correlations. If X causes Y, the should be smaller than . However, estimating the coefficients from samples may induce random errors. So we introduce a threshold . They are significantly different if their difference is larger than , and we can decide the causal direction. Otherwise we stay undecided. We detail the inference method in table 2.
Algorithm 1: Causal inference via estimating distance correlations (DC) 

Input: sample of the discrete variables and , threshold 
1. Construct the vector recording the distribution and the matrix recording the conditional distribution . Calculate . 
2. Construct the vector recording the distribution and the matrix recording the conditional distribution . Calculate . 
3. Decide the causal direction: 
If , output “ causes ”. 
If , output “ causes ”. 
Else, output “No decision made”. 
From table 2, one could see that our method identifies the cause and the effect by factorizing the joint distribution in two directions and comparing the dependence coefficients ( and ). The one with smaller distance correlation (between the marginal distribution and the conditional distribution) is inferred as the causal direction. We name it causal inference via estimating distance correlations (abbreviated as DC). Next we analyze the computational cost of our method. Suppose the sample size is . The time for constructing the matrix recording the joint distribution is , and the times for calculating and are and respectively. So the total time is . One can see that this method is of low computational complexity (linear with respect to sample size). This would be verified in experiments.
5 Experiments
In this section we test the performance of our method (DC). The compared algorithm is the discrete regression (DR) (Peters et al., 2011). We perform experiments under various settings. Section 5.1 shows the performance of DC on identifying ANMs with different . Section 5.2 presents the performance of DC when the distribution of the cause and the conditional distribution mapping cause to effect are randomly generated. Section 5.3 tests the efficiency of the algorithms. Section 5.4 discusses the choice of the threshold parameter . Section 5.5 shows the performance of DC at different decision rates. In section 5.6, we apply DC to real world causeeffect pairs (with discretization) to show its capability in solving practical problems.
5.1 Additive Noise Models
We first evaluate the accuracies of DC on identifying ANMs with different ranges of noise term . The model is written as
(13) 
The function is constructed by random mapping from to . Suppose the support of the noise is . The noise domain is chosen as:

.

.

.

.
The probability distributions of the cause are chosen by: (1) randomly generate a vector (length ) with each entry being an integer between . (2) Normalize it to unit sum. We generate the probability distributions of the noise using the same way. In each trial, the algorithms are forced to make a decision. For each noise setting, we randomly generate 500 functions. Thus we have 500 additive noise models. for each model could be different due to the randomness of the mappings. Then we sample 200, 300, 500, 1000, 2000, 4000 points for each model, and apply DC and DR to the samples. The plots showing the accuracies of the algorithms are given in figure 2.
From figure 2, one can see that DR performs slightly better than DC when . For example, when sample size is 4000, the accuracy of DR is 0.82 while that of DC is 0.78. We observe that ANMs with small could sometimes yield small distance correlations in both directions. In these situations, the decision made by DC is close to a random guess. DC performs better than DR when is larger. The accuracies of DC become around 0.9 when sample size is large. But DC does not correctly identify all models. This is because the difference between the estimated distance correlations is sometimes small due to estimation random errors, and this may make the decision wrong.
5.2 Models with Randomly Generated and
We now test the algorithms on the models with and being randomly generated. To be specific, we generate using the method in section 5.1. Then we generate distributions on as a reference set. For , we generate by randomly taking one of the distributions in the reference set. We choose the domain size to be

.

.

.

.
For each setting, we generate 500 models. For each model, we sample 200, 300, 500, 1000, 2000, 4000 points, and apply DC and DR to them. The performance is showed in figure 3.
One can see that DR has unsatisfactory performance in these scenarios. This is because DR often makes a random guess since the models do not satisfy the ANM in either direction. DC has satisfactory performance when sample size goes large. This shows that DC works in these scenarios while DR does not.
5.3 On Efficiency
This section investigates the efficiency of the algorithms. We use the same experimental setting as that in section 5.2 (setting 2 and 4). For each sample size, we run them 100 times and record the total running time (seconds) of the algorithms. The records are showed in figure 4.
Figure 4 tells that DC has a higher efficiency than DR. For example, when and sample size is 1000, DR uses around 400 seconds to finish the experiments while DC uses only 2 seconds. This is because DR searches the whole domain iteratively to find a function that yields minimum dependence between residuals and regressors, which could be timeconsuming in practice.
5.4 On Parameter
In the sections above, DC is forced to make a decision at each trial (). In this section, we examine the influence of on the performance of DC. This may help to set the values of in practice. We use the same experimental setting as that in section 5.2. The domain size is choose as and . The sample size is fixed to be 4000. We choose the parameter to be:

.

.

.
For each setting, we generate 500 models and apply DC to them. The proportion of correctly identified models, proportion of wrongly identified models and proportion of nonidentified models are showed in figure 5.
The proportion of nonidentified models becomes large when the threshold is 0.1. This is because DC becomes conservative in this situation. The proportion of correctly identified models and that of wrongly identified models decrease as the threshold goes larger. However, the accuracy (the number of correctly identified models divided by the number of correctly identified models plus the number of wrongly identified models) increases. This is reasonable since the decisions made by DC under a higher threshold are more reliable. Based on the plotted results, we observe that the accuracy and decision date of DC are acceptable when . We suggest that 0.05 is a reasonable choice for the parameter.
5.5 On Decision Rates
In our algorithm, the threshold parameter controls the decision rates of DC. In other words, if we increase the (from 0 to 1), the decision rates decrease (from 100% to 0%). In this section, we study the influence of the parameter on the performance of DC. We follow the experimental setting in section 5.2, and we choose . We fix the sample size to be 5000, and we vary the decision rates (by changing the from 0 to 1). The plots showing the percentage of correct decisions versus the decision rate are available in figure 6.
Obviously, the percentage of correct decisions decreases as the decision rate increases. For example, the percentage of correct decisions is 100% when the decision rate is less than 20%. But it becomes 77% when the decision rate is 100%. This is acceptable since the decision would be more reliable if the algorithm makes decisions based on higher .
5.6 On Real World Data
We apply DC to real world benchmark causeeffect pairs
Determining causal directions on real world data is challenging since the causal mechanisms are often complex and the data records could be noisy (Pearl, 2000; Spirtes et al., 2000). However, figure 7 shows that DC has satisfactory performance on this task. The average accuracy of DC is around 72%, which is highest among all algorithms. ANM based methods (DR, LiNGAM, ANM) do not have a good performance. This may be because the assumptions of additive noise models restrict their applicability.
6 Conclusions
In this paper, we deal with the causal inference problem on discrete data. We consider the distribution of the cause and the conditional distribution mapping cause to effect as independent random variables, and propose to discover the causal direction via estimating and comparing the distance correlations. Encouraging experimental results are reported. This shows that inferring the causal direction using the independence postulate is a promising research direction. In future we will try to extend this method to deal with high dimensional data.
Acknowledgments
The authors want to thank the editor and the anonymous reviewers for helpful comments. The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.
Footnotes
 Available at https://webdav.tuebingen.mpg.de/causeeffect/
 For pair 65, 66, 67 which contain stock returns, we process the variables by
References
 Chen, Z., Zhang, K., and Chan, L. (2013). Nonlinear causal discovery for high dimensional data: A kernelized trace method. In Proceedings of the IEEE 13th International Conference on Data Mining (ICDM), pages 1003–1008. IEEE.
 Chen, Z., Zhang, K., Chan, L., and Schölkopf, B. (2014). Causal discovery via reproducing kernel hilbert space embeddings. Neural Computation, 26(7):1484–1517.
 Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. (2005). Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075–2129.
 Hoyer, P. O. and Hyttinen, A. (2009). Bayesian discovery of linear acyclic causal models. In Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 240–248. AUAI Press.
 Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J. R., and Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems (NIPS), pages 689–696.
 Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2012). Learning linear cyclic causal models with latent variables. Journal of Machine Learning Research, 13(1):3387–3439.
 Hyvärinen, A. and Smith, S. M. (2013). Pairwise likelihood ratios for estimation of nongaussian structural equation models. Journal of Machine Learning Research, 14:111–152.
 Hyvärinen, A., Zhang, K., Shimizu, S., and Hoyer, P. O. (2010). Estimation of a structural vector autoregression model using nongaussianity. Journal of Machine Learning Research, 11:1709–1731.
 Janzing, D., Hoyer, P., and Schölkopf, B. (2010). Telling cause from effect based on highdimensional observations. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 479–486.
 Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniušis, P., Steudel, B., and Schölkopf, B. (2012). Informationgeometric approach to inferring causal directions. Artificial Intelligence, 182:1–31.
 Janzing, D. and Schölkopf, B. (2010). Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194.
 Lemeire, J. and Janzing, D. (2013). Replacing causal faithfulness with algorithmic independence of conditionals. Minds and Machines, 23(2):227–249.
 Mooij, J. M., Janzing, D., Heskes, T., and Schölkopf, B. (2011). On causal discovery with cyclic additive noise models. In Advances in Neural Information Processing Systems (NIPS), pages 639–647.
 Pearl, J. (2000). Causality: models, reasoning and inference. Cambridge University Press, Cambridge, UK.
 Peters, J., Janzing, D., and Schölkopf, B. (2010). Identifying cause and effect on discrete data using additive noise models. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 597–604.
 Peters, J., Janzing, D., and Scholkopf, B. (2011). Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12):2436–2450.
 Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J., Pineau, L. J., et al. (2012). On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1–8.
 Sgouritsa, E., Janzing, D., Hennig, P., and Schölkopf, B. (2015). Inference of cause and effect with unsupervised inverse regression. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 847–855.
 Shimizu, S., Hoyer, P. O., Hyvärinen, A., and Kerminen, A. (2006). A linear nongaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003–2030.
 Shimizu, S., Inazumi, T., Sogawa, Y., Hyvärinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., and Bollen, K. (2011). Directlingam: a direct method for learning a linear nongaussian structural equation model. Journal of Machine Learning Research, 12:1225–1248.
 Spirtes, P., Glymour, C. N., and Scheines, R. (2000). Causation, prediction, and search, volume 81. MIT press.
 Székely, G. J., Rizzo, M. L., Bakirov, N. K., et al. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794.
 Zhang, K. and Hyvärinen, A. (2009a). Causality discovery with additive disturbances: an informationtheoretical perspective. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pages 570–585. Springer.
 Zhang, K. and Hyvärinen, A. (2009b). On the identifiability of the postnonlinear causal model. In Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 647–655. AUAI Press.
 Zscheischler, J., Janzing, D., and Zhang, K. (2011). Testing whether linear equations are causal: A free probability theory approach. In Proceedings of the 27th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 839–847. AUAI Press.