Causal Inference and Mechanism Clustering of a Mixture of Additive Noise Models
The inference of the causal relationship between a pair of observed variables is a fundamental problem in science, and most existing approaches are based on one single causal model. In practice, however, observations are often collected from multiple sources with heterogeneous causal models due to certain uncontrollable factors, which renders causal analysis results obtained by a single model skeptical. In this paper, we generalize the Additive Noise Model (ANM) to a mixture model, which consists of a finite number of ANMs, and provide the condition of its causal identifiability. To conduct model estimation, we propose Gaussian Process Partially Observable Model (GPPOM), and incorporate independence enforcement into it to learn latent parameter associated with each observation. Causal inference and clustering according to the underlying generating mechanisms of the mixture model are addressed in this work. Experiments on synthetic and real data demonstrate the effectiveness of our proposed approach.
Understanding the data generating mechanism (g.m.) has been a main theme of causal inference. To infer the causal direction between two random variables (r.v.s) and using passive observations, most existing approaches first model the relation between them using a functional model with certain assumptions (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyvärinen, 2009; Janzing et al., 2012). Then a certain asymmetric property (usually termed cause-effect asymmetry), which only holds in the causal direction, is derived to conduct inference. For example, the additive noise model (ANM) (Hoyer et al., 2009) represents the effect as a function of the cause with an additive independent noise: . It is shown in (Hoyer et al., 2009) that there is no model of the form that admits an ANM in the anticausal direction for most combinations .
Similar to ANM, most causal inference approaches based on functional models, such as LiNGAM (Shimizu et al., 2006), PNL (Zhang and Hyvärinen, 2009), and IGCI (Janzing and Scholkopf, 2010), assume a single causal model for all observations. However, there is no such a guarantee in practice, and it could be very common that the observations are generated by a mixture of causal models due to different data sources or data collection under different conditions, rendering existing single-causal-model based approaches inapplicable in many problems (e.g. Fig. 1). Recently, an approach was proposed for inferring the causal direction of mixtures of ANMs with discrete variables (Liu and Chan, 2016). However, the inference of such mixture models with continuous variables remains a challenging problem and is not yet well studied.
Another question regarding mixture models addressed in this paper is how one could reveal causal knowledge in clustering tasks. Specifically, we aim at finding clusters consistent with the causal g.m.s of a mixture model, which is usually vital in the preliminary phase of many research. For example in the analysis of air data (see section 4.2 for detail), discovering knowledge from air data combined from several different regions (i.e. mechanisms in causal perspective) is much more difficult than from data of each region separately. Most existing clustering algorithms are weak for this perspective as they typically define similarity between observations in the form of distances in some spaces or manifolds. Most of them neglect the relation among r.v.s within a feature vector (observation), and only use those feature dimensions to calculate an overall distance metric as the clustering criterion.
In this paper, we focus on analyzing observations generated by a mixture of ANMs of two r.v.s and try to answer two questions: 1) causal inference: can we infer the causal direction between the two r.v.s? 2) mechanism clustering: can we cluster the observations generated from the same g.m. together? To answer these questions, first as the main result of this paper, we show that the causal direction of the mixture of ANMs is identifiable in most cases, and we propose a variant of GP-LVM (Lawrence, 2005) named Gaussian Process Partially Observable Model (GPPOM) for model estimation, based on which we further develop the algorithms for causal inference and mechanism clustering.
The rest of the paper is organized as follows: in section 2, we formalize the model, show its identifiability and elaborate mechanism clustering; in section 3, model estimation method is proposed; we present experiments on synthetic and real world data in section 4 and conclude in section 5.
2 ANM Mixture Model
2.1 Model definition
Each observation is assumed to be generated from an ANM and the entire data set is generated by a finite number of related ANMs. They are called the ANM Mixture Model (ANM-MM) and formally defined as:
Definition 1 (ANM Mixture Model).
An ANM Mixture Model is a set of causal models of the same causal direction between two continuous r.v.s and . All causal models share the same form given by the following ANM:
where denotes the cause, denotes the effect, is a nonlinear function parameterized by and the noise . The differences between causal models in an ANM-MM stem only from different values of function parameter . In ANM-MM, is assumed to be drawn from a discrete distribution on a finite set , i.e. , where , and is the indicator function of a single value .
Obviously in ANM-MM, all observations are generated by a set of g.m.s, which share the same function form () but differ in parameter values (). This model is inspired by commonly encountered cases where the data generating process is slightly different in each independent trial due to the influence of certain external factors that one can hardly control. In addition, these factors are usually believed to be independent of the observed variables. The data generating process of ANM-MM can be represented by a directed graph in Fig. 2.
2.2 Causal inference: identifiability of ANM-MM
Let be the cause and be the effect () without loss of generality. As most recently proposed causal inference approaches, following postulate, which was originally proposed in (Daniusis et al., 2012), is adopted in the analysis of ANM-MM.
Postulate 1 (Independence of input and function).
If , the distribution of and the function mapping to are independent since they correspond to independent mechanisms of nature.
In a general perspective, postulate 1 essentially claims the independence between the cause () and mechanism mapping the cause to effect (Janzing and Scholkopf, 2010). In ANM-MM, we interpret the independence between the cause and mechanism in an intuitive way: , as the function parameter, captures all variability of mechanisms so it should be independent of the cause according to postulate 1. Based on the independence between and , cause-effect asymmetry could be derived to infer the causal direction.
Since ANM-MM consists of a set of ANMs, the identifiability result of ANM-MM can be a simple corollary of that in (Hoyer et al., 2009) when the number of ANMs () is equal and there is a one-to-one correspondence between mechanisms in the forward and backward ANM-MM. In this case the condition of ANM-MM being unidentifiable is to fulfill ordinary differential equations given in (Hoyer et al., 2009) simultaneously which can hardly happen in a generic case. However, in ANM-MM in both directions may not necessarily be equal and there may also exist many-to-one correspondence between ANMs in both directions. In this case, the identifiability result can not be derived as a simple corollary of (Hoyer et al., 2009). To analyze the identifiability result of ANM-MM, we first derive lemma 1 to find the condition of existence of many-to-one correspondence (which is a generalization of the condition given in (Hoyer et al., 2009)), then conclude the identifiability result of ANM-MM (theorem 1) based on the condition in lemma 1. The condition that there exists one backward ANM for a forward ANM-MM is:
Let and they follow an ANM-MM. If there exists a backward ANM in the anti-causal direction, i.e.
the cause distribution (), the noise distribution (), the nonlinear function () and its parameter distribution () should jointly fulfill the following ordinary differential equation (ODE)
where , and the definitions of , , and are provided in supplementary due to the page limitation.
Sketch of proof. Since and follow an ANM-MM, their joint density is factorized in the causal direction by . If there exists a backward ANM in the anti-causal direction, i.e. , then and holds, where , in the backward ANM. Since should be the same, by substituting into , the condition shown in (2) is obtained.
The proof of lemma 1 follows the idea of the identifiability of ANM in (Hoyer et al., 2009) and is provided in the supplementary. Since the condition that one backward ANM exists for an forward ANM-MM (mixture of ANMs) is more restrictive than that for a single forward ANM, which is the identifiability in (Hoyer et al., 2009), lemma 1 indicates that a backward ANM is unlikely to exist in the anticausal direction if 1) and follow an ANM-MM; 2) Postulate 1 holds. Based on lemma 1, it is reasonable to hypothesize that a stronger result, which is justified in theorem 1, is valid, i.e. if the g.m. follows an ANM-MM, then it is almost impossible to have a backward ANM-MM in the anticausal direction.
Assume that there exists ANM-MM in both directions. Then there exists a non overlapping partition of the entire data such that in each data block , there is an ANM-MM in the causal direction , where is a discrete distribution on a finite set , and an ANM in the anti-causal direction . According to lemma 1, for each data block, to ensure the existence of an ANM-MM in the causal direction and an ANM in the anti-causal direction, (, , , ) should fulfill an ordinary differential equation in the form of (2). Then the existence of backward ANM-MM requires ordinary differential equations to be fulfilled simultaneously which yields (3). ∎
Then the causal direction in ANM-MM can be inferred by investigating the independence between the hypothetical cause and the corresponding function parameter. According to theorem 1, if they are independent in the causal direction, then it is highly likely they are dependent in the anticausal direction. Therefore in practice, the inferred direction is the one that shows more evidence of independence between them.
2.3 Mechanism clustering of ANM-MM
In ANM-MM, , which represents function parameters, can be directly used to identify different g.m.s since each parameter value corresponds to one mechanism. In other words, observations generated by the same g.m. would have the same if the imposed statistical model is identifiable with respect to .
Denote the parameter associated with each observation by , we suppose a more practical inherent clustering structure behind hidden . Formally, there is a grouping indicator of integers that assign each to one of the clusters, through the th element of , e.g. belongs to cluster if Following ANM-MM, we may assume each belong to one of components and each component follows . A likelihood-based clustering scheme suggests minimizing jointly with respect to all means and
where and is the indicator function. To simplify further let’s ignore the known and minimize using coordinate descent iteratively
The minimizer of (4) is the mean where is the size of the th cluster The minimizer of (5) is group assignment through minimum Euclidean distance. Therefore, iterating between (4) and (5) coincides with applying -means algorithm on all and the goal of finding clusters consistent with the g.m.s for data from ANM-MM can be achieved by firstly estimating parameters associated with each observation and then conducting -means directly on parameters.
3 ANM-MM Estimation by GPPOM
We propose Gaussian process partially observable model (GPPOM) and incorporate Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005a) enforcement into GPPOM to estimate the model parameter . Then we summarize algorithms for causal inference and mechanism clustering of ANM-MM.
Dual PPCA. Dual PPCA (Lawrence, 2004) is a latent variable model in which maximum likelihood solution for the latent variables is found by marginalizing out the parameters. Given a set of centered -dimensional data , dual PPCA learns the -dimensional latent representation associated with each observation . The relation between and in dual PPCA is , where the matrix specifies the linear relation between and and noise . Then by placing a standard Gaussian prior on each row of , one obtains the marginal likelihood of all observations and the objective function of dual PPCA is the log-likelihood , where and .
GP-LVM. GP-LVM (Lawrence, 2005) generalizes dual PPCA to cases of nonlinear relation between and by mapping latent representations in to a feature space, i.e. , where denotes the feature map. Then and can be computed using kernel trick. GP-LVM can also be interpreted as a new class of models which consists of independent Gaussian processes (Williams, 1998) mapping from a latent space to an observed data space (Lawrence, 2005).
HSIC. HSIC (Gretton et al., 2005a), which is based on reproducing kernel Hilbert space (RKHS) theory, is widely used to measure the dependence between r.v.s. Let be an sample of size draw independently and identically distributed according to , HSIC answers the query whether . Formally, denote by and RKHSs with universal kernel , on the compact domains and , HSIC is the measure defined as , which is essentially the squared Hilbert Schmidt norm (Gretton et al., 2005a) of the cross-covariance operator from RKHS to (Fukumizu et al., 2004). It is proved in (Gretton et al., 2005a) that, under conditions specified in (Gretton et al., 2005b), if and only if . In practice, a biased empirical estimator of HSIC based on the sample is often adopted:
where , , and is a vector of ones.
3.2 Gaussian process partially observable model
Partially observable dual PPCA. Dual PPCA is not directly applicable to model ANM-MM since: 1) part of the r.v. that maps to the effect is visible (i.e. ); 2) the relation (i.e. ) is nonlinear; 3) r.v.s that contribute to the effect should be independent () in ANM-MM. To tackle 1), a latent r.v. is brought in dual PPCA.
Denote the observed effect by , observed cause by , the matrix collecting function parameters associated with each observation by and the r.v. that contribute to the effect by . Similar to dual PPCA, the relation between the latent representation and the observation is given by
where , is the matrix specifies the relation between and , is the additive noise. Then by placing a standard Gaussian prior on , i.e. , where is the th row of the matrix , the log-likelihood of the observations is given by
where is the covariance matrix after bringing in .
General nonlinear cases (GPPOM). Similar to the generalization from dual PPCA to GP-LVM, the dual PPCA with observable and latent can be easily generalized to nonlinear cases. Denote the feature map by and , then the covariance matrix is given by . The entries of can be computed using kernel trick given a selected kernel . In this paper, we adopt the radial basis function (RBF) kernel, which reads , where , are free parameters and is the dimensionality of the input. As a result of adopting RBF kernel, the covariance matrix in (7) can be computed as
where denotes the Hadamard product, the entries on th row and th column of and are given by and , respectively. After the nonlinear generalization, the relation between and reads . We call this variant of GP-LVM with partially observable latent space GPPOM. Like GP-LVM, is mapped to by the same set of Gaussian processes in GPPOM so the differences in the g.m.s is captured by , the latent representations associated with each observation.
3.3 Model estimation by independence enforcement
Both dual PPCA and GP-LVM finds the latent representations through log-likelihood maximization using scaled conjugate gradient (Møller, 1993). However, the can not be found by directly conducting likelihood maximization since the ANM-MM requires additionally the independence between and . To this end, we include HSIC (Gretton et al., 2005a) in the objective. By incorporating HSIC term into the negative log-likelihood of GPPOM, the optimization objective reads
where is the parameter which controls the importance of the HSIC term and is the set of all hyper parameters including and all kernel parameters , .
To find , we resort to the gradient descant methods. The gradient of the objective with respect to latent points in is given by
The first part on the right hand side of (9), which is the gradient of with respect to the kernel matrix , can be computed as
where is the single-entry matrix, 1 at and 0 elsewhere and . Combining with , whose entry on the th row and th column is given by , through the chain rule, all latent points in can be optimized. With , one can conduct causal inference and mechanism clustering of ANM-MM. The detailed steps are given in Algorithm 1 and 2.
In this section, experimental results on both synthetic and real data are given to show the performance of ANM-MM on causal inference and mechanism clustering tasks. The Python code of ANM-MM is available online.111 https://github.com/amber0309/ANM-MM
4.1 Synthetic data
In experiments of causal inference, ANM-MM is compared with ANM (Hoyer et al., 2009), PNL (Zhang and Hyvärinen, 2009), IGCI (Janzing et al., 2012), ECP Zhang et al. (2015) and LiNGAM (Shimizu et al., 2006). The results are evaluated using accuracy, which is the percentage of correct causal direction estimation of 50 independent experiments. Note that ANM-MM was applied using different parameter and IGCI was applied using different reference measures and estimators. Their highest accuracy is reported.
In experiments of clustering, ANM-MM is compared with well-known -means (MacQueen, 1967) (similarity-based) on both raw data (-means) and its PCA component (PCA-m), Gaussian mixture clustering (GMM) (Rasmussen, 2000) (model-based), spectral clustering (SpeClu) (Shi and Malik, 2000) (spectral graph theory-based) and DBSCAN (Ester et al., 1996) (density-based). Clustering performance is evaluated using average adjusted Rand index (Hubert and Arabie, 1985) (avgARI), which is the mean ARI over 100 experiments. High ARI () indicates good match between the clustering results and the ground truth. Sample size () is 100 in all synthetic clustering experiments. Clustering results are visualized in the supplementary222The results of PCA-m are not visualized since they are similar to and worse than those of -means..
Different g.m.s and sample sizes. We examine the performance on different g.m.s () and sample sizes (). The mechanisms adopted are the following elementary functions: 1) ; 2) ; 3) ; 4) . We tested sample size 50, 100 and 200 for each mechanism. Given and , the cause is sampled from a uniform distribution and then mapped to the effect by , where , and . Each mechanism generates half of the observations.
Causal Inference. The results are shown in Fig. 3. ANM-MM and ECP outperforms others based on a single causal model, which is consistent with our anticipation. Compared with ECP, ANM-MM shows slight advantages in 3 out of 4 settings. Clustering. The avgARI values are summarized in (i) of Table 1. ANM-MM significantly outperforms other approaches in all mechanism settings.
Different number of g.m.s.333From this part on, g.m. is fixed to be . We examine the performance on different number of g.m.s ( in Definition 1). , and are the same as in previous experiments. In the setting of three mechanisms, . In the setting of four, and . Again, the numbers of observations from each mechanism are the same.
Causal Inference. The results are given in Fig. 3(a) which shows decreasing trend for all approaches. However, ANM-MM keeps 100% when the number of mechanisms increases from 2 to 3. Clustering. The avgARI values are given in (ii) and (i) of Table 1. The performance of different approaches show different trends which is probably due to the clustering principle they are based on. Although ANM-MM is heavily influenced by , its performance is still much better than others.
Different noise standard deviations. We examine the performance on different noise standard deviations . , are the same as in the first part of experiments. Three different cases where and 0.1 are tested.
Causal Inference. The results are given in Fig. 3(b). The change in in this range does not significantly influence the performance of most causal inference approaches. ANM-MM keeps 100% accuracy for all choice of . Clustering. The avgARI values are given in (iii) and (i) of Table 1. As our anticipation, the clustering results heavily rely on and all approaches show a decreasing trend in avgARI as increases. However, ANM-MM is the most robust against large .
Different mixing proportions. We examine the performance on different mixing proportions ( in Definition 1). , and are the same as in the first part of experiments. Cases where and 0.75 (corresponding and 0.25) are tested.
Causal Inference. The results on different are given in Fig. 3(c). Approaches based on a single causal model are sensitive to the change in whereas ECP and ANM-MM are more robust and outperform others. Clustering. The avgARI values of experiments on different are given in (iv) and (i) of Table 1. The results of comparing approaches are significantly affected by and ANM-MM shows best robustness against the change in .
4.2 Real data
Causal inference on Tüebingen cause-effect pairs. We evaluate the causal inference performance of ANM-MM on real world benchmark cause-effect pairs444https://webdav.tuebingen.mpg.de/cause-effect/. (Mooij et al., 2016). Nine out of 41 data sets are excluded in our experiment because either they consists of multivariate or categorical data (pair 47, 52, 53, 54, 55, 70, 71, 101 and 105) or the estimated latent representations are extremely close555close in the sense that . (pair 12 and 17). Fifty independent experiments are repeated for each pair, and the percentage of correct inference of different approaches are recorded. Then average percentage of pairs from the same data set is computed as the accuracy of the corresponding data set. In each independent experiment, different inference approaches are applied on 90 points randomly sampled from raw data without replacement.
The results are summarized in Fig. 5 with blue solid line indicating median accuracy and red dashed line indicating mean accuracy. It shows that the performance of ANM-MM is satisfactory, with highest median accuracy of about 82%. IGCI also performs quite well, especially in terms of median, followed by PNL.
Clustering on BAFU air data. We evaluate the clustering performance of ANM-MM on real air data obtained online666https://www.bafu.admin.ch/bafu/en/home/topics/air.html. This data consists of daily mean values of ozone () and temperature () of 2009 from two distinct locations in Switzerland. In our experiment, we regard the data as generating from two mechanisms (each corresponds to a location). The clustering results are visualized in Fig. 6. The ARI values of ANM-MM is 0.503, whereas -means, GMM, spectral clustering and DBSCAN could only obtain ARI of -0.001, 0.003, 0.078 and 0.003, respectively. ANM-MM is the only one that could reveal the property related to the location of the data g.m..
In this paper, we extend the ANM to a more general model (ANM-MM) in which there are a finite number of ANMs of the same function form and differ only in parameter values. The condition of identifiability of ANM-MM is analyzed. To estimate ANM-MM, we adopt the GP-LVM framework and propose a variant of it called GPPOM to find the optimized latent representations and further conduct causal inference and mechanism clustering. Results on both synthetic and real world data verify the effectiveness of our proposed approach.
- Daniusis et al. (2012) Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, B., Zhang, K., and Schölkopf, B. (2012). Inferring deterministic causal relations. arXiv preprint arXiv:1203.3475.
- Ester et al. (1996) Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pages 226–231. AAAI Press.
- Fukumizu et al. (2004) Fukumizu, K., Bach, F. R., and Jordan, M. I. (2004). Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73–99.
- Gretton et al. (2005a) Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005a). Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer.
- Gretton et al. (2005b) Gretton, A., Smola, A. J., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Schölkopf, B., and Logothetis, N. K. (2005b). Kernel constrained covariance for dependence measurement. In AISTATS, volume 10, pages 112–119.
- Hoyer et al. (2009) Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pages 689–696.
- Hubert and Arabie (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1):193–218.
- Janzing et al. (2012) Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniušis, P., Steudel, B., and Schölkopf, B. (2012). Information-geometric approach to inferring causal directions. Artificial Intelligence, 182:1–31.
- Janzing and Scholkopf (2010) Janzing, D. and Scholkopf, B. (2010). Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194.
- Lawrence (2005) Lawrence, N. (2005). Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of machine learning research, 6(Nov):1783–1816.
- Lawrence (2004) Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional data. In Advances in neural information processing systems, pages 329–336.
- Liu and Chan (2016) Liu, F. and Chan, L. (2016). Causal discovery on discrete data with extensions to mixture model. ACM Transactions on Intelligent Systems and Technology (TIST), 7(2):21.
- MacQueen (1967) MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pages 281–297, Berkeley, Calif. University of California Press.
- Møller (1993) Møller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural networks, 6(4):525–533.
- Mooij et al. (2016) Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. (2016). Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103–1204.
- Rasmussen (2000) Rasmussen, C. E. (2000). The infinite gaussian mixture model. In Advances in neural information processing systems, pages 554–560.
- Shi and Malik (2000) Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905.
- Shimizu et al. (2006) Shimizu, S., Hoyer, P. O., Hyvärinen, A., and Kerminen, A. (2006). A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030.
- Williams (1998) Williams, C. K. (1998). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In Learning in graphical models, pages 599–621. Springer.
- Zhang et al. (2015) Zhang, K., Huang, B., Zhang, J., Schölkopf, B., and Glymour, C. (2015). Discovery and visualization of nonstationary causal models. arXiv preprint arXiv:1509.08056.
- Zhang and Hyvärinen (2009) Zhang, K. and Hyvärinen, A. (2009). On the identifiability of the post-nonlinear causal model. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 647–655. AUAI Press.
Appendix A Proof of Lemma 1
If there exists an Additive Noise Model (ANM) in the backward direction, i.e.,
where , then we have
Denote by and . Taking partial derivative of with respect to , we get
Furthermore, we have
We find that
Let us get back to the forward model where we have
Taking of both sides of (11), we get
For notation simplicity, we drop the argument of and denote by , we get
and denote by , ,, and , , and , , , and . We have
then we have
Further denote by