Causal Inference on Multivariate and Mixed-Type Data

# Causal Inference on Multivariate and Mixed-Type Data

Alexander Marx Max Planck Institute for Informatics and Saarland University, Saarbrücken, Germany. {amarx,jilles}@mpi-inf.mpg.de    Jilles Vreeken11footnotemark: 1
###### Abstract

Given data over the joint distribution of two random variables and , we consider the problem of inferring the most likely causal direction between and . In particular, we consider the general case where both and may be univariate or multivariate, and of the same or mixed data types. We take an information theoretic approach, based on Kolmogorov complexity, from which it follows that first describing the data over cause and then that of effect given cause is shorter than the reverse direction.

The ideal score is not computable, but can be approximated through the Minimum Description Length (MDL) principle. Based on MDL, we propose two scores, one for when both and are of the same single data type, and one for when they are mixed-type. We model dependencies between and using classification and regression trees. As inferring the optimal model is NP-hard, we propose Crack, a fast greedy algorithm to determine the most likely causal direction directly from the data.

Empirical evaluation on a wide range of data shows that Crack reliably, and with high accuracy, infers the correct causal direction on both univariate and multivariate cause-effect pairs over both single and mixed-type data.

Causal Inference on Multivariate and Mixed-Type Data

 Alexander Marx††thanks: Max Planck Institute for Informatics and Saarland University, Saarbrücken, Germany. {amarx,jilles}@mpi-inf.mpg.de and Jilles Vreeken11footnotemark: 1

## 1 Introduction

Telling cause from effect is one of the core problems in science. It is often difficult, expensive, or impossible to obtain data through randomized trials, and hence we often have to infer causality from, what is called, observational data [21]. We consider the setting where, given data over the joint distribution of two random variables and , we have to infer the causal direction between and . In other words, our task is to identify whether it is more likely that causes , or vice versa, that causes , or that the two are merely correlated.

In practice, and do not have to be of the same type. The altitude of a location (real-valued), for example, determines whether it is a good habitat (binary) for a mountain hare. In fact, neither nor have to be univariate. Whether or not a location is a good habitat for an animal, is not just caused by a single aspect, but by a combination of conditions, which not necessarily are of the same type. We are therefore interested in the general case where and may be of any cardinality, and may be single or mixed-type.

To the best of our knowledge there exists no method for this general setting. Causal inference based on conditional independence tests, for example, requires three variables, and cannot decide between and  [21]. All existing methods that consider two variables are only defined for single-type pairs. Additive Noise Models (ANMs), for example, have only been proposed for univariate pairs of real-valued [24] or discrete variables [23], and similarly so for methods based on the independence of and  [28, 16]. Trace-based methods require both and to be strictly multivariate real-valued [9, 2], and whereas Ergo [33] also works for univariate pairs, these again have to be real-valued. We refer the reader to Sec. 6 for a more detailed overview of related work.

Our approach is based on algorithmic information theory. That is, we follow the postulate that if , it will be easier—in terms of Kolmogorov complexity—to first describe , and then describe given , than vice-versa [11, 33, 1]. Kolmogorov complexity is not computable, but can be approximated through the Minimum Description Length (MDL) principle [26, 5], which we use to instantiate this framework. In addition, we develop a causal indicator that is able to handle multivariate and mixed-type data.

To this end, we define an MDL score for coding forests, a model class where a model consists of classification and regression trees. By allowing dependencies from to , or vice versa, we can measure the difference in complexity between and . Discovering a single optimal decision tree is already NP-hard [20], and hence we cannot efficiently discover the coding forest that describes the data most succinctly. We therefore propose Crack, an efficient greedy algorithm for discovering good models directly from data.

Through extensive empirical evaluation on synthetic, benchmark, and real-world data, we show that Crack performs very well in practice. It performs on par with existing methods for univariate single-type pairs, is the first to handle pairs of mixed data type, and outperforms the state of the art on multivariate pairs with a large margin. It is also very fast, taking less than 4 seconds over any pair in our experiments.

## 2 Preliminaries

First, we introduce notation and give brief primers to Kolmogorov complexity and the MDL principle.

### 2.1 Notation

In this work we consider data over the joint distribution of random variables and . Such data contains records over a set of attributes, . An attribute has a type where . We will refer to binary and categorical attributes as nominal attributes. The size of the domain of an attribute is defined as

 (2.1) |D(a)|={#valuesif {type}(a) is nominal% max(a)−min(a)res(a)+1if {type}(a) is numeric,

where is the resolution at which the data over attribute was recorded. For example, a resolution of 1 means that we consider integers, of means that was recorded with a precision of up to a hundredth.

We will consider decision and regression trees. In general, a tree consist of nodes. We identify internal nodes as , and leaf nodes as . A leaf node contains data points.

All logarithms are to base 2, and we use .

### 2.2 Kolmogorov Complexity, a brief primer

The Kolmogorov complexity of a finite binary string is the length of the shortest binary program for a universal Turing machine that generates , and then halts [13, 15]. Formally, we have

 K(x)=min{|p|∣p∈{0,1}∗,U(p)=x}.

Simply put, is the most succinct algorithmic description of , and the Kolmogorov complexity of is the length of its ultimate lossless compression. Conditional Kolmogorov complexity, , is then the length of the shortest binary program that generates , and halts, given as input. For more details see [15].

### 2.3 MDL, a brief primer

The Minimum Description Length (MDL) principle [26, 5] is a practical variant of Kolmogorov Complexity. Intuitively, instead of all programs, it considers only those programs that we know that output and halt. Formally, given a model class , MDL identifies the best model for data as the one minimizing

 L(D,M)=L(M)+L(D∣M),

where is the length in bits of the description of , and is the length in bits of the description of data given . This is known as two-part MDL. There also exists one-part, or refined MDL, where we encode data and model together. Refined MDL is superior in that it avoids arbitrary choices in the description language , but in practice only computable for certain model classes. Note that in either case we are only concerned with code lengths — our goal is to measure the complexity of a dataset under a model class, not to actually compress it [5].

## 3 Causal Inference by Compression

We pursue the goal of causal inference by compression. Below we give a short introduction to the key concepts.

### 3.1 Causal Inference by Complexity

The problem we consider is to infer, given data over two correlated variables and , whether caused , whether caused , or whether and are only correlated. As is common, we assume causal sufficiency. That is, we assume there exists no hidden confounding variable that is the common cause of both and .

The Algorithmic Markov condition, as recently postulated by Janzing and Schölkopf [11], states that factorizing the joint distribution over and into and , will lead to simpler—in terms of Kolmogorov complexity—models than factorizing it into and . Formally, they postulate that if causes ,

 (3.2) K(P(X))+K(P(Y∣X))≤K(P(Y))+K(P(X∣Y)).

While in general the symmetry of information, , holds up to an additive constant [15], Janzing and Schölkopf [11] showed it does not hold when causes , or vice versa. Based on this, Budhathoki & Vreeken [1] proposed

 (3.3) Δ∗X→Y=K(P(X))+K(P(Y∣X))K(P(X))+K(P(Y)),

as a causal indicator that uses this asymmetry to infer that as the most likely causal direction if , and vice versa. The normalisation has no function during inference, but does help to interpret the confidence of the indicator.

Both scores assume access to the true distribution , whereas in practice we only have access to empirical data. Moreover, following from the halting problem, Kolmogorov complexity is not computable. We can approximate it, however, via MDL [15, 5], which also allows us to directly work with empirical distributions.

### 3.2 Causal Inference by MDL

For causal inference by MDL, we will need to approximate both and . For the former, we need to consider the model classes and , while for the latter we need to consider class of models that describe the data of dependent the data of .

That is, we are after the causal model from the class that best describes the data by exploiting as much as possible structure of to save bits. By MDL, we identify the optimal model for data over and as the one minimizing

 L(D,MX→Y)=L(X,MX)+L(Y,MY∣X∣X),

where the encoded length of the data of under a given model is encoded using two-part MDL, similarly so for , if we consider the inverse direction.

To identify the most likely causal direction between and by MDL we can now simply rewrite Eq. (3.3)

 ΔX→Y=L(X,MX)+L(Y,MY∣X∣X)L(X,MX)+L(Y,MY).

Similar to the original score, we infer that is a likely cause of if , is a likely cause of if , and that and are only correlated or might have a common cause if .

### 3.3 Normalized Causal Indicator

Although has nice theoretical properties, it has a mayor drawback. It assumes that a bit gain in the description of the data over one attribute has the same importance as one bit gain in the description of the data over another attribute. This does not hold true if these attributes have different intrinsic complexities, such as when their domain sizes strongly differ. For example, a continuous valued attribute is very likely to have a much higher intrinsic complexity than a binary attribute. This means that gaining bits from an attribute with a large domain is not comparable to gaining bits from an attribute with a small domain. Since the indicator compares the absolute difference in bits, it does not account for differences w.r.t. the intrinsic complexity. Hence, is highly likely to be a bad choice when and are of different, or of mixed-type data.

We therefore propose an alternative indicator for causal inference on mixed-type data. Instead of taking the absolute difference between the conditioned and unconditioned score, we instead consider relative differences w.r.t. the marginal. We can derive the Normalized Causal Indicator () starting from the numerator of the indicator. By subtracting the conditional costs on both sides, we have

 L(X,MX)−L(X,MX|Y|Y)

Since the aim of the is to measure the relative gain, we divide by the costs of the unconditioned data

 L(X,MX)−L(X,MX|Y|Y)L(X,MX)=1−L(X,MX|Y|Y)L(X,MX).

After this step, we can conclude that for the relative gain it holds, if

 L(X,MX∣Y∣Y)L(X,MX)>L(Y,MY∣X∣X)L(Y,MY).

This score can be understood as an instantiation of the Ergo indicator proposed by Vreeken [33]. From the derivation, we can easily see that the difference between the score of both indicators depends only on the normalization factor and hence both are based on the Algorithmic Markov condition. It turns out, however, that the Ergo indicator is also biased. Although it balances the gain between and , we need a score that does not impose prior assumptions to the individual attributes of and . With the Ergo indicator, it could happen that a single dominates the whole score for . To account for this, we assume independence among the variables within and , meaning that the domain of two individual attributes within or is allowed to differ. Hence, we formulate the , which we from now on denote by , from to as

 δX→Y=1|Y|∑Yi∈YL(Yi,MYi∣X∣X)L(Yi,MYi)

and analogously . To avoid bias towards dimensionality, we normalize by the number of attributes. As above, we infer if and vice versa.

In practice, we expect that performs well on data where and are of the same type, especially when and the domain sizes of their attributes are balanced. For unbalanced domains, dimensionality, and especially for mixed-type data, we expect to perform much better. The experiments indeed confirm this.

## 4 MDL for Tree Models

To use the above defined causal indicators in practice, we need to define a casual model class , how to encode a model in bits, and how to encode a dataset using a model . As models we consider tree models, or, coding forests. A coding forest contains per attribute one coding tree . A coding tree encodes the values of in its leaves, splitting or regressing the data of on attribute () in its internal nodes to encode the data of more succinctly.

We encode the data over attribute with the corresponding coding tree . The encoded length of data and then is , which corresponds to the sum of costs of the individual trees.

To ensure lossless decoding, there needs to exist an order on the trees such that we can transmit these one by one. In other words, in a valid tree model there are no cyclic dependencies between the trees , and a valid model can hence be represented by a DAG. Let be the set of all valid tree models for , that is, is a set of trees such that the data types of the leafs in corresponds to the data type of attribute , and its dependency graph is acyclic.

We write and to denote the subset of valid coding forests for and , where we do not allow dependencies. To describe the possible set of models where we allow attributes of to only depend on attributes of we write and do so accordingly for depending only on . If an attribute does not have any incoming dependencies, its tree is a stump. Fig. 1 shows the DAG for a toy data set, and an example tree for . From the DAG, the set of purple edges would be a valid model in , whereas the orange edges are a valid model from .

#### Cost of a Tree

The encoded cost of a tree consists of two parts. First, we transmit the topology of the tree. From the root node on we indicate with one bit per node whether it is a leaf or an internal node, and if the latter, one further bit to identify whether it is a split or regression node. Formally we have that

 L(T)=|T|+∑v∈int(T)(1+L(v))+∑l∈lvs(T)L(l).

Next, we explain how we encode internal nodes and then specify the encoding for leaf nodes.

#### Cost of a Single Split

The length of a split node is

 L1split(v)=1+log|A|+{log|D(aj)| if ai is categorical,log|D(aj)−1| else.

whereas we first need one bit to indicate this is a single-split node, then identify in bits on which attribute we split, and third the split condition.

The split condition can be any value in the domain for categorical, and can lie in between two consecutive values of a numeric attribute ( choices). For binary we only have one option, resulting in zero cost.

#### Costs of a Multiway split

A multiway split is only possible for categorical and real valued data. As there are exponentially many multiway splits, we consider only a subset. The costs for a multiway split are

 Lksplit(v)=1+log|A|+{0 if ai is categorical,LN(k) numeric,

where the first two terms are similar to above. For categorical data, we only consider splitting on all values, and hence have no further cost. For numeric data, we only split non-deterministic cases, i.e. if there exist duplicate values. To do so, we split on every such value that occurs at least times, and one residual split for all remaining data points. To encode such a split, we transmit using bits, where is the MDL optimal encoding for integers  [27].

#### Cost of Regressing

For a regression node we also first encode the target attribute, and then the parameters of the regression, i.e.

 Lreg(v)=log|A|+∑ϕ∈Φ(v)(1+LN(s)+LN(⌊ϕ⋅10s⌋)),

where denotes the set of parameters for the regression. For linear regression, it consists of and , while for quadratic regression it further contains . To describe each parameter we transmit it up to a user defined precision, e.g. , we first encode the corresponding number of significant digits , e.g. , and then the shifted parameter value.

Next, we describe how to encode the data in a leaf . As we consider both nominal and numeric attributes, we need to define for nominal and for numeric data.

#### Cost of a Nominal Leaf

To encode the data in a leaf of a nominal attribute, we can use refined MDL [14]. That is, we encode minimax optimally, without having to make design choices [5]. In particular, we encode the data of a nominal leaf using the normalized maximum likelihood (NML) distribution as

 (4.4) Lnom(l)= log(∑h1+⋯+hk=|l||l|!h1!h2!⋯hk!) (4.5) −|l|∑c∈D(ai)Pr(ai=c∣l)logPr(ai=c∣l).

Kontkanen & Myllymäki [14] derived a recursive formula to calculate this in linear time.

#### Cost of a Numerical Leaf

For numeric data existing refined MDL encodings have high computational complexity [14]. Hence, we encode the data in numeric leaves using two-part MDL, using point models with Gaussian or uniform noise. A split or a regression on an attribute aims to reduce the variance or the domain in the leaf. We encode the costs of a numeric leaf as

 (4.6) Lnum(l∣σ,μ)= |l|2(1ln2+log2πσ2)−|l|logres(ai),

given empirical mean and variance or as uniform given and as

 (4.7) Lnum(l∣min,max)= |l|⋅log(max−minres(ai)+1).

We encode the data as Gaussian if this costs fewer bits than encoding it as uniform. To indicate this decision, we use one bit and encode the minimum of both plus the corresponding parameters. As we consider empirical data, we can safely assume that all parameters lie in the domain of the given attribute. Since we do not have any preference on the parameter values, the encoded costs of a numeric leaf are

 (4.8) Lnum(l) =1+2log|D(aj)| (4.9)

Putting it all together, we now know how to compute , by which we can formally define the Minimal Coding Forest problem.

Minimal Coding Forest Problem Given a data set over a set of attributes , and a valid model class for . Find the smallest model such that is minimal.

From the fact that both inferring optimal decision trees and structure learning of Bayesian networks—to which our tree-models reduce for nominal-only data and splitting on all values—are NP-hard [20], it trivially follows that the Minimal Coding Forest problem is also NP-hard. Hence, we resort to heuristics.

## 5 The Crack Algorithm

Knowing the score and the problem, we can now introduce the Crack algorithm, which stands for classification and regression based packing of data. Crack is an efficient greedy heuristic for discovering a coding forest from given model class with low . It builds upon the well-known ID3 algorithm [25]. In the next section we explain the main aspects of the algorithm.

#### Greedy algorithm

We give the pseudocode of Crack as Algorithm 1. Before running the algorithm, we set the resolution per attribute, which is for nominal data (line 1). For numeric data, we calculate the differences between adjacent values, and to reduce sensitivity to outliers take the smallest difference as resolution. In general, setting to works well in practice.

Crack starts with an empty model consisting of only trivial trees, i.e. leaf nodes containing all records, per attribute (line 1). The given model class implicitly defines a graph of dependencies between attributes that we are allowed to consider (line 1). To make sure the returned model is valid, we need to maintain a graph representing its dependencies (lines 11). We iteratively discover that refinement of the current model that maximizes compression. To find the best refinement, we consider every attribute (line 1), and every legal additional split or regression of its corresponding tree (line 1). A refinement is only legal when the dependency is allowed by the model family (line 1), the dependency graph remains acyclic, and we do not split or regress twice on the same attribute (line 1). We keep track of the best found refinement.

The key subroutine of Crack is RefineLeaf, in which we discover the optimal refinement of a leaf in tree . That is, it finds the optimal split of over all candidate attributes such that we minimize the encoded length. In case both and are numeric, RefineLeaf also considers the best linear and quadratic regression and decides for the variant with the best compression—choosing to split in case of a tie. In the interest of efficiency, we do not allow splitting or regressing multiple times on the same candidate.

Since we use a greedy heuristic to construct the coding trees, we have a worst case runtime of , where is the number of attributes and is the number of rows. Although the worst case runtime is exponential, in practice, Crack takes only a few seconds.

#### Causal Inference with Crack

To compute our causal indicators we have to run Crack twice on . First with model class to obtain and second with , to obtain . To estimate , we assume a uniform prior and similarly for . We can use these scores to calculate both the score and the score. We will refer to Crack using the indicator as , and Crack with the indicator as .

## 6 Related Work

Causal inference on observational data is a challenging problem, and has recently attracted a lot of attention [21, 11, 29, 1]. Most existing proposals, however, are highly specific in the type of causal dependencies and type of variables they can consider.

Clasical constrained-based approaches, such as conditional independence tests, require three observed random variables [30, 21], cannot distinguish Markov equivalent causal DAGs [32] and hence cannot decide between and . Recent approaches use properties of the joint distribution to break the symmetry.

Additive Noise Models (ANMs) [29], for example, assume that the effect is a function of the cause and cause-independent additive noise. ANMs exist for univariate real-valued [29, 8, 34, 24] and discrete data [22]. A related approach considers the asymmetry in the joint distribution of and for causal inference. The linear trace method (LTR[9] and the kernelized trace method (KTR[2] aim to find a structure matrix and the covariance matrix to express as . Both methods are restricted to multivariate continuous data. Sgouritsa et al. [28] show that the marginal distribution of the cause is independent of the conditional distribution of the effect. They proposed Cure, using unsupervised reverse regression on univariate continuous pairs. Liu et al [16] use distance correlation to identify the weakest dependency between univariate pairs of discrete data.

The algorithmic information-theoretic approach views causality in terms of Kolmogorov complexity. The key idea is that if causes , the shortest description of the joint distribution is given by the separate descriptions of the distributions and  [11], and justifies additive noise model based causal inference [12]. However, as Kolmogorov complexity is not computable [15], causal inference using algorithmic information theory requires practical implementations, or notions of independence. For instance, the information-geometric approach [10] defines independence via orthogonality in information space for univariate continuous pairs. Vreeken [33] instantiates it with the cumulative entropy to infer the causal direction in continuous univariate and multivariate data. Mooij instantiates the first practical compression-based approach [18] using the Minimum Message Length. Budhathoki and Vreeken approximate and through MDL, and propose Origo, a decision tree based approach for causal inference on multivariate binary data [1]. Marx and Vreeken[17] proposed Slope, an MDL based method employing local and global regression for univariate numeric data.

In contrast to all methods above, Crack can consider pairs of any cardinality, univariate or multivariate, and of same, different, or even mixed-type data.

## 7 Experiments

In this section, we evaluate Crack empirically. We implemented Crack in C++, and provide the source code including the synthetic data generator along with the tested datasets for research purposes. The experiments concerning Crack were executed single-threaded. All tested data sets could be processed within seconds; over all pairs the longest runtime for Crack was seconds.

We compare Crack to Cure [28], IGCI [10], LTR [9], Origo [1], Ergo [33] and Slope [17] using their publicly available implementations and recommended parameter settings.

### 7.1 Synthetic data

The aim of our experiments on synthetic data is to show the advantages of either score. In particular, we expect to perform well on nominal data and numeric data with balanced domain sizes and dimensions. On the other hand, should have an advantage when it comes to numeric data with varying domain sizes and mixed-type data.

We generate synthetic data with assumed ground truth with and , each having rows, in the following way. First, we randomly assign the type for each attribute in . For nominal data, we randomly draw the number of classes between two (binary) and five and distribute the classes uniformly. Numeric data is generated following a normal distribution taken to the power of by keeping the sign, leading to a sub-Gaussian () or super-Gaussian () distribution.222We use super- and sub-Gaussians to ensure identifiability.

To create data with the true causal direction , we introduce dependencies from to , where we distinguish between splits and refinements. We call the probability threshold to create a dependency . For each , we throw a biased coin based on for each that determines if we model a dependency from to . A split means that we find a category (nominal) or a split-point (numeric) on to split into two groups, for which we model its distribution independently. As refinement, we either do a multiway split or model as a linear or quadratic function of plus independent Gaussian noise.

##### Accuracy

First, we compare the accuracies of and with regard to single-type and mixed-type data. To do so, we generate synthetic data sets with for each dependency level where . Figure 2 shows the results for numeric, nominal and mixed-type data. At both approaches reach nearly accuracy on single-type data. For single-type data, the accuracy of both methods increases with the dependency. At , both approaches correctly do not decide instead of taking wrong decisions. As expected strongly outperforms on mixed-type data, reaching near accuracy, whereas reaches only . On nominal data, picks up the correct signal faster than .

##### Dimensionality

Next, we check how sensitive both scores are to dimensionality, whereas we discriminate between asymmetric and symmetric . We evaluated data sets per dimensionality. For the symmetric case, both methods are near to on single-type data, whereas only also reaches this target on mixed-type data, as can be seen in the appendix.11footnotemark: 1 We now discuss the more interesting case for asymmetric pairs in detail.

To test asymmetric pairs, we keep the dimension of one variable at three, , while we increase the dimension of the second variable from one to eleven. To avoid bias, we assigned the dimension to and to and swap the dimensions in every other test. We show the results in Figure 3. As expected, we observe that has much fewer difficulties with the asymmetric data sets than . From onwards, is close to . On nominal data, performs near perfect and also has the clear advantage for .

### 7.2 Real world data

Based on the evaluation on synthetic data, we test our approach on univariate benchmark data and multivariate data consisting of known test sets and new causal pairs with known ground truth that we present in the current paper.

##### Univariate benchmark

To evaluate Crack on univariate data, we apply it to the well-known Tuebingen benchmark data set that consists of univariate pairs. The pairs mainly consist of numeric data and a few categoric instances. Therefore, we apply . We compare to the state of the art methods that are applicable to multivariate and univariate data, Origo [1] and Ergo [33], and methods specialized for univariate pairs, Cure [28], IGCI [10] and Slope [17]. For each approach, we sort the results by their confidence. According to this order, we calculate for each position the percentage of correct inferences up to this point, called the decision rate. We weigh the decisions as specified by the benchmark, plot the results in Fig. 4 and show the confidence interval of a fair coin flip as a grey area. Except to Crack and Slope, all methods are insignificant w.r.t. the fair coin flip. In particular, Crack has an accuracy of over for the first of its decisions and reaches overall. Regarding the whole decision rate, Crack is nearly on par with Slope, which is as far as we know, the current state of the art for univariate continuous data.

##### Multivariate data

To test on multivariate mixed-type and single-type data, we collected 17 data sets. The information of the dimensionality for each data set is listed in Table 1. The first six data sets belong to the Tuebingen benchmark data set [19] and the next four were published by Janzing et al. [9]. Further, we extracted cause-effect pairs form the Haberman [6], Iris [3], Mammals [7] and Octet [4, 31] data sets. Those are described in more detail in the appendix.

We compare with LTR, Ergo and Origo. Ergo and LTR do not consider categoric data, and are hence not applicable on all data sets. In addition, LTR is only applicable to strictly multivariate data sets. is applicable to all data sets, infers causal directions correctly, by which it has an overall accuracy of . Importantly, the two wrong decisions have low confidences compared to the correct inferences.

## 8 Conclusion

We considered the problem of inferring the causal direction from the joint distribution of two univariate or multivariate random variables and consisting of single-, or mixed-type data. We point out weaknesses of known causal indicators and propose the Normalized Causal Indicator for mixed-type data and data with highly unbalanced domains. Further, we propose a practical encoding based on classification and regression trees to instantiate these causal indicators and provide a fast greedy heuristic to compute good solutions.

In the experiments we evaluate the advantages of the NCI and the common indicator and give advice on when to use them. On real world benchmark data, we are on par with the state of the art for univariate continuous data and beat the state of the art on multivariate data with a wide margin.

For future work, we aim to investigate in the application of Crack for causal discovery, meaning that we would like to infer causal networks. In addition, we only selected a subset of possible refinements to exploit dependencies from candidates. This choice could be expanded by considering more complex functions, finding combinations of categories for splitting. However, unless specific care is taken many of such extensions will likely have repercussions on the runtime of our algorithm, which is why besides being out of scope here, we leave this for future work.

## Acknowledgements

The authors wish to thank Kailash Budhathoki for insightful discussions. Alexander Marx is supported by the International Max Planck Research School for Computer Science (IMPRS-CS). Both authors are supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.

## References

• [1] K. Budhathoki and J. Vreeken. Causal inference by compression. In ICDM, pages 41–50. IEEE, 2016.
• [2] Z. Chen, K. Zhang, and L. Chan. Nonlinear causal discovery for high dimensional data: A kernelized trace method. In ICDM, pages 1003–1008. IEEE, 2013.
• [3] R. A. Fisher. The use of multiple measurements in taxonomic problems. Ann Eugen, 7(2):179–188, 1936.
• [4] L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, and M. Scheffler. Big data of materials science: Critical role of the descriptor. PRL, 114, 2015.
• [5] P. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.
• [6] S. J. Haberman. Generalized residuals for log-linear models. In Proceedings of the 9th international biometrics conference, pages 104–122, 1976.
• [7] H. Heikinheimo, M. Fortelius, J. Eronen, and H. Mannila. Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters. J. Biogeogr., 34:1053–1064, 2007.
• [8] P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In NIPS, pages 689–696, 2009.
• [9] D. Janzing, P. Hoyer, and B. Schölkopf. Telling cause from effect based on high-dimensional observations. In ICML, pages 479–486. JMLR, 2010.
• [10] D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf. Information-geometric approach to inferring causal directions. AIJ, 182-183:1–31, 2012.
• [11] D. Janzing and B. Schölkopf. Causal inference using the algorithmic markov condition. IEEE TIT, 56(10):5168–5194, 2010.
• [12] D. Janzing and B. Steudel. Justifying additive noise model-based causal discovery via algorithmic information theory. OSID, 17(2):189–212, 2010.
• [13] A. Kolmogorov. Three approaches to the quantitative definition of information. Problemy Peredachi Informatsii, 1(1):3–11, 1965.
• [14] P. Kontkanen and P. Myllymäki. MDL histogram density estimation. In AISTATS, pages 219–226, 2007.
• [15] M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 1993.
• [16] F. Liu and L. Chan. Causal inference on discrete data via estimating distance correlations. Neur. Comp., 28(5):801–814, 2016.
• [17] A. Marx and J. Vreeken. Telling Cause from Effect by MDL-based Local and Global Regression. In ICDM. IEEE, 2017.
• [18] J. Mooij, O. Stegle, D. Janzing, K. Zhang, and B. Schölkopf. Probabilistic latent variable models for distinguishing between cause and effect. NIPS, 2010.
• [19] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. Distinguishing cause from effect using observational data: Methods and benchmarks. JMLR, 17(32):1–102, 2016.
• [20] K. V. S. Murthy. On Growing Better Decision Trees from Data. Phd thesis, Johns Hopkins, 1997.
• [21] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.
• [22] J. Peters, D. Janzing, and B. Schölkopf. Identifying cause and effect on discrete data using additive noise models. In AISTATS, pages 597–604, 2010.
• [23] J. Peters, D. Janzing, and B. Schölkopf. Causal Inference on Discrete Data using Additive Noise Models. IEEE TPAMI, 33(12):2436–2450, 2011.
• [24] J. Peters, J. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models. JMLR, 15:2009–2053, 2014.
• [25] J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106, 1986.
• [26] J. Rissanen. Modeling by shortest data description. Automatica, 14(1):465–471, 1978.
• [27] J. Rissanen. A universal prior for integers and estimation by minimum description length. Annals Stat., 11(2):416–431, 1983.
• [28] E. Sgouritsa, D. Janzing, P. Hennig, and B. Schoelkopf. Inference of Cause and Effect with Unsupervised Inverse Regression. AISTATS, 38:847–855, 2015.
• [29] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-gaussian acyclic model for causal discovery. JMLR, 7:2003–2030, 2006.
• [30] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2000.
• [31] J. A. Van Vechten. Quantum dielectric theory of electronegativity in covalent systems. i. electronic dielectric constant. Physical Review, 182(3):891, 1969.
• [32] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In UAI, pages 255–270, 1991.
• [33] J. Vreeken. Causal inference by direction of information. In SDM, pages 909–917. SIAM, 2015.
• [34] K. Zhang and A. Hyvärinen. On the identifiability of the post-nonlinear causal model. In UAI, pages 647–655, 2009.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters