# Conditionally-additive-noise Models for Structure Learning

Daniel Chicharro
Neural Computation Laboratory
Center for Neuroscience and Cognitive Systems@UniTn
Istituto Italiano di Tecnologia
38068 Rovereto, Italy
Department of Neurobiology
Harvard Medical School
Boston, MA 02115
daniel.chicharro@iit.it
daniel_chicharro@hms.harvard.edu
&Stefano Panzeri
Neural Computation Laboratory
Center for Neuroscience and Cognitive Systems@UniTn
Istituto Italiano di Tecnologia
38068 Rovereto, Italy
stefano.panzeri@iit.it
&Ilya Shpitser
Department of Computer Science
Whiting School of Engineering
Johns Hopkins University
ilyas@cs.jhu.edu
###### Abstract

Constraint-based structure learning algorithms infer the causal structure of multivariate systems from observational data by determining an equivalent class of causal structures compatible with the conditional independencies in the data. Methods based on additive-noise (AN) models have been proposed to further discriminate between causal structures that are equivalent in terms of conditional independencies. These methods rely on a particular form of the generative functional equations, with an additive noise structure, which allows inferring the directionality of causation by testing the independence between the residuals of a nonlinear regression and the predictors (nrr-independencies). Full causal structure identifiability has been proven for systems that contain only additive-noise equations and have no hidden variables. We extend the AN framework in several ways. We introduce alternative regression-free tests of independence based on conditional variances (cv-independencies). We consider conditionally-additive-noise (CAN) models, in which the equations may have the AN form only after conditioning. We exploit asymmetries in nrr-independencies or cv-independencies resulting from the CAN form to derive a criterion that infers the causal relation between a pair of variables in a multivariate system without any assumption about the form of the equations or the presence of hidden variables.

## 1 Introduction

Inferring the causal structure of multivariate systems from observational data has become an indispensable need in many domains of science, from physics, to neuroscience, to finance (Lütkepohl, 2006; Wibral et al., 2014; Peters et al., 2017). Constraint-based structure learning algorithms have been used to infer the causal structure by determining an equivalent class of causal structures compatible with the conditional independencies in the data (Spirtes et al., 2000; Pearl, 2009). Additive-noise (AN) models were proposed as powerful solutions that allow further discriminating between structures within these equivalent classes (Hoyer et al., 2009; Peters et al., 2014; Mooij et al., 2016). A pure AN functional equation requires that the noise is additively separable from the causes of a variable, and in the standard approach this property is exploited by testing independencies of the residuals of a nonlinear regression with the regression predictors. For multivariate systems, algorithms testing these nonlinear regression residuals independencies (nrr-independencies) proceed inferring a global causal ordering (Mooij et al., 2009) under the assumption that the noise is separable in all equations. This approach has been mostly studied in the case of causal sufficiency (no hidden variables), (but see Janzing et al., 2009, for an exception). In this work we extend the AN framework in four fronts. First, allowing for the presence of hidden variables. Second, considering functional equations that have the AN form only after conditioning on certain variables. Third, introducing an alternative regression-free test to infer causality exploiting the independencies present in AN models. Fourth, proposing a criterion to infer the causal relation between a specific pair of variables in a multivariate system with hidden variables, without restrictions on the form of the functional equations and without involving the inference of a global causal ordering.

In more detail, we generalize AN models to partial conditionally-additive-noise (CAN) models with hidden variables. These models contain both equations reducible and irreducible to the AN form, and the AN form may only be obtained after conditioning on some of the observable variables. We show how structure learning for partial CAN models can be formulated in terms of nrr-independencies asymmetries, analogously to AN models (Hoyer et al., 2009; Peters et al., 2014). Furthermore, we introduce a regression-free test to detect additive noise. This test assesses the independence of the residuals second-order moments from the predictors indirectly, estimating conditional variance independencies (cv-independencies) that do not require an actual reconstruction of the noise variables. We formulate a criterion to infer a potential cause from one particular variable to another in the presence of hidden variables, which does not require inferring a global causal ordering. Finally, we discuss the extension of CAN models by generalizing post-nonlinear AN models (Zhang and Hyvärinen, 2009), which allow for the presence in the functional equations of a global invertible nonlinear transformation of the AN terms. We believe that this work will lead to a structure learning algorithm alternative to the ones existing for AN models with no hidden variables (Mooij et al., 2009; Peters et al., 2014; Bühlmann et al., 2014). The proposal of such algorithm exploiting the new criterion we propose is left for a future contribution.

This paper is organized as follows. In Section 2, we review previous work on AN models and post-nonlinear AN models. In Section 3, we describe the regression-free test based on cv-independence. In Section 4, we extend the AN models to CAN models, providing conditions for the existence of cv-independencies and nrr-independencies that appear after conditioning. We introduce a criterion that exploits these independencies to infer causal relations in the presence of hidden variables and for system which may be only partially CAN models. In Section 5, we examine examples of concrete systems. In Section 6, we extend our approach to post-nonlinear AN models.

## 2 Previous work on additive-noise models

We start with some basic notions for graphs. We use capital letters for random variables and bold letters for sets and vectors. Consider a set of random variables . A graph consists of nodes and edges between the nodes. for any . We write for . We refer to as both variable and its corresponding node. A node is called a parent of if . The set of parents of is denoted by . A path in is a sequence of (at least two) distinct nodes such that there is an edge between and for all . If all edges are the path is a causal or directed path. A node is a collider in a path if it has incoming arrows and is a noncollider otherwise. The set of descendants of node comprises those variables that can be reached going forward through causal pathways from . The set of non-descendants of , is complementary to , including . In Directed Acyclic Graphs (DAGs) no node is its own descendant. Two nodes and are adjacent if either , , or there is a hidden common parent between them (i. e.  and and is not observable). is a potential cause of if it is a parent or they share a hidden common parent. Conditional independence between two variables is equivalent to d-separation Pearl (2009) of their corresponding nodes under the faithfulness assumption Spirtes et al. (2000), which ensures that the probability distribution contains only independencies induced by the causal structure. Accordingly, a conditional dependence between and given , i.e. , exists iff the nodes are connected by a path that is active when blocking the nodes in (S-active path). See Spirtes et al. (2000) for a more detailed description.

The functional equation generating variable has the AN form if it conforms to

 Vi=fi(Vi,εi)=fi(Vi,1,...,Vi,n)+εi, (1)

with and noise by definition independent of the parents. Most part of the work with AN models assumes that all variables are observable. Under the assumption of no hidden variables, since the noise is additively separable from the parents, an estimate can be obtained by nonlinear regression as . If is properly reconstructed, , that is, the independence of the noise from the parents is recovered. Consider a particular variable and parent . If all variables are observed and the equation of has the AN form it is guaranteed that an independent noise can be reconstructed.

Proposition   Nrr-independence with AN functional equations: ‘If the functional equation of has the AN form, then and such that , with .’

The existence of at least one set is guaranteed because leads to . If we knew that reconstructs a truly generative noise variable, Proposition would suffice to infer a cause from to (assuming no hidden variables). This is because, if is adjacent to both and (there are edges and ), the fact that and is a sufficient condition for to be a collider () (Pearl, 2009). However, because is only a reconstruction of the presumed underlying noise variable, extra checks are required: the question is if the nrr-independence could also occur when is estimated but the generative model contains the reverse causal relation.

Hoyer et al. (2009) proved that, if has a generative AN functional equation and , nrr-independence holds for , given fixed, and in general not for the direction opposite to causality, that is, there is no nrr-independence for . However, they also showed that nrr-independence in both directions holds for a family of distributions which, for fixed, is characterized as the solutions of a third-order linear inhomogeneous differential equation. For example, Gaussian distributions belong to that family. Accordingly, if a system only contained AN equations with no hidden variables, an asymmetry and , would suffice to infer a cause from to . This is because always holds given the AN form of the functional equation of and only if the data generating distribution is within the special family of Hoyer et al. (2009) nrr-independence holds in both directions, in which case nothing can be concluded.

However, generally not all functional equations have an AN form. Focusing on the bivariate case, Janzing and Steudel (2010) discussed the necessary assumptions for structure learning based on asymmetries of nrr-independencies. They indicated that, for a generative functional equation with the opposite direction of causality , it has to be assumed that will not hold for any , except within the family of Hoyer et al. (2009). Janzing and Steudel (2010) justified the fulfillment of this assumption because would impose constraints making and dependent. This dependence requires a fine tuning of the distribution of the cause, given the mechanism , and hence is fragile to changes in if the cause distribution changes independently of the causal mechanism, as expected. These arguments are tightly related with the justification of the faithfulness assumption for conditional independencies based on stability. In particular stability rules out ’pathological parameterizations’ (Pearl, 2009) in which a conditional independence does not correspond to a d-separation present in the causal structure because such independencies also require tuning the parameters of the functional equation, and will vanish with small changes of these parameters.

When testing for nrr-independencies in multivariate systems, the common procedure starts by inferring a global causal ordering of the variables (Mooij et al., 2009). This step already uses nrr-independencies, and relies on the fact that conditioning on a descendant introduces a dependence between and its noise variable. Subsequently, nrr-independencies are tested with regression models that, if the causal ordering is correct, do not take descendants as arguments. This allows removing superfluous edges from non-descendants that are not parents of . To our knowledge, for the multivariate case an analogous assumption of faithfulness has not been formulated explicitly. For the sake of comparison with our results we here explicitly state the following assumption:

Assumption  Nrr-independence faithfulness for non-additive-noise functional equations: ‘if the generative functional equation of , with , does not have an AN form, then with , and .’

This assumption is a multivariate version of the bivariate one discussed in Janzing and Steudel (2010). It considers that all other parents of are included in the regression, and that only and are exchanged. The assumption can be used iteratively when determining the causal ordering. It ensures that, if a functional equation does not have the AN form and hence Proposition does not guarantee independence in the right direction, an asymmetry of independence does not appear in the wrong direction. The assumption focuses on equations without an AN form because, by Proposition , with the AN form nrr-independence in the wrong direction only leads to symmetric nrr-independencies.

Finally, we also review post-nonlinear AN models, where a global nonlinearity transforms the AN equation (Zhang and Hyvärinen, 2009):

 Vi=fi(Vi,εi)=hi,2(hi,1(Vi)+εi). (2)

Here is an invertible nonlinear function. For the bivariate case, with , Zhang and Hyvärinen (2009) generalized the work of Hoyer et al. (2009) extending the characterization of the special family of distributions that admits a statistical post-nonlinear AN model in both directions. Furthermore, they showed how to fit a nonlinear model to extract residuals to test nrr-independencies. For the multivariate case, assuming no hidden variables, Zhang and Hyvärinen (2009) used regressions to evaluate nrr-independencies given sets of candidate parents previously determined examining conditional independencies between the variables (Spirtes et al., 2000).

Additive-noise models are a well-established approach for structure learning, which has been mostly studied in the case of causal sufficiency (no hidden variables). A pure AN functional equation requires that the noise is separable as in Eq. 1, and in the standard approach this property is exploited by testing the independence of the residuals of a nonlinear regression from the predictors. For multivariate systems, the application of these tests proceeds by inferring a global causal ordering. We extend the AN framework in four fronts, allowing for the presence of hidden variables, considering functional equations that have the AN form only after conditioning on certain variables, introducing an alternative regression-free test to infer causality, and modifying the procedure not to rely on the inference of a global causal ordering.

## 3 Conditional variance independencies

We start introducing a regression-free test for causal directionality alternative to the regression-based analysis of nrr-independencies. For this purpose, we continue to consider the pure AN equations of the form of Eq. 1. The key property of AN functional equations is that the independent noise is additively separable from the parents. For a particular variable and a parent , define . We can study the conditional variance as a variable which is a function only of , with fixed. For AN functional equations, the independence and separability of the noise leads to , which is independent of (). This independence reflects the independence of the second-order moments of the residuals from the predictors indirectly, and does not require an actual reconstruction of the noise variables. Analogously to Proposition , the AN form suffices for this type of independence, which we call conditional variance independence (cv-independence).

Proposition   Cv-independence with AN functional equations: ‘If the functional equation of has the AN form, then .’

The existence of at least one set is guaranteed because leads to . Because cv-independence follows from the fact that the noise is independent and separable from the other arguments of the equation, for the special family characterized by Hoyer et al. (2009) in which an AN statistical model can also be constructed in the reverse direction, cv-independence holds in both directions. For general systems, possibly containing functional equations without the AN form, an assumption analogous to Assumption is required to ensure that the asymmetry of cv-independencies does not hold in the direction inconsistent with the causal relation. Like for Assumption 1, to formulate this assumption of faithfulness we consider a functional equation with the opposite causal direction for and , and compare the cv-dependence of with respect to the independence stated in Proposition 2.

Assumption  Cv-independence faithfulness for non-additive-noise functional equations: ‘if the functional equation of , with , does not have an AN form, then .’

The two faithfulness assumptions are related by the following conditions:

Proposition  Relation between cv-independence faithfulness and nrr-independence faithfulness: ‘The fulfillment of Assumption implies the one of Assumption , but not the opposite.’

Proof of Proposition : See Appendix.

Despite this theoretical asymmetry between the two faithfulness assumptions, the fulfilment of assumption and not assumption would impose further constraints to the probability distributions. It would require that is such that dependencies appear in third or higher-order moments, so that cv-independence holds despite nrr-dependence. Furthermore, because we are considering a functional equation where is a parent of , the fulfillment of faithfulness regards , which does not correspond to the generative direction. Accordingly, cases in which nrr-independence faithfulness is violated and cv-independence faithfulness holds require a specific tuning introducing a dependence between the probability of the causes and the causal mechanism (Janzing and Steudel, 2010). The necessity of this tuning renders these cases fragile to changes in the distribution of the causes, and hence nonstable.

Testing nrr-independencies intrinsically requires a regression-based approach, fitting a (nonlinear) regression model. On the other hand, while cv-independencies can also be evaluated using the variance of the residuals, they can alternatively be tested in a regression-free approach, estimating the conditional variance of the variables without reconstructing the noise variables. The latter has the advantage that it does not rely on a particular model of regression. However, in some cases a test of variance homogeneity, if is highly dimensional, may require more data than the nonlinear regression approach. These practical issues are out of the scope of this work. As we will see below, the cv-independencies formulation is particularly intuitive to derive an extension of AN models to partial CAN models.

For systems in which all functional equations have the AN form, full identifiability of the causal structure has been proven when there are no hidden variables (Peters et al., 2014). For partially AN models, for which only some of the equations have the AN form, asymmetries in nrr-independencies have been used (Tillman et al., 2009) as a method to complement algorithms of constraint-based causal discovery such as the PC algorithm (Spirtes et al., 2000), which exploit conditional independencies between the variables. However, to our knowledge, it has not been examined how extra inferential power can be gained from functional equations that, although not having a pure AN form, are converted to the AN form after conditioning on some variables. We call these type of equations conditionally-additive-noise (CAN) functional equations. We derive the conditions on the form of a functional equation so that it can be converted to the CAN form in order to test nrr-independencies or cv-independencies. Furthermore, we now drop the assumption of causal sufficiency and consider also the existence of hidden variables.

To derive which equations have the conditionally-additive-noise form, we start expressing a generic functional equation as:

 Vi=fi(Vi,εi)=fi,1(Vi,1,1,...,Vi,1,n1)+fi,2(Vi,2,1,...,Vi,2,n2,εi)+fε(εi). (3)

Here the form allowed for and should be understood as complementary to simpler terms. That is, comprises any function of only . Function comprises any function that contains as an argument, but excluding terms that only contain . The sets and can overlap. Any functional equation can be expressed in this form. In particular, if the equation reduces to the AN form.

Consider and a particular parent . We want to determine under which conditions cv-independencies or nrr-independencies can occur. As a first remark, if , for any set , since is an argument of and modulates its variance. For the same reason the residuals cannot be independent from when . Subsequently, we focus on variables . Taking a particular variable as reference, Eq. 3 can be expanded into the following subterms, where we also differentiate between observable variables (V) and hidden variables (U):

 (4)

We dropped subindex from all variables and functions to simplify the notation. As in Eq. 3, the meaning of each function is determined by opposition to simpler terms explicitly separated. For example, is any function that does not have as an argument and does not include the other explicit simpler terms that do not include either. As will be appreciated below, we only separate those terms that are subject to different constraints in the conditions to obtain the CAN form. Only the function has as an argument. Function is linear on some observable variables with a coefficient that is a function of other observable variables. Function is linear in each hidden variable of , with a coefficient that is a function of observable variables . Here , and . Similarly and . contains all other observable parents apart from , and all hidden parents. There can be overlaps between subgroups of or of .

We determine the conditions that lead to cv-independencies and nrr-independencies. We will focus on the case in which, for a certain variable , which causal relation with is examined, is adjacent to all other potential causes of , i.e. , parents and variables sharing a hidden common cause with . This is because, as discussed above, it suffices that two observable potential causes are nonadjacent to infer that is a collider for them using conditional independencies (Spirtes et al., 2000). This means that the conditions we derive could be relaxed, but the knowledge obtained would be redundant to the one provided by conditional independencies. Because cv-independencies only rely on second-order moments, there is a difference in the conditions needed to obtain cv-independence and nrr-independence. We start with cv-independencies, which lead to less restrictive conditions.

### 4.1 The CAN form with cv-independence

We define the cv-CAN form as the form of a functional equation leading to cv-independence:

Definition   Cv-independence with cv-CAN functional equations: ‘The functional equation of has the cv-CAN form for when conditioning on if .’

We now enunciate when a functional equation can be set into the cv-CAN form. For this purpose, expressing the functional equation of as in Eq. 4, we define the functions and , where and play the role of fixed parameters, and we also define . The cv-CAN form is characterized as follows.

Theorem   Functional equations with the cv-CAN form: ‘Consider an and a set . For the case in which is adjacent to all other potential causes of , the functional equation of has the cv-CAN form with respect to given the set if and only if the hidden variables fulfill the following conditions

 i) U1,1=∅;ii) X⊥Uk|S ∀Uk∈{U1,2,U2};iii) σUk|X,S⊥X ∀Uk∈{U1,4,U3};iv) σZiZj|X,S⊥X ∀Zi,Zj∈S2, (5)

the set is such that

 {V1,1,V1,2,~V1,3,~V1,4,V2,V1,3,2,V3,2}⊆S,

where is defined as , with such that , is defined as with such that , and the unconditioned observable variables also fulfill the following conditions

 v) σViVj|X,S⊥X ∀Vi,Vj∈S3;vi) σViZj|X,S⊥X ∀Vi∈S3,Zj∈S2, (6)

where .’

Proof of Theorem : See Appendix.

To understand the logic of these conditions, we rewrite Eq. 4 as

 Y=f1,1(X;V1,1)+[f1,2(U1,2;V1,2)+∑j~βjV1,3,1,j+∑jβjV3,1,j+∑j~αjU1,4,j+∑jαjU3,j+f2(U2,εy;V2)+fε(εy)+c]. (7)

and are constant coefficients because . The constant equals because . Eq. 7 can be summarized as:

 Y=f1,1(X;S)+g(V1,3,1,V3,1,U1,2,U1,4,U2,U3;S)=f1,1(X;S)+ξy|S, (8)

where the function plays the role of a noise analogous to the additive noise term of a pure AN equation, and hence the equation has the additive-noise form when seen as a function of . and are conditionally independent of given , and is linear in all the other arguments, with their variances and covariances conditionally independent of given . This leads to . Note that, to fulfill the conditions in Eqs. 5 and 6, may need to include other variables that are not parents of . Furthermore, the constraints are intertwined because independencies change depending on which variables are included in . Since all variables in and are observable, it is always possible to try to find a valid set with and . In that case, the constraints of Eq. 6 vanish. It is also possible to formulate a simpler sufficient condition by demanding .

Note that the cv-CAN form is obtained relative to a certain variable. The existence of a valid set to place an equation in the CAN form relative to a variable is not guaranteed for all the observable parents. This is because of two reasons. First, it may be due to the presence of hidden variables that for a certain do not fulfill the conditions of Theorem . This limitation is common to pure AN functional equations if hidden variables are allowed, since AN equations are CAN equations with . Second, even with no hidden variables, . That is, certain parents are not additively separable from the noise and cannot lead to any cv-independence. The fact that only some equations in the system, and only relatively to certain variables, have the CAN form, hinders the application of algorithms of structure learning in which a global causal ordering is inferred searching for the ordering that leads to the highest estimates of residuals independence (Mooij et al., 2009; Peters et al., 2014), which are designed for systems in which all equations have the pure AN form. This is because now a lack of independence can be due not to the wrong order, but to the lack of separability of the noise, for the reasons mentioned above.

Theorem 1 states which form of a functional equation will create a cv-independence. Assuming that a certain functional equation is known or hypothesized, and for a certain context in which the existence of certain hidden variables is known or hypothesized, the theorem allows determining if a cv-independence exists. However, the theorem cannot be applied for inference, given that the conditions in Eqs. 5 and 6 involve hidden variables and hence their fulfillment cannot be tested from data. To derive a criterion applicable for inference, we identify the assumptions required so that a specific asymmetry of cv-independencies provides information about the causal relation between the corresponding pair of variables, without inferring a global causal ordering.

Assumption  Cv-independence faithfulness for non-conditionally-additive-noise functional equations: ‘if the generative functional equation of , with , does not have the cv-CAN form for conditioned on , then .’

In comparison to the previous assumptions of faithfulness, here there is no restriction of to non-descendants of . is not limited based on any causal knowledge. The assumption again focuses on functional equations which do not have the CAN form. In the Appendix we indicate that, like for pure AN equations, a special family of joint distributions as described by Hoyer et al. (2009), allows a CAN statistical form in both directions. Assumption can be used to infer a potential cause from to , that is, to infer that causes or there is a latent common cause:

Proposition   Inferring noncausality with cv-independence asymmetries: ‘Consider two adjacent variables and . Under the assumption of cv-independence faithfulness for non-cv-CAN functional equations (Assumption ), if and , then there is no causality from to , that is, is a potential cause of .’

Proof of Proposition : If , it does not hold that . By Assumption , this implies that, either or the functional equation of has the cv-CAN form for conditioning on for some . The latter is discarded since we have .

We now provide some intuition about this criterion. First, if there is only a latent common cause between and , it is valid to infer a potential cause in either direction. Therefore, what we need is to avoid inferring the potential cause in the wrong direction when there is a genuine cause. For the bivariate case, the asymmetry of cv-independencies suffices if we assume faithfulness for non-CAN functional equations. However, conditioning on some set not only can convert an equation to the CAN form, it can also introduce cv-dependencies that were not present when conditioning only on a subset of . An asymmetry could appear in the following way: for a certain , not only the functional equation of has the CAN form relatively to , but furthermore the conditional joint distribution belongs to the special family that allows a CAN statistical model in both directions. For , a symmetry of cv-independencies is obtained. However, conditioning on a larger set () can introduce a cv-dependence that only appears in the direction in which the independence given was consistent with the causal structure. Accordingly, for an unfaithful asymmetry is obtained. See Section 5 for an example of a system in which this type unfaithful of asymmetry occurs. Checking if , we can find the for which symmetric independencies were obtained, showing that the observed asymmetry is not reliable.

Altogether, Theorem 1 states when cv-independencies occur as a consequence of the causal structure, and Assumption 3 specifies the faithfulness assumption required so that cv-independencies do not occur inconsistently with the causal structure, which allows formulating the criterion of Proposition to infer noncausality from data. That is, Theorem 1 provides us an analytical tool to establish cv-independencies from a known or hypothesized functional equation, and Proposition provides us an empirical tool to infer the causal information from data.

### 4.2 The CAN form with nrr-independence

We now define the nrr-CAN form as the form of a functional equation leading to nrr-independence:

Definition   Nrr-independence with nrr-CAN functional equations: ‘The functional equation of has the nrr-CAN form for when conditioning on if such that , with .’

We distinguish and as an argument and constant parameters of the function , since is fixed when conditioning. We now enunciate the conditions in which a functional equation can be set into the nrr-CAN form. Similarly to Theorem , we focus on conditions for the case that is adjacent to all other potential causes of , since otherwise the rules based on conditional independencies would already be applicable to extract the same causal information. For this purpose, we first introduce some further notation. Consider a variable . This variable has a linear additive contribution to the functional equation of (Eq. 4), and hence will contain an additive component associated with the term in which appears. This component corresponds to the conditional mean of given and , scaled by its coefficient in Eq. 4. The contribution of this term to the residual of is hence proportional to the residual that would result from a separate regression to estimate . Therefore, we define for . We use an analogous definition in relation to the part of the residual of associated with when, after conditioning on and , respectively, they also have linearly additive contributions in Eq. 4. The nrr-CAN form is characterized as follows:

Theorem   Functional equations with nrr-CAN form: ‘Consider an and the case in which is adjacent to all other potential causes of . Express the functional equation of as in Eq. 4. The equation has the nrr-CAN form with respect to given if and only if the hidden variables fulfill the following conditions

 i) U1,1=∅;ii) X⊥Uk|S ∀Uk∈{U1,2,U2};iii) εUk|X;S⊥X ∀Uk∈{U1,4,U3}, (9)

the set is such that

 {V1,1,V1,2,~V1,3,~V1,4,V2,V1,3,2,V3,2}⊆S,

where is defined as , with such that , is defined as with such that .

Proof of Theorem : See Appendix.

The correspondence between Theorems and can be understood considering that the conditional variances only quantify, in a regression-free way, dependencies of with the second-order moments of the residuals of . On the other hand, nrr-independencies are sensitive also to dependencies of with the residuals higher-order moments. Accordingly, while the conditions i-ii) of Theorem requiring conditional independencies are preserved in Theorem , the rest of conditions iii-vi), specific for second-order moments, are modified. Condition iii) of Theorem is analogous to condition iii) of Theorem . It indicates that for a dependence with can exist in the mean , which will be captured by the regression function, but any other dependence with in will create also an nrr-dependence between and the residuals . The other conditions of Theorem , iv-vi), are already fulfilled given the standard assumption of faithfulness for conditional independencies (Spirtes et al., 2000). This because in Theorem condition iii) and the requirements in the selection of and only involve conditional variances, and the conditional variance of also depends on the covariance between the different linear contributions in its functional equation. Conversely, in Theorem condition iii) and the requirements in the selection of and are conditional independence constraints. Any dependence between and a subset of variables in or which exists despite being independent of each of these single variables would violate the standard assumption of faithfulness for conditional independencies.

For most functional equations, both or none of the CAN forms are obtainable, because the existence of higher-order dependencies without second-order dependencies imposes restrictive constraints to the form of the functional equations. However, the specific cases in which the cv-CAN form holds and the nrr-CAN form does not may still be stable, in the sense that they do not depend on a specific tuning of the distribution of the causes (Janzing and Steudel, 2010). This is because the independencies required in Theorem 1 and 2 may depend exclusively on the form of the functional equations. The relation between the fulfillment of the cv-CAN form and the nrr-CAN form is thus qualitatively different than the one of cv-independence faithfulness and nrr-independence faithfulness, as discussed in relation to Proposition 3. In the latter case, because the violation of faithfulness regards dependencies with residuals extracted in the direction opposite to the generative functional equation, cases in which cv-independence faithfulness is violated and nrr-independence faithfulness is not will occur only for specific tunings of the distribution of the causes, as discussed above.

Similarly to the formulation based on cv-independencies, the conditions in Eq. 5 are not testable experimentally, since they involve hidden variables. Again, Theorem 2 serves to identify for which type of functional equations nrr-independencies will exist as a consequence of the form of the equation, but furthermore a criterion for inference from data has to be introduced. For this purpose we formulate for nrr-independence an assumption of faithfulness analogous to Assumption :

Assumption  Nrr-independence faithfulness for non-conditionally-additive-noise functional equations: ‘if the generative functional equation of , with , does not have the nrr-CAN form for conditioned on , then  for any regression , with .’

Based on this assumption, we can state a criterion of noncausality using nrr-independencies analogous to Proposition :

Proposition   Inferring noncausality with nrr-independence asymmetries: ‘Under the assumption of nrr-independence faithfulness for non-nrr-CAN functional equations (Assumption ), if and with and  for any regression , with , then there is no causality from to .’

Proof of Proposition : If and with , it does not hold that for any . By assumption , this implies that, either or the functional equation of has the nrr-CAN form for conditioning on for some . The latter is discarded since we have  for any regression .

This criterion is analogous to the one with cv-independencies. However, because nrr-independencies are a regression-based approach, there is an extra condition requiring that dependencies hold for any possible regression. Theoretically, this is an extra requirement to apply nrr-independencies for causal discovery as opposed to cv-independencies. Pragmatically, this reduces to the requirement of a good regression model, in the same way that we need a good estimate of the conditional variances. Note that the use of nonlinear regressions differs from that common in algorithms that infer a global causal ordering (Mooij et al., 2009). In that approach, a regression takes as predictors all the candidate parents of a variable. Conversely, here the regression operates on with all variables in conditioned, or at least, regarding the terms in Eq. 7, it has to estimate as a function of and the subset of variables in which does not appear in any other term, while conditioning on the rest. The relation between a formulation of nrr-independence in terms of conditional regressions and of multivariate regressions will be further addressed in future work.

## 5 Examples

We now examine some concrete examples to understand the different possible effects that conditioning on an extra variable can have to confer the CAN form or remove it from a functional equation. For that purpose, we first consider systems within the class of linear mixed models (LMM) (West et al., 2007). This widely applied type of models takes into account the existence of random effects, that is, coefficients of the predictors which are themselves random variables. A functional equation in a linear mixed model has the form:

 Vi=∑kbikV1k+∑kϵikV2k+ξi. (10)

The sets of parents and can overlap. Here indicates a constant fixed coefficient, while indicates a random coefficient, that is, is itself a random variable. For example, can represent across-subjects variability in the influence strength of a parent variable. All random coefficients are hidden variables. Furthermore, only a subset of may be observed. For simplicity, we restrict the examples to Gaussian linear mixed models. Because linear Gaussian models belong to the special family of Hoyer et al. (2009) for which cv-independencies symmetrically hold, this has the advantage that in these examples we can relate cv(nrr)-dependencies only to the presence of random effects introducing nonlinearities in the equations. LMM equations are only in the AN form if the random coefficients vanish. A CAN form can be obtained conditioning on the parents in . We use LMM models for exemplary purpose because the connection between random effects and cv(nrr)-dependencies facilitates the explanation. However, as it is clear from the general form of the functional equations that can have the cv(nrr)-CAN-forms, according to Theorem 1 and 2, cv(nrr)-independencies will exist in a much wider type of systems than LMM models. We will later discuss general versions of these examples, sharing the same causal structures as in Figure 1, but with a more general form of the functional equations. Furthermore, note that the random coefficients do not play any especial role other than being hidden variables which appear multiplicatively with the observed variables.

Figure 1 shows examples of different effects that conditioning has on the cv(or nrr)-independencies asymmetry. For simplification from now on we describe these examples referring only to cv-independencies, but the same reasoning holds for the nrr-independencies. To reflect the form of the equation in the graphical representation, we indicate by an arrow the presence of in the equation of , but as mentioned above the random effects are just hidden variables. We focus on cv-dependencies between and , conditioned or unconditioned on , which are collected in Table 1. In Figure 1A, conditioning on does not alter the asymmetry. This is because it is the influence of on what leads to a cv-dependence in the direction . Because is independent of , acts effectively as a source of noise on and the equation of has the CAN form for , conditioned or unconditioned on . does not have a Gaussian distribution, which brings the distribution of and out of the special family of Hoyer et al. (2009) and leads to cv-dependencies in the direction opposite to causality. In this case, if is observable, the collider can be identified using conditional independencies. Otherwise, , provides new causal information.

In Figure 1B, conditioning on activates the collider , activating a path of dependence between and . Changes in the mean of modulate the variance of , leading to . In the opposite direction, again acts a source of non-Gaussian noise, leading to . Conditioning on inactivates the alternative path between and , providing the CAN form to the equation of . The non-Gaussian influence from results in the asymmetry , .

In Figure 1C, conditioning does not help to find an asymmetry. When is not conditioned, either conditioning on or changes the mean of , which modulates the variance of . After conditioning , the system is reduced to a linear Gaussian model, leading to a symmetry of cv-independencies. Finally, in Figure 1D, conditioning creates a misleading asymmetry. Because the random effect only affects , without conditioning the system is linear Gaussian, resulting in a symmetric cv-independence. After conditioning , a dependence is created between the random effect and both and . Because , this dependence is inactivated when further conditioning on , leading to . That is, in this case the cv-independence results from a more general conditional independence. In the opposite direction, cannot inactivate the dependence between and (), and the effect of leads to . Because the asymmetry only appears after conditioning on , the extra check of Proposition can detect that it is not reliable to infer a potential cause from to .

These examples do not cover all possible effects of conditioning, but indicate that conditioning can maintain an informative asymmetry (Figure 1A), create an informative asymmetry (Figure 1B), exchange symmetries of cv-dependencies and cv-independencies (Figure 1C), and create a misleading asymmetry that has to be detected by the extra checks of Proposition (Figure 1D). Note that the graphs of Figure 1 do not have the structure of DAGs for all variables, since the random effect variables are assigned to edges instead of nodes. However, the way they provide information about cv(nrr)-independencies suggests that graphical criteria can be used to read cv(nrr)-independencies. A formal introduction of graphical criteria will be described in forthcoming work.

We now discuss more general forms of systems that would lead to the cv-independence asymmetries reported in Table 1A-B, corresponding to the causal structures of Figure 1A-B, that is, the examples for which it is possible to infer a potential cause from to . The pattern of cv-independencies of Table 1A is more generally compatible with any system of the form:

 Z=ηz;  X=bxzZ+ηx;  Y=byxX+byzZ+fy(V,ϵ,εy), (11)

where indicates a Gaussian noise. We follow the same notational rule as in Section 4, writing the functional equations in the most generic form possible given the constraints we require. This class of systems is more general than Gaussian LMM models since can have any form, including nonlinearities, and can be non-Gaussian. This is because, with respect to , the component acts as an additive noise, in agreement with the CAN form of Eq. 8. Furthermore, the pattern and of Table 1A, which by itself allows inferring the potential cause from to , holds for a larger class of systems compatible with the causal structure of Figure 1A:

 Z=εz;  X=fx(Z,εx);  Y=fy,1(X,Z)+fy,2(Z,V,ϵ,εy), (12)

where all noises can have generic distributions and , , and are generic and can be nonlinear. Again, after conditioning , given the causal structure of Figure 1A and the form of the functional equation of in Eq. 12, the cv-CAN form holds according to Theorem 1.

In the same way, the pattern of Table 1B is also obtained for a much wider class of functional equations compatible with the causal structure of Figure 1B:

 (13)

where all noises have generic distributions and , , and are generic.

The analysis of these concrete examples illustrates how, when the functional equations are known or hypothesized, Theorem 1 (or Theorem 2), allow determining which cv(or nrr)-independencies exist. In application to data, the criterion of Proposition 4 (or Proposition 5) would be applied after estimating the cv(nrr)-independencies, and the patterns displayed in Table 1 determine whether a potential cause would be inferred.

## 6 Post-nonlinear CAN functions

Finally, we also briefly consider how post-nonlinear AN equations (Zhang and Hyvärinen, 2009) can also be extended to a post-nonlinear CAN form. From Theorem and , it is straightforward to derive the same conditions for CAN post-nonlinear forms, simply considering that the conditions apply to in Eq. 2. However, this class of models can be further generalized. To see this, consider a functional equation of the form

 Y=h4(h2(h1(X,V1,U1,εy))+h3(X,V3,U3)), (14)

where both and are nonlinear invertible functions and is a function that has the CAN form for given a certain conditioning set , where is the parent of interest for which we examine the causal relation with . The equation can be reexpressed as

 h−12(h−14(Y)−h3(X,V3,U3))=h1(X,V1,U1,εy). (15)

If , considering the set , and using the same notation as in Eq. 8 for the CAN function , Eq. 15 has the form

 h(Y,X;S′)=f(X;S′)+ξy|S′. (16)

Exploiting a model of this type requires estimating the functions and to minimize the information between and . If is not an argument of this reduces to the same estimation problem studied in Zhang and Hyvärinen (2009), with .

The form of Eq. 14 suggests a generalization by an iterative composition of two operations. Consider the operation consisting in an invertible nonlinear univariate transformation and the operation consisting in the bivariate sum . Starting from a function that has the CAN form for given a certain conditioning set , a set of invertible nonlinear functions and a set of arguments , the functional equation of can be constructed by the iterative composition starting as , with . Because all functions are invertible, the functional equation of can be expressed in the form of Eq. 16 by inverting the operations. As in the case of Eq. 14, if is not an argument of the functions , the expression further simplifies to the form studied in Zhang and Hyvärinen (2009). The required conditioning set is . The same procedure can be followed replacing the sum operation by the product. This procedure results in increasingly complex functional equations for which in principle cv-independencies and nrr-independencies can be tested. In practice, the difficulty of the estimation problem of Eq. 16 will depend on the number of these operations, the extra variables introduced in the functions analogous to , as well as on the number of variables in , and the complexity of the functions.

## 7 Conclusions

In this paper we extended the theory behind the AN framework for structure learning in several ways. We first introduced an alternative regression-free test of independence. This test does not require the reconstruction of the additive noise using the residuals of a nonlinear regression. Instead of testing the independence between the residuals and the parents of a variable (nrr-independencies), it evaluates indirectly the independence between the noise variance and the parents using conditional variances (cv-independencies). The use of cv-independencies is expected to be especially useful when the form of the functional equation is complex. In that case, the family of regression models used may not be powerful enough to capture the form of the actual dependencies, and thus our indirect estimate of independencies may be particularly beneficial. On the other hand, the examination of cv-independencies and nrr-independencies is not mutually exclusive and could be combined to improve learning.

We formulated all the other contributions of this work both for cv-independencies as well as for nrr-independencies. In the latter case, the implementation of nonlinear regressions developed in previous work (see the actual implementations provided by Hoyer et al., 2009; Mooij et al., 2009; Peters et al., 2014; Bühlmann et al., 2014) can already be applied to implement this extended framework. We generalized AN models to partial conditionally-additive-noise (CAN) models with hidden variables. In these models, only some functional equations and only for certain parents have the AN form, possibly after conditioning. We determined when a functional equation has the CAN form that results in cv(or nrr)-independencies. Exploiting asymmetries in cv(or nrr)-independencies, we then introduced a criterion to infer the causal relation between specific pairs of variables in a multivariate system with hidden variables, without restrictions on the form of the functional equations. The criterion can be applied locally, if the CAN form holds for a certain functional equation, and without inferring a global causal ordering (Mooij et al., 2009). Because the type of functional equations that have a CAN form is substantially larger than the type of pure additive-noise functional equations, we can expect that cv(nrr)-independencies induced by the CAN form will exist more often and hence that in more practical cases the AN framework will increase the inferential power of standard methods based on conditional independencies. The magnitude of this increase will be specific to each domain of application, depending on the properties of the generative functional equations.

The new criterion can readily be applied to complement the existing algorithms that in the presence of hidden variables extract equivalence classes of causal structures given conditional independencies (Spirtes et al., 2000; Drton and Maathuis, 2017; Heinze-Deml et al., 2018). Like for any standard rule of causal orientation used in constraint-based structure learning algorithms (e.g.  Spirtes et al., 2000), this new criterion relies on faithfulness assumptions. While it is an ongoing subject of research to understand when faithfulness holds (Uhler et al., 2013), only under these types of assumptions the corresponding analysis of independencies can be applied for structure learning. In future work we will address in full detail how to exploit the new criterion in combination with conditional independencies as part of a structure learning algorithm.

## Acknowledgments

This research was supported by the NIH Brain Initiative (Grant No. U19 NS107464) and by the Fondation Bertarelli.

## Appendix

### Proof of Proposition 3

Proof of Proposition : We first prove that cv-independence faithfulness implies nrr-independence faithfulness. Consider that a nonlinear regression is implemented such that is independent of despite . Then the statistical model has the AN form and it follows that . Given that implies , inversely implies . Because cv-independence faithfulness assumes , this implies , which corresponds to the assumption of nrr-independence faithfulness. We now justify that nrr-independence faithfulness does not imply cv-independence faithfulness. To see this, it suffices to realize that nrr-independence requires that all moments of the residuals variable are independent of . On the other hand, cv-independence only requires that the variance of the residuals variable is independent. The distribution can be such that the dependence only appears in the third and higher moments. In that case, cv-independence holds despite nrr-dependence.

### Proof of Theorem 1 and Theorem 2

We first prove the if and only if conditions of Theorem 1 for the functional equation of to be in the cv-CAN form with respect to a parent given the set when is adjacent to all other potential causes of .

Proof of Theorem : We proceed justifying the necessary and sufficient requirements for each set of hidden and observed variables of Eq. 4. First, we need because modulates the variance of any , since they appear together as arguments of in Eq. 4, which is nonlinear. Also for