Identifiability of an Integer Modular Acyclic Additive Noise Model and its Causal Structure Discovery

# Identifiability of an Integer Modular Acyclic Additive Noise Model and its Causal Structure Discovery

Joe Suzuki Takanori Inazumi11footnotemark: 1 Takashi Washio Shohei Shimizu Department of Mathematics, Graduate School of Science, Osaka University, 1-1, Machikaneyamacho, Toyonaka, Osaka 560-0043, Japan Department of Reasoning for Intelligence, the Institute of Scientific and Industrial Research, Osaka University, 8-1, Mihogaoka, Ibaraki, Osaka 567-0047, Japan
###### Abstract

The notion of causality is used in many situations dealing with uncertainty. We consider the problem whether causality can be identified given data set generated by discrete random variables rather than continuous ones. In particular, for non-binary data, thus far it was only known that causality can be identified except rare cases. In this paper, we present necessary and sufficient condition for an integer modular acyclic additive noise (IMAN) of two variables. In addition, we relate bivariate and multivariate causal identifiability in a more explicit manner, and develop a practical algorithm to find the order of variables and their parent sets. We demonstrate its performance in applications to artificial data and real world body motion data with comparisons to conventional methods.

###### keywords:
statistical causal inference, causal ordering, acyclic causal structure, integer modular variable, discrete variable
label2label2footnotetext: Currently in NTT Communications Corp.

## 1 Introduction

We consider the problem of inferring causal relation between two random variables from a finite number of samples that have been generated according to the joint distribution (Spirtes et al. 2000).

Solving the problem in a general setting is rather hard, and we need some assumptions to find the causal relation: suppose are related by

 Y=f(X)+e , (1)

where is a function from the range of to that of , and the noise is independent of , and suppose further that there is no function from the range of to that of such that

 X=g(Y)+h , (2)

where the noise is independent of . Then, we can infer that causes but does not cause , and say that the causality is identifiable. On the other hand, if such a function exists, then we conclude that we cannot infer causality, and say that are reversible. This principle (additive noise model) was proposed by Shimizu et. al, who demonstrated that causality can be found if the joint distribution of is not Gaussian when is a linear, i.e., with some constant (LiNGAM) Shimizu et al. (2006, 2009, 2011).

The same principle applies to searching (acyclic) causal relation

 Xi=fi(X1,⋯,Xi−1)+ei (3)

among random variables , where is independent of , and is a linear function of , . The estimated directed acyclic graph (DAG) is found from a finite number of samples  Dodge and Rousson (2001); Kano and Shimizu (2003); Shimizu and Kano (2008). The idea Mooij et al. (2009) is to find and such that

 ei=Xi−f(X1,⋯,Xi−1,Xi+1,⋯,Xd)

is independent of , and remove such an (sink variable); starting from , if we repeat the process (removing a sink variable from to obtain , ), we obtain an order of , and can rename the indexes of so that Eq. (3) holds for some .

The references Hoyer et al. (2009); Zhang and Hyvarinen (2009); Mooij et al. (2009) address using nonlinear functions as in Eq. (1). In another direction, Lacerda et al. (2008) extended Eq. (3) to the case:

 Xi=fi(X1,⋯,Xi−1,Xi+1,⋯,Xd)+ei .

However, those results assumed that the random variables are continuous.

This paper addresses the case that the random variables take a finite number of values: suppose each random variable takes a value in the set for an integer , and define arithmetic over as follows: for , takes the value if divides . Such a exists and is unique for any . For example, if , then in . When , this amounts to binary data with the exclusive-or arithmetic. Such random cyclic values are abundant in our daily life. For example, directions in , months in . For two random variables that take values in , we consider the additive noise model expressed by Eqs. (1)(2). The idea can be extended to the multivariate case using Eq. (3) (integer modulus acyclic additive noise (IMAN) model). Recently, several papers deal with such discrete cases, and we discuss those related results in the next section.

Our contributions in this paper are

1. to express necessary and sufficient conditions on causal identifiability in a bivariate IMAN model in terms of the probabilities of and , and

2. to develop a practical algorithm for identifying a causal structure in a multivariate IMAN model under the identifiability.

This IMAN often appears in circular/directional statistics Fisher (1995); Mardia and Jupp (2009). This is used in time series analysis of phase angles in the frequency domain. It has been extensively used for angular data representing an object’s shape and motion as observed in ubiquitous sensing systems Pons-Moll et al. (2010).

In Section 2, we discuss existing results related to this paper. In Section 3, we state theorems on necessary and sufficient conditions of reversibility that is equivalent to non-identifiability for a bi-variate IMAN, and show examples illustrating those theorems. These results show that the causal identifiability of an IMAN actually holds except in rare situations. In Section 4, we propose an algorithm for identifying a causal structure in a multivariate IMAN. In Sections 5 and 6, we show numerical experiments by using artificial examples and real-world data of human body motions to compare with conventional approaches, which suggests that the proposed algorithm is actually useful in many situations.

## 2 Related Works and Discussion

Peters et. al Peters et al. (2011a) first considered the IMAN model: let and with , and if we assumed take values in and , respectively, then the causal identifiability is defined by non-existence of in Eq. (2) such that is independent of assuming existence of in Eq. (1) such that is independent of .

The notions of causal identifiability and reversibility are the same even if are discrete. Let be the sets of elements such that , respectively, and denote the number of elements in set by . They proved that for reversibility of a bivariate IMAN model, the following conditions are necessary (Theorem 4, Peters et al. (2011a)):

• divides .

• If and , then at least one additional equality constraint on and over and is required.

assuming that none of , , are uniformly distributed and is not constant.

Although the above result suggests that it is unlikely that are reversible in general situations, no essence has been captured: exactly when causality is identified for IMAN ? We would be very pleased if we had a result on necessary and sufficient conditions of reversibility in terms of and over and , respectively, and would feel safe because we would know exactly when reversibility occurs beforehand. We know that earthquakes occur very rare even in Japan but would be much happier if we knew exactly when they occur beforehand.

When , assuming that is injective, we derive the necessary and sufficient conditions in the next section. Thus far, the condition was obtained for : either or . The condition we consider in this paper extends the existing result, and eventually, the proposed algorithm will have more applications. We notice that the assumption of injectivity can be seen in many situations including the circular/directional problems. One of the most common cases is that is a composite function of a monotonic periodic function of discrete angles and a labeling function of the angles. This frequently appears in angle relations observed in different coordinates in mechanical sensing Pons-Moll et al. (2010).

On the other hand, in order to establish relation between bivariate and multivariate causal identifiability, Perters et. al Peters et al. (2011b) proposed -identifiable functional model classes (IFMOCs): Suppose that each such that belongs to a subset of for some . Let , and the set of the distribution functions. Let be any set of such that , is independent of , and is not independent of , where are the distribution functions of , respectively. For example, for the original LiNGAM, we may take the as the set of such that , and both of should not be Gaussian. Then, they prove that if the data generated process belongs to any -IFMOC, we can identify the exact causal graph from data (Theorem 2).

Our result in this paper does not contradict to the theorem. Instead, we show relation between bivariate and multivariate causalities in a more specific manner (Propositions 1 and 2), and propose a method to find a sink based on bivariate causality verification. More precisely, we obtain a bi-variate IMAN for any pair of variables by conditioning all the other variables except . In fact, Peters et al. (2011b) has not addressed any method to find a sink variable uniquely from bi-variate independence relation between and as demonstrated in DirectLiNGAM Shimizu et al. (2009, 2011).

On the other hand, Sun and Janzing (2007) proposed a causal ordering method of binary variables. However, its identifiability is not insured, and its applicability is limited because the computational complexity is rather high.

Recently, Inazumi et al. (2011) showed a necessary and sufficient condition on the bi-variate causal identifiability of Eq.(1) for binary variables. Given a value of , coincides with either or irrespective of the value of . They showed the reverse model Eq.(2) satisfying the same condition among , and exists if and only if when . They proposed an efficient algorithm to identify a unique causal structure in a multivariate binary acyclic additive noise model named BExSAM, i.e., Eq.(3) modulo , under the identifiability condition. However, BExSAM and its algorithm are not suitable for generic modular model. Our study indicates that a nontrivial condition different from the uniform is a necessary and sufficient condition for reversibility in some generic cases, discussed in the next section, and further establishes a generic condition where a uniform is a necessary and sufficient condition for reversibility.

## 3 Analysis on Bi-variate Identifiability

We show necessary and sufficient conditions for the reversibility of a bi-variate IMAN (1)(2), where take values in . For simplicity, its modulus is a prime or its power, and is injective. The notation and is used for brevity, and is assumed while for , which does not loose any generality since is not constant in general situations. A typical real example arises from human body motion data demonstrated in section 6. We first present some lemmas on the reversibility for couple moduli to help understanding theorems presented later.

###### Lemma 1 (Reversibility For m=2)

A bi-variate IMAN modulo with an injective is reversible if and only if one of the following four equalities holds: , , , .

Proof. See Appendix A.

###### Lemma 2 (Reversibility For m=3)

A bi-variate IMAN modulo with an injective is reversible if and only if one of the following five equalities holds: , , , , .

Proof. See Appendix B.

###### Lemma 3 (Reversibility For m=4)

A bi-variate IMAN modulo with an injective is reversible if and only if either one of the following ten holds: , , (, , ), (, , ), (, , ), (, , ), , , , , where expresses the condition .

Proof. See Appendix C.

These lemmas are now extended to a theorem on a necessary and sufficient condition for bi-variate causal reversibility of an IMAN covering more generic moduli . Before presenting the theorem, we need to introduce the notion of “balanced distribution” of . We say is balanced with respect to dividing , if all the rows in the following matrix are identical for some constants and .

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝p0+g(0)/C0p0+c+g(0)/C0…p0+m−c+g(0)/C0p1+g(1)/C1p1+c+g(1)/C1…p1+m−c+g(1)/C1⋯⋯⋯⋯pc−2+g(c−2)/Cc−2p2c−2+g(c−2)/Cc−2…pm−2+g(c−2)/Cc−2pc−1+g(c−1)/Cc−1p2c−1+g(c−1)/Cc−1…pm−1+g(c−1)/Cc−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

We denote the condition by . For example, suppose as in Lemma 3, says the rows should coincide in either

 (p0/(p0+p2)p2/(p0+p2)p1/(p1+p3)p3/(p1+p3)) or (p2/(p0+p2)p0/(p0+p2)p1/(p1+p3)p3/(p1+p3)),

corresponding to the such that either or , which is equivalent to either or , respectively. On the other hand, we define by the smallest such that . For example, for ,

1. ,

2. ,

3. ,

4. for some ,

5. for just one ,

and for as in Lemma 3, if and , then .

###### Theorem 1 (Necessary and sufficient condition for reversibility)

Assume that is a power of a prime number. Let . Then, and are reversible in a bi-variate IMAN modulo if and only if for all or for all and ).

Proof. See Appendix D.
Lemmas 1 and 2 are easily derived using this theorem. Furthermore, applying to this theorem, we obtain the ten conditions in Lemma 3 as follows.

###### Example 1 (m=4)

:

or

:

1.  and (( and ) or ( and ))
2.  and (( and ) or ( and ))

:

or or or

The following two corollaries, which are easily derived from Theorem 1, show simpler necessary and sufficient conditions under some practical assumptions.

###### Corollary 0

Given a prime number , and in bivariate IMAN modulo are reversible if and only if either of the following two conditions are met:

:

or .

:

for some .

Proof. When is a prime, is either 1 or in Theorem 1. If , then does not require any condition, so that Theorem 1 reads either or . If , which is equivalent to for some , no condition is required other than this.

###### Corollary 0

Given a power of some prime number and , and in bivariate IMAN modulo are reversible if and only if either or .

Proof. When , which means by the definition of and Theorem 1, does not require any condition. Thus, Theorem 1 reads either or .

These results ensure that the causal identifiability of a bi-variate IMAN holds except for a finite number of special conditions to occur in practice. If the modulus does not meet the condition in Theorem 1, there are some cases where the reversibility holds even if both and are nonuniform and nonzero.

###### Example 2 (m=6)

, , .

 R = ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝p0q0C0p1q5C0p2q4C0p3q3C0p4q2C0p5q1C0p0q1C0p1q0C0p2q5C0p3q4C0p4q3C0p5q2C0p0q2C0p1q1C0p2q0C0p3q5C0p4q4C0p5q3C0p0q3C0p1q2C0p2q1C0p3q0C0p4q5C0p5q4C0p0q4C0p1q3C0p2q2C0p3q1C0p4q0C0p5q5C0p0q5C0p1q4C0p2q3C0p3q2C0p4q1C0p5q0C0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ = p0q0r2+3r+2⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1rrr1r2rr1r21r1r21rrr1rrr1r2rr1r21r1r21rrr⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

We find that and are reversible by .

In Example 2, reversibility is due to shared parameter , which is consistent with Peters et. al [15] who suggested that reversibility requires additional equality condition among and .

## 4 IMAN Algorithm

We assume that there exist and such that

 ei:=Xi−f({Xj}j≠i)

is independent of . In [14], we speculate that the condition (multivariate causal identifiability) reduces to bivariate causal identifiability that is independent of given for all . We say that such and a minimal subset of on which depends are a sink and a parent set, respectively. In this paper, we show in Proposition 1 that the claim is true as long as the probabilities of are positive. Besides, based on the strong support for bi-variate causal identifiability in Section 3 and Proposition 1, we ignore the reversible cases. From those considerations, we propose an algorithm to find a unique causal structure in an IMAN from a given modular data set .

Figure 1 outlines the proposed algorithm which is an extension of [17] to cover the IMAN modulo . The algorithm uniquely find a sink variable, which is different from [9][15]. The first step calculates a frequency table from . If we have samples of sufficiently large size, the values of relative frequency converge to the true probabilities. Steps 3 and 4 find a sink and its parent set given and , respectively, where is a subset of . Step 5 removes the estimated sink from , and update so that the frequency values can be expressed for the updated set excluding (marginalization). This reduces the size of the model by one in the next cycle. The entire list in the output expresses a DAG structure of the IMAN.

### 4.1 Finding Sink and Parent Set

The proposed method is based on the following observation:

###### Lemma 4
1. If there is no functional relation between , then the converse is also true.

(Proof: see Pearl (1988) for example.)

###### Proposition 1

Suppose have no deterministic relation. Then the following conditions are equivalent:

1. is a sink

2. is independent of given for all

3. is independent of

(Proof: immediate from Lemma 4).

Proposition 1 implies that we can check that is a sink by verifying to be independent of , and that finding the sink node is as likely as bivariate causal identifiability.

In find_sink, the conditional probability is estimated from FT (we write the value by ). Suppose that is a sink node. Then, the probability of is the same for and . If we choose and such that and are maximized, respectively. Then, it is likely that

 x′i−x′′i=f({x′j}j≠i)−f({x′′j}j≠i) (4)

if the sample size is large. Let be the value of (4) Then, the distributions of and given and , respectively, should be the same if the estimation of is correct.

To this end, we apply the data to the G-test Sokal and Rohlf (1994) which distinguishes whether or not for disjoint events from data. The G-test calculates the G-value:

 2∑kcn(Ck∩A)log{cn(Ck∩A)cn(A)/cn(Ck)n} (5) + 2∑kcn(Ck∩B)log{cn(Ck∩B)cn(B)/cn(Ck)n} ,

where is the frequency of the event, and are disjoint events covering the whole events (). The G-test is more correct than the -test that calculates an approximation of (5). In our case, in order to prove Independence, we compare the values for all in .

Once the sink is obtained, we find the parent set assuming that the estimated is correct. We find the parent based on the following observation:

###### Proposition 2

Suppose have no deterministic relation. Then, for any and ,

 Xi⊥⊥Xj|{Xh}h≠i,j,j∈V−π−{i}⟺Xi⊥⊥{Xj}j∈V−π−{i}|{Xh}h∈π

(Proof: immediate from Lemma 4). To this end, for each , we compute for all to test if and are independent via the G-test given . We see that if and only if and are not independent given at least one . We test all the tables of by multiple comparison tests Benjamini and Hochberg (1995). Throughout this paper, find_parent uses the significance level and repeats the procedure for all to enumerate all the values of .

### 4.2 Computational Complexity

The table FT is of size . Hence, the space complexity is . The critical task in find_sink is to compute conditional probability tables for , and this is repeated at most times. The critical task in find_parent is the times computation of conditional probability tables for , and this is repeated times at most. These are further repeated times in the main algorithm shown in Fig.1. Accordingly, the total time complexity is . We require that the data size is near the size of , i.e., . Therefore, when , the complexity is virtually which is comparable or better than previous work. For example, DirectLiNGAM Shimizu et al. (2009, 2011) which is one of the most efficient causal inference algorithm requires .

## 5 Experimental Evaluation

In this section, we evaluate the basic performance of our algorithm by using artificial data. Let be the number of variables , the size of the modulus domain , the number of samples, the probability that is a parent of for each , and with the noise distribution. For each sink with parent set , is a function of . We add noise to to obtain given . By repeating the process for , we obtain one realization . For simplicity, we assume that all the noise share the same distribution. Furthermore, by generating times and randomly changing the indexes of into some , we obtain data set . In our experiments, we estimate from the data set obtained above.

We evaluate performance of IMANN in terms of several measures. Suppose we estimate to obtain from D. Then, we can obtain the adjacency matrices and for and , respectively. If we change the orders of rows and columns so that the matrix becomes lower triangular, the matrix does not become lower triangular unless the estimation is correct. Let be the number of nonzero elements in the upper triangular in . Then, we define by

 Ero=UT(B,^B)d(d−1)/2 ,

which expresses the ratio of the number of inconsistent edges to the number of the whole possible edges. This measure has been used in many studies including LiNGAM Shimizu et al. (2006). We also evaluate in how many elements and coincide, i.e., the ratio of the number of elements matched between and to the number of the whole elements except the diagonal elements, . This measure () expresses accuracy of the estimated causal structure. (The less Ero and the larger ACC, the better performance.) We also evaluate computation time (msec) which expresses its algorithm scalability. We randomly executed 1000 trials and took average among them. For the experiments, we installed MATLAB R2011a on a Windows 7 machine with Xeon W3565 (3.2GHz, 4 core, 8MB cache), 6GB RAM and 500GB HDD.

Based on the result in section 4, we are particularly interested in the effects of and on the estimation accuracy. The parameter is also an important factor for scalability. Table 1 shows the results under which each is set uniformly random, , . In this case, can be mutually close by chance when , is small. According to our Theorem 1 and Corollary 2, such a condition violates the bi-variate causal identifiability of the IMAN. However, the chance is reduced as grows. This is reflected in and when and . Although is not a power of a prime number, its causal identifiability holds similarly to the other numbers. This is consistent with the observation in Example 2 and Peters et al. (2011a).

On the other hand, if is large, the size of frequency table relative to is large, and we might not have enough samples to estimate the conditional probabilities from via the G-test. Because the FT size is , the critical size of should be at least to compute statistically accurate . For example, the errors can be seen to be significantly large when (). On the other hand, from Table 1, our algorithm seems to provide practical accuracy if is larger than . Also, we find that reflects of the algorithm as analyzed in Section 4.2. Table 2 indicates the results when , , , for , , and . In this case, are always far from any reversible conditions shown in our theorem and corollaries. Thus, it is reasonable to think that accuracy is obtained as long as is large enough compared with .

Table 3 and 4 show the performance in terms of and under , , and uniform . The error is reduced as grows. However, again we observe the critical size of is . When is large (the structure is dense), wrong selection of of a sink variable in causal ordering affects the ordering of all the remaining variables. Therefore, increases as density grows whereas does not decrease.

## 6 Real-world Applications

We analyzed human body orientation data of MPI08_Database222The data is accessible at . Pons-Moll et al. (2010). Various indoor motions of a human measured with eight movie cameras and five orientation sensors are stored in the data. Five angle sensors are attached to various parts of the body between knees and ankles, between wrists and hands and between chest and neck, and angles measured with respect to a global inertial coordinate frame at 40Hz which is suitable to describe human body orientation in his/her view. We obtained angles from each snapshot output: is the rotation angle around the rotation axis; is the look up angle between the axis and a horizontal plain; and is the horizontal direction angle of the axis. and take cyclic values in , whereas takes values in .

We applied our IMAN algorithm to of the data sets; “ab_01_01” and “ab_10_01.” The former records a counterclockwise walking motion over 406 time steps. The latter records two cartwheel motions over 580 time steps, and we analyzed the first 200 time steps on the leftward cartwheel motion. We discretized each into 3 equi-width intervals of , , . The critical is comparable with and for both data sets. Figure 3 shows a causal network on ab_01_01 by the IMAN algorithm. This result is well interpreted in that the right leg, its orientation being measured by rknee_rankle senor, takes the initiative, and the left leg and right hand, as measured by lknee_lankle and rwrist_rhand sensors respectively, follow the right leg motion, and are further followed by the neck and left hand, as measured by chest_neck and lwrist_lhand sensors respectively. Figure 3 shows the result on ab_10_01. To initiate the leftward cartwheel, the right leg is used first to push off from the floor. The motion then influences the orientation of his left leg, and subsequently planting of these left hand on the floor hand to support the body in the rotation. In turn, the right hand is planted in similar manner as the body continues to rotate. The neck always follows these motions to maintain body balance.

Because these are time series representing the body motion dynamics, we also applied VAR (Vector Auto-Regressive) (Fig. 5Box et al. (2008) and DirectLiNGAM (Fig. 5Shimizu et al. (2009, 2011) to by assuming shares similar causality with , because no approaches are applicable to the modular . We used the 1st order VAR model selected by the final prediction error (FPE) in Fig. 5. In both figures, the causal networks are drawn by the matrix elements above a certain threshold level. The dotted edges are outputs of VAR and DirectLiNGAM that is not in the IMAN output; the dashed edges are the output of IMAN but not in the others. Though VAR indicates cycles, 4 out of 6 edges of IMAN are supported. DirectLiNGAM’s causal order is consistent except lknee_lankle. Though these are not from , they are quite consistent with IMAN. We analyzed some other data obtained from kicking motions, and produced a similar consistency.

## 7 Discussion and Conclusion

In this paper, we presented necessary and sufficient conditions for bivariate causal identifiability in IMAN, and actually affirm that causality can be identified except in rare cases. Our result locates exactly when the reversible cases occur. In addition, we relate bivariate and multivariate causal identifiability in more precise manner for IMANN (Propositions 1 and 2).

As a result, we developed a practical way to find a sink and its parent set. The algorithm needs a sufficient number of samples compared with to verify independence when each of takes a value among values. The computational complexity is , and it is reasonable to say that the value should be at most 10 for small , which is according to our experiments in this paper (the value can be small in practical situations by reducing the quantization level).

If the sample size is small, we need to improve estimation of FT. One possibility is to construct a Bayesian measure to deal with each sample set, and we expect to obtain more robust results even for small . Then, we can avoid checking independence via the G-test.

The most important direction for future study is to seek a causal model and its causal identifiability conditions on continuous and cyclic data, such as that used in section 6. If we address these issues, our approach can be extended so that we do not need to discretize these data to integers. Recent studies in directional statistics provided some analyses on distributions of circular/directional variables Fisher (1995); Mardia and Jupp (2009).

## Appendix A Proof of Lemma 1

There are four functions , where only and are injective. Let be such that

 P(X|Y):=(P(X=0|Y=0)P(X=1|Y=0)P(X=0|Y=1)P(X=1|Y=1)). (6)

If and are reversible, there exists such that , , which is equivalent to . Accordingly,

 P(h):=(P(X−g(0)=0|Y=0)P(X−</