Validity of time reversal for testing Granger causality

# Validity of time reversal for testing Granger causality

Irene Winkler, Danny Panknin, Daniel Bartz, Klaus-Robert Müller and Stefan Haufe This work was supported by a Marie Curie International Outgoing Fellowship (grant No. 625991) within the 7th European Community Framework Program, the BMBF project ALICE II, Autonomous Learning in Complex Environments (01IB15001B), and the Brain Korea 21 Plus Program as well as the SGER Grant 2014055911 through the National Research Foundation of Korea funded by the Ministry of Education. I. Winkler, D. Panknin, D. Bartz, K.-R. Müller and S. Haufe are with the Machine Learning Group, Technische Universität Berlin, Germany. K.-R. Müller is also with the Department of Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea. S. Haufe is also with the Laboratory for Intelligent Imaging and Neural Computing, Columbia University, New York, USA. Correspondence to: {i.winkler,stefan.haufe}@tu-berlin.de.
###### Abstract

Inferring causal interactions from observed data is a challenging problem, especially in the presence of measurement noise. To alleviate the problem of spurious causality, Haufe et al. (2013) proposed to contrast measures of information flow obtained on the original data against the same measures obtained on time-reversed data. They show that this procedure, time-reversed Granger causality (TRGC), robustly rejects causal interpretations on mixtures of independent signals. While promising results have been achieved in simulations, it was so far unknown whether time reversal leads to valid measures of information flow in the presence of true interaction. Here we prove that, for linear finite-order autoregressive processes with unidirectional information flow between two variables, the application of time reversal for testing Granger causality indeed leads to correct estimates of information flow and its directionality. Using simulations, we further show that TRGC is able to infer correct directionality with similar statistical power as the net Granger causality between two variables, while being much more robust to the presence of measurement noise.

Granger causality, time reversal, noise, TRGC

## I Introduction

The estimation of causal relations between time series is a signal processing topic promising to enhance our understanding of dynamical systems in numerous application domains. For data with time structure, the concept of Granger causality (GC) has gained popularity as a simple testable definition of causality based on temporal precedence. Signal processing techniques based on Granger-causality have been studied in a variety of fields such as econometrics [1], neuroscience [2, 3, 4, 5], and climate science [6, 7].

In its original formulation, a time series is said to Granger-cause a time series , if the past of helps to predict above what can be predicted by using ‘all other information in the universe’ besides the past of [8]. In the bivariate framework, it is common to consider only the information contained in the past of and (cf. [9]).

A serious problem for the estimation of information flow using Granger causality is that spurious Granger causality can occur due to measurement noise. On one hand, if two sensors measuring the same signal are superimposed with noise, they mutually help predicting each other’s future [10, 11]. This is a problem especially in the study of brain connectivity using non-invasive electrophysiology, where the activity at a given sensor is typically a mixture of contributions from several neuronal sources due to the volume conduction of electric currents in the head [12, 13, 14, 15, 16]. On the other hand, noise that is correlated across sensors has a similar adverse effect on estimates of directed interaction even if the actual signals-of-interest are not mixed into different sensors [17, 18]. Such spurious causality can occur in any measure based on the concept of Granger causality, including multivariate [19, 20] and non-linear [21, 22, 23] variants.

Recently, a number of ways to make causality estimates more robust to the presence of mixed signals and noise have been proposed. These include novel measures of directed information flow [12, 10, 11] as well as novel ways of assessing their statistical significance [24, 23, 17, 16, 18]. Recently, Haufe et. al [17, 16] suggested to contrast causality scores obtained on the original time series to those obtained on time-reversed signals. The intuitive idea behind this approach is that, if temporal order is crucial to tell a driver from a recipient, directed information flow should be reduced (if not reversed) if the temporal order is reversed. In fact, Haufe et al. showed that for correlated, but non-interacting signals, the use of time reversal for testing Granger causality scores (here referred to as time-reversed Granger causality, TRGC) and other metrics based on cross-spectral estimates or linear autoregressive modeling correctly leads to rejection of causal interpretations. This was confirmed for Granger causality in an independent simulation study [18] showing that TRGC leads to a much smaller fraction of false positive detections compared to the original Granger causality index, and also compares favorably against the Phase Slope Index (PSI) [11].

While time-reversed Granger causality thus displays an intriguing noise robustness property, and yields very encouraging results in simulations, its behavior in the presence of causal interactions is still poorly understood. In particular, it is currently unclear how Granger causality scores computed on time-reversed signals link to the causal interactions on the original time-series, and therefore whether TRGC correctly indicates the direction of causality. Theoretical guarantees have only been derived for special cases in which either the signal’s auto- and cross-covariances are very small in magnitude, or in which both signals have very similar autocorrelations [18].

The aim of this paper is two-fold. In the theory section, we provide new theoretical insights on time-reversal for testing Granger causality between two variables. After introducing the concepts of linear autoregressive modeling, Granger causality, and time-reversed Granger causality (Section II-A and II-B), we elaborate on the existing result of Haufe et al. [16] showing that, for mixtures of independent signals, causality measures based on cross-covariances are invariant to the reversal of the temporal order (Section II-C). This is the theoretical basis for the noise-robustness property of time-reversal testing of causality scores. We then investigate the time-reversal of a process fulfilling the assumptions typically made by Granger causality estimators: a finite-order vector autoregressive (VAR) process that is unaffected by measurement noise. We review what is known about the time-reversal of a VAR process (Section II-D), based on what we provide an analytic description of Granger causality scores of time-reversed signals in terms of their autoregressive coefficients (Section II-E) and a minimal example (Section II-F). Using these insights, we prove our main result stating that, in the case of unambiguous unidirectional information flow from to , time reversal leads to a decrease of the Granger-causal net information flow relative to the original time series. The difference of net Granger causality scores obtained on original and time-reversed data thus indicates the correct direction of interaction (Section II-G).

In the second part of the paper (Section III), we revisit scenarios known to cause problems for conventional Granger causality. Using simulations, we illustrate when and how the theoretical guarantees of TRGC lead to measurable performance increases in practice. We point out the implications of our theoretical and empirical results in Section IV, along with a discussion of ambiguities in causal interpretation caused by the presence of correlated residuals in VAR models.

## Ii Theory

Vectors are considered to be column vectors (unless otherwise stated), and are generally typed in bold. The symbol denotes the transpose operator, the identity matrix, and concatenation. The symbol refers to the Kronecker product, and to the vectorization operator, which converts a matrix into a column vector. The symbol denotes expectation. The cross-covariance matrices of a stationary process are denoted by

 Cz(h):=⟨(zt−⟨z⟩)(zt−h−⟨z⟩)⊤⟩∀h∈Z.

We use the notation both for an observed time series and its underlying data generating process. We denote all quantities related to the time-reversed process with a tilde.

A process is said to be white noise if it is stationary with mean zero, finite covariance and zero autocorrelation; that is, if . Note that the covariance matrix is not necessarily diagonal, and that neither independence nor joint Gaussianity is required.

### Ii-a Granger causality and the linear VAR model

Consider a stable bivariate vector autoregressive process of lag order (VAR() process), ,

 (1)

where is a -dimensional white noise process (that is, , for , and for ) with residual covariance matrix

 Σ=⟨ϵtϵ⊤t⟩=[ΣxxΣxyΣxyΣyy]. (2)

The noise variables are also called innovations or residuals. Stability requires that for all with .

Following [25], and possess themselves autoregressive (AR) representations, which we denote by

 xt =∞∑k=1akxt−k+ξxt,Var(ξxt)=:Σxand (3) yt =∞∑k=1bkyt−k+ξyt,Var(ξyt)=:Σy. (4)

The residuals and of these two univariate processes are each serially uncorrelated, but may be correlated with each other at various lags. Importantly, even though the bivariate autoregressive process (1) is of finite order, the univariate processes (3) and (4) are in general of infinite order. We refer to (1) as the unrestricted or full model, while (3) and (4) contain the restricted models.

Directed Granger-causal information flow is defined based on the so-called Granger-scores [25]

 Fy→x:=log(ΣxΣxx)andFx→y:=log(ΣyΣyy). (5)

Granger causality from to implies that information from the past of improve the prediction of the present of compared to what can be predicted by the past of alone. That is, the residual variance of the unrestricted model is required to be smaller than the residual variance of the restricted model. Under the assumption of Gaussian-distributed residuals, and are asymptotically distributed, giving rise to an analytical test of their significance [25]. An asymptotically equivalent test is given by an F-test of the goodness-of-fit of the two models (cf. [9, 4]). We refer to this approach as standard Granger causality (standard GC).

As variables in physical systems often mutually influence each other, it is also of interest to determine the net driver of the interaction by assessing whether more information is flowing from to then from to or vice versa. Following [11, 26], net Granger causality (Net-GC) is defined as the difference of the Granger causality scores, that is

 F(net)x→y:=Fx→y−Fy→xandF(net)y→x:=−F(net)x→y. (6)

As the analytical distributions of these differences are unknown, statistical significance of Net-GC scores needs to be assessed using resampling methods as outlined in Section III-A.

### Ii-B Time-reversed Granger causality (TRGC)

To avoid false detections of causal interactions, Haufe et al. proposed to contrast causality measures applied to the original time series with the same measures obtained from time-reversed signals [17, 16]. Here, we formalize this idea in the context of Granger causality.

Given a bivariate VAR() process, its time-reversed process also possesses a VAR() representation, which we derive in Section II-D. We denote the residual covariance matrix of the time-reversed process by

 ~Σ=[~Σxx~Σxy~Σxy~Σyy]. (7)

The restricted AR models of the time-reversed data have a simple structure, as they are concerned with univariate time series. The autocovariance function of a univariate time series is symmetric, i.e., we have and for all . As a result of this and (19) (Section II-C), the time-reversed signals will have the same autocovariances as the original series. Because the AR representation is uniquely determined by the autocovariance function (cf. Section II-D1), they also share the same AR representation. The restricted models of the time-reversed univariate processes are thus given by

 xt =∞∑k=1akxt+k+~ξxt,Var(~ξxt)=:~Σxand (8) yt =∞∑k=1bkyt+k+~ξyt,Var(~ξyt)=:~Σy (9)

with

 ~Σx=Σx and ~Σy=Σy. (10)

In analogy to the original time series, we define the time-reversed Granger scores as

 ~F~y→~x:=log(~Σx~Σxx)and~F~x→~y:=log(~Σy~Σyy), (11)

and the net Granger causality scores as

 ~F(net)~x→~y :=~F~x→~y−~F~y→~xand~F(net)~y→~x:=−~F(net)~x→~y. (12)

Finally, the differences of the Granger scores obtained on original and time-reversed signals are given by

 ~Dy→x :=Fy→x−~F~y→~x, (13) ~Dx→y :=Fx→y−~F~x→~y,and (14) ~D(net)x→y :=F(net)x→y−~F(net)~x→~y. (15)

Time-reversed Granger causality can be applied in the following variants.

##### Conjunction-based time-reversed Granger causality (Conj-TRGC)

Here, net information flow from to is inferred if

 F(net)x→y>0and~F(net)~x→~y<0, (16)

that is, if the directionality of net Granger causality reverses for time-reversed signals. This variant has been investigated in [18].

##### Difference-based time-reversed Granger causality (Diff-TRGC)

Here, net information flow from to is inferred if

 ~D(net)x→y>0, (17)

that is, we require that net Granger causality from to is reduced on the time-reversed signals. Note that this is a weaker requirement than conjunction-based TRGC, as all signals for which (16) holds also fulfill (17).

##### Conjunction of Net-GC and Diff-TRGC

Finally, we can require both the time-reversed net difference and the net Granger score to be significantly larger than zero in order to infer net information flow from to , that is

 ~D(net)x→y>0andF(net)x→y<0. (18)

Just as for Net-GC, statistical significance of Conj-TRGC and Diff-TRGC, as well as the combination of Net-GC and Diff-TRGC can be assessed using resampling techniques (see Section III-A).

### Ii-C Robustness of time-reversed Granger causality (TRGC)

In [16] it is pointed out that time-reversed Granger causality robustly rejects causal interpretations for mixtures of non-interacting signals such as correlated noise sources. The mathematical basis for this noise robustness property is the fact that the cross-covariance matrices of the time-reversed signals are equal to the transposed cross-covariance matrices of the original signals, that is

 ~C~z(h)=⟨~zt~z⊤t−h⟩=⟨ztz⊤t+h⟩=Cz(−h)=(Cz(h))⊤ (19)

for all . If a series only contains a mixture of independent signals, all its cross-covariance matrices are symmetric [27]: consider where contains a number of independent sources. Then, for all , and thus is symmetric. For mixtures of independent noise sources, any causality measure that is solely based on a series’ cross-covariance matrices therefore yields the same result on the original and the time-reversed signals. This includes Granger causality, but also other popular variants such as directed transfer function (DTF) [19] and partial directed coherence (PDC) [20]. Given sufficient amounts of data, the conditions for Conj-TRGC and Diff-TRGC cannot be fulfilled for mixtures of independent sources using these measures, preventing the detection of spurious interaction.

### Ii-D The VAR representation of a time-reversed process

There is so far no theoretical argument guaranteeing that time-reversed Granger causality correctly indicates the presence of information flow as well as its direction in the presence of actual interaction. In order to provide such a guarantee, we here study the time-reversal of (linear) finite-order VAR processes. Note that studying this case is sufficient since, as a results of Wold’s decomposition theorem, every stationary, purely nondeterministic, process can be approximated well by a finite order VAR process [28, 1].

We start by briefly revisiting the link between cross-covariance matrices and VAR representation, which we use throughout the paper, in Section II-D1. In Sections II-D2 and II-D3, we then review the theoretical result of Andel [29] stating that the time-reversed signal of any VAR() process has again a VAR() representation that can be expressed analytically in terms of the original process. As the description for is mathematically involved, we only treat the case in the main paper, while the proof for arbitrary  is presented in Appendix A.

We use these results to provide an analytic description of difference-based TRGC scores in terms of their autoregressive coefficients (Section II-E), give a minimal example (Section II-F), and prove our main result stating that, in the case of unambiguous unidirectional information flow, difference-based time-reversed Granger causality indeed yields the correct result (Section II-G).

#### Ii-D1 The cross-covariance function of a VAR process

Most of the insights in this paper are based on the direct link between autoregressive coefficient matrices and residual covariance matrices on one hand, and cross-covariance matrix on the other hand. This link is established by the Yule-Walker equations as follows (see e.g. [1]). For a VAR(1) process

 zt=A1zt−1+ϵt, (20)

 Cz(0) =A1⋅Cz(0)⋅A⊤1+Σand (21) Cz(h) =A1⋅Cz(h−1)(∀h∈N∖{0}). (22)

Given and , the cross-covariances are uniquely determined from (21) through

 vec(Cz(0))=(I−A1\raisebox1.0pt$⊗$A1)−1vecΣ, (23)

while higher-order cross-covariances can be recursively computed using (22). Conversely, and are uniquely determined by the cross-covariances through

 A1 =Cz(1)Cz(0)−1and (24) Σ =Cz(0)−A1Cz(0)A⊤1. (25)

Results on VAR() processes can typically be extended to higher-order VAR() processes by reducing VAR() processes to their VAR() form. The VAR() representation of a VAR() process as well as the Yule-Walker equations for general VAR() processes are provided in Appendix A-A.

#### Ii-D2 The VAR representation of a time-reversed VAR(1) process

The time-reversed autoregressive representation of a VAR() process has been derived by Bartlett in 1955 [30]. Suppose we generate an infinite sequence of according to the VAR(1) process (20). The VAR representation of the time-reversed or backward process is given by

 zt=~A1zt+1+~ϵt, (26)

where

 ~A1=Cz(0)⋅A⊤1⋅Cz(0)−1, (27)

and where the reversed residuals are calculated from as

 ~ϵt:=zt−~A1zt+1 (28)

with residual covariance matrix

 ~Σ=⟨~ϵt~ϵ⊤t⟩=Cz(0)−Cz(0)⋅A⊤1⋅Cz(0)−1⋅A1⋅Cz(0). (29)

It is easy to show that the sequence is indeed white noise, that is for all : and for all : .

From (27), we see that the time-reversed coefficient matrix is similar to , and thus shares some of its properties, notably its eigenvalues, determinant, trace and rank. However, in the context of Granger causality, it is important to note that many properties of do not transfer to . In particular, if is triangular, diagonal, or symmetric, this is not generally the case for .

#### Ii-D3 The VAR representation of a time-reversed VAR(p) process

The result of Bartlett on the time-reversed VAR() process has been generalized to VAR() processes by Andel in 1972 [29], in a paper that received, so far, little attention. Andel showed that any stable VAR() process (1) has a time-reversed representation

 zt=~A1zt+1+~A2zt+2+…+~Apzt+p+~ϵt (30)

that is again of order with uniquely defined autoregressive coefficients and residual covariance matrix . We reproduce this result in Appendix A-B. Note that, while we only treat bivariate VAR processes in this paper, the analytic description of the time-reversed VAR process holds for processes of arbitrary dimensionality.

### Ii-E Analytic description of Diff-TRGC

Contrasting Granger scores obtained on original with those obtained on time-reversed signals is simplified by the fact that the AR representation of a univariate time series does not depend on the direction of time. It follows immediately from (10), that the differences of the Granger scores related to original and time-reversed data do not depend on the restricted models:

 ~Dy→x=Fy→x−~F~y→~x=log~Σxx−logΣxx~Dx→y=Fx→y−~F~x→~y=log~Σyy−logΣyy~D(net)x→y=F(net)x→y−~F(net)~x→~y=(Fx→y−Fy→x)−(~F~x→~y−~F~y→~x)=log~Σyy−log~Σxx−logΣyy+logΣxx. (31)

The Granger score differences , and thus only depend on the residual covariance matrices of the full models of the original and time-reversed data. For the VAR(1) process, these are given in (25) and (29). For VAR() processes, the residual covariance matrices can be obtained through (56) and (59) as described in Appendix A-A and A-B.

Please note that while (31) implies that the unrestricted models can be neglected when computing Granger scores differences, we might gain from including them in finite sample settings. We investigate this issue through simulations in Section III.

### Ii-F A minimal example

It is not intuitive to see how the residual variance of the time-reversed process, and thus Granger causality, depends on the autoregressive coefficients of the model. Interpretation is made difficult by the occurrence of in (29).

Let us therefore consider the following minimal case: a VAR(1) process with . In that case, and for all (from (22)). All asymmetries in the cross-covariance matrices are thus due to asymmetries in .

Furthermore, time-reversing the signal leads to transposition of the autoregressive coefficient matrix as a result of (27). The residual covariance matrices (25) and (29) are now given by

 Σ=I−A1⋅A⊤1and~Σ=I−A⊤1⋅A1.

Denote with the autoregressive coefficients. We then have

 Σxx=1−a211−a212,~Σxx=1−a211−a221,

and

 ~Dy→x=log~Σxx−logΣxx>0 ⇔Σxx<~Σxx ⇔a212>a221.

The difference of the Granger scores computed on the original and time-reversed time series thus indicates the correct net direction of information flow. We will in general not be able to infer whether  has a Granger-causal influence on . However, we will be able to tell whether  Granger-causes  more than  Granger-causes , or vice versa.

While this simple case will almost never occur in practice, we give theoretical guarantees for more general cases in the next section.

### Ii-G Validity of TRGC for unidirectional information flow

We now prove our main result, the validity of difference-based time-reversed Granger causality in the presence of unidirectional information flow. Consider a bivariate VAR() process with unambiguous unidirectional information flow. This is the case when all coefficient matrices are triangular and the residual covariance matrix is diagonal. Then the following theorem holds.

###### Theorem 1.

Let be a stable bivariate VAR() process (1) with the time-reversed representation (30). Under the assumptions

1. are lower triangular matrices (i.e., may Granger-cause , but does not Granger-cause ), and

2. is a diagonal matrix, i.e. (the residuals are uncorrelated), and

3. is invertible  ,

it holds that

 ~Σxx ≤Σxx, (32) and that ~Σyy ≥Σyy. (33)
###### Corollary 1.

Under assumptions (A1)-(A3), Theorem 1 and (31) immediately imply the following inequalities for the differences of Granger scores:

 ~Dy→x=Fy→x−~F~y→~x ≤0 (34) ~Dx→y=Fx→y−~F~x→~y ≥0 (35) ~D(net)x→y=(Fx→y−Fy→x)−(~F~x→~y−~F~y→~x) ≥0. (36)

As a result of Corollary 1, net Granger-causal information flow from to is reduced or remains the same when the signal is time-reversed. Thus, in the case of unambiguous unidirectional information flow, difference-based time-reversed Granger causality yields the correct result. Note that it is not true in general that the net flow between the time-reversed signals and , , is negative (reverses compared to the original series). That is, conjunction-based TRGC might in some cases incorrectly reject the presence of true causal interaction.

Corollary 1 states that each of the three difference scores, , , and alone is sufficient to infer the correct directionality under assumptions (A1)–(A3). As (A1) requires information flow to be unidirectional, the individual scores and only indicate net information flow, which is what is also observed in Section II-F.

The three scores will behave differently if the assumption of uncorrelated residuals (A2) is violated. Then, and still hold, but the inequalities , and do not. On average, the net difference (which equals ) is less affected by the presence of correlations in the residuals than any of the individual scores, which is why we defined difference-based TRGC based on in (17). Nevertheless, all three scores are valid measures for net information flow, as residuals should be uncorrelated if the VAR model accurately describes a physical process. The significance of uncorrelated as opposed to correlated residuals is discussed in Section IV-A.

Sketch of the proof. The first inequality (32) is relatively easy to prove. The intuition is the following: Since does not Granger-cause , the prediction of is only based on past . In contrast, the coefficient matrices , …, of the time-reversed representation are in general not triangular. This means that prediction of the time-reversed signals is not only based on past , but can also use information from past . We would thus expect that can be better predicted than , and that the corresponding residuals are smaller.

The proof of the second inequality (33) is more involved. The intuition is the following: we would expect that the ‘amount’ of unexplainable variance is the same for both the original and the time-reversed process. Thus, since the residual variance of decreases, the residual variance of should increase. Mathematically, we prove that

 det(Σ)=det(~Σ). (37)

The proof of (37) is the only part that requires the analytic description of , and is the main difficulty of the overall proof. It is not straightforward, because depends on the inverse of the covariance matrix , while we only have an analytic description of . From (37), it is easy to infer , which completes the proof. It is only in this final step that we need assumption (A2) that is diagonal.

###### Proof:

As are lower triangular matrices (assumption (A1)), is an autoregressive process of order ,

 xt=a1xt−1+…+apxt−p+ξxt with Var(ξxt)=Σxx (38)

Its time-reversed representation (cf. Section II-B) is

 xt=a1xt+1+…+apxt+p+~ξxt, with Var(~ξxt)=Σxx (39)

Because the unrestricted (or full) model (30) extends the restricted model by including , (32) follows:

 ~Σxx≤Var(~ξxt)=Σxx. (40)

###### Proof:

As mentioned in the proof sketch, we need to derive (37), the equality of the determinants and . To improve readability, we here treat only the case , and derive (37) for general in Appendix  A-C.

The proof relies on Sylvester’s determinant theorem [31], which states that for any matrices , :

 det(I+KL)=det(I+LK). (41)

We then have:

 detΣ det(Cz(0)−A1⋅Cz(0)⋅A⊤1) = det(Cz(0))⋅det(I−A1⋅Cz(0)⋅A⊤1⋅Cz(0)−1) det(Cz(0))⋅det(I−Cz(0)⋅A⊤1⋅Cz(0)−1A1) = det(Cz(0)−Cz(0)⋅A⊤1⋅Cz(0)−1A1⋅Cz(0)) det~Σ.

From the result of Part 1 (32), the equality of residual covariance determinants (37) (derived for general in Appendix A-C), and assumption (A2) of uncorrelated residuals in , we then obtain:

 ΣxxΣyy(A2)=detΣdet~Σ = ~Σxx~Σyy−~Σxy~Σxy ≤ ~Σxx~Σyy Σxx~Σyy ⇔Σyy ≤ ~Σyy.

Strict inequality. Let us further note that inequality (36) for difference-based TRGC is strict in the presence of causal interaction. The following theorem holds.

###### Theorem 2.

Under assumptions (A1)-(A3), it holds that

 ~D(net)x→y=0⇔A1,…,Ap % are diagonal. (42)

The proof is provided in Appendix A-D. Combined with Corollary 1, Theorem 2 immediately implies that in the presence of unidirectional information flow from to . That is, net Granger-causal information flow from to is truly reduced and cannot remain the same when the signal is time-reversed.

## Iii Experiments

In this section, we provide an empirical investigation of model violations and other factors influencing the performance of Granger causal measures using numerical simulations. After describing the tested methods and performance measures (Section III-A), we compare several variants of TRGC in either the presence or absence of noise (Section III-B). We then investigate the influence of common drivers, various types of noise (Section III-C and III-D) and downsampling (Section III-E) on standard Granger causality and Diff-TRGC.

### Iii-a Experimental setup

We consider bivariate time series in the presence of unidirectional information flow () as well as in the absence of causal interaction. Unless otherwise stated, time series of length are generated from stationary VAR(5) processes, whose autoregressive coefficients are drawn from a normal distribution with mean  and standard deviation . The absence of causal interaction is modeled by setting respective AR coefficients to zero. Residuals are generated from a normal distribution with diagonal covariance matrix, whose entries are drawn from the standard uniform distribution.

We compare standard GC as well as Net-GC to Diff-TRGC (see (17)). In Section III-B, we also include Conj-TRGC (see (16)), the conjunction of Net-GC and Diff-TRGC (see (18)), and a variation of Diff-TRGC, in which is computed using only the full bivariate models according to (31). This variant is denoted by Diff-TRGC (full).

All statistical tests are performed at significance level . For standard GC, we perform two separate F-tests, one to assess whether Granger-causes , and one to assess whether Granger-causes . It is possible that both variables are estimated to Granger-cause each other. In contrast, all other metrics indicate net directionality. We assess their statistical significance by bootstrapping residuals from the regression model: We regress on its past and future values , and retain the fitted values and residuals . In each bootstrap repetition, causality metrics are computed on synthetic variables , where is selected randomly for each . Percentile confidence intervals are then constructed from the bootstrap sampling distribution. Significance is determined by evaluating if the confidence interval does not contain 0. We use 500 bootstrap samples and select the number of lags as the optimizer of Schwarz’s Bayesian Information Criterion (BIC) [32].

All experiments are repeated 300 times. In each run, a true positive (TP) is defined as a significant detection of the true direction of interaction. The true positive rate (TPR) is the fraction of true positives among all runs. It is here also referred to as the sensitivity or power. A false positive (FP) is defined as a significant detection of the wrong direction of interaction, or a significant detection of causal interaction in the absence of any causal interaction. The false positive rate (FPR) is the fraction of false positives among all tested runs.

### Iii-B Comparison of TRGC variants under interaction

We assess Granger causality and time-reversed Granger causality in the presence of unidirectional interaction considering differing sample sizes, standard deviations of the AR parameters, noise types and signal-to-noise ratios (SNR).

In a first experiment, we consider the noiseless case, and vary the sample size from 400 to 4000 for a fixed standard deviation of the AR coefficients. In a second experiment, we vary the standard deviation at a constant sample size of . This experiment thus tests the impact of the strength of the causal connections relative to the innovation noise. The standard deviations tested are 0.05, 0.1, 0.2, …, and 0.6. Finally, for a fixed standard deviation , and a fixed sample size , we add linearly mixed, autocorrelated measurement noise to each system according to

 [xtyt]=(1−γ)[x(l)ty(l)t]+γ⋅ηt, (43)

where the subscript denotes the underlying latent variables and defines the signal-to-noise ratio (SNR). Noise is generated by multiplying two independent AR(5) time-series with a random matrix , with . We consider the signal-to-noise ratios 0, 0.25, 0.5, 0.75, 0.9 and 1.

The TP and FP rates attained in the three experiments are depicted in Figure 1. From Figure 1(a), we see that Diff-TRGC (full), which computes the difference score only using the full model according to (31), seems to be suboptimal for finite samples. While we have demonstrated the equivalence of (31) to the original definition (15) for infinite samples in Section II-E, this equivalence does not hold for the finite samples studied here. Estimating residuals from the restricted models increased the power of the test for all investigated parameter settings.

Conj-TRGC has lower power relative to Diff-TRGC. This is particularly so for high , which corresponds to a dominance of the dynamical and causal aspects of the model comprised in the AR coefficients relative to the innovation noise. This result is not unexpected, as time-reversing the signals does not necessarily reverse the direction of information flow. Note that, on the other hand, Conj-TRGC is the more conservative measure compared to Diff-TRGC and could be expected to produce fewer spurious results in the presence of noise. However, as we see in Figure 1(b), both variants yield almost no spurious results in the presence of measurement noise. We will therefore use Diff-TRGC in the remaining experiments.

### Iii-C Impact of latent variables and measurement noise in the absence of causal interaction

Already Granger pointed out that standard Granger causality can lead to spurious results if not all relevant variables are incorporated in the model [8]. In a bivariate system, GC cannot determine whether the observed variables and are both driven by a third common cause. This argument extends to multivariate systems, if a relevant confounding variable is not part of the measurement. Furthermore, standard GC is susceptible to measurement noise [33, 10, 11, 26, 34, 18] and to instantaneous linear mixing of activity, which is a major problem for example in the analysis of electroencephalographic (EEG) recordings [13, 14, 16]. We demonstrate these effects here in additional simulations, in all of which no actual interaction occurs. We consider three different scenarios.

(A) Linear mixing. The observed time series and are a linear mixture of two independent signals , , that is

 [xtyt]=M[x(l)ty(l)t], (44)

where denotes the mixing matrix. and  were generated as two independent univariate AR(5) processes.

(B) Common hidden cause. The observed time series and  are driven by a common unobserved cause . Time series , and are generated from a three-dimensional VAR(5) model with , in which Granger-causes  and , with no causal interaction between and as modeled by the respective AR coefficients being set to zero.

(C) Additive noise. The observed time series and are a superposition of two independent univariate AR(5) processes , and additive noise as in (43), with adjusting the SNR. We consider three different types of noise. Independent white noise is generated from a normal distribution with diagonal covariance matrix, whose entries are drawn from the standard uniform distribution. Mixed white noise is created by multiplying independent noise with a random matrix with . Mixed autocorrelated noise is created by multiplying two independent AR(5) time-series with .

Figure 2 illustrates the behavior of standard Granger causality, Net-GC and Diff-TRGC in the various simulation settings. Values on the y-axis indicate the FP rate at significance level . As all experiments are characterized by the absence of any interaction between and , any significant detection of information flow either from to or to is counted as a false positive.

It is apparent from Figure 2 that standard GC and Net-GC lead to spurious detection of causality in all tested scenarios. Their behavior in the presence of noise (panel C) depends on the properties of that noise. Mixed noise (left and center plots of panel C) is generally very problematic, especially if it is also autocorrelated (left part). As and are already independent, adding independent noise (obviously) does not pose a problem here (right part of panel C).

In contrast to standard GC and Net-GC, time-reversed Granger causality implemented through Diff-TRGC is insensitive to mixtures of independent sources regardless of their spatial and temporal correlation structure (see panels A and C). This behavior thus reflects its known theoretical properties discussed in Section II-C. The presence of a hidden common confounder, however, cannot be ruled out by using time-reversed Granger causality (panel B).

### Iii-D Impact of noise in the presence of causal interaction

We further study the behavior of standard GC, Net-GC and Diff-TRGC in the presence of unidirectional causal interactions superimposed with noise. Four different scenarios are considered. In all cases, data are generated according to (43) with Granger-causing . In the first three scenarios, (A-C), interacting signals from bivariate VAR(5) models are superimposed with noise. As in Section III-C, we use mixed autocorrelated noise (scenario A), mixed white noise (B), and independent white noise (C). The same signal to noise ratios as in Section III-C are used.

In the fourth scenario, (D), we simulate the following VAR(1) process with long memory:

 [x(l)ty(l)t] =[0.95010.5]⎡⎣x(l)t−1y(l)t−1⎤⎦+ϵt ϵt∼N(0,I) (45) xt =(1−γ)⋅x(l)t+γ⋅ηt ηt∼N(0,1) yt =y(l)t,

adopted from [26], where denotes the normal distribution.

True positive and false positive rates as estimated from 300 simulation runs are reported in Figure 3 (A-D). Just as in the absence of causality (cf. Section III-C), we observe that linearly mixed, autocorrelated noise leads to the highest numbers of false detections for standard GC, while independent white noise leads to lowest FP rates. Diff-TRGC is characterized by negligible amounts of false positives in all cases at the cost of slightly decreased sensitivity as compared to standard GC in scenarios (A-C). Interestingly, Net-GC behaves very similar to Diff-TRGC in the presence of non-autocorrelated noise both in terms of sensitivity and specificity (B-C). In these settings, spurious causality could already be almost entirely eliminated by testing for net Granger causality. This result, however, does not imply that Net-GC cannot be affected by non-autocorrelated noise in general. A counterexample is the system with long memory studied in scenario (D). Here, Net-GC (as well as standard GC) fails, because contains delayed but cleaner information about than itself and thus may help to predict future . Diff-TRGC, however, robustly identifies as the driver.

Our examples show time-reversed Granger causality almost completely eliminates spurious causalities arising from any kind of additive noise. At the same time, it exhibits similar statistical power as net Granger causality. We also observe that net Granger causality is typically more robust with respect to additive noise than standard Granger causality.

### Iii-E Impact of downsampling and temporal aggregation

Spurious Granger causality has also been reported to arise due to downsampling and temporal aggregation [35, 36, 37], posing serious problems, for example, in functional magnetic resonance imaging (fMRI) [38, 39].

We generate data using a VAR(5) model with random coefficients with , in which Granger-causes . These data are decimated at different factors in two ways. In the downsampling scenario (E), causal measures are applied to time series of length constructed from the original time series by skipping time points in between sampled data points. In the temporal aggregation scenario (F), time series of length are constructed from the original time series by averaging over data points. No noise was added.

Figure 3 (E-F) depicts TP and FP rates attained in the two scenarios as a function of . We see that Net-GC and Diff-TRGC are more robust then standard GC. Both Net-GC and Diff-TRGC did not result in spurious causality.

## Iv Discussion

We established the theoretical guarantee that difference-based time-reversed Granger causality (Diff-TRGC) indicates the correct direction of causality in bivariate autoregressive processes characterized by unambiguous unidirectional information flow. Our results complement previous work by [16, 17] showing that TRGC in general correctly rejects causal interpretations for mixtures of non-interacting sources (thus, in the absence of any causality). While further compelling intuitive ideas for robust causality measures have been presented [11, 23, 18], our result provides, to the best of our knowledge, the first proof of the correctness of one of such techniques (Diff-TRGC) for a relatively general class of time-series models.

Our theory is accompanied by simulations, in which we confirmed that time-reversed Granger causality robustly detects the presence of true causal interactions in various realistic scenarios including mixed noise and downsampling. We showed that Diff-TRGC is able to infer correct directionality with similar power as net Granger causality, while at the same time producing fewer (in most cases, negligible amounts of) false alarms than Net-GC and standard GC. We therefore suggest to use Diff-TRGC whenever the data under study are likely to be corrupted by noise.

### Iv-a Correlated residuals

To define an unambiguous uni-directional information flow, our theory assumes uncorrelated residuals, as is common in the literature. Correlated residuals indicate instantaneous effects that the variables exert on each other. While we would not expect correlated residuals if the VAR model accurately describes the data generating process, such effects are likely to occur in practice (e.g., if the sampling rate of the acquired data falls below the time scale of the causal interactions). They pose severe problems for causal estimation, because they can be explained by several possible data generating models, the coefficients of which cannot be uniquely identified using second order information only.

Data generating models. Instantaneous interactions can be modeled implicitly through correlated residuals in classical VAR processes, or explicitly, for example using so-called ‘structural’ VAR (SVAR) processes [1, 40, 41]. By augmenting the VAR model with an instantaneous mixing matrix , the SVAR model

 zt=p∑h=0Γhzt−h+¯ϵt, (46)

achieves that the residuals are uncorrelated. Here, the diagonal of is assumed to be zero.

Correlated residuals emerge naturally in electrophysiological neuroimaging data, where the signals observable at the sensors (e.g., EEG electrodes) are a linear mixture of the latent activity of possibly interacting neuronal populations within the brain. A model for such mixtures of potentially interacting sources is given by

 zt =Mz(l)t,z(l)t =p∑h=1Bhz(l)t−h+´ϵt, (47)

where denotes the observed data, denotes the activity of underlying latent variables (e.g., brain sources) following a VAR() process with uncorrelated residuals , and is an unknown mixing matrix (representing, e.g., the volume conduction effect of the human head). We call (47) the ‘mixture of interacting sources’ model.

Note that VAR models with correlated residuals, SVAR models, and mixture of interacting sources models can be used interchangeably to represent the same statistical process. For example, an interacting sources model (47) can be equivalently written as a VAR() process (1) with coefficients

 Ah=MBhM−1,h∈{1,…,p} (48)

and correlated residuals . Likewise, an SVAR() process (46) can be converted into a VAR() process with correlated residuals and coefficients

 Ah=(I−Γ0)−1Γh,h∈{1,…,p}. (49)

The reverse transformations from VAR models to SVAR or interacting source models, as well as the transformations between SVAR and interacting source models, are, however, not unique (see Model identifiability).

Ambiguous causal interpretations can emerge in cases where one of the three models indicates time-delayed causal interactions through non-zero off-diagonal coefficients in the , or , while another one does not. This ambiguity can in general only be resolved if the model generating the data is known a-priori. In case of EEG data, for example, (47) reflects the true data-generating process. Therefore, only the parameters of the source VAR process (47) permit meaningful causal interpretation (wrt. to the source variables ), while, for example, the VAR parameters in (48) are distorted by the mixing matrix .

Model identifiability. A further complication in the presence of instantaneous effects in the data is that for mixture of interacting sources as well as SVAR models, the parameters are not uniquely defined from second order information only. This can be best seen for the latter model (47). Identifying the model parameters requires the estimation of a full factorization of the data into a mixing matrix and source time series . This means that the estimation problem falls into the blind source separation (BSS) setting, in which Gaussianity of the factors is not sufficient for their identification. The classical approach to BSS, independent component analysis (ICA) assumes statistical independence and non-Gaussianity of the sources to ensure identifiability. This concept can be adopted in the context of source AR models by enforcing independence/non-Gaussianity of the residuals of the source AR process in (47) [13, 15, 42]. In a similar way, independence of residuals has been used in the identification of SVAR models [40, 43].

Example. Consider the following VAR(1) process with correlated residuals:

 [xtyt] =[0.70−0.120.9][xt−1yt−1]+ϵt,⟨ϵtϵ⊤t⟩=[10.60.61].

This process can also be represented by the SVAR(1) model

 [xtyt] =[000.60][xtyt]+[0.70−0.540.9][xt−1yt−1]+¯ϵt

as well as the mixtures of interacting sources model

 [xtyt] =[100.60.8][x(l)ty(l)t],[x(l)ty(l)t]=[0.7000.9]⎡⎣x(l)t−1y(l)t−1⎤⎦+´ϵt,

with uncorrelated residuals , . Note that both the VAR(1) and the SVAR(1) representation indicate unidirectional causal interaction between the observed variables and , whereas the mixture model suggests that the observed data can also arise from a mixture of two independent latent sources and . However, another equivalent mixture model

 [xtyt] =[−√0.2√0.8√0.2√0.8][x(l)ty(l)t],[x(l)ty(l)t]=[0.860.080.080.74]⎡⎣x(l)t−1y(l)t−1⎤⎦+~~ϵt

with suggests bidirectional informational flow on the source level. Similarly, the following SVAR(1) model indicates bidirectional flow

 [xtyt] =[00.600][xtyt]+[0.772−0.54−0.120.9][xt−1yt−1]+¯¯ϵt, ⟨¯¯ϵt¯¯ϵ⊤t⟩ =[0.64001].

### Iv-B Future work

Further effort is required to investigate the behavior of TRGC in the presence of bidirectional information flow. Also, our theoretical analysis only covers the bivariate framework. Both Granger causality and TRGC can result in spurious causality when relevant variables are not included (cf. Fig. 2, panel B). Therefore, an extension of the analysis of time-reversal to general multivariate signals would be very interesting. Furthermore, it would be desirable to obtain theoretical guarantees for the performance of TRGC in the presence of true interaction superimposed by noise in the form of bounds on the false positive rate. A major difficulty here is to obtain the residual covariance of the superposition of a VAR process and additive noise. Analytically computing Granger causality in the presence of noise is mathematically involved even for special cases [44].

Finally, [16] showed that for any causality measure based on cross-covariances, differences of the scores obtained on the original and time-reversal signals correctly indicate the absence of causality on mixtures of independent sources. While we focused here on Granger causality, it remains to be shown whether validity of time-reversal in the presence of causal interaction can also be demonstrated for other causality measures.

## Appendix A Proofs for VAR(p)

### A-a The VAR(p) process and its cross-covariance function

Consider a stable bivariate VAR() process, , as defined in (1),

where is a -dimensional white noise process (i.e.