Identifying Coupling Structure in Complex Systems through the Optimal Causation Entropy Principle
Abstract
Inferring the coupling structure of complex systems from time series data in general by means of statistical and informationtheoretic techniques is a challenging problem in applied science. The reliability of statistical inferences requires the construction of suitable informationtheoretic measures that take into account both direct and indirect influences, manifest in the form of information flows, between the components within the system. In this work, we present an application of the optimal causation entropy (oCSE) principle to identify the coupling structure of a synthetic biological system, the repressilator. Specifically, when the system reaches an equilibrium state, we use a stochastic perturbation approach to extract time series data that approximate a linear stochastic process. Then, we present and jointly apply the aggregative discovery and progressive removal algorithms based on the oCSE principle to infer the coupling structure of the system from the measured data. Finally, we show that the success rate of our coupling inferences not only improves with the amount of available data, but it also increases with a higher frequency of sampling and is especially immune to false positives.
I Introduction
Deducing equations of dynamics from empirical observations is fundamental in science. In realworld experiments, we gather data of the state of a system. Then, to achieve the comprehension of the mechanisms behind the system dynamics, we often need to reconstruct the underlying dynamical equations from the measured data. For example, the laws of celestial mechanics were deduced based on observations of planet trajectories Fitzpatrick (); the forms of chemical equations were inferred upon empirical reaction relations and kinetics Atkins (); the principles of economics were uncovered through market data analysis Mankiw (). Despite such important accomplishments, the general problem of identifying dynamical equations from data is a challenging one. Early efforts, such as the one presented in Crutchfield1987 (), utilized embedding theory and relative entropy to reconstruct the deterministic component of a lowdimensional dynamical system.
Later, more systematic methods were developed to design optimal models from a set of basis functions, where model quality was quantified in various ways, such as the Euclidean norm of the error Yao2007 (), the length of the prediction time window (chaotic shadowing) Sun2011 () or the sparsity in the model terms (compressive sensing) Wang2011 (). Each method achieved success in a range of systems, but none is generally applicable. This is, in fact, not surprising in light of the recent work that showed that the identification of exact dynamical equations from data is NP hard and, therefore, unlikely to be solved efficiently Cubitt2012 ().
Fortunately, in many applications, the problem we face is not necessarily the extraction of exact equations, but rather, uncovering the causeandeffect relationships (i.e., direct coupling structure) among the components within a complex system. For example, in medical diagnosis, the primary goal is to identify the causes of a disease and/or the roots of a disorder, so as to prescribe effective treatments. In structural health monitoring, the main objective is to locate the defects that could cause abrupt changes of the connectivity structure and adverse the performance of the system. Due to its wide range of applications, the problem of inferring causal relationships from observational data has attracted broad attention over the past few decades Heider1944 (); Granger1969 (); Granger1988 (); Spirtes2000 (); Rothman2005 (); Caticha2007 (); Frenzel2007 (); Schindlera2007 (); Guo2008 (); Heckman2008 (); Pearl2009 (); Gao2011 (); Vicente2011 (); Runge2012PRL (); Runge2012PRE (); Sun2014PhysicaD (); Sun2014arXiv ().
Among the various notions of causality, we adopt the one originally proposed by Granger, which relies on two basic principles Granger1969 (); Granger1988 ():

The cause should occur before the effect (caused);

The causal process should carry information (unavailable in other processes) about the effect.
A causal relationship needs to satisfy both requirements. See Figure 1 as a schematic illustration.
The classical Granger causality is limited to coupled linear systems, while most recently developed methods based on informationtheoretic measures are applicable to virtually any model, although their effectiveness relies on the abundance of data. Notably, transfer entropy (TE), a particular type of conditional mutual information, was introduced to quantify the (asymmetric) information flow between pairs of components within a system Schreiber2000 (); Kaiser2002 () and further developed to detect the directionality of coupling in systems with two components Palus2001 (); Vejmelka2008 (); Bollt2012 (). The application of these pairwise inference methods to identify direct coupling in systems with more than two components warrants caution: the inferred couplings are often incorrect regardless of the amount of available data Smirnov2013 (); Sun2014PhysicaD (); Sun2014arXiv (). In fact, Granger’s second principle implies that a true causal relationship should remain effective upon the removal of all other possible causes. As a consequence, an inference method cannot correctly identify the direct coupling between two components in a system without appropriate conditioning on the remaining components Frenzel2007 (); Sun2014PhysicaD (); Sun2014arXiv ().
Inferring causal relationships in largescale complex systems is challenging. For a candidate causal relationship, one needs to effectively determine whether the cause and effect is real or is due to the presence of other variables in the system. A common approach is to test the relative independence between the potential cause and effect conditioned on all other variables, as demonstrated in linear models Guo2008 (); Barrett2010 (); Papana2013 (). Although analytically rigorous, this approach requires the estimation of the related inference measures for high dimensional random variables from (often limited available) data, and therefore, suffers the curse of dimensionality Runge2012PRL (). Many heuristic alternatives have been proposed to address this issue. The essence of these alternative approaches is to repeatedly measure the relative independence between the cause and effect conditioned on a combination of the other variables, often starting with subsets of only a few variables Marinazzo2012 (); Runge2012PRL (); Runge2012PRE (). As soon as one such combination renders the proposed cause and effect independent, the proposed relationship is rejected, and there is no need to condition on subsets with more variables. The advantage of such an approach is that it reduces the dimensionality of the sample space if the proposed relationship is rejected at some point. However, if the relationship is not rejected, then the algorithm will continue, potentially having to enumerate over all subsets of the other variables. In this scenario, regardless of the dimensionality of the sample space, the combinatorial search itself is often computationally infeasible for large or even moderatesize systems.
In our recent work in Sun2014PhysicaD (); Sun2014arXiv (), we introduced the concept of causation entropy (CSE), proposed and proved the optimal causation entropy (oCSE) principle and presented efficient algorithms to infer causal relationships within large complex systems. In particular, we showed in Sun2014arXiv () that by a combination of the aggregate discovery and progressive removal of variables, all causal relationships can be correctly inferred in an computational feasible and dataefficient manner. We repeat for emphasis that without properly accounting for multiple interactions and conditioning accordingly, erroneous or irrelevant casual influences may be inferred, and specifically, any pairwiseonly method will inherent the problem that many falsepositive connections will arise. The design of oCSE specifically addresses this issue.
In this paper, we focus on the problem of inferring the coupling structure in synthetic biological systems. When the system reaches an equilibrium state, we employ random perturbations to extract time series that approximate a linear Gaussian stochastic process. Then, we apply the oCSE principle to infer the system’s coupling structure from the measured data. Finally, we show that the success rate of causal inferences not only improves with the amount of available data, but it also increases with the higher frequency of sampling.
Ii Inference of Direct Coupling Structure through Optimal Causation Entropy
To infer causal structures in complex systems, we clearly need to specify the mathematical assumptions under which the task is to be accomplished. Accurate mathematical modeling of complex systems demands taking into account the coupling of neglected degrees of freedom or, more generally, the fluctuations of external fields that describe the environment interacting with the system itself Bazzani2003 (). This requirement can be addressed in a phenomenological manner by adding noise in deterministic dynamical models. The addition of noise, in turn, generally leads to the stochastic process formalism used in the modeling of natural phenomena Doukhan2008 (). We study the system in a probabilistic framework. Suppose that the system contains components, . Each component is assumed to be influenced by a unique set of components, denoted by . Let:
(1) 
be a random variable that describes the state of the system at time . For a subset , we define:
(2) 
ii.1 Markov Conditions
We assume that the system undergoes a stochastic process with the following Markov conditions Sun2014arXiv ().
(3) 
Here, denotes conditional probability. The relationship between two probability density functions and is denoted as iff they equal almost everywhere, and iff there is a set of positive measure on which the two functions do not equal. Note that the Markov conditions stated in Equation (3) ensure that for each component , there is a unique set of components that renders the rest of the system irrelevant in making inference about and each individual component in presents an observable cause regardless of the presence or absence of the other components.
Although several complex systems can be properly modeled in terms of Markov processes Eichler2012 (), we cannot avoid recalling that, after all, nonMarkov is the rule and Markov is only the exception in nature Kampen1998 (). Therefore, it is important to develop a theoretical framework suitable for identifying coupling structures in complex systems driven by correlated fluctuations where memory effects cannot be neglected. It is in fact possible to relax the assumptions in Equation (3) to a stochastic process with finite or vanishing memory, as discussed in the last section of the paper.
The problem of inferring direct (or causal) couplings can be stated as follows. Given time series data () drawn from a stochastic process that fulfills the Markov conditions in Equation (3), the goal is to uncover the set of causal components for each . We solve this problem by means of algorithms that implement the oCSE principle.
ii.2 Causation Entropy
We follow Shannon’s definition of entropy to quantify the uncertainty of a random variable. In particular, the entropy of a continuous random variable is defined as Shannon1948 ():
(4) 
where is the probability density function of . The conditional entropy of given is defined as Cover ():
(5) 
Equations (4) and (5) are valid for both univariate and multivariate random variables. Consider a stochastic process. Let , and be arbitrary sets of components within the system. We propose to quantify the influence of on conditioned upon via the CSE:
(6) 
provided that the limit exists Sun2014PhysicaD (); Sun2014arXiv (). Since, in general, , where the latter defines the conditional mutual information between and conditioned on , it follows from the nonnegativity of conditional mutual information that CSE is nonnegative. When , we omit the conditioning part and simply use the notation . Notice that if , CSE reduces to TE. On the other hand, CSE generalizes TE by allowing to be an arbitrary set of components.
ii.3 Optimal Causation Entropy Principle
In our recent work Sun2014arXiv (), we revealed the equivalence between the problem of identifying the causal components and the optimization of CSE. In particular, we proved that for an arbitrary component , its unique set of causal components is the minimal set of components that maximizes CSE. Given the collection of sets (of components) with maximal CSE with respect to component ,
(7) 
it was shown that is the unique set in with minimal cardinality, i.e.,
(8) 
We refer to this minimax principle as the oCSE principle Sun2014arXiv ().
ii.4 Computational Causality Inference
Based on the oCSE principle, we developed two algorithms whose joint sequential application allows the inference of causal relationships within a system Sun2014arXiv (). The goal of these algorithms is to effectively and efficiently identify for a given component the direct causal set of components . The first algorithm aggregatively identifies the potential causal components of . The outcome is a set of components that includes as its subset, with possibly additional components. These additional components are then progressively removed by applying the second algorithm.
For a given component , the first algorithm, referred to as aggregative discovery, starts by selecting a component that maximizes CSE, i.e.,
(9) 
Then, at each step (), a new component is identified among the rest of the components to maximize the CSE conditioned on the previously selected components:
(10) 
Recall that CSE is nonnegative. The above iterative process is terminated when the corresponding maximum CSE equals zero, i.e., when:
(11) 
and the outcome is the set of components .
Next, to remove noncausal components (including indirect and spurious ones) that are in , but not in , we employ the second algorithm, referred to as progressive removal. A component in is removed when:
(12) 
and is updated accordingly^{1}^{1}1An alternative way of removing noncausal components is to test conditioning on subsets of with increasing cardinality, potentially reducing the dimensionality of sample space at the expense of increased computational burden. This is similar to the PCalgorithm originally proposed in Ref. Spirtes2000 () and successfully employed in Ref. Runge2012PRL (), with the key difference that here the enumeration of conditioning subsets only needs to be performed for instead of for the entire system .. After removing all such components, the resulting set is identified as .
ii.5 Estimation of CSE in Practice
In practice, causation entropy needs to be estimated from data. We define the Gaussian estimator of causation entropy as:
(13) 
Here:
(14) 
denotes a covariance matrix where , and denotes the corresponding sample covariance matrix estimated from the data Sun2014arXiv (). The estimation if: (i) the underlying random variables are Gaussians; and (ii) the amount of data is sufficient for the relevant sample covariances to be close to their true values. When the underlying random variables are nonGaussian, an efficient estimator for CSE is yet to be fully developed. As previously pointed out, binningrelated estimators, although conceptually simple, are generally inefficient, unless the sample space is lowdimensional Schindlera2007 (). To gain efficiency and to minimize bias for a large system where the sample space is highdimensional, an idea would be to build upon the nearest neighbor estimators of entropic measures Vejmelka2008 (); Kraskov2004 (); Vlachos2010 (); Tsimpiris2012 ().
Regardless of the method that is being adopted for the estimation of , the estimated value is unlikely to be exactly zero due to limited data and numerical precision. In practice, it is necessary to decide whether or not the estimated quantity should be regarded as zero. Such a decision impacts the termination criterion Equation (11) in the aggregative discovery algorithm and determines which components need to be removed based on Equation (12) in the progressive removal algorithm. As discussed in Sun2014arXiv (), such a decision problem can be addressed via a nonparametric statistical test, called the permutation test, as described below.
For given time series data (), let denote the estimated value of . Based on the set of components and a permutation function on the set of integers from one to , the corresponding permuted time series is defined as:
(15) 
To apply the permutation test, we generate a number of randomly permuted time series (the number will be denoted by ). We then compute the causation entropy from to conditioned on for each permuted time series to obtain values of the estimates, which are consequently used to construct an empirical cumulative distribution as:
(16) 
where are estimates from the permuted time series with and denotes the cardinality of a set. Finally, with the null hypothesis that , we regard as strictly positive if and only if:
(17) 
where is the prescribed significance level. In other words, the null hypothesis is rejected at level if the above inequality holds. Therefore, the permutation test relies on two input parameters: (i) the number of random permutations ; and (ii) the significance level . In practice, the increase of improves the accuracy of the empirical distribution at the expense of additional computational cost. A reasonable balance is often achieved when Sun2014arXiv (). The value of sets a lower bound on the false positive ratio and should be chosen to be close to one (for example, ) Sun2014arXiv ().
Iii Extracting Stochastic Time Series from Deterministic Orbits
Biological systems can exhibit both regular and chaotic features Murray2002 (). For instance, a healthy cardiac rhythm is characterized by a chaotic time series, whereas a pathological rhythm often exhibits regular dynamics Garfinkel1992 (). Therefore, a decrease of the chaoticity of the cardiac rhythm is an alarming clinical signature. A similar connection between the regularity of the time series and pathology is observed in spontaneous neuronal bursting in the brain Schiff1994 (); Lesne2006 (). When a system settles into a periodic or equilibrium state, it becomes nearly impossible to infer the coupling structure among the variables, as the system is not generating any information to be utilized for inference. To overcome this difficulty, we propose to apply small stochastic perturbations to the system while in an equilibrium state (this is equivalent to adding dynamic noise to a system to facilitate coupling inference, as shown in Pompe2011PRE ()). Then, we measure the system’s response over short time intervals. Finally, we follow the oCSE principle and apply the aggregative discovery and progressive removal algorithms to the measured data to infer the couplings among the variables.
iii.1 Dynamical System and Equilibrium States
Consider a continuous dynamical system:
(18) 
where is the dimensional state variable, the symbol denotes transpose, and is a smooth vector field, which models the dynamic rules of the system. A trajectory of the system (18) is a solution to the differential equation, Equation (18), with a given initial condition .
An equilibrium of the system is a state , such that . When a system reaches an equilibrium, the time evolution of the state ceases. An equilibrium is called stable if nearby trajectories approach forward in time, i.e., there exists a constant , such that whenever , where denotes the standard Euclidean norm. Otherwise, the equilibrium is called unstable.
iii.2 Response of the System to Stochastic Perturbations
To gain information about the coupling structure of a system, it is necessary to apply external perturbations to “knock” the system out of an equilibrium state and observe how it responds to these perturbations. Suppose that we apply and record a sequence of random perturbations to the system in such a manner that before each perturbation, the system is given sufficient time to evolve back to its stable equilibrium. In addition, the response of the system is measured shortly after each perturbation, but before the system reaches the equilibrium again. Denote the stable equilibrium of interest as ; we propose to repeatedly apply the following steps.
Step 1. Allow the system to (spontaneously) reach .
Step 2. At time , apply and record a random perturbation to the system, i.e., .
Step 3. At time , measure the rate of response, defined as .
Repeated application of these steps times results in a collection of perturbations, denoted as , and rates of response, denoted as , where . Here, each perturbation is assumed to be drawn independently from the same multivariate Gaussian distribution with zero mean and covariance matrix . For a given equilibrium , in addition to —the number of times the perturbations are applied—there are two more parameters in the stochastic perturbation process: the sample frequency defined as and the variance of perturbation defined as . To ensure that the perturbation approximates the linearized dynamics of the system, we require that and . The influence of these parameters will be studied in the next section with a concrete example.
We remark here that the choice of Gaussian distribution is a practical rather than a conceptual one. In theory, any multivariate distribution can be used to generate the perturbation vector as long as the componentwise distributions of and are identical and independent whenever or . In practice, since the effectiveness of any information theoretic method (including the one proposed here) depends on the estimation of entropies, choosing the perturbations to be Gaussian greatly improves the reliability of estimation and, thus, the accuracy of the inference. This is because the entropy of Gaussian variables depends only on covariances, rendering its estimation relatively straightforward Ahmed1989 (); Barnett2009PRL (); Sun2014arXiv ().
Note that each perturbation and its response are related through the underlying nonlinear differential equation , where the nonlinearity is encoded in the function , which is assumed to be unknown. For an equilibrium , the dynamics of nearby trajectories can be approximated by its linearized system, as follows. Consider a state and define the new variable . To the first order, the time evolution of can be approximated by the following linear system:
(19) 
where is the by Jacobian matrix of , defined as . A sufficient condition for to be stable is that all eigenvalues of must have negative real parts. From Equation (19), with the additional assumption that , the relationship between the perturbation and response is approximated by the following equation:
(20) 
Note that since is a multivariate normal random variable and is a constant matrix, the variable is also (approximately) a multivariate normal random variable. Equation (20) therefore represents a driveresponse type of Gaussian process.
Iv Application to Synthetic Biology
iv.1 The Repressilator
Cellular dynamics is centrally important in biology Ellner2006 (). To describe and, to a certain extent, to understand what happens in cells, fluctuating chemical concentrations and reaction rates need to be measured experimentally. However, constructing dynamic models that accurately reproduce the observed phenomena in cells is quite challenging. An alternative approach consists in engineering synthetic biological systems that follow prescribed dynamic rules. An important example is the socalled repressilator (or repressiondriven oscillator) presented in Elowitz2000 (). The repressilator is based upon three transcriptional repressors inserted into the E. coli bacteria with a plasmid. The three repressors, , and , are related as follows: inhibits the transcription of the gene coding for ; inhibits the transcription of the gene coding for ; inhibits the transcription of the gene coding for . In the absence of inhibition, each of the three proteins reaches a steadystate concentration resulting from a balance between its production and degradation rates within the bacterium. In the presence of crossinhibition by the other two repressors, this network architecture potentially allows oscillations and other interesting dynamical behaviors and serves as a prototype of modeling the quorum sensing among bacteria species Ullner2007 ().
The repressilator dynamics can be modeled by a system of coupled differential equations, which describe the rates of change for the concentration of each protein repressor and the concentration of its associated mRNA in their network, as: Elowitz2000 ()
(21) 
where (), () and () represent the mRNA (protein) concentration of the genes , and , respectively. See Figure 2a for a schematic representation of the system. Each ODE in Equation (21) consists of positive terms modeling the production rate and a negative term representing degradation. There are four parameters in the ODEs in Equation (21), namely: is the ratio of the protein decay rate to the mRNA decay rate; is the socalled Hill coefficient and describes the cooperativity of the binding of repressor to promoter; , the leakiness of the promoter, is the rate of transcription of mRNA in the presence of saturating concentration of the repressor; is the additional rate of transcription of mRNA in the absence of the inhibitor. Note that units of time and concentration in Equation (21) have been rescaled in order to make these equations nondimensional Elowitz2000 ().
As shown in Elowitz2000 (), there is an extended region in the parameter space for which the system described in Equation (21) exhibits a single stable equilibrium. The Jacobian matrix at the equilibrium is:
(22) 
Problem statement: The goal of coupling inference is to identify the location of the nonzero entries of the Jacobian through time series generated by the system near equilibrium.
iv.2 Inference of Coupling Structure via the Repressilator Dynamics
We consider the repressilator dynamics as modeled by Equation (21) with the parameters , , and . Under this setting, the system exhibits a single stable equilibrium Elowitz2000 () at which . Typical time series are shown in Figure 2b. After the system settles at the equilibrium, we apply the stochastic perturbations as described in Section III and obtain a time series of the perturbations , as well as responses . The goal is to identify, for each component (the mRNA or protein of a gene) in the system, the set of components that determine its dynamics (i.e., the variables that appear on the righthand side of each relation in Equation (21)).
To apply our algorithms according to the oCSE principle, we define the set of random variables () as:
(23) 
The approximate relationship between the perturbation and response as in Equation (20) can be expressed as:
(24) 
where , and . Since this equation defines a stochastic process that satisfies the Markov assumptions in Equation (3), the oCSE principle applies. This implies that the direct couplings can be correctly inferred, at least in principle, by jointly applying the aggregative discovery and progressive removal algorithms.
Since the perturbations and responses depend on the number of samples , the rate of perturbation and the variance of perturbation , so is the accuracy of the inference. Next, we explore the change in performance of our approach by varying these parameters. We use two quantities to measure the accuracy of the inferred coupling structure, namely, the false positive ratio and the false negative ratio . Since our goal is to infer the structure rather than the weights of the couplings, we focus on the structure of the Jacobian matrix , encoded in the binary matrix , where:
(25) 
On the other hand, applying the oCSE principle, the inferred direct coupling structure gives rise to the estimated binary matrix , where:
(26) 
Given matrices and , the false positive and false negative ratios are defined, respectively, as:
(27) 
It follows that and when exact (errorfree) inference is achieved.
Figure 3a,b shows that both the false positives and false negatives converge as the number of samples increases. However, they converge to zero only if the rate of perturbation is sufficiently high. Figure 3c,d supports this observation and, in addition, shows that in the high rate of perturbation regime, exact inference is achieved with a sufficient number of samples (in this case, ). The combined effects of and are shown in Figure 4. In all simulations, the variation of the perturbation is set to be and is found to have little effect on the resulting inference, provided that it is sufficiently small to keep the linearization in Equations (20) and (24) valid.
V Discussion and Conclusions
v.1 Results Summary
In this paper, we considered the challenging problem of inferring the causal structure of complex systems from limited available data (enjoy Figure 1 for a cartoon depiction of the concept of causality during a soccer game). Specifically, we presented an application of the socalled oCSE principle to identify the coupling structure of a synthetic biological system, the repressilator. First, we briefly reviewed the main points of the oCSE principle (Equations (7) and (8)), which states that for an arbitrary component of a complex system, its unique set of causal components is the minimal set of components that maximizes CSE (causation entropy, defined in Equation (6) is a generalization of transfer entropy). We strengthen in this work our claim that CSE is a suitable informationtheoretic measure for reliable statistical inferences, since it takes into account both direct and indirect influences that appear in the form of information flows between nodes of networks underlying complex systems. We also devoted some attention to the implementation of the oCSE principle. This task is accomplished by means of the joint sequential application of two algorithms, aggregative discovery and progressive removal, respectively. Second, having introduced the main theoretical and computational frameworks, we used a stochastic perturbation approach to extract time series data approximating a Gaussian process when the model system—the repressilator (see Equation (21))—reaches an equilibrium configuration. We then applied the abovementioned algorithms implementing the oCSE principle in order to infer the coupling structure of the model system from the observed data. Finally, we numerically showed that the success rate of our causal entropic inferences not only improves with the amount of available measured data (Figure 2a), but it also increases with a higher frequency of sampling (Figure 2b).
One especially important feature of our oCSEbased causality inference approach is that it is immune (in principle) to false positives. When data is sufficient, false positives are eliminated by sequential joint application of the aggregative discovery and progressive removal algorithms, as well as raising the threshold used in the permutation test (see Sun2014arXiv () for a more detailed investigation of this point). In contrast, any pairwise causality measure without additional appropriate conditioning will in principle be susceptible to false positives, result in too many connections and, sometimes, even, implies that everything causes everything. Such a limitation is intrinsic and cannot be overcome merely by gathering more data Sun2014arXiv (). On the other hand, false negatives are less common in practice and are usually caused by ineffective statistical estimation orinsufficient data.
v.2 oCSE for General Stochastic Processes
We presented the oCSE principle for stochastic processes under the Markov assumptions stated in Equation (3). From an inference point of view, each and every one of these assumptions has a welldefined interpretation. For instance, the temporal Markov condition implies the stationarity of causal dependences between nodes. The loss of temporal stationarity could be addressed by partitioning the time series data into stationary segments and, then, performing timedependent inferences. Clearly, such inferences would require extra care. Furthermore, the loss of the spatially and/or faithfully Markov condition would imply that the set of nodes that directly influence any given node of the network describing the complex system is no longer minimal and unique. Causality inference in this case becomes an illposed problem. These issues will be formally addressed in a forthcoming work.
Note that a finite th order Markov process can always be converted into a firstorder (memoryless) one by lifting, or in other words, defining new random variables Gallager2013 (). In this regard, the oCSE principle extends naturally to an arbitrary finiteorder Markov process. On the other hand, for a general stationary stochastic process that is not necessarily Markov (i.e., considering a process with potentially infinite memory), there might exist an infinite number of components from the past that causally influence the current state of a given component. However, under an assumption of vanishing (or fading) memory, such influences decay rapidly as a function of the time lag, and consequently, the process itself can be approximated by a finiteorder Markov chain Harris1955 (); Berbee1987 (); Fernandez2002 (); Fernandez2005 (). We plan to leave such generalizations for forthcoming investigations.
Acknowledgments We thank Samuel Stanton from the Army Research Office (ARO) Complex Dynamics and Systems Program for his ongoing and continuous support. This work was funded by ARO Grant No. W911NF1210276.
Author Contributions Jie Sun and Erik M. Bollt designed and supervised the research. Jie Sun and Carlo Cafaro performed the analytical calculations and the numerical simulations. All authors contributed to the writing of the paper.
Conflicts of Interest The authors declare no conflict of interest.
References
 (1) Fitzpatrick, R. An Introduction to Celestial Mechanics; Cambridge University Press: New York, NY, USA, 2012.
 (2) Atkins, P.; de Paula, J. Physical Chemistry, 9th ed.; W. H. Freeman: New York, NY, USA, 2010.
 (3) Mankiw, N.G. Principles of Economics, 6th ed.; Cengage Learning: Stamford, CT, USA, 2011.
 (4) Crutchfield, J.P.; McNamara, B.S. Equations of motions from a data series. Complex Syst. 1987, 1, 417.
 (5) Yao, C.; Bollt, E.M. Modeling and nonlinear parameter estimation with Kronecker product representation for coupled oscillators and spatiotemporal systems. Phys. D: Nonlinear Phenom. 2007, 227, 78–99.
 (6) Sun, J.; Bollt, E.M.; Nishikawa, T. Judging model reduction of complex systems. Phys. Rev. E 2011, 83, 046125.
 (7) Wang, W.X.; Yang, R.; Lai, Y.C.; Kovanis, V.; Harrison, M.A.F. Timeseries based prediction of complex oscillator networks via compressive sensing. Europhys. Lett. 2011, 94, 48006.
 (8) Cubitt, T.S.; Eisert, J.; Wolf, M.W. Extracting dynamical equations from experimental data is NP hard. Phys. Rev. Lett. 2012, 108, 120503.
 (9) Heider, F. Social perception and phenomenal causality. Psychol. Rev. 1944, 51, 358–374.
 (10) Granger, C.W.J. Investigating causal relations by econometric models and crossspectral methods. Econometrica 1969, 37, 425–438.
 (11) Granger, C.W.J. Some recent developments in a concept of causality. J. Econom. 1988, 39, 199–211.
 (12) Spirtes, P.; Glymour, C.N.; Scheines, R. Causation, Prediction, and Search, 2nd ed.; MIT Press: Cambridge, MA, USA, 2000.
 (13) Rothman, K.J.; Greenland, S. Causation and causal inference in epidemiology. Am. J. Public Health 2005, 95, S144–S150.
 (14) Caticha, A.; Cafaro, C. From information geometry to Newtonian dynamics. In Proceedings of the 27th International Workshop on Bayesian Inference and Maximum Entropy Methods, Saratoga Springs, NY, USA, 8–13 July 2007; Volume 954, pp. 165–174.
 (15) Frenzel, S.; Pompe, B. Partial mutual information for coupling analysis of multivariate time series. Phys. Rev. Lett. 2007, 99, 204101.
 (16) HlaváčkováSchindlera, K.; Paluš, M.; Vejmelka, M.; Bhattacharya, J. Causality detection based on informationtheoretic approaches in time series analysis. Phys. Rep. 2007, 441, 1–46.
 (17) Guo, S.; Seth, A.K.; Kendrick, K.M.; Zhou, C.; Feng, J. Partial Granger causality—Eliminating exogenous inputs and latent variables. J. Neurosci. Methods 2008, 172, 79–93.
 (18) Heckman, J.J. Econometric causality. Int. Stat. Rev. 2008, 76, 1–27.
 (19) Pearl, J. Causality: Models, Reasoning and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009.
 (20) Gao, Q.; Duan, X.; Chen, H. Evaluation of effective connectivity of motor areas during motor imagery and execution using conditional Granger causality. NeuroImage 2011, 54, 1280–1288.
 (21) Vicente, R.; Wibral, M.; Lindner, M.; Pipa, G. Transfer entropy—A modelfree measure of effective connectivity for the neurosciences. J. Comput. Neurosci. 2011, 30, 45–67.
 (22) Runge, J.; Heitzig, J.; Petoukhov, V.; Kurths, J. Escaping the curse of dimensionality in estimating multivariate transfer entropy. Phys. Rev. Lett. 2012, 108, 258701.
 (23) Runge, J.; Heitzig, J.; Marwan, N.; Kurths, J. Quantifying causal coupling strength: A lagspecific measure for multivariate time series related to transfer entropy. Phys. Rev. E 2012, 86, 061121.
 (24) Sun, J.; Bollt, E.M. Causation entropy identifies indirect influences, dominance of neighbors and anticipatory couplings. Phys. D: Nonlinear Phenom. 2014, 267, 49–57.
 (25) Sun, J.; Taylor, D.; Bollt, E.M. Causal network inference by optimal causation entropy. 2014, arXiv: 1401.7574.
 (26) Schreiber, T. Measuring information transfer. Phys. Rev. Lett. 2000, 85, 461–464.
 (27) Kaiser, A.; Schreiber, T. Information transfer in continuous processes. Phys. D: Nonlinear Phenom. 2002, 166, 43–62.
 (28) Palus, M.; Komarek, V.; Hrncir, Z.; Sterbova, K. Synchronization as adjustment of information rates: Detection from bivariate time series. Phys. Rev. E 2001, 63, 046211.
 (29) Vejmelka, M.; Palus, M. Inferring the directionality of coupling with conditional mutual information. Phys. Rev. E 2008, 77, 026214.
 (30) Bollt, E.M. Synchronization as a process of sharing and transferring information. Int. J. Bifurc. Chaos 2012, 22, 1250261.
 (31) Smirnov, D.A. Spurious causalities with transfer entropy. Phys. Rev. E 2013, 87, 042917.
 (32) Nevill, A.M; Balmer, N.J; Mark Williams, A. The influence of crowd noise and experience upon refereeing decisions in football. Psychol. Sport Exer. 2002, 3, 261–272.
 (33) Barrett, A.B.; Barnett, L.; Seth, A.K. Multivariate granger causality and generalized variance. Phys. Rev. E 2010, 81, 041907.
 (34) Papana, A.; Kyrtsou, C.; Kugiumtzis, D.; Diks, C. Simulation study of direct causality measures in multivariate time series. Entropy 2013, 15, 2635–2661.
 (35) Marinazzo, D.; Pellicoro, M.; Stramaglia, S. Causal information approach to partial conditioning in multivariate data sets. Comput. Math. Methods Med. 2012, 2012, 303601:1–303601:8.
 (36) Bazzani, A.; Bassi, G.; Turchetti, G. Diffusion and memory effects for stochastic processes and fractional Langevin equations. Phys. A 2003, 324, 530–550.
 (37) Doukhan, P.; Wintenberger, O. Weakly dependent chains with infinite memory. Stoch. Processes Appl. 2008, 118, 1997–2013.
 (38) Eichler, M. Graphical modelling of multivariate time series. Probab. Theory Relat. Fields 2012, 153, 233–268.
 (39) Van Kampen, N.G. Remarks on nonMarkov processes. Braz. J. Phys. 1998, 28, 90–96.
 (40) Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423.
 (41) Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley Son, Inc.: Hoboken, NJ, USA, 2006.
 (42) Kraskov, A.; Stogbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138.
 (43) Vlachos, I.; Kugiumtzis, D. Nonuniform statespace reconstruction and coupling detection. Phys. Rev. E 2010, 82, 016207.
 (44) Tsimpiris, A.; Vlachos, I.; Kugiumtzis, D. Nearest neighbor estimate of conditional mutual information in feature selection. Expert Syst. Appl. 2012, 39, 12697.
 (45) Murray, J.D. Mathematical Biology: I. An Introduction; Springer: New York, NY, USA, 2002.
 (46) Garfinkel, A.; Spano, M.L.; Ditto, W.L.; Weiss, J.N. Controlling cardiac chaos. Science 1992, 257, 1230–1235.
 (47) Schiff, S.J.; Jerger, K.; Duong, D.H.; Chang, T.; Spano, M.L.; Ditto, W.L. Controlling chaos in the brain. Nature 1994, 370, 615–620.
 (48) Lesne, A. Chaos in biology. Riv. Biol. 2006, 99, 467–482.
 (49) Pompe, B.; Runge, J. Momentary information transfer as a coupling measure of time series. Phys. Rev. E 2011, 83, 051122.
 (50) Ahmed, N.A.; Gokhale, D.V. Entropy expressions and their estimators for multivariate distributions. IEEE Trans. Inf. Theory 1989, 35, 688.
 (51) Barnett, L.; Barrett, A.B.; Seth, A.K. Granger causality and transfer entropy are equivalent for gaussian variables. Phys. Rev. Lett. 2009, 103, 238701.
 (52) Ellner, S.P.; Guckenheimer, J. Dynamic Models in Biology; Princeton University Press: Princeton, NJ, USA, 2006.
 (53) Elowitz, M.B.; Leibler, S. A synthetic oscillatory network of transcriptional regulators. Nature 2000, 403, 335–338.
 (54) Ullner, E.; Zaikin, A.; Volkov, E.I.; GarcíaOjalvo, J. Multistability and clustering in a population of synthetic genetic oscillators via phaserepulsive celltocell communication. Phys. Rev. Lett. 2007, 99, 148103.
 (55) Gallager, R.G. Stochastic Processes, Theory for Applications; Cambridge University Press: Cambridge, UK, 2013.
 (56) Harris, T.E. On chains of infinite order. Pac. J. Math. 1955, 5, 707–724.
 (57) Berbee, H. Chains with infinite connections: Uniqueness and Markov representation. Probab. Theory Relat. Fields 1987, 76, 243–253.
 (58) Fernández, R.; Galves, A. Markov approximations of chains of infinite order. Bull. Braz. Math. Soc. 2002, 33, 1–12.
 (59) Fernández, R.; Maillard, G. Chains with complete connections: General theory, uniqueness, loss of memory and mixing properties. J. Stat. Phys. 2005, 118, 555–588.