On the Relationship between Mutual Information and Minimum MeanSquare Errors in Stochastic Dynamical Systems
Abstract
We consider a general stochastic inputoutput dynamical system with output evolving in time as the solution to a functional coefficients, Itô’s stochastic differential equation, excited by an input process. This general class of stochastic systems encompasses not only the classical communication channel models, but also a wide variety of engineering systems appearing through a whole range of applications. For this general setting we find analogous of known relationships linking inputoutput mutual information and minimum mean causal and noncausal square errors, previously established in the context of additive Gaussian noise communication channels. Relationships are not only established in terms of timeaveraged quantities, but also their timeinstantaneous, dynamical counterparts are presented. The problem of appropriately introducing in this general framework a signaltonoise ratio notion expressed through a signaltonoise ratio parameter is also taken into account, identifying conditions for a proper and meaningful interpretation.
Index Terms. Stochastic dynamical systems, stochastic differential equations (SDE), mutual information, minimum mean square errors (MMSE), nonlinear estimation, smoothing, optimal filtering.
1 Introduction
Consider the widely used communication system model known as the standard additive white Gaussian noise channel, described by
(1) 
where is the signaltonoise ratio parameter, is a fixed timehorizon, is the transmitted random signal or channel input, is an independent standard Brownian motion or Wiener process representing the noisy transmission environment, and is the received random signal or channel output, corresponding to the respective value of the signaltonoise ratio parameter .
Of central importance from an information theoretical point of view is the inputoutput mutual information, i.e., the mutual information between the processes and , denoted by . (Precise mathematical definitions are deferred to the next section.) On the other hand, of central importance from an estimation theoretical point of view are the causal and noncausal minimum mean square errors, in estimating or smoothing at time , denoted by and , respectively. Inputoutput mutual information encloses a measure of how much coded information can be reliably transmitted through the channel for the given input source, whereas the causal and noncausal minimum mean square errors indicate the level of accuracy that can be reached in the estimation of the transmitted message at the receiver, based on the causal or noncausal observation of an output sample path, respectively.
Interesting results on the relationship between filter maps and likelihood ratios in the context of the additive white Gaussian noise channel have been available in the literature for a while (see for example [RMAB1995] and references therein). An interesting specific result linking information theory and estimation theory in this same Gaussian channel context, concretely, inputoutput mutual information and causal minimum mean square error, is Duncan’s theorem [D1970] stating, under appropriate finite average power conditions, the relationship
(2) 
i.e., after dividing both sides by , stating the proportionality (through the factor ) of mutual information rate per unit time and time average causal minimum mean square error. It was recently shown by Guo et al. [GSV2005] that the previous relationship is not the only linking property between information theory and estimation theory in this Gaussian channel setting, but also that there exists an important result involving inputoutput mutual information and noncausal minimum mean square error, namely
(3) 
As pointed out by Guo et al. [GSV2005], an interesting relationship between causal and noncausal minimum mean square errors can then be directly deduced from (2) and (3), giving
(4) 
i.e., after dividing as before both sides by , stating the equality between time average causal minimum mean square error and the in turn averaged over the signaltonoise ratio, time average noncausal minimum mean square error. Equations (2) to (4) can for example be used to study asymptotics of inputoutput mutual information and minimum mean square errors, and to find new representations of information measures [GSV2005].
An increasing necessity of considering general stochastic models has arisen during the last decades in the stochastic systems modelling community, not just from a communication systems standpoint, but from a wide variety of applications demanding the consideration of general stochastic inputoutput dynamical systems described by Itô’s stochastic differential equations of the form
(5) 
with the input stochastic process to the system, a nonnegative real parameter (to be interpreted further in subsequent sections), the corresponding system output stochastic process^{1}^{1}1To ease notation we simply write , instead of for example , the input process being clear from the context., and and given (timevarying) nonanticipative functionals, i.e., with depending on the random paths of and only up to time , and similarly for . Note since is an infinite variation process, the integral
is an Itô’s stochastic integral and not an standard pathwise LebesgueStieltjes integral. For the input process , the corresponding system output evolves in time then as the solution to the stochastic differential equation (5). (Once again, we defer mathematical preciseness to subsequent sections.) From a modelling point of view, the flexibility offered by the general model (5) captures a bast collection of system output stochastic behaviors, as for example the class of strong Markov processes [PP2004]. As mentioned, general stochastic inputoutput dynamical systems as the one portrayed by (5) appear in a wide variety of stochastic modelling applications. They are usually obtained by a weaklimit approximation procedure, where a sequence of properly scaled and normalized subjacent stochastic models is considered and shown to converge, in a weak or in distribution stochastic process convergence sense [PB1999, WW2002, JJAS2003, HK1984], to the solution of a corresponding stochastic differential equation. Just to name a few, some examples are applications to adaptive antennas, channel equalizers, adaptive quantizers, hard limiters, and synchronization systems such as standard phaselocked loops and phaselocked loops with limiters [HK1984]. They have also become extremely useful in heavytraffic approximations of stochastic networks of queues in operations research and communications [WW2002, SRAM2000, Re1984, W98, KLL2004, HK2001, PR2003, HCDY2001], where they are usually brought into the picture along with the Skorokhod (or reflection) map constraining a given process to stay inside a certain domain or spatial region [WW2002, W2001], and in mathematical economics (option pricing and the BlackScholes formula, arbitrage theory, consumption and investment problems, insurance and risk theory, etc.) and stochastic control theory [KS1991, S22005, S22005II, BOAS2007].
The so obtained diffusion^{2}^{2}2An strong Markov process with continuous sample paths is generally termed a diffusion. models offer two main modelling advantages. On one hand, they usually wash off in the limit non fundamental model details, accounting for mathematical tractability and leading to a diffusion model that captures the main aspects and trade offs involved. On the other, they have the enormous advantage of taking the modelling setting to the stochastic analysis framework, where the whole machinery of stochastic calculus is available.
From a purely communication systems modelling viewpoint, it is worth emphasizing that a general stochastic inputoutput dynamical system such as (5) encompasses all standard communication Gaussian channel models as particular cases, such as the white Gaussian noise channel (with/without feedback) or its extension to the colored Gaussian noise case. These particular instances will be mathematically described in subsequent sections. It is also worth mentioning that though more sophisticated mathematical frameworks have been considered in the literature, as for example an infinite dimensional Gaussian setting [MZ2005] with the associated Malliavin’s stochastic analysis tools [DN1995, IS2004], the essentially white Gaussian nature of the noise has remained untouched by most. In this regard, the main tools considered to establish relationships such as (3) and (4) usually depend critically on a Lévy structure^{3}^{3}3Recall a process with stationary independent increments is termed a Lévy process. for the noisy term^{4}^{4}4Following the communication systems jargon, we refer to the integral as the noise term. Further interpretations on this line are discussed in the next section. and, specifically, on its independent increment property such as in the purely Brownian motion noisy term case where^{5}^{5}5The process is not a Lévy process unless is a fixed constant. (a constant) in (5). The flexibility of an Itô’s stochastic integral with general functional in (5) allows for a much generality of stochastic behaviors, including nonLévy ones.
The main objective of this paper is to establish links between information theory and estimation theory in the general setting of a stochastic inputoutput dynamical system described by (5). Specifically, it is shown that an analogous relationship to (2) can be written in this setting, so extending classical Duncan’s theorem for standard additive white Gaussian noise channels with and without feedback [D1970, KZZ1971] to this generalized model. Proofs are in the framework of absolutely continuity properties of stochastic process measures, subjacent to the Girsanov’s theorem [LS1977, PP2004]. Relationships (3) and (4) are also studied in this generalized setting. As mentioned, they were shown to hold in the context of the additive white Gaussian noise channel in the work of Guo et al. [GSV2005]. However, as also pointed out in that work, they fail to hold when feedback is allowed in that purely Gaussian noise framework. We show that failure obeys to the fact that a proper notion of a signaltonoise ratio expressed through a parameter such as in (1) cannot be properly introduced in that case, and, by adequately identifying conditions for a signaltonoise ratio parameter to have a meaningful interpretation, we find analogous relationships to (3) and (4) holding for a subclass of models contained in the general setting of (5). The analysis includes the identification and proper definition of three important classes of related systems, namely what we will came to call quasisignaltonoise, signaltonoise and strongsignaltonoise systems.
Another particular aspect adding scope of applicability to the results exposed in the present paper, in addition to the system model generality considered here, is related to the fact that not only relationships involving timeaveraged quantities such as in (2) and (3) above are extended to this general setting, but also timeinstantaneous counterparts are provided. This fact brings dynamical relationships into the picture, allowing to write general integropartialdifferential equations characterizing the different information and estimation theoretic quantities involved. Dynamical relationships are usually absent in the information theory context, being in general difficult to find. The results provided extend then not only the traditional Gaussian system framework, but also the customary timeindependent, static relationships setting where information and estimation theoretic quantities are studied for stationary (usually Gaussian) system input processes [CES1949, NW1942, MCYJLJ1955], or for nonstationary system inputs but in terms of timeaveraged quantities [D1970, KZZ1971].
Finally, we mention that for sake of simplicity in the exposition of the results we will consider throughout the paper onedimensional systems and processes. However, all the results presented in the paper have indeed multidimensional counterparts. These and further possible extensions, with the corresponding related generalized results, will not be difficult to carry out by the reader in light of the computations developed in the paper, and therefore we will only mention the main ideas involved by the end of the paper without giving corresponding proofs.
The organization of the paper is as follows. In Section 2 we introduce the mathematically rigorous system model setup, including the model definition, the main general assumptions, and the different information and estimation theoretic quantities involved, such as inputoutput mutual information and causal and noncausal minimum mean square errors, as well as important concepts from the general theory of stochastic process such as the absolutely continuity of stochastic process measures. In Section 3 we establish the relationship linking inputoutput mutual information and causal minimum mean square error for the general dynamical inputoutput stochastic system considered in the paper, generalizing the known result for the standard additive white Gaussian noise channel with/without feedback. In Section 4 we identified conditions under which a proper notion of a signaltonoise ratio parameter can be introduced in our general system setting. We distinguish three major subclasses of systems and give appropriate characterizations. In Section 5 we establish the corresponding generalization of the relationship linking inputoutput mutual information and noncausal minimum mean square error for an appropriate subclass of system models. In Section 6 we provide the corresponding timeinstantaneous counterparts of the previous results. In Section 7 we comment on further model extensions and related results. Finally, in Section 8 we briefly comment on the scope of the results exposed.
2 Preliminary Elements
This section provides the precise mathematical framework upon which the present work is elaborated. In addition to introduce a thoroughly mathematical definition of the dynamical system model to be considered throughout, it also introduces the main concepts from information theory and statistical signal processing appearing in subsequent sections, such as the notion of mutual information between stochastic processes, the accompanying notion of absolutely continuity of measures induced by stochastic processes, and minimummean square errors in estimating and smoothing stochastic processes.
2.1 System Model Definition
Let be a probability space, be fixed throughout, and be a filtration on , i.e., a nondecreasing family of subalgebras of . We assume the filtration satisfies the usual hypotheses [PP2004], i.e., contains all the null sets of and is rightcontinuous. Also, let be a onedimensional standard Brownian motion^{6}^{6}6The notation indicates the stochastic process is adapted, i.e., is measurable for each . In case of a Brownian motion , it also indicates is a martingale on that filtration, coinciding then with the also called in the literature Wiener process relative to [LS1977]. [KS1991], and be the measurable space of functions in , the space of all functions continuous on , equipped with the algebra of finitedimensional cylinder sets in [KS1991], i.e.^{7}^{7}7We write, as usual, for the corresponding generated algebra.,
where denotes the collection of Borel sets in , , and
for each , , and . In a similar way we introduce, for each , the algebra of finitedimensional cylinder sets in the space of all functions continuous on , and, for a given family of functions , the algebras and of finitedimensional cylinder sets in and , respectively, with
and the restriction of to the subinterval .
For each we consider a stochastic process , with paths or trajectories in the measurable space , and having Itô’s stochastic differential
(6) 
with , where

the stochastic process , with trajectories in a given measurable space of functions , is independent of , and

the functionals and are measurable and nonanticipative, i.e., they are  and measurable^{8}^{8}8Similarly than for , denotes the collection of Borel sets in the interval ., respectively, and, for each , and are  and measurable, respectively as well. In other words, the functionals and are jointly measurable with respect to (w.r.t.) all their corresponding arguments, and depend at each time on and only through and , i.e., only on the pieces of trajectories
Conditions for properly interpreting as a signaltonoise ratio (SNR) parameter for system (6) will be discussed in Section 4.
As discussed in Section 1, we may interpret equation (6) as a general stochastic inputoutput dynamical system with input stochastic process and output stochastic process , for each given value of the parameter , the output process evolving in time as an Itô’s process [BO1998] with differential given by (6). Though the scope of applicability of a general dynamical system model such as (6) exceeds by far a purely communication system setting, it is worth mentioning that from a classical communication channels point of view we shall interpret as a random input message being printed in the “channel signal component” , received at the channel output embedded in the additive “channel noisy term” . The standard additive white Gaussian noise channel (AWGNC) being obtained from (6) by taking
for each , , and , i.e., with the corresponding output process or “random received signal” evolving for according to
(7) 
and the channel SNR^{9}^{9}9The interpretation of as an SNR parameter is discussed at full in Section 4.. In this same line, note when in (6) is allowed to depend only on , and not on , the noisy term
is a zeromean Gaussian process with covariance function given by [FC1998]
, provided is squareintegrable on , i.e.,
This case is usually known in the literature as the additive colored Gaussian noise channel.
It is technically suitable to treat in (6) as a system input too, as it is sometimes the case when the stochastic system at hand is obtained by a weak limit procedure of a properly scaled and normalized sequence of subjacent system models [HK1984, HK2001]. The principle of causality for dynamical systems [KS1991] requires the output process at time , (), to depend only on the values
i.e., only on the past history of and up to time . (This requirement finds a precise mathematical expression in the adaptability condition (I) imposed below.) Therefore the nonanticipability nature imposed on the functional and .
For a fixed deterministic trajectory in place of in (6), we have the corresponding output stochastic process, denoted as for each , evolving as a solution of the stochastic differential equation (SDE) [PP2004]
(8) 
. When for each and we have and , for some Borelmeasurable functions and , is indeed a diffusion process, i.e., an strong Markov process with continuous sample paths on [RY1999]. Though we are of course interested in the general case when the input to the system is a stochastic process as in (6), rather than a fix trajectory as in (8), we refer to (6) as an SDE system motivated from the above discussion. In fact, for and related as in (6), we may look at as solving the SDE with random drift coefficient
, where the random drift functional is given by
(9) 
for each and . Note that is not only measurable, but also, for each , is measurable, where
, is the history of up to time , i.e., the minimal algebra on making all the random variables measurable.
Throughout we shall assume the following conditions are satisfied.

For each the stochastic process is the pathwise unique strong solution of equation (6) [FG1997, OK2002]. It is strong in the sense that, for each , is measurable w.r.t. the algebra
which represents the joint history of and up to time , i.e., the minimal algebra on making all the random variables measurable. Equivalently, the stochastic process is adapted to the filtration . It is pathwise unique in the sense that if and are two strong solutions of (6), then for all , almost surely, i.e.,
(See Remark 2.1 below for the existence and uniqueness of such a solution.)

The nonanticipative functionals and are such that
for each and .

For each ,
where is the pathwise unique strong solution of the equation
(Existence and uniqueness of follow from condition (III) and [LS1977, Theorem 4.6, p.128].)

For each ,
(13) and
where, for each and ,
the history of up to time . Here, and throughout, denotes conditional expectation, as usual.
Remark 2.1.
If the random drift functional in (9) satisfies appropriate similar Lipschitz and linear growth conditions as to in (III), in a almost surely basis of course and with and and random variables and stochastic process, respectively, then the existence of a pathwise unique strong solution of (6) can be read off from [LS1977, Theorem 4.6, p.128]. We do not explicitly require such conditions though, but just assume the corresponding existence and uniqueness in condition (I).
Remark 2.2.
As the reader will easily verify, all the results in the paper hold if condition (I) is weakened to just ask that, for each , is any strong solution of (6), i.e., to just assume the existence of each as any given strong solution of equation (6). We demand uniqueness in condition (I) for sake of preciseness, as well as to properly interpret (6) as an input(unique)output dynamical system.
As it will be detailed in subsequent sections, conditions (I) to (V), as well as the assumption on the stochastic independence of processes and , ensure the existence of several densities or RadonNikodym derivatives between the measures induced by the stochastic processes involved in their corresponding sample spaces of functions. These RadonNikodym derivatives are introduced in the following subsection.
2.2 Absolutely Continuity of Stochastic Process Measures
Recall from the previous subsection that the stochastic processes and (each ) have trajectories, or sample paths, in the measurable spaces of functions and , respectively. In the same way, the auxiliary process , introduced previously in condition (IV), has sample paths in the measurable space . We denote by
the corresponding measures they induced in the measurable spaces , , and , respectively. Analogously, we denote by
the (joint) measure induced by the pair of processes in the measurable space .
As it was mentioned by the end of the previous subsection, and as it will be detailed further in subsequent sections, conditions (I) to (V), as well as the assumption on the stochastic independence of processes and , ensure the absolutely continuity, in fact the mutual absolutely continuity, of several of the afore mentioned measures, and therefore the existence of the corresponding RadonNikodym derivatives. In particular,
(14) 
where, as usual, “” denotes mutual absolutely continuity of the corresponding measures and the product measure in obtained from and in and , respectively. From (14) it then follows that
too. We denote the corresponding RadonNikodym derivatives by
, . Note they are , , and measurable functionals, respectively. For product measures, such as for example , the differential is sometimes written in the literature also as .
In addition, for each , we denote by and the measures the restricted processes and induce on , respectively, by
(15) 
the corresponding RadonNikodym derivative, and similarly for all the other measures and processes above. In accordance with our previous notation, we omit in expressions of the form (15) when .
Finally, we denote by
the  and measurable random variables, , obtained from the corresponding substitution of in (15) by each sample path , , of the process , and similarly for all other processes and measures above.
2.3 InputOutput Mutual Information
Let , be the space of all valued random variables on , and be the space of all having finite expectation, i.e.,
with denoting expectation w.r.t. and the usual measure theoretic convention .
We make the following definition involving the processes and , . Here, and throughout, logarithms are understood to be, without loss of generality, to the natural base , with the convention .
Definition 2.1.
If for each the condition
(16) 
is satisfied^{10}^{10}10Note that, for each , the left hand side of (16) is measurable, therefore measurable too, and hence an element of ., we define the inputoutput mutual information, , by
(17) 
In the same way, we define the instantaneous inputoutput mutual information, , by^{11}^{11}11Note condition (16) also implies the well definiteness of .
Note that for each . Note also that we may alternatively write as
, and similarly for , .
Remark 2.3.
For a given input process , changing the value of in (6) changes the output process , and thus changes the right hand side of (17) too. Therefore the notation , treating as the variable for a given input process . The notation obeys to the same reasoning. We find this notation more appealing than for example or , specially in identifying the relevant variables to compute quantities such as
in subsequent sections.
Sufficient conditions for (16) to be satisfied will be discussed in subsequent sections.
It is easy to check that and are indeed nonnegativevalued, i.e.,
Definition 2.1 is motivated from the classical definition of mutual information in the context of stochastic processes and stochastic systems [GSV2005, K1956, P1964], such as the AWGNC.
2.4 Minimum MeanSquare Errors
A central role will be played in all the results to be stated in the paper by the measurable nonanticipative functional
given by
for each , , and . Note from condition (III), equation (12), we have , and therefore is well defined.
Remark 2.4.
From condition (V), equation (13), it follows that, for each ,
for Lebesgue almostevery . Since also, from condition (III), equation (12), we have , we conclude that, for each ,
for Lebesgue almostevery too. Therefore, for any subalgebra of and each the conditional expectation
is a well defined and finite measurable random variable (in fact an element of with denoting the restriction of to [DW1991]), for Lebesguealmost every as well. By defining it as on the remaining Lebesguenull subset of , henceforth we treat it as a realvalued function in , for each .
Having made the previous remark, we now introduce the following definition involving the above introduced functional , and the accompanying stochastic processes , .
Definition 2.2.
For each we define the causal minimum meansquare error (CMMSE) in estimating the stochastic process at time from the observations , , denoted , by
Similarly, for each we define the noncausal minimum meansquare error (NCMMSE) in smoothing the stochastic process at time from the observations , , denoted , by
In the same way, and slightly abusing notation, for each , , and we set
the NCMMSE in smoothing the stochastic process at time from the observations , , with and the convention of omitting the first of its three arguments when it equals , i.e., . Note that the quantities just defined differ through the conditioning algebras, and that for each and .
Remark 2.5.
From Remark 2.4 it follows that, for any subalgebra of and each ,
is a well defined nonnegative random variable for each , and therefore each of the three quantities introduced in Definition 2.2 above is a well defined valued function of its corresponding arguments, clearly jointly measurable. Note the domain of is the set given by
3 InputOutput Mutual Information and CMMSE
In this section we provide a result relating inputoutput mutual information, , and CMMSE, , for the general dynamical inputoutput system (6). The result generalizes the classical Duncan’s theorem for AWGNCs with or without feedback [D1970, KZZ1971]. It also provides a general condition guaranteeing the fulfilment of requirement (16) in Definition 2.1.
Theorem 3.1.
Assume that for each we have
Then for each we have
and the following relationship between and ,
(18) 
holds for each as well.
Before giving the proof of the theorem we make the following remark.
Remark 3.1.
Under a finite average power condition
(19) 
it follows that
Indeed, from (19) and condition (III), equation (12), we have
which implies, by standard properties of expectations and conditional expectations for finite second order moment random variables [DW1991], and with and , , , that
(20) 
Relationship (18) had been previously proved in the especial case of AWGNCs (with or without feedback [D1970, KZZ1971]), and under condition (19).
Proof.
Let be fixed throughout the proof. From conditions (I) to (V), the fact that the processes and are independent, and [LS1977, Lemma 7.6, p.292] and [LS1977, Lemma 7.7, p.293], we have that
Therefore
(21) 
too, and, by [LS1977, Theorem 7.23, p.289],
with the right hand side of the above expression equaling