On the Relationship between Mutual Information and Minimum Mean-Square Errors in Stochastic Dynamical Systems

On the Relationship between Mutual Information and Minimum Mean-Square Errors in Stochastic Dynamical Systems

Francisco J. Piera F.J. Piera is with the Department of Electrical Engineering, University of Chile, Av. Tupper 2007, Santiago, 8370451, Chile (e-mail: fpiera@ing.uchile.cl).    Patricio Parada P. Parada is with the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1406 W. Green St., Urbana, IL, 61801-2918 USA, and the Department of Electrical Engineering, University of Chile, Av. Tupper 2007, Santiago, 8370451, Chile (e-mail: paradasa@uiuc.edu).
October 4, 2007
Abstract

We consider a general stochastic input-output dynamical system with output evolving in time as the solution to a functional coefficients, Itô’s stochastic differential equation, excited by an input process. This general class of stochastic systems encompasses not only the classical communication channel models, but also a wide variety of engineering systems appearing through a whole range of applications. For this general setting we find analogous of known relationships linking input-output mutual information and minimum mean causal and non-causal square errors, previously established in the context of additive Gaussian noise communication channels. Relationships are not only established in terms of time-averaged quantities, but also their time-instantaneous, dynamical counterparts are presented. The problem of appropriately introducing in this general framework a signal-to-noise ratio notion expressed through a signal-to-noise ratio parameter is also taken into account, identifying conditions for a proper and meaningful interpretation.

Index Terms. Stochastic dynamical systems, stochastic differential equations (SDE), mutual information, minimum mean square errors (MMSE), non-linear estimation, smoothing, optimal filtering.

1 Introduction

Consider the widely used communication system model known as the standard additive white Gaussian noise channel, described by

(1)

where is the signal-to-noise ratio parameter, is a fixed time-horizon, is the transmitted random signal or channel input, is an independent standard Brownian motion or Wiener process representing the noisy transmission environment, and is the received random signal or channel output, corresponding to the respective value of the signal-to-noise ratio parameter .

Of central importance from an information theoretical point of view is the input-output mutual information, i.e., the mutual information between the processes and , denoted by . (Precise mathematical definitions are deferred to the next section.) On the other hand, of central importance from an estimation theoretical point of view are the causal and non-causal minimum mean square errors, in estimating or smoothing at time , denoted by and , respectively. Input-output mutual information encloses a measure of how much coded information can be reliably transmitted through the channel for the given input source, whereas the causal and non-causal minimum mean square errors indicate the level of accuracy that can be reached in the estimation of the transmitted message at the receiver, based on the causal or noncausal observation of an output sample path, respectively.

Interesting results on the relationship between filter maps and likelihood ratios in the context of the additive white Gaussian noise channel have been available in the literature for a while (see for example [RMAB1995] and references therein). An interesting specific result linking information theory and estimation theory in this same Gaussian channel context, concretely, input-output mutual information and causal minimum mean square error, is Duncan’s theorem [D1970] stating, under appropriate finite average power conditions, the relationship

(2)

i.e., after dividing both sides by , stating the proportionality (through the factor ) of mutual information rate per unit time and time average causal minimum mean square error. It was recently shown by Guo et al. [GSV2005] that the previous relationship is not the only linking property between information theory and estimation theory in this Gaussian channel setting, but also that there exists an important result involving input-output mutual information and non-causal minimum mean square error, namely

(3)

As pointed out by Guo et al. [GSV2005], an interesting relationship between causal and non-causal minimum mean square errors can then be directly deduced from (2) and (3), giving

(4)

i.e., after dividing as before both sides by , stating the equality between time average causal minimum mean square error and the in turn averaged over the signal-to-noise ratio, time average non-causal minimum mean square error. Equations (2) to (4) can for example be used to study asymptotics of input-output mutual information and minimum mean square errors, and to find new representations of information measures [GSV2005].

An increasing necessity of considering general stochastic models has arisen during the last decades in the stochastic systems modelling community, not just from a communication systems standpoint, but from a wide variety of applications demanding the consideration of general stochastic input-output dynamical systems described by Itô’s stochastic differential equations of the form

(5)

with the input stochastic process to the system, a non-negative real parameter (to be interpreted further in subsequent sections), the corresponding system output stochastic process111To ease notation we simply write , instead of for example , the input process being clear from the context., and and given (time-varying) non-anticipative functionals, i.e., with depending on the random paths of and only up to time , and similarly for . Note since is an infinite variation process, the integral

is an Itô’s stochastic integral and not an standard pathwise Lebesgue-Stieltjes integral. For the input process , the corresponding system output evolves in time then as the solution to the stochastic differential equation (5). (Once again, we defer mathematical preciseness to subsequent sections.) From a modelling point of view, the flexibility offered by the general model (5) captures a bast collection of system output stochastic behaviors, as for example the class of strong Markov processes [PP2004]. As mentioned, general stochastic input-output dynamical systems as the one portrayed by (5) appear in a wide variety of stochastic modelling applications. They are usually obtained by a weak-limit approximation procedure, where a sequence of properly scaled and normalized subjacent stochastic models is considered and shown to converge, in a weak or in distribution stochastic process convergence sense [PB1999, WW2002, JJAS2003, HK1984], to the solution of a corresponding stochastic differential equation. Just to name a few, some examples are applications to adaptive antennas, channel equalizers, adaptive quantizers, hard limiters, and synchronization systems such as standard phase-locked loops and phase-locked loops with limiters [HK1984]. They have also become extremely useful in heavy-traffic approximations of stochastic networks of queues in operations research and communications [WW2002, SRAM2000, Re1984, W98, KLL2004, HK2001, PR2003, HCDY2001], where they are usually brought into the picture along with the Skorokhod (or reflection) map constraining a given process to stay inside a certain domain or spatial region [WW2002, W2001], and in mathematical economics (option pricing and the Black-Scholes formula, arbitrage theory, consumption and investment problems, insurance and risk theory, etc.) and stochastic control theory [KS1991, S22005, S22005II, BOAS2007].

The so obtained diffusion222An strong Markov process with continuous sample paths is generally termed a diffusion. models offer two main modelling advantages. On one hand, they usually wash off in the limit non fundamental model details, accounting for mathematical tractability and leading to a diffusion model that captures the main aspects and trade offs involved. On the other, they have the enormous advantage of taking the modelling setting to the stochastic analysis framework, where the whole machinery of stochastic calculus is available.

From a purely communication systems modelling viewpoint, it is worth emphasizing that a general stochastic input-output dynamical system such as (5) encompasses all standard communication Gaussian channel models as particular cases, such as the white Gaussian noise channel (with/without feedback) or its extension to the colored Gaussian noise case. These particular instances will be mathematically described in subsequent sections. It is also worth mentioning that though more sophisticated mathematical frameworks have been considered in the literature, as for example an infinite dimensional Gaussian setting [MZ2005] with the associated Malliavin’s stochastic analysis tools [DN1995, IS2004], the essentially white Gaussian nature of the noise has remained untouched by most. In this regard, the main tools considered to establish relationships such as (3) and (4) usually depend critically on a Lévy structure333Recall a process with stationary independent increments is termed a Lévy process. for the noisy term444Following the communication systems jargon, we refer to the integral as the noise term. Further interpretations on this line are discussed in the next section. and, specifically, on its independent increment property such as in the purely Brownian motion noisy term case where555The process is not a Lévy process unless is a fixed constant. (a constant) in (5). The flexibility of an Itô’s stochastic integral with general functional in (5) allows for a much generality of stochastic behaviors, including non-Lévy ones.

The main objective of this paper is to establish links between information theory and estimation theory in the general setting of a stochastic input-output dynamical system described by (5). Specifically, it is shown that an analogous relationship to (2) can be written in this setting, so extending classical Duncan’s theorem for standard additive white Gaussian noise channels with and without feedback [D1970, KZZ1971] to this generalized model. Proofs are in the framework of absolutely continuity properties of stochastic process measures, subjacent to the Girsanov’s theorem [LS1977, PP2004]. Relationships (3) and (4) are also studied in this generalized setting. As mentioned, they were shown to hold in the context of the additive white Gaussian noise channel in the work of Guo et al. [GSV2005]. However, as also pointed out in that work, they fail to hold when feedback is allowed in that purely Gaussian noise framework. We show that failure obeys to the fact that a proper notion of a signal-to-noise ratio expressed through a parameter such as in (1) cannot be properly introduced in that case, and, by adequately identifying conditions for a signal-to-noise ratio parameter to have a meaningful interpretation, we find analogous relationships to (3) and (4) holding for a subclass of models contained in the general setting of (5). The analysis includes the identification and proper definition of three important classes of related systems, namely what we will came to call quasi-signal-to-noise, signal-to-noise and strong-signal-to-noise systems.

Another particular aspect adding scope of applicability to the results exposed in the present paper, in addition to the system model generality considered here, is related to the fact that not only relationships involving time-averaged quantities such as in (2) and (3) above are extended to this general setting, but also time-instantaneous counterparts are provided. This fact brings dynamical relationships into the picture, allowing to write general integro-partial-differential equations characterizing the different information and estimation theoretic quantities involved. Dynamical relationships are usually absent in the information theory context, being in general difficult to find. The results provided extend then not only the traditional Gaussian system framework, but also the customary time-independent, static relationships setting where information and estimation theoretic quantities are studied for stationary (usually Gaussian) system input processes [CES1949, NW1942, MCYJLJ1955], or for non-stationary system inputs but in terms of time-averaged quantities [D1970, KZZ1971].

Finally, we mention that for sake of simplicity in the exposition of the results we will consider throughout the paper one-dimensional systems and processes. However, all the results presented in the paper have indeed multi-dimensional counterparts. These and further possible extensions, with the corresponding related generalized results, will not be difficult to carry out by the reader in light of the computations developed in the paper, and therefore we will only mention the main ideas involved by the end of the paper without giving corresponding proofs.

The organization of the paper is as follows. In Section 2 we introduce the mathematically rigorous system model setup, including the model definition, the main general assumptions, and the different information and estimation theoretic quantities involved, such as input-output mutual information and causal and non-causal minimum mean square errors, as well as important concepts from the general theory of stochastic process such as the absolutely continuity of stochastic process measures. In Section 3 we establish the relationship linking input-output mutual information and causal minimum mean square error for the general dynamical input-output stochastic system considered in the paper, generalizing the known result for the standard additive white Gaussian noise channel with/without feedback. In Section 4 we identified conditions under which a proper notion of a signal-to-noise ratio parameter can be introduced in our general system setting. We distinguish three major subclasses of systems and give appropriate characterizations. In Section 5 we establish the corresponding generalization of the relationship linking input-output mutual information and non-causal minimum mean square error for an appropriate subclass of system models. In Section 6 we provide the corresponding time-instantaneous counterparts of the previous results. In Section 7 we comment on further model extensions and related results. Finally, in Section 8 we briefly comment on the scope of the results exposed.

2 Preliminary Elements

This section provides the precise mathematical framework upon which the present work is elaborated. In addition to introduce a thoroughly mathematical definition of the dynamical system model to be considered throughout, it also introduces the main concepts from information theory and statistical signal processing appearing in subsequent sections, such as the notion of mutual information between stochastic processes, the accompanying notion of absolutely continuity of measures induced by stochastic processes, and minimum-mean square errors in estimating and smoothing stochastic processes.

2.1 System Model Definition

Let be a probability space, be fixed throughout, and be a filtration on , i.e., a nondecreasing family of sub--algebras of . We assume the filtration satisfies the usual hypotheses [PP2004], i.e., contains all the -null sets of and is right-continuous. Also, let be a one-dimensional standard Brownian motion666The notation indicates the stochastic process is -adapted, i.e., is -measurable for each . In case of a Brownian motion , it also indicates is a martingale on that filtration, coinciding then with the also called in the literature Wiener process relative to [LS1977]. [KS1991], and be the measurable space of functions in , the space of all functions continuous on , equipped with the -algebra of finite-dimensional cylinder sets in [KS1991], i.e.777We write, as usual, for the corresponding generated -algebra.,

where denotes the collection of Borel sets in , , and

for each , , and . In a similar way we introduce, for each , the -algebra of finite-dimensional cylinder sets in the space of all functions continuous on , and, for a given family of functions , the -algebras and of finite-dimensional cylinder sets in and , respectively, with

and the restriction of to the subinterval .

For each we consider a stochastic process , with paths or trajectories in the measurable space , and having Itô’s stochastic differential

(6)

with , where

  • the stochastic process , with trajectories in a given measurable space of functions , is independent of , and

  • the functionals and are measurable and non-anticipative, i.e., they are - and -measurable888Similarly than for , denotes the collection of Borel sets in the interval ., respectively, and, for each , and are - and -measurable, respectively as well. In other words, the functionals and are jointly measurable with respect to (w.r.t.) all their corresponding arguments, and depend at each time on and only through and , i.e., only on the pieces of trajectories

Conditions for properly interpreting as a signal-to-noise ratio (SNR) parameter for system (6) will be discussed in Section 4.

As discussed in Section 1, we may interpret equation (6) as a general stochastic input-output dynamical system with input stochastic process and output stochastic process , for each given value of the parameter , the output process evolving in time as an Itô’s process [BO1998] with differential given by (6). Though the scope of applicability of a general dynamical system model such as (6) exceeds by far a purely communication system setting, it is worth mentioning that from a classical communication channels point of view we shall interpret as a random input message being printed in the “channel signal component” , received at the channel output embedded in the additive “channel noisy term” . The standard additive white Gaussian noise channel (AWGNC) being obtained from (6) by taking

for each , , and , i.e., with the corresponding output process or “random received signal” evolving for according to

(7)

and the channel SNR999The interpretation of as an SNR parameter is discussed at full in Section 4.. In this same line, note when in (6) is allowed to depend only on , and not on , the noisy term

is a zero-mean Gaussian process with covariance function given by [FC1998]

, provided is square-integrable on , i.e.,

This case is usually known in the literature as the additive colored Gaussian noise channel.

It is technically suitable to treat in (6) as a system input too, as it is sometimes the case when the stochastic system at hand is obtained by a weak limit procedure of a properly scaled and normalized sequence of subjacent system models [HK1984, HK2001]. The principle of causality for dynamical systems [KS1991] requires the output process at time , (), to depend only on the values

i.e., only on the past history of and up to time . (This requirement finds a precise mathematical expression in the adaptability condition (I) imposed below.) Therefore the non-anticipability nature imposed on the functional and .

For a fixed deterministic trajectory in place of in (6), we have the corresponding output stochastic process, denoted as for each , evolving as a solution of the stochastic differential equation (SDE) [PP2004]

(8)

. When for each and we have and , for some Borel-measurable functions and , is indeed a diffusion process, i.e., an strong Markov process with continuous sample paths on [RY1999]. Though we are of course interested in the general case when the input to the system is a stochastic process as in (6), rather than a fix trajectory as in (8), we refer to (6) as an SDE system motivated from the above discussion. In fact, for and related as in (6), we may look at as solving the SDE with random drift coefficient

, where the random drift functional is given by

(9)

for each and . Note that is not only -measurable, but also, for each , is -measurable, where

, is the history of up to time , i.e., the minimal -algebra on making all the random variables measurable.

Throughout we shall assume the following conditions are satisfied.

  1. For each the stochastic process is the pathwise unique strong solution of equation (6) [FG1997, OK2002]. It is strong in the sense that, for each , is measurable w.r.t. the -algebra

    which represents the joint history of and up to time , i.e., the minimal -algebra on making all the random variables measurable. Equivalently, the stochastic process is adapted to the filtration . It is pathwise unique in the sense that if and are two strong solutions of (6), then for all , -almost surely, i.e.,

    (See Remark 2.1 below for the existence and uniqueness of such a solution.)

  2. The non-anticipative functionals and are such that

    for each and .

  3. For each and ,

    (10)
    (11)

    and

    (12)

    where is a non-decreasing, right-continuous function satisfying for each , and , and are finite constants. Equations (10), (11) and (12) correspond to Lipschitz, linear growth and non-degeneracy conditions on the non-anticipative functional , respectively.

  4. For each ,

    where is the pathwise unique strong solution of the equation

    (Existence and uniqueness of follow from condition (III) and [LS1977, Theorem 4.6, p.128].)

  5. For each ,

    (13)

    and

    where, for each and ,

    the history of up to time . Here, and throughout, denotes conditional expectation, as usual.

Remark 2.1.

If the random drift functional in (9) satisfies appropriate similar Lipschitz and linear growth conditions as to in (III), in a -almost surely basis of course and with and and random variables and stochastic process, respectively, then the existence of a pathwise unique strong solution of (6) can be read off from [LS1977, Theorem 4.6, p.128]. We do not explicitly require such conditions though, but just assume the corresponding existence and uniqueness in condition (I).

Remark 2.2.

As the reader will easily verify, all the results in the paper hold if condition (I) is weakened to just ask that, for each , is any strong solution of (6), i.e., to just assume the existence of each as any given strong solution of equation (6). We demand uniqueness in condition (I) for sake of preciseness, as well as to properly interpret (6) as an input-(unique)output dynamical system.

As it will be detailed in subsequent sections, conditions (I) to (V), as well as the assumption on the stochastic independence of processes and , ensure the existence of several densities or Radon-Nikodym derivatives between the measures induced by the stochastic processes involved in their corresponding sample spaces of functions. These Radon-Nikodym derivatives are introduced in the following subsection.

2.2 Absolutely Continuity of Stochastic Process Measures

Recall from the previous subsection that the stochastic processes and (each ) have trajectories, or sample paths, in the measurable spaces of functions and , respectively. In the same way, the auxiliary process , introduced previously in condition (IV), has sample paths in the measurable space . We denote by

the corresponding measures they induced in the measurable spaces , , and , respectively. Analogously, we denote by

the (joint) measure induced by the pair of processes in the measurable space .

As it was mentioned by the end of the previous subsection, and as it will be detailed further in subsequent sections, conditions (I) to (V), as well as the assumption on the stochastic independence of processes and , ensure the absolutely continuity, in fact the mutual absolutely continuity, of several of the afore mentioned measures, and therefore the existence of the corresponding Radon-Nikodym derivatives. In particular,

(14)

where, as usual, “” denotes mutual absolutely continuity of the corresponding measures and the product measure in obtained from and in and , respectively. From (14) it then follows that

too. We denote the corresponding Radon-Nikodym derivatives by

, . Note they are -, -, and -measurable functionals, respectively. For product measures, such as for example , the differential is sometimes written in the literature also as .

In addition, for each , we denote by and the measures the restricted processes and induce on , respectively, by

(15)

the corresponding Radon-Nikodym derivative, and similarly for all the other measures and processes above. In accordance with our previous notation, we omit in expressions of the form (15) when .

Finally, we denote by

the - and -measurable random variables, , obtained from the corresponding substitution of in (15) by each sample path , , of the process , and similarly for all other processes and measures above.

2.3 Input-Output Mutual Information

Let , be the space of all -valued random variables on , and be the space of all having finite expectation, i.e.,

with denoting expectation w.r.t. and the usual measure theoretic convention .

We make the following definition involving the processes and , . Here, and throughout, logarithms are understood to be, without loss of generality, to the natural base , with the convention .

Definition 2.1.

If for each the condition

(16)

is satisfied101010Note that, for each , the left hand side of (16) is -measurable, therefore -measurable too, and hence an element of ., we define the input-output mutual information, , by

(17)

In the same way, we define the instantaneous input-output mutual information, , by111111Note condition (16) also implies the well definiteness of .

Note that for each . Note also that we may alternatively write as

, and similarly for , .

Remark 2.3.

For a given input process , changing the value of in (6) changes the output process , and thus changes the right hand side of (17) too. Therefore the notation , treating as the variable for a given input process . The notation obeys to the same reasoning. We find this notation more appealing than for example or , specially in identifying the relevant variables to compute quantities such as

in subsequent sections.

Sufficient conditions for (16) to be satisfied will be discussed in subsequent sections.

It is easy to check that and are indeed non-negative-valued, i.e.,

Definition 2.1 is motivated from the classical definition of mutual information in the context of stochastic processes and stochastic systems [GSV2005, K1956, P1964], such as the AWGNC.

2.4 Minimum Mean-Square Errors

A central role will be played in all the results to be stated in the paper by the measurable non-anticipative functional

given by

for each , , and . Note from condition (III), equation (12), we have , and therefore is well defined.

Remark 2.4.

From condition (V), equation (13), it follows that, for each ,

for Lebesgue almost-every . Since also, from condition (III), equation (12), we have , we conclude that, for each ,

for Lebesgue almost-every too. Therefore, for any sub--algebra of and each the conditional expectation

is a well defined and finite -measurable random variable (in fact an element of with denoting the restriction of to [DW1991]), for Lebesgue-almost every as well. By defining it as on the remaining Lebesgue-null subset of , henceforth we treat it as a real-valued function in , for each .

Having made the previous remark, we now introduce the following definition involving the above introduced functional , and the accompanying stochastic processes , .

Definition 2.2.

For each we define the causal minimum mean-square error (CMMSE) in estimating the stochastic process at time from the observations , , denoted , by

Similarly, for each we define the non-causal minimum mean-square error (NCMMSE) in smoothing the stochastic process at time from the observations , , denoted , by

In the same way, and slightly abusing notation, for each , , and we set

the NCMMSE in smoothing the stochastic process at time from the observations , , with and the convention of omitting the first of its three arguments when it equals , i.e., . Note that the quantities just defined differ through the conditioning -algebras, and that for each and .

Remark 2.5.

From Remark 2.4 it follows that, for any sub--algebra of and each ,

is a well defined non-negative random variable for each , and therefore each of the three quantities introduced in Definition 2.2 above is a well defined -valued function of its corresponding arguments, clearly jointly measurable. Note the domain of is the set given by

3 Input-Output Mutual Information and CMMSE

In this section we provide a result relating input-output mutual information, , and CMMSE, , for the general dynamical input-output system (6). The result generalizes the classical Duncan’s theorem for AWGNCs with or without feedback [D1970, KZZ1971]. It also provides a general condition guaranteeing the fulfilment of requirement (16) in Definition 2.1.

Theorem 3.1.

Assume that for each we have

Then for each we have

and the following relationship between and ,

(18)

holds for each as well.

Before giving the proof of the theorem we make the following remark.

Remark 3.1.

Under a finite average power condition

(19)

it follows that

Indeed, from (19) and condition (III), equation (12), we have

which implies, by standard properties of expectations and conditional expectations for finite second order moment random variables [DW1991], and with and , , , that

(20)

Relationship (18) had been previously proved in the especial case of AWGNCs (with or without feedback [D1970, KZZ1971]), and under condition (19).

Proof.

Let be fixed throughout the proof. From conditions (I) to (V), the fact that the processes and are independent, and [LS1977, Lemma 7.6, p.292] and [LS1977, Lemma 7.7, p.293], we have that

Therefore

(21)

too, and, by [LS1977, Theorem 7.23, p.289],

with the right hand side of the above expression equaling