Hidden Markov Models with
Multiple Observation Processes
James Yuanjie Zhao
Submitted in total fulfilment of the requirements
of the degree of Master of Philosophy
Submitted 18 August 2010
Revised 18 January 2011
Department of Mathematics and Statistics
Department of Electrical and Electronic Engineering
University of Melbourne
We consider a hidden Markov model with multiple observation processes, one of which is chosen at each point in time by a policy—a deterministic function of the information state—and attempt to determine which policy minimises the limiting expected entropy of the information state. Focusing on a special case, we prove analytically that the information state always converges in distribution, and derive a formula for the limiting entropy which can be used for calculations with high precision. Using this fomula, we find computationally that the optimal policy is always a threshold policy, allowing it to be easily found. We also find that the greedy policy is almost optimal.
This is to certify that:
The thesis comprises only my original work towards the MPhil;
Due acknowledgement has been made to all other material used; and
The thesis is less than 50,000 words in length.
My deepest and sincerest appreciation goes to my supervisors, Bill Moran and Peter Taylor, for their countless hours of guidance, both in relation to this thesis and in more general matters.
List of Figures
- 2.1 A hidden Markov model
- 3.1 Space of all threshold policies
- 3.2 Example of a Region I policy
- 3.3 Example of a Region II policy
- 3.4 Example of a Region III policy
- 3.5 Example of a Region IV policy
- 3.6 Example of a Region V policy
- 3.7 Example of a Region VI policy
- 3.8 Location of the optimal threshold
- 3.9 Proof of Proposition 3.8
- 3.10 Locally optimal policies
- 3.11 Optimality of the greedy policy
A hidden Markov model is an underlying Markov chain together with an imperfect observation on this chain. In the case of multiple observations, the classical model assumes that they can be observed simultaneously, and considers them as a single vector of observations. However, the case where not all the observations can be used at each point in time often arises in practical problems, and in this situation, one is faced with the challenge of choosing which observation to use.
We consider the case where the choice is made as a deterministic function of the previous information state, which is a sufficient statistic for the sequence of past observations. This function is called the policy, which we rank according to the information entropy of the information state that arises due to that policy.
Our main results are:
The information state converges in distribution for almost every underlying Markov chain, as long as each observation process gives a perfect information observation with positive probability;
In a special case (see Section 2.3 for a precise definition), we can write down the limiting entropy explicitly as a rational function of subgeometric infinite series, which allows the calculation of limiting entropy to very good precision;
Computational results suggest that the optimal policy is a threshold policy, hence finding the optimal threshold policy is sufficient for finding the optimal policy in general;
Finding a locally optimal threshold policy is also sufficient, while finding a locally optimal general policy is sufficient with average probability 0.98; and
The greedy policy is optimal 96% of the time, and close to optimal the remaining times, giving a very simple yet reasonably effective suboptimal alternative.
The theory of hidden Markov models was first introduced in a series of papers from 1966 by Leonard Baum and others under the more descriptive name of Probabilistic Functions of Markov Chains . An application of this theory was soon found in speech recognition, spurring development, and the three main problems—probability calculation, state estimation and parameter estimation—had essentially been solved by the time of Lawrence Rabiner’s influential 1989 tutorial paper .
The standard hidden Markov model consists of an underlying state which is described by a Markov chain, and an imperfect observation process which is a probabilistic function of this underlying state. In most practical examples, this single observation is equivalent to having multiple observations, since we can simply consider them as a single vector of simultaneous observations. However, this requires that these multiple observation can be made and processed simultaneously, which is often not the case.
Sometimes, physical constraints may prevent the simultaneous use of all of the available observations. This is most evident with a sensor which can operate in multiple modes. For example, a radar antenna must choose a waveform to transmit; each possible waveform results in a different distribution of observations, and only one waveform can be chosen for each pulse. Another example might be in studying animal populations, where a researcher must select locations for a limited pool of detection devices such as traps and cameras.
Even when simultaneous observations are physically possible, other constraints may restrict their availability. For example, in an application where processors are much more expensive than sensors, a sensor network might reasonably consist of a large number of sensors and insufficient processing power to analyse the data from every sensor, in which case the processor must choose a subset of sensors from which to receive data. Similarly, a system where multiple sensors share a limited communication channel must decide how to allocate bandwidth, in a situation where each bit of bandwidth can be considered a virtual sensor, not all of which can be simultaneously used.
Another example is the problem of searching for a target which moves according to a Markov chain, where observation processes represent possible sites to be searched. Indeed, MacPhee and Jordan’s  special case of this problem exactly corresponds to the special case we consider in Section 2.3, although with a very different cost function. Johnston and Krishnamurthy  show that this search problem can be used to model file transfer over a fading channel, giving yet another application for an extended hidden Markov model with multiple observation processes.
Note that in the problem of choosing from multiple observation processes, it suffices to consider the case where only one observation is chosen, by considering an observation to be an allowable subset of sensors. The three main hidden Markov model problems of probability calculation, state estimation and parameter estimation remain essentially the same, as the standard algorithms can easily be adapted by replacing the parameters of the single observation process by those of whichever observation process is chosen at each point in time.
Thus, the main interesting problem in the hidden Markov model with multiple observation processes is that of determining the optimal choice of observation process, which cannot be adapted from the standard theory of hidden Markov models since it is a problem that does not exist in that framework. It is this problem which will be the focus of our work.
We will use information entropy of the information state as our measure of optimality. While Evans and Krishnamurthy  use a distance between the information state and the underlying state, it is not necessary to consider this underlying state explicitly, since the information state is by definition an unbiased estimator of the distribution of the underlying state. We choose entropy over other measures such as variance since it is a measure of uncertainty which requires no additional structure on the underlying set.
The choice of an infinite time horizon is made it order to simplify the problem, as is our decision to neglect sensor usage costs. These variables can be considered in future work.
1.2 Past Work
The theory of hidden Markov models is already well-developed . On the other hand, very little research has been done into the extended model with multiple observation processes. The mainly algorithmic solutions in the theory of hidden Markov models with a single observation process cannot be extended to our problem, since the choice of observation process does not exist in the unextended model.
Similarly, there is a significant amount of work on the sensor scheduling literature, but mostly considering autoregressive Gaussian processes such as in . The case of hidden Markov sensors was considered by Jamie Evans and Vikram Krishnamurthy in 2001 , using policies where an observation process is picked as a deterministic function of the previous observation, and with a finite time horizon. They transformed the problem of choosing an observation into a control problem in terms of the information state, thereby entering the framework of stochastic control. They were able to write down the optimal policy as an intractible dynamic programming problem, and suggested the use of approximations to find the solution.
Krishnamurthy  followed up this work by showing that this dynamic programming problem could be solved using the theory of Partially Observed Markov Decision Processes when the cost function is of the form
where is the information state, is the Dirac measure and is a piecewise constant norm. It was then shown that such piecewise linear cost functions could be used to approximate quadratic cost functions, in the sense that a sufficiently fine piecewise linear approximation must have the same optimal policy. In particular, this includes the Euclidean norm on the information state space, which corresponds to the expected mean-square distance between the information state and the distribution of the underlying chain. However, no bounds were found on how fine an approximation is needed.
The problem solved by Evans and Krishnamurthy is a similar but different problem to ours. We consider policies based on the information state, which we expect to perform better than policies based on only the previous observation, as the information state is a sufficient statistic for the sample path of observations (see Proposition 2.8, also ). We also consider and infinite time horizon, and specify information entropy of the information state as our cost function. Furthermore, while Evans and Krishnamurthy consider the primary tradeoff as that between the precision of the sensors and the cost of using them, we do not consider usage costs and only aim to minimise the uncertainty associated with the measurements.
Further work by Krishnamurthy and Djonin  extended the set of allowable cost functions to a Lipschitz approximation to the entropy function, and proved that threshold policies are optimal under certain very restrictive assumptions. Their breakthrough uses lattice theory methods  to show that the cost function must be monotonic in a certain way with respect to the information state, and thus the optimal choice of observation process must be characterised by a threshold. However, this work still does not solve our problem, as their cost function, a time-discounted infinite sum of expected costs, differs significantly from our limiting expected entropy, and furthermore their assumptions are difficult to verify in practice.
Another similar problem was also considered by Mohammad Rezaeian , who redefined the information state as the posterior distribution of the underlying chain given the sample path of observations up to the previous, as opposed to current, time instant, which allowed for a simplification in the recursive formula for the information state. Rezaeian also transformed the problem into a Markov Decision Process, but did not proceed further in his description.
The model for the special case we consider in Section 2.3 is an instance of the problem of searching for a moving target, which was partially solved by MacPhee and Jordan  with a very different cost function – the expected cumulative sum of prescribed costs until the first certain observation. They proved that threshold policies are optimal for certain regions of parameter space by analysing the associated fractional linear transformations. Unfortunately, similar approaches have proved fruitless for our problem due to the highly non-algebraic nature of the entropy function.
Our problem as it appears here was first studied in unpublished work by Bill Moran and Sofia Suvorova, who conjectured that the optimal policy is always a threshold policy. More extensive work was done in , where it was shown that the information state converges in distribution in the same special case that we consider in Section 2.3. It was also conjectured that threshold policies are optimal in this special case, although the argument provided was difficult to work into a full proof. However,  contains a mistake in the recurrence formula for the information state distribution, a corrected version of which appears as Lemma 2.13. The main ideas of the convergence proof still work, and are presented in corrected and improved form in Section 2.2.
2 Analytic Results
We begin by precisely defining the model we will use. In particular, we will make all our definitions within this section, in order to expediate referencing. For the reader’s convenience, Table 2.1 at the end of this section lists the symbols we will use for our model.
For a sequence and any non-negative integer , we will use the notation to represent the vector .
A Markov Chain  is a stochastic process , such that for all times , all states and all measurable sets ,
where denotes the canonical filtration. We will consistently use the symbol to refer to an underlying Markov chain, and to denote its distribution.
In the case of a time-homogeneous, finite state and discrete time Markov chain, this simplifies to a sequence of random variables taking values in a common finite state space , such that for all times , is conditionally independent of given , and the distributions of given does not depend on .
In this case, there exists matrix , called the Transition Matrix, such that for all and ,
Since we mainly consider Markov chains which are time-homogeneous and finite state, we will henceforth refer to them as Markov chains without the additional qualifiers.
An Observation Process on the Markov chain is a sequence of random variables given by , where is a deterministic function and is a sequence of independent and identically distributed random variables which is also independent of the Markov chain .
As before, we will only consider observation processes which take values in a finite set . Similarly to before, there exists an matrix , which we call the Observation Matrix, such that for all , and ,
Heuristically, these two conditions can be seen as requiring that observations depend only on the current state, and do not affect future states. A diagrammatic interpretation is provided in Figure 2.1.
Traditionally, a hidden Markov model is defined as the pair of a Markov chain and an observation process on that Markov chain. Since we will consider hidden Markov models with multiple observation processes, this definition does not suffice. We adjust it as follows.
A Hidden Markov Model is the triple of a Markov chain , a finite collection of observation processes on , and an additional sequence of random variables , called the Observation Index, mapping into the index set .
Note that this amends the standard definition of a hidden Markov model. For convenience, we will no longer explicitly specify our hidden Markov models to have multiple observation processes.
It makes sense to think of as the state of a system under observation, as a collection of potential observations that can be made on this system, and as a choice of observation for each point in time.
Since our model permits only one observation to be made at each point in time, and we will wish to determine which one to use based on past observations, it makes sense to define as a sequence of random variables on the same probability space as the hidden Markov model.
We will discard the potential observations which are not used, leaving us with a single sequence of random variables representing the observations which are actually made.
The Actual Observation of a hidden Markov model is the sequence of random variables .
We will write to mean , noting that this is consistent with our notation for a hidden Markov model with a single observation process . On the other hand, for a hidden Markov model with multiple observation processes, the actual observation is not itself an observation process in general.
Since our goal is to analyse a situation in which only one observation can be made at each point in time, we will consider our hidden Markov model as consisting only of the underlying state and the actual observation . Where convenient, we will use the abbreviated terms state and observation at time to mean and respectively.
For any practical application of this model to a physical system, the underlying state cannot be determined, otherwise there would be no need to take non-deterministic observations. Therefore, we need a way of estimating the underlying state from the observations.
The Information State Realisation of a hidden Markov model at time is the posterior distribution of given the actual observations and observation indices up to time .
To make this definition more precise, we introduce some additional notation.
First, recall that has state space , and define the set of probability measures on ,
Second, for a random variable with state space , and an event , define the posterior distribution of given ,
Although we make this definition in general, we purposely choose the letters and , coinciding with the letters used to represent the underlying Markov chain and the state space, as this is the context in which we will use this definition. Then, the information state realisation is a function
This extends very naturally to a random variable.
The Information State Random Variable is
Its distribution is the Information State Distribution , taking values in , the space of Radon probability measures on , which is a subset of the real Banach space of signed Radon measures on .
Thus, the information state realisation is exactly a realisation of the information state random variable. It is useful because it represents the maximal information we can deduce about the underlying state from the observation index and the actual observation, as shown in Proposition 2.8.
For the purpose of succinctness, we will refer to any of , and as simply the Information State when the context is clear.
A random variable is a sufficient statistic for a parameter given data if for any values and of and respectively, the probability is independent of . As before, we make the definition in general, but purposely choose the symbols , and to coincide with symbols already defined.
In our case, , which is a random variable, is used in the context of a parameter. Our problem takes place in a Bayesian framework, where the information state represents our belief about the underlying state, and is updated at each observation.
The information random variable is a sufficient statistic for the underlying state , given the actual observations and the observation indices .
By Definition 2.7, we need to prove that for all and ,
is independent of .
First, note that the event is the disjoint union of events over all such that .
Next, if , then by definition of , for all ,
Then, by definition of conditional probability,
Each sum above is taken over all such that . This expression is clearly independent of , which completes the proof that is a sufficient statistic for . ∎
Since the information state represents all information that can be deduced from the past, it makes sense to use it to determine which observation process to use in future.
A policy on a hidden Markov model is a deterministic function , such that for all , . We will use the symbol to denote the preimage of the observation method under the policy, that is, the subset of on which observation method is prescribed by the policy. We will always consider the policy as fixed.
Since is a function of and , this means that is a function of and . Then by induction, we see that is a function of and . Therefore, if we prescribe some fixed , then is a function of .
For fixed , we can write
Hence, the information random variable is a deterministic function of only . In particuar, the information state can be written with only one argument, that is, .
Since our aim is to determine the underlying state with the least possible uncertainty, we need to introduce a quantifier of uncertainty. There are many possible choices, especially if the state space has additional structure. For example, variance would be a good candidate in an application where the state space embeds naturally into a real vector space.
However, in the general case, there is no particular reason to suppose our state space has any structure; our only assumption is that it is finite, in which case information entropy is the most sensible choice, being a natural, axiomatically-defined quantifier of uncertainty for a distribution on a countable set without any additional structure .
The Information Entropy of a discrete probability measure is given by
We will use the natural logarithm, and define in accordance with the fact that as .
Since takes values in , is well-defined, and by definition measures the uncertainty in given , and therefore by Proposition 2.8, measures the uncertainty in our best estimate of . Thus, the problem of minimising uncertainty becomes quantified as one of minimising .
We are particularly interested in the limiting behaviour, and thus, the main questions we will ask are:
Under what conditions, and in particular what policies, does converge as ?
Among the policies under which converges, which policy gives the minimal limiting value of ?
Are there interesting cases where does not converge, and if so, can we generalise the above results?
|set||probability measures on|
|set||probability measures on|
|set||region of observation process|
|random variable||observation index|
|finite set||set of observation processes|
|finite set||state space of Markov chain|
|finite set||observation space|
|random variable||observation randomness|
|random variable||Markov chain|
|random variable||observation process|
|random variable||actual observation|
|random variable||information state random variable|
|integer||number of observation values|
|integer||number of states|
|integer||position in time|
|distribution||information state realisation|
|distribution||Markov chain distribution|
|distribution||information state distribution|
In this section, we will prove that under certain conditions, the information state converges in distribution. This fact is already known for classical hidden Markov models, and is quite robust: LeGland and Mevel  prove geometric ergodicity of the information state even when calculated from incorrectly specified parameters, while Cappé, Moulines and Rydén  prove Harris recurrence of the information state for certain uncountable state underlying chains. We will present a mostly elementary proof of convergence in the case of multiple observation processes.
To determine the limiting behaviour of the information state, we begin by finding an explicit form for its one-step time evolution.
For each observation process and each observed state , the -function is the function given by
where is the Dirac measure on and is the th component of .
In a hidden Markov model with multiple observation processes and a fixed policy , the information state satisfies the recurrence relation
Note that for each information state and each observation process , there are at most possible information states at the next step, which are given explicitly by for each observation .
The information distribution satisfies the recurrence relation
where the sum is taken over all observation processes and all observation states , is the Dirac measure on , and is the matrix product considering as a row vector.
Since is a deterministic function of , given that ,
This depends only on and , so given that and ,
Integration over gives
By Definition 2.5, is the posterior distribution of given the observations up to time , so , the th coordinate of the vector . Since is a function of , which is a function of and the observation randomness , by the Markov property as in Definition 2.2,
Note that Lemma 2.13 shows that the information distribution is given by a linear dynamical system on , and therefore the information state is a Markov chain with state space . We will use tools in Markov chain theory to analyse the convergence of the information state, for which it will be convenient to give a name to this recurrence.
The transition function of the information distribution is the deterministic function given by , extended linearly to all of by the recurrence in Lemma 2.13. The coefficients are called the -functions.
We now give a criterion under which the information state is always positive recurrent.
A discrete state Markov Chain is called Ergodic if it is irreducible, aperiodic and positive recurrent. Such a chain has a unique invariant measure , which is a limiting distribution in the sense that converges to in total variation norm .
A discrete state Markov Chain is called Positive if every transition probability is strictly positive, that is, for all , . This is a stronger condition than ergodicity.
We shall call a hidden Markov model Anchored if the underlying Markov chain is ergodic, and for each observation process , there is a state and an observation such that and for all . The pair is called an Anchor Pair.
Heuristically, the latter condition allows for perfect information whenever the observation is made using observation process . This anchors the information chain in the sense that this state can be reached with positive probability from any other state, thus resulting in a recurrent atom in the uncountable state chain . On the other hand, since each information state can make a transition to only finitely many other information states, starting the chain at results in a discrete state Markov chain, for which it is much easier to prove positive recurrence.
In an anchored hidden Markov model, for any anchor pair , for all .
In a positive, anchored hidden Markov model, the -functions , for each , are uniformly bounded below by some , that is, for all and .
For each state , the Orbit of under the -functions is
By requiring the -functions to be positive, we exclude points in the orbit which are reached with zero probability. Let .
In a positive, anchored hidden Markov model, there exists a constant such that for all measures , the mass of the measure outside is bounded by , that is, .
We can rewrite Definition 2.14 as
In this notation, the integral is the Lebesgue integral of the function with respect to the measure . Since takes values in the and is a probability, the integral also takes values in , thus maps the information state space to itself.
Since is a measure supported on the set of points reachable from via an -function, and is a union of orbits of -functions and therefore closed under -functions, it follows that all mass in is mapped back into under the evolution function, that is
On the other hand, by Lemma 2.18, for all , hence
Putting these together gives
Setting gives , hence by induction. By Lemma 2.19, , while since we can always choose a larger value. ∎
Up to this point, we have considered the evolution function as a deterministic function . However, we can also consider it as a probabilistic function . By Definition 2.14, maps points in to , hence the restriction gives a probabilistic function, and therefore a Markov chain, with countable state space .
By Proposition 2.21, the limiting behaviour of the information chain takes place almost entirely in in some sense, so we would expect that convergence of the restricted information chain is sufficient for convergence of the full information chain . This is proved below.
In a positive, anchored hidden Markov model, under any policy, the chain has at least one state of the form which is positive recurrent, that is, whose expected return time is finite.
Construct a Markov chain on the set , with transition probabilities for all , whenever is nonempty, and all other transition probabilities zero. We note that this is possible since we allow each state a positive probability transition to some other state.
Since is a finite state Markov chain, it must have a recurrent state. Each state can reach some state , so some state is recurrent; call it .
Consider a state of the chain which is reachable from , where is a composition of -functions with corresponding -functions nonzero. Since the partition , one of them must contain ; call it . We will assume ; the proof follows the same argument and is simpler in the case when .
By definition of the -functions,
This means that is reachable from in the chain , hence in the chain , is reachable from , and by recurrence of , must also be reachable from via some sequence of positive probability transitions
By definition of , is nonempty, and thus contains some point , where is a composition of -functions with corresponding nonzero.
By Definition 2.20, each transition to in the information chain occurs with positive probability, so
Since , by anchoredness and positivity,
The Markov property then gives
Continuing via the sequence (2.23), we obtain
Thus, for every state reachable from , we have found constants and such that
By Lemma 2.19, is uniformly bounded below, while depends only on the directed path (2.23) and not on , and thus is also uniformly bounded below since there are only finitely many , and hence it suffices to choose finitely many such paths.
Similarly, also depends only on the directed path (2.23), and thus is uniformly bounded above. In particular, it is possible to pick and such that and .
Let be the first entry time into the state . By the above bound, we have for any initial state reachable from . Letting and be independent copies of and , the time-homogeneous Markov property gives
By induction, for all . Dropping the condition on the initial distribution for convenience, we have
In particular, , so is a positive recurrent state. ∎
The transition function , considered as an operator on the real Banach space of signed Radon measures on with the total variation norm, is linear with operator norm .
Linearity follows immediately from the fact that is defined as a finite sum of integrals.
For each , let be the Hahn decomposition, so that by definition of the total variation norm.
If , then by linearity of . Otherwise, let , so that . Since maps probability measures to probability measures,