Sequential Tracking of a Hidden Markov Chain

# Sequential Tracking of a Hidden Markov Chain Using Point Process Observations

## Abstract.

We study finite horizon optimal switching problems for hidden Markov chain models with point process observations. The controller possesses a finite range of strategies and attempts to track the state of the unobserved state variable using Bayesian updates over the discrete observations. Such a model has applications in economic policy making, staffing under variable demand levels and generalized Poisson disorder problems. We show regularity of the value function and explicitly characterize an optimal strategy. We also provide an efficient numerical scheme and illustrate our results with several computational examples.

###### Key words and phrases:
Markov modulated Poisson processes, optimal switching
###### 2000 Mathematics Subject Classification:
Primary 62L10; Secondary 62L15, 62C10, 60G40
E. Bayraktar is supported in part by the National Science Foundation, under grant DMS-0604491.

## 1. Introduction

An economic agent (henceforth the controller) observes a compound Poisson process with arrival rate , and mark/jump distribution . The local characteristics of are determined by the current state of an unobservable Markov jump process with finite state space . More precisely, the characteristics are whenever is at state , for .

The objective of the controller is to track the state of given the information in . To do so, the controller possesses a range of policies in the finite alphabet . The policies are sequentially adopted starting from time 0 and until some fixed horizon . The infinite horizon case is treated in Section 5.1. The selected policy leads to running costs (benefits) at instantaneous rate

 ∑i∈Eci(a)1{Mt=i}dt.

The controller’s overall strategy consists of a double sequence , with representing the sequence of chosen policies and representing the times of policy changes (from now on termed switching times). We denote the entire strategy by the right-continuous piecewise constant process , with if or

 (1.1) ξt=∑τk+1≤Tξk⋅1[τk,τk+1)(t).

Beyond running benefits, the controller also faces switching costs in changing her policy which lead to inertia and hysteresis. If at time , the controller changes her policy from to and then an immediate cost is incurred. The overall objective of the controller is to maximize the total present value of all tracking benefits minus the switching costs which is given by

 ∫T0e−ρt(∑i∈Eci(ξt)1{Mt=i})dt−∑ke−ρτk(∑i∈EKi(ξτk−,ξτk)⋅1{Mτk=i}),

where is the discount factor.

Since is unobserved, the controller must carry out a filtering procedure. We postulate that she collects information about via a Bayesian framework. Let be the initial (prior) beliefs of the controller about and the corresponding conditional probability law. The controller starts with beliefs , observes , updates her beliefs and adjusts her policy accordingly. Because only is observable, the strategy should be determined by the information generated by , namely each must be a stopping time of the filtration of . Similarly, the value of each is determined by the information revealed by until . These notions and the precise updating mechanism will be formalized in Section 2.3. We denote by the set of all such admissible strategies on a time interval . Since strategies with infinitely many switches would have infinite costs, we exclude them from .

Starting with initial policy and beliefs , the performance of a given policy is

 (1.2) Jξ(T,→π,a)≜E→π,a[∫T0e−ρt(∑i∈Eci(ξt)1{Mt=i})dt−∑ke−ρτk(∑i∈EKi(ξk−1,ξk)⋅1{Mτk=i})].

The first argument in is the remaining time to maturity. The optimization problem is to compute

 (1.3) U(T,→π,a)≜supξ∈U(T)Jξ(T,→π,a),

and, if it exists, find an admissible strategy attaining this value. In this paper we solve (1.3), including giving a full characterization of an optimal control and a deterministic numerical method for computing to arbitrary level of precision. The solution will proceed in two steps: an initial filtering step and a second optimization step. The inference step is studied in Section 2, where we convert the optimal control problem with partial information (1.3) into an equivalent fully observed problem in terms of the a posteriori probability process . The process summarizes the dynamic updating of controller’s beliefs about the Markov chain given her point process observations. The explicit dynamics of are derived in Proposition 2.2, so that the filtering step is completely solved. The main part of the paper then analyzes the resulting optimal switching problem (2.6) in Sections 3 and 4.

To our knowledge, the finite horizon partially observed switching control problem (which might be viewed as an impulse control problem in terms of ) defined in (1.3), has not been studied before. However, it is closely related to optimal stopping problems with partially observable Cox processes that have been extensively looked at starting with the Poisson Disorder problems, see e.g. Peskir and Shiryaev (2000, 2002); Bayraktar and Dayanik (2006); Bayraktar et al. (2006); Bayraktar and Sezer (2006). In particular, Bayraktar and Sezer (2006) solved the Poisson disorder problem when the change time has phase type prior distribution by showing that it is equivalent to an optimal stopping problem for a hidden Markov process (which has several transient states and one absorbing state) that is indirectly observed through a point process. Later Ludkovski and Sezer (2007) solved a similar optimal stopping problem in which all the states of the hidden Markov chain are recurrent. Both of these works can be viewed as a special case of (1.3), see Remark 3.2. Our model can also be viewed as the continuous-time counterpart of discrete-time sequential -ary detection in hidden Markov models, a topic extensively studied in sequential analysis, see e.g. Tartakovsky et al. (2006); Aggoun (2003).

Filtering problems with point process observations is a well-studied area; let us mention the work of Arjas et al. (1992), Ceci and Gerardi (1998) and the reference volume Elliott et al. (1995). In our model we use the previous results obtained in Bayraktar and Sezer (2006); Ludkovski and Sezer (2007) to derive an explicit filter; this allows us then to focus on the separated fully-observed optimal switching problem using the new hyper-state. Let us also mention the recent paper of Chopin and Varini (2007) who study a simulation-based method for filtering in a related model, but where an explicit filter is unavailable and must be numerically approximated.

The techniques that we use to solve the optimal switching/impulse control problem are different from the ones used in the continuous-time optimal control problems mentioned above. The main tool in solving the optimal stopping problems (in the multi-dimensional case, the tools in the one dimensional case are not restricted to the one described here) is the approximating sequence that is constructed by restricting the time horizon to be less than the time of the -th observation/jump of the observed point process. This sequence converges to the value function uniformly and exponentially fast. However, in the impulse control problem, the corresponding approximating sequence is constructed by restricting the sum of the number of jumps and interventions to be less than . This sequence converges to the value function, however the uniform convergence in both and is not identifiable using the same techniques.

As in Costa and Davis (1989) and Costa and Raymundo (2000) (also see Mazziotto et al. (1988) for general theory of impulse control of partially observed stochastic systems), we first characterize the value function as the smallest fixed point of two functional operators and obtain the aforementioned approximating sequence. Using one of these characterization results and the path properties of the a posteriori probability process we obtain one of our main contributions: the regularity of the value function . We show that is convex in , Lipschitz in the same variable on the closure of its domain, and Lipschitz in the variable uniformly in . Our regularity analysis leads to the proof of the continuity of in both and which in turn lets us explicitly describe an optimal strategy.

The other characterization of as a fixed point of the first jump operator is used to numerically implement the optimal solution and find the value function. In general, very little is known about numerics for continuous-time control of general hidden Markov models, and this implementation is another one of our contributions. We combine the explicit filtering equations together with special properties of piecewise deterministic processes (Davis, 1993) and the structure of general optimal switching problems to give a complete computational scheme. Our method relies only on deterministic optimization sub-problems and lets us avoid having to deal with first order quasi-variational inequalities with integral terms that appear in related stochastic control formulations (see remark 3.3 below). We illustrate our approach with several examples on a finite/infinite horizon and a hidden Markov chain with two or three states.

Our framework has wide-ranging applications in operations research, management science and applied probability. Specific cases are discussed in the next subsection. As these examples demonstrate, our approach leads to sensible policy advice in many scenarios. Most of the relevant applied literature treats discrete-time stationary problems, and our model can be seen as a finite-horizon, continuous-time generalization of these approaches.

The rest of the paper is organized as follows: In Section 1.1 we propose some applications of our modeling framework. In Section 2 we describe an equivalent fully observed problem in terms of the a posteriori probability process . We also analyze the dynamics of . In Section 3 we show that satisfies two different dynamic programming equations. The results of Section 3 along with the path description of allows us to study the regularity properties of and describe an optimal strategy in Section 4. Our model can be extended beyond (1.3), in particular to cover the case of infinite horizon and the case in which the costs are incurred at arrival times. The extensions are described in Section 5. Extensive numerical analysis of several illustrative examples is carried out in Section 6.

### 1.1. Applications

In this section we discuss case studies of our model and the relevant applied literature.

#### Cyclical Economic Policy Making

The economic business cycle is a basis of many policy making decisions. For instance, the country’s central bank attempts to match its monetary policy, so as to have low interest rates in periods of economic recession and high interest rates when the economy overheats. Similarly, individual firms will time their expenditures to coincide with boom times and will cut back on capital spending in unfavorable economy states. Finally, investors hope to invest in the bull market and stay on the sidelines during the bear market. In all these cases, the precise current economy state is never known. Instead, the agents collect information via economic events, surveys and news, and act based on their dynamic beliefs about the environment. Typically, such news consist of discrete events (e.g. earnings pre-announcements, geo-political news, economic polls) which cause instantaneous jumps in agents’ beliefs. Thus, it is natural to model the respective information structure by observations of a modulated compound Poisson process. Accordingly, let represent the current state of the economy and let the observation correspond to economic news. Inability to correctly identify will lead to (opportunity) costs . Hence, one may take and . The strategy represents the set of possible actions of the agent. The switching costs of the form correspond to the costly influence of the Federal Reserve changing its interest rate policy, or to the transaction costs incurred by the investor who gets in/out of the market. Depending on the particular setting, one may study this problem both in finite- and infinite-horizon setting, and with or without discounting. For instance, a firm planning its capital budgeting expenses might have a fixed horizon of one year, while a central bank has infinite horizon but discounts future costs. A corresponding numerical example is presented in Section 6.2.

#### Matching Regime-Switching Demand Levels

Many customer-oriented businesses experience stochastically fluctuating demand. Thus, internet servers face heavy/light traffic; manufacturing managers observe cyclical demand levels; customer service centers have varying frequencies of calls. Such systems can be modeled in terms of a compound Poisson request process modulated by the partially known system state . Here, serves the dual role of representing the actual demands and conveying information about . The objective of the agent is to dynamically choose her strategy , so as to track current demand level. For instance, an internet server receives asynchronous requests , (corresponding to jumps of ) that take time units to fulfill. The rate of requests and their complexity distribution depend on . In turn, the server manager can control how much processing power is devoted to the server: more processors cut down individual service times but lead to higher fixed overhead. Such a model effectively corresponds to a controlled -queue, where the arrival rate is -modulated, and where the distribution of service times depends both on and the control . A related computational example concerning a customer call center is treated in Section 6.3.

A concrete example that has been recently studied in the literature is the insurance premium problem. Insurance companies handle claims in exchange for policy premiums. A standard model asserts that claims form a compound (time-inhomogeneous) Poisson process . Suppose that the rate of claims is driven by some state variable that measures the current background risk (e.g. climate, health epidemics, etc.), with the latter being unobserved directly. In Aggoun (2003), such a model was studied (in discrete time) from the inference point of view, deriving the optimal filter for the insurance environment given the claim process. Assume now that the company can control its continuous premium rate , as well as its deductible level . High deductibles require lowering the premium rate, and are therefore only optimal in high-risk environments. Furthermore, changes to policy provisions (which has a finite expiration date ) are costly and should be undertaken infrequently. The overall objective is thus,

 supξ∈U(T)E→π,a⎡⎣−N(T)∑j=1e−ρσj(Yj−c1(ξσj))++∫T0c2(ξt)dt−∑ke−ρτk(∑i∈EKi(ξk−1,ξk)⋅1{Mτk=i})⎤⎦,

where is the counting process for the number of claims. The resulting cost structure, which is a variant of (1.3), is described in Section 5.2.

#### Security Monitoring

Classical models of security surveillance (radar, video cameras, communication network monitor) involve an unobserved system state representing current security (e.g. , where corresponds to a ‘normal’ state and represents a security breach) and a signal . The signal records discrete events, namely artifacts in the surveyed space (radar alarms, camera movement, etc.). Benign artifacts are possible, but the intensity of increases when . If the signal can be decomposed into further sub-types, then becomes a marked point process with marks . The goal of the monitor is to correctly identify and respond to security breaches, while minimizing false alarms and untreated security violations. Classical formulations (Tartakovsky et al., 2006; Peskir and Shiryaev, 2000) only analyze optimality of the first detection. However, in most practical problems the detection is ongoing and discrete announcement costs require studying the entire (infinite) sequence of detection decisions. Accordingly, our optimal switching framework of (1.3) is more appropriate.

As a simplest case, the monitor can either declare the system to be sound , or declare a state of alarm . This produces -dependent penalty costs at rate ; also changing the monitor state is costly and leads to costs . A typical security system is run on an infinite loop and one wishes to minimize total discounted costs, where the discounting parameter models the effective time-horizon of the controller (i.e. the trade-off between the myopically optimal announcement and long-run costs). Such an example is presented in Section 6.1.

#### Sequential Poisson Disorder Problems

Our model can also serve as a generalization of Poisson disorder problems, (Bayraktar et al., 2006; Peskir and Shiryaev, 2002). Consider a simple Poisson process whose intensity sequentially alternates between and . The goal of the observer is to correctly identify the current intensity; doing so produces a running reward at rate per unit time, otherwise a cost at rate is assessed, where is the control process. Whenever the observer changes her announcement, a fixed cost is charged in order to make sure that the agent does not vacillate. Letting , denote the intensity state, and this example yet again fits into the framework of (1.3). Obvious generalizations to multiple values of and multiple announcement options for the observer can be considered. Again, one may study the classical infinite-horizon problem, or the harder time-inhomogeneous model on finite-horizon, where the observer must also take into account time-decay costs.

## 2. Problem Statement

In this section we rigorously define the problem statement and show that it is equivalent to a fully observed impulse control problem using the conditional probability process . We then derive explicitly the dynamics of . First, however we give a construction of the probability measure and the formal description of .

### 2.1. Observation Process

Let be a probability space hosting two independent elements: (i) a continuous time Markov process taking values in a finite set , and with infinitesimal generator , (ii) a compound Poisson process with intensity and jump size distribution on . Let be the natural filtration of enlarged by -null sets, and consider its initial enlargement with for all . The filtration summarizes the information flow of a genie that observes the entire path of at time .

Denote by the arrival times of the process ,

 σℓ≜inf{t>σℓ−1:Xt≠Xt−},ℓ≥1with σ0≡0.

and by the -valued marks observed at these arrival times:

 Yℓ=Xσℓ−Xσℓ−,ℓ≥1.

Then in terms of the counting random measure

 (2.1) p((0,t],A)≜∞∑ℓ=11{σℓ≤t}1{Yℓ∈A},

where is a Borel set in , we can write the observation process as

 Xt=X0+∫(0,t]×Rdyp(ds,dy).

Let us introduce the positive constants and the distributions . We also define the total measure , and let be the density of with respect to . Define

 R(t,y)≜1λ1f1(y)∑i∈E1{Mt=i}λifi(y),t≥0,y∈Rd.

and denote the (or )-compensator of by

 (2.2) p0((0,t]×A)=λ1t∫Af1(y)ν(dy),t≥0,A∈B(Rd).

We will use and to change the underlying probability measure to a new probability measure on defined by

 dPdP0∣∣∣Gt=Zt,

where the stochastic exponential given by

 Zt≜exp{∫(0,t]×Rdlog(R(s,y))p(ds,dy)−∫(0,t]×Rd[R(s,y)−1]p0(ds,dy)},

is a -martingale. Note that and coincide on since , therefore law of the Markov chain is the same under both probability measures. Moreover, the -compensator of becomes

 (2.3) p1((0,t],A)=∑i∈E∫(0,t]1{Ms=i}λi∫Afi(y)ν(dy)ds.

see e.g. Jacod and Shiryaev (1987). The last statement is equivalent to saying that under this new probability, has the form

 (2.4) Xt≜X0+∫t0∑i∈E1{Ms=i}dX(i)s,t≥0,

in which are independent compound Poisson processes with intensities and jump size distributions , respectively. Such a process is called a Markov-modulated Poisson process (Karlin and Taylor, 1981). By construction, the observation process has independent increments conditioned on . Thus, conditioned on , the distribution of is on .

### 2.2. Equivalent Fully Observed Problem.

Let be the space of prior distributions of the Markov process . Also, let denote the set of all -stopping times smaller than or equal to .

We define the -valued conditional probability process such that

 (2.5) Πi(t)=P{Mt=i|FXt},% for i∈E, and t≥0.

Each component of gives the conditional probability that the current state of is given the information generated by until the current time . Using the process we now convert (1.3) into a standard optimal stopping problem.

###### Proposition 2.1.

The performance of a given strategy can be written as

 (2.6) Jξ(T,→π,a)=E→π,a[∫T0e−ρtC(→Π(t),ξt)dt−∑ke−ρτkK(ξτk−,ξτk,→Π(τk))],

in terms of the functions

 (2.7) C(→π,a)≜∑i∈Eci(a)πi,%andK(a,b,→π)≜∑i∈EKi(a,b)πi.

Proposition 2.1 above states that solving the problem in (1.3) is equivalent to solving an impulse control problem with state variables and . As a result, the filtering and optimization steps are completely separated. In our context with optimal switching control, the proof of this separation principle is immediate (see e.g. Shiryaev (1978, pp. 166-167)). In more general problems with continuous controls, the result is more delicate, see Ceci and Gerardi (1998).

We proceed to discuss the technical assumptions on and . Note that by construction and are linear. Moreover, is bounded since is finite, so there is a constant denoted that uniformly bounds possible rates of profit, . For the switching costs we assume that they satisfy the triangle inequality

 Ki(a,b)+Ki(b,c)≥Ki(a,c),andKi(a,b)>k0>0fori∈E;a,b,c∈A.

By the above assumptions on the switching costs and because possible rewards are uniformly bounded, with probability one the controller only makes finitely many switches and she does not make two switches at once. Without loss of generality we will also assume that every element in satisfies

 (2.8) E→π,a[∑ke−ρτkK(ξτk−,ξτk,→Π(τk))]<∞.

Otherwise, the cost associated with a strategy would be since

 E→π,a[∫T0e−ρt|C(→Π(t),ξt)|dt]≤cT,

and taking no action would be better than applying .

In the sequel we will also make use of the following auxiliary problems. First, let be the value of no-action, i.e.,

 (2.9) U0(T,→π,a)=E→π,a[∫T0e−ρtC(→Πt,a)dt].

Also in reference to (1.3), we will consider the restricted problems

 (2.10) Un(T,→π,a)≜supξ∈Un(T)Jξ(T,→π,a),n≥1,

in which is a subset of which contains strategies with at most interventions up to time .

### 2.3. Sample paths of →Π.

In this section we describe the filtering procedure of the controller, i.e. the evolution of the conditional probability process . Proposition 2.2 explicitly shows that the processes and are piecewise deterministic processes and hence have the strong Markov property, Davis (1993). This description of paths of the conditional probability process is also discussed in Proposition 2.1 in Ludkovski and Sezer (2007) and Proposition 2.1 of Bayraktar and Sezer (2006). We summarize the needed results below.

Let

 (2.11) I(t)≜∫t0m∑i=1λi1{Ms=i}ds,

so that the probability of no events for the next time units is . Then for , we have

 (2.12) Πi(t+u) =P→π{σ1>u,Mu=i}P→π{σ1>u}∣∣∣→π=→Π(t).

On the other hand, upon an arrival of size , the conditional probability experiences a jump

 (2.13) Πi(σℓ+1)=λifi(Yℓ+1)Πi(σℓ+1−)∑j∈Eλjfj(Yℓ+1)Πj(σℓ+1−),for ℓ∈N.

To simplify (2.12), define via

 (2.14) xi(t,→π)≜P→π{σ1>t,Mt=i}P→π{σ1>t}=E→π[1{Mt=i}⋅e−I(t)]E→π[e−I(t)],for i∈E.

It can be checked easily that the paths have the semigroup property . In fact, can be described as a solution of coupled first-order ordinary differential equations. To observe this fact first recall (Darroch and Morris, 1968; Neuts, 1989; Karlin and Taylor, 1981) that the vector

 (2.15) →m(t,→π)≡(m1(t,→π),…,mm(t,→π))≜(E→π,a[1{Mt=1}⋅e−I(t)],…,E→π,a[1{Mt=m}⋅e−I(t)])

has the form

 →m(t,→π)=→π⋅et(Q−Λ),

where is the diagonal matrix with . Thus, the components of solve and together with the chain rule and (2.14) we obtain

 (2.16) dxi(t,→π)dt=(∑j∈Eqj,ixj(t,→π)−λixi(t,→π)+xi(t,→π)∑j∈Eλjxj(t,→π)).

For the sequel we note again that .

The preceding equations (2.12) and (2.13) imply that

###### Proposition 2.2.

The process is a piecewise-deterministic, -Markov process. The paths have the characterization

 (2.17) ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩→Π(t)=→x(t−σℓ,→Π(σℓ)),σℓ≤t<σℓ+1,ℓ∈N→Π(σℓ)=(λ1f1(Yℓ)Π1(σℓ−)∑j∈Eλjfj(Yℓ)Πj(σℓ−),…λmfm(Yℓ)Πm(σℓ−)∑j∈Eλjfj(Yℓ)Πj(σℓ−))⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭.

Alternatively, we can describe in terms of the random measure ,

 dΠi(t) =μi(→Π(t−))dt+∫RdJi(→Π(t−),y)p(dt,dy),

for all , where

 (2.18) μi(→π)=∑j∈Eqj,iπj+λπi(∑j∈Eλjπj−λi),andJi(→π,y)=πi⋅(λifi(y)∑j∈Eλjfj(y)πj−1).

Here, one should also note that the -compensator of the random measure is

 ~p((0,t]×A)=∑j∈E∫t0∫Aλjfj(y)Πj(s)dyds,t≥0,A Borel.

In more general models with point process observations, an explicit filter for would not be available and one would have to resort to simulation-based approaches, see e.g. Chopin and Varini (2007). The subsequent optimization step would then appear to be intractable, though an integrated Markov chain Monte Carlo paradigm for filtering and optimization was proposed in Muller et al. (2004).

## 3. Two Dynamic Programming Equations for the Value Function

In this section we establish two dynamic programming equations for the value function . The first key equation (3.13) reduces the solution of the problem (1.3) to studying a system of coupled optimal stopping problems. The second dynamic programming principle of Proposition 3.4 shows that the value function is also the fixed point of a first jump operator. The latter representation will be useful in the numerical computations.

### 3.1. Coupled Optimal Stopping Operator

In this section we show that solves a coupled optimal stopping problem. Combined with regularity results in Section 4, this leads to a direct characterization of an optimal strategy. The analysis of this section parallels the general framework of impulse control of piecewise deterministic processes (pdp) developed by Costa and Davis (1989); Lenhart and Liao (1988). It is also related to optimal stopping of pdp’s studied in Gugerli (1986); Costa and Davis (1988).

Let us introduce a functional operator whose action on a test function is

 (3.1) Mw(T,→π,a)≜maxb∈A,b≠a{w(T,→π,b)−K(a,b,→π)}.

The operator is called the intervention operator and denotes the maximum value that can be achieved if an immediate best change is carried to the current policy. Assuming some ordering on the finite policy set , let us denote the smallest policy choice achieving the maximum in (3.1) as

 (3.2) dMw(T,→π,a)≜minb∈A{w(T,→π,b)−K(a,b,→π)=Mw(T,→π,a)}.

The main object of study in this section is another functional operator whose action is described by the following optimal stopping problem:

 (3.3) GV(T,→π,a)=supτ∈S(T)E→π,a[∫τ0e−ρsC(→Πs,a)ds+e−ρτMV(T−τ,→Πτ,a)],

for , and . We set from (2.9) and iterating obtain the following sequence of functions:

 (3.4) Vn+1≜GVn,n≥0.
###### Lemma 3.1.

is an increasing sequence of functions.

In Section 4 we will further show that are convex and continuous.

###### Proof.

The statement follows since

 V1(T,→π,a)=GV0(T,→π,a) =supτ∈S(T)E→π,a[∫τ0e−ρsC(→Πs,a)ds+e−ρτMV0(T−τ,→Πτ,a)] ≥E→π,a[∫T0e−ρsC(→Πs,a)ds]=U0(T,→π,a)=V0(T,→π,a),

and since is a monotone/positive operator, i.e. for any two functions we have , and ∎

The following proposition shows that the value functions of (2.10), which correspond to the restricted control problems over , can be alternatively obtained via the sequence of iterated optimal stopping problems in (3.4).

for .

###### Proof.

By definition we have that . Let us assume that and show that . We will carry out the proof in two steps.

Step 1. First we will show that . Let ,

 ξt=n+1∑k=0ξk⋅1[τk,τk+1)(t),t∈[0,T],

with and , be -optimal for the problem in (2.10), i.e.,

 (3.5) Un+1(T,→π,a)−ε≤Jξ(T,→π,a).

Let be defined as

 ~ξt=n∑k=0~ξk⋅1[~τk,~τk+1)(t),t∈[0,T],

in which , , and , , for . Using the strong Markov property of , we can write as

 (3.6) Jξ(T,→π,a)=E→π,a[∫τ10e−ρsC(→Πs,a)ds+e−ρτ1(J~ξ(T−τ1,→Πτ1,ξ1)−K(a,ξ1,→Πτ1))]≤E→π,a[∫τ10e−ρsC(→Πs,a)ds+e−ρτ1(Vn(T−τ1,→Πτ1,ξ1)−K(a,ξ1,→Πτ1))]≤E→π,a[∫τ10e−ρsC(→Πs,a)ds+ e−ρτ1MVn(T−τ1,→Πτ1,ξ1)]≤GVn(T,→π,a)=Vn+1(T,→π,a).

Here, the first inequality follows from induction hypothesis, the second inequality follows from the definition of , and the last inequality from the definition of . As a result of (3.5) and (3.6) we have that since is arbitrary.

Step 2. To show the opposite inequality , we will construct a special . To this end let us introduce

 (3.7) {¯¯¯τ1=inf{t≥0:MVn(T−t,→Πt,a)≥Vn+1(T−t,→Πt,a)−ε},¯¯¯ξ1=dMVn(T−¯τ1,→Π¯¯τ1,a).

Let , be -optimal for the problem in which interventions are allowed, i.e. (2.10). Using we now complete the description of the control by assigning,

 (3.8) ¯¯¯τn+1=^τn∘θτ1,¯¯¯ξn+1=^ξn∘θ¯¯τ1,n∈N+,

in which is the classical shift operator used in the theory of Markov processes.

Note that is an -optimal stopping time for the stopping problem in the definition of . This follows from the classical optimal stopping theory since the process has the strong Markov property. Therefore,

 (3.9) Vn+1(T,→π,a)−ε≤E→π,a[∫¯¯τ10e−ρsC(→Πs,a)ds+e−ρ¯¯τ1MVn(T−¯¯¯τ1,→Π¯¯τ1,a)]≤E→π,a[∫¯¯τ10e−ρsC(→Πs,a)ds+e−ρ¯¯τ1(Un(T−¯¯¯τ1,→Π¯¯τ1,¯¯¯ξ1)−K(a,¯¯¯ξ1,→Π¯¯τ1))],

in which the second inequality follows from the definition of and the induction hypothesis. It follows from (3.9) and the strong Markov property of that

 (3.10) Vn+1(T,→π,a)−2ε≤E→π,a[∫¯¯τ10e−ρsC(→Πs,a)ds+e−ρ¯¯τ1(Un(T−¯¯¯τ1,→Π¯¯τ1,¯¯¯ξ1)−ε−K(a,ξ1,→Π¯¯τ1))]≤E→π,a[∫¯¯τ10e−ρsC(→