Bayesian sequential change diagnosis
Abstract.
Sequential change diagnosis is the joint problem of detection and identification of a sudden and unobservable change in the distribution of a random sequence. In this problem, the common probability law of a sequence of i.i.d. random variables suddenly changes at some disorder time to one of finitely many alternatives. This disorder time marks the start of a new regime, whose fingerprint is the new law of observations. Both the disorder time and the identity of the new regime are unknown and unobservable. The objective is to detect the regimechange as soon as possible, and, at the same time, to determine its identity as accurately as possible. Prompt and correct diagnosis is crucial for quick execution of the most appropriate measures in response to the new regime, as in fault detection and isolation in industrial processes, and target detection and identification in national defense. The problem is formulated in a Bayesian framework. An optimal sequential decision strategy is found, and an accurate numerical scheme is described for its implementation. Geometrical properties of the optimal strategy are illustrated via numerical examples. The traditional problems of Bayesian changedetection and Bayesian sequential multihypothesis testing are solved as special cases. In addition, a solution is obtained for the problem of detection and identification of component failure(s) in a system with suspended animation.
1. Introduction
Sequential change diagnosis is the joint problem of detection and identification of a sudden change in the distribution of a random sequence. In this problem, one observes a sequence of i.i.d. random variables , taking values in some measurable space . The common probability distribution of the ’s is initially some known probability measure on , and, in the terminology of statistical process control, the system is said to be “in control.” Then, at some unknown and unobservable disorder time , the common probability distribution changes suddenly to another probability measure for some unknown and unobservable index , and the system goes “out of control.” The objective is to detect the change as quickly as possible, and, at the same time, to identify the new probability distribution as accurately as possible, so that the most suitable actions can be taken with the least delay.
Decision strategies for this problem have a wide array of applications, such as fault detection and isolation in industrial processes, target detection and identification in national defense, pattern recognition and machine learning, radar and sonar signal processing, seismology, speech and image processing, biomedical signal processing, finance, and insurance. For example, suppose we perform a quality test on each item produced from a manufacturing process consisting of several complex processing components (labeled ). As long as each processing component is operating properly, we can expect the distribution of our quality test statistic to be stationary. Now, if there occurs a sudden fault in one of the processing components, this can change the distribution of our quality test statistic depending on the processing component which caused the fault. It may be costly to continue manufacture of the items at a substandard quality level, so we must decide when to (temporarily) shut down the manufacturing process and repair the fault. However, it may also be expensive to dissect each and every processing component in order to identify the source of the failure and to fix it. So, not only do we want to detect quickly when a fault happens, but, at the same time we want also to identify accurately which processing component is the cause. The time and the cause of the fault will be distributed independently according to a geometric and a finite distribution, respectively, if each component fails independently according to some geometric distributions, which is a reasonable assumption for highly reliable components; see Section 5.5. As another example, an insurance company may monitor reported claims not only to detect a change in its risk exposure, but also to assess the nature of the change so that it can adjust its premium schedule or rebalance appropriately its portfolio of reserves to hedge against a different distribution of loss scenarios.
Sequential change diagnosis can be viewed as the fusion of two fundamental areas of sequential analysis: change detection and multihypothesis testing. In traditional change detection problems, and there is only one change distribution, ; therefore, the focus is exclusively on detecting the change time, whereas in traditional sequential multihypothesis testing problems, there is no change time to consider. Instead, every observation has common distribution for some unknown , and the focus is exclusively on the inference of . Both change detection and sequential multihypothesis testing have been studied extensively. For recent reviews of these areas, we refer the reader to Basseville and Nikiforov [3], Dragalin, Tartakovsky and Veeravalli [8, 9], and Lai [14], and the references therein.
However, the sequential change diagnosis problem involves key tradeoff decisions not taken into account by separately applying techniques for change detection and sequential multihypothesis testing. While raising an alarm as soon as the change occurs is advantageous for the change detection task, it is undesirable for the isolation task because the longer one waits to raise the alarm, the more observations one has to use for inferring the change distribution. Moreover, the unknown change time complicates the isolation task, and, as a result, adaptation of existing sequential multihypothesis testing algorithms is problematic.
The theory of sequential change diagnosis has not been broadly developed. Nikiforov [16] provides the first results for this problem, showing asymptotic optimality for a certain nonBayesian approach, and Lai [13] generalizes these results through the development of informationtheoretic bounds and the application of likelihood methods. In this paper, we follow a Bayesian approach to reveal a new sequential decision strategy for this problem, which incorporates a priori knowledge regarding the distributions of the change time and of the change index . We prove that this strategy is optimal and we describe an accurate numerical scheme for its implementation.
In Section 2 we formulate precisely the problem in a Bayesian framework, and in Section 3 we show that it can be reduced to an optimal stopping of a Markov process whose state space is the standard probability simplex. In addition, we establish a simple recursive formula that captures the dynamics of the process and yields a sufficient statistic fit for online tracking.
In Section 4 we use optimal stopping theory to substantiate the optimality equation for the value function of the optimal stopping problem. Moreover, we prove that this value function is bounded, concave, and continuous on the standard probability simplex. Furthermore, we prove that the optimal decision strategy uses a finite number of observations on average and we establish some important characteristics of the associated optimal stopping/decision region. In particular, we show that the optimal stopping region of the state space for the problem consists of nonempty, convex, closed, and bounded subsets. Also, we consider a truncated version of the problem that allows at most observations from the sequence of random measurements. We establish an explicit bound (inversely proportional to ) for the approximation error associated with this truncated problem.
In Section 5 we show that the separate problems of change detection and sequential multihypothesis testing are solved as special cases of the overall joint solution. We illustrate some geometrical properties of the optimal method and demonstrate its implementation by numerical examples for the special cases and . Specifically, we show instances in which the convex subsets comprising the optimal stopping region are connected and instances in which they are not. Likewise, we show that the continuation region (i.e., the complement of the stopping region) need not be connected. We provide a solution to the problem of detection and identification of component failure(s) in a system with suspended animation. Finally, we outline in Section 6 how the changediagnosis algorithm may be implemented with a computer in general. Proofs of most results are deferred to the Appendix.
2. Problem statement
Let be a probability space hosting random variables and and a process taking values in some measurable space . Suppose that for every , , , and
(2.1) 
for some given probability measures on , known constants , , and such that , where and . Namely, is independent of ; it has a zeromodified geometric distribution with parameters and in the terminology of Klugman, Panjer, and Willmot [12, Sec. 3.6], which reduces to the standard geometric distribution with success probability when . Moreover, is the probability that the change type is for every .
Conditionally on and , the random variables , are independent; and are identically distributed with common distributions and , respectively. The probability measures always admit densities with respect to some finite measure on ; for example, we can take . So, we fix and denote the corresponding densities by , respectively.
Suppose now that we observe sequentially the random variables , . Their common probability density function changes at stage to some other probability density function , . Our objective is to detect the change time as quickly as possible and isolate the change index as accurately as possible. More precisely, given costs associated with detection delay, false alarm, and false isolation of the change index, we seek a strategy that minimizes the expected total change detection and isolation cost.
In view of the fact that the observations arrive sequentially, we are interested in sequential diagnosis schemes. Specifically, let denote the natural filtration of the observation process , where
A sequential decision strategy is a pair consisting of a stopping time (or stopping rule) of the filtration and a terminal decision rule measurable with respect to the history of observation process through stage . Applying a sequential decision strategy consists of announcing at the end of stage that the common probability density function has changed from to at or before stage . Let
denote the collection of all such sequential decision strategies (“” means that is a stopping time of filtration ). Let us specify the possible losses associated with a sequential decision strategy as follows:

Detection delay loss. Let us denote by a fixed positive constant the detection delay cost per period. Then the expected decision delay cost for is , possibly infinite, where .

Terminal decision loss. Here we identify two cases of isolation loss depending on whether or not the change has actually occurred at or before the stage in which we announce the isolation decision:

Loss due to false alarm. Let us denote by the isolation cost on for every . Then the expected false alarm cost for is .

Loss due to false isolation. Let us denote by the isolation cost on the event for every . Then the expected false isolation cost for is .
Here, are known nonnegative constants, and for every ; i.e., no cost incurred for making a correct terminal decision.

Accordingly, for every sequential decision strategy , we define a Bayes risk function
(2.2) 
as the expected diagnosis cost: the sum of the expected detection delay cost and the expected terminal decision cost upon alarm. The problem is to find a sequential decision strategy (if it exists) with the minimum Bayes risk
(2.3) 
3. Posterior analysis and formulation as an optimal stopping problem
In this section we show that the Bayes risk function in (2.2) can be written as the expected value of the running and terminal costs driven by a certain Markov process. We use this fact to recast the minimum Bayes risk in (2.3) as a Markov optimal stopping problem.
Let us introduce the posterior probability processes
Having observed the first observations, is the posterior probability that the change has not yet occurred at or before stage , while is the posterior joint probability that the change has occurred by stage and that the hypothesis is correct. The connection of these posterior probabilities to the loss structure for our problem is established in the next proposition.
Proposition 3.1.
For every sequential decision strategy , the Bayes risk function (2.2) can be expressed in terms of the process as
While our original formulation of the Bayes risk function (2.2) was in terms of the values of the unobservable random variables and , Proposition 3.1 gives us an equivalent version of the Bayes risk function in terms of the posterior distributions for and . This is particularly effective in light of Proposition 3.2, which we state with the aid of some additional notation that is referred to throughout the paper. Let
denote the standard dimensional probability simplex. Define the mappings and by
(3.1) 
and the operator on the collection of bounded functions by
(3.2) 
Proposition 3.2.
The process possesses the following properties:

The process is a supermartingale, and for every .

The process is a submartingale for every .

The process is a Markov process, and
(3.3) with initial state and , Moreover, for every bounded function and , we have .
Remark 3.3.
Since , the vector for every . Since is uniformly bounded, the limit exists by the martingale convergence theorem. Moreover, a.s. by Proposition 3.2(a) since .
Now, let the functions from into be defined by
respectively. Then, we note that for every , we have
where we define on the event the terminal decision rule to be any index satisfying . In other words, an optimal terminal decision depends only upon the value of the process at the stage in which we stop. Note also that the functions and are bounded on . Therefore, we have the following:
lemma 3.4.
The minimum Bayes risk (2.3) reduces to the following optimal stopping of the Markov process :
We simplify this formulation further by showing that it is enough to take the infimum over
(3.4) 
where
(3.5) 
is the minimum cost obtained by making the best terminal decision when alarm is set at time . Since is bounded on , the process consists of integrable random variables. So the expectation exists for every , and our problem becomes
(3.6) 
Observe that for every because . In fact, we have for every . Since , it is enough to consider such that . Namely, (3.6) reduces to
(3.7) 
4. Solution via optimal stopping theory
In this section we derive an optimal solution for the sequential change diagnosis problem in (2.3) by building on the formulation of (3.7) via the tools of optimal stopping theory.
4.1. The optimality equation
We begin by applying the method of truncation with a view of passing to the limit to arrive at the final result. For every and , define the subcollections
of stopping times in of (3.4). Note that . Now, consider the families of (truncated) optimal stopping problems corresponding to and , respectively, defined by
(4.1) 
Note that .
To investigate these optimal stopping problems, we introduce versions of the Snell envelope of (i.e., the smallest regular supermartingale dominating ) corresponding to and , respectively, defined by
(4.2) 
Then through the following series of lemmas, whose proofs are deferred to the Appendix, we point out several useful properties of these Snell envelopes. Finally, we extend these results to an arbitrary initial state vector and establish the optimality equation. Note that each of the ensuing (in)equalities between random variables are in the almost sure sense.
First, these Snell envelopes provide the following alternative expressions for the optimal stopping problems introduced in (4.1) above.
lemma 4.1.
For every and , we have and .
Second, we have the following backwardinduction equations.
lemma 4.2.
We have for every . For every and , we have and .
We also have that these versions of the Snell envelopes coincide in the limit as . That is,
lemma 4.3.
For every , we have .
Next, recall from (3.2) and Proposition 3.2(c) the operator and let us introduce the operator on the collection of bounded functions defined by
(4.3) 
Observe that . That is, is a nonnegative bounded function. Therefore, is welldefined. If is nonnegative and bounded, then is defined for every , with by definition. Using operator , we can express in terms of the process as stated in the following lemma.
lemma 4.4.
For every , and , we have
(4.4) 
The next lemma shows how the optimal stopping problems can be rewritten in terms of the operator . It also conveys the connection between the truncated optimal stopping problems and the initial state of the process.
lemma 4.5.
We have

for every , and

.
Observe that since , we have for some . On the other hand, for every we can construct a probability space hosting a Markov process with the same dynamics as in (3.3) and . Moreover, on such a probability space, the preceding results remain valid. So, let us denote by the expectation with respect to and rewrite (4.1) as
for every . Then Lemma 4.5 implies that
(4.5) 
for every . Taking limits as of both sides in and applying the monotone convergence theorem on the righthand side yields . Hence, we have shown the following result.
Proposition 4.6 (Optimality equation).
For every , we have
(4.6) 
4.2. Some properties of the value function
Now, we reveal some important properties of the value function of (4.5). These results help us to establish an optimal solution for , and hence an optimal solution for , in the next subsection.
lemma 4.8.
If is a bounded concave function, then so is .
Proposition 4.9.
The mappings and are concave.
Proposition 4.10.
For every and , we have
Since , uniformly in .
Proposition 4.11.
For every , the function is continuous.
Corollary 4.12.
The function is continuous.
4.3. An optimal sequential decision strategy
Finally, we describe the optimal stopping region in implied by the value function , and we present an optimal sequential decision strategy for our problem. Let us define for every ,
Theorem 4.15 below shows that it is always optimal to stop and raise an alarm as soon as the posterior probability process enters the region . Intuitively, this follows from the optimality equation (4.6). At any stage, we always have two choices: either we stop immediately and raise an alarm or we wait for at least one more stage and take an additional observation. If the posterior probability of all possibilities is given by the vector , then the costs of those competing actions equal and , respectively, and it is always better to take the action that has the smaller expected cost. The cost of stopping is less (and therefore stopping is optimal) if , equivalently, if . Likewise, if at most stages are left, then stopping is optimal if or .
For each , let denote the unit vector consisting of zero in every component except for the th component, which is equal to one. Note that are the extreme points of the closed convex set , and any vector can be expressed in terms of as .
theorem 4.13.
For every , is a decreasing sequence of nonempty, closed, convex subsets of . Moreover,
Furthermore, .
lemma 4.14.
For every , we have
theorem 4.15.
Let .

The stopped process is a martingale.

The random variable is an optimal stopping time for , and

.
Therefore, the pair is an optimal sequential decision strategy for (2.3), where the optimal stopping rule is given by Theorem 4.15, and, as in the proof of Lemma 3.4, the optimal terminal decision rule is given by
Accordingly, the set is called the stopping region implied by , and Theorem 4.13 reveals its basic structure. We demonstrate the use of these results in the numerical examples of Section 5.
Note that we can take a similar approach to prove that the stopping rules are optimal for the truncated problems in (4.5). Thus, for each , the set is called the stopping region for : it is optimal to terminate the experiments in if stages are left before truncation.
5. Special cases and examples
In this section we discuss solutions for various special cases of the general formulation given in Section 2. First, we show how the traditional problems of Bayesian sequential change detection and Bayesian sequential multihypothesis testing are formulated via the framework of Section 2. Then we present numerical examples for the cases and . In particular, we develop a geometrical framework for working with the sufficient statistic developed in Section 3 and the optimal sequential decision strategy developed in Section 4. Finally, we solve the special problem of detection and identification of primary component failure(s) in a system with suspended animation.
5.1. A. N. Shiryaev’s sequential change detection problem
5.2. Sequential multihypothesis testing
5.3. Two alternatives after the change
In this subsection we consider the special case in which we have only two possible change distributions, and . We describe a graphical representation of the stopping and continuation regions for an arbitrary instance of the special case . Then we use this representation to illustrate geometrical properties of the optimal method (Section 4.3) via model instances for certain choices of the model parameters , , , , , , , , , , , and .
Let the linear mapping be defined by . Since for every , we can recover the preimage of any point . For every point , the coordinate is given by the Euclidean distance from the image point to the edge of the image triangle that is opposite the image point , for each . For example, the distance from the image point to the edge of the image triangle opposite the lowerlefthand corner is the value of the preimage coordinate . See Figure 1.
Therefore, we can work with the mappings and of the stopping region and the continuation region , respectively. Accordingly, we depict the decision region for each instance in this subsection using the twodimensional representation as in the righthandside of Figure 1 and we drop the notation when labeling various parts of each figure to emphasize their source in .
Each of the examples in this section have the following model parameters in common:
We vary the delay cost and false alarm/isolation costs to illustrate certain geometrical properties of the continuation and stopping regions. See Figures 2, 3, and 4.
Specifically, these examples show instances in which the convex subsets comprising the optimal stopping region are connected (Figure 2) and instances in which they are not (Figures 3 and 4(a)). Figure 4(b) shows an instance in which the continuation region is disconnected.
Each of the figures in this section have certain features in common. On each subfigure there is a dashed line representing those states at which . Also, each subfigure shows a sample path of and the realizations of and for the sample. The shaded area, including its solid boundary, represents the optimal stopping region, while the unshaded area represents the continuation region.
An implementation of the optimal strategy as described in Section 4.3 is as follows: Initialize the statistic by setting as in part (c) of Proposition 3.2. Use the dynamics of (3.3) to update the statistic as each observation is realized. Stop taking observations when the statistic enters the stopping region for the first time, possibly before the first observation is taken (i.e., ). The optimal terminal decision is based upon whether the statistic is in or upon stopping. Each of the sample paths in Figures 2, 3, and 4 were generated via this algorithm. As Figure 2 shows, the sets and can intersect on their boundaries and so it is possible to stop in their intersection. In this case, either of the decisions or is optimal.
We use value iteration of the optimality equation (4.6) over a fine discretization of to compute and generate the decision region for each subfigure. Because in the expression the value for any fixed initial condition on the left depends on the entire function on on the right, we have to calculate (or approximate it by ) on the entire space . The resulting discretized decision region is mapped into the plane via .
See Bertsekas [4, Chapter 3] for techniques of computing the value function via the optimality equation such as value iteration. Solving the optimality equation by discretizing highdimensional statespace may not be the best option. Monte Carlo methods based on regression models for the value function seem to scale better as the dimension of the statespace increases; see, for example, Longstaff and Schwartz [15], Tsitsiklis and van Roy [21], Glasserman [10, Chapter 8] for details.
5.4. Three alternatives after the change
In this subsection we consider the special case in which we have three possible change distributions, , , and . Here, the continuation and stopping regions are subsets of . Similar to the twoalternatives case, we introduce the mapping of into via
Then we use this representation—actually a rotation of it—to illustrate in Figure 5 an instance with the following model parameters:
Note that Figure 5 can be interpreted in a manner similar to the figures of the previous subsection. In this case, for every point , the coordinate is given by the (Euclidean) distance from the image point to the face of the image tetrahedron that is opposite the image corner , for each .
5.5. Detection and identification of component failure(s) in a system with suspended animation
Consider a system consisting initially of two working concealed components (labeled 1 and 2) such that upon the failure of either component, the system goes into a state of suspended animation. That is, while both components are still working normally, observations of output of the system have density , but upon failure of either component the density of observations changes thereafter (until an alarm is raised) to one of two alternatives: if component fails before component , then postfailure observations have density , otherwise they have density . The problem is to detect quickly when there has been a component failure and to identify accurately which component has actually failed based only on sequential observations of output of the system.
Let the random variables
be respectively the time of failure of the first failed component of the system and the corresponding index of this component, where the failure time of the th component is a random variable having a geometric distribution with failure probability , . It can be shown easily that when the disorder times and are independent, the random variable has a geometric distribution with failure probability (or equivalently, has a zeromodified geometric distribution with parameters and ) and that it is independent of the random variable , which has distribution and . So although the failure type (i.e., which component has failed) is a function of the failure times of each component, it turns out that this problem fits properly within the Bayesian sequential change diagnosis framework.
This problem can be extended naturally to several components and solved via the technology of Sections 3 and 4. In fact, it can be configured for a variety of scenarios. For example, seriesconnected components where malfunction of one component suspends immediately the operation of all the remaining components can appear in various electronic relays and multicomponent electronic devices which have fuses to protect the system from the misbehavior of one of its components. Since the system may react differently to diagnostics run by the operators, postmalfunction behavior can differ according to the underlying cause of the malfunction. See Barlow [2, Section 8.4] for background on series systems with suspended animation. Consider also a manufacturing process where we perform a quality test on the final output produced from several processing components. If a component is highly reliable then a geometric distribution with a low failure rate can be a reasonable choice for the lifetime of the component. Moreover, since the typical duration between successive component failures widens over time we can often treat the remaining components as if they enter a state of suspended animation under certain cost structures. That is, we can expect the remaining components to outlive the alarm. For example, suppose that two independent geometric random variables have expected lifetime of each. Then the first failure will occur at about time on average, while the second failure will take an additional periods on average to occur. As illustrated in Figures 2 and 4, respectively, lower falsealarm costs promote raising the alarm earlier, while a higher delay cost discourages waiting for more than relatively few additional periods to raise the alarm.
Specifically, suppose that in a “black box” there are components whose lifetimes are independent and geometrically distributed. Observations have initially distribution while the system is working, but upon failure of a single component (or simultaneous failure of multiple components), the remaining components enter a state of suspended animation, and the postfailure distribution of observations is determined by the failed component(s). We want to detect the time when at least one of them fails as soon as possible. Moreover, when we raise an alarm we would like to be able to make as accurately as possible diagnoses such as (1) how many of the components have actually failed, and (2) which ones.
Again, let the failure time of the th component be a random variable having a geometric distribution with failure probability , , and define
as the time when at least one of the components fails. Let the mapping be a nonnegativeintegervalued measure on the discrete algebra of the set of component indices, and define the random variable