Finite Optimal Control for Time-Bounded Reachability in CTMDPs and Continuous-Time Markov Games This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Center “Automatic Verification and Analysis of Complex Systems” (SFB/TR 14 AVACS) and by the Engineering and Physical Science Research Council (EPSRC) through grant EP/H046623/1 “Synthesis and Verification in Markov Game Structures”.

# Finite Optimal Control for Time-Bounded Reachability in CTMDPs and Continuous-Time Markov Games ††thanks: This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Center “Automatic Verification and Analysis of Complex Systems” (SFB/TR 14 AVACS) and by the Engineering and Physical Science Research Council (EPSRC) through grant EP/H046623/1 “Synthesis and Verification in Markov Game Structures”.

Markus Rabe Universität des Saarlandes    Sven Schewe University of Liverpool
###### Abstract

We establish the existence of optimal scheduling strategies for time-bounded reachability in continuous-time Markov decision processes, and of co-optimal strategies for continuous-time Markov games. Furthermore, we show that optimal control does not only exist, but has a surprisingly simple structure: The optimal schedulers from our proofs are deterministic and timed-positional, and the bounded time can be divided into a finite number of intervals, in which the optimal strategies are positional. That is, we demonstrate the existence of finite optimal control. Finally, we show that these pleasant properties of Markov decision processes extend to the more general class of continuous-time Markov games, and that both early and late schedulers show this behaviour.

## 1 Introduction

Continuous-time Markov decision processes (CTMDPs) are a widely used framework for dependability analysis and for modelling the control of manufacturing processes [12, 6], because they combine real-time aspects with probabilistic behaviour and non-deterministic choices. CTMDPs can also be viewed as a framework that unifies different stochastic model types [14, 12, 9, 7, 10].

While CTMDPs allow for analysing worst-case and best-case scenarios, they fall short of the demands that arise in many real control problems, as they disregard the different nature that non-determinism can have depending on its source: Some sources of non-determinism are supportive, while others are hostile, and in a realistic control scenario, we face both types of non-determinism at the same time: Supportive non-determinism can be used to model the influence of a controller on the evolution of a system, while hostile non-determinism can capture abstraction or unknown environments. We therefore consider a natural extension of CTMDPs: Continuous-time Markov games (CTMGs) that have two players with opposing objectives [5].

The analysis of CTMDPs and CTMGs requires to resolve the non-deterministic choices by means of a scheduler (which consists of a pair of strategies in the case of CTMGs), and typically tries to optimise a given objective function.

In this paper, we study the time-bounded reachability problem, which recently enjoyed much attention [3, 16, 10, 11, 5]. Time-bounded reachability in CTMDPs is the standard control problem to construct a scheduler that controls the Markov decision process such that the probability of reaching a goal region within a given time bound is maximised (or minimised), and to determine the value. For CTMGs, time-bounded reachability reduces to finding a Nash equilibrium, that is, a pair of strategies for the players, such that each strategy is optimal for the chosen strategy of her opponent.

While continuous-time Markov games are a young field of research [5, 13], Markov decision processes have been studied for decades [8, 4].

Optimal control in CTMDPs clearly depends on the observational power we allow our schedulers to have when observing a run. In the literature, various classes of schedulers with different restrictions on what they can observe [15, 10, 5, 13] are considered. We focus on the most general class of schedulers, schedulers that can fully observe the system state and may change their decisions at any point in time (late schedulers, cf. [4, 10]). To be able to translate our results to the more widespread class of schedulers that fix their decisions when entering a location (early schedulers), we introduce discrete locations that allow for a translation from early to late schedulers (see Appendix D).

Due to their practical importance, time-bounded reachability for continuous-time Markov models has been studied intensively [5, 4, 11, 3, 10, 1, 2, 16]. However, most previous research focussed on approximating optimal control. (The existence of optimal control is currently only known for the artificial class of time-abstract schedulers [5, 13], which assume that the scheduler has no access whatsoever to a clock.) While an efficient approximation is of interest to a practitioner, being unable to determine whether or not optimal control exists is very dissatisfying from a scientific point of view.

#### Contributions.

This paper has three main contributions: First, we extend the common model of CTMDPs by adding discrete locations, which are passed in time. This generalisation of the model is mainly motivated by avoiding the discussion about the appropriate scheduler class. In particular, the widespread class of schedulers that fix their actions when entering a location can be encoded by a simple mapping.

The second contribution of this paper is the answer to an intriguing research question that remained unresolved for half a century: We show that optimal control of CTMDPs exists for time-bounded reachability and safety objectives. Moreover, we show that optimal control can always be finite.

Our third contribution is to lift these results to continuous-time Markov games.

Pursuing a different research question, we exploit proof techniques that differ from those frequently used in the analysis of CTMDPs. Our proofs build mainly on topological arguments: The proof that demonstrates the existence of measurable optimal schedulers, for example, shows that we can fix the decisions of an optimal scheduler successively on closures of open sets (yielding only measurable sets), and the lift to finiteness uses local optimality of positional schedulers in open left and right environments of arbitrary points of times and the compactness of the bounded time interval.

#### Structure of the Paper.

We follow a slightly unorthodox order of proofs for a mathematical paper: we start with a special case in Section 3 and generalise the results later. Besides keeping the proofs simple, this approach is chosen because the simplest case, CTMDPs, is the classical case, and we assume that a wider audience is interested in results for these structures. In the following section, we strengthen this result by demonstrating that optimal control does not only exist, but can be found among schedulers with finitely many switching points and positional strategies between them. In Section 5, we lift this result to single player games (thus extending it to other scheduler classes like those which fix their decision when entering a location, cf. Appendix D). In the final section, we generalise the existence theorem for finite optimal control to finite co-optimal strategies for general continuous-time Markov games.

## 2 Preliminaries

A continuous-time Markov game is a tuple , consisting of

• a finite set of locations, which is partitioned into

• a set of discrete locations and a set of continuous locations, and

• sets and of locations owned by a reachability and a safety player,

• a dedicated set of goal locations,

• a finite set of actions,

• a rate matrix ,

• a discrete transition matrix , and

• an initial distribution ,

that satisfies the following side-conditions: For all continuous locations , there must be an action such that ; we call such actions enabled. For actions enabled in continuous locations, we require , and we require for the remaining actions. For discrete locations, we require that either holds for all , or that holds true. Like in the continuous case, we call the latter actions enabled and require the existence of at least one enabled action for each discrete location .

The idea behind discrete-time locations is that they execute immediately. We therefore do not permit cycles of only discrete-time locations (counting every positive rate of any action as a transition). This restriction is stronger than it needs to be, but it simplifies our proofs, and the simpler model is sufficient for our means.

We assume that the goal region is absorbing, that is holds for all and . See Section 7 for the extension to non-absorbing goal regions.

Intuitively, it is the objective of the reachability player to maximise the probability to reach the goal region in a predefined time , while it is the objective of the safety player to minimise this probability. (Hence, it is a zero-sum game.)

We are particularly interested in (traditional) CTMDPs. They are single player CTMGs, where either all positions belong to the reachability player (), or to the safety player (), without discrete locations ( and ).

#### Paths.

A timed path in a CTMG is a finite sequence in . We write

 l0a0,t0−−→l1a1,t1−−→⋯ an−1,tn−1−−−−−−→ln

for a sequence and we require for all , where is the time bound for our time-bounded reachability probability. (We are not interested in the behaviour of the system after .) The denote the system’s time when the action is selected and a discrete transition from to takes place. Concatenation of paths will be written as if the last location of is the first location of and the points of time are ordered correctly. We call a timed path a complete timed path when we want to stress that this path describes a complete system run, not to be extended by further transitions.

#### Schedulers and Strategies.

The nondeterminism in the system needs to be resolved by a scheduler which maps paths to decisions. The power of schedulers is determined by their ability to observe and distinguish paths, and thus by their domain. In this paper, we consider the following common scheduler classes:

• Timed history-dependent (TH) schedulers
that map timed paths and the remaining time to decisions.

• Timed positional (TP) schedulers
that map locations and the remaining time to decisions.

• Positional (P) or memoryless schedulers
that map locations to decisions.

Decisions are either randomised (R), in which case is the set of distributions over enabled actions, or are restricted to deterministic (D) choices, that is . Where it is necessary to distinguish randomised and deterministic versions we will add a postfix to the scheduler class, for example THD and THR.

Strategies. In case of CTMGs, a scheduler consists of the two participating players’ strategies, which can be seen as functions , where denotes, for , the paths ending on the position of the reachability or safety player, respectively. As for general schedulers, we can introduce restrictions on what players are able to observe.

Discrete locations. The main motivation to introduce discrete locations was to avoid the discussion whether a scheduler has to fix his decision, as to which action it chooses, upon entering a location, or whether such a decision can be revoked while staying in the location. For example, the general measurable schedulers discussed in [15] have only indirect access to the remaining time (through the timed path), and therefore have to decide upon entrance of a location which action they want to perform. Our definition builds on fully-timed schedulers (cf. [4]) that were recently rediscovered and formalised by Neuhäußer et al. [10], which may revoke their decision after they enter a location. (As a side result, we lift Neuhäußer’s restriction to local uniformity.) The discrete locations now allow to encode making the decision upon entering a continuous location by mapping the decision to a discrete location that is ‘guarding the entry’ to a family of continuous locations, one for each action enabled in . (See Appendix D for details.)

#### Cylindrical Schedulers.

While it is common to refer to TH schedulers as a class, the truth is that there is no straightforward way to define a measure for the time-bounded reachability probability for the complete class (cf. [15]). We therefore turn to a natural subset that can be used as a building block for a powerful yet measurable sub-class of TH schedulers, which is based on cylindrical abstractions of paths.

Let be a finite partition of the interval into intervals and for with and for , where is the time-bound from the problem definition. Then we denote with the interval that contains , called the -cylindrification of , and we denote with the -cylindrification of the timed path .

We call a TH scheduler -cylindrical if its decisions depend only on the cylindrification and of and , respectively, and cylindrical if it is -cylindrical for some finite partition of the interval .

#### Cylindrical Sets and Probability Space.

For a given finite partition of the interval , an -cylindrical set of timed paths is the set of timed paths with the same -cylindrification, and we call a finite partition of a refinement of if every interval in is the union of intervals in .

For an -cylindrical scheduler and an -cylindrical set of finite timed paths, where is a refinement111The restriction to partitions that refine is purely technical, because for arbitrary we can simply use a partition that refines both and , and reconstruct every -cylindrical set as a finite union of -cylindrical sets. of , the likelihood that a complete path is from this cylindrical set is easy to define: Within each interval of , the likelihood that a CTMDP with scheduler behaves in accordance with the -cylindrical set can—assuming compliance in all previous intervals—be checked like for a finite Markov chain.

The probability to comply with the -th segment of the partition of is the product of three multiplicands :

1. the probability that the actions are chosen in accordance with the -cylindrical set of timed paths (which is either or for discrete schedulers, and the product of the likelihood of the individual decisions for randomised schedulers),

2. the probability that the transitions are taken in accordance with the -cylindrical set of timed paths, provided the respective actions are chosen, which is simply the product over the individual probabilities in this sequence of the -cylindrical set of timed paths, and

3. the probability that the right number of steps is made in this sequence of the -cylindrical set of timed paths.

The latter probability is if the last location is a discrete location, as the system would leave this location at the same point in time in which it was entered. Otherwise, it is the difference between the likelihood that at least the correct number of transitions starting in continuous locations are made (), and the likelihood that at least transitions starting in continuous locations are made () in the relevant sequence of the timed path.

Let be the continuous locations (named in the required order of appearance; note that might be , and that the same location can occur multiple times), and let be the transition rate one would observe at the respective . For deterministic schedulers, this transition rate is simply , where is the decision from at the respective position in a timed path and in . For a randomised scheduler , it is the respective expected transition rate , where is the likelihood that makes the decision at the respective position in a timed path and in . Note that the locations and transition rates are fixed.

The likelihood to get a path of length is then
for for , and for . Likewise, the likelihood to get a path of length is for . (Recall that and are the upper and lower endpoints of the interval .)

The likelihood that a complete timed path is in the -cylindrical set of timed paths for is the product over the individual .

#### Probability Space.

Having defined a measure for the likelihood that, for a given cylindrical scheduler, a complete timed path is in a particular cylindrical set, we define the likelihood that it is in a finite union of disjoint sets of cylindrical paths as the sum over the likelihood for the individual cylindrical sets.

This primitive probability measure for primitive schedulers can be lifted in two steps by a standard space completion, going from this primitive measures to quotient classes of Cauchy sequences of such measures:

1. In a first step, we complete the space of measurable sets of complete timed paths from finite unions of cylindrical sets of timed paths to Cauchy sequences of finite unions of cylindrical sets of timed paths. We define the required difference measure of two sets of timed paths (each a finite disjoint union of cylindrical sets) as the measure of the symmetrical difference of the two sets. This set can obviously be represented as a finite disjoint union of cylindrical sets of timed paths, and we can use our primitive measure to define this difference measure.

2. Having lifted the measure to this completed space of paths, we lift the set of measurable schedulers in a second step from cylindrical schedulers to Cauchy sequences of cylindrical schedulers. (The difference measure between two cylindrical schedulers is the likelihood that two schedulers act observably different.)

More details of these standard constructions can be found in Appendix A.2.

#### Time-Bounded Reachability Probability.

For a given CTMG and a given measurable scheduler that resolves the non-determinism, we use the following notations for the probabilities:

• is the probability of reaching the goal region within time when starting in location ,

• denotes the probability of reaching the goal region within time .

As usual, the supremum of the time-bounded reachability probability over a particular scheduler class is called the time-bounded reachability of for this scheduler class.

## 3 Optimal Scheduling in CTMDPs

In this section, we demonstrate the existence of optimal schedulers in traditional CTMDPs. Before turning to the proof, let us first consider what happens if time runs out, that is, at time , and then develop an intuition what an optimal scheduling policy should look like.

If we are still in time () and we are in a goal location, then we reach a goal location in time with probability ; and if time has run out () and we are not in a goal location, then we reach a goal location in time with probability . For ease of notation, we also fix the probability of reaching the goal location in time to for all points in time strictly after . For a measurable TPR scheduler , we would get:

• holds for all goal locations and all ,

• holds for all non-goal locations , and

• holds for all locations and all .

A scheduler can, in every point in time, choose from distributions over successor locations. Such a choice should be optimal, if the expected gain in the probability of reaching the goal location is maximised.

This gain has two aspects: first, the probability of reaching the goal location provided a transition is taken, and second, the likelihood of taking a transition. Both are multiplicands in the defining differential equations, assuming a cylindrical TPD scheduler

 −˙PrMS(l,t)=∑l′∈LR(l,S(l,t),l′)⋅(PrMS(l′,t)−PrMS(l,t)).

(We skip the simple generalisation to measurable TPD scheduler because it is not required in the following proofs.)

The reachability probability of any scheduler is therefore intuitively dominated by the function and dominates the function defined by the following equations:

where . This intuitive result is not hard to prove222The systems of non-linear ordinary differential equations used in this paper are all quite obvious, and the challenge is to prove that they can be taken and not merely approximated. An approximative argument for these ODE’s goes back to Bellman [4], but he uses a less powerful set of schedulers, and only proves that and can be approximated from below and above, respectively, claiming that the other direction is obvious. After starting with a similar claim, we were urged to include a full proof..

###### Lemma 1

The reachability probability of any measurable THR scheduler is dominated by the function and dominates the function .

Proof Idea: To proof this claim for , assume that there is a scheduler that provides a better time-bounded reachability probability for some location and time (in particular for ), and hence improves over at this position by at least for some .

is a Cauchy sequence of cylindrical schedulers. Therefore we can sacrifice one and get an -close cylindrical scheduler from this sequence, which is still at least better than at position .

As the measure for this cylindrical scheduler is a Cauchy sequence of measures for sequences with a bounded number of discrete transitions, we can sacrifice another to sharpen the requirement for the scheduler to reach the goal region in time and with at most steps for an appropriate bound , still maintaining an advantage over . Hence, we can compare with a finite structure, and use an inductive argument to show for paths of shrinking length that end in any location that holds true. ∎

The full proof is moved to Appendix B.2.

###### Theorem 3.1

For a CTMDP, there is a measurable TPD scheduler optimal for maximum time-bounded reachability in the class of measurable THR scheduler.

###### Proof

We construct a measurable scheduler that always chooses an action that maximises ; by Lemma 1, this guarantees that holds true.

To construct the scheduler decisions for a location for a measurable scheduler , we partition into measurable sets , such that only makes decisions that maximise . (For positions outside of , the behaviour of the scheduler does not matter. can therefore be fixed to any constant decision for all and .)

We start with fixing an arbitrary order on the actions in and introduce, for each point , an order on the actions determined by the value of , using as a tie-breaker.

Along the order of , we construct, starting with the minimal element, for each action :

1. Open sets that contain the positions where our scheduler does not make a decision333Note that, for all , the points in time where the scheduler does make the decision have been fixed earlier by this construction. .

(We choose for the action that is minimal with respect to .)

2. A set that is open in and contains the points in time in where is maximal with respect to .

3. A set which is the closure of in .

If is not maximal, we set for the successor of with respect to .

To complete the proof, we have to show that the scheduler , which chooses for all , makes only decisions that maximise the gain, that is (and hence that holds true), and we have to show that the resulting scheduler is measurable. As an important lemma on the way, we have to demonstrate the claimed openness of the ’s and ’s in the compact Euclidean space .

This openness is provided by a simple inductive argument: First, the complete space is open in itself.

Let us assume that is maximal w.r.t.  for a in the open set . Then the following holds: is strictly greater for compared to the respective value of all other actions (because serves as tie-breaker), and hence this holds for some -environment of that is contained in the open set . For the actions in this -environment of , the respective value also cannot be strictly greater compared to , because otherwise one of these actions had been selected before.

Note that this argument provides optimality of the choices in as well as openness of . The optimality for the choices at the fringe of (and hence the extension of the optimality argument to ) is a consequence of the continuity of .

As every open and closed set in is (Lebesgue) measurable, (or, to be precise, ) and , and hence their intersection , are measurable.

Our construction therefore provides us with a measurable scheduler, which is optimal, deterministic, and timed positional. ∎

By simply replacing maximisation by minimisation, by , and by , we can rewrite the proof to yield a similar theorem for the minimisation of time-bounded reachability, or likewise, for the maximisation of time-bounded safety.

###### Theorem 3.2

For a CTMDP, there is a measurable TPD scheduler optimal for minimum time-bounded reachability in the class of measurable THR scheduler.

## 4 Finite Optimal Control

In this section we show that, once the existence of an optimal scheduler is established, we can refine this result to the existence of a cylindrical optimal TPD scheduler, that is, a scheduler that changes only finitely many times between different positional strategies. This is as close as we can hope to get to implementability as optimal points for policy switching are—like in the example from Figure 1—almost inevitably irrational.

Our proof of Theorem 3.1 makes a purely topological existence claim, and therefore does not imply that a finite number of switching points suffices. In principle, this could mean that the required switching points have one or more limit points, and an unbounded number of switches is required to optimise time-bounded reachability. (cf. Figure 2) is an example for a continuous function for which the codomain of has a limit point at , and the right curve of Figure 2 shows the derivations for a positional scheduler (black) and a potential comparison with a gain function such that their intersections have a limit point.

To exclude such limit points, and hence to prove the existence of an optimal scheduler with a finite number of switching points, we re-visit the differential equations that define the reachability probability, but this time to answer a different question: Can we use the true values in some point of time to locally find an optimal strategy for an -environment? If yes, then we could exploit the compactness of : We could, for all points in time , fix a decision that is optimal in an -environment of . This would provide an open set with a positional optimal strategy around each , and hence an open coverage of a compact set, which would imply a final coverage with segments of positional optimal strategies.

While this is the case for most points, this is not necessarily the case at our switching points. In the remainder of this section, we therefore show something similar: For every point in time, there is a positional strategy that is optimal in a left -environment of (that is, in a set ), and one that is optimal in a right -environment of . Hence, we get an open coverage of strategies with at most one switching point, and thus obtain a strategy with a finite number of switching points.

###### Theorem 4.1

For every CTMDP, there is a cylindrical TPD scheduler optimal for maximum time-bounded reachability in the class of measurable THR scheduler.

###### Proof

We have seen that the true optimal reachability probability is defined by a system of differential equations. In this proof we consider the effect of starting with the ‘correct’ values for a time , but locally fix a positional strategy for a small left or right -environment of . That is, we consider only schedulers that keep their decision constant for a (sufficiently) small time before or after .

Given a CTMDP , we consider the differential equations that describe the development near the support point for each location under a positional strategy :

 −˙PrDl(τ)=∑l′∈LR(l,al,l′)⋅(PrDl′(τ)−PrDl(τ)),\vspace−1mm

where is the action chosen at by (see Figure 3 for an example).

Different to the development of the true probability, the development of these linear differential equations provides us with smooth functions. This provides us with more powerful techniques when comparing two locally positional strategies: Each deterministic scheduler defines a system of ordinary homogeneous linear differential equations with constant coefficients.

As a result, the solutions of these differential equations—and hence their differences —can be written as finite sums , where is a polynomial and the may be complex. Consequently, these functions are holomorphic.

Using the identity theorem for holomorphic functions, can only be a limit point of the set of points of if and are identical on an -environment of . The same applies to their derivations: either has no limit point in , or and are identical on an -environment of .

For the remainder of the proof, we fix, for a given time , a sufficiently small such that, for each pair of schedulers and and every location , is either , , or on the complete interval , and, possibly with different sign, for the complete interval .

We argue the case for the left -environment . In the ‘’ case for a location , we say that is -better than . We call preferable over if is not -better than for any location , and better than if is preferable over and -better for some .

If is -better than in exactly a non-empty set of locations, then we can obviously use to construct a strategy that is better than by switching to the strategies of in exactly the locations .

Since we choose our strategies from a finite domain—the deterministic positional schedulers—this can happen only finitely many times. Hence we can stepwise strictly improve a strategy, until we have constructed a strategy preferable over all others.

By the definition of being preferable over all other strategies, satisfies

 −˙PrDmaxl(τ)=maxa∈Act(l)∑l′∈LR(l,a,l′)⋅(PrDmaxl′(τ)−PrDmaxl(τ))

for all and all .

We can use the same method for the right -environment , and pick the decision for arbitrarily; we use the decision from the respective left environment.

Now we have fixed, for an -environment of an arbitrary , an optimal scheduler with at most one switching point. As this is possible for all points in , the sets define an open cover of . Using the compactness of , we infer a finite sub-cover, which establishes the existence of a strategy with a finite number of switching points. ∎

The proof for the minimisation of time-bounded reachability (or maximisation of time-bounded safety) runs accordingly.

###### Theorem 4.2

For every CTMDP, there is a cylindrical TPD scheduler optimal for minimal time-bounded reachability in the class of measurable THR scheduler.

## 5 Discrete Locations

In this section, we treat the mildly more general case of single player CTMGs, which are traditional CTMDPs plus discrete locations. We reduce the problem of finding optimal measurable schedulers for CTMGs first to simple CTMGs, CTMGs whose discrete locations have no incoming transitions from continuous locations. (They hence can only occur initially at time .) The extension from CTMDPs to simple CTMGs is trivial.

###### Lemma 2

For a simple single player CTMG with only a reachability (or only a safety) player, there is an optimal deterministic scheduler with finitely many switching points.

###### Proof

By the definition of simple single player games, the likelihood of reaching the goal location from any continuous location and any point in time is independent of the discrete locations and their transitions. For continuous locations, we can therefore simply reuse the results from the Theorems 4.1 and 4.2.

We can only be in discrete locations at time , and for every continuous location there is a fixed time-bounded reachability probability described by . We can show that there is a timed-positional (even a positional) deterministic optimal choice for the discrete locations at time by induction over the maximal distance to continuous locations: If all successors have been evaluated, we can fix an optimal timed-positional choice. We can therefore use discrete positions with maximal distance as induction basis, and then apply an induction step from positions with distance to positions with distance . ∎

Rebuilding a single player CTMG to a simple single player CTMG can be done in a straight forward manner; it suffices to pool all transitions taken between two continuous locations. To construct the resulting simple CTMG , we add new continuous locations for each possible time abstract path from continuous locations of the CTMG , and we add the respective actions: For continuous locations and discrete locations a timed path translates to , where the underlined part is a new continuous location. (For simplicity, we also translate a timed path to .)

The new actions of the resulting simple single player CTMG encode the sequences of actions of that a scheduler could make in the current location plus in all possible sequences of discrete locations, until the next continuous location is reached. (Note that this set is finite, and that the scheduler makes all of these transitions at the same point of time.) If encodes choices that depend only on the position (but not on this local history), is called positional. For continuous locations, all old actions are deleted, and all new continuous locations that end in a location get the same outgoing transitions as . The rate matrix is chosen accordingly.

Adding the information about the path to locations allows to reconstruct the timed history in the single player CTMG from a history in the constructed simple CTMG.

###### Theorem 5.1

For a single player CTMG with only a reachability (or only a safety) player, there is an optimal deterministic scheduler with finitely many switching points.

###### Proof

First, every scheduler for can be naturally translated into a scheduler of of , because every timed-path in defines a timed-path in ; the resulting time-bounded reachability probability coincides.

Let us consider a cylindrical optimal deterministic scheduler for the simple Markov game, and the function defined by it. For the actions chooses, we can, for each interval in which , is positional, use an inductive argument similar to the one from the proof of Lemma 2 to show that we can choose a positional action instead. The resulting cylindrical deterministic scheduler defines the same (same differential equations).

Clearly, holds true. We use this observation to change to by choosing the action that chooses for for all locations and at each point of time. The resulting scheduler is still cylindrical and deterministic, and defines the same (same differential equations).

is also the mapping of a cylindrical optimal deterministic scheduler for . ∎

## 6 Continuous-Time Markov Games

In this section, we lift our results from single player to general continuous-time Markov games. In general continuous-time Markov games, we are faced with two players with opposing objectives: A reachability player trying to maximise the time-bounded reachability probability, and a safety player trying to minimise it—we consider a -sum game.

Thus, all we need to do for lifting our results to games is to show that the quest for optimal strategies for single player games discussed in the previous section can be generalised to a quest for co-optimal strategies—that is, for Nash equilibria—in general games. To demonstrate this, it essentially suffices to show that it is not important whether we first fix the strategy for the reachability player and then the one for the safety player in a strategy refinement loop, or vice versa.

Let us first assume CTMGs without discrete locations.

###### Lemma 3

Using the -environments from the proof of Theorem 4.1, we can construct a Nash equilibrium that provides co-optimal deterministic strategies for both players, such that the co-optimal strategies contain at most one strategy switch on .

###### Proof

We describe the technique to find a constant co-optimal strategy on the right -environment of .

We write a constant strategy as that is composed of the actions chosen by the safety player on , and the actions chosen by the reachability player on . For this simple structure, we introduce a strategy improvement technique on the finite domain of deterministic choices for the respective player.

For a fixed strategy of the safety player, we can find an optimal counter strategy of the reachability player by applying the technique described in Theorem 4.1. (For equivalent strategies, we make an arbitrary but fixed choice.)

We call the resulting vector the quality vector of . Now, we choose an arbitrary for which this vector is minimal. (Note that there could, potentially, be multiple incomparable minimal elements.)

We now show that the following holds for and all :

 −˙PrM¯¯¯S+R(¯¯¯S)(l,τ)=maxa∈Act(l)∑l′∈LR(l,a,l′)⋅(PrM¯¯¯S+R(¯¯¯S)(l′,τ)−PrM¯¯¯S+R(¯¯¯S)(l,τ))

for all , and

 −˙PrM¯¯¯S+R(¯¯¯S)(l,τ)=mina∈Act(l)∑l′∈LR(l,a,l′)⋅(PrM¯¯¯S+R(¯¯¯S)(l′,τ)−PrM¯¯¯S+R(¯¯¯S)(l,τ))

for all . (Note that the order between the derivation is maintained on the complete right -environment .)

The first of these claims is a trivial consequence from the proof of Theorem 4.1. (The result is, for example, the same if we had a single player CTMDP that, in the locations of the safety player, has only one possible action: the one chosen by .)

Let us assume that the second claim does not hold. Then we choose a particular where it is violated. Let us consider a slightly changed setting, in which the choices in are restricted to two actions, the action chosen by , and the minimising action . Among these two, one maximises, and one minimises

 −˙PrM¯¯¯S+R(¯¯¯S)(l,τ)=mina∈{a1,a2}∑l′∈LR(l,a,l′)⋅(PrM¯¯¯S+R(¯¯¯S)(l′,τ)−Pr¯¯¯S+R(¯¯¯S)(l,τ)).

Let us fix all other choices of , and allow the reachability player to choose among and (we ‘pass control’ to the other player). As shown in Theorem 4.1, she will select an action that produces the well defined set of equations for the resulting single player game. Hence, choosing and keeping all other choices from is the optimal choice for the reachability player in this setting (as the equations are satisfied, while they are dissatisfied for ).

Consequently, the quality vector for is strictly greater than the one for the adjusted strategy. That is, assuming that choosing an arbitrary maximal element does not lead to a satisfaction of the and equations leads to a contradiction.

We can argue symmetrically for the left -environment. Note that the satisfaction of the and equations implies that it does not matter if we change the rôle of the safety and reachability player in our argumentation. ∎

This lemma can easily be extended to construct simple co-optimal strategies:

###### Theorem 6.1

For CTMGs without discrete locations, there are cylindrical deterministic timed-positional co-optimal strategies for the reachability and the safety player.

###### Proof

First, Lemma 3 provides us with an open coverage of co-optimal strategies that switch at most once, and we can build a strategy that switches at most finitely many times from a finite sub-cover of the open space . This strategy is everywhere locally co-optimal, and forms a Nash equilibrium:

It is straight forward to cut the interval into a finite set of sub-intervals , , …, with , such that the strategy for the safety player is constant in all of these intervals. We can use the construction from Theorem 4.1 (note that the proof of Theorem 4.1 does not use that the differential equations are initialised to or at ) to construct an optimal strategy for the reachability player: We can first solve the problem for the interval , then for the interval using as initialisation, and so forth. A similar argument can be made for the other player.

This provides us with the same differential equations, namely:

 −˙fopt(l,t)=maxa∈Act(l)∑l′∈LR(l,a,l′)⋅(fopt(l′,t)−fopt(l,t))

for and , and

 −˙fopt(l,t)=mina∈Act(l)∑l′∈LR(l,a,l′)⋅(fopt(l′,t)−fopt(l,t))

for and .

Note that all Nash equilibria need to satisfy these equations (with the exception of sets, of course), because otherwise one of the players could improve her strategy. ∎

The extension of these results to the full class of CTMGs is straight forward: We would first reprove Theorem 5.1 in the style of the proof of Theorem 4.1 (which requires to establish the Theorem in the first place). The only extension is that we additionally get an equation for every discrete location . The details are moved to Appendix C.

###### Theorem 6.2

For continuous-time Markov Games, there are cylindrical deterministic timed-positional co-optimal strategies for the reachability and the safety player.

As a small side result, these differential equations show us that we can, for each continuous location and every action , add arbitrary values to without changing the bounded reachability probability for every pair of schedulers. (Only if we change to we have to make sure that is not removed from .) In particular, this implies that we can locally and globally uniformise a continuous-time Markov game if this eases its computational analysis. (Cf. [10] for the simpler case of CTMDPs.)

## 7 Variances

In this section, we discuss the impact of small changes in the setting, namely the impact of infinitely many states or actions, and the impact of introducing a non-absorbing goal region.

#### Infinitely Many States.

If we allow for infinitely many states, optimal solutions may require infinitely many switching points. To see this, it suffices to use one copy of the CTMDP from Figure 1, but with rates and for the -th copy, and assign an initial probability distribution that assigns a weight of to the initial state of the -th copy. (If one prefers to consider only systems with bounded rates, one can choose rates and .) The switching points are then different for every copy, and an optimal strategy has to select the correct switching point for every copy.

#### Infinitely Many Actions.

If we allow for infinitely many actions, there is not even an optimal strategy if we restrict our focus to CTMDPs with two locations, an initial location and an absorbing goal location. For the CTMDP of Figure 4 with the natural numbers as actions and rate for the action if we have a reachability player and if we have a safety player, every strategy can be improved over by a strategy that always chooses the successor when of the action chosen by .

#### Reachability at tmax.

If we drop the assumption that the goal region is absorbing, one might be interested in the marginally more general problem to be (not to be) in the goal region at time for the reachability player (safety player, respecively). For this generalisation, no substantial changes need to be made: It suffices to replace

 fopt(l,t)=PrMS(l,t)=1 for all goal locations l∈G and all t≤tmax

by

 fopt(l,tmax)=PrMS(l,tmax)=1 for all goal locations l∈G.

(In order to be flexible with respect to this condition, the are defined for goal locations as well. Note that, when all goal locations are absorbing, the value of is and is for all goal locations and all .)

## References

• [1] Adnan Aziz, Kumud Sanwal, Vigyan Singhal, and Robert Brayton. Model-checking continuous-time Markov chains. Transactions on Computational Logic, 1(1):162–170, 2000.
• [2] C. Baier, J.-P. Katoen, and H. Hermanns. Approximate Symbolic Model Checking of Continuous-Time Markov Chains. In Proceedings of CONCUR’99, volume 1664 of Lecture Notes in Computer Science, pages 146–161, 1999.
• [3] Christel Baier, Holger Hermanns, Joost-Pieter Katoen, and Boudewijn R. Haverkort. Efficient computation of time-bounded reachability probabilities in uniform continuous-time Markov decision processes. Theoretical Computer Science, 345(1):2–26, 2005.
• [4] Richard Bellman. Dynamic Programming. Princeton University Press, 1957.
• [5] Tomas Brazdil, Vojtech Forejt, Jan Krcal, Jan Kretinsky, and Antonin Kucera. Continuous-time stochastic games with time-bounded reachability. In Proceedings of FSTTCS’09, Leibniz International Proceedings in Informatics (LIPIcs), pages 61–72, 2009.
• [6] Eugene A. Feinberg. Continuous Time Discounted Jump Markov Decision Processes: A Discrete-Event Approach. Mathematics of Operations Research, 29(3):492–524, 2004.
• [7] H. Hermanns. Interactive Markov Chains and the Quest for Quantified Quality. LNCS 2428. Springer-Verlag, 2002.
• [8] Prasadarao Kakumanu. Continuously Discounted Markov Decision Model with Countable State and Action Space. The Annals of Mathematical Statistics, 42(3):919–926, 1971.
• [9] M. A. Marsan, G. Balbo, G. Conte, S. Donatelli, and G. Franceschinis. Modelling with Generalized Stochastic Petri Nets. SIGMETRICS Performance Evaluation Review, 26(2):2, 1998.
• [10] Martin R. Neuhäußer, Mariëlle Stoelinga, and Joost-Pieter Katoen. Delayed Nondeterminism in Continuous-Time Markov Decision Processes. In Proceedings of FOSSACS ’09, pages 364–379, 2009.
• [11] Martin R. Neuhäußer and Lijun Zhang. Time-Bounded Reachability in Continuous-Time Markov Decision Processes. Technical report, 2009.
• [12] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, April 1994.
• [13] Markus Rabe and Sven Schewe. Optimal Time-Abstract Schedulers for CTMDPs and Markov Games. In Proceedings of QAPL’10 (accepted), 2010.
• [14] William H. Sanders and John F. Meyer. Reduced Base Model Construction Methods for Stochastic Activity Networks. In Proceedings of PNPM’89, pages 74–84, 1989.
• [15] Nicolás Wolovick and Sven Johr. A Characterization of Meaningful Schedulers for Continuous-Time Markov Decision Processes. In Proceedings of FORMATS’06, pages 352–367, 2006.
• [16] L. Zhang, H. Hermanns, E. M. Hahn, and B. Wachter. Time-bounded model checking of infinite-state continuous-time Markov chains. In Proceedings of ACSD’08, pages 98–107, 2008.

## Appendix

As to be expected by the topic, the paper is based in large parts on measure theory. While the techniques are standard and straight forward for the experts in the field we provide a short introduction to the ideas exploited; while we do not use any technique beyond the standard curriculum of a math degree, we assume that some recap of the ideas behind the completion of metric spaces and its application in our measures in Section A. However, Section A is but a short introduction to the ideas, and cannot serve as a self contained introductory to the techniques.

Section B contains a short recap on the differential equations that describe and , and, more generally, the development of in Subsection B.1, and a proof that truly establishes an upper bound on the performance of any measurable scheduler in Subsection B.2, which constitutes a proof of Lemma 1.

## Appendix A Completion of Metric Spaces

A metric space is called complete if every Cauchy sequence in it converges. A Cauchy sequence in a metric space is a sequence such that the following holds:

 ∀ε>0 ∃n∈N ∀l,m>n. d(s(l),s(m))<ε

Intuitively one could say that a Cauchy sequence converges, but not necessarily to a point within the space. For example, a sequence of rational numbers that converges to is a converging sequence in the real numbers, but not in the rationals—as the limit point is outside of the carrier set—but it is still a Cauchy sequence.

The basic technique to complete an incomplete metric space is to use the Cauchy sequences of this space as the new carrier set , and define a distance function between two Cauchy sequences of to be . Now, is not yet a metric space, because two different Cauchy sequences—for example the constant sequence and the sequence of the rationals—can have distance .

Technically, one therefore defines equivalence classes of Cauchy sequences that have distance with respect to as the new carrier set of a metric space , where the distance function is defined by using on representatives of the quotient classes of Cauchy sequences. This also complies with the intuition: a Cauchy sequence in is meant to represent its limit point (which is not necessarily in ), and hence two Cauchy sequences with the same limit point should be identified.

The resulting metric space is complete by construction. The simplest example of such a completion is the completion of the rational numbers into the real numbers. And on this level, a straight forward effect of completion can be easily explained: To end up with (or a space isomorphic to it), we can start with any dense subset of .

A subset is dense in if, for every point and every , there is a point with . Looking at the definition, one immediately sees the connection to Cauchy sequences: One could intuitively say that is dense in if, for every point , there is a Cauchy sequence with limit point .

Hence, it does not matter which dense set we use as a starting point. Of course, it works to use