Optimal Control of MDPs with Temporal Logic Constraints
Abstract
In this paper, we focus on formal synthesis of control policies for finite Markov decision processes with nonnegative realvalued costs. We develop an algorithm to automatically generate a policy that guarantees the satisfaction of a correctness specification expressed as a formula of Linear Temporal Logic, while at the same time minimizing the expected average cost between two consecutive satisfactions of a desired property. The existing solutions to this problem are suboptimal. By leveraging ideas from automatabased model checking and game theory, we provide an optimal solution. We demonstrate the approach on an illustrative example.
I Introduction
Markov Decision Processes (MDP) are probabilistic models widely used in various areas, such as economics, biology, and engineering. In robotics, they have been successfully used to model the motion of systems with actuation and sensing uncertainty, such as ground robots [17], unmanned aircraft [21], and surgical steering needles [1]. MDPs are central to control theory [4], probabilistic model checking and synthesis in formal methods [3, 9], and game theory [13].
MDP control is a well studied area (see e.g., [4]). The goal is usually to optimize the expected value of a cost over a finite time (e.g., stochastic shortest path problem) or an average expected cost in infinite time (e.g., average cost per stage problem). Recently, there has been increasing interest in developing MDP control strategies from rich specifications given as formulas of probabilistic temporal logics, such as Probabilistic Computation Tree Logic (PCTL) and Probabilistic Linear Temporal Logic (PLTL) [12, 17]. It is important to note that both optimal control and temporal logic control problems for MDPs have their counterpart in automata game theory. Specifically, optimal control translates to solving player games with payoff functions, such as discountedpayoff and meanpayoff games [6]. Temporal logic control for MDPs corresponds to solving player games with parity objectives [2].
Our aim is to optimize the behavior of a system subject to correctness (temporal logic) constraints. Such a connection between optimal and temporal logic control is an intriguing problem with potentially high impact in several applications. Consider, for example, a mobile robot involved in a persistent surveillance mission in a dangerous area under tight fuel or time constraints. The correctness requirement is expressed as a temporal logic specification, e.g., “Keep visiting A and then B and always avoid C”. The resource constraints translate to minimizing a cost function over the feasible trajectories of the robot. Motivated by such applications, in this paper we focus on correctness specifications given as LTL formulae and optimization objectives expressed as average expected cumulative costs per surveillance cycle (ACPC).
The main contribution of this work is to provide a sound and complete solution to the above problem. This paper can be seen as an extension of [18, 19, 11, 8]. In [18], we focused on deterministic transition systems and developed a finitehorizon online planner to provably satisfy an LTL constraint while optimizing the behavior of the system between every two consecutive satisfactions of a given proposition. We extended this framework in [19], where we provided an algorithm to optimize the longterm average behavior of deterministic transition systems with timevarying events of known statistics. The closest to this work is [11], where the authors focus on a problem of optimal LTL control of MDPs with realvalued costs on actions. The correctness specification is assumed to include a persistent surveillance task and the goal is to minimize the longterm expected average cost between successive visits of the locations under surveillance. Using dynamic programming techniques, the authors design a solution that is suboptimal in the general case. In [8], it is shown that, for a certain fragment of LTL, the solution becomes optimal. By using recent results from game theory [5], in this paper we provide an optimal solution for full LTL.
Ii Preliminaries
For a set , we use and to denote the set of all infinite and all nonempty finite sequences of elements of , respectively. For a finite sequence , we use to denote the length of . For , and is the finite prefix of of length . We use the same notation for an infinite sequence from the set .
Iia MDP Control
Definition 1
A Markov decision process (MDP) is a tuple , where is a nonempty finite set of states, is a nonempty finite set of actions, is a transition probability function such that for every state and action it holds that , is a finite set of atomic propositions, is a labeling function, and is a cost function. An initialized Markov decision process is an MDP with a distinctive initial state .
An action is called enabled in a state if . With a slight abuse of notation, denotes the set of all actions enabled in a state . We assume for every .
A run of an MDP is an infinite sequence of states such that for every , there exists , . We use to denote the set of all runs of that start in a state . Let . A finite run of is a finite prefix of a run in and denotes the set of all finite runs of starting in a state . Let . The length of a finite run is also referred to as the number of stages of the run. The last state of is denoted by .
The word induced by a run of is an infinite sequence . Similarly, a finite run of induces a finite word from the set .
Definition 2
Let be an MDP. An end component (EC) of the MDP is an MDP such that , . For every and it holds that . For every pair of states , there exists a finite run such that . We use to denote the function restricted to the sets and . Similarly, we use and with the obvious meaning. If the context is clear, we only use instead of . EC of is called maximal (MEC) if there is no EC of such that , and for every . The set of all end components and maximal end components of are denoted by and , respectively.
The number of ECs of an MDP can be up to exponential in the number of states of and they can intersect. On the other hand, MECs are pairwise disjoint and every EC is contained in a single MEC. Hence, the number of MECs of is bounded by the number of states of .
Definition 3
Let be an MDP. A control strategy for is a function such that for every it holds that .
A strategy for which for all finite runs with is called memoryless. In that case, we consider to be a function . A strategy is called finitememory if it is defined as a tuple , where is a finite set of modes, is a transition function, selects an action to be applied in , and selects the starting mode for every .
A run of an MDP is called a run under a strategy for if for every , it holds that . A finite run under is a finite prefix of a run under . The set of all infinite and finite runs of under starting in a state are denoted by and , respectively. Let and .
Let be an MDP, a state of , and a strategy for . The following probability measure is used to argue about the possible outcomes of applying in starting from . Let be a finite run. A cylinder set of is the set of all runs of under that have as a finite prefix. There exists a unique probability measure on the algebra generated by the set of cylinder sets of all runs in . For , it holds
and . Intuitively, given a subset , is the probability that a run of under that starts in belongs to the set .
The following properties hold for any MDP (see, e.g., [3]). For every EC of , there exists a finitememory strategy for such that under starting from any state of never visits a state outside and all states of are visited infinitely many times with probability 1. On the other hand, having any, finitememory or not, strategy , a state of and a run of under that starts in , the set of states visited infinitely many times by forms an end component. Let be the set of all ECs of that correspond, in the above sense, to at least one run under the strategy that starts in the state . We say that the strategy leads from the state to the set .
IiB Linear Temporal Logic
Definition 4
Linear Temporal Logic (LTL) formulae over a set of atomic propositions are formed according to the following grammar:
where is an atomic proposition, and are standard Boolean connectives, and (next), (until), (always), and (eventually) are temporal operators.
Formulae of LTL are interpreted over the words from , such as those induced by runs of an MDP (for details see e.g., [3]). For example, a word satisfies and if holds in always and eventually, respectively. If the word induced by a run satisfies a formula , we say that the run satisfies . With slight abuse of notation, we also use states or sets of states of the MDP as propositions in LTL formulae.
For every LTL formula , the set of all runs of that satisfy is measurable in the probability measure for any and [3]. With slight abuse of notation, we use LTL formulae as arguments of . If for a state it holds that , we say that the strategy almostsurely satisfies starting from . If is an initialized MDP and , we say that almostsurely satisfies .
The LTL control synthesis problem for an initialized MDP and an LTL formula over aims to find a strategy for that almostsurely satisfies . This problem can be solved using principles from probabilistic model checking [3], [12]. The algorithm itself is based on the translation of to a Rabin automaton and the analysis of an MDP that combines the Rabin automaton and the original MDP .
Definition 5
A deterministic Rabin automaton (DRA) is a tuple , where is a nonempty finite set of states, is an alphabet, is a transition function, is an initial state, and is an accepting condition.
A run of is a sequence such that for every , there exists , . We say that the word induces the run . A run of is called accepting if there exists a pair such that the run visits every state from only finitely many times and at least one state from infinitely many times.
For every LTL formula over , there exists a DRA such that all and only words from satisfying induce an accepting run of [14]. For translation algorithms see e.g., [16], and their online implementations, e.g., [15].
Definition 6
Let be an initialized MDP and be a DRA. The product of and is the initialized MDP , where , if and otherwise, , , . The initial state of is .
Using the projection on the first component, every (finite) run of projects to a (finite) run of and vice versa, for every (finite) run of , there exists a (finite) run of that projects to it. Analogous correspondence exists between strategies for and . It holds that the projection of a finitememory strategy for is also finitememory. More importantly, for the product of an MDP and a DRA for an LTL formula , the probability of satisfying the accepting condition of under a strategy for starting from the initial state , i.e.,
is equal to the probability of satisfying the formula in the MDP under the projected strategy starting from the initial state .
Definition 7
Let be the product of an MDP and a DRA . An accepting end component (AEC) of is defined as an end component of for which there exists a pair in the acceptance condition of such that and . We say that is accepting with respect to the pair . An AEC is called maximal (MAEC) if there is no AEC such that , , for every and and are accepting with respect to the same pair. We use and to denote the set of all accepting end components and maximal accepting end components of , respectively.
Note that MAECs that are accepting with respect to the same pair are always disjoint. However, MAECs that are accepting with respect to different pairs can intersect.
From the discussion above it follows that a necessary condition for almostsure satisfaction of the accepting condition by a strategy for is that there exists a set of MAECs such that leads the product from the initial state to .
Iii Problem Formulation
Consider an initialized MDP and a specification given as an LTL formula over of the form
(1) 
where is an atomic proposition and is an LTL formula over . Intuitively, a formula of such form states two partial goals – mission goal and surveillance goal . To satisfy the whole formula the system must accomplish the mission and visit the surveillance states infinitely many times. The motivation for this form of specification comes from applications in robotics, where persistent surveillance tasks are often a part of the specification. Note that the form in Eq. (1) does not restrict the full LTL expressivity since every LTL formula can be translated into a formula of the form in Eq. (1) that is associated with the same set of runs of . Explicitly, , where is such that for every state .
In this work, we focus on a control synthesis problem, where the goal is to almostsurely satisfy a given LTL specification, while optimizing a longterm quantitative objective. The objective is to minimize the average expected cumulative cost between consecutive visits to surveillance states.
Formally, we say that every visit to a surveillance state completes a surveillance cycle. In particular, starting from the initial state, the first visit to completes the first surveillance cycle of a run. We use to denote the number of completed surveillance cycles in a finite run plus one. For a strategy for , the cumulative cost in the first stages of applying to starting from a state is
where is the random variable whose values are finite runs of length from the set and the probability of a finite run is . Note that is also a random variable. Finally, we define the average expected cumulative cost per surveillance cycle (ACPC) in the MDP under a strategy as a function such that for a state
The problem we consider in this paper can be formally stated as follows.
Problem 1
Let be an initialized MDP and be an LTL formula over of the form in Eq. (1). Find a strategy for such that almostsurely satisfies and, at the same time, minimizes the ACPC value among all strategies almostsurely satisfying .
The above problem was recently investigated in [11]. However, the solution presented by the authors is guaranteed to find an optimal strategy only if every MAEC of the product of the MDP and the DRA for the specification satisfies certain conditions (for details see [11]). In this paper, we present a solution to Problem 1 that always finds an optimal strategy if one exists. The algorithm is based on principles from probabilistic model checking [3] and game theory [5], whereas the authors in [11] mainly use results from dynamic programming [4].
In the special case when every state of is a surveillance state, Problem 1 aims to find a strategy that minimizes the average expected cost per stage among all strategies almostsurely satisfying . The problem of minimizing the average expected cost per stage (ACPS) in an MDP, without considering any correctness specification, is a well studied problem in optimal control, see e.g., [4]. It holds that there always exists a stationary strategy that minimizes the ACPS value starting from the initial state. In our approach to Problem 1, we use techniques for solving the ACPS problem to find a strategy that minimizes the ACPC value.
Iv Solution
Let be an initialized MDP and an LTL formula over of the form in Eq. (1). To solve Problem 1 for and we leverage ideas from game theory [5] and construct an optimal strategy for as a combination of a strategy that ensures the almostsure satisfaction of the specification and a strategy that guarantees the minimum ACPC value among all strategies that do not cause immediate unrepairable violation of .
The algorithm we present in this section works with the product of the MDP and a deterministic Rabin automaton for the formula . We inherit the notion of a surveillance cycle in by adding the proposition to the set and to the set for every such that . Using the correspondence between strategies for and , an optimal strategy for is found as a projection of a strategy for which almostsurely satisfies the accepting condition of and at the same time, minimizes the ACPC value among all strategies for that almostsurely satisfy .
Since must almostsurely satisfy the accepting condition , it leads from the initial state of to a set of MAECs. For every MAEC , the minimum ACPC value that can be obtained in starting from a state is equal for all the states of and we denote this value . The strategy is constructed in two steps.
First, we find a set of MAECs of and a strategy that leads from the initial state to the set . We require that and minimize the weighted average of the values for . The strategy applies from the initial state until enters the set .
Second, we solve the problem of how to control the product once a state of an MAEC is visited. Intuitively, we combine two finitememory strategies, for the almostsure satisfaction of the accepting condition and for maintaining the average expected cumulative cost per surveillance cycle. To satisfy both objectives, the strategy is played in rounds. In each round, we first apply the strategy and then the strategy , each for a specific (finite) number of steps.
Iva Finding an optimal set of MAECs
Let be the set of all MAECs of the product that can be computed as follows. For every pair , we create a new MDP from by removing all its states with label in and the corresponding actions. For the new MDP, we use one of the algorithms in [10], [9], [7] to compute the set of all its MECs. Finally, for every MEC, we check whether it contains a state with label in .
In this section, the aim is to find a set and a strategy for that satisfy conditions formally stated below. Since the strategy will only be used to enter the set , it is constructed as a partial function.
Definition 8
A partial strategy for the MDP is a partial function , where if is defined for , then .
A partial stationary strategy for can also be considered as a partial function or a subset . The set of runs of under contains all infinite runs of that follow and all those finite runs of under for which is not defined. A finite run of under is then a finite prefix of a run under . The probability measure is defined in the same manner as in Sec. IIA. We also extend the semantics of LTL formulas to finite words. For example, a formula is satisfied by a finite word if in some nonempty suffix of the word always holds.
The conditions on and are as follows. First, the partial strategy leads to the set , i.e.,
(2) 
Second, we require that and minimize the value
(3) 
The procedure to compute the optimal ACPC value for an MAEC of is described in the next section. Assume we already computed this value for each MAEC of . The algorithm to find the set and partial strategy is based on an algorithm for stochastic shortest path (SSP) problem. The SSP problem is one of the basic optimization problems for MDPs. Given an initialized MDP and its state , the goal is to find a strategy under which the MDP almostsurely reaches the state , so called terminal state, while minimizing the expected cumulative cost. If there exists at least one strategy almostsurely reaching the terminal state, then there exists a stationary optimal strategy. For details and algorithms see e.g., [4].
The partial strategy and the set are computed as follows. First, we create a new MDP from by considering only those states of that can reach the set with probability 1 and their corresponding actions. The MDP can be computed using backward reachability from the set . If does not contain the initial state , there exists no solution to Problem 1. Otherwise, we add a new state and for every MAEC , we add a new action to . From each state , we define a transition under to with probability and set its cost to . All other costs in the MDP are set to . Finally, we solve the SSP problem for and the state as the terminal state. Let be the resulting stationary optimal strategy for . For every , we define if the action does not lead from to , is undefined otherwise. The set is the set of all MAECs for which there exists a state such that .
Proposition 1
Both conditions follow directly from the fact that the strategy is an optimal solution to the SSP problem for and .
IvB Optimizing ACPC value in an MAEC
In this section, we compute the minimum ACPC value that can be attained in an MAEC and construct the corresponding strategy for . Essentially, we reduce the problem of computing the minimum ACPC value to the problem of computing the minimum ACPS value by reducing to an MDP such that every state of the reduced MDP is labeled with the surveillance proposition .
Let be an MAEC of . Since it is an MAEC, there exists a state with . Let denote the set of all such states in . We reduce to an MDP
using Alg. 1. For the sake of readability, we use singletons such as instead of pairs such as to denote the states of . The MDP is constructed from by eliminating states from one by one in arbitrary order. The actions are partial stationary strategies for in which we remember all the states and actions we eliminated. Later we prove that the transition probability for states and an action is the probability that in under the partial stationary strategy , if we start from the state , the next state that will be visited from the set is the state , i.e., the first surveillance cycle is completed by visiting . The cost is the expected cumulative cost gained in using partial stationary strategy from until we reach a state in .
In Fig. 1, we demonstrate the reduction on an example using the notation introduced in Alg. 1. On the left side, we see a part of an MAEC with five states and two actions. First, we build an MDP from by transforming every action of every state to a partial stationary strategy with a single pair given by the state and the action. The MDP is used in the algorithm as an auxiliary MDP to store the current version of the reduced system. Assume we want to reduce the state . We consider all “incoming” and “outgoing” actions of and combine them pairwise as follows. There is only one outgoing action from in , namely , and only one incoming action, namely action of state . Since and do not conflict as partial stationary strategies on any state of , we merge them to create a new partial stationary strategy that is an action of . The transition probability for a state of is computed as the sum of the transition probability of transiting from to using the old action and the probability of entering by first transiting from to using and from eventually reaching using . The cost is the expected cumulative cost gained starting from by first applying action and if we transit to , applying until a state different from is reached. Now that we considered every pair of an incoming and outgoing action of , the state and its incoming and outgoing actions are reduced. The modified MDP is depicted on the right side of Fig. 1.

,

for
,

for
,

for
Proposition 2
Let be an MAEC and its reduction resulting from Alg. 1. The minimum ACPC value that can be attained in starting from any of its states is the same and we denote it . There exists a stationary strategy for that attains this value regardless of the starting state in . Both and can be computed as a solution to the ACPS problem for . It holds that and from , one can construct a finitememory strategy for which regardless of the starting state in attains the optimal ACPC value .
We prove the following correspondence between and . For every and , it holds that is a welldefined partial stationary strategy for . The transition probability is the probability that in , when applying starting from , the first surveillance cycle is completed by visiting , i.e.,
The cost is the expected cumulative cost gained in when applying starting from until the first surveillance cycle is completed. On the other hand, for every partial stationary strategy for such that
for some , there exists an action such that the action corresponds to the partial stationary strategy in the above sense, i.e.,
for every , and the cost is the expected cumulative cost gained in when we apply starting from until we reach a state in .
To prove the first part of the correspondence above, we prove the following invariant of Alg. 1. Let be the MDP from the algorithm after the initialization, before the first iteration of the while cycle. It is easy to see that all actions of are welldefined partial stationary strategies. For the transition probabilities, it holds that
for every and . The cost is the expected cumulative cost gained in starting from when applying until we reach a state in . We show that these conditions also hold after every iteration of the while cycle.
Let satisfy the conditions above and let . By removing the state from , we obtain a new version of the MDP . Note that . Let be a state of and be its action such that has changed in the process of removing the state . The action is a welldefined partial stationary strategy because it must have been created as a union of an action of and an action of , both from the previous version , which do not conflict on any state from .
Let denote the LTL formula . For a state , we prove that