Optimal Receding Horizon Control for Finite Deterministic Systems
with Temporal Logic Constraints
In this paper, we develop a provably correct optimal control strategy for a finite deterministic transition system. By assuming that penalties with known probabilities of occurrence and dynamics can be sensed locally at the states of the system, we derive a receding horizon strategy that minimizes the expected average cumulative penalty incurred between two consecutive satisfactions of a desired property. At the same time, we guarantee the satisfaction of correctness specifications expressed as Linear Temporal Logic formulas. We illustrate the approach with a persistent surveillance robotics application.
Temporal logics, such as Computation Tree Logic (CTL) and Linear Temporal Logic (LTL), have been customarily used to specify the correctness of computer programs and digital circuits modeled as finite-state transition systems . The problem of analyzing such a model against a temporal logic formula, known as formal analysis or model checking, has received a lot of attention during the past thirty years, and several efficient algorithms and software tools are available [2, 3, 4]. The formal synthesis problem, in which the goal is to design or control a system from a temporal logic specification, has not been studied extensively until a few years ago. Recent results include the use of model checking algorithms for controlling deterministic systems , automata games for controlling non-deterministic systems , linear programming and value iteration for synthesis of control policies for Markov decision processes [1, 7]. Through the use of abstractions, such techniques have also been used for infinite systems, such as continuous and discrete-time linear systems [8, 9, 10, 11, 12].
The connection between optimal and temporal logic control is an intriguing problem with a potentially high impact in several applications. By combining these two seemingly unrelated areas, our goal is to optimize the behavior of a system subject to correctness constraints. Consider, for example, a mobile robot involved in a persistent surveillance mission in a dangerous area and under tight fuel / time constraints. The correctness requirement is expressed as a temporal logic specification, e.g., “Alternately keep visiting and while always avoiding ”, while the resource constraints translate to minimizing a cost function over the feasible trajectories of the robot. While optimal control is a mature discipline and formal synthesis is fairly well understood, optimal formal synthesis is a largely open area.
In this paper, we focus on finite labeled transition systems and correctness specifications given as formulas of LTL. We assume there is a penalty associated with the states of the system with a known occurrence probability and time-behavior. Motivated by persistent surveillance robotic missions, our goal is to minimize the expected average cumulative penalty incurred between two consecutive satisfactions of a desired property associated with some states of the system, while at the same time satisfying an additional temporal logic constraint. Also from robotics comes our assumption that actual penalty values can only be sensed locally in a close proximity from the current state during the execution of the system. We propose two algorithms for this problem. The first operates offline, i.e., without executions of the system, and therefore only uses the known probabilities but does not exploit actual penalties sensed during the execution. The second algorithm designs an online strategy by locally improving the offline strategy based on local sensing and simulation over a user-defined planning horizon. While both algorithms guarantee optimal expected average penalty collection, in real executions of the system, the second algorithm provides lower real average than the first algorithm. We illustrate these results on a robotic persistent surveillance case study.
This paper is closely related to [13, 14, 5], which also focused on optimal control for finite transitions systems with temporal logic constraints. In , the authors developed an offline control strategy minimizing the maximum cost between two consecutive visits to a given set of states, subject to constraints expressed as LTL formulas. Time-varying, locally sensed rewards were introduced in , where a receding horizon control strategy maximizing rewards collected locally was shown to satisfy an LTL specification. This approach was generalized in  to allow for a broader class of optimization objectives and reward models. In contrast with [13, 14], we interpret the dynamic values appearing in states of the system as penalties instead of rewards, i.e., in our case, the cost function is being minimized rather than maximized. That allows the existence of the optimum in expected average penalty collection. In this paper, we show how it can be achieved using automata-based approach and game theory results.
For a set , we use and to denote the set of all infinite and all non-empty finite sequences of elements of , respectively. For a finite or infinite sequence , we use to denote the -th element and for the finite prefix of of length .
A weighted deterministic transition system (TS) is a tuple , where is a non-empty finite set of states, is a transition relation, is a finite set of atomic propositions, is a labeling function and is a weight function. We assume that for every exists such that . An initialized transition system is a TS with a distinctive initial state .
A run of a TS is an infinite sequence such that for every it holds . We use to denote the set of all states visited infinitely many times in the run and for the set of all runs of that start in . Let . A finite run of is a finite prefix of a run of and denotes the set of all finite runs of that start in . Let . The length , or number of stages, of a finite run is and denotes the last state of . With slight abuse of notation, we use to denote the weight of a finite run , i.e., . Moreover, denotes the minimum weight of a finite run from to . Specifically, for every and if there does not exist a run from to , then . For a set we let . We say that a state and a set is reachable from , iff and , respectively.
Every run , resp. , induces a word , resp. , over the power set of .
A cycle of the TS is a finite run of for which it holds that .
A sub-system of a is a TS , where and . We use to denote the labeling function restricted to the set . Similarly, we use with the obvious meaning. If the context is clear, we use instead of . A sub-system of is called strongly connected if for every pair of states , there exists a finite run such that . A strongly connected component (SCC) of is a maximal strongly connected sub-system of . We use to denote the set of all strongly connected components of .
Strongly connected components of a TS are pairwise disjoint. Hence, the cardinality of the set is bounded by the number of states of and the set can be computed using Tarjan’s algorithm .
Let be a TS. A control strategy for is a function such that for every , it holds that .
A strategy for which , for all finite runs with , is called memoryless. In that case, is a function .
A strategy is called finite-memory if it is defined as a tuple , where is a finite set of modes, is a transition function, selects a state of to be visited next, and selects the starting mode for every .
A run induced by a strategy for is a run for which for every . For every , there is exactly one run induced by that starts in . A finite run induced by is , which is a finite prefix of a run induced by .
Let be a strategy, finite-memory or not, for a TS . For every state , the run induced by satisfies for some . We say that leads from the state to the SCC .
Linear Temporal Logic (LTL) formulas over the set are formed according to the following grammar:
where is an atomic proposition, , and are standard Boolean connectives, and (next), (until), (always) and (eventually) are temporal operators.
The semantics of LTL is defined over words over , such as those generated by the runs of a TS (for details see e.g., ). For example, a word satisfies and if holds in always and eventually, respectively. If the word induced by a run of satisfies a formula , we say that the run satisfies . We call satisfiable in from if there exists a run that satisfies .
Having an initialized TS and an LTL formula over , the formal synthesis problem aims to find a strategy for such that the run induced by satisfies . In that case we also say that the strategy satisfies . The formal synthesis problem can be solved using principles from model checking methods . Specifically, is translated to a Büchi automaton and the system combining the Büchi automaton and the TS is analyzed.
A Büchi automaton (BA) is a tuple , where is a non-empty finite set of states, is the alphabet, is a transition relation such that for every , , there exists such that , is the initial state, and is a set of accepting states.
A run of is an infinite sequence such that for every there exists with . The word is called the word induced by the run . A run of is accepting if there exist infinitely many such that is an accepting state.
For every LTL formula over , one can construct a Büchi automaton such that the accepting runs are all and only words over satisfying . We refer readers to [17, 18] for algorithms and to online implementations such as , to translate an LTL formula to a BA.
Let be an initialized TS and be a Büchi automaton. The product of and is a tuple where , is a transition relation such that for every it holds that if and only if and , is the initial state, is a labeling function, is a set of accepting states, and is a weight function.
The product can be viewed as an initialized TS with a set of accepting states. Therefore, we adopt the definitions of a run , a finite run , its weight , and sets , , and from above. Similarly, a cycle of , a strategy for and runs induced by are defined in the same way as for a TS. On the other hand, can be viewed as a weighted BA over the trivial alphabet with a labeling function, which gives us the definition of an accepting run of .
Using the projection on the first component, every run and finite run of corresponds to a run and a finite run of , respectively. Vice versa, for every run and finite run of , there exists a run and finite run . Similarly, every strategy for projects to a strategy for and for every strategy for there exists a strategy for that projects to it. The projection of a finite-memory strategy for is also finite-memory.
Since can be viewed as a TS, we also adopt the definitions of a sub-system and a strongly connected component.
Let be the product of an initialized TS and a BA . An accepting strongly connected component (ASCC) of is an SCC such that the set is non-empty and we refer to it as the set of accepting states of . We use to denote the set of all ASCCs of that are reachable from the initial state .
In our work, we always assume that is non-empty, i.e., the given LTL formula is satisfiable in the TS.
Iii Problem Formulation
Consider an initialized weighted transition system . The weight of a transition represents the amount of time that the transition takes and the system starts at time . We use to denote the point in time after the -th transition of a run, i.e., initially the system is in a state at time and after a finite run of length the time is .
We assume there is a dynamic penalty associated with every state . In this paper, we address the following model of penalties. Nevertheless, as we discuss in Sec.V, the algorithms presented in the next section provide optimal solution for a much broader class of penalty dynamics. The penalty is a rational number between and that is increasing every time unit by , where is a given rate. Always when the penalty is , in the next time unit the penalty remains or it drops to according to a given probability distribution. Upon the visit of a state, the corresponding penalty is incurred. We assume that the visit of the state does not affect the penalty’s value or dynamics. Formally, the penalties are defined by a rate and a penalty probability function , where is the probability that if the penalty in a state is then in the next time unit the penalty remains , and is the probability of the penalty dropping to . The penalties are described using a function , such that is the penalty in a state at time . For , is a uniformly distributed random variable with values in the set and for
where is a random variable such that with probability and otherwise. We use
to denote the expected value of the penalty in a state . Please note that , for every .
In our setting, the penalties are sensed only locally in the states in close proximity from the current state. To be specific, we assume a visibility range is given. If the system is in a state at time , the penalty of a state is observable if and only if . The set is also called the set of states visible from .
The problem we consider in this paper combines the formal synthesis problem with long-term optimization of the expected amount of penalties incurred during the system’s execution. We assume that the specification is given as an LTL formula of the form
where is an LTL formula over and . This formula requires that the system satisfies and surveys the states satisfying the property infinitely often. We say that every visit of a state from the set completes a surveillance cycle. Specifically, starting from the initial state, the first visit of after the initial state completes the first surveillance cycle of a run. Note that a surveillance cycle is not a cycle in the sense of the definition of a cycle of a TS in Sec. II. For a finite run such that , denotes the number of complete surveillance cycles in , otherwise is the number of complete surveillance cycles plus one. We define a function such that is the expected average cumulative penalty per surveillance cycle (APPC) incurred under a strategy for starting from a state :
where is the run induced by starting from and denotes the expected value. In this paper, we consider the following problem:
In the next section, we propose two algorithms solving the above problem. The first algorithm operates offline, without the deployment of the system, and therefore, without taking advantage of the local sensing of penalties. On the other hand, the second algorithm computes the strategy in real-time by locally improving the offline strategy according to the penalties observed from the current state and their simulation over the next time units, where is a natural number, a user-defined planning horizon.
The two algorithms work with the product of the initialized TS and a Büchi automaton for the LTL formula . To project the penalties from to , we define the penalty in a state at time as . We also adopt the visibility range and the set of all states visible from is defined as for a state of . The APPC function of a strategy for is then defined according to Eq. (4). We use the correspondence between the strategies for and to find a strategy for that solves Problem 1. Let be a strategy for such that the run induced by visits the set infinitely many times and at the same time, the APPC value is minimal among all strategies that visit infinitely many times. It is easy to see that projects to a strategy for that solves Problem 1 and .
The offline algorithm leverages ideas from formal methods. Using the automata-based approach to model checking, one can construct a strategy for that visits at least one of the accepting states infinitely many times. On the other hand, using graph theory, we can design a strategy that achieves the minimum APPC value among all strategies of that do not cause an immediate, unrepairable violation of , i.e., is satisfiable from every state of the run induced by . However, we would like to have a strategy satisfying both properties at the same time. To achieve that, we employ a technique from game theory presented in . Intuitively, we combine two strategies and to create a new strategy . The strategy is played in rounds, where each round consists of two phases. In the first phase, we play the strategy until an accepting state is reached. We say that the system is to achieve the mission subgoal. The second phase applies the strategy . The aim is to maintain the expected average cumulative penalty per surveillance cycle in the current round, and we refer to it as the average subgoal. The number of steps for which we apply is computed individually every time we enter the second phase of a round.
The online algorithm constructs a strategy by locally improving the strategy computed by the offline algorithm. Intuitively, we compare applying for several steps to reach a specific state or set of states of , to executing different local paths to reach the same state or set. We consider a finite set of finite runs leading to the state, or set, containing the finite run induced by , choose the one that is expected to minimize the average cumulative penalty per surveillance cycle incurred in the current round and apply the first transition of the chosen run. The process continues until the state, or set, is reached, and then it starts over again.
Iv-a Probability measure
Let be a strategy for and a state of . For a finite run induced by the strategy starting from the state and a sequence of length , we call a finite pair. Analogously, an infinite pair consists of the run induced by the strategy and an infinite sequence . A cylinder set of a finite pair is the set of all infinite pairs such that is a prefix of .
Consider the -algebra generated by the set of cylinder sets of all finite pairs , where is a finite run induced by the strategy starting from the state and is of length . From classical concepts in probability theory , there exists a unique probability measure on the -algebra such that for a finite pair
is the probability that the penalties incurred in the first stages when applying the strategy in from the state are given by the sequence , i.e.,
for every . This probability is given by the penalty dynamics and therefore, can be computed from the rate and the penalty probability function . For a set of infinite pairs, an element of the above -algebra, the probability is the probability that under starting from the infinite sequence of penalties received in the visited states is where .
Iv-B Offline control
In this section, we construct a strategy for that projects to a strategy for solving Problem 1. The strategy has to visit infinitely many times and therefore, must lead from to an ASCC. For an , we denote the minimum expected average cumulative penalty per surveillance cycle that can be achieved in starting from . Since is strongly connected, this value is the same for all the states in and is referred to as . It is associated with a cycle of witnessing the value, i.e.,
where is the set of all states of labeled with . Since is an ASCC, it holds .
We design an algorithm that finds the value and a cycle for an ASCC . The algorithm first reduces to a TS and then applies the Karp’s algorithm  on . The Karp’s algorithm finds for a directed graph with values on edges a cycle with minimum value per edge also called the minimum mean cycle. The value and cycle are synthesized from the minimum mean cycle.
Let be an ASCC of . For simplicity, we use singletons such as to denote the states of in this paragraph. We construct a TS
and a function for which it holds that if and only if there exists a finite run in from to with one surveillance cycle, i.e., between and no state labeled with is visited. Moreover, the run is such that and is the finite run in from to with one surveillance cycle that minimizes the expected sum of penalties received during among all finite runs in from to with one surveillance cycle. The TS can be constructed from by an iterative algorithm eliminating the states from one by one, in arbitrary order. At the beginning let , , and for every transition let . The procedure for eliminating proceeds as follows. Consider every such that . If the transition is not in , add to and define , where is the concatenation of sequences. If already contains the transition and , we set , if
where for a run is the sum of for every state of , otherwise we let . The weight . Once all pairs are handled, remove from and all adjacent transitions from . Fig. 1 demonstrates one iteration of the algorithm.
Consequently, we apply the Karp’s algorithm on the oriented graph with vertices , edges and values on edges . Let be the minimum mean cycle of this graph. Then it holds
When the APPC value and the corresponding cycle is computed for every ASCC of , we choose the ASCC that minimizes the APPC value. We denote this ASCC and .
The mission subgoal aims to reach an accepting state from the set . The corresponding strategy is such that from every state that can reach the set , we follow one of the finite runs with minimum weight from to . That means, is a memoryless strategy such that for with it holds where
The strategy for the average subgoal is given by the cycle of the ASCC . Similarly to the mission subgoal, for a state with , the strategy follows one of the finite runs with minimum weight to . For a state , it holds . If all the states of the cycle are distinct, the strategy is memoryless, otherwise it is finite-memory.
For the strategy and every state , it holds
Equivalently, for every state and every , there exists such that if the strategy is played from the state until at least surveillance cycles are completed, then the average cumulative penalty per surveillance cycle incurred in the performed finite run is at most with probability at least .
(Sketch.) The proof is based on the fact that the product with dynamic penalties can be translated into a Markov decision process (MDP) (see e.g., ) with static penalties. The run corresponds to a Markov chain (see e.g., ) of the MDP. Moreover, the cycle corresponds to the minimum mean cycle of the reduced TS . Hence, the equation in the theorem is equivalent to the property of MDPs with static penalties proved in  regarding the minimum expected penalty incurred per stage.
Assume there exists a state with , i.e., if the penalty in is , it always drops to . The dynamics of the penalty in is not probabilistic and if we visit infinitely many times, the expected average penalty incurred in might differ from . That can cause violation of Prop. 1.
Now we describe the strategy . It is played in rounds, where each round consists of two phases, one for each subgoal. The first round starts at the beginning of the execution of the system in the initial state of . Let be the current round. In the first phase of the round the strategy is applied until an accepting state of the ASCC is reached. We use to denote the number of steps we played the strategy in round . Once the mission subgoal is fulfilled, the average subgoal becomes the current subgoal. In this phase, we play the strategy until the number of completed surveillance cycles in the second phase of the current round is .
The strategy projects to a strategy of that solves Problem 1.
From the fact that the ASCC is reachable from the initial state and from the construction of , it follows that is reached from in finite time. In every round of the strategy , an accepting state is visited. Moreover, from Prop. 1 and the fact that , it can be shown that the average cumulative penalty per surveillance cycle incurred in the -th round is at most with probability at least . Therefore, in the limit, the run induced by satisfies the LTL specification and reaches the optimal average cumulative penalty per surveillance cycle with probability .
Note that, in general, the strategy is not finite-memory. The reason is that in the modes of the finite-memory strategy we would need to store the number of steps spent so far in the first phase and the number of the surveillance cycles in the second phase of a given round. Since is generally increasing with , we would need infinitely many modes to be able to count the number of surveillance cycles in the second phase. However, if there exists a cycle of the SCC corresponding to that contains an accepting state, then the finite-memory strategy for the average subgoal maps to a strategy of solving Problem 1, which is therefore in the worst case finite-memory as well.
The size of a BA for an LTL formula is in the worst case , where is the size of . However, the actual size of the BA is in practice often quite small. The size of the product is . To compute the minimum weights between every two states of we use Floyd-Warshall algorithm taking time. Tarjan’s algorithm  is used to compute the set in time . The reduction of an ASCC can be computed in time . The Karp’s algorithm  finds the optimal APPC value and corresponding cycle in time . The main pitfall of the algorithm is to compute the number of surveillance cycles needed in the second phase of the current round according to Prop. 1. Intuitively, we need to consider the finite run induced by the strategy from the current state that contains surveillance cycles, and compute the sum of probabilities for every with the average cumulative penalty per surveillance cycle less or equal to . If the total probability is at least , we set , otherwise we increase and repeat the process. For every , there exist sequences . To partially overcome this issue, we compute the number only at the point in time, when the number of surveillance cycles in the second phase of the current round is and the average cumulative penalty in this round is still above . As the simulation results in Sec. VI show, this happens only rarely, if ever.
Iv-C Online control
The online algorithm locally improves the strategy according to the values of penalties observed from the current state and their simulation in the next time units. The resulting strategy is again played in rounds. However, in each step of the strategy , we consider a finite set of finite runs starting from the current state, choose one according to an optimization function, and apply its first transition.
Throughout the rest of the section we use the following notation. We use singletons such as to denote the states of