Control Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforcement Learning
We present a reinforcement learning (RL) framework to synthesize a control policy from a given linear temporal logic (LTL) specification in an unknown stochastic environment that can be modeled as a Markov Decision Process (MDP). Specifically, we learn a policy that maximizes the probability of satisfying the LTL formula without learning the transition probabilities. We introduce a novel rewarding and path-dependent discounting mechanism based on the LTL formula such that (i) an optimal policy maximizing the total discounted reward effectivelly maximizes the probabilities of satisfying LTL objectives, and (ii) a model-free RL algorithm using these rewards and discount factors is guaranteed to converge to such policy. Finally, we illustrate the applicability of our RL-based synthesis approach on two motion planning case studies.
Formal logics have been used to facilitate robot motion planning beyond its traditional focus on computing robot trajectories that, starting from an initial region, reach a desired goal without hitting any obstacles (e.g., [karaman2011sampling, karaman2011anytime]). Linear Temporal Logic (LTL) is a widely used framework for formal specification of high-level robotic tasks on discrete models. Thus, control synthesis on discrete-transition systems for LTL objectives has attracted a lot of attention (e.g., [vasile2013, smith2011, chen2012, kantaros2017, wolff2014]).
Another line of work considers motion planning for LTL objectives for systems that exhibit uncertainty coming from either robot dynamics or the environment, such as Markov Decision Processes (MDPs) [guo2018, guo2015multi, kantaros2019, lahijanian2012, wolff2012, kwiatkowska2013, ding2014]. One of the reasons for the focus on synthesizing control for an MDP, from a given LTL objective, is that by construction the obtained controller maximizes the probability of satisfying the specification. Furthermore, tools from probabilistic model checking [baier2008] can be directly used for synthesis. Yet, when the MDP transition probabilities are not known a priori, the control policy needs to be synthesized through learning from samples.
Accordingly, there is a recent focus on learning for control (i.e., motion planing) synthesis for LTL objectives (e.g., [fu2014, sadigh2014, li2017, li2018policy, brazdil2014]). Most model-based reinforcement learning (RL) methods are based on detection of end components, and provide estimates of satisfaction probabilities with probably approximately correct bounds (e.g., [fu2014, brazdil2014]). These approaches, however, need to first learn and store the MDP transition probabilities, and thus mostly have significant space requirements, restricting their use on systems with small and low-dimensional state spaces.
On the other hand, model-free RL methods derive the desired policies without storing a model of the MDP. The temporal logic tasks need to be represented by a reward function, possibly with a finite-memory, so that the optimal policy maximizing the discounted future reward, also maximizes the probability of satisfying the tasks. One approach is to use time-bounded temporal logic specifications that can be directly translated to a real-valued reward function (e.g., [aksaray2016, li2017]). Alternatively, unbounded LTL formulas can be transformed into an -automaton and the accepting condition of the automaton can be used to design the reward function.
Such reward functions based on Rabin conditions were first introduced in [sadigh2014], as part of a model-based RL method; the approach assigns a sufficiently small negative and a positive reward to the first and second sets of the Rabin pairs, respectively. A generalization of this method to deep Q-learning with a new optimization algorithm is done in [gao2019]. However, in the presence of rejecting end components and multiple Rabin pairs, optimal policies may not satisfy the LTL property almost surely, even if such policy exists [hahn2019].
A given LTL property can also be translated into a limit-deterministic Büchi automaton (LDBA), which can be used in quantitative analysis of MDPs [hahn2015, sickert2016]. On the other hand, the problem of satisfying the Büchi condition of an LDBA can be reduced to a reachability problem by adding transitions with a positive reward from accepting states to a terminal state [hahn2019]. As the probabilities of these transitions go to zero, the probability of reaching the terminal state should capture the probability of satisfying the corresponding Büchi condition. However, model-free RL algorithms such as Q-learning may fail to converge to the correct reachability probabilities without discounting (or improper discounting) in the presence of end components [brazdil2014].
Consequently, in this paper, we propose a model-free RL algorithm that is guaranteed to find a control policy that maximizes the probability of satisfying a given LTL objective (i.e., specification) in an arbitrary unknown MDP; for the MDP, not even which probabilities are nonzero (i.e., its graph/topology) is known. We use an automata-based approach that constructs a product MDP using an LDBA of a given LTL formula and assigns rewards based on the Büchi (repeated reachability) acceptance condition. Such optimal policy can then be derived by learning a policy maximizing the satisfaction probability of the Büchi condition on the product. Unlike [hahn2019], our approach directly assigns positive rewards to the accepting states and discounts these rewards in such a way that the values of the optimal policy are proved to converge to the maximal satisfaction probabilities as the discount factor goes beyond a threshold that is less than .
The rest of the paper is organized as follows. We introduce preliminaries and formalize the considered problem in Section II. In Section III, we present our model-free reinforcement learning algorithm that maximizes probabilities that LTL specification are satisfied. Finally, we evaluate our approach on several motion planning problems for mobile robots (Section IV), before concluding in Section V.
Ii Preliminaries and Problem Statement
In this section, we provide preliminaries on LTL, MDPs, and reinforcement learning on MDPs, and then give the problem formulation. We denote the sets of real and natural numbers by and , respectively. For a set , we denote by the set of all finite sequences taken from .
Ii-a Markov Decision Processes and Reinforcement Learning
MDPs are common modeling formalism for systems that permit nondeterministic choices with probabilistic outcomes.
A (labeled) MDP is a tuple , where is a finite set of states, is a finite set of actions, is the transition probability function, is an initial state, AP is a finite set of atomic propositions, and is a labeling function. For simplicity, let denote the set of actions that can be taken in state ; then for all states , it holds that if , and otherwise.
A path is an infinite sequence of states , with such that for all , there exists with . We use to denote the state , as well as and to denote the prefix and the suffix of the path, respectively.
A policy for an MDP is a function such that . A policy is memoryless, if it only depends on the current state, i.e., for any , and thus can be defined as . A Markov chain (MC) of an MDP induced by a memoryless policy is a tuple , where for all . A bottom strongly connected component (BSCC) of an MC is a strongly connected component with no outgoing transitions.
Let be a reward function of the MDP . Then, for a discount factor , the -steps return ( or ) of a path from time is
Under a policy , the value of a state is defined as the expected return of a path from it – i.e.,
for any fixed such that .
For reinforcement learning (RL), the objective is to find an optimal policy for the MDP from samples, such that the return is maximized for all ; and we denote the maximum by . Specifically, the RL is model-free, if is derived without explicitly estimating the transition probabilities, as in model-based RL approaches; hence it scales significantly better in large applications [sutton2018].
Ii-B LTL and Limit-Deterministic Büchi Automata
LTL provides a high-level language to describe the specifications of a system. LTL formulas can be constructed inductively as combinations of Boolean operators, negation () and conjunction (), and two temporal operators, next () and until (U), using the following syntax:
The satisfaction of a LTL formula for a path of the MDP from Def. 1 (denoted by ) is defined as follows: satisfies an atomic proposition , if the first state of the path is labeled with , i.e., ; a path satisfies if satisfies the formula ; and finally,
Other common Boolean and temporal operators are derived as follows: (or) ; (implies) ; (eventually) ; and (always) [baier2008].
Satisfaction of an LTL formula can be evaluated on a Limit-Deterministic Büchi Automata (LDBA) that can be directly derived from the formula [hahn2015, sickert2016].
An LDBA is a tuple , where is a finite set of states, is a finite alphabet, is a (partial) transition function, is an initial state, and is a set of accepting states, such that (i) is total except for the -moves, i.e., for all ; and (ii) there exists a bipartition to a deterministic and a nondeterministic part of the states, i.e., , where
the -moves are not allowed in the deterministic part, i.e., for any , ;
outgoing transitions from the deterministic part stays within it, i.e., for any ,
the accepting states are in the deterministic part, i.e., .
An infinite path is accepted by the LDBA if it satisfies the Büchi condition – i.e., , where denotes the set of states visited by for infinitely many times.
Ii-C Problem Statement
In this work, we consider the problem of synthesizing a robot control policy in a stochastic environment such that the probability of satisfying a desired specification is maximized. The robot environment is modeled as an MDP with unknown transition probabilities (i.e., not even which probabilities are nonzero is known), and the desired objective (i.e., specification) is given by an LTL formula. Our goal is to obtain such policy by learning the maximal probabilities that the LTL specification is satisfied; this should be achieved by directly interacting with the environment – i.e., without constructing a model of the MDP.
For any policy , denotes the probability of all paths from the state to satisfy formula under the policy
We omit the superscript when it is clear from the context. We now formally state the problem considered in this work.
Given an MDP where is fully unknown and an LTL specification , design a model-free RL algorithm that finds a finite-memory objective policy that satisfies
where for all .
Iii Learning-Based Synthesis from LTL Specifications
In this section, we introduce a design framework to solve Problem 1. We start by exploiting the fact that LDBAs can be used to represent LTL formulas since any LTL formula can be transformed into an LDBA [hahn2015, sickert2016]; in such LDBA, the only nondeterministic actions are the -moves from a given set of nondeterministic states to the complement of that set (e.g., see Fig. 0(a)). Thus, we reduce the problem of satisfying a given LTL objective in an MDP to the problem satisfying a repeated reachability (Büchi) objective in a product MDP, computed from the MDP and obtained LBDA. We then exploit a new path-dependent and discounting rewarding mechanism that enables the use of model-free reinforcement learning, to find an objective policy with strong performance guarantees (i.e., probability maximization). Specifically, we use Q-learning [sutton2018] in this work, but other reinforcement learning methods can be applied similarly. Our overall approach is captured in Algorithm 1, and we now describe each step in detail.
Iii-a Design of Product MDP
Given the LTL formula with atomic propositions , the product MDP is constructed by composing with an LDBA with the alphabet , that can be automatically derived from [hahn2015, sickert2016]. LDBAs, similarly to deterministic Rabin automata [baier2008], can be used in quantitative analysis of MDPs if they are constructed in a certain way [hahn2019].
A product MDP of an MDP and an LDBA is defined as follows: is the set of states, is the set of actions, is the transition function
is and is the set of accepting states. We say that a path of the product MDP satisfies the Büchi condition , if .
The nondeterministic -moves in the LDBA are represented by -actions in the product MDP. When an -action is taken, only the state of the LDBA is updated according to the corresponding -move. When an MDP action is taken, the next MDP state will be determined by the transition probabilities and the LDBA makes a transition by consuming the label of the current MDP state.
To illustrate this, an example product MDP is presented in Fig. 1. In the MDP (Fig. 0(b)), states and are labeled by atomic propositions and , respectively. In the LDBA (Fig. 0(a)), for simplicity, the transitions are labeled by Boolean formulas of the atomic propositions of and or an label, with standing for “true”; this is equivalent to labeling the transitions using sets of atomic propositions, as in Def. 3. A transition labeled by a Boolean formula is triggered upon receiving a set of atomic propositions satisfying that formula, and the transition labeled by an label can be (but does not have to be) triggered automatically. The product MDP is shown in Fig. 0(c). To distinguish the two transitions from to and from to in Fig. 0(a), we denote them by and in Fig. 0(c), respectively.
Now, the satisfaction of the LTL objective on the original MDP is related to the satisfaction of the Büchi objective on the product MDP , as formalized below.
A memoryless policy that maximizes the satisfaction probability of on induces a finite-memory policy that maximizes the satisfaction of on in Problem 1.
Proof\@addpunct.The proof directly follows from the proof of Theorem 3 in [sickert2016]. ∎
Therefore, the behavior of the induced policy can be described by the policy and the LDBA derived directly from the LTL formula . Initially, is reset to its start state and whenever the MDP makes a transition from to , updates its current state from to . The action to be selected in an MDP state when is in a state is determined by as follows: if is an -action , changes its state to and the action is selected; otherwise, is selected.
Iii-B Learning for Büchi Conditions Using Path-Dependent Discounted Rewards
Following Sec. III-A, we now focus on learning an objective policy that maximizes the probability of satisfying a given Büchi objective. By Lemma 1, in what follows, we assume policies are memoryless since they are sufficient for Büchi objectives. In addition, for simplicity, we omit the superscript and we write and instead of and .
We propose a model-free learning method that uses carefully crafted rewards and path-dependent discounting based on the Büchi condition such that an optimal policy maximizing the expected return is also an objective policy maximizing the satisfaction probabilities. Specifically, we define the return of a path as a function of these rewards and discount factors in such a way that the value of a state, the expected return from that state, approaches the probability of satisfying the objective as the discount factor goes to 1.
For a given MDP with , the value function for the policy and the discount factor satisfies
for all states , if the return of a path is defined as
where , and are the reward and the discount functions defined as:
Here, we set as a function of such that
Before proving Theorem 1, we develop bounds on .
For all paths and from (9), it holds that
Proof\@addpunct.Since there is no negative reward, holds. By the return definition, replacing with 1 yields a larger or equal return, which constitutes the following upper bound on the return: , where is the number of states visited. Return , defined in (9), can be expressed recursively as
Now, it immediately follows from that it holds that , which combined with (13) proves the other inequalities. ∎
Lemma 2 implies that replacing a prefix of a path with states belonging to never decreases the return of a path and similarly replacing with states that do not belong to never increases the return. The result is particularly useful when we establish upper and lower bounds on the value of a state.
The next lemma shows that under a policy, the values of states in the accepting BSCCs of the induced Markov chain approach 1 in the limit; thus, is the key to proving Theorem 1.
Let denote the set of all BSCCs of an induced Markov chain and let denote the set of states that belong to a BSCC of – i.e.,
Then, for any state in
Proof\@addpunct.For any fixed , let be the stopping time of first returning to the state after leaving it at ,
Then by (2), it holds that
since once a state is visited, almost surely it is visited again [baier2008]. Using that , we obtain
where ➀ holds by the Markov property, ➁ holds by the Jensen’s inequality and is a constant. From (18),
We now prove Theorem 1.
Proof of Theorem 1\@addpunct.First, we divide the expected return of a random path from a state by whether it visits the states infinitely often:
for some fixed . let be the stopping time of first reaching a state in after leaving at ,
where is defined as in (14). Then, it holds that
where and m is constant. Here, ➀ holds because a path almost surely eventually enters an accepting BSCC, it eventually reaches a state almost surely, ➁, ➂ and ➃ hold due to Lemma 2, the Markov property and Jensen’s inequality. From (20), we have
Similarly, let be the stopping time of first reaching a rejecting BSCC of after leaving at . Then
denoting the number of time steps before a rejecting BSCC
Theorem 1 suggests that the limit of the optimal state values is equal to the maximal probabilities as goes to 1; this is captured by the next corollary whose proof follows from the definition of the optimal policies and maximal probabilities.
For all states the following holds:
From Theorem 1 of [jaakkola1994], ensures convergence of the model-free learning to the unique solution. With , the result may converge to a non-optimal policy [brazdil2014].
There exists a such that for all and for all states , the optimal policy satisfies
Proof\@addpunct.Let be the minimum positive difference between the satisfaction probabilities of two policies:
and let be the discount factor such that
Now, suppose a policy that maximizes the satisfaction probability is not optimal for , then the optimal value of all states must be larger than , which is not possible due to the definition of . ∎
Iv Implementation and Case Studies
We implemented our RL-based synthesis framework in Python; we use Rabinizer 4 [kretinsky2018] to map LTL formulas into LDBAs, and Q-learning for the proposed path-dependent discounting rewards. The code and videos are available at [csrl2019].
We evaluated our framework on two motion planning case studies. As shown in Fig. 2 and 3, we consider two scenarios in a grid-world where a mobile robot can take four actions top, left, down and right. The robot moves in the intended direction with probability and it can go sideways with probability ( each). If the robot hits a wall or an obstacle it stays in the same state.
For Q-learning, we used -greedy policy to choose the optimal actions, and discount factors and . The probability that a random action is taken, , and the learning rate, , were gradually decreased from to and then . The objective policies and estimates of the maximal probabilities were obtained using episodes.
Iv-a Motion Planning with Safe Absorbing States
In this example, the robot tries to reach a safe absorbing state (states or in circle), while avoiding unsafe states (states ). This is formally specified in LTL as
The LDBA computed from has states and the product-MDP has states. All episodes started in a random state and were terminated after steps.
The optimal policy obtained for an MDP is illustrated in Fig. 1(a). The shortest way to enter a safe absorbing state from is reaching via ; yet, in that case, the robot visits an unsafe state with probability 0.2. Thus, the optimal policy tries to enter one of and by choosing up in . Under this policy, the robot eventually reaches a safe absorbing state without visiting an unsafe state almost surely. Once the robot enters an absorbing state, it chooses an -action depending on the state label, and thus the LDBA transitions to an accepting state, with positive rewards.
Fig. 1(b) shows the estimates of the maximal probabilities. Note that the approximation errors in and are due to the variance of the return caused by the unsafe states. When the robot visits an unsafe state, the LDBA makes a transition to a trap state, making it impossible for the robot to receive a positive reward. Hence, the return that can be obtained from and is either 1 or 0 with probability and , respectively. In addition, this type of non-0 or non-1 probability guarantees cannot be provided with existing learning-based methods for LTL specifications.
While the values from Fig. 1(a) and 1(b) were obtained from a single run over episodes, we investigated the impact of the number of episodes. Fig. 1(c) shows the L2 norm of the errors averaged over 100 repetitions for different number of episodes (the error bars show standard deviation).
Iv-B Mobile Robot in Nursery Scenario
In this scenario, the robot’s objective is to repeatedly check a baby (at state ) and go back to its charger (at state ), while avoiding the danger zone (at state ). Near the baby , the only allowed action is left and when taken the following situations can happen: (i) the robot hits the wall with probability and wakes the baby up; (ii) the robot moves left with probability or moves down with probability . If the baby has been woken up, which means the robot could not leave in a single time step (represented by LTL as ), the robot should notify the adult (at state ); otherwise, the robot should directly go back to the charger (at state ). The full objective is specified in LTL as
Here the sub-formulas mean (1) avoid the danger state; (2) if the baby is left, do not return before visiting the adult or the charger; (3) after notifying the adult, leave immediately and go for the baby; (4) after leaving the baby sleeping, go for the charger and do not notify the adult; (5) after charging, return to the baby first without visiting the adult; and (6) notify the adult if the baby has woken up.
The LDBA for this specification has 47 states and the product MDP has 940 states. The episodes were terminated after steps and the robot position was reset to charging.
Fig. 3 depicts the optimal policy for the four most visited LDBA states during the simulation. The robot follows the policy in Fig. 2(a) after it leaves the charger dock . Under this policy, the robot almost surely reaches the baby in , while successfully avoiding visiting . Similarly, the policy in Fig. 2(b) is followed by the robot to go back to the charger while the baby is sleeping. If the baby is awake, the robot takes the shortest path to reach (Fig. 2(c)).
In this work, we present a model-free learning-based method to synthesize a control policy that maximizes probability that an LTL specification is satisfied in unknown stochastic environments that can be modeled by an MDP. We first show that synthesizing controllers from an LTL specification on the MDP can be converted to synthesizing a memoryless policy of a Büchi objective on the product MDP. Then, we design a path-dependent discounting reward, and show that the memoryless policy optimizing this reward, also optimizes the satisfaction probability of the Büchi objective (and thus the initial LTL specification). Finally, we evaluate our synthesis method on motion planning case studies.