Reinforcement Learning for Temporal Logic Control Synthesiswith Probabilistic Satisfaction Guarantees

# Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees

M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, I. Lee M. Hasanbeig, A. Abate, and D. Kroening are with the Department of Computer Science, University of Oxford, UK, and are supported by the ERC project 280053 (CPROVER) and the H2020 FET OPEN 712689 SC. @cs.ox.ac.uk. Y. Kantaros, G. J. Pappas, and I. Lee are with the School of Engineering and Applied Science (SEAS), University of Pennsylvania, PA, USA, and are supported by the AFRL and DARPA under Contract No. FA8750-18-C-0090 and the ARL RCTA under Contract No. W911NF-10-2-0016 @seas.upenn.edu.
###### Abstract

Reinforcement Learning (RL) has emerged as an efficient method of choice for solving complex sequential decision making problems in automatic control, computer science, economics, and biology. In this paper we present a model-free RL algorithm to synthesize control policies that maximize the probability of satisfying high-level control objectives given as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace properties, the structure of the workspace, and the agent actions, giving rise to a Probabilistically-Labeled Markov Decision Process (PL-MDP) with unknown graph structure and stochastic behaviour, which is even more general case than a fully unknown MDP. We first translate the LTL specification into a Limit Deterministic Büchi Automaton (LDBA), which is then used in an on-the-fly product with the PL-MDP. Thereafter, we define a synchronous reward function based on the acceptance condition of the LDBA. Finally, we show that the RL algorithm delivers a policy that maximizes the satisfaction probability asymptotically. We provide experimental results that showcase the efficiency of the proposed method.

## I Introduction

The use of temporal logic has been promoted as formal task specifications for control synthesis in Markov Decision Processes (MDPs) due to their expressive power, as they can handle a richer class of tasks than the classical point-to-point navigation. Such rich specifications include safety and liveness requirements, sequential tasks, coverage, and temporal ordering of different objectives [1, 2, 3, 4, 5]. Control synthesis for MDPs under Linear Temporal Logic (LTL) specifications has also been studied in [6, 7, 8, 9, 10]. Common in these works is that, in order to synthesize policies that maximize the satisfaction probability, exact knowledge of the MDP is required. Specifically, these methods construct a product MDP by composing the MDP that captures the underlying dynamics with a Deterministic Rabin Automaton (DRA) that represents the LTL specification. Then, given the product MDP, probabilistic model checking techniques are employed to design optimal control policies [11, 12].

In this paper, we address the problem of designing optimal control policies for MDPs with unknown stochastic behaviour so that the generated traces satisfy a given LTL specification with maximum probability. Unlike previous work, uncertainty is considered both in the environment properties and in the agent actions, provoking a Probabilistically-Labeled MDP (PL-MDP). This model further extend MDPs to provide a way to consider dynamic and uncertain environments. In order to solve this problem, we first convert the LTL formula into a Limit Deterministic Büchi Automaton (LDBA) [13]. It is known that this construction results in an exponential-sized automaton for LTL, and it results in nearly the same size as a DRA for the rest of LTL. LTL is a fragment of linear temporal logic with the restriction that no until operator occurs in the scope of an always operator. On the other hand, the DRA that are typically employed in relevant work are doubly exponential in the size of the original LTL formula [14]. Furthermore, a Büchi automaton is semantically simpler than a Rabin automaton in terms of its acceptance conditions [15, 10], which makes our algorithm much easier to implement. Once the LDBA is generated from the given LTL property, we construct on-the-fly a product between the PL-MDP and the resulting LDBA and then define a synchronous reward function based on the acceptance condition of the Büchi automaton over the state-action pairs of the product. Using this algorithmic reward shaping procedure, a model-free RL algorithm is introduced, which is able to generate a policy that returns the maximum expected reward. Finally, we show that maximizing the expected accumulated reward entails the maximization of the satisfaction probability.

Related work – A model-based RL algorithm to design policies that maximize the satisfaction probability is proposed in [16, 17]. Specifically, [16] assumes that the given MDP model has unknown transition probabilities and builds a Probably Approximately Correct MDP (PAC MDP), which is composed with the DRA that expresses the LTL property. The overall goal is to calculate the finite-horizon (-step) value function for each state, such that the obtained value is within an error bound from the probability of satisfying the given LTL property. The PAC MDP is generated via an RL-like algorithm, then value iteration is applied to update state values. A similar model-based solution is proposed in [18]: this also hinges on approximating the transition probabilities, which limits the precision of the policy generation process. Unlike the problem that is considered in this paper, the work in [18] is limited to policies whose traces satisfy the property with probability one. Moreover, [16, 17, 18] require to learn all transition probabilities of the MDP. As a result, they need a significant amount of memory to store the learned model [19]. This specific issue is addressed in [20], which proposes an actor-critic method for LTL specification that requires the graph structure of the MDP, but not all transition probabilities. The structure of the MDP allows for the computation of Accepting Maximum End Components (AMECs) in the product MDP, while transition probabilities are generated only when needed by a simulator. By contrast, the proposed method does not require knowledge of the structure of the MDP and does not rely on computing AMECs of a product MDP. A model-free and AMEC-free RL algorithm for LTL planning is also proposed in [21]. Nevertheless, unlike our proposed method, all these cognate contributions rely on the LTL-to-DRA conversion, and uncertainty is considered only in the agent actions, but not in the workspace properties.

In [22] and [23] safety-critical settings in RL are addressed in which the agent has to deal with a heterogeneous set of MDPs in the context of cyber-physical systems. [24] further employs DDL [25], a first-order multi-modal logic for specifying and proving properties of hybrid programs.

The first use of LDBA for LTL-constrained policy synthesis in a model-free RL setup appears in [26, 27]. Specifically, [27] propose a hybrid neural network architecture combined with LDBAs to handle MDPs with continuous state spaces. The work in [26] has been taken up more recently by [28], which has focused on model-free aspects of the algorithm and has employed a different LDBA structure and reward, which introduce extra states in the product MDP. The authors also do not discuss the complexity of the automaton construction with respect to the size of the formula, but given the fact that resulting automaton is not a generalised Büchi, it can be expected that the density of automaton acceptance condition is quite low, which might result in a state-space explosion, particularly if the LTL formula is complex. As we show in the proof for the counter example in the Appendix-E the authors indeed have overlooked that our algorithm is episodic, and allows the discount factor to be equal to one. Unlike [26, 27, 28], in this work we consider uncertainty in the workspace properties by employing PL-MDPs.

Summary of contributionsFirst, we propose a model-free RL algorithm to synthesize control policies for unknown PL-MDPs which maximizes the probability of satisfying LTL specifications. Second, we define a synchronous reward function and we show that maximizing the accumulated reward maximizes the satisfaction probability. Third, we convert the LTL specification into an LDBA which, as a result, shrinks the state-space that needs to explored compared to relevant LTL-to-DRA-based works in finite-state MDPs. Moreover, unlike previous works, our proposed method does not require computation of AMECs of a product MDP, which avoids the quadratic time complexity of such a computation in the size of the product MDP [11, 12].

## Ii Problem Formulation

Consider a robot that resides in a partitioned environment with a finite number of states. To capture uncertainty in both the robot motion and the workspace properties, we model the interaction of the robot with the environment as a PL-MDP, which is defined as follows.

###### Definition II.1 (Probabilistically-Labeled MDP [9])

A PL-MDP is a tuple , where is a finite set of states; is the initial state; is a finite set of actions. With slight abuse of notation denotes the available actions at state ; is the transition probability function so that is the transition probability from state to state via control action and , for all ; is a set of atomic propositions; and specifies the associated probability. Specifically, denotes the probability that is observed at state , where , .

The probabilistic map provides a means to model dynamic and uncertain environments. Hereafter, we assume that the PL-MDP is fully observable, i.e., at any time/stage the current state, denoted by , and the observations in state , denoted by , are known.

At any stage we define the robot’s past path as , the past sequence of observed labels as , where and the past sequence of control actions , where . These three sequences can be composed into a complete past run, defined as . We denote by , , and the set of all possible sequences , and , respectively.

The goal of the robot is accomplish a task expressed as an LTL formula. LTL is a formal language that comprises a set of atomic propositions , the Boolean operators, i.e., conjunction and negation , and two temporal operators, next and until . LTL formulas over a set can be constructed based on the following grammar:

 ϕ::=true | π | ϕ1∧ϕ2 | ¬ϕ | ◯ϕ | ϕ1 ∪ ϕ2,

where . The other Boolean and temporal operators, e.g., always , have their standard syntax and meaning. An infinite word over the alphabet is defined as an infinite sequence , where denotes infinite repetition and , . The language is defined as the set of words that satisfy the LTL formula , where is the satisfaction relation [29].

In what follows, we define the probability that a stationary policy for satisfies the assigned LTL specification. Specifically, a stationary policy for is defined as , where . Given a stationary policy , the probability measure , defined on the smallest -algebra over , is the unique measure defined as where denotes the probability that at time the action will be selected given the current state [11, 30]. We then define the probability of satisfying under policy as [11, 12]

 PξM(ϕ)=PξM(\ccalR∞:\ccalL∞⊨ϕ), (1)

The problem we address in this paper is summarized as follows.

###### Problem 1

Given a PL-MDP with unknown transition probabilities, unknown label mapping, unknown underlying graph structure, and a task specification captured by an LTL formula , synthesize a deterministic stationary control policy that maximizes the probability of satisfying captured in (1), i.e., .111The fact that the graph structure is unknown implies that we do not know which transition probabilities are equal to zero. As a result, relevant approaches that require the structure of the MDP, as e.g., [20] cannot be applied.

## Iii A New Learning-for-Planning Algorithm

In this section, we first discuss how to translate the LTL formula into an LDBA (see Section III-A). Then, we define the product MDP , constructed by composing the PL-MDP and the LDBA that expresses (see Section III-B). Next, we assign rewards to the product MDP transitions based on the accepting condition of the LDBA . As we show later, this allows us to synthesize a policy for that maximizes the probability of satisfying the acceptance conditions of the LDBA. The projection of the obtained policy over model results in a policy that solves Problem 1 (Section III-C).

### Iii-a Translating LTL into an LDBA

An LTL formula can be translated into an automaton, namely a finite-state machine that can express the set of words that satisfy . Conventional probabilistic model checking methods translate LTL specifications into DRAs, which are then composed with the PL-MDP, giving rise to a product MDP. Nevertheless, it is known that this conversion results, in the worst case, in automata that are doubly exponential in the size of the original LTL formula [14]. By contrast, in this paper we propose to express the given LTL property as an LDBA, which results in a much more succinct automaton [13, 15]. This is the key to the reduction of the state-space that needs to be explored; see also Section V.

Before defining the LDBA, we first need to define the Generalized Büchi Automaton (GBA).

###### Definition III.1 (Generalized Büchi Automaton [11])

A GBA is a structure where is a finite set of states, is the initial state, is a finite alphabet, is the set of accepting conditions where , , and is a transition relation.

An infinite run of over an infinite word , , is an infinite sequence of states , i.e., , such that . The infinite run is called accepting (and the respective word is accepted by the GBA) if where is the set of states that are visited infinitely often by .

###### Definition III.2 (Limit Deterministic Büchi Automaton [13])

A GBA is limit deterministic if can be partitioned into two disjoint sets , so that (i) and , for every state and ; and (ii) for every , it holds that and there are -transitions from to .

An -transition allows the automaton to change its state without reading any specific input. In practice, the -transitions between and reflect the “guess” on reaching : accordingly, if after an -transition the associated labels in the accepting set of the automaton cannot be read, or if the accepting states cannot be visited, then the guess is deemed to be wrong, and the trace is disregarded and is not accepted by the automaton. However, if the trace is accepting, then the trace will stay in ever after, i.e. is invariant.

###### Definition III.3 (Non-accepting Sink Component)

A non-accepting sink component in an LDBA is a directed graph induced by a set of states such that (1) is strongly connected, (2) does not include all accepting sets , and (3) there exist no other strongly connected set that . We denote the union set of all non-accepting sink components as .

### Iii-B Product MDP

Given the PL-MDP and the LDBA , we define the product MDP as follows.

###### Definition III.4 (Product MDP)

Given a PL-MDP and an LDBA , we define the product MDP as , where (i) is the set of states, so that , , , and ; (ii) is the initial state; (iii) is the set of actions inherited from the MDP, so that , where ; (iv) is the transition probability function, so that

 PP([x,ℓ,q],a,[x′,ℓ′,q′])=PC(x,u,x′)PL(x′,ℓ′), (2)

where , , and ; (v) is the set of accepting states, where . In order to handle -transitions in the constructed LDBA we have to add the following modifications to the standard definition of the product MDP [15]. First, for every -transition to a state we add an action in the product MDP, i.e., . Second, the transition probabilities of -transitions are given by

 PP(s,a,s′)={1, if (x=x′)∧(ℓ=ℓ′)∧(δ(q,εq′)=q′)0, otherwise, (3)

where and .

Given any policy for , we define an infinite run of to be an infinite sequence of states of , i.e., , where . By definition of the accepting condition of the LDBA , an infinite run is accepting, i.e., satisfies with a non-zero probability (denoted by ), if , .

In what follows, we design a synchronous reward function based on the accepting condition of the LDBA so that maximization of the expected accumulated reward implies maximization of the satisfaction probability. Specifically, we generate a control policy that maximizes the probability of (i) reaching the states of from and (ii) the probability that each accepting set will be visited infinitely often.

### Iii-C Construction of the Reward Function

To synthesize a policy that maximizes the probability of satisfying , we construct a synchronous reward function for the product MDP. The main idea is that (i) visiting a set , yields a positive reward ; and (ii) revisiting the same set returns zero reward until all other sets , are also visited; (iii) the rest of the transitions have zero rewards. Intuitively, this reward shaping strategy motivates the agent to visit all accepting sets of the LDBA infinitely often, as required by the acceptance condition of the LDBA; see also Section IV.

To formally present the proposed reward shaping method, we need first to introduce the the accepting frontier set which is initialized as the family set

 A={Fk}fk=1. (4)

This set is updated on-the-fly every time a set is visited as where is the accepting frontier function defined as follows.

###### Definition III.5 (Accepting Frontier Function)

Given an LDBA , we define as the accepting frontier function, which executes the following operation over any given set :

 AF(q,A)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩A∖\ccalFj:(q∈\ccalFj)∧(A≠\ccalFj):{Fk}fk=1∖\ccalFj:(q∈\ccalFj)∧(A=\ccalFj).□:

In words, given a state and the set , outputs a set containing the elements of minus those elements that are common with (first case). However, if , then the output is the family set of all accepting sets of minus those elements that are common with , resulting in a reset of to (4) minus those elements that are common with (second case). Intuitively, always contains those accepting sets that are needed to be visited at a given time and in this sense the reward function is synchronous with the LDBA accepting condition.

Given the accepting frontier set , we define the following reward function

 R(s,a)={rif q′ ∈A, s′=(x′,ℓ′,q′),0otherwise. (5)

In (5), is the state of the product MDP that is reached from state by taking action , and is an arbitrary positive reward. In this way the agent is guided to visit all accepting sets infinitely often and, consequently, satisfy the given LTL property.

###### Remark III.6

The initial and accepting components of the LDBA proposed in [13] (as used in this paper) are both deterministic. By Definition III.2, the discussed LDBA is indeed a limit-deterministic automaton, however notice that the obtained determinism within its initial part is stronger than that required in the definition of LDBA. Thanks to this feature of the LDBA structure, in our proposed algorithm there is no need to “explicitly build” the product MDP and to store all its states in memory. The automaton transitions can be executed on-the-fly, as the agent reads the labels of the MDP states.

Given , we compute a stationary deterministic policy , that maximizes the expected accumulated return, i.e.,

 μ∗(s)=argmaxμ∈D Uμ(s), (6)

where is the set of all stationary deterministic policies over , and

 Uμ(s)=Eμ[∞∑n=0γn R(sn,μ(sn))|s0=s], (7)

where denotes the expected value given that the product MDP follows the policy [30], is the discount factor, and is the sequence of states generated by policy up to time step , initialized at . Note that the optimal policy is stationary as shown in the following result.

###### Theorem III.7 ([30])

In any finite-state MDP, such as , if there exists an optimal policy, then that policy is stationary and deterministic.

In order to construct , we employ episodic Q-learning (QL), a model-free RL scheme described in Algorithm LABEL:alg:stationary.222Note that any other off-the-shelf model-free RL algorithm can also be used within Algorithm LABEL:alg:stationary, including any variant of the class of temporal difference learning algorithms [19]. Specifically, Algorithm LABEL:alg:stationary requires as inputs (i) the LDBA , (ii) the reward function defined in (5), and (iii) the hyper-parameters of the learning algorithm.

Observe that in Algorithm LABEL:alg:stationary, we use an action-value function to evaluate instead of , since the MDP is unknown. The action-value function can be initialized arbitrarily. Note that . Also, we define a function that counts the number of times that action has been taken at state . The policy is selected to be an -greedy policy, which means that with probability , the greedy action is taken, and with probability a random action is selected. Every episode terminates when the current state of the automaton gets inside (Definition III.3) or when the iteration number in the episode reaches a certain threshold . Note that it holds that asymptotically converges to the optimal greedy policy : where is the optimal function. Further, , where is the optimal value function that could have been computed via Dynamic Programming (DP) if the MDP was fully known [19, 31, 32]. Projection of onto the state-space of the PL-MDP, yields the finite-memory policy that solves Problem 1.

\@float

algocf[!t]     \end@float

## Iv Analysis of the Algorithm

In this section, we show that the policy generated by Algorithm LABEL:alg:stationary maximizes (1), i.e., the probability of satisfying the property . Furthermore, we show that, unlike existing approaches, our algorithm can produce the best available policy if the property cannot be satisfied. To prove these claims, we need to show the following results. All proofs are presented in the Appendix. First, we show that the accepting frontier set is time-invariant. This is needed to ensure that the LTL formula is satisfied over the product MDP by a stationary policy.

###### Proposition IV.1

For an LTL formula and its associated LDBA , the accepting frontier set is time-invariant at each state of .

As stated earlier, since QL is proved to converge to the optimal Q-function [19], it can synthesize an optimal policy with respect to the given reward function. The following result shows that the optimal policy produced by Algorithm LABEL:alg:stationary satisfies the given LTL property.

###### Theorem IV.2

Assume that there exists at least one deterministic stationary policy in whose traces satisfy the property with positive probability. Then the traces of the optimal policy defined in (6) satisfy with positive probability, as well.

Next we show that and subsequently its projection maximize the satisfaction probability.

###### Theorem IV.3

If an LTL property is satisfiable by the PL-MDP , then the optimal policy that maximizes the expected accumulated reward, as defined in (6), maximizes the probability of satisfying , defined in (1), as well.

Next, we show that if there does not exist a policy that satisfies the LTL property , Algorithm LABEL:alg:stationary will find the policy that is the closest one to property satisfaction. To this end, we first introduce the notion of closeness to satisfaction.

###### Definition IV.4 (Closeness to Satisfaction)

Assume that two policies and do not satisfy the property . Consequently, there are accepting sets in the automaton that have no intersection with runs of the induced Markov chains and . The policy is closer to satisfying the property if runs of have more intersections with accepting sets of the automaton than runs of .

###### Corollary IV.5

If there does not exist a policy in the PL-MDP  that satisfies the property , then proposed algorithm yields a policy that is closest to satisfying .

## V Experiments

In this section we present three case studies, implemented on MATLAB R2016a on a computer with an Intel Xeon CPU at 2.93 GHz and 4 GB RAM. In the first two experiments, the environment is represented as a discrete grid world, as illustrated in Figure 1. The third case study is an adaptation of the well-known Atari game Pacman (Figure 2), which is initialized in a configuration that is quite hard for the agent to solve.

The first case study pertains to a temporal logic planning problem in a dynamic and unknown environment with AMECs, while the second one does not admit AMECs. Note that the majority of existing algorithms fail to provide a control policy when AMECs do not exist [8, 34, 20], or result in control policies without satisfaction guarantees [18].

The LTL formula considered in the first two case studies is the following:

 ϕ1=◊(target1)∧□◊(target2)∧□◊(user)∧(¬user∪target2)∧□(¬% obs). (8)

In words, this LTL formula requires the robot to (i) eventually visit target 1 (expressed as ); (ii) visit target  infinitely often and take a picture of it (); (iii) visit a user infinitely often where, say, the collected pictures are uploaded (captured by ); (iv) avoid visiting the user until a picture of target  has been taken; and (v) always avoid obstacles (captured by ).

The LTL formula (8) can be expressed as a DRA with states. On the other hand, a corresponding LDBA has states (fewer, as expected), which results in a significant reduction of the state space that needs to be explored.

The interaction of the robot with the environment is modeled by a PL-MDP with states and actions per state. The actions space is . We assume that the targets and the user are dynamic, i.e., their location in the environment varies probabilistically. Specifically, their presence in a given region is determined by the unknown function from Definition II.1 (Figure 1).

The LTL formula specifying the task for Pacman (third case study) is:

 ϕ2=◊[(food1∧◊food2)∨(food2∧◊% food1)]∧□(¬ghost). (9)

Intuitively, the agent is tasked with (i) eventually eating food1 and then food2 (or vice versa), while (ii) avoiding any contact with the ghosts. This LTL formula corresponds to a DRA with states and to an LDBA with states. The agent can execute actions per state and if the agent hits a wall by taking an action it remains in the previous location. The ghosts dynamics are stochastic: with a probability each ghost chases the Pacman (often referred to as “chase mode”), and with its complement it executes a random action (“scatter mode”).

In the first case study, we assume that there is no uncertainty in the robot actions. In this case, it can be verified that AMECs exist. Figure 3(a) illustrates the evolution of over episodes, where denotes the -greedy policy. The optimal policy was constructed in approximately minutes. A sample path of the robot with the projection of optimal control strategy onto , i.e. policy , is given in Figure 1 (red path).

In the second case study, we assume that the robot is equipped with a noisy controller and, therefore, it can execute the desired action with probability , whereas a random action among the other available ones is taken with a probability of . In this case, it can be verified that AMECs do not exist. Intuitively, the reason why AMECs do not exist is that there is always a non-zero probability with which the robot will hit an obstacle while it travels between the access point and target and, therefore, it will violate . Figure 3(b) shows the evolution of over episodes for the -greedy policy. The optimal policy was synthesized in approximately hours.

In the third experiment, there is no uncertainty in the execution of actions, namely the motion of the Pacman agent is deterministic. Figure 3(c) shows the evolution of over 186000 episodes where denotes the -greedy policy. On the other hand, the use of standard Q-learning (without LTL guidance) would require either to construct a history-dependent reward for the PL-MDP as a proxy for the considered LTL property, which is very challenging for complex LTL formulas, or to perform exhaustive state-space search with static rewards, which is evidently quite wasteful and failed to generate an optimal policy in our experiments.

Note that given the policy for the PL-MDP, probabilistic model checkers, such as PRISM [35], or standard Dynamic Programming methods can be employed to compute the probability of satisfying . For instance, for the first case study, the synthesized policy satisfies with probability , while for the second case study, the satisfaction probability is , since AMECs do not exist. For the same reason, even if the transition probabilities of the PL-MDP are known, PRISM could not generate a policy for the second case study. Nevertheless, the proposed algorithm can synthesize the closest-to-satisfaction policy, as shown in Corollary IV.5.

## Vi Conclusions

In this paper we have proposed a model-free reinforcement learning (RL) algorithm to synthesize control policies that maximize the probability of satisfying high-level control objectives captured by LTL formulas. The interaction of the agent with the environment has been captured by an unknown probabilistically-labeled Markov Decision Process (MDP). We have shown that the proposed RL algorithm produces a policy that maximizes the satisfaction probability. We have also shown that even if the assigned specification cannot be satisfied, the proposed algorithm synthesizes the best possible policy. We have provided evidence via numerical experiments on the efficiency of the proposed method.

## References

• [1] G. E. Fainekos, H. Kress-Gazit, and G. J. Pappas, “Hybrid controllers for path planning: A temporal logic approach,” in CDC and ECC, December 2005, pp. 4885–4890.
• [2] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-logic-based reactive mission and motion planning,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1370–1381, 2009.
• [3] A. Bhatia, L. E. Kavraki, and M. Y. Vardi, “Sampling-based motion planning with temporal goals,” in ICRA, 2010, pp. 2689–2696.
• [4] Y. Kantaros and M. M. Zavlanos, “Sampling-based optimal control synthesis for multi-robot systems under global temporal tasks,” IEEE Transactions on Automatic Control, 2018. [Online]. Available: DOI:10.1109/TAC.2018.2853558
• [5] ——, “Distributed intermittent connectivity control of mobile robot networks,” IEEE Transactions on Automatic Control, vol. 62, no. 7, pp. 3109–3121, 2017.
• [6] X. C. Ding, S. L. Smith, C. Belta, and D. Rus, “MDP optimal control under temporal logic constraints,” in CDC and ECC, 2011, pp. 532–538.
• [7] E. M. Wolff, U. Topcu, and R. M. Murray, “Robust control of uncertain Markov decision processes with temporal logic specifications,” in CDC, 2012, pp. 3372–3379.
• [8] X. Ding, S. L. Smith, C. Belta, and D. Rus, “Optimal control of Markov decision processes with linear temporal logic constraints,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1244–1257, 2014.
• [9] M. Guo and M. M. Zavlanos, “Probabilistic motion planning under temporal tasks and soft constraints,” IEEE Transactions on Automatic Control, 2018.
• [10] I. Tkachev, A. Mereacre, J.-P. Katoen, and A. Abate, “Quantitative model-checking of controlled discrete-time Markov processes,” Information and Computation, vol. 253, pp. 1–35, 2017.
• [11] C. Baier and J.-P. Katoen, Principles of model checking.   MIT Press, 2008.
• [12] E. M. Clarke, O. Grumberg, D. Kroening, D. Peled, and H. Veith, Model Checking, 2nd ed.   MIT Press, 2018.
• [13] S. Sickert, J. Esparza, S. Jaax, and J. Křetínskỳ, “Limit-deterministic Büchi automata for linear temporal logic,” in CAV.   Springer, 2016, pp. 312–332.
• [14] R. Alur and S. La Torre, “Deterministic generators and games for LTL fragments,” TOCL, vol. 5, no. 1, pp. 1–25, 2004.
• [15] S. Sickert and J. Křetínskỳ, “MoChiBA: Probabilistic LTL model checking using limit-deterministic Büchi automata,” in ATVA.   Springer, 2016, pp. 130–137.
• [16] J. Fu and U. Topcu, “Probably approximately correct MDP learning and control with temporal logic constraints,” in Robotics: Science and Systems X, 2014.
• [17] T. Brázdil, K. Chatterjee, M. Chmelík, V. Forejt, J. Křetínskỳ, M. Kwiatkowska, D. Parker, and M. Ujma, “Verification of Markov decision processes using learning algorithms,” in ATVA.   Springer, 2014, pp. 98–114.
• [18] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in CDC.   IEEE, 2014, pp. 1091–1096.
• [19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press Cambridge, 1998, vol. 1.
• [20] J. Wang, X. Ding, M. Lahijanian, I. C. Paschalidis, and C. A. Belta, “Temporal logic motion control using actor–critic methods,” The International Journal of Robotics Research, vol. 34, no. 10, pp. 1329–1344, 2015.
• [21] Q. Gao, D. Hajinezhad, Y. Zhang, Y. Kantaros, and M. M. Zavlanos, “Reduced variance deep reinforcement learning with temporal logic specifications,” 2019 (to appear).
• [22] N. Fulton and A. Platzer, “Verifiably safe off-model reinforcement learning,” arXiv preprint arXiv:1902.05632, 2019.
• [23] N. Fulton, “Verifiably safe autonomy for cyber-physical systems,” Ph.D. dissertation, Carnegie Mellon University Pittsburgh, PA, 2018.
• [24] N. Fulton and A. Platzer, “Safe reinforcement learning via formal methods: Toward safe control through proof and learning,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
• [25] A. Platzer, “Differential dynamic logic for hybrid systems,” Journal of Automated Reasoning, vol. 41, no. 2, pp. 143–189, 2008.
• [26] M. Hasanbeig, A. Abate, and D. Kroening, “Logically-constrained reinforcement learning,” arXiv preprint arXiv:1801.08099, 2018.
• [27] ——, “Logically-constrained neural fitted Q-iteration,” in AAMAS, 2019, pp. 2012–2014.
• [28] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak, “Omega-regular objectives in model-free reinforcement learning,” arXiv preprint arXiv:1810.00950, 2018.
• [29] A. Pnueli, “The temporal logic of programs,” in Foundations of Computer Science.   IEEE, 1977, pp. 46–57.
• [30] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming.   John Wiley & Sons, 2014.
• [31] A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, 2008.
• [32] A. Abate, J.-P. Katoen, J. Lygeros, and M. Prandini, “Approximate model checking of stochastic hybrid systems,” European Journal of Control, vol. 16, no. 6, pp. 624–641, 2010.
• [33]
• [34] J. Fu and U. Topcu, “Probably approximately correct MDP learning and control with temporal logic constraints,” arXiv preprint arXiv:1404.7073, 2014.
• [35] M. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: Verification of probabilistic real-time systems,” in CAV.   Springer, 2011, pp. 585–591.
• [36] R. Durrett, Essentials of stochastic processes.   Springer, 1999, vol. 1.
• [37] V. Forejt, M. Kwiatkowska, and D. Parker, “Pareto curves for probabilistic model checking,” in ATVA.   Springer, 2012, pp. 317–332.
• [38] E. A. Feinberg and J. Fei, “An inequality for variances of the discounted rewards,” Journal of Applied Probability, vol. 46, no. 4, pp. 1209–1212, 2009.
###### Definition .1

Given an LTL property and a set of G-subformulas, i.e., formulas in the form , we define to be the resulting formula when we substitute for every G-subformula in and for other G-subformulas of .

### -a Proof of Proposition iv.1

Let be the set of all G-subformulas of . Since elements of are subformulas of we can assume an ordering over so that if is a subformula of then . The accepting component of LDBA is a product of DBAs (called G-monitors) such that each expresses where is the state space of the -th G-monitor, , and [13]. Note that has no G-subformulas. The states of the G-monitor are pairs of formulas where at each state, the first checks if the run satisfies , while the second puts the next G-subformula in the ordering of on hold. However, all the previous G-subformulas have been checked already and is replaced by in . The product of the G-monitors is a deterministic generalized Büchi automaton: where , , , and . As shown in [13], while a word is being read by the accepting component of the LDBA, the set of G-subformulas that hold “monotonically” expands. If , then eventually all G-subformulas become true.

Assume that the current state of the automaton is and the automaton is checking whether is satisfied or not, assuming that is already , while putting on hold. At this point, the accepting frontier set is . Also assume the automaton returns to but then at least one accepting set has been removed from (Note that an accepting set cannot be added since the set of satisfied G-subformulas monotonically expands). This essentially means that is already checked while is not checked yet, making a subformula of . This violates the ordering of and hence the assumption of being time-variant is not correct.

### -B Proof of Theorem iv.2

We prove this result by contradiction. Consider any policy whose traces satisfy with positive probability. Policy induces a Markov chain when it is applied over the MDP . This Markov chain comprises a disjoint union between a set of transient states and sets of irreducible recurrent classes [36], namely: . From the accepting condition of the LDBA, traces of policy satisfy with positive probability if and only if

 ∃Ra¯¯¯μ s.t. ∀j∈{1,...,f}, FPj∩Ra¯μ≠∅.

The recurrent class is called an accepting recurrent class. Note that if all recurrent classes are accepting then traces of policy satisfy with probability one. By construction of the reward function (5) the agent receives a positive reward ever after it has reached an accepting recurrent class as it keeps visiting all the accepting sets infinitely often.

There are two other possibilities concerning the remaining recurrent classes that are not accepting. A non-accepting recurrent class, name it , either (i) has no intersection with any accepting set , or (ii) or has intersection with some of the accepting sets but not all of them. In case (i), the agent does not visit any accepting set in the recurrent class and the likelihood of visiting accepting sets within the transient states is zero since is invariant. In case (ii), the agent is able to visit some accepting sets but not all of them. This means that there exist always at least one accepting set that has no intersection with and after a finite number of times, no positive reward can be obtained, and the re-initialization of in Definition III.5 will never happen. By (7), in both cases, for any arbitrary , there always exists a such that the expected reward of a trace reaching an accepting recurrent class such as with infinite number of positive rewards, is higher than the expected reward of any other trace.

Next, assume that the traces of optimal policy , defined in (6), do not satisfy the property . In other words, and all of the recurrent classes are non-accepting. As discussed in cases (i) and (ii) above, the accepting policy has a higher expected reward than the optimal policy due to expected infinite number of positive rewards in policy . However, this contradicts the optimality of in (6), completing the proof.

### -C Proof of Theorem iv.3

We first review how the satisfaction probability is calculated traditionally when the MDP is fully known and then we show that the proposed algorithm convergence is the same. Normally when the MDP graph and transition probabilities are known, the probability of property satisfaction is often calculated via DP-based methods such as standard value iteration over the product MDP  [11]. This allows to convert the satisfaction problem into a reachability problem. The goal in this reachability problem is to find the maximum (or minimum) probability of reaching AMECs.

The value function in value iteration is then initialized to for non-accepting maximum end components and to for the rest of the MDP. Once value iteration converges then at any given state the optimal policy is produced by where is the converged value function, representing the maximum probability of satisfying the property at state , i.e. in our setup.

The key to compare standard model-checking methods to our method is reduction of value iteration to basic form. More specifically, quantitative model-checking over an MDP with a reachability predicate can be converted to a model-checking problem with an equivalent reward predicate which is called the basic form. This reduction is done by adding a one-off (or sometimes called terminal) reward of upon reaching AMECs [37]. Once this reduction is done, Bellman operation is applied over the value function (which represents the satisfaction probability) and policy maximizes the probability of satisfying the property.

In the proposed method, when an AMEC is reached, all of the automaton accepting sets have surely been visited by policy and an infinite number of positive rewards will be given to the agent as shown in Theorem IV.2.

There are two natural ways to define the total discounted rewards [38]: (i) to interpret discounting as the coefficient in front of the reward; and (ii) to define the total discounted rewards as a terminal reward after which no reward is given and treat the update rule as if it is undiscounted. It is well-known that the expected total discounted rewards corresponding to these methods are the same; see, e.g., [38]. Therefore, without loss of generality, given any discount factor , and any positive reward component , the expected discounted reward for the discounted case (the proposed algorithm) is times the undiscounted case (value iteration) where is a positive constant. This concludes that maximizing one is equivalent to maximizing the other.

### -D Proof of Corollary iv.5

Assume that there exists no policy in whose traces can satisfy the property . Construct the induced Markov chain for any arbitrary policy and its associated set of transient states and its sets of irreducible recurrent classes : . By assumption, policy cannot satisfy the property and thus . Following the same logic as in the proof of Theorem IV.2, after a limited number of times no positive reward is given to the agent. However, by the convergence guarantees of QL, Algorithm LABEL:alg:stationary will generate a policy with the highest expected accumulated reward. By construction of the reward function in (5), this policy has the highest number of intersections with accepting sets.

### -E Counter-example

We would like to emphasise that in this work and [26] due to the fact the algorithm that we proposed is “episodic” and thus, covers the un-discounted case as well. This has been unfortunately overlooked in [28]. In the following we examine the general cases of discounted and un-discounted learning and we show that our algorithm, which is episodic, is able to output the correct action for the example provided in [28] (Fig. 4). For the sake of generality, we have parameterised the probabilities associated with action and with and , respectively.

Recall that for a policy on an MDP , and given a reward function , the expected discounted reward at state by taking action is defined as [19]:

 Uμ(s,a)=Eμ[∞∑n=0γn R(sn,μ(sn))|s0=s,a0=a], (10)

where denotes the expected value by following policy , and is the sequence of state-action pairs generated by policy up to time step .

We would like to show that for some , . From (10), at state , the expected return for each action is:

 Uμ(s0,right)=(1−ν)[r+γr+γ2r+...] (11) Uμ(s0,left)=γ2r+γ5r+γ8r+...

Notice that has no effect on the expected return after the agent chose to go right or left as there is only one action available in subsequent states. Let us first consider . The RHS is a geometric series with the initial term and ratio of . Thus,

 Uμ(s0,right)=(1−ν)r1−γn1−γ. (12)

The expected return is also a geometric series such that:

 Uμ(s0,left)=γ2r1−γ3n1−γ3. (13)

Consider two cases (1) , and (2) .

In the first case , as , and and therefore, the following inequality can be solved for :

 γ2r1−γ3>(1−ν)r1−γ ⟶ γ21+γ+γ2>1−ν ⟶ (14) ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩γ<(−√1/ν2+2/ν−3−1)ν+12ν, orγ>(√1/ν2+2/ν−3−1)ν+12ν.

Thus, for some , the discounted case is sufficient if . However, it is possible that for some both conditions push to be outside of its range of in the first case. Therefore, in the learning algorithm needs to be equal to , which brings us to the second case, that is allowed in our work thanks to the episodic nature of our algorithm.

Note that when we cannot derive (14) since , and also cannot be cancelled from both sides of the inequality. Further to this, (12) and (13) become undefined when . From (11) though, we know with , the summations go to infinity as . The question is, can we show that .

Recall that the convergence of QL is asymptotic and if we can show that after a number of episodes, then essentially our algorithm can output the correct result and will choose action once QL has converged. To prove this claim let us consider the following limit as we push towards :

 limγ→1Uμ(s0,left)Uμ(s0,right)=limγ→1γ2r1−γ3n1−γ3(1−ν)r1−γn1−γ=limγ→1γ2r(1−γn)(1+γn+γ2n)(1−γ)(1+γ+γ2)(1−ν)r1−γn1−γ=γ21−ν (15)

In case when then the limit is , namely the algorithm is indifferent between choosing or . This matches with the MDP as well since going to either direction does not change the optimality of the action when . However, if then the limit is always greater than one, meaning that the expected return for taking left is greater than taking right after some finite number of episodes.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters