Continuous Motion Planning with Temporal Logic Specifications using Deep Neural Networks
Abstract
In this paper, we propose a modelfree reinforcement learning method to synthesize control policies for motion planning problems with continuous states and actions. The robot is modelled as a labeled discretetime Markov decision process (MDP) with continuous state and action spaces. Linear temporal logics (LTL) are used to specify highlevel tasks. We then train deep neural networks to approximate the value function and policy using an actorcritic reinforcement learning method. The LTL specification is converted into an annotated limitdeterministic Büchi automaton (LDBA) for continuously shaping the reward so that dense rewards are available during training. A naïve way of solving a motion planning problem with LTL specifications using reinforcement learning is to sample a trajectory and then assign a high reward for training if the trajectory satisfies the entire LTL formula. However, the sampling complexity needed to find such a trajectory is too high when we have a complex LTL formula for continuous state and action spaces. As a result, it is very unlikely that we get enough reward for training if all sample trajectories start from the initial state in the automata. In this paper, we propose a method that samples not only an initial state from the state space, but also an arbitrary state in the automata at the beginning of each training episode. We test our algorithm in simulation using a carlike robot and find out that our method can learn policies for different working configurations and LTL specifications successfully.
I Introduction
Traditionally, motion planning problems consider generating a trajectory for reaching a specific target while avoiding obstacles [11]. However, realworld applications often require more complex tasks than simply reaching a target. As a result, recent motion planning problems consider a class of highlevel complex specifications that can be used to describe a richer class of tasks. A branch of planning approaches has been proposed recently that describes highlevel tasks like reaching a sequence of goals or ordering a set of events using formal languages such as linear temporal logic (LTL) [16]. As a simple example, the task of reaching region A and then reaching region B can be easily expressed as an LTL formula. To deal with LTL specifications, an approach for dealing with a pointmass robot model has been proposed in [5]. A control synthesis technique with receding horizon control has been proposed in [31] to handle a linear robot model. The approach in [2] uses samplingbased method to deal with nonlinear dynamic robot models with LTL specifications. However, this method suffers from the curse of dimensionality limiting its use to lowdimensional system models.
Reinforcement learning has achieved great success in the past decades both in terms of theoretical results [29] and application [24, 26]. It is a way of learning the best actions for a Markov decision process (MDP) by interacting with the environment [27]. It is efficient in solving problems of complex systems with or without knowing a model [28]. Early works were mainly based on learning [30] and policy gradient methods [29]. The actorcritic algorithm [20] is also widely used with two components, namely an actor and a critic. The actor is used as the policy, which tells the system what action should be taken at each state, and the critic is used to approximate the stateaction value function. Modern reinforcement learning methods take advantage of deep neural networks to solve problems with large state and action spaces. A deep network (DQN) [17] uses a deep neural network to approximate stateaction values and learns an implicit control policy by improving this network. In [25], a deterministic policy gradient method is proposed with better time efficiency and consequently the deep deterministic policy gradient method (DDPG) [15] leverages this idea of a deterministic policy and uses two deep neural networks, an actor network and a critic network, to solve problems of continuous state and action spaces.
Reinforcement learning algorithms have been applied to solve modelfree robotic control problems with temporal logic specifications. In [21], a learning method is used to solve an MDP problem with LTL specifications. The temporal logical formula is transformed into a deterministic Rabin automaton (DRA) and a realvalued reward function is designed in order to satisfy complex requirements. In [6], a reduced variance deep learning method is used to approximate the stateaction values of the product MDP with the help of deep neural networks. Another branch of methods convert the LTL formula into a limitdeterministic Büchi automaton (LDBA) and a synchronous reward function is designed based on the acceptance condition of the LDBA as in [10] and [7]. The authors in [8] use neural fitted iteration to solve systems with continuous state. In [19], the limitdeterministic generalized Büchi automaton (LDGBA) is used to convert the LTL formula. Moreover, a continuous state space is considered in [9].
However, training deep networks for continuous controls is more challenging due to significantly increased sample complexity and most approaches achieve poor performance on hierarchical tasks, even with extensive hyperparameter tuning [4]. This is because the reward function is so sparse if we only have a terminal reward when the accepting conditions are satisfied. The authors in [13] use a timevarying linear Gaussian process to describe the policy and the policy updated by maximizing the robustness function in each step. In [12], the authors use modelfree learning to synthesize controllers for finite time horizon for systems with continuous state and action space. The authors in [32] also propose a training scheme for continuous state and action space but using one neural network for each individual automaton state. This requires a large number of networks when the LTL specification is complex.
As a result, the main contribution of this paper is that by using an annotated LDBA converted from the LTL specification and a simple idea that randomly samples from the automaton states without initializing it to a fixed initial state (as given by the translated automaton), we can effectively train deep networks to solve continuous control problems with temporal logic goals. We show in simulations that our method achieves a good performance for a nonlinear robot model with complex LTL specifications.
Ii Preliminary and Problem Definition
Iia Linear Temporal Logic
Linear temporal logic (LTL) formulas are composed over a set of atomic propositions by the following syntax:
(1) 
where is an atomic proposition, true, negation (), conjunction () are propositional logic operators and next (), until () are temporal operators.
Other propositional logic operators such as false, disjunction (), implication (), and temporal operators always (), eventually () can be derived based on the ones in (1). A sequence of symbols in is called a word. We denote by if the word satisfies the LTL formula . Details on syntax and semantics of LTL can be found in [1].
For probabilistic systems such as Markov decision processes, it is sufficient to use a limitdeterministic Büchi automaton (LDBA) over the set of symbols , which are deterministic in the limit, to guide the verification or control synthesis with respect to an LTL formula. For any LTL formula , there exists an equivalent LDBA that accepts exactly the words described by [23].
In the current paper, we use transitionsbased LDBA, since it is often of smaller size than its statebased version. We begin by defining a transitionbased Büchi automaton and then give a formal definition of an LDBA.
Definition 1
A transitionbased generalized Büchi automaton (TGBA) is a tuple , where is a set of states, is a finite alphabet, is the state transition function, is the initial state, and with () is a set of accepting conditions.
A run of a TGBA under an input word is an infinite sequence of transitions in , denoted by , that satisfies and for all . Let , and . Denote by the transition between under the input . A word is accepted by if there exists a run such that for all , where is the set of transitions that occur infinitely often during the run .
Definition 2
A TGBA is a limitdeterministic Büchi automaton (LDBA) if , , and

and for all and ,

for all and ,

for all .

if for , and , then .
IiB Labeled Markov Decision Process
To capture the robot motion and working properties, we use a continuous labeled Markov decision process with discrete time to describe the dynamics of the robot and its interaction with the environment [22].
Definition 3
A continuous labeled Markov Decision Process (MDP) is a tuple , where is a continuous state space, is a continuous action space, is a transition probability kernel with defining the nextstate distribution of taking action at state , the function specifies the reward, is a discount factor, is the set of atomic propositions, and is the labeling function that returns propositions that are satisfied at a state . Here denotes the set of all probability measures over .
The labeling function is used to assign labels from a set of atomic propositions to each state in the state space . Given a sequence of states , a sequence of symbols , called the trace of , can be generated to verify if it meets a LTL specification . If , we also write .
Definition 4
A deterministic policy of a labeled MDP is a function that maps a state to an action .
Given a labeled MDP, we can define the accumulated reward starting from state as
IiC Product MDP
Considering a robot operating in a working space to accomplish a highlevel complex task described by an LTLequivalent LDBA , the mobility of the robot is captured by a labeled MDP
defined as above. We can combine the labeled MDP and the LDBA to obtain a product MDP.
Definition 5
A product MDP between a labeled MDP and an LDBA is a tuple
where

is the set of states,

is the set of actions,

is the transition probability kernel defined as
for all ,

is the reward function, and

(), where for all is a set of accepting conditions.
Likewise, a run of a product MDP is an infinite sequence of transitions of the form
where . We say that is an accepted run, denoted by , if for all , where is the set of transitions that occur infinitely often in .
IiD Problem Formulation
We consider the problem in which a robot and its environment are modelled as an MDP , and the robot task is specified as an LTL formula . Given an initial state , we define the probability of an MDP satisfying under a policy from as
where is the set of all infinite sequences of states of the MDP that are induced from the policy . We say a formula is satisfied by a policy at if . If such a policy exists, we say that is satisfiable at .
Then the problem we address in this paper is as follows.
Problem 1
Given a continuous labeled MDP and an LTL specification , find a policy such that is satisfied by for each such that is satisfiable at .
As we have seen in Section IIA, an LTL formula can be translated into an LDBA . Therefore, solving Problem 1 is equivalent to solving the following control problem for the corresponding product MDP of the given MDP and [3].
Consider the product MDP . We say that a policy for satisfies at , where and the initial state of , if , where
and is the set of all runs of the product MDP that are induced from the policy . If such a policy exists, we say that is satisfiable at .
Problem 2
Given a continuous labeled MDP and an LDBA translated from an LTL specification , find a policy for the product MDP such that is satisfied by for each and the initial state of such that is satisfiable.
Iii Reinforcement Learning Method
For an MDP, the value of a state under a policy , denoted as , is the expected return when starting from and following thereafter. We define formally as
for all . Similarly, the value for a policy is the value of taking action at state and following thereafter. It is defined as
An optimal stateaction value is the maximum stateaction value achieved by any policy for state and action . learning[30] is a method of finding the optimal strategy for an MDP. It learns the stateaction value by using the update rule , where is a learning rate and is the nextstate of taking action at state and is the best action at according to the current values.
The deep deterministic policy gradient method (DDPG) [25] introduces a parameterized function , called an actor, to represent the policy using a deep neural network. A critic , which also uses a deep neural network with a parametric vector , is used to represent the actionvalue function. The critic is updated by minimizing the following loss function:
where such that and is the state distribution under policy . The objective function of the deterministic policy defined as
is used to evaluate the performance of a policy for the MDP. According to the Deterministic Policy Gradient Theorem [25],
and the deterministic policy can be updated by
where is a learning rate. By applying the chain rule,
It is stated in [15] that we can use the critic to approximate the objective function of the policy, which means . As a result, we can update the parameters of the actor using
(2) 
The DDPG method moves the parameter vector greedily in the direction of the gradient of and is more efficient in solving MDP problems with continuous state and action spaces. As a result, we propose a learning method to solve motion planning problems with LTL specifications based on DDPG.
Iv Reinforcement Learning with LDBAGuided Reward Shaping
In this section, we introduce our method of solving a continuous state and action MDP with LTL specifications using deep reinforcement learning. The LTL specification is transformed into an annotated LDBA and a reward function is defined on the annotated LDBA for reward shaping in order to training the networks with dense reward.
Iva Reward Shaping
Our definition of the reward function for the product MDP depends on an annotated LDBA defined as follows.
Definition 6
An annotated LDBA is an LDBA augmented by (), where and () is a function assigning 0 or 1 to all the edges of according to the following rules:
For any , the map , which corresponds to , assigns 1 to all the accepting transitions in and 0 to all others. The set defined above, however, only marks the accepting transitions but not the other transitions that can be taken so that the accepting transitions can happen in some future steps. In order to also identify such transitions to guide the design of the reward function of the product MDP, we provide the following Algorithm 1 to preprocess the set of boolean maps.
The function in line 1 of Algorithm 1 is defined to gradually mark every state in that has outgoing transitions annotated by 1. For each set , the function is initialized (in line 3 and 4) to 1 for any state that has at least one accepting outgoing transition and 0 for any other states. By using , the loop from line 5 to 12 in Algorithm 1 marks backwardly the state with no outgoing transition marked 1 (i.e., ), through which the accepting transitions can be taken. The loop terminates in a finite number of steps since the set of states is finite and can only be marked to 1 not . After running Algorithm 1, for each , the map marks 1 to the transitions that either are accepting in or can lead to the occurrence of accepting transitions in . A state with after the end of th for loop (for all ) is called a trap, because accepting transitions do not occur in any run that passes through .
Since any accepting run of an LDBA should contain infinitely many transitions from each , the status that whether there is at least one transition in any is taken should be tracked. For this purpose, we let be a Boolean vector of size and be the th element in , where is the number of subsets in the accepting condition and . The vector is initialized to all ones and is updated according to the following rules:

If a transition in set is taken, then .

If all elements in are 0, reset to all ones.
Now we define a function that is updated by vector as follows:
(3) 
For a transition , if and only if there exists an that has not been visited (i.e., ) and .
Based on the above definitions, the reward function of the product MDP is defined as:
(4) 
where , , , the numbers and satisfy , , the function is given in (3). The set is given by
(5) 
The term measures the distance from the MDP state to the set , where denotes the distance between the states and .
The large positive number is used to reward taking an accepting transition or a transition that can lead to an accepting one, the small negative number is used to guide the transitions in the state space of the MDP to encourage the occurrence of the desired transitions between LDBA states, and the negative reward will be collected if the corresponding run in hit a trap.
IvB The Proposed Algorithm
The authors of [6] propose a method that initializes each episode with the initial Rabin state for a discrete product MDP model. The approach in [21] also resets the Rabin state with the initial state periodically. The main drawback of doing this is that we can only have a good reward if a training episode produces a trajectory that successfully reaches an accepting state in the DRA. However, for a product MDP with continuous state and action spaces, the sampling complexity of getting such a satisfactory trajectory is too high when we have a complex LTL formula and consequently, we cannot obtain enough reward to train the neural networks for a good performance. As a result, at the beginning of each episode, we sample a instead of using the initial state as given by the translated automaton. Then the initial state of the product MDP is constructed by using this sampled .
(6) 
(7) 
(8) 
(9) 
We use DDPG [15] to train the neural networks. As most reinforcement learning algorithms in which data has to be independently and identically distributed, a buffer is used here for storing only the last steps of transition data [17]. At each time step, the tuple is stored into the buffer and a batch of data is uniformly sampled from the buffer for training the networks. As is shown in Algorithm 2 in line 16 and 17, the critic is updated with minimizing the loss function of the neural network and the actor is updated such that the average value is used to approximate the expectation as in Eq. 2. It was discussed in [18] that directly implementing deep learning with neural networks will be unstable because the value is also used for policy network training. As a result, a small change in the value may significantly change the policy and therefore change the data distribution. The authors proposed a way of solving this issue by cloning the network to obtain a target network after each fixed number of updates. This modification makes the algorithm more stable compared with the standard deep learning. We use two target networks and as in [15]. The target networks are copied from the actor and critic networks in the beginning and the weights of both networks are updated after every several steps by using and with .
The proposed method to solve a continuous MDP with LTL specifications is summarized in Algorithm 2.
IvC Analysis of the Algorithm
While DDPG does not offer any convergence guarantees for approximating a general nonlinear value function, we prove in this section that, if the MDP is finite (e.g. obtained as a finite approximation of the underlying continuousstate MDP), the reward function defined by (IVA) does characterize Problem 2 correctly in the sense that the optimal policy can satisfy the formula at each state such that the formula is satisfiable.
Theorem 1
Let be an LTL formula and be the product MDP formed from the MDP and an LDBA translation encoding . Then there exists some , , and given in (IVA) such that for all the optimal policy on satisfies for each initial state such that is satisfiable.
(Sketch of proof) Suppose that () are two policies such that has probability of satisfying (i.e. producing accepting runs on ) from an initial state . We show that if , then implies . Suppose that this is not the case, i.e., and . We have
where denotes a run of the product MDP under starting from .
By carefully estimating the accumulated reward, we can get an upper bound for and an lower bound for as follows:
where , is the maximum value that can be taken by in (IVA), and , are constants (depending on the product MDP). Since , there exists and a choice of a sufficiently large (depending on and other constants) such that for all . This contradicts .
Remark 1
Note that this result does not offer guarantees that a policy that maximizes for all also maximizes the satisfaction probability for all . Nonetheless, we guarantee that the optimal policy always satisfies the formula, provided that the formula is satisfiable. Our formulation is consistent with that in [19]. For future work, we can investigate how to integrate the reward formulation in [7] and those in this paper to maximize satisfaction probability.
V Simulation Results
In this section, we test the proposed method with different LTL specifications using a carlike robot as in [14]:
(10) 
where is the planar position of center of the vehicle, is its orientation, the control variables and are the velocity and steering angle, respectively, and . The state space is and the control space is .
Va Example 1
In the first example, we test our algorithm with a simple LTL specification
(11) 
where and are two regions in working space. This LTL formula specifies that the robot must reach first and then reach . We compare our algorithm that samples a random with the standard method that resets to at the beginning of each episode. The neural networks are trained for 1 million steps with 200 steps in each episode. The simulation step is s. We use and for the reward function as in Eq. IVA. The simulation result of example 1 is presented in Fig. 1. The areas marked as blue are the regions and . We show the trajectories from an initial point at in Fig. 0(a). The black curve is the trajectory generated using the idea of fixing at the beginning of each episode and the red one is the trajectory from our method. It is shown that for this simple LTL specification, both ideas provide a successful trajectory. Fig. 0(b) shows the normalized reward during training for both ideas. The blue one is the normalized reward for the standard method and the red curve is for our method. Our method collects a normalized reward of for 500k steps of training and for 1M steps of training while the standard method obtains a normalized reward of and for 500k steps and 1M steps, respectively. The runtime of both methods are the same because this is determined by the number of training steps. The success rate of both methods are presented in TABLE I. We can see that our method achieves better performance in the same amount of time as the standard method.
Success rate  

Standard method  76.7% 
Our method  83.3% 
VB Example 2
In the second example, we test our algorithm using following LTL specification:
(12) 
where , , and are four areas in the working space. In other words, we want the robot to visit , , , sequentially.
We train the neural networks for 1 million steps. The system is also simulated using a time step s. The reward function is the same as in the first example, where and . For the standard method, there are 600 steps in each episode. We increased the number of steps in each episode so that a trajectory will be long enough to satisfy the whole LTL specification. In our method, we still have 200 steps in each episode. The trajectory generated from our method is shown in Fig. 2. The success rate of both methods are presented in TABLE II.
Success rate  

Standard method  13.3% 
Our method  63.3% 
VC Example 3
In the third example, we test our algorithm using the following LTL specification:
(13) 
In plain words, the specification encodes that the robot must reach either or first. If it reaches first, then it must next reach without any other restrictions. If it reaches first, then it has to reach without entering . We consider this specification with two different layouts of the regions , , and .
Case 1
In case 1, , and are three goals marked as blue and is a restricted area in the working space marked as yellow as in Fig. 2(a) and Fig. 2(b).
The reward function is of and . Since we have a constraint of not entering region if it reaches , we assign if this happens. We also train the networks for 1 million steps with 200 steps in each episode. The simulation time step is s. The trajectories generated from two initial states and are shown in Fig. 2(a) and Fig. 2(b). For initial point at , it is closer to region so that the trajectory reaches first. According to the LTL specification, it has to avoid before reaching if it reaches first. For initial point at , the trajectory first reaches region and then, the trajectory can reach without avoiding . The simulation results show that our method can successfully generate a policy that satisfies the LTL specification for different initial points. The algorithm learns that the trajectory should choose the target that is closer to the initial point between and and then reach according to the specification.
Case 2
In Case 2, , , are the same regions as in Case 1. Region is the area of . As in Fig. 2(c), , , are marked as blue and is the yellow area plus the area of region . As is shown in the figure, is enclosed by , which means that if a trajectory enters , it will also be in . This implies that the automaton will be trapped in the deadlock between and and will never reach accepting condition as shown in Fig. LABEL:fig:ltl2. The learning algorithm is able to figure out that even if the initial point is closer to , it still need to reach first. The result is shown in Fig. 2(c). The success rate for both case 1 and 2 as in TABLE III



Success rate (case 1)  Success rate (case 2)  

Standard method  10%  6.7% 
Our method  63.3%  60% 
Vi Conclusions
In this paper, we proposed a learning method for motion planning problems with LTL specifications with continuous state and action spaces. The LTL specification is converted into an annotated LDBA and the deep deterministic policy gradient method is used to train the resulting product MDP. The annotated LDBA is used to continuously shape the reward so that dense reward is available for training. We sample a state from the annotated LDBA at the beginning of each episode in training. We use a carlike robot to test our algorithm with three LTL specifications from different working configurations and initial positions in our simulation. Simulation results show that our method achieves successful trajectories for each of the specifications. For future work, we found out in our simulation that the algorithm sometimes fails to deal with complex working configurations such as nonconvex obstacles. We will focus on doing research about improving the algorithm to deal with more complex scenarios and LTL specifications.
References
 (2008) Principles of model checking. MIT press. Cited by: §IIA.
 (2010) Samplingbased motion planning with temporal goals. In Proc. of ICRA, pp. 2689–2696. Cited by: §I.
 (199507) The complexity of probabilistic verification. J. ACM 42 (4), pp. 857–907. External Links: Document, ISSN 00045411, Link Cited by: §IID.
 (2016) Benchmarking deep reinforcement learning for continuous control. In Proc. of ICML, pp. 1329–1338. Cited by: §I.
 (2009) Temporal logic motion planning for dynamic robots. Automatica 45 (2), pp. 343–352. Cited by: §I.
 (2019) Reduced variance deep reinforcement learning with temporal logic specifications. In Proc. of ICCPS, pp. 237–248. Cited by: §I, §IVB.
 (2019) Omegaregular objectives in modelfree reinforcement learning. In Proc. of TACAS, pp. 395–412. Cited by: §I, Remark 1.
 (2018) Logicallyconstrained neural fitted qiteration. arXiv preprint arXiv:1809.07823. Cited by: §I.
 (2019) Certified reinforcement learning with logic guidance. arXiv preprint arXiv:1902.00778. Cited by: §I.
 (2019) Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. arXiv preprint arXiv:1909.05304. Cited by: §I.
 (2011) Samplingbased algorithms for optimal motion planning. The International Journal of Robotics Research 30 (7), pp. 846–894. Cited by: §I.
 (2020) Formal controller synthesis for continuousspace mdps via modelfree reinforcement learning. In 2020 ACM/IEEE 11th International Conference on CyberPhysical Systems (ICCPS), pp. 98–107. Cited by: §I.
 (2018) A policy search method for temporal logic specified reinforcement learning tasks. In Proc. of ACC, pp. 240–245. Cited by: §I.
 (2018) Robustly complete synthesis of memoryless controllers for nonlinear systems with reachandstay specifications. arXiv preprint arXiv: 1802.09082. Cited by: §V.
 (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I, §III, §IVB.
 (2004) Automatic synthesis of multiagent motion tasks based on ltl specifications. In Proc. of CDC, Vol. 1, pp. 153–158. Cited by: §I.
 (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §I, §IVB.
 (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §IVB.
 (2020) Reinforcement learning of control policy for linear temporal logic specifications using limitdeterministic büchi automata. arXiv preprint arXiv:2001.04669. Cited by: §I, Remark 1.
 (2005) Natural actorcritic. In European Conference on Machine Learning, pp. 280–291. Cited by: §I.
 (2014) A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications. In Proc. of CDC, pp. 1091–1096. Cited by: §I, §IVB.
 (2002) Probabilistic Robotics. The MIT Press. Cited by: §IIB.
 (2016) LimitDeterministic Büchi Automata for Linear Temporal Logic. In Proc. of CAV, pp. 312–332. External Links: Document Cited by: §IIA, §IIA.
 (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §I.
 (2014) Deterministic policy gradient algorithms. In ICML, Cited by: §I, §III.
 (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §I.
 (1998) Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §I.
 (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I.
 (2000) Policy gradient methods for reinforcement learning with function approximation. In Proc. of NeurIPS, pp. 1057–1063. Cited by: §I.
 (1989) Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge. Cited by: §I, §III.
 (2009) Receding horizon temporal logic planning for dynamical systems. In Proc. of CDC, pp. 5997–6004. Cited by: §I.
 (2019) Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591. Cited by: §I.