Modular Deep Reinforcement Learning with Temporal Logic Specifications
Abstract
We propose an actorcritic, modelfree, and online Reinforcement Learning (RL) framework for continuousstate continuousaction Markov Decision Processes (MDPs) when the reward is highly sparse but encompasses a highlevel temporal structure. We represent this temporal structure by a finitestate machine and construct an onthefly synchronised product with the MDP and the finite machine. The temporal structure acts as a guide for the RL agent within the product, where a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a lowlevel control policy. We evaluate our framework in a Mars rover experiment and we present the success rate of the synthesised policy.
Introduction
Deep reinforcement learning is an emerging paradigm for autonomous solving of decisionmaking tasks in complex and unknown environments. However, tasks featuring extremely delayed rewards are often difficult, if at all possible, to solve with monolithic learning in Reinforcement Learning (RL). A wellknown example is the Atari game Montezuma’s Revenge in which deep RL methods such as [deepql] failed to score even once.
Despite their generality, deep RL methods are not a natural representation of how humans perceive these problems, since humans already have prior knowledge and associations regarding elements and their corresponding function, e.g. “keys open doors” in Montezuma’s Revenge. These simple yet critical temporal highlevel associations in Montezuma’s Revenge and a large number of real world complex problems, can lift deep RL initial knowledge about the problem to efficiently find the global optimal policy, while avoiding an exhaustive unnecessary exploration in the beginning.
These hierarchies, sometimes called options [sutton], can be encoded in general RL algorithms to solve such complex problems. Practical approaches in hierarchical RL depend on state representations and on whether they are simple or structured enough such that suitable reward signals can be effectively engineered by hand. This means that these methods often require detailed supervision in the form of explicitly specified highlevel actions or intermediate supervisory signals [precup, options_1, options_h_1, options_2, options_h_2, modular].
In this paper we propose a fullyunsupervised oneshot online learning framework for deep RL, where the learner is presented with a composable highlevel mission task in a continuousstate and continuousaction MDP. The mission task is specified in the form of Linear Temporal Logic (LTL) property, namely a formal, ungrounded, and symbolic representation of the task and of its components. Without requiring any supervision, each component of the LTL property systematically structures any complex mission task into lowlevel, achievable task “modules”. The LTL property essentially acts as a highlevel unsupervised guide for the agent, whereas the lowlevel planning is handled by a deep RL scheme.
LTL is a rich specification language that can formally express a wide range of timedependent logical properties which are quite similar to patterns in natural language [natural2LTL, natural2LTL2, natural2LTL3]. Examples include safety, liveness and cyclic properties, where the agent is required to make progress (liveness) while executing components for critical sections (safety) or to perform a sequence of tasks periodically (cyclic).
In order to synchronise the highlevel LTL guide with RL, we convert the LTL property to an automaton, namely a finitestate machine accepting sequences of symbols [bible]. Once the automaton is generated from the given LTL property, we construct onthefly^{1}^{1}1Onthefly means that the algorithm tracks (or executes) the state of an underlying structure (or a function) without explicitly constructing it. a synchronous product between the MDP and the automaton and then automatically define a reward function based on the structure of the automaton. From this algorithmic rewardshaping procedure, an RL agent is able to accomplish highly complex tasks with no supervisory assistance.
The closest line of work is the modelbased [topku, dorsa] or modelfree [logicalconstraint] approaches in RL that constrain the agent with a temporal logic property. However, these approaches are limited to finitestate finiteaction MDPs, an assumption we will relax throughout this work. Another related work is [modular]’s policysketchbased method, which learns easy instructionbased tasks first and eventually composes them together, to accomplish a more complex task. In this work instead, the complex task can be expressed as an LTL property to guide the learning and to generate a policy with no need to start from easy tasks and later join them together.
In addition to the described setup, conventional RL is mostly focused on problems in which the set of states of the Markov Decision Processes (MDP) and the set of possible actions are finite. However, many real world problems require continuous realvalued actions to be taken in response to highdimensional and realvalued state observations. To tackle problems with continuous state and action spaces, the most immediate method is to discretise the state and action spaces of the MDP [abate2010approximate, abate2015quantitative] and to rely on conventional deep RL methods. Although this discretisation method works well for many problems [faust, stochy], the produced discrete MDP might be approximate and might not capture the full dynamics of the original MDP, which can be essential for optimally solving the original problem. Further, the number of discrete actions increases exponentially with the number of degrees of freedom [DDPG]  a similar consideration holds for the state space. Thus, discretisation of MDPs generally suffers from the trade off between accuracy and the curse of dimensionality.
To tackle this issue, in this work we propose a modular Deep Deterministic Policy Gradient (DDPG) based on the results in [DPG, DDPG]. This modular DDPG is the first actorcritic algorithm using deep function approximators that can learn policies in continuous action and state spaces while jointly optimises over LTL taskspecific subpolicies.
Problem Framework
Definition 1 (General MDP)
The tuple is a general MDP over a set of continuous states , where is a set of continuous actions, and is the initial state. is a Borelmeasurable conditional transition kernel which assigns to any pair of state and action a probability measure on the Borel space . is a finite set of atomic propositions and a labelling function assigns to each state a set of atomic propositions [shreve].
Definition 2 (Path)
An infinite path starting at is a sequence of states such that every transition is allowed in , i.e. belongs to the smallest Borel set such that .
At each state , an agent behaviour is determined by a Markov policy , which is a mapping from states to a probability distribution over the actions, i.e. . If is a degenerate distribution then the policy is said to be deterministic.
Definition 3 (Expected Discounted Reward)
For a policy on an MDP , the expected discounted reward is defined as [sutton]:
(1) 
where denotes the expected value given that the agent follows policy , is a discount factor, is the reward, and is the sequence of stateaction pairs generated by policy up to time step .
The function is often referred to as value function (under the policy ). Another closely related notion in RL is actionvalue function , which describes the expected discounted reward after taking an action in state and thereafter following policy :
Accordingly, the recursive form of the actionvalue function can be obtained as:
(2) 
where . Qlearning (QL) [watkins] is the most extensively used modelfree RL algorithm built upon (2), for MDPs with finitestate and finiteaction spaces. For all stateaction pairs QL initializes a Qfunction with an arbitrary finite value, where is an arbitrary stochastic policy. QL is an offpolicy RL scheme, namely policy has no effect on the convergence of the Qfunction, as long as every stateaction pair is visited infinitely many times. Thus, for the sake of simplicity, we may drop the policy index from the actionvalue function. Under mild assumptions, QL converges to a unique limit, and a greedy policy can be obtained as follows:
and corresponds to the optimal policy that is generated by Dynamic Programming (DP) [NDP] to maximise (1), when the MDP is fully known.
The DPG algorithm [DPG] introduces a parameterised function called actor to represent the current policy by deterministically mapping states to actions, where is the function approximation parameters for the actor function. Further, an actionvalue function is called critic and is learned as described next.
Assume that at time step the agent is at state , takes action , and receives a scalar reward . In case when the agent policy is deterministic, the recursion (2) can be approximated by parameterising using a parameter set , i.e. , and by minimizing the following loss function:
(3) 
where is the probability distribution of state visits over , under any given arbitrary stochastic policy , and such that .
The actor is updated by applying the chain rule to the expected return with respect to the actor parameters as follows:
(4)  
[DPG] has shown that this is a policy gradient, and therefore we can apply a policy gradient algorithm on the deterministic policy. DDPG further extends DPG by employing a deep neural network as function approximator and updating the network parameters via a “soft update” method, which is explained later in the paper.
Linear Temporal Logic (LTL)
We employ LTL to encode the structure of the highlevel mission task and to automatically shape the reward function. An LTL formula is able to express a range of properties that are hard (if at all possible) to express by conventional or handcrafted methods in classical reward shaping [sutton, precup, options_h_2]. LTL formulae over a given set of atomic propositions are syntactically defined as [pnueli]
(5) 
where the operators and are called “next” and “until”, respectively.
We will next define the semantics of LTL formulae interpreted over MDPs. For a given path , we define the th state of to be where , and the th suffix of to be where
Definition 4 (LTL Semantics)
For an LTL formula and for a path , the satisfaction relation is defined as
The operator next requires that to be satisfied starting from the nextstate suffix of . The operator until is satisfied over if continuously holds until becomes true. Using the until operator we can define two temporal modalities: (1) eventually, ; and (2) always, . LTL extends propositional logic using the temporal modalities until , eventually , and always . For instance, constraints such as “eventually reach this point”, “visit these points in a particular sequential order”, or “always stay safe” are easily expressible by these modalities. Further, these modalities can be combined with logical connectives and nesting to provide more complex task specifications. Any LTL task specification over expresses the following set of words:
Definition 5 (LTL Policy Satisfaction)
We say that a stationary deterministic policy satisfies an LTL formula if where every transition is executed by taking action at state .
The set of associated words is expressible using a finitestate machine [bible]. Limit Deterministic Büchi Automaton (LDBA) [sickert] is the stateoftheart in formal methods and proved to be the most succinct finitestate machine for this purpose [sickert2]. We first define a Generalized Büchi Automaton (GBA), then we formally introduce the LDBA.
Definition 6 (Generalized Büchi Automaton)
A GBA is a state machine, where is a finite set of states, is the set of initial states, is a finite alphabet, is the set of accepting conditions where , and is a transition relation.
Let be the set of all infinite words over . An infinite word is accepted by a GBA if there exists an infinite run starting from where and, for each ,
(6) 
where is the set of states that are visited infinitely often in the sequence .
Definition 7 (Ldba)
A GBA is limit deterministic if can be partitioned into two disjoint sets , such that [sickert]:

and for every state and for every corresponding ,

for every , .
In other words, a LDBA is a GBA with two partitions: (1) initial (), and (2) accepting (). The accepting partition includes all accepting states and also all the transitions are deterministic.
Definition 8 (Nonaccepting Sink Component)
A nonaccepting sink component of the LDBA is a directed graph induced by a set of states such that (1) the graph is strongly connected; (2) it does not include all accepting sets ; and (3) there exist no other strongly connected set such that . We denote the union of all nonaccepting sink components of as .
The set include those components in the automaton that are surely nonaccepting and impossible to escape from. Thus, reaching them is equivalent to not being able to satisfy the given LTL property.
Modular Deep RL
We consider a modular deep RL problem in which we exploit the structural information provided by the LTL specification and by constructing a subpolicy for each state of the associated LDBA. Our proposed approach learns a satisfying policy without requiring any information about the grounding of the LTL task to be explicitly specified. Namely, the labelling assignment in Definition 1 is unknown apriori, and the algorithm solely relies on experience samples gathered onthefly.
Given an LTL mission task and an unknown continuousstate continuousaction MDP, we aim to synthesise a policy that satisfies the LTL specification. For the sake of clarity and to explain the core ideas of the algorithm, for now we assume that the MDP graph and the transition kernel are known: later these assumptions are entirely removed, and we stress that the algorithm can be run modelfree. We relate the MDP and the automaton by synchronising them, in order to create a new structure that is first of all compatible with deep RL and secondly that encompasses the given logical property.
Definition 9 (Product MDP)
Given an MDP and an LDBA with , the product MDP is defined as , where , , , such that and is the set of accepting states , where . The transition kernel is such that given the current state and action , the new state is , where and .
By constructing the product MDP we synchronise the current state of the MDP with the state of the automaton. This allows to evaluate the (partial) satisfaction of the corresponding LTL property (or parts thereof), and consequently to modularise the highlevel task into subtasks. Hence, with a proper reward assignment driven from the LTL property and its associated LDBA, the agent is able to break down a complex task into a set of easy subtasks. We elaborate further on task modularisation in the next section.
Note that the automaton transitions can be executed just by reading the label of the visited states, which makes the agent aware of the automaton state without explicitly constructing the product MDP. Thus, the proposed approach can run “modelfree”, and as such it does not require an initial knowledge about the MDP.
In the following we define an “onthefly” LTLdriven reward function, emphasising that the agent does not need to know the model structure or the transition probabilities (or their product). Before introducing a reward assignment for the RL agent, we need to present the ensuing function:
Definition 10 (Accepting Frontier Function)
For an LDBA , we define as the accepting frontier function, which executes the following operation over a given set :
In words, once the state and the set are introduced to the function , it outputs a set containing the elements of minus . However, if , then the output is the family set of all accepting sets of the LDBA minus the set . Finally, if the state is not an accepting state then the output of is . The accepting frontier function excludes from the accepting set that is currently visited, unless it is the only remaining accepting set. Otherwise, the output of is itself. Owing to the automatondriven structure of the function, we are able to shape a reward function (as detailed next) without any supervision and regardless of the dynamics of the MDP.
We propose a reward function that observes the current state , the current action , and the subsequent state , to provide the agent with a scalar value according to the current automaton state:
(7) 
Here is a positive reward and is a negative reward. A positive reward is assigned to the agent when it takes an action that leads to a state, the label of which is in . The set is called the accepting frontier set, is initialised as the family set , and is updated by the following rule every time after the reward function is evaluated:
The set contains those accepting states that are visited at a given time. Thus, the agent is guided by the above reward assignment to visit these states and once all of the sets are visited, the accepting frontier is reset. As such, the agent is guided to visit the accepting sets infinitely often, and consequently, to satisfy the given LTL property. Finally, the set is the set of nonaccepting sink components of the automaton, as per Definition 8.
Task Modularisation
In this section we explain how a complex task can be broken down into simple composable subtasks or modules. Each state of the automaton in the product MDP is a “task divider” and each transition between these states is a “subtask”. For example consider a sequential task of visit and then and finally , i.e.
The corresponding automaton for this LTL task is given in Fig. 1. The entire task is modularised into three subtasks, i.e. reaching , , and then , and each automaton state acts as a divider.
Given an LTL task and its LDBA , we propose a modular architecture of separate DDPG actor, actortarget, critic and critictarget neural networks, along with their own replay buffer. A replay buffer is a finitesized cache in which transitions sampled from exploring the environment are stored. The replay buffer is then used to train the actor and critic networks. The set of neural nets acts as a global modular actorcritic deep RL architecture, which allows the agent to jump from one subtask to another by just switching between the set of neural nets^{2}^{2}2Different embeddings, such as the one hot encoding [onehot] and the integer encoding, have been applied in order to approximate the global Qfunction with a single DDPG network. However, we have observed poor performance since these encodings allow the network to assume an ordinal relationship between automaton states. This means that by assigning integer numbers or one hot codes, automaton states are categorised in an ordered format, and can be ranked. Clearly, this disrupts Qfunction generalisation by assuming that some states in the product MDP are closer to each other. Consequently, we have turned to the use of separate neural nets, which work together in a modular fashion, meaning that the agent can switch between these neural nets as it jumps from one automaton state to another..
For each automaton state an actor function represents the current policy by deterministically mapping states to actions, where is the vector of parameters of the function approximation for the actor. The critic is learned based on (3), as in QL.
The modular deep RL algorithm is detailed in Algorithm LABEL:algor. Each DDPG network set in this algorithm is associated with its own replay buffer , where (line 4, 12). Experience samples are stored in in the form of . When the replay buffer reaches its maximum capacity, the samples are discarded based on a first in first out policy. At each timestep, actor and critic are updated by sampling a minibatch of size uniformly from . We only train the DDPG network corresponding to the current automaton state, as experience samples on the current automaton state have little influence on other DDPG neural networks (line 1217).
Further, directly implementing the update of the critic parameters as in (3) is shown to be potentially unstable, and as a result the Qupdate (line 14) is prone to divergence [minhd]. Hence, instead of directly copying the weights, the standard DDPG [DDPG] uses “soft” target updates to improve learning stability. Target networks, and , are timedelayed copies of the original actor and critic networks that slowly track the learned networks, and . These target actor and critic networks are used within the algorithm to gather evidence (line 13) and subsequently to update the actor and critic networks. In our algorithm, for each automaton state we make a copy of the actor and the critic network: and respectively. The weights of both target networks are then updated by with a rate of (line 18). Although this “soft update” may slow down learning as target networks have propagation delays, in practice this is greatly outweighed by the introduced learning stability.
algocf[!t] \end@float
Experiments
In this section we discuss a mission planning problem for an autonomous Mars rover that uses the proposed algorithm to pursue exploration missions. The areas of interest on Mars are the Melas Chasma and the Victoria crater.
The Melas Chasma a number of signs of water, with ancient river valleys and networks of stream channels showing up as sinuous and meandering ridges and lakes (Fig. 2). The blue dots, provided by NASA, indicate locations of Recurring Slope Lineae (RSL) in the canyon network. RSL are seasonal dark streaks regarded as the strongest evidence for the possibility of liquid water on the surface of Mars. RSL extend downslope during a warm season and then disappear in the colder part of the Martian year [water_on_mars].
Victoria crater (Fig. 3. a) is an impact crater and is located near the equator of Mars. The crater is approximately 800 meters in diameter and it has a distinctive shape to its rim. Layered sedimentary rocks are exposed along the wall of the crater, providing invaluable information about the ancient surface condition of Mars. Since January 2004, the wellknown Mars rover Opportunity had been operating around the crater and its mission path is given in Fig. 3. b. Opportunity worked nearly 15 years on Mars and found dramatic evidence that long ago Mars was wetter and it could have sustained microbial life, if any existed.
The scenario of interest is to train a deep neural network that can autonomously accomplish a safetycritical complex task on Mars by accessing surface images. We start with the images of the surface of Mars and given mission tasks in the form of LTL properties. We then convert the LTL properties into their corresponding LDBAs so that we can feed them into the modular deep RL algorithm.
Presumably, from orbiting satellite data, we assume that the highest possible disturbance caused by different factors (such as sand storms) on the rover motion is known. This assumption can be set to be very conservative given the fact that there might be some unforeseen factors that was not captured by the satellite.
MDP structure
For each image, let its entire area be the MDP state space , where the rover location is a single state . At each state , the rover has a continuous range of actions : when the rover takes an action it moves to another state (e.g., ) towards the direction of the action and within a range that is randomly drawn from , unless the rover hits the boundary of the image which forces the rover to remain on the boundary.
Note that in the first experiment (Fig. 2), when the rover is deployed to its real mission, the precise landing location is not known. Therefore, we should encompass some randomness in the initial state . However, in the second experiment (Fig. 3) the rover has already landed and it starts its mission from a known and fixed point.
Specifications
The first control objective over Melas Chasma is expressed by the following LTL formula (Fig. 2):
(8) 
where stands for “target 1”, stands for “target 2” and stands for “unsafe” (the red region in the figure). Target 1 corresponds to the RSL (blue dots) on the right with a lower risk of the rover going to unsafe region, whereas the target 2 label goes on the left RSL that are a bit riskier to explore. Conforming to (8) the rover has to visit any of the right dots at least once and then proceed to the left dots, while avoiding unsafe areas. Note that according to in (8) the agent can enter the unsafe area (by climbing up the slope) but it is not able to come back due to the risk of falling. From (8) we build the associated Büchi automaton as in Fig. 4.
The mission task for the Victoria crater is expressed by the following LTL formula:
(9) 
where represent the “th target”, and represents “unsafe”. The th target in Fig. 3. c is the th red circle from the bottom left along the crater rim. According to (9) the rover is required to visit the checkpoints from the bottom left to the top right sequentially, while not falling into the crater, mimicking the actual path in Fig. 3. b. From (9), we can build the associated Büchi automaton as shown in Fig 5.
Experimental Outcomes
All simulations have been carried out on a machine with an Intel Xeon 3.5GHz processor and 16GB of RAM, running Ubuntu 18. In the first experiment we have employed 4 DDPG actor critic neural networks and ran simulations for 10,000 episodes. We have then tested the trained network for all safe starting position across 200 runs. Our algorithm has achieved a success rate of 98.8% across 18,202 landing positions. Fig. 6 gives the path generated by our algorithm. Fig. 6. c is particularly interesting, as we have observed a sudden turn before reaching the first RSL, which shows that the proposed algorithm is able to optimally learn complex policies than just smooth curve lines when needed.
In the second experiment we have used 13 DDPG actor critic neural networks. We have ran simulations for a total of 17,000 episodes, at which point it had already converged. The training has taken approximately 5 hours to complete. We have then tested the trained network across 200 runs. Our algorithm has achieved a success rate of 100% across all runs starting from . Figure 7 shows a generated path: we observe that the path is mostly curved away from the crater. This is due to the presence of a negative reward, as described before. We find that the negative reward is essential and that the algorithm is otherwise unable to travel from to (around the Bottomless Bay in Fig. 3. b) without introducing this negative reward. Without the negative reward the agent insists on reaching via the shortest path during the exploration, resulting in constantly falling into the crater.
We have tried to use standalone DDPG, action space discretisation of LCNFQ as a baseline, however, both methods perform too poorly and are unable to navigate the crater regardless of the number of episodes. We are also unaware of any literature that can provide a oneshot learning baseline for such complex sequential task. Implementation details are available in the appendix.
Conclusion
In this paper we have discussed the first deep RL scheme for hierarchical continuousstate continuousaction decision making problems with temporal constraints. These problems are composed of interrelated subproblems, that in turn might have their own subproblems. Although the optimal decision making for each subproblem can be effortlessly done, the original problem is quite hard to be tackled holistically, even with stateoftheart techniques. We have employed LTL to specify these interrelations and to assist the agent to find an optimal policy in a oneshot learning scheme.
References
Appendix
Appendix A General Considerations
We start our training for each automaton state only once our buffer size becomes large enough (e.g. buffer size of 4096), as a small buffer size would introduce more correlation in the data. This seems to slow down training in the initial stage, but results in faster learning of the policies after the initial stage.
Also, we initially apply our algorithm as described with the state dimension being the and coordinates along with the automaton states, action dimension being the angle normalized to the range of , where represent degree and represents . However, we find that the learning algorithm tends to perform very poorly and there exists a tendency of predicting actions at the boundary of and . We believe that this incident occurs because the measure of angle is circular, where degree is essentially the same as degree, and the learning algorithm is unaware of such property, and hence, it will be trapped in the local minimum. To resolve this issue, instead of predicting the angle for the action dimension, we instead predict the and of the angle. This approach works better as the and value of and is the same, and hence the circular property is preserved. We implement this approach by simply predicting two dimension of range where their value is normalized such that the sum of their squares is equal to as for any .
Appendix B Instability of DDPG algorithm
Training the DDPG algorithm is quite challenging, and in our case, we find that the DDPG algorithm is only able to traverse to the next automaton state successfully at approximately 6070% of the time. Therefore, the probability of reaching the th state is at most . While this is not an issue for the Melas Chasma experiment (with only four states in the automaton), this causes great instability for the Victoria crater task. A solution to this problem is to stop the training for each set of DDPG nets once they become stable. However, since the DDPG net of each state of the automaton is not independent, once a DDPG net stops training, it is not able to get new updates from the DDPG net that it depends on, i.e. the next DDPG net in the automaton state.
[swa] showed that applying Stochastic Weight Averaging (SWA) [sw] to DDPG can improve its stability. SWA is a technique that allows for solutions to be found with better generalisation in supervised and semisupervised learning. SWA is based on averaging the weights collected during training with an SGDlike method. In supervised learning, the weights are collected at the end of each training epoch. [sw] uses a constant or cyclical learning rate schedule to prevent the optimization to converge to a single solution and continue to explore the region of high performing networks.
In order to apply SWA to DDPG algorithms, [swa] introduces frequency of updating the SWA weights. In our work, to initialize the weights we use the weights of the model that was trained until it is able to reach the next automaton state 8 times in a row. Then we apply SWA for the weights of both actor and critic networks.
Appendix C Catastrophic forgetting
Catastrophic forgetting is the act of overwriting previous knowledge about a task when a new task is learnt. While our agent manages to become stable after applying the SWA algorithm, it started to lose accuracy after 20,000 episodes. We believe that this is due to the algorithm forgetting how to avoid unsafe region as the experience set is filled with only successful runs. To resolve this issue, we increase the initial samples for the replay buffer to 16384 (i.e. ) samples before we start the training. In addition to that, we separate the experience set into successful and unsuccessful experience set for each automaton state. We then sample from both replay buffer at each epoch at a fix ratio to be tuned. We found that by separating the replay buffer and increasing the initial replay buffer sample, the algorithm is able to maintain stability after 20,000 episodes.