Robust Satisfaction of Temporal Logic Specifications via Reinforcement Learning
Abstract
We consider the problem of steering a system with unknown, stochastic dynamics to satisfy a rich, temporallylayered task given as a signal temporal logic formula. We represent the system as a Markov decision process in which the states are built from a partition of the statespace and the transition probabilities are unknown. We present provably convergent reinforcement learning algorithms to maximize the probability of satisfying a given formula and to maximize the average expected robustness, i.e., a measure of how strongly the formula is satisfied. We demonstrate via a pair of robot navigation simulation case studies that reinforcement learning with robustness maximization performs better than probability maximization in terms of both probability of satisfaction and expected robustness.
I Introduction
We consider the problem of controlling a system with unknown, stochastic dynamics, i.e., a “black box”, to achieve a complex, timesensitive task. An example is controlling a noisy aerial vehicle with partially known dynamics to visit a prespecified set of regions in some desired order while avoiding hazardous areas. We consider tasks given as temporal logic (TL) formulae [2], an extension of first order Boolean logic that can be used to reason about how the state of a system evolves over time. When a stochastic dynamical model is known, there exist algorithms to find control policies for maximizing the probability of achieving a given TL specification [18, 17, 23, 13] by planning over stochastic abstractions [12, 1, 17]. However, only a handful of papers have considered the problem of enforcing TL specifications to a system with unknown dynamics. Passive [3] and active [21, 9] reinforcement learning has been used to find a policy that maximizes the probability of satisfying a given linear temporal logic formula.
In this paper, in contrast to the above works on reinforcement learning which use propositional temporal logic, we use signal temporal logic (STL), a rich predicate logic that can be used to describe tasks involving bounds on physical parameters and time intervals [7]. An example of such a property is “Within seconds, a region in which is less than is reached, and regions in which is larger than are avoided for seconds.” STL admits a continuous measure called robustness degree that quantifies how strongly a given sample path exhibits an STL property as a real number rather than just providing a or answer [8, 7]. This measure enables the use of continuous optimization methods to solve inference (e.g., [10, 11, 14]) or formal synthesis problems (e.g., [20]) involving STL.
One of the difficulties in solving problems with TL formulae is the historydependence of their satisfaction. For instance, if the specification requires visiting region A before region B, whether or not the system should steer towards region B depends on whether or not it has previously visited region A. For linear temporal logic (LTL) formulae with timeabstract semantics, this historydependence can be broken by translating the formula to a deterministic Rabin automaton (DRA), a model that automatically takes care of the historydependent “bookkeeping” [4, 21]. In the case of STL, such a construction is difficult due to the timebounded semantics. We circumvent this problem by defining a fragment of STL such that the progress towards satisfaction is checked with some finite number of state measurements. We thus define an MDP, called the MDP whose states correspond to the step history of the system. The inputs to the MDP are a finite collection of control actions.
We use a reinforcement learning strategy called learning [24], in which a policy is constructed by taking actions, observing outcomes, and reinforcing actions that improve a given reward. Our algorithms either maximize the probability of satisfying a given STL formula, or maximize the expected robustness with respect to the given STL formula. These procedures provably converge to the optimal policy for each case. Furthermore, we propose that maximizing expected robustness is typically more effective than maximizing probability of satisfaction. We prove that in certain cases, the policy that maximizes expected robustness also maximizes the probability of satisfaction. However, if the given specification is not satisfiable, the probability maximization will return an arbitrary policy, while the robustness maximization will return a policy that gets as close to satisfying the policy as possible. Finally, we demonstrate through simulation case studies that the policy that maximizes expected robustness in some cases gives better performance in terms of both probability of satisfaction and expected robustness when fewer training episodes are available.
Ii Signal Temporal Logic(STL)
STL is defined with respect to continuously valued signals. Let denote the set of mappings from to and define a signal as a member of . For a signal , we denote as the value of at time and as the sequence of values . Moreover, we denote as the suffix from time , i.e., .
In this paper, the desired mission specification is described by an STL fragment with the following syntax :
(1) 
where is a finite time bound, and are STL formulae, and are nonnegative realvalued constants, and is a predicate where is a signal, is a function, and is a constant. The Boolean operators and are negation (“not”) and conjunction (“and”), respectively. The other Boolean operators are defined as usual. The temporal operators , , and stand for “Finally (eventually)” , “Globally (always)”, and “Until”, respectively. Note that in this paper, we use a discretetime version of STL rather than the typical continuoustime formulation.
The semantics of STL is recursively defined as
In plain English, means “within and time units in the future, is true,” means “for all times between and time units in the future is true,” and means “There exists a time between and time units in the future such that is true until and is true at .” STL is equipped with a robustness degree [8, 7] (also called “degree of satisfaction”) that quantifies how well a given signal satisfies a given formula . The robustness is calculated recursively according to the quantitative semantics
We use to denote . If is large and positive, then would have to change by a large deviation in order to violate . Similarly, if is large in absolute value and negative, then strongly violates .
Similar to [6], let denote the horizon length of an STL formula . The horizon length is the required number of samples to resolve any (future or past) requirements of . The horizon length can be computed recursively as
(2) 
where are STL formulae.
Example 1
Consider the robot navigation problem illustrated in Figure 1(a). The specification is “Visit Regions or and visit Regions or every 4 time units along a mission horizon of 100 units.” Let , where and are the and components of the signal . This task can be formulated in STL as
(3) 
Figure 1(a) shows two trajectories of the system beginning at the initial location of and ending in region that each satisfies the inner specification given in (3). Note that barely satisfies , as it only slightly penetrates region , while appears to satisfy it strongly, as it passes through the center of region and the center of region . The robustness degrees confirm this: while .
Iii Models for Reinforcement Learning
For a system with unknown and stochastic dynamics, a critical problem is how to synthesize control to achieve a desired behavior. A typical approach is to discretize the state and action spaces of the system and then use a reinforcement learning strategy, i.e., by learning how to take actions through trial and error interactions with an unknown environment [22]. In this section, we present models of systems that are amenable for reinforcement learning to enforce temporal logic specifications. We start with a discussion on the widely used LTL before introducing the particular model that we will use for reinforcement learning with STL.
IiiA Reinforcement Learning with LTL
One approach to the problem of enforcing LTL satisfaction in a stochastic system is to partition the statespace and design control primitives that can (nominally) drive the system from one region to another. These controllers, the stochastic dynamical model of the system, and the quotient obtained from the partition are used to construct a Markob decision process (MDP), called a bounded parameter MDP or BMDP, whose transition probabilities are intervalvalued [1]. These BMDPs can then be composed with a DRA constructed from a given LTL formula to form a product BMDP. Dynamic programming (DP) can then be applied over this product MDP to generate a policy that maximizes the probability of satisfaction. Other approaches to this problem include aggregating the states of a given quotient until an MDP can be constructed such that the transition probability can be considered constant (with bounded error) [16]. The optimal policy can be computed over the resulting MDP using DP [15] or approximate DP, e.g., actorcritic methods [5].
Thus, even when the stochastic dynamics of a system are known and the logic that encodes constraints has timeabstract semantics, the problem of constructing an abstraction of the system that is amenable to control policy synthesis is difficult and computationally intensive. Reinforcement learning methods for enforcing LTL constraints make the assumption that the underlying model under control is an MDP [3, 21, 9]. Implicitly, these procedures compute a frequentist approximation of the transition probabilities that asymptotically approaches the true (unknown) value as the number of observed sample paths increases. Since this algorithm doesn’t explicitly rely on any a priori knowledge of the transition probability, it could be applied to an abstraction of a continuousspace system that is built from a propositionpreserving partition. In this case, the uncertainty on the motion described by intervals in the BMDP that is reduced via computation would instead be described by complete ignorance that is reduced via learning. The resulting policy would map regions of the statespace to discrete actions that will optimally drive the realvalued state of the system to satisfy the given LTL specification. Different partitions will result in different policies. In the next section, we extend the above observation to derive a discrete model that is amenable for reinforcement learning for STL formulae.
IiiB Reinforcement learning with STL: Mdp
In order to reduce the search space of the problem, we partition the statespace of the system to form the quotient graph , where is a set of discrete states corresponding to the regions of the statespace and corresponds to the set of edges. An edge between two states and exists in if and only if and are neighbors (share a boundary) in the partition. In our case, since STL has timebounded semantics, we cannot use an automaton with a timeabstract acceptance condition (e.g., a DRA) to check its satisfaction. In general, whether or not a given trajectory satisfies an STL formula would be determined by directly using the qualitative semantics. The STL fragment (1) consists of a subformula with horizon length that is modified by either a or temporal operator. This means that in order to update at time whether or not the given formula has been satisfied or violated, we can use the previous state values For this reason, we choose to learn policies over an MDP with finite memory, called a MDP, whose states correspond to sequences of length of regions in the defined partition.
Example 1 (cont’d)
Let the robot evolve according to the discretetime Dubins dynamics
(4) 
where and are the and coordinates of the robot at time , is its forward speed, is a time interval, and the robot’s orientation is given by . The control primitives in this case are given by which correspond to the directions on the grid. Each (noisy) control primitive induces a distribution with support , where is the orientation where the robot is facing the desired cell. When a motion primitive is enacted, the robot rotates to an angle drawn from the distribution and moves along that direction for time units. The partition of the statespace and the induced quotient are shown in Figures 1(b) and 1(c), respectively. A state in the quotient (Figure 1(c)) represents the region in the partition of the statespace (Figure 1(b)) with the point in the lower left hand corner. \endproof
(a)  (b)  (c) 
Definition 1
Given a quotient of a system and a finite set of actions , a Markov Decision Process (MDP) is a tuple , where

is the set of finite states, where is the empty string. Each state corresponds to a horizon (or shorter) path in . Shorter paths of length (representing the case in which the system has not yet evolved for time steps) have prepended times.

is a probabilistic transition relation. can be positive only if the first states of are equal to the last states of and there exists an edge in between the final state of and the final state of .
We denote the state of the MDP at time as .
Definition 2
Given a trajectory of the original system, we define its induced trace in the MDP as . That is, corresponds to the previous regions of the statespace that the state has resided in from time to time .
The construction of a MDP from a given quotient and set of actions is straightforward. The details are omitted due to length constraints. We make the following key assumptions on the quotient and the resulting MDP:

The defined control actions will drive the system either to a point in the current region or to a point in a neighboring region of the partition, e.g.,no regions are “skipped”.

The transition relation is Markovian.
For every state , there exists a continuous set of sample paths whose traces could be that state. The dynamics of the underlying system produces an unknown distribution . Since the robustness degree is a function of sample paths of length and an STL formula , we can define a distribution .
Iv Problem Formulation
In this paper, we address the following two problems.
Problem 1 (Maximizing Probability of Satisfaction)
Let be a MDP as described in the previous section. Given an STL formula with syntax (1), find a policy such that
(5) 
Problem 2 (Maximizing Average Robustness)
Problems 1 and 2 are two alternate solutions to enforce a given STL specification. The policy found by Problem 1, i.e. , maximizes the chance that will be satisfied, while the policy found by Problem 2, i.e. , drives the system to satisfy as strongly as possible on average. Problems similar to (5) have already been considered in the literature (e.g., [9, 21]). However, Problem 2 is a novel formulation that provides some advantages over Problem 1. As we show in Section V, for some special systems, achieves the same probability of satisfaction as . Furthermore, if is not satisfiable, any arbitrary policy could be a solution to Problem 1, as all policies will result in a satisfaction probability of 0. If is unsatisfiable, Problem 2 yields a solution that attempts to get as close as possible to satisfying the formula, as the optimal solution will have an average robustness value that is least negative.
The forms of the objective functions differ for the two different types of formula, and .
V Maximizing Expected Robustness vs. Maximizing Probability of Satisfaction
Here, we demonstrate that the solution to (6) subsumes the solution to (5) for a certain class of systems. Due to space limitations, we only consider formulae of the type . Let be a MDP. For simplicity, we make the following assumption on .
Assumption 1
For every state , either every trajectory whose trace is satisfies , denoted , or every trajectory that passes through the sequence of regions associated with does not satisfy , denoted .
Assumption 1 can be enforced in practice during partitioning. We define the set
(11) 
Definition 3
The signed graph distance of a state to a set is
(12) 
where is the length of the shortest path from to .
We also make the following two assumptions.
Assumption 2
For any signal such that , let be bounded from below by and from above by .
Assumption 3
Let . For any two states,
(13) 
Now we define the policies and over as
(14) 
(15) 
Proposition 1
Given any policy , its associated reachability probability can be defined as
(16) 
Let be the indicator function such that is 1 if is true and if is false. By definition, the expected probability of satisfaction for a given policy is
(17) 
Also, the expected robustness of policy becomes
(18) 
Since is constant, maximizing (18) is equivalent to
(19) 
Let be the satisfaction probability such that . Then, we can rewrite the objective in (19) as
(20) 
Now,
(21) 
Thus, any policy increasing also leads to an increase in . Since increasing is equivalent to increasing , then we can conclude that the policy that maximizes the robustness also achieves the maximum satisfaction probability. \endproof
Vi Control Synthesis to Maximize Robustness
ViA Policy Generation through Learning
Since we do not know the dynamics of the system under control, we cannot a priori predict how a given control action will affect the evolution of the system and hence its progress towards satisfying/dissatisfying a given specification. Thus, we use the wellknown paradigm of reinforcement learning to learn policies to solve Problems 1 and 2. In reinforcement learning, the system takes actions and records the rewards associated with the stateaction pair. These rewards are then used to update a feedback policy that maximizes the expected gathered reward. In our cases, the rewards that we collect over are related to whether or not is satisfied (Problem 1) or how robustly is satisfied/violated (Problem 2).
Our solutions to these problems rely on a learning formulation [24]. Let be the reward collected when action was taken in state . Define the function as
(22) 
For an optimization problem with a cumulative objective function of the form
(23) 
the optimal policy can be found by
(24) 
ViB Batch learning
We cannot reformulate Problems 1 and 2 into the form (23) (see Section IV). Thus, we propose an alternate learning formulation, called batch learning , to solve these problems. Instead of updating the function after each action is taken, we wait until an entire episode is completed before updating the function. The batch learning procedure is summarized in Algorithm 1.
The function is initialized to random values and is computed from the initial values. Then, for episodes, the system is simulated using . Randomization is used to encourage exploration of the policy space. The observed trajectory is then used to update the function according to Algorithm 2. The new value of the function is used to update the policy . For compactness, Algorithm 2 as written only covers the case . The case in which can be addressed similarly.
ViC Convergence of Batch learning
Given a formula of the form and an objective of maximizing the expected robustness (Problem 2), we will show that applying Algorithm 1 converges to the optimal solution. The other three cases discussed in Section IV can be proven similarly. The following analysis is based on [19]. The optimal function derived from (8) is
(26) 
This gives the following convergence result.
Proposition 2
The learning rule given by
(27) 
converges to the optimal function (26) if the sequence is such that and .
(Sketch) The proof of Proposition 2 relies primarily on Proposition 3. Once this is established, the rest of the proof varies only slightly from the presentation in [19]. \endproof
Note that in this case, ranges over the number of episodes and ranges over the time coordinate of the signal.
Proposition 3
The optimal function given by (26) is a fixed point of the contraction mapping where
(28) 
By (26), if is a contraction mapping, then is a fixed point of . Consider
(29) 
Define
(30) 
WOLOG let . Define
(31) 
There exist 3 possibilities for the value of .
(32a)  
(32b)  
(32c) 
Thus, this means that . Hence,
(33) 
Therefore, is a contraction mapping. \endproof
Vii Case Study
We implemented the batch learning algorithm (Algorithm 1) and applied it to two case studies that adapt the robot navigation model from Example 1. For each case study, we solved Problems 1 and 2 and compared the performance of the resulting policies. All simulations were implemented in Matlab and performed on a PC with a 2.6 GHz processor and 7.8 GB RAM.
ViiA Case Study 1: Reachability
First, we consider a simple reachability problem. The given STL specification is
(34) 
where is the STL subformula corresponding to being in a blue region. In plain English, (34) can be stated as “Within 20 time units, reach a blue region and then don’t revisit a blue region for 4 time units.” The results from applying Algorithm 1 are summarized in Figure 3. We used the parameters , = 300 and , where is the probability at iteration of selecting an action at random ^{1}^{1}1Although the conditions and are technically required to prove convergence, in practice these conditions can be relaxed without having adverse effects on learning performance. Constructing the MDP took 17.2s. Algorithm 1 took 161s to solve Problem 1 and 184s to solve Problem 2.
(a)  (b) 
(c)  (d) 
The two approaches perform very similarly. In the first row, we show a histogram of the robustness of 500 trials generated from the system simulated using each of the trained policies after learning has completed, i.e. without the randomization that is used during the learning phase. Note that both trained policies satisfied the specification with probability 1. The performance of the two algorithms are very similar, as the mean robustness is 0.2287 with standard deviation 0.1020 for probability maximization and 0.2617 and 0.1004,resp., for robustness maximization. In the second row, we see trajectories simulated by each of the trained policies.
The similarity of the solutions in this case study is not surprising. If the state of the system is deep within or , then the probability that it will remain inside that region in the next 3 time steps (satisfy ) is higher than if it is at the edge of the region. Trajectories that remain deeper in the interior of region or also have a high robustness value. Thus, for this particular problem, there is an inherent coupling between the policies that satisfy the formula with high probability and those that satisfy the formula as robustly as possible on average.
ViiB Case Study 2: Repeated Satisfaction
In this second case study, we look at a problem involving repeatedly satisfying a condition finitely many times. The specification of interest is
(35) 
In plain English, (35) is “Ensure that every 4 time units over a 12 unit interval, a green region and a blue region is entered.” Results from this case study are shown in Figure 4. We used the same parameters as listed in Section VIIA, except = 1200,, and . Constructing the MDP took 16.5s. Applying Algorithm 1 took 257.7s for Problem 1 and 258.3s for Problem 2.
(a)  (b) 
(c)  (d) 
In the first row, we see that the solution to Problem 1 satisfies the formula with probability 0 while the solution to Problem 2 satisfies the formula with probability 1. At first, this seems counterintuitive, as Proposition 2 indicates that a policy that maximizes probability would achieve a probability of satisfaction at least as high as the policy that maximizes the expected robustness. However, this is only guaranteed with an infinite number of learning trials. The performance in terms of robustness is obviously better for the robustness maximization (mean 0.1052, standard deviation 0.0742) than for the probability maximization (mean 0.6432, standard deviation 0.2081). In the second row, we see that the maximum robustness policy enforces convergence to a cycle between two regions, while the maximum probability policy deviates from this cycle.
The discrepancy between the two solutions can be explained by what happens when trajectories that almost satisfy (35) occur. If a trajectory that almost oscillates between a blue and green region every four seconds is encountered when solving Problem 1, it collects 0 reward. On the other hand, when solving Problem 2, the policy that produces the almost oscillatory trajectory will be reinforced much more strongly, as the resulting robustness is less negative. However, since the robustness degree gives “partial credit” for trajectories that are close to satisfying the policy, the reinforcement learning algorithm performs a directed search to find policies that satisfy the formula. Since probability maximization gives no partial credit, the reinforcement learning algorithm is essentially performing a random search until it encounters a trajectory that satisfies the given formula. Therefore, if the family of policies that satisfy the formula with positive probability is small, it will on average take the learning algorithm solving Problem 1 a longer time to converge to a solution that enforces formula satisfaction.
Viii Conclusions and Future Work
In this paper, we presented a new reinforcement learning paradigm to enforce temporal logic specifications when the dynamics of the system are a priori unknown. In contrast to existing works on this topic, we use a logic (signal temporal logic) whose formulation is directly related to a system’s statespace. We present a novel, convergent learning algorithm that uses the robustness degree, a continuous measure of how well a trajectory satisfies a formula, to enforce the given specification. In certain cases, robustness maximization subsumes the established paradigm of probability maximization and, in certain cases, robustness maximization performs better in terms of both probability and robustness under partial training. Future research includes formally connecting our approach to abstractions of linear stochastic systems.
References
 [1] A. Abate, A. D’Innocenzo, and M. Di Benedetto. Approximate abstractions of stochastic hybrid systems. Automatic Control, IEEE Transactions on, 56(11):2688–2694, Nov 2011.
 [2] C. Baier and J.P. Katoen. Principles of model checking, volume 26202649. MIT press Cambridge, 2008.
 [3] T. Brazdil, K. Chatterjee, M. Chmelik, M.k, V. Forejt, J. Kretinsky, M. Kwiatkowska, D. Parker, and M. Ujma. Verification of markov decision processes using learning algorithms. In F. Cassez and J.F. Raskin, editors, Automated Technology for Verification and Analysis, volume 8837 of Lecture Notes in Computer Science, pages 98–114. Springer International Publishing, 2014.
 [4] X. C. Ding, S. L. Smith, C. Belta, and D. Rus. Optimal control of markov decision processes with linear temporal logic constraints. IEEE Transactions on Automatic Control, 59(5):1244–1257, 2014.
 [5] X. C. Ding, J. Wang, M. Lahijanian, I. Paschalidis, and C. Belta. Temporal logic motion control using actorcritic methods. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 4687–4692, May 2012.
 [6] A. Dokhanchi, B. Hoxha, and G. Fainekos. Online monitoring for temporal logic robustness. In Runtime Verification, pages 231–246. Springer, 2014.
 [7] A. Donzé and O. Maler. Robust satisfaction of temporal logic over realvalued signals. Formal Modeling and Analysis of Timed Systems, pages 92–106, 2010.
 [8] G. E. Fainekos and G. J. Pappas. Robustness of temporal logic specifications for continuoustime signals. Theoretical Computer Science, 410(42):4262–4291, 2009.
 [9] J. Fu and U. Topcu. Probably approximately correct MDP learning and control with temporal logic constraints. CoRR, abs/1404.7073, 2014.
 [10] X. Jin, A. Donze, J. V. Deshmukh, and S. A. Seshia. Mining requirements from closedloop control models. In Proceedings of the 16th international conference on Hybrid systems: computation and contro, pages 43–52, 2013.
 [11] A. Jones, Z. Kong, and C. Belta. Anomaly detection in cyberphysical systems: A formal methods approach. In IEEE Conference on Decision and Control (CDC), pages 848–853, 2014.
 [12] A. Julius and G. Pappas. Approximations of stochastic hybrid systems. Automatic Control, IEEE Transactions on, 54(6):1193–1203, June 2009.
 [13] M. Kamgarpour, J. Ding, S. Summers, A. Abate, J. Lygeros, and C. Tomlin. Discrete time stochastic hybrid dynamic games: Verification and controller synthesis. In Proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference, pages 6122–6127, 2011.
 [14] Z. Kong, A. Jones, A. Medina Ayala, E. Aydin Gol, and C. Belta. Temporal logic inference for classification and prediction from data. In Proceedings of the 17th international conference on Hybrid systems: computation and control, pages 273–282. ACM, 2014.
 [15] M. Lahijanian, S. Andersson, and C. Belta. Temporal logic motion planning and control with probabilistic satisfaction guarantees. Robotics, IEEE Transactions on, 28(2):396–409, April 2012.
 [16] M. Lahijanian, S. B. Andersson, and C. Belta. Approximate markovian abstractions for linear stochastic systems. In Proc. of the IEEE Conference on Decision and Control, pages 5966–5971, Maui, HI, USA, Dec. 2012.
 [17] M. Lahijanian, S. B. Andersson, and C. Belta. Formal verification and synthesis for discretetime stochastic systems. IEEE Transactions on Automatic Control, 6(8):2031–2045, 2015.
 [18] R. Luna, M. Lahijanian, M. Moll, and L. E. Kavraki. Asymptotically optimal stochastic motion planning with temporal goals. In Workshop on the Algorithmic Foundations of Robotics, Istanbul, Turkey, 03/08/2014 2014.
 [19] F. S. Melo. Convergence of qlearning: a simple proof. http://users.isr.ist.utl.pt/ mtjspaan/readingGroup/ProofQlearning.pdf.
 [20] V. Raman, A. Donze, M. Maasoumy, R. M. Murray, A. SangiovanniVincentelli, and S. A. Seshia. Model predictive control with signal temporal logic specifications. In Proceedings of IEEE Conference on Decision and Control (CDC), pages 81–87, 2014.
 [21] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia. A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications. CoRR, abs/1409.5486, 2014.
 [22] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 [23] M. Svorenova, J. Kretínský, M. Chmelik, K. Chatterjee, I. Cerná, and C. Belta. Temporal logic control for stochastic linear systems using abstraction refinement of probabilistic games. In Hybrid Systems Computation and Control (HSCC) 2015, volume abs/1410.5387, 2015. (To appear).
 [24] J. N. Tsitsiklis. Asynchronous stochastic approximation and qlearning. Machine Learning, 16(3):185–202, 1994.