# Automatic Testing and Falsification with Dynamically Constrained Reinforcement Learning

## Abstract

An autonomous system such as a self-driving car can be modeled as a multi-agent system, where the system under design is modeled as an ego agent whose behavior can be influenced by a number of (possibly adversarial) agents in the environment. Other cars, pedestrians, bicyclists, traffic lights, etc. can be treated as such adversarial agents. A self-driving car is a safety-critical system; thus, it is important to identify any possible safety violations of the ego agent even in the presence of adversarial agents. In this paper, we propose a novel adversarial testing methodology in which we train adversarial agents to demonstrate flaws in the behavior of the ego agent. One challenge is that it is easy to obtain maximally antagonistic adversaries that would cause all but the most overly conservative ego agents to violate their safety requirements. To address this, we control the degree of adversariality of the environment agents by constraining their behaviors to satisfy certain rules, e.g. requiring that they obey traffic rules. We demonstrate how such dynamic constraints can be expressed as hierarchically ordered rules in the formalism of Signal Temporal Logic. We illustrate the efficacy of our technique in both traditional and deep model-free reinforcement learning to train dynamically constrained adversarial agents in three case studies from the automotive domain.

## 1 Introduction

When developing cyberphysical systems such as autonomous vehicles, drones, or aircraft, it is important to have a robust testing strategy that identifies critical bugs before the system is put into production. Falsification techniques exist to find simulations in which the system under test fails to satisfy its goal specification. These falsification traces can be generated from a bounded set of inputs. Aside from these input bounds, it can be difficult to constrain the falsification traces to satisfy dynamic constraints such as traffic rules on vehicles or regulations on drones and aircraft.

In the case of autonomous driving, it is useful to develop testing scenarios with agents that automatically learn to induce the autonomous vehicle under test to make a mistake that results in a collision or other undesirable behavior. In this case, we are typically not interested in the maximally adversarial vehicle, e.g. a vehicle driving the wrong way on the freeway actively attempting to collide. Instead, we seek to constrain the behavior of the adversary depending on a desired level of difficulty of the testing regime. We may, for example, stipulate that the adversary may not drive backwards, may not stop on the freeway, and must obey speed limits.

To constrain the adversarial agents, we can organize the dynamic constraints in hierarchically organized sets, called rulebooks [4]. The hierarchy of these sets reflects the relative importance of the rules. These rulebooks can define specific legal requirements, e.g. do not drive the wrong way and do not run red lights. They may also encode cultural or normative customary behavior such as the ’Pittsburgh left“ or “California Stop“. Each such dynamic constraint is modeled as a specification in Signal Temporal Logic (STL) [10].

In addition to the dynamic constraints, the adversaries are given a goal specification. This specification is an STL property that the vehicle under test must attempt to maintain, and the adversaries attempt to falsify. This problem is formally known as falsification. We demonstrate that our algorithm is guaranteed to find faults of the system under test if they exist, and we demonstrate our approach on three case studies from autonomous driving, detailed in Fig. 1.

## 2 Related Work

There is extensive related work in falsification of cyberphysical systems. Falsification considers a system and a specification. The goal specification is given in terms of the output of the system, and the goal is to find a sequence of inputs that induce a violation of the goal specification. In existing work on falsification using reinforcement learning algorithms [1], in which a single, monolithic falsifier uses reinforcement learning, but to our knowledge this is the first work in which multiple falsification engines is packaged as agents in the simulation, interacting with the system under test. Also, use of STL formulas combined with reinforcement learning agents allows a system designer to specify constraints such as traffic rules.

Work in safe reinforcement [3] learning considers the problem of training agents with dynamic constraints. However, the conventional approaches typically use a model checker in the loop. This model checker is run at each time step to determine which actions preserve the dynamic constraints, and the RL agent is only allowed to choose from those actions. In contrast, our work allows the agent to explore and naturally learn to respect the dynamic constraints. Although the agent is not guaranteed to satisfy the dynamic constraints, it would be natural to use an SMT solver in a counterexample-guided refinement loop until a formal proof can be obtained.

Existing work on hierarchical rulebooks [4] describes a model for specifications on an autonomous vehicle as sets of rules that vary in importance, ranging from basic collision-avoidance through traffic rules and comfort requirements to local customs. However, our work provides an implementation of this idea as signal temporal logic constraints, and our algorithm describes how to train an agent that behaves adversarially within the bounds of rulebooks.

## 3 Background and Problem Statement

Our goal is to test an autonomous vehicle, referred to as the “ego” vehicle, in simulation. The autonomous vehicle is tested against a specification, and the simulation contains other agents that seek to cause the ego vehicle to falsify its specified requirements. Using reinforcement learning, we train agents that behave in a constrained adversarial fashion to cause the ego to violate its requirement. These adversaries are constrained by a set of provided specifications, such as obeying traffic rules. The constraints are expressed as logical scaffolds for the learning agent, and are prioritized hierarchically. In this way, we are able to control the level of adversarial behavior, which enables a spectrum of agents that exhibit different levels of difficulty.

Formally, an agent is a tuple . is a set of observable states, i.e. states that are available as public information to other participants of the scenario. is a set of hidden states, visible only to the agent. is a set of observations (which depends on the observable states of other agents). is a set of actions that the agent may take.

The sets of states, actions, and observations may be continuous, discrete, or finite. The conditional transition distribution governs transitions between tuples of agent states. The stochastic policy is a conditional distribution over the agent actions for a given tuple of agent states and environment observations.

Suppose a set of multiple interacting agents is given, . Without loss of generality, we assume that anything that may change its state is an agent, and so no additional environment model needs to be accounted for. Let be the concatenation of the values of all agent states. We describe the transition dynamics of a representative agent . The agent computes its observation of the overall state as , and it has a current hidden state value . The agent chooses an action by sampling from the policy distribution,

Then, the agent chooses a new state tuple by sampling from the transition distribution ,

A behavior trace of a set of agents is a sequence of overall state tuples and actions, , where the subscripts denote the number in the sequence (i.e. the timestamp). The set of behaviors of a collection of agents is denoted by .

A specification is a set of behavior traces. We say that a trace satisfies the specification if it is one of the behaviors in the set. A specification can be represented in many different ways, such as a logical formula that characterizes a set of behaviors, or by an abstract model of the system [8, 5, 2] We assume that for any specification , it is possible to give an indicator function , which evaluates to if a behavior satisfies the specification and zero otherwise.

Without loss of generality, we assume that all dynamic behaviors in a scenario are controlled by an agent. For example, pedestrians, traffic lights, and even weather patterns can be modeled as being controlled by agents. A scenario consists of a collection of agents together with rulebooks that specify constraints on their behaviors, .

Specifications are organized according to their relative importance into hierarchical rulebooks [4]. A hierarchical rulebook is a collection of specifications , together with a partial order that expresses the relative priority between the rules. We say that a specification has a higher priority than another specification if we have that and it is not the case that . The partial order allows that multiple rules may have the same priority. The rules may be expressed in any formalism, such as logic-based formalisms like LTL, STL, or as conformance metrics to an abstract or prototype model. Alternatively, specifications may be learned from data, for example via specification mining [7]. The work of [4] describes algorithms for managing and combining rulebooks from different sources. In this work, we describe how to systematically test a system driven by these rulebooks.

We are now in a position to state the constrained adversarial strategy synthesis problem, illustrated in Fig. 2. Given a scenario composed of interacting agents,

the goal is to learn transition distributions and policies for the adversarial agents such that each adversary satisfies its rulebook , but the ego is not able to satisfy its rulebook . In this work, we assume that the rulebooks are modeled in Signal Temporal Logic (STL), We demonstrate that it is possible to synthesize constrained adversarial strategies using reinforcement learning.

### 3.1 Reinforcement learning

Reinforcement learning provides a class of algorithms to train goal-driven agents [13]. As part of the training process, we sample the initial state of each episode with a probability distribution that is nonzero at all states. Thus, if given enough time, all states will eventually be selected as the initial state. Additionally, we fix the policies of the adversarial agents to be -soft. This means that for each state and every action , , where is a parameter. Taken together, random sampling of initial states and -soft policies ensure that the agent performs sufficient exploration and avoids converging prematurely to local optima. In fact, it ensures that if the algorithm runs long enough, the global optimum will eventually be found, as all trajectories have nonzero probability.

Q-Learning: Q-learning is an algorithm used for reinforcement learning that does not require knowledge of the agent’s environment. The agent maintains a table whose rows correspond to the states of the system and whose columns correspond to the actions. For a state-action pair , the entry at row and column represents the quality that the agent has currently calculated. The table is initialized randomly. At each time step , the agent considers the current state value and, for each action , uses the table to judge the quality of that action from this state, . Then, based on this judgment, it selects an action based on the -greedy policy described above. Next, at time step , the agent observes the reward received as well as the new state , and it uses this information to update its beliefs about its previous behavior via the update equation

where is a learning rate parameter.

Deep Q-learning: In deep Q-learning [12], the table is approximated by a neural network, , where are the network parameters. Deep Q-learning observes states and selects actions similarly to Q-learning, but it additionally uses experience replay, in which the agent stores previously observed tuples of states, actions, next states, and rewards. At each time step, the agent updates its q-function with the currently observed experience as well as with a batch of experiences sampled randomly from the experience replay buffer. The agent then updates its approximation network by gradient descent on

where

### 3.2 Signal Temporal Logic

Signal Temporal Logic (STL) [9] enables specifications over real-valued signals and can be applied to many continuous and hybrid systems, such as automotive applications. STL formulas are defined over predicates of the form , where is a timed trace (signal), is a function and . STL formulas allow the standard logical connectives of conjunction and disjunction, as well as three temporal operators, always (), eventually () and until ().

A behavior trace satisfies if is true at least once during the sequence. A behavior satisfies if is true during the entire duration of the trace. The until operator states that the left formula is true until the right formula becomes true.

### 3.3 Adversarial strategy synthesis with Reinforcement Learning

In this section, we describe how the adversarial strategy synthesis problem can be solved with reinforcement learning by encoding the rulebooks into the reward function that an agent should maximize. We are given an environment consisting of the system under test, called the ego agent, as well as adversarial agents. Each agent is given a rulebook, . For simplicity and without loss of generality, we assume that the ego rulebook consists of a single specification. We call the negation of this specification the goal specification, since each adversary is targeting this specification. We denote the goal specification by .

We assume that the rules of each rulebook have been sorted into sets of rules with the same importance. For example, let and be two such sets in rulebook . If , we say that has higher priority than . Then, for every , we have that and , meaning that within this set all rules have the same priority. However, across sets of different priority, we have that for each and every , .

For each set , we associate a hyperparameter . This is the amount of punishment that the agent will receive for violating a constraint in . Furthermore, there is a hyperparameter that represents the reward that agent receives when it attains its goal. We want to choose these hyperparameters carefully to prevent the agent from finding strategies that attain the goal by violating its constraints. If the maximum length of an episode is fixed to be , it is possible to define relationships between these parameters. The maximum reward that the agent could attain by meeting its goal at all time steps is . Then, we require that for the lowest priority group of constraints , . Similarly, the higher priority groups should have values that discourage the agent from violating those constraints more than the lower priority rules.

To compute the reward signal, we log the state trace from the initial time to the present time. We represent this state trace by . For an STL formula , let be the indicator function of over the trace at time . This function is equal to if (i.e. the state trace satisfies at time ) and zero otherwise. Then, the reward signal can be computed as

(1) |

This means that the agent is rewarded by an amount for causing a violation of the goal specification , and it is punished by an amount for violating a constraint .

[Completeness] Let be the maximum episode length. If a satisfying trace of of length exists, agent will eventually find it.

This follows as we have required that agent policies be -soft. For simplicity, suppose that there is only one adversarial agent. Since every action at every state is taken with probability at least , then any -step sequence of actions has probability at least . Therefore, if the algorithm runs long enough, the trace will eventually be found.

## 4 Case Studies

We consider three case studies. In the first, an ego vehicle is following an adversarial agent on a single-lane freeway. In the second scenario, the ego vehicle is driving on the freeway and an adversarial agent performs a cut-in maneuver. In the third scenario, the ego vehicle is coming to an intersection with a yellow light behind an adversary. The adversary must either cause the ego vehicle to run the traffic light after it turns red or cause the ego vehicle to rear-end it. The adversary may not run the red light and may not drive backwards. In all cases, the adversary learns to trigger erroneous behavior from the ego subject to traffic constraints automatically by exploring the state space. Experiments were performed with the CARLA simulator [6] on a Razer with an Intel Core i7 2.6 GHz processor and 16GB RAM.

### 4.1 Driving in lane

In this experiment, two vehicles are driving on a single lane freeway. The lead vehicle is an adversarial agent, and the follower vehicle is using an adaptive cruise control policy that we seek to test. This vehicle under test is called the “ego” vehicle. The throttle of the ego vehicle is calculated by

(2) |

where and are saturation bounds, and is a proportional-derivative control law given by

(3) |

Here, is the distance between the front bumper of the two vehicles. is a setpoint distance that the vehicle tries to maintain, is the velocity of the ego, is the velocity of the adversary, and and are proportional and derivative gains, respectively.

The goal formula for the adversarial agent is described by the STL formula

(4) |

where is the maximum duration of an episode and is the minimum safe distance between the two vehicles. In other words, the objective of the adversary is to violate the safety distance in time less than .

The rulebook constraints that the adversarial agent must obey are

(5) | |||

(6) |

where is the speed limit and is a minimum velocity to prevent the adversary from coming to a complete stop, since stopping on the freeway is a traffic rule violation. Both constraints are given the same priority under the rulebook.

We set the reward parameters and . The reward can be calculated as in Equation 1 depending on whether it attained its objective and whether it violated its constraints. The safety distance is . The distance is computed between the two front bumpers. This represents a car length of , plus a small safety margin. At each time step, the adversarial agent may select an acceleration. The acceleration space has been discretized to contain possible actions. The state space consists of , the distance and velocities of the vehicles.

#### Results using tabular Q-learning

We first consider tabular Q-learning. Each episode is chosen from a random initial state, and over time the adversary is able to induce a collision as training proceeds.

Figure 3 shows episodes from the same initial state. In the first episode, the adversary fails to induce a collision, whereas in the later episode it succeeds. It demonstrates different combinations of acceleration and deceleration behaviors that able to make the ego vehicle fail.

epoch | success | episode | success rate (%) | sim time (s) |
---|---|---|---|---|

6 | 76 | 106 | 71.69 | 9115.97 |

8 | 149 | 206 | 72.33 | 16963.58 |

10 | 232 | 306 | 75.82 | 28060.29 |

To benchmark the training procedure, we explored the performance of the adversary under relaxed goals, i.e. goal specifications that are easier to satisfy. Table 1 shows statistics of the training process for .

Fig. 4 allows us to examine the interplay of the goal condition and constraints over time. In the top left quadrant, we plot the number of times that the adversary violated its constraints without achieving its goal condition as a function of number of training episodes. In the top right quadrant, we plot the number of times that the adversary achieved its goal condition while violating its constraints. In the bottom left, we plot the number of times that the adversary satisfies its constraints without achieving its goal condition. Finally, in the bottom right we plot the number of times that the adversary achieves its goal condition while satisfying its constraints.

The total simulation for 488 episodes was 40623.52 seconds. The average episode duration was 83.24 seconds.

#### Results using Neural Network

We then consider using Neural Network with a replay buffer[11]. In the neural network case, we do not need to discretize the state space. The interplay between the goal specification and the rulebook constraints is shown in Figure 6. Performance of the neural network increases over time.

The simulation time for 853 episodes was 24806.76 seconds, with an average time of 29.08 seconds per episode.

#### Comparison between Q-table and Neural Network

From the section above and table 2 we see the average run time of the neural network case is less than the Q-table. Storing and reloading Q-table across different simulations is more computationally expensive than a portable Neural Network.

Episode | Success | Success Rate (%) | Time (s) | |
---|---|---|---|---|

Q-table | 233 | 130 | 55.79 | 17420.58 |

NN | 233 | 128 | 54.93 | 6964.63 |

### 4.2 Lane Change Maneuvers

In this experiment, detailed in Fig. 7, two vehicles are driving on a two-lane freeway. The ego vehicle is controlled together by a switching controller consisting of cruise and collision-avoidance controllers. The ego vehicle predicts future adversary positions based on the current state and switches between the controllers. The adversarial agent is on the left lane and has the goal of merging right to cause the ego vehicle to collide with it. To prevent cases where the adversary merges into the side of the ego vehicle, we add a rulebook constraint that the adversary should always be longitudinally in front of the ego car.

(7) |

The look ahead distance is calculated by:

(8) |

where is lateral distance between adversary and ego car, is the current lateral velocity of the adversary, is the look ahead time.

Episode | Success count | Success rate (%) | Simulation time (s) |
---|---|---|---|

162 | 97 | 59.88 | 6797.20 |

192 | 117 | 60.93 | 7645.74 |

222 | 142 | 63.96 | 8289.18 |

252 | 162 | 64.28 | 9337.53 |

From table 3 we see the performance of the adversary is improving over time. The behavior of the adversary can potentially help improve not only ego’s each single controller, but also ego’s policy of switching between cruise and collision avoidance controller.

### 4.3 Yellow Light Scenario

In this experiment, the ego vehicle is approaching a yellow traffic light, led by an adversarial vehicle. The ego vehicle is controlled by a policy that switches between adaptive cruise control with respect to the lead car and choosing an appropriate deceleration to stop before the traffic light turns red. The traffic light is not adversarial, it is merely changing its state based on a pre-determined schedule. The goal of the adversarial vehicle is to make the ego vehicle run the red light.

The training starts when the adversarial vehicle is meters in front of the traffic light. At this point, the traffic light turns yellow and transitions to red after seconds. The rulebook constraints on the adversarial vehicle are that it may not drive backwards and it may not run the red light. The initial state sampling includes different initial speeds of the two vehicles and different initial distances between the two vehicles. The distance between the adversarial vehicle and the traffic light to trigger the light turn yellow, and the yellow light count down remains the same across the training.

By setting ,

(9) |

where is the maximum available deceleration of the ego vehicle, is the stopping distance to yellow light, and is current distance of the ego vehicle to traffic light. is calculated by:

(10) |

where is current speed of ego vehicle, is remaining count down time for the yellow light, and is maximum available deceleration of ego vehicle. Equation 4.3 means with maximum deceleration, this is the distance needed for the ego vehicle to completely stop.

Figure 8 shows the ego vehicle maintained an appropriate distance to the lead car, but it started decelerating too late and was caught in the intersection during the red light. The adversarial vehicle successfully cleared the intersection while the light was still yellow, consistent with its rulebook constraints.

## 5 Conclusions and Future Work

We have described a technique to automatically test and falsify complex systems by training dynamically constrained RL agents. Our approach can find all of the counterexample traces that a monolithic falsifier can find, and it comes with the additional advantage of being able to re-use pretrained adversarial agents in other testing scenarios. In future work, we will explore the use of an SMT solver to check that adversarial agents indeed satisfy their dynamic constraints, and as part of a counterexample-guided retraining process in case they do not. Furthermore, we will explore the use of signal clustering techniques to distinguish different categories of counterexample traces. We believe it will be useful to test engineers to see a few categories that may correspond to the same bug rather than to a large number of counterexample traces.

### References

- (2018) Falsification of Cyber-Physical Systems Using Deep Reinforcement Learning. In Federated Logic Conference (FLOC), Vol. 10951, pp. 456–465 (en). Note: arXiv: 1805.00200Comment: 9 pages, 1 figure, to be presented at FM2018 External Links: Link, Document Cited by: §2.
- (2007) Logics of specification languages. Springer Science & Business Media. Cited by: §3.
- (2019) Safe Reinforcement Learning with Scene Decomposition for Navigating Complex Urban Environments. In IV, (en). Note: arXiv: 1904.11483Comment: 8 pages; 7 figures External Links: Link Cited by: §2.
- (2019) Liability, Ethics, and Culture-Aware Behavior Specification using Rulebooks. In ICRA, (en). Note: arXiv: 1902.09355Comment: To appear in ICRA 2019 External Links: Link Cited by: §1, §2, §3.
- (2009) Quantitative model checking of continuous-time markov chains against timed automata specifications. In 2009 24th Annual IEEE Symposium on Logic In Computer Science, pp. 309–318. Cited by: §3.
- (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: Figure 1, §4.
- (2015-11) Mining requirements from closed-loop control models. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34 (11), pp. 1704–1717. External Links: Document, ISSN 1937-4151 Cited by: §3.
- (1999) Designing specification languages for process control systems: lessons learned and steps to the future?. In Software EngineeringâESEC/FSEâ99, pp. 127–146. Cited by: §3.
- (2004) Monitoring temporal properties of continuous signals. In FORMATS/FTRTFT, Cited by: §3.2.
- (2004) Monitoring temporal properties of continuous signals. Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems (3253), pp. 152–166. Cited by: §1.
- (2013) Playing atari with deep reinforcement learning. ArXiv abs/1312.5602. Cited by: §4.1.2.
- (2013) Playing Atari with Deep Reinforcement Learning. In NIPS, (en). Note: arXiv: 1312.5602Comment: NIPS Deep Learning Workshop 2013 External Links: Link Cited by: §3.1.
- (2018) Reinforcement learning: an introduction. Second edition edition, Adaptive computation and machine learning series, The MIT Press, Cambridge, MA (en). External Links: ISBN 978-0-262-03924-6 Cited by: §3.1.