Supervisor Synthesis of POMDP based on Automata Learning
Abstract
As a general and thus popular model for autonomous systems, partially observable Markov decision process (POMDP) can capture uncertainties from different sources like sensing noises, actuation errors, and uncertain environments. However, its comprehensiveness makes the planning and control in POMDP difficult. Traditional POMDP planning problems target to find the optimal policy to maximize the expectation of accumulated rewards. But for safety critical applications, guarantees of system performance described by formal specifications are desired, which motivates us to consider formal methods to synthesize supervisor for POMDP. With system specifications given by Probabilistic Computation Tree Logic (PCTL), we propose a supervisory control framework with a type of deterministic finite automata (DFA), zaDFA, as the controller form. While the existing work mainly relies on optimization techniques to learn fixedsize finite state controllers (FSCs), we develop an learning based algorithm to determine both space and transitions of zaDFA. Membership queries and different oracles for conjectures are defined. The learning algorithm is sound and complete. An example is given in detailed steps to illustrate the supervisor synthesis algorithm.
I Introduction
In real world applications, autonomous systems always contain uncertainties. The planning and control problem for such systems has become a hot research area in recent years with background varying from navigation [1, 2], communication protocol design [3], autonomous driving [4, 5], and humanrobot collaboration [6, 7, 8]. Different system models have been considered to capture uncertainties, and partially observable Markov decision process (POMDP) has emerged as one of the most general and thus popular models. POMDP models system software and hardware statuses with discrete states. Between different states, probabilistic transitions are triggered by different system actions to describe uncertainties from the system actuation behavior. Compared to Markov decision process (MDP), POMDP considers partial observability on its states that can model sensing noises and observation errors, which makes MDP a special case of POMDP. This property is very useful in modeling autonomous systems with hidden states, such as advanced driver assistant system (ADAS) [5] and humanrobot collaboration [7, 9] where human intention can not be directly observed. Together with probabilistic transitions between states and nondeterminism in action selections, POMDP can capture uncertainties from various sources, such as sensing, actuation, and the environment. Meanwhile, a reward function can be defined that assigns real value to each state transition to represent additional information in POMDP.
In this paper, we study formal design methods for POMDPs with control tasks given by Probabilistic Computation Tree Logic (PCTL). While most of the regular properties are undecidable for POMDPs [35], we consider PCTL specifications with finite horizons which can bound the searching space with finite memory in POMDP model checking following the philosophy of Bounded Model Checking [36]. Meanwhile, a lot of robotics applications require task completion with finite time, such as motion planning [37], which also makes PCTL specification with finite horizons suitable to describe our control tasks. With a finite planning horizon, the model checking problem of POMDP is decidable, but a historydependent controller instead of a memoryless one is necessary. To regulate POMDP to satisfy a finite horizon PCTL, we propose a novel supervisor control framework with a special type of deterministic finite automaton (DFA), zaDFA, as the supervisor to achieve historydependent planning. After defining the probability space of POMDP, PCTL satisfaction over POMDP is established on the product model between zaDFA and POMDP. To check the satisfaction relation efficiently, we show the connection between the model checking and the optimal policy computation, then modify a stateofart POMDP solving algorithm, Partially Observable MonteCarlo Planning (POMCP) [38], to reduce the computational complexity for POMDP model checking. After that, an learning based supervisor synthesis algorithm is proposed to synthesize a zaDFA to satisfy the given specification. To guarantee the soundness and completeness of the supervisor synthesis, we design novel algorithms to answer membership queries and conjectures from learning. The returned zaDFA will also be permissive by enabling more than one action for POMDP to select given a history.
Ia Related Work
Traditional planning and control problems in POMDP target to find a policy that maximizes the expectation of accumulated rewards. Since states are not directly observable, the available information for a control policy is an observationaction sequence up to current time instance, and such a sequence is called history. History can be represented in a compact form called belief state, which is a probability distribution over the state space of POMDP. Since belief state is sufficient statistics for history [10], POMDP can be viewed as MDP with a continuous state space formed by belief states. This inspires solving POMDP planning by finding the optimal control policy over the continuous belief state space. Exact planning of POMDP [11] can be intractable with the size of state space and planning horizon exploding quickly. Therefore, approximation methods are proposed to approximate the value function or limit the policy search space to alleviate the computational complexity. As one of the most popular approaches, pointbased value iteration (PBVI) optimizes the value function only over a selected finite set of belief states and provides the optimization result with a bounded error [12, 13, 14, 15].
Compared to the pointbased approach that solves POMDP on the continuous state space of belief states, the controllerbased approach [16] finds the optimal policy represented by a finite state controller (FSC) with finite memory. An FSC can be defined as a directed graph with each node being labeled by an action and each edge by an observation in POMDP. Each node has one outward edge per observation, and a policy can be executed by taking action associated with the node at current time instance and updating the current node by following the edge labeled by the observation made[17]. This representation is equivalent to Moore machine [18] from automata theory[16]. There are two types of approaches to find an FSC: policy iteration and gradient search. The policy iteration tends to find the optimal controller, but the size of the controller can grow exponentially fast and turns intractable. The gradient search usually leads to a suboptimal solution that often traps in local optimum [19, 20]. To combine the advantages from gradient ascent and policy iteration, bounded policy iteration (BPI) is proposed in [17] to limit the size of the controller and provide evidence to help escape local optimum. Besides direct graph as the controller form for FSC, DFA and Mealy machine have also been considered in [21] and [16], respectively.
Currently, most existing results on the control problem of POMDP focus on the rewardbased planning. However, for some safetycritical applications like autonomous driving, a guaranteed system performance is crucial. This motivates us to consider formal methods. In robotics, formal methods are used to generate controllers that can guarantee the system performance to satisfy highlevel mission requirements [22, 23, 24, 25]. For complicated missions, temporal logic [26] is an efficient tool to describe requirements for system tasks due to its expressiveness and similarity to natural languages. Compared to extensive studies in rewardbased planning, very few results on formal methods based planning have been established for POMDP, which makes it an open problem [27]. Until recently, there are some advances in the controller synthesis of POMDP under temporal logics. In [28], the controller synthesis of POMDP with Linear Temporal Logic (LTL) specifications over infinite horizon is discussed and solved based on gradient search where fixedsize FSCs are used to maximize the probability of satisfying the given LTL specification. However, this method suffers from local maxima, and the initial choice of the FSC’s structure does not have a systematic guideline [28]. In [29], the authors use observationstationary (memoryless) controller to regulate POMDP to satisfy almostsure reachability properties. Since the action is selected only depends on current observation, the satisfiability modulo theories (SMT) method is applied with similar idea shows in [30] where a statebased controller for MDP is learned. Compared to historydependent controllers, memoryless controllers used in these work are not general enough for reasoning over finite horizons. In [31], a linear time invariant system with linear observation model for states is considered, which is equivalent to a discrete time continuous space POMDP. The system specification is given as Gaussian Distribution Temporal Logic (GDTL) as an extension of Boolean logic. Samplingbased algorithms are proposed to build a transition system to generate a finite abstraction for belief space. With the specification being converted to deterministic Rabin automaton, the synthesis is done on the product MDP following dynamic programming approach. However, the size of the product MDP still suffers under the curse of history for POMDPs. Similar to POMDPs, the deterministic systems with partial information have also been studied for synthesis problems in [32, 33, 34]. However, applying these methods to POMDP is hard due to the probabilistic transition nature of POMDP.
Since POMDP is an extended model of MDP, there is also some related work for supervisor synthesis on MDPs. Especially for permissive controller design, in [40], statebased controllers without memory are proposed for infinite horizon planning and Mixed Integer Linear Programming (MILP) [41] is applied to find a permissive controller. Similarly, in [30], SMT is combined with reinforcement learning to learn a statebased controller. While these methods assume a memoryless controller, a historydependent controller is necessary for POMDP planning over a finite horizon. Also due to the partial observability in POMDPs, applying these methods to POMDP are fundamentally difficult. Besides these works, using the algorithm to learn system supervisor has also been considered in our previous work for MDPs [42]. To apply the algorithm to POMDP supervisor synthesis, in this paper, we extensively discuss the supervisor synthesis framework and design new membership query and conjecture checking rules to overcome the difficulties brought by partial observability.
IB Our Contributions
This paper is an extended and revised version of our preliminary conference paper [39]. Compared to [39], this paper makes the following new contributions. First, we formally build the POMDP supervisor control framework by proving the sufficiency of using zaDFA as the controller form, defining the probability space for POMDP, then establishing PCTL satisfaction over POMDP. Secondly, the model checking of POMDP over observationbased adversaries is intensively studied, and a modified POMCP algorithm is given to conquer the computational complexity. Thirdly, we develop new oracles for the learning algorithm to guarantee the completeness and allow permissiveness of the supervisor. Based on that, a new example is given to illustrate the learning process in detailed steps.
The technical contributions are summarized in the order in which they appear in the paper as follows:

We propose a supervisory control framework for POMDP to satisfy PCTL specifications over finite horizons. As a special type of DFA, zaDFA is used as the supervisor form. Based on that, we further define the probability space and PCTL satisfaction over POMDP. Then the POMDP model checking is intensively discussed and a modified POMCP method is given to speed up the model checking process.

We design an learning based supervisor synthesis algorithm to learn a suitable supervisor automatically. With properly defined membership queries and conjectures, our learning algorithm is sound and complete. The returned zaDFA can be permissive, and the nonblocking feature is guaranteed.
IC Outline of the Paper
The rest of this paper is organized as follows. In Section II, MDPrelated preliminaries are given with definitions and notations. The supervisory control framework for POMDP is proposed in Section III. Following by that, Section IV presents learning based supervisor synthesis algorithm. The analysis and discussions are addressed in Section V. Section VI gives an example to illustrate the learning process. Finally, Section VII concludes this paper with the future work.
Ii Preliminaries
Iia MDP Modeling, Paths and Adversaries
MDPs are probabilistic models for systems with discrete state spaces. With nondeterminisms from decision making and probabilistic behavior in system transitions, MDPs are widely used to model system uncertainties.
Definition 1.
[43] An MDP is a tuple where

is a finite set of states;

is the initial state;

is a finite set of actions;

is a transition function.
Here describes the probability of making a transition from a state to another state after taking an action .
In MDPs, there are multiple actions defined for each state. If we limit the number of actions defined for each state to be 1, we have a discretetime Markov chain (DTMC).
Definition 2.
[43] A DTMC is a tuple where

is a finite set of states;

is the initial state;

is a transition function.
To analyze the behavior of MDP and DTMC with additional information, we can define a labeling function that assigns each state with a subset of atomic propositions . This helps to introduce system requirements in forms of temporal logics.
In MDP , a is a nonempty sequence of states and actions in the form
where and for all [43]. Generally, we denote the th state of a path as and the length of (the number of transitions) as . We use to represent the set of all possible paths in and for its set of corresponding prefixes.
To solve the nondeterminism in MDP, we need an to build a map between system paths and actions. Depending on whether a deterministic action is selected or a probability distribution over all possible actions is given, there are two types of adversaries: pure adversary and randomized adversary. For the pure adversary, it is a function , that maps every finite path of onto an action in . For the randomized adversary, it is a function , which maps every finite path of onto a distribution over . With an adversary that solves the nondeterminism in MDP, the set of possible MDP paths is denoted as and the regulated system behavior can be represented as a DTMC.
IiB PCTL and PCTL Model Checking over MDPs
For a labeled MDP, we can use PCTL [43] to represent the system design requirements. PCTL is the probabilistic extension of the Computation Tree Logic (CTL) [44].
Definition 3.
Here stands for "negation", for "conjunction", for "next", for "bounded until" and for "until". Specially, takes a path formula as its parameter and describes the probabilistic constraint.
Given the syntaxes of POMDP, we can define PCTL satisfaction relation on MDP as follows.
Definition 4.
[43] For an labeled MDP , the satisfaction relation for any states is defined inductively:
where is the set of all adversaries and for any path
The model checking of PCTL specification has been extensively studied for MDPs [43]. PCTL specifications with probabilistic operators are considered. Depending on whether in the specification gives lower or upper bound, PCTL model checking of MDPs solves an optimization problem by computing either the minimum or maximum probability over all adversaries [43]. Since the states are fully observable, the model checking for MDPs can be solved following dynamic programming techniques with polynomial time complexity [45]. Different software tools for MDP model checking are available, such as PRISM [46] and recently developed model checker Storm [47].
Iii POMDP modeling and supervisory control framework
In this section, we propose a supervisory control framework to regular the closeloop behavior of POMDP to satisfy finite horizon PCTL specifications.
Iiia POMDP Modeling, Paths and Adversaries
POMDPs are widely used to capture systems uncertainties from difference aspects. As an extension of MDP model, POMDP considers states with partial observability to model uncertainties from system sensing.
Definition 5.
A POMDP is a tuple where

is an MDP;

is a finite set of observations;

is an observation function.
In POMDP, the observable information for each state is given by as a probability distribution over . Here stands for the probability of observing at state . Then MDP can also be viewed as a special case of POMDP where its and its observation function defined for each is a Dirac delta function with
Remark: Since states in POMDP are not directly observable, it may happen that we observe an observation and decide to take action while is not defined for the current real state . In this case, no state transitions will be triggered, as the system will ignore this command and stay in its current state.
Due to the partial observability, paths in POMDP can not be directly observed then used as the information for POMDP planning. Instead, the observation sequence of a path can be defined as a unique sequence where and for all (if , then and are considered as different paths). This observation sequence can be seen as history in traditional POMDP planning problems. While history is defined to start with an action, the initial observation in the observation sequence can be seen as a special observation for the initial state with since we assume is known. If the initial status of POMDP is given as a probability distribution over , we can add a dummy initial state then define its transitions to other based on the initial probability distribution [39]. In the rest of this paper, we will use the observation sequence and history for POMDP interchangeably if the meanings are clear.
Given histories as control inputs, the planning problem of POMDP needs to find an adversary as a mapping function that maps every finite history onto an action in or a probability distribution over . As in MDP, the former type of adversaries is called pure adversary, and the later is called randomized adversary. As a special case of the randomized adversary, the pure adversary is less powerful generally. But for the finite horizon PCTL specifications considered in our work, the pure adversaries and randomized adversaries have the same power in the sense that restricting the set of adversaries to pure strategies will not change the satisfaction relation of the considered PCTL fragments [48]. While the detailed analysis follows the fact that POMDP is a oneandahalf player game [48], the intuitive justification for this claim is that if we are just interested in upper and lower bounds to the probability of some events to happen, any probabilistic combination of these events stays within the bounds. Moreover, pure adversaries are sufficient to observe the bounds [48]. Therefore, we consider the controller design of pure adversary in our supervisory control framework.
IiiB Supervisory Control with zaDFA
We want to find a supervisor to provide pure adversaries for POMDP and regulate the closedloop behavior to satisfy finite horizon PCTL specifications. To improve the permissiveness, we target to find a set of proper pure adversaries. Since the control objective is given by a finite horizon specification, historydependent controller outperforms historyindependent (memoryless or observationstationary) one and its justification can be directly inherited from MDP cases [40]. Based on these facts, we propose zaDFA as the supervisor for POMDP with the alphabet being defined in a particular form.
Definition 6.
[39] A supervisor for POMDP is a zaDFA , where

is a finite set of states;

is the initial state;

is the finite alphabet;

is a transition function;

is a finite set of accepting states.
Since DFA is an equivalent representation of regular language [49], zaDFA represents a regular set of strings with the set of the observationaction pairs in POMDP as its alphabet. A path in is a string concatenation of these pairs, which encodes a history with an action . Then the accepted runs in zaDFA give the enabled actions for different histories and represent POMDP executions. Note that the prefixes of the accepted runs must also be accepted since we have to allow the prefixes to happen in POMDP execution first. This implies that the accepted language of zaDFA as the supervisor for POMDP is prefixclosed, i.e., where denotes all prefixes of the language of [49].
Proposition 1.
A set of pure adversaries to regulate a finite horizon PCTL specification for POMDP can always be represented as a zaDFA.
Proof:
A pure adversary in POMDP maps a history to an action . Since we consider finite POMDP, the observation set and action set are finite, which form a finite alphabet for zaDFA. Meanwhile, for a finite horizon PCTL specification, the pure adversaries give the action selection rules for finite length histories. Thus all possible concatenations of history and action enabled by this set of pure adversaries will form a finite set of strings and each string has a finite length. Then we can define a nondeterministic finite automaton (NFA) such that its accepted language is exactly the set . Here can be constructed by unifying the initial state for DFA representing each string . By applying the subset construction on NFA , we can get a DFA whose accepted language is [49]. With the set of observationaction pairs as the alphabet, we have shown that we can always find a zaDFA to represent a set of pure adversaries to regulate a finite horizon PCTL specification for POMDP. ∎
Given a zaDFA as the supervisor for POMDP, all histories that may be encountered during POMDP executions are mapped to a set of enabled actions. Then we can define a product MDP as the parallel composition between POMDP and zaDFA to describe the regulated behavior.
Definition 7.
Given a POMDP and a zaDFA as the supervisor, their parallel composition is an MDP ,

is a finite set of states;

is the initial state;

is a finite set of actions;

, if with , and ;

, if with and .
For the labeling function, and .
Remark: Compared to the global Markov chain defined in [28] describing the regulated behavior of POMDP under an FSC, the product MDP defined in Definition 7 is more general because zaDFA is permissive and it enables more than one action to be selected under a history.
To make zaDFA feasible for POMDP planning in practice, we require that a POMDP should not get "blocked" under the supervision of in the sense that there always exists at least one action being enabled given a history allowed in .
Definition 8.
A supervisor zaDFA to regulate POMDP for a finite horizon is nonblocking, if there are outgoing transitions defined on all states that are reachable in steps from in .
Compared to the feasibility constraint defined in our previous work [39], here we allow multiple actions being enabled given a history to have permissiveness in the supervisory control framework using zaDFA.
Given a nonblocking zaDFA , the simulation run of POMDP is shown in Algorithm 1. Starting from initial state , first generates an observation on state at each time instance . Then will search for an outgoing transition from to any with and the corresponding action is selected to execute. After that, the state of is updated and a new POMDP state is simulated following action .
IiiC Probability Space and PCTL Satisfaction over POMDP
To formally address the PCTL satisfaction over POMDP, we first define the probability space in POMDP. With an observationbased adversary, the behavior of POMDP is purely probabilistic. Given a finite path and its corresponding observation sequence , with an observationbased adversary, we can define the basic cylinder set in POMDP as follows:
which is the set of all infinite paths with the prefix and observation prefix . Let contain all sets where ranges over all paths with all possible observation sequences. Then the algebra can be defined on the paths generated by and the corresponding probability measure can be defined as
where , and is the selected action from the adversary given the observation sequence up to time instance . Since we assume the initial state is given, the initial observation will be the special observation with . With the domain , algebra and the corresponding probability measure, we have defined the probability space for POMDP under an observationbased adversary. These results are modified based on [50] where the probability space for Hidden Markov Model (HMM) is defined.
Since PCTL over MDP is well defined, the product MDP that describes the regulated behavior of POMDP under the supervision of zaDFA can be used to connect PCTL satisfaction over POMDP with its definition for MDP. Given a path in , we can have its observation sequence of by extracting the observation symbol out of the tuple for the state in (the observation symbol for is the special observation ). With an observationbased adversary, there is a onetoone correspondence between the paths in and . Then based on the general definition of the probability space on MDP [43], it is not hard to see that the probability spaces on MDP and POMDP are equivalent. Therefore, given a POMDP and a zaDFA , the PCTL satisfaction with a finite horizon over the regulated system is equivalent to the PCTL satisfaction over the product MDP constraining to observationbased adversaries. We denote the model checking on constrained to the observationbased adversaries as where stands for satisfaction relation constrained to the observationbased adversaries.
For the sake of simplicity, we consider bounded until PCTL specification with in the rest of this paper. But for finite horizon PCTL, the generality is not lost since lots of finite horizon PCTL specifications can be transformed to bounded until form and the model checking mechanism is similar as shown in [51].
IiiD POMDP Model Checking
To verify the satisfaction relation over the regulated behavior, we need to solve the PCTL model checking problem on where most of the operators are handled in the same way as in MDP model checking. But for state formula , we need to check whether the probability bound is satisfied given the observationbased adversaries instead of all adversaries. We can solve this by computing either the minimum or maximum probability depending on whether a lower or upper bound is defined by [43]. This problem can be solved with EXPTIMEcomplete complexity for finite horizon specifications. But with the size of POMDP and the planning horizon increasing, this problem becomes much harder to solve. Another promising approach is to convert the model checking to an equivalent optimal policy computation problem on POMDP. Following this method, we can leverage recently developed POMDP solvers that can handle a larger problem size with highefficiency [13, 14, 15]. For the finite horizon PCTL , the model checking of this type of specifications can be converted to an optimal policy computation problem by modifying the transition structure of POMDP to make all states and states absorbing, and designing the reward scheme that assigns 0 to intermediate transitions and 1 to the final transitions on when the planning depth is reached [28, 52, 38].
Among different POMDP solvers, we modify a stateofart POMDP optimal policy computation algorithm, Partially Observable MonteCarlo Planning (POMCP) [38], that can well fit with our supervisory control framework. POMCP is proposed as an online POMDP planner to find the control policy and optimize a discounted accumulative reward in future. Instead of explicitly solving a POMDP, POMCP applies MonteCarlo tree search [53] by running MonteCarlo simulations to maintain a search tree of histories. Each node in the search tree represents history as . Here counts the number of times that history has been visited; is the value of history ; is a set of particles used to approximate the belief state for history to avoid exact belief state update for each step. Given the current history , each simulation starts in an initial state sampled from the belief state . There are two stages of simulation: when the child nodes exist for all children, the actions selection rule follows the Upper Confidence Bounds 1 (UCB1) [54] algorithm to maximize where is the exploration constant; at the second stage, the actions will be selected following an observationbased rollout policy and normally it follows a uniform random action selection policy. One new node is added to the search tree after each simulation.
To modify POMCP for our model checking purpose for the PCTL specification , we use a constant planning depth instead of a discount factor for the value function to guarantee the termination of each simulation. Meanwhile, without intermediate rewards, a termination reward will be assigned when planning depth is reached and this reward is equal to where is the exact belief state of . For the action selection rules, we limit the selection been considered only on the enabled action set given the supervisor and history . While the main algorithm is the same with POMCP, our modified version is shown in Algorithm 2. Then by initializing the current history to empty, we can estimate the optimal value , which is equal to the maximum satisfaction probability [52]. To find the minimum satisfaction probability, we just need to change the sign of the termination reward, and the estimation is . From the search tree in POMCP, we can also get the selected action for each history node which together gives an observationbased adversary that can achieve the estimated satisfaction probability. Since our modification on POMCP does not change its main mechanism, the convergence and performance analysis for POMCP is still hold. With the convergence guarantee in probability, the bias of the value function is [38]. Given a fixed , the probability of in the range of is less or equal to with for a sufficiently large number of simulations [55]. In practice, we may need to run many simulations (for example, ) to get a good estimation, but the simulation run can be very fast and the total cost time is still very small (for example, in to seconds) as reported in [38].
Iv Learning based supervisor synthesis
Within the supervisory control framework using DFA, our task of finding a supervisor for POMDP is converted to find a DFA, which is an equivalent representation of regular set [49]. This inspires us to use algorithm to learn a supervisor.
Iva Learning Algorithm
G  

1  
1  0 
0  1 
10  0 
11  0 
The learning algorithm [56] is proposed to learn an unknown regular set [49]. Starting from a fixed known size of alphabet , learning defines an observation table to organized the knowledge acquired by the learning algorithm. The row index of the table contains two parts: and , where is a nonempty finite prefixclosed set of strings. The column index is given by a nonempty finite suffixclosed set of strings . The function maps a string to where is the set of all finite length strings containing symbols from . For a string , if and only if . For each row entry of a string , its row denotes the finite function from to defined by . Initializing the observation table with , algorithm tries to make the table closed and consistent. For closeness, , it requires that , s.t. ; for consistence, whenever with , it requires that , [56]. Given a closed and consistent observation table, a DFA as the acceptor can be generated with its accepting language representing the learned regular set as follows:

,

,

.

For the observation table shown in Table I, it is closed and consistent, and the corresponding DFA is shown in Fig. I.
To generate a closed and consistent observation table, learning maintains a Questions & Answers mechanism. Given the alphabet , two types of questions, membership query and conjecture, are asked by the Learner and answered by the Teacher. For the membership query, the Learner asks whether a string is a member of or not, and the Teacher answers or , respectively. For the conjecture, the Learner asks whether a learned regular set is equal to or not, and the Teacher answers , or with a string showing the symmetric difference between the learned set and . In the latter case, is called a counterexample. With the membership query, if the table is not closed, the algorithm finds , s.t., , then adds to and extends the table; if the table is not consistent, the algorithm finds , s.t., but , then adds to and extends the table [56]. With the conjecture, if is given as a counterexample, and its prefixes will be added to and the table is extended using membership queries. With a Teacher being able to answer membership queries and conjectures, algorithm is proved to converge to the minimum DFA accepting in polynomial time [56].
IvB Learn zaDFA as the Supervisor
Given a POMDP and a finite horizon PCTL specification , we use learning to learn a zaDFA as the supervisor. To get a feasible zaDFA that can regulate POMDP to satisfy the specification , we develop algorithms to answer membership queries and conjectures. To simplify the analysis, we will take with as the specification to illustrate the learning process. The overview of the learning process is shown in Fig. 2 and we illustrate it as follows.
IvB1 Preprocessing
Before the initialization of learning, we first find the observationbased adversaries and that give the maximum and minimum satisfaction probabilities and for the path formula , respectively. With the probability bound in given by , we compare and with the threshold : if then any observationbased adversaries can be applied and a trivial zaDFA with one state and self loop transitions under any will be returned as the supervisor; if then no observationbased adversaries can be applied and an empty zaDFA that only accepts the empty string will be returned.
IvB2 Initialization
After the preprocessing stage to calculate , and their corresponding , , we can initialize the learning algorithm. Starting with the alphabet defined in Definition 6, the observation table is initialized with , and . Then membership queries are generated by the Learner to extend the table.
Beside the observable table, we initialize two string sets and to empty. Here and will contain strings of negative counterexamples returned from the OracleB and OracleS, respectively, and both oracles will be introduced in the conjecture answering section.
IvB3 Answering Membership Queries
For each string , the membership query checks whether or not the corresponding observationaction sequence can be used as the control policy for histories as the prefix of . If there exists a prefix of in , the membership query returns . Otherwise, we will unfold the POMDP given the control policy from . This unfolding process follows the product MDP generation rules given in Definition 7. Basically can be converted to a zaDFA with a unique action being selected for a history as the prefix of . Then its product MDP turns into a DTMC. On DTMC , the model checking result of specification will answer the membership query with if and only if . If , we will take its prefix and apply the membership query for since the specification only constrain the regulated behavior up to the depth .
Remark: In the original algorithm, implies that the unknown regular set accepts . But in our case, if membership query returns , it only means the corresponding control policy will not cause the violation of the specification by itself. Here may still need to be removed to get a correct supervisor because the satisfaction probability of the regulated behavior is determined based on the accumulative probability brought by different strings accepted in the supervisor.
Based on the algorithm, the Learner will keep generating membership queries until a closed and consistent table is learned. Then a zaDFA is generated as the acceptor of .
IvB4 Answering Conjectures
Given a zaDFA as the acceptor, the Learner asks a conjecture to check whether or not is a nonblocking supervisor that can regulate POMDP to satisfy PCTL . If the answer is , the algorithm will terminate with the learned zaDFA as a nonblocking and permissive supervisor. Otherwise, counterexamples will be returned to guide the refinement and extension process of the observation table for the Learner. To answer conjectures, three oracles are defined to guarantee the soundness and completeness of our learning algorithm.
Since we know is a suitable adversary that will not violate the specification or cause blocking during the POMDP execution, we define OracleP to check whether or not there exists a string such that with but . If yes, the conjecture will answer with being returned as a positive counterexample to make . With OracleP, we can guarantee that the learned supervisor will accept any historyaction pairs given by .
Remark: For any string with , the membership query will return . That is because if this single control policy could bring a probability violating the requirement in the specification, brought by will also violate the requirement, which will terminate the algorithm in the preprocessing stage. Therefore, there are no conflicts between the membership query and OracleP.
If OracleP does not find a positive counterexample, we use OracleB to check whether or not is nonblocking. Here we checks all states that are step reachable from on : if all such states have outgoing transitions being defined, OracleB returns ; otherwise, OracleB will check the causes of blocking. Assume there exists a step reachable state that does not have any outgoing transitions. Then and by applying a depthfirst search on we can find the shortest observationaction sequence transits from to . Denote this string as . Depending on whether or not, we have two possible causes for the blocking on . If , the blocking of the supervisor is because the Learner can generate a conjecture without adding to and asking membership queries for . To know if there exists an action that can be enabled for to fix the blocking, OracleB will return with as the counterexample to enforce appearing as rows in the observation table. If has already been included in , for all will appear as rows in the observation table. Since has no outgoing transitions under the observation , all strings with have been answered by membership queries with . This means once the POMDP execution reaches state and observes while the zaDFA reaches state , choosing any action will cause the violation of the specification under current zaDFA. Therefore, should be avoided during the system transition. To remove the strings that may lead to such states, all states with no outgoing transitions will be marked as dark states and the transitions to dark states will be removed in . This process keeps running until no new dark state appears. From traces starting from the initial state to dark states, the observationaction sequences are extracted and form a string set . Then is updated to . OracleB will return together with the shortest string as the negative counterexample.
If OracleB does not find a counterexample, we use OracleS to check whether or not . If no, OracleS will return with a negative counterexample as the evidence of specification violation. To find such a string as the counterexample, we first solve and find the observationbased adversary that gives the maximum satisfaction probability . With , we generate a derived DTMC with histories as states: . Here is the state space of histories with and is a dummy state. For with , with the belief state function, and
where is the standard Dirac delta function to make absorb. For with , . Basically we are grouping up paths with the same observation sequences together in and generate . Therefore, a path in corresponds to a set of paths in . For a path in that starts from and ends in , its transition probability is equal to the accumulative transition probability of the corresponding set of paths in ending in a state with label in steps. Since witnesses the violation of the specification, . Then we apply the DTMC counterexample generation algorithm in [51] to get the strongest evidence as a finite path with the maximum probability of violate. Denote its corresponding observationaction sequence as . If while , will be replaced by the observationaction sequence of the path with the second largest probability of violation. This process keeps going until does not conflict with . Then will be returned as the negative counterexample and is updated with .
If all three oracles return , our algorithm will return the result DFA as the supervisor and terminate. If there exist counterexamples returned from either oracle, the observation table will be refined and extended.
IvB5 Refining and Extending the Observation Table
In the next iteration, given a counterexample returned from conjectures and the updated and , we first refine the observation table by correcting to if . Then and all its prefixes are added to in . After that, the observation table is extended using membership queries to generate a new closed and consistent table.
V Analysis and Discussions
We analyze the learning based supervisor synthesis algorithm in this section regards to the termination, soundness, and completeness, as well as the computational complexity. Our analysis focuses on the cases where the algorithm is not terminated during the preprocessing stage since trivial statements can be followed otherwise.
Va Termination
In the learning, we use membership queries and conjectures to collect information about whether or not an observationaction sequence can be used as part of a proper supervisor. Because we consider finite POMDP with a finite horizon specification, the number of all possible observationaction sequences are finite. So we only have a finite number of strings needed to be labeled in the observation table for the algorithm. Our algorithm requires a refinement process for the observation table if the returned negative counterexamples and their suffixes were answered with by membership queries in previous iterations. However, for a string , it will never happen that is changed from to . Consider a string with . Then either the accumulative probability from violates the threshold given by the specification, or . In any of these cases, membership queries will always return for . While OracleP will return certain strings as positive counterexamples, will never be returned by OracleP, i.e., is not accepted by . If the accumulative probability from violates the threshold, it will never belong to . If , by definition of OracleS, cannot be returned by OracleP. If , must be returned by OracleB which will only happen after OracleP returns . But when OracleP returns , the observationaction sequences from are all accepted by the acceptor zaDFA, and none of them will cause blocking of the supervisor which is guaranteed by as an observationbased adversary. Therefore, if , can never belong to . As a result, will never be changed from to . With the fact that the number of strings to be inquired is finite and at each iteration the algorithm must return counterexamples if any oracles return , we can conclude that the termination of our supervisor synthesis algorithm is guaranteed. The upper bound of the number of iterations is equal to twice of the number of possible strings.
VB Soundness and Completeness
Our learning based supervisor synthesis algorithm is sound and complete. If a zaDFA is returned as the supervisor, based on the definition of OracleP, OracleB, and OracleS, this zaDFA is nonblocking, and the model checking on the regulated behavior of POMDP proves the satisfaction of the specification. This shows the soundness of the algorithm.
For the completeness, if there exists a proper supervisor, our algorithm will return a zaDFA representing in the worst cases. This is guaranteed by OracleP. But we cannot guarantee the permissiveness for the worse cases when OracleS returns "good" observationaction sequences as negative counterexamples. While OracleS will never misidentify a single string carrying enough probability mass of violation, if a set of paths is needed to witness the violation, how to select a proper counterexample from that set is still a research question, and it is possible that some paths accepted by the desired supervisor are returned as negative counterexamples. While now we will return the one with the maximum probability mass, newly developed counterexample selection algorithms for probabilistic systems can be applied and improve the performance of our learning framework.
VC Complexity
Define the size of the POMDP as the product of the size of the underlying MDP and : and denote the planning horizon of the specification as . Then following the termination analysis, the number of iterations is at most where is the alphabet. In each iteration, denote the size of current acceptor DFA as . OracleP tries to find the difference between the current acceptor DFA and . This can be achieved with time complexity by doing complement and interaction between two DFAs then applying depth first search to check whether or not the initial state can be reached in steps from the accepted state. OracleB mainly applies depth first search on the product MDP, so the time complexity is . OracleS replies on POMDP solving which generally have a time complexity exponential with , linear with the length of the PCTL formula (the number of logical and temporal operators in the formula). But with the modified POMCP method, the model checking result can be returned in seconds by running thousands of simulations and the running time will depend on the hardware. After that, the counterexample selection algorithm will take polynomial time with and the number of transitions in the derived DTMC [51]. In the learning process, the maximum number of membership queries is at most . Then combining with the time analysis of in [56], we can see that our algorithm has a complexity exponential with , polynomial with , and . However, whenever we eliminate negative counterexamples, their suffixes are also removed. Therefore the and the number of iterations rarely assume large values in practice. So this complexity analysis is rather conservative.
Vi Example
Consider a POMDP , where

;

;

;

.
The transition probabilities under different actions are given in the order of , , in the square brackets shown in Fig. 3. The observation matrix is given in Table IV. Among , the state represents a failure state with label and is colored by orange in Fig. 3. The specification is given by a finite horizon PCTL with , which requires the probability of reaching failure within steps should be less or equal to .
Remark: This POMDP is specially designed that the model checking problem can be solved quite straightforwardly. Then we can focus on the illustration of our supervisor synthesis algorithm.
0.3 0.7 0.5 0.5 0.2 0.8 1 0 0 1 1 2 0 1 1 3 0 4 1 5 0 6 0 2 0 G 3 13 1 0 1 2 0 0 0 1 1 1 0 13 1 1 1 11 1 0 0 3 0 0 0 4 1 1 0 5 0 0 0 6 0 0 0 2 0 0 0 12 1 1 1 14 1 0 0 15 1 1 1 16 1 1 1 13 1 1 1 11 0 0 0 11 1 1 1