Joint Inference of Reward Machines and Policies for Reinforcement Learning
Abstract
Incorporating highlevel knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the highlevel knowledge is in the form of reward machines, i.e., a type of Mealy machine that encodes the reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, qlearning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It derives qfunctions from the current hypothesis reward machine, and performs RL to update the qfunctions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relationship we defined between states of reward machines, we transfer the qfunctions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit if a minimal reward machine can be inferred and the maximal length of each RL episode is sufficiently long. The experiments show that learning highlevel knowledge in the form of reward machines can lead to fast convergence to optimal policies in RL, while standard RL methods such as qlearning and hierarchical RL methods fail to converge to optimal policies after a substantial number of training steps in many tasks.
enhanced jigsaw,nobeforeafter,size=fbox,sharp corners, shrink tight, extrude by=3pt, tcbox raise base, borderline=0.5pt1ptred,opacity=0.75, opacityframe=0.75, opacityback=0.5,
1 Introduction
In many reinforcement learning (RL) tasks, agents only obtain sparse rewards for complex behaviors over a long period of time. In such a setting, learning is very challenging and incorporating highlevel knowledge can help the agent explore the environment in a more efficient manner [Taylor_ICML2007]. This highlevel knowledge may be expressed as different levels of temporal or behavioral abstractions, or a hierarchy of abstractions [Nachum_NIPS2018, Abel2018, Akrour2018].
The existing RL work exploiting the hierarchy of abstractions often falls into the category of hierarchical RL [sutton1999between, Dietterich2000MaxQ, parr1998HAM]. Generally speaking, hierarchical RL decomposes an RL problem into a hierarchy of subtasks, and uses a metacontroller to decide which subtask to perform and a controller to decide which action to take within a subtask [Barto2003review].
For many complex tasks with sparse rewards, there exist highlevel structural relationships among the subtasks [aksaray2016q, andreas2017modular, li2017reinforcement, zhe_ijcai2019]. Recently, the authors in [DBLP:conf/icml/IcarteKVM18] propose reward machines, i.e., a type of Mealy machines, to compactly encode highlevel structural relationships. They develop a method called qlearning for reward machines (QRM) and show that QRM can converge almost surely to an optimal policy in the tabular case. Furthermore, QRM outperforms both qlearning and hierarchical RL for tasks where the highlevel structural relationships can be encoded by a reward machine.
Despite the attractive performance of QRM, the assumption that the reward machine is explicitly known by the learning agent is unrealistic in many practical situations. The reward machines are not straightforward to encode, and more importantly, the highlevel structural relationships among the subtasks are often implicit and unknown to the learning agent.
In this paper, we investigate the RL problem where the highlevel knowledge in the form of reward machines is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies (JIRP) for RL (more specifically, qlearning [Watkins1992]). In each iteration, the JIRP algorithm maintains a hypothesis reward machine and a sample of RL episodes. It derives qfunctions from the current hypothesis reward machine, and performs RL to update the qfunctions. While performing RL, the algorithm updates the sample by adding counterexamples (i.e., RL episodes in which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine). The updated sample is used to infers a new hypothesis reward machine, using automata learning techniques [DBLP:conf/nfm/NeiderJ13, oncina1992inferring]. The algorithm converges almost surely to an optimal policy in the limit if a minimal reward machine can be inferred and the maximal length of each RL episode is sufficiently long.
We use three optimization techniques in the proposed algorithm for its practical and efficient implementation. First, we periodically add batches of counterexamples to the sample for inferring a new hypothesis reward machine. In this way, we can adjust the frequency of inferring new hypothesis reward machines. Second, we utilize the experiences from previous iterations by transferring the qfunctions between equivalent states of two hypothesis reward machines. Lastly, we adopt a polynomialtime learning algorithm for inferring the hypothesis reward machines.
We implement the proposed approach and two baseline methods (qlearning in augmented state space and deep hierarchical RL [Kulkarni2016hierarchical]) in three scenarios: an autonomous vehicle scenario, an office world scenario and a Minecraft world scenario. In the autonomous vehicle scenario, the proposed approach converges to optimal policies within 100,000 training steps, while the baseline methods are stuck with nearzero average cumulative reward for up to two million training steps. In each of the office world scenario and the Minecraft world scenario, over the number of training steps within which the proposed approach converges to optimal policies, the baseline methods reach only 60% of the optimal average cumulative reward.
1.1 Motivating Example
As a motivating example, let us consider an autonomous vehicle navigating a residential area, as sketched in Figure 1. As is common in many countries, some of the roads are priority roads. While traveling on a priority road, a car has the rightofway and does not need to stop at intersections. In the example of Figure 1, all the horizontal roads are priority roads (indicated by gray shading), whereas the vertical roads are ordinary roads.
Let us assume that the task of the autonomous vehicle is to drive from position “A” (a start position) on the map to position “B” while obeying the traffic rules. To simplify matters, we are here only interested in the traffic rules concerning the rightofway and how the vehicle acts at intersections with respect to the traffic from the intersecting roads. Moreover, we make the following two further simplifications: (1) the vehicle correctly senses whether it is on a priority road and (2) the vehicle always stays in the road and goes straight forward while not at the intersections.
The vehicle is obeying the traffic rules if and only if

it is traveling on an ordinary road and stops for exactly one time unit at the intersections;

it is traveling on a priority road and does not stop at the intersections.
After a period of time (e.g., 100 time units), the vehicle receives a reward of 1 if it reaches B while obeying the traffic rules, otherwise it receives a reward of 0.
2 Preliminaries
In this section we introduce necessary background on reinforcement learning and reward machines.
2.1 Markov Decision Processes and Reward Machines
Definition 1
A labeled Markov decision process is a tuple consisting of a finite state space , an agent’s initial state , a finite set of actions , and a probabilistic transition function . A reward function and a discount factor together specify payoffs to the agent. Finally, a finite set of propositional variables, and a labeling function determine the set of relevant highlevel events that the agent detects in the environment. We define the size of , denoted as , to be (i.e., the cardinality of the set ).
A policy is a function mapping states in to a probability distribution over actions in . At state , an agent using policy picks an action with probability , and the new state is chosen with probability . A policy and the initial state together determine a stochastic process and we write for the random trajectory of states and actions.
A trajectory is a realization of this stochastic process: a sequence of states and actions , with . Its corresponding label sequence is where for each . Similarly, the corresponding reward sequence is , where , for each . We call the pair a trace.
A trajectory achieves a reward . In reinforcement learning, the objective of the agent is to maximize the expected cumulative reward, .
Note that the definition of the reward function assumes that the reward is a function of the whole trajectory. A special, often used, case of this is a socalled Markovian reward function, which depends only on the last transition (i.e., for all , where we use to denote concatenation).
Our definition of MDPs corresponds to the “usual” definition of MDPs (e.g., [Puterman]), except that we have introduced a set of highlevel propositions and a labeling function assigning sets of propositions (labels) to each transition of an MDP. We use these labels to define (general) reward functions through reward machines. Reward machines [DBLP:conf/icml/IcarteKVM18, LTLAndBeyond] are a type of finitestate machines—when in one of its finitely many states, upon reading a symbol, such a machine outputs a reward and transitions into a next state.^{1}^{1}1 The reward machines we are using are the socalled simple reward machines in the parlance of [DBLP:conf/icml/IcarteKVM18], where every output symbol is a real number.
Definition 2
A reward machine consists of a finite, nonempty set of states, an initial state , an input alphabet , an output alphabet , a (deterministic) transition function , and an output function . We define the size of , denoted as , to be (i.e., the cardinality of the set ).
Technically, a reward machine is a special instance of a Mealy machine [DBLP:books/daglib/0025557]: the one that has real numbers as its output alphabet and subsets of propositional variables (originating from an underlying MDP) as its input alphabet. (To accentuate this connection, a defining tuple of a reward machine explicitly mentions both the input alphabet and the output alphabet .)
The run of a reward machine on a sequence of labels is a sequence of states and labelreward pairs such that and for all , we have and . We write to connect the input label sequence to the sequence of rewards produced by the machine . We say that a reward machine encodes the reward function of an MDP if for every trajectory and the corresponding label sequence , the reward sequence equals . ^{2}^{2}2In general, there can be multiple reward machines that encode the reward function of an MDP: all such machines agree on the label sequences that arise from trajectories of the underlying MDP, but they might differ on label sequences that the MDP does not permit. For clarity of exposition and without loss of generality, we assume throughout this paper that there is a unique reward machine encoding the reward function of the MDP under consideration. However, our algorithm also works in the general case.
An interesting (and practically relevant) subclass of reward machines is given by Mealy machines with a specially marked subset of final states, the output alphabet , and the output function mapping a transition to 1 if and only if the endstate is a final state and the transition is not a selfloop. Additionally, final states must not be a part of any cycle, except for a selfloop. This special case can be used in reinforcement learning scenarios with sparse reward functions (e.g., see the reward machines used in the case studies in [DBLP:conf/icml/IcarteKVM18]).
For example, Figure 2 shows a reward machine for our motivating example. Intuitively, state corresponds to the vehicle traveling on a priority road, while and correspond to the vehicle traveling and stopped on an ordinary road, respectively. While in , the vehicle ends up in a sink state (representing violation of the traffic rules) if it stops at an intersection (). While in state , the vehicle gets to the sink state if it does not stop at an intersection (), and gets to state if it stops at an intersection (). While in state , the vehicle gets to the sink state if it stops again at the same intersection (), gets back to state if it turns left or turns right (thus ending in a priority road, i.e., ), and gets back to state if it goes straight (thus ending in an ordinary road, i.e., ). The reward machine switches among states , and if the vehicle is obeying the traffic rules. Finally, the reward 1 is obtained if from the goal position B is reached. (Transitions not shown in Figure 2 are selfloops with reward 0.)
2.2 Reinforcement Learning With Reward Machines
In reinforcement learning, an agent explores the environment modeled by an MDP, receiving occasional rewards according to the underlying reward function [sutton2018reinforcement]. One possible way to learn an optimal policy is tabular qlearning [Watkins1992]. There, the value of the function , that represents the expected future reward for the agent taking action in state , is iteratively updated. Provided that all stateaction pairs are seen infinitely often, qlearning converges to an optimal policy in the limit, for MDPs with a Markovian reward function [Watkins1992].
The qlearning algorithm can be modified to learn an optimal policy when the general reward function is encoded by a reward machine [DBLP:conf/icml/IcarteKVM18]. Algorithm 1 shows one episode of the QRM algorithm. It maintains a set of qfunctions, denoted as for each state of the reward machine.
The current state of the reward machine guides the exploration by determining which qfunction is used to choose the next action (line 1). However, in each single exploration step, the qfunctions corresponding to all reward machine states are updated (lines 1 and 1).
The modeling hypothesis of QRM is that the rewards are known, but the transition probabilities are unknown. Later, we shall relax the assumption that rewards are known and we shall instead observe the rewards (in line 1). During the execution of the episode, traces of the reward machine are collected (line 1) and returned in the end. While not necessary for qlearning, the traces will be useful in our algorithm to check the consistency of an inferred reward machine with rewards received from the environment (see Section 3).
3 Joint Inference of Reward Machines and Policies (JIRP)
Given a reward machine, the QRM algorithm learns an optimal policy. However, in many situations, assuming the knowledge of the reward function (and thus the reward machine) is unrealistic. Even if the reward function is known, encoding it in terms of a reward machine can be nontrivial. In this section, we describe an RL algorithm that iteratively infers (i.e., learns) the reward machine and the optimal policy for the reward machine.
Our algorithm combines an automaton learning algorithm to infer hypothesis reward machines and the QRM algorithm for RL on the current candidate. Inconsistencies between the hypothesis machine and the observed traces are used to trigger relearning of the reward machine. We show that the resulting iterative algorithm converges in the limit almost surely to the reward machine encoding the reward function and to an optimal policy for this reward machine.
3.1 JIRP Algorithm
Algorithm 2 describes our JIRP algorithm. It starts with an initial hypothesis reward machine and runs the QRM algorithm to learn an optimal policy. The episodes of QRM are used to collect traces and update qfunctions. As long as the traces are consistent with the current hypothesis reward machine, QRM explores more of the environment using the reward machine to guide the search. However, if a trace is detected that is inconsistent with the hypothesis reward machine (i.e., , Line 2), our algorithm stores it in a set (Line 2)—we call the trace a counterexample and the set a sample. Once the sample is updated, the algorithm relearns a new hypothesis reward machine (Line 2) and proceeds. Note that we require the new hypothesis reward machine to be minimal (we discuss this requirement shortly).
3.2 Passive Inference of Minimal Reward Machines
Intuitively, a sample contains a finite number of counterexamples. Consequently, we would like to construct a new reward machine that is (a) consistent with in the sense that for each and (b) minimal. We call this task passive learning of reward machines. The phrase “passive” here refers to the fact that the learning algorithm is not allowed to query for additional information, as opposed to Angluin’s famous “active” learning framework [DBLP:journals/iandc/Angluin87].
Task 1
Given a finite set , passive learning of reward machines refers to the task of constructing a minimal reward machine that is consistent with (i.e., that satisfies for each ).
Note that this learning task asks to infer not an arbitrary reward machine but a minimal one (i.e., a consistent reward machine with the fewest number of states among all consistent reward machines). This additional requirement can be seen as an Occam’s razor strategy [DBLP:conf/tacas/LodingMN16] and is crucial in that it guarantees JIRP to converge to the optimal policy in the limit. Unfortunately, Task 1 is computationally hard in the sense that the corresponding decision problem
“given a sample and a natural number , does a consistent Mealy machine with at most states exist?”
is NPcomplete. This is a direct consequence of Gold’s (in)famous result for regular languages [DBLP:journals/iandc/Gold78].
Since this problem is computationally hard, a promising approach is to learn minimal consistent reward machines with the help of highlyoptimized SAT solvers ([DBLP:conf/icgi/HeuleV10], [DBLP:conf/nfm/NeiderJ13], and [DBLP:phd/dnb/Neider14] describe similar learning algorithms for inferring minimal deterministic finite automata from examples). The underlying idea is to generate a sequence of formulas in propositional logic for increasing values of (starting with ) that satisfy the following two properties:

is satisfiable if and only if there exists a reward machine with states that is consistent with ; and

a satisfying assignment of the variables in contains sufficient information to derive such a reward machine.
By increasing by one and stopping once becomes satisfiable (or by using a binary search), an algorithm that learns a minimal reward machine that is consistent with the given sample is obtained.
Despite the advances in the performance of SAT solvers, this approach still does not scale to large problems. Therefore, one often must resort to polynomialtime heuristics.
3.3 Convergence in the Limit
Tabular qlearning and QRM both eventually converge to a qfunction defining an optimal policy almost surely. We show that the same desirable property holds for JIRP. More specifically, in the following sequence of lemmas we show that—given a long enough exploration—JIRP will converge to the reward machine that encodes the reward function of the underlying MDP. We then use this fact to show that overall learning process converges to an optimal policy (see Theorem 1).
We begin by defining attainable trajectories—trajectories that can possibly appear in the exploration of an agent.
Definition 3
Let be a labeled MDP and a natural number. We call a trajectory attainable if (i) and (ii) for each . Moreover, we say that a trajectory is attainable if there exists an such that is attainable.
An induction shows that JIRP almost surely explores every attainable trajectory in the limit (i.e., with probability when the number of episodes goes to infinity).
Lemma 1
Let be a natural number. Then, JIRP with almost surely explores every attainable trajectory at least once in the limit.
Analogous to Definition 3, we call a label sequence ()attainable if there exists an ()attainable trajectory such that for each . An immediate consequence of Lemma 1 is that JIRP almost surely explores every attainable label sequence in the limit.
Corollary 1
JIRP with almost surely explores every attainable label sequence at least once in the limit.
If JIRP explores sufficiently many attainable label sequences for a large enough value of , it is guaranteed to infer a reward machine that is “good enough” in the sense that it is equivalent to the reward machine encoding the reward function on all attainable label sequences. This is formalized in the next lemma.
Lemma 2
Let be a labeled MDP and the reward machine encoding the reward function of . Then, JIRP with almost surely learns a reward machine in the limit that is equivalent to on all attainable label sequences.
Lemma 2 guarantees that JIRP will eventually learn the reward machine encoding the reward function of an underlying MDP. This is the key ingredient in proving that JIRP learns an optimal policy in the limit almost surely.
Theorem 1
Let be a labeled MDP and the reward machine encoding the reward function of . Then, JIRP with almost surely converges to an optimal policy in the limit.
4 Algorithmic Optimizations
Section 3 provides the base algorithm with theoretical guarantees for convergence to an optimal policy. In this section, we present an improved algorithm (Algorithm 3) that includes three algorithmic optimizations:
Optimization 1: batching of counterexamples (Section 4.1);
Optimization 2: transfer of qfunctions (Section 4.2);
Optimization 3: polynomial time learning algorithm for inferring the reward machines (Section 4.3).
The following theorem claims that Optimizations 1 and 2 retain the convergence guarantee of Theorem 1.
Theorem 2
Let be a labeled MDP and the reward machine encoding the rewards of . Then, JIRP with Optimizations 1 and 2 with converges to an optimal policy in the limit.
It should be noted that although such guarantee fails for Optimization 3, in practice the policies usually still converge to the optimal policies (see the case studies in Section 5).
4.1 Batching of Counterexamples
Algorithm 2 infers a new hypothesis reward machine whenever a counterexample is encountered. This could incur a high computational cost. In order to adjust the frequency of inferring new reward machines, Algorithm 3 stores each counterexample in a set . After each period of episodes (where is a userdefined hyperparameter), if is nonempty, we add to the sample and infer a new hypothesis reward machine (lines 3 to 3). Then, Algorithm 3 proceeds with the QRM algorithm for . The same procedure repeats until the policy converges.
4.2 Transfer of Qfunctions
In Algorithm 2, after a new hypothesis reward machine is inferred, the qfunctions are reinitialized and the experiences from the previous iteration of RL are not utilized. To utilize experiences in previous iterations, we provide a method to transfer the qfunctions from the previously inferred reward machine to the newly inferred reward machine (inspired by the curriculum learning implementation in [DBLP:conf/icml/IcarteKVM18]). The transfer of qfunctions is based on equivalent states of two reward machines as defined below.
Definition 4
For a reward machine and a state , let be the machine with as the initial state. Then, for two reward machines and , two states and are equivalent, denoted by , if and only if for all label sequences .
With Definition 4, we provide the following theorem claiming equality of optimal qfunctions for equivalent states of two reward machines. We use to denote the optimal qfunction for state of the reward machine.
Theorem 3
Let and be two reward machines encoding the rewards of a labeled MDP . For states and , if , then for every and , .
Algorithm 4 shows the procedure to transfer the qfunctions between the hypothesis reward machines in consecutive iterations. For any state of the hypothesis reward machine in the current iteration, we check if there exists an equivalent state of the hypothesis reward machine in the previous iteration. If so, the corresponding qfunctions are transferred (line 4). As shown in Theorem 3, the optimal qfunctions for two equivalent states are the same.
4.3 A Polynomial Time Learning Algorithm for Reward Machines
In order to tackle scalability issues of the SATbased machine learning algorithm, we propose to use a modification of the popular Regular Positive Negative Inference (RPNI) algorithm [oncina1992inferring] adapted for learning reward machines. This algorithm, which we name RPNIRM, proceeds in two steps.
In the first step, RPNIRM constructs a partial, treelike reward machine from a sample where

each prefix of a trace corresponds to a unique state of ; and

for each trace and , a transition leads from state to state with input and output .
Note that fits the sample perfectly in that for each and the output of all other inputs is undefined (since the reward machine is partial). In particular, this means that is consistent with .
In the second step, RPNIRM successively tries to merge the states of . The overall goal is to construct a reward machine with fewer states but more inputoutput behaviors. For every candidate merge (which might trigger additional state merges to restore determinism), the algorithm checks whether the resulting machine is still consistent with . Should the current merge result in an inconsistent reward machine, it is reverted and RPNIRM proceeds with the next candidate merge; otherwise, RPNIRM keeps the current merge and proceeds with the merged reward machine. This procedure stops if no more states can be merged. Once this is the case, any missing transition is directed to a sink state, where the output is fixed but arbitrary.
Since RPNIRM starts with a consistent reward machine and keeps intermediate results only if they remain consistent, its final output is clearly consistent as well. Moreover, merging of states increases the inputoutput behaviors, hence generalizing from the (finite) sample. Finally, let us note that the overall runtime of RPNIRM is polynomial in the number of symbols in the given sample because the size of the initial reward machine corresponds to the number of symbols in the sample and each operation of RPNIRM can be performed in polynomial time.
5 Case Studies
In this section, we apply the proposed approach to three different scenarios: 1) autonomous vehicle scenario; 2) office world scenario adapted from [DBLP:conf/icml/IcarteKVM18], and 3) Minecraft world scenario adapted from [andreas2017modular]. We use the libalf [libalfTool] implementation of RPNI [oncina1992inferring] as the algorithm to infer reward machines. The detailed description of the tasks in the three different scenarios can be found in the supplementary material.
We compare JIRP (with algorithmic optimizations) with the two following baseline methods:

QAS (qlearning in augmented state space): to incorporate the extra information of the labels (i.e., highlevel events in the environment), we perform qlearning [Watkins1992] in an augmented state space with an extra binary vector representing whether each label has been encountered or not.

HRL (hierarchical reinforcement learning): we use a metacontroller for deciding the subtasks (represented by encountering each label) and use the lowlevel controllers expressed by neural networks [Kulkarni2016hierarchical] for deciding the actions at each state for each subtask.
5.1 Autonomous Vehicle Scenario
We consider the autonomous vehicle scenario as introduced in the motivating example in Section 1.1. The set of actions is , corresponding to going straight, turning left, turning right and staying in place. For simplicity, we assume that the labeled MDP is deterministic (i.e, the slip rate is zero for each action). The vehicle will make a Uturn if it reaches the end of any road.
The set of labels is and the labeling function is defined by
where is a Boolean variable that is true () if and only if is on the priority roads, represents the set of locations where the vehicle is entering an intersection, and are the and coordinate values at state , and and are and coordinate values at B (see Figure 1).
We set and . Figure 4 shows the inferred hypothesis reward machine in the last iteration of JIRP in one typical run. The inferred hypothesis reward machine is different from the true reward machine in Figure 2, but it can be shown that these two reward machines are equivalent on all attainable label sequences.
Figure 3 (a) shows the cumulative rewards with the three different methods in the autonomous vehicle scenario. The JIRP approach converges to optimal policies within 100,000 training steps, while QAS and HRL are stuck with nearzero cumulative reward for up to two million training steps (with the first 200,000 training steps shown in Figure 3 (a)).
5.2 Office World Scenario
We consider the office world scenario in the 912 gridworld. The agent has four possible actions at each time step: move north, move south, move east and move west. After each action, the robot may slip to each of the two adjacent cells with probability of 0.05. We use four tasks with different highlevel structural relationships among subtasks such as getting the coffee, getting mails and going to the office (see Appendix G for details).
We set and . Figure 3 (b) shows the cumulative rewards with the three different methods in the office world scenario. The JIRP approach converges to the optimal policy within 150,000 training steps, while QAS and HRL reach only 60% of the optimal average cumulative reward within 200,000 training steps.
5.3 Minecraft World Scenario
We consider the Minecraft example in a 2121 gridworld. The four actions and the slip rates are the same as in the office world scenario. We use four tasks including making plank, making stick, making bow and making bridge (see Appendix H for details).
We set and . Figure 3 (c) shows the cumulative rewards with the three different methods in the Minecraft world scenario. The JIRP approach converges to the optimal policy within 400,000 training steps, while QAS and HRL reach only 50% of the optimal average cumulative reward within 600,000 training steps.
6 Conclusion
We proposed an iterative approach that alternates between reward machine inference and reinforcement learning (RL) for the inferred reward machine. We have shown the improvement of RL performances using the proposed method.
This work opens the door for utilizing automata learning in RL. First, the same methodology can be applied to other forms of RL, such as modelbased RL, or actorcritic methods. Second, we will explore methods that can infer the reward machines incrementally (based on inferred reward machines in the previous iteration). Finally, the method to transfer the qfunctions between equivalent states of reward machines can be also used for transfer learning between different tasks where the reward functions are encoded by reward machines.
References
Appendix A Proof of Lemma 1
Proof 1
We first prove that JIRP with explores every attainable trajectory with a positive (nonzero) probability. We show this claim by induction over the length of trajectories.
 Base case:

The only trajectory of length , , is always explored because it is the initial state of every exploration.
 Induction step:

Let and be an attainable trajectory of length . Then, the induction hypothesis yields that JIRP explores each attainable trajectory (of length ). Moreover JIRP continues its exploration because . At this point, every action will be chosen with probability at least , where denotes the set of available actions in the state (this lower bound is due to the greedy strategy used in the exploration). Having chosen action , the state is reached with probability because is attainable. Thus, the trajectory is explored with a positive probability.
Since JIRP with explores every attainable trajectory with a positive probability, the probability of an attainable trajectory not being explored becomes in the limit (i.e., when the number of episodes goes to infinity). Thus, JIRP almost surely (i.e., with probability ) explores every attainable trajectory in the limit.
Appendix B Proof of Lemma 2
In order to prove Lemma 2, we require a few (basic) definitions from automata and formal language theory.
An alphabet is a nonempty, finite set of symbols . A word is a finite sequence of symbols. The empty sequence is called empty word and denoted by . The length of a word , denoted by is the number of its symbols. We denote the set of all words over the alphabet by .
Next, we recapitulate the definition of deterministic finite automata.
Definition 5
A deterministic finite automaton (DFA) is a fivetuple consisting of a nonempty, finite set of states, an initial state , an input alphabet , a transition function , and a set of final states. The size of a DFA, denoted by , is the number of its states.
A run of a DFA on an input word is a sequence of states such that and for each . A run of on a word is accepting if , and is accepted if there exists an accepting run. The language of a DFA is the set . As usual, we call two DFAs and equivalent if . Moreover, let us recapitulate the wellknown fact that two nonequivalent DFAs have a “short” word that witnesses their nonequivalence.
Theorem 4 ([DBLP:books/daglib/0025557], Theorem 3.10.5)
Let and be two DFAs with . Then, there exists a word of length at most such that if and only if .
As the next step towards the proof of Lemma 2, we remark that every reward machine over the input alphabet and output alphabet can be translated into an “equivalent” DFA as defined below. This DFA operates over the combined alphabet and accepts a word if and only if outputs the reward sequence on reading the label sequence .
Lemma 3
Given a reward machine , one can construct a DFA with states such that
(1) 
Proof 2 (Proof of Lemma 3)
Let be a reward machine. Then, we define a DFA over the combined alphabet by

with ;

;

;


.
In this definition, is a new sink state to which moves if its input does not correspond to a valid inputoutput pair produced by . A straightforward induction over the length of inputs to shows that it indeed accepts the desired language. In total, has states.
Similarly, one can construct a DFA that accepts exactly the attainable traces of an MDP . First, viewing labels as input symbols, marking every state as an accepting state, and keeping only those transitions for which , can be viewed as a nondeterministic finite automaton. Second, using the standard determinization algorithm [DBLP:books/daglib/0025557], one can create an equivalent DFA with an exponential blowup in the number of states.
Remark 1
Given a labeled MDP , one can construct a DFA with at most states that accepts exactly the admissible label sequences of .
Next, we show that if two reward machines disagree on an attainable label sequence, then we can provide a bound on the length of such a sequence.
Lemma 4
Let be a labeled MDP and two reward machines with input alphabet . If there exists an attainable label sequence such that , then there also exists an attainable label sequence with such that .
Proof 3 (Proof of Lemma 4)
Let be a labeled MDP and , two reward machines with input alphabet . As a first step, we construct the DFAs according to Lemma 3 and the DFAs for according to Remark 1.
Next, we construct the inputsynchronized product of a and by

;

;

; and

,
which synchronizes and the inputcomponent of . A straightforward induction over the lengths of inputs to shows that if and only if is an attainable label sequence such that . Moreover, note that has states.
If there exists an attainable label sequence such that , then by construction of the DFAs and . In this situation, Theorem 4 guarantees the existence of a word of size
such that if and only if .
Let now . By construction of the DFAs and , we know that holds. Moreover, is an attainable label sequence with the desired bound on .
We are now ready to prove Lemma 2.
Proof 4 (Proof of Lemma 2)
Let be the sequence of samples that arise in the run of JIRP whenever new counterexamples are added to (in Line 2 of Algorithm 2). We now make two observations about this sequence, which help us prove Lemma 2.

The sequence grows strictly monotonically (i.e., ). The reasons for this are twofold. First, JIRP always adds counterexamples to and never removes them (which establishes ). Second, whenever a counterexample is added to to form , then . To see why this is the case, remember that JIRP always constructs hypothesis reward machines that are consistent with the current sample. Thus, the reward machine is consistent with . However, was added because . Hence, cannot have been an element of .

The true reward machine , the one that encodes the reward function , is by definition consistent with all samples that are generated during the run of JIRP.
Once a new counterexample is added, JIRP learns a new reward machine. Let be the sequence of these reward machines, where is computed based on the sample . As above, we make two observations about this sequence.

We have . Towards a contradiction, assume that . Since JIRP always computes consistent reward machines and (see Observation 1), we know that is not only consistent with but also with (by definition of consistency). Moreover, JIRP always computes consistent reward machines of minimal size. Thus, since is consistent with and , the reward machine is not minimal, which yields the desired contradiction.

We have for each ; in other words, the reward machines generated during the run of JIRP are semantically distinct. This is a consequence of the facts that was a counterexample to (i.e., ) and the learning algorithm for reward machines always constructs consistent reward machines (which implies ).
Observations 2 and 3 now provide as an upper bound on the size of the hypothesis reward machines constructed in the run of JIRP. Since there are only finite many reward machines of size or less, Observation 4 then implies that there exists an after which no new reward machine is inferred. Thus, it is left to show that for all attainable label sequences .
Towards a contradiction, assume that there exists an attainable label sequence such that . Lemma 4 then guarantees the existence of an attainable label sequence with
such that . By Corollary 1, JIRP almost surely explores the label sequence in the limit because we assume . Thus, the trace , where , is almost surely returned as a new counterexample, resulting in a new sample . This triggers the construction of a new reward machine (which will then be different from all previous reward machines). However, this contradicts the assumption that no new reward machine is constructed after . Thus, holds for all attainable input sequences .
Appendix C Proof of Theorem 1
To prove Theorem 1, we use the fact that JIRP will eventually learn a reward machine equivalent to the reward machine on all attainable label sequences (see Lemma 2). Then, closely following the proof of Theorem 4.1 from [DBLP:conf/icml/IcarteKVM18], we construct an MDP , show that using the same policy for and yields same rewards, and, due to convergence of qlearning for , conlcude that JIRP converges towards an optimal policy for . Lemma 5 describes the construction of the mentioned MDP .
Lemma 5
Given an MDP with a nonMarkovian reward function defined by a reward machine , one can construct an MDP whose reward function is Markovian such that every attainable label sequence of gets the same reward as in . Furthermore, any policy for achieves the same expected reward in .
Proof 5
Let be a labeled MDP and a reward machine encoding its reward function. We define the product MDP by

;

;

;


; ;

; and

.
The described MDP has a Markovian reward function that matches , the (nonMarkovian) reward function of defined by the reward machine (Definition 2). Since the reward functions and discount factors are the same, the claims follow.
Lemma 2 shows that eventually , the reward machine learned by JIRP, will be equivalent to on all attainable label sequences. Thus, using Lemma 5, an optimal policy for MDP will also be optimal for .
When running episodes of QRM (Algorithm 1) under the reward machine , an update of a function connected to a state of a reward machine corresponds to updating the function for . Because , the fact that QRM uses greedy strategy and that updates are done in parallel for all states of the reward machine , we know that every stateaction pair of the MDP will be seen infinitely often. Hence, convergence of qlearning for to an optimal policy is guaranteed [Watkins1992]. Finally, because of Lemma 5, JIRP converges to an optimal policy, too.
We have proved that if the number of episodes goes to infinity, and the length of an episode is at least , then JIRP converges towards an optimal policy.
Appendix D Proof of Theorem 2
In order to prove Theorem 2, we first need to prove the following lemma.
Lemma 6
Let be a labeled MDP and the reward machine encoding the rewards of . Then, JIRP with Optimizations 1 and 2 with