Stochastic Fairness and Language-Theoretic Fairness in Planning on Nondeterministic Domains

# Stochastic Fairness and Language-Theoretic Fairness in Planning on Nondeterministic Domains

## Abstract

We address two central notions of fairness in the literature of planning on nondeterministic fully observable domains. The first, which we call stochastic fairness, is classical, and assumes an environment which operates probabilistically using possibly unknown probabilities. The second, which is language-theoretic, assumes that if an action is taken from a given state infinitely often then all its possible outcomes should appear infinitely often (we call this state-action fairness). While the two notions coincide for standard reachability goals, they diverge for temporally extended goals. This important difference has been overlooked in the planning literature, and we argue has led to confusion in a number of published algorithms which use reductions that were stated for state-action fairness, for which they are incorrect, while being correct for stochastic fairness. We remedy this and provide an optimal sound and complete algorithm for solving state-action fair planning for LTL/LTLf goals, as well as a correct proof of the lower bound of the goal-complexity (our proof is general enough that it provides new proofs also for the no-fairness and stochastic-fairness cases). Overall, we show that stochastic fairness is better behaved than state-action fairness.

## 1 Introduction

Nondeterminism in planning captures uncertainty that the agent has at planning time about the effects of its actions. For instance, “remove block A from the table” may either succeed, resulting in “block A is not on the table”, or fail, resulting in “block A is on the table”. Plans in nondeterministic environments are not simply sequences of actions as in classical planning; rather, the next action may depend on the sequences of actions (and observations1) so far, and are captured by policies (also known as strategies and controllers).

Broadly speaking, nondeterminism manifests in one of two ways, stochastic- and adversarial-environments.

### Stochastic environments

Nondeterministic environments with probabilities are often modeled as Markov Decision Processes (MDPs) in planning. These are state-transition systems in which the probability of an effect depends only on the current state and action. However, sometimes the probabilities of action effects are not available, or are non stationary, or are hard to estimate, e.g., a robot may encounter an unexpected obstacle, or an exogenous event or failure occurs. A long thread in this setting aims to understand what it means to plan in such an environment [14, 29, 11, 23, 13]. One common intuition is that the goal should be achievable by trial-and-error while expecting only a finite amount of bad luck [11], e.g., a policy that repeats the action “remove block A from the table” would eventually succeed under this assumption. This amounts to assuming that some unknown distribution assigns a non-zero probability to each of the alternative effects.2 Thus, although there are no explicit probabilities, the stochastic principle is still in place, and we call such assumptions stochastic fairness. Plans in such a setting are called strong-cyclic, and their importance is evidenced by the fact that there are several tools for finding strong-cyclic policies, e.g., NDP [1], FIP [20], myND [26], Gamer [24], PRP [27], GRENADE [34], and FOND-SAT [22]. Such policies also correspond to ensuring that the goal holds with probability one [23, 21].

Nondeterministic environments without probabilities are often modeled as fully observable nondeterministic planning domains (FOND). These are state-transition systems in which the effect of an action is a set of possible states, rather than a single state as in classical planning. Policies that guarantee success, i.e., the goal is achieved no matter how the nondeterminism is resolved, are called strong solutions. When handling adversarial nondeterminism it is often reasonable to require that a policy should guarantee success under some additional assumptions about the environment. For instance, a typical assumption is that repeating an action in a given state results in all possible effects, e.g., repeating the action “remove block A from the table” would eventually succeed (as well as eventually fail). Note that this can be expressed as a property of traces, and so for the purpose of this paper, we call such notions language-theoretic fairness. We focus on one central such notion which we call state-action fairness and which says, of a trace, that if an action is taken from a state infinitely often in the trace, and if is a possible effect of from , then infinitely often in the trace is the resulting effect of action from state . Although there are many notions of fairness, this particular notion has been identified as providing sufficient assumptions that guarantee the success of solutions that repeatedly retry; see [13] where the notion is called state strong fairness.

### What is the relationship between fairness in an adversarial setting and fairness in a stochastic setting?

On the one hand, the two notions of fairness are similar. Indeed, planning assuming either notion of fairness means that the policy can ignore some traces, which are guaranteed not to be produced by the environment.3 Also, it turns out that when planning for reachability goals (i.e., eventually reach a certain target set of states) the two notions of fairness are interchangeable. More precisely, a policy achieves the reachability goal assuming stochastic fairness (i.e., it is a strong-cyclic solution) if and only if it achieves the reachability goal assuming state-action fairness (i.e., the target set is reached on all state-action fair traces). On the other hand, it turns out that the two notions of fairness are not generally interchangeable for planning for temporally extended goals (such as those expressed in linear temporal logic ltl or its finite-trace variant ltl). It is the purpose of this paper to clarify this fact and study its consequences.

### Outline of the paper and contributions

In Section 3 we point out the distinction between stochastic fairness and state-action fairness in the context of planning. Once this distinction has been noted, one realizes that there are algorithms (published in IJCAI) for fair planning for temporally-extended goals that, although stated for state-action fairness, are actually correct for stochastic fairness (but do not address state-action fairness at all). The relevant parts of these algorithms are discussed in Section 4. To remedy this, the focus of the rest of the paper is on algorithms and the computational complexity of planning for temporally-extended goals assuming state-action fairness.

In Section 5 we provide a new algorithm for this problem that does not conflate the two notions. We go on to show that the complexity in the goal is in exptime, while the complexity in the domain is in nexptime.

In Section 6 we provide a proof of the matching exptime lower-bound for the goal-complexity. We also discuss the domain-complexity: it is exptime-hard already for reachability goals, leaving a gap between deterministic and nondeterministic exponential time. We also show that our lower bound is proved using a technique that is general enough to give new proofs of the exptime-hardness for the goal complexity also for the no-fairness and stochastic-fairness cases.

## 2 Fair Planning Problems

In this section we define planning domains, temporally extended goals, and isolate the two notions of fairness.

### Planning Domains

A nondeterministic planning domain is a tuple where is a finite set of states, is a finite set of actions, is an initial state, and is a transition relation. We will sometimes write in functional form, i.e., . We say that the action is applicable in state if . We assume, by adding a dummy action and state if needed, that for every state there is an applicable action.

For a finite set let denote the set of (probability) distributions over , i.e., functions such that . An element is in the support of if . A stochastic planning domain is a tuple where is called the induced nondeterministic planning domain, and , called the probabilistic transition function, is a partial function defined only for pairs where is applicable in , satisfying that the support of is equal to . Note that stochastic domains are variants of Markov Decision Processes (MDPs). However, MDPs typically have Markovian rewards, while stochastic planning problems may have goals that depend on the history.

We will refer to both nondeterministic and stochastic planning domains simply as domains. Unless otherwise stated, domains are compactly represented, e.g., in variants of the Planning Domain Description Language (PDDL), and thus can usually be represented with a number of bits which is poly-logarithmic in the number of states and actions. In particular, the states are encoded as assignments to Boolean variables called fluents, thus we have that . For symmetry, also the actions are encoded as assignments to Boolean variables that are disjoint from , thus we have that . Although the literature also contains formalisms for compactly representing stochastic domains (such as Probabilistic PDDL), here we will not be concerned with a detailed formalization of probabilistic transition functions since it is known (as we will later discuss) that probabilities essentially play no role in the stochastic-fair planning problem (formally defined below).

### Traces and Policies

Let be a domain. A trace of is a finite or infinite sequence over the alphabet where is the initial state, and for all with . Moreover, the sequence of states is called the path induced by . A policy is a function such that for every the action is applicable in the last state of . Note that policies are history dependent in this paper. A trace is generated by , or simply called an -trace, if for every finite prefix of we have that .

A finite-state representation of a policy is a finite-state input/output automaton that, on reading as input, outputs the action . A finite-state policy is one having a finite-state representation.

A stochastic domain combined with a policy induces a (possibly infinite-state) Markov chain, denoted , in the usual way, which gives rise to a probability distribution over the set of infinite -traces in  [39].

The following domain, illustrated in Figure 1, will be used in counterexamples.

###### Example 1

Define the domain where with , with , , and consist of the triples and . Note that the only applicable action (from any state) is , only three states are reachable from the initial state using this action (i.e., and ), and there is only one policy available (it always does the action ). Define the trace as .4 Note that this trace takes each of the transitions and infinitely often.

### Linear Temporal Logic

Linear Temporal Logic (ltl) is a formalism that was introduced into the verification literature for describing computations of programs without the use of explicit time-stamps [32]. The logic has since been used in planning as a language for specifying temporally extended goals and for expressing search control, see, e.g., [18, 3].

The syntax of ltl consists of atoms , and is closed under the Boolean operations and , and the temporal operators (read “next”) and (read “until”):

 ψ::=p∣(¬ψ)∣(ψ1∧ψ2)∣(◯ψ)∣(ψ1Uψ2)

with varying over the elements of .

We use the usual short-hands, e.g., , , (read “eventually ”), and (read “always ”).

Formulas of ltl are interpreted over infinite sequences over the alphabet . Define inductively on the structure of , simultaneously for all time points , as follows:

• if ,

• if for ,

• if ,

• if for some , and for all .

We also consider the variant ltl of ltl interpreted over finite sequences. It has the same syntax and semantics as ltl except that is a finite sequence and that one defines as follows, cf. [3, 5, 16]:

• if and where is the last position of , i.e., since sequences start with position .

If is an ltl (resp. ltl) formula and is an infinite (resp. finite) sequence over , we write , and say that satisfies , iff .

We also make the following useful convention that allows us to interpret ltl formulas over infinite traces: if is infinite and is an ltl formula, then is defined to mean that some finite prefix of satisfies .

In the context of a planning domain , we will take to be (this is for convenience; some papers take ). We write , and say that enforces , if every infinite -trace of satisfies .

### Planning Problems

A goal is a set of infinite traces of . A planning problem consists of a domain and a goal . Solving the planning problem is to decide, given (compactly represented) and (suitably represented), if there is a policy such that every infinite -trace satisfies (i.e., is in ). In this paper, goals will typically be represented by ltl/ltl formulas.

### Fair Planning Problems

We now define the two types of fair planning problems mentioned in the introduction.

A trace of a domain is state-action fair if for every transition of , if occurs infinitely often in then occurs infinitely often in . This can be expressed by the following ltl formula :

 ϕD,fair:=⋀(s,a,s′)∈Tr(□◊(s∧a)⊃□◊(s∧a∧◯s′))).

A policy solves the state-action-fair planning problem if every state-action-fair -trace satisfies , written .

For a stochastic domain , we write to mean that the probability that an -trace satisfies is equal to , and we say that almost surely enforces . It is known that does not depend on the probabilistic transition function of , but only on its induced nondeterministic domain; indeed, it does not depend on the exact distributions but only on their supports, which are specified by the transition relation of the induced nondeterministic domain, cf. [37]. Hence, we can actually extend this probabilistic notion of enforcing also to nondeterministic domains, as follows. For a nondeterministic domain , we write to mean that where is any stochastic domain whose induced nondeterministic domain is . Thus, for a domain (nondeterministic or stochastic), we say that a policy solves the stochastic-fair planning problem if .

### Connection with Planning for Reachability Goals

The classic goal in planning is reachability, typically represented as a Boolean combination of fluents, i.e., it can be expressed by an ltl/ltl formula . A policy enforcing is known as a strong solution [11] or an acyclic safe solution [23]). A policy enforcing assuming state-action fairness is known as a strong cyclic solution [11] or a cyclic safe solution [23]).

### Computational Complexity

Planning problems have two inputs: the domain (represented compactly) and the goal (typically represented as a formula). Combined complexity measures the complexity in terms of the size of both inputs, while goal complexity (resp. domain complexity) only measures the complexity in the size of the goal (resp. domain). Formally, we say that the goal complexity is in a complexity class if for every domain , the complexity of the problem that takes as input a goal and decides if there is a solution to the planning problem , is in ; and we say that the goal complexity is hard for if there is a domain such that the complexity of the problem that takes as input a goal and decides if there is a solution to the planning problem is -hard. Similar definitions hold for domain complexity. Such measures were first introduced in database theory [38].

### Automata-theoretic approach to planning

A typical approach for solving planning problems with temporally-extended goals is to use an automata-theoretic approach. Here we recall just enough for our needs in Sections 4 and 5.

A deterministic automaton is a tuple where is the input alphabet, is a finite set of states, is the initial state, is the transition function, and is the acceptance condition (described later). A (finite or infinite) input word determines a run, i.e., the sequence of states starting with the initial state and respecting the transition function, i.e., for all . A word is accepted by if its run satisfies the acceptance condition . There are a variety of different ways to define the acceptance condition. If is to accept only finite words, then we typically have ; and we say that a finite run satisfies if its last state is in . Such an automaton is called a deterministic finite word automaton (DFW). If is to accept only infinite words, then there are a number choices for . We will not be concerned with the specific choice until Section 5.

The synchronous product of a domain and a deterministic automaton over the input alphabet is a domain, denoted , whose states are pairs where is a state of and is a state of , and that can transition from state to state on action if and the automaton can go from reading to . Intuitively, simulates both and simultaneously. Such products are used in algorithms for planning with ltl/ltl goals in Section 4 and Section 5. We remark that the product is sometimes also compactly represented, although the details depend on the context and will not concern us.

## 3 Stochastic Fairness ≢ State-action Fairness

In this section we compare the two notions of fairness in the context of planning. It turns out that they are equivalent for reachability goals, but not for general ltl/ltl goals. The first principle is known, e.g. [35, 23], and is repeated here for completeness.

###### Proposition 1

Let be a (nondeterministic or stochastic) domain and let be a Boolean combination of fluents. The following are equivalent for every finite-state policy :

1. , i.e., the target is reached on state-action fair traces.

2. , i.e., the target is reached with probability one.

Proof.  Assume that . Observe that the state-action fair traces have probability , cf. [37], and thus, by definition, . For the other direction, assume by way of contradiction that holds but doesn’t, and pick an infinite state-action fair -trace that doesn’t satisfy . Let be the finite-state Markov chain induced by and , viewed as a directed graph, and let be the path in induced by . Since is state-action fair, reaches a bottom strongly connected component of , and visits every state in . By the assumption that , contains no state in which holds. Let be some (fixed) prefix of that ends in a state in , and consider the set of infinite -traces whose induced paths have as a prefix. Observe that the probability of is positive, and none of the traces in satisfy . This contradicts .

We now turn to goals expressed as ltl/ltl formulas. Unfortunately, in this case the analogue of Proposition 1 does not hold. Indeed, only the forward direction holds.

###### Proposition 2

Let be a domain, and ltl/ltl formula, and a finite-state policy. If then .

Proof.  As in Proposition 1, simply use the fact that the set of infinite state-action fair -traces has probability .

The next proposition shows that the converse of Proposition 2 does not hold. Intuitively, the reason is that, assuming stochastic fairness, every finite trace that is enabled infinitely often appears with probability 1, while assuming state-action fairness, this is only true for traces of length one.

For the next proposition, recall Example 1.

###### Proposition 3

There is a domain , a finite-state policy , and an ltl goal such that , but for no policy does it hold that .

Proof.  Let be the domain from Example 1. Let be the ltl/ltl formula (i.e., eventually and two steps afterwards again). There is only one policy available: it always chooses the action . Observe that , but that as witnessed by the trace .

## 4 Confusion in the literature

Certain algorithms in the literature for solving state-action fair planning problems with temporally extended goals rely on a reduction to another state-action fair planning problem, that, as we prove, is complete but not sound. The papers, in order of publication, are [28][Theorem ], [15][Theorem ] and [10][Theorem ]. If one assume stochastic-fairness instead of state-action fairness, then the reduction is both sound and complete. This suggests that the cited algorithms are correct if one assumes stochastic fairness instead of state-action fairness.

### The reduction

We begin by describing the reduction without any mention of fairness. From a planning problem , first define a deterministic automaton that recognizes exactly the traces that satisfy . Second, define the domain as the synchronous product of and . Finally, define the planning problem where is a goal that captures the acceptance condition of , i.e., consists of those traces of whose first components are traces of that are accepted by .5

### Analsis of the reduction

If this reduction is to be used to give an exact algorithm for planning assuming state-action fairness, it should be sound and complete, i.e., is solvable assuming state-action fairness iff is solvable assuming state-action fairness. The reduction is indeed complete because every state-action fair trace in the product domain projects to a state-action fair trace in (this follows immediately from the definition of state-action fairness and of the synchronous product). On the other hand, the reduction is not sound because there may be fair traces in that do not induce any fair trace in (intuitively, this is due to synchronization in between the domain and the automaton ). We formalise this in the following theorem which actually shows that the reduction is not sound no matter which deterministic automaton for is used.

###### Theorem 1

There is a domain , and an ltl/ltl goal , s.t. a) there is no solution to the state-action fair planning problem , but b) for every deterministic automaton accepting exactly the traces that satisfy , there is a solution to the state-action fair planning problem , where is the product of and , and captures the acceptance condition of .

Proof.  Let (resp. ) be the domain (resp. trace) from Example 1, let be the formula , and observe that all traces of satisfy except for the trace .

There is a single policy available in , i.e., always perform the single applicable action. However, is a state-action-fair -trace that does not satisfy . Thus, there is no solution to the state-action fair problem .

We claim that the single policy available in is a solution to . For this, it is enough to show that every state-action fair trace in induces in a trace that satisfies , i.e., a trace other than . Let be a trace in that induces . To see that is not state-action fair, let be a state that appears in infinitely often after a state of the form . Note that never appears as a source of a transition to a state of the form . Indeed, since occurs on exactly every four steps, the source of such a transition is only reached three steps after reading an ; and while reading , is always in a different state than three steps after reading an (so not to confuse occurrences of four steps after an with ones two steps after it). Thus, some successor of is enabled infinitely often but never taken.

We note, however, that if one uses stochastic fairness instead of state-action fairness then the reduction above is sound and complete. This is because stochastic-fairness is preserved by taking a product with a deterministic automaton, a fact which is exploited in the automata-theoretic approach to verification of probabilistic systems [39, 12, 6, 7]:

###### Theorem 2

Let be a planning problem, and let be a planning problem constructed as in the reduction above. There is a policy solving assuming stochastic fairness iff there is a policy solving assuming stochastic fairness.

In summary, we conjecture that some errors in the proofs and algorithms for state-action fair planning in the literature arise from the mistaken intuition that state-action fairness always behaves like stochastic fairness, which it does not in the presence of even simple ltl/ltl formulas (that are not reachability formulas).

## 5 Algorithm for State-action Fair Planning

In the previous section we showed that some algorithms in the literature for state-action fair planning for temporally extended goals use complete but unsound reductions. In this section, we provide a sound and complete reduction to the problem of solving Rabin games (defined below).

###### Theorem 3

The combined (and thus goal) complexity of solving planning with ltl/ltl goals assuming state-action fairness is in exptime, and the domain complexity is in nexptime (in the size of a compactly represented domain).

The main approach to solving such a problem is to use, explicitly or implicitly, an automata-theoretic approach. However, as we now remark, naive applications of this approach yield a exptime domain-complexity (which we then show how to lower to nexptime), a exptime combined-complexity (which we then show how to lower to exptime), and a exptime goal complexity.

###### Remark 1

The problem of solving the state-action fair planning problem where is an ltl/ltl formula is equivalent to solving the planning problem where

 ϕD,fair:=⋀(s,a,s′)∈Tr(□◊(s∧a)⊃□◊(s∧a∧◯s′)))

is an ltl formula expressing state-action fairness in the domain (for more on this equivalence see [2]). However, the size of is exponential in the size of (compactly represented). Thus, we have reduced the problem to solving planning for an ltl goal of size exponential in the size of and linear in the size of . In turn, there are algorithms that solve planning with ltl goals (no fairness assumptions) that run in exptime in the size of the domain and exptime in the size of the goal [2, 9]. Putting this together results in an algorithm for the state-action-fair planning problem that runs in exptime in the size of the domain and exptime in the size of the formula .

The main insight that achieves the complexities in Theorem 3 is that one should use Rabin conditions.

A Rabin condition over a set is a set of pairs of the form with . The pairs are called Rabin pairs. An infinite sequence over the alphabet is said to satisfy the Rabin condition if there is a pair such that some appears infinitely often in and no appears infinitely often in .6 Below we use Rabin conditions in two ways: as acceptance conditions (for automata) and as winning conditions (in games).

### Rabin Automata

A Deterministic Rabin Word (DRW) automaton is an automaton where the acceptance condition is a Rabin condition over . The size of a DRW is the number of its states and its index is the number of pairs in .

The reader may be wondering why we chose the Rabin acceptance condition instead of some other acceptance condition. The reason is: they can capture very general properties, including ltl/ltl; they are naturally closed under union; they can naturally express that a trace is not state-action fair.

###### Theorem 4

[cf. [41]] Given an ltl/ltl formula one can build a DRW that accepts exactly the infinite traces satisfying .7 Moreover, has size exp and index exp in .

###### Lemma 1

Given a domain one can build a DRW that accepts exactly the infinite traces of that are not state-action fair. Moreover, has size and index exp in the size of (compactly represented).

To see this, let the states of the DRW store the last state-action and last state-action-state of , and the Rabin pairs are of the form for .

###### Lemma 2

Given DRW one can build a DRW , denoted , that accepts the words accepted by or . The size of is the product of the sizes of the s, and the index of is the sum of the indices of the s.

To see this, if define where , and consists of all pairs of the form for and all pairs of the form for .

### Rabin Games

The other use for the Rabin condition is to give winning conditions in games. A Rabin game is an explicitly represented planning problem whose goal is expressed as a Rabin condition over the set of states.

###### Theorem 5

[8, 17] There is an algorithm that solves Rabin games in time where is the number of Rabin pairs, is the number of states, and is the number of transitions. In addition, solving Rabin games is np-complete (in the size of the explicit representation).

### Reduction and Algorithm

We can now describe the algorithm promised in Theorem 3. Given a state-action fair planning problem , reduce it to the problem of solving the Rabin game constructed as follows. The arena is defined as the synchronous product of the domain , explicitly represented, and the DRW . The Rabin winning condition is induced by the Rabin acceptance condition of , i.e., consists of all pairs of the form for .

This completes the description of the reduction. To see that it is sound and complete, simply note that a policy solves the state-action fair planning problem iff every fair -trace in is accepted by the DRW iff every trace (fair and not-fair) in generated by the strategy that maps to the action satisfies the Rabin condition . The first iff is due to Theorem 4, and the second iff follows from the definition of Rabin condition and of the synchronous product.

For the complexity analysis, simply note that the DRW has size exp in the size of (compactly represented) and exp in , and index exp in (compactly represented) and exp in . Now apply Theorem 5 to get the stated goal, combined, and domain complexities.

## 6 Lower bounds for state-action fair planning

We showed that state-action fair planning for temporally extended goals has exptime combined-complexity and goal-complexity, and nexptime domain complexity. In this section we study lower bounds for the problem and show that we can match the exptime goal complexity (with a technique that also supplies new proofs of exptime goal complexity for the cases of no-fairness and stochastic fairness). For domain-complexity, we observe that existing results show the problem is exptime-hard. This leaves open whether the domain complexity can be lowered from nexptime to exptime.

### Domain-complexity

It is not hard to establish a exptime lower-bound for the domain complexity. Indeed, one can reduce the problem of stochastic-fair planning with reachability goals, which is known to be exptime-complete [25, 35]. Indeed, introduce a fresh fluent and fix the goal . Then, for a stochastic-fair planning problem with domain and reachability goal , build a new domain from by adding the fluent and a new action with precondition and postcondition . Then the stochastic-fair problem has a solution iff the stochastic-fair problem has a solution. Moreover, the latter holds iff it has a finite state solution. By Proposition 1, this is equivalent to the fact that the state-action fair problem has a solution.

### Goal complexity

The contribution of this section is a proof of the following theorem.8

###### Theorem 6

The goal complexity (and therefore, also combined complexity) of planning for ltl/ltl goals assuming state-action fairness is exptime-hard.

Inspired by [12],we provide a polynomial-time construction that, given an alternating expspace Turing machine and an input word , produces a probabilistic domain (explicitly represented) and an ltl formula such that accepts iff . Note that to handle the goal complexity, the domain will be independent of and .

Notation. An alternating Turing machine is a tuple where is the set of states partitioned into and (called the existential and universal modes), is the tape-alphabet, is the transition relation, and is the initial state, are the accepting and rejecting states. A configuration is a string matching the expression ; it is initial (resp. accepting, rejecting) if the state is (resp. ). A computation of is a sequence of configurations, starting in an initial configuration, respecting the transition relation, and ending in an accepting or rejecting state. Wlog, we assume that the existential and universal modes of strictly alternate, with the existential going first.

Say runs in space for some polynomial . In particular, a configuration of running on has length at most . Let . Intuitively, the domain ensures that the agent and the environment generate strings of the form

 C0⋅ (#⋅T1⋅#′⋅C1⋅#′′⋅K1)⋅(#⋅T2⋅#′⋅C2⋅#′′⋅K2) ⋯(#⋅Tj⋅#′⋅Cj⋅#′′⋅Kj)⋅#⋅⊥⋅⊥⋅⊥⋯

where the s are arbitrary strings over , the s and s are arbitrary strings over , and is a special symbol. Intuitively, the s will encode configurations of and are generated by the agent, the s will encode transitions of and are generated by the agent for odd and the environment for even , and the s are generated by the environment and encode a position/index on the tape that the environment wants to check. Finally, holds in a sink of the domain that the agent can go to when it is done. Note that this allows the agent to never go to the sink, but such traces will be rejected by the goal formula. We define the ltl goal . Intuitively, will enforce that as long as the environment encodes its parts correctly (i.e., holds), then so does the agent, and the accepting state is reached (i.e., holds). The formula , and , where says that each encodes a configuration, with encoding the initial configuration; (resp. ) says that each with odd (resp. even) encodes a transition of ; says that each encodes a number in ; and says that an accepting configuration is reached; (think of it as a “challenge”) is used to check that the th letter in the configuration encoded by is the result of applying transition to the configuration . Intuitively, since the environment is adversarial, all possible positions will be challenged and all universal transitions will be taken. Thus, the agent will be able to enforce the goal iff accepts . We now provide details on how to write the subformulas of .

Choose a sufficiently large integer to encode all members of and as a binary string of length exactly . Let Sym denote a set of binary strings of length that encode either a tape-letter or a tape-letter/state pair . Let denote the binary string of length whose numeric value is . The possible configurations of are encoded by the strings of the form which have exactly one symbol encoding a tape-letter/state pair . The reason for the s is they allow the formula to check if an encoding of one configuration can be reached in one step of from an encoding of another. We call the substring the th block, where is the block number and is the block symbol. One can write an ltl formula conf, of size linear in and the size of , that enforces this structure. Indeed, using a standard encoding of the binary counter on -bit strings, the formula says that exactly one symbol encodes a tape-letter/state pair, and all the other symbols encode just tape-letters. It also says that , , and for every , the th bit in a block is flipped in the next block iff all bits strictly lower in this block are s. Thus, the formula can be defined as where init is a formula that encodes the initial configuration (which can be hard-coded by a polynomial sized formula by explicitly specifying the first blocks, and that the rest of the blocks in the configuration contain the encoding of the blank tape symbol). Writing linearly-sized ltl formulas , , and poses no particular problem.

It remains to show how to build the formula . It will be the conjunction of two formulas and . The first handles the first challenge, and the second handles all the rest. We now show how to build the second (the first is similar). Define as:

 □⋀[(cha∧curx∧nxy∧nxnxz∧trt)⊃imgy′]

where the conjunction is over tuples such that applying the transition to the triple of tape-contents (including a possible state) results in the tape content of the middle cell (e.g., for , if then , if then , etc.). Intuitively, cha expresses that we are currently at the start of a block of a configuration, say , whose number is one less than the challenge number encoded by ; the formula expresses that the symbol in the current block is ; the formula expresses that the symbol in the next block is ; the formula expresses that the symbol in the block after that is ; the formula says that encodes the transition ; and the formula says that the block whose number is encoded by in the configuration is .

We use the following shorthand, that can scan the string for patterns: define and . Intuitively, means holds one step after the th occurrence of .

Formally, define cha as:

 %∧⎡⎣(%)J2⎛⎝⋀i∈[0,n)⋀b∈{0,1}[◯ib⟺#′′J2◯ib]⎞⎠⎤⎦.

Define as: where encodes the symbol , and define and similarly. Define as: where encodes the symbol . Define as:

 [#′]J1[(match⊃◯n∧i:0≤i

where encodes the symbol , and match is

 ⋀i∈[0,n)⋀b∈{0,1}(◯ib⟺(#′′)J1(◯ib)).

Intuitively, it says that in the next configuration, if a block number equals the challenge number, then the block symbol should be . This completes the construction of the goal . This completes the proof for the case of no fairness.

For stochastic and state-action fairness, observe that a) if accepts then, already with no fairness assumptions, there is a solution, and b) if rejects , then for every policy , the environment can, within a finite number of steps, prevent any hope of satisfying the goal: either by exposing that the agent is cheating in the simulation, or by reaching a rejecting configuration. Since every finite -trace can be extended to a fair infinite -trace, the policy is not a solution to the state-action fair planning problem, nor is it a solution to the stochastic fair planning problem since the set of infinite -traces that extend this finite -trace has positive probability.

## 7 Related Work and Discussion

We have discussed how the distinction between stochastic- and state-action fairness is so-far missing from the planning/AI literature. On the other hand, as we now discuss, this distinction is present in the verification literature.

### Related work in verification

Early work in verification was motivated by the problem of providing formal methods (such as proof-systems or model-checking algorithms) to reason about probabilistic concurrent systems. As such, some effort was made to abstract probabilities and capture stochastic fairness by language-theoretic properties. In fact, sophisticated forms of language-theoretic fairness were introduced to do this [31, 4], since simple language-theoretic notions (similar to state-action fairness) were known not to capture stochastic fairness [33].

A comprehensive study of fairness in reactive systems is provided in [42] where fairness is characterized language-theoretically, game-theoretically, topologically, and probabilistically. Fairness is used in verification of concurrent systems in order to prove liveness properties, i.e., that something good will eventually happen. The limitations of fairness for proving liveness properties, as well as ways to overcome these limitations, are analysed in [36].

The verification literature on probabilistic concurrent programs typically considers policies as schedulers. In particular, the central decision problem there is different to the planning problem: it asks whether every (rather than some) policy almost-surely enforces the temporally-extended goal [39, 31, 6, 12]. Just as we used Rabin conditions to capture state-action unfair traces, one can use the dual Street condition to capture stochastically-fair traces [39].

Generalizations of strong-cyclic solutions, other than those that use state-action fairness, have been studied using automata-theory. For instance, [30] consider that is a solution if every finite -trace can be extended to an infinite -trace satisfying the given ltl formula. However, such a notion is different from a solution assuming state-action fairness (the example in Proposition 3 shows this). Also, [40] studies a variation of ltl synthesis assuming a fair scheduler, where the transitions over a given set of states are assumed to be implicitly encoded in ltl.

### Discussion

While stochastic fairness admits well-behaved algorithms, it is not clear that language-theoretic fairness does. For the moment, even for the case of ltl goals, our algorithm (Section 5) requires automata over infinite traces to deal with state-action fairness, which itself is a property of infinite traces. Unfortunately, algorithms for automata over infinite traces are not as easy to implement as for finite traces [19]. Nonetheless, we hope that our new algorithm, which suggests the importance of planning for Rabin goals, spurs the planning community to devise translations and heuristics for solving these.

### Footnotes

1. In this paper we assume there is no uncertainty about the current state of the system, i.e., environments are fully observable.
2. Although reinforcement-learning also makes a similar assumption, it is out of the scope of this work which focuses on nondeterminism in model-based control and in planning in particular.
3. In the language-theoretic setting, the policy need not succeed on traces that do not satisfy the fairness property; while in the stochastic setting the policy need not succedd on any set of traces whose probability measure is zero.
4. For a finite string , we write for the infinite string .
5. The representation of is induced by the acceptance condition of , but here the specific representation of is not relevant.
7. Recall that we define that an infinite trace satisfies an ltl formula if some prefix of it satisfies .
8. The result is stated in [15] for ltl but with an incorrect proof. The error there is concluding that every -trace visits each state at most once. This is true for memoryless strategies, but need not be true for other strategies which might be required when planning for temporally extended goals.

### References

1. R. Alford, U. Kuter, D. S. Nau and R. P. Goldman (2014) Plan aggregation for strong cyclic planning in nondeterministic domains. Artif. Intell. 216, pp. 206–232. Cited by: §1.
2. B. Aminof, G. De Giacomo, A. Murano and S. Rubin (2019) Planning under LTL environment specifications. In ICAPS, Cited by: Remark 1.
3. F. Bacchus and F. Kabanza (2000) Using temporal logics to express search control knowledge for planning. Artif. Intell. 116 (1-2), pp. 123–191. Cited by: §2, §2.
4. C. Baier and M. Z. Kwiatkowska (1998) On the verification of qualitative properties of probabilistic processes under fairness constraints. Inf. Process. Lett. 66 (2), pp. 71–79. Cited by: §7.
5. J. A. Baier and S. A. McIlraith (2006) Planning with first-order temporally extended goals using heuristic search. In AAAI, Cited by: §2.
6. A. Bianco and L. de Alfaro (1995) Model checking of probabilistic and nondeterministic systems. In FSTTCS, Cited by: §4, §7.
7. B. Bollig and M. Leucker (2004) Verifying qualitative properties of probabilistic programs. In Validation of Stochastic Systems, pp. 124–146. Cited by: §4.
8. N. Buhrke, H. Lescow and J. Vöge (1996) Strategy construction in infinite games with Streett and Rabin chain winning conditions. In TACAS, Cited by: Theorem 5.
9. A. Camacho, M. Bienvenu and S. A. McIlraith (2019) Towards a unified view of AI planning and reactive synthesis. In ICAPS, Cited by: Remark 1.
10. A. Camacho and S. A. McIlraith (2019) Strong fully observable non-deterministic planning with LTL and LTLf goals. In IJCAI, Cited by: §4.
11. A. Cimatti, M. Pistore, M. Roveri and P. Traverso (2003) Weak, strong, and strong cyclic planning via symbolic model checking.. Artificial Intelligence 1–2 (147). Cited by: §1, §2.
12. C. Courcoubetis and M. Yannakakis (1995) The complexity of probabilistic verification. J. ACM 42 (4), pp. 857–907. Cited by: §4, §6, §7.
13. N. D’Ippolito, N. Rodríguez and S. Sardiña (2018) Fully observable non-deterministic planning as assumption-based reactive synthesis. J. Artif. Intell. Res. 61, pp. 593–621. Cited by: §1, §1.
14. M. Daniele, P. Traverso and M. Y. Vardi (1999) Strong cyclic planning revisited. In ECP, Cited by: §1.
15. G. De Giacomo and S. Rubin (2018) Automata-theoretic foundations of FOND planning for LTLf and LDLf goals. In IJCAI, Cited by: §4, footnote 8.
16. G. De Giacomo and M. Y. Vardi (2013) Linear temporal logic and linear dynamic logic on finite traces. In IJCAI, Cited by: §2.
17. E. A. Emerson and C. S. Jutla (1988) The complexity of tree automata and logics of programs (extended abstract). In FOCS, External Links: Cited by: Theorem 5.
18. G. E. Fainekos, H. Kress-Gazit and G. J. Pappas (2005) Temporal logic motion planning for mobile robots. In ICRA, pp. 2020–2025. Cited by: §2.
19. S. Fogarty, O. Kupferman, M. Y. Vardi and T. Wilke (2013) Profile trees for Büchi word automata, with application to determinization. In GandALF, Cited by: §7.
20. J. Fu, A. C. Jaramillo, V. Ng, F. B. Bastani and I. Yen (2016) Fast strong planning for fully observable nondeterministic planning problems. Ann. Math. Artif. Intell. 78 (2), pp. 131–155. Cited by: §1.
21. H. Geffner and B. Bonet (2013) A coincise introduction to models and methods for automated planning. Morgan & Claypool. Cited by: §1.
22. T. Geffner and H. Geffner (2018) Compact policies for fully observable non-deterministic planning as SAT. In ICAPS, Cited by: §1.
23. M. Ghallab, D. S. Nau and P. Traverso (2016) Automated planning and action. Cambridge. External Links: ISBN 978-1-10703-727-4 Cited by: §1, §2, §3.
24. P. Kissmann and S. Edelkamp (2011) Gamer, a general game playing agent. Künst. Intell. 25 (1), pp. 49–52. Cited by: §1.
25. M. L. Littman (1997) Probabilistic propositional planning: representations and complexity. In AAAI, Cited by: §6.
26. R. Mattmüller, M. Ortlieb, M. Helmert and P. Bercher (2010) Pattern database heuristics for fully observable nondeterministic planning. In ICAPS, Cited by: §1.
27. C. J. Muise, S. A. McIlraith and J. C. Beck (2012) Improved non-deterministic planning by exploiting state relevance. In ICAPS, Cited by: §1.
28. F. Patrizi, N. Lipovetzky and H. Geffner (2013) Fair LTL synthesis for non-deterministic systems using strong cyclic planners. In IJCAI, Cited by: §4.
29. M. Pistore and P. Traverso (2001) Planning as model checking for extended goals in non-deterministic domains. In IJCAI, Cited by: §1.
30. M. Pistore and M. Y. Vardi (2007) The planning spectrum - one, two, three, infinity. J. Artif. Intell. Res. 30, pp. 101–132. Cited by: §7.
31. A. Pnueli and L. D. Zuck (1993) Probabilistic verification. Inf. Comput. 103 (1), pp. 1–29. Cited by: §7, §7.
32. A. Pnueli (1977) The temporal logic of programs. In FOCS, Cited by: §2.
33. A. Pnueli (1983) On the extremely fair treatment of probabilistic algorithms. In STOC, Cited by: §7.
34. M. Ramírez and S. Sardiña (2014) Directed fixed-point regression-based planning for non-deterministic domains. In ICAPS, Cited by: §1.
35. J. Rintanen (2004) Complexity of planning with partial observability. In ICAPS, Cited by: §3, §6.
36. R. van Glabbeek and P. Höfner (2019) Progress, justness, and fairness. ACM Comput. Surv. 52 (4), pp. 69:1–69:38. Cited by: §7.
37. M. Y. Vardi and P. Wolper (1986) An automata-theoretic approach to automatic program verification. In LICS, Cited by: §2, §3.
38. M. Y. Vardi (1982) The complexity of relational query languages. In STOC, External Links: Cited by: §2.
39. M. Y. Vardi (1985) Automatic verification of probabilistic concurrent finite-state programs. In FOCS, Cited by: §2, §4, §7.
40. M. Y. Vardi (1995) An automata-theoretic approach to fair realizability and synthesis. In CAV, External Links: Cited by: §7.
41. M. Y. Vardi (1995) An automata-theoretic approach to linear temporal logic. In Logics for Concurrency – Structure versus Automata (Banff), LNCS, Vol. 1043. External Links: Cited by: Theorem 4.
42. H. Völzer and D. Varacca (2012) Defining fairness in reactive and concurrent systems. J. ACM 59 (3), pp. 13:1–13:37. Cited by: §7.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters