Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems A short version of partial results has appeared in the Proceedings of the International Conference on Information Fusion, July 2020.

# Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems 1

## Abstract

This paper presents an inverse reinforcement learning (IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function; then we construct set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. To illustrate our IRL scheme, we consider two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities. Also we discuss how to identify -optimal Bayesian decision makers and perform IRL.

{keywords}

Inverse Reinforcement Learning, Bayesian Revealed Preferences, Stopping Time Problems, Inverse Detection, Sequential Hypothesis Testing, Bayesian Search, Hypothesis Testing, Finite Sample Complexity

\acks

This research was funded in part the U. S. Army Research Office under grant W911NF-19-1-0365, National Science Foundation under grant 1714180, and Air Force Office of Scientific Research under grant FA9550-18-1-0007

## 1 Introduction

In a stopping time problem, an agent observes a random variable (state of nature) sequentially over time. Based on the observation history (sigma-algebra generated by the observations), the agent decides at each time whether to continue or stop. If the agent chooses the stop action at a specific time, then the problem terminates. In a Bayesian stopping time problem, the agent knows the prior distribution of state of nature and the observation likelihood (conditional distribution of the observations) and uses this information to chooses its continue and stop actions. Finally, in an optimal Bayesian stopping time problem, the agent chooses its continue and stop actions to minimize an expected cumulative cost function.

Inverse reinforcement learning (IRL) aims to estimate the costs/rewards of agents by observing their actions (Ng and Russell, 2000; Abbeel and Ng, 2004). This paper considers IRL for Bayesian stopping time problems. Suppose an inverse learner observes the decisions of agents performing Bayesian sequential stopping. The inverse learner does not know the observation sequence realizations or observation likelihoods of the agents. The two main questions we address are:

1. How can the inverse learner decide if the actions of Bayesian agents are consistent with optimal stopping?

2. If the observed actions of the Bayesian agents are consistent with optimal stopping, how can the inverse learner estimate the costs incurred by the agents?

The main result of this paper is to provide necessary and sufficient conditions for the inverse learner to identify if the actions of Bayesian agents are consistent with optimal stopping and if so, provide a convex feasibility algorithm to provide set valued estimates of the costs incurred by the agents. Then we illustrate this IRL result in two examples, namely, inverse sequential hypothesis testing and inverse Bayesian search.

### 1.1 Context. Bayesian Revealed Preferences for IRL

The key formalism used in this paper to achieve IRL is Bayesian revealed preferences (Martin, 2014; Caplin and Dean, 2015). Non-parametric estimation of cost functions given a finite length time series of decisions and budget constraints is the central theme in the area of revealed preferences in microeconomics, starting with Afriat (1967); Samuelson (1938) where necessary and sufficient conditions for cost minimization are given; see also Varian (1982, 2012); Woodford (2012) and more recently in machine learning (Lopes et al., 2009).

We now discuss how our Bayesian revealed preference based IRL approach differs from classical IRL.

1. The classical IRL frameworks (Ng and Russell, 2000; Abbeel and Ng, 2004) assume the observed agents are reward maximizers (or equivalently, cost minimizers) and then seeks to estimate their cost functions. The approach in this paper is more fundamental. We first identify if the decisions of a set of agents are consistent with optimality and if so, we then generate set-valued estimates of the agent costs that are consistent with the observed decisions.

2. Classical IRL assumes complete knowledge of the agents’ observation likelihoods. As mentioned above, we assume that the inverse learner only knows the state of nature and the action chosen when the agent stops. Specifically, the inverse learner does not know their observation likelihoods or the sequence of observation realizations.

3. Algorithmic Issues: In classical IRL, the inverse learner solves the Bayesian stopping time problem iteratively for various choices of the cost. This can be computationally prohibitive since it involves stochastic dynamic programming over a belief space which is PSPACE hard (Papadimitriou and Tsitsiklis, 1987). The IRL procedure in this paper is relatively computationally inexpensive since the inverse learner only needs to test for feasibility wrt a set of convex inequalities.

### 1.2 Applications of IRL for Bayesian Stopping Problems

After constructing an IRL framework for general stopping time problems, this paper discusses two important examples, namely, inverse sequential hypothesis testing and inverse Bayesian search. Below we briefly motivate these examples.

Sequential hypothesis testing (SHT) (Poor, 1993; Ross, 2014) is widely studied in detection theory. The inverse SHT problem of estimating misclassification costs by observing agents has not been addressed. Estimating the misclassification costs of in SHT is useful in adversarial inference problems. For example, by observing the actions of an adversary, an inverse learner can estimate the adversary’s utility and predict its future decisions.

In Bayesian search, each agent sequentially searches locations until a stationary (non-moving) target is found. Bayesian search (Ross, 2014) is used in vehicular tracking (Wong et al., 2005), image processing (Pele and Werman, 2008) and cognitive radars (Goodman et al., 2007). IRL for Bayesian search requires the inverse learner to estimate the search costs by observing the search actions taken by a set of Bayesian agents.

Bayesian search is a special case of the Bayesian multi-armed bandit problem (Gittins, 1989; Audibert and Bubeck, 2010). Hence our IRL procedure can be used to solve inverse Bayesian bandit problems, namely, estimate the Gittins indices of the arms. This has several applications including determining the fairness of multimedia recommender systems, for news websites. By observing the suggestions on a news website, an inverse learner can estimate the relative bias of suggested articles from one category compared to other categories.

Regarding the literature in inverse bandits, Chan et al. (2019) propose a real-time assistive procedure for a human performing a bandit task based on the history of actions taken by the human. Noothigattu et al. (2020) solve the inverse bandit problem by assuming the inverse learner knows the variance of the stochastic reward; in comparison out setup assumes no knowledge of the rewards.

### 1.3 Main Results and Organization

1. Inverse RL for Bayesian sequential stopping: Our main result, Theorem 2.3 in Sec. 2 specifies a set of convex inequalities that are simultaneously necessary and sufficient for the decisions of a set of agents to be consistent with optimal stopping. If the agents are optimal stopping agents, then Theorem 2.3 also provides an algorithm that the inverse learner can use to generate set-valued estimates of the agents’ stopping costs.

We emphasize that our IRL approach does not assume knowledge of the agents’ observation likelihoods or their observation sample paths. Our IRL approach bypasses these - in this sense, our IRL approach is similar to reinforcement learning where model parameters are bypassed in order to the learn optimal strategy.

2. Inverse RL for SHT and Search: Sec. 3 and Sec. 4 construct IRL algorithms for two specific examples of Bayesian stopping time problems, namely, Sequential Hypothesis Testing (SHT) and Search. The main results, Theorem 3.3 and Theorem 4.2 specify necessary and sufficient conditions for the agents’ decisions to be consistent with optimal SHT and optimal search, respectively. If the conditions hold, Theorems 3.3 and 4.2 provide algorithms to estimate the incurred misclassification costs (for SHT) and search costs (for Bayesian search).

3. Inverse RL for finite samples: Sec. 2, 3 and 4 assume that the inverse learner performing IRL observes the agents decisions over infinitely many trials. In Sec. 5, we propose IRL detection tests for optimal stopping, optimal SHT and optimal search under finite sample constraints. Theorems 5.2, 5.3 and 5.4 in Sec. 5 comprise our sample complexity results that characterize the robustness of the detection tests by specifying Type-I/II and posterior Type-I/II error bounds.

4. Identifying -optimal Bayesian stopping behavior: In general a Bayesian agent can only solve a stopping time problem approximately since computing the optimal stopping policy involves stochastic dynamic programming problem over continuum belief space. (SHT and optimal search are exceptions since they have a special structure.) Sec. 6 discusses the problem: how can an inverse learner identify -optimal Bayesian agents and perform IRL? Theorem 6.1 shows that an inverse learner can identify -optimal stopping behavior by testing for feasibility of the convex inequalities of Theorem 2.3. Also, if the agents choose their stopping strategies in noise, then Theorem 6.2 shows that the inverse learner can use the finite sample IRL detector in result to detect -optimal Bayesian stopping.

The proofs of all theorems are provided in the Appendix.

### 1.4 Related works in IRL

We already mentioned related IRL works on multi-armed bandits like Chan et al. (2019); Noothigattu et al. (2020). We now briefly summarize other key IRL works in the literature. Classical IRL (Ng and Russell, 2000; Abbeel and Ng, 2004) aims to estimate an unknown deterministic reward function of an agent by observing the optimal actions of the agent in a Markov decision process (MDP) setting. Ziebart et al. (2008) use the principle of maximum entropy for achieving IRL of optimal agents.

Ramachandran and Amir (2007) achieve IRL in a Bayesian setting, where the agent optimally takes actions in an MDP according to a reward function sampled from a prior distribution. The inverse learner is assumed to know this prior pdf. Levine and Koltun (2012) generalize IRL to continuous space processes and circumvent the problem of solving the optimal policy for candidate reward functions. Recently Fu et al. (2017); Wulfmeier et al. (2015); Sharifzadeh et al. (2016); Finn et al. (2016) use deep-learning for IRL to estimate agent rewards where the rewards are parametrized by complicated non-linear functions.

Applications of IRL include robotics (Kretzschmar et al., 2016), user engagement multimedia social networks such as YouTube (Hoiles et al., 2020), autonomous navigation (Sharifzadeh et al., 2016; Abbeel and Ng, 2004; Ziebart et al., 2008) and inverse cognitive radar (Krishnamurthy, 2020; Krishnamurthy et al., 2020).

Finally, in microeconomics, Bayesian revealed preferences is studied by Caplin and Dean (2015); Caplin and Martin (2015). As mentioned earlier, revealed preferences is more general than classical IRL since it first identifies optimal behavior and then provides set-valued estimates of the cost function. Classical non-parametric revealed preferences stemming from Afriat’s remarkable theorem (Varian, 2012; Afriat, 1967; Diewert, 1973; Reny, 2015) focuses on identifying optimal agent behavior and then estimating agent utility functions in a non-Bayesian setting.

## 2 Identifying optimal Bayesian Stopping Behavior and Reconstructing Utility Function

The IRL framework we consider comprises Bayesian agents that make decisions in a stopping time problem, and an inverse learner that observes the decisions of these agents. This section defines the IRL problem that the inverse learner faces and then presents two results regarding the inverse learner:

1. Identifying Optimal Stopping. Theorem 2.3 below provides a necessary and sufficient condition for the inverse learner to identify if the Bayesian agents choose their decisions as the solution of an optimal stopping problem.

2. IRL for Reconstructing Costs. Theorem 2.3 also shows that the continue and stopping costs of the Bayesian agents can be reconstructed by solving a convex feasibility problem.

This section provides a complete IRL framework for Bayesian stopping time problems and sets the stage for subsequent sections where we formulate generalizations and examples.

### 2.1 Bayesian Stopping Agent

A Bayesian stopping time agent is parametrized by the tuple

 Ξ=(X,π0,Y,A,B,μ) (1)

where

• is a finite set of states.

• At time , the true state is sampled from prior distribution . is unknown to the agent.

• is the observation space. Given state , the observations have conditional probability density .

• is the finite set of stopping actions.

• Finally, denotes the agent’s stopping strategy. The stopping strategy operates sequentially on a sequence of observations as discussed below in Protocol 1.

###### Protocol 1

Sequential Decision-making protocol:

1. Generate (unknown to the agent) at time .

2. At time agent records observation .

3. Belief Update: Let denote the sigma-algebra generated by observations . The agent updates belief (posterior) using Bayes formula as

 πt=B(yt)πt−11′B(yt)πt−1, (2)

where . The belief is an -dimensional probability vector in the dimensional unit simplex

 Δ(X)def.={π∈RX+:1′π=1}. (3)
4. Choose action from the set . If then stop, else if set and go to Step 2.

The stopping strategy is a (possibly randomized) time-dependent mapping from the agent’s belief at time to the set . That is, . We define the random variable as the time when the agent stops and takes a stop action from .

 τ=inf{t≥0| μ(πt,t)≠{continue}}. (4)

Note that the random variable implicitly depends on the strategy and belief . Clearly, the set is measurable wrt , hence, the random variable is adapted to the filtration . In the following sub-section, we will introduce costs for the agent’s stop and continue actions. We will use for expressing the expected cumulative cost of the agent.

Summary: The Bayesian stopping agent is parameterized by and operates according to Protocol 1. Several decision problems such as SHT and sequential search fit this formulation.

### 2.2 Optimal Bayesian Stopping Agents

So far we have defined a single Bayesian stopping agent. Our main IRL result is to identify if a set of Bayesian stopping agents behaves optimally. The purpose of this section is to define a set of optimal Bayesian stopping agents. For identifiability reasons (see assumption 2 below) we require at least two agents ().

A collection of optimal Bayesian stopping agents is

 Ξopt=(Ξ,C,s,μ). (5)

In (5),

• is the set of Bayesian stopping agents.

• The parameters in  (1) and continue cost (defined below) are the same for all agents in .

• , is the continue cost incurred by any agent at time given state .

• , is the agent’s cost for taking stop action when the state .

• is the set of optimal stopping strategies of the agents in , where the optimality is defined in Definition 2.2 below. Each agent employs its stopping strategy and operates according to Protocol 1.

{definition}

[Stopping strategy optimality] For any agent , is optimal for stopping cost iff

 μm(π,τ) =argmina∈Aπ′¯sm,a, (6) J(μm,sm) =infμJ(μ,sm), (7)

where the optimization is over the set of all stationary stopping strategies . Here is the expected sum of cumulative continue and stopping cost for a stopping strategy and stopping cost .

 J(μ,s)=G(μ,s)+C(μ), G(μ,s)=Eμ{π′τ¯sμ(πτ,τ)}, C(μ)=Eμ{τ−1∑t=0π′t¯ct}. (8)

denotes expectation parametrized by w.r.t the probability measure induced by 2. , are the stopping and continue cost vectors, respectively, vectorized over all . Definition 2.2 is standard for the optimal strategy in a sequential stopping problem. The optimal strategy naturally decomposes into two steps: choosing whether to continue or stop according to (7); and if the decision is to stop, then choose a specific stopping action from according to (6). The optimal stopping strategies that satisfy the conditions (6), (7) can be obtained by solving a stochastic dynamic programming problem (Krishnamurthy, 2016). It is a well known result (Lovejoy, 1987) that the optimal stopping policy has the following structure: the set of beliefs for which it is optimal to stop is a convex set.

### 2.3 IRL for Inverse Optimal Stopping. Main Result

Thus far we have described a single Bayesian stopping agent and a set of optimal stopping agents. In this subsection, we provide an inverse learner centric view of the Bayesian stopping time problem and the main IRL result.

Suppose the inverse learner observes Bayesian stopping agents, where each agent performs several independent trials of Protocol 1. We make the following assumptions about the inverse learner performing IRL.

1. The inverse learner knows the dataset

 DM=(π0,p), (9)

where . is the agent’s conditional probability of choosing stop action at the stopping time given state . We call as the agent’s action selection policy.
Note. The inverse learner does not know the stopping times; it only has access to the conditional density of which stop action was chosen given the true state .

2. If the observed agents are optimal stopping agents, then there are at least agents with distinct stopping costs.

Remark: Both assumptions are discussed after the main theorem, but let us make some preliminary remarks at this stage. 1 implies the inverse learner observes the stopping actions chosen by a finite number () of stopping agents, where each agent performs an infinite number of independent trials of Protocol 1; see discussion in Sec.2.4 for asymptotic interpretation. In Sec.5 we will consider finite sample effects where the inverse learner observes the agents performing a finite number of independent trials of Protocol 1. 2 is necessary for the inverse optimal stopping problem to be well-posed.

The finite assumption on in 1 imposes an important restriction on our IRL task of identifying optimality of stopping agents, namely “relative optimality”. {definition}[Relative Optimality.] comprises relatively optimal Bayesian stopping agents (see Fig. 1) iff (6) and the following inequalities hold

 J(μm,sm)≤minμ∈{μn,n∈M} J(μ,sm), ∀m∈M. (10)

Recall is the agent’s expected cumulative cost (8).

The motivation for relative optimality will be apparent from Theorem 2.3 below which shows that the inverse learner can at best identify if the dataset is generated by stopping agents with relatively optimal stopping strategies.

We now present our first main result for our IRL framework. The result specifies a set of inequalities that are simultaneously necessary and sufficient for the agents observed by the inverse learner to be relatively optimal in the sense of (10).

{theorem}

[IRL for Bayesian stopping (Caplin and Dean, 2015)] Consider the inverse learner with dataset (9) obtained from the set of Bayesian agents . Assume 1, 2 holds. Then:
1. Identifiability: The inverse learner can at best identify if the dataset is generated by stopping agents with relatively optimal stopping strategies (Definition 2.3).
2. Existence: is a collection of relatively optimal Bayesian stopping agents, if and only if there exists a feasible solution to the following convex (in stopping costs) inequalities:

 Find sm(x,a)∈R+ ∀m∈M s.t. NIAS(DM,{sm(x,a),x∈X,a∈A,m∈M})≤0, (11) NIAC(DM,{sm(x,a),x∈X,a∈A,m∈M})≤0. (12)

The NIAS (No Improving Action Switches) and NIAC (No Improving Action Cycles) inequalities are defined in (22), (18), and are convex in the stopping cost .
3. Reconstruction of costs:
(a). The set-valued IRL estimate of agents’ stopping costs is the set of all feasible solutions to the NIAS and NIAC inequalities.
(b). For a feasible set of stopping costs , the set-valued IRL estimate of the expected cumulative continue costs is the set of all feasible solutions to the following linear (in ) inequalities:

 SUMCOST({pm(a|x),sm(x,a),m∈M},π0)≤0, (13)

where is defined in (14).

Theorem 2.3 is proved in Appendix A. It says that identifying if a set of Bayesian stopping agents is relatively optimal and then reconstructing their costs is equivalent to solving a convex feasibility problem. Theorem 2.3 provides a constructive procedure for the inverse learner to generate set valued estimates of the stopping cost and expected cumulative continue cost for all agents . Algorithms for convex feasibility such as interior points methods (Boyd and Vandenberghe, 2004) can be used to check feasibility of (11,12) and construct a feasible solution.

Assertion 3(b) of Theorem 2.3 mentioned the inequalities . Here we define these inequalities. For a feasible set of stopping costs , the inverse learner uses the relative optimality of stopping strategies (7) to estimate the expected cumulative continue costs of the agents. Specifically, the inverse learner solves the following linear feasibility problem:

 Find positive reals C1,C2,…CM s.t. SUMCOST:Eπ0{Epm(a|x){sm(x,a)|x}}+Cm≤Epn(a){mina′∈AEx{sm(x,a)|a}}+Cn (14)

The RHS in (14) is the expected cumulative stopping cost of agent given agent ’s stopping costs which together with the expected continue cost performs worse than agent ’s stopping strategy due to relative optimality.

### 2.4 Discussion of 1, 2 and Relative Optimality

#### 1

To motivate 1, suppose for each agent , the inverse learner records the true state , stopping action and stopping time over independent trials. Then the pmf in (9) is the limit pmf of the empirical pmf as the number of trials

 ^pm(a|x)=∑Kk=1\mathbbm1{xok,m=x,ak,m=a}∑Kk=1\mathbbm1{xok,m=x}. (15)

Specifically, since for each the sequence is i.i.d for , by Kolmogorov’s strong law of large numbers, as the number of trials , converges with probability to the pmf . In the remainder of the paper (apart from Sec.5), we will work with the asymptotic dataset for IRL. In Sec. 5 we analyze the effect of finite sample size on the inverse learner using concentration inequalities.

#### 2

2 is necessary for the identification of relatively optimal stopping behavior to be well-posed. Suppose 2 does not hold. Then, for and true stopping costs , we have in . This implies the set of feasible solutions for the feasibility inequality (14) is the set . Hence, 2 ensures identifiability3

#### Relative Optimality

Relative optimality (Definition 2.3) is a weaker notion of optimality (see Fig. 1) compared to absolute optimality in Definition 2.2. Given the dataset from a finite number of agents 1, the inverse learner can at best identify relative optimality. If the agents that generate are indeed optimal stopping agents (6), (7), their stopping strategies automatically satisfy relative optimality. But 1, implies that the inverse learner can identify them as relatively optimal stopping agents and not absolutely optimal.

#### Unobservability of agent costs

The inverse learner does not know the agents’ observation sequences , observation likelihood or the continue cost . Otherwise the IRL task is equivalent to using (6), (10) to reconstruct the costs. Given partial information about the continue cost, how to formulate feasibility constraints for the inverse learner to identify optimal stopping behavior? The set of NIAC inequalities (12) that require the action selection policies as input is a surrogate to the set of NIAC inequalities that take in the unobserved stopping strategies as input for ensuring relative optimality of the agents. However, when the continue cost is known exactly to the inverse learner as discussed in Sec. 3, the action selection policies with the prior are sufficient statistics for the NIAS and NIAC inequalities that the inverse learner can use to estimate the agent stopping costs for the sequential hypothesis testing (SHT) problem.

### 2.5 Discussion of Theorem 2.3

We now discuss the implications of Theorem 2.3 and provide intuition behind NIAS and NIAC inequalities (11), (12).

#### Necessity and Sufficiency

The NIAS and NIAC conditions (11), (12) are necessary and sufficient for identifying relatively optimal stopping behavior. This makes Theorem 2.3 a remarkable result. If no feasible solution exists, then the dataset  cannot be rationalized by agents performing relatively optimal Bayesian stopping. Also, if there exists a feasible solution, then the dataset  must be generated by relatively optimal stopping agents (Definition 2.3).

#### Set valued estimates vs point estimate

An important consequence of Theorem 2.3 is that the reconstructed utilities are set-valued estimates rather than point valued estimates even though the dataset has samples. Put differently, all points in the feasible set explain the dataset equally well. Hence a point valued estimate is not useful for our purpose.

#### Consistency of Set-Valued Estimate

The necessity proof of Theorem 2.3 implies that if the agents are optimal stopping agents, then the true stopping costs are feasible wrt the convex NIAS and NIAC inequalities. Hence, the IRL procedure is consistent in that the true costs belong to the set of feasible solutions to the NIAS and NIAC inequalities.

#### Intuition behind NIAS and NIAC

The inequalities (6), (10) for relatively optimal stopping agents can be written in abstract notation as (16), (17), respectively, as shown below

 NIAS({{p(y1:τ(μm)|x),x∈X},sm,m∈M},π0)≤0, (16) NIAC∗({{p(y1:τ(μm)|x),x∈X},sm,Cm,m∈M},π0)≤0. (17)

Eq. (11) in Theorem 2.3 is obtained by replacing the unknown likelihood in (16) with the known action selection policy . As discussed in the proof (Appendix A), is a noisy measurement of .

The NIAC inequality (12) is a surrogate for (17); it ensures that each agent’s stopping time performs better than that of any other agent when the stopping cost is kept constant. The set of NIAC inequalities are defined as follows.

{definition}

[NIAC Inequalities] Given dataset and stopping costs , the following holds for any set of agent indices ,

 ∑m∈ˆM(Epm(a){maxa′∈AEx{sm(x,a)−sm+1(x,a′)|a}})≤0, (18)

where if and . The NIAC inequality considers the surrogate expected stopping cost which is obtained by using instead of in the expected stopping cost definition (8). It says that the cumulative surrogate stopping cost is smaller than that obtained by shuffling the action selection policies among the agents.

Note that in Definition 2.5.4 unlike (17), the NIAC inequality does not contain the (unknown) expected cumulative continue costs as an argument. The necessity of the NIAC inequality to hold for (17) to hold can be shown trivially. However, for sufficiency, it will be shown in Appendix A that if NIAC inequality holds, then there exist positive reals that satisfy the inequality (17), thereby showing the existence of relatively optimal stopping agents that generate which is precisely the objective of Theorem 2.3.

#### Private and Public Beliefs

The stopping belief in (7) can be interpreted as the private belief evaluated by the agent after measuring in the sense of Bayesian social learning (Krishnamurthy, 2016). Since is unavailable to the inverse learner, it uses the public belief as a result of the agent’s stop action to estimate the agent’s costs.

### 2.6 Outline of Proof of Theorem 2.3

The proof of Theorem 2.3 in Appendix A involves two main ideas. The first key idea is to specify a fictitious likelihood parametrized by the stopping strategy so that given strategy , observation likelihood and prior , the observation trajectory of the stopping time problem yields an identical stopping belief , i.e.,

 P(~yπ|x,μ)=P({y1:τ}:πτ=π|x).

A more precise statement is given in (65). In other words, a one-step Bayesian update using the likelihood is equivalent to the multi-step Bayesian update (2) of the state till the stopping time. This idea is shown in Fig. 2. Recall that the cumulative expected cost of the agent comprises two components, the stopping cost and cumulative continue cost. A useful property of this fictitious likelihood is that it is a sufficient statistic for the expected stopping cost .

The second main idea is to formulate the agent’s expected cumulative cost using the observed action selection policy of the agent instead of the unobserved fictitious likelihood that determines the expected stopping cost.  (9) is a stochastically garbled (noisy) version of . We use this concept to formulate the NIAS and NIAC inequalities whose feasibility given is necessary and sufficient for relatively optimal stopping behavior.

Showing that feasibility of the NIAS and NIAC inequalities (11), (12) is a necessary condition for the existence of relatively optimal stopping agents (6), (10) is straightforward. The key idea in the sufficiency proof is to note that the elements of the garbling matrix that maps the fictitious observation likelihood to the action selection policy is unknown to the inverse learner. Hence, the inverse learner can arbitrarily assume to be an accurate measurement of . We then show that for a feasible set of viable stopping costs that satisfy the NIAS and NIAC inequalities, there exist a set of positive reals that satisfy (6), (10) with the agent’s expected cumulative continue cost set to .

The NIAS and NIAC inequalities are convex in the stopping costs . The inverse learner can solve for these convex feasibility constraints to obtain a feasible solution. Thus, we have a constructive IRL procedure for reconstructing the stopping and expected cumulative continue costs for the inverse optimal stopping time problem.

### 2.7 Summary

This section has laid the groundwork for IRL of Bayesian stopping time agents. Specifically, we discussed the dynamics of the Bayesian stopping time agent and optimal stopping time agents. We then described the IRL problem that the inverse learner aims to solve. Theorem 2.3 gave a necessary and sufficient condition for a collection of stopping time agents to be relatively optimal stopping when their decisions are observed by the inverse learner. The optimal agent’s stopping cost can be estimated by solving a convex feasibility problem. Theorem 2.3 forms the basis of the IRL framework in this paper.

## 3 Example 1. Inverse Reinforcement Learning (IRL) for Sequential Hypothesis Testing (SHT)

We now discuss our first example of IRL for an optimal Bayesian stopping time problem, namely, inverse Sequential Hypothesis Testing (SHT). Our main result below (Theorem 3.3) specifies a necessary and sufficient condition for IRL in SHT. Unlike the general stopping problem in Sec.2.3, the continue cost in the SHT problem can be chosen as 1 without loss of generality. This additional structure implies that the inverse leaner can reconstruct the SHT misclassification costs by solving a linear feasibility problem.

### 3.1 Sequential Hypothesis Testing (SHT) Problem

Let be a sequence of i.i.d observations. Suppose an agent knows that the pdf of is either or . The aim of classical SHT is to decide sequentially on whether or by minimizing a combination of the continue (measurement) cost and misclassification cost.

The SHT problem is a special case of the optimal Bayesian stopping problem discussed in Sec. 2.2. In analogy to Sec. 2.2, we now define a set of SHT agents.

{definition}

[SHT agents] A set of SHT agents is a special case of a set of optimal stopping agents (Definition 2.2) with:

• .

• .

• .

• is the SHT agent’s stopping cost parametrized by misclassification costs .

 sm(x,a)=⎧⎪⎨⎪⎩¯Lm,1, if x=1,a=2,¯Lm,2, if x=2,a=1,0, if x=a∈{1,2}.
• is the constant continue cost.

• are the SHT stopping strategies of the agents in defined below.

The SHT stopping strategies in the above definition satisfy the optimality conditions in Definition 2.2 and can be computed using stochastic dynamic programming (Krishnamurthy, 2016). The solution for the agent is well-known (Lovejoy, 1987) to be a stationary policy with the following threshold rule parameterized by .

 μm(π)=⎧⎪⎨⎪⎩choose action 2,if 0≤π(x=2)≤βmcontinue,if βm<π(x=2)≤αmchoose action 1,if αm<π(x=2)≤1. (19)

Remarks: (i) Since the SHT dynamics can be parameterized by , we can set without loss of generality since the optimal policy is unaffected.
(ii) In complete analogy to Sec. 2.3 for relative optimality, IRL for inverse SHT can at best identify if the observed agents are relatively optimal agents.

### 3.2 IRL for Inverse SHT. Main Assumptions

Suppose the inverse learner observes the actions of Bayesian stopping agents. In addition to assumptions 2, we assume the following about the inverse learner performing IRL for identifying SHT agents:

1. The inverse learner has the dataset

 DM(SHT)=(DM,{Cm,m∈M}), (20)

where is defined in (9), is the expected continue cost for SHT agent .

2. The stopping strategies are stationary strategies characterized by the threshold structure in (19).

3. There exist positive numbers and , () such that the two conditions below are satisfied for the observed stopping agents .

 (i) βm≤δ1≤δ2≤αm, ∀m∈M, (ii) δ1/(1−δ1)≤¯Lm,1/¯Lm,2≤δ2/(1−δ2),

where are the threshold values of the stationary strategy .

Remarks: (i) 3 specifies additional information the inverse learner has for performing IRL for SHT by recording the agent decisions over independent trials. Since the continue cost is 1, the expected cumulative continue cost is simply the expected stopping time of the agent. The inverse learner obtains an a.s. consistent estimate of the expected stopping time by computing the sample average of the stopping times.
(ii) 4 and 5 comprise partial information the inverse learner has about the stopping strategies of the agents and their observation likelihood. An important consequence is that the expected stopping cost of the SHT agent (7), which is a complicated function of the unobserved observation likelihood , can now be formulated as a linear function of the observed agent action selection policy

 G(μm,sn)=∑x,aπ0(x)pm(a|x)sn(x,a), m,n∈M.\vspace−0.2cm (21)

The above reformulation together with (20) results in a linear feasibility problem the inverse learner needs to solve for achieving IRL for SHT.

### 3.3 IRL for Inverse SHT. Main Result

Our main result below specifies a set of linear inequalities that are necessary and sufficient for the agents observed by the inverse learner to be relatively optimal SHT agents (Definition 2.3). Any feasible solution constitutes a viable SHT misclassification cost for the relatively optimal SHT agents.

{theorem}

[IRL for SHT] Consider the inverse learner with dataset (20) obtained from the set of Bayesian agents . Assume 2-5 hold. Then:
1. Identifiability: The inverse learner can at best identify if the dataset is generated by relatively optimal SHT agents (Definition 2.3).
2. Existence: is a collection of relatively optimal SHT agents, if and only if there exists a feasible solution to the following linear (in stopping costs) inequalities:

 Find sm(x,a)>0,sm(x,x)=0, ∀x,a∈X, m∈M s.t. NIAS: ∑x∈Xpm(x|a)(sm(x,a)−sm(x,b))≤0,∀a,m. (22) NIAC∗: ∑x,aπ0(x)(pn(a|x)−pm(a|x)) sm(x,a) +Cn−Cm≥0, ∀m,n,m≠n∈M. (23)

3. Reconstruction: The set-valued IRL estimates of the SHT misclassification costs
are

 ¯L1,m=sm(1,2), ¯L2,m=sm(2,1) ∀m∈M,

where is any feasible solution to the NIAS and NIAC inequalities. Theorem 3.3 is a special instance of Theorem \vrefthrm:NIAS_NIAC for identifying relatively optimal stopping agents. Theorem 3.3 says the inverse learner needs to solve a linear feasibility problem to identify relatively optimal SHT agents compared to a convex feasibility problem in Theorem 2.3. Algorithms for linear feasibility (Boyd and Vandenberghe, 2004) can be used to check feasibility of (22), (23) in Theorem 3.3 and construct a feasible solution.

Remark: Suppose the inverse learner had no information about the agents’ expected continue cost. Then the feasibility NIAS and NIAC inequalities in Theorem 2.3 are necessary and sufficient for existence of relatively optimal SHT agents. But since this cost is known to the inverse learner due to the SHT problem structure, it can estimate the misclassification costs better. More precisely, the inverse learner can directly check for feasibility of (10) inequality using . The inverse learner then checks for the feasibility of the NIAS inequality (22) which is necessary and sufficient for the existence of stopping strategies that satisfy the inequality (6). This completes the IRL procedure for the inverse SHT problem in Theorem 3.3.

### 3.4 Numerical Example Illustrating Inverse SHT

The following numerical example illustrates Theorem 3.3 for inverse SHT.

SHT Agents. We consider SHT agents with:

• Prior .

• Observation likelihood: ,
, where denotes the normal distribution with mean and variance .

• Misclassification costs:
Agent 1: , Agent 2: , Agent 3: .

Inverse Learner specification. Next we consider the inverse learner. We generate samples for the agents using the above parameters. Recall from Theorem 3.3 that the inverse learner uses the dataset to perform IRL for SHT. Here

 DM(SHT)=(π0,(^pm(a|x),K∑k=1τk(μm)/K),m∈{1,2,3}), (24)

where , the second and third terms are the empirically calculated action selection policy and expected stopping time for SHT agent from the generated samples.

IRL Result. The inverse learner performs IRL by using the dataset (24) to solve the linear feasibility problem in Theorem 3.3. The result of the feasibility test is shown in Fig. 3. The blue region is the set of feasible misclassification costs for each agent. The feasible set of costs is . For visualization purposes, Fig. 3 displays the feasible misclassification costs for a single agent keeping the costs for the other two agents fixed at their true values.

The true misclassification costs for each SHT agent are highlighted by a yellow point. The key observation is that these true costs belong to the set of feasible costs (blue region) computed via Theorem 3.3. Thus, Theorem 3.3 successfully performs IRL for the SHT problem and the set of feasible misclassification costs can be reconstructed as the solution to a linear feasibility problem. Also, all points in the blue feasible region of misclassification costs explain the SHT dataset (24) equally well.

### 3.5 Summary

Theorem 3.3 specified necessary and sufficient conditions for identifying relatively optimal SHT agents. These conditions constitute a linear feasibility program that the inverse learner can solve to estimate SHT misclassification costs of the agents. Due to the structure of the SHT problem, the inverse learner has additional information about the agents. So the IRL task of solving the inverse SHT problem is more structured than the inverse optimal stopping problem in Sec. 2.

## 4 Example 2. Inverse Reinforcement Learning (IRL) for Optimal Search

In this section, we present a second example of IRL for an optimal Bayesian stopping time problem, namely, inverse Bayesian Search. In the search problem, a Bayesian agent sequentially searches over a set of target locations until a static (non-moving) target is found. The optimal search problem is a special case of a Bayesian multi-armed bandit problem.

The optimal search problem is a modification of the sequential stopping problem in Sec. 2 with the following changes.

• There is only stop action but multiple continue actions, namely, which of the locations to search at each time. We will call the continue actions as search actions, or simply, actions.

• The observation likelihood depends both on the true state and the continue action .

Suppose an inverse learner observes the decisions of a collection of Bayesian search agents. The aim of the inverse search problem is to identify if the search actions of the agents are optimal and if so, estimate their search costs. Our main IRL result for Bayesian search (Theorem 4.2 below) gives a necessary and sufficient condition for identifying optimal search agents as equivalent to the existence of a feasible solution to a set of linear inequalities.

### 4.1 Optimal Bayesian Search Agents

Suppose an agent searches for a target location . When the agent chooses action to searches location , it obtains an observation . Assume the agent knows the set of conditional pmfs of , namely, . The aim of optimal search is to decide sequentially which location to search at each time so as to minimize the cumulative search cost until the target is found.

We define a set of optimal Bayesian search agents as

 Ξopt=(X,π0,Y,A,α,{lm,μm,m∈M}) (25)

where

• is a finite set of states (target locations).

• At time , the true state is sampled from prior pmf . This location is not known to the agents but is known to the inverse learner (performing IRL).

• , is the location searched by the agent.

• Agent incurs instantaneous cost for searching location .

• , is the reveal probability for location , i.e. , the probability that the target is found when any agent searches the target location (). characterizes the action dependent observation likelihood .

 B(y,x,a)=p(y|x,a)=⎧⎨⎩α(a),y=1,x=a1−α(a),y=0,x=a1,y=0,x≠a. (26)

We emphasize that the reveal probabilities are the same for all agents in .

• are the optimal search strategies of the agents in that operates sequentially on a sequence of observations as discussed below in Protocol 2.

###### Protocol 2

Sequential Decision-making protocol for Search:

1. Generate at time .

2. At time , agent records observation
.

3. If , then stop, else if ,
(i) Update belief (described below).
(ii) For search policy , agent takes action . (Note the first action is taken at time , while the first observation is at ).
(iii) Set and go to Step .

Belief Update: Let denote the sigma-algebra generated by the action and observation sequence . The agent updates belief using Bayes formula as

 πt=B(yt,at−1)πt−11′B(yt,at−1)πt−1, (27)

where . The belief is an dimensional probability vector belonging to the dimensional unit simplex (3).

Remark: (1) The search agent’s stopping region comprises a set of points, where each point corresponds to a distinct edge of the dimensional unit simplex.
(2) The agent first takes an action then realizes the observation.

We define the random variable as the time when the agent stops when using the search strategy (30).

 τ=inf {t>0| yt=1} (28)

Clearly, the set is measurable wrt , hence, the random variable is adapted to the filtration . Below, we define the optimal search strategies .

{definition}

[Search Strategy Optimality] The optimal search strategies of Bayesian search agents operating according to Protocol 2 minimize the cumulative expected search cost for each agent till the target is found. These strategies can be computed by solving a stochastic dynamic programming problem (Krishnamurthy, 2016). The result is well known to be a stationary policy as defined below.

 J(μm,lm)=minμJ(μ,lm)=Eμ{τ−1∑t=0lm(μ(πt))}, (29) μm(π)=argmaxa∈A(π(a)αlm(a)). (30)

Here, denotes expectation parametrized by induced by the probability measure and denotes the cumulative expected search cost and belongs to the class of stationary search strategies. Remark. (1) Note that the minimization in (29) is over stationary search strategies. It is well known that the minimizing search strategy has as a threshold structure (Krishnamurthy, 2016). Since the set of all threshold strategies forms a compact set, we can replace the in (7) by in (29).
(2) Since the expected cumulative cost of an agent depends only on the search costs (for constant reveal probabilities), we can set WLOG.

Relation to Multi-armed Bandits: Optimal search for a stationary target is a special case of a Bayesian multi-armed bandit problem. The term in (30) is the Gittins index (Gittins, 1979) for the search problem, and the optimal policy for the agents is to search the location with the highest Gittins index at each time . Hence, our IRL result in Theorem 4.2 below for estimating search costs can be used to perform IRL for Bayesian multi-armed bandits.

### 4.2 IRL for Inverse Search. Main Result

In this subsection, we provide an inverse learner centric view of the Bayesian stopping time problem and the main IRL result for inverse search. Suppose the inverse learner observes search agents where each agent performs several independent trials of Protocol 2 for Bayesian sequential search. We make the following assumptions about the inverse learner performing IRL to detect if comprises relatively optimal search agents.

1. The inverse learner knows the dataset

 DM(Search)=(π0,{gm(a,x),m∈M}). (31)

Here, is the average number of times agent searches location when the target is in .

 gm(a,x)=Eμm{τ∑t=1\mathbbm1{μm(πt)=a}|x}. (32)

We call as the agent’s action search policy.

2. If the agents are optimal search agents, then there are at least agents with distinct search costs.

Remarks: 6 is discussed after the main result. In complete analogy to 2, 7 is needed for identifiability of the search costs. We emphasize that the inverse learner only has the average number of times any agent searches a particular location. The inverse learner does not know the stopping time or the order in which the agents search the locations.

In analogy to Sec. 2, the inverse learner can only identify if the observed agents are relatively optimal search agents as discussed below. The inverse learner cannot identify absolute optimality in the sense of (30) in Definition 4.1 due to a finite number of observed strategies .

{definition}

[Relative Optimality for Search] The Bayesian search agents parametrized by the tuple comprise relatively optimal search agents (see Fig. 1) iff (10) holds.

 μm=argminμ∈{μn,n∈M}J(μ,lm(a)) (33)

In complete analogy with (10) in Definition 2.3, in the above equation is the expected cumulative search cost of the agent.

We are now ready to present our main IRL result for the inverse search problem. The result specifies a set of linear inequalities that are simultaneously necessary and sufficient for the search agents to be relatively optimal search agents ( (33) in Definition 4.2).

{theorem}

[IRL for Bayesian Search] Consider the inverse learner with dataset (31) obtained from the search agents . Assume 6 holds. Then:
1. Identifiability: The inverse learner can identify if the dataset is generated by relatively optimal search agents (Definition 4.2).
2. Existence: is a collection of relatively optimal search agents, if and only if there exists a feasible solution to the following linear (in search costs) inequalities:

 Find lm(a)∈R+,lm(1)=1, s.t.NIAC†(DM(Search))≤0, where NIAC†: ∑x∈Xπ0(x)(gm(a,x)−gn(a,x)) lm(a)<0 ∀m≠n∈M. (34)

3. Reconstruction: The set-valued IRL estimate of the agent’s search costs is the set of all feasible solutions to the NIAC inequalities.

The proof of Theorem 4.2 is in Appendix B. Theorem 4.2 provides a set of linear inequalities whose feasibility is equivalent to relative optimality of Bayesian search. Note that Theorem 4.2 uses the search action policies to construct the expected cumulative search costs of the agents and verify if the relative optimality equation (33) for Bayesian search holds. The key idea for the IRL result is to express the expected cost of the search agent in terms of its action search policy 31. Algorithms for linear feasibility such as the simplex method (Boyd and Vandenberghe, 2004) can be used to check feasibility of (34) in Theorem 4.2) and construct a feasible set of search costs for the relatively optimal search agents.

### 4.3 Discussion of Assumption 6

To motivate 6, suppose for each agent , the inverse learner records the state and actions over independent trials. Then, the variable in (31) is the limit pmf of the empirical pmf as the number of trials .

 ^gm(a,x)=∑Kk=1∑τk,mt=1\mathbbm1{xk,m=x,at,k,m=a}∑Kk=1\mathbbm1{xk,m=x}. (35)

In complete analogy to Sec. 2.4, almost sure convergence holds by Kolmogorov’s strong law of large numbers. is the average number of times agent searches location when the target is in location . More formally, for a fixed state , is the number of times the posterior belief of the agent visits the region in the unit simplex of pmfs where it is optimal to choose action . In Appendix B, we discuss how the action search policy can be used to express the agent’s cumulative expected search cost (29).

Remark: Analogous to the action selection policy (15) for stopping problems with multiple stopping actions, the inverse learner uses the action search policy to identify Bayes optimality in stopping problems with multiple continue actions (and single stop action) and estimate the continue costs.

### 4.4 Numerical Example Illustrating Inverse Search

The following numerical example illustrates Theorem 3.3 for inverse SHT.

Search Agents. We consider search agents with:

• Prior .

• Search locations: .

• Reveal probability: .

• Search costs:
Agent 1: ,
Agent 2: ,
Agent 3: .

(Recall that WLOG the search cost can be set to for all .)

Inverse Learner specification. Next we consider the inverse learner. We generate samples for the agents using the above parameters. Recall from Theorem 4.2 that the inverse learner uses the dataset to perform IRL for search. Here

 DM(Search)=(π0,(^gm(a,x),m∈{1,2,3}), (36)

where , the second term is the empirically calculated action search policy (35) for search agent from the generated samples.

IRL Result. The inverse learner performs IRL by using the dataset