1 Introduction

The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about their current policy’s quality before executing it, and thus have limited use in high-stakes applications like healthcare. In this paper, we address such a lack of accountability by proposing that algorithms output policy certificates, which upper bound the sub-optimality in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further present a new learning framework (IPOC) for finite-sample analysis with policy certificates, and develop two IPOC algorithms that enjoy guarantees for the quality of both their policies and certificates.


Policy Certificates: Towards Accountable Reinforcement Learning


Christoph Dann                        Lihong Li                        Wei Wei                        Emma Brunskill

Carnegie Mellon University                        Google Brain                        Google Cloud AI Research                        Stanford University

1 Introduction

There is increasing excitement around applications of machine learning (ML), but also growing awareness and concern. Recent research on FAT (fairness, accountability and transparency) ML aims to address these concerns but most work focuses on supervised learning settings and only few works exist on reinforcement learning or sequential decision making in general (Jabbari et al., 2016; Joseph et al., 2016; Kannan et al., 2017; Raghavan et al., 2018).

One challenge when applying reinforcement learning (RL) in practice is that, unlike in supervised learning, the performance of an RL algorithm is typically not monotonically increasing with more data due to the trial-and-error nature of RL that necessitates exploration. Even sharp drops in policy performance during learning are possible, for example, when the agent starts to explore a new part of the state space. Such unpredictable performance fluctuation has limited the use of RL in high-stakes applications like healthcare, and calls for more accountable algorithms that can quantify and reveal their performance during learning.

In this work, we propose that an RL algorithm outputs policy certificates, a form of confidence interval, in episodic reinforcement learning. Policy certificates are upper bounds on how far from optimal the return (expected sum of rewards) of an algorithm in the next episode can be. They allow one to monitor the policy’s performance and intervene if necessary, thus improving accountability of the algorithm. Formally, we propose a theoretical framework called IPOC that not only guarantees that certificates are valid performance bounds but also that both, the algorithm’s policy and certificates, improve with more data.

There are two relevant lines of research on RL with guaranteed performance for episodic reinforcement learning. The first area is on frameworks for guaranteeing the performance of a RL algorithm across many episodes, as it learns. Such frameworks, like regret (Jaksch et al., 2010), PAC (probably approximately correct, Kakade, 2003; Strehl et al., 2009) and Uniform-PAC (Dann et al., 2017) all provide apriori bounds about the cumulative performance of the algorithm, such as bounding the total number of times an algorithm may execute a policy that is not near optimal. However, these frameworks do not provide bounds for any individual episode. In contrast, the second main related area for providing guarantees focuses on estimating and guaranteeing the performance of a particular RL policy, given some prior data (e.g., Thomas et al., 2015b; Jiang and Li, 2016; Thomas and Brunskill, 2016). Such work typically provides limited or no guarantees for algorithms that are learning and updating their policies across episodes. In this paper, we unite both lines of work by providing performance guarantees online for a reinforcement learning algorithm in individual episodes and across all episodes. In fact, we show that bounds in our new IPOC framework imply strong guarantees in existing regret and PAC frameworks.

We consider policy certificates in two settings, finite episodic Markov decision processes (MDPs) and, more generally, finite MDPs with episodic side information (context) (Abbasi-Yadkori and Neu, 2014; Hallak et al., 2015; Modi et al., 2018). The latter is of particular interest in practice. For example, in a drug treatment optimization task where each patient is one episode, context is the background information of the patient which influences the treatment outcome. While one expects the algorithm to learn a good policy quickly for frequent contexts, the performance for unusual patients may be significantly more variable due to the limited prior experience of the algorithm. Policy certificates allow humans to detect when the current policy is good for the current patient and intervene if a certified performance is deemed inadequate. For example, for this health monitoring application, a human expert could intervene to either directly specify the policy for that episode, or in the context of automated customer service, the service could be provided at reduced cost to the customer.

Existing algorithms based on the optimism-in-the-face-of-uncertainty (OFU) principle (e.g., Auer et al., 2009) are natural to extend to learning with policy certificates. We demonstrate this by extending the UBEV algorithm (Dann et al., 2017) for episodic MDPs with finite state and action spaces, and show that with high probability it outputs certificates greater than at most times for all . For problems with side information, we propose an algorithm that learns with policy certificates in episodic MDPs with adversarial linear side information (Abbasi-Yadkori and Neu, 2014; Modi et al., 2018) of dimension , and bound the rate at which the cumulative sum of certificates can grow up to log terms by .

2 Setting and Notation

In this work, we consider episodic RL problems where the agent interacts with the environment in episodes of a certain length. While the framework for policy certificates applies to a wide range of problems, we focus on finite Markov decision processes (MDP) with linear side information (Modi et al., 2018; Hallak et al., 2015; Abbasi-Yadkori and Neu, 2014) for concreteness. This setting includes tabular MDPs as a special case but is more general and can model variations in the environment across episodes, e.g., because different episodes correspond to treating different patients in a healthcare application. Unlike the tabular special case, function approximation is necessary for efficient learning.

Finite MDPs with linear side information.

The agent interacts in episode by observing a state , taking action and observing the next state as well as a scalar reward . This interaction loop continues for time steps , before a new episode starts. We assume that state- and action-space are of finite sizes and , respectively, as in the widely considered tabular MDPs (Osband and Van Roy, 2014; Dann and Brunskill, 2015; Azar et al., 2017; Jin et al., 2018). But here, the agent essentially interacts with a family of infinitely many tabular MDPs that is parameterized by linear contexts. At the beginning of episode , two contexts, and , are observed and the agent interacts in this episode with a tabular MDP, whose dynamics and reward function depend on the contexts in a linear fashion. Specifically, it is assumed that the rewards are sampled from with means and transition probabilities are where and are unknown parameter vectors for each . As a regularity condition, we assume bounded parameters, i.e., and as well as bounded contexts and . We allow and to be different, and use to denote in the following. To further simplify notation, we assume w.l.o.g. that there is a fixed start state. Note that there is no assumption of the distribution of contexts; our framework and algorithms can handle adversarially chosen contexts.

Return and optimality gap.

The quality of a policy in any episode is evaluated by the total expected reward or return: , where this notation means that all actions in the episode are taken as prescribed by . We focus here on deterministic time-dependent policies and note that optimal policy and return depend on the context of the episode. The difference of achieved and optimal return is called optimality gap for each episode where is the algorithm’s policy in that episode.

Additional notation.

We denote by the largest optimality gap possible and and are the Q- and value function of in episode . Optimal versions are marked by superscript and subscripts are omitted when unambiguous. We often treat as linear operator, that is, for any .

3 Existing Learning Frameworks

During execution, the optimality gaps are hidden, the algorithm only observes the sum of rewards which is a sample of . This causes risk as one does not know when the algorithm is playing a good policy and when a potentially bad policy. One might hope that performance guarantees for algorithms mitigate this risk but no existing theoretical framework gives guarantees for individual episodes during learning:

  • Mistake-style PAC bounds (Strehl et al., 2006, 2009; Szita and Szepesvári, 2010; Lattimore and Hutter, 2012; Dann and Brunskill, 2015) bound the number of -mistakes, that is, the size of the superlevel set with high probability, but do not tell us when mistakes can happen. The same is true for the recently proposed stronger Uniform-PAC bounds (Dann et al., 2017) which hold for all jointly.

  • Supervised-learning style PAC bounds (Kearns and Singh, 2002; Jiang et al., 2017; Dann et al., 2018) guarantee that the algorithm outputs an -optimal policy for a given , that is, they ensure that for greater than the bound. They do however require to know ahead of time and do not give any guarantee about during learning (when is smaller than the bound).

  • Regret bounds (Osband et al., 2013, 2016; Azar et al., 2017; Jin et al., 2018) control the cumulative sum of optimality gaps (regret) which does not yield any nontrivial guarantee for individual because it does not tell which optimality gaps are small.

Not knowing during learning makes it difficult to stop an algorithm at some point and extract a good policy. For example, the common way to extract a good policy for algorithms with regret bound is to pick one of the policies executed so far at random (Jin et al., 2018). This only yields a good policy that has with probability in general. As a result, one requires episodes for a good policy with probability at least which is much larger than the of algorithms with supervised-learning style PAC bounds. Note that the KWIK framework (Li et al., 2008) does guarantee the quality of individual predictions but is for supervised learning; its use in RL leads to mistake-style PAC bounds (see Section 7).

4 The IPOC Framework

We introduce a new learning framework that mitigates the limitations of prior guarantees highlighted above. This framework forces the algorithm to output its current policy as well as a certificate before each episode . This certificate informs the user how sub-optimal the policy can be for the current context, i.e., and allows one to intervene if needed. For example, in automated customer services, one might reduce the service price in episode if certificate is above a certain threshold, since the quality of the provided service cannot be guaranteed. When there is no context, a certificate upper bounds the suboptimality of the current policy in any episode which makes algorithms anytime interruptable (Zilberstein and Russell, 1996): one is guaranteed to always know a policy with improving performance. Our learning framework is formalized as follows:

Definition 1 (Individual Policy Certificates (IPOC) Bounds).

An algorithm satisfies an individual policy certificate (IPOC) bound if for a given it outputs a certificate and the current policy before each episode (after observing the contexts) so that with probability at least

  • all certificates are upper bounds on the sub-optimality of policy played in episode , i.e., ; and either

  • for all number of episodes the cumulative sum of certificates is bounded (Cumulative Version), or

  • for any threshold , the number of times certificates can exceed the threshold is bounded as (Mistake Version).

Here, can be (known or unknown) properties of the environment. If conditions 1 and 2a hold, we say the algorithm has a cumulative IPOC bound and if conditions 1 and 2b hold, we say the algorithm has a mistake IPOC bound.

Condition 1 alone would be trivial to satisfy with , but condition 2 prohibits this by controlling the size of . Condition 2a bounds the cumulative sum of certificates (similar to regret bounds), and condition 2b bounds the size of the superlevel sets of (similar to PAC bounds). We allow both alternatives as condition 2b is stronger but one sometimes can only prove condition 2a (see Sec. 5.2.1). An IPOC bound controls simultaneously the quality of certificates (how big is) as well as the optimality gaps themselves and hence an IPOC bound not only guarantees that the algorithm improves its policy but also becomes better at telling us how well the policy performs. As such it is stronger than existing frameworks. Besides this benefit, IPOC ensures that the algorithm is anytime interruptable, i.e., it can be used to find better and better policies that have small with high probability . That means IPOC bounds imply supervised learning style PAC bounds for all jointly. These claims are formalized in the following statements:

Proposition 2.

Assume an algorithm has a cumulative IPOC bound .

  1. Then it has a regret bound of same order, i.e., with probability at least , for all the regret is bounded by .

  2. If has the form for appropriate functions , then with probability at least for any , it outputs a certificate within


    episodes. For settings without context, this implies that the algorithm outputs an -optimal policy within that number of episodes (supervised learning-style PAC bound).

Proposition 3.

If an algorithm has a mistake IPOC bound , then

  1. it has a uniform PAC bound , i.e., with probability at least , the number of episodes with is at most for all ;

  2. with probability at least for all , it outputs a certificate within episodes. For settings without context, that means the algorithm outputs an -optimal policy within that many episodes (supervised learning-style PAC).

  3. if has the form with it also has a cumulative IPOC bound of order

Note that the functional form in part 2 of Prop. 2 includes all common polynomial bounds like or with appropriate factors and similarly for part 3 of Prop. 3 which covers for example .

5 Algorithms with Policy Certificates

As shown above, IPOC is stricter than other learning frameworks. Existing algorithms based on the OFU principle (Auer et al., 2009) need extensions to satisfy IPOC bounds. OFU algorithms can be interpreted as maintaining a set of models defined by confidence sets of the individual components and picking the policy optimistically from that set of models. As a byproduct, this yields an upper confidence bound on the optimal value function and therefore optimal return . We augment this by computing a lower confidence bound on value function of the optimistic policy recursively using the same confidence set of models. This yields a lower confidence bound on which is sufficient to compute . We demonstrate this approach by extending two similar OFU algorithms, one for tabular MDPs with no side information, and the other for the more general case with side information. While the algorithms have similar structure we consider them separately because we can prove stronger IPOC guarantees for the first (see Section 5.2.1).

5.1 Policy Certificates in Tabular MDPs

Input : failure tolerance
1 ;
3 ;
4 ;
5 for  do
       // Optimistic planning and policy evaluation
6       for  to and  do
7             for  do
8                   ;
9                   ;
10                   ;
11                   ;
12                   ;
                   // clip values
13                   ;
14                   ;
16            ;
17             ;
      /* Execute policy for one episode */
19       ;
20       ;
21       output policy with certificate ;
22       for  to  do
23             ;
24             ;
             // Update statistics
25             ;
26             ;
27             ;
Algorithm 1 ORLC (Optimistic Reinforcement Learning with Certificates)

We present an extension of the UBEV algorithm by Dann et al. (2017) called ORLC (optimistic reinforcement learning with certificates) and shown in Algorithm 1. Algorithm 1 essentially combines the policy selection approach of UBEV with high-confidence model-based policy evaluation of the current policy. Before each episode, Algorithm 1 computes , an optimistic estimate of , as well as , a pessimistic estimate of , by dynamic programming on the empirical model and confidence intervals and for and , respectively. Note that the width of the lower confidence bounds is by a factor larger than , as the estimation target of is a random quantity due to the dependency on as opposed to (see discussion below).

We show the following IPOC bound for this algorithm:

Theorem 4 (Mistake IPOC Bound of Algorithm 1).

For any given , Algorithm 1 satisfies in any tabular MDP with states, actions and horizon , the following Mistake IPOC bound: For all , the number of episodes where the algorithm outputs a certificate is


By Proposition 3, this implies a PAC bound of same order as well as a regret bound. The PAC lower bound (Dann and Brunskill, 2015) for this setting is , which implies an IPOC mistake lower bound of the same order by Proposition 3. We conjecture that Algorithm 1 satisfies a Uniform-PAC bound of order , which is by a factor lower than the Uniform-PAC bound of UBEV due to our assumption of time-independent dynamics. Using techniques by Azar et al. (2017), this bound can be reduced to match the lower PAC bound. However, as we sketch below, existing techniques cannot be directly applied to the lower confidence bounds in Algorithm 1. It is therefore an open question whether our IPOC bound in Theorem 4 is improvable, or whether the IPOC lower bound is strictly larger than the PAC lower bound. Interestingly, in the related active learning setting, such a discrepancy between achieved and certifiable performance is known to exist (Balcan et al., 2010).

5.1.1 Proof Sketch of the IPOC Bound

To show Theorem 4, we need to verify condition 1 and 2b of Definition 1. Condition 2b can be shown in similar way to existing Uniform-PAC (Dann et al., 2017) bounds but with optimality gaps being replaced by certificates . For condition it is sufficient to show and holds in all episodes . We use additional subscripts to indicate the value of variables before sampling in episode . Proving optimism, , is standard in analyses of OFU algorithms. Hence, we focus here on showing . When there is no value clipping one can use the following common decomposition (Azar et al., 2017; Jin et al., 2018; Dann et al., 2017). All terms are functions of which we omit in the following for readability.


Here, we expanded and used . We want to show that this value difference is non-negative. Using standard martingale concentration, one can show . It remains to show that . We decompose the left-hand side as

Using an inductive assumption that , the first term cannot be positive. Note that when showing optimism () the second term is which is a martingale that can be bounded directly by . Unfortunately, the second term here is not a martingale as and both depend on the samples. For that reason, we have to resort to Hölder’s inequality to decompose


and apply concentration bounds on the distance of empirical distributions to get the upper bound . This is why the lower confidence bound width are by a factor larger than the upper confidence bound widths . Eventually, this yields a IPOC bound compared to the conjectured Uniform-PAC bound with dependency in the term. Similarly, the difference in -dependency of our IPOC and conjectured Uniform-PAC bounds origins from leveraging Bernstein’s inequality for the upper confidence bound widths. That requires bounding how much larger the empirical variance estimate of -value of next state can be compared to using target values. While this is possible by exploiting that is monotonically decreasing with (Azar et al., 2017, Equation 5), this technique cannot be applied to the lower confidence widths as is not monotone in .

5.2 Policy Certificates in MDPs With Linear Side Information

Input : failure prob. , regularizer
2 ;
3 ;
4 ;
5 ;
6 ;
7 for  do
8       Observe current contexts and ;
       /* estimate model with least squares */
9       for  do
10             ;
11             ;
12             ;
13             ;
      /* optimistic planning */
15       for  to and  do
16             for  do
17                   ;
18                   ;
19                   ;
20                   ;
                   // clip values
21                   ;
22                   ;
24            ;
25             ;
26             ;
      /* Execute policy for one episode */
28       ;
29       ;
30       output policy with certificate ;
31       for  to  do
32             ;
33             ;
             // Update statistics
34             ;
35             ;
36             ;
37             ;
Algorithm 2 ORLC-SI (Optimistic Reinforcement Learning with Certificates and Side Information)

After considering the tabular MDP setting, we now present an algorithm for the more general setting with side information, which for example allows us to take background information about a customer into account and generalize across different customers.

Algorithm 2 gives an extension, called ORLC-SI, of the OFU algorithm by Abbasi-Yadkori and Neu (2014). Similar to tabular Algorithm 1, Algorithm 2 computes as an upper bound on the optimal Q-function and as a lower bound on the Q-function of the current policy using dynamic programming with the empirical model as well as confidence bound widths and . Unlike in the tabular case, the empirical model is now computed as least-squares estimates of the model parameters evaluated at the current contexts. Specifically, the empirical transition probability is where is the least squares estimate of model parameter . Since transition probabilities are normalized, this estimate is then clipped to . Note that this empirical model is estimated separately for each -triple, but does generalize across different contexts. The confidence widths and are derived using ellipsoid confidence intervals on model parameters (Abbasi-Yadkori and Neu, 2014). We show the following IPOC bound:

Theorem 5 (Cumulative IPOC Bound for Alg. 2 ).

For any and regularization parameter , Algorithm 2 satisfies the following cumulative IPOC bound in any MDP with states, actions, contexts with dimensions and as well as bounded parameters and . With probability at least all certificates are upper bounds on the optimality gaps and their total sum after episodes is bounded for all by


By Proposition 2, this IPOC bound implies a regret bound of the same order which improves on the regret bound of Abbasi-Yadkori and Neu (2014) with by a factor of . While they make a different modelling assumption (generalized linear instead of linear), we believe at least our better dependency is due to using improved least-squares estimators for the transition dynamics 111They estimate only from samples where the transition was observed instead of all occurrences of (no matter whether was the next state). and can likely be transferred to their setting. The mistake-type PAC bound by Modi et al. (2018) is not directly comparable because our cumulative IPOC does not imply a mistake-type PAC bound.222Similar to regret and PAC bounds (Dann et al., 2017), an algorithm with a sublinear cumulative IPOC bound can still output a certificate larger than a given threshold infinitely often as long as it does so sufficiently less frequently (see Section 5.2.1). Nonetheless, loosely translating our result to a PAC-like bound yields which is much smaller than their bound for sufficiently small .

The confidence bounds in Algorithm 2 are more general but looser compared to the confidence bounds specialized to the tabular case in Algorithm 1, in particular the upper confidence bounds. Instantiating the cumulative IPOC bound for Algorithm 2 from Theorem 5 for tabular MDPs (where for all ) yields which is worse than the cumulative IPOC bound for Algorithm 1 implied by Theorem 4.

5.2.1 Mistake IPOC Bound for Algorithm 2?

By Proposition 3, a mistake IPOC bound is stronger than the cumulative version we proved for Algorithm 2. One might wonder whether Algorithm 2 also satisfies this stronger bound, but this is not the case:

Proposition 6.

For any , there is an MDP with linear side information such that Algorithm 2 outputs certificates infinitely often with probability .

Proof Sketch.

Consider a two-armed bandit where the two-dimensional context is identical to the deterministic reward for both actions. The context alternates between and . That means in odd-numbered episodes, the agent receives reward for action and reward for action (bandit A) and conversely in even-numbered episodes (bandit B). Let and be the current number of times action was played in each bandit and the covariance matrix. One can show that the optimistic Q-value of action in bandit A is lower bounded as


Assume now the agent stops playing action 2 in bandit A and playing action 1 in bandit B at some point. Then the denominator in Eq (9) stays constant but the numerator grows unboundedly as . That implies that but the optimistic Q-value for the other action approaches the true reward. Eventually and the agent will play the -suboptimal action in bandit A again. Hence, Algorithm 2 has to output infinitely many . ∎

The construction in the proof illustrates that the non-decreasing nature of the ellipsoid confidence intervals cause this negative result (due to the term in in Line 2 of Alg 2). This does not rule out alternative algorithms with mistake IPOC bound for this setting, but they would likely require entirely different parameter estimators and confidence bounds.

6 Simulation Experiments

Figure 1: Results of Algorithm 2 for 4M episodes on an MDP with context distribution shift after 2M; Left: certificates and true (unobserved) optimality gap in temporal order (episodes sub-sampled for better visualization); Right: Scatter plot of all certificates and optimality gaps.

Certificates need to upper bound the optimality gap in each episode, even for the worst case up to a small failure probability, and Algorithms 1 and 2 are not optimized for empirical performance. As such, their certificates may be conservative, and potentially significantly overestimate the unobserved optimality gaps without further empirical tuning. Yet, one may wonder whether the certificates output by Algorithms 1 and 2 are simply a monotonically decreasing sequence, or whether they can indicate the actual performance variation during learning. In this section, we present the results of a small simulation study, which demonstrates that the certificates do inform us about when the algorithms execute a bad policy. For brevity, we focus on the more general Algorithm 2 in tasks with side information. Details are available in Appendix D.

We first apply Algorithm 2 to randomly generated contextual bandit problems () with dimensional context and actions.333We actually use a slightly more complicated version of Algorithm 2 with better empirical performance but the same theoretical guarantees. All details are in Appendix D. For clarity, we presented the simpler Algorithm 2 in the main paper. Certificates and optimality gaps have a correlation of which confirms that certificates are informative about the policy’s return. If one for example needs to intervene when the policy is more than from optimal (e.g., by reducing the price for that customer), then in more than of the cases where the certificate is above , the policy is worse than suboptimal.

In practice, the distribution of contexts can change rapidly. For example, in a call center dialogue system, there can be a sudden increase of customers calling due to a certain regional outage. Such abrupt shifts in contexts are prevalent, and can cause a drop in performance. We demonstrate that certificates can identify such performance drops. We consider a simulated MDP with states, actions and horizon where rewards depend on a -dimensional context and let the distribution of contexts change after M episodes. As seen in Figure 1 (left), this causes a spike in optimality gap as well as certificates. Our algorithm reliably detects this sudden decrease of performance. In fact, the scatter plot in Figure 1 (right) shows that certificates are highly correlated with optimality gaps.

We would like to emphasize that our focus in this paper is to provide a theoretical framework and not optimize empirical performance. Nonetheless, these experiments indicates that even with no empirical tuning, certificates can be a useful indicator for the optimality gap of our algorithms. We expect that the empirical quality can be significantly improved in future work.

7 Related Work

The connection of IPOC to existing RL frameworks such as PAC and regret is shown formally in Section 4. In addition, IPOC is similar to the KWIK framework (Li et al., 2008), in that the algorithm is required to declare how well it will perform. However, KWIK is a framework for supervised learning algorithms, which are then used as building blocks to create PAC RL algorithms. In contrast, IPOC is a framework specifically for RL methods. Also, KWIK only requires to declare whether the output will perform better than a single threshold that is pre-specified in the input.

Our algorithms essentially compute confidence regions as in OFU algorithms, and then use that information in model-based policy evaluation to obtain policy certificates. There is a large body of works on off-policy policy evaluation (e.g., Jiang and Li, 2016; Thomas and Brunskill, 2016; Mahmood et al., 2017) including a few that provide non-asymptotic confidence intervals (e.g., Thomas et al., 2015b, a; Sajed et al., 2018). However, these methods focus on the batch setting where a batch of episodes sampled from known policies is given. Many approaches rely on importance weights that require stochastic data-collecting policies but most sample-efficient algorithms deploy deterministic policies. One could treat previous episodes to be collected by one stochastic data-dependent policy but that introduces bias in the importance-weighting estimators that is not accounted for in the analyses. In contrast to these batch offline approaches, our work focuses on providing guarantees for online RL in the form of policy certificates as well as proposing a theoretical framework for controlling the quality of these certificates and the learning speed of the algorithm.

There are also approaches on safe exploration (Kakade and Langford, 2002; Pirotta et al., 2013; Thomas et al., 2015a) that guarantee monotonically increasing policy performance by operating in a batch loop. Our work is orthogonal, as we do not aim to change exploration but rather expose its impact on performance to the users and give them the choice to intervene.

8 Conclusion and Future Work

We have introduced policy certificates to improve accountability in reinforcement learning by enabling users to intervene if the guaranteed performance is deemed inadequate. Bounds in our new theoretical framework IPOC ensure that certificates indeed upper bound the suboptimality in each episode and prescribe the rate at which certificates and policy improve. By extending two optimism-based algorithms, we have not only demonstrated RL with policy certificates but also our IPOC guarantees. This initial work on more accountable RL through online certificates opens up several exciting avenues for future work:

The high correlation of policy certificates and optimality gaps in our experiments motivates a more empirical study of RL with policy certificates, including the design of practical algorithms with well-calibrated policy certificates in more challenging settings (e.g. deep RL).

Policy certificates enable intervention and our theory already captures interventions that do not hinder execution of the algorithm (e.g. reducing the price for the current customer). As future work, it would be interesting to quantify how other interventions that provide alternative feedback (like expert demonstrations) affect learning speed.

Appropriate interventions may depend on why the algorithm chooses a potentially bad policy. The algorithm could for example explicitly explore as opposed to just not being able to exploit. To further improve accountability and interpretability, we could distinguish these cases by comparing certificates of the optimism-based policy and the policy that is optimal in the empirical model.


  • Abbasi-Yadkori and Neu (2014) Yasin Abbasi-Yadkori and Gergely Neu. Online learning in mdps with side information. arXiv preprint arXiv:1406.6812, 2014.
  • Auer et al. (2009) Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems, 2009.
  • Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.
  • Balcan et al. (2010) Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Machine learning, 80(2-3):111–139, 2010.
  • Dann and Brunskill (2015) Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
  • Dann et al. (2017) Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
  • Dann et al. (2018) Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efficient pac reinforcement learning with rich observations. arXiv preprint arXiv:1803.00606, 2018.
  • Hallak et al. (2015) Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual Markov decision processes. arXiv:1502.02259, 2015.
  • Howard et al. (2018) Steven R. Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Uniform, nonparametric, non-asymptotic confidence sequences. arXiv preprint arXiv:1810.08240, 2018.
  • Jabbari et al. (2016) Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fair learning in markovian environments. arXiv preprint arXiv:1611.03071, 2016.
  • Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 652–661. JMLR. org, 2016.
  • Jiang et al. (2017) Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713, 2017.
  • Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? arXiv preprint arXiv:1807.03765, 2018.
  • Joseph et al. (2016) Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pages 325–333, 2016.
  • Kakade (2003) Sham Kakade. On the sample complexity of reinforcement learning. PhD thesis, University College London, 2003.
  • Kakade and Langford (2002) Sham M. Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002.
  • Kannan et al. (2017) Sampath Kannan, Michael Kearns, Jamie Morgenstern, Mallesh Pai, Aaron Roth, Rakesh Vohra, and Zhiwei Steven Wu. Fairness incentives for myopic agents. In Proceedings of the 2017 ACM Conference on Economics and Computation, pages 369–386. ACM, 2017.
  • Kearns and Singh (2002) Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002.
  • Lattimore and Czepesvari (2018) Tor Lattimore and Csaba Czepesvari. Bandit Algorithms. Cambridge University Press, 2018.
  • Lattimore and Hutter (2012) Tor Lattimore and Marcus Hutter. Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pages 320–334. Springer, 2012.
  • Li et al. (2008) Lihong Li, Michael L Littman, and Thomas J Walsh. Knows what it knows: a framework for self-aware learning. In Proceedings of the 25th international conference on Machine learning, pages 568–575. ACM, 2008.
  • Mahmood et al. (2017) Ashique Rupam Mahmood, Huizhen Yu, and Richard S Sutton. Multi-step off-policy learning without importance sampling ratios. arXiv preprint arXiv:1702.03006, 2017.
  • Modi et al. (2018) Aditya Modi, Nan Jiang, Satinder Singh, and Ambuj Tewari. Markov decision processes with continuous side information. In Algorithmic Learning Theory, pages 597–618, 2018.
  • Osband and Van Roy (2014) Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, 2014.
  • Osband et al. (2013) Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
  • Osband et al. (2016) Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pages 2377–2386, 2016.
  • Pirotta et al. (2013) Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In PInternational Conference on Machine learning, pages 307–315, 2013.
  • Raghavan et al. (2018) Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The externalities of exploration and how data diversity helps exploitation. arXiv preprint arXiv:1806.00543, 2018.
  • Sajed et al. (2018) Touqir Sajed, Wesley Chung, and Martha White. High-confidence error estimates for learned value functions. arXiv preprint arXiv:1808.09127, 2018.
  • Strehl et al. (2006) Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, 2006.
  • Strehl et al. (2009) Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 2009.
  • Szita and Szepesvári (2010) István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038, 2010.
  • Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
  • Thomas et al. (2015a) Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, pages 2380–2388, 2015a.
  • Thomas et al. (2015b) Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. 2015b.
  • Zilberstein and Russell (1996) Shlomo Zilberstein and Stuart Russell. Optimal composition of real-time systems. Artificial Intelligence, 82(1-2):181–213, 1996.

Appendix A Proofs of relationship of IPOC bounds

a.1 Proof of Proposition 2

Proof of Proposition