Reinforcement Learning with Perturbed Rewards
Abstract
Recent studies have shown the vulnerability of reinforcement learning (RL) models in noisy settings. The sources of noises differ across scenarios. For instance, in practice, the observed reward channel is often subject to noise (e.g., when observed rewards are collected through sensors), and thus observed rewards may not be credible as a result. Also, in applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors. In this paper, we consider noisy RL problems where observed rewards by RL agents are generated with a reward confusion matrix. We call such observed rewards as perturbed rewards. We develop an unbiased reward estimator aided robust RL framework that enables RL agents to learn in noisy environments while observing only perturbed rewards. Our framework draws upon approaches for supervised learning with noisy data. The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the stateoftheart PPO algorithm is able to obtain 67.5% and 46.7% improvements in average on five Atari games, when the error rates are 10% and 30% respectively.
Reinforcement Learning with Perturbed Rewards
Jingkang Wang 

Shanghai Jiao Tong University 
Shanghai, China 
wangjksjtu@gmail.com 
Yang Liu 

University of California, Santa Cruz 
California, USA 
yangliu@ucsc.edu 
Bo Li 

University of Illinois at UrbanaChampaign 
Illinois, USA 
lxbosky@gmail.com 
1 Introduction
Designing a suitable reward function plays a critical role in building reinforcement learning models for realworld applications. Ideally, one would want to customize reward functions to achieve applicationspecific goals (HadfieldMenell et al., 2017). In practice, however, it is difficult to design a function that produces credible rewards in the presence of noise. This is because the output from any reward function is subject to multiple kinds of randomness:

Inherent Noise. For instance, sensors on a robot will be affected by physical conditions such as temperature and lighting, and therefore will report back noisy observed rewards.

ApplicationSpecific Noise. In machine teaching tasks (Thomaz et al., 2006), when an RL agent receives feedback/instructions from people, different human instructors might provide drastically different feedback due to their personal styles and capabilities. This way the RL agent (machine) will obtain reward with bias.

Adversarial Noise. Adversarial perturbation has been widely explored in different learning tasks and shows strong attack power against different machine learning models. For instance, Huang et al. (2017) has shown that by adding adversarial perturbation to each frame of the game, they can mislead RL policies arbitrarily.
Assuming an arbitrary noise model makes solving this noisy RL problem extremely challenging. Instead, we focus on a specific noisy reward model which we call perturbed rewards, where the observed rewards by RL agents are generated according to a reward confusion matrix. This is not a very restrictive setting to start with, even considering that the noise could be adversarial: Given that arbitrary pixel value manipulation attack in RL is not very practical, adversaries in realworld have high incentives to inject adversarial perturbation to the reward value by slightly modifying it. For instance, adversaries can manipulate sensors via reversing the reward value.
In this paper, we develop an unbiased reward estimator aided reward robust reinforcement learning framework that enables an RL agent to learn in a noisy environment with observing only perturbed rewards. Our solution framework builds on existing reinforcement learning algorithms, including the recently developed DRL ones. The main challenge is that the observed rewards are likely to be biased, and in RL or DRL the accumulated errors could amplify the reward estimation error over time. We do not require any assumption on knowing the true distribution of reward or adversarial strategies, other than the fact that the generation of noises follow an unknown reward confusion matrix. Instead, we address the issue of estimating the reward confusion matrices by proposing an efficient and flexible estimation module. Everitt et al. (2017) provided preliminary studies for the noisy reward problem and gave some general negative results. The authors proved a No Free Lunch theorem, which is, without any assumption about what the reward corruption is, all agents can be misled. Our results do not contradict with the results therein, as we consider a specific noise generation model (that leads to a set of perturbed rewards). We analyze the convergence and sample complexity for the policy trained based on our proposed method using surrogate rewards in RL, using Learning as an example.
We conduct extensive experiments on OpenAI Gym (Brockman et al., 2016) (AirRaid, Alien, Carnival, MsPacman, Pong, Phoenix, Seaquest), and test various RL and DRL algorithms including Learning, CEM, SARSA, DQN, Dueling DQN, DDPG, NAF, and PPO. We show that the proposed reward robust RL method achieves comparable performance with the policy trained using the true rewards. In some cases, our method even achieves higher cumulative reward  this is surprising to us at first, but we conjecture that the inserted noise together with our noisyremoval unbiased estimator adds another layer of explorations, which proves to be beneficial in some settings. This merits a future study.
Our contributions are summarized as follows: (1) We adapt and generalize the idea of defining a simple but effective unbiased estimator for true rewards using observed and perturbed rewards to the reinforcement learning setting. The proposed estimator helps guarantee the convergence to the optimal policy even when the RL agents only have noisy observations of the rewards. (2) We analyze the convergence to the optimal policy and finite sample complexity of our reward robust RL methods, using Learning as the running example. (3) Extensive experiments on OpenAI Gym show that our proposed algorithms perform robustly even at high noise rates.
1.1 Related Work
Robust Reinforcement Learning
It is known that RL algorithms are vulnerable to noisy environments (Irpan, 2018). Recent studies (Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017) show that learned RL policies can be easily misled with small perturbations in observations. The presence of noise is very common in realworld environments, especially in roboticsrelevant applications. Consequently, robust (adversarial) reinforcement learning (RRL/RARL) algorithms have been widely studied, aiming to train a robust policy which is capable of withstanding perturbed observations (Teh et al., 2017; Pinto et al., 2017; Gu et al., 2018) or transferring to unseen environments (Rajeswaran et al., 2016; Fu et al., 2017). However, these robust RL algorithms mainly focus on noisy vision observations, instead of the observed rewards.
Learning with Noisy Data Learning appropriately with biased data has received quite a bit of attention in recent machine learning studies Natarajan et al. (2013); Scott et al. (2013); Scott (2015); Sukhbaatar & Fergus (2014); van Rooyen & Williamson (2015); Menon et al. (2015). The idea of above line of works is to define unbiased surrogate loss function to recover the true loss using the knowledge of the noises. We adapt these approaches to reinforcement learning. Though intuitively the idea should apply in our RL settings, our work is the first one to formally establish this extension both theoretically and empirically. Our quantitative understandings will provide practical insights when implementing reinforcement learning algorithms in noisy environments.
2 Problem formulation and preliminaries
In this section, we define our problem of learning from perturbed rewards in reinforcement learning. Throughout this paper, we will use perturbed reward and noisy reward interchangeably, as each shot of our sequential decision making setting is similar to the “learning with noisy data” setting in supervised learning (Natarajan et al., 2013; Scott et al., 2013; Scott, 2015; Sukhbaatar & Fergus, 2014). In what follow, we formulate our Markov Decision Process (MDP) problem and the reinforcement learning (RL) problem with perturbed (noisy) rewards.
2.1 Reinforcement Learning: The NoiseFree Setting
Our RL agent interacts with an unknown environment and attempts to maximize the total of his collected reward. The environment is formalized as a Markov Decision Process (MDP), denoting as . At each time , the agent in state takes an action , which returns a reward (which we will also shorthand as ), and leads to the next state according to a transition probability kernel , which encodes the probability . Commonly is unknown to the agent. The agent’s goal is to learn the optimal policy, a conditional distribution that maximizes the state’s value function. The value function calculates the cumulative reward the agent is expected to receive given he would follow the current policy after observing the current state : where is a discount factor. Intuitively, the agent evaluates how preferable each state is given the current policy. From the Bellman Equation, the optimal value function is given by It is a standard practice for RL algorithms to learn a stateaction value function, for example the function. function calculates the expected cumulative reward if agent chooses in the current state and follows thereafter:
2.2 Perturbed Reward in RL
In many practical settings, our RL agent does not observe the reward feedbacks perfectly. We consider the following MDP with perturbed reward, denoting as : instead of observing at each time directly (following his action), our RL agent only observes a perturbed version of , denoting as . For most of our presentations, we focus on the cases where , are finite sets; but our results generalize to the continuous reward settings.
The generation of follows a certain function . To let our presentation stay focused, we consider the following simple stateindependent flipping error rates model: if the rewards are binary (consider and for simplicity), () can be characterized by the following noise rate parameters : When the signal levels are beyond binary, suppose there are outcomes in total, denoting as . will be generated according to the following confusion matrix where each entry indicates the flipping probability for generating a perturbed outcome: Again we’d like to note that we focus on settings with finite reward levels for most of our paper, but we provide discussions in Section 3.1 on how to handle continuous rewards with discretizations.
3 Learning with Perturbed Rewards
In this section, we first introduce an unbiased estimator for binary rewards in our reinforcement learning setting when the error rates are known. This idea is inspired by Natarajan et al. (2013), but we will extend the method to the multioutcome, as well as the continuous reward settings.
3.1 Unbiased Estimator for True Reward
With the knowledge of noise rates (reward confusion matrices), we are able to establish an unbiased approximation of the true reward in a similar way as done in Natarajan et al. (2013). We will call such a constructed unbiased reward as a surrogate reward. To give an intuition, we start with replicating the results for binary reward in our RL setting:
Lemma 1.
Let be bounded. Then, if we define,
(1) 
we have for any ,
In the standard supervised learning setting, the above property guarantees convergence  as more training data are collected, the empirical surrogate risk converges to its expectation, which is the same as the expectation of the true risk (due to the unbiasedness property). This is also the intuition why we would like to replace the reward terms with surrogate rewards in our RL algorithms.
The above idea can be generalized to the multioutcome setting in a fairly straightforward way. Define , where denotes the value of the surrogate reward when the observed reward is . Let be the bounded reward matrix with values. We have the following results:
Lemma 2.
Suppose is invertible. With defining:
(2) 
we have for any ,
Continuous reward
When the reward signal is continuous, we discretize it into intervals and view each interval as a reward level, with its value approximated by its middle point. With increasing , this quantization error can be made arbitrarily small. Our method is then the same as the solution for the multioutcome setting, except for replacing rewards with discretized ones. Note that the finerdegree quantization we take, the smaller the quantization error  but we would suffer from learning a bigger reward confusion matrix. This is a tradeoff question that can be addressed empirically.
So far we have assumed knowing the confusion matrices, but we will address this additional estimation issue in Section 3.3, and present our complete algorithm therein.
3.2 Convergence and Sample Complexity: Learning
We now analyze the convergence and sample complexity of our surrogate reward based RL algorithms (with assuming knowing ), taking Learning as an example.
Convergence guarantee
First, the convergence guarantee is stated in the following theorem:
Theorem 1.
Given a finite MDP, denoting as , the learning algorithm with surrogate rewards, given by the update rule,
(3) 
converges w.p.1 to the optimal function as long as and .
Note that the term on the right hand of Eqn. (3) includes surrogate reward estimated using Eqn. (1) and Eqn. (2). Theorem 1 states that that agents will converge to the optimal policy w.p.1 with replacing the rewards with surrogate rewards, despite of the noises in observing rewards. This result is not surprising  though the surrogate rewards introduce larger variance, we are grateful of their unbiasedness, which grants us the convergence.
Sample complexity
To establish our sample complexity results, we firstly introduce a generative model following previous literature (Kearns & Singh, 1998; 2000; Kearns et al., 1999). This is a practical MDP setting to simplify the analysis.
Definition 1.
A generative model for an MDP is a sampling model which takes a stateaction pair as input, and outputs the corresponding reward and the next state randomly with the probability of , i.e., .
Exact value iteration (using function) is impractical if the agents follow the generative models above exactly (Gatsby, 2003). Consequently, we introduce a phrased QLearning which is similar to the ones presented in Gatsby (2003); Kearns & Singh (1998) for the convenience of proving our sample complexity results. We briefly outline phrased QLearning as follows  the complete description (Algorithm 2) can be found in Appendix A.
Definition 2.
Phased QLearning algorithm takes samples per phase by calling generative model . It uses collected samples to estimate the transition probability and update the estimated value function per phase. Calling generative model means that surrogate rewards are returned and used to update value function per phase.
The sample complexity of Phrased Learning is given as follows:
Theorem 2.
(Upper Bound) Let be bounded reward, be an invertible reward confusion matrix with denoting its determinant. For an appropriate choice of , the Phrased Learning algorithm calls the generative model times, and returns a policy such that for all state , w.p. .
Theorem 2 states that, to guarantee the convergence to optimal policy, the number of samples needed is no more than times of the one needed when the RL agent observes true rewards perfectly. This additional constant is the price we pay for the noise presented in our learning environment. When the noise level is high, we expect to see a much higher ; otherwise when we are in a lownoise regime , Learning can be very efficient with surrogate reward (Kearns & Singh, 2000). Note that Theorem 2 gives the upper bound in discounted MDP setting; for undiscounted setting (), the upper bound is at the order of . Lower bound result is omitted due to the lack of space. The idea of constructing MDP is similar to Gatsby (2003).
While the surrogate reward guarantees the unbiasedness, we sacrifice the variance at each of our learning steps, and this in turn delays the convergence (as also evidenced in the sample complexity bound). It can be verified that the variance of surrogate reward is bounded when is invertible, and it is always higher than the variance of true reward. This is summarized in the following theorem:
Theorem 3.
Let be bounded reward and confusion matrix is invertible. Then, the variance of surrogate reward is bounded as follows:
To give an intuition of the bound, when we have binary reward, the variance for surrogate reward bounds as follows: As , the variance becomes unbounded and the proposed estimator is no longer effective, nor will it be welldefined. In practice, there is a tradeoff question between bias and variance by tuning a linear combination of and , i.e., , and choosing an appropriate .
3.3 Estimation of Confusion Matrices
In Section 3.1 we have assumed the knowledge of reward confusion matrices, in order to compute the surrogate reward. This knowledge is often not available in practice. Estimating these confusion matrices is challenging without knowing any ground truth reward information; but we’d like to note that efficient algorithms have been developed to estimate the confusion matrices in supervised learning settings (Liu & Liu, 2015; Bekker & Goldberger, 2016; Khetan et al., 2017; Hendrycks et al., 2018). The idea in these algorithms is to dynamically refine the error rates based on aggregated rewards. Note this approach is not different from the inference methods in aggregating crowdsourcing labels, as referred in the literature (Dawid & Skene, 1979; Karger et al., 2011; Liu et al., 2012). We adapt this idea to our reinforcement learning setting, which is detailed as follows.
At each training step, the RL agent collects the noisy reward and the current stateaction pair. Then, for each pair in , the agent predicts the true reward based on accumulated historical observations of reward for the corresponding stateaction pair via, e.g., averaging (majority voting). Finally, with the predicted true reward and the accuracy (error rate) for each stateaction pair, the estimated reward confusion matrices are given by
(4) 
where in above denotes the number of stateaction pair, which satisfies the condition in the set of observed rewards (see Algorithm 1 and 3); and denote predicted true rewards (using majority voting) and observed rewards when the stateaction pair is . The above procedure continues with more observations arriving.
Our final definition of surrogate reward is nothing different from Eqn. (2) but with replacing a known reward confusion matrix with our estimated one in the definition of . We denote this reward as .
We present (Reward Robust RL) in Algorithm 1^{1}^{1}1One complete Learning implementation (Algorithm 3) is provided in Appendix C.. Note that the Algorithm is rather generic, and we can plug in any exisitng RL algorithm into our reward robust one, with only changes in replacing the rewards with our estimated surrogate rewards.
4 Experiments
In this section, reward robust RL is tested in different games, with different noise settings. Due to space limit, more experimental results can be found in Appendix D.
4.1 Experimental Setup
Environments and RL Algorithms
To fully test the performance under different environments, we evaluate the proposed robust reward RL method on two classic control games (CartPole, Pendulum) and seven Atari 2600 games (AirRaid, Alien, Carnival, MsPacman, Pong, Phoenix, Seaquest), which encompass a large variety of environments, as well as rewards. Specifically, the rewards could be unary (CartPole), binary (most of Atari games), multivariate (Pong) and even continuous (Pendulum). A set of stateoftheart reinforcement learning algorithms are experimented with while training under different amounts of noise (See Table 3)^{2}^{2}2The detailed settings are accessible in Appendix B.. For each game and algorithm, three policies are trained based on different random initialization to decrease the variance.
Reward PostProcessing
For each game and RL algorithm, we test the performances for learning with true rewards, learning with noisy rewards and learning with surrogate rewards. Both symmetric and asymmetric noise settings with different noise levels are tested. For symmetric noise, the confusion matrices are symmetric. As for asymmetric noise, two types of random noise are tested: 1) randone, each reward level can only be perturbed into another reward; 2) randall, each reward could be perturbed to any other reward, via adding a random noise matrix. To measure the amount of noise w.r.t confusion matrices, we define the weight of noise in Appendix B.2. The larger is, the higher the noise rates are.
4.2 Robustness Evaluation
CartPole
The goal in CartPole is to prevent the pole from falling by controlling the cart’s direction and velocity. The reward is for every step taken, including the termination step. When the cart or pole deviates too much or the episode length is longer than 200, the episode terminates. Due to the unary reward in CartPole, a corrupted reward is added as the unexpected error (). As a result, the reward space is extended to . Five algorithms Learning (1992), CEM (2006), SARSA (1998), DQN (2016) and DDQN (2016) are evaluated.
In Figure 1, we show that our estimator successfully produces meaningful surrogate rewards that adapt the underlying RL algorithms to the noisy settings, without any assumption of the true distribution of rewards. With the noise rate increasing (from 0.1 to 0.9), the models with noisy rewards converge slower due to larger biases. However, we observe that the models always converge to the best score 200 with the help of surrogate rewards.
In some circumstances (slight noise  see Figure 3(a), 3(b), 3(c), 3(d)), the surrogate rewards even lead to faster convergence. This points out an interesting observation: learning with surrogate reward even outperforms the case with observing the true reward. We conjecture that the way of adding noise and then removing the bias introduces implicit exploration. This implies that for settings even with true reward, we might consider manually adding noise and then remove it in expectation.
Pendulum
The goal in Pendulum is to keep a frictionless pendulum standing up. Different from the CartPole setting, the rewards in pendulum are continuous: . The closer the reward is to zero, the better performance the model achieves. Following our extension (see Section 3.1), the is firstly discretized into 17 intervals: , with its value approximated using its maximum point. After the quantization step, the surrogate rewards can be estimated using multioutcome extensions presented in Section 3.1.
Noise Rate  Reward  QLearn  CEM  SARSA  DQN  DDQN  DDPG  NAF 

170.0  98.1  165.2  187.2  187.8  1.03  4.48  
165.8  108.9  173.6  200.0  181.4  0.87  0.89  
181.9  99.3  171.5  200.0  185.6  0.90  1.13  
134.9  28.8  144.4  173.4  168.6  1.23  4.52  
149.3  85.9  152.4  175.3  198.7  1.03  1.15  
161.1  81.8  159.6  186.7  200.0  1.05  1.36 
We experiment two popular algorithms, DDPG (2015) and NAF (2016) in this game. In Figure 2, both algorithms perform well with surrogate rewards under different amounts of noise. In most cases, the biases were corrected in the longterm, even when the amount of noise is extensive (e.g., ). The quantitative scores on CartPole and Pendulum are given in Table 1, where the scores are averaged based on the last thirty episodes. The full results () can be found in Appendix D.1, so does Table 2. Our reward robust RL method is able to achieve consistently good scores.
Atari
We validate our algorithm on seven Atari 2600 games using the stateoftheart algorithm PPO (Schulman et al., 2017). The games are chosen to cover a variety of environments. The rewards in the Atari games are clipped into . We leave the detailed settings to Appendix B.
Results for PPO on Pongv4 in symmetric noise setting are presented in Figure 3. Due to limited space, more results on other Atari games and noise settings are given in Appendix D.3. Similar to previous results, our surrogate estimator performs consistently well and helps PPO converge to the optimal policy. Table 2 shows the average scores of PPO on five selected Atari games with different amounts of noise (symmetric asymmetric). In particular, when the noise rates , agents with surrogate rewards obtain significant amounts of improvements in average scores. We do not present the results for the case with unknown because the statespace (imageinput) is very large for Atari games, which is difficult to handle with the solution given in Section 3.3.
Noise Rate  Reward  Lift ()  Mean  Alien  Carnival  Phoenix  MsPacman  Seaquest 

  2044.2  1814.8  1239.2  4608.9  1709.1  849.2  
67.5%  3423.1  1741.0  3630.3  7586.3  2547.3  1610.6  
  770.5  893.3  841.8  250.7  1151.1  715.7  
20.3%  926.6  973.7  955.2  643.9  1307.1  753.1  
  1180.1  543.1  919.8  2600.3  1109.6  727.8  
46.7%  1730.8  1637.7  966.1  4171.5  1470.2  408.6  
5 Conclusion
Only an underwhelming amount of reinforcement learning studies have focused on the settings with perturbed and noisy rewards, despite of the fact that such noises are common when exploring a realworld scenario, that faces sensor errors or adversarial examples. We adapt the ideas from supervised learning with noisy examples (Natarajan et al., 2013), and propose a simple but effective RL framework for dealing with noisy rewards. The convergence guarantee and finite sample complexity of Learning with estimated surrogate rewards are given. To validate the effectiveness of our approach, extensive experiments are conducted on OpenAI Gym, showing that surrogate rewards successfully rescue models from misleading rewards even at high noise rates.
Acknowledgement
We thank Anay Pattanaik for the valuable discussion and feedback.
References
 Bekker & Goldberger (2016) Alan Joseph Bekker and Jacob Goldberger. Training deep neuralnetworks based on unreliable labels. In ICASSP, pp. 2682–2686. IEEE, 2016.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 Dawid & Skene (1979) Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pp. 20–28, 1979.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
 Everitt et al. (2017) Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In IJCAI, pp. 4705–4713. ijcai.org, 2017.
 Fu et al. (2017) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. CoRR, abs/1710.11248, 2017.
 Gatsby (2003) Machandranath Kakade Gatsby. On the sample complexity of reinforcement learning sham. 2003.
 Gu et al. (2016) Shixiang Gu, Timothy P. Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep qlearning with modelbased acceleration. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2829–2838. JMLR.org, 2016.
 Gu et al. (2018) Zhaoyuan Gu, Zhenzhong Jia, and Howie Choset. Adversary a3c for robust reinforcement learning, 2018. URL https://openreview.net/forum?id=SJvrXqvaZ.
 HadfieldMenell et al. (2017) Dylan HadfieldMenell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, pp. 6765–6774, 2017.
 Hendrycks et al. (2018) Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. CoRR, abs/1802.05300, 2018.
 Huang et al. (2017) Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
 Irpan (2018) Alex Irpan. Deep reinforcement learning doesn’t work yet. https://www.alexirpan.com/2018/02/14/rlhard.html, 2018.
 Jaakkola et al. (1993) Tommi S. Jaakkola, Michael I. Jordan, and Satinder P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In NIPS, pp. 703–710. Morgan Kaufmann, 1993.
 Karger et al. (2011) David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems, pp. 1953–1961, 2011.
 Kearns & Singh (1998) Michael J. Kearns and Satinder P. Singh. Finitesample convergence rates for qlearning and indirect algorithms. In NIPS, pp. 996–1002. The MIT Press, 1998.
 Kearns & Singh (2000) Michael J. Kearns and Satinder P. Singh. Biasvariance error bounds for temporal difference updates. In COLT, pp. 142–147. Morgan Kaufmann, 2000.
 Kearns et al. (1999) Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for nearoptimal planning in large markov decision processes. In IJCAI, pp. 1324–1231. Morgan Kaufmann, 1999.
 Khetan et al. (2017) Ashish Khetan, Zachary C. Lipton, and Anima Anandkumar. Learning from noisy singlylabeled data. CoRR, abs/1712.04577, 2017.
 Kos & Song (2017) Jernej Kos and Dawn Song. Delving into adversarial attacks on deep policies. CoRR, abs/1705.06452, 2017.
 Lillicrap et al. (2015) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 Lin et al. (2017) YenChen Lin, ZhangWei Hong, YuanHong Liao, MengLi Shih, MingYu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. In IJCAI, pp. 3756–3762. ijcai.org, 2017.
 Liu et al. (2012) Qiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing. In Proc. of NIPS, 2012.
 Liu & Liu (2015) Yang Liu and Mingyan Liu. An online learning approach to improving the quality of crowdsourcing. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’15, pp. 217–230, New York, NY, USA, 2015. ACM. ISBN 9781450334860. doi: 10.1145/2745844.2745874. URL http://doi.acm.org/10.1145/2745844.2745874.
 Menon et al. (2015) Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. Learning from corrupted binary labels via classprobability estimation. In International Conference on Machine Learning, pp. 125–134, 2015.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Natarajan et al. (2013) Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pp. 1196–1204, 2013.
 Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 2817–2826. PMLR, 2017.
 Plappert (2016) Matthias Plappert. kerasrl. https://github.com/kerasrl/kerasrl, 2016.
 Rajeswaran et al. (2016) Aravind Rajeswaran, Sarvjeet Ghotra, Sergey Levine, and Balaraman Ravindran. Epopt: Learning robust neural network policies using model ensembles. CoRR, abs/1610.01283, 2016.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 Scott (2015) Clayton Scott. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, 2015.
 Scott et al. (2013) Clayton Scott, Gilles Blanchard, Gregory Handy, Sara Pozzi, and Marek Flaska. Classification with asymmetric label noise: Consistency and maximal denoising. In COLT, pp. 489–511, 2013.
 Sukhbaatar & Fergus (2014) Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014.
 Sutton & Barto (1998) Richard S. Sutton and Andrew G. Barto. Reinforcement learning  an introduction. Adaptive computation and machine learning. MIT Press, 1998.
 Szita & Lörincz (2006) Istvan Szita and András Lörincz. Learning tetris using the noisy crossentropy method. Neural Computation, 18(12):2936–2941, 2006.
 Teh et al. (2017) Yee Whye Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In NIPS, pp. 4499–4509, 2017.
 Thomaz et al. (2006) Andrea Lockerd Thomaz, Cynthia Breazeal, et al. Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. 2006.
 Tsitsiklis (1994) John N. Tsitsiklis. Asynchronous stochastic approximation and qlearning. Machine Learning, 16(3):185–202, 1994.
 van Hasselt et al. (2016) Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, pp. 2094–2100. AAAI Press, 2016.
 van Rooyen & Williamson (2015) Brendan van Rooyen and Robert C Williamson. Learning in the presence of corruption. arXiv preprint arXiv:1504.00091, 2015.
 Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pp. 1995–2003. JMLR.org, 2016.
 Watkins & Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan. Qlearning. In Machine Learning, pp. 279–292, 1992.
 Watkins (1989) Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, May 1989.
Appendix A Proofs
Proof of Lemma 1.
It is easy to validate the unbiasedness of proposed estimator directly. To get an intuitive understanding of how we derive the proposed method, rewrite Eqn. (1) as follows:
where , are the binary levels of true rewards; and denote the value of the surrogate reward when the observed reward is and . Solving the equations above, we could obtain the surrogate binary rewards:
Or written in another way:
∎
Proof of Lemma 2.
The idea of constructing unbiased estimator is easily adapted to multioutcome reward settings via writing out the conditions for the unbiasedness property (s.t. ). For simplicity, we shorthand as in the following proofs. Similar to Lemma 1, we need to solve the following set of functions to obtain :
where denotes the value of the surrogate reward when the observed reward is . Define , and , then the above equations are equivalent to: If the confusion matrix is invertible, we obtain the surrogate reward:
∎
Furthermore, the probabilities for observing surrogate rewards can be written as follows:
where , and , represent the probabilities of occurrence for surrogate reward and true reward respectively.
Corollary 1.
Let and denote the probabilities of occurrence for surrogate reward and true reward . Then the surrogate reward satisfies,
(5) 
To establish Theorem 1, we need an auxiliary result (Lemma 3) from stochastic process approximation, which is widely adopted for the convergence proof for Learning (Jaakkola et al., 1993; Tsitsiklis, 1994).
Lemma 3.
The random process taking values in and defined as
converges to zero w.p.1 under the following assumptions:

, and ;

, with ;

, for .
Proof of Theorem 1.
For the simplicity of notations, we abbreviate , , , , , and as , , , , , , and , respectively.
The procedure of Phrased Learning is described as Algorithm 2:

Calling times for each stateaction pair.

Set
Note that here is the estimated transition probability, which is different from in Eqn. (5).
To obtain the sample complexity results, the range of our surrogate reward needs to be known. Assuming reward is bounded in , Lemma 4 below states that the surrogate reward is also bounded, when the confusion matrices are invertible:
Lemma 4.
Let be bounded, where is a constant; suppose , the confusion matrix, is invertible with its determinant denoting as . Then the surrogate reward satisfies
(6) 
Proof of Lemma 4.
From Eqn. (2), we have,
where is the adjugate matrix of ; is the determinant of . It is known from linear algebra that,
where is the determinant of the matrix that results from deleting row and column of . Therefore, is also bounded:
where the sum is computed over all permutations of the set ; is the element of ; returns a value that is whenever the reordering given by can be achieved by successively interchanging two entries an even number of times, and whenever it can not.
Consequently,
∎
Proof of Theorem 2.
From Hoeffding’s inequality, we easily obtain:
because is bounded within . In the same way, is bounded by from Lemma 4. We then have,
Further, due to the unbiasedness of surrogate rewards, we have
As a result,
In the same way,
Recursing the two equations in two directions (), we get