Improving Interactive Reinforcement Agent Planning with Human Demonstration

Improving Interactive Reinforcement Agent Planning with Human Demonstration

Guangliang Li    Randy Gomez    Keisuke Nakamura    Jinying Lin    Qilei Zhang&Bo He \affiliationsDepartment of Electronic Engineering, Ocean University of China, China
Honda Research Institute Japan Co. Ltd., Japan \emails{guangliangli, bhe}@ouc.edu.cn, {r.gomez, k.nakamura}@jp.honda-ri.com/.
Abstract

TAMER has proven to be a powerful interactive reinforcement learning method for allowing ordinary people to teach and personalize autonomous agents’ behavior by providing evaluative feedback. However, a TAMER agent planning with UCT—a Monte Carlo Tree Search strategy, can only update states along its path and might induce high learning cost especially for a physical robot. In this paper, we propose to drive the agent’s exploration along the optimal path and reduce the learning cost by initializing the agent’s reward function via inverse reinforcement learning from demonstration. We test our proposed method in the RL benchmark domain—Grid World—with different discounts on human reward. Our results show that learning from demonstration can allow a TAMER agent to learn a roughly optimal policy up to the deepest search and encourage the agent to explore along the optimal path. In addition, we find that learning from demonstration can improve the learning efficiency by reducing total feedback, the number of incorrect actions and increasing the ratio of correct actions to obtain an optimal policy, allowing a TAMER agent to converge faster.

Improving Interactive Reinforcement Agent Planning with Human Demonstration


Guangliang Li, Randy Gomez, Keisuke Nakamura, Jinying Lin, Qilei Zhangand Bo He

Department of Electronic Engineering, Ocean University of China, China
Honda Research Institute Japan Co. Ltd., Japan

{guangliangli, bhe}@ouc.edu.cn, {r.gomez, k.nakamura}@jp.honda-ri.com/.

1 Introduction

Autonomous agents have the potential to operate in most applications in the human’s living environment in the near future. In the real-world, as the increasing interactions between people and agents, users may want to teach agents an optimal behavior and even customize agents’ behavior according to their preferences. Interactive reinforcement learning [????] has been developed and proved to be a powerful method for facilitating non-technical people to teach an agent to perform a task using evaluations of the quality of the agent’s behavior.

In this work, we focus on interactive reinforcement agent planning and extend the TAMER framework [?]—a popular interactive reinforcement learning method. To facilitate a TAMER agent to plan in complex environments, TAMER was combined with Upper Confidence Bounds for Trees (UCT) [?], a typical Monte Carlo Tree Search strategy [?]. However, the TAMER agent planning with UCT can only update states along its path, which causes a local-bias problem and results in low learning efficiency. For example, when the agent does not visit the states along the optimal path, the human trainer cannot give feedback for those states, preventing the agent from learning about what reward might be received along those critical states. Therefore a TAMER agent planning with UCT ironically needs to traverse to the goal to learn the accurate reward predictions that might lead it to traverse to the goal [?]. In addition, when learning from human-generated reward, a TAMER agent still learns from trial-and-error. In some situation, this will make the agent learning dangerous or induce high cost, especially for the physical robot learning, e.g., learning to drive a car.

In this paper, we try to solve the encountered local-bias problem when the TAMER agent plans with UCT and reduce the agent’s cost in the learning process. We propose to improve the TAMER agent’s exploration along the optimal path by initializing the agent’s reward function via inverse reinforcement learning from human demonstration, which is another main natural teaching methods developed for enabling autonomous agents to learn from a non-technical teacher [?]. Learning from demonstration will often lead to faster learning than reward signals and highlight a subspace for the agent to explore. We evaluate our proposed method in a RL benchmark domain—Grid World—with different discounts on human reward. This is the first time that a TAMER agent planning with UCT has been tested with real human user and with different discounts on human reward. Our results indicate the usefulness of our proposed method on solving the problem of overly local updates and reducing the learning cost for a TAMER agent planning with UCT.

2 Related Work

When learning from human reward, an agent uses the evaluations of its behavior provided by a human trainer to improve its behavior [????]. Thomaz and Breazeal [?] implemented an interface with a tabular Q-learning [?] agent where a separate interaction channel was provided allowing the human to give the agent feedback. The agent aims to maximize its total discounted reward, which is the sum of human reward and environmental reward. Suay and Chernova [?] extended their work to a real-world robotic system using only human reward. Knox and Stone [?] proposed the TAMER framework that allows an agent to learn from only human reward signals instead of environmental rewards by directly modeling the human reward. Moreover, Warnell et al. [?] proposed Deep TAMER, an extension of the TAMER framework that leverages the representational power of deep neural networks in order to learn complex tasks with high-dimensional state spaces. In addition, MacGlashan et al. [?] claimed human evaluative feedback to be interpreted as policy feedback depending on the agent’s current policy and proposed an actor-critic algorithm–Convergent Actor-Critic by Humans (COACH), to learn from human feedback. Loftin et al. [?] interpreted human feedback as categorical feedback strategies that depend both on the behavior the trainer is trying to teach and the trainer’s teaching strategy.

In learning from demonstration, the agent learns from sequences of state-action pairs provided by a human trainer who demonstrates the desired behavior [?]. For example, apprenticeship learning [?] is a form of learning from demonstration, which learns how to perform a task using inverse reinforcement learning [?] from observations of the behavior demonstrated by an expert teacher. Argall et al. [?] proposed a method wherein the agent learns from both demonstrations and the trainer’s critiques of the agent’s task performance, which is quite related to our work in this paper. However, our work differs in allowing the human trainer to provide human rewards — evaluations of the quality of the agent’s action — to fine-tune the agent’s behavior, while in their work only the critiques of the whole task’s performance were provided.

The work of Judah et al. [?] is most related to our work in this paper. Specifically, they used a specified shaping reward function to improve the learning efficiency of learning from demonstration. However, our work differs by allowing the shaping reward to be provided by a human trainer not pre-defined potential function by the agent designer. In addition, Brys et al. [?] proposed a method for speeding up reinforcement agent learning from environmental rewards by reward shaping via a learned potential function from demonstrations. While in our work, we used demonstrations to seed the agent’s learning from human-generated rewards.

3 Preliminaries: Interactive Reinforcement Learning

Interactive reinforcement learning (Interactive RL) was developed to allow an ordinary human user to shape the agent learner by providing evaluative feedback [????].

As in traditional reinforcement learning (RL) [?], an interactive RL agent learns to make sequential decisions in a task. A sequential decision task is modeled as a Markov decision process (MDP), denoted by a tuple {, , , , }. In MDP, time is divided into discrete time steps, and is a set of states in the environment. and is a set of actions that the agent can perform. At each time step , the agent observes the state of the environment, and takes an action . The experienced state-action pair will take the agent into a new state , decided by a transition function . The agent will receive an evaluative feedback , provided by a human observer evaluating the quality of the action selection based on her knowledge. That is to say, there is no predefined reward function in interactive RL. The discount factor determines the present value of future rewards. The objective of the agent is to learn a policy , mapping from states to actions. One common way is to learn an action value function, , which estimates the long-term discounted reward for a given state-action pair. Given an optimal action value function, the optimal policy can be obtained by greedily selecting the action with the highest value in the current state.

3.1 The TAMER Framework

In this paper, we use the TAMER framework [?] as the agent’s learning algorithm. TAMER is a typical interactive reinforcement learning method. Different from the original TAMER framework which learns and selects actions with the reward function [?], in this paper, we rephrased TAMER as a general model-based method for agent learning from human reward, as shown in Figure 1.

An agent implemented according to TAMER learns from real-time evaluations of its behavior, provided by a human teacher who observes the agent. These evaluations are taken as human reward signals. The TAMER agent learns a model of the human reward and then uses it to learn a value function. The TAMER agent will select actions with the value function to get the most accumulated human reward.

Figure 1: An agent learns from a human teacher with TAMER (modified from [?]).

There are four key modules for an agent learning with TAMER. The first one is to learn a predictive model of human reward. Specifically, the TAMER agent learns a function , approximating the expectation of human rewards received in the interaction experience:

(1)

where is the parameter vector, and is a vector of basis feature functions.

The second one is the credit assigner to deal with the time delay of human reward caused by evaluation of agent’s behavior and delivering it. TAMER defines a probability density function to estimate the probability of the teacher’s feedback delay. provides the probability that the feedback occurs within any specific time interval and is used to calculate the probability (i.e. the credit) that a single reward signal is targeting a single time step. At the current time step , the credit for each previous time step - is computed as:

(2)

If a human teacher gives multiple rewards, the label for each previous time step (state-action pair) is the sum of all credits calculated with each human reward using Equation 2. The TAMER agent uses and state-action pair as a supervised learning sample to learn by updating its parameters, e.g., with incremental gradient descent:

(3)
(4)

where is the learning rate, is temporal difference error.

The third one is the value function module. The TAMER agent learns an action value function— from the learned human reward function :

(5)

The fourth module is the action selector. As a traditional RL agent which seeks the largest discounted accumulated future rewards, a TAMER agent also greedily selects the action with the largest value:

(6)

The TAMER agent learns by repeatedly taking an action, sensing reward, and updating the predictive model and corresponding value function.

3.2 Inverse Reinforcement Learning

Similar to TAMER, in inverse reinforcement learning (IRL), an agent also learns in an MDPR [?]. An agent learning via IRL assumes that there is a vector of features defined over states, and a “true” unknown reward function on which the demonstrator is trying to optimize [?], where is the vector of basis functions and is the weight vector. The value of a policy is calculated as

(7)

where is the state value function following policy , is the discount factor, is a vector of basis functions describing features over state , is the weight vector over the basis functions for the reward function .

The feature expectations is defined as the expected discounted accumulated feature value vector, calculated as:

(8)

With this notation, the value of a policy can be written as

(9)

Therefore, if we can find the optimal (or close to optimal) weight vector for the reward function , then the optimal value function of policy can be derived with Equation 9, which can attain performance near that of the demonstrator on the unknown reward function . In this paper, we use the projection algorithm [?] in our proposed method.

4 Methodology

The TAMER agent learning can be combined with planning methods such as value iteration, Monte Carlo Tree Search etc. In this case, the agent uses simulated experience to speed up its learning. Value iteration updates the state value for the whole state space. However, in considerably more complex domains which usually have large state space, agents will not be able to perform value iteration by sweeping over the entire state space, let alone those tasks with continuous states and actions [?].

TAMER can also update the value function through a Monte Carlo Tree Search (MCTS) strategy—Upper Confidence Trees (UCT) [?]. These MCTS-based search has been successfully applied in many complex tasks with especially large state spaces [?], e.g., the game of Go [?]. However, the TAMER agent planning with UCT can only update states along its path, which causes a local-bias problem and results in low learning efficiency. Moreover, when the agent does not visit the states near the goal, the trainer cannot give feedback for those states, preventing the agent from learning about what reward might be received along those critical states [?].

Optimistically initializing the reward function might solve the local-bias problem since it can drive the agent to explore the whole state-action space [?]. However, this will lead to extensive exploration, and might frustrate, exhaust, and even confuse the human trainer, since with such thorough exploration the agent may not respond to the trainer’s feedback and be unable to learn for a considerable period of time. Moreover, this would additionally sacrifice the fast learning which is the main advantage of interactive reinforcement learning. Nonetheless, we believe some mild optimistic initialization might solve the problem of overly local exploration.

Therefore, to drive the agent’s exploration and improve the agent’s learning efficiency, in this paper, we propose to use human demonstration to seed the TAMER agent learning and planning. Specifically, our proposed method allows the agent to learn a reward function from human provided demonstrations via inverse reinforcement learning (IRL) first. The demonstrations provided by the human trainer typically consist of sequences of state-action pairs {}, which will be fed into the IRL algorithm. The learned reward function via IRL from the demonstration is used to seed the reward function in TAMER, which learns a value function with learned human reward function and plans with UCT at the same time. Then the human trainer can revise the agent’s policy with human rewards.

In our approach, inverse reinforcement learning was implemented with the projection algorithm [?], though approaches such as maximum entropy, bayesian and game-theoretic can also be used. In IRL, we implemented UCT to generate planning trajectories for optimizing the reward function. In TAMER, planning trajectories are also chosen by UCT, where the search tree is reset at the start of each time step to respect the corresponding change to the human reward function in TAMER, which generally makes past search results inaccurate.

In this paper, we would like to test and see how human demonstration will improve a TAMER agent’s learning and drive its planning from human reward, not to solve the task with demonstrations or human reward alone. Therefore, as a starting point, we assume the human trainer prefer to provide one demonstration first and then use human reward to revise the agent’s behavior, though more demonstrations can be provided even until the problem in the task is solved with only demonstrations. However, we will investigate the effect of more demonstrations on agent’s learning and planning from human reward and even the interchangeability of demonstrations and human rewards in future work.

5 Experiments

To demonstrate the potential usefulness of our proposed approach, in this paper, we perform experiments in the Grid World domain—a benchmarking problem in reinforcement learning. The grid world task contains 30 states. For each state, at each time step the agent can choose from four actions: moving up, down, left or right. The action attempted through a wall results in no movement for that step. Task performance metrics are based on the number of time steps (actions) taken to reach the goal. The agent always starts one learning episode in the same state, which is shown as the robot’s location in Figure 2. The red cross indicates the direction of the agent’s action. In the task, the agent tries to learn a policy that can reach the goal state (the blue square in Figure 2) with as few time steps as possible. The optimal policy from the start state requires 19 actions.

Figure 2: A screenshot of the Grid World domain. The robot’s current location is the starting state for each episode, and the goal state is the blue block next to it with a wall between them. Note that the dark black lines and the grey blocks are walls.

In our experiments, to see the effect of our proposed method, we compare the agent learning via TAMER to that via our proposed method with different discount rates on human reward (). The TAMER framework and the TAMER module in our proposed method are the same and both search with the UCT algorithm to update the action value function. Therefore, when we say , it applies to both of them. The only difference between TAMER and our method is whether learning from demonstration via IRL is incorporated or not. In addition, both reward functions in TAMER and in IRL, action value functions in TAMER and IRL, are represented by a linear model of Gaussian radial basis functions.

One radial basis function is centered on each cell of the grid world, effectively creating a pseudo-tabular representation that generalizes slightly between nearby cells. Each radial basis function has a width , where 1 is the distance to the nearest adjacent center of a radial basis function, and the linear model has an additional bias feature of constant value 0.1 [?]. The discount factor for learning from demonstration via IRL is set to 0.99 (). The author of the paper trained both agents learning via our proposed method and TAMER with all discount factors. Each agent with every discount factor was trained for 10 trials. For each trial with either method, we trained the agent to the best ability until an optimal policy is obtained. With our proposed method, we first provide one single demonstration navigating from the start state to the goal state via keyboard, and then train the agent with human rewards as in TAMER. Note that, for each trial, the demonstration was performed to the best ability but might not always be optimal. The analysis in the next section is based on an average of data collected from the 10 trials.

6 Experimental Results

This section provides results over experiments performed in the Grid World domain with our proposed algorithm in comparison with TAMER. Our experiments were conducted with discount factor = 0.99 for learning from demonstration via IRL, paired with discount factor = 0, 0.7, 0.9 and 0.99 on human reward for TAMER agent planning with UCT. Note that for the TAMER module in our proposed algorithm and TAMER are with the same values.

6.1 Number of Feedback

Figure 3: Here we can see the number of time steps with feedback trained until an optimal policy is obtained with different discount rates on human reward, in terms of total, positive and negative feedback. Seeding the TAMER agent learning and planning with learned reward function from demonstration can significantly reduce the number of feedback needed to learn an optimal policy. Note: black bars stand for the standard error of the mean.

We hypothesized that agents trained with our proposed method will require less feedback than those with the TAMER framework, especially the negative one. To measure the amount of feedback given, we counted the number of time steps with feedback, comparing our method to TAMER. Figure 3 shows the number of time step with feedback for both our proposed method and TAMER agents with different discount rates on human reward, in terms of total feedback, positive and negative feedback.

Discount on human reward ()
0 0.7 0.9 0.99
Our method Positive 0.30 0.34 0.39 0.28
Negative 0.70 0.66 0.61 0.72
() () () ()
TAMER Positive 0.18 0.25 0.24 0.20
Negative 0.82 0.75 0.76 0.80
() () () ()
(t-test) 6.6e-05 1.4e-06 6.4e-10 6.2e-07
Table 1: Ratio of positive and negative feedback among the total number of feedback given for agent learning with our method and TAMER. Note: p-value was computed with t-test, is the standard deviation.

From Figure 3 we can see that, agent learning with our method received significantly less total feedback than the TAMER agent for all discounts on human reward ( = 0, 0.7, 0.9 and 0.99 for the TAMER planning respectively). In terms of negative feedback, the agent learning with our method also received significantly less feedback than the TAMER agent. The largest differences between our method and TAMER in terms of total and negative feedback are achieved when the discount rate on human reward is highest (i.e. = 0.99). Although the number of received positive feedback for our proposed method is not always more than that for the TAMER agent, Table 1 shows the ratio of positive feedback among the total number of feedback given for our proposed method is significantly higher than that for TAMER agent learning with all discount rates on human reward. These results suggest that learning from demonstration can improve the learning efficiency of a TAMER agent by reducing the total number of human rewards needed to train an agent to get an optimal policy. Moreover, the provided demonstration can reduce the number of incorrect actions and increase the ratio of correct actions during the learning process.

6.2 Performance

Since the task performance metric is based on the time steps taken to reach the goal in the Grid World domain, we take the number of total time steps needed to train the agent to obtain an optimal policy as the performance measure in our experiments. Figure 4 shows the total number of time steps (actions) needed for training an agent to obtain an optimal policy with our proposed method and TAMER using different discount rates on human reward. From Figure 4 we can see that, the total number of time steps needed to train an agent with our method is significantly fewer than a TAMER agent for all discount rates on human reward. Figure 4 also shows that the total number of actions needed for an agent to obtain an optimal policy with our proposed method is decreasing when the discount on human reward increasing from 0 to 0.99. The largest difference between our method and TAMER is achieved when the discount factor is 0.99.

Figure 4: Seeding the TAMER agent learning and planning with learned reward function from demonstration can significantly reduce the total number of time steps needed to obtain an optimal policy with different discount rates on human reward. Note: black bars stand for the standard error of the mean.

We also analyzed the number of time steps per episode trained until an optimal policy is obtained with our proposed method and TAMER during the training process, as shown in Figure 5. From Figure 5 we can see that, for all discount rates on human reward, the number of time steps for the first episode with both our method and TAMER is similar. But after that, the number of time steps for each episode with our proposed method decreased dramatically especially for = 0.9 and 0.99, which is significantly fewer than that with TAMER before obtaining an optimal policy. This could be because the TAMER agent plans and updates the value function with UCT, it has not update the value function for all states yet. In this case the learned reward function from demonstration has little effect for TAMER agent planning in the initial training process. Moreover, for all discounts on human reward, agents with our method converge faster than those with TAMER. For example, it takes about one episode fewer for agents with our method to learn an optimal policy in comparison with TAMER while the discounts on human reward are 0 and 0.7. When = 0.9 and 0.99, agents with our method learn an optimal policy within three or four episodes, while it takes a TAMER agent five to eight episodes to achieve the same performance. Moreover, the number of episodes needed to train an agent with our method to obtain an optimal policy is decreasing when the discount on human reward increasing from 0 to 0.99.

Figure 5: Here we can see the number of time steps needed to reach the goal per episode until an optimal policy is obtained with different discount rates on human reward for our proposed method and TAMER. Seeding with learned reward function from demonstration allows the TAMER agent learning and planning faster. Note: black bars stand for the standard error of the mean.

7 Discussion

For a TAMER agent to search via UCT in complex environments with large state space, when the agent does not visit the states near the goal, the human trainer cannot give feedback for those states. This prevents the agent from learning about what reward might be received along those critical states. Therefore, a TAMER agent planning with UCT cannot learn the true values of states along its optimal path. During the learning process, the learned policy might take the agent back to previously experienced states since it already received feedback and updated those states with high values. The heat map of state value function in Figure 6(a) support this explanation. From Figure 6(a) we can see that experienced states are updated with high value while states that are far from experienced states often have not been updated even once and retain their original state values. While with our proposed method, the TAMER agent can get a roughly optimal policy up to the deepest search and encourage it to explore along the optimal path while navigating to the goal, as shown in Figure 6(b).

(a) TAMER (b) Our method
Figure 6: Heat map of state value when the TAMER agent visits the state block X1Y1 for the first time (a) and when the agent with our proposed method after learning from demonstration via IRL visits the starting state block X5Y1 for the first time (b). Note: White square shows the block the robot is in, X means the row number and Y means the column number in the board.

8 Conclusions

In this paper, to alleviate the encountered local-bias problem when the TAMER agent plans with the UCT algorithm and reduce the agent’s cost in the learning process, we propose to drive the TAMER agent’s exploration along the optimal path by initializing the agent’s reward function via learning from demonstration. We test our proposed method in the RL benchmark testing domain —Grid World—with different discount rates on human reward. Our results show that learning from demonstration can improve the agent’s learning efficiency by reducing feedback needed to obtain an optimal policy. In addition, demonstration can reduce the learning cost by decreasing the number of incorrect actions and increasing the ratio of correct actions during the learning process. More importantly, demonstration allows a TAMER agent to learn a roughly optimal policy up to the deepest search and encourage the agent to explore along the optimal path, allowing it to converge faster.

In future work, we would like to further test our method in complex domains and conduct a user study, to see how our method and results generalize to other domains and members of the general public. In addition, we prefer to further extend our method by combining with deep learning method to see how learning from demonstration can affect an agent’s learning and planning from human reward in complex tasks with high-dimensional state spaces.

References

  • [Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, pages 1–8. ACM, 2004.
  • [Argall et al., 2007] Brenna Argall, Brett Browning, and Manuela Veloso. Learning by demonstration with critique from a human teacher. In Proceedings of the ACM/IEEE international conference on Human-robot interaction, pages 57–64. ACM, 2007.
  • [Argall et al., 2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
  • [Brys et al., 2015] Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E Taylor, and Ann Nowé. Reinforcement learning from demonstration through shaping. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2015.
  • [Isbell et al., 2001] Charles Isbell, Christian R Shelton, Michael Kearns, Satinder Singh, and Peter Stone. A social reinforcement learning agent. In Proceedings of the 5th International Conference on Autonomous Agents, pages 377–384. ACM, 2001.
  • [Judah et al., 2014] Kshitij Judah, Alan Fern, Prasad Tadepalli, and Robby Goetschalckx. Imitation learning with demonstrations and shaping rewards. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, pages 1890–1896, 2014.
  • [Knox and Stone, 2009] W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: the TAMER framework. In Proceedings of the 5th International Conference on Knowledge Capture, pages 9–16. ACM, 2009.
  • [Knox and Stone, 2015] W Bradley Knox and Peter Stone. Framing reinforcement learning from human reward: reward positivity, temporal discounting, episodicity, and performance. Artificial Intelligence, 225:24–50, 2015.
  • [Kocsis and Szepesvári, 2006] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
  • [Loftin et al., 2014] Robert Loftin, James MacGlashan, M Littman, M Taylor, and D Roberts. A strategy-aware technique for learning behaviors from discrete human feedback. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI-2014), 2014.
  • [Loftin et al., 2015] Robert Loftin, Bei Peng, James MacGlashan, Michael L Littman, Matthew E Taylor, Jeff Huang, and David L Roberts. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and Multi-Agent Systems, pages 1–30, 2015.
  • [MacGlashan et al., 2017] James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, David Roberts, Matthew E Taylor, and Michael L Littman. Interactive learning from policy-dependent human feedback. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2285–2294, 2017.
  • [Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Proceedings of International Conference on Machine Learning (ICML), pages 663–670, 2000.
  • [Pilarski et al., 2011] Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi, Jason P Carey, and Richard S Sutton. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In Proceedings of 12th International Conference on Rehabilitation Robotics (ICORR), pages 1–7. IEEE, 2011.
  • [Silver et al., 2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
  • [Suay and Chernova, 2011] Halit Bener Suay and Sonia Chernova. Effect of human guidance and state space size on interactive reinforcement learning. In Proceedings of IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pages 1–6. IEEE, 2011.
  • [Sutton and Barto, 1998] R. Sutton and A. Barto. Reinforcement learning: an introduction. MIT press, 1998.
  • [Thomaz and Breazeal, 2008] Andrea L Thomaz and Cynthia Breazeal. Teachable robots: understanding human teaching behavior to build more effective robot learners. Artificial Intelligence, 172(6):716–737, 2008.
  • [Warnell et al., 2017] Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone. Deep tamer: Interactive agent shaping in high-dimensional state spaces. arXiv preprint arXiv:1709.10163, 2017.
  • [Watkins and Dayan, 1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
354720
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description