Efficient Deep Reinforcement Learning through Policy Transfer

Efficient Deep Reinforcement Learning through Policy Transfer

Abstract

Transfer Learning (TL) has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing transfer approaches either explicitly computes the similarity between tasks or select appropriate source policies to provide guided explorations for the target task. However, how to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarity is currently missing. In this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL by taking advantage of this idea. Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multi-policy transfer as the option learning problem. PTF can be easily combined with existing deep RL approaches. Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods in terms of learning efficiency and final performance in both discrete and continuous action spaces.

Key Words.:
Reinforcement learning; Policy transfer; Policy reuse
\setcopyright

ifaamas \acmDOIdoi \acmISBN \acmConference[AAMAS’20]Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.)May 2020Auckland, New Zealand \acmYear2020 \copyrightyear2020 \acmPrice

\affiliation\institution

College of Intelligence and Computing, Tianjin University 1

\affiliation\institution

Nanjing University

\affiliation\institution

Fuxi AI Lab in Netease

\affiliation\institution

Tianjin University \affiliation\institutionWashington State University \affiliation\institutionNorthwestern Polytechnical University

1 Introduction

Recent advance in Deep Reinforcement Learning (DRL) has obtained expressive success of achieving human-level control in complex tasks Mnih et al. (2015); Lillicrap et al. (2016). However, DRL is still faced with sample inefficiency problems especially when the state-action space becomes large, which makes it difficult to learn from scratch. TL has shown great potential to accelerate RL Sutton and Barto (1998) via leveraging prior knowledge from past learned policies of relevant tasks Taylor and Stone (2009); Laroche and Barlier (2017); Rajendran et al. (2017). One major direction of transfer in RL focused on measuring the similarity between two tasks either through mapping the state spaces between two tasks Taylor et al. (2007); Brys et al. (2015), or computing the similarity of two Markov Decision Processes (MDPs) Song et al. (2016), and then transferring value functions directly according to their similarities.

Another direction of policy transfer focuses on selecting a suitable source policy for explorations Fernández and Veloso (2006); Li and Zhang (2018). However, such single-policy transfer cannot be applied to cases when one source policy is only partially useful for learning the target task. Although some transfer approaches utilized multiple source policies during the target task learning, they suffer from either of the following limitations, e.g., Laroche and Barlier Laroche and Barlier (2017) assumed that all tasks share the same transition dynamics and differ only in the reward function; Li et al. Li et al. (2019) proposed Context-Aware Policy reuSe (CAPS) which required the optimality of source policies since it only learns an intra-option policy over these source policies. Furthermore, it requires manually adding primitive policies to the policy library which limits its generality and cannot be applied to problems of continuous action spaces.

To address the above problems, we propose a novel Policy Transfer Framework (PTF) which combines the above two directions of policy reuse. Instead of using source policies as guided explorations in a target task, we adaptively select a suitable source policy during target task learning and use it as a complementary optimization objective of the target policy. The backbone of PTF can still use existing DRL algorithms to update its policy, and the source policy selection problem is modeled as the option learning problem. In this way, PTF does not require any source policy to be perfect on any subtask and can still learn toward an optimal policy in case none of the source policy is useful. Besides, the option framework allows us to use the termination probability as a performance indicator to determine whether a source policy reuse should be terminated to avoid negative transfer. In summary, the main contributions of our work are: 1) PTF learns when and which source policy is the best to reuse for the target policy and when to terminate it by modelling multi-policy transfer as the option learning problem; 2) we propose an adaptive and heuristic mechanism to ensure the efficient reuse of source policies and avoid negative transfer; and 3) both existing value-based and policy-based DRL approaches can be incorporated and experimental results show PTF significantly boosts the performance of existing DRL approaches, and outperforms state-of-the-art policy transfer methods both in discrete and continuous action spaces.

2 Background

This paper focuses on standard RL tasks, formally, a task can be specified by an Markov Decision Process (MDP), which can be described as a tuple , where is the set of states; is the set of actions; is the state transition function: and is the reward function: . A policy is a probability distribution over actions conditioned on states: . The solution for an MDP is to find an optimal policy maximizing the total expected return with a discount factor : .

Q-Learning, Deep Q-Network (DQN).  Q-learning Watkins and Dayan (1992) and DQN Mnih et al. (2015) are popular value-based RL methods. Q-learning holds an action-value function for policy as , and learns the optimal Q-function, which yields an optimal policy Watkins and Dayan (1992). DQN learns the optimal Q-function by minimizing the loss:

(1)

where is the target Q-network parameterized by and periodically updated from .

Policy Gradient (PG) Algorithms. Policy gradient methods are another choice for dealing with RL tasks, which is to directly optimize the policy parameterized by . PG methods optimize the objective by taking steps in the direction of . Using Q-function, then the gradient of the policy can be written as:

(2)

where is the state distribution given . Several practical PG algorithms differ in how they estimate . For example, REINFORCE Williams (1992) simply uses a sample return . Alternatively, one could learn an approximation of the action-value function ; is called the critic and leads to a variety of actor-critic algorithms Sutton and Barto (1998); Mnih et al. (2016).

The Option Framework.  Sutton et al. Sutton et al. (1999) firstly formalized the idea of temporally extended actions as an option. An option is defined as a triple in which is an initiation state set, is an intra-option policy and is a termination function that specifies the probability an option terminates at state . An MDP endowed with a set of options becomes a Semi-Markov Decision Process (Semi-MDP), which has a corresponding optimal option-value function over options learned using intra-option learning. The option framework considers the call-and-return option execution model, in which an agent picks option according to its option-value function , and follows the intra-option policy until termination, then selects a next option and repeats the procedure.

3 Policy Transfer Framework

3.1 Motivation

One major direction of previous works focuses on transferring value functions directly according to the similarity between two tasks Brys et al. (2015); Song et al. (2016); Laroche and Barlier (2017). However, this way often assumes a well-estimated model for measurement which causes computational complexity and is infeasible in complex scenarios. Another direction of policy transfer methods focuses on selecting appropriate source policies based on the performance of source policies on the target task to provide guided explorations during each episode Fernández and Veloso (2006); Li and Zhang (2018); Li et al. (2019). However, most of these works are faced with the challenge of how to select a suitable source policy, since each source policy may only be partially useful for the target task. Furthermore, some of them assume source policies to be optimal and deterministic which restricts the generality. How to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarity is currently missing in previous work.

According to the above analysis, in this paper, we firstly propose a novel Policy Transfer Framework (PTF) to accelerate RL by taking advantage of this idea and combining the above two directions of policy reuse. Instead of using source policies as guided explorations in a target task, PTF adaptively selects a suitable source policy during target task learning and uses it as a complementary optimization objective of the target policy. In this way, PTF does not require any source policy to be perfect on any subtask and can still learn toward an optimal policy in case none of the source policy is useful. Besides, we propose a novel way of adaptively determining the degree of transferring the knowledge of a source policy to the target one to avoid negative transfer, which can be effectively used in cases when only part of source policies share the same state-action space as the target one.

3.2 Framework Overview

(a) PTF
(b) Agent Module
(c) Option Module
Figure 1: An illustration of the policy transfer framework.

Figure 1(a) illustrates the proposed Policy Transfer Framework (PTF) which contains two main components, one (Figure 1(b)) is the agent module (here is an example of an actor-critic model), which is used to learn the target policy with guidance from the option module. The other (Figure 1(c)) is the option module, which is used to learn when and which source policy is useful for the agent module. Given a set of source policies as the intra-option policies, the PTF agent first initializes a set of options together with the option-value network with random parameters. At each step, it selects an action following its policy, receives a reward and transitions to the next state. Meanwhile, it also selects an option according to the policy over options and the termination probabilities. For the update, the PTF agent introduces a complementary loss, which transfers knowledge from the intra-option policy through imitation, weighted by an adaptive adjustment factor . The PTF agent will also update the option-value network and the termination probability of using its own experience simultaneously. The reuse of the policy terminates according to the termination probability of and then another option is selected for reuse following the policy over options. In this way, PTF efficiently exploits the useful information from the source policies and avoids negative transfer through the call-and-return option execution model. PTF could be easily integrated with both value-based and policy-based DRL methods. We will describe how it could be combined with A3C Mnih et al. (2016) as an example in the next section in detail.

3.3 Policy Transfer Framework (PTF)

1:  Initialize: option-value network parameters , termination network parameters , replay buffer , global parameters and , thread-specific parameters and , step t 1
2:  for each thread do
3:     Reset gradients:
4:     Assign thread-specific parameters:
5:     Start from state ,
6:     Select an option -greedy
7:     repeat
8:        Perform an action
9:        Observe reward and new state
10:        
11:        Store transition to replay buffer
12:        Choose another option if terminates
13:     until  is terminal or
14:     
15:     for  do
16:        
17:        Calculate gradients w.r.t. :
18:        Calculate gradients w.r.t. :
19:        Update (see Algorithm 2)
20:        Update w.r.t. (Equation 5)
21:     end for
22:     Asynchronously update using and using
23:  end for
Algorithm 1 PTF-A3C

In this section, we describe PTF applying in A3C Mnih et al. (2016): PTF-A3C. The whole learning process of PTF-A3C is shown in Algorithm 1. First, PTF-A3C initializes network parameters for the option-value network, the termination network (which shares the input and hidden layers with the option-value network and holds a different output layer), and A3C networks (Line 1). For each episode, the PTF-A3C agent first selects an option according to the policy over options (Line 6); then it selects an action following the current policy , receives a reward , transits to the next state and stores the transition to the replay buffer (Lines 8-11). Another option will be selected if the option is terminated according to the termination probability of (Line 12).

For the update, the agent computes the gradient of the temporal difference loss for the critic network (Line 17); and calculates the gradients of the standard actor loss, and also the extra loss of difference between the source policy inside the option and the current policy , which is measured by the cross-entropy loss: . is used as the supervision signal, weighted by an adaptive adjustment factor . To ensure sufficient explorations, an entropy bonus is also considered Mnih et al. (2016), weighted by a constant factor (Line 18). Then it updates the option-value network following Algorithm 2 and the termination network accordingly (Lines 19, 20) which is described in detail in the following section.

Learning Source Policy Selection

The remaining issue is how to update the option-value network which is given in Algorithm 2. Since options are temporal abstractions Sutton et al. (1999); Bacon et al. (2017), is introduced as the option-value function upon arrival. The expected return of executing option upon entering next state is , which is correlated to , i.e., the probability that option terminates in next state :

(3)

Then, PTF-A3C samples a batch of transitions from the replay buffer and updates the option-value network by minimizing the loss (Line 6 in Algorithm 2). Each sample can be used to update the values of multiple options, as long as the option allows to select the sampled action (for continuous action space, this is achieved by fitting action in the source policy distribution with a certain confidence interval). Thus the sample efficiency can be significantly improved in an off-policy manner.

1:  Sample a batch of transitions from
2:  for  do
3:     if  selects action at state  then
4:        Update (Equation 3)
5:        Set
6:        Update option by minimizing the loss:
7:     end if
8:  end for
9:  Copy to the target network every steps
Algorithm 2 Update

PTF-A3C learns option-values in the call-and-return option execution model, where an option is executed until it terminates at state based on its termination probability and then a next option is selected by a policy over options, which is -greedy to the option-value . Specifically, with a probability of , the option with the highest option-value is selected (random selection in case of a tie); and PTF-A3C makes random choices with probability to explore other options with potentially better performance.

Learning Termination Probabilities

According to the call-and-return option execution model, the termination probability controls when to terminate the current selected option and select another option accordingly. The objective of learning the termination probability is to maximize the expected return , so we update the termination network parameters by computing the gradient of the discounted return objective with respect to the initial condition Bacon et al. (2017):

(4)

where is the advantage function which can be approximated as , and is a discounted factor of state-option pairs from the initial condition : . is the transition probability along the trajectory starting from the initial condition to in steps. Since is estimated from samples along the on-policy stationary distribution, we neglect it for data efficiency Thomas (2014); Li et al. (2019). Then is updated w.r.t. as follows Bacon et al. (2017); Li et al. (2019):

(5)

where is the learning rate, is a regularization term. The advantage term is if the option is the one with the maximized option value, and negative otherwise. In this way, all termination probabilities would increase if the option value is not the maximized one. However, the estimation of the option-value function is not accurate initially. If we multiply the advantage to the gradient, the termination probability of an option with the maximize true option value would also increase, which would lead to a sub-optimal policy over options. The purpose of is to ensure sufficient exploration that the best one could be selected.

Transferring from Selected Source Policy

Next, we describe how to transfer knowledge from the selected source policy. The way to transfer is motivated from policy distillation Rusu et al. (2016) which exploits multiple teacher policies to train a student policy. Namely, a teacher policy is used to generate trajectories , each containing a sequence of states . The goal is to match student’s policy , parameterized by , to . The corresponding loss function term for each sequence at each time step is: , where is the cross-entropy loss. For value-based algorithms, e.g., DQN, we can measure the difference of two Q-value distributions using the Kullback-Leibler divergence (KL) with temperature :

(6)

Kickstarting Schmitt et al. (2018) trains a student policy that surpasses the teacher policy on the same task set by adding the cross-entropy loss between the teacher and student policies to the RL loss. However, it does not consider learning a new task that is different from the teacher’s task set. Furthermore, the way using Population Based Training (PBT) Jaderberg et al. (2017) to adjust the weighting factor of the cross-entropy loss increases the computational complexity, lack of adaptive adjustment.

To this end, we propose an adaptive and heuristic way to adjust the weighting factor of the cross-entropy loss. The option module contains a termination network that reflects the performance of options on the target task. If the performance of the current option is not the best among all options, the termination probability of this option grows, which indicates we should assign a higher probability to terminate the current option. Therefore, the termination probability of a source policy can be used as a performance indicator of adjusting its exploitation degree. Specifically, the probability of exploiting the current source policy should be decreased as the performance of the option decreases. And the weighting factor which implies the probability of exploiting the current source policy should be inversely proportional to the termination probability. Specifically, we propose adaptively adjust as follows:

(7)

where is a discount function. When the value of the termination function of option increases, it means that the performance of the option is not the best one among all options based on the current experience. Thus we decrease the weighting factor of the cross-entropy loss and vice versa. controls the slow decrease in exploiting the transferred knowledge from source policies which means at the beginning of learning, we exploit source knowledge mostly. As learning continues, past knowledge becomes less useful and we focus more on the current self-learned policy. In this way, PTF efficiently exploits useful information and avoids negative transfer from source policies.

4 Experimental Results

(a) Grid world
(b) Grid world
Figure 2: Two grid worlds. (a) contains two target tasks , , four source tasks; (b) contains the same target task .
(a) A3C vs PTF-A3C
(b) PPO vs PTF-PPO
(c) Deep-CAPS vs PTF-A3C
Figure 3: Average discounted rewards of various methods when learning task on grid world .
(a) A3C vs PTF-A3C
(b) PPO vs PTF-PPO
(c) Deep-CAPS vs PTF-A3C
Figure 4: Average discounted rewards of various methods when learning task on grid world .

In this section, we evaluate PTF on three test domains, grid world Fernández and Veloso (2006), pinball Konidaris and Barto (2009) and reacher Tassa et al. (2018) compared with several DRL methods learning from scratch (A3C Mnih et al. (2016) and PPO Schulman et al. (2017)); and the state-of-the-art policy transfer method CAPS Li et al. (2019), implemented as a deep version (Deep-CAPS). Results are averaged over random seeds 2.

4.1 Evaluation on Different Environments

Grid world

Figure 2(a) shows a grid world , with an agent starting from any of the grids, and choosing one of four actions: up, down, left and right. Each action makes the agent move to the corresponding direction with one step size. denote goals of source tasks, and represent goals of target tasks. As noted, is similar to one of the source tasks since their goals are within a close distance; while is different from each source task due to the far distance among their goals. The game ends when the agent approaches the grid of a target task or the time exceeds a fixed period. The agent receives a reward of after approaching the goal grid. The source policies are trained using A3C learning from scratch. We also manually design primitive policies for deep-CAPS following its previous settings (i.e., each primitive policy selects the same action for all states), which is unnecessary for our PTF framework.

We first investigate the performance of PTF when the target task is similar to one of the source tasks, (i.e., the distance between their goal grids is very close). Figure 3 presents the average discounted rewards of various methods when learning task on grid world. We can see from Figure 3(a) that PTF-A3C significantly accelerates the learning process and outperforms A3C. Similar results can be found in Figure 3(b). The reason is that PTF quickly identifies the optimal source policy and exploits useful information from source policies, which efficiently accelerates the learning process than learning from scratch. Figure 3(c) shows the performance gap between PTF-A3C and deep-CAPS. This is because the policy reuse module and the target task learning module in PTF are loosely decoupled, apart from reusing knowledge from source policies, PTF is also able to utilize its own experience from the environment. However, in deep-CAPS, these two parts are highly decoupled, which means its explorations and exploitations are fully dependent on the source policies inside the options. Thus, deep-CAPS needs higher requirements on source policies than our PTF, and finally achieves lower performance than PTF-A3C.

(a) A3C vs PTF-A3C
(b) PPO vs PTF-PPO
(c) Deep-CAPS vs PTF-A3C
Figure 5: Average discounted rewards of various methods when learning task on grid world .

Next, we investigate the performance of PTF when all source tasks are not quite similar to the target task (i.e., the distance between their goal grids is very far). Figure 4 presents average discounted rewards of various methods when learning task . We can see from Figure 4(a), (b) that both PTF-A3C and PTF-PPO significantly accelerate the learning process and outperform A3C and PPO. The reason is that PTF identifies which source policy is optimal to exploit and when to terminate it, which efficiently accelerates the learning process than learning from scratch. The lower performance of deep-CAPS than PTF-A3C (Figure 4(c)) is due to the similar reasons as described before, that its explorations and exploitations are fully dependent on source policies, thus needs higher requirements on source policies than PTF, and finally achieves lower performance than PTF-A3C.

To verify that PTF works as well in situations where transitions between source and target tasks are different, we conduct experiments on learning on a grid world (Figure 2(b)), whose map is much different from the map for learning source tasks. Figure 5 shows that PTF still outperforms other methods even if only some parts of source policies can be exploited. PTF identifies and exploits useful parts automatically.

Figure 6: The performance of PTF-A3C and deep-CAPS on grid world with imperfect source policies.

We further investigate whether PTF can efficiently avoid negative transfer. Figure 6 shows the average discounted rewards of PTF-A3C and deep-CAPS when source policies are not optimal towards source tasks. As we described before, deep-CAPS is fully dependent on source policies for explorations and exploitations on the target task, when source policies are not optimal towards source tasks, which means they are not deterministic at all states. Thus, deep-CAPS cannot avoid the negative and stochastic impact of source policies, which confuses the learning of the option-value network and finally obtains lower performance than PTF-A3C.

Pinball

In the pinball domain (Figure 7(a)), a ball must be guided through a maze of arbitrarily shaped polygons to a designated target location. The state space is continuous over the position and velocity of the ball in the plane. The action space is continuous in the range of , which controls the increment of the velocity in the vertical or horizontal direction. Collisions with obstacles are elastic and can be used to the advantage of the agent. A drag coefficient of effectively stops ball movements after a finite number of steps when the null action is chosen repeatedly. Each thrust action incurs a penalty of while taking no action costs . The episode terminates with a reward when the agent reaches the target. We interrupted any episode taking more than steps and set the discount factor to . These rewards are all normalized to ensure more stable training. The source policies are trained using A3C learning from scratch. We also design primitive policies for deep-CAPS, an increment of the velocity in the vertical or horizontal direction; a decrement of the velocity in the vertical or horizontal direction and the null action, which is unnecessary for our PTF framework.

(a) Pinball
(b) Reacher
Figure 7: Two evaluation environments with continuous control.
(a) A3C vs PTF-A3C
(b) PPO vs PTF-PPO
(c) Deep-CAPS vs PTF-A3C
Figure 8: Average discounted rewards of various methods when learning on pinball.
(a) A3C vs PTF-A3C
(b) PPO vs PTF-PPO
(c) Deep-CAPS vs PTF-A3C
Figure 9: Average discounted rewards of various methods when learning on pinball.

Figure 8 depicts the performance of PTF when learning task on Pinball, which is similar to source task (i.e., the distance between their goal states is very close). We can see that PTF significantly accelerates the learning process of A3C and PPO (Figure 8(a) and (b)); outperforms deep-CAPS (Figure 8(c)). The advantage of PTF is similar with that in grid world: PTF efficiently exploits the useful information from source policies to optimize the target policy, thus achieves higher performance than learning from scratch. However, deep-CAPS achieves lower average discounted rewards than PTF since it is fully dependent on source policies for explorations in the target task, while the continuous action space is hard to be fully covered even with the manually added primitive policies. Therefore, deep-CAPS achieves lower performance than PTF in such a domain.

We further verify that PTF works well in the same setting as in the grid world that all source tasks are not quite similar to the target task (i.e., the distance between their goal states is very far). From Figure 9 we can see that PTF outperforms other methods even if only some parts of source policies can be exploited. PTF identifies which source policy is optimal to exploit and when to terminate it, thus efficiently accelerates the learning process.

Reacher

To further validate the performance of PTF, we provide an alternative scenario, Reacher (Figure 7(b)) Tassa et al. (2018), which is qualitatively different from the above two navigation tasks. Reacher is one of robot control problems in MuJoCo Todorov et al. (2012), equipped with a two-link planar to reach a target location. The episode ends with the reward when the end effector penetrates the target sphere, or ends when it takes more than steps. We design several tasks in Reacher which are different from the location and size of the target sphere. Since deep-CAPS performs poorly in the above continuous domain (pinball) due to the limitations described above, we only compare the PTF with vanilla A3C and PPO in the following sections.

(a) A3C vs PTF-A3C
(b) PPO vs PTF-PPO
Figure 10: The performance of PTF on Reacher.

Figure 10(a) shows the performance of PTF-A3C and A3C on Reacher. We can see that PTF-A3C efficiently achieves higher average discounted rewards than A3C. Similar results can be found in PTF-PPO and PPO shown in Figure 10(b). This is because PTF efficiently exploits the useful knowledge in source tasks, thus accelerates the learning process compared with vanilla methods. All results over various environments further show the robustness of PTF.

4.2 The Influence of

Next, we provide an ablation study to investigate the influence of the weighting factor (Equation 7) on the performance of PTF, which is the key factor. Figure 11 shows the influence of different parts of the weighting factor on the performance of PTF-A3C. We can see that when the extra loss is added without the weighting factor , although it helps the agent at the beginning of learning compared with A3C learning from scratch, it leads to a sub-optimal policy because of focusing too much on mimicking the source policies. In contrast, introducing the weighting factor allows us to terminate exploiting source policies in time and thus achieves the best transfer performance.

Figure 11: The influence of weighting factor .

4.3 The Performance of Option Learning

Finally, we validate whether PTF learns an effective policy over options. Since there may be some concerns about learning termination , that the termination is easy to collapse Bacon et al. (2017); Harutyunyan et al. (2019); Harb et al. (2018), making it difficult for the policy optimization. In this section, we provide the dynamics of the option switch frequency to investigate the option learning in PTF. From Figure 12 (a), (b) we can see that the option switch frequency decreases quickly and stabilizes as the learning goes by. This indicates that both PTF-A3C and PTF-PPO efficiently learn when and which option is useful and provides meaningful guidance for target task learning.

(a) PTF-A3C
(b) PTF-PPO
Figure 12: The switch frequency of options.

5 Related Work

Recently, transfer in RL has become an important direction and a wide variety of methods have been studied in the context of RL transfer learning Taylor and Stone (2009). Brys et al. Brys et al. (2015) applied a reward shaping approach to policy transfer, benefiting from the theoretical guarantees of reward shaping. However, it may suffer from negative transfer. Song et al. Song et al. (2016) transferred the action-value functions of the source tasks to the target task according to a task similarity metric to compute the task distance. However, they assumed a well-estimated model which is not always available in practice. Later, Laroche et al. Laroche and Barlier (2017) reused the experience instances of a source task to estimate the reward function of the target task. The limitation of this approach resides in the restrictive assumption that all the tasks share the same transition dynamics and differ only in the reward function.

Policy reuse is a technique to accelerate RL with guidance from previously learned policies, assuming to start with a set of available policies, and to select among them when faced with a new task, which is, in essence, a transfer learning approach Taylor and Stone (2009). Fernández et al. Fernández and Veloso (2006) used policy reuse as a probabilistic bias when learning the new, similar tasks. Rajendran et al. Rajendran et al. (2017) proposed the A2T (Attend, Adapt and Transfer) architecture to select and transfer from multiple source tasks by incorporating an attention network which learns the weights of several source policies for combination. Li et al. Li and Zhang (2018) proposed the optimal source policy selection through online explorations using multi-armed bandit methods. However, most of the previous works select the source policy according to the performance of source policies on the target task, i.e., the utility, which fails to address the problems where multiple source policies are partially useful for learning the target task and even cause negative transfer.

The option framework was firstly proposed in Sutton et al. (1999) as temporal abstractions which is modeled as Semi-MDPs. A number of works focused on option discovery Bacon et al. (2017); Klissarov et al. (2017); Harb et al. (2018); Harutyunyan et al. (2019). An important example is the option-critic Bacon et al. (2017) which learns multiple source policies in the form of options from scratch, end-to-end. However, the option-critic tends to collapse to single-action primitives in later training stages. The follow-up work on the option-critic with deliberation cost Harb et al. (2018) addresses this option collapse by modifying the termination objective to additionally penalize option termination, but it is highly sensitive to the associated cost parameter. Recently, Harutyunyan et al. Harutyunyan et al. (2019) further modify the termination objective to be completely independent of the task reward and provide theoretical guarantees for the optimality. The objective of all these option discovery works and PTF are orthogonal, that PTF transfers from the source policies to the target task and the rest of works learn multiple source policies from scratch. There are also some imitation learning works Kipf et al. (2019); Hausman et al. (2017); Sahni et al. (2017) correlated to option discovery which is not the focus of this work.

6 Conclusion and Future Work

In this paper, we propose a Policy Transfer Framework (PTF) which can efficiently select the optimal source policy and exploit the useful information to facilitate the target task learning. PTF also efficiently avoids negative transfer through terminating the exploitation of current source policy and selects another one adaptively. PTF can be easily combined with existing deep policy-based and actor-critic methods. Experimental results show PTF efficiently accelerates the learning process of existing state-of-the-art DRL methods and outperforms previous policy reuse approaches. As a future topic, it is worthwhile investigating how to extend PTF to multiagent settings. Another interesting direction is how to learn abstract knowledge for fast adaptation in new environments.

Acknowledgments

The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, U1836214).

Appendix

Network structure

The network structure is the same for all methods: the actor network has two fully-connected hidden layers both with 64 hidden units, the output layer is a fully-connected layer that outputs the action probabilities for all actions; the critic network contains two fully-connected hidden layers both with 64 hidden units and a fully-connected output layer with a single output: the state value; the option-value network contains two fully-connected hidden layers both with 32 units; two output layers, one outputs the option-values for all options, and the other outputs the termination probability of the selected option.

Grid world

The input consists of the following information: the coordinate of the agent and the environmental information (i.e., each of surrounding eight grids is a wall or not) which is encoded as a one-hot vector.

Pinball

The input contains the position of the ball ( and ) and the velocity of the ball in the plane.

Reacher

The input contains the positions of the finger ( and ), the relative distance to the target position, and the velocity of in the plane.

Parameter Settings

Hyperparameter Value
Discount factor() 0.99
Optimizer Adam
Learning rate
decrement
-start
-end
Batch size
Number of episodes
replacing the target network
Table 1: CAPS Hyperparameters.
Hyperparameter Value
Number of processes 8
Discount factor() 0.99
Optimizer Adam
Learning rate
Entropy term coefficient
Table 2: A3C Hyperparameters.
Hyperparameter Value
Discount factor() 0.99
Optimizer Adam
Learning rate
Clip value
Entropy term coefficient 0.005
Table 3: PPO Hyperparameters.
Hyperparameter Value
Discount factor() 0.99
Optimizer Adam
Learning rate for the policy network
Learning rate for the option network
Regularization term for Equation 5
decrement
-start
-end
Batch size
Number of episodes
replacing the target network
Table 4: PTF Hyperparameters.

Footnotes

  1. thanks: * Corresponding author
  2. The source code is put on https://github.com/PTF-transfer/Code_PTF

References

  1. The option-critic architecture. In Proceedings of AAAI Conference on Artificial Intelligence, pp. 1726–1734. Cited by: §3.3.1, §3.3.2, §4.3, §5.
  2. Policy transfer using reward shaping. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, pp. 181–188. Cited by: §1, §3.1, §5.
  3. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, pp. 720–727. Cited by: §1, §3.1, §4, §5.
  4. When waiting is not an option: learning options with a deliberation cost. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 3165–3172. Cited by: §4.3, §5.
  5. The termination critic. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 2231–2240. Cited by: §4.3, §5.
  6. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 1235–1245. Cited by: §5.
  7. Population based training of neural networks. CoRR abs/1711.09846. Cited by: §3.3.3.
  8. CompILE: compositional imitation learning and execution. In Proceedings of the 36th International Conference on Machine Learning, pp. 3418–3428. Cited by: §5.
  9. Learnings options end-to-end for continuous action tasks. CoRR abs/1712.00004. Cited by: §5.
  10. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems, pp. 1015–1023. Cited by: §4.
  11. Transfer reinforcement learning with shared dynamics. In Proceedings of AAAI Conference on Artificial Intelligence, pp. 2147–2153. Cited by: §1, §1, §3.1, §5.
  12. Context-aware policy reuse. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, pp. 989–997. Cited by: §1, §3.1, §3.3.2, §4.
  13. An optimal online method of selecting source policies for reinforcement learning. In Proceedings of AAAI Conference on Artificial Intelligence, pp. 3562–3570. Cited by: §1, §3.1, §5.
  14. Continuous control with deep reinforcement learning. In Proceedings of International Conference on Learning Representations, Cited by: §1.
  15. Asynchronous methods for deep reinforcement learning. In Proceedings of International Conference on Machine Learning, pp. 1928–1937. Cited by: §2, §3.2, §3.3, §3.3, §4.
  16. Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §2.
  17. Attend, adapt and transfer: attentive deep architecture for adaptive transfer from multiple sources in the same domain. In Proceedings of International Conference on Learning Representations, Cited by: §1, §5.
  18. Policy distillation. In Proceedings of International Conference on Learning Representations, Cited by: §3.3.3.
  19. Learning to compose skills. CoRR abs/1711.11289. Cited by: §5.
  20. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835. Cited by: §3.3.3.
  21. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.
  22. Measuring the distance between finite Markov decision processes. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, pp. 468–476. Cited by: §1, §3.1, §5.
  23. Reinforcement learning: an introduction. MIT press. Cited by: §1, §2.
  24. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112 (1), pp. 181 – 211. External Links: ISSN 0004-3702 Cited by: §2, §3.3.1, §5.
  25. DeepMind control suite. CoRR abs/1801.00690. Cited by: §4.1.3, §4.
  26. Transfer learning via inter-task mappings for temporal difference learning. Journal of Machine Learning Research 8 (Sep), pp. 2125–2167. Cited by: §1.
  27. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (Jul), pp. 1633–1685. Cited by: §1, §5, §5.
  28. Bias in natural actor-critic algorithms. In Proceedings of International Conference on Machine Learning, pp. 441–448. Cited by: §3.3.2.
  29. MuJoCo: A physics engine for model-based control. In Proceedings of the International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §4.1.3.
  30. Q-learning. Machine Learning 8 (3-4), pp. 279–292. Cited by: §2.
  31. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
408785
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description