Transfer Value Iteration Networks

Transfer Value Iteration Networks

Junyi Shen, Hankz Hankui Zhuo, Jin Xu, Bin Zhong,    Sinno Jialin Pan
School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
Tencent Inc., China
Nanyang Technological University, Singapore
vichyshen@tencent.com, zhuohank@mail.sysu.edu.cn, {jinxxu,harryzhong}@tencent.com, sinnopan@ntu.edu.sg
Abstract

Value iteration networks (VINs) have been demonstrated to be effective in predicting outcomes, assuming there is sufficient training data in the target domain. In this paper, we propose a transfer learning approach to leverage knowledge from the source domain to the target domain via automatically learning similarities of actions between two domains, for training the target VIN with only limited training data. The proposed architecture called Transfer Value Iteration Network (TVIN) is shown to empirically outperform VIN between domains with similar state and action spaces. Furthermore, we show that this performance gap is consistent across different maze environments, maze sizes, dataset sizes and also hyperparameters such as iteration counts and kernel sizes.

Introduction

Convolutional neural networks (CNNs) have been applied to reinforcement learning for representing control strategies, i.e., a mapping from observations of system states to actions [12]. As mentioned in [21], the CNN-based control policies do not generalize well to new tasks in unseen domains. A new approach called value iteration networks (VINs) [21] has been proposed to demonstrate that planning is an important ingredient in control policies. VINs are expected to generalize to solve new tasks in unseen, but related domains. VINs have been already used in various domains including path planning, i.e., visual navigation [7] and WebNav challenge [13], which can learn a planning-based policy for an input state observation, and generate action predictions with good long-term behaviors. Despite the success of VINs, we observe that the generalizability of VINs is based on a vital assumption that the feature space and action space in the unseen domain should be the same as the ones in the original domain. The goal of this paper is to design a transfer learning framework to generate VIN-based policies between different domains.

If we have tasks with different feature spaces or different action spaces, the classic approach to predicting actions is to train VIN models separately in their own domains. In this paper, we aim to transfer knowledges [15] learned from the source domain to the relevant target domain to improve the performance of training by reducing the learning expense of parameters. We assume that the feature and action spaces of the source and target domains share similarities, such that the observations of states can be represented in the same way.

With pre-trained VINs learned well in the source domain, we propose a novel approximate approach of the value-iteration algorithm for the target domain called TVIN , which stands for Transfer Value Iteration Network. In TVIN , we mainly propose two transfer strategies based on the learned reward function and transition function:

  • We first automatically encode the state observation (of different features) of the target domain to the same representation of the source domain, such that the reward function transferred from the pre-trained VINs accurately produces a reward image for the target domain.

  • We then leverage the transition functions learned from pre-trained model which correspond to the similar actions between the source and target domains, and fine-tune the transition functions by adding an additional weight to automatically learn to what degree the actions resemble. After that we back-propagate the gradient through the novel value iteration model to learn a TVIN policy for the target domain with different feature spaces and action spaces.

To demonstrate the effectiveness of TVIN , we attempt to transfer knowledge between different 2D domains, including 2D mazes and Differential Drive [9]. We evaluate the transfer performance of TVIN with varying environments, maze sizes, and hyperparameters, etc. Our experiments empirically show that transferring the pre-trained VINs from the source domain to the target domain can learn the target TVIN policy significantly faster and reach a higher generalization, compared with the simple VIN policy and another heuristic transfer method. As we demonstrate, representing transfer value iteration networks in this form leads to efficiency that accelerates learning process and requires less dataset of the target domain.

Problem Definition

Let denote the MDP of some domain for which we design our policy . The states, actions, rewards, and transitions in are denoted by , , and respectively. Let denote an observation for state . and are depended on the observation in , namely, and . The functions and will later be learned in the policy learning process. Let denote all the joint parameters of the policy, including the parameters of , , autoencoder and transfer weights. Therefore, the TVIN policy can be written in the form , prescribing an action distribution for each state observation.

Our problem can be defined as follows. Given the pre-trained MDP in the source domain, we transfer its pre-trained knowledge denoted by to the target domain, where ( is a set of similar actions in both the source and target domains). The problem in the target domain is that given the observation for state , we aim to output the corresponding optimal policy by transferring some knowledge from the source domain. Our TVIN policy is represented as a neural network, with denoting the network weights.

Figure 1: The Framework of Transfer Value Iteration Networks

Transfer Value Iteration Networks

In this section we introduce the Transfer Value Iteration Network (TVIN ), a transfer policy representation that embeds a value iteration module. The full framework of TVIN we proposed is depicted in Figure 2, where we build a mapping between feature spaces in the source and target domains (denoted by “part I”), and transfer Q-networks (related to similar actions) from the source to the target domain (denoted by “part II”). After that, we build policy networks for dissimilar actions which are learned from scratch (denoted by “part III”). Thus, a planning-based TVIN policy for the target domain can be trained end-to-end by back-propagating the gradient through the whole networks.

Pre-trained VINs

Before transferring to the target domain, we trained a pre-trained model using the Value Iteration Networks [21]. The basic idea behind many reinforcement learning algorithms is to estimate the action-value function [24], by using the Bellman equation as an iterative update, . The value-iteration algorithm is the popular algorithm for calculating optimal value function and the respecting optimal policy . In each iteration , where . The value function converges to optimal value function as , from which an optimal policy may be derived as The VI module in VINs is a neural network approximately performing the value iteration algorithm. VINs first produce reward image by and input of dimensions l, m, n to the VI module. The reward is then fed into a convolutional layer Q with channels and a linear activation function: . Each channel in this layer corresponds to for a particular action . This layer is then max-pooled along the actions channel to produce the next-iteration value function layer, where . The next-iteration value function layer is then stacked with the reward , and fed back into the convolutional layer and max-pooling layer K times, to perform K iterations of value iteration.

Tvin Algorithm

In this section we present our TVIN algorithm in detail. An overview of transfer value iteration algorithm is shown in Algorithm 1. Given the pre-trained VIN in the source domain, we first transfer the reward function to produce a reward image for the observation in the target domain (i.e., Step 3). Then we transfer the transition functions, which are relevant to these similar actions between the source and target domain, to compute the Q-functions for each iteration in target domain (i.e., Steps 6 and 7), and add attention modulated vector to the final policy (i.e., Step 11). Finally we back-propagate the gradient through the whole model to learn a TVIN policy for the target domain (i.e., Step 12). In order to implement such a network, we specify the two transfer strategies based on the pre-trained reward function and transition function .

Reward function transferring

In VINs, can map features of input state to reward images and pass the reward images to the VI module. For example in the gird-world domain, can map an observation to a high reward at the goal, and negative reward near an obstacle. If we directly transfer the pre-trained from the source domain to the target domain, we may slightly be constrained to the task-specific features due to the diversity of pixel-level inputs. Therefore, in the target domain where the feature spaces are different from the source domain but share some similarities, we propose the first transfer strategy shown in Step 3 of Algorithm 1. We automatically encode the state observation in the target domain to the same representation to the source domain using autoencoder [26], such that the reward function transferred from the pre-trained VIN accurately produces a reward image for target domain before passed to the new VI module. Experimentally, we remain the learned parameters in source domain for reward function, and train an additional fully-connected layer as the autoencoder to output a shared representation for the input states in the target domain. This autoencoder part is trained end-to-end through the whole TVIN . The new reward function with an autoencoder in TVIN is parameterized by the joint parameter , which is denoted by

1:  Initialize value function with zeros
2:  for epoch = 1,  do
3:     Set reward
4:     for  = 1,  do
5:        Construct transition functions for each of the states:
6:        
7:        
8:        
9:     end for
10:     Construct optimal with
11:     Add attention vector to the final policy
12:     Compute TVIN policy with
13:     Update by back-propagating the gradient according to Equation (4)
14:  end for
Algorithm 1 Transfer Value Iteration Algorithm

Transition function transferring

Our second transfer strategy is to design the new VI module for the target domain (depicted in part III of figure 2), which performs value iteration by approximating the Bellman-update through a convolutional neural network. A CNN is comprised of stacked convolution and max-pooling layers. The input to each convolution layer is a 3-dimensional signal , typically, an image with channels and pixels image, and its output is a -channel convolution of the image with different kernels: where is some scalar activation function. A max-pooling layer then down-sample the image by selecting the maximum value among some dimension. CNNs are typically trained using stochastic gradient descent (SGD), with backpropagation for computing gradients. Thus, in our new VI module, each iteration may be approximately regarded as passing the reward as well as the previous value function through a convolution layer and max-pooling layer. Since each channel in the convolution layer corresponds to the Q-function for a specific action, and convolution kernel weights correspond to the discounted transition probabilities, we just leverage the transition functions from pre-trained model that correspond to the similar actions between the source and target domain. In this analogy, our new VI module divides these channels in the convolution layer into two parts: one corresponds to the Q-function for transferable actions and the other for new actions in the target domain.

In human level, we are able to recognize similar actions between different action spaces in the source and target domain. For an agent, however, an additional weight is added when leveraging the pre-trained transition functions, in order that TVIN can automatically learn to what degree these actions resemble. The fine-tuned pre-trained transition functions in the target domain are denoted by where . When back-propagating the gradient through the TVIN in target domain, we treat as fixed and only update the transfer weight .

To sum up, convolution kernel weights corresponding to the discounted transition probabilities in TVIN are denoted by two cases:

(1)

where stands for the pre-trained convolution kernel weights corresponding to the discounted transition probabilities for transferred actions, stands for transfer weight, and stands for the new convolution kernel weights corresponding to the discounted transition probabilities for new actions in target domain. Value function is stacked with the reward , and fed back into the convolutional layer, where each channel corresponds to the Q-function for a specific action. The convolution operation in new VI module is then denoted by

(2)

where , stand for the input state , is the reward and is the value function in each iteration. These convolutional channels in both two parts are then max-pooled along all channels to produce the next-iteration value function layer with . Performing value iteration for times in this form, the new VI module outputs the approximate optimal value function . The value iteration module in TVIN has an effective depth of , which is larger than the depth of the well-known Deep Q-Network [12]. To reduce parameters for training process, we share the weights in the K recurrent layers in the TVIN .

After learning the internal transfer VI module which is independent to the observation, we generate optimal policy for the input state according to . Note that the transition only depends on a subset of the optimal value function if the states have a topology with local transition dynamics such as in the grid-world. Thus we can suppose that a local subset of is sufficient for extracting information about the optimal TVIN plan.

In deep learning, the Attention Mechanism [25] is widely used to improve learning performance by reducing the effective number of network parameters during training. In TVIN , we implement attention mechanism by selecting the value of the current grid-world state after K iterations of value iteration. Therefore, since we learn the internal MDP model over the full input state space, adding attention module to TVIN will necessarily accelerate training.

In the sense that for a given label prediction (action), only a subset of the input features (value function) is relevant, the attention module can be represented by parametric function to output a attention modulated vector of input state . And this vector is added as additional features to the TVIN model to predict the final policy . By back-propagating through the whole network end-to-end, we update the joint parameter and learn the planning-based TVIN policies for the target domain.

Updating Parameters

In order to implement a Transfer Value Iteration Network (TVIN ), we specify the representation of reward and transition functions and , and the attention function. In this case, we refer to a neural network function approximator with weights as a TVIN . The transfer VI module is also a fine-tuned CNN architecture that has the capability of performing an approximate transfer value iteration algorithm. We then define the policy objective over the TVIN network as the cross-entropy between the expert policy and the current TVIN policy. A TVIN can be trained by minimizing the loss function ,

(3)

where is the TVIN policy, parameterized by , and is the expert policy. To acquire training data, we can sample the expert to generate the trajectories used in the loss. In contrast to the deep reinforcement learning objective [12] which recursively relies on itself as a target value, we now use a stable training signal generated by an expert to guide the transfer network. Learning the TVIN policy then becomes an instance of supervised learning.

We consider the updates that optimize the policy parameter of the state representation, reward function, and the new VI model. We update the towards the expert outcome. Differentiating the loss function with respect to the weights we arrive at the following gradient,

(4)

We can use the gradient of the sample loss to update parameters, for example by stochastic gradient descent (SGD) [3]. In summary, the joint parameter of the novel representation are updated to make the planning-based TVIN policy more close to the expert policy .

Source NEWS-9 NEWS-15 NEWS-28
Target Moore-9 Moore-15 Moore-28 Moore-9 Moore-15 Moore-28 Moore-9 Moore-15 Moore-28
N Model %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc
1k VIN 84.2 87.7 77.3 81.7 56.2 65.8 84.2 87.7 77.3 81.7 56.2 65.8 84.2 87.7 77.3 81.7 56.2 65.8
1k TVIN 89.8 94.2 88.3 91.0 66.7  74.7 94.6 96.6 90.1 92.8 66.1 75.3 94.3 95.8 86.4 89.1 62.1 71.1
5k VIN 90.5 92.5 86.7 88.7 64.3 72.9 90.5 92.5 86.7 88.7 64.3 72.9 90.5 92.5 86.7 88.7 64.3 72.9
5k TVIN 97.0 98.0 93.8 94.9 80.4 86.3 97.1 97.2 95.2 96.0 73.4 84.3 97.8 98.2 91.1 92.6 76.2 84.3
10k VIN 86.2 88.0 91.1 92.3 60.8 68.0 86.2 88.0 91.1 92.3 60.8 68.0 86.2 88.0 91.1 92.3 60.8 68.0
10k TVIN 97.6 97.8 95.4 96.2 83.1 88.3 97.4 97.5 96.2 96.7 87.8 91.8 96.6 96.8 92.5 93.7 78.7 84.0
Table 1: Transfer from NEWS to Moore with varying dataset sizes N and maze sizes M.
Source Moore-9 Moore-15 Moore-28
Target NEWS-9 NEWS-15 NEWS-28 NEWS-9 NEWS-15 NEWS-28 NEWS-9 NEWS-15 NEWS-28
N Model %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc
1k VIN 77.8 81.0 69.3 71.1 45.6 51.9 77.8 81.0 69.3 71.1 45.6 51.9 77.8 81.0 69.3 71.1 45.6 51.9
1k TVIN 94.7 94.8 85.5 86.8 69.1 71.6 94.8 94.9 96.3 96.4 89.2 89.4 82.0 84.0 73.1 75.0 64.0 67.7
5k VIN 79.8 81.9 70.7 73.5 57.8 60.9 79.8 81.9 70.7 73.5 57.8 60.9 79.8 81.9 70.7 73.5 57.8 60.9
5k TVIN 95.0 95.0 88.6 89.4 75.1 77.8 97.1 97.1 96.5 96.6 93.0 93.1 85.1 86.7 77.3 80.3 65.1 68.1
10k VIN 87.1 88.4 88.1 88.4 58.4  61.5 87.1 88.4 88.1 88.4 58.4 61.5 87.1 88.4 88.1 88.4 58.4 61.5
10k TVIN 96.6 96.6 89.3 90.0 80.1 82.2 97.4 97.4 97.0 96.9 94.4 94.5 91.7 92.5 88.7 89.6 68.4 72.9
Table 2: Transfer from Moore to NEWS with varying dataset sizes N and maze sizes M.

Experiments

Datasets and Criteria

Dataset

Our experiment domains are synthetic 2D mazes with randomly placed obstacles, in which observations include positions of agents, goal positions and images of maps with obstacles. Specifically, we used three different 2D maze environments similar to the GPPN experiments conducted by [9] : the NEWS, the Moore and the Differential Drive.

In NEWS, the agent can move {East, West, North, South}; in Differential Drive, the agent can move forward along its current orientation, or turn left/right by 90 degrees, and its action space is {Move forward, Turn left, Turn right}; in Moore, the agent can move to any of the eight cells in its Moore neighborhood, so the action space of Moore is {East, West, North, South, Northeast, Northwest, Southeast, Southwest}. When considering transfer in the following experiments, we given the pairs of similar actions between different domains. Between NEWS and Differential Drive, the similar pairs are {(North, Move forward), (East, Turn left), (West, Turn right)}. Between NEWS and Moore, the similar pairs are {(East, East), (West, West), (North, North), (South, South)}.

For our experiments on these three domains, the state vector given as input to the models consists of the maze maps and the goal location. In the NEWS and Moore, the target is an x-y coordinate. Similar to the experimental setup in [21], for each state in each trajectory, we produce a -sized observation image where is the maze size. The first channel of the image encodes the obstacle presence (1 for obstacle, 0 otherwise), while the second channel encodes the goal position (1 at the goal, 0 otherwise). The full observation vector consists of the observation image and the state . While in Differential Drive, the goal location contains an orientation along with the x-y coordinate. Consequently, the dimension of the goal map given as input to the models is for Differential Drive. In addition, for each input state we produce a label encoding the action that an optimal shortest-path policy would take in that state. In our experiments, the ground-truth label is created with a maze generation process that uses Depth-First Search with the Recursive Back-tracker algorithm [5].

Criteria

In this section, we empirically compare TVIN and VIN using two metrics referred to [9] : %Optimal (%Opt) is the percentage of states whose predicted paths under the policy estimated by the model has optimal length. %Opt is denoted by:

where represents the total number of states in test set, represents the optimal action for state , and is the action prediction by models for . And %Success (%Suc) is the percentage of states whose predicted paths under the policy estimated by the model reach the goal state. A trajectory was said to succeed if it reached the goal without hitting obstacles. Let denote the total number of test trajectories, represents the goal state of trajectory and represents the end state of the predicted trajectory by the models. %Suc can be denoted by:

Figure 2: Training process on Moore with 1k training data transferred from NEWS compared with VIN. Left: domains of 30 percent obstacles. Right: domains of 50 percent obstacles.
Figure 3: Prediction accuracy on Moore with varying training data transferred from NEWS compared with VIN. Left: domains of 30 percent obstacles. Right: domains of 50 percent obstacles.

Experimental Results

Our experiments attempt to transfer policies between 2D maze domains with different environments and maze sizes. We evaluate our TVIN approach in the following aspects:

  1. We first evaluate TVIN between different domains, including transfer from NEWS to Moore, transfer from Moore to NEWS and transfer from Differential Drive to NEWS. Additionally we vary the maze sizes in each domains and dataset sizes in the target domains, to see the effectiveness of TVIN when only limited training data are available.

  2. We then evaluate TVIN approach on hyperparameter sensitivity, including the iteration counts K and the kernel sizes F. Experiments show that TVIN can indeed perform better than single VIN and do not rely on the setting of these hyperparameters.

  3. We finally evaluate TVIN by varying the amount of knowledge transferred from the source domains, which is characterized by the number of transferable actions between source and target domains, to see the impact of the amount of knowledge transferred.

In 2D maze domains, an optimal policy can be calculated by exact value iteration algorithm. And the pre-trained VINs represented by a neural network are proved to be able to learn planning results. However, for these tasks of similar complexity and sharing similar actions, TVIN can greatly accelerate training process as well as improving the performance of training by leveraging learned knowledge and by reducing the learning expense of parameters.

Source Drive-9 Drive-15 Drive-28
Target NEWS-9 NEWS-15 NEWS-28 NEWS-9 NEWS-15 NEWS-28 NEWS-9 NEWS-15 NEWS-28
N Model %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc %Opt %Suc
1k VIN 77.8 81.0 69.3 71.1 45.6 51.9 77.8 81.0 69.3 71.1 45.6 51.9 77.8 81.0 69.3 71.1 45.6 51.9
1k TVIN 86.7 88.1 70.2 72.2 49.2 52.6 80.0 81.4 83.4 84.6 63.9 68.7 78.3 81.3 72.8 74.8 57.5 59.8
5k VIN 79.8 81.9 70.7 73.5 57.8 60.9 79.8 81.9 70.7 73.5 57.8 60.9 79.8 81.9 70.7 73.5 57.8 60.9
5k TVIN 88.0 88.8 83.7 86.0 84.1 84.9 86.0 86.8 93.4 93.6 91.9 92.1 81.9 84.4 85.2 85.1 78.8 80.5
10k VIN 87.1 88.4 88.1 88.4 58.4 61.5 87.1 88.4 88.1 88.4 58.4 61.5 87.1 88.4 88.1 88.4 58.4 61.5
10k TVIN 92.8 93.3 92.9 93.1 91.2 91.5 90.9 91.7 94.2 94.3 93.3 93.4 89.5 90.7 95.5 95.5 92.5 92.5
Table 3: Transfer from Drive to NEWS with varying dataset sizes N and maze sizes M.

Accuracies w.r.t. domains

Based on these guidelines, we evaluate several instances of knowledge transfer, i.e., from NEWS to Moore, from Moore to NEWS and from Differential Drive to NEWS. For each transfer, we compare TVIN policy to the VIN reactive policy. Additionally we vary the maze sizes in each domains and dataset sizes in the target domains. Note that, is required to be chosen in proportion to the maze size. In the implementation, we refer to [9] and set the default recurrence relative to the maze sizes: for mazes, for mazes and for mazes. Results are respectively reported in Table 1, Table 2, and Table 3, showing that our transfer learning approach TVIN provides a definite increase in accuracy when we have limited data in the target domain. Even compared to the standard reactive networks DQN of the success rate on Moore-28 with full dataset which is shown in [21], TVIN can reach the success rate of (Table 1), outperforming DQN only with 5k training data in the same case. Additionally, training process of the TVIN and VIN on 1k training data of Moore-15 is depicted in Figure 2. It also shows that knowledge transfer by TVIN speeds up learning process and reaches a higher generalization.

Accuracies w.r.t. transfer methods

As in Table 4, we make comparison with a simple transfer method denoted by VIN. VIN is a heuristic transfer method [16] by directly leveraging the pre-trained weights of and part of (with respected to similar actions) as the initialization for training in target domain. Taking the transfer experiments between domain NEWS-15 to MOORE for example, the results shows that heuristic transfer by VIN gives useful pre-trained information, compared to training from scratch. Moreover, the TVIN policy learned in target domain performs much better than the heuristic transfer VIN, which shows that our transfer strategies are effective and applicable.

Accuracies w.r.t. planning complexity

The complexity of planning in the 2-D maze domains generally depends on the number of obstacles and their distribution on the grid map. We thus synthesize domains based on different number of obstacles and different size of the grid map. In this experiments, We compare two complexity, which are 30 percent and 50 percent. It means 30 percent or 50 percent of the map is randomly placed with obstacles. Although we evaluate our approach on these 2-D domains, we should note that many real-world application domains, such as navigations, warehouse scheduling, etc. can be matched to 2-D maze domains with different complexity, and thus such evaluation in these domains should be convincing.

In this experiment, we view NEWS as source domains, and transfer pre-trained knowledge to Moore. We investigate the transfer performance with respect to different complexity. The results are show in Figure 2 and Figure 3, where the left one shows the transfer between domains of 30 percent obstacles, and the right one is the transfer between domains of 30 percent obstacles. In both cases, adjusting weights of the transferred knowledge in TVIN can indeed outperform the mechanism of randomly initializing VIN. It illustrates that TVIN planning policies, by our transfer strategies, are technically effective either in simple environment or complex environment. The performance gap between transfer learning policy TVIN and original VIN policy is more significant in low complexity domain, whereas in high complexity domains the gap between TVIN and VIN is comparatively slight. The difference in performance gap shows that it is more challenging for TVIN to leverage the pre-trained knowledge when the complexity of planning is much higher.

Source NEWS-15
Target Moore-9 Moore-15 Moore-28
N Model %Opt %Suc %Opt %Suc %Opt %Suc
1k VIN 84.2 87.7 77.3 81.7 56.2 65.8
1k VIN 92.8 94.9 88.6 91.2 65.2 74.6
1k TVIN 94.6 96.6 90.1 92.8 66.1 75.3
5k VIN 90.5 92.5 86.7 88.7 64.3 72.9
5k VIN 96.2 96.1 94.2 95.4 71.9 80.9
5k TVIN 97.1 97.2 95.2 96.0 73.4 84.3
10k VIN 86.2 88.0 91.1 92.3 60.8 68.0
10k VIN 96.1 96.3 95.0 95.5 84.6 90.4
10k TVIN 97.4 97.5 96.2 96.7 87.8 91.8
Table 4: Policy performance compared with simple transferred VIN and TVIN

Accuracies w.r.t. dataset sizes

To evaluate the objective on transfer learning, we compare the performance of TVIN model by using different size of dataset. As is illustrated in Table 1, Table 2 and Table 3, the size of training data on target domain influences the performance of TVIN . Prediction accuracy with varying training data in target domain is also depicted in Figure 4. It shows that, in each case, TVIN can indeed outperform the mechanism of randomly initializing VIN. Although the performance gap decreases gradually with the dataset size increasing, the performance of TVIN turns out to be significantly greater than VIN when there is limited data in the target domain. This shows that if there is already sufficient data for a novel domain to learn optimal policies, information transferred from the source domain would not help improve the performance a lot. Rather, our transfer strategies focus more on generating planning-based TVIN policies for a target domain with limited dataset.

Accuracies w.r.t. hyperparameters

Following the above results that TVIN performs better or equals to VIN, we further evaluate the effect of varying both iteration count and kernel size on the TVIN models. Table 5 and Table 6 show and results of TVIN and VIN on Moore-15 for different values of F and K, and we use NEWS-9 as the source domain. This shows that TVIN outperforms VIN even when hyperparameters such as iteration count K and kernel size F are set differently in the target domains. Although in VINs, larger mazes require larger kernel sizes and iteration counts, the performance gap between TVIN and single VIN do not rely on a specific choice of hyperparameters.

Accuracies w.r.t. transferred knowledge

Finally, we evaluate the influence of the number of transferable actions between source and target domains in TVIN . The more actions are transferred, the more knowledge is leveraged in target domain. Table 7 shows results for different numbers of transferable actions between the source domain (NEWS-9) and the target domain (Moore-15) with 1k training data. It is illustrated that the more similar actions to transfer, the better performance for target TVIN to gain.

K = 10 K = 20 K = 30
Model %Opt %Suc %Opt %Suc %Opt %Suc
VIN 70.3 78.3 67.7 77.0 64.7 74.5
TVIN 78.0 85.8 81.6 90.8 80.1 91.8
Table 5: Test performance on Moore of size transferred from NEWS of size with varying iteration counts K.
F = 3 F = 5 F = 7
Model %Opt %Suc %Opt %Suc %Opt %Suc
VIN 64.7 74.5 77.3 81.7 77.8 83.1
TVIN 80.1 91.8 88.3 91.0 85.3 88.9
Table 6: Test performance on Moore of size transferred from NEWS of size with varying kernel sizes F.

Related Work

In Reinforcement Learning (RL), the agent can act in the world and learn a policy from trial and error such as RL algorithms in [20, 18, 10] use these observations to improve the value of the policy. Recent works investigate policy architectures that are specifically tailored for planning under uncertainty. VINs [21] take a step in this direction by exploring better generalizing policy representations. The Predictron [19], Value Prediction Network [14] also learn value functions end-to-end using an internal model, with recurrent neural networks (RNNs) [11] acting as the transition functions over abstract states. However, none of these abstract planning-based models have been considered for transfer. Our work investigates the generalization properties of the pre-trained policy and proposes the TVIN model for knowledge transfer.

Actions
Num = 1
%Opt %Suc
Num = 2
%Opt %Suc
Num = 3
%Opt %Suc
Num = 4
%Opt %Suc
VIN 77.3 81.7 77.3 81.7 77.3 81.7 77.3 81.7
TVIN 82.0 86.1 82.2 86.5 86.2 90.9 88.3 91.0
Table 7: Test performance on Moore of size transferred from NEWS of size with varying number of transferred actions.

A wide variety of methods have also been studied in the context of RL transfer learning[22]. Policy distillation [8, 4] aims to compress the capacity of a deep network via efficient knowledge transfer . It has been successfully applied to deep reinforcement learning problems [17]. Recently, successor features and generalised policy improvement, has been introduced as a principled way of transferring skills [2]. Also [1] considers value-function-based transfer in RL. However the key to our approach is that the Q-functions for specific actions learned from the source domain can be transferred to the corresponding VI module in the target domain. we also build a mapping between feature spaces in the source and target domains, transfer Q-networks related to similar actions from the source to the target domain and build policy networks for dissimilar actions which are learned from scratch.

Conclusions

We propose a novel transfer learning approach TVIN to learn a planning-based policy for the target domain with different feature spaces and action spaces by leveraging pre-trained knowledge from source domains. In addition, we exhibit that such a transfer network TVIN leads to better performance when the training data is limited in the target domain. In this paper we assume the pairs of similar actions is provided beforehand. In the future, it would be interesting to exactly learn the action similarities based on Web search [27, 28] or language model learning [23, 6] before using the transfer method.

References

  • [1] D. Abel, Y. Jinnai, S. Y. Guo, G. Konidaris, and M. L. Littman (2018) Policy and value transfer in lifelong reinforcement learning. See DBLP:conf/icml/2018, pp. 20–29. External Links: Link Cited by: Related Work.
  • [2] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. J. Mankowitz, A. Zídek, and R. Munos (2018) Transfer in deep reinforcement learning using successor features and generalised policy improvement. See DBLP:conf/icml/2018, pp. 510–519. External Links: Link Cited by: Related Work.
  • [3] S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge University Press, New York, NY, USA. External Links: ISBN 0521833787 Cited by: Updating Parameters.
  • [4] G. Chen, W. Choi, X. Yu, T. X. Han, and M. Chandraker (2017) Learning efficient object detection models with knowledge distillation. See DBLP:conf/nips/2017, pp. 742–751. External Links: Link Cited by: Related Work.
  • [5] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2009) Introduction to algorithms, third edition. 3rd edition, The MIT Press. External Links: ISBN 0262033844, 9780262033848 Cited by: Dataset.
  • [6] W. Feng, H. H. Zhuo, and S. Kambhampati (2018) Extracting action sequences from texts based on deep reinforcement learning. See DBLP:conf/ijcai/2018, pp. 4064–4070. External Links: Link, Document Cited by: Conclusions.
  • [7] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017-07) Cognitive mapping and planning for visual navigation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Introduction.
  • [8] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. Computer Science 14 (7), pp. 38–39. Cited by: Related Work.
  • [9] L. Lee, E. Parisotto, D. S. Chaplot, E. Xing, and R. Salakhutdinov (2018) Gated path planning networks. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Cited by: Introduction, Dataset, Criteria, Accuracies w.r.t. domains.
  • [10] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: Related Work.
  • [11] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur (2010) Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048. Cited by: Related Work.
  • [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: Introduction, Transition function transferring, Updating Parameters.
  • [13] R. Nogueira and K. Cho (2016) End-to-end goal-driven web navigation. See DBLP:conf/nips/2016, pp. 1903–1911. External Links: Link Cited by: Introduction.
  • [14] J. Oh, S. Singh, and H. Lee (2017) Value prediction network. See DBLP:conf/nips/2017, pp. 6120–6130. External Links: Link Cited by: Related Work.
  • [15] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 (10), pp. 1345–1359. Cited by: Introduction.
  • [16] E. Parisotto, J. L. Ba, and R. Salakhutdinov (2015) Actor-mimic: deep multitask and transfer reinforcement learning. Computer Science. Cited by: Accuracies w.r.t. transfer methods.
  • [17] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell (2016) Policy distillation. Computer Science. Cited by: Related Work.
  • [18] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. Computer Science, pp. 1889–1897. Cited by: Related Work.
  • [19] D. Silver, H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. Dulac-Arnold, D. P. Reichert, N. C. Rabinowitz, A. Barreto, and T. Degris (2017) The predictron: end-to-end learning and planning. See DBLP:conf/icml/2017, pp. 3191–3199. External Links: Link Cited by: Related Work.
  • [20] R. S. Sutton and A. G. Barto (2005) Reinforcement learning: an introduction. Machine Learning 16 (1), pp. 285–286. Cited by: Related Work.
  • [21] A. Tamar, S. Levine, P. Abbeel, Y. Wu, and G. Thomas (2016) Value iteration networks. In NIPS, pp. 2146–2154. Cited by: Introduction, Pre-trained VINs, Dataset, Accuracies w.r.t. domains, Related Work.
  • [22] M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (10), pp. 1633–1685. Cited by: Related Work.
  • [23] X. Tian, H. H. Zhuo, and S. Kambhampati (2016) Discovering underlying plans based on distributed representations of actions. In AAMAS, pp. 1135–1143. Cited by: Conclusions.
  • [24] J.N. Tsitsiklis and B. V. Roy (2002) An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42 (5), pp. 674–690. Cited by: Pre-trained VINs.
  • [25] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. See DBLP:conf/icml/2015, pp. 2048–2057. External Links: Link Cited by: Transition function transferring.
  • [26] F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He (2015) Supervised representation learning: transfer learning with deep autoencoders. In IJCAI, pp. 4119–4125. Cited by: Reward function transferring.
  • [27] H. H. Zhuo, Q. Yang, R. Pan, and L. Li (2011) Cross-domain action-model acquisition for planning via web search. See DBLP:conf/aips/2011, External Links: Link Cited by: Conclusions.
  • [28] H. H. Zhuo and Q. Yang (2014) Action-model acquisition for planning via transfer learning. Artif. Intell. 212, pp. 80–103. External Links: Link, Document Cited by: Conclusions.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398265
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description