Pseudo-Rehearsal: Achieving Deep Reinforcement Learning without Catastrophic Forgetting

Pseudo-Rehearsal: Achieving Deep Reinforcement Learning without Catastrophic Forgetting

Craig Atkinson, Brendan McCane, Lech Szymanski, and Anthony Robins Corresponding author: Craig Atkinson ( of Computer Science, University of Otago, Dunedin, New Zealand© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Neural networks can achieve extraordinary results on a wide variety of tasks. However, when they attempt to sequentially learn a number of tasks, they tend to learn the new task while destructively forgetting previous tasks. One solution to this problem is pseudo-rehearsal, which involves learning the new task while rehearsing generated items representative of previous tasks. Our model combines pseudo-rehearsal with a deep generative model and a dual memory system, resulting in a method that does not demand additional storage requirements as the number of tasks increase. Our model iteratively learns three Atari 2600 games while retaining above human level performance on all three games and performing as well as a set of networks individually trained on the tasks. This result is achieved without revisiting or storing raw data from past tasks. Furthermore, previous state-of-the-art solutions demonstrate substantial forgetting compared to our model on these complex deep reinforcement learning tasks.

Deep Reinforcement Learning, Pseudo-Rehearsal, Catastrophic Forgetting, Generative Adversarial Network.

I Introduction

There has been enormous growth in research around reinforcement learning since the development of Deep Q-Networks (DQNs) [1]. DQNs apply Q-learning to deep networks so that complicated reinforcement tasks can be learnt. However, as with most distributed models, DQNs can suffer from Catastrophic Forgetting (CF) [2, 3]. This is where a model has the tendency to forget previous knowledge as it learns new knowledge. Pseudo-rehearsal is a method for overcoming CF by rehearsing randomly generated examples of previous tasks, while learning real data from a new task. Although pseudo-rehearsal methods have been widely used in image classification, they have been virtually unexplored in reinforcement learning. Solving CF is essential if we want to achieve artificial agents that can continuously learn.

Continual learning is important to neural networks because CF limits their potential in numerous ways. For example, if a network has been trained on a particular task, but since training, the function of the neural network needs to be extended or partially changed, the typical solution would be to train the neural network on all of the previously learnt data (that was still relevant) along with the data to learn the new function. This can be an expensive operation because previous datasets (which can be extremely large, as is often the case in deep learning) would need to be stored and retrained. However, if a neural network could effectively perform continual learning, it would only be necessary for it to directly learn data representing the changes that should be made to the function of the network. Furthermore, continual learning is also desirable because it allows the solution to multiple tasks to be compressed into a single network where weights common to both tasks may be shared. This can also benefit the speed at which new tasks are learnt because useful features may already be present in the network.

Our Reinforcement-Pseudo-Rehearsal model (which we call RePR)111Source code: achieves continual learning in the reinforcement domain. It does so by utilising a dual memory system where a freshly initialised DQN is trained on the new task and then knowledge from this short-term network is transferred to a separate DQN containing long-term knowledge of all previously learnt tasks. A generative model is used to produce short sequences of data representative of previous tasks which can be rehearsed while transferring knowledge of the new task. For each new task, the generative model is trained on data generated from the previous generative model and data from the new task. Therefore, the system can prevent CF without the need for a large memory store holding data from all previous tasks. The main contributions of this paper are:

  • The first successful application of pseudo-rehearsal methods to difficult deep reinforcement learning tasks.

  • Above state-of-the-art performance when iteratively learning 3 difficult reinforcement tasks, without storing any raw data from previously learnt tasks.

  • Empirical evidence clearly demonstrating that a dual memory system is particularly beneficial in reinforcement learning as it facilitates learning the new task, decreasing convergence times.

I-a Deep Q-Learning

In Deep Q-learning [1], the neural network is taught to predict the discounted reward that would be received from taking each of the possible actions given the current state. More specifically, the loss function used in deep Q-learning is:


where there exist two functions, a deep predictor network and a deep target network. The predictor’s parameters are updated continuously by stochastic gradient descent and the target’s parameters are infrequently updated with the values of . is the state, action, reward, terminal and next state for a given time step drawn uniformly from a large record of previous experiences, known as an experience replay.

I-B Pseudo-Rehearsal

The simplest way of solving the CF problem is to use a rehearsal strategy, where previously learnt items are practised alongside the learning of new items. However, rehearsal is not ideal as it requires a buffer containing previously learnt items across all tasks, not just a record of recently learnt items from the current task as stored by the experience replay. Researchers have proposed extensions to this method such as utilising previous examples’ gradients during learning [4], picking a subset of previous samples which best represents the population [5] and using a variational auto-encoder to compress stored items [6]. Such rehearsal methods are cognitively implausible and therefore, do not shine light on how mammal brains might efficiently solve the CF problem.

Pseudo-rehearsal was proposed as a solution to CF which does not require storage of a large dataset of previously learnt input items [7]. Originally, pseudo-rehearsal involved constructing a pseudo-dataset by generating random inputs, passing them through the original network and recording their output. This meant that when a new dataset was learnt, the pseudo-dataset could be rehearsed alongside it, resulting in the network learning the data with minimal changes to the previously modelled function.

There is psychological research that suggests that mammal brains use an analogous method to pseudo-rehearsal to prevent CF in memory consolidation. Memory consolidation is the process of transferring memory from the hippocampus, which is responsible for short-term knowledge, to the cortex for long-term storage. The hippocampus and sleep have both been linked as important components for retaining previously learnt information [8], even in tasks which do not require the hippocampus to learn [9]. The hippocampus has been observed to replay patterns of activation that occurred during the day while sleeping [10], similar to the way that pseudo-rehearsal generates previous experiences. Therefore, we believe that a similar concept will solve the CF problem in deep reinforcement learning.

Although pseudo-rehearsal works for neural networks with relatively small input spaces, it does not scale well to datasets with large input spaces such as image datasets [11]. This is because the probability of a randomly generated input example representing a plausible input item is essentially zero. This is where Deep Generative Replay [12] and Pseudo-Recursal [11] have leveraged the generative abilities of a Generative Adversarial Network (GAN) [13] to randomly generate pseudo-items representative of previously learnt items.

A GAN has two components; a generator and a discriminator. The discriminator is trained to distinguish between real and generated images, whereas the generator is trained to generate images which fool the discriminator. When a GAN is used alongside pseudo-rehearsal, the GAN is also trained on the task so that its generator learns to produce items representative of the task’s input items. Then, when a second task needs to be learnt, pseudo-items can be generated randomly from the GAN’s generator and used in pseudo-rehearsal. More specifically, the loss function for pseudo-rehearsal is:


where is a loss function, such as cross-entropy, is a neural network with weights while learning task . is the input-output pair for the current task, whereas is a pseudo-item generated to represent the previous task and its target output is calculated by .

This technique can be applied to multiple tasks using only a single GAN by doing pseudo-rehearsal on the GAN as well. Thus, the GAN learns to generate items representative of the new task while still remembering to generate items representative of the previous tasks (by rehearsing the pseudo-items it generates). This technique has been shown to be very effective for remembering a chain of multiple image classification tasks without ever using real data to rehearse a previously learnt task [12, 11].

Ii The RePR Model

Our Reinforcement-Pseudo-Rehearsal model (which we call RePR) utilises pseudo-rehearsal and generative methods to achieve iterative learning in reinforcement learning tasks. These methods are further extended by separating the learning agent into two parts, an idea which stems from early research into CF [14, 15, 16] and is still being utilised in recent algorithms such as Deep Generative Dual Memory Network [17] for incremental image classification and others (e.g. [18] and [19]). The first part of our model is the short-term memory (STM) system, which serves a similar function to the hippocampus and is used to learn the current task. The STM system contains two components, a DQN that learns the current task and an experience replay containing data only from the current task. The second part is the long-term memory (LTM) system, which serves a similar function to the cortex. The LTM system also has two components, a DQN containing knowledge of all tasks learnt and a GAN which can generate sequences representative of these tasks. During consolidation, the LTM retains previous knowledge through pseudo-rehearsal, while being taught by the STM how to respond on the current task. All of the networks’ architectures and training parameters used throughout our experiments can be found in the appendices.

Transferring knowledge between these two systems is achieved through knowledge distillation [20], where a student network is optimised so that it outputs similar values to a teacher network. In our case, the student network is our long-term network and the teacher network is our short-term network. The key differences between distillation and pseudo-rehearsal are that distillation uses real images for calculating the desired output and that distillation is used to teach new knowledge, not retain previous knowledge. Segmenting our model into two memory systems was found to be beneficial for learning the new task and allowed more freedom when weighting the importance of learning the new task compared to retaining previous knowledge. We hypothesise that this is because the training of the LTM system is more stable as the reinforcement values it is learning are not changing over time.

Ii-a Training Procedure

The training procedure can be broken down into three steps: short-term system DQN training, long-term system DQN training and long-term system GAN training. This process could be repeated for any number of tasks until the DQN or GAN does not have enough capacity to perform the role sufficiently.

Ii-A1 Training the Short-Term System

When there is a new task to be learnt, the STM system is reinitialised and trained solely on the task using the standard DQN loss function (Equation 1).

Ii-A2 Training the Long-Term System’s DQN

Knowledge from the STM system is transferred to the LTM system’s DQN by teaching it to produce similar outputs to the STM system on examples from its experience replay, while also constraining it to produce similar output values to the previous LTM system’s DQN on sequences generated from the LTM system’s GAN. More specifically, the loss function used is:


where is a state drawn from the current task’s experience replay, is the mini-batch size, is the set of possible actions, is the long-term DQN’s weights on the current task, is the short-term DQN’s weights after learning the current task and is the long-term DQN’s weights after learning the previous task. Pseudo-states are generated from a GAN and are representative of sequences in previously learnt games. is a scaling factor between learning the current task and retaining the previous tasks via pseudo-rehearsal (). When , learning the current task is given equal importance to retaining previous tasks. A larger value of gives more importance to learning the current task, while a lower value gives more importance to retaining the previous tasks ( in our experiments). A summary of the information flow while training the short-term system can be found in Fig. 3 in the appendices.

Ii-A3 Training the Long-Term System’s GAN

The GAN is reinitialised and trained to produce images that are representative of the previous GAN’s images and images drawn from the current task’s experience replay. More specifically, the generator learns to represent sequences drawn by:


where is a random number uniformly drawn from and is a randomly selected index for an element in the current task’s experience replay . is the number of tasks learnt and is a randomly generated item from the LTM system’s GAN before training on task . This item could represent a sequence from task to . The GAN in our experiments is trained with the WGAN-GP [21] loss function with a drift term [22] added to it. The specific loss function can be found in the appendices. A summary of the information flow while training the long-term system can be found in Fig. 4 in the appendices.

Ii-B Requirements of a Continual Learning Agent

A continual learning agent should be capable of learning multiple tasks: iteratively without revisiting them; without substantially forgetting previous tasks; with a consistent memory size that does not grow as the number of tasks increase; and without storing raw data from previous tasks. The results in Section V-A demonstrate that our RePR model can iteratively learn multiple tasks, without revisiting them or substantially forgetting previous tasks. Applying pseudo-rehearsal methods to the GAN is also important as it allows our model to use a single generative network which does not need to scale in size as the number of tasks increase.

We believe an agent should not have to store raw data from previous tasks for storage and/or privacy reasons. Instead our model uses a GAN to randomly generate samples representative of the raw data. However, we do not investigate whether the GAN is effective at reducing storage compared to directly storing a compressed subset of previously learnt raw data. We leave this research question for future work.

Iii Related Work

This section will focus on methods for preventing CF in reinforcement learning and will generally concentrate on how to learn a new task without forgetting a previously learnt task. There is a lot of related research outside of this domain, predominantly around continual learning in image classification. However, because these methods cannot be directly applied to complex reinforcement learning tasks, we have excluded them from this review.

There are two main strategies for avoiding CF; restricting how the network is optimised and amending the training data to be more representative of previous tasks. Restricting how the network is optimised generally involves either having units in the network trained only on a particular task or constraining the weights in the network to yield similar values as they had on previous tasks. Amending the training dataset generally involves adding samples that are representative of previous tasks to the training dataset like in pseudo-rehearsal.

In real neuronal circuits it is a matter of debate whether memory is retained through synaptic stability, synaptic plasticity, or a mixture of mechanisms [23, 24, 25]. The synaptic stability hypothesis states that memory is retained through fixing the weights between units that encode it. The synaptic plasticity hypothesis states that the weights between the units can change as long as the output units still produce the correct output pattern. Methods that restrict how the network is optimised align with the synaptic stability hypothesis, whereas methods that amend the training dataset align with the synaptic plasticity hypothesis. The major advantage of these latter methods, which include pseudo-rehearsal, is that they allow the network to restructure its weights and compress previous representations to make room for new ones.

Previous research into preventing CF in reinforcement learning has focused on restricting how the network is optimised. Progressive neural networks [26] are made up of a number of smaller networks, each trained on a separate task. Although these networks share some weights, they still have a large number of task specific weights and thus the size of the model grows substantially as it learns new tasks.

Weight constraint methods amend the loss function so that weights do not change considerably when learning a new task. The most popular of these methods is Elastic Weight Consolidation (EWC) [3], which augments its loss function with a constraint that forces the network’s weights to yield similar values to previous networks. Weights that are more important to the previous task/s are constrained more so that less important weights can be used to learn the new task. EWC has been paired with a DQN to learn numerous Atari 2600 games. One undesirable requirement of EWC is that the network’s weights after learning each task must be stored along with either the Fisher information matrix for each task or examples from past tasks so that the matrix can be calculated when needed. Other variations for constraining the weights have also been proposed [27, 28], however these variations have only been applied to relatively simple reinforcement tasks.

Progress and Compress [19] learnt multiple Atari games by firstly learning the game in STM and then using distillation to transfer it to LTM. The LTM system holds all previously learnt tasks, counteracting CF using a modified version of EWC called online-EWC. This modified version does not scale in memory requirements as the number of tasks increase. This is because the algorithm stores only the most recent previous network’s weights along with the discounted sum of previous Fisher information matrices to use when constraining weights. In Progress and Compress, there are also layer-wise connections between the two systems to encourage the short-term network to learn the game using features already learnt by the long-term network.

The experimental conditions EWC and Progress and Compress were tested on were considerably easier than the conditions we use. We do not allow tasks to be revisited, whereas both EWC and Progress and Compress visited tasks several times, giving the algorithm the opportunity to retrain on previously learnt tasks, lessening the effects of CF. In EWC, networks were allocated two task specific weights per neuron, whereas the networks we use in our experiments do not grow in capacity as a new task is learnt. Furthermore, we require the value function to be retained during learning, whereas Progress and Compress only retained the policy function. Retaining the policy function allows the network to remember how to act in an environment. However, retaining the value function is also useful to a continual learner because if training was to continue on a previously learnt task, which the agent had not yet mastered, the value function would have to be relearnt to continue improving the policy successfully.

The Deep Generative Dual Memory Network [17] and the Progress and Compress algorithm [19] are the closest iterative learning models to RePR which contain a dual memory system. However, our current research differs from both of these algorithms. The Deep Generative Dual Memory Network combines pseudo-rehearsal techniques with a Variational Auto-Encoder (a generative network) to retain knowledge of previous tasks. These tasks are image classification tasks and thus, they are learnt by the STM system and transferred to the LTM system by the cross-entropy loss function, whereas RePR uses a combination of the Deep Q-learning loss function and mean squared error, along with a GAN which generates sequences of consecutive frames. Progress and Compress uses distillation to transfer knowledge from the short-term network to the long-term network, retaining previous knowledge through online-EWC as opposed to pseudo-rehearsal in our model.

The typical rehearsal methods fall into the strategy of amending the training dataset with representative samples, where the network is retrained on previously learnt examples. For example, PLAID [29] uses distillation to merge a network that performs the new task with a network whose policy performs all previously learnt tasks. Distillation methods have also been applied to Atari 2600 games [30, 31], however these were in multi-task learning where CF is not an issue. The major disadvantage with all rehearsal methods is that they require either access to the previous environments or that a large number of previously learnt frames from each game are stored. For other recent examples of rehearsal in reinforcement learning see [32, 33, 34].

Without an experience replay, CF can occur while learning even a single task as the network forgets how to act in previously seen states. Pseudo-rehearsal has also been applied to this problem by rehearsing randomly generated input items from basic distributions (e.g. uniform distribution) [35, 36], with a similar idea accomplished in actor-critic networks [37]. However, all these methods were applied to simple reinforcement tasks and did not utilise deep generative structures for generating pseudo-items or convolutional network architectures.

To our knowledge, pseudo-rehearsal has only been applied by [38] to sequentially learn reinforcement tasks. This is achieved by extending the Deep Generative Replay algorithm used in image classification to reinforcement learning. Pseudo-rehearsal is combined with a Variational Auto-Encoder so that two very simple reinforcement tasks can be iteratively learnt by State Representation Learning without CF occurring. These tasks involve a 2D world where the agent’s input is a small grid representing the colour of objects it can see in front of it. The only thing that changes between tasks is the colour of the objects the agent must collect. There are several main differences between RePR and [38]. Firstly, our model incorporates a dual memory system so that learning is increased. Also, we use a DQN to learn much more complex reinforcement learning tasks which are relatively different from one another and have a large input space that requires deep convolutional networks to learn and generate plausible input items. Furthermore, the input items RePR generates are not static images but rather a sequence of consecutive frames. Since our work, pseudo-rehearsal has been used to overcome CF in models which have learnt to generate states from previously seen environments [39, 40]. In both these cases, pseudo-rehearsal was not applied to the learning agent to prevent its CF.

In summary, RePR differs from other work in the continual learning field by being the first variation of pseudo-rehearsal to be successfully applied to deep reinforcement learning. This has been achieved by using a generative model to produce pseudo-items [12] along with a dual memory system [17]. To achieve this, the loss functions of the networks have been changed from cross-entropy to a combination of the Deep Q-learning loss function and mean-squared error. Furthermore, the generative model has been manipulated so that it generates sequences from the task rather than a static image.

Iv Method

Our current research applies pseudo-rehearsal to deep Q-learning so that a DQN can be used to learn multiple Atari 2600 games in sequence. The tasks chosen were Road Runner, Boxing and James Bond as they were three conceptually different games in which a DQN could outperform human performance by a wide margin [1]. Road Runner is a game where the agent must outrun another character by moving toward the left of the screen while collecting items and avoiding obstacles. To achieve high performance the agent must also learn to lead its opponent into certain obstacles to slow it down. Boxing is a game where the agent must learn to move its character around a 2D boxing ring and throw punches aimed at the face of the opponent to score points, while also avoiding taking punches to the face. James Bond has the agent learn to control a vehicle, while avoiding obstacles and shooting various objects. All agents select between 18 possible actions representing different combinations of joystick movements and pressing of the fire button. Our DQN is based upon [1] with a few minor changes which we found helped the network to learn the individual tasks quickly. The specifics of these changes can be found in the appendices.

The tasks were learnt in the following order: Road Runner, Boxing and then James Bond. Each game was learnt by the STM system for 20 million frames and then taught to the LTM system for 20m frames, with the exception of the first long-term DQN which had the short-term DQN’s weights copied directly over to it. This means that our experimental conditions differ only once the second task (Boxing) is being learnt. The GAN had its discriminator and generator loss function alternatively optimised for steps.

When pseudo-rehearsal was applied to the long-term system’s DQN agent or GAN, pseudo-items were drawn from a temporary array of sequences generated by the previous GAN. The final weights for the short-term system’s DQN are those that produce the largest average score over observed frames. The final weights for the long-term system’s DQN are those that produced the lowest error over observed frames. In all of our experiments , however we have also tested RePR with and , both of which produced very similar results, with the final agent performing at least equivalently to the original DQN’s results [1] for all tasks.

After every million observed frames the network is evaluated on the current task and all previously learnt tasks. Our evaluation procedure is similar to [1] in that our network plays each task for 30 episodes and an episode terminates when all lives are lost. Actions are selected from the network using an -greedy policy with . Final network results are also reported using this procedure and standard deviations are calculated over these 30 episodes. Each condition is trained three times using the same set of seeds between conditions. Unless stated otherwise, all reported results are averaged across these seeds.

V Results

Fig. 1: Results of our RePR model compared to the , , , and conditions. Scores are recorded during training the long-term system. Task switches occur at the dashed lines, in the order Road Runner, Boxing and then James Bond.

V-a Does RePR Prevent Catastrophic Forgetting?

The first experiment investigates how well RePR compares to the best and worst case scenario for learning the sequence of tasks. The worst case scenario is where no attempt is made to retain previously learnt tasks and thus, the long-term model is optimised solely with the distillation loss function (Equation 5). We refer to this condition as . The best case scenario for RePR is for the GAN to produce perfectly accurate sequences of previously learnt tasks to rehearse alongside learning the new task. Therefore, the condition learns the new task while rehearsing previously learnt tasks from sequences drawn from previous tasks’ experience replays, instead of sequences generated from a GAN. This condition is not doing typical rehearsal although the difference is subtle. It relearns previous tasks using targets produced by the previous network (as in RePR), rather than targets on which the original long-term DQN was taught. The condition learns the three tasks with the proposed RePR model. Finally, we include results from the condition which is identical to the condition without a dual memory system and thus, the loss function is the standard DQN loss function (Equation 1) with the pseudo-rehearsal loss function (Equation 6) added onto it.

The results of RePR can be found in Fig. 1, alongside other conditions’ results. All of the mentioned conditions outperform the condition which severely forgets previous tasks. RePR was found to perform very closely to the condition, besides a slight degradation of performance on Road Runner which was likely due to the GAN performing pseudo-rehearsal to retain sequences representative of Road Runner. These results suggest that RePR can prevent CF without any need for extra task specific parameters or directly storing examples from previously learnt tasks. Furthermore, the scores our final network model achieves ( (), (), ()) are on par with the DQNs from [1] which were trained on the tasks individually ( (), (), ()) and well above human expert performance levels (, , ). Finally, the condition demonstrates poorer results compared to the condition along with slower convergence times for learning the new task and thus, shows that combining pseudo-rehearsal with a dual memory model, as we have done in RePR, is beneficial for learning the new task. We also investigated whether a similar set of weights are important to the RePR agent’s output on all of the learnt tasks or whether the network learns the tasks by dedicating certain weights to be important to each individual task. When observing the overlap in the network’s Fisher information matrices for each of the games (see appendices for implementation details and specific results), we found that the network did share weights between tasks, with similar tasks sharing a larger proportion of important weights. Overall, these positive results show that RePR is a useful strategy for overcoming CF.

Fig. 2: Images drawn from previous tasks’ experience replays (real) and images generated from a GAN iteratively taught to produce sequences from Road Runner, Boxing and then James Bond. Images shown are the first image of each four frame sequence. Each row contains images from one of the three tasks.

Fig. 2 shows GAN generated images after learning all three tasks, alongside real images from the games. This figure shows that although the GAN is successful at generating images similar to the previous games they are still not perfect. However, our results demonstrate that this is not vital as RePR can retain almost all knowledge of previous tasks.

V-B How does RePR Compare to EWC?

We further investigate the effectiveness of RePR by comparing its performance to the leading EWC variants. More specifically, we train the and conditions on the same task as the previously tested condition with the only difference being the long-term system’s DQN is trained with either the standard EWC loss function or the online-EWC loss function. The specific EWC and online-EWC loss function, as well as details of the hyper-parameter search, can be found in the appendices. In both EWC conditions no GAN is used by the LTM system. Our implementation does not include connections from the LTM system to the STM system which try and encourage weight sharing when learning the new task. This was not included because authors of the Progress and Compress method found online-EWC alone was competitive with the Progress and Compress method (which included these connections) and it kept the architecture of the agents’ dual memory system consistent between conditions. As previously mentioned, this experiment is designed to be more difficult than the conditions that EWC and online-EWC have previously been tested on, in terms of reinforcement learning. This is because the model must retain the value function and the agent cannot revisit previously learnt tasks or use task specific parameters.

The results of both the EWC conditions are also included in Fig. 1. These results clearly show that RePR outperforms both EWC and online-EWC under these conditions. We find that EWC retains past experiences better than online-EWC and due to this, online-EWC was more effective at learning the new task.

To confirm our (online-)EWC implementation was correct we tested both our EWC and online-EWC implementations on an easier task where the LTM system only had to retain the agent’s policy (taught by minimising the cross-entropy) and where new tasks were only learnt for 5m frames each, such that the time the network must retain past knowledge was reduced. Under these conditions, both EWC and online-EWC implementations could successfully retain previously learnt tasks while learning new tasks (see appendices for the learning curves). This confirms that it is the added difficulty of long retention times (without revisiting tasks) and requiring the value function to be learnt that makes our iterative learning task more difficult and thus, more accurately shows the capabilities of the models.

Vi Discussion

Our experiments have demonstrated RePR to be an effective solution to CF when iteratively learning multiple tasks. To our knowledge, this result has not yet been achieved on complex reinforcement learning tasks which require powerful generative models such as GANs. RePR has advantages over popular weight constraint methods, such as EWC, because it does not constrain the network to retain similar weights when learning a new task. This allows the internal layers of the neural network to change according to new knowledge, giving the model the freedom to restructure itself when incorporating new information. Experimentally, we have verified that RePR does outperform EWC methods on an iterative learning task. Furthermore, in terms of memory requirements RePR is very scalable because it does not store data from previous tasks or use task specific weights.

Pseudo-rehearsal without the addition of the dual memory system () showed much slower convergence times on new tasks and lower performance was attained while learning new tasks compared to when a dual memory system was used (). During Deep Q-learning, the Q-values the DQN is learning are consistently changing due to their targets being estimated by the target network, which is also being updated during learning. Therefore, without a dual memory system, the DQN has the challenging task of learning these changing Q-values while also constraining itself to retain knowledge of previous tasks. However, with a dual memory system, the short-term DQN can be given the sole task of learning these changing Q-values. This reduces the difficulty of the problem for the long-term network, as it is just taught the final Q-values by distillation, while rehearsing its previous knowledge through pseudo-rehearsal. This explains why the dual memory system in RePR shows more substantial benefits than a dual memory system does in image classification with the same number of tasks [17].

Deep reinforcement learning on Atari games is a more difficult problem than typical image classification tasks because training is less stable (due to the deadly triad [41]) and the desired output of the network is consistently changing. Furthermore, the image regions important for determining the Q-values are relatively small (due to the limited number of small sprites) compared to the area of an image which is useful in image classification (as can be seen in [42, 43]). Therefore, it is a promising result that pseudo-rehearsal can also be applied to deep reinforcement learning with such success.

Our experiments have not investigated RePR on a task sequence longer than 3, whereas others have used more, such as Progress and Compress which used 6 Atari games. However, our experimental conditions were more difficult and still showed RePR outperforming the other state-of-the-art solutions to CF. In theory, RePR could be applied to learn any number of tasks as long as the agent’s network (e.g. DQN) and GAN have the capacity to successfully learn the collection of tasks and generate sequences that are representative of previously learnt tasks. We hypothesise that this bottleneck would likely occur in the GAN as currently GANs are still relatively unstable to train as they suffer from vanishing gradients and mode collapse [44], however we leave these research questions for future work.

In our experiments we assume that the agent knows when a task switch occurs, however in some use cases this might not be true and thus, a task detection method would need to be combined with RePR. Finally, although we chose to apply our RePR model to DQNs, this can easily be extended to other state-of-the-art deep reinforcement learning algorithms, such as actor-critic networks, by adding a similar constraint to the loss function. We chose DQNs for this research because the value function and policy function were combined, allowing our results to clearly display when both the policy and value function are retained by producing high scores during testing episodes. Future research could investigate the advantages of retaining the value function, such as allowing a task which has only been partially learnt to continue to be learnt without disruption.

Vii Conclusion

In conclusion, deep reinforcement learning can use pseudo-rehearsal so that DQNs can achieve continual learning. We have shown that our RePR model can be used to effectively learn three sequential tasks, without scaling in complexity as the number of tasks increase and without revisiting or storing raw data from past tasks. Pseudo-rehearsal has major benefits over weight constraint methods as it is less restrictive on the network and this is supported by experimental evidence. We also found compelling evidence that the addition of a dual memory system is necessary for continual reinforcement learning to be effective. Finally, as the power of generative models increases, it will have a direct impact on what can be achieved with RePR and our goal of having an agent which can continuously learn in its environment, without being challenged by CF.

Fig. 3: Flow of information while training the short-term system. The model plays the game while simultaneously training the short-term system.

Fig. 4: Flow of information while training the long-term system. The DQN and GAN are both trained at independent times.

Appendix A Further Implementation Details

A-a Dqn

Hyper-parameter Value Description
mini-batch size 32 Number of examples drawn for calculating the stochastic gradient descent update.
replay memory size 200,000 Number of frames in experience replay which samples from the current game are drawn from.
history length 4 Number of recent frames given to the agent as an input sequence.
target network update frequency 5,000 Number of frames which are observed from the environment before the target network is updated.
discount factor 0.99 Discount factor () for each future reward.
action repeat 4 Number of times the agent’s selected action is repeated before another frame is observed.
update frequency 4 Frequency of observed frames which updates to the current network occur on.
learning rate 0.00025 Learning rate used by Tensorflow’s RMSProp optimiser.
momentum 0.0 Momentum used by Tensorflow’s RMSProp optimiser.
decay 0.99 Decay used by Tensorflow’s RMSProp optimiser.
epsilon Epsilon used by Tensorflow’s RMSProp optimiser.
initial exploration 1.0 Initial -greedy exploration rate.
final exploration 0.1 Final -greedy exploration rate.
final exploration frame 1,000,000 Number of frames seen by the agent before the linear decay of the exploration rate reaches its final value.
replay start size 50,000 The number of frames which the experience replay is initially filled with (using a uniform random policy).
no-op max 30 Maximum number of ”do nothing” actions performed at the start of an episode ().
TABLE I: DQN hyper-parameters.
layer # units/filters filter shape filter stride
TABLE II: DQN architecture, where CONV is a convolutional layer and FC is a fully connected layer.

The main difference between our DQN and [1] is that we used TensorFlow’s RMSProp optimiser (without centering) with global norm gradient clipping compared to the original paper’s RMSProp optimiser which clipped gradients between . Our network architecture remained the same, however our biases were set to and weights were initialised with , where all values that were more than two standard deviations from the mean were re-drawn. The remaining changes were to the hyper-parameters of the learning algorithm which can be seen in bold in Table I. The architecture of our network can be found in Table II, where all layers use the ReLU activation function except the last linear layer.

A-B Gan

The GAN is trained with the WGAN-GP [21] loss function with a drift term [22] added to it. The drift term is applied to the discriminator’s output for real and fake inputs, stopping the output from drifting too far away from zero. More specifically, the loss functions used for updating the discriminator () and generator () are:


where and are the discriminator and generator networks with the parameters and . is an input item drawn from either the current task’s experience replay or the previous long-term system’s GAN (as specified in the main text). is an item produced by the current generative model () and . is a random number , is an array of latent variables , and . The discriminator and generator network’s weights are updated on alternating steps using their corresponding loss function.

Generator Discriminator
Input: latent variables Input:
layer # units/filters filter shape filter stride layer # units/filters filter shape filter stride
TABLE III: GAN architecture, where FC is a fully connected layer, DECONV is a deconvolutional layer and CONV is a convolutional layer.

Fig. 5: Results of our (online-)EWC implementations tested under less challenging experimental conditions. Scores are recorded during training the long-term system. Task switches occur at the dashed lines, in the order Road Runner, Boxing and then James Bond. Results were produced using a single seed.

Condition Road Runner & Boxing Road Runner & James Bond Boxing & James Bond
0.576 0.194 0.282
0.781 0.140 0.113
TABLE IV: Fisher overlap scores between task pairs. Results were produced using a single seed.

The GAN is trained with the Adam optimiser (, , and as per [22]) where the networks are trained for a total of steps with a mini-batch size of 100. The architecture of the network is illustrated in Table III. All layers of the discriminator use the ReLU activation function, except the last linear layer. All layers of the generator use batch normalisation ( and ) and the ReLU activation function, except the last layer which has no batch normalisation and uses the Tanh activation function. This is to make the generated images’ output space the same as the real images which are rescaled between and by applying to each raw pixel value. We also decreased the convergence time of our GAN by applying random noise to real and generated images before rescaling and giving them to the discriminator.

A-C Ewc

The EWC constraint is implemented as per [3], where the loss function is amended so that:


where is the distillation loss for learning the current task (as specified in the main text) and is the batch-size. is a scaling factor determining how much importance the constraint should be given, is the current long-term network’s parameters, is the final long-term network’s parameters after learning the previous task and iterates over each of the parameters in the network. is an approximation of the diagonal elements in a Fisher information matrix, where each element represents the importance each parameter has on the output of the network.

The Fisher information matrix is calculated as in [3], by approximating the posterior as a Gaussian distribution with the mean given by the optimal parameters after learning a previous task and a standard deviation . More specifically, the calculation follows [45]:


where an expectation is calculated by uniformly drawing states from the experience replay (). is the Jacobian matrix for the output layer .

When the standard EWC implementation is extended to a third task, a separate penalty is added. This means the current parameters of the network are constrained to be similar to the parameters after learning the first task and the parameters after further learning the second task.

Online-EWC further extends EWC so that only the previous network’s parameters and a single Fisher information matrix are stored. As per [19], this results in the constraint being replaced by:


where the single Fisher information matrix is updated by:


where is a discount parameter and represents the index of the current task. In online-EWC, Fisher information matrices are normalised using min-max normalisation so that the tasks’ different reward scales do not affect the relative importance of parameters between tasks.

For the condition, we applied a grid search over and for our condition we performed a grid search over and . The best parameters found during the grid searches are in bold. In all conditions the Fisher information matrix is calculated by sampling 100 batches from each task. The final network’s test scores for each of the tasks were min-max normalised and the network with the best average score was selected. The minimum and maximum is found across all testing episodes played during the learning of the task in STM.

To confirm our (online-)EWC implementation was correct we tested whether our EWC and online-EWC implementations could retain previous task knowledge successfully under less challenging experimental conditions. More specifically, the LTM system only had to retain the agent’s policy (taught by minimising the cross-entropy) and new tasks were only learnt for 5m frames each. The results of the EWC and online-EWC implementations tested under these conditions can be found as and in Fig. 5, where both conditions could successfully learn new tasks while retaining knowledge of previous tasks.

Appendix B How well does RePR Share Weights?

To investigate whether an agent’s DQN uses similar parameters for determining its output across multiple tasks, [3] suggest that the degree of overlap between two tasks’ Fisher information matrices can be analysed. This Fisher overlap score is bounded between 0 and 1, where a high score represents high overlap and indicates that many of the weights that are important for calculating the desired action in one task are also important in the other task. More specifically, the Fisher overlap is calculated by , where:


given and are the two tasks’ Fisher information matrices which have been normalised so that they each have a unit trace. Fisher information matrices are approximated by Equation 12 using 100 batches of samples drawn from each tasks’ experience replay.

We compared RePR’s Fisher information matrices for each task using the Fisher overlap calculation. When RePR had learnt the tasks in the order Road Runner, Boxing and then James Bond (as in the condition from Section V-A) the Fisher overlap score was high between the first two tasks learnt but relatively low between other task pairs. This suggests that there are more similarities between Road Runner and Boxing than other task pairs. We confirm this by calculating the Fisher overlap for each of the task pairs when the RePR model had successfully learnt the tasks in the reverse order (ie. James Bond, Boxing and then Road Runner). In this case, a higher overlap value remains between Road Runner and Boxing, regardless of the order they were learnt in. This demonstrates that the network attempts to share the computation across a similar set of important weights, where the more similar the tasks are the more effective they are at sharing weights. The precise Fisher overlap values for both of these conditions can be found in Table IV.


We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X GPU used for this research.


Craig Atkinson received his B.Sc. (Hons.) from the University of Otago, Dunedin, New Zealand, in 2017. Currently, he is studying for a doctorate in Computer Science. His research interests include deep reinforcement learning and continual learning.

Brendan McCane received the B.Sc. (Hons.) and Ph.D. degrees from the James Cook University of North Queensland, Townsville City, Australia, in 1991 and 1996, respectively. He joined the Computer Science Department, University of Otago, Otago, New Zealand, in 1997. He served as the Head of the Department from 2007 to 2012. His current research interests include computer vision, pattern recognition, machine learning, and medical and biological imaging. He also enjoys reading, swimming, fishing and long walks on the beach with his dogs.

Lech Szymanski received the B.A.Sc. (Hons.) degree in computer engineering and the M.A.Sc. degree in electrical engineering from the University of Ottawa, Ottawa, ON, Canada, in 2001 and 2005, respectively, and the Ph.D. degree in computer science from the University of Otago, Otago, New Zealand, in 2012. He is currently a Lecturer at the Computer Science Department at the University of Otago. His research interests include machine learning, artificial neural networks, and deep architectures.

Anthony Robins completed his doctorate in cognitive science at the University of Sussex (UK) in 1989. He is currently a Professor of Computer Science at the University of Otago, New Zealand. His research interests include artificial neural networks, computational models of memory, and computer science education.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description