Continual Learning with Tiny Episodic Memories
Learning with less supervision is a major challenge in artificial intelligence. One sensible approach to decrease the amount of supervision is to leverage prior experience and transfer knowledge from tasks seen in the past. However, a necessary condition for a successful transfer is the ability to remember how to perform previous tasks. The Continual Learning (CL) setting, whereby an agent learns from a stream of tasks without seeing any example twice, is an ideal framework to investigate how to accrue such knowledge. In this work, we consider supervised learning tasks and methods that leverage a very small episodic memory for continual learning. Through an extensive empirical analysis across four benchmark datasets adapted to CL, we observe that a very simple baseline, which jointly trains on both examples from the current task as well as examples stored in the memory, outperforms state-of-the-art CL approaches with and without episodic memory. Surprisingly, repeated learning over tiny episodic memories does not harm generalization on past tasks, as joint training on data from subsequent tasks acts like a data dependent regularizer. We discuss and evaluate different approaches to write into the memory. Most notably, reservoir sampling works remarkably well across the board, except when the memory size is extremely small. In this case, writing strategies that guarantee an equal representation of all classes work better. Overall, these methods should be considered as a strong baseline candidate when benchmarking new CL approaches.111Code: https://goo.gl/AqM6ZC
Continual Learning with Tiny Episodic Memories
Arslan Chaudhry††thanks: Corresponding Author: firstname.lastname@example.org , Marcus Rohrbach , Mohamed Elhoseiny ,
Thalaiyasingam Ajanthan**footnotemark: * , Puneet K. Dokania , Philip H. S. Torr , Marc’Aurelio Ranzato
University of Oxford,
Facebook AI Research,
King Abdullah University of Science and Technology
1 Introduction and Related Work
Arguably, the objective of continual learning (CL) is to rapidly learn new skills from a sequence of tasks leveraging the knowledge accumulated in the past. Catastrophic forgetting (mccloskey1989catastrophic), i.e. the inability of a model to recall how to perform tasks seen in the past, makes such rapid or efficient adaptation extremely difficult.
This decades old problem of CL (ring1997child; thrun1998lifelong) is now seeing a surge of interest in the research community. Recently, several works have attempted to reduce forgetting by adding a regularization term to the objective function. In some of these works (Kirkpatrick2016EWC; Zenke2017Continual; chaudhry2018riemannian; aljundi2017memory), the regularization term discourages change in parameters that were important to solve past tasks. In other works (li2016learning; Rebuffi16icarl), regularization is used to penalize feature drift on already learned tasks. Yet another approach is to use an episodic memory storing data from past tasks (Rebuffi16icarl; chaudhry2018riemannian); one effective approach to leverage such episodic memory is to use it to constrain the optimization such that the loss on past tasks can never increase (lopez2017gradient).
In this work, we do a quantitative study of CL methods on four benchmark datasets under the assumption that i) each task is fully supervised, ii) that each example from a task can only be seen once using the learning protocol proposed by chaudhry2019agem (see §3), and iii) that the learner has access to a very small episodic memory to store and replay examples from the past. Restricting the size of the episodic memory is important because it makes the learning problem more realistic and more distinct from multi-task learning.
While lopez2017gradient and chaudhry2019agem used the memory as a constraint, here we drastically simplify the optimization problem and directly train on the memory, resulting in better performance and more efficient learning. Earlier works (isele2018selective) explored a similar usage of episodic memory, dubbed Experience Replay (ER)222For consistency to prior work in the literature, we will refer to this approach which trains on the episodic memory as ER, although its usage for supervised learning tasks is far less established., but for RL tasks where the learner does multiple passes over the data using a very large episodic memory. Our work is instead most similar to riemer2018learning, who also considered the same single-pass through the data learning setting and trained directly on the episodic memory. Our contribution is to extend their study by a) considering much smaller episodic memories, b) investigating different strategies to write into the memory and c) analyzing why training on tiny memories does not lead to overfitting.
Our extensive empirical analysis shows that when the size of the episodic buffer is reasonably large, ER with reservoir sampling outperforms current state-of-the-art CL methods (Kirkpatrick2016EWC; chaudhry2019agem; riemer2018learning). However, when the episodic buffer is very small, reservoir sampling-based ER suffers a performance loss because there is no guarantee that each class has at least one representative example stored in the memory. In this regime, other sampling strategies that sacrifice perfect randomization for balancing the population across classes work better. This observation motivated us to introduce a simple hybrid approach which combines the best of both worlds and does not require prior knowledge of the total number of tasks: it starts by using reservoir sampling but then switches to a balanced strategy when it detects that at least one class has too few examples stored in the memory. Importantly and counter intuitively, repetitive learning on the same tiny episodic memory still yields a significant boost of performance thanks to the regularization effect brought by training on subsequent tasks, a topic which we investigate at length in §LABEL:sec:analysis. In conclusion, ER on tiny episodic memories offers very strong performance at a very small additional computational cost. We believe that this approach will serve as a stronger baseline for the development of future CL approaches.
2 Experience Replay
Recent works (lopez2017gradient; chaudhry2019agem) have shown that methods relying on episodic memory have superior performance than regularization based approaches (e.g., (Kirkpatrick2016EWC; Zenke2017Continual)) when using a “single-pass through the data” protocol, see §3 for details. While lopez2017gradient and chaudhry2019agem used the episodic memory as a mean to project gradients, more recently riemer2018learning proposed an approach that achieves even better generalization by training directly with the examples stored in the memory. In this work, we build upon riemer2018learning and further investigate the use of memory as a source of training data when the memory size is very small. Moreover, we compare various heuristics for writing into the memory.
The overall training procedure is given in Alg. 1. Compared to the simplest baseline model that merely tunes the parameters on the new task starting from the previous task parameter vector, ER makes two modifications. First, it has an episodic memory which is updated at every time step, line 8. Second, it doubles the size of the minibatch used to compute the gradient descent parameter update by stacking the actual minibatch of examples from the current task with a minibatch of examples taken at random from the memory, line 7. As we shall see in our empirical validation, these two simple modifications yield much better generalization and substantially limit forgetting, while incurring in a negligible additional computational cost on modern GPU devices.
Next, we describe various strategies to write into the memory. All these methods assume access to a continuous stream of data and a small additional temporary memory, which rules out approaches relying on the temporary storage of all the examples seen so far. This restriction is consistent with our definition of CL; a learning experience through a stream of data under the constraint of a fixed and small sized memory and limited compute budget.
Similarly to riemer2018learning, Reservoir sampling (vitter1985random) takes as input a stream of data of unknown length and returns a random subset of items from that stream. If ‘’ is the number of points observed so far and is the size of the reservoir (sampling buffer), this selection strategy samples each data point with a probability . The routine to update the memory is given in Alg. 2.
Similarly to lopez2017gradient, for each task, the ring buffer strategy allocates as many equally sized FIFO buffers as there are classes. If is the total number of classes across all tasks, each stack has a buffer of size . As shown in Alg. 3, the memory stores the last few observations from each class. Unlike reservoir sampling, in this strategy the samples stored for older tasks do not change throughout training, leading to potentially stronger overfitting. Also, at early stages of training the memory is not fully utilized since each stack has constant size throughout training. However, this simple sampling strategy guarantees equal representation of all classes in the memory, which is particularly important when the memory is very small.
For each class, we use online k-Means to estimate k centroids in feature space, using the representation before the last classification layer. We then store in the memory the input examples whose feature representation is the closest to such centroids, see Alg. 4. This memory writing strategy has similar benefits and drawbacks of ring buffer, except that it has potentially better coverage of the feature space in L2 sense.
Mean of Features (MoF):
Similarly to Rebuffi16icarl, for each class we compute a running estimate of the average feature vector just before the classification layer and store in the memory examples whose feature representation is closest to such average feature vector, see details in Alg. 5. This writing strategy has the same balancing guarantees of ring buffer and k-means, but it populates the memory differently. Instead of populating the memory at random or using k-Means, it puts examples that are closest to the mode in feature space.
3 Learning Framework
We use the same learning framework proposed by chaudhry2019agem. There are two streams of tasks, and . The former contains only a handful of tasks and it is only used for cross-validation purposes. Tasks from can be replayed as many times as needed and have various degree of similarity to tasks in the training and evaluation dataset . The latter stream instead can be played only once; the learner will observe examples in sequence and will be tested throughout the learning experience. The final performance will be reported on the held-out test set drawn from .
The -th task in any of these streams consists of , where each triplet constitutes an example defined by an input (), a task descriptor () which is an integer id in this work, and a target vector (), where is the set of labels specific to task and .
We measure performance using two metrics, as standard practice in the literature (lopez2017gradient; chaudhry2018riemannian):
Average Accuracy ()
Let be the performance of the model on the held-out test set of task ‘’ after the model is trained on task ‘’. The average accuracy at task is then defined as:
Let be the forgetting on task ‘’ after the model is trained on task ‘’ which is computed as:
The average forgetting measure at task is then defined as:
In this section, we review the benchmark datasets we used in our evaluation, as well as the architectures and the baselines we compared against. We then report the results we obtained using episodic memory and ER, and we conclude with a brief analysis investigating generalization when using ER on tiny memories.
We consider four commonly used benchmarks in CL literature. Permuted MNIST (Kirkpatrick2016EWC) is a variant of MNIST (lecun1998mnist) dataset of handwritten digits where each task has a certain random permutation of the input pixels which is applied to all the images of that task. Our Permuted MNIST benchmark consists of a total of tasks.
Split CIFAR (Zenke2017Continual) consists of splitting the original CIFAR-100 dataset (krizhevsky2009learning) into disjoint subsets, each of which is considered as a separate task. Each task has classes that are randomly sampled without replacement from the total of classes.
Similarly to Split CIFAR, Split miniImageNet is constructed by splitting miniImageNet (vinyals2016matching), a subset of ImageNet with a total of 100 classes and 600 images per class, to disjoint subsets.
Finally, Split CUB (chaudhry2019agem) is an incremental version of the fine-grained image classification dataset CUB (WahCUB_200_2011) of bird categories split into disjoint subsets of classes.
In all cases, consisted of tasks while contained the remainder. As described in § 3.1, we report metrics on after doing a single training pass over each task in the sequence. The hyper-parameters selected via cross-validation on are reported in Appendix Tab. LABEL:tab:hyper_params.
For MNIST, we use a fully-connected network with two hidden layers of 256 ReLU units each. For CIFAR and miniImageNet, a reduced ResNet18, similar to lopez2017gradient, is used and a standard ResNet18 with ImageNet pretraining is used for CUB. The input integer task id is used to select a task specific classifier head, and the network is trained via cross-entropy loss.
For a given dataset stream, all models use the same architecture, and all models are optimized via stochastic gradient descent with a mini-batch size equal to 10. The size of the mini-batch sampled from the episodic memory is also set to 10 irrespective of the size of the episodic buffer.
We compare against the following baselines:
finetune, a model trained continually without any regularization and episodic memory, with parameters of a new task initialized from the parameters of the previous task.
ewc (Kirkpatrick2016EWC), a regularization-based approach that avoids catastrophic forgetting by limiting learning of parameters critical to the performance of past tasks, as measured by the Fisher information matrix (FIM). In particular, we compute the FIM as a moving average similar to ewc++ in chaudhry2018riemannian and online EWC in progresscompress.
a-gem (chaudhry2019agem), a model that uses episodic memory as an optimization constraint to avoid catastrophic forgetting. Since gem and a-gem have similar performance, as shown by chaudhry2019agem, we only consider the latter in our experiments due to its computational efficiency.
mer (riemer2018learning), a model that also leverages an episodic memory and uses a loss that approximates the dot products of the gradients of current and previous tasks to avoid forgetting. To make the experimental setting more comparable to the other methods (in terms of SGD updates), we set the number of inner gradient steps to for each outer Reptile (metareptile) meta-update with the mini-batch size of .
In the first experiment, we measured average accuracy at the end of the learning experience on as a function of the size of the memory. From the results in Fig. 1 we can make several observations. First and not surprisingly, average accuracy increases with the memory size and average accuracy is much better than the baseline finetune and ewc, showing that CL methods with episodic memories are indeed very effective. For instance, on CIFAR the improvement brought by an episodic memory storing a single example per class is at least 10% (difference between performance of MER, the worst performing method using episodic memory, and ewc, the best performing baseline not relying on episodic memory), a gain that further increases to 20% when the memory stores 10 examples per class.
Second, methods using ER outperform not only the baseline approaches that do not have episodic memory (finetune and ewc) but also state-of-the-art approaches relying on an episodic memory of the same size (a-gem and mer). For instance, on CIFAR the gain over a-gem brought by ER is 1.7% when the memory only stores 1 example per class, and more than 5% when the memory stores 13 examples per class. This finding might seem quite surprising and will be investigated in more depth in §LABEL:sec:analysis.
Third, experience replay based on reservoir sampling works the best across the board except when the memory size is very small (less than 3 samples per class). Empirically we observed that as more and more tasks arrive and the size of the memory per class shrinks, reservoir sampling often ends up evicting some of the earlier classes from the memory, thereby inducing higher forgetting.
Fourth, when the memory is tiny, sampling methods that by construction guarantee a balanced number of samples per class, work the best (even better than reservoir sampling). All methods that have this property, ring buffer, k-Means and Mean of Features, have a rather similar performance which is substantially better than the reservoir sampling. For instance, on CIFAR, with one example per class in the memory, ER with reservoir sampling is 3.5% worse than ER K-Means, but ER K-Means, ER Ring Buffer and ER MoF are all within 0.5% from each other (see Tab. LABEL:tab:main_cifar_comp in Appendix for numerical values). These findings are further confirmed by looking at the evolution of the average accuracy (Fig. 2 left) as new tasks arrive when the memory can store at most one example per class.
The better performance of strategies like ring buffer for tiny episodic memories, and reservoir sampling for bigger episodic memories, suggests a hybrid approach, whereby the writing strategy relies on reservoir sampling till some classes have too few samples stored in the memory. At that point, the writing strategy switches to the ring buffer scheme which guarantees a minimum number of examples for each class. For instance, in the experiment of Fig. 3 the memory budget consists of only memory slots (as there are tasks and classes per task), an average of sample per class by the end of the learning experience. The learner switches from reservoir sampling to ring buffer once it observes that any of the classes seen in the past has only one sample left in the memory. When the switch happens (marked by a red vertical line in the figure), the learner only keeps randomly picked examples per class in the memory, where is the number of examples of class in the memory and are the total number of classes observed so far. The overwriting happens opportunistically, removing examples from over-represented classes as new classes are observed. Fig. 3 shows that when the number of tasks is small, the hybrid version enjoys the high accuracy of reservoir sampling. As more tasks arrive and the memory per task shrinks, the hybrid scheme achieves superior performance than reservoir (and at least similar to ring buffer).