Closed-Loop Memory GAN for Continual Learning111To appear in the Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019).
Sequential learning of tasks using gradient descent leads to an unremitting decline in the accuracy of tasks for which training data is no longer available, termed catastrophic forgetting. Generative models have been explored as a means to approximate the distribution of old tasks and bypass storage of real data. Here we propose a cumulative closed-loop memory replay GAN (CloGAN) provided with external regularization by a small memory unit selected for maximum sample diversity. We evaluate incremental class learning using a notoriously hard paradigm, “single-headed learning,” in which each task is a disjoint subset of classes in the overall dataset, and performance is evaluated on all previous classes. First, we show that when constructing a dynamic memory unit to preserve sample heterogeneity, model performance asymptotically approaches training on the full dataset. We then show that using a stochastic generator to continuously output fresh new images during training increases performance significantly further meanwhile generating quality images. We compare our approach to several baselines including fine-tuning by gradient descent (FGD), Elastic Weight Consolidation (EWC), Deep Generative Replay (DGR) and Memory Replay GAN(MeRGAN). Our method has very low long-term memory cost, the memory unit, as well as negligible intermediate memory storage.
Closed-Loop Memory GAN for Continual Learning††thanks: To appear in the Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019).
Amanda Rios and Laurent Itti
University of Southern California, Los Angeles, USA
Since early development and throughout life humans are constantly faced with unknowns in the environment which demand a persistent adaptation and expansion of past knowledge. In addition, as knowledge is expanded, learning is often facilitated since objects and tasks are often closely related and interconnected. For instance, during development, infants learn to categorize animals according to dimensions such as size, texture, shape, sound, among others. However, subsequent addition of new species rarely corrupts classification performance on the already learned categories. In fact, learning broad-species domains can aid in finer species discriminatory capability [?].
Nonetheless, recreating human-like lifelong continual learning remains a central challenge in Artificial Intelligence. State of the art deep neural networks (DNN) trained to perform supervised continual learning are known to undergo a phenomenon termed “catastrophic forgetting”, which describes a sharp decline in the performance of the model on previously learned tasks as soon as a new task is introduced [?; ?; ?]. This behavior does not come as a surprise if one recalls that in DNNs, learning an input output mapping implies parameterizing the network with an optimal weight set, through loss minimization. Thus, if training data is unavailable for previous tasks, there will be no more loss term for the old data and a weight parametrization may blatantly deviate from the previous optimal state incurring severe memory erasure.
2 Prior Work
In the recent literature, several methods have been proposed aiming to ameliorate catastrophic forgetting. They can be roughly subdivided into 3 groups: regularization, network-growing and replay approaches. With regularization methods, one constrains the change of learnable parameters to prevent ”overwriting” what was previously encoded. For instance, [?] perform distillation between multiple realizations of a network at distinct time-points, ensuring that the new weights do not shift significantly from the old. In a similar vein, [?] operate within a single network model and use a Fisher information matrix computed with saved samples drawn from past tasks, which then acts as a regularizer preserving highly correlated weights. Similarly, [?] use path integrals of loss-derivatives to constrain weights crucial to past tasks, yielding an intermediate parameterization with minimal combined loss.
Alternatively, in region-growing algorithms, the architecture itself is altered to accommodate new tasks followed by retraining. For instance, [?] freeze the most important paths in the network, therefore forcefully preventing forgetting, and incrementally add new network chunks to incorporate new tasks. Lastly, In replay methods, the models no longer preserve a key pathway or weights. In these algorithms, one estimates the distribution of the old data either by saving a small fraction of the original dataset into a memory buffer or by training a generator to mimic the lost data and labels. At each new task, these methods learn by presenting a network with both new images as well as replay of estimated or buffered old images, reverting the continual framework into a multi-task setting and thus alleviating forgetting [?]. Other works have built on the idea of using a buffer of real data to approximate the past distribution [?; ?; ?].
Yet, despite a growing number of appealing solutions, catastrophic forgetting is not a solved issue. Regularization methods have been shown to perform poorly in single-headed incremental class learning, for instance [?; ?], and here we reproduce this limitation in our own results for elastic weight consolidation [?]. On the other hand, region growing approaches, while usually providing a clean solution for constrained incremental problems, can quickly become memory expensive since they require both an architectural expansion and the storage of at least a portion of old data for retraining.
Likewise, replay methods also run into scalability issues. So far, generative replay models learn a data distribution by resorting to intermediate copy states of the generator. In Deep Generative Replay (DGR) an unconditional GAN is trained at each task to cumulatively generate and discriminate images. Since the proposed GAN is unconditional, they employ an additional classifier (Solver) which is trained in parallel to classify the generated images and assign corresponding labels [?]. During each task switch, DGR makes a copy of the generator and classifier networks and uses them to generate sample images and labels for the old tasks. In Memory Replay GAN (MeRGAN) with joint replay, [?] propose a modification in the DGR framework by substituting the unconditional GAN for an ACGAN, thereby eliminating the need for the additional solver. Copy operations are both expensive and often lead to image quality being degraded through consecutive tasks. Moreover, replicating network states successively is not a fully desirable solution since, from the biological perspective, a human brain cannot produce an “intermediate copy” of itself to transfer knowledge. Lastly, methods which rely rather on small subsets of past data, memory buffers, have shown to yield good results but they do not make explicit how much of the performance is due to the algorithm developed and how much is intrinsically due to the variability included in the buffer unit.
3 Closed Loop memory GAN
3.1 Model Overview
In this paper, we propose a hybrid approach between memory buffers and deep generative models aiming to specifically reduce memory costs and maximize both the classification performance and generated image quality throughout training. In our model, there is only one generator and embedded classifier trained cumulatively, with no intermediate copy step. In this framework, as a new task is learned, the old data is approximated by continuously sampling from the generator at its present state, forming a closed loop training paradigm. Of course, since a new task also modifies the parameterization of the generator, this procedure cannot be applied without some verification that the generated images are reasonable approximations of the old distribution that has been lost. Our method tackles this issue by, first, using an image filtering step in which either the classifier or the discriminator is used to assess the sample image quality and, as a result, blocking bad images from entering the training loop. Second, we employ external regularization by constructing a small dynamic memory buffer with real data samples chosen to maximize image heterogeneity and to enforce smoothness in the representation of old classes. The image buffer has fixed memory allotment. Therefore, it is not allowed to grow which requires eliminating some old images to make room for new ones. The sampling for the old data is then always a combination of buffer samples and “on-the-fly” generated samples, which provide a stochastic up-sampling of the memory unit.
3.2 Model Architecture
A vanilla GAN consists of two networks, a Generator and a Discriminator, competing with each other in a zero-sum game framework. The core block of our model (CloGAN), see figure (1), is a modified GAN termed Auxiliary Conditional Generative Adversarial (AC-GAN) [?]. The AC-GAN is also composed of 2 networks, but it includes a classifier combined in the same architecture as the discriminator, via an expansion to K+1 output nodes, for K classes plus the original vanilla Real/Fake discriminator output.
In an AC-GAN framework the generator is fed a uniform noise appended with a corresponding class label . Thus, the conditional generator, described by , generates an image and the AC-GAN learns a mapping in which the noise is independent of the class , enabling multiple class outputs for a fixed noise input. While the generator is trained to generate images as closely resembling the input image distribution, the discriminator, , is conversely trained to discriminate these generated images as fake, loss . The embedded classifier, , shares most weights with the discriminator and generates a label prediction which, if incorrect, contributes to the overall loss of both generator and discriminator, . Overall, an AC-GAN is easier to train than a conventional vanilla GAN while also producing higher quality images. The loss functions are given as follows in (1) and (4) for generator and discriminator/classifier respectively.
Note that a plausible alternative to using a GAN would be to use a variational auto encoder (VAE) instead [?]. However, in our testing, we have not been able to achieve results with a VAE as good as those presented here using a GAN. Hence, in the following, we restrict our analysis to approaches based on GAN. Details of the implementation can be found in the supplementary materials link.
3.3 Closed-Loop Training with Replay
In the continual learning setting, our method approximates the likelihood of old data by employing CloGAN to continuously output fresh new images at each mini-batch during training. A combination of image filtering and external regularization by an image memory buffer confer stability to the closed-loop procedure. At each task, our model is trained using an extended dataset which includes real images for the new task, GAN replayed images for old tasks, and memory images, forming an extended training set (8), see figure 2. The memory component can be given a weighted importance, . The network is then trained by minimizing (9,10).
3.4 Image Filtering
At each mini-batch, the generator outputs fresh images approximating samples from old tasks, with the intent of producing a stochastic up-sampling of the reduced memory core. However, since these images are then used as training data in a closed loop, they have to be of the best quality possible to minimize error propagation. Thus, at each generation step, images are assessed for their quality and ”filtered” out if they do not correspond to the standard.
Here, we use the embedded classifier in CloGAN to generate a prediction for the conditional image. If this prediction does not match the conditioning label, the image is filtered out. When old images are generated for closed-loop replay, they are sampled from a model which has already previously converged for generation and classification of old tasks. The rationale behind this evaluation is that images which are missclassified have a higher probability of being distorted because of the ongoing training of the new task, and of deviating too grossly from the original distribution. We term this method Class-Conditioned Filtering (CFM).
In addition to CFM, we implemented a more complex procedure, ”Discriminator Rejection Sampling” (DRS) proposed in [?]. The latter employs the discriminator of an AC-GAN to approximately correct errors in the GAN generated distribution. Details of the implementation can be found in the supplementary materials link. We compare both to a baseline case for DRS which rejects a sample if the output from its discriminator logit layer has a score below some threshold, Soft rejection Filtering (SRF) [?]. Overall, we found that CFM, DRS and SFR perform equivalently well. A table with comparisons is included in the supplementary materials link. Hence, since CFM has a much faster running time, we opted for carrying out only class conditional filtering in our final model.
3.5 Dynamic Memory Buffer
We fill a small memory buffer with samples and labels of original past data to perform external regularization. The memory can be seen as a stable reference frame throughout training that enforces a ”smoothness” in the representation for each class. At each task, a selection method is employed to choose the samples from the new task which will go into the buffer, with the aim to maximize sample heterogeneity. Also, since a buffer has fixed size, this selection method is further used to determine which of the old task samples will be removed to make space for the incoming new data, employing again the heuristics of sample heterogeneity. Several buffer selection strategies were initially experimented but the best selection scheme was K-means clustering per class, both at image insertion and removal. In more details, the construction scheme is as follows: at the end of each current task, a k-centers algorithm is run per each class in the current tasks’s training labels, super-labeling each image as one of K clusters. At the time of insertion into the memory buffer, we select equal numbers of image samples from each class-specific cluster. Additionally, if the buffer is full we compute the space needed for new images and remove an equivalent number of old images. We do this by assessing their stored super-cluster labels and removing equal amounts of samples per cluster, thereby preserving heterogeneity. By storing the per-class, cluster assignment superlabels we also avoid repeating the clustering operation.
3.6 Continual Learning Baselines
We evaluate other continual learning algorithms as baseline comparisons. We implement Elastic Weight Consolidation (EWC; Kirkpatrick [?]), Deep Generative Replay (DGR; Shin [?]) and Memory Replay GAN (MeRGAN; Wu [?]) . With DGR, to make our implementation a fair comparison, we use an unconditional GAN with the same architecture and complexity as our CloGAN, except that it has only one Real/Fake output node. For both EWC and DGR, we use a classifier with identical architecture as our embedded classifier/discriminator, but with one fewer output node since a pure classifier does not evaluate Real/Fake attribution. Finally, for MeRGAN we implement an AC-GAN with identical architecture as our CloGAN.
4.1 Buffer Selection
We experimented with several buffer selection schemes but they under-performed class-specific K-centers. In the other selection methods, we extracted the logit or softmax layer of the discriminator/classifier network and computed measures such as Kurtosis and Peak-Difference to assess sample heterogeneity. The latter measure corresponds to the difference between softmax scores of the most probable and second most probable class for a given image. As such, we ranked the images according to each measure and kept the images with a probability proportional to their score. In other words, we performed a roulette weighting procedure such as in genetic selection [?]. Table 1 contains performance metrics for 3 buffer selection schemes and no selection (none) during CloGAN incremental class learning using the FASHION dataset with memory buffer of size 0.16%.
|Class-Kcenter||75.87 +/- 0.43|
|Kurtosis||64.52 +/- 0.73|
|Peak Difference||57.74 +/- 0.61|
|None||71.03 +/- 1.4|
4.2 Incremental Learning
We evaluate continual learning as accumulating knowledge of a growing number of disjoint classes, termed incremental learning. Furthermore, we make use of a challenging variation of incremental learning, “single headed learning”. Here, each task is a disjoint subset of classes from the overall dataset. Performance is evaluated for all previous classes, resulting in a 1/K chance level, where K is the number of classes accumulated to that point. We evaluate incremental class learning in 4 datasets: MNIST [?], FASHION [?], SVHN [?] and E-MNIST [?]. The first 3 were subdivided in disjoint subsets of 2 classes per task, with a total of 5 tasks to cover all the label types. E-MNIST, a larger dataset, was divided into tasks of 3 classes, covering 24 different classes in 8 consecutive tasks. To account for the growing number of classes, we create extra output nodes which are incrementally used, which allows us a single head for all tasks.
We distinguish our procedure from Multi-Headed learning [?] in which prediction is constrained to classes in each task. For instance, a multi-headed version of our MNIST test would use and re-use only two output nodes. After training on full disjoint MNIST with 5 tasks of 2 classes each, when evaluating the first task (digits 0 and 1), a multi-headed would only have to decide between digit 0 vs 1, as opposed to a one in ten decision for single-headed. This typically leads to much higher accuracies partially because an output node never becomes completely disabled, as it is always used for the last task. Finally note that a multi-headed network with only 2 output nodes provides an output that needs to be further disambiguated by knowing the task.
Average Continual Performance
Figure 3 displays the average performance of CloGAN when varying memory buffer size. Our method avoids catastrophic forgetting even with very small buffer sizes such as 0.08% (50 images) and 0.16% (100 images), for both MNIST and FASHION. For the more challenging E-MNIST and SVHN, buffer requirement becomes more demanding. Nonetheless, we obtain superior performance over the competing methods with still very reduced memory sizes: only 0.5% (576 images) and 1% (492 images).
Table 2 compares maximum average accuracies after training all tasks, for all methods tested. First, when no memory or GAN sampling is performed catastrophic forgetting occurs, as exemplified by the FGD condition which contains only fine-tuning with gradient descent. Second, EWC accuracy rapidly declines, asymptotically reaching the catastrophic forgetting curve. EWC has already been shown to behave poorly in incremental single-headed paradigms [?; ?]. To further confirm that this degradation of performance was not particular to our implementation, we replicated the permuted-MNIST experiment proposed in the original EWC paper ; and verified that in this learning paradigm EWC performs very well. This discrepancy between the experiments is likely due to the difference in output mapping, see supplementary.
|CloGAN||98.03 (1.6%)||85.25 (1.6%)||79.30 (5%)||83.50 (5%)|
|CloGAN||92.26 (0.16%)||76.15 (0.16%)||73.08 (1%)||79.14 (1%)|
Lastly, we report the accuracies for the deep replay methods, DGR and MeRGAN. For MNIST, both DGR and MeRGAN perform very well, reaching 94.9 % and 98.25% whereas CloGAN achieves accuracies of 92.26% with memory of 0.16% and 98.03 with (1.6%). However, for all other datasets, which are significantly harder than MNIST, DGR and MeRGAN both underperform CloGAN by significant amounts.
For SVHN, the most challenging dataset, both DGR and MeRGAN display degraded performance after the first task. This behavior likely has cause in a persistent degradation of generated image quality throughout training. Both methods represent old data exclusively by replayed images from an intermediate generator copy. If the generator cannot produce images which represent the original distribution with high fidelity, the gap in representation capacity can be enlarged and propagated through successive GAN transfer (copy) operations. CloGAN alleviates GAN representation degeneration because it is trained from an extended set containing both replay images from the generator and real images in the buffer. The real images never degenerate and act as an anchor to keep smoothness and quality in the subsequent generated images. DGR has another disadvantage over CloGAN: it does not generate conditioned images, requiring a separate classifier to produce old image labels during training. If that classifier does not have perfect performance, it will inevitably misslabel some images, contributing to error propagation.
We confirm that CloGAN performs an upsampling of the memory buffer selection by comparing our method to two variations in which the AC-GAN is trained only from a memory buffer, both in continual (Frozen-CloGAN) and multi-task settings (MT). For the latter two conditions there is no closed-loop replay of GAN samples. Furthermore, in the MT setting we re-start training at each task switch. The results reported in figure 4 correspond to the maximum accuracies achieved for each task for all 3 variations. We verify that stochastic generation in CloGAN provides an upsampling of the buffer and achieves superior performance to Frozen-CloGAN and MT. We show results for the more challenging datasets, E-MNIST and SVHN.
Upsampling is indicated when a positive gap between CloGAN and Frozen-CloGAN increases as more tasks are added. For SVHN, the last task shows clear gaps between CloGAN Frozen-CloGAN as well as MT (maximum gap of 11.39% at task 5). Similarly, E-MNIST shows a clear gap in the last two tasks, 7 and 8 (maximum gap of 10.41% at task 8). Additionally, we show that MT under-performs starting in early tasks due to lack of forward transfer since the networks are re-started from scratch at each task switch. Similar upsampling behavior was observed in MNIST and FASHION, with maximum gaps of 6.84% and 9.35% respectively. Additional figures can be found in the supplementary link.
Memory Equivalence of Stochastic Replay
To further disentangle the contribution of a closed-loop generative replay to the model, we compensate the memory expense of the stochastic generator by allocating all of its memory budget to the episodic buffer. Thus, for each CloGAN we create a NoGenReplay-Equivalent model which equates the generator size (1.6 M float parameters) to images in the episodic memory (2572 if RGB and 6226 if gray).
|NoGenReplay-Equivalent||77.35 (6.5%)||78.68 (10.5%)||73.32 (5.2%)||73.79 (6.2%)|
For instance, for CloGAN-1% trained in EMNIST (gray images) we create the NoGenReplay-Equivalent-6.5% (buffer of 1%+6226=6.5%) and obtain 79.14% correct for our method versus 77.35% for the no replay condition. Similarly, for CloGAN-5% we achieve 83.5% whereas NoGenReplay-Equivalent-10.5% yields 78.68%. Hence, we show that including replay beats just using a larger episodic buffer of equivalent memory. Results are included in table 3.
Per Task Performance
In figure 5A,B), we exhibit per task accuracies along time. Here, CloGAN is shown to produce stable performance throughout consecutive tasks. For both E-MNIST and SVHN all past tasks maintain high accuracies consistently throughout learning of new classes. For example in EMNIST, task 1 preserves its accuracy at 84.33% despite the learning of 7 other tasks in succession. Likewise, SVHN task 1 has an accuracy of 83.87 %. The results are significantly higher when compared to the baseline of catastrophic forgetting and EWC. Moreover, we also display performance for MeRGAN. In E-MNIST, MeRGAN accuracies for tasks 1 and 2 are clearly underperforming CloGAN at the end of training, likely due to image degradation from GAN to GAN transfer.
In figure 5C), we show generated images by CloGAN and MeRGAN for both SVHN and E-MNIST. We list images taken after training of all tasks. For SVHN we list all classes cumulatively learned. For E-MNIST, since there are 24 classes, we limit the display to the two first tasks as well as the last task (8th). For CloGAN, we find that images are sharp even when using small memory sizes, 1% - SHVN and 0.5% - EMNIST. This is true for beginning tasks as well as latter tasks. In contrast, in MeRGAN former taks are sharply more degenerated than latter ones. In EMNIST this can be seen by an overall darkening of letters through .
We tested a variant of our model, Copy-CloGAN, in which the generator is copied at each task switch. In Copy-CloGAN the stochastic replay samples come from the frozen copied generator and the remaining replay is from the CloGAN episodic memory buffer. In order to properly evaluate the performance of this new model, we account for the extra memory usage by calculating the size in bytes of an extra generator: 1.6M float parameters (6.4 Mbyte). Accordingly, we create a new equiv-CloGAN with a larger episodic buffer to compensate for the duplicate generator of copy-CloGAN. We use the same calculation as described in the construction of the NonGenReplay-Equivalent variant previously described, adding either 2572 RGB or 6226gray images to equate 6.4 Mbytes of extra memory load.
|equiv-CloGAN||82.60 (6.5%)||87.89 (10.5%)||81.01 (5.2%)||83.74 (6.2%)|
For a given episodic memory size (e.g., 1%), we compare CloGAN, copy-CloGAN, and equiv-CloGAN. Overall, the copy operation provided a small increase in performance but only when the buffer sizes were held constant, for instance, when trained with SVHN, CloGAN-1% achieves accuracy of 73.08% and copy-CloGAN-1% of 73.19%. However, when compensating the extra memory usage via buffer augmentation, Copy-CloGAN underperformed equiv-CloGAN-5.2%, with the latter yielding a 81.01% correct performance, the highest between the 3 compared models. Thus, on balance, the copy operation did not surpass our approach.
In conclusion, we have shown how using very small buffers in conjunction with stochastic replay can give rise to superior performance compared to simple gradient descent, EWC or other replay methods. In our model, CloGAN, the memory buffer acts as an external regularization for the generator, counteracting image degradation through time. Our approach is relatively easy to implement and necessitates only low computation (no full retraining) and memory (small buffer), making it ideal to enable life-long learning on resource-constrained mobile (at the edge) devices.
This work was supported by the National Science Foundation (grant number CCF-1317433), C-BRIC (one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA), and the Intel Corporation. The authors affirm that the views expressed herein are solely their own, and do not represent the views of the United States government or any agency thereof.
- [Azadi et al., 2018] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. Discriminator rejection sampling. arXiv preprint arXiv:1810.06758, 2018.
- [Cohen et al., 2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
- [Farquhar and Gal, 2018] Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733v1, 2018.
- [Fernando et al., 2017] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet : Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
- [Flesch et al., 2018] Timo Flesch, Jan Balaguer, Ronald Dekker, Hamed Nili, and Christopher Summerfield. Comparing continual task learning in minds and machines. Proceedings of the National Academy of Sciences, 115(44):E10313–E10322, 2018.
- [French, 1999] Robert M French. Catastrophic forgetting in connectionist networks. 6613(April):128–135, 1999.
- [Furlanello et al., 2016] Tommaso Furlanello, Jiaping Zhao, Andrew M Saxe, Laurent Itti, and Bosco S Tjan. Active long term memory networks. arXiv preprint arXiv:1606.02355, 2016.
- [Goldberg and Deb, 1991] David E Goldberg and Kalyanmoy Deb. A comparative analysis of selection schemes used in genetic algorithms. In Foundations of genetic algorithms, volume 1, pages 69–93. Elsevier, 1991.
- [Kemker and Kanan, 2018] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. International Conference on Learning Representations, 2018.
- [Kingma et al., 2014] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, 2014.
- [Kirkpatrick et al., 2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 2017.
- [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
- [Li and Hoiem, 2017] Zhizhong Li and Derek Hoiem. Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- [Lopez-paz and Ranzato, 2017] David Lopez-paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. (Advances in Neural Information Processing Systems), 2017.
- [MacKay, 2003] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
- [McCloskey and Cohen, 1989] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
- [Netzer et al., 2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
- [Nguyen et al., 2018] Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Variational continual learning. In International Conference on Learning Representations, 2018.
- [Odena et al., 2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. Proceedings of the 34th International Conference on Machine Learning, Sydney, 2017.
- [Parisi et al., 2019] German Parisi, Ronald Kemker, Jose Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
- [Rebuffi et al., 2017] Sylvestre-alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl : Incremental classifier and representation learning. Conference on Computer Vision and Pattern Recognition, 2017.
- [Robins, 1995] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
- [Shin et al., 2017] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in Neural Information Processing Systems, 2017.
- [Wu et al., 2018] Chenshen Wu, Luis Herranz, Xialei Liu, Yaxing Wang, Joost van de Weijer, and Bogdan Raducanu. Memory replay gans: learning to generate images from new categories without forgetting. Advances in Neural Information Processing Systems, 2018.
- [Xiao et al., 2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- [Zenke et al., 2017] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual Learning Through Synaptic Intelligence. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017, 2017.