# A Scalable Approach to Multi-Context Continual Learning via Lifelong Skill Encoding

###### Abstract

Continual or lifelong learning (CL) is one of the most challenging problems in machine learning. In this paradigm, a system must learn new tasks, contexts, or data without forgetting previously learned information. We present a scalable approach to multi-context continual learning (MCCL) in which we decouple how a system learns to solve new tasks (i.e., acquires skills) from how it stores them. Our approach leverages two types of artificial networks: (1) a set of reusable, task-specific networks (TN) that can be trained as needed to learn new skills, and (2) a lifelong, autoencoder network (EN) that stores all learned skills in a compact, latent space. To learn a new skill, we first train a TN using conventional backpropagation, thus placing no restrictions on the system’s ability to encode the new task. We then incorporate the newly learned skill into the latent space by first recalling previously learned skills using our EN and then retraining it on both the new and recalled skills. Our approach can efficiently store an arbitrary number of skills without compromising previously learned information because each skill is stored as a separate latent vector. Whenever a particular skill is needed, we recall the necessary weights using our EN and then load them into the corresponding TN. Experiments on the MNIST and CIFAR datasets show that we can continually learn new skills without compromising the performance of existing skills. To the best of our knowledge, we are the first to demonstrate the feasibility of encoding entire networks in order to facilitate efficient continual learning.

A Scalable Approach to Multi-Context Continual Learning via Lifelong Skill Encoding

Blake Camp^{†}^{†}thanks: Both authors contributed equally.
Department of Computer Science
Georgia State University
Atlanta, GA 30319
bcamp2@student.gsu.com
Jaya Krishna Mandivarapu
Department of Computer Science
Georgia State University
Atlanta, GA 30319
jmandivarapu1@student.gsu.edu
Rolando Estrada
Department of Computer Science
Georgia State University
Atlanta, GA 30319
restrada1@gsu.edu

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Lifelong or continual learning (CL) is one of the most challenging problems in machine learning. In this paradigm, a system must learn new skills (i.e., solve new tasks), contexts, or types of data without forgetting previously learned information. Crucially, in CL there is no point at which a system stops learning: it must continue to update its internal representations as new inputs become available.

In contrast, the dominant paradigm in machine learning relies on two well-defined phases: a training phase (in which all the training data is available) and a deployment phase (in which the trained system is kept fixed). Not surprisingly, machine learning techniques based on this paradigm, including deep learning, have not proven suitable for CL. In particular, artificial neural networks (ANNs), whether shallow or deep, suffer from two significant ailments w.r.t. to CL: catastrophic forgetting and architectural inflexibility. The first problem is the inability to remember previously learned information after learning to solve a new task [4] and is inevitable if we train an ANN by updating all of its network parameters. In particular, we explicitly overwrite previously learned features, albeit slightly, every time we backpropagate during training. The second problem concerns the fact that the optimal architectures for two different, even related problems may differ, e.g., in the numbers or types of layers, output dimensions, etc. Thus, using a single, fixed architecture will likely prove suboptimal for CL, since we do not know a priori which problems the system will need to solve throughout its lifetime.

Several approaches have emerged to address the above limitations. Notable examples include Experience Replay [5], Elastic Weight Consolidation [1], Progressive Networks [10], and, more recently, Lifelong Generative Models [7]. However, as we detail in Section 2, existing approaches suffer from two key shortcomings. In short, they either: (1) freeze the network over time, thus limiting its ability to learn new data, or (2) expand the size of the network linearly w.r.t. the number of learned tasks, which is asymptotically equivalent to instantiating a new network for every new task.

In contrast, we propose to address multi-context continual learning (MCCL) by decoupling how we learn a set of network parameters from how we store them, thus allowing us to make both steps as efficient as possible. Our framework is loosely inspired by the role that the hippocampus is purported to play in memory consolation [11]. As noted in [6], during learning the brain first forms an initial neural representation in cortical regions; the hippocampus then consolidates this representation into a form that is optimized for storage and retrieval. These complementary biological mechanisms enable continual learning by efficiently consolidating knowledge and compressing prior experiences.

In this spirit, we propose a system that uses two types of artificial networks: (1) a set of reusable, task-specific networks (TN) that can be trained as needed to learn different skills, and (2) a lifelong, autoencoder network (EN) that stores all learned skills in a compact, latent space. In other words, after learning a new skill, we store a compact representation of the newly learned parameters in our EN (and then discard them). If we need to recall a previously learned skill, we approximate the original weights using our EN and load them into the corresponding TN. To the best of our knowledge, we are the first to demonstrate the feasibility of encoding entire networks in order to facilitate efficient continual learning.

Our approach allows us to leverage the flexibility of ANNs while avoiding their aforementioned limitations. We overcome catastrophic forgetting by separately storing a compact representation of every skill; thus, new learning does not overwrite existing features. In addition, our approach is not limited to using a single architecture; as we validate in our experiments (Section 4), our EN can simultaneously store architectures of multiple types and sizes within the same latent space. More generally, our experiments show that we can efficiently generate accurate approximations of multiple skills without explicitly saving the original models. Our approach allows task-specific networks to be recalled on demand and discarded or overwritten when not in-use. It also allows us to store large numbers of skills in logarithmic or even constant space without affecting our ability to learn new skills. In addition, our system can seamlessly resume training on previously learned tasks, from the same point of performance from which it left off. Finally, our experiments show that by using a contractive autoencoder (CAE) [9]—a type of regularized, denoising autoencoder—as our EN, we can quickly assimilate new skills into the latent space while retaining a substantial amount of previously learned information. Thus, we can efficiently update our lifelong network in sub-linear time relative to the total number of learned skills.

## 2 Prior work

As noted above, there are several existing approaches that enable some degree of continual learning in deep neural networks. Experience Replay [5] can ameliorate catastrophic forgetting by re-training on batches of task-relevant data whenever the network needs to learn new tasks or contexts. However, this technique is inefficient since it requires the network to relearn how to solve the same tasks over and over. Elastic Weight Consolidation [1] is a technique for discovering the most significant weights of a network and subsequently restricting the degree to which they can change when training on some new task. However, as noted above, restricting a network’s ability to change its parameter values compromises its ability to learn new tasks. Progressive Networks [10], address catastrophic forgetting by adding additional, parallel layers to solve new tasks; unfortunately, progressive networks must grow linearly as new tasks are incorporated into the model, making them difficult to scale. Our work is perhaps most similar to Lifelong Generative Modeling [7] and Generative Knowledge Distillation [8]; both approaches use a student-teacher model, in which the current network first generates synthetic samples that match previously seen data and then trains on both the new and synthetic data. In our proposed approach, though, we generate approximations of previously learned networks, not data. Storing learned networks requires far less space than storing training data, which allows us to quickly and efficiently incorporate new skills into the system’s long-term storage.

## 3 Methodology

Our framework addresses MCCL by compactly storing learned skills using a lifelong autoencoder. As Figure 1 illustrates, our proposed system has two components: a set of reusable task-specific networks (TN) and a lifelong, skill-encoding network (EN). In our experiments, we use a contractive autoencoder [9] to implement the EN. Each TN is just a standard, deep neural network, so given a suitable architecture and enough training data, it will learn a skill or policy for the given task. Traditionally, though, to learn different tasks, one would need to train and save networks, with parameters each. Thus, the overall storage requirements are . In contrast, we propose storing these networks as a set of latent vectors using our EN^{1}^{1}1For generality, our analysis assumes a logarithmic dependence on network size. In practice, if we have an upper bound for , we can view our EN as using constant space, .. Thus, our storage requirements are only . Despite this space compression, our experiments show that we can reproduce a high-quality approximation of any learned network that can still solve the original task. Furthermore, the time needed to incorporate new skills into our CAE-based EN is sublinear to the number of skills. Figure 1 offers a high-level outline of our proposed approach.

Formally, let be a set of known tasks and let be a set of -dimensional learned skills (i.e., network parameters), resp. That is, for any , there is a network which solves above some predetermined threshold (e.g., classification accuracy above 95%). Then, let be the decoder of our EN. Our goal is train our EN to learn a set of -dimensional^{2}^{2}2Or, in practice, , as noted above. latent representations such that solves its corresponding task with performance above , for all .

Given initial skills, our EN autoencoder can learn them using conventional backpropagation. Now, let be a new task. We will show how to learn a new skill and how to store a compressed representation while still retaining all previously learned skills.

First, we learn by training a TN until we achieve a suitable performance threshold. In our experiments, this step consists of conventional ANN training, i.e., backpropagation w.r.t. to a large training set. We then extract and flatten the learned parameters into an -dimensional vector, where is the total number of parameters in the network.

In order to integrate the new skill into the latent space, we first recollect all previously learned skills by feeding each as input to the decoder of the EN. We thus generate a set of approximations to the original skills. We then append to and retrain the EN on all skills until it can reconstruct all of them with suitable accuracy.

As we show in our experiments, our retrained EN encodes latent representations of all skills, including the new skill , that achieve nearly identical performance to the original parameters. Since each is simply a vector of network parameters, it can easily be loaded back into a network with the correct architecture. This allows us to discard the original skills and learn new skills without linearly increasing our system’s memory footprint.

Our proposed approach achieves continual learning because we can learn multiple consecutive tasks in a sequential manner while retaining knowledge gained from previous tasks. In contrast to several competing techniques, we place no restrictions on the Task-Network as it updates parameters during training. This is permissible because the parameters themselves are never explicitly saved, but are rather encoded and approximated by the EN. As a result, it matters not if the original learned parameters for task are overwritten during training of task . Further, this allows the framework to leverage the benefits of fine-tuning and knowledge transfer without worrying about overwriting previous information. For example, if we learned skill for task , we can more quickly learn a skill for a related task by initializing the TN with the weights. Crucially, we then store and as separate, latent vectors, so learning the new skill will have minimal effect on our ability to perform the original skill. Once training on has completed, the resulting network will equate to a learned skill, parameters for . After retraining our EN on both skills, we can approximate both and even though the original models have been discarded.

In addition to its compression benefits, our work also offers a new approach to multi-context learning. In the above example, imagine that and are two tasks on the same data. For example, they might correspond to detecting faces vs. crowds in street images. In order to distinguish between these two contexts, a traditional network would need to learn a 3-tuple comprised of the input, output, and context (i.e., the current task). However, as the number of tasks grows, we need to use additional bits to encode all the possible contexts, so the network’s architecture would have to change over time. Our approach, however, has no such requirement because a context-specific skill can be simply be reconstructed from its encoded representation whenever needed. As noted above, by decoupling learning from storage, we can incorporate additional skills without interfering with existing knowledge.

Lastly, our framework is capable of efficiently encoding many different types and sizes of networks in efficient continual fashion. In particular, we can encode a network of arbitrary size using a constant-size EN (that takes inputs of size ) by splitting the input network into subvectors, such that ()^{3}^{3}3We pad with zeros whenever and are not multiples of each other.. As we verify in Section 4, we can effectively reconstruct a large network from its subvectors and still achieve a suitable performance threshold.

In the following section, we offer empirical results on the MNIST and CIFAR datasets which demonstrate the efficacy and flexibility of our approach.

## 4 Experimental Results

We carried out experiments on the MNIST [3] and CIFAR [2] datasets to validate the effectiveness of our proposed approach. In particular, we first performed a robustness analysis to establish the degree to which an approximation of a network can deviate from the original and yet remain effective (Section 4.1). In Section 4.2, we then tested our continual learning framework on the MNIST dataset by first defining classifiers for different subsets of digits (e.g., 1 vs. 2,3). We then showed that our EN can effectively encode these different classifiers. Finally, in Section 4.3 we carried out a similar analysis on the CIFAR dataset, but only using classifiers for individual classes (e.g., frog vs. not a frog). As noted above, we used a contractive autoencoder (CAE) as our EN, since this type of autoencoder is designed to be extremely sensitive to small changes in the input data [9].

### 4.1 Robustness analysis

We trained a convolutional feed-forward neural network with 21,432 parameters on the MNIST dataset for 10-Digit classification. The resulting accuracy of the trained network on the test set was approximately 98%. We then proceeded to add incremental amounts of Gaussian noise to the parameters and subsequently test the performance of the network. As seen in Figure 3, the experiments show a strong correlation between the amount of noise introduced to the parameters and the performance of the underlying network. Further, there is a similar correlation between the accuracy and the mean-squared error of the noisy networks compared to the original, as seen in Figure 3. These results confirm that small deviations in the network parameters do not lead to immediate and catastrophic collapse in performance, but rather, performance can be expected to slowly degrade as more and more noise is introduced. Nevertheless, as the figures show, in order to a maintain a high level of performance, the approximations must be relatively close to the original.

### 4.2 MNIST experiments

We trained up to 333 separate networks on different MNIST tasks using the same TN network as in our robust analysis (Section 4.1). We defined a task as binary digit classification; here, a task is specified by a tuple that lists the positive and negative digit class(es), e.g., (pos={1}, neg={2,3}). Importantly, we distinguish tuples as separate tasks if either the positive or negative class differ, e.g. ({1}, {2,3}) vs. ({1}, {3,4}). This scheme allows us to define an exponential number of tasks over a single dataset. Training and test sets consisted of approximately 40% positive examples and 60% negative examples because most of our tuples contained more negative than positive targets.

#### 4.2.1 Batch learning

We first verified that our EN could encode a set of learned networks using conventional, batch learning. Figures 7-7 show the performances of the approximations learned by the EN on the MNIST test set, as a function of how long we train the autoencoder. In these figures, all tasks were trained as a batch. We scaled the number of networks from 2 to 333 tasks to verify that our reconstructions achieve suitable performance even as the number of encoded tasks grows. For clarity, Figure 7 shows only the mean and standard deviation of the accuracies achieved by the 333 reconstructed networks; the other three figures show actual performances of all reconstructed skills. Although the performance of some individual reconstructed networks sometimes lagged w.r.t. the original weights, our results show that, even for hundreds of skills, the mean accuracy is quite high, nearly matching the mean accuracy of the original skills.

#### 4.2.2 Continual learning

We then tested our approach on continual learning. As detailed in Section 3, we iteratively added one new learned skill at a time and retrained the EN on both the new skill and its recollections of previously learned skills. In this experiment, we used single-digit classifiers (e.g., 1 vs. not 1) and added these trained networks in numerical order, i.e., first the classifier for 0’s, then the one for 1’s, etc. As Figures 9-11 show, our EN is able to iteratively incorporate new skills into its latent space without degrading the performance of existing skills, thus demonstrating substantial benefits in the realm of continual learning. Furthermore, as the latent space grows, the EN needs fewer iterations to relearn a set of skills, suggesting that it can leverage transfer learning to more effectively encode new skills into its overall representation.

To verify the above observation, we quantified the number of training steps needed for all the encoded skills to achieve their performance threshold as a function of the total number of skills. As seen in Figure 13, there appears to be an apparent, inverse relationship between an increase in the total number of number of skills, trained in sequential fashion, and the number of training iterations required for the reconstructed networks to achieve performances comparable to their original counterparts. To further confirm this idea, we also measured the precise number of updates, or calls to the back-propagation function, required to reach these performance benchmarks (shown in Figure 13). Our analysis suggests that there is no direct relationship between the total number of skills and the amount of updates required to learn them. Intuitively, this implies that as the number of skills grows, the time required to integrate those skills into the universal latent space should remain relatively constant, making this approach particularly well suited to continual learning.

Finally, and perhaps most importantly, these empirical results suggest that the EN is able to retain a large amount of previously learned information. This means that the latent representations which have been previously learned need only change slightly in order to accommodate the integration of a new skill. Further, as the number of skills grows, the performance of most previously learned skills remains extremely high throughout the retraining process. Figures 9-11 clearly show that as the number of networks grow from 2, to 4, to 6, to 10, it becomes easier and easier for the EN to retain the ability to accurately reconstruct the previously learned networks. Consequently, if an earlier skill were immediately needed, it would not be necessary to wait until we fully integrate the new skill before we can generate a recollection of the needed skill. Instead, the AE can generate old skills during retraining because the performance of the reconstructed skill can be expected to remain high.

### 4.3 CIFAR experiments

We then verified that our proposed approach can reconstruct larger, more sophisticated networks. Similarly to the MNIST experiments above, we divided the CIFAR dataset into multiple training and test sets, and proceeded to train separate task-specific networks, one per class. Here, we used TNs with over 60,000 parameters that achieved accuracies ranging from 78% to 84%. We encoded these larger networks using the same EN as in the MNIST experiments, with an input size of 21,432, by splitting the 60K parameter vectors into three subvectors. As noted in Section 3, by splitting a larger input vector into smaller subvectors, we can encode networks of arbitrary sizes. As seen in Figure 15, the accuracies of the reconstructed CIFAR networks also approached the performances of their original counterparts.

Splitting larger networks into smaller sub-vectors allows us to use a smaller autoencoder, which can be trained in substantially less time than a larger one. Figure 15 shows the respective training rates of a EN with 20,000 input units—trained to reconstruct 3 sub-vectors of length 20,000—compared to that of a larger one, with 61,000 input units, trained on the original 60K network. Clearly, using more inputs for a smaller autoencoder enables us to more quickly encode larger skills.

Finally, we demonstrated that the same EN can be used to encode trained networks of different sizes and architectures. Figure 16 shows that the same EN can simultaneously reconstruct 5 MNIST networks and 1 CIFAR network so that all approach their original baseline accuracies.

## 5 Conclusions and future work

In this paper, we introduced a scalable approach for multi-context continual learning in which we decouple how we learn a set of parameters from how we store them in memory. Our proposed framework makes use of state-of-the-art autoencoders to facilitate lifelong learning and overcomes many of its associated challenges, including catastrophic forgetting and architecture inflexibility. Our empirical results confirm that our method can efficiently learn new skills in continual fashion, without affecting the performance of previously learned skills. Overall, our framework has notable advantages over competing approaches with respect to training efficiency, storage scalability, and architecture flexibility. We believe it has the potential to substantially contribute towards enabling efficient, lifelong learning. In future work, we aim to further improve the efficiency with which the encoding network can sequentially encode vast amounts of skills. Furthermore, we will explore how to use the latent space to extrapolate new skills based on existing skills, i.e., with little or no training data. Promising approaches include clustering the latent representations into sets of closely related skills and using sparse latent representations.

## References

- [1] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming catastrophic forgetting in neural networks. ArXiv e-prints, December 2016.
- [2] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- [3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
- [4] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109 – 165. Academic Press, 1989.
- [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with Deep Reinforcement Learning. ArXiv e-prints, December 2013.
- [6] AlisonR. Preston and Howard Eichenbaum. Interplay of hippocampus and prefrontal cortex in memory. Current Biology, 23(17):R764 – R773, 2013.
- [7] J. Ramapuram, M. Gregorova, and A. Kalousis. Lifelong Generative Modeling. ArXiv e-prints, May 2017.
- [8] M. Riemer, M. Franceschini, D. Bouneffouf, and T Klinger. Generative Knowledge Distillation for General Purpose Function Compression. ArXiv e-prints, 2017.
- [9] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 833–840, USA, 2011. Omnipress.
- [10] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive Neural Networks. ArXiv e-prints, June 2016.
- [11] Timothy J. Teyler and Pascal DiScenna. The hippocampal memory indexing theory. Behavioral Neuroscience, 100(2):147–154, 1986.