Memory Aware Synapses: Learning what (not) to forget
Humans can learn in a continuous manner. Old rarely utilized knowledge can be overwritten by new incoming information while important, frequently used knowledge is prevented from being erased. In artificial learning systems, lifelong learning so far has focused mainly on accumulating knowledge over tasks and overcoming catastrophic forgetting. In this paper, we argue that, given the limited model capacity and the unlimited new information to be learned, knowledge has to be preserved or erased selectively. Inspired by neuroplasticity, we propose an online method to compute the importance of the parameters of a neural network, based on the data that the network is actively applied to, in an unsupervised manner. After learning a task, whenever a sample is fed to the network, we accumulate an importance measure for each parameter of the network, based on how sensitive the predicted output is to a change in this parameter. When learning a new task, changes to important parameters are penalized. We show that a local version of our method is a direct application of Hebb’s rule in identifying the important connections between neurons. We test our method on a sequence of object recognition tasks and on the challenging problem of learning an embedding in a continuous manner. We show state-of-the-art performance and the ability to adapt the importance of the parameters towards what the network needs (not) to forget, which may be different for different test conditions.
The (real and digital) world around us evolves continuously. Each day millions of images with new tags appear on social media. Every minute hundreds of hours of video are uploaded on Youtube. This new content contains new topics and trends that may be very different from what one has seen before - think e.g. of new emerging news topics, fashion trends, social media hypes or technical evolutions. Consequently, to keep up to speed, our learning systems should be able to evolve as well.
Yet the dominating paradigm to date, using supervised learning, ignores this issue. Traditional supervised learning learns a given task using an existing set of training examples. Once the training is finished, the trained model is frozen and we switch to test mode. From then on, new incoming data is processed without any capability of adapting or customizing the model. Soon, the model becomes outdated. In that case, the training process has to be repeated, using both the previous and new data, and with an extended set of category labels. In a world like ours, such a practice becomes impractical or even impossible when moving to real scenarios, such as those mentioned earlier, where the data is streaming, might be disappearing after a given period of time or even can’t be stored at all due to storage constraints or privacy issues.
In this setting, lifelong learning (LLL) comes as a natural solution. LLL studies continual learning across tasks and data, without storing old data. The goal is to accumulate knowledge across tasks (typically via model sharing), resulting in a single model that performs well on all the learned tasks. The question then is how to overcome catastrophic forgetting of the old knowledge when starting a new learning process using the same model. Additionally, such techniques need to be scalable and easy to setup to cope with the constraints of real environments.
Unfortunately, the current setup in which LLL methods are developed and evaluated is rather artificial: it supposes a sequence of disjoint tasks that are learned one after the other. Once the training of a task is finished, one moves on to the next task. Old tasks are never revisited, as it is assumed that the data of old tasks is no longer available. While this makes sense for the labeled data used during the initial training, the same assumption is also applied to unlabeled data, which may in fact be easily collected anytime. Instead, we would like LLL methods that are more natural and adaptive.
Given the limited capacity of a model, LLL is about learning what (not) to forget. It is unrealistic to expect a model to keep track of every single piece of information that it has learned before. Instead, one may have to give up some of the old knowledge to free capacity for new tasks. However, the crucial bits of information that are always in use should be protected from being erased by new ones.
We need a LLL method that makes the best possible compromise over all the different tasks. Actually, what (not) to forget may be different for different agents, depending on the context in which they are deployed. In other words, we would like a model that adapts to the specific conditions under which the system is active. Ideally, this adaptation uses only unlabelled data, so the model can adapt to the actual test environment and on a continuous basis.
Such adaptation and memory organization is what we also observe in biological neurosystems. Our ability to preserve what we have learned before is largely dependent on how frequent we make use of it. Skills that we practice often, appear to be unforgettable, unlike those that we have not used for a long time. Remarkably, this flexibility and adaptation occur in the absence of any form of supervision. According to Hebbian theory , the process at the basis of this phenomenon is the strengthening of synapses connecting neurons that fire synchronously, compared to those connecting neurons with unrelated firing behavior.
In this work, we propose a new method for LLL, coined Memory Aware Synapses, or MAS for short, inspired by Hebbian learning in biological systems. Unlike previous works, our LLL method can learn what is important using unlabelled data. This allows for “personalisation” and continuous updating of importance weights – see Figure 1.
Our contributions can be summarized as follows: First, a new LLL method is proposed. It is based on function approximation rather than focusing on the loss, and avoids the need for labels when learning importance weights. This allows adaptation to unlabeled data, in the actual test environment. Second, we show how our new LLL method is linked to the Hebbian learning scheme, which can be seen as a local variant of our method. Third, we achieve better performance than state-of-the-art, both in the context of object recognition, as in the context of fact learning (e.g. Subject, Predicate, Object triplets), where an embedding is used instead of a softmax output 111As argued in our companion paper (see supplemental material), fact learning and the use of an embedding are a more natural setup for lifelong learning, as the full model is shared between the different tasks. This is more challenging than the object recognition setup, having a shared representation only, with a specialized final classification layer for each task separately and (typically) relying on an oracle to activate the right classification layer depending on the input image..
2 Related Work
While lifelong learning has been studied since a long time in different domains ( robotics  or machine learning ), and touches upon the broader fields of meta-learning  and learning to learn , we focus in this section on more recent work, in the context of computer vision only.
The main challenge in LLL is to make the learned model adapt to new data, be it from a similar or a different environment . With tasks being dealt with in a sequential manner [25, 26], the absence of the data from the previous tasks introduces the risk of catastrophic forgetting [14, 20, 13, 6, 7] of the previously acquired knowledge. To avoid this issue, two main approaches have been studied: data-based and model-based approaches.
Data-based approaches [12, 24, 21, 19] use data from the new task to approximate the performance of the previous tasks. This works best if the data distribution mismatch between tasks is limited. Overall, the need of the data-based approaches to have a preprocessing step before each new task, to record the targets for the previous tasks limits their applicability on practical continual learning scenarios.
Model-based approaches [4, 11, 9, 31], like our method, focus on the parameters of the network instead of depending on the task data. Most similar to our work are [9, 31]. Elastic Weight Consolidation  uses an approximation of the diagonal term of the Fisher information matrix to identify the important parameters for the task. While training a new task, a regularizer is used to prevent those important weights from being overwritten by the new task. The Fisher information matrix needs to be computed in a separate phase after each task and also needs to be stored for each task for later use when learning a new task. Thus, it stores a large number of parameters that grows with the number of seen tasks. To avoid this, Improved multitask learning through synaptic intelligence  adopts an online way of computing the importance of the network parameters. They showed that their method works equally well or better than . During training, in each parameter update, they add an approximation of how much the loss would change due to this parameter value change. This method is described in more detail in section 3. It is the first to suggest an online way of computing and accumulating the importance of the network parameters across tasks, without the need of a preprocessing step or to store the full importance matrix for each task in the sequence. However, there are also some drawbacks: 1) Relying on the weight changes in a batch gradient descent might overestimate the importance of the weights, as noted by the authors. Also it is unclear how the method will respond to different learning rates per layer and dropout used in the fully connected layers. 2) The method assumes starting from a randomly initialized network. When starting from a pretrained network, as in most practical computer vision applications, some weights might be used without much changes. As a result, their importance will be underestimated. 3) The computation of the importance is done during training and fixed later. In contrast, we believe the importance of the weights should adapt to the test data where the system is actually applied to.
In this work, we propose a model-based method that computes the importance of the network parameters not only in an online manner but also adaptive to the test in an unsupervised manner. The goal is to build a continual system that is able to adapt the importance of the weights to what the system actually needs to remember. Imagine an agent equipped with an image recognition module. Such a module is trained on a large set of images and classes ( ImageNet). However, for a user in a real environment only a subset of these skills will actually be useful. When introducing a new task and using our method, the agent will protect the used skills from being erased and be less conservative with others.
Before introducing our method, we briefly remind the reader of the standard LLL setup, and summarize the work on Synaptic Intelligence , on which we build.
LLL setup The standard LLL setup focuses on image classification. It consists of a sequence of disjoint tasks, that are learned one after the other. Tasks may correspond to different datasets, or different splits of a dataset, without overlap in category labels between different splits. Crucial to this setup is that, when training a task, only the data related to that task is accessible. To guarantee scalability, data from older tasks cannot be stored, and models should not grow linearly with the number of tasks. Ideally, newer tasks can benefit from the representations learned by older tasks (forward transfer). Yet in practice, the biggest challenge is to avoid catastrophic forgetting of the old tasks’ knowledge. This is a far more challenging setup than joint learning, where all tasks are trained simultaneously.
Synaptic Intelligence As indicated before, most of the LLL work involves a preprocessing step before each new task in a learning sequence. This limits the applicability of these approaches in real scenarios. The Intelligent Synapses approach  differentiates itself by establishing an online way of computing the importance of the parameters. While training the network, they estimate the importance of a network parameter by evaluating to what extent changing its value affects the loss being minimized.
At training step , a change in the parameters by an infinitesimal amount results in a change in the loss approximated by:
with the gradient of the loss with respect to parameter . Based on this, the change in the loss over one step in the training procedure can be decomposed in contributions from each of the parameters :
The contribution of one particular parameter to the total change in the loss, from the point when the learning started up to convergence for a given task, , can then be obtained by summing its contributions along the training trajectory.
This procedure allows to compute the importance of the parameters in an online fashion. However, it is based on the dynamics of the learning process and suffers from the drawbacks mentioned earlier in the related work (section 2). What we actually need is an online method for computing and adapting the importance of the parameters while the network is actively tested on new input data.
4 Memory Aware Synapses (MAS)
In the following, we introduce our Memory Aware Synapses (MAS). Similar to the Synaptic Intelligence work described above, we compute the importance of the parameters in an online fashion. Yet instead of looking at the change in the loss during training, we look at the function learned by the network after training (see Figure 2).
We consider a deep model composed of multiple (convolutional or fully connected) layers. For sake of clarity, we use a slightly different notation than before. The parameters of our model are the weights of the connections between pairs of neurons and in two consecutive layers and . Our goal is to design a method that computes an importance value for each parameter , indicating its importance with respect to the previous tasks.
Estimating parameter importance In a learning sequence, we first receive a task to be learned along with its training data (), with the input data and the corresponding output data. We train the model to minimize the task loss on . When the training procedure converges to a local minimum, the model has learned an approximation of the true function . maps the input to the output . This mapping now is our target that we want to preserve. Instead of measuring the sensitivity of the loss function to the network parameters, as in [9, 31], we measure how sensitive the function output is to changes in the network parameters.
For a given data point , the output of the network is . A small change in the parameters results in a change in the function output that can be approximated by:
where is the gradient of the learned function with respect to the parameter evaluated at the data point . Our goal is to preserve the prediction of the network at each data point (the learned function) and prevent changes to parameters that are crucial for this prediction.
Based on equation 3, we can measure the importance of a parameter by the magnitude of the gradient , i.e. how much does a small change on that parameter value change the output of the learned function. We then accumulate the gradients over the given data points for a given parameter to obtain our importance weights :
This equation can be updated online whenever a new data point is fed to the network. is the total number of data points at a given phase (when the network is active after learning a set of tasks).
In the case of a output function, the application of this equation is straight forward. When moving to a multi-dimensional output function, as is the case for a neural network, we would need to compute the gradients for each output and this would need as many backward passes as the size of the output. As an alternative, we propose to use the gradients of the norm of the learned function output, . The importance of the parameters is then measured by the sensitivity of the norm of the function output to their changes.
This way, for the regions in the input space that are sampled densely, the function will be preserved and catastrophic forgetting is avoided. However, parameters not affecting those regions will be given low importance weights, and can be used to optimize the function approximation for other tasks, affecting the function over other regions of the input space.
Learning a new task. When a new task needs to be learned, we have in addition to the new task loss , a regularizer that penalizes changes to parameters that are considered important for previous tasks. Similarly to other weight regularization methods [9, 31], we set
with a hyperparameter for the regularizer and the “old” network parameters. As such we allow the new task to change parameters that are not important for the previous task (low ). The important parameters (high ) can be reused but with a penalty when changing them.
Finally, the importance matrix is to be updated after each task training. Since we don’t make use of the loss function, can be computed on any available data. In the experiment section 6, we show how this allows our method to adapt and specialize to any set, be it from the training or from the test.
5 Connection to Hebbian learning
In neuroscience, Hebbian learning theory  provides an explanation for the phenomenon of synaptic plasticity. It postulates that “cells that fire together, wire together”: the synapses (connections) between neurons that fire synchronously for given input are strengthened over time to maintain and possibly improve the corresponding outputs.
Parameter importance based on Hebb’s rule. Here we reconsider this theory from the perspective of an artificial neural network after it has been trained successfully with backpropagation222For simplicity, we focus the discussion on a classification network with softmax output layer, but the results are more generally applicable.. When a sample is fed to the network, the predicted class corresponds to the last layer neuron with the highest activation. The firing of this neuron is caused by neurons in previous layers that were highly activated for the given input sample. Following Hebb’s rule, parameters connecting neurons that often fire together (high activations for both, i.e. highly correlated outputs) In the learning sequence as illustrated above, after training , this can be achieved with importance weights computed as follows:
with the output of the activation function of neuron . Below we show that the application of Hebb’s rule for finding the importance of the network parameters, as above, can be seen as a local version of our proposed approach.
A local version of our method. Instead of considering the function that is learned by the network as a whole, we can decompose it in a sequence of functions each corresponding to one layer of the network, i.e. , with the total number of layers. By locally preserving the output of each layer, we can preserve the global function . Similar to the procedure followed previously, we consider the norm of each layer after the activation function (the local function to preserve). Following our previous derivation, an infinitesimal change in the parameters connecting two consecutive layers and results in a change to the norm of local function of layer output for a given input by:
where and it can be shown in the case of a ReLU activation function to be equal to:
As above the accumulation of the gradients evaluated at different data points is a measure for the importance of the parameter with respect to that local function :
which is remarkably similar to Equation 6.
We can conclude that applying Hebb’s rule to measure the importance of the parameters in a neural network can be seen as a local variant of our method that considers
only one layer at a time instead of
the global function learned by the network.
Since only the relative importance weights really matter, the scale factor can be ignored.
Discussion Both the global and the local method have the advantage of computing the importance of the parameters on any given data point without the need to access the labels or the condition of being computed while training the model. The global version needs to compute the gradients of the output function while the local variant (Hebbian based) can be computed locally by multiplying the input with the output of the connecting neurons.
Our proposed method (both the local and global version) resembles an implicit memory included for each parameter of the network. We therefore refer to it as Memory Aware Synapses, or MAS for short. It keeps updating its value based on the activations of the network when applied to new data points. It can adapt and specialize to a given subset of data points rather than preserving every functionality in the network. Further, the method doesn’t need to be there when the network is trained. It can be applied on top of any pretrained network and compute the importance on any set of data without the need to have the labels. This is an important criterion that differentiates our work from methods that rely on the loss function to compute the importance of the parameters.
6.1 Compared Methods
In our experiments, we compare our two methods:
- Global Memory Aware Synapses (g-MAS).
- Local Memory Aware Synapses (l-MAS).
with a baseline and another LLL method:
- Finetuning (FineTuning). After learning the first task and when receiving a new task to learn, this method uses the previous tasks network as an initialization for the new task, then finetunes the parameters of the network on the new task data. This baseline is expected to suffer from forgetting the old tasks while being advantageous for the new task.
- Synaptic Intelligence  (Int. Synapses) (see section 3). This method shows state of the art performance and comes closest to our approach. To the best of our knowledge, it is the only LLL method from the literature that can work with the two setups considered in this paper (softmax output and embedding space) without additional tweaking.
6.2 Object Recognition
We follow the standard setup commonly used in computer vision to evaluate LLL[12, 1, 29, 11]. It consists of a sequence of supervised classification tasks each from a particular dataset. Note that this is, arguably, a somewhat relaxed setup as it supposes having different classification layers for each task that cannot be changed and remain unshared. Moreover, an oracle is used at test time to decide on the task ( which classification layer to use). Different from the literature, we also evaluate our methods, g-MAS and l-MAS, when using unlabeled test data to learn what (not) to forget.
Experimental setup We use the AlexNet architecture pretrained on Imagenet  from 333We use the pretrained model available in Pytorch. Note that it differs slightly from other implementations used in . . We consider a sequence of two tasks based on three datasets: MIT Scenes  for indoor scene classification (5,360 samples), Caltech-UCSD Birds  for fine-grained bird classification (5,994 samples), and Oxford Flowers  for fine-grained flower classification (2,040 samples). We consider: Scene Birds, Birds Scenes, FlowerScenes and Flower Birds, used previously in LLL [12, 1, 29]. We didn’t consider Imagenet as a task in the sequence as this would require retraining the network from scratch to get the importance weights for the Int. Synapses . Performance is measured in terms of classification accuracy. As warmup phase, when learning a second task, we first freeze the parameters of the network and train only the last layer until convergence. Then, we free all the parameters and continue the learning process. Such a procedure was used before in [12, 29]. All the training of the different tasks was done with SGD for 100 epochs and a batch size of 200 using the same learning rate for the compared methods.
|Method||computed on||Birds Scenes||Scenes Birds||Flower Bird||Flower Scenes||Mean|
|Int Synapses ||Train||49.84||54.63||54.4||47.51||70.01||51.5||75.18||56.04||62.35||52.42|
Results As shown in Table 1, FineTuning performs comparably to the other methods for the new task, but clearly falls behind when looking at the previous task, due to catastrophic forgetting. Int. Synapses reduces the forgetting of the previous task while allowing the unused parameters to adjust their values towards the new task. A similar performance is obtained by the Hebbian based approach l-MAS. Our global method, g-MAS preserves most of the previous task performance among the competitors while performing equally well or better on the new task.
|Method||computed on||Birds Scenes||Scenes Birds||Flower Bird||Flower Scenes||Mean|
|l-MAS||Train and Test||48.37||55.14||53.43||47.20||69.15||49.98||73.02||56.94||60.99||52.31|
|g-MAS||Train and Test||52.48||54.55||57.91||47.79||77.52||48.49||77.44||56.12||66.33||51.73|
Table 2 compares the use of the training vs. test set for adapting the importance of the parameters. For both l-MAS and g-MAS, independent from the set used for computing the importance of the weights, the preservation of the previous task and the performance on the current task are quite similar. This illustrates the ability of our method to correctly estimate the parameters importance of a given task given any set of points, without the need of labeled data.
6.3 Facts Learning
Next, we move to a more realistic and challenging setup where all the layers of the network are shared, including the last layer. Instead of learning a classifier, we learn an embedding space. The goal is to learn facts from natural images . We suppose that the data is streaming with a set of new facts to learn at each time (task). In between learning tasks, the agent goes through active phases where the model is applied on data from previous tasks, and this unlabeled data is used to estimate the importance weights. This is modeled by either processing the training data again, or by processing the (unlabeled) test data.
As argued in our companion paper (see suppl. material), we believe this is a natural fit to continual learning. Facts are structured into 3 units: Subject (S), Object (O) and Predicate (P). For example a fact could be: Person eating pizza. We design different experimental settings to show the ability of our method to learn what (not) to forget.
The base model We build on the model introduced recently in : a CNN model based on the VGG-16 architecture  pretrained on ImageNet. The architecture is composed of 7 convolutional layers, followed by two branches, treating separately the Subject from its modifiers (Predicate and Object). Each branch consists of 6 additional convolutional layers, followed by 3 fully connected layers. Lastly, the modifiers branch forks in two, enabling the model to have three separated and structured outputs for Subject, Predicate and Object. The Loss minimizes the pairwise distance between the visual and the language embedding. For the language embedding, the Word2vec  representation of the fact units is used.
The 6DS dataset We use the mid scale dataset presented in . It consists of images, divided in training samples and test samples belonging to unique facts (which can be Subjects, SP pairs, or SPO triplets). It is constructed by merging 6 object recognition datasets and adapting the annotations – hence the name 6DS. To study fact learning from a lifelong perspective, we divided the whole dataset in batches belonging to different groups of facts.
Experimental setup. Here each task is learning a different batch from the same dataset. This is slightly different from the previous setup where tasks are different datasets, we thus refer to the tasks in the following as batch. All the training of the different tasks was done for 300 epochs using the same learning rate for the compared methods and the same learning rates per layer as suggested in . SGD optimizer was used. For evaluation, we report the fact to image retrieval scenario. We follow the evaluation protocol proposed in  and report the mean average precision. For each batch, we consider retrieving the images belonging to facts from this batch only. We also report the mean average precision on all the dataset which differs from the average of the performance achieved on each batch. Please refer to the supplementary materials for more details.
Two tasks experiments. We start by randomly splitting the facts into two groups resulting in two batches of data, and . We consider those as our tasks to be learned: . Table 3 shows the performance of each task at the end of the sequence. For the two variants of our method, we show the performance achieved while using training data only, or Training and Test data for estimating the importance weights . Note that we always use unlabeled data to estimate the importance weights .
It is clear that this is a much harder task where finetuning suffers badly from forgetting. The LLL methods manage to control the forgetting but the performance on the second task is lower than with finetuning. Finetuning cares only about the current task and then it is easier to achieve a better performance on a subset of facts when ignoring the rest. l-MAS achieves on the first task, which is slightly lower than the obtained by Int. Synpases. g-MAS scores the best on with using the Training set only ( similar setup as in Int. Synapses). For both g-MAS and l-MAS, the use of both sets (Train&Test) results in a better importance estimation, translated in a better performance preservation on : for l-MAS and for g-MAS, with similar results on . Note that Int Synapses cannot exploit unlabeled (test) data as it uses gradient back propagation.
To further demonstrate that our method does not just capture general importance weights, but can really adapt, in an unsupervised fashion, to particular test conditions, we split the test set of the first batch further into two random subsets of facts, and . After learning the first batch , the importance of the parameters is computed using one subset only ( or – we show results for both cases). Then the second batch is learned. Table 4 compares the performance in each case.
We can see that the forgetting on the subset that was used for estimating the importance of the parameters is less than on the other subset that was not considered. For example, g-MAS learning the importance of the parameters on the first subset preserves a performance of for compared to that of when computing the importance on the other batch . This can stand as an empirical proof of our method’s capability on learning the importance of the parameters based on what the network is actively tested on.
|Method evaluated on|
|Method evaluated on|
Longer sequence of tasks.
|Method evaluated on|
Next we test our method on a sequence of 4 tasks/batches composed from the same dataset. Those tasks resemble a grouping of the different facts into disjoint concepts, as explained in our companion paper.
Table 5 presents the achieved performance on each set of the disjoint tasks at the end of the learned sequence. In spite of preserving some of the previous knowledge by the Int. Synapses over finetuning, it is clear how the performance of both methods decreased quite severely: 0.026 for finetuning and 0.029 for Int. Synapses at the end of the sequence. In fact, this is the most challenging situation that can be faced by a continual learning agent. However, our method still manages to save a reasonable amount of knowledge on the previous groups of facts, achieving an average performance of 0.171 at the end of the sequence. When looking at the scores for the different batches, we can notice the consistency good performance among all the subsets and see how the method does quite a good job in learning what is important to preserve.
Adaptation Test. Finally we want to test the ability of our method in learning not to forget a specific subset of a task. As we explained earlier, sometimes an agent specializes and makes use of only specific capabilities while the others remain unused. When learning a new task, we care about the performance on that specific set more than the rest. For that reason, we selected a specialized subset of , namely 7 facts of person playing sports. We run our method with the importance parameters computed only over the examples from this set along the 4 tasks sequence. Figure 4 shows the achieved performance on this sport subset by each method at each step of the learning sequence. Joint Training (black dashed) is shown as reference. It violates the LLL setting as it trains on all data jointly. Note that Int. Synapses can only learn importance weights during training, and therefore cannot adapt to a particular subset. Our g-MAS (pink) succeeds to learn that this set is important to preserve and achieves a performance of 0.50 at the end of the sequence, while the performance of finetuning and Int. Synapses on this set was close to 0.20.
In this paper we argued that, given a limited model capacity and unlimited evolving data, it is not possible to preserve all the previous knowledge. Instead, agents should learn what (not) to forget. Forgetting should relate to the rate in which a specific piece of knowledge is used. This is similar to how biological systems are learning. In the absence of error signals, synapses connecting biological neurons strengthen or weaken based on the concurrence of the connected neurons activations. In this work and inspired by the synaptic plasticity, we proposed a method that is able to learn the importance of network parameters from the input data that the system is active on, in an unsupervised manner. We showed that a local variant of our method can be seen as an application of Hebb’s rule in learning the importance of parameters. We first tested our method on a sequence of object recognition problems in a traditional LLL setting. We then moved to a more challenging test case where we learn facts from images in a continuous manner. We showed i) the ability of our method to better learn the importance of the parameters using training data, test data or both; and ii) the ability of our method to adapt the importance of the parameters towards a frequent set of data. We believe that this is a step forward in developing systems that can always learn and adapt in a flexible manner.
8 Supplementary Materials
In the following, we start by explaining in more details some of the experimental settings followed in the main paper (section 8.2). We then move in section 8.3 to analyzing some statistics of the importance of the parameters obtained by our proposed method MAS. Later, in section 8.4 we focus on comparing the importance values computed by our method based on different sets. Section 8.5 looks at the projections obtained for the sport subset along the 4 tasks learning sequence concerning the adaptation test in the main paper. Finally, in section LABEL:sec:comp, we explain the differences with our companion paper.
8.2 Data split visualization
8.3 Histogram of parameters importance
We have shown empirically in the main paper that our proposed method (MAS) is able to identify the important parameters and penalize changing them when learning a new task. To further analyze how the importance values are spread among the different parameters, we plotted the histogram of (the parameter importance). Ideally, a good importance measure would give very low importance values to the unused parameters and high values for those that are crucial for the task at hand. Part (a) of figure 7 shows the histogram of of the last shared convolutional layer computed on the training data from the first task. This is based on the two tasks experiments under the fact learning setting. We can notice how the histogram has a peak at a value close to zero and then goes flat. Part (b) of figure 7 shows the same histogram but magnified in the area covering the 1000 top most important parameters. We can see the long tail distribution and how the values get sparser the more we move to higher importance assignment. This indicates that our method (MAS) will allow changes on most of the parameters that were unused by the first task while penalizing changes on those few crucial parameters that carry meaningful information for the learned task.
8.4 Correlation between the parameters importance computed on different sets
In the main paper, we have conducted several experiments to examine our method’s ability to preserving the previous task’s performance by computing the importance of the parameters on different sets, e.g. train, test or a subset thereof. We have shown that our method is able to adequately compute the importance of the parameters using either the training data or the test data in an unsupervised manner. We also have shown that the method is able to adapt to a subset and preserve mostly the performance on that subset more than the rest of the task. Here we want to shed some light on the correlation or the difference between the importance assigned to the parameters computed on different sets.
First, we compare the estimated parameters importance () using the training data and the computed using the test data. For that, we used a model from the object recognition experiment, namely BirdsScenes, the results of which are shown in table 2 in the main paper. Figure 9 shows a scatter plot for the top most important parameters according to the computed on the training data (blue). The X-axis represents the values from computed on training data while the Y-axis represents the values from computed on test data.
Figure 9 shows a similar scatter plot for the top important parameters according to the computed on the test data (red). Here, the X-axis represents the values from computed on test data while the Y-axis represents the values from computed on training data.
A plot where the points are closely lying around a straight line indicates that the parameters from the two s have similar importance values. A plot where the points are spread further from such a line and scattered among the plotted area indicates a lower correlation between the s.
It can be seen how similar are the importance values computed on test data to those computed on training data where they form a tight grouping of points around a straight line where the values would be identical. This demonstrates our method’s ability to correctly identify the important parameters in an unsupervised manner, regardless of what set is used for that purpose as long as it covers the different classes or concepts of the task at hand.
How about using different subsets that cover a partial set of classes or concepts from a task? In the main paper we have conducted an experiment under the fact learning setting where we split the data from the first task into two disjoint groups of facts and showed that computing the importance on one subset results in a better preservation in performance than the other subset that was not used for computing the importance (table 4 in the main paper and Figures 6,6 above). This suggests that the importance of the parameters differs while using different subsets. To further investigate this claim, we plotted the values of for the top most important parameters estimated on the (in blue) subset of the training data from the first task along with the same parameters but with their importance computed using the other subset . Figures 11 and 11 show this for the last convolutional layer from the branch. This branch learns the subjects, which are highly shared between the two subsets.
Figures 13 and 13 show the same plot but for the last convolutional layer from the branch that forks at the end and projects the sample into and features. It is clear that the of the branch is strongly correlated between the two sets while the of the branch differs more between the two subsets. This suggests that the method identifies the important parameters needed for each subset and when those parameters are shared the parameters importance is correlated between the two subsets while when those are different, different parameters receive different importance values based on the used subset.
8.5 Visualizing the learned embedding on the adaptation experiment
Finally, in the main paper (section 6.3 Adaptation test paragraph), we showed that our method tries to preserve the performance on a specific subset in case it encounters this subset frequently at test time along a learning sequence. This was done by picking a subset from the first task in the 4 tasks fact learning sequence. This subset was mainly composed of sports facts. We showed that our method reduces the forgetting on this subset the most among the competitors that do not have this specialization capabilities (figure 4 in the main paper). We were eager to know what happens in the learned embedding space, i.e. how the projections of the samples that belong to this subset change along the sequence compared to how they were right after training the first task. For that purpose, we extract a projection of the learned embedding after each task in the sequence. This was done for our method (MAS) when adapting to sport subset (Adaptive) and our method (MAS) when preserving the performance on all facts of the first task (Non Adaptive). We also show the projections of the points in the embedding learned by the finetuning baseline (finetune, where no regularizer is used). To have a point of reference, we also show the projections of the originally learned representation after the first task (org). Figure 16 shows the projections from the different variants after learning the second task compared to the original projections. It can be seen that the Adaptive and Non Adaptive variants of our method try to preserve the projections from this subset. The adaptive projections are closer to the original one, if we look closely, while Finetuning projections starts drifting away from where they were. After the third task, as shown in figure 16, the Adaptive projections are closer to the original ones than the Non Adaptive that considers this subset as part of the task being preserved and tries to prevent forgetting them as well. Finetuning started destroying the learned topology of this subset and lies further apart. However, when it comes to the fourth task, we saw in the main paper that it is a quite challenging and hard task (in table 5 the performance is quite low compared to the other tasks). The forgetting appears more severe than before and preservation of the projections become even harder. Nevertheless, the Adaptive MAS and Non Adaptive MAS still preserve the topology of the learned projections. The Adaptive projections lie closer and look more similar to the originals than the Non Adaptive MAS. Finetune, forgets completely about this subset and all the samples got projected in one point where it becomes quite hard to recognize their corresponding facts.
-  R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  M. Andrychowicz, M. Denil, S. Gómez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3981–3989. Curran Associates, Inc., 2016.
-  M. Elhoseiny, S. Cohen, W. Chang, B. L. Price, and A. M. Elgammal. Sherlock: Scalable fact learning in images. In AAAI, pages 4016–4024, 2017.
-  C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
-  C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017.
-  R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
-  I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
-  D. Hebb. The organization of behavior. 1949. New York Wiely, 2002.
-  J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. arXiv preprint arXiv:1612.00796, 2016.
-  A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
-  S.-W. Lee, J.-H. Kim, J.-W. Ha, and B.-T. Zhang. Overcoming catastrophic forgetting by incremental moment matching. arXiv preprint arXiv:1703.08475, 2017.
-  Z. Li and D. Hoiem. Learning without forgetting. In European Conference on Computer Vision, pages 614–629. Springer, 2016.
-  J. L. McClelland, B. L. McNaughton, and R. C. O’reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
-  M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation, 24:109–165, 1989.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
-  M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
-  A. Pentina and C. H. Lampert. Lifelong learning with non-iid tasks. In Advances in Neural Information Processing Systems, pages 1540–1548, 2015.
-  A. Quattoni and A. Torralba. Recognizing indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 413–420. IEEE, 2009.
-  A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars. Encoder based lifelong learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1320–1328, 2017.
-  R. Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological review, 97(2):285–308, 1990.
-  S.-A. Rebuffi, A. Kolesnikov, and C. H. Lampert. icarl: Incremental classifier and representation learning. arXiv preprint arXiv:1611.07725, 2016.
-  M. B. Ring. Child: A first step towards continual learning. Machine Learning, 28(1):77–104, 1997.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  K. Shmelkov, C. Schmid, and K. Alahari. Incremental learning of object detectors without catastrophic forgetting. In The IEEE International Conference on Computer Vision (ICCV), 2017.
-  D. L. Silver and R. E. Mercer. The task rehearsal method of life-long learning: Overcoming impoverished data. In Conference of the Canadian Society for Computational Studies of Intelligence, pages 90–101. Springer, 2002.
-  D. L. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: Beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, pages 49–55. Citeseer, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  S. Thrun and T. M. Mitchell. Lifelong robot learning. Robotics and autonomous systems, 15(1-2):25–46, 1995.
-  A. R. Triki, R. Aljundi, M. B. Blaschko, and T. Tuytelaars. Encoder based lifelong learning. arXiv preprint arXiv:1704.01920, 2017.
-  P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
-  F. Zenke, B. Poole, and S. Ganguli. Improved multitask learning through synaptic intelligence. In Proceedings of the International Conference on Machine Learning (ICML), 2017.