Overcoming Longterm Catastrophic Forgetting through Adversarial Neural Pruning and Synaptic Consolidation
Abstract
Enabling a neural network to sequentially learn multiple tasks is of great significance for expanding the applicability of neural networks in realistic human application scenarios. However, as the task sequence increases, the model quickly forgets previously learned skills; we refer to this loss of memory of long sequences as longterm catastrophic forgetting. There are two main reasons for the longterm forgetting: first, as the tasks increase, the intersection of the lowerror parameter subspace satisfying these tasks will become smaller and smaller or even nonexistent; The second is the cumulative error in the process of protecting the knowledge of previous tasks. This paper, we propose a confrontation mechanism in which neural pruning and synaptic consolidation are used to overcome longterm catastrophic forgetting. This mechanism distills taskrelated knowledge into a small number of parameters, and retains the old knowledge by consolidating a small number of parameters, while sparing most parameters to learn the followup tasks, which not only avoids forgetting but also can learn a large number of tasks. Specifically, the neural pruning iteratively relaxes the parameter conditions of the current task to expand the common parameter subspace of tasks; The modified synaptic consolidation strategy is comprised of two components, a novel network structure information considered measurement is proposed to calculate the parameter importance, and a elementwise parameter updating strategy that is designed to prevent significant parameters being overridden in subsequent learning. We verified the method on image classification, and the results showed that our proposed ANPSC approach outperforms the stateoftheart methods. The hyperparametric sensitivity test further demonstrates the robustness of our proposed approach.
I Introduction
Humans can learn consecutive tasks and memorize acquired skills, such as running, biking and reading, throughout their lifetimes. This ability, namely, continual learning, is crucial to the development of artificial general intelligence[pratama2013panfis]. Existing models lack this ability mainly due to catastrophic forgetting, which means that networks forget knowledge that has been learned from previous tasks when learning new tasks [McCloskey1989Catastrophic]. To mitigate the catastrophic forgetting, a straightforward approach is to retrain the model by mixing previous data with new data; however, this manner is inefficient for networks with low storage and with a high model update frequency[Li2017Learning]. Rusu et al. [Rusu2016Progressive], Fernando et al. [Fernando2017Pathnet:] and Coop et al. [coop2013ensemble] attempted to reserve taskspecific structures for single tasks, such as layers or modules. In works [LopezPaz2017Gradient][Rebuffi2013icarl:][robins1993catastrophic][diaz2014incremental] based on the rehearsal strategy, previous memories are reinforced by replaying experiences. All of these methods require additional incremental network capacity for retaining previous tasks.
An ideal learning system could sequentially learn tasks without increasing the memory space or the computational cost[bargi2017adon]. The regularizationbased methods satisfy these requirements. For instance, elastic parameter updating, [Kirkpatrick2016Overcoming][ritter2018online] finds the joint distribution of tasks by protecting parameters with higher importance. However, this approach suffers from insufficient memory when learning long sequences of tasks. This is because this approach has difficulty finding a common parameter subspace that satisfies the requirements of all tasks, which leads to it to become entangled when utilizing more capacity to memorize previous tasks or to learn a current task.
One of the major challenges in longterm learning is that the size of the shared parameter subspace of previous tasks decreases as the number of tasks increases. Existing weight consolidation approaches that search for a common solution face two problems. First, the L2 distance is adopted as the overall measurement index. Hence, the update of each parameter cannot be precisely controlled, thereby leading to the failure to protect important parameters. Second, the topological properties of the network are an important factor in knowledge representation[courbariaux2016binarized]. Several works[Zenke2017Continual][Aljundi2017Memory] have attempted to calculate the importances of parameters, e.g., the sensitivity of parameters to tiny perturbations; however, none of them have considered the topological relationship between the network structure and parameters.
In this paper, we proposed a novel method, namely, the Adversarial Neural Pruning and Synaptic Consolidation(ANPSC), to overcome the longterm catastrophic forgetting. We believe that the cause of this problem is the decrease in the size of the shared parameter subspace and the accumulated error increase as new tasks arrive. We solve the former via an online neural pruning strategy. This approach distills the current task into a few parameters, which indirectly expands the common solution space, and it frees up capacity for subsequent tasks to access the memory. To tackle the latter, we design a momentumbased weight consolidation policy to protect critical parameters element by element. In addition, we claim that the information on the network topology is significant, and we propose using the connectivity of network to measure the importance of parameters. The main contributions of this paper are as follows:

[]

We analyze the reason of longterm catastrophic forgetting in neural networks and propose a mechanism of adversarial neural pruning and synaptic consolidation to tackle longterm catastrophic forgetting.

To precisely protect significant parameters from being destroyed, we design a weight update policy through revising the gradient step with momentum.

To consider the structural information of networks, we propose a novel measurement of parameterimportance. This measure utilizes parameterconnectivity to abstract the topological characteristics of a network in parameter space in a labelfree manner. The experimental results demonstrate that this measure is accurate, centralized and polarized.

We investigated a series of regularization methods of overcoming forgetting. Experiment results show that our method is superior to other mainstream methods and has strong robustness and generalization ability.
Ii Related Works
This paper we focus on nonadditional network structure added methods. Such methods include model pruning, knowledge distillation and regularization strategies.
Iia Model Prune And Knowledge Distillation
Parameter pruning methods [LeCun2015Optimal][Hassibi2014Second] [smith2018neural] are based on the hypothesis that some parameters have little effect on the model loss after being erased. Thus, the key strategy is to search for the optimum parameters that have minimal influence on the loss. An effective approach for narrowing the representational overlap between tasks is to reduce parameter sharing among tasks with limited network capacity. Another approach is knowledge distillation, which packs the knowledge of complex network into a lightweight network using the teacherstudent model , which is also used to tackle the problem of catastrophic forgetting[Li2017Learning].
PackNet [Mallya2017Packnet:] sequentially compresses multiple tasks into a single model by pruning redundant parameters. The dual memory network[kamra2017deep] partially drew on this idea to overcome catastrophic forgetting by an external network. Inspired by model compression, our method utilizes parameter connectivity to establish a soft mask rather than hard pruning based on a binary mask[courbariaux2016binarized]; it does not completely truncate the unimportant parameters, but adaptively adjusts them according to subsequent tasks, shares parameters among multiple tasks, conserves the model capacity compared with hard pruning, and has lower performance penalties.
IiB Regularization Strategies
Various methods reduce representational overlap among tasks to overcome catastrophic forgetting via regularization, such as such as weights freezing and weight consolidation. Weight freezing, which was inspired by the distributed encoding of human brain neurons, tries to avoid overlaps between crucial functional modules of tasks. For instance, PathNet [Fernando2017Pathnet:] establishes a large neural network and fixes a module of the network to avoid interference from later tasks. The progressive neural network (PNN) [Rusu2016Progressive] and [sun2018concept] allocates separate networks for each task and performs multitasks by a progressive expansion strategy. Methods of this type fix important parameters of a task to prevent the network from forgetting. However, these methods lack flexibility when facing a long sequence of tasks, and the memory footprint and computational complexity will linearly increase with the number of tasks.
Weight consolidation tries to identify parameters that are important for previous tasks and punish them when training new tasks. The classic method [Aljundi2017Memory, Zenke2017Continual] is elastic weight consolidation (EWC) [11], which is inspired by the mechanism of synaptic plasticity. EWC updates parameters elastically according to parameter importance. EWC measures the parameter importance by approximating the Fisher information matrix. Methods of this type encode more tasks with lower network capacity and lower computational complexity compared with PathNet and PNN. The measurement of parameter importance is crucial. Most methods calculate parameter importance based on the parameter sensitivity; however, none of these methods consider network topological properties.
Iii Methods
Iiia Problem Definition
Given a sequence of tasks that are defined by datasets , and a neural networks defined by parameters . The objective of continual learning is to sequentially learn all tasks . To overcome catastrophic forgetting, a classic approach is to find a distribution that fits all task data from the previous parameter space of tasks (Figure 1.a), namely,
(1) 
This goal is realized by consolidating important parameters of previous tasks. The cause of longterm catastrophic forgetting is that it is intractable to search for a solution that satisfies all tasks from the intersection of the parameter subspaces of tasks. The fundamental problem is that the shared parameter subspace is either small or does not exist, and the cumulative error of the weight consolidation strategy causes the solution to deviate from the parameter subspace with low error.
IiiB Adversarial Solution
To alleviate longterm catastrophic forgetting, two key strategies are employed: one strategy is to expand the overlap of the parameter subspaces of tasks(Figure 1.a, in which the region is denoted with triangles and pentagons), and the other strategy is to more precisely protect the parameters(Figure 1.b). We believe that by approximating the solution of the current task through a subset of parameters, while keeping the approximation error low, the parameter constraint field can be effectively relaxed and the parameter subspace of the current task can be expanded(Figure 1.d) compared with the original model(Figure 1.c):
(2) 
which is the subset of parameters , and is the approximate solution.
To decrease the accumulated error of parameter consolidation, we modify it in two ways: first, a novel weightwise consolidation approach, namely, momentum updating, is designed. This approach revises the direction of optimization according to the importance of parameters while learning new tasks (Figure 2). Second, as the previous measurements of importance did not consider the structural information that is hidden in the parameters, a novel parameter measure is proposed. This novel approach measures the importance of a parameter according to the state of the connection between two neurons.
Neural pruning abandons as many parameters that are not important to the new task as possible to enlarge the parameter subspace of task sharing, while synaptic consolidation requires that the parameters of the old task be protected as much as possible from being destroyed. This adversarial mechanism enables the model to compress task knowledge into a small number of parameters with high representation, while protecting a small number of parameters to balance the performance of old and new tasks.
IiiC Neural Pruning
Most techniques for model pruning are conducted offline[cheng2017quantized]; hence, the prune operation must be implemented after training, which is inflexible and timeconsuming. In addition, this approach requires the reuse of previous data. In this paper, we selectively prune parameters with less salience to the output of the model during training, and we implement it in an iterative trainingpruning way. This approach implicitly distills the previous training phase into fewer parameters during the pruning phase. Thus, it can be also considered as an online pruning approach. The objective of pruning is defined as:
(3) 
When learning a task, we train the parameters on the given training dataset and calculate the salience of parameters . The salience measures the influence of a parameter on the performance of the model: a higher value corresponds to a larger decrease in performance if it is pruned. Then, we generate the mask of parameters that correspond to the saliency to prevent insignificant parameters from being updated according to the threshold . These parameters are not actually pruned but are reserved for later tasks.
We utilize optimal brain surgery[LeCun2015Optimal] to measure the salience of a parameter. This approach prunes the parameters that contribute little to the loss. Given a welltrained model, we try to train parameters W on input X to reduce the error , and the model learned can be expressed as . If we set to zero, the change in the error that corresponds to can be expressed as . The larger the value of is, the more important is. The formula of the Taylor expansion is:
(4) 
is the Hessian matrix of the parameters, and represents the gradient on W. The gradient will be close to zero when the model converges, and the first term on the righthand side will be too small to calculate a precise value of the error change in response to the parameter perturbation. The secondorder approximate solution is used instead. Therefore, an accurate value can be obtained regardless of whether the model is convergent or not to ensure that online pruning is effective throughout the training stage.
The calculation of the Hessian matrix is complex and computationintensive[xu2015optimization]. In this paper, we introduce the diagonal Fisher information matrix [Pascanu2013Revisiting] to approximate Hessian matrix. The main advantage is that its computational complexity is linear to the number of dimensions. And it can be quickly solved through gradient. However, the diagonalization may lead to a loss of precision. We think we can obtain better results if we adopt a better Hessian approximation method, which usually has a higher computational burden.
IiiD Modified Synaptic Consolidation
Momentum based parameter updating.
To ensure that the end point of optimization is not far from the previous task when learning a new task, we designed a momentumbased updating policy for revising the gradient direction which is calculated via stochastic gradient descent. This policy is implemented as follows:
(5) 
As illustrated in Figure 2, when the optimization point moves toward a new task which is analogous to a ball rolling up a hill, there are three forces that are related its movement. The gradient step of the ball is driven by the force of the target function, which is calculated via classical stochastic gradient descent(sgd). The memory step is driven by the force that keeps the ball from leaving the previous memory checkpoint, which ensures the stability of the learning system. The gradient decay is the resistance, whose direction is opposite that of the actual step; it is the momentum of one parameter. Thus, we define the memory momentum as follows:
(6) 
where is a hyperparameter, of which a large value corresponds to strong momentum, and is the importance of the parameter . We can use the frictional coefficient of the ball as an analogy of it. One parameter with great importance should be prevented from further changing.
Measuring the parameter importance through the connectivity of neurons.
We design a novel method for calculating the parameter importance by measuring the magnitude of the change in the target function when changing the connectivity state of two neurons. This method considers the structural knowledge of a model that is related to one task. Similar to the salience of a parameter, we utilize this method to measure the influence of the connectivity of two neurons on the model. Most of these measurements must utilize the labels, which limits the scope of application. To eliminate the need for the labels, we use the information entropy to approximate the error because the distribution p and the predicted distribution q are proximate on a welltrained model. Thus, the parameter connectivity is in terms of:
(7) 
where . The strategy is to measure the steady state of a learning system utilizing information entropy. We explain this strategy as follows: the output distribution of the model will gradually evolve from a random state into a stable state, with decreasing entropy. When the model converges, the system will perform stably on the training data, with low entropy and a known output distribution. Therefore, entropy change is an effective substitute for the loss function change for measuring the steady state of a learning system.
Given a set of t+1 tasks, we calculate the importance of the parameter after learning the task. Where i and j represent the connections between the neuron and the neuron, respectively, in neural networks.
(8) 
According to Eq.(5), the direction of the gradient decay is always opposite that of the actual step. Thus, we set the negative value to zero. After learning task t+1, we sum the importances of previous tasks to obtain the accumulated values:
(9) 
We present our algorithm in Algorithm 1.
Iv Experiments and Analysis
Iva Experimental Setting
We tested the proposed method on four tasks: an image classification with a convolutional neural network (CNN) and a multilayer perception, longsequence of incremental classification tasks, a generative task with the variational autoencoder (VAE) model and a generative adversarial network (GAN).
Data In image classification task, the permuted MNIST[Srivastava2014Compete] or split MNIST [Lee2015Overcoming] is applied to MLP. The Cifar10 [Krizhevsky2009Learning], the NOTMNIST [Bulatov2011Notmnist], the SVHN [Netzer1989Reading], and the STL10 [Coates2015An] datasets, which are all sets of RGB images of size of 32*32 pixels, are chosen. For longterm incremental learning tasks, Cifar100 [Krizhevsky2009Learning] is used for medium scale network model, and Caltech101 [FeiFei2006Oneshot] is used for large scale network models (shown in supplement). In the generative task, celebA [Liu2018Largescale] and anime face, which were crawled from the web, are selected as test data. Both databases share the same resolution. In the generative adversarial network, we choose three categories of SVHN[Netzer1989Reading] as a sequence of tasks.
Baseline We compared our method with state of the art methods, including LWF [Li2017Learning], EWC [Kirkpatrick2016Overcoming], SI [Zenke2017Continual] and MAS [Aljundi2017Memory], and classic methods, including standard SGD with a single output layer(singleheaded SGD), SGD with multiple output layers, SGD with frozen intermediate layers (SGD F), and SGD with finetuned intermediate layers (finetuning). We defined a multitask joint training with SGD (Joint) [yuan2012visual] as the baseline for evaluating the difficulty of a sequential task.
Evaluation â We utilize the average accuracy(ACC), forward transfer (FWT), and backward transfer (BWT) [LopezPaz2017Gradient] to estimate the model performance: (1) the ACC, for evaluating the average performance of processing tasks; (2) the FWT, for describing the suppression of former tasks on later tasks; and(3)the BWT, for describing the forgetting of previous tasks. Evaluating the difficulty of an individual task by testing the model using multitask joint training [yuan2012visual] is more objective than testing the model of a single task. Therefore we propose a modified version. Given tasks, we evaluate the previous t tasks after training on the task. Denoting the result of task i being tested on the task model as , and the accuracy on the task through joint learning as , We use three indicators:
(10) 
(11) 
(12) 
A higher value of ACC corresponds to superior overall performance, and higher values of BWT and FWT correspond to better tradeoff between memorizing previous tasks and learning new ones.
Training â All models share the same network structure with a dropout layer[Goodfellow2013An], and we initialized all parameters on MLP with random Gaussian distributions that have the same mean and variance( ),and we applied Xavier on CNN. We optimized models by SGD with an initial learning rate of 0.1, 0.01, or 0.001, with a decay ratio of 0.96, and with uniform batch size. We trained models with fixed epoch and global hyperparameters for all tasks. We identified the optimal hyperparameters by greedy search. Beta is uniformly set as 5%.
Method  FWT(%)  BWT(%)  ACC(%) 

SGD  0.31  34.01  61.53 
SGDF  18.6  12.9  84.82 
Finetuning  0.29  13.9  82.04 
EWC [Kirkpatrick2016Overcoming]  4.99  6.43  88.75 
SI [Zenke2017Continual]  6.19  3.51  90.67 
MAS [Aljundi2017Memory]  4.38  2.08  94.09 
LWF [Li2017Learning]  4.42  2.04  94.08 
Joint [yuan2012visual]  /  /  99.87 
Ours  0.44  0.75  98.31 
IvB Experimental Results and Analysis
Sequential learning on split MNIST and permuted MNIST by MLP
We divided the data into 5 subdatasets and trained an MLP with 78451225610 units. In Table I, we present the experimental results on split MNIST. Not all continual learning strategies perform well on all indices. Finetuning and SGD perform best on FWT because no free capacity is required for the subsequent tasks, and some features may be reused to improve the learning of the new tasks if the tasks are similar. LWF, MAS and SI perform well on BWT and ACC, and our method achieves the best performance on both indices, except compared to the joint learning method. We conclude that the model learns the general features from multiple datasets; hence, the model implicitly benefits from data augmentation. Our results in terms of ACC and FWT rival the best on single indices. In addition, our model has the least severe forgetting problem on BWT, and it has only a reduction of 1.5% in ACC after learning 10 tasks. Overall, our method outperforms the other eight approaches.
Method  FWT(%)  BWT(%)  ACC(%) 

SGD  1.11  18.05  70.45 
SGDF  14.90  0.10  81.99 
Finetuning  0.75  6.21  80.69 
EWC [Kirkpatrick2016Overcoming]  0.98  2.57  91.97 
SI [Zenke2017Continual]  0.56  4.40  90.21 
MAS [Aljundi2017Memory]  1.23  1.61  92.6 
LWF [Li2017Learning]  0.67  24.02  74.15 
Joint [yuan2012visual]  /  /  95.05 
Ours  2.33  3.22  94.51 
We evaluate our method on 10 permuted MNIST tasks. In Table II, we present the results of our approaches and those of others. As expected, our method performs best on FWT; it outperforms SGD, which we attribute to the possibility of some features of the lower layer being stared by new tasks and to sufficient capacity for relieving the pressure on the capacity demand in new tasks. SGDF obtains the highest score on BWT because SGDF has fixed parameters, which help protect the parameters of previous tasks from being overwritten, but at the cost of reduced ability to learn new tasks flexibly. LWF performs worse on permuted MNIST than on split MNIST despite a satisfactory score, which may be attributed to the change in the dataset , as discussed above for FWT. Our method performs comparably in terms of ACC.
CNN & image recognition
We test out method on natural image datasets that is based on VGG [Simonyan2014Very] with 9 layers and a batch normalization layer to prevent gradient explosion. We train and test on MNIST, notMNIST, SVHN, STL10 and Cifar10 sequentially, which have been processed into the same numbers of training images and categories (50,000 and 10, respectively). Overall, our method achieved the best performance in terms of FWT, BWT and ACC. According to Figure 3, our method realizes FWT that is almost onethird of those for LWF and MAS. Thus, our proposed method performs well in alleviating the dilemma of memory, and the test accuracy is close to the baseline. Our method also obtains the top result on BWT; hence, it ensures that the network retains the ability to handle previous tasks. On ACC, our method realized comparable performance to multitask joint training; hence, networks effectively tradeoff capacity for tasks. The result of finetuning is better than that of SGD; thus, the use of an independent classifier for each task can prevent forgetting. We speculate that this is because the features of tasks at the high layer are highly entangled, and using individual classifiers can slightly alleviate this situation.
Robust analysis
To test the stability of our method with respect to the hyperparameter, we test the method under various values of based on the above experiment. The results show that our method is robust to hyperparameter variation in a range of values. According to Figure 4, when is 0.01, the network is almost impervious to the resistance of previous tasks, which means that no capacity is assigned to previous memory. In this case, the values of all three indicators are extremely poor, and the proposed method and SGD are almost the same at this time. When reaches 0.1, the proposed method has realized relatively satisfactory performance and has substantially improved on all three indicators. If is in the range of 0.5 to 4, the performance is relatively stable. The proposed method achieves the best performance with . As continues to rise, the network memorizes too much, which results in lack of capacity to learn new tasks; hence, the performance on new tasks is lower than would be realized by training from scratch.
Continual learning in VAE
To evaluate the generalization performance of our method, we apply it in variational automatic coding (VAE). We carry out tasks from human faces and anime faces and resize the samples of the two datasets to the same size of 96*96. We set up a VAE with a convconvfc encoder layer and a fcdeconvdeconv layer on both sides. Then, we use a separate latent variable to train a single task, which is essential because of the significant difference between the distributions of the two datasets.
We trained models by three approaches: (1) training on the Celeba dataset from scratch; (2) training on the Celeba dataset and, subsequently, training on the anime face dataset with SGD; and (3) training on the Celeba dataset and subsequently training on the anime face dataset with ANPSC. In Figure 5, we present samples of human faces that were produced by the three models. The results demonstrate that our approach can well preserve the skill of human face generation while learning anime faces. The model with ANPSC performs as well as the model that was trained on the Celeba, whereas the model with SGD loses its ability. This finding proves that ANPSC has strong generalization performance in MLP, CNN and VAE.
Continual learning in GAN
We further apply the ANPSC to a generative adversarial network[goodfellow2014generative]. We implement the ANPSC by assuming that the model sequentially learns several datasets. Then, the model should be able to generate images that belongs to any specified dataset. To achieve this goal, we train CGAN[mirza2014conditional] on SHVN[Netzer1989Reading], because the CGAN is assigned a classifier and label, and these can be used to control the generation according to the order of the tasks. To evaluate the performance of ANPSC in terms of longterm memory, we sequentially train a model on digit 0 to digit 10 and test the model on the previous 5 tasks separately. The method of joint training is the ceiling of the result and the method of sgd is utilized for comparison. Figure 6 presents the results of CGAN with ANPSC. The model still memorizes most knowledge of previous tasks, and it generates 5 digits well, which is similar to joint training. We conclude that ANPSC performs well on the generative adversarial network, and it is an effective approach for alleviating catastrophic forgetting.
V Discussion
Va Analysis of parameterconnectivity
In Figure 7, we plot the distributions of parameter importance that were obtained by the three methods. The results shows that a concentrated and polarized distribution of importance contributes to overcoming catastrophic forgetting. The left figure shows that our distribution is sharp at low importance and high importance on the MLP and CNN model. In addition, the figure on the right shows similar results based on CNN. In the right figure, our method also shows polarization compared with the other methods. The distribution is concentrated and polarized, and this is suitable for various models and datasets. Hence, our method distills previous knowledge into fewer parameters and frees more parameters for learning new tasks.
VB Parameter space similarity and changing analysis
We conducted six tasks with ANPSC on PermutedMNIST and analyze the experiment results by comparison with singleheaded SGD and multiheaded finetuning:

[]

The evolution of the overall average accuracy is shown in Figure 8(a), which indicates that our method is more stable and achieves more accurate results as the number of tasks increases;

To determine whether the model can efficiently preserve its memory of previous tasks, we utilize the Frechet distance [frechet1906quelques] to measure the similarity of the parameter distributions between the first tasks and the last tasks; see Figure 8(b). In The F value of our method is far greater than those of the other two methods; hence, our method can effectively control parameter updates according to importance. The F values are greater in deeper layers of the networks; thus, the forgetting occurs mainly in deeper layers, and strengthening the protection of parameters in deep layers may tremendously help in tackling catastrophic forgetting;

In Figure 8(c), we utilized the weighted sum of the squares of differences between the first and the last task to measure the parameter change. The result that parameters in deeper layers change less shows that the consolidation of the shallow layer is more flexible. In addition, the fluctuation of parameters based on our methods is much larger compared to other methods. Thus, our method can preserve the former memories; however, it leads to higher network capacity for learning new tasks.
VC Visualization analysis.
We visualize the negative absolute values of the parameter changes (left) and compare them with the connectivity of the parameters (right). Our results demonstrate that our method can prevent significant parameters from being updated and can fully utilize nonsignificant parameters to learn new tasks. In Figure 9, in the black dotted bordered rectangle of the 1st row, parameters with warm color change little. In contrast, parameters in the second column of the picture on the right change substantially because they are unimportant to previous memorization. Thus, our method can precisely capture the significant parameters and prevent them from being updated to prevent forgetting.
Vi Conclusions and future works
Longterm catastrophic forgetting limits the application of neural networks in practice. In this paper, we analyze the causes of longterm catastrophic forgetting in neural networks: the shrinking of the shared parameter subspace of tasks and the accumulated error of weight consolidation as tasks arrive. We proposed the adversarial neural pruning and synaptic consolidation approach to overcome longterm catastrophic forgetting. This approach balances the shortterm and longterm profits of a learning model by online weight pruning and revised weight consolidation. The calculation of parameter saliency is similar to optimal brain surgery[Hassibi2014Second], however, our method frees parameters from updating and spares them for later tasks instead of throwing them away. In addition, we assume that the structural knowledge of the model is significant and measure it with neuron connectivity, which provides a new perspective from which to represent network knowledge. The experimental results demonstrate several advantages of our method:

[]

Efficiency: our approach performs comparably on a variety of datasets and tasks;

Robustness: our approach has low sensitivity to hyperparameters;

Universality: our approach can be extended to generative models.
The evidence suggests that finding an approximate solution of a sequence of tasks is effective in alleviating the dilemma of memory. Online neural pruning is not the only approach for achieving this solution; other methods such as knowledge distillation are also feasible. We conclude that the concentration and polarization proper ties of the parameter distribution are significant for overcoming longterm catastrophic forgetting. Protecting some parameters through measurement based on a single strategy is not entirely effective. We suggest that wellstructured constraints for controlling parameter behavior or welldesigned patterns of distributions of parameters may be crucial to the satisfactory performance of a model in overcoming forgetting. In addition, research on human brain memory is providing a potential approach for solving this problem [Hassabis2017Neuro]. The problem of overcoming catastrophic forgetting remains open.
Appendix A Incremental learning
Aa Large scale dataset from Caltech101
Largescale dataset from Caltech101. To evaluate the performance of our method on a larger dataset, we randomly split the Caltech101 dataset into 4 subsets with 30, 25, 25, and 22 classes and divided each part of the subsets into training and validation sets according to the ratio of 7:3. In the experiment, we resized the images to [224,224,3], normalized the pixels into [0,1] and randomly flipped the images from left and right to augment the data in preprocessing. We employed ResNet18 as the basic network. Because the categories of four datasets are not consistent, we added a new separate classifier and a fully connected layer before the classifier for each task. Each new fc layer has 2048 neural units, and the dropout rate is set to 0.5. The iteration size and batch size of every task are 100 epochs and 128, respectively. The initial learning rate is set to 0.001, and decay is used every 100 epochs to 90% of the original. To prevent overfitting, we randomly select the hyperparameter in the range from 0.5 to 30. Due to the inconsistent numbers of categories in the four subsets, we do not compare our method with SI.
A wellfunctioning model is expected to be stable under abrupt changes of tasks. To evaluate the stability of the model on unseen tasks, we designed an indicator, namely, SMT, as follows:
(13) 
Where is the variance of a single task for sequential learning, which reflects the performance fluctuations of the task.
AB Long sequence for CIFAR100
As shown in Figure 11, all current methods do not perform well on largescale datasets as the number of tasks increases. In the fourth task, the ACC of our method is less than that of SGDF. However, it outperforms EWC, MAS and LWF. In terms of SMT, when the model learns the second and third tasks, our method is outperformed by SGD and MAS. On the fourth task, our method performs better than all the remaining methods in terms of BWT and SMT; hence, our method can preserve the memory of tasks with longer sequences and has higher stability. In FWT, our method outperforms all the methods except MAS. Overall, our method outperforms the stateoftheart methods that are based on regularization.
The results in Figure 12 idemonstrate that it remains difficult to construct models that are capable of longterm memory, especially in complex tasks. Our method yielded similar results in terms of overall performance among regularization methods; however, SGD F and finetuning outperformed it when the number of learning tasks was large, and LWF almost lost its learning ability. On BWT, our method and MAS achieved a better results. SGDF performed best on preventing forgetting because the weights were completely fixed. LWF shows higher BWT; however, the data are useless due to the loss of learning ability. On FWT, our method realized the best results; thus, our method has little impact on the learning of new tasks while preserving previous knowledge.
Appendix B Sequentially generate new categories
We apply the ANPSC to generate new categories sequentially instead of learning them with old categories. To achieve this goal, we train CGAN[mirza2014conditional] on SHVN[Netzer1989Reading]. We sequentially train the model on digit 0, digit 1 and digit 2 and separately test the model on 3 tasks. Figure 10 presents the results of CGAN with ANPSC. The results prove that it model 3 digits well, which is similar to a single task.
Appendix C Model compression
We compress the LeNet[lecun1998gradient] that was trained on MNIST. The maximum number of epochs is set to 50, the batch size is set to 100 and the learning rate is set to 0.01. We calculate the importances of parameters after training and prune the insignificant parameters according to the importance threshold. We sequentially conduct this procedure 5 times with various thresholds, and we set the best value as [0.8, 0.7, 0.5, 0.4, 0.1]. In TableIII, the experimental results show that the model that is compressed with ANPSC balances a high compression ratio and low accuracy loss.
Prune iters  Original model  iter 1  iter 2  iter 3  iter 4  iter 5 

param W1  800  392  271  184  134  129 
param b1  32  7  3  2  2  2 
param W2  51200  27923  18582  14182  11593  10220 
param b2  64  13  4  2  2  2 
param W3  1605632  310500  87135  38756  21978  19565 
param b3  512  100  30  15  9  9 
param W4  131072  24075  6895  2674  1565  1400 
param b4  256  51  16  8  5  5 
total params  1789568  363061  112936  55823  35288  31332 
compressed times  /  4.93x  15.85x  32.06x  50.71x  57.12x 
prune ratio  /  79.71%  93.69%  96.88%  98.03%  98.25% 
test acc  98.94%  98.87%  98.87%  98.83%  98.67%  98.55% 