# Continual Learning Using Bayesian Neural Networks

###### Abstract

Continual learning models allow to learn and adapt to new changes and tasks over time. However, in continual and sequential learning scenarios in which the models are trained using different data with various distributions, neural networks tend to forget the previously learned knowledge. This phenomenon is often referred to as catastrophic forgetting. The catastrophic forgetting is an inevitable problem in continual learning models for dynamic environments. To address this issue, we propose a method, called Continual Bayesian Learning Networks (CBLN), which enables the networks to allocate additional resources to adapt to new tasks without forgetting the previously learned tasks. Using a Bayesian Neural Network, CBLN maintains a mixture of Gaussian posterior distributions that are associated with different tasks. The proposed method tries to optimise the number of resources that are needed to learn each task and avoids an exponential increase in the number of resources that are involved in learning multiple tasks. The proposed method does not need to access the past training data and can choose suitable weights to classify the data points during the test time automatically based on an uncertainty criterion. We have evaluated our method on the MNIST and UCR time-series datasets. The evaluation results show that our method can address the catastrophic forgetting problem at a promising rate compared to the state-of-the-art models.

[name=Frieder, color=orange]ganz

## 1 Introduction

Deep learning models provide an effective end-to-end learning approach in a variety of fields. One common solution in deep neural networks to solve a complex task such as ImageNet Large Scale Visual Recognition Challenge (ILSVRC) deng2009imagenet is to increase the depth of the network he2016deep; lin2013network. However, as the depth increases, it becomes harder for the training model to converge. On the other hand, a shallower network is not able to solve a complex classification task at once, but it may be able to find a solution for a smaller set of classes and converges much faster. If a model can continually learn several tasks, then it can solve a complex task by dividing it into several simple tasks. In continual learning, the model repeatedly receives new data and the training data is not complete at any given time. If we re-train the entire model whenever there are new instances, it would be very inefficient, and we have to store the trained samples. The key challenge in such continual learning scenarios in changing environments is how to incrementally and continually learn new tasks without forgetting the previous or creating highly complex models that may require accessing the entire training data.

Most of the common deep learning models are not capable of adapting to different tasks without forgetting what they have learned in the past. These models are often trained via back-propagation where the weights are updated based on a global error function. Updating and altering tasks of an already learned model leads to the loss of the previously learned knowledge as the network is not able to maintain the important weights for various distributions. The attempt to sequentially or continuously learn and adapt to various distributions will eventually result in a model collapse. This phenomenon is referred as catastrophic forgetting or interference mccloskey1989catastrophic; goodfellow2013empirical. The catastrophic forgetting problem makes the model inflexible. Furthermore, the need for a complete set of training samples during the learning process is very different from the normal biological systems which can incrementally learn and acquire new knowledge without forgetting what is learned in the past.

Terminology: In this paper, the term task refers to the overall function of a model; e.g. classification, clustering and outlier detection. A task has an input distribution and an output distribution. A dataset is used to train and evaluate a model for a task. A dataset follows a certain distribution. The distribution of a dataset that is used to train a specific task can change over time. We can train a model with different tasks. Each task can be trained on its own individual dataset. In other words, each task can be trained based on different input and output distribution.

To address the catastrophic forgetting problem, there are mainly three approaches parisi2019continual:

Regularisation Approaches: Regularisation based approaches re-train the model with trading off the learned knowledge and new knowledge. Kirkpatrick et. al kirkpatrick2017overcoming propose Elastic Weights Consolidation (EWC), which uses sequential Bayesian inference to approximate the posterior distribution by taking the learned parameters as prior knowledge. EWC finds the important parameters to the learned tasks according to Fisher Information and mitigates their changes by adding quadratic items in the loss function. Similarly, Zenke et. al zenke2017continual inequitably penalise the parameters in the objective function. Zenke et. al define a set of influential parameters by using the information obtained from the gradients of the model. The idea of using a quadratic form to approximate the posterior function is also used in Incremental Moment Matching (IMM) lee2017overcoming. In IMM, there are three transfer techniques: weight-transfer, L2-transfer and drop-transfer to smooth the loss surface between the different tasks. Recently, the variational inference has drawn attention to solving the continual learning problem nguyen2017variational. The core idea of this method is to approximate the intractable true posterior distribution by variational learning.

Memory Replay: The core idea of memory replay is to interleave the new training data with the previously learned samples. The recent developments in this direction reduce the memory of the old knowledge by leveraging a pseudo-rehearsal technique robins1995catastrophic. Instead of explicitly storing the entire training samples, the pseudo-rehearsal technique draws the training samples of the old knowledge from a probabilistic distribution model. Shin et. al shin2017continual propose an architecture consisting of a deep generative model and a task solver. Similarly, Kamra et. al kamra2017deep use a variational autoencoder to regenerate the previously trained samples.

Dynamic Networks: Dynamic Networks allocate new neuron resources to learn new tasks. For example, ensemble methods build a network for each task. As a result, the number of models grows linearly with respect to the number of tasks wozniak2014survey; polikar2001learn++; Dai. This is not always a desirable solution because of its high resource demand and complexity kemker2017measuring. One of the key issues in the dynamic methods is that whenever there is a new task, new neuron resources will be created without considering the possibility of generating redundant resources. In yoon2017lifelong, the exponential parameter and resource increases are avoided by selecting part of the existing neurons for training new tasks. However, during the test process, the model has to be aware of which test task is targeted to choose the appropriate parameters to perform the desired task rusu2016progressive.

In this paper, we propose a Continual Bayesian Learning Network (CBLN) to address the forgetting problem and to allow the model to adapt to new distributions and learn new tasks. CBLN trains an entirely new model for each task and merges them into a master model. The master model finds the similarities and distinctions among these sub-models. For the similarities, the master model merges them and produces a general representation. For the distinctive parameters, the master model does not merge them and retains them. CBLN is based on Bayesian Neural Networks (BNNs) blundell2015weight, see Figure 1. Based on BNNs, we assume that the weights in our BNN model have a Gaussian distribution and the covariance matrix is diagonal. The distribution of the weights in different tasks are independent of each other. Based on this assumption, we can assume that the combined posterior distribution of all the training tasks is a mixture of Gaussian distributions. We then use an Expectation-Maximisation (EM) moon1996expectation algorithm to approximate the posterior mixture distributions and remove the components that are redundant or less significant. The final distribution of the weights can be a Gaussian mixture distribution with an arbitrary number of components. At the test stage, we produce an epistemic uncertainty kendall2017uncertainties measure for each set of components. The set which has minimal uncertainty will be used to give the final prediction.

## 2 Continual Bayesian Learning Networks (CBLN)

### 2.1 Training Process

The training process in CBLN is similar to BNNs. At the beginning of the training for each task, we initialise all the training parameters and train the model. However, at the end of the training for each task, we store the solution for the current task. We used the loss function shown in Equation (1):

(1) |

Where refers to the training parameters, is the Monte Carlo sample hastings1970monte drawn from the variational posterior , is the training data, is the influence of prior knowledge. We attempt to obtain weight parameters that have a similar Gaussian distribution, which is close to the prior knowledge. After training tasks, we can obtain sets of parameters that construct the posterior mixture Gaussian distribution in which each component is associated with a different task.

### 2.2 Merging Process

The merging process in this method is used to reduce the components in the posterior mixture distribution. We approximate the posterior mixture distribution with an arbitrary number of Gaussian distributions, see Equation (2), where is the number of tasks, is the number of components in the final posterior mixture distribution, is the posterior mixture distribution with the component associated with task, and are the weight parameters where . In the extreme case, when , this process can be interpreted as a special case of IMM which merges several models into a single one. When , this process can be interpreted as a special case of ensemble methods since there are set of parameters without being merged.

(2) |

To obtain the final posterior distribution and restrict the sudden increase in the number of parameters, we approximate the by using a Gaussian Mixture Model (GMM) reynolds2015gaussian with EM algorithm to get . We then remove the redundant distributions in .

The EM algorithm contains an Estimation step (E-step) and a Maximisation step (M-step). For each weight, we first sample data points from the posterior mixture distribution and initialise a GMM model with components. Then, the E-step estimates the probability of each data point generated from each random Gaussian distribution, see Equation (3). For the data point , we assume that it is generated from the Gaussian distribution and calculate the probability . We can obtain a matrix of membership weights after applying Equation (3) to each data point and determine the mixture of Gaussian distributions. The M-step modifies the parameters of these random Gaussian distributions by maximising the likelihood according to the weights generated from the first step; see Equation (4).

(3) |

(4) |

After the algorithm is converged, we can obtain an approximated posterior mixture distribution , where . We then remove , if is smaller than a threshold which is set to . These distributions can be regarded as redundant components which overfit the model. Since the EM algorithm clusters similar data points into one cluster, we can merge the distributions if they are similar to each other and get the final posterior mixture distribution . We use the trained GMM to cluster the mean value of each component in . If the mean values of two distributions are in the same cluster, these two distributions are merged into a single Gaussian distribution.

### 2.3 Testing Process

After the training process, we obtain set of parameters to construct the mixture posterior distribution with an arbitrary number of components. To identify which set of parameters can give a correct prediction of the test task, for each set of parameters, we obtain several Monte Carlo samples of the weights drawn from the variational posterior to classify the test data. We then calculate the variance of the predictive scores. The set of parameters which has minimal uncertainty is chosen to give the final prediction. We use the epistemic uncertainty for this purpose. There are also some other uncertainty measurements such as computing the entropy of the predictive scores renyi1961measures or Model Uncertainty as measured by Mutual Information (MUMMI) rawat2017adversarial, see Equation (5). The trade-off between these uncertainties measures is discussed in Section 4.

(5) |

Where is the predicted distribution, is the variational posterior distribution, and is the test input.

## 3 Experiments

We evaluated our method on the MNIST lecun2010mnist image datasets and the UCR Two Patterns time-series dataset UCRArchive. The MNIST and Two Patterns contain 60000 and 1000 training samples, 10000 and 4000 test samples, 10 and 4 classes respectively.

In our experiments, we do not re-access the samples after the first training but let the model know that it needs to train for a new task. However, the difference in our method compared with the existing works is that we do not tell the model which task is being tried. Furthermore, the output nodes refer to the appropriate number of classes that the task is trained for. The overlap between the output classes, which are trained at different times, are also taken into consideration. This means that at the time of the training for each task, we do not know which other tasks the new samples could also be associated with. The settings in our experiments are similar to lee2017overcoming which is more strict than other settings in the existing works. For example, in contrast to our experiments, the other existing experiments are allowed to re-access the training samples shin2017continual, tell the model which task is the test data comes from rusu2016progressive, or use different classifiers for different tasks nguyen2017variational. In CBLN, we randomly choose 200 test data from the test task and draw 200 Monte Carlo samples from the posterior distribution and measured the uncertainty to decide which parameters should be used in the model for each particular task.

We compare our model with state-of-the-art methods including Neural Networks (NN) haykin1994neural, Incremental Moment Matching (IMM) lee2017overcoming and Synaptic Intelligence (SI) zenke2017continual. In the IMM model, we perform all the IMM algorithms combining with all the transfer techniques mentioned in lee2017overcoming. We also search for the best hyper-parameters and choose the best accuracy according to lee2017overcoming. In the SI model, we search the best hyper-parameters as well. For the SI, Multiple-Head (MH) approach is used in the original paper. The MH approach is used to divide the output layer into several sections. For different tasks, each section will be activated in which the overlap between different classes in different tasks is also avoided. MH approach requires the model to be told about the test tasks. We perform our evaluation based on the SI approach with and without using the MH approach. In CBLN, we search for the best model that can distinguish the test data.

### 3.1 Split MNIST

The first experiment is based on the split MNIST. This experiment is to evaluate the ability of the model to continually learn new tasks. In this experiment, we split the MNIST dataset into several sub-sets; e.g. when the number of tasks is one, the networks are trained on the original MNIST at once; when the number task is two, the network is trained on the digits 0 to 4, 5 to 9 sequentially, etc. To implement the other methods, we follow the optimal architecture described in the original papers. In IMM, we use two hidden layers with 800 neurons each. In SI and NN, we use two hidden layers with 250 neurons each. In CBLN, we use two hidden layers with only 10 neurons each. To evaluate the performance, we compute the average of test accuracy on all the tasks.

As shown in Figure 1(a), the average test accuracy of all the tasks in CBLN keeps increasing, while the performance of other methods decreases over time. As long as we divide the MNIST model training into several simpler tasks, the performance of CBLN keeps increasing since CBLN can learn the new tasks without forgetting the previously learned ones. The accuracy after training five tasks sequentially reaches the performance of SI with the MH approach. Shown in Figure 1(a), the method using the MH approach avoids the interference between the tasks with different classes at the output layer (i.e. by interference we mean the situation in which learning a new task causes changing the parameters in a way that the model forgets the previously learned ones). However, we need to tell the model which task the test data refers to in both training and test processes. The grey line in Figure 1(a) represents the accuracy of training a BNN with the original MNIST. This BNN contains two hidden layers with 25 neurons in each layer; hence, the total number of parameters is 41070 (the number of parameters in BNNs is doubled). The performance of CBLN which continually learns five different tasks outperforms the BNN.

The parameters used in CBLN are less than the BNN. Figure 1(b) illustrates the number of parameters used in CBLN. The orange line shows the number of parameters before the merging process. The blue line shows the number of parameters after the merging process, and the green lines illustrate the number of merged parameters. The number of parameters used while training five tasks is 35094 which is significantly lower than the parameters used in other state-of-the-art methods. CBLN only doubles the number of parameters during the experiment (which is 16140 at the beginning). The more tasks are trained, the more parameters are merged because CBLN finds the similarity among the solutions for all the tasks and merges them.

Figure 1(c) illustrates the uncertainty measure in the test process when the number of tasks is five. In each block, the x-axis shows the prediction score for that particular task; the y-axis shows the variance. If the density of highlighted points is close to the lower right corner, the model has low uncertainty and high prediction score. The blocks shown in the diagonal line are the results with the lowest level of uncertainty for each particular task.

### 3.2 Permuted MNIST

The second experiment is based on the permuted MNIST. This experiment contains two parts:

The first part is to evaluate the ability of the model to learn new tasks incrementally. This experiment is different from the split MNIST experiment since the number of classes in each task is always 10. We follow the same setting in the previous work done by Kirkpatrick et. al, Lee et. al in kirkpatrick2017overcoming; lee2017overcoming. The first task is based on the original MNIST. In the rest of the tasks, we shuffle all the pixels in the images with different random seeds. Therefore, each task requires a different solution. However, the difficulty level of all the tasks is similar. In this experiment, CBLN contains two hidden layers with each having 50 neurons.

In the second part, we evaluate the ability of the model to learn new tasks incrementally and continually. Here we use two datasets. The first dataset is the original MNIST. The second dataset is permuted MNIST. We split these two datasets into subsets. The number of tasks to be trained is . In this experiment, CBLN uses the architecture as mentioned in the split MNIST experiment.

Figure 3 illustrates the results. While learning the permuted MNIST incrementally, CBLN achieves similar accuracy as the state-of-the-art models. However, the performance of CBLN is more stable than other methods. In the second part of the experiment, CBLN shows its robustness again in learning new tasks continually and incrementally. The additional parameters that are required to train a task decreases as the number of the learned tasks increase.

### 3.3 Time-Series data

In the last experiment, we use the Two-Patterns dataset from UCR time-series archive. In this experiment, CBLN uses two hidden layers each containing 200 neurons. The other methods with two hidden layers each containing 800 neurons with Dropout layers srivastava2014dropout. While training the CBLN model with the entire Two-Patterns dataset, the best accuracy is around 0.8. If we split the dataset into two parts, the accuracy is above 0.9. The accuracy of CBLN outperforms other methods by continually learning Two-Patterns dataset divided into smaller tasks rather than learning it as an entire model.

## 4 Discussion

Merged weights: We start the discussion with analysing how the weights are merged. We visualise the weights in the Split MNIST experiment that was carried out with two tasks. Shown in Figure 5, the orange points represent the merged weights. In Figure 4(a),4(b), the x-axis shows the mean of weights; the y-axis shows the variance of the weights. Figure 4(c) shows the density of the merged parameters. If the mean of the weight distribution is closer to 0, the weight has a larger chance to be merged because our prior knowledge is a Gaussian distribution with a mean of 0. For the weights which the mean values are higher, they have less chance to be merged because these weights can be regarded as to have larger contributions to finding the solution for the training tasks. For each task, the solution could be different. Hence these weights are the distinctions among different tasks.

Ablation study: Inspired by kemker2017measuring, we have evaluated our model with and without the merging process. To evaluate the performance decreases after merging the models, we calculate the absolute difference of test accuracy before and after the merging process. Shown in Figure 5(a), the absolute difference is almost 0. Therefore, all the similar parameters have been merged perfectly, and the distinct parameters are maintained very well. To evaluate the uncertainty changes before and after the merging process, we track the uncertainty changes in the time-series experiment. Shown in Figure 3(c), the uncertainties are decreased after the merging process, but it can still help the model to choose the correct parameters to predict the test data. Furthermore, the merging process significantly decreases the number of parameters needed to learn a mode for multiple tasks as shown in Figure 1(b),2(b),2(d), 3(b). The merging process can significantly prevent the exponential increase in the number of parameters required to learn the model without degrading the performance.

Complexity: We ran the experiments on a Macbook Pro (2015) with 2.7 GHz Intel Core i5. Shown in Figure 5(b), where 10 represents the CBLN contains two hidden layers with 10 neurons each, 25 represents the CBLN contains two hidden layers with 25 neurons each. CBLN is time-consuming during the test state especially when the number of the trained tasks grows. To produce the uncertainty measure, the computational complexity of CBLN is while the BNNs are for each test data. We assume the model does not know in advance which task the test data is associated with. This means that the model needs to identify and chooses the correct solution for each test task. This is a key advantage of CBLN compared to other existing methods that assume the model knows in advance which test task is being performed; e.g. rusu2016progressive; nguyen2017variational; zenke2017continual. The test stage could be the same as a conventional neural network if we informed the model which task is being tested. However, in real-world applications often this information is not available to the model in advance. CBLN uses the uncertainty measure to choose the appropriate learned solution for each particular task. The number of tasks does not have much effect on the merging process. The main effect on the merging process is the number of parameters of the model at the initialisation. According to our experiments, we can initialise CBLN with a much smaller number of parameters to solve a complex task as long as it can solve it as a set of simpler tasks. Furthermore, CBLN does not need to evaluate the importance of parameters by measures such as computing Fisher Information (second derivative of loss function) lee2017overcoming; kirkpatrick2017overcoming which are computationally expensive and intractable in large models.

Uncertainty measure: In this section, we discuss the epistemic uncertainty measure that is computed by the model given test data. CBLN uses epistemic uncertainty measure to identify the current task form the distribution of training data. We evaluate the variance, entropy and MUMMI in different experiments. To see which measurement of uncertainty is suitable to be used in CBLN for choosing the learned solution, we run each experiment for ten times and calculate the average selection rate. Shown in Table 1, in the permuted MNIST experiments, although the number of tasks is increasing, the model can choose the correct solutions. In the split MNIST experiment, the rate of uncertainty decreases, if the number of tasks increases. In other words, the model cannot distinguish the tasks that the test data is associated with when the number of classes in each task decrease. We analyse this as a Rare Class Scenario in Epistemic Uncertainty (RCSEU). RCSEU means that when the number of classes in each task is very small, the model will overfit the training data quickly and will become over-confident with the result of classifying the test data. To illustrate the RCSEU, we visualise the uncertainty information in the split MNIST experiment, when the number of tasks is ten. In Figure 5(c), the blue blocks are the correct solution (in the diagonal line), the green blocks represent that the model identify the test data correctly and the red blocks represent that the model identifies the test data incorrectly and the black blocks represent very small uncertainty.

Experiment | Split MNIST | Permuted MNIST | UCR | ||||
---|---|---|---|---|---|---|---|

Number of Tasks | 2 | 3 | 4 | 5 | 10 | 10 | 2 |

Variance | 1.0 | 1.0 | 0.95 | 0.866 | 0.3 | 1.0 | 1.0 |

Entropy | 1.0 | 0.8 | 0.7 | 0.736 | 0.29 | 1.0 | 0.8 |

MUMMI | 1.0 | 0.87 | 0.925 | 0.894 | 0.32 | 1.0 | 1.0 |

## 5 Conclusions

In this paper, we proposed the Continual Bayesian Learning Networks (CBLN) to solve the catastrophic forgetting problem in continual learning scenarios. CBLN is based on Bayesian neural networks (BNNs). Different from BNNs, the weights in the CBLN are mixture Gaussian distributions with an arbitrary number of Gaussian distributions. CBLN can solve a complex task by dividing it into several simpler tasks and learn each of them sequentially. Since CBLN uses mixture Gaussian distribution models in its network, the number of additionally required parameters decreases as the number of tasks increases. CBLN is also to identify which solution should be used for which test data by using an uncertainty criterion. More importantly, our proposed model can overcome the catastrophic forgetting problem without requiring to re-access the previous training samples. We have evaluated our method based on MNIST image and UCR time-series datasets and have compared the results to the state-of-the-art models. In the split MNIST experiment, our method outperforms the Incremental Moment Matching (IMM) model by 25%, and the Synaptic Intelligence (SI) model by 80%. In the permuted MNIST experiment, our method outperforms IMM by 16% and achieve the same accuracy as the SI model. In the time-Series experiment, our method outperforms IMM by 40% and the SI model by 47%. The future work will focus on developing new solutions to let the model determine when it needs to train for a new task given a series of samples by analysing the changes in the distribution of the training data. The work will also focus on developing methods to group the neurons during the merging process to construct regional functional areas in the network specific to a set of similar types of task. This will allow us to reduce the complexity of the network and create more scalable models.