Hierarchical Indian Buffet Neural Networks for Bayesian Continual Learning
Abstract
We place an Indian Buffet process (IBP) prior over the structure of a Bayesian Neural Network (BNN), thus allowing the complexity of the BNN to increase and decrease automatically. We further extend this model such that the prior on the structure of each hidden layer is shared globally across all layers, using a HierarchicalIBP (HIBP). We apply this model to the problem of resource allocation in Continual Learning (CL) where new tasks occur and the network requires extra resources. Our model uses online variational inference with reparameterisation of the Bernoulli and Beta distributions which constitute the IBP and HIBP priors. As we automatically learn the number of weights in each layer of the BNN, overfitting and underfitting problems are largely overcome. We show empirically that our approach offers a competitive edge over existing methods in CL.
1 Introduction
Humans have the ability to continually learn, consolidate their knowledge and leverage previous experiences when learning a new set of skills. In Continual Learning (CL) an agent must also learn continually, presenting several challenges including learning online, avoiding forgetting and efficiently allocating resources for learning new tasks. In CL, a neural network model is required to learn a series of tasks, one by one, and remember how to perform each. After learning each task, the model loses access to the data. More formally the model is given a set of tasks sequentially for . Where each task is comprised of a dataset with input and output samples: for , for and so on until for . Although the model will lose access to the training dataset for task , it will be continually evaluated on the test sets for all previous tasks for . For a comprehensive review of the CL scenarios see van de Ven & Tolias (2018); Hsu et al. (2018).
The principal challenges to CL are threefold, firstly models need to overcome catastrophic forgetting of old tasks; a neural network will exhibit forgetting of previous tasks after having learnt a few tasks Goodfellow et al. (2015). Secondly, models need to leverage knowledge transfer from previously learnt tasks for learning a new task . And finally, the model needs to have enough neural resources available to learn a new task and adapt to the complexity of the task at hand.
One of the main approaches to CL involves the use of the natural sequential learning approach embedded within Bayesian inference. The prior for task is the posterior which is obtained from the previous task . This enables knowledge transfer and offers an approach to overcome catastrophic forgetting. Previous Bayesian CL approaches have leveraged Laplace approximations Kirkpatrick et al. (2017); Ritter et al. (2018) and variational inference Nguyen et al. (2018); Swaroop et al. (2018); Zeno et al. (2018) to aid computational tractability. Whilst Bayesian methods solve the first and second objectives above, the third objective of ensuring that the BNN has enough neural resources to adapt its complexity to the task at hand is not necessarily achieved. For instance, additional neural resources can alter performance on MNIST classification (see Table 1 in Blundell et al. (2015)). This is a problem as the amount of neural resources required for one task now, may not be enough (or may be redundant) for a future task. Propagating a poor approximate posterior from one task will alter performance for all subsequent tasks.
NonBayesian neural networks use additional neurons to learn new tasks and prevent overwriting previous knowledge thus overcoming forgetting. The neural networks which have been trained on previous tasks are frozen and a new neural network is appended to the existing network for learning a new task Rusu et al. (2016). The problem with this approach is that of scalability: the number of neural resources increases linearly with the number of tasks. The scalability issue has been tackled with selective retraining and expansion with a group regulariser Yoon et al. (2018). However this solution is unable to shrink and so are vulnerable to overfitting if misspecified when starting CL. Moreover knowledge transfer and prevention of catastrophic forgetting are not solved in a principled manner, unlike approaches couched in a Bayesian framework.
As the resources required are typically unknown in advance, we propose a BNN which adds or withdraws neural resources automatically in response to the data. This is achieved by drawing on Bayesian nonparametrics to learn the structure of each hidden layer of a BNN. Thus, the model size adapts to the amount of data seen and the difficulty of the task. This is achieved by using a binary latent matrix , distributed according to an Indian Buffet Process (IBP) prior Griffiths & Ghahramani (2011). The IBP prior on an infinite binary matrix, , allows inference on which and how many neurons are required for each data point in a task. The weights of the BNN are treated as draws from noninteracting Gaussians Blundell et al. (2015). Catastrophic forgetting is overcome by repeated application of the Bayesian update rule, embedded within variational inference Nguyen et al. (2018). We summarise the contributions as follows. We present a novel BNN using an IBP prior and its hierarchical extension to automatically learn the complexity of each hidden layer according to the task difficulty. The model’s effective use of resources is shown to be useful in CL. We derive a variational inference algorithm for learning the posterior distribution of the proposed models. In addition, our model elegantly bridges two separate CL approaches: expansion methods and Bayesian methods (more commonly referred to as regularisation based methods in CL literature).
2 Indian Buffet Neural Networks
We introduce variational Bayesian approaches to CL in Section 2.1. We present the IBP prior in Section 2.2 and the IBP prior on the latent binary matrix is then applied to a BNN such that the complexity of each hidden layer can be learnt from the data in Section 2.3. In Section 3, the Hierarchical IBP prior (HIBP) is introduced and applied to the BNN to encourage a more regular structure. Thus, the use of an IBP and HIBP prior over the hidden states of the BNN can be readily used together with the Bayesian CL framework presented, and so automatically adapt its complexity according to the task.
2.1 Bayesian Continual Learning
The CL process can be decomposed into Bayesian updates where the approximate posterior for can be used as a prior for task . Variational CL (VCL) Nguyen et al. (2018) uses a BNN to perform the prediction tasks where the network weights are independent Gaussians. The variational posterior from previous tasks is used as a prior for new tasks. Consider learning the first task , and are the variational parameters, then the variational posterior is . For the subsequent task, access to is lost and the prior will be , optimisation of the ELBO will yield the variational posterior . Generalising, the negative ELBO for the th task is:
(1) 
The first term acts to regularise the posterior such that it is close to previous task’s posterior and the second term is the loglikelihood of the data for the current task Nguyen et al. (2018).
2.2 Indian Buffet Process prior
Matrix decomposition aims to represent the data as a combination of latent features: where , , and is an observation noise. Each element in corresponds to the presence or absence of a latent feature from . Specifically, corresponds to the presence of a latent feature in observation and all columns in with are assumed to be zero. In a scenario where the number of latent features is to be inferred, then the IBP prior on is suitable DoshiVelez et al. (2009).
One representation of the IBP prior is the stickbreaking formulation Teh et al. (2007). The probability is assigned to the column for , whether a feature has been selected is determined by . This parameter is generated according to the following stickbreaking process:
(2) 
thus decreases exponentially with . The Beta concentration parameter controls how many features one expects to see in the data, the larger is, the more latent features are present.
2.3 Adaptation with the IBP prior
Consider a BNN with neurons for each layer layers. Thence, for an arbitrary activation , the binary matrix is applied elementwise where , , , and where is the elementwise product and is the number of data points per batch. We have ignored biases for simplicity. is distributed according to an IBP prior. The IBP prior has some suitable properties for this application: the number of neurons sampled grows with and the promotion of “rich get richer” scheme for neuron selection Griffiths & Ghahramani (2011). For convenience, we term the IBP BNN as IBNN for the remainder of the report.
The number of neurons selected grow or contract according to the variational objective; which depends on the complexity of the data. This allows for efficient use of neural resources which is crucial to a successful CL model. The variational objectives for the IBP prior and BNN are introduced further down the line in Section 2.4 and Section 3.2. Additionally, the “rich get richer” scheme is useful since the common neurons are selected across tasks enabling knowledge transfer and preventing forgetting.
As a standard practice in variational inference with a Bayesian nonparametric prior, we use a truncation level , to the maximum number of features in the variational IBP posterior. Doshi et al. (2009) present bounds on the marginal distribution of in a matrix factorisation setting and show that the bound decreases exponentially as increases. A similar behaviour is expected for our application.
2.4 Structured VI
Structured stochastic VI (SSVI) has been shown to perform better inference of the IBP posterior than meanfield VI in deep latent variable models Singh et al. (2017). Hence, this inference method has been chosen for learning and presented next.
A separate binary matrix can be applied to each layer of a BNN. The subscript is dropped for clarity. The structured variational approximation is: , where the variational variables, for all , are defined below, and the variational posterior is truncated to . The constituent distributions of the variational distribution are , , and the BNN weights are independent draws from . Having defined the structured variational objective, the negative ELBO is:
(3)  
We note that the Bernoulli random variables have an explicit dependence on the stickbreaking probabilities in the structured variational approximation. However, this explicit dependence between parameters is removed in a mean field approximation.
3 Hierarchical IBNN
In the previous section, we presented the IBNN model to allow a BNN to automatically select the number of neurons for each layer according to the data. For a multilayer BNN, one can apply the IBP prior independently for each layer. However, this approach is limited in that information is not shared across layers. To overcome this, we propose a hierarchical IBP prior Thibaux & Jordan (2007) for neuron selection across multiple layers, jointly. The number of neurons from all layers are generated from the same global prior, thus will discourage irregular structure in the BNN (a BNN with adjacent wide and narrow layers might be inferred when using independent priors on each hidden layer). Of course, this property might not be desirable for all use cases, however the majority of BNNs used in the literature have a regular structure. We term our model as HIBNN and present the graphical model which describes the hierarchical IBP prior in Figure 1.
3.1 Adaptation with the Hierarchical IBP
The global probability of selecting the neuron positioned at the th index across all layers is defined according to a stickbreaking process:
(4) 
Child IBPs are defined over the structure of each individual hidden layer of a BNN which depend on to define the respective Bernoulli probabilities of selecting neuron in layer :
(5) 
for }, , where is the number of layers in a BNN, are hyperparameters Thibaux & Jordan (2007); Gupta et al. (2012). The selection of the th neuron in the th layer by a particular data point in the dataset of size is thus:
(6) 
Notice that if is small, is close to then the shape parameter of the child Beta distribution will be large. At the same time the scale parameter will be small. So the Bernoulli probability in Equation (6) will be close to , as increases and decrease. To infer the posterior , we perform SSVI.
3.2 Structured VI
A structured variational posterior distribution which retains properties of the true posterior is desired such that the global stickbreaking probabilities influence child stickbreaking probabilities of each layer of the BNN. Let us define the variational distributions for our hidden variables as follows, and , , and the weights of the BNN are drawn from .
The structured variational distribution is defined as follows
(7) 
where for all and , up to the variational truncation, . Having defined the structured variational distribution, the negative ELBO is:
(8)  
The child stickbreaking variational parameters for each layer are conditioned on the global stickbreaking parameters and the binary masks for each neuron in each layer are conditioned on the child stickbreaking variational parameters. Thus, the variational structured posterior is able to capture dependencies of the prior. The learnable parameters are , , and for all neurons and for all layers .
Inference
The variational posterior is obtained by optimising Equation (8) using structured stochastic VI in which the goal is to learn the variational distributions parameterised by . For inference to be tractable, we utilise three reparameterisation tricks. The first is for learning the Gaussian weights Kingma & Welling (2014); Blundell et al. (2015). The second is an implicit reparameterisation of the Beta distribution Figurnov et al. (2018). The third reparameterisation uses a Concrete relaxation to the Bernoulli distribution Maddison et al. (2017); Jang et al. (2017). We present details of these in the Supplementary material, Sections A.1, A.2 and A.3, respectively.
4 Related Work
4.1 Continual learning
Bayesian continual learning. Repeated application of Bayes’ rule can be used to update a model given the arrival of a new task. Previous work has used Laplace approximations Kirkpatrick et al. (2017); Ritter et al. (2018) and variational inference Nguyen et al. (2018); Zeno et al. (2018); Ahn et al. (2019). Bayesian methods can also be intuitively thought of as a weight space regularisation. Explicit regularisation in weight space have also proved successful in CL Zenke et al. (2017); Chaudhry et al. (2018); Schwarz et al. (2018). Our method builds upon Nguyen et al. (2018) as the framework for learning continually. None of these methods deal with the issue of resource allocation to alleviate potential overfitting or underfitting problems in CL.
Bayesian CL can be interpreted as a regularisation in weight space. However, intuitively it is more sensible to regularise the task functions between tasks Benjamin et al. (2019) by storing samples from previous tasks as repeated regularisation in weight space could lead to parameter values becoming obsolete. Thus, Titsias et al. (2020) propose a functional regularisation by constructing approximate posteriors over taskspecific functions using sparse GPs. However, this approach requires optimisation over inducing points and choosing of an appropriate GP kernel.
Adaptive models in continual learning. NonBayesian approaches to CL use additional neural resources to remember previous tasks. One approach boils down to training an individual neural network for each task Rusu et al. (2016), thus is unrealistic. Dynamically Expandable Networks involves selective retraining of neurons and expansion with a group sparsity regulariser to ensure that the network doesn’t expand unnecessarily Yoon et al. (2018), however this approach is unable to shrink and continues to expand if a large network is chosen initially for CL. Another approach, based on reinforcement learning adds neural resources by penalising the complexity of the task network in its reward function Xu & Zhu (2018).
Recently Rao et al. (2019) propose an unsupervised CL model (CURL) in scenarios that lack a task identifier. CURL is adaptive, insofar that if a new task is detected then a new component is added to the mixture of Gaussians in the model. However the neural capacity of the encoder and decoder which capture lower level task details doesn’t adapt to the task difficultly. Also, CURL uses deep generative replay techniques Shin et al. (2017) to overcome forgetting. Another approach proposes a mixture of expert models which can perform taskfree CL Lee et al. (2020). The experts are distributed according to a Dirichlet prior and a new expert can be added when a new task arrives automatically without the need for a task identifier during training. Catastrophic forgetting is overcome by using lateral connections between experts like Rusu et al. (2016) and the complexity of each expert is set a priori; hence does not adapt to the difficulty of the task at hand and is learnt with MAP estimation. In contrast, the model presented in this report has the added advantage of producing uncertainty estimates.
4.2 IBP priors and model selection in deep learning
IBP prior has been used in VAEs to automatically learn the number of latent features. Stickbreaking probabilities have been placed directly as the VAE latent state Nalisnick & Smyth (2017). The IBP prior has been used to learn the number of features in a VAE hidden state using meanfield VI Chatzis (2018) with blackbox VI Ranganath et al. (2014) and structured VI Hoffman & Blei (2015); Singh et al. (2017). As an alternative to truncation, Xu et al. (2019) use a Russian roulette sampling scheme to sample from the infinite sum in the stickbreaking for the IBP.
Model selection for BNNs has been performed with the Horseshoe prior over weights Ghosh et al. (2019). To the best of our knowledge, we are the first to employ an IBP prior for model selection.
5 Experimental Results
To demonstrate the effectiveness of the IBP and HIBP priors on determining the structure of the BNN, we perform weight pruning to see whether the pruned weights coincide with the weights dropped by the IBP and HIBP prior in Section 5.1. Furthermore, we then use the IBNN and HIBNN in a CL setting in Section 5.2. Also in the supplementary material, further continual learning results and all experimental details are outlined. Unless explicitly stated, all curves are an average of 5 independent runs one standard error. By test error, we refer to .
5.1 IBP induces sparcity
We study the sparsity induced by the IBP variational posterior to see whether it is sensibly compressing the BNN with the binary matrix . Weight pruning is performed on the MNIST multiclass classification problem and the HIBNN is compared to a variational BNN. The baseline BNN has two layers with hidden state sizes of , the HIBNN uses a variational truncation of for fair comparison. The HIBNN achieves an accuracy of before pruning while the BNN with no prior over the network structure achieves an accuracy of before pruning. This gap in performance is due to the approximate inference of the HIBP posterior and in particular the Concrete reparameterisation which is applying a ‘soft’ mask on the hidden layers of the HIBNN. Weights are pruned according to and the signal to noise ratio . The pruning accuracies in Figure 2 demonstrate that the HIBP is indeed much sparser than a BNN. Two separate IBP priors on each layer of the BNN achieve a similar accuracy of before pruning, but is slightly less robust to pruning HIBNN. This is due to the more regular structure of the HIBNN, see Figure 3. We compare our method to Sparse Variational Dropout Molchanov et al. (2017) which uses a BNN with a specific sparsity inducing prior. The results indicate a similar pruning performance as when using a HIBP prior, see Section B.1 and Figure 10 in the supplementary material.
5.2 Continual learning experiments
Adaptive complexity. Approximate inference of the IBP and HIBP posteriors is challenging in a stationary setting rendering overall performance attenuated. Despite this, the approximate IBP and HIBP posteriors are useful in nonstationary CL setting, where the amount of resources are unot known beforehand. In Figure 4, one can see that the average accuracies across all CL tasks for permuted MNIST vary considerably with the hidden state for VCL Nguyen et al. (2018) hence the benefit of our model which automatically infers the hidden state size for each task, see Figure 6. The details of the experiment will be introduced below. Despite inferring an appropriate number of resources, accuracies are still attenuated compared to a well chosen VCL network size.
Continual learning scenarios. Three different CL scenarios are used to evaluate the proposed models of increasing order of difficulty, see van de Ven & Tolias (2018); Hsu et al. (2018) for further details. The first is task incremental learning abbreviated to CL1 this is where the task identifier is given to the continual learner during evaluation. The second is domain incremental learning abbreviated as CL2, where the task identifier is not given to the continual learner, this is a harder and more realistic scenario Farquhar & Gal (2018). The third scenario is incremental class learning denoted as CL3 in which the number of classes the model is required to classify from increases for each new task. Bayesian CL methods fail at CL3 by themselves, the reason for this is an open research question. Therefore, we focus on evaluating the first two scenarios in this work. Extensions to our work to perform incremental class learning are discussed in the final discussion Section 6.
Baseline models. We compare our models to VCL Nguyen et al. (2018) and EWC Kirkpatrick et al. (2017), since the IBNN and HIBNN models build upon the VCL framework. EWC is motivated by repeated Bayesian updates. The (diagonal) empirical Fisher Information is scaled by a hyperparameter Huszár (2017) (this boils down to an regularisation). If the previous task’s parameters yield a poor performance then a small regularisation will mildly affect new task parameters. As a result it is expected that EWC is more robust to poor posterior propagation for a well chosen hyperparameter. However, EWC cannot deliver uncertainties (unless it is redefined as a strict Laplace approximation of the BNN posterior). For these reasons VCL is a fairer comparison for our model. The exact EWC algorithm used is OnlineEWC Schwarz et al. (2018). The OnlineEWC implementation used is from Hsu et al. (2018). The objective is to demonstrate that limited or excessive neural resources can cause problems in CL in comparison to the adaptive model presented, hence a small model and a larger more reasonably sized model are used as baselines. This choice might seem contrived or unreasonable but in applications of CL it is impossible to know a priori how large or small a model should be for unknown future tasks.
MNIST Experimental details. All models use a single layer BNNs with varying hidden state sizes. The use of a single layer is enough as MNIST is a simple task; the results for Online EWC baselines in Table 1 for Split MNIST outperform those presented in Hsu et al. (2018); van de Ven & Tolias (2018) which use larger models. CL1 uses a multihead network and CL2 uses a singlehead network.
Permuted MNIST benchmarks. The permuted MNIST benchmark involves performing multiclass classification on MNIST where for each new task the data has been transformed by a fixed permutation, the class labels remain unchanged. The results for permuted MNIST in Figure 5 show that poorly initialised BNNs with a neuron hidden layer can underperform in CL scenarios. The IBNN expands continuously such that a median of neurons are active for the fifth task like the optimal VCL baseline, see Figure 6. We define a neuron as active by aggregating all neurons where for data point and neuron for plotting in all boxplots. There is a small gap in performance between our model and VCL baselines due to the approximations used for inference of the variational posterior. The IBNN is able to overcome forgetting and transfer knowledge in a similar vain to VCL. Variational solutions outperform EWC for Permuted MNIST. Results for CL2 are shown in the supplementary material in Section B.2.
Split MNIST benchmarks. The split MNIST benchmark for CL involves a sequence of binary classification tasks in MNIST and its variants with background noise and background images
For the CL2 scenario there is a gap in performance between CL1 this affects OnlineEWC, VCL and our method equally, in line with Hsu et al. (2018). There doesn’t seem to be an accuracy benefit with model size, see Table 2. in the supplementary material.
EWC h5  EWC h50  EWC h100  VCL h5  VCL h50  VCL h100  IBNN (ours)  

PMNIST 







MNIST 







MNIST + noise 







MNIST + images 







Uncertainty quantification. The advantage of using VCL as a basis for our model is uncertainty quantification over seen and unseen tasks. Predictive entropy is a suitable measure of uncertainties for a BNN as shown in Osawa et al. (2019). At each step of CL (CL1 is considered here), predictive entropies are calculated over all seen and unseen test sets. We see that our model is clearly able to distinguish between seen and unseen tasks for Permuted MNIST with sharp differences in predictive entropies, see Figure 7. For Split MNIST benchmarks the differences are still clear but less marked than for Permuted MNIST. We further discuss the use of uncertainties in CL, in Section 6.
Split CIFAR10 benchmark. The CIFAR10 dataset is used as a challenging benchmark for CL. Each task is a binary classification of the constituent classes. Only the CL1 scenario is considered as this scenario because the dataset is already difficult. Our models, a two layer HIBNN and IBNN models with is compared to VCL baselines with two layers and hidden state sizes of (a similar setup of two layers is considered in Hsu et al. (2018)). The HIBNN model outperforms all the benchmarks, see Figure 8. The larger models struggle to learn anything. The HIBNN model outperforms the IBNN model with separate priors over each layer. This is because of the more regular structure induced over the BNN by HIBP variational posterior. The HIBP model expands from neurons in each layer for the first task to around neurons for the last task, while IBP model is less regular between layers see Figure 9. We provide further details of the experimental setup and task accuracy breakdowns in the Section B.4 in the supplementary material.
6 Conclusion and Future Work
Model size is an important contributing factor for CL performance, see Figures 4 and 8 for instance. Most CL methods assume a perfectly selected model size and study the effects of method specific hyperparameters. Our novel CL framework is introduced which adapts the complexity of a BNN to the task difficulty. Our model is based on the IBP prior for selecting the number of neurons for each task and uses variational inference for learning. The models presented reconcile two different approaches to CL: Bayesian or regularisation based approaches and dynamic architecture approaches through the use of a IBP and HIBP prior. The IBP and HIBP priors applied to the BNN also have the desirable property of inducing sparcity in the model.
Solutions which adapt resources, to prevent overfitting and underfitting, are essential for successful Bayesian CL approaches to avoid poor approximate posteriors being used as priors for a new task. This requirement provides our core motivation for a Bayesian model which adds and removes resources automatically. To the best of our knowledge, the HIBNN and IBNN are the first to used in a CL setting. Future extensions can be made for more difficult CL scenarios.
Task free continual learning. The model presented can be augmented in a straightforwardly to perform task free CL. Variational BNNs like the ones presented can quantify uncertainties such as predictive entropy Osawa et al. (2019) and used together with a statistical test Titsias et al. (2020) can be used to detect task boundaries straightforwardly.
Memories for replay. The model presented can be augmented with small memories, for instance coresets Nguyen et al. (2018). The use of small memories of previous tasks to augment the current task dataset, although not purely Bayesian, can be a remarkably effective tool for class incremental learning (CL3) Hsu et al. (2018).
Appendices
Appendix A Model and Inference
In this section, we present the variational BNN with the structure of the hidden layer determined by the HIBP variational posterior (a.k.a. HIBNN). It is straightforward to apply the following methodology for a simpler model with independent IBP variational posteriors determining the structure of each hidden layer (referred to the IBNN in the main paper).
We derive a structured variational posterior Singh et al. (2017) where dependencies are established between global parameters and local parameters Hoffman & Blei (2015). Once the variational posterior is obtained, we can follow the VCL framework Nguyen et al. (2018) for CL. The following set of equations govern the stickbreaking HIBP model for an arbitrary layer with neurons in each layer of a BNN:
(9)  
(10)  
(11)  
(12)  
(13)  
(14) 
The index denotes a particular neuron, denotes a particular layer, and denotes a column (of ) from the weight matrix mapping hidden states from layer to layer , such that . We denote as the elementwise multiplication operation. The binary matrix controls the inclusion of a particular neurons in layer . The dimensionality of our variables are as follows , , and .
The closedform solution to the true posterior of the HIBP parameters and BNN weights involves integrating over the joint distribution of the data and our hidden variables, . Since it is not possible to obtain a closedform solution to this integral, variational inference together with reparameterisations of the variational distributions are used Kingma & Welling (2014); Figurnov et al. (2018) to employ gradient based methods. The variational approximation used is
(15) 
where the variational posterior is truncated up to , the prior is still infinite Nalisnick & Smyth (2017). The set of variational parameters which we optimise over are . Each term in Equation (15) is specified as follows
(16)  
(17)  
(18)  
(19)  
(20) 
where are hyperparameters Gupta et al. (2012). Now that we have defined our structured variational approximation in Equation (15) we can write down the objective as
(21) 
In the above formula, is the approximate posterior and is the prior. By substituting Equation (15), we obtain the negative ELBO objective:
(22) 
Estimating the gradient of the Bernoulli and Beta variational parameters requires a suitable reparameterisation. Samples from the Bernoulli distribution in Equation (19) arise after taking an argmax over the Bernoulli parameters. The argmax is discontinuous and a gradient is undefined. The Bernoulli is reparameterised as a Concrete distribution Maddison et al. (2017); Jang et al. (2017). Additionally the Beta is reparameterised implicitly Figurnov et al. (2018) to separate sampling nodes and parameter nodes in the computation graph and allow the use of stochastic gradient methods to learn the variational parameters of the approximate HIBP posterior. The meanfield approximation for the Gaussian weights of the BNN is used, in Equation (20) Blundell et al. (2015); Nguyen et al. (2018). In the next sections we detail the reparameterisations used to optimise Equation (22).
a.1 The variational Gaussian weight distribution reparameterisation
The variational posterior over the weights of the BNN are diagonal Gaussian . By using a reparameterisation, one can represent the BNN weights using a deterministic function , where is an auxiliary variable and a deterministic function parameterised by , such that:
(23) 
where is an objective function, for instance Equation (22). The BNN weights can be sampled directly through the reparameterisation: . By using this simple reparameterisation the weight samples are now deterministic functions of the variational parameters and and the noise comes from the independent auxiliary variable Kingma & Welling (2014). Taking a gradient of our ELBO objective in Equation (22) the expectation of the loglikelihood may be rewritten by integrating over so that the gradient with respect to and can move into the expectation according to Equation (23) (the backward pass). In the forward pass, is sampled to compute .
a.2 The implicit Beta distribution reparameterisation
Implicit reparameterisation gradients Figurnov et al. (2018); Mohamed et al. (2019) is used to learn the variational parameters for Beta distribution. Nalisnick & Smyth (2017) propose to use a Kumaraswamy reparameterisation. However, in CL a Beta distribution rather than an approximation is desirable for repeated Bayesian updates.
There is no simple inverse of the reparameterisation for the Beta distribution like the Gaussian distribution presented earlier. Hence, the idea of implicit reparameterisation gradients is to differentiate a standardisation function rather than have to perform its inverse. The standardisation function is given by the Beta distribution’s CDF.
Following Mohamed et al. (2019), the derivative required for general stochastic VI is:
(24) 
where is an objective function, i.e. Equation (22), are the Beta parameters. The implicit reparameterisation gradient solves for , above, using implicit differentiation Figurnov et al. (2018):
(25) 
where , is the inverse CDF of the Beta distribution and and so is the standardisation path which removes the dependence of the distribution parameters on the sample. The key idea in implicit reparameterisation is that this expression for the gradient in Equation (25) only requires differentiating the standardisation function and not inverting it. Given then using Equation (25) the implicit gradient is:
(26) 
where is the Beta PDF. The CDF of the Beta distribution, is given by
(27) 
where and are the incomplete Beta function and Beta function, respectively. The derivatives of do not admit simple analytic expressions. Thus, numerical approximations have to be made, for instance, by using Taylor series for Jankowiak & Obermeyer (2018); Figurnov et al. (2018).
In the forward pass, is sampled from a Beta distribution or alternatively from two Gamma distributions. That is, if and , then .
a.3 The variational Bernoulli distribution reparameterisation
The Bernoulli distribution can be reparameterised using a continuous relaxation to the discrete distribution and so Equation (23) can be used.
Consider a discrete distribution where and , then . Sampling from this distribution requires performing an argmax operation, the crux of the problem is that the argmax operation doesn’t have a well defined derivative.
To address the derivative issue above, the Concrete distribution Maddison et al. (2017) or GumbelSoftmax distribution Jang et al. (2017) is used as an approximation to the Bernoulli distribution. The idea is that instead of returning a state on the vertex of the probability simplex like argmax does, these relaxations return states inside the inside the probability simplex (see Figure 2 in Maddison et al. (2017)). The Concrete formulation and notation from Maddison et al. (2017) are used. We sample from the probability simplex using
(28) 
with temperature hyperparameter , parameters and i.i.d. Gumbel noise . This equation resembles a softmax with a Gumbel perturbation. As the softmax computation approaches the argmax computation. This can be used as a relaxation of the variational Bernoulli distribution and can be used to reparameterise Bernoulli random variables to allow gradient based learning of the variational Beta parameters downstream in our model.
When performing variational inference using the Concrete reparameterisation for the posterior, a Concrete reparameterisation of the Bernoulli prior is required to properly lower bound the ELBO in Equation (22). If is the variational Bernoulli posterior over sparse binary masks for weights and is the Bernoulli prior. To guarantee a lower bound on the ELBO, both Bernoulli distributions require replacing with Concrete densities, i.e.,
(29) 
where is a Concrete density for the variational posterior with parameter , temperature parameter given global parameters . The Concrete prior is . Equation (29) is evaluated numerically by sampling from the variational posterior (we will take a single Monte Carlo sample Kingma & Welling (2014)). At test time one can sample from a Bernoulli using the learnt variational parameters of the Concrete distribution Maddison et al. (2017).
In practice, the log transformation is used to alleviate underflow problems when working with Concrete probabilities. One can instead work with , as the KL divergence is invariant under this invertible transformation and Equation (29) is valid for optimising our Concrete parameters Maddison et al. (2017). For binary Concrete variables one can sample from where and the logdensity (before applying the sigmoid activation) is Maddison et al. (2017). The reparameterisation , where is the sigmoid function, enables us to differentiate through the Concrete and use a similar formula to Equation (23).
EWC h5  EWC h50  EWC h100  VCL h5  VCL h50  VCL h100  IBNN  

Fully Bayesian  ✗  ✗  ✗  ✓  ✓  ✓  ✓ 
PMNIST CL1 







PMNIST CL2 







MNIST CL1 







MNIST CL2 







MNIST + noise CL1 







MNIST + noise CL2 







MNIST + images CL1 







MNIST + images CL2 







Appendix B Further results
Further experimental results are presented including a comparison of the sparsity of HIBNN with Sparse Variational Dropout Molchanov et al. (2017) by pruning, Section B.1. We present the additional CL2 results for Permuted MNIST (Section B.2), Split MNIST and its variants (Section B.3). Finally, we detail all results using the CIFAR10 dataset, expanding on the results presented in the main paper and also showing the results for CL2 (Section B.4). All results are an average and standard error of 5 runs unless otherwise mentioned. For all boxplots, by ‘active’ it means that for a data point and for neuron . The test error is computed as .
b.1 Comparison of HIBP and Sparse Variational Dropout
Using the IBP and HIBP priors for model selection induces sparsity as a side effect. Other priors such as Sparse Variational Dropout (SVD) Molchanov et al. (2017) or a horseshoe prior on weights Louizos et al. (2017) are specifically employed for sparsity. By comparing the effect of weight pruning between a HIBNN and a BNN employing SVD, we can see that the HIBNN has a similar sparsity and pruning performance to SVD. The accuracies obtained from SVD prior to pruning are compared to for the HIBNN due to variational approximations in the HIBP posterior. In terms of pruning, both methods have similar performance with SVD being slightly more robust; performance starts to degrade after of weights are pruned in comparison to the HIBNN’s , see Figure 10.
The SVD BNN uses a two layer BNN with 200 neurons and the HIBP prior with a variational truncation of for a fair comparison. The HIBP prior parameters and the initialisation of the variational posterior are and for all , the hyperparameter for all layers . The Concrete temperatures used are for the variational posterior and for the Concrete prior. The prior on the Gaussian weights was set to for these parameters where found to work well on split MNIST and were assumed to also work well for multiclass MNIST too.
Both networks are optimised using Adam Kingma & Lei Ba (2015) with a decaying learning rate schedule starting at at a rate of every steps, for epochs and using a batch size of . Weight means are initialised with their ML parameters found after training for epochs and . Local reparameterisation Kingma et al. (2015) is employed. SVD is trained for epochs while our method for epochs, as it requires more epochs to converge.
Dataset  Train set size  Test set size 

Permuted MNIST  
Split MNIST  
Split MNIST + noise  
Split MNIST + images  
CIFAR 10 
b.2 Permuted MNIST
Experimental details. We summarise all dataset sizes in Table 3. No preprocessing of the data is performed as Nguyen et al. (2018). For the permuted MNIST experiments, the BNNs used for the baselines and the IBNN consist of a single layer with ReLU activations. For the IBNN, the variational truncation parameter is set to . For the first task the parameters of the Beta prior and the variational Beta distribution are initialised to and for all . The temperature parameters of the Concrete distributions for the variational posterior and priors are set to and respectively. For each batch, samples are averaged from the IBP priors as this made training stable. The Gaussian weights of the BNNs have their means and variances initialised with the maximum likelihood estimators found after training for 100 epochs with Adam and the variances initialised to for the first task. For the CL Adam is also used 200 epochs and an initial learning rate of 0.001 which decays exponentially with a rate of 0.87 every 1000 iterations for each task. The CL experiments are followed the implementation of VCL Nguyen et al. (2018).
Results Cl2. The results for task incremental learning i.e. CL1 are presented in the main paper. The CL2 performances of the VCL baselines and the IBNN are similar to the CL1 scenario. The results in Figure 11 show that limited model capacity in VCL leads to poor performance. In contrast, the IBNN which adapts according to the data eventually expands to the same size as the VCL baselines with 50 neurons, see Figure 12. However, our model slightly underperforms VCL for a well chosen network size (h=) due to the approximations in the variational posterior.
b.3 Split MNIST and Variants
Experimental details. The BNNs used for the baselines and the IBNN consist of a single layer with ReLU activations. The variational truncation parameter is set to . For CL1 and for the first task, the parameters of the Beta prior and the variational Beta distribution are initialised to and for all . The temperature parameters of the Concrete distributions for the variational posterior and priors are set to and , respectively. The prior for the Gaussian weights is . The Gaussian weights of the BNNs have their means initialised with the maximum likelihood estimators, found after training a NN for 100 epochs with Adam. The variances initialised to . The Adam optimiser is used to for epochs, the default in the original VCL paper Nguyen et al. (2018) for Split MNIST. To ensure convergence of the IBNN, the first task needs to be trained for more epochs. An initial learning rate of which decays exponentially with a rate of every iterations is employed.
Results Cl1. For CL on split MNIST, the IBNN is able to outperform the 5 neuron VCL baseline as it underfits. On the other hand, the 50 and 100 neuron VCL networks slightly outperform the IBP prior BNN for later tasks, see Figure 13. The IBP prior enables the BNN to expand from a median of 11 neurons for the first task to 14 neurons for later tasks, see Figure 14. Regarding CL on MNIST with random background noise, it is clear that the IBNN is able to outperform all the VCL benchmarks for all tasks. The baseline models overfit on the second task and propagate a poor approximate posterior which affects subsequent task performance, see Figure 13. The IBNN expands over the course of the 5 tasks from a median of 11 neurons to 14 neurons in Figure 14.
Results Cl2. In general, the performance of the IBNN shows little difference versus VCL for CL2, see Figure 15. Also there is a dip in performance compared to CL1 for all models. Model capacity doesn’t obviously affect performance for CL2 (this is not the case for permuted MNIST where model capacity does have an effect), see Table 2. The IBNN expands continuously for the five tasks for split MNIST and the variant with background images. While there is no expansion for split MNIST with background noise: the IBP selects a larger model due to the variational Beta parameter initialised to , see Figure 16.
b.4 Split CIFAR10
We provide details of the experimental setup and further results on the split CIFAR10 CL experiments.
Experimental details. Two layer VCL networks are used as baselines with different hidden state sizes and ReLU activations, the HIBNN and IBNN models also use two layers and . All hyperparameters are set to the same as for the MNIST experiments except that for the first task, to encourage more neurons to become active, as this is a harder vision task. The IBNN model uses the same prior parameters as the HIBNN. The parameters used are the same for CL1 and CL2 experiments. Images are flattened to a vector as inputs.
Results Cl1. The HIBNN used for CL outperforms all the VCL baselines with different hidden state sizes. The larger baseline models fail to perform well on these tasks despite using more parameters, see Figure 18. HIBNN expands from a median of neurons per layer to a median of around neurons from tasks 1 to 5, see Figure 9 in the main paper. As is characteristic of the HIBP variational posterior the number of active neurons in each layer are very similar. The IBNN model outperforms the baselines too, but doesn’t perform as well as the HIBNN model, probably due to (1) the more irregular structure inferred by the separate IBP variational posteriors and (2) lack of information sharing in the IBNN, see Figure 9 in the main paper.
Results Cl2. The HIBNN outperforms all VCL baselines. Like for CL1 the VCL baselines struggle to learn. There is performance gap between CL1 and CL2 (like the split MNIST CL tasks) and the HIBNN outperforms the IBNN model, see Table 4 and Figure 18. The HIBNN allocates an optimal amount of resources for the first task and retains its size for the duration of CL. We specify the model parameters for HIBNN as and for all layers .
As a Bayesian nonparametric model, we can influence the variational posterior through our prior belief about the data. We show the hyperparameters and prior which are specified larger than usual can produce larger models as shown in Figure 17.
VCL h20  VCL h100  VCL h400  IBP  HIBP  

CIFAR10 CL1 





CIFAR10 CL2 





Footnotes
 is usually shown in leftordered form, however since the inference procedure is based on the stick breaking construction, order is meaningful and sorted according to Xu et al. (2019).
 The data is obtained from http://bit.ly/2SYZX99
References
 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Ahn, H., Lee, D., Cha, S., and Moon, T. Uncertaintybased Continual Learning with Adaptive Regularization. In Neurips, 2019.
 Benjamin, A. S., Rolnick, D., and Kording, K. P. Measuring and Regularizing Networks in Function Space. In International Conference on Learning Representations, 2019.
 Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight Uncertainty in Neural Networks. In International Conference on Machine Learning, 2015.
 Chatzis, S. P. Indian Buffet Process deep generative models for semisupervised classification. Technical report, 2018.
 Chaudhry, A., Dokania, P. K., Ajanthan, T., and Torr, P. H. S. Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence. In European Conference on Computer Vision (ECCV), 2018.
 Doshi, F., Miller, K., Van Gael, J., and Teh, Y. W. Variational inference for the Indian Buffet Process. In Artificial Intelligence and Statistics, pp. 137–144, 2009.
 DoshiVelez, F., Miller, K. T., Van Gael, J., Teh, Y. W., Van, J., Yee, G. ., and Teh, W. Variational Inference for the Indian Buffet Process. Technical report, 2009.
 Farquhar, S. and Gal, Y. Towards Robust Evaluations of Continual Learning. In Lifelong Learning: A Reinforcement Learning Approach workshop, ICML, 2018.
 Figurnov, M., Mohamed, S., and Mnih, A. Implicit Reparameterization Gradients. In Neural Information Processing Systems, 2018.
 Ghosh, S., Yao, J., and DoshiVelez, F. Model Selection in Bayesian Neural Networks via Horseshoe Priors. Journal of Machine Learning Research, 20(182):1–46, 2019.
 Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradientbased neural networks. 2015.
 Griffiths, T. L. and Ghahramani, Z. The Indian Buffet Process: An Introduction and Review. Journal of Machine Learning Research, 12:1185–1224, 2011.
 Gupta, S. K., Phung, D., and Venkatesh, S. A Slice Sampler for Restricted Hierarchical Beta Process with Applications to Shared Subspace Learning. In Proceedings of the TwentyEighth Conference on Uncertainty in Artificial Intelligence, pp. 316–325, 2012.
 Hoffman, M. D. and Blei, D. M. Structured Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics, 2015.
 Hsu, Y.C., Liu, Y.C., Ramasamy, A., and Kira, Z. Reevaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines. In Continual Learning Workshop, 32nd Conference on Neural Information Processing Systems, 2018.
 Huszár, F. On Quadratic Penalties in Elastic Weight Consolidation. Technical report, 2017.
 Jang, E., Gu, S., and Poole, B. Categorical Reparametrization with GumbelSoftmax. In International Conference on Learning Representations, 2017.
 Jankowiak, M. and Obermeyer, F. Pathwise derivatives beyond the reparameterization trick. In International Conference on Machine Learning, pp. 2235–2244, 2018.
 Kingma, D. P. and Lei Ba, J. ADAM: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
 Kingma, D. P. and Welling, M. AutoEncoding Variational Bayes. In International Conference on Learning Representations, 2014.
 Kingma, D. P., Salimans, T., and Welling, M. Variational Dropout and the Local Reparameterization Trick. In Neural Information Processing Systems, 2015.
 Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., GrabskaBarwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. ISSN 00278424. doi: 10.1073/pnas.1611835114.
 Lee, S., Ha, J., Zhang, D., and Kim, G. A Neural Dirichlet Process Mixture Model for TaskFree Continual Learning. In International Conference on Learning Representations, 2020.
 Louizos, C., Ullrich, K., and Welling, M. Bayesian Compression for Deep Learning. In Neural Information Processing Systems, 2017.
 Maddison, C. J., Mnih, A., and Teh, Y. W. The Concrete Distribution: a Continual Relaxation of Discrete Random Variables. In International Conference on Learning Representations, 2017.
 Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. Monte carlo gradient estimation in machine learning. ArXiv, abs/1906.10652, 2019.
 Molchanov, D., Ashukha, A., and Vetrov, D. Variational Dropout Sparsifies Deep Neural Networks. In International Conference on Machine Learning, 2017.
 Nalisnick, E. and Smyth, P. StickBreaking Variational Autoencoders. In International Conference on Learning Representations, 2017.
 Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Variational Continual Learning. In International Conference on Learning Representations, 2018.
 Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner, R. E., Yokota, R., and Emtiyaz Khan, M. Practical Deep Learning with Bayesian Principles. In Neural Information Processing Systems, 2019.
 Ranganath, R., Gerrish, S., and Blei, D. M. Black Box Variational Inference. In Artificial Intelligence and Statistics, 2014.
 Rao, D., Visin, F., Rusu, A. A., Teh, Y. W., Pascanu, R., and Hadsell, R. Continual Unsupervised Representation Learning. In Neural Information Processing Systems, 2019.
 Ritter, H., Botev, A., and Barber, D. Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting. In Neural Information Processing Systems, 2018.
 Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. Progressive Neural Networks. In arXiv:1606.04671v3, 2016. ISBN 1606.04671v3.
 Schwarz, J., Luketina, J., Czarnecki, W. M., GrabskaBarwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. Progress and Compress: A scalable framework for continual learning. In International Conference on Machine Learning, 2018.
 Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual Learning with Deep Generative Replay. In Neural Information Processing Systems, 2017.
 Singh, R., Ling, J., and DoshiVelez, F. Structured Variational Autoencoders for the BetaBernoulli Process. In Workshop on Advances in Approximate Bayesian Inference, Neural Information Processing Systems, 2017.
 Swaroop, S., Nguyen, C. V., Bui, T. D., and Turner, R. E. Improving and Understanding Variational Continual Learning. In Continual Learning workshop, Neural Information Processing Systems, 2018.
 Teh, Y. W., GrÃ¼r, D., and Ghahramani, Z. Stickbreaking Construction for the Indian Buffet Process. In International Conference on Artificial Intelligence and Statistics, 2007.
 Thibaux, R. and Jordan, M. I. Hierarchical Beta processes and the Indian buffet process. In Artificial Intelligence and Statistics, pp. 564–571, 2007.
 Titsias, M. K., Schwarz, J., Matthews, A. G. d. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning. International Conference on Learning Representations, 2020.
 van de Ven, G. M. and Tolias, A. S. Three scenarios for continual learning. In NeurIPS Continual Learning workshop, 2018.
 Xu, J. and Zhu, Z. Reinforced Continual Learning. In Neural Information Processing Systems, 2018.
 Xu, K., Srivastava, A., and Sutton, C. Variational Russian Roulette for Deep Bayesian Nonparametrics. In Proceedings of the 36th International Conference on Machine Learning, 2019.
 Yoon, J., Yang, E., Lee, J., Ju Hwang, S., and Korea, S. Lifelong Learning with Dynamically Expandable Networks. In International Conference on Learning Representations, 2018.
 Zenke, F., Poole, B., and Ganguli, S. Continual Learning Through Synaptic Intelligence. In International Conference on Machine Learning, 2017.
 Zeno, C., Golan, I., Hoffer, E., and Soudry, D. Task Agnostic Continual Learning Using Online Variational Bayes. In Bayesian Deep Learning workshop, Neural Information Processing Systems, 2018.