Lifelong Generative Modeling

Lifelong Generative Modeling

Jason Ramapuram   
Jason.Ramapuram@etu.unige.ch
\ANDMagda Gregorova 1  2
magda.gregorova@hesge.ch
&Alexandros Kalousis 1  2
Alexandros.Kalousis@hesge.ch
University of Geneva, SwitzerlandHaute école de gestion de Genève, HES-SO, Switzerland
1footnotemark: 1
2footnotemark: 2
Abstract

Lifelong learning is the problem of learning multiple consecutive tasks in a sequential manner, where knowledge gained from previous tasks is retained and used to aid future learning over the lifetime of the learner. It is essential towards the development of intelligent machines that can adapt to their surroundings. In this work we focus on a lifelong learning approach to unsupervised generative modeling, where we continuously incorporate newly observed distributions into a learned model. We do so through a student-teacher Variational Autoencoder architecture which allows us to learn and preserve all the distributions seen so far, without the need to retain the past data nor the past models. Through the introduction of a novel cross-model regularizer, inspired by a Bayesian update rule, the student model leverages the information learned by the teacher, which acts as a probabilistic knowledge store. The regularizer reduces the effect of catastrophic interference that appears when we learn over sequences of distributions. We validate our model’s performance on sequential variants of MNIST, FashionMNIST, PermutedMNIST, SVHN and Celeb-A and demonstrate that our model mitigates the effects of catastrophic interference faced by neural networks in sequential learning scenarios.

\iclrfinaltrue

1 Introduction

Machine learning is the process of approximating unknown functions through the observation of typically noisy data samples. Supervised learning approximates these functions by learning a mapping from inputs to a predefined set of outputs such as categorical class labels (classification) or continuous targets (regression). \note[Replaced-18]Unsupervised learning on the other hand attempts to learn an estimate of the inputs using the inputs alone. Unsupervised learning seeks to uncover structure and patterns from the input data without any supervision. Examples of this learning paradigm include density estimation and clustering methods. Both learning paradigms make assumptions that restrict the set of plausible solutions. These assumptions are referred to as hypothesis spaces, biases or priors and aid the model in favoring one solution over another [mitchell1980need, vapnik2006estimation]. For example, the use of convolutions [lecun1995convolutional] to process images favors local structure; recurrent models [Jordan:1990:ADP:104134.104148, hochreiter1997long] exploit sequential dependencies and graph neural networks [scarselli2008graph, kipf2016semi] assume that the underlying data can be modeled accurately as a graph.

Current state of the art machine-learning models typically focus on learning a single model for a single task, such as image classification [liu2018progressive, szegedy2016inception, he2016deep, DBLP:journals/corr/SimonyanZ14a, krizhevsky2012imagenet], image generation [DBLP:journals/corr/abs-1906-00446, DBLP:conf/iclr/BrockDS19, kingma2014, goodfellow2014generative], natural language question answering [devlin2019bert, radford2019language] or single game playing [vinyals2019alphastar, silver2016mastering]. In contrast, humans experience a sequence of learning tasks over their lifetimes, and are able to leverage previous learning experiences to rapidly learn new tasks. Consider learning how to ride a motorbike after learning to ride a bicycle: the task is drastically simplified through the use of prior learning experience. Studies [ahn1993psychological, ahn1987schema, lake2015human] in psychology have shown that humans are able to generalize to new concepts in a rapid manner, given only a handful of samples. [lake2015human] demonstrates that humans can classify and generate new concepts of two wheel vehicles given just a single related sample. This contrasts the state of the art machine learning models described above which use hundreds of thousands of samples and fail to generalize to slight variations of the original task [cobbe2019quantifying].

Lifelong learning [thrun1995lifelong, thrun1995lifelong2] argues for the need to consider learning over task sequences, where learned task representations and models are stored over the entire lifetime of the learner and can be used to aid current and future learning. This form of learning allows for the transfer of previously learned models and representations and can reduce the sample complexity of the current learning problem [thrun1995lifelong]. In this work we restrict ourselves to a subset of the broad lifelong learning paradigm; rather than focus on the supervised lifelong learning scenario as most state of the art methods, our work is one of the first to tackle the more challenging problem of deep lifelong unsupervised learning. We also identity and relax crucial limitations of prior work in life-long learning that requires the storage of previous models and training data, allowing us to operate in a more realistic learning scenario.

2 Related Work

The idea of learning in a continual manner has been explored extensively in machine learning, seeded by the seminal works of lifelong-learning [thrun1995lifelong, thrun1995lifelong2, silver2013lifelong], online-learning [fiat1998online, blum1998line, bottou1998online, bottou2004large] and sequential linear gaussian models [Roweis1999-ud, ghahramani2000online] such as the Kalman Filter [kalman1960new] and its non-linear counterpart, the Particle Filter [del1996non]. Lifelong learning bears some similarities to online learning in that both learning paradigms observe data in a sequential manner. Online learning differs from lifelong learning in that the central objective of a typical online learner [bottou1998online, bottou2004large] is to best solve/fit the current learning problem, without preserving previous learning. In contrast, lifelong learners seek to retain, and reuse, the learned behavior acquired over past tasks, and aim to maximize performance across all tasks. \note[Removed-18]While lifelong learning has similarities to online learning, they differ in their core objective. Typical online learners [bottou1998online, bottou2004large] aim to best model each observed sequential task, while lifelong learning attempts to model the entire space of tasks. Consider the example of forecasting click through rate: the objective of the online learner is to evolve over time, such that it best represents current user preferences. This contrasts lifelong learners which enforce a constraint between tasks to ensure that previous learning is not lost. \note[Removed-18]We provide a more in depth comparison against state of the art online methods such as Streaming Variational Bayes (SVB) [DBLP:conf/nips/BroderickBWWJ13] and incremental bayesian clustering methods [katakis2008incremental, gomes2008incremental] in Appendix Section 10.4 for the curious reader.

Lifelong Learning [thrun1995lifelong] was initially proposed in a supervised learning framework for concept learning, where each task seeks to learn a particular concept/class using binary classification. The original framework used a task specific model, such as a K Nearest Neighbors (KNN) 111These models were known as memory based learning in [thrun1995lifelong]., coupled with a representation learning network that used training data from all past learning tasks (support sets), to learn a common, global representation. This supervised approach was later improved through the use of dynamic learning rates [silver1996parallel], core-sets [silver2015consolidation] and multi-head classifiers [fei2016learning]. \note[Comment-18]If time allows, describe briefly, a couple of sentences, how each one of them improved over the original? In parallel, lifelong learning was extended to independent multi-task learning [ruvolo2013ella, fei2016learning], \note[Comment-18]hhmmmm the Hall Daume work seems to be (from the title) standard multi-task learning, does it really have a life-long flavor? The other two seem to be about life-long. Standard multi-task requires (typically) seeing everything together, so you should say, briefly, how this extension happened. reinforcement learning [thrun1995lifelong2, tanaka1997approach, ring1997child], topic modeling [chen2014topic, wang2016mining] and semi-supervised language learning [mitchell2015never, mitchell2018never]. For a more detailed review see [chen2016lifelong].

More recently, lifelong learning has seen a resurgence within the framework of deep learning. As mentioned earlier, one of the central tenets of lifelong learning is that that the learner should perform well over all observed tasks. Neural networks, and more generally, models that learn using stochastic gradient descent [robbins1951stochastic], typically cannot persist past task learning without directly preserving past models or data. This problem of catastrophic forgetting [mccloskey1989catastrophic] is well known in the neural network community and is the central obstacle that needs to be resolved to build an effective neural lifelong learner. Catastrophic forgetting is the phenomenon where model parameters of a neural network trained in a sequential manner become biased towards the distribution of the latest observations, forgetting previously learned representations, over data no longer accessible for training. In order to mitigate catastrophic forgetting current research in lifelong learning employs four major strategies: transfer learning, replay mechanisms, parameter regularization and distribution regularization. In table 1 we classify the different lifelong learning methods that we will discuss in the following paragraphs into these strategies.

{adjustbox}

width= EWC [kirkpatrick2017overcoming] VCL [nguyen2018variational] LwF [li2016learning] ALTM [furlanello2016active] PNN [rusu2016progressive] DGR [shin2017continual, kamra2017deep] DBMNN [terekhov2015knowledge] SI [zenke2017continual] VASE [achille2018life] LGM (us) Transfer learning Replay mechanisms Parameter regularization Functional regularization

Table 1: Catastropic interference mitigation strategies of state of the art models. Rows highlighted in gray represent desirable mitigation strategies.

Transfer learning: These approaches mitigate catastrophic forgetting by freezing previous task models and relaying a latent representation of the previous task to the current model. Research in transfer learning for the mitigation of catastrophic forgetting include Progressive Neural Networks (PNN) [rusu2016progressive] and Deep Block-Modular Neural Networks (DBMNN) [terekhov2015knowledge] to name a few. These approaches allow the current model to adapt its parameters to the (new) joint representation \note[Comment-18]what is ”the (new) joint representation”? elaborate a bit more on the two methods you reference. How can it be new and joint on the same time. If it is new and joint past models need retraining. in an efficient manner and prevent forgetting through the direct preservation of all previous task models. \note[Rephrased-18] Due to this, transfer learning mitigation approaches are unsuitable candidates for long-lifetime (eg: [mitchell2015never, mitchell2018never]) or resource constrained learners such as those on embedded devices. Deploying such a transfer learning mechanism in a lifelong learning setting would necessitate training a new model with every new task, considerably increasing the memory footprint of the lifelong learner. In addition, since transfer learning approaches freeze previous models, it negates the possibility of improving previous task performance using knowledge gathered from new tasks.

Replay mechanisms: The original formulation of lifelong learning [thrun1995lifelong] required the preservation of all previous task data. This requirement was later relaxed in the form of core-sets [silver2002task, silver2015consolidation, nguyen2018variational], which represent small weighted subsets of inputs that approximate the full dataset. \note[rephrased-18]Recently however there have been efforts to use deep generative replay to prevent catastrophic forgetting in a classification setting [shin2017continual, kamra2017deep]. Recently, within the classification setting, there have been replay approaches that try to lift the requirement of storing past training data by relying on generative modeling [shin2017continual, kamra2017deep]; we will call such methods Deep generative replay (DGR) methods. DGR methods methods use a student-teacher network architecture, where the teacher (generative) model augments the student (classifier) model with synthetic samples from previous tasks. These synthetic task samples are used in conjunction with real samples from the current task to learn a new joint model across all tasks. While strongly motivated by biological rehearsal processes [skaggs1996replay, johnson2007neural, karlsson2009awake, schuck2019sequential], these generative replay strategies fail to efficiently use previous learning and simply re-learn each new joint task from scratch.

Parameter regularization: Most work that mitigates catastrophic forgetting falls under the umbrella of parameter regularization. There are two approaches within this mitigation strategy: constraining the parameters of the new task to be close to the previous task through a predefined metric, and enforcing task-specific parameter sparsity. The two approaches are related as task-specific parameter sparsity can be perceived as a refinement of the parameter constraining approach. Parameter constraining approaches typically share the same model/parameters, but encourage new tasks from altering important learned parameters from previous tasks. Task specific parameter sparsity relaxes this, by enforcing that each task use a different subset of parameters from a global model, through the use of an attention mechanism.

Models such as Laplace Propagation [eskin2004laplace], Elastic Weight Consolidation (EWC) [kirkpatrick2017overcoming], Synaptic Intelligence (SI) [zenke2017continual] and Variational Continual Learning (VCL) [nguyen2018variational] fall under the parameter constraining approach. EWC for example, uses the Fisher Information matrix (FIM) to control the change of model parameters between two learning tasks. Intuitively, important parameters should not have their values changed, while non-important parameters are left unconstrained. The FIM is used as a weighting in a quadratic parameter difference regularizer under a Gaussianity assumption of the parameter posterior. However, this Gaussian parameter posterior assumption has been demonstrated [Neal1995-dx, blundell2015weight] to be sub-optimal for learned neural network parameters. VCL improves upon EWC, by generalizing the local assumption of the FIM to a KL-Divergence between the (variational) parameter posterior and prior. This generalization derives from the fact that the FIM can be cast as a KL divergence between the posterior and an epsilon perturbation of the same random variable [jeffreys1946invariant]. \note[Comment-18] If time allows: 1) eehmmm what is the local assumption of FIM? you did not discuss that in EWC. 2) What is a ”non-local KL-divergence” elaborate and give also intuitions in addition to the technical details. VCL actually spans a number of different mitigation strategies as it uses parameter regularization (described above) , transfer learning (it keeps a separate head network per task) and replay (it persists a core-set of true data per task).

Models such as Hard Attention to the Task (HAT) [serra2018overcoming] and the Variational Autoencoder with Shared Embeddings (VASE) [achille2018life] fall under the task-specific parameter sparsity strategy. This mitigation strategy enforces that different tasks use different components of a single model, typically through the use of attention vectors [bahdanau2014neural] that are learned given supervised task labels. Multiplying the attention vectors with the model outputs prevents gradient descent updates for different subsets of the model’s parameters, allowing them to be used for future task learning. Task specific parameter sparsity allows a model to hold-out a subset of its parameters for future learning and typically works well in practice [serra2018overcoming], with its strongest disadvantage being the requirement of supervised information.

\note

[Rephrased-18] Most work that mitigates catastrophic forgetting falls under the umbrella of parameter regularization. There exist two over-arching paradigms within this mitigation strategy: enforcing task-specific parameter sparsity and constraining the parameters of the new task to be close to the previous task through a predefined metric.

Models such as Hard Attention to the Task (HAT) [serra2018overcoming] and the Variational Autoencoder with Shared Embeddings (VASE) [achille2018life] use supervised information to learn a task specific attention vector [bahdanau2014neural] 222Hard attention can be perceived as a one-hot vector with soft-attention [bahdanau2014neural] being its continuous relaxation (typically through the softmax function).. This attention vector is multiplied by model outputs and prevents gradient descent updates of a subset of parameters. Task specific parameter sparsity allows a model to hold-out a subset of its parameters for future learning and typically works well in practice [serra2018overcoming], with its strongest disadvantage being the requirement of supervised information.

In contrast, models such as Elastic Weight Consolidation (EWC) [kirkpatrick2017overcoming], Synaptic Intelligence (SI) [zenke2017continual] and Variational Continual Learning (VCL) [nguyen2018variational] constrain model parameters between tasks. EWC [kirkpatrick2017overcoming] for example, uses the Fisher Information matrix (FIM) to control the change of model parameters between two tasks. Intuitively, important parameters should not have their values changed, while non-important parameters are left unconstrained. The FIM is used as a weighting in a quadratic parameter difference regularizer under a Gaussianity assumption of the parameter posterior. This Gaussian parameter posterior assumption has been demonstrated [Neal1995-dx, blundell2015weight] to be sub-optimal for learned neural network parameters. VCL improves upon EWC, by generalizing the local assumption of the FIM to a non-local KL-Divergence between the parameter posterior and prior [jeffreys1946invariant]. However, VCL also adds a separate head network and a core-set of true data-samples per observed task, preventing its scalability to long-lifetime or embedded devices.

Functional regularization: Parameter regularization methods attempt to preserve the learned behavior of the past models by controlling how the model parameters change between tasks. However, the model parameters are only a proxy for the way a model actually behaves. Models with very different parameters can have exactly the same behavior with respect to input-output relations (non-uniqueness [williamson1995existence]). Functional regularization concerns itself with preserving the actual object of interest: the input-output relations. This strategy allows the model to flexibly adapt its internal parameter representation between tasks, while still preserving past learning.

Methods such as distillation [hinton2015distilling], ALTM [furlanello2016active] and Learning Without Forgetting (LwF) [li2016learning] impose similarity constraints on the classification outputs of models learned over different tasks. This can be interpreted as functional regularization by generalizing the constraining metric (or semi-metric) to be a divergence on the output conditional distribution. In contrast to parameter regularization, no assumptions are made on the parametric form of the parameter posterior distribution. This allows models to flexibly adapt their internal representation as needed, making functional regularization a desirable mitigation strategy. One of the pitfalls of current functional regularization approaches is that they necessitate the preservation of all previously data.

2.1 Limitations of existing approaches.

A simple solution to the problem of lifelong learning is to store all data and re-learn a new joint multi-task representation [caruana1997multitask] at each newly observed task. Alternatively, it is possible to retain all previous model parameters and select the model that presents the best performance on new test task data. Existing solutions typically relax one these requirements. [furlanello2016active, li2016learning, nguyen2018variational] relaxes the need for model persistence, but requires preservation of all data [furlanello2016active, li2016learning], or a growing core-set of data [silver2002task, silver2015consolidation, nguyen2018variational]. Conversely, [rusu2016progressive, terekhov2015knowledge, nguyen2018variational, zenke2017continual] relaxes the need to store data, but persists all previous models [rusu2016progressive, terekhov2015knowledge] or a subset of model parameters [nguyen2018variational, zenke2017continual]. \note[Alexandros]Remove the two following sentences. \note[Jason]Roger.

\note

[Jason]Roger. +Small fixes. \note[Add]Unlike these approaches we draw inspiration from how humans learn over time and seek to render unnecessary the storing of past training data and models. Unlike these approaches, we draw inspiration from how humans learn over time and remove the requirement of storing past training data and models. Consider the human visual system; research has shown [curcio1990human, blackwell1946contrast] that the human eye is capable of capturing 576 megapixels of content per image frame. If stored naively on a traditional computer, this corresponds to approximately 6.9 gigabytes of information per sample. Given that we perceive trillions of frames over our lifetimes, it is infeasible to store this information in its base, uncompressed representation. Research in neuroscience has validated [wittrock1992generative, anderson2014human] that the associative human mind, compresses, merges and reconstructs information content in a dynamic way. Motivated by this, we believe that a lifelong learner should not store past training data or models. Instead, it should retain a latent representation that is common over all tasks and evolve it as more tasks are observed. \note[Jason]Roger. +Small fixes. ’behavior’ is unclear in this setting. \note[Alexandros]Remove previous sentence \note[Add]Motivated by this, as already mentioned, we believe that a lifelong learner should not store past training data, nor it should models. Instead it should retain a latent representation that is common over all tasks and evolve it as more tasks are seen. The only constraint that we will impose is that the learned representation should preserve past learned behaviors.

2.2 Our solution at a high level.

\note

[Alexandros]I would replace (or restructure) the whole subsection with the following. \note[Add] While most lifelong learning work focuses on supervised learning, [rusu2016progressive, furlanello2016active, li2016learning, terekhov2015knowledge, zenke2017continual, kirkpatrick2017overcoming], we focus on the more challenging task of unsupervised learning and in particular deep generative modelling with latent variables. At the core of our lifelong learning method we place a generative model, which we train by exploiting replay and functional regularisation strategies. Our generative model evolves as new tasks are seen and allows us to generate, at will, data from any of the past distributions which we use for replay. As a direct result it is not any more necessary to store training data from any of the past distributions. Once a new task arrives we learn a new generative model, which we call the student, over real data from the new task and replay data from the past tasks. We generate the latter using the generative model learned over these past tasks, which we call the teacher. When the student’s training is completed it becomes on its turn the teacher, containing everything learned so far. In order to preserve the past learned behaviors we make use of functional regularisation. We require that our generative model behaves as it was trained over past tasks, in terms of input-output relations, when confronted with (generated) data that correspond to these past tasks; we enforce this behaviour through the use of an appropriate regulariser. Unlike EWC or VCL we make no parametric assumptions and allow the generative model to use the available parameters as appropriate in order to be able to learn all so far seen tasks in the best manner. The use of the generative replay mechanism coupled with our functional regularisation renders unnecessary the preservation of the past models as well as the past data and brings in significant performance gains in terms of sample complexity on the future tasks. Finally we should note that it is rather straightforward to adapt our approach in the setting of supervised learning as in fact we have done in [DBLP:journals/corr/abs-1810-10612].

While most research in lifelong learning focuses on supervised learning [rusu2016progressive, furlanello2016active, li2016learning, terekhov2015knowledge, zenke2017continual, kirkpatrick2017overcoming], we focus on the more challenging task of deep unsupervised latent variable generative modeling. These models have wide ranging applications such as clustering [makhzani2015adversarial, jiang2017variational, nalisnick2017stick] and pre-training [larsen2016autoencoding, radford2015unsupervised].

Central to our lifelong learning method are a pair of generative models, aptly named the teacher and student, which we train by exploiting the replay and functional regularization strategies described above. After training a single generative model over the first task, we use it is used as the teacher for a newly instantiated student model. The student model receives data from the current task, as well as replayed data from the teacher, which acts as a probabilistic storage container of past tasks. In order to preserve previous learning, we make use of functional regularization, which aids in preserving input-output relations over past tasks.

Unlike EWC or VCL, we make no assumptions on the form of the parameter posterior and allow the generative models to use available parameters as appropriate, to best accommodate current and past learning. The use of generative replay, coupled with functional regularization, renders the preservation of the past models and past data unnecessary. It also significantly improves the sample complexity on future task learning, which we empirically demonstrate in our experiments. Finally we should note that it is straightforward to adapt our approach to the supervised learning setting, as we have done in [DBLP:journals/corr/abs-1810-10612].

3 Background

In this section we describe the main concepts that we use throughout this work. We begin by describing the base-level generative modeling approach in Section 3.1, followed by how it extends to the lifelong setting in Section 3.2. Finally, in Section 3.3, we describe the Variational Autoencoder over which we instantiate our lifelong generative model.

\note

[Rewrite]In this section we describe the main concepts that we use throughout this work. We begin by describing the base-level generative modelling approach that we follow in Section 3.1 and how it extends to the lifelong setting in Section 3.2; in Section 3.3, we describe the Variational Autoencoder model, [kingma2014], which we have chosen to instantiate our generative model.

3.1 Latent Variable Generative Modeling

We consider a scenario where we observe a dataset, , consisting of variates, , of a continuous or discrete variable . We assume that the data is generated by a random process involving a non-observed random variable, . The data generation process involves first sampling and then producing a variate from the conditional, . We visualize this form of latent generative model in the graphical model in Figure 14.

Figure 1: Typical latent variable graphical model. Gray nodes represent observed variables while white nodes represent unobserved variables.

Typically, latent variables models are solved through maximum likelihood estimation which can be formalized as:

(1)

In many cases, the expectation from Equation 1 does not have a closed form solution (eg: non-conjugate distributions) and quadrature is not computationally tractable due to large dimensional spaces [kingma2014, rezende2014stochastic] (eg: images). To overcome these intractabilities we use a Variational Autoencoder (VAE), which we summarize in Section 3.3. The VAE allows us to infer our latent variables and jointly estimate the parameters of our model. However, before describing the VAE, it is important to understand how this generative setting can be perceived in a lifelong learning scenario.

3.2 Lifelong Generative Modeling

Lifelong generative modeling extends the single-distribution estimation task from Section 3.1 to a set of sequentially observed learning tasks. The -th learning task has variates that are realized from the task specific conditional, , where acts as a categorical indicator variable of the current task. We visualize a simplified form of this in Figure 2 below.

Figure 2: Simplified lifetime of a lifelong learner. Given a true (unknown) distribution, , we observe partial information in the form of L sequential tasks, . Observing more tasks, reduces the uncertainty of the model until convergence, .

Crucially, when observing task, , the model has no access to any of the previous task datasets, . As the lifelong learner observes more tasks, , it should improve its estimate of the true distribution, , which is unknown at the start of training.

\note

[Alexandros]Remove what follows, untill the end of the paragraph. You have already said that, plus you compare against archeology here, not fair. \note[Removed] The original formulation in [thrun1995lifelong] models the lifelong learning problem in two stages (albeit for a classification setting): learn a representation using stored support sets, , and use it to improve the -th task estimate, . This two step learning solution was interpreted by [thrun1995lifelong] as a meta-learning approach [kalousis2002algorithm, vilalta2002perspective]. In our formulation however, we require the learner to improve its estimate without the preservation of the support sets, , or the addition of a per-task model, .

3.3 The Variational Autoencoder

As eluded to in Section 3.1, we would like to infer the latent variables from the data. This can be realized as an alternative form of Equation 1 in the form of Bayes rule: , where is referred to as the latent variable posterior and as the likelihood. One method of approximating the posterior, , is through MCMC sampling methods such as Gibbs sampling [gelfand1990sampling] or Hamiltonian MCMC [neal2011mcmc]. MCMC methods have the advantage that they provide asymptotic guarantees [DBLP:conf/uai/NeiswangerWX14] of convergence to the true posterior, . However in practice it is not possible to know when convergence has been achieved. In addition, due to their Markovian nature, they possess an inner loop, which makes it challenging to scale for large scale datasets.

In contrast, Variational Inference (VI) [jordan1999introduction] side-steps the intractability of the posterior by approximating it with a tractable distribution family, . VI rephrases the objective of determining the posterior as an optimization problem by minimizing the KL divergence between the known distributional family, , and the unknown true posterior, . Applying VI to the intractable integral from Equation 1 results in the evidence lower bound (ELBO) or variational free energy, which can easily be derived from first principles:

(2)
(3)
(4)

where we used Jensen’s inequality to transition from Equation 3 to Equation 4. The objective introduced in Equation 4 induces the graphical model shown below in Figure 3.

Figure 3: Standard VAE graphical model. Gray nodes represent observed variables while white nodes represent unobserved variables; dashed lines represent inferred variables.

VAEs typically use deep neural networks to model the approximate inference network, and conditional, , which are also known as the encoder and decoder networks (respectively). To optimize for the parameters of these networks, VAEs maximize the ELBO (Equation 4) using Stochastic Gradient Descent [robbins1951stochastic]. By sharing the variational parameters of the encoder, , across the data points (amortized inference [gershman2014amortized]), variational autoencoders avoid per-data inner loops typically needed by MCMC approaches. \note[Alexandros]Now this is the appropriate place to speak about optimization, since you are discussing a specific method.

By the way this discussion just bellow concerns only the expectation term over Q in the ELBO. Make that clear right away. Say something like: Optimizing the ELBO objective requires computing the gradient over an expectation in which we sample over the . the sampling part does not allow the gradients to flow back to the encoder network. The gradient of the KL term does not have this issue since it has a closed form solution in the case of isotropic gaussian distributions. The standard way to address the former is through the use of path-wise…

Optimizing the ELBO in Equation 4 requires computing the gradient of an expectation over the approximate posterior, . This typically takes place through the use of the path-wise estimator [rezende2014stochastic, kingma2014] (originally called “push-out” [rubinstein1992sensitivity]). \note[Alexandros]And this is what I call usefull information which I can go and read (though might be a bit too technical for the lifelong setting). But still usefull and in context. The path-wise reparameterizer uses the Law of the Unconscious Statistician (LOTUS) [grimmett2001probability], which enables us to compute the expectation of a function of a random variable (without knowing its distribution) if we know its corresponding sampling path and base distribution [DBLP:journals/corr/abs-1906-10652]. For the typical isotropic gaussian approximate posterior, , used in standard VAEs this can be aptly summarized by:

(5)
(6)

where Equation 5 defines the sampling procedure of our latent variable through the location-scale transformation and Equation 6 defines the path-wise Monte Carlo gradient estimator applied on the decoder (first term in Equation 4). This Monte Carlo estimator enables differentiating through the sampling process of the distribution . Note that computing the gradient of the second term in Equation 4, , is possible through a closed form analytical solution for the case of isotropic gaussian distributions.

\note

[Alexandros]I wonder whether this VAE motivation should not have come earlier in the subsection, before describing how in fact VAE operates. I would also add here something that says why the latent variables per se are useful in a life-long setting; as we said they allow for a finer control of where to generate data from, i.e. from which task. While it is possible to extend any latent variable generative model to the lifelong setting, we choose to build our lifelong generative models using variational autoencoders (VAEs) [kingma2014] as they provide a mechanism for stable training; this contrasts other state of the art unsupervised models such as Generative Adversarial Networks (GANs) [goodfellow2014generative, kim2018disentangling]. Furthermore, latent-variable posterior approximations are a requirement in many learning scenarios such as clustering [quintana2003bayesian], compression [perlmutter1996bayes] and unsupervised representation learning [fe2003bayesian]. Finally, GANs can suffer from low sample diversity [dupont2018learning] which can lead to compounding errors in a lifelong generative setting.

4 Lifelong Learning Model

Algorithm 1 Data Flow   Teacher:   Sample Prior:   Decode: \note[Removed-16]   Encode:      Student:   Sample :   Encode :   Decode:
Figure 4: Student training procedure. Left: graphical model for student-teacher model. Data generated from the teacher model (top row) is used to augment the current training data observed by the student model (bottom row). A posterior regularizer is also applied between and to enable functional regularization (not shown, but discussed in detail in Section 4.1.1). Right: data flow algorithm.

fMRI studies of the rodent [skaggs1996replay, johnson2007neural, karlsson2009awake] and human [schuck2019sequential] brains have shown that previously experienced sequences of events are replayed in the hippocampus during rest. These replays are necessary for better planning [johnson2007neural] and memory consolidation [carr2011hippocampal]. We take inspiration from the memory consolidation of biological learners and introduce our model of Lifelong Generative Modeling (LGM). We visualize the LGM student-teacher architecture in Figure 4.

The student and the teacher are both instantiations of the same base-level generative model, but have different roles throughout the learning process. The teacher’s role is to act as a probabilistic knowledge store of previously learned distributions, which it transfers to the student in the form of replay and functional regularization. The student’s role is to learn the distribution over the new task, while accommodating the learned representation of the teacher over old tasks. In the following sections we provide detailed descriptions of the student-teacher architecture, as well as the base-level generative model that each of them use. The base-level model uses a variant of VAEs, which we tailor for lifelong learning and is learned by maximizing a variant of the standard VAE ELBO from Equation 4 ; we describe this objective at end of this section.

4.1 Student-teacher Architecture

The top row of Figure 4 represents the teacher model. At any given time, the teacher contains a summary of all previous distributions within the learned parameters, , of the encoder , and the learned parameters, , of the decoder . We use the teacher to generate synthetic variates, , from these past distributions by decoding variates from the prior, . We pass the generated (synthetic) variates, , to the student model as a form of knowledge transfer about the past distributions. Information transfer in this manner is known as generative replay and our work is the first to explore it in a VAE setting.

The bottom row of Figure 4 represents the student. The student is responsible for updating the parameters, , of its encoder, , and , of its decoder . Importantly, the student receives data from both the currently observed task, as well as synthetic data generated by the teacher. This can be formalized as , as shown in Equation 7:

(7)

The mean, , of the Bernoulli distribution, controls the sampling proportion of the previously learned distributions to the current one and is set based on the number of assimilated distributions. Thus, given observed distributions: . This ensures that the samples observed by the student are representative of both the current and past distributions. Note that this does not correspond to varying sample sizes in datasets, but merely our assumption to model each distribution with equivalent weighting.

Once a new task is observed, the old teacher is dropped, the student model is frozen and becomes the new teacher (). A new student is then instantiated with the latest weights and from the previous student (the new teacher). Due to the cyclic nature of this process, no new models are added. This contrasts many existing state of the art deep lifelong learning methods which add an entire new model or head-network per task (eg: [nguyen2018variational, rusu2016progressive, terekhov2015knowledge]).

A crucial aspect in the lifelong learning process is to ensure that previous learning is successfully exploited to bias current learning [thrun1995lifelong]. While the replay mechanism that we put in place ensures that the student will observe data from all tasks, it does not ensure that previous knowledge from the teacher is efficiently exploited to improve current student learning. The student model will re-learn (from scratch) a completely new representation, which might be different than the teacher. In order to successfully transfer knowledge between both VAE models, we rely on functional regularization, which we enforce through a Bayesian update regularizer of the posteriors of both models. Intuitively, we would like the student model’s latent outputs, to be similar to latent outputs of teacher model, , over synthetic variates generated by the teacher, . In the following section, we describe the exact functional form of this regularizer and demonstrate how it can be perceived as a natural extension of the VAE learning objective to a sequential setting.

4.1.1 Knowledge Transfer Via Bayesian Update.

\note

[Jason]Added it to the base of Section 4.0 \note[Alexandros-16]Ahahaha… before introducing the regulariser we need to say how actually the student is trained… Check.

While both the student and teacher are instantiations of VAE variants, tailored for the particularities of the lifelong setting, for the purpose of this exposition we use the standard VAE formulation. Our objective is to learn the set of parameters of the student, such that it can generate variates from the complete distribution, , described in Section 3.2. Subsuming the definition of the augmented input data, , from Equation 7, we can define the student ELBO as:

(8)

Rather than naively shrinking the full posterior to the prior via the KL divergence in Equation 8, we rely on one of the core tenets of the Bayesian paradigm which states that we can always update our posterior when given new information (“yesterday’s posterior is today’s prior”) [mcinerney2015population]. Given this tenet, we introduce our posterior regularizer 333While it is also possible to apply a similar regularizer to the reconstruction term, i.e: , we observed that doing so hurts performance (Appendix 10.2).:

(9)

which distills the teacher’s learnt representation into the student over the generated data only. Combining Equations 8 and 9, yields the objective that we can use to train the student and is described below in Equation 10:

(10)

Note that this is not the final objective, due to the fact that we have yet to present the VAE variant tailored to the particularities of the lifelong setting. We will now show how the posterior regularizer can be perceived as a natural extension of the VAE learning objective, through the lens of a Bayesian update of the student posterior.

\note

[Alexandros-16]In the lemma you can simply use instead of , after the lemma I just keep and , i.e. drom from the latter the index

Lemma 1

For random variables and with conditionals and , both distributed as a categorical or gaussian and parameterized by and respectively, the KL divergence between the distributions is:

(11)

where depends on the parametric form of Q, and C is only a function of the parameters, .

We prove Lemma 1 for the relevant distributions (under some mild assumptions) in Appendix 10.1. Using Lemma 1 allows us to rewrite Equation 10 as shown below in Equation 12:

(12)

This rewrite makes it easy to see that our posterior regularizer from Equation 10 is a standard VAE ELBO (Equation 4) under a reparameterization of the student parameters, . Note that is constant with respect to the student parameters, , and thus not used during optimization. While the change seems minor, it omits the introduction of which allows for a transfer of information between models. In practice, we simply analytically evaluate , the KL divergence between the teacher and the student posteriors, instead of deriving the functional form of for each different distribution pair. We present Equation 12 simply as a means to provide a more intuitive understanding of our functional regularizer.

4.2 Base-level generative model.

\note

[Alexandros-16]Check the following. It looks to me that at least for the moment it is better here. While it is theoretically possible to use the vanilla VAE from Section 3.3 for the teacher and student models, doing so brings to light a number of limitations that render it problematic for use in the context of lifelong learning (visualized in Figure 5-Right). Specifically, using a standard VAE decoder, , to generate synthetic replay data for the student is problematic due to two reasons:

  1. Mixed Distributions: Sampling the continuous standard normal prior, , can select a point in latent space that is in between two separate distributions, causing generation of unrealistic synthetic data and eventually leading to loss of previously learnt distributions.

  2. Undersampling: Data points mapped to the isotropic-gaussian posterior that are further away from the prior mean will be sampled less frequently, resulting in an undersampling of some of the constituent distributions.

Figure 5: Left: Graphical model for VAE with independent discrete and continuous posterior, . Right: Two dimensional test variates, , , of a vanilla VAE trained on MNIST. We depict the two generative shortcomings visually: 1) mixing of distributions which causes aliasing in a lifelong setting and 2) undersampling of distributions in a standard isotropic-gaussian VAE posterior.

To address these sampling limitations we decompose the latent variable, , into an independent continuous, , and a discrete component, , as shown in Equation 13 and visually in Figure 5-Left:

(13)

The objective of the discrete component is to summarize the discriminative information of the individual generative distributions. The continuous component on the other hand, caters for the remaining sample variability (a nuisance variable [louizos2015variational]). Given that the discrete component can accurately summarize the discriminative information, we can then explicitly sample from any of the past distributions, allowing us to balance the student model’s synthetic inputs with samples from all of the previous learned distributions. We describe this beneficial generative sampling property in more detail in Section 4.2.1.

\note

[Alexandros-16]Check Naively introducing the discrete component, , does not guarantee that the decoder will use it to represent the most discriminative aspects of the modeled distribution. In preliminary experiments, we observed that that the decoder typically learns to ignore the discrete component and simply relies on the continuous variable, . This is similar to the posterior collapse phenomenon which has received a lot of recent interest within the VAE community [razavi2019preventing, goyal2017z]. Posterior collapse occurs when training a VAE with a powerful decoder model such as a PixelCNN++ [tomczak2018vae] or RNN [chung2015recurrent, goyal2017z]. The output of the decoder, can become almost independent of the posterior sample, , but is still able to reconstruct the original sample by relying on its auto-regressive property [goyal2017z]. In Section 4.2.2, we introduce a mutual information regulariser which ensures that the discrete component of the latent variable is not ignored.

4.2.1 Controlled Generations.

{adjustbox} width= Desired Task Conditional [0, 0, 1] [0, 1, 0] [1, 0, 0] \captionlistentry

[table]A table beside a figure

Figure 6: FashionMNIST with tasks: t-shirts, sandals and bag. To generate samples from the -th task conditional, , we set , randomly sample , and run through the decoder, . Resampling , while keeping fixed, enables generation of varied samples from the task conditional. Left: Desired task conditionals. Right: Desired decoder behavior.

Given the importance of generative replay for knowledge transfer in LGM, synthetic sample generation by the teacher model needs to be representative of all the previously observed distributions in order to prevent catastrophic forgetting. Under the assumption that accurately captures the underlying discriminativeness of the individual distributions and through the definition of the LGM generative process, shown in Equation 14:

(14)

we can control generations by setting a fixed value, , and randomly sampling the continuous prior, . This is possible because the trained decoder approximates the task conditional from Section 3.2:

(15)

where sampling the true task conditional, , can be approximated by sampling , keeping fixed, and decoding the variates as shown in Equation 16 below:

(16)

We provide a simple example of our desired behavior for three generative tasks, , using Fashion MNIST in Figure 6 and Table 2 above. The assumption made up till now is that accurately captures the discriminative aspects of each distribution. However, there is no theoretical reason for the model to impose this constraint on the latent variables. In practice, we often observe that the decoder ignores due to the much richer representation of the continuous variable, . In the following section we introduce a mutual information constraint that encourages the model to fully utilize .

\note

[Alexandros-16]If time allows, and only if, give a figure with the new graphical model.

4.2.2 Information restricting regularizer

As eluded to in the previous section, the synthetic samples observed by the student model need to be representative of all previous distributions. In order to control sampling via the process described in Section 4.2.1, we need to enforce that the discrete variable, , carries the discriminative information about each distribution. Given our graphical model from Figure 4-Left, we observe that there are two ways to accomplish this: maximize the information content between the discrete random variable, and the decoded , or minimize the information content between the continuous variable, and the decoded . Since our graphical model and underlying network does not contain skip connections, information from the input, , has to flow through the latent variables to reach the decoder. While both formulations can theoretically achieve the same objective, we observed that in practice, minimizing provided better results. We believe the reason for this is that minimizing provides the model with more subtle gradient information in contrast to maximizing which receives no gradient information when the value of the -th element of the categorical sample is 1. We now formalize our mutual information regularizer, which we derive from first principles in Equation 17:

{adjustbox}

width=

(17)

where we use the independence assumption of our posterior from Equation 13 and the fact that the expectation of a constant is the constant. This regularizer has parallels to the regularizer in InfoGAN [NIPS2016_6399]. In contrast to InfoGAN, VAEs already estimate the posterior and thus do not need the introduction of any extra parameters for the approximation. In addition [huszar_2016] demonstrated that InfoGAN uses the variational bound (twice) on the mutual information, making its interpretation unclear from a theoretical point of view. In contrast, our regularizer has a clear interpretation: it restricts information through a specific latent variable within the computational graph. We observe that this constraint is essential for empirical performance of our model and empirically validate this in our ablation study in Experiment 7.2.

4.3 Learning Objective

The final learning objective for each of the student models is the maximization of the sequential VAE ELBO (Equation 10), coupled with generative replay (Equation 7)and the mutual information regularizer, , (Equation 17):

(18)

The hyper-parameter controls the importance of the information gain regularizer. Too large a value for causes a lack of sample diversity, while too small a value causes the model to not use the discrete latent distribution. We did a random hyperparameter search and determined to be a reasonable choice for all of our experiments. This is in line with the used in InfoGAN [NIPS2016_6399] for continuous latent variables. We empirically validate the necessity of both terms proposed in Equation 18 in our ablation study in Experiment 7.2. We also validate the benefit of the latent variable factorization in Experiment 7.1. Before delving into the experiments, we provide a theoretical analysis of computational complexity induced by our model and objective (Equation 18) in Section 4.4 below.

4.4 Computational Complexity

We define the computational complexity of a typical VAE encoder and decoder as and correspondingly; internally these are dominated by the matrix-vector products which take approximately for L layers. We also define the cost of applying the loss function as , where is the cost of evaluating the KL divergence from the ELBO (Equation 4) and the cost for evaluating the reconstruction term. Given these definitions, we can summarize LGM’s computation complexity as follows in Equation 19: {adjustbox}width=1.0

(19)

where we introduce increased computational complexity due to teacher generations, the cost of the posterior regularizer, and the mutual information terms; the latter of which necessitates an extra encode operation, . The computational complexity is still dominated by the matrix-vector product from evaluating forward functionals of the neural network. These operations can easily be amortized through parallelization on modern GPUs and typical experiments do not directly scale as per Equation 19. In our most demanding experiment (Experiment 6.6), we observe an average empirical increase of 13.53 seconds per training epoch and 6.3 seconds per test epoch.

5 Revisiting state of the art methods.

In this section we revisit some of the state of the art methods from Section 2. We begin by providing a mathematical description of the differences between EWC [kirkpatrick2017overcoming], VCL [nguyen2018variational] and LGM and follow it up with a discussion of VASE [achille2018life] and their extensions of our work.

EWC and VCL: Our posterior regularizer, , affects the same parameters, , as parameter regularizer methods such as EWC and VCL. However, rather than assuming a functional form for the parameter posterior, , our method regularizes the output latent distribution . EWC and VCL, both make the assumption that is distributed as an isotropic gaussian444VCL assumes an isotropic gaussian variational form vs. EWC which directly assumes the parametric form on .. This allows the use of the Fisher Information Matrix (FIM) in a quadratic parameter regularizer in EWC, and an analytical KL divergence of the posterior in VCL. This is a very stringent requirement for the parameters of a neural network and there is active research in Bayesian neural networks that attempts to relax this constraint [louizos2016structured, louizos2017multiplicative, mishkin2018slang].

EWC LGM ( Isotropic Gaussian Posterior )

In the above table we examine the distance metric , used to minimize the effects of catastrophic inference in both EWC and LGM. While our method can operate over any distribution that has a tractable KL-divergence, for the purposes of demonstration we examine the simple case of an isotropic gaussian latent-variable posterior. EWC directly enforces a quadratic constraint on the model parameters , while our method indirectly affects the same parameters through a regularization of the posterior distribution . For any given input variate, , LGM allows to model to freely change its internal parameters, ; it does so in a non-linear555This is because the parameters of the distribution are modeled by a deep neural network. way such that the analytical KL shown above is minimized.

VASE : The recent work of Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies (VASE) [achille2018life] extend upon our work [achille2018life, p. 7], but take a more empirical route by incorporating a classification-based heuristic for their posterior distribution. In contrast, we show (Section 4.1.1) that our objective naturally emerges in a sequential learning setting for VAEs, allowing us to infer the discrete posterior, in an unsupervised manner. Due to the incorporation of direct supervised class information [achille2018life] also observe that regularizing the decoding distribution aids in the learning process, something that we observe to fail in a purely unsupervised generative setting (Appendix Section 10.2). Finally, in contrast to [achille2018life], we include an information restricting regularizer (Section 4.2.2) which allows us to directly control the interpretation and flow of information of the learnt latent variables.

6 Experiments

We evaluate our model and the baselines over standard datasets used in other state of the art lifelong / continual learning literature [nguyen2018variational, zenke2017continual, shin2017continual, kamra2017deep, kirkpatrick2017overcoming, rusu2016progressive]. While these datasets are simple in a traditional classification setting, transitioning to a lifelong-generative setting scales the problem complexity substantially. We evaluate LGM on a set of progressively more complicated tasks (Section 6.2) and provide comparisons against baselines [nguyen2018variational, zenke2017continual, kirkpatrick2017overcoming, eskin2004laplace, kingma2014] using a set of standard metrics (Section 6.1). All network architectures and other optimization details for our LGM model are provided in Appendix Section 10.3 as well our open-source git repository [jramapuram_2018].

6.1 Performance Metrics

To validate the benefit of LGM in a lifelong setting we explore three main performance dimensions: the ability for the model to reconstruct and generate samples from all previous tasks and the ability to learn a common representation over time, thus reducing learning sample complexity. We use three main quantitative performance metrics for our experiments: the log-likelihood importance sample estimate [burda2015importance, nguyen2018variational], the negative test ELBO, and the Frechet distance metric [heusel2017gans]. In addition, we also provide two auxiliary metrics to validate the benefits of LGM in a lifelong setting: training sample complexity and wall clock time per training and test epoch.

To fairly compare models with varying latent variable configurations, one solution is to marginalize out the latents, , during model evaluation / test time: . This is realized in practice by using a Monte Carlo approximation (typically K=5000) and is commonly known as the importance sample (IS) log-likelihood estimate [burda2015importance, nguyen2018variational]. As latent variable and model complexity grows, this estimate tends to become noisier and intractable to compute. For our experiments we use this metric only for the FashionMNIST and MNIST datasets as computing one estimate over 10,000 test samples for a complex model takes approximately 35 hours on a K80 GPU.

In contrast to the IS log-likelihood estimate, the negative test ELBO (Equation 4) is only applicable when comparing models with the same latent variable configurations; it is however much faster to compute. The negative test ELBO provides a lower bound to the test log-likelihood of the true data distribution under the assumed latent variable configuration. One crucial aspect missing from both these metrics is an evaluation of generation quality. We resolve this by using the Frechet distance metric [heusel2017gans] and qualitative image samples.

The Frechet distance metric allows us to quantify the quality and diversity of generated samples by using a pre-trained classifier model to compare the feature statistics (generally under a Gaussianity assumption) between synthetic generated samples and samples drawn from the test set. If the Frechet distance between these two distributions is small, then the generative model is said to be generating realistic images. The Frechet distance between two gaussians (produced by evaluating latent embeddings of a classifier model) with means with corresponding covariances is:

(20)

While the Frechet distance, negative ELBO and IS log-likelihood estimate provide a glimpse into model performance, there exists no conclusive metric that captures the quality of unsupervised generative models [theis2016note, sajjadi2018assessing] and active research suggests a direct trade-off between perceptual quality and model representation [blau2018perception]. Thus, in addition to the metrics described above, we also provide qualitative metrics in the form of test image reconstructions and image generations. We summarize all used performance metrics in Table 2 below:

{adjustbox}

width=

Definition Purpoose Lower is better?
Negative ELBO Equation 4.
Quantitative metric on
likelihood / reconstructions.
yes
Negative Log-Likelihood
5000 (latent) sample Monte
Carlo estimate of Equation 4.
Quantitative metric on
density estimate.
yes
Frechet Distance Equation 20. Quantitative metric on generations. yes
Test Reconstructions Qualitative view of reconstructions. N/A
Generations . Qualitative view of generations. N/A
#Training Samples # real training samples used for task . Sample Complexity. yes
Table 2: Summary of different performance metrics.

6.2 Data Flow

Figure 7: Visual examples of training and test task sequences (top to bottom) for the datasets used to validate LGM. The training set only consists of samples from the current task while the test set is a cumulative union of the current task, coupled with all previous tasks. The permuted MNIST tasks uses different fixed permutation matrices to create 4 auxiliary datasets.

In Figure 7 we list train and test variates depicting the data flow for each of the problems that we model. Due to the relaxing the need to preserve data in a lifelong setting, the train task sequence observes a single dataset, , at a time, without access to any previous, . The corresponding test dataset consists of a union ( operator) of the current test dataset, , merged with all previously observed test datasets, .

MNIST / Fashion MNIST: For the MNIST and Fashion MNIST problems, we observe a single MNIST digit or fashion object (such as shirts) at a time. Each training set consists of 6000 training samples and 1000 test samples. These samples are originally extracted from the full training and test datasets which consist of 60,000 training and 10,000 test samples.

Permuted MNIST: this problem differs from the MNIST problem described above in that we use the entire MNIST dataset at each task. After observing the first task, which is the standard MNIST dataset, each subsequent task differs through the application of a fixed permutation matrix on the entire MNIST dataset. The test task sequence differs from the training task sequence in that we simply use the corresponding full train and test MNIST datasets (with the appropriate application of ).

Celeb-A: We split the CelebA dataset into four individual distributions using the features: bald, male, young and eye-glasses. As with the previous problems, we treat each subset of data as an individual distribution, and present our model samples from a single distribution at a time. This presents a real world scenario as the samples per distribution varies drastically from only 3,713 samples for the bald distribution, to 126,788 samples for young. In addition specific samples can span one or more of these distributions.

SVHN to MNIST: in this problem, we transition from fully observing the centered SVHN [netzer2011reading] dataset to observing the MNIST dataset. We treat all samples from SVHN as being generated by one distribution and all the MNIST 666MNIST was resized to 32x32 and converted to RBG to make it consistent with the dimensions of SVHN. samples as generated by another distribution (irrespective of the specific digit). At inference, the model is required to reconstruct and generate from both datasets.

6.3 Situating against state of the art lifelong learning models.

To situate LGM against other state of the art methods in lifelong learning we use the sequential FashionMNIST and MNIST datasets described earlier in Section 6.2 and the data flow diagram in Figure 7. We contrast our LGM model against VCL [nguyen2018variational], VCL without a task specific head network, SI [zenke2017continual], EWC [kirkpatrick2017overcoming], Laplace propagation [eskin2004laplace], a full batch VAE trained jointly on all data and a standard naive sequential VAE without any catastrophic forgetting prevention strategy in Figures 8 and 9 below. The full batch VAE presents the upper-bound performance and all lifelong learning models typically under-perform this model by the final learning task. For the baselines, we use the generously open sourced code [nvcuong_2019] by the VCL authors, using the optimal hyper-parameters specified for each model. We begin by evaluating the 5000 sample Monte Carlo estimate of the log-likelihood of all compared models in Figure 8 below:

Figure 8: IS log-likelihood (mean std) 5. Left: Fashion MNIST. Right: MNIST.

Even though each trial was repeated five times (each), we observe large increases in the estimates at a few critical points. After further inspection, we determined the large magnitude increases were due to the model observing a drastically different distribution at that point. We overlay the graphs with an example variate for of the magnitude spikes. In the case of FashionMNIST for example, the model observes its first shoe distribution at ; this contrasts the previously observed items which were mainly clothing related objects. Interestingly we observe that LGM has much smoother performance across tasks. We posit this is because LGM does not constrain its parameters, and instead enforces the same input-output mapping through functional regularization.

{adjustbox}

width=

Figure 9: Final model, , generation and reconstructions for MNIST and FashionMNIST. The LGM model presents competitive performance for both generations and reconstructions, while not preserving any past data nor past models.

Since one of the core tenets of lifelong learning is to reduce sample complexity over time, we use this experiment to validate if LGM does in fact achieve this objective. Since all LGM models are trained with an early-stopping criterion, we can directly calculate the number of samples used for each learning task using the stopping epoch and mean, of the Bernoulli sampling distribution of the student model. In Figure 10 we plot the number of true samples and the number of synthetic samples used by a model until it satisfied its early-stopping criterion. We observe a steady decrease in the number of real samples used over time, validating LGMs advantage in a lifelong setting.

Figure 10: FashionMNIST sample complexity. Left: Synthetic training samples used till early-stopping. Right: Real samples used till early-stopping.

6.4 Diving deeper into the sequence.

Rather than only visualizing the final model’s qualitative results as in Figure 9, we provide qualitative results for model performance over time for the PermutedMNIST experiment in Figure 11. This allows us to visually observe lifelong model performance over time. In this experiment, we focus our efforts on EWC and LGM and visualize model (test) reconstructions starting from the second learning task, , till the final . The EWC-VAE variant that we use as a baseline has the same latent variable configuration as our model, enabling the usage of the test ELBO as a quantitative metric for comparison. We use an unpermuted version of the MNIST dataset, , as our first distribution, , as it allows us to visually asses the degradation of reconstructions. This is a common setup utilized in continual learning [kirkpatrick2017overcoming, zenke2017continual] and we extend it here to the density estimation setting.


Figure 11: Top row: test-samples; bottom row: reconstructions. We visualize an increasing number of accumulated distributions from left to right. (a) Lifelong VAE model (b) EWC VAE model.

Both models exhibit a different form of degradation: EWC experiences a more destructive form of degradation as exemplified by the salt-and-pepper noise observed in the final dataset reconstruction at . LGM on the hand experiences a form of Gaussian noise as visible in the corresponding final dataset reconstruction. In order to numerically quantify this performance we analyze the log-Frechet distance and negative ELBO below in Figure 12, where we contrast the LGM to EWC, a batch VAE (full-vae in graph), an upto-VAE that observes all training data up to the current distribution and a vanilla sequential VAE (vanilla). We examine a variety of different convolutional and dense architectures and present the top performing models below. We observe that LGM drastically outperforms EWC and the baseline naive sequential VAE in both metrics.

Figure 12: PermutedMNIST (a) negative Test ELBO and (b) log-Frechet distance.

6.5 Learning Across Complex Distributions

Figure 13: (a) Reconstructions of test samples from SVHN[left] and MNIST[right]; (b) Decoded samples based on linear interpolation of with ; (c) Same as (b) but with .

The typical assumption in lifelong learning is that the sequence of observed distributions are related [thrun1995lifelong2] in some manner. In this experiment we relax this constraint by learning a common model between the colored SVHN dataset and the binary MNIST dataset. While semantically similar to humans, these datasets are vastly different, as one is based on RGB images of real world house numbers and the other of synthetically hand-drawn digits. We visualize examples of the true test inputs, , and their respective reconstructions, , from the final lifelong model in figure 13(a). Even though the only true data the final model received for training was the MNIST dataset, it is still able to reconstruct the SVHN data observed previously. This demonstrates the ability of our architecture to transition between complex distributions while still preserving the knowledge learned from the previously observed distributions.

Finally, in figure 13(b) and 13(c) we illustrate the data generated from an interpolation of a 2-dimensional continuous latent space, . To generate variates, we set the discrete categorical, , to one of the possible values and linearly interpolate the continuous over the range . We then decode these to obtain the samples, . The model learns a common continuous structure for the two distributions which can be followed by observing the development in the generated samples from top left to bottom right on both figure 13(b) and 13(c).

6.6 Validating Empirical Sample Complexity Using Celeb-A

We iterate the Celeb-A dataset as described in the data flow diagram (Figure 7) and use this learning task to explore qualitative and quantitative generations, as well as empirical real world time complexity (as described in Section 4.4) on modern GPU hardware. We train a lifelong model and a typical VAE baseline without catastrophic forgetting mitigation strategies and evaluate the final model’s generations in Figure 14. As visually demonstrated in Figure 14-Left, the lifelong model is able to generate instances from all of the previous distributions, however the baseline model catastrophically forgets (Figure 14-Right) and only generates samples from the eye-glasses distribution. This is also reinforced by the log-Frechet distance shown in Figure 15.

Figure 14: Left: Sequential generations for Celeb-A from the final lifelong model for bald, male, young and eye-glasses (left to right). Right: (random) generations by the final baseline VAE model.

We also evaluate the wall-clock time in seconds (Table 3) for the lifelong model and the baseline-vae for the 44,218 samples of the male distribution. We observe that the lifelong model does not add a significant overhead, especially since the baseline-vae undergoes catastrophic forgetting (Figure 14 Right) and completely fails to generate samples from previous distributions. Note that we present the number of parameters and other detailed model information in our code and Appendix 10.3.

{adjustbox}

width= 44,218 male samples baseline-VAE Lifelong training-epoch (s) 43.1 +/- 0.6 56.63 +/- 0.28 testing-epoch (s) 9.79 +/- 0.12 16.09 +/- 0.01

Table 3: Mean & standard deviation wall-clock for one epoch of male distribution of Celeb-A.
Figure 15: Celeb-A log-Frechet distance of lifelong vs. naive baseline VAE model without catastrophic mitigation strategies over the four distributions. Listed on the right is the time per epoch (in seconds) for an epoch of the corresponding models.

7 Ablation Studies

In this section we independently validate the benefit of each of the newly introduced components to the learning objective proposed in Section 4.3. In Experiment 7.1 we demonstrate the benefit of the discrete-continuous posterior factorization introduced in Section 4.2.1. Then in Experiment 7.2, we validate the necessity of the information restricting regularizer (Section 4.2.2) and posterior consistency regularizer (Section 4.1.1).

7.1 Linear Separability of Discrete and Continuous Posterior

Figure 16: Left: Graphical model depicting classification using pretrained VAE, coupled with a linear classifier, . Right: Linear classifier accuracy on the Fashion MNIST test set for a varying range of latent dimensions, and distributions.

In order to validate that the (independent) discrete and continuous latent variable posterior, , aids in learning a better representation, we classify the encoded posterior sample using a simple linear classifier , where corresponds to the categorical class prediction. Higher (linear) classification accuracies demonstrate that the the VAE is able to learn a more linearly separable representation. Since the latent representation of VAEs are typically used in auxiliary tasks, learning such a representation is useful in downstream tasks. This is a standard method to measure posterior separability and is used in methods such as Associative Compression Networks [graves2018associative].

We use the standard training set of FashionMNIST [xiao2017/online] (60,000 samples) to train a standard VAE with a discrete only (disc) posterior, an isotropic-gaussian only (gauss) posterior, a bernoulli only (bern) posterior and finally the proposed independent discrete and continuous (disc+gauss) posterior presented in Section 4.2.1. For each different posterior reparameterization, we train a set of VAEs with varying latent dimensions, . In the case of the disc+gauss model we fix the discrete dimension, and vary the isotropic-gaussian dimension to match the total required dimension. After training each VAE, we proceed to use the same training data to train a linear classifier on the encoded posterior sample, .

In Figure 16 we present the mean and standard deviation linear test classification accuracies of each set of the different experiments. As expected, the discrete only (disc) posterior performs poorly due to the strong restriction of mapping an entire input sample to a single one-hot vector. The isotropic-gaussian (gauss) and bernoulli (bern) only models provide a strong baseline, but the combination of isotropic-gaussian and discrete posteriors (disc+gauss) performs much better, reaching an upper-bound (linear) test-classification accuracy of 87.1%. This validates that the decoupling of latent represention presented in Section 4.2.1 aids in learning a more meaningful, separable posterior.

7.2 Validating the Mutual Information and Posterior Consistency Regularizers.

Figure 17: MNIST Ablation: (a) negative test ELBO. (b) Sequentially generated samples by setting and sampling (Section 4.2.1) with consistency + mutual information (MI). (c) Sequentially generated samples with no consistency + no mutual information (MI).

In order to independently evaluate the benefit of our proposed Bayesian update regularizer (Section 4.1.1) and the mutual information regularizer proposed in (Section 4.2.1) we perform an ablation study using the MNIST data flow sequence from Figure 7. We evaluate three scenarios: 1) with posterior consistency and mutual information regularizers, 2) only posterior consistency and 3) without both regularizers. We observe that both components are necessary in order to generate high quality samples as evidenced by the negative test ELBO in Figure 17-(a) and the corresponding generations in Figure 17-(b-c). The generations produced without the information gain regularizer and consistency in Figure 17-(c) are blurry. We attribute this to: 1) uniformly sampling the discrete component is not guaranteed to generate samples representative samples from and 2) the decoder, , relays more information through the continuous component, , causing catastrophic forgetting and posterior collapse [alemi2018fixing].

8 Limitations

While LGM presents strong performance, it fails to completely solve the problem of lifelong generative modeling and we see a slow degradation in model performance over time. We attribute this mainly to the problem of poor VAE generations that compound upon each other (also discussed below). In addition, there are a few poignant issues that need to be resolved in order to achieve an optimal (in terms of non-degrading Frechet distance / -ELBO) unsupervised generative lifelong learner:

Distribution Boundary Evaluation: The standard assumption in current lifelong / continual learning approaches [nguyen2018variational, zenke2017continual, shin2017continual, kamra2017deep, kirkpatrick2017overcoming, rusu2016progressive] is to use known, fixed distributions instead of learning the distribution transition boundaries. For the purposes of this work, we focus on the accumulation of distributions (in an unsupervised way), rather than introduce an additional level of indirection through the incorporation of anomaly detection methods that aid in detecting distributional boundaries.

Blurry VAE Generations: VAEs are known to generate images that are blurry in contrast to GAN based methods. This has been attributed to the fact that VAEs don’t learn the true posterior and make a simplistic assumption regarding the reconstruction distribution [alemi2018fixing, rainforth2018tighter]. While there exist methods such as ALI [dumoulin2016adversarially] and BiGAN [donahue2016adversarial], that learn a posterior distribution within the GAN framework, recent work has shown that adversarial methods fail to accurately match posterior-prior distribution ratios in large dimensions [rosca2018distribution].

Memory: In order to scale to a truly lifelong setting, we posit that a learning algorithm needs a global pool of memory that can be decoupled from the learning algorithm itself. This decoupling would also allow for a principled mechanism for parameter transfer between sequentially learnt models as well a centralized location for compressing non-essential historical data. Recent work such as the Kanerva Machine [wu2018kanerva] and its extensions [wu2018learning] provide a principled way to do this in the VAE setting.

9 Conclusion

In this work we propose a novel method for learning generative models over a lifelong setting. The principal assumption for the data is that they are generated by multiple distributions and presented to the learner in a sequential manner. A key limitation for the learning process is that the method has no access to any of the old data and that it shall distill all the necessary information into a single final model. The proposed method is based on a dual student-teacher architecture where the teacher’s role is to preserve the past knowledge and aid the student in future learning. We argue for and augment the standard VAE’s ELBO objective by terms helping the teacher-student knowledge transfer. We demonstrate the benefits this augmented objective brings to the lifelong learning setting using a series of experiments. The architecture, combined with the proposed regularizers, aid in mitigating the effects of catastrophic interference by supporting the retention of previously learned knowledge.

References

10 Appendix

10.1 Understanding the Consistency Regularizer

The analytical derivations of the consistency regularizer show that the regularizer can be interpreted as an a transformation of the standard VAE regularizer. In the case of an isotropic gaussian posterior, the proposed regularizer scales the mean and variance of the student posterior by the variance of the teacher 1 and adds an extra ’volume’ term. This interpretation of the consistency regularizer shows that the proposed regularizer preserves the same learning objective as that of the standard VAE. Below we present the analytical form of the consistency regularizer with categorical and isotropic gaussian posteriors:

Proof 1

We assume the learnt posterior of the teacher is parameterized by a centered, isotropic gaussian with and the posterior of our student by a non-centered isotropic gaussian with , then

{adjustbox}

width=1.0

(21)
Via a reparameterization of the student’s parameters:
(22)

It is also interesting to note that our posterior regularizer becomes the prior if:

Proof 2

We parameterize the learnt posterior of the teacher by and the posterior of the student by . We also redefine the normalizing constants as and for the teacher and student models respectively. The KL divergence from the ELBO can now be re-written as:

(23)

where is the entropy operator and is the cross-entropy operator.

10.2 Reconstruction Regularizer

Figure 18: Fashion Negative Test ELBO
Figure 19: Fashion Log-Frechet Distance

While it is possible to constrain the reconstruction/decoder term of the VAE in a similar manner to the consistency posterior-regularizer, i.e: , doing so diminishes model performance. We hypothesize that this is due to the fact that this regularizer contradicts the objective of the reconstruction term in the ELBO which already aims to minimize some metric between the input samples and the reconstructed samples ; eg: if , then the loss is proportional to , the standard L2 loss. Without the addition of this reconstruction cross-model regularizer, the model is also provided with more flexibility in how it reconstructs the output samples.

In order to quantify the this we duplicate the FashionMNIST Experiment listed in the data flow definition in Figure 7. We use a simpler model than the main experiments to validate this hypothesis. We train two dense models (-D): one with just the posterior consistency regularizer (without-LL-D) and one with the consistency and likelihood regularizer (with-LL-D). We observe the model performance drops (with respect to the Frechet distance as well the test ELBO) in the case of the with-LL-D as demonstrated in Figures 19 and 19.

10.3 Model Architecture

We used two different architectures for our experiments. When we use a dense network (-D) we used two layers of 512 units to map to the latent representation and two layers of 512 to map back to the reconstruction for the decoder. We used batch norm [ioffe2015batch] and ELU activations for all the layers barring the layer projecting into the latent representation and the output layer. Note that while we used the same architecture for EWC we observed a drastic negative effect when using batch norm and thus dropped it’s usage. The convolution architectures (-C) used the architecture described below for the encoder and the decoder (where the decoder used conv-transpose layers for upsampling). The notation is [OutputChannels, (filterX, filterY), stride]:

(24)
Method Initial dimension Final dimension dimension # initial parameters # final parameters
EWC-D 10 10 14 4,353,184 4,353,184
naive-D 10 10 14 1,089,830 1,089,830
batch-D 10 10 14 1,089,830 1,089,830
batch-D 10 10 14 2,179,661 2,179,661
lifelong-D 1 10 14 2,165,311 2,179,661
EWC-C 10 10 14 30,767,428 30,767,428
naive-C 10 10 14 7,691,280 7,691,280
batch-C 10 10 14 7,691,280 7,691,280
batch-C 10 10 14 15,382,560 15,382,560
lifelong-C 1 10 14 15,235,072 15,382,560

The table above lists the number of parameters for each model and architecture used in our experiments. The lifelong models initially start with a of dimension 1 and at each step we grow the representation by one dimension to accommodate the new distribution (more info in Section 10.7). In contrast, the baseline EWC models are provided with the full representation throughout the learning process. EWC has double the number of parameters because the computed diagonal fisher information matrix which is the same dimensionality as the number of parameters. EWC also neeeds the preservation of the teacher model to use in it’s quadratic regularizer. Both the naive and batch models have the fewest number of parameters as they do not use a student-teacher framework and only use one model, however the vanilla model has no protection against catastrophic interference and the full model is just used as an upper bound for performance.

We used Adam [kingma2015adam] to optimize all of our problems with a learning rate of 1e-4 or 1e-3. When we used weight transfer we re-initialized the accumulated momentum vector of Adam as well as the aggregated mean and variance of the batch norm layers. The full architecture can be examined in our github repository [jramapuram_2018] and is provided under an MIT license.

10.4 Contrast to streaming / online methods

Our method has similarities to streaming methods such as Streaming Variational Bayes (SVB) [DBLP:conf/nips/BroderickBWWJ13] and Incremental Bayesian Clustering methods [katakis2008incremental, gomes2008incremental] in that we estimate and refine posteriors through time. In general this can be done through the following Bayesian update rule that states that the lastest posterior is proportional to the current likelihood times the previous posterior:

(25)

SVB computes the intractable posterior, , utilizing an approximation, , that accepts as input the current dataset, , along with the previous posterior :

(26)

The first posterior input () to the approximating function is the prior . The objective of SVB and other streaming methods is to model the posterior of the