The Hierarchical Adaptive Forgetting Variational Filter

# The Hierarchical Adaptive Forgetting Variational Filter

Vincent Moens
###### Abstract

A common problem in Machine Learning and statistics consists in detecting whether the current sample in a stream of data belongs to the same distribution as previous ones, is an isolated outlier or inaugurates a new distribution of data. We present a hierarchical Bayesian algorithm that aims at learning a time-specific approximate posterior distribution of the parameters describing the distribution of the data observed. We derive the update equations of the variational parameters of the approximate posterior at each time step for models from the exponential family, and show that these updates find interesting correspondents in Reinforcement Learning (RL). In this perspective, our model can be seen as a hierarchical RL algorithm that learns a posterior distribution according to a certain stability confidence that is, in turn, learned according to its own stability confidence. Finally, we show some applications of our generic model, first in a RL context, next with an adaptive Bayesian Autoregressive model, and finally in the context of Stochastic Gradient Descent optimization.

## 1 Introduction

Learning in a changing environment is a difficult albeit ubiquitous task. One key issue for learning in such context is to discriminate between isolated, unexpected events and a prolonged contingency change. This discrimination is challenging with conventional techniques because they rely on prior assumptions about environment stability. When assuming fluctuating context, past experience will be forgotten immediately when an unexpected event occurs, but if that event was just noise, this erroneous forgetting might be very costly. In less variable contexts, model parameters will tend to change more gradually, thus sometimes missing fluctuations when they happen faster than expected. Most models cover one of the two possibilities, and either gradually adapt their predictions to the new contingency or do it abruptly, but not both.

One classical solution to the problem of change detection is to compare the likelihood of the current observation given the previous posterior distribution with a default probability distribution (Kulhavy & Karny, 1984), representing an initial, naive state of the learner. Usually, the mixing coefficient (or forgetting factor) that is used to weight these two hypotheses is adapted to the current data in order to detect and account for the possible contingency change. This mixing coefficient can be implemented in a linear or exponential manner (Kulhavý & Kraus, 1996). We will focus here on the exponential case.

In the past decade, several Bayesian solutions to this problem based on the aforementioned strategy have been proposed (Smidl, 2004; Smidl & Gustafsson, 2012; Azizi & Quinn, 2015). However, they usually suffer from several drawbacks: many of them put a restrictive prior on the mixing coefficient (e.g. (Smidl, 2004; Masegosa et al., 2017)) and cannot account for the fact that an unexpected event is unlikely to be caused by a contingency change if the environment has been stable for a long time.

We propose the Hierarchical Adaptive Forgetting Variational Filter (HAFVF). The core idea of the model is that the the mixing coefficient can be learned as a latent variable with its own mixing coefficient. It is inspired by the observation that animals tend to decrease their flexibility (i.e. their capacity to adapt to a new contingency) when they are trained in a stable environment and that this flexibility is inversely correlated with the training length (Dickinson, 1985). We suggest that this strategy may be beneficial in many environments, where the stability of the system identified by a learner is a variable that can be learned as an independent variable with a certain confidence: in certain environments, contingency changes are inherently more less than in others. Although this assumption may not hold in every case, we show that it helps the algorithm to stabilize and discriminate contingency changes from accidents.

Accordingly, we frame our algorithm in a RL framework. We explore how the forward learning algorithm can be extended to the forward-backward case. We show three applications of our model: first in the case of a simple RL task, next to fit an autoregressive model and finally for gradient learning in a Stochastic Gradient Descent (SGD) algorithm.

## 2 Hierarchical Model

Let be a stream of data distributed according to a set of distributions , where the change trials are unknown and unpredictable. We make the following assumptions:

Let and , then .

###### Corollary 1.

If is a measure of the relative probability that belongs to wrt , and if , then .

Assumption 1 and Corollary 1 state that the probability of seeing a contingency change decreases with time in a steady environment. This might seem counter-intuitive or even maladaptive in many situations, but it is a key assumption we use to discriminate artifacts from contingency change: after a long sequence, the amount of evidence needed to switch from the current belief to the naive belief is greater than after a short sequence. This assumption will lead us to build a model where, if the learner is very confident in his belief, it will take him more time to forget past observations, because he will need more evidence for a contingency change. Therefore, in this context, the learner aims not only to learn the distribution of the data at hand, but also a measure of the confidence in the steadiness of the environment.

###### Assumption 2.

In the set of probability distributions p, all elements have the same parametric form that belongs to the exponential family and have a conjugate prior that is also from the exponential family:

 pn∈p⟹pn(xt|zn)=h(xt)exp{zTnT(xt)−A(zn)}

and

We now focus on the problem of approximating the current posterior distribution of given the current and past observations. For clarity, we will make the subscripts implicit in the following. Let us first focus on the problem of estimating the posterior distribution of in the stationary case. After steps, and given some prior distribution , the posterior distribution can be formulated recursively as:

Given the restriction imposed by Assumption 2, this posterior probability distribution has a closed-form expression and can be estimated efficiently.

We enrich this basic model by first formulating the prior distribution of at as a mixture of the previous posterior distribution and an arbitrary prior:

 pt(z|x

Following Assumption 2, the conjugate distribution is also from the exponential family and reads

 pt−1(z|x

where we have expanded , where . is the part of that indicates the effective (prior) number of observations. If has the same form as , then the log-partition function can be computed efficiently (Mandt et al., 2014):

 Z(w,θt−1,θ0)= exp{−wB(θt−1)−(1−w)B(θ0)+ B(wθt−1+(1−w)θ0)}.

Note that this result simplifies when combined with the numerator of Equation 1:

 p( (2)

where . The latent variable weights the initial prior with the posterior at the previous trial. We incorporate this variable in the set of the latent variables, and, we put a mixture prior on with a weight : following this approach, the previous posterior probability of conditions the current one (similarly to ), together with a prior that is blind to the stream of data up to now. Assuming that , and each can be generated by changing distributions, the joint probability now reads:

 p(xt,z,w,b|x

where we have assumed that the posterior probability factorizes (Mean-Field assumption), and where are the parameters of the naive, initial prior distributions over respectively. The model presented in Equation 3 is not conjugate anymore, and the posterior probability does not generally have an analytical solution. We therefore introduce a variational posterior to approximate the posterior probability . In short, Variational Inference (Jaakkola & Jordan, 2000) works by replacing the posterior by a proxy of an arbitrary form and finding the configuration of this approximate posterior that minimizes the Kullback-Liebler divergence between this distribution and the true posterior. This is virtually identical to maximizing the Expected Lower-Bound to the log-model evidence (ELBO).

For simplicity, we use a factorized variational posterior where each factor has the same form as the prior distribution of its latent variable. Assuming that Equation 3 conveniently simplifies to:

 p(xt,z,w,b |x

This model is shown in Figure 1. In what follows, we will restrict our analysis to the case where and are Beta distributed, meaning that the approximate posterior we will optimize for these two variables will also be a Beta distribution.

### 2.1 Update equations

#### Notation

We first define the following notation: is the difference between the previous approximate posterior and the initial prior. We use as the weighted prior parameters, and as the expectation of under . Similarly, and are the weighted prior over and its expectation under , respectively. Also, we will often abbreviate the summary statistics of as .

We now focus on the problem of finding the approximate posterior configuration that maximizes the ELBO. Various techniques have been developed to solve this problem: whereas Stochastic Gradient Variational Bayes (Kingma & Welling, 2013) and Stochastic Variational Bayes (Hoffman et al., 2012) work well for large datasets, more traditional conjugate (Winn et al., 2005) or non-conjugate (Knowles & Minka, 2011) Variational Message Passing (VMP) algorithms are better suited for our problem. This technique indeed allows us to derive closed-form update equations that can be sequentially applied to each of the nodes of the factorized posterior distribution until a certain convergence criterion is met. We interpret these results in a Hierarchical Reinforcement Learning framework, where each level adapts its learning rate (LR) as a function of expected log-likelihood of the current observation given the past.

Fortunately, under the form of the approximate posterior we chose and using Conjugate VMP, the variational parameters of the posterior over the latent parameters have a simple form given the current value of and . For a number of observations observed at time , we have:

 θξt =ˆϑξ+J∑j=1T(xtj) (5) θηt =ˆϑη+J (6)

Equation 5 finds an interesting correspondent in the RL literature. Consider the limit case where and (which is still analytically tractable following Equation 2). As the expectation of a distribution of the exponential family has the general form , one can derive a similar posterior expectation of (Diaconis & Ylvisaker, 1979):

 Eq(z,w)[z]=ˆϑξ+T(xt)ˆϑη+1

Now, replacing by , the above expression becomes (Mathys, 2016)

 Eq(z,w)[z]=Q+α(T(xt)−Q) (7)

where is the average at the time of the previous observation and is the LR, whose value is inversely proportional to the effective memory and to the current expected value of the forgetting factor 111One can easily see that dictates the memory of the learner. If and assuming that is stationary, we have: . Equation 7 is a classical incremental update rule in RL (Sutton & Barto, 1998), and our algorithm can be viewed as a special case of such algorithms where the LR is adapted online to the data at hand.

The update equations of is, however, not as simple to derive as , because is not conjugate to its Beta prior . To solve this problem, we used a Non-Conjugate VMP approach (Knowles & Minka, 2011). Briefly, NCVMP minimizes an approximate KL divergence in order to find the value of the approximate posterior parameters that maximize the ELBO. In order to use NCVMP, the first step is to derive the expected log-joint probability of the model, which we will need to differentiate wrt (or, in the case of the approximate posterior update rule for , ). It quickly appears that part of this expression does not always have an analytical form for common exponential distributions: indeed, the expected value of is, in general, intractable and needs to be approximated. Expanding the Taylor series of this expression around up to the second order and taking the expectation, we have:

 Eq(w)[B(ϑ)]≈B(ˆϑ)+12Eq(w)[(w−ˆw)2]▽2ˆwB(ˆϑ) (8)

Notice that the second term of the sum in Equation 8 can be expressed as , where is the prior covariance of . Hence, this penalty term becomes important when the product of the following factors increase: the distance between the previous posterior and initial prior , the posterior variance of and the prior covariance of . This has the effect of favoring values of and that have a low variance, especially when the two proposed distributions, and , are very distant from each other.

We now derive the update equation for the approximate posterior of . Let us first define

 δL:=ddˆwEq(z)[logp(z|ˆϑ)]δC:=−12Varq(w)[w]×ddˆwdθTC(T(z)|ϑ)dθδV:=−12dθTC(T(z)|ˆϑ)dθ×C(logw|ϕ)−1∇ϕVarq(w)[w]. (9)

We obtain the following result:

###### Proposition 1.

Using Algorithm 1 of (Knowles & Minka, 2011), the update equation for has the form:

 ϕαt=ˆφα+K(ϕαt,ϕβt)δL+K(ϕαt,ϕβt)δC+δVαϕβt=ˆφβ−K(ϕβt,ϕαt)δLu1−K(ϕβt,ϕαt)δCu2+δVβu3 (10)
 where K(x,y):=Mx+L(y)y(L(x)L(y)−M2)(x+y)2>0L(x):=ψ1(x)+MM:=−ψ1(ϕαt+ϕβt)

and is the n order polygamma function.

###### Proof.

Follows directly from Algorithm 1 in (Knowles & Minka, 2011)222The full development can be found in the supplementary materials.. ∎

The update equation in Equation 10 can be easily transposed for .

In Proposition 1, we show that the update of can be decomposed in four terms: the first is the (weighted) prior , which acts as a reference for the update.

The second term, , depends upon , the derivative wrt of the expectation of the log probability over , times a constant . has a simple form:

###### Lemma 1.

The derivative of the first order Taylor expansion of the expected log probability around has the form

The proof is given in the supplementary materials. The expression of is easily understood as a measure of similarity between the current update of the variational posterior and the previous posterior dependent prior . Note that a rather straightforward result of Lemma 1 is that : as the posterior becomes stronger, the relative change that one can expect tends to zero, and the impact of on the update of can be expected to decrease. This is the behaviour we aimed at: a very strong posterior probability becomes more and more difficult to change as the training time increases.

Note also the opposite sign of the related increment in Equation 10 for and . This implies that if , then , and the update of will tend to increase. The opposite is true for , showing that the posterior of effectively deals with the similarity between the current observation and the previous ones.

The third and fourth term of Equation 10, and , are conditioned by the posterior variance of and the prior variance of . In brief, they push the update of in a direction that lowers the variance of both and . We will show in the next section a simple example of the relative contribution that each of these terms has in the update.

An important consideration to make is that the value of must be , which implies that , a restriction that may be violated in practice, especially for low values of . In such cases, we reset the value of to some arbitrary value (typically ) where the above inequality holds, and resume the update loop until convergence or until a certain amount of iterations is reached. Note that NCVMP is not guaranteed to converge, but, as suggested by (Knowles & Minka, 2011), the use of a form of exponential damping can improve the convergence of the algorithm.

### 2.2 Example: Binary distribution learning

In order to understand better the relative contribution of and to the variational update scheme, we generated a sequence of 200 binary data distributed according to a binomial distribution, whose probability was switching between and every 40 trials. This distribution can be modelled as a hierarchy of beta distributions, where the first level is a Bernoulli distribution with a conjugate, Beta approximate posterior, and the one or two levels above are both Beta distributions measuring the stability of the level below. We simulated the learning process in three cases:

• A two-layer HAFVF model, where only the posterior over could be forgotten (incremental).

• A two-layer HAFVF model, where the posterior of was being forgotten at a fixed rate (i.e. fixed to ).

• A three-layer HAFVF model, where the posterior of was being forgotten at a rate of .

In each of these examples, we used the following implementation: the beta prior of the first level was set to . The value of was set to , which showed to be a good compromise between informativeness and freedom to fit the data. If applicable, the top-level was set to .

In the first case, the fitting rapidly degenerated, as the memory grew at each trial. Figure 2, left column, gives a hint about the reason of this behaviour: each observation decreases the prior covariance , which results in a positive increment for both and through . This can be viewed as a form of confirmation bias: because the posterior over and are confident about the distribution of the data, they tend to reinforce each other and loose flexibility. Consequently, the impact of the contingency changes decreases as learning goes on. This might seem undesirable (and, in this pathological case, it is indeed the case), but in the case of datasets with outliers it can be very beneficial: a longer training in a stable environment will require a longer and/or stronger sequence of outliers to reset the parameters.

Adding a forgetting factor to the posterior of can moderate the effect of overtraining. In the case of a fixed-forgetting for the posterior probability of Figure 2, right column, the fitting is much more stable: the model is able to learn and forget the current distribution efficiently with a memory bounded at approximately 5 trials (i.e. ). This shows that adding a forgetting over the posterior of effectively provides the flexibility we aim at: the contingency changes are efficiently detected, and the drop of (through ) triggers a resetting of the parameters in the following trials.

In the last case (Figure 3), the first level of the model acquires a higher memory than in the second example, due to the ability of the model to adapt the forgetting factor of , which relaxes its bound. It is, however, more flexible than the first example.

### 2.3 Forward-Backward algorithm

Let us consider the conjugate posterior of the distribution from the exponential family when the whole dataset has been observed. For a given , one can derive the posterior probability of given as:

 p(z|xt,x¬t;θ0)=p(xt|z)p(z|xt;θ0)p(xt|xt) (11)

Given Equation 2 and Equation 5, if is the conjugate prior of and is from the exponential family, we can substitute the prior by , where and are the effective samples retrieved from the forward and the backward application of the AFVF on the dataset, respectively. Formally, we have:

 θξt =fθξt+bθξt−J∑j=1T(xtj)−θξ0 (12) θηt =fθηt+bθηt−J−θη0

where the and superscripts index the forward and backward pass, respectively. In offline learning, this technique can increase the effective memory of the approximate posterior distribution just before and after the change trials.

## 3 Related work

Change detection is a broad field in machine learning, where no optimal and general solution exists (KULHAVÝ & ZARROP, 1993). Consequently, assumptions about the structure of the system can lead to very different algorithms and results.

The Kalman Filter (Azizi & Quinn, 2015) is a special case of Bayesian Filtering (BF) (Doucet et al., 2001) that has had a large success in the Signal Processing literature due to its sparsity and efficiency. It is, however, highly restrictive and its assumptions need to be relaxed in many instances. One can discriminate two main approaches in order to deal with this problem: the first approach is to use a global approximation of BF such as Particle Filtering (PF) (Smidl & Quinn, 2008; Smidl & Gustafsson, 2012; Özkan et al., 2013), which enjoys a bounded error but suffers from a lower accuracy than other local approximations. The second class of algorithms comprises the Stabilized Forgetting (SF) family of algorithms (KULHAVÝ & ZARROP, 1993; Azizi & Quinn, 2015; Laar et al., 2017), from which our model is a special case. SF suffers from an unbounded error, but it usually has a greater accuracy for a given amount of resources (Smidl & Quinn, 2008). Note that SF has been shown to be essential to reduce the divergence between the true posterior and its approximation in recursive Bayesian estimation (Kárný, 2014). As we apply the SF operator to estimate the posterior of and the mixture weight (through the weighted mixture prior), we ensure that the divergence is reduced for both of these latent variables.

Even though our model is described as a Stabilized Exponential Forgetting (Kulhavý & Kraus, 1996) algorithm and is well suited for signal processing, it can be generalized to models where there is no prediction of future states (e.g. smoothing of a signal, reinforcement learning etc.) Also, it overcomes other methods in several following ways:

First, it uses a Beta prior on the mixing coefficient. This is unusual (but not unique (Dedecius & Hofman, 2012)), as most of previous approaches used a truncated exponential prior (Smidl & Quinn, 2005; Masegosa et al., 2017) or a fixed, linear mixture prior that account for the stability of the process (Smidl & Gustafsson, 2012). In Stabilized Linear Forgetting, a Bernoulli prior with a Beta hyperprior has been proposed for the mixture weight (Laar et al., 2017). Our approach is designed to learn the posterior probability of the forgetting factor in a flexible manner. We show that this posterior probability depends upon its own (and possibly a mixture of) prior distribution and upon the prior covariance of the model parameters . This makes the change detection more subtle than an all-or-none process, as one might observe with a Bernoulli distribution. It also enables us to accumulate evidence for a change of distribution across trials, which can help to discriminate outliers from real, prolonged contingency changes. This is, to our knowledge, an entirely novel feature in the adaptive forgetting literature.

The second important novelty of our model lies in its hierarchical learning of the environment stability. This is somehow similar to the Hierarchical Gaussian Filter (HGF) (Mathys, 2011; Mathys et al., 2014). The present model is, however, much more general, as the generic form we provide can be applied to several members of the exponential family. Also, although the KL divergence (error term) of our model is not bounded in the long run, it can be efficiently applied to a large subset of datasets and models, whereas the HGF often fails to fit processes that are highly stationary, with many datapoints and/or with abrupt contingency changes.

## 4 Experiments

The HAFVF was coded in the Julia language (Bezanson et al., 2017) using a Forward automatic differentiation algorithm (Revels et al., 2016) for the NCVMP for the RL and AR parts of this section, and using an analytical gradient for the SGD part.

### 4.1 Reinforcement Learning

We first look at the behaviour of the model in the simple case of estimating the current distribution of a random variable sampled from a moving distribution. We simulated two sequences of 2x200 datapoints where each pair of points was generated according to the same multivariate normal distribution with mean and . We then added an independent random walk to these means.

We applied the Forward-Backward (FB) version of the HAFVF to these datasets. We used the same Normal Inverse Wishart prior for both of these results (, , , ). The prior over was manipulated to include a high confidence (, ) or a low confidence (, ) about the average value of . Note that both of these priors had the same expected value. To avoid overfitting of early trials (which may happen using weak priors) while keeping the distribution flexible, we used a flat prior over : . The top level forgetting was ignored (). Results are shown in Figure 4.

As the first setting had a weak prior over , it had more freedom to adapt the posterior distribution to the current data. The effective memory trace (measured with the parameter ) was greater when the environment was stable, and changed faster after the contingency change than when the prior was more confident, where the adaptation was slow and the effective memory did not increase much above the prior-defined threshold 10 (or 20 for the FB algorithm).

The behaviour of both models after the contingency change is informative about the effect that the prior had on the inference process: the weak-prior forgetting factor dropped immediately after an unexpected observation was made, which can be advantageous when sudden changes are expected, but maladaptive in the presence of outliers. The strong-prior model behaved in the opposite way, and handled the change more slowly than its weak-prior counterpart.

It is interesting to note that the posterior probability distribution of (not shown in the figure) was also more flexible in the first model fit than in the second, because the observations in the level below were also more variable, due to less confident prior over : this had the effect of increasing the gain in precision over , which increased the strength of the posterior over (through and in Equation 10).

### 4.2 Autoregressive model

We fitted the HAFVF to a simulated a non-stationary sinusoidal signal of 400 datapoints issued from two separate systems with a low and high frequency. These signals were randomly generated as the sum of five sinusoidal waves, with the scope of observing whether the algorithm was able to adapt to the abrupt contingency change.

Because we also aimed at a more informative view on the performance of the algorithm in the presence of artifacts, we altered this signal by adding two impulses of 2 a.u. at and .

We studied a single implementation of the model, with a relatively strong prior over () and a flat prior over , (). The Forward-backward version of the algorithm was applied. We arbitrarily chose a forward and a backward order of 10 samples. Figure 5 shows the results of this experiment.

SGD is a popular technique to find the minimum of (often computationally expensive) loss-functions over large datasets (Tran et al., 2015) or involving intractable integrals (Kingma & Welling, 2013) that can be sampled from. However, SGD can be unstable, especially with recurrent neural networks (Fabius & van Amersfoort, 2014) where an isolated, highly noisy sample in the sequence can lead to a degenerate sample of the gradient over the whole sequence. This effect is further magnified when the sample size is low.

We implemented a slightly modified version of the HAFVF in a SGD framework, intended to be similar to the Adam optimizer333More details can be found in the supplementary materials (Kingma & Ba, 2015). In short, we used two specific decays and for the posterior of the means and variances of the gradients, respectively, while ensuring that . We modelled these posteriors as a set of Normal-Inverse-Gamma distributions. Each set of weights and biases of the multilayered perceptrons was provided with its own hierarchical decay, to take advantage of the fact that some groups of partial derivatives might be more noisy than others. We used this algorithm with a strong prior over and ( and ), to limit the effect of degenerated gradients on the approximate posteriors.

This algorithm was tested with a variational recurrent auto-encoding regression model inspired by (Moens & Zenon, 2018), where the output probability density was a first passage density of a Wiener process (Ratcliff, 1978). The simulated dataset was composed of 64 subjects performing a 500 trials long two alternative forced choice task (Britten et al., 1992), where choices and reaction times were the model was aiming to predict. At each step of the SGD process, 5 subjects were sampled, for a total of 2500 trials.

Figure 6 compares the results of the AdaFVF SGD optimizer with the Adam optimizer, executed with the default parameters (, , ). The AdaFVF showed to be less affected by degenerate samples than Adam, as can be seen from the ELBO trace and from the heat plots of the expected memories, for an estimated average negative ELBO of for Adam and for AdaFVF at the iteration 10000.

## 5 Limitations, perspective and conclusion

Our algorithm has the following limitations: The first lies in the exponential form we have given to the mixture distributions. A linear form, similar to (Laar et al., 2017) could however also be implemented, at specific levels of the hierarchy of the whole model. It may also be difficult to choose an adequate prior on the various levels of the hierarchy. The naive prior of the lower level is usually crucial but hard to specify, but this is a generic feature in adaptive forgetting. For the two top levels, we propose as a rule of thumb to use a weak prior in situations where abrupt contingency changes are expected. They can also provide a higher memory to the model. They are, however, more affected by outliers than stronger priors. The latter option is therefore advisable in situations where the sequence is expected to contain outliers, and when large amount of data are modelled. There is, however, no generic solution and one might need to try different model specifications before selecting the optimal (i.e. more suited) one.

The HAFVF and variants could lead to many promising developments in RL related fields, where they might help to prevent unnecessary forgetting of past events during exploration, in signal processing and more distant fields such as deep learning, where they could be used to prevent the occurrence of catastrophic forgetting.

In conclusion, we present a new generic model aimed at coping with abrupt or slow signal changes and presence of artifacts. This model flexibly adapts its memory to the volatility of the environment, and reduces the risk of abruptly forgetting its learned belief when isolated, unexpected events occur. The HAFVF constitutes a promising tool for decay adaptation in RL, system identification and SGD.

## Acknowledgements

We thank the reviewers for their careful reading and precious comments on the manuscript. We also thank Oleg Solopchuk and Alexandre Zénon, who greatly contributed to the development and writing of the present manuscript. The present work was supported by grants from the ARC (Actions de Recherche Concertées, Communauté Francaise de Belgique).

## References

• Azizi & Quinn (2015) Azizi, S. and Quinn, A. A data-driven forgetting factor for stabilized forgetting in approximate Bayesian filtering. In 2015 26th Irish Signals and Systems Conference (ISSC), volume 11855, pp. 1–6. IEEE, jun 2015. ISBN 978-1-4673-6974-9.
• Bezanson et al. (2017) Bezanson, Jeff, Edelman, Alan, Karpinski, Stefan, and Shah, Viral B. Julia: A Fresh Approach to Numerical Computing. SIAM Review, 59(1):65–98, jan 2017. ISSN 0036-1445. doi: 10.1137/141000671.
• Britten et al. (1992) Britten, K H, Shadlen, M N, Newsome, W T, and Movshon, J a. The analysis of visual motion: a comparison of neuronal and psychophysical performance. The Journal of neuroscience : the official journal of the Society for Neuroscience, 12(12):4745–65, dec 1992. ISSN 0270-6474. doi: 10.1.1.123.9899.
• Dedecius & Hofman (2012) Dedecius, Kamil and Hofman, Radek. Autoregressive model with partial forgetting within Rao-Blackwellized particle filter. Communications in Statistics: Simulation and Computation, 41(5):582–589, 2012. ISSN 03610918.
• Diaconis & Ylvisaker (1979) Diaconis, Persi and Ylvisaker, Donald. Conjugate Priors for Exponential Families. The Annals of Statistics, 7(2):269–281, mar 1979. ISSN 0090-5364.
• Dickinson (1985) Dickinson, A. Actions and Habits: The Development of Behavioural Autonomy. Philosophical Transactions of the Royal Society B: Biological Sciences, 308(1135):67–78, feb 1985. ISSN 0962-8436.
• Doucet et al. (2001) Doucet, a, De Freitas, N, and Gordon, N. Sequential Monte Carlo Methods in Practice. Springer New York, pp. 178–195, 2001. ISSN 1530-888X.
• Fabius & van Amersfoort (2014) Fabius, Otto and van Amersfoort, Joost R. Variational Recurrent Auto-Encoders. (2013):1–5, dec 2014. URL http://arxiv.org/abs/1412.6581.
• Hoffman et al. (2012) Hoffman, Matt, Blei, David M., Wang, Chong, and Paisley, John. Stochastic Variational Inference. (2), 2012. ISSN 1532-4435.
• Jaakkola & Jordan (2000) Jaakkola, T S and Jordan, M I. Bayesian parameter estimation via variational methods. Statistics And Computing, 10(1):25–37, 2000. ISSN 0960-3174.
• Kárný (2014) Kárný, Miroslav. Approximate Bayesian recursive estimation. Information Sciences, 285(1):100–111, 2014. ISSN 00200255.
• Kingma & Ba (2015) Kingma, Diederik P. and Ba, Jimmy Lei. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 2015, pp. 1–15, 2015.
• Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. dec 2013. URL http://arxiv.org/abs/1312.6114http://www.aanda.org/10.1051/0004-6361/201527329.
• Knowles & Minka (2011) Knowles, David and Minka, Thomas P. Non-conjugate variational message passing for multinomial and binary regression. Nips, pp. 1–9, 2011.
• Kulhavy & Karny (1984) Kulhavy, R and Karny, M. Tracking of slowly varying parameters by directional forgetting. Preprints 9ih IFAC Congress, 10:178–183, 1984. ISSN 07411146.
• Kulhavý & Kraus (1996) Kulhavý, R. and Kraus, F.J. On Duality of Exponential and Linear Forgetting. IFAC Proceedings Volumes, 29(1):5340–5345, jun 1996. ISSN 14746670.
• KULHAVÝ & ZARROP (1993) KULHAVÝ, R. and ZARROP, M. B. On a general concept of forgetting. International Journal of Control, 58(4):905–924, oct 1993. ISSN 0020-7179.
• Laar et al. (2017) Laar, Thijs Van De, Cox, Marco, Diepen, Anouk Van, and Vries, Bert De. Variational Stabilized Linear Forgetting in State-Space Models. (Section II):848–852, 2017.
• Mandt et al. (2014) Mandt, Stephan, McInerney, James, Abrol, Farhan, Ranganath, Rajesh, and Blei, David. Variational Tempering. 41, 2014. URL http://arxiv.org/abs/1411.1810.
• Masegosa et al. (2017) Masegosa, Andres, Nielsen, Thomas D, Langseth, Helge, Ramos-Lopez, Dario, Salmeron, Antonio, and Madsen, Anders L. Bayesian Models of Data Streams with Hierarchical Power Priors. International Conference on Machine Learning (ICM), 70:2334–2343, 2017.
• Mathys (2011) Mathys, Christoph. A Bayesian foundation for individual learning under uncertainty. Frontiers in Human Neuroscience, 5(May):1–20, 2011. ISSN 16625161.
• Mathys (2016) Mathys, Christoph. How could we get nosology from computation? Computational Psychiatry: New Perspectives on Mental Illness, 20:121–138, 2016.
• Mathys et al. (2014) Mathys, Christoph D, Lomakina, Ekaterina I, Daunizeau, Jean, Iglesias, Sandra, Brodersen, Kay H, Friston, Karl J, and Stephan, Klaas E. Uncertainty in perception and the Hierarchical Gaussian Filter. Frontiers in Human Neuroscience, 8(November):825, nov 2014. ISSN 1662-5161.
• Moens & Zenon (2018) Moens, Vincent and Zenon, Alexandre. Recurrent Auto-Encoding Drift Diffusion Model. bioRxiv, 2018. doi: 10.1101/220517.
• Özkan et al. (2013) Özkan, Emre, Šmídl, Václav, Saha, Saikat, Lundquist, Christian, and Gustafsson, Fredrik. Marginalized adaptive particle filtering for nonlinear models with unknown time-varying noise parameters. Automatica, 49(6):1566–1575, jun 2013. ISSN 00051098.
• Ratcliff (1978) Ratcliff, Roger. A theory of memory retrieval. Psychological Review, 85(2):59–108, 1978.
• Revels et al. (2016) Revels, Jarrett, Lubin, Miles, and Papamarkou, Theodore. Forward-Mode Automatic Differentiation in Julia. (April):4, 2016. URL http://arxiv.org/abs/1607.07892.
• Smidl (2004) Smidl, V. The Variational Bayes Approach in Signal Processing. PhD thesis, 2004.
• Smidl & Quinn (2005) Smidl, V. and Quinn, A. Mixture-based extension of the AR model and its recursive Bayesian identification. IEEE Transactions on Signal Processing, 53(9):3530–3542, sep 2005. ISSN 1053-587X.
• Smidl & Quinn (2008) Smidl, V. and Quinn, Anthony. Variational Bayesian Filtering. IEEE Transactions on Signal Processing, 56(10):5020–5030, oct 2008. ISSN 1053-587X.
• Smidl & Gustafsson (2012) Smidl, Vaclav and Gustafsson, Fredrik. Bayesian estimation of forgetting factor in adaptive filtering and change detection. In 2012 IEEE Statistical Signal Processing Workshop (SSP), number 1, pp. 197–200. IEEE, aug 2012. ISBN 978-1-4673-0182-4.
• Sutton & Barto (1998) Sutton, Richard S and Barto, Andrew G. Introduction to Reinforcement Learning. Learning, 4:1–5, 1998. ISSN 10743529. doi: 10.1.1.32.7692.
• Tran et al. (2015) Tran, Dustin, Toulis, Panos, and Airoldi, Edoardo M. Stochastic gradient descent methods for estimation with large data sets. VV(October), sep 2015. URL http://arxiv.org/abs/1509.06459.
• Winn et al. (2005) Winn, John, Bishop, Cm, and Jaakkola, T. Variational Message Passing. Journal of Machine Learning Research, 6:661–694, 2005. ISSN 1532-4435.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters