Dynamic sparsity on time-varying parameter models
Dynamic sparsity on dynamic
Paloma Vaissman Uribe
Data Scientist, iFood, São Paulo, Brazil
Hedibert Freitas Lopes
Insper, São Paulo, Brazil
In the present work, we consider variable selection and shrinkage for the Gaussian dynamic linear regression within a Bayesian framework. In particular, we propose a novel method that allows for time-varying sparsity, based on an extension of spike-and-slab priors for dynamic models. This is done by assigning appropriate Markov switching priors for the time-varying coefficients’ variances, extending the previous work of Ishwaran and Rao . Furthermore, we investigate different priors, including the common Inverted gamma prior for the process variances, and other mixture prior distributions such as Gamma priors for both the spike and the slab, which leads to a mixture of Normal-Gammas priors (Griffin and Brown ) for the coefficients. In this sense, our prior can be view as a dynamic variable selection prior which induces either smoothness (through the slab) or shrinkage towards zero (through the spike) at each time point. The MCMC method used for posterior computation uses Markov latent variables that can assume binary regimes at each time point to generate the coefficients’ variances. In that way, our model is a dynamic mixture model, thus, we could use the algorithm of Gerlach et al.  to generate the latent processes without conditioning on the states. Finally, our approach is exemplified through simulated examples and a real data application.
Keywords: Cholesky decomposition, dynamic models, Normal-Gamma prior, spike-and-slab priors, high-dimensional data, scale mixture of Normals.
Over the past few decades, advances in computational processing have encouraged the proliferation of massive datasets, bringing new challenges to statistical research due to high-dimensionality issue. In this sense, regularization and variable selection techniques have become even more relevant to prevent overfitting and to solve ill-posed problems by inducing sparsity and/or shrinkage.
The advantage of regularization was highlighted a few years ago when Hastie et al.  coined the informal Bet on Sparsity principle. The principle encourages the use of procedures that do well in sparse problems for high-dimensional problems, since no procedure does well in dense problems, accordingly to them. Indeed, they have shown that for a dense problem, where all the numerous coefficient where different from zero, and/or there is a high Noise-to-Signal Ratio (NSR), both the former ridge regression procedure of Hoerl and Kennard  and the least absolute shrinkage and selection operator (lasso) from Tibshirani  do poorly in terms of prediction.
Consider the Gaussian linear model defined by
where is a -dimensional vector of continuous responses, is the intercept, is a -dimensional vector of regression coefficients associated with covariates, is a design matrix with each column representing a covariate.
It is well known that the ordinary least squares (OLS) or maximum likelihood estimator (MLE) often does poorly on both interpretation and prediction accuracy. While interpretation stands for the preference for parsimony, in the sense that simpler models put more light on the relationship between the response and covariates, prediction accuracy is related to the bias-variance trade-off. Although the OLS/ MLE estimator has the smallest variance among all linear unbiased estimators accordingly to the Gauss Markov Theorem, an estimator with slight bias but smaller variance could be preferable, leading to a substantial decrease in prediction error.
Modern statistics addresses this trade-off between bias and variance through regularization and variable selection methods, which encourages simpler models because the space of values of estimators considered is smaller. In general terms, the notion of regularization summarizes approaches that allow to solve ill-posed problems, such as those which arises when , or to prevent overfitting. The problem of variable selection arises when there is some unknown subset of the predictors with regression coefficients so small that it would be preferable to ignore them.
Recently, there has been great interest in regularizing the coefficients within problems in which the parameters vary over time and a few methods were proposed such as those from Belmonte et al. , Kalli and Griffin  and Bitto and Frühwirth-Schnatter . In linear regression models with a large number of predictors, it is common to assume that only a subset of them is important for prediction. In the context of dynamic regression, it is reasonable to assume that these relevant subsets change over time.
Actually, we can define two sources of sparsity in dynamic regression problems: the vertical sparsity or dynamic sparsity, which stands for time-varying subsets of relevant predictors, and the horizontal sparsity, which allows for intermittent zeros for when each individual predictor is not relevant at all times .
Consider the Gaussian dynamic linear regression model defined by
for , where is the vector of regressors, is the vector of coefficients and and are two independent sequences of independent Gaussian errors with mean zero and variances and , respectively. Note that setting for all is equivalent to , i.e., the static regression. Usually, we may define .
The problem of shrinking in dynamic regression problems has been addressed by Belmonte et al.  and Bitto and Frühwirth-Schnatter  in a similar strategy. See also Huber et al.  for recent comparative study. Their basic approach was to rewrite the states from (2) in terms of the scaled states as . Then, the shrinkage of the time-varying coefficients was done by assigning priors to their standard deviations . While Belmonte et al.  used the Laplace prior for shrinking the standard deviations, Bitto and Frühwirth-Schnatter  used the Normal-Gamma prior, which is more general since the Laplace prior is a special case of the Normal-Gamma prior where the shrinking parameter equals one.
In both approaches, the standard deviation plays the role of relevance of the th predictor: small values of leads to greater shrinkage of the coefficient for all times . That is, because the standard deviation is taken as fixed for all times , if it is pulled toward zero, then the time-varying effect of the covariate is non significant. In this sense, both tackles horizontal sparsity, as the shrinkage effect of the prior for is equal over all times .
On the other hand, Kalli and Griffin , developed an extension of the Normal-Gamma prior discussed by Griffin and Brown  for dynamic regression where both the values of the regression coefficients and the importance of the variables are allowed to change over time. Time-varying sparsity is allowed by giving independent Normal-Gamma autoregressive (NGAR) process priors to the time series of regression coefficients. For details on the model specification, please refer to Kalli and Griffin . See also Lopes et al. , who propose a customized four-component mixture prior for the elements of . The prior induces sparsity by either flatlining the coefficient to a constant, possibly zero, or letting move around freely according the dynamic structure when the state vector is of very high dimensions, say five thousand equations. All these methods deal with horizontal sparsities.
Vertical sparsity in dynamic models has become an increasingly interesting research topic. Related literature include Nakajima and West , Kalli and Griffin , Rocková and McAlinn  and Kowal and Ruppert  and, more recently, Koop and Korobilis  who uses variational Bayes ideas.
In the present work, we consider variable selection and shrinkage for the Gaussian dynamic linear regression model within a Bayesian framework. In particular, we propose a method that allows for time-varying sparsity, based on an extension of spike-and-slab priors for dynamic models using Markov switching auxiliary variables. The paper is organized as follows: Section 2 reviews the Bayesian approach for regularization and variable selection, emphasizing the main shrinking priors and giving a unifying approach for the spike-and-slab priors. Section 3 introduces the dynamic spike-and-slab prior and describes the full Bayesian model for dynamic regression with time-varying sparsity, the posterior inference and the Markov chain Monte Carlo (MCMC) method for sampling the parameters. Section 4 applies the proposed model to simulated data and considers empirical studies in inflation modeling. Section 5 summarizes our findings and conclusions.
2 Shrinking and variable selection priors: a brief review
Classical regularization is done by maximizing the likelihood subject to a penalty function. While the the ridge regression estimate is a penalized least squares method imposing a penalty on the regression coefficients, the lasso uses the norm instead, leading not only to shrunken coefficients but also to sparse solutions. In that way, the lasso is also considered a variable selection method.
In contrast, Bayesian approaches for both variable selection and regularization in the Gaussian linear model stated by (1) can be formalized through the conditional distribution , where is a parameter vector, comprising, for example, the error variance . The regularization is achieved by specifying appropriate informative priors , where the hyperparameter vector includes parameters controlling shrinkage properties. The model is completed by assuming hyperpriors and and the inference is based on the posterior .
It can be shown that, if and are fixed, then the posterior mode or the maximum a posteriori (MAP)
is equivalent to penalizing the log-likelihood with penalty equal to the (minus) log prior .
2.1 Shrinking priors
In general, practically all shrinking priors are defined hierarchically as a Scale-Mixture of Normals (SMN) (see, e.g., West ). Considering the Gaussian linear model as in (1), the SMN has the following general structure:
where and are independent for any and depends on the vector of hyperparameters . Note that the marginal distribution is non Gaussian, and can assume many forms depending on the mixing distribution . A famous form arises when the mixing distribution from (3) is Exponential, that is,
for , where denotes the Exponential distribution with mean . Thus, marginally, follows a Laplace distribution with parameter , so that . The Laplace prior is also known as the Bayesian Lasso from Park and Casella .
The frequentist ridge regression also has a Bayesian analogue and can be represented by (3) if we consider an Inverted-Gamma mixing distribution . That is, assuming
for , then, follows a scaled distribution with degrees of freedom and scale parameter , marginally. Hence, the ridge prior leads to weaker penalization of large coefficients as long as the distribution has heavier tails than the Gaussian distribution.
2.2 The Normal-Gamma prior
Even more shrinkage than the Bayesian lasso and the ridge prior can be achieved by using a Gamma mixing distribution in (3) as
for , where denotes the Gamma distribution with shape parameter and mean . This structure leads to the Normal-Gamma density, which was applied to regression problems in Griffin and Brown  and can be expressed in closed form as
where is the modified Bessel function of the third kind. Taking this parametrization, the variance of is and the excess kurtosis is . As the shape parameter of the Gamma distribution decreases these include distributions that place a lot of mass close to zero but at the same time have heavy tails. Thus, the effect of the parameter is related to shrinking, where lower values of is associated with more shrinkage as more mass is placed to zero. Thusly, the Normal-Gamma prior is more general than the Bayesian lasso, which corresponds to .
One could fix the shrinking parameter such as in Frühwirth-Schnatter and Wagner  and in Kastner 
2.3 Variable selection: spike-and-slab priors
Bayesian variable selection is commonly based on spike-and-slab priors for regression coefficients. The basic idea is that each component from the coefficients’ vector is modeled as having come either from a distribution with most (or all) of its mass concentrated around zero (the spike), or from a comparably diffuse distribution with mass spread out over a large range of values (the slab), that is
where is a vector of parameters, is the spike distribution, is the slab distribution and is a binary random variable with .
Famous seminal stochastic search variable selection (SSVS) method of George and McCulloch  used Gaussian distributions for both the spike and the slab as follows
for , where is a large scalar and is a small scalar. Note that is the ratio of variances between the slab and the spike distributions.
Indeed, one can achieve (5) by formulating appropriate spike-and-slab priors to the component variances and , what was first proposed by Ishwaran and Rao . Their idea is that, considering the hierarchical SMN representation from (3), if we assume absolutely continuous priors for the component variances, that is, a mixture prior for , then we reach the spike-and-slab structure for the coefficient as in (5).
Actually, even the SSVS approach given by (6), where the coefficients are given a mixture of Normals prior, can be represented in that way. Letting and from (6) and assuming a two point mixture prior to the variance in (3)
where is a discrete measure concentrated at value , we get the original formulation of the SSVS prior of George and McCulloch .
Nevertheless, it can be difficult to set the hyperparameters , and parameters used in the two point mixture prior for the variance , as noted by Ishwaran and Rao . Hence, their approach is to place absolutely continuous priors for . In particular, they chose Inverted-Gamma densities for both the spike and the slab variances as
where is the vector of hyperparameters that define the conditional (on ) variances’ densities. Thus, each has the marginal distribution
where denotes the Student’s t distribution with zero location, scale and degrees of freedom. Because of the SMN representation, (7) is also know as Normal mixture of Inverse-Gamma (NMIG) prior. Although it allows discrimination or variable selection, it does not encourage shrinkage in the sense that the resulting marginal distribution of each coefficient is a two component mixture of scaled Student’s t distributions. Hence it makes sense to choose other component specific distributions, besides the Inverse Gamma, that could actually induce shrinkage. A prior that solves this is assuming Exponential densities for both the spike and the slab as
where denotes the Exponential distribution with mean , which leads to
where denotes the Laplace distribution with mean 0 and scale parameter and the weight is the prior probability of the slab, i.e., . That is, (8) is a mixture of Laplace densities for . Finally, if we assume that is a mixture of Gammas, that is,
then the marginal distribution of is a mixture of Normal-Gamma densities as showed in (4), that is,
with being a constant which depends on the distribution assumption. Table 1 gives a summary for what as discussed through this section assuming the general form from (5) and viewing each prior as a scaled mixture of Normals (SMN).
|Mixture of Laplaces||2|
|Mixture of Normal-Gammas|
In this Section we have discussed several shrinkage and sparsity inducing priors independently assigned to static coefficients from the Gaussian linear model. Now we turn attention to the case where the coefficients from regression are time-varying, the so-called time-varying parameter (TVP) regression models. Next section describes the proposed model for shrinking and selecting subsets of relevant variables dynamically.
3 Our approach for sparsity in TVP models
The aim of the method is allowing for vertical sparsity, so that in each snapshot of time we can have different subsets of relevant predictors as well enabling shrinkage of the time-varying coefficients. This is accomplished by extending the approach of spike-and-slab priors over the variances discussed in Section 2.3 for dynamic models. In our approach, at each time point, an auxiliary Markov switching variable can assume two regimes - the spike or the slab - driving the evolution of the variance , assuming that .
3.1 Model specification
The observation and the state equations considered are given below. We work with univariate time series responses, although extending the framework to multivariate time series is straightforward. We assume the Gaussian dynamic linear regression model
for , where is a matrix of regressors, is a vector of coefficients with the following evolution equation for the scaled states .
for , with
where the initial condition for the scaled states is .
The definition below specifies the generic dynamic spike-and-slab prior that can be assigned to the coefficients’ variances in order to induce shrinkage and/or variable selection.
Because now we are talking about dynamic models and dynamic sparsity, we have a time-varying scale parameter , which is taken to be . That is, the time-varying pattern for the scale parameter is driven by the latent variable , which evolves as a Markov switching process of order 1 and can assume two values or accordingly to a transition matrix. Thus, we assume a finite mixture prior for as
where is the transition probability of the first order Markov process to regime (the slab) given that . Thus, by adopting a regime switching model, the process can switch between the spike and the slab variances’ distributions according to the following transition probabilities
where denotes the probability of changing to regime from regime , . Note that and .
For the other component is placed a prior distribution (Inverse-Gamma, Gamma or Exponential) that together with the variable results in a spike-and-slab prior for that shrinks the coefficients whenever it gets a small value through the spike component of the mixture prior.
For instance, if , then we have a mixture of scaled-t for each as
If , then we have a mixture of Normal-Gamma densities for each as
Finally, if , then we have a mixture of Laplaces densities for each as
Furthermore, we assume that , which means that is distributed as
, where depends on the value of on the distribution constant from Table 1 specified for and on the value of . By defining the hyperparameters and appropriately we can learn about and therefore about . For instance, if we assume that , , ,, we have the following mixture densities for in Figure 1, considering two values of and the NMIG structure.
The assumption that is given a prior as defined in (16) makes also indirectly depend on the previous value of the Markov latent variable . The purpose of this is to keep the variance of the coefficients at each time point constant across the priors defined by equations (13), (14) and (15) and thus, comparable.
For a simpler specification, from now on we assume that the observation variance is constant over time such that . Extending the model to accommodate stochastic volatility is straightforward. In order to complete the specification, we shall assign prior distributions to parameters , and to the transition probabilities in a fully Bayesian strategy. For the observation variance , we assume the conjugate traditional prior
For the AR parameters , we assume that each are independent from each other and distributed as
for , where we are not considering the case . Finally, for the transition probabilities we assign independent Beta distributions as
for , with denoting the spike () or the slab () and , , .
The directed acyclic graph (DAG) that summarizes the dependencies of the proposed model is shown in Figure 2. We can see that the observations are conditionally independent given the scaled coefficients which in turn depend on the AR parameter and on the scale . The latter is driven by the Markov switching variable , which evolves accordingly to the transition probabilities and and with , whose distribution governs the prior choice for (e.g., mixture of Laplaces, Normal-Gammas or scaled-t).
3.2 Posterior inference
The posterior distribution of the parameters can be drawn using an hybrid Gibbs sampler with an additional Metropolis-Hastings update. The scaled states can be updated using the FFBS algorithm (due to Carter and Kohn  and Frühwirth-Schnatter ) within the Gibbs sampler, while the process is updated using the algorithm proposed by Gerlach et al. . At each snapshot of time , the probability of depends only on the previous value observed . Thus, the sequence is a sequence of random variables that are Markov as for we have .
This feature was discussed by Gerlach et al.  in their work about dynamic mixture models. These models adds to dynamic linear model equations 2 the assumption that the system matrices , and and the variance are determined, up to a set of unknown parameters, by the value of . The algorithm and the Lemmas associated are detailed in the Appendix.
In summary, the MCMC scheme is given below.
Draw jointly using Forward Filtering Backward Sampling (FFBS).
Draw jointly using the algorithm of Gerlach et al. .
Draw by its full conditional
where denotes all the parameters to be sampled except from .
Draw each by its full conditional. Assuming the NMIG prior structure from (13),
Draw each using Metropolis Hastings algorithm since the full conditional
has no close form. We use a Beta proposal density as
where is a tuning parameter and the acceptance distribution is
Update the transition probabilities from the latent Markov process by their full conditionals
with and .
Draw each by its full conditional, which depends on the mixing distribution assumed for the spike-and-slab process as follows:
NMIG prior: , with , and
Mixture of Normal-Gammas:
Mixture of Laplaces: , with .
4 Synthetic and real data analyses
In this section we present two simulated examples where some coefficients are relevant in some periods of time and negligible in others. The first example is a singular equation model with five coefficients with four possible patterns and the second example is an application of the modified Cholesky decomposition where we simulate time-varying coefficients that compose the Cholesky factor and then apply the spike-and-slab priors on each recursive regression. In our empirical application, we use the inflation data obtained from Griffin’s research page
4.1 First simulation example
We generated the data using Equation (10) with predictors, and constant observational variance , where and are independent. We simulate the five regression coefficients as follows.
The first coefficient follows a stationary AR(1) process with AR parameter 0.97 and a Normal stationary distribution with mean 2 and variance 0.25. The initial value was drawn from its stationarity distribution .
The second coefficient also follows an AR(1) process with autocorrelation parameter 0.97 and a Normal marginal distribution with mean 0 and variance 0.25, but only until the half of the sample, that is:
with the initial value drawn as .
The third coefficient is always zero, except from two short periods when it equals -2:
The fourth coefficient is .
The fifth coefficient .
We generate 5 replications of the data and then sample from the posterior distribution using the three mentioned priors for with the the following hyperparameters settings: (for the NG prior), (improper prior) and (tuning parameter for Metropolis). The MCMC algorithm was run for 10,000 iterations with half discarded as a burn-in. The prior for autoregressive parameter is for , so that it has mean 0.97. The same choice was made for the transition probabilities and . The informative prior choice is to assure that and evolves smoothly and does not switch regimes so rapidly. Figure 3 shows the posterior medians of the coefficients , comparing them to the real values, while Table 2 shows the root mean square error of the three priors considering the 5 replications. Figure reffig:fit2 shows the posterior densities of the sampled coefficients with the Laplace prior in time points and . We can note the change between these two time points: the third coefficient is equal -2 in , so that its posterior densities shows more mass near this value.
We note that the dynamic Laplace prior was slightly superior in terms of RMSE than the other two priors, but the difference is tiny. The dynamic NG with have some issues: they are much more volatile than the NMIG prior and the Laplace prior.
4.2 Second simulation example
In the Cholesky decomposition each variable is regressed on its predecessors in a dynamic regression problem, that is,
for , with . The Cholesky factor is then
where is the lower triangular matrix of coefficients for each time with zeros in the diagonal, that is, the matrix with entries . Thus, we have parameters to be estimated.
In this second example we simulate time-varying coefficients that compose the Cholesky factor and then apply the spike-and-slab priors on each recursive regression. The simulation is done as follows. We define that the number of time points and the number of ordered variables that compose the vector is . We sample from four possible processes for the time-varying coefficients with the same probability of occurrence. They are:
A stationary AR(1) process with autoregressive coefficient and with fixed variance , without an intercept term, that is