Gaussian variational approximations for high-dimensional state space models

# Gaussian variational approximations for high-dimensional state space models

## Abstract

Our article considers a Gaussian variational approximation of the posterior density in a high-dimensional state space model. The variational parameters to be optimized are the mean vector and the covariance matrix of the approximation. The number of parameters in the covariance matrix grows as the square of the number of model parameters, so it is necessary to find simple yet effective parameterizations of the covariance structure when the number of model parameters is large. We approximate the joint posterior distribution over the high-dimensional state vectors by a dynamic factor model, having Markovian time dependence and a factor covariance structure for the states. This gives a reduced description of the dependence structure for the states, as well as a temporal conditional independence structure similar to that in the true posterior. The usefulness of the approach is illustrated for prediction in two high-dimensional applications that are challenging for Markov chain Monte Carlo sampling. The first is a spatio-temporal model for the spread of the Eurasian Collared-Dove across North America; the second is a Wishart-based multivariate stochastic volatility model for financial returns.
Keywords. Dynamic factor, Stochastic gradient, Spatio-temporal modeling.

## 1 Introduction

Variational Approximation (VA) (Ormerod and Wand, 2010; Blei et al., 2017) estimates the posterior distribution of a model by assuming the form for the posterior density and optimizing a measure of closeness to the true posterior; e.g., a frequent choice for the approximation is a multivariate Gaussian distribution, where the variational optimization is over an unknown mean and covariance matrix. VA is becoming an increasingly popular way to estimate the posterior because of its ability to handle large datasets and highly parameterized models. The accuracy of the VA depends on a number of factors, such as the flexibility of the approximating family, the model considered, and the sample size. There are now some theoretical results which show that the variational posterior converges to the true parameter value under suitable regularity conditions, and rates of convergence have been established for parametric models (Wang and Blei, 2019) and, more generally, for non-parametric and high-dimensional models (Zhang and Gao, 2018). However, for a finite number of observations, when the variational approximation does not collapse to a point mass, it is often observed that there is a practically meaningful discrepancy between the uncertainty quantification provided by the approximation and that of the true posterior distribution. This is especially the case when the variational family used is insufficiently flexible. Nevertheless, even in these cases, predictive inference – predictions and prediction intervals – obtained from VA seem empirically to be usefully close to those obtained from the exact posterior. As such, variational approximation methods provide a useful and fast alternative to Markov chain Monte Carlo (MCMC), especially when predictive inference is the focus of the analysis.

Our article considers a Gaussian variational approximation (GVA) for a state space model when the state vector is high-dimensional. Such models are common in spatio-temporal applications (Cressie and Wikle, 2011), financial econometrics (Philipov and Glickman, 2006b), and in other important applications. It is challenging to obtain the GVA when dealing with a high-dimensional model, because the number of variational parameters in the covariance matrix of the VA grows quadratically with the number of model parameters. This makes it necessary to parameterize the variational covariance matrix parsimoniously, but still be able to capture the structure of the posterior. This goal is best achieved by taking into account the structure of the posterior itself. We do so by parameterizing the variational posterior covariance matrix using a dynamic factor model, which reduces the dimension of the state vector. The Markovian time dependence for the low-dimensional factors provides sparsity in the precision matrix for the factors.

We develop efficient computational methods for forming the approximations and illustrate the advantages of the approach in two high-dimensional example datasets. For both models, Bayesian inference by MCMC simulation is challenging. The first is a spatio-temporal model for the spread of the Eurasian collared dove across North America (Wikle and Hooten, 2006); the second is a multivariate stochastic volatility model for a collection of portfolios of assets (Philipov and Glickman, 2006b). We derive GVAs for both models and show that they give useful predictive inference.

VA estimates the posterior by optimization. Our article uses stochastic gradient ascent methods to performing the optimization (Ji et al., 2010; Nott et al., 2012; Paisley et al., 2012; Salimans and Knowles, 2013). In particular, the so-called reparameterization trick is used to unbiasedly estimate the gradients of the variational objective (Kingma and Welling, 2014; Rezende et al., 2014). Section 2 briefly reviews these methods. Applying these methods for GVA, Tan and Nott (2017) match the sparsity of the variational precision matrix to the conditional independence structure of the true posterior based on a sparse Cholesky factor for the precision matrix. Their motivation is that zeros in the precision matrix of a Gaussian distribution correspond to conditional independence between variables, and sparse matrix operations allow computations in the variational optimization to be done efficiently. They apply their approach to both random effects models and state space models, but their method is impractical for a state space model having a high-dimensional state vector. This approach and related approximations using a sparse precision matrix are reviewed further in Section 3.

Our approach is also related to recent high-dimensional Gaussian posterior approximations having a factor structure. Factor models (Bartholomew et al., 2011) are well known to be useful for modeling dependence in high-dimensional settings. Ong et al. (2018) consider a Gaussian variational approximation for factor covariance structures using stochastic gradient methods for the variational optimization. Computation in the variational optimization can be done efficiently in high dimensions using the Woodbury formula (Woodbury, 1950). Barber and Bishop (1998) and Seeger (2000) used factor structures in GVA, but only when the variational objective can be computed analytically or with one-dimensional quadrature. Rezende et al. (2014) consider a factor model for the precision matrix with one factor in some applications to some deep generative models arising in machine learning applications. Miller et al. (2017) considered factor parameterizations of covariance matrices for normal mixture components in a flexible variational boosting approximate inference method, including a method for exploiting the reparameterization trick for unbiased gradient estimation of the variational objective in that setting.

Various other parameterizations of the covariance matrix in GVA were considered by Opper and Archambeau (2009), Challis and Barber (2013) and Salimans and Knowles (2013). Salimans and Knowles (2013) also considered efficient stochastic gradient methods for fitting such approximations, using both gradient and Hessian information and exploiting other structure in the target posterior distribution, as well as extensions to more complex hierarchical formulations including mixtures of normals.

Our article makes use of both the conditional independence structure and the factor structure in forming the GVA for high dimensional state space models. Bayesian computations for state space models are well known to be challenging for complex nonlinear models. It is usually feasible to carry out MCMC on a complex state space model by sampling the states one at a time conditional on the neighbouring states (e.g., Carlin et al., 1992); in general, sampling one state at a time requires careful convergence diagnosis and can fail if the dependence between states is strong. Carter and Kohn (1994) document this phenomenon in linear Gaussian state space models, and we also document this problem of poor mixing for the spatio-temporal case (Wikle and Hooten, 2006) discussed later.

State-of-the-art general approaches using particle MCMC methods (Andrieu et al., 2010) can in principle be much more efficient than an MCMC algorithm that generates the states one at a time. However, particle MCMC is usually much slower than MCMC because of the need to generate multiple particles at each time point. Particle methods also have a number of other drawbacks, which depend on the model that is estimated. Thus, if there is strong dependence between the states and parameters, then it is necessary to use pseudo-marginal methods (Beaumont, 2003; Andrieu and Roberts, 2009) which estimate the likelihood and it is necessary to ensure that the variability of the log of the estimated likelihood is sufficiently small (Pitt et al., 2012; Doucet et al., 2015). This is particularly difficult to do if the state dimension is high.

The rest of the article is organized as follows. Section 2 provides a brief description of variational approximation. Section 3 reviews some previous parameterizations of the covariance matrix, in particular the methods of Tan and Nott (2017) and Ong et al. (2018) for GVA using conditional independence and a factor structure, respectively. Section 4 describes our methodology, which combines a factor structure for the states and conditional independence in time for the factors to obtain flexible and convenient approximations of the posterior distribution in high dimensional state space models. Section 5 describes an extended example for a spatio-temporal dataset in ecology concerned with the spread of the Eurasian collared-dove across North America. Section 6 considers variational inference in a Wishart based multivariate stochastic volatility model. Appendix A contains the necessary gradient expressions to implement our method. Technical derivations and other details are placed in a self-contained supplement after the main article. We refer to equations, sections, etc in the main paper as (1), Section 1, etc, and in the supplement as (S1), Section S1, etc.

## 2 Stochastic gradient variational methods

### 2.1 Variational objective function

Variational approximation methods (Attias, 1999; Jordan et al., 1999; Winn and Bishop, 2005) reformulate the problem of approximating an intractable posterior distribution as an optimization problem. Let be the vector of model parameters, the observations and consider Bayesian inference for with a prior density . Denoting the likelihood by , the posterior density is , and in variational approximation we consider a family of densities , indexed by the variational parameter , to approximate . Our article takes the approximating family to be Gaussian so that consists of the mean vector and the distinct elements of the covariance matrix in the approximating normal density.

To express the approximation of as an optimization problem, we take the Kullback-Leibler (KL) divergence, \linenomath

 KL(qλ(θ)||p(θ|y)) =∫logqλ(θ)p(θ|y)qλ(θ)dθ,
\endlinenomath

as the discrepancy measure between to . The KL divergence is non-negative and zero if and only if . It is straightforward to show that , where , can be expressed as \linenomath

 logp(y) =L(λ)+KL(qλ(θ)||p(θ|y)), (1)
\endlinenomath

where \linenomath

 L(λ) =∫logp(θ)p(y|θ)qλ(θ)qλ(θ)dθ (2)
\endlinenomath

is referred to as the variational lower bound or evidence lower bound (ELBO). We have that , with equality if and only if because the KL divergence is non-negative. Eq. (1) shows that minimizing the KL divergence is equivalent to maximizing the ELBO in (2) because does not depend on ; this is a more convenient optimization target as it does not involve the intractable .

For introductory overviews of variational methods for statisticians see Ormerod and Wand (2010); Blei et al. (2017).

Maximizing to obtain an optimal approximation of is often difficult in models with a non-conjugate prior structure, since is defined as an integral which is generally intractable. However, stochastic gradient methods (Robbins and Monro, 1951) are useful for performing the optimization and there is now a large literature surrounding the application of this idea (Ji et al., 2010; Paisley et al., 2012; Nott et al., 2012; Salimans and Knowles, 2013; Kingma and Welling, 2014; Rezende et al., 2014; Hoffman et al., 2013; Ranganath et al., 2014; Titsias and Lázaro-Gredilla, 2015; Kucukelbir et al., 2017, among others). In a simple stochastic gradient ascent method for optimizing , an initial guess for the optimal value is updated according to the iterative scheme \linenomath

 λ(t+1) =λ(t)+atˆ∇λL(λ(t)), (3)
\endlinenomath

where , is a sequence of learning rates; is the gradient vector of with respect to ; and denotes an unbiased estimate of . The learning rate sequence is typically chosen to satisfy and , which ensures that the iterates converge to a local optimum as under suitable regularity conditions (Bottou, 2010). Various adaptive choices for the learning rates are also possible and we use the ADADELTA (Zeiler, 2012) approach in our applications in Sections 5 and 6.

### 2.3 Variance reduction

Application of stochastic gradient methods to variational inference depends on being able to obtain the required unbiased estimates of the gradient of the lower bound in (3). Reducing the variance of these gradient estimates as much as possible is important for both the stability of the algorithm and fast convergence. Our article uses gradient estimates based on the so-called reparameterization trick (Kingma and Welling, 2014; Rezende et al., 2014). The lower bound is an expectation with respect to , \linenomath

 L(λ) =Eq(logh(θ)−logqλ(θ)), (4)
\endlinenomath

where denotes expectation with respect to and . If we differentiate with respect to under the integral sign in (4), the resulting expression for the gradient can also be written as an expectation with respect to , which is easily estimated unbiasedly by Monte Carlo integration provided that sampling from this distribution is feasible. However, this approach, called the score function method (Williams, 1992), typically has a very large variance. The reparameterization trick is often much more efficient (Xu et al., 2019) and we now describe it. Suppose that we can write as , where is a random vector with a density which does not depend on the variational parameters , e.g. for a multivariate normal density , with , where is the (lower triangular) Cholesky factor of we can write , where and is the identity matrix. Substituting into (4), we obtain \linenomath

 L(λ) =Eω(logh(u(λ,ω))−logqλ(u(λ,ω))), (5)
\endlinenomath

where is the expectation with respect to . Differentiating under the integral sign, we obtain \linenomath

 ∇λL(λ) =Eω(∇λlogh(u(λ,ω))−∇λlogqλ(u(λ,ω))), (6)
\endlinenomath

which is easily estimated unbiasedly if it is possible to sample from .

We now discuss variance reduction beyond the reparameterization trick. Roeder et al. (2017), generalizing arguments in Salimans and Knowles (2013), Han et al. (2016) and Tan and Nott (2017), show that (6) can be rewritten as \linenomath

 ∇λL(λ) =Eω(du(λ,ω)dλ{∇θlogh(u(λ,ω))−∇θlogqλ(u(λ,ω))}), (7)
\endlinenomath

where is defined as the matrix with element the partial derivative of the th element of with respect to the th element of . Note that if the approximation is exact, i.e. , then a Monte Carlo approximation to the expectation on the right hand side of (7) is exactly zero even if such an approximation is formed using only a single sample from . This is one reason to prefer (7) as the basis for obtaining unbiased estimates of the gradient of the lower bound if the approximating variational family is flexible enough to provide an accurate approximation. However, Roeder et al. (2017) show that the extra terms that arise when (6) is used directly for estimating the gradient of the lower bound can be thought of as acting as a control variate, i.e. it reduces the variance, with a scaling that can be estimated empirically, although the computational cost of this estimation may not be worthwhile. In our state space model applications, we consider using both (6) and (7), because our approximations may be very rough when the dynamic factor parameterization of the variational covariance structure contains only a small number of factors. Here, it may not be so relevant to consider what happens in the case where the approximation is exact as a guide for reducing the variability of gradient estimates.

## 3 Parameterizing the covariance matrix

### 3.1 Cholesky factor parameterization of Σ

Titsias and Lázaro-Gredilla (2014) considered normal variational posterior approximation using a Cholesky factor parameterization and used stochastic gradient methods for optimizing the KL divergence. Challis and Barber (2013) also considered Cholesky factor parameterizations in Gaussian variational approximation, but without using stochastic gradient optimization methods.

For gradient estimation, Titsias and Lázaro-Gredilla (2014) consider the reparameterization trick with , where , is the variational posterior mean and is the variational posterior covariance with lower triangular Cholesky factor and with the diagonal elements of being positive. Hence, contains and the non-zero elements of and (5) becomes, apart from terms not depending on , \linenomath

 L(λ) =Eω(logh(μ+Cω))+log|C|, (8)
\endlinenomath

and note that since is lower triangular. Titsias and Lázaro-Gredilla (2014) derive the gradient of (8), and it is straightforward to estimate the expectation unbiasedly by simulating one or more samples and computing their average, i.e. plain Monte Carlo integration. The method can also be considered in conjunction with data subsampling. Kucukelbir et al. (2017) considered a similar approach.

### 3.2 Sparse Cholesky factor parameterization of Ω=Σ−1

Tan and Nott (2017) consider an approach which parameterizes the precision matrix in terms of its Cholesky factor , and impose a sparse structure on which comes from the conditional independence structure in the model. To minimize notation, we continue to write for a Cholesky factor used to parameterize the variational posterior even though here it is the Cholesky factor of the precision matrix rather than of the covariance matrix as in the previous subsection. Similarly to Tan and Nott (2017), Archer et al. (2016) also consider parameterizing a Gaussian variational approximation using the precision matrix, but they optimize directly with respect to the elements , while also exploiting sparse matrix computations in obtaining the Cholesky factor of . Archer et al. (2016) are also concerned with state space models and impose a block tridiagonal structure on the variational posterior precision matrix for the states, using functions of local data parameterized by deep neural networks to describe blocks of the mean vector and precision matrix corresponding to different states. Recently Spantini et al. (2018) have considered variational algorithms for filtering and smoothing based on transport maps; they also consider online approaches to estimation of the fixed parameters.

Here, we follow Tan and Nott (2017) and parameterize the variational approximation in terms of the Cholesky factor of . Section 4 shows how to use the conditional independence structure in the model to impose a sparse structure on . We note that sparsity is very important for reducing the number of variational parameters that need to be optimized, so that a sparse allows the Gaussian variational approximation method to be extended to high-dimensions.

Using the reparameterization trick, with , implies that , with . Here, and where is the half-vectorization of stacking the elements of below the diagonal in a vector going from left to right.

Similarly to Section 3.1, \linenomath

 L(λ) =Eω(logh(μ+C−⊤ω)−logqλ(μ+C−⊤ω)),
\endlinenomath

which, apart from terms not depending on , is \linenomath

 L(λ) =Eω(logh(μ+C−⊤ω))−log|C|; (9)
\endlinenomath

with since is lower triangular. Tan and Nott (2017) derive the gradient of (9) and, moreover, consider some improved gradient estimates for which Roeder et al. (2017) provide a more general understanding. Section 4 applies Roeder et al.’s approach to our methodology.

### 3.3 Latent factor parameterization of Σ

While the method of Tan and Nott (2017) is an attractive way to reduce the number of variational parameters in problems with an exploitable conditional independence structure, there are models where no such structure is available. An alternative parsimonious parameterization is to use a factor structure (Geweke and Zhou, 1996; Bartholomew et al., 2011). Ong et al. (2018) parameterize the variational posterior covariance matrix , where is a matrix with , for , and is a diagonal matrix with diagonal elements . The variational posterior becomes with ; this corresponds to the generative model with , where denotes elementwise multiplication. Ong et al. (2018) applied the reparameterization trick based on this transformation and derive gradient expressions of the resulting evidence lower bound. Ong et al. (2018) also outline how to efficiently implement the computations. Section 4.4 discusses this further.

## 4 Methodology

### 4.1 Model, prior and posterior

Let be an observed time series, generated by the state space model

 yt|Xt =xt∼mt(y|xt,ζ),t=1,…,T, (10a) Xt|Xt−1 =xt−1∼st(x|xt−1,ζ),t=1,…,T; (10b)

where the prior density for is , are the unknown fixed (non-time-varying) parameters in the model, and the elements of in the measurement and the state equation are typically different, but the same symbol is used for brevity. The observations are conditionally independent given the states , and the prior distribution of given is

 p(X|ζ)=p(X0|ζ)T∏t=1st(Xt|Xt−1,ζ).

Let denote the full set of unknowns in the model.

The posterior density of is , with , where is the prior density for and . Let be the dimension of and suppose is large. Approximating the joint posterior distribution in this setting is difficult and Section 4.3 describes a method based on GVA.

### 4.2 Example: Multivariate Stochastic Volatility

We illustrate some of the above ideas with the multivariate stochastic volatility model introduced by Philipov and Glickman (2006b), who used it to model the time-varying dependence of a portfolio of assets over time periods; Section 6 discusses the model in more detail.

Philipov and Glickman assume that the return at time period , , is the vector ,

 yt ∼N(0,Σt),Σt∈Rp×p (11a) Σ−1t ∼Wishart(ν,St−1),St=1νH(Σ−1t)dH⊤,St∈Rp×p,ν>k,0

is an unknown positive definite matrix and and are unknown scalars; is a known positive definite matrix. Section 6 describes the priors for all the fixed parameters and latents.

The state vector in this model is , and has dimension . It is very high dimensional when is large, e.g. it is 55 dimensional for . It may then be necessary to use particle methods, which are typically very slow, to estimate such a high dimensional model. Section 6 gives a more complete discussion.

### 4.3 Structure of the variational approximation

The variational posterior density for , is based on a generative model which has the dynamic factor structure, \linenomath

 Xt =Bzt+ϵtϵt∼N(0,D2t), (12)
\endlinenomath

where is a matrix, , and is a diagonal matrix with diagonal elements . Let and , where is the Cholesky factor of the precision matrix of . We will write for the dimension of each , with , and assume that

 C=[C10 0C2],

is block diagonal; is the Cholesky factor of the precision matrix for ; and is the Cholesky factor for the precision matrix of . Let denote the covariance matrix of . We further assume that is lower triangular with a single band, implying that is band tridiagonal. See Section S2 of the supplement for details. For a Gaussian distribution, zero elements in the precision matrix represent conditional independence relationships. In particular, the sparse structure imposed on means that in the generative distribution for , the latent variable , given and , is conditionally independent of the remaining elements of ; in other words, if we think of the variables , as a time series, they have a Markovian dependence structure.

We now construct the variational distribution for through \linenomath

 θ =[Xζ]=[IT+1⊗B00IP]ρ+[ϵ0],
\endlinenomath

where denotes the Kronecker product, is the dimension of , and . We can apply the reparameterization trick by writing , where . Then, \linenomath

 θ =Wρ+Ze=Wμ+WC−⊤ω+Ze, (13)
\endlinenomath

where

is a diagonal matrix with diagonal entries , and . We also write , where the blocks of this partition follow those of .

The factor model above describes the covariance structure for the states, as well as for dimension reduction in the variational posterior mean of the states, since , where . An alternative is to set and use \linenomath

 Xt =μt+Bzt+ϵt, (14)
\endlinenomath

where is now a -dimensional vector specifying the variational posterior mean for directly.

We call parameterization (12) the low-dimensional state mean (LD-SM) parameterization, and parameterization (14) the high-dimensional state mean (HD-SM) parameterization. In both parameterizations, forms a basis for , which is reweighted over time according to the latent weights (factors) . The LD-SM parameterization provides information on how these basis functions are reweighted over time to form the approximate posterior mean, since and we infer both and in the variational optimization. Section 5 illustrates this basis representation. Appendix A outlines the gradients and their derivation for the LD-SM parameterization. Derivations for the HD-SM parameterization follow those for the LD-SM case with straightforward minor adjustments.

Algorithm 1 outlines the stochastic gradient ascent algorithm that maximizes (5). Lemmas A1 and A2 in Appendix A obtain the gradients. Their expectations are estimated by one or more samples from . We note that although automatic differentiation implementations have improved enormously in recent years, and there are some implementations of linear algebra operators supporting sparse precision matrices (Durrande et al., 2019), it is not straightforward to use automatic differentiation for the structured matrix manipulations necessary for efficient computation here which make use of a combination of sparse and low rank matrix computations.

The gradients are computed by either (A2)– (A5) in Lemma A1, or by equations (A6), (A7), (A9) and (A10) in Lemma A2. If the variational approximation to the posterior is accurate, then (A6), (A7), (A9) and (A10) corresponding to the gradient estimates recommended in Roeder et al. (2017) may be prefered; Section 2 explains the reasons. However, since we consider massive dimension reduction with only a small numbers of factors the approximation may be crude and we therefore investigate both approaches in later examples.

Finally, it is well known that factor models have identifiability issues (Shapiro, 1985). The choice of identifying constraints in factor models can matter, particularly for interpretation. However, here the choice of any identifying constraints is not crucial as we do not interpret either the factors or the loadings, but only use them for modeling the covariance matrix and, in the LD-SM parameterization, also the variational mean. Factor structures are widely used as a method for achieving parsimony in the model formulation in the state space framework for spatio-temporal data (Wikle and Cressie, 1999; Lopes et al., 2008), multivariate stochastic volatility (Ku et al., 2014; Philipov and Glickman, 2006a), and in other applications (Aguilar and West, 2000; Carvalho et al., 2008). This is distinct from the main idea in the present paper of using a dynamic factor structure for dimension reduction in a variational approximation for getting parsimonious but flexible descriptions of dependence in the posterior for approximate inference.

### 4.4 Efficient computation

The gradient estimates for the lower bound (see Appendix A for expressions) are efficiently computed using a combination of sparse matrix operations (for evaluating terms such as and the high-dimensional matrix multiplications in the expressions) and, as in Ong et al. (2018), the Woodbury identity for dense matrices such as and . The Woodbury identity is \linenomath

 (ΛΓΛ⊤+Ψ)−1= Ψ−1−Ψ−1Λ(Λ⊤Ψ−1Λ+Γ−1)−1Λ⊤Ψ−1
\endlinenomath

for conformable matrices and diagonal ; it reduces the required computations into a much lower dimensional space since and is diagonal.

## 5 Application 1: Spatio-temporal model

### 5.1 Eurasian collared-dove data

The first example considers the spatio-temporal model of Wikle and Hooten (2006) for a dataset on the spread of the Eurasian collared-dove across North America. The dataset consists of the number of doves observed at location (latitude, longitude) in year corresponding to an observation period of 1986-2003. The spatial locations correspond to grid points with the dove counts aggregated within each area. See Wikle and Hooten (2006) for details. The count observed at location at time depends on the number of times that the location was sampled. However, this variable is unavailable and therefore we set the offset in the model to zero, i.e. .

### 5.2 Model

Let denote the count data at time . Wikle and Hooten (2006) model as conditionally independent Poisson variables, where the log means are given by a latent high-dimensional Markovian process plus measurement error. The dynamic process evolves according to a discretized diffusion equation; specifically, the model in Wikle and Hooten (2006) is \linenomath

 yt|vt ∼Poisson(diag(Nt)exp(vt))yt,Nt,vt∈Rp vt|ut,σ2ϵ ∼N(ut,σ2ϵIp),ut∈Rp,Ip∈Rp×p,σ2ϵ∈R+ ut|ut−1,ψ,σ2η
\endlinenomath

with priors and \linenomath

 u0 ∼N(0,10Ip) ψ|α,σ2ψ ∼N(Φα,σ2ψIp),Φ∈Rp×l,α∈Rl,σ2ψ∈R+ α ∼N(0,σ2αRα),α0∈Rl,Rα∈Rl×l,σ2α∈R+.
\endlinenomath

is the Poisson distribution for a (conditionally) independent response vector parameterized in terms of its expectation and is the inverse-gamma distribution with shape and scale as arguments. The spatial dependence is modeled via the prior mean of the diffusion coefficients , where consists of the orthonormal eigenvectors with the largest eigenvalues of the spatial correlation matrix , where is the Euclidean distance between pairwise grid locations in . Finally, is a diagonal matrix with the largest eigenvalues of . We follow Wikle and Hooten (2006) and set and .

Let , and denote the parameter vector

 θ=(u,v,ψ,α,logσ2ϵ,logσ2η,logσ2ψ,logσ2α),

which we infer through the posterior

 p(θ|y) ∝ σ2ϵσ2ησ2ψσ2αp(σ2ϵ)p(σ2η)p(σ2ψ)p(σ2α)p(α|σ2α)p(ψ|α,σ2ψ) (15) p(u0)T∏t=1p(ut|ut−1,ψ,σ2η)p(vt|ut,σ2ϵ)p(yt|vt),

with . Section S3.2 of the supplement derives the gradient of the log-posterior required by the variational Bayes (VB) approach.

### 5.3 Variational approximations of the posterior distribution

Section 4 considers two different parameterization of the low rank approximation, in which either the state vector has mean , (low-dimensional state mean, LD-SM) or has a separate mean and (high-dimensional state mean, HD-SM). In this particular application there is a third choice of parameterization which we now consider.

The model in Section 5.2 connects the data with the high-dimensional state vector via a high-dimensional auxiliary variable . In the notation of Section 4, we can include in , in which case the parameterization of the variational posterior is the one described there. We refer to this parameterization as a low-rank state (LR-S). However, it is clear from (15) that there is posterior dependence between and , but the variational approximation in Section 4 omits the dependence between and . Moreover, is also high-dimensional, but the LR-S parameterization does not reduce its dimension. An alternative parameterization that deals with both considerations includes in the -block, which we refer to as the low-rank state and auxiliary variable (LR-SA) parameterization. This comes at the expense of omitting dependence between and , but also becomes more computationally costly because, while the total number of variational parameters is smaller (see Table S1 in Section S6 of the supplement), the dimension of the -block increases ( and ) and the main computational effort lies here and not in the -block. Table 1 shows the CPU times relative to the LR-S parameterization. The LR-SA parameterization requires a small modification of the derivations in Section 4, which we outline in detail in Section S4 of the supplement as they can be useful for other models with a high-dimensional auxiliary variable.

It is straightforward to deduce conditional independence relationships in (15) to build the Cholesky factor of the precision matrix of in Section 4, with

 ζ={(v,ψ,α,logσ2ϵ,logσ2η,logσ2ψ,logσ2α)(LR-S)(ψ,α,logσ2ϵ,logσ2η,logσ2ψ,logσ2α)(LR-SA).

Section 4 outlines the construction of the Cholesky factor of the precision matrix of , whereas the minor modification needed for LR-SA is in Section S4 of the supplement. We note that, regardless of the parameterization, we obtain massive parsimony (between variational parameters) compared to the saturated Gaussian variational approximation which in this application has parameters; see Section S6 of the supplement for further details.

We consider four different variational parameterizations, combining each of LR-SA or LR-S with the different parameterization of the means of , i.e. LD-SM or HD-SM. In all cases, we let and perform iterations of a stochastic optimization algorithm with learning rates chosen adaptively according to the ADADELTA approach (Zeiler, 2012). We use the gradient estimators in Roeder et al., i.e. (A6), (A7), (A9) and (A10), although we found no noticeable difference compared to (A2) – (A5); it is likely that this is due to the small number of factors as described in Sections 2 and 4. Our choice was motivated by computational efficiency as some terms cancel out using the approach in Roeder et al.. We initialize and as unit diagonals and, for parity, and are chosen to match the starting values of the Gibbs sampler in Wikle and Hooten.

Figure 1 monitors the convergence via the estimated value of using a single Monte Carlo sample. Table 1 presents estimates of at the final iteration using Monte Carlo samples. The results suggest that the best VB parameterization in terms of ELBO is the low-rank state algorithm (LR-SA) with, importantly, a high-dimensional state-mean (HD-SM) (otherwise the poorest VB approximation is achieved, see Table 1). However, Table S1 shows that this parameterization is about three times as CPU intensive. The fastest VB parameterizations are both Low-Rank State (LR-S) algorithms, and modeling the state mean separately for these does not seem to improve (Table 1) and is also slightly more computationally expensive (Table S1). Taking these considerations into account, the final choice of VB parameterization we use for this model is the low-rank state with low-dimensional state mean (LR-S + LD-SM). Section 5.5 shows that this parameterization gives accurate approximations for our analysis. For the rest of this example, we benchmark the VB posterior from LR-S + LD-SM against the MCMC approach in Wikle and Hooten. Figure 1: L(λ) for the variational approximations for the spatio-temporal example. The figure shows the estimated value of L(λ) vs iteration number for the four different parameterizations, see Section 5.3 or Table 1 for abbreviations.

### 5.4 MCMC settings

Before comparing VB to MCMC, it is necessary to determine a reasonable burn-in period and number of iterations for inference for the Gibbs sampler in Wikle and Hooten. It is clear that it is infeasible to monitor convergence for every single parameter in such a large scale model as (15), and therefore we focus on , and , which are among the variables considered in the analysis in Section 5.5.

Wikle and Hooten use iterations of which are discarded as burn-in. We generate sampling chains with these settings and inspect convergence using the package (Plummer et al., 2006) in . We compute the Scale Reduction Factors (SRF) (Gelman and Rubin, 1992) for and as a function of the number of Gibbs iterations. The adequate number of iterations in MCMC depends on what functionals of the parameters are of interest; here we monitor convergence for these quantities since we report marginal posterior distributions for these quantities later. The scale reduction factor of a parameter measures if there is a significant difference between the variance within the four chains and the variance between the four chains of that parameter. We use the rule of thumb that concludes convergence when , which gives a burn-in period of approximately here, for these functionals. After discarding these samples and applying a thinning of we are left with posterior samples for inference. However, as the draws are auto-correlated, this does not correspond to independent draws used in the analysis in Section 5.5 (note that we obtain independent samples from our variational posterior). To decide how many Gibbs samples are equivalent to independent samples for and , we compute the Effective Sample Size (ESS) which takes into account the auto-correlation of the samples. We find that the smallest is and hence we require iterations after a thinning of , which makes a total of Gibbs iterations, excluding the burn-in of . Thinning is advisable here due to memory issues — it is impractical to store iterations for each parameter (which may be used, for example, to estimate kernel densities) in high-dimensional models.

### 5.5 Analysis and results

We first consider inference on the diffusion coefficient for location . Figure 2 shows the “true” posterior (represented by MCMC) together with the variational approximation for six locations described in the caption of the figure. The figure shows that the posterior distribution is highly skewed for locations with zero dove counts and approaches normality as the dove counts increase. Consequently, the accuracy of the variational posterior (which is Gaussian) improves with increasing dove counts. The figure also shows the phenomena we described in the beginning of Section 1: there is a discrepancy between the posterior densities. However, as we will see, it affects neither the location estimates of the intensity of the process nor its prediction. Figure 2: Distribution of the diffusion coefficients. The figure shows the posterior distribution of ψi obtained by MCMC and VB. The locations are divided into three categories (total doves over time within brackets): zero count locations (Idaho, i=1 , Arizona i=5, left panels), low count locations (Texas, i=35,46, middle panels) and high count locations (Florida, i=96[1,566],105[1,453], right panels). Figure 3: Samples from the posterior sum of dove intensity over the spatial grid for each year. The figure shows 100 samples from the posterior distribution of φt=∑iexp(vit) obtained by MCMC (left panel) and VB (right panel).

Figure shows 100 VB and MCMC posterior samples of the dove intensity for each year summed over the spatial locations, i.e. . The two posteriors are similar and show an exponential increase of doves until the year followed by a steep decline for .

Figure 4 summarises some spatial properties of the model. It shows a heat map of the MCMC and VB posterior means of the dove intensity for the last five years of the data, overlaid on a map of the U.S. The figure confirms that VB gives accurate location estimates of the spatial process in this example. Figure 4: Posterior dove intensity for the years 1999-2003. The figure shows the posterior mean of φit=exp(vit) computed by MCMC (left panels) and VB (right panels) for i=1,…,p=111, and the last 5 years of the data (t=14,15,16,17,18). The results are illustrated with a spatial grid plotted together with a map of the United States, where the colors vary between low intensity (yellow) and high intensity (red). The light blue color is for aesthetic reasons and does not correspond to observed locations.

We draw the following conclusions from the analysis using the MCMC and VB posteriors, which are nearly identical. First, Figure 4 shows that the the dove intensity is most pronounced in the South East states, in particular Florida. Second, Figure 4 also shows that it is likely that the decline of doves for year in Figure 3 can be attributed to a drop in the intensity at two areas of Florida: Central Florida () and South East Florida (). Figure 5 illustrates the whole posterior distribution of the log-intensity for these locations at the year , as well as an out-of-sample posterior predictive distribution for year . Both estimates are kernel density estimates using approximately effective samples. The posterior distributions for the VB and MCMC are similar, and it is evident that using this large scale model for forecasting future values is associated with a large uncertainty.

Figure 6 illustrates the spatial basis functions and their reweighting over time to produce mean of the variational approximation, as discussed in Section 4.3. Figure 5: Forecasting the log-intensity of the spatial process. The figure shows an in-sample forecast density of the log-intensity vit for year 2003 (t=18, upper panels) and out-of-sample forecast density for year 2004 (t=19, lower panels) for Central Florida (i=96, left panels) and South East Florida (i=105, right panels). Figure 6: Spatial basis representation of the state vector. The figure shows the Spatial basis functions (left panel), i.e. the jth column of B, j=1,…,q=4 and the corresponding weights μt (right panel) through t=0,…,18, that fors E(Xt)=Bμt.

## 6 Application 2: Stochastic volatility modeling

### 6.1 Model

The second example considers the Wishart based multivariate stochastic volatility model proposed in Philipov and Glickman (2006b) who used it to model the time-varying dependence of a portfolio of assets over time periods. Section 4.2 briefly discussed this model.

Philipov and Glickman (2006b) assume that the return at time period , , is the vector , with

 yt ∼ N(0,Σt);Σt∈Rp×p; Σ−1t ∼ Wishart(ν,St−1);St=1νH(Σ−1t)dH⊤;St∈Rp×p;ν>k;0

is a lower triangular Cholesky factor of a positive definite matrix , with ; and is assumed known. The prior for is inverse Wishart, i.e. , with and ; a uniform prior on for , i.e. ; and a shifted gamma prior for , i.e. . The joint posterior density for is \linenomath

 p(Σ,A,ν−k,d|y) ∝p(A,d,ν−k)T∏t=1p(Σt|ν,St−1)p(yt|Σt); (16)
\endlinenomath

is the joint prior density for ; is the conditional inverse Wishart prior for given , ; and , and is the normal density for given .

We write for the Cholesky factor of and we reparameterize the posterior in terms of the unconstrained parameter vector

 θ=(vech(H′)⊤,d′,ν′,vech(C′1)⊤,…,vech(C′T)⊤)⊤;

where

 C′t∈Rk×k; C′t,ij=Ct,ij;i≠j, and C′t,ii=logCt,ii; H′∈Rk×k; H′ij=Hij;i≠j, and Hii=logHii;

with and . Section S3.3 shows that \linenomath

 p(θ|y)∝ (17) ×d(1−d)×{∏iHii}{T∏t=1k∏i=1Ct,ii}×p(A,d,ν−k){T∏t=1p(Σt|ν,St−1,d)p(yt|Σt)};
\endlinenomath

Section S1 of the supplement defines the elimination matrix and the commutation matrix ; Section S3.4 of the supplement shows how to evaluate the gradient of the log posterior.

### 6.2 Evaluating the predictive performance of the variational approximation

Philipov and Glickman (2006b) develop an MCMC algorithm to estimate their Wishart based multivariate stochastic volatility model. Rinnergschwentner et al. (2012) point out that the Gibbs sampler developed by Philipov and Glickman contains a mistake which affects all the full conditionals. We find that implementing the ‘corrected’ version of their algorithm results in a very inefficient sampler even for the portfolios used by Philipov and Glickman in their empirical example. This means that the ‘corrected’ Philipov and Glickman algorithm cannot be used as a ‘gold standard’ to compare against the variational approximation results. Although it may be possible to estimate the posterior of Philipov and Glickman’s model using particle methods, we do not pursue this here. Section S7 of the supplement illustrates the inefficiency of the corrected Philipov and Glickman (2006b) sampler and explains its problems.

We now show empirically (by simulation) that the variational posterior provides useful predictive inference. Since MCMC is unavailable, the GVA is benchmarked against an oracle predictive approach, which assumes the the static model parameters are known. We use a bootstrap particle filter (Gordon et al., 1993) to obtain the posterior density of the state-vector at ; it is then possible to obtain the one-step ahead oracle predictive density , where denotes the true static model parameters. The variational predictive density is then benchmarked against the oracle predictive density; we note that the variational predictive density integrates over the variational posterior of all the parameters, including the static model parameters.

Section S5.1 of the supplement shows how to simulate from the oracle predictive density. Section S5.2 of the supplement shows how to simulate from the variational predictive density. The one-step ahead prediction is repeated for horizons. At horizons , both filtering densities are based on and the optimization for finding the variational posterior for is fast since the variational parameters are initialized (except the ones added at ) at their variational mode from the previous optimization.

### 6.3 Variational approximations of the posterior distribution

Since this example does not include a high-dimensional auxiliary variable, we use the low-rank state (LR-S) parameterization combined with both a low-dimensional state mean (LD-SM) and a high-dimensional state mean (HD-SM). As in the previous example, it is straightforward to deduce conditional independence relationships in (17) to build the Cholesky factor of the precision matrix of in Section 4; this section also outlines how to construct the Cholesky factor of the precision matrix of . Massive parsimony is achieved in this application. In particular, for assets, the saturated Gaussian variational approximation has parameters, while our parameterization has . For , the saturated case has parameters and our parameterizations has . See Section S6 of the supplement for more details.

For all variational approximations we let and perform iterations of a stochastic optimization algorithm with learning rates chosen adaptively according to the ADADELTA approach (Zeiler, 2012). We initialize and as unit diagonals and choose and randomly. Figure 7 monitors the estimated ELBO for both parameterizations, using both the gradient estimators in Roeder et al. and the alternative standard ones which do not cancel terms that have zero expectation. For , the figure shows that the different gradient estimators perform equally well. Moreover, slightly more variable estimates are observed in the beginning for the low-dimensional state mean parameterization compared to that of the high-dimensional mean. Table 1 presents estimates of at the final iteration using Monte Carlo samples and also presents the relative CPU times of the algorithms. In this example, the separate state mean present in the high-dimensional state mean seems to improve the ELBO considerably. Figure 7: L(λ) for the variational approximations in the Wishart process example. The figure shows the estimated value of L(λ) vs iteration number using a low-dimensional state mean / high-dimensional state mean (LD-SM / HD-SM) with the gradient estimator in Roeder et al. (2017) or the standard estimator. The left and middle panels are for k=5; the right panel is for the real data with k=12. Figure 8: Multivariate stochastic volatility model with simulated data and T=100. The top row and the two panels from the left of the second row show the marginal one-step-ahead kernel density estimates of the predictive density for each of the k=5 variables for both the variational approximation and the oracle; the test observation is the red line. The right panel of the second row and the rest of the panels show the contour plots of the kernel density estimates of the one-step-ahead bivariate predictive densities for the variational approximation and the oracle; the red dot is the test observation.

### 6.4 Results for simulated data

We now assess the variational approximation by comparing its out-of-sample predictive properties against the oracle predictive density. See Section 6.2. The comparison is based on data generated by the multivariate stochastic volatility model with , and generated from . While the reported results are for a particular simulated dataset due to space restrictions, we have verified that the same performance is obtained when the random number seed is changed and and are varied. Figure 8 shows the kernel density estimates for the marginals of all five parameters and bivariate kernel density estimates for all pairs of variables for the predictive (variational and oracle) for . The figure also shows the test observation (withheld when estimating the variational predictive and the oracle predictive). Figure 9 shows boxplots of draws from all marginals of the predictive densities (variational and oracle) for the horizons . This figure also shows the withheld test observation which is within the prediction intervals of both methods. Figure 10 shows, for each of the horizons, future predictions (variational and oracle) of an equally weighted portfolio conditional on the posteriors using the data . Section S5.3 of the supplement gives more plots that further confirm the accuracy of the variational predictive densities. Figure 9: Simulated data. Boxplots of samples from the variational one-step ahead marginal predictive densities compared against the oracle predictive densities with T=100,101,102,103. The figure also shows the test observation (red) dot for each T and variable. Figure 10: Simulated data. Kernel density estimates of the one-step ahead predictive densities of a equally weighted portfolio of assets. The results are for T=100,101,102,103. The figure also shows the test observation (red line) for each T.

### 6.5 Real data results

The data consists of monthly observations on all (Philipov and Glickman (2006b) only consider assets and report an acceptance probability close to zero when for their sampler) value-weighted portfolios from the 201709 CRSP database, for the period 2009-06 to 2017-09. The portfolios are: consumer non-durables, consumer durables, manufacturing, energy, chemicals, business equipment, telecom, utilities, retail/wholesale, health care, finance, other. With the dimension of the state vector is . We follow Philipov and Glickman (2006b) and prefilter each series using an AR(1) process.

The right panel in Figure 7 shows the estimated ELBO on a variational optimization using the real dataset. While the estimated ELBO plot is more variable than for the case, it settles down eventually. Figure 11 shows the in-sample prediction of given , together with the observed data point, for some of the assets. The figure also shows an in-sample prediction of a portfolio consisting of equally weighted assets. The variational posterior for the real data example uses the low-dimensional state mean parameterization. Figure 11: Multivariate stochastic volatility model: Real data. Kernel density estimates of the in-sample predictive density for some the assets and a equally weighted portfolio of assets. The figure also shows the in-sample observation (red line).

## 7 Discussion

The article proposes a Gaussian variational approximation method for high-dimensional state space models. Dimension reduction in the variational approximation is achieved through a dynamic factor structure for the variational covariance matrix. The factor structure reduces the dimension in the description of the states, whereas the Markov dynamic structure for the factors achieves parsimony in describing the temporal dependence. We show that the method works well in two challenging models. The first is an extended example for a spatio-temporal data set describing the spread of the Eurasian collared-dove throughout North America. The second is a multivariate stochastic volatility model in which the state vector is high dimensional.

Perhaps the most obvious limitation of our current work is the restriction to a Gaussian approximation, which does not allow capturing skewness or heavy tails in the posterior distribution. However, Gaussian variational approximations can be used as building blocks for more complex approximations based on normal mixtures, copulas or conditionally Gaussian families for example (Han et al., 2016; Miller et al., 2017; Smith et al., 2019; Tan et al., 2019) and these more complex variational families can overcome some of the limitations of the simple Gaussian approximation. We intend to consider this in future work.

## Acknowledgements

We thank Mevin Hooten for his help with the Eurasian collared-dove data. We thank Linda Tan for her comments on an early version of this manuscript. Matias Quiroz and Robert Kohn were partially supported by Australian Research Council Center of Excellence grant CE140100049. David Nott was supported by a Singapore Ministry of Education Academic Research Fund Tier 2 grant (MOE2016-T2-2-135).

## Appendix A Gradient expressions of the evidence lower bound

### a.1 Notation and definitions

We consider any vector to be arranged as a column vector with elements, i.e. . Likewise, if is function whose output is vector valued, i.e.