Gaussian variational approximation for high-dimensional state space models

# Gaussian variational approximation for high-dimensional state space models

Matias Quiroz Australian School of Business, School of Economics, University of New South Wales, Sydney NSW 2052, Australia. David J. Nott Robert Kohn Australian School of Business, School of Economics, University of New South Wales, Sydney NSW 2052, Australia.
###### Abstract

Our article considers variational approximations of the posterior distribution in a high-dimensional state space model. The variational approximation is a multivariate Gaussian density, in which the variational parameters to be optimized are a mean vector and a covariance matrix. The number of parameters in the covariance matrix grows as the square of the number of model parameters, so it is necessary to find simple yet effective parametrizations of the covariance structure when the number of model parameters is large. The joint posterior distribution over the high-dimensional state vectors is approximated using a dynamic factor model, with Markovian dependence in time and a factor covariance structure for the states. This gives a reduced dimension description of the dependence structure for the states, as well as a temporal conditional independence structure similar to that in the true posterior. We illustrate our approach in two high-dimensional applications which are challenging for Markov chain Monte Carlo sampling. The first is a spatio-temporal model for the spread of the Eurasian Collared-Dove across North America. The second is a multivariate stochastic volatility model for financial returns via a Wishart process.

Keywords. Dynamic factor model, Gaussian variational approximation, stochastic gradient.

\externaldocument

OnlineSupplement_GVA_highdim_statespace

## 1 Introduction

Variational approximation methods (Ormerod and Wand, 2010; Blei et al., 2017) are an increasingly popular way to implement approximate Bayesian computations because of their ability to scale well to large datasets and highly parametrized models. A Gaussian variational approximation uses a multivariate normal approximation to the posterior distribution, and these approximations can be both useful in themselves as well as building blocks for more complicated variational inference procedures such a those based on Gaussian mixtures or copulas (Miller et al., 2016; Han et al., 2016). Here we consider Gaussian variational approximation for state space models where the state vector is high-dimensional. Models of this kind are common in spatio-temporal modelling (Cressie and Wikle, 2011), econometrics, and in other important applications.

Constructing a Gaussian variational approximation is challenging when considering a model having a large number of parameters because the number of variational parameters in the covariance matrix grows quadratically with the number of model parameters. This makes it necessary to parametrize the variational covariance matrix parsimoniously, but so that we can still capture the structure of the posterior. This goal is best achieved by making intelligent use of the structure of the model itself. This is considered in the present manuscript for high-dimensional state space models. We parametrize the variational posterior covariance matrix using a dynamic factor model which provides dimension reduction for the states, and Markovian time dependence for the low-dimensional factors provides sparsity in the precision matrix for the factors. We develop efficient computational methods for forming the approximations and illustrate the advantages of the approach in two high-dimensional examples. The first is a spatio-temporal dataset on the spread of the Eurasian collared dove across North America (Wikle and Hooten, 2006). The second example is a multivariate stochastic volatility model for a collection of portfolios of assets (Philipov and Glickman, 2006b). Markov chain Monte Carlo (MCMC) is challenging in both examples, but particularly for the latter as Philipov and Glickman (2006b) use an independence Metropolis-Hastings proposal (the performance of which is well known to deteriorate in high dimensions) within a Gibbs sampler for sampling the state vector. Indeed, Philipov and Glickman (2006b) reported acceptance probabilities close to zero for the state vector in an application to assets, which gives a state vector of dimension in their model. We show that our variational approximation allows for efficient inference even when the state dimension is large.

Variational approximation methods formulate the problem of approximating the posterior as an optimization problem. In this paper we use stochastic gradient ascent methods for performing the optimization (Ji et al., 2010; Nott et al., 2012; Paisley et al., 2012; Salimans and Knowles, 2013) and in particular we use the so-called reparametrization trick for unbiased estimation of the gradients of the variational objective (Kingma and Welling, 2014; Rezende et al., 2014). Section 2 describes these methods further. Applying these methods for Gaussian variational approximation, Tan and Nott (2017) considered matching the sparsity of the variational precision matrix to the conditional independence structure of the true posterior based on a sparse Cholesky factor for the precision matrix. This is motivated because zeros in the precision matrix of a Gaussian distribution correspond to conditional independence between variables. Sparse matrix operations allow computations in the variational optimization to be done efficiently. They apply their approach to both random effects models and state space models, although their method is impractical in a state space model where the state is high-dimensional. Archer et al. (2016) also considered a similar idea for the problem of filtering in state space models using an amortization approach, where blocks of the variational mean and sparse Cholesky factor are parametrized in terms of functions of local data. More recently, Krishnan et al. (2017) considered a similar approach, but noted the importance of including future as well as past local data in the amortization procedure. Similar Gaussian variational approximations to those considered in Tan and Nott (2017), Archer et al. (2016) and Krishnan et al. (2017) were earlier considered in the literature where the parametrization is in terms of the Cholesky factor of the covariance matrix rather than the precision matrix (Titsias and Lázaro-Gredilla, 2014; Kucukelbir et al., 2017), although these approaches do not consider sparsity in the parametrization except through diagonal approximations which lose the ability to capture posterior dependence in the approximation. Various other parametrizations of the covariance matrix in Gaussian variational approximation were considered by Opper and Archambeau (2009), Challis and Barber (2013) and Salimans and Knowles (2013). The latter authors also consider efficient stochastic gradient methods for fitting such approximations, using both gradient and Hessian information and exploiting other structure in the target posterior distribution, as well as extensions to more complex hierarchical formulations including mixtures of normals.

Another way to parametrize dependence in a high-dimensional Gaussian posterior approximation is to use a factor structure. Factor models (Bartholomew et al., 2011) are well known to be useful for modelling dependence in high-dimensional settings. Ong et al. (2017) recently considered a Gaussian variational approximation for factor covariance structures using stochastic gradient methods for the variational optimization. Computations in the variational optimization can be done efficiently in high-dimensions using the Woodbury formula (Woodbury, 1950). Factor structures in Gaussian variational approximation were used previously by Barber and Bishop (1998) and Seeger (2000), but these authors are concerned with situations where the variational objective can be computed analytically or with one-dimensional quadrature. Rezende et al. (2014) considered a factor model for the precision matrix with one factor in some applications to some deep generative models arising in machine learning applications. Miller et al. (2016) considered factor parametrizations of covariance matrices for normal mixture components in a flexible variational boosting approximate inference method, including a method for exploiting the reparametrization trick for unbiased gradient estimation of the variational objective in that setting.

Our article specifically considers approximating posterior distributions for Bayesian inference in high-dimensional state space models, and we make use of both the conditional independence structure and the factor structure in forming our approximations. Bayesian computations for state space models are well known to be challenging for complex nonlinear models. It is usually feasible to carry out MCMC on a complex state space model by sampling the states one at a time conditional on the neighbouring states (e.g., Carlin et al., 1992). In general, such methods require careful convergence diagnosis for individual applications and can fail if the dependence between states is strong. Carter and Kohn (1994) document this phenomenon in linear Gaussian state space models, and we also document this problem of poor mixing for the spatio-temporal (Wikle and Hooten, 2006) and multivariate stochastic volatility via a Wishart process (Philipov and Glickman, 2006b) examples discussed later. State of the art general approaches using particle MCMC methods (Andrieu et al., 2010) can in principle be much more efficient than an MCMC that generates the states one at a time. However, particle MCMC is usually much slower than MCMC because of the need to generate multiple particles at each time point. Particle methods also have a number of other drawbacks, which depend on the model that is estimated. Thus, if there is strong dependence between the states and parameters, then it is necessary to use pseudo marginal methods (Beaumont, 2003; Andrieu and Roberts, 2009) which estimate the likelihood and it is necessary to ensure that the variability of the log of the estimated likelihood is sufficiently small (Pitt et al., 2012; Doucet et al., 2015). This is particularly difficult to do if the state dimension is high.

Finally, we note that factor structures are widely used as a method for achieving parsimony in the model formulation in the state space framework for spatio-temporal data (Wikle and Cressie, 1999; Lopes et al., 2008), multivariate stochastic volatility (Ku et al., 2014; Philipov and Glickman, 2006a), and in other applications (Aguilar and West, 2000; Carvalho et al., 2008). This is distinct from the main idea in the present paper of using a dynamic factor structure for dimension reduction in a variational approximation for getting parsimonious but flexible descriptions of dependence in the posterior for approximate inference.

Our article is outlined as follows. Section 2 gives a background on variational approximation. Section 3 reviews some previous parametrizations of the covariance matrix, in particular the methods of Tan and Nott (2017) and Ong et al. (2017) for Gaussian variational approximation using conditional independence and a factor structure, respectively. Section 4 describes our methodology, which combines a factor structure for the states and conditional independence in time for the factors to obtain flexible and convenient approximations of the posterior distribution in high-dimensional state space models. Section 5 describes an extended example for a spatio-temporal dataset in ecology concerned with the spread of the Eurasian collared-dove across North America. Section 6 considers inference in a multivariate stochastic volatility model via Wishart processes. Technical derivations and other details are placed in the supplement. We refer to equations, sections, etc in the main paper as (1), Section 1, etc, and in the supplement as (S1), Section S1, etc.

## 2 Stochastic gradient variational methods

### 2.1 Variational objective function

Variational approximation methods (Attias, 1999; Jordan et al., 1999; Winn and Bishop, 2005) reformulate the problem of approximating an intractable posterior distribution as an optimization problem. Let be the vector of model parameters, the observations and consider Bayesian inference for with a prior density . Denoting the likelihood by , the posterior density is , and in variational approximation we consider a family of densities , indexed by the variational parameter , to approximate . Our article takes the approximating family to be Gaussian so that consists of the mean vector and the distinct elements of the covariance matrix in the approximating normal density.

To formulate the approximation of as an optimization problem, we take the Kullback-Leibler (KL) divergence, \linenomath

 KL(qλ(θ)||p(θ|y)) =∫logqλ(θ)p(θ|y)qλ(θ)dθ,
\endlinenomath

as the distance between to . The KL divergence is non-negative and zero if and only if . It is straightforward to show that , where , can be expressed as \linenomath

 logp(y) =L(λ)+KL(qλ(θ)||p(θ|y)), (1)
\endlinenomath

where \linenomath

 L(λ) =∫logp(θ)p(y|θ)qλ(θ)qλ(θ)dθ (2)
\endlinenomath

is referred to as the variational lower bound or evidence lower bound (ELBO). The non-negativity of the KL divergence implies that , with equality if and only if . Because does not depend on , we see from (1) that minimizing the KL divergence is equivalent to maximizing the ELBO in (2). For introductory overviews of variational methods for statisticians see Ormerod and Wand (2010) and Blei et al. (2017).

Maximizing to obtain an optimal approximation of is often difficult in models with a non-conjugate prior structure, since is defined as an integral which is generally intractable. However, stochastic gradient methods (Robbins and Monro, 1951; Bottou, 2010) are useful for performing the optimization and there is now a large literature surrounding the application of this idea (Ji et al., 2010; Paisley et al., 2012; Nott et al., 2012; Salimans and Knowles, 2013; Kingma and Welling, 2014; Rezende et al., 2014; Hoffman et al., 2013; Ranganath et al., 2014; Titsias and Lázaro-Gredilla, 2015; Kucukelbir et al., 2017, among others). In a simple stochastic gradient ascent method for optimizing , an initial guess for the optimal value is updated according to the iterative scheme \linenomath

 λ(t+1) =λ(t)+atˆ∇λL(λ(t)), (3)
\endlinenomath

where , is a sequence of learning rates, is the gradient vector of with respect to , and denotes an unbiased estimate of . The learning rate sequence is typically chosen to satisfy and which ensures that the iterates converge to a local optimum as under suitable regularity conditions. Various adaptive choices for the learning rates are also possible and we consider the ADADELTA (Zeiler, 2012) approach in our applications in Sections 5 and 6.

### 2.3 Variance reduction

Application of stochastic gradient methods to variational inference depends on being able to obtain the required unbiased estimates of the gradient of the lower bound in (3). Reducing the variance of these gradient estimates as much as possible is important for both the stability of the algorithm and fast convergence. Our article uses gradient estimates based on the so-called reparametrization trick (Kingma and Welling, 2014; Rezende et al., 2014). The lower bound is an expectation with respect to , \linenomath

 L(λ) =Eq(logh(θ)−logqλ(θ)), (4)
\endlinenomath

where denotes expectation with respect to and . If we differentiate under the integral sign in (4), the resulting expression for the gradient can also be written as an expectation with respect to , which is easily estimated unbiasedly by Monte Carlo provided that sampling from this distribution can be easily done. However, differentiating under the integral sign does not use gradient information from the model because does not appear in the term involving in (4). The reparametrization trick is a method that allows this information to be used. We start by supposing that can be written as , where is a random vector with density which does not depend on the variational parameters . For instance, in the case of a multivariate normal density where with and denotes the (lower triangular) Cholesky factor of , we can write where where is the identity matrix. Substituting into (4), we obtain \linenomath

 L(λ) =Ef(logh(u(λ,ω))−logqλ(u(λ,ω))), (5)
\endlinenomath

and then differentiating under the integral sign \linenomath

 ∇λL(λ) =Ef(∇λlogh(u(λ,ω))−∇λlogqλ(u(λ,ω))), (6)
\endlinenomath

which is an expectation with respect to that is easily estimated unbiasedly if we can sample from . Note that gradient estimates obtained for the lower bound this way use gradient information from the model, and it has been shown empirically that gradient estimates by the reparametrization trick have greatly reduced variance compared to alternative approaches.

We now discuss variance reduction beyond the reparametrization trick. Roeder et al. (2017), generalizing arguments in Salimans and Knowles (2013), Han et al. (2016) and Tan and Nott (2017), show that (6) can equivalently be written as \linenomath

 ∇λL(λ) =Ef(du(λ,ω)dλ{∇θlogh(u(λ,ω))−∇θlogqλ(u(λ,ω))}), (7)
\endlinenomath

where is the matrix with element the partial derivative of the th element of with respect to the th element of . Note that if the approximation is exact, i.e. , then a Monte Carlo approximation to the expectation on the right hand side of (7) is exactly zero even if such an approximation is formed using only a single sample from . This is one reason to prefer (7) as the basis for obtaining unbiased estimates of the gradient of the lower bound if the approximating variational family is flexible enough to provide an accurate approximation. However, Roeder et al. (2017) show that the extra terms that arise when (6) is used directly for estimating the gradient of the lower bound can be thought of as acting as a control variate, with a scaling that can be estimated empirically, although the computational cost of this estimation may not be worthwhile. In our state space model applications, we consider using both (6) and (7) because our approximations may be very rough when the dynamic factor parametrization of the variational covariance structure contains only a small number of factors. Here, it may not be so relevant to consider what happens in the case where the approximation is exact as a guide for reducing the variability of gradient estimates.

## 3 Parametrizing the covariance matrix

### 3.1 Cholesky factor parametrization of Σ

Titsias and Lázaro-Gredilla (2014) considered normal variational posterior approximation using a Cholesky factor parametrization and used stochastic gradient methods for optimizing the KL divergence. Challis and Barber (2013) also considered Cholesky factor parametrizations in Gaussian variational approximation but without using stochastic gradient optimization methods.

For gradient estimation, Titsias and Lázaro-Gredilla (2014) consider the reparametrization trick with , where , is the variational posterior mean and is the variational posterior covariance with lower triangular Cholesky factor and with the diagonal elements of being positive. Hence, and (5) becomes, apart from terms not depending on , \linenomath

 L(λ) =Ef(logh(μ+Cω))+log|C|, (8)
\endlinenomath

and note that since is lower triangular. Titsias and Lázaro-Gredilla (2014) derived the gradient of (8), and it is straightforward to estimate the expectation unbiasedly by simulating one or more samples from and computing their average, i.e. plain Monte Carlo integration. The method can also be considered in conjunction with data subsampling. Kucukelbir et al. (2017) considered a similar approach.

### 3.2 Sparse Cholesky factor parametrization of Ω=Σ−1

Tan and Nott (2017) consider an approach which parametrizes the precision matrix in terms of its Cholesky factor, say, and impose a sparse structure in which comes from the conditional independence structure in the model. To minimize notation, we continue to write for a Cholesky factor used to parametrize the variational posterior even though here it is the Cholesky factor of the precision matrix rather than of the covariance matrix as in the previous subsection. Similarly to Tan and Nott (2017), Archer et al. (2016) also considered parametrizing a Gaussian variational approximation using the precision matrix, but they optimize directly with respect to the elements , while also exploiting sparse matrix computations in obtaining the Cholesky factor of . Archer et al. (2016) were also concerned with state space models and imposed a block tridiagonal structure on the variational posterior precision matrix for the states, using functions of local data parametrized by deep neural networks to describe blocks of the mean vector and precision matrix corresponding to different states. Here we follow Tan and Nott (2017) and consider parametrization of the variational optimization in terms of the Cholesky factor of . In this section we consider the case where no restrictions are placed on the elements of and discuss in Section 4 how the conditional independence structure in the model can be used to impose a sparse structure on . We note at the outset that sparsity is very important for reducing the number of variational parameters that need to be optimized and considering a sparse allows the Gaussian variational approximation method to be extended to high-dimensional settings.

Consider the reparametrization trick once again with , where means , and . For , we can write , with . Similarly to Section 3.1, \linenomath

 L(λ) =Ef(logh(μ+C−⊤ω)−logqλ(μ+C−⊤ω)),
\endlinenomath

which, apart from terms not depending on , is \linenomath

 L(λ) =Ef(logh(μ+C−⊤ω))−log|C|, (9)
\endlinenomath

and note that since is lower triangular. Tan and Nott (2017) derived the gradient of (9) and, moreover, considered some improved gradient estimates for which Roeder et al. (2017) have provided a more general understanding. We apply the approach of Roeder et al. (2017) to our methodology in Section 4.

### 3.3 Latent factor parametrization of Σ

While the method of Tan and Nott (2017) is an attractive way to reduce the number of variational parameters in problems with an exploitable conditional independence structure, there are situations where no such structure is available. An alternative parsimonious way to parametrize dependence is to use a factor structure (Geweke and Zhou, 1996; Bartholomew et al., 2011). Ong et al. (2017) parametrized the variational posterior covariance matrix as , where is a matrix with and for identifiability for and is a diagonal matrix with diagonal elements . The variational posterior becomes with .

If , this corresponds to the generative model with , where denotes point-wise multiplication. Ong et al. (2017) applied the reparametrization trick based on this transformation and derive gradient expressions of the resulting evidence lower bound. Ong et al. (2017) outlined how to efficiently implement the computations and we discuss this further in Section 4.

## 4 Methodology

### 4.1 Notation

Let denote a function of a vector valued argument . If is scalar valued, then is the gradient vector, considered as a column vector of length . If is a vector valued function, , then we write for the matrix with th element . For an matrix , is the vector obtained by stacking the columns of from left to right. for a vector is the inverse operation, where the intended dimensions of the resulting matrix is clear from the context. If is a scalar valued function of a matrix valued argument , then , so that is a matrix of the same dimensions as . We write for the matrix of zeros, for the Kronecker product and for the Hadmard (elementwise) product which can be applied to two matrices of the same dimensions. We also write for the commutation matrix, of dimensions , which for an matrix satisfies

 Kr,svec(Z)=vec(Z⊤).

### 4.2 Structure of the posterior distribution

Our Gaussian variational distribution is suitable for models with the following structure. Let be an observed time series, and consider a state space model in which \linenomath

 yt|Xt=xt ∼mt(y|xt,ζ),t=1,…,T Xt|Xt−1=xt−1 ∼st(x|xt−1,ζ),t=1,…,T
\endlinenomath

with a prior density for and where are the unknown fixed (non-time-varying) parameters in the model. The observations are conditionally independent given the states , and the prior distribution of given is

 p(X|ζ)=p(X0|ζ)T∏t=1st(Xt|Xt−1,ζ).

Let denote the full set of unknowns in the model. The joint posterior density of is , with , where is the prior density for and . Let be the dimension of and consider the situation where is large. Approximating the joint posterior distribution in this setting is difficult and we now describe a method based on Gaussian variational approximation.

### 4.3 Structure of the variational approximation

Our variational posterior density for , is based on a generative model which has a dynamic factor structure. We assume that \linenomath

 Xt =Bzt+ϵtϵt∼N(0,D2t), (10)
\endlinenomath

where is a matrix, , and is a diagonal matrix with diagonal elements . Let and , where is the Cholesky factor of the precision matrix of . We will write for the dimension of each , with , and assume that has the structure

 C=[C10 0C2],

where is the Cholesky factor of the precision matrix for and is the Cholesky factor for the precision matrix of . We further assume that takes the form \linenomath

 C1 =⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣C0000…00C10C110…000C21C22…00⋮⋮⋮⋱⋮⋮000…CT−1,T−1000……CT,T−1CTT⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, (11)
\endlinenomath

where the blocks in this block partitioned matrix follow the blocks of . With this definition of , the corresponding precision matrix takes the form \linenomath

 Ω1 =⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Ω00Ω⊤100…00Ω10Ω11Ω⊤21…000Ω21Ω22…00⋮⋮⋮⋱…⋮000…ΩT−1,T−1Ω⊤T,T−1000…ΩT,T−1ΩTT⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦. (12)
\endlinenomath

For a Gaussian distribution, zero elements in the precision matrix represent conditional independence relationships. In particular, the sparse structure we have imposed on means that in the generative distribution for the latent variable , given , is conditionally independent of the remaining elements of . In other words, if we think of the variables , as a time series, they have a Markovian dependence structure.

We now construct the variational distribution for through \linenomath

 θ =[Xζ]=[IT+1⊗B00IP]ρ+[ϵ0],
\endlinenomath

where is the dimension of and . Note that we can apply the reparametrization trick by writing , where . Then, \linenomath

 θ =Wρ+Ze=Wμ+WC−⊤ω+Ze, (13)
\endlinenomath

where

and is a diagonal matrix with diagonal entries . Here . We also write , where the blocks of this partition follow those of as .

In the development above we see that the factor structure is being used both for describing the covariance structure for the states, and also for dimension reduction in the variational posterior mean of the states, since , where . An alternative is to set and use \linenomath

 Xt =μt+Bzt+ϵt, (14)
\endlinenomath

where is now a -dimensional vector specifying the variational posterior mean for directly. We call parametrization (10) the low-dimensional state mean (LD-SM) parametrization, and parametrization (14) the high-dimensional state mean (HD-SM) parametrization. In both parametrizations, forms a basis for , which is reweighted over time according to the latent weights (factors) . The LD-SM parametrization provides information on how these basis functions are reweighted over time in our approximate posterior, since and we infer both and in the variational optimization. Section 5 illustrates this basis representation. We only outline the gradients and their derivation for the LD-SM parametrization. Derivations for the HD-SM parametrization follow those for the LD-SM case with straightforward minor adjustments.

### 4.4 Gradient expressions of the evidence lower bound

We can apply the reparameterization trick using (13) to the LD-SM parametrization to get expressions for the gradient with respect to the variational parameters. These can be used for unbiased gradient estimation of the lower bound. Section S1.1 derives the expressions below. We write for the covariance matrix of .

For , \linenomath

 ∇μL(λ) =W⊤Ef(∇θlogh(Wμ+WC−⊤ω+Ze)). (15)
\endlinenomath

An equivalent expression is \linenomath

 ∇μL(λ) =W⊤Ef(∇θlogh(Wμ+WC−⊤ω+Ze)+(WΣW⊤+Z2)−1(WC−⊤ω+Ze)), (16)
\endlinenomath

upon noting that the second term on the right-hand side of (16) is zero, and in fact (16) corresponds to the gradient estimate of Roeder et al. (2017) discussed in Section 2, which has the property that a Monte Carlo estimate of the gradient based on (16) based on even a single sample is zero when the variational posterior is exact. For , \linenomath

 ∇vec(B)L(λ) =T1B+T2B+T3B, (17)
\endlinenomath

where \linenomath

 T1B ={dvec(W)dvec(B)}⊤Ef(((μ+C−⊤ω)⊗Ip(T+1)+P)∇θlogh(Wμ+WC−⊤ω+Ze)),
\endlinenomath\linenomath
 T2B ={dvec(W)dvec(B)}⊤vec((WΣW⊤+Z2)−1WΣ),
\endlinenomath\linenomath
 T3B ={dvec(W)dvec(B)}⊤Ef(vec((WΣW⊤+Z2)−1(WC−⊤ω+Ze)ω⊤C−1 −(WΣW⊤+Z2)−1(WC−⊤ω+Ze)(WC−⊤ω+Ze)⊤(WΣW⊤+Z2)−1WΣ))
\endlinenomath

and in these expressions \linenomath

 dvec(W)dvec(B)= (Q⊤1⊗P1)[{(IT+1⊗Kq(T+1))(vec(IT+1)⊗Iq)}⊗Ip],
\endlinenomath

with

 P1=[Ip(T+1)0P×p(T+1)],Q1=[Iq(T+1)0q(T+1)×P].

We note that the term also cancels with the second term in after taking expectations, so an alternative expression for the gradient is \linenomath

 ∇vec(B)L(λ) =T1B+T′3B, (18)
\endlinenomath

where \linenomath

 T′3B ={dvec(W)dvec(B)}⊤Ef(vec((WΣW⊤+Z2)−1(WC−⊤ω+Ze)(ω⊤C−1+μ⊤))). (19)
\endlinenomath

Equation (18) corresponds to the gradient estimator of Roeder et al. (2017). For , writing , \linenomath

 ∇δL(λ) =Ef(diag(∇Xlogh(Wμ+WC−⊤ω+Ze)ϵ⊤+(W1Σ1W⊤1+D2)−1D +(W1Σ1W⊤1+D2)−1(W1C−⊤1ω1+Dϵ)ϵ⊤ −(W1Σ1W⊤1+D2)−1(W1C−⊤1ω1+Dϵ)(W1C−⊤1ω1+Dϵ)⊤(W1Σ1W⊤1+D2)−1D)). (20)
\endlinenomath

Again, cancellation of the second and fourth terms above occurs after taking expectations, giving the gradient estimator of Roeder et al. (2017), \linenomath

 ∇δL(λ) =Ef(diag(∇Xlogh(Wμ+WC−⊤ω+Ze)ϵ⊤ +(W1Σ1W⊤1+D2)−1(W1C−⊤1ω1+Dϵ)ϵ⊤)). (21)
\endlinenomath

Finally, \linenomath

 ∇CL(λ)= Ef(−C−⊤ω∇θlogh(Wμ+WC−⊤ω+Ze)⊤WC−⊤ − ΣW⊤(WΣW⊤+Z2)−1WC−⊤ − C−⊤ω(WC−⊤ω+Ze)⊤(WΣW⊤+Z2)−1WC−⊤ + ΣW⊤(WΣW⊤+Z2)−1(WC−⊤ω+Ze)(WC−⊤ω+Ze)⊤(WΣW⊤+Z2)−1WC−⊤) (22)
\endlinenomath

and, again after cancellation of some terms after taking expectations, the gradient estimator of Roeder et al. (2017) is \linenomath

 ∇CL(λ)= Ef(−C−⊤ω{∇θlogh(Wμ+WC−⊤ω+Ze)⊤ + (WC−⊤ω+Ze)⊤(WΣW⊤+Z2)−1}WC−⊤). (23)
\endlinenomath

### 4.5 Efficient computation

We can estimate the expectations in the previous subsection by one or more samples from . One can compute gradients based on either equations (15), (17), (20) and (22), or on equations (16), (18), (21) and (23). If the variational approximation to the posterior is accurate, as we explained in Section 2, there are reasons to prefer equations (16), (18), (21) and (23) corresponding to the gradient estimates recommended in Roeder et al. (2017). However, since we investigate massive dimension reduction with only very small numbers of factors where the approximation may be crude, we investigate both approaches in later examples.

The gradient estimates are efficiently computed using a combination of sparse matrix operations (for evaluation of terms such as and the high-dimensional matrix multiplications in the expressions) and, as in Ong et al. (2017), the Woodbury identity for dense matrices such as and . The Woodbury identity is \linenomath

 (ΛΓΛ⊤+Ψ)−1= Ψ−1−Ψ−1Λ(Λ⊤Ψ−1Λ+Γ−1)−1Λ⊤Ψ−1
\endlinenomath

for conformable matrices and diagonal . The Woodbury formula reduces the required computations into a much lower dimensional space since and, moreover, inversion of the high-dimensional matrix is trivial because it is diagonal.

## 5 Application 1: Spatio-temporal modeling

### 5.1 Eurasian collared-dove data

Our first example considers the spatio-temporal model of Wikle and Hooten (2006) for a dataset on the spread of the Eurasian collared-dove across North America. The dataset consists of the number of doves observed at location (latitude, longitude) in year corresponding to an observation period of 1986-2003. The spatial locations correspond to grid points with the dove counts aggregated within each area; see Wikle and Hooten (2006) for details. The count observed at location at time depends on the number of times that the location was sampled. However, this variable is unavailable and therefore we set the offset in the model to zero, i.e. .

### 5.2 Model

Let denote the count data at time . Wikle and Hooten (2006) model as conditionally independent Poisson variables, where the log means are given by a latent high-dimensional Markovian process plus measurement error. The dynamic process evolves according to a discretized diffusion equation. Specifically, the model in Wikle and Hooten (2006) is \linenomath

 yt|vt ∼Poisson(diag(Nt)exp(vt))yt,Nt,vt∈Rp vt|ut,σ2ϵ ∼N(ut,σ2ϵIp),ut∈Rp,Ip∈Rp×p,σ2ϵ∈R+ ut|ut−1,ψ,σ2η
\endlinenomath

with priors and \linenomath

 u0 ∼N(0,10Ip) ψ|α,σ2ψ ∼N(Φα,σ2ψIp),Φ∈Rp×l,α∈Rl,σ2ψ∈R+ α
\endlinenomath

is the Poisson distribution for a (conditionally) independent response vector parameterized in terms of its expectation and is the inverse-gamma distribution with shape and scale as arguments. The spatial dependence is modeled via the prior mean of the diffusion coefficients , where consist of the orthonormal eigenvectors with the largest eigenvalues of the spatial correlation matrix , where is the Euclidean distance between pair-wise grid locations in . Finally, is a diagonal matrix with the largest eigenvalues of . We follow Wikle and Hooten (2006) and set and .

Let , and denote the parameter vector

 θ=(u,v,ψ,α,logσ2ϵ,logσ2η,logσ2ψ,logσ2α),

which we infer through the posterior

 p(θ|y) ∝ σ2ϵσ2ησ2ψσ2αp(σ2ϵ)p(σ2η)p(σ2ψ)p(σ2α)p(α|σ2α)p(ψ|α,σ2ψ) (24) p(u0)T∏t=1p(ut|ut−1,ψ,σ2η)p(vt|ut,σ2ϵ)p(yt|vt),

with . Section S1.2 derives the gradient of the log-posterior required by our VB approach.

### 5.3 Variational approximations of the posterior distribution

Section 4 considered two different parametrization of the low rank approximation, in which either the state vector has mean , (low-dimensional state mean, LD-SM) or has a separate mean and (high-dimensional state mean, HD-SM). In this particular application there is a third choice of parametrization which we now consider.

The model in Section 5.2 connects the data with the high-dimensional state vector via a high-dimensional auxiliary variable . In the notation of Section 4, we can include in , in which case the parametrization of the variational posterior is the one described there. We refer to this parametrization as a low-rank state (LR-S). However, it is clear from (24) that there is posterior dependence between and , but the variational approximation in Section 4 omits dependence between and . Moreover, is also high-dimensional but the LR-S parametrization does not reduce its dimension. An alternative parametrization that deals with both considerations includes in the -block, which we refer to as the low-rank state and auxiliary variable (LR-SA) parametrization. This comes at the expense of omitting dependence between and , but also becomes more computationally costly because, while the total number of variational parameters is smaller (see Table S1 in Section S3), the dimension of the -block increases ( and ) and the main computational effort lies here and not in the -block. Table 1 shows the CPU times relative to the LR-S parametrization. The LR-SA parametrization requires a small modification of the derivations in Section 4, which we outline in detail in Section S2 as they can be useful for other models with a high-dimensional auxiliary variable.

It is straightforward to deduce conditional independence relationships in (24) to build the Cholesky factor of the precision matrix of in Section 4, with

 ζ={(v,ψ,α,logσ2ϵ,logσ2η,logσ2ψ,logσ2α)(LR-S)(ψ,α,logσ2ϵ,logσ2η,logσ2ψ,logσ2α)(LR-SA).

Section 4 outlines the construction of the Cholesky factor of the precision matrix of , whereas the minor modification needed for LR-SA is in Section S2. We note that, regardless of the parametrization, we obtain massive parsimony (between variational parameters) compared to the saturated Gaussian variational approximation which in this application has parameters; see Section S3 for further details.

We consider four different variational parametrizations, combining each of LR-SA or LR-S with the different parametrization of the means of , i.e. LD-SM or HD-SM. In all cases, we let and run iterations of a stochastic optimization algorithm with learning rates chosen adaptively according to the ADADELTA approach (Zeiler, 2012). We use the gradient estimators in Roeder et al. (2017), i.e. (16), (18), (21) and (23), although we found no noticeable difference compared to (15), (17), (20) and (22), which is likely due to the small number of factors as described in Sections 2 and 4. Our choice was motivated by computational efficiency as some terms cancel out using the approach in Roeder et al. (2017). We initialize and as unit diagonals and, for parity, and are chosen to match the starting values of the Gibbs sampler in Wikle and Hooten (2006).

Figure 1 monitors the convergence via the estimated value of using a single Monte Carlo sample. Table 1 presents estimates of at the final iteration using Monte Carlo samples. The results suggest that the best VB parametrization in terms of ELBO is the low-rank state algorithm (LR-SA) with, importantly, a high-dimensional state-mean (HD-SM) (otherwise the poorest VB approximation is achieved, see Table 1). However, Table S1 shows that this parametrization is about three times as CPU intensive. The fastest VB parametrizations are both Low-Rank State (LR-S) algorithms, and modeling the state mean separately for these does not seem to improve (Table 1) and is also slightly more computationally expensive (Table S1). Taking these considerations into account, the final choice of VB parametrization we use for this model is the low-rank state with low-dimensional state mean (LR-S + LD-SM). We show in Section 5.5 that this parametrization gives accurate approximations for our analysis. For the rest of this example we benchmark the VB posterior from LR-S + LD-SM against the MCMC approach in Wikle and Hooten (2006).

### 5.4 Settings for MCMC

Before evaluating VB against MCMC, we need to determine a reasonable burn-in and number of iterations for the Gibbs sampler in Wikle and Hooten (2006). It is clear that it is not feasible to monitor convergence for every single parameter in such a large scale model as (24), and therefore we focus on , and , which are among the variables considered in the analysis in Section 5.5.

Wikle and Hooten (2006) use iterations of which are discarded as burn-in. We generate sampling chains with these settings and inspect convergence using the package (Plummer et al., 2006) in . We compute the Scale Reduction Factors (SRF) (Gelman and Rubin, 1992) for and as a function of the number of Gibbs iterations. The adequate number of iterations in MCMC depends on what functionals of the parameters are of interest; here we monitor convergence for these quantities since we report marginal posterior distributions for these quantities later. The scale reduction factor of a parameter measures if there is a significant difference between the variance within the four chains and the variance between the four chains of that parameter. We use the rule of thumb that concludes convergence when , which gives a burn-in period of approximately here, for these functionals. After discarding these samples and applying a thinning of we are left with posterior samples for inference. However, as the draws are auto-correlated, this does not correspond to independent draws used in the analysis in Section 5.5 (note that we obtain independent samples from our variational posterior). To decide how many Gibbs samples are equivalent to independent samples for and , we compute the Effective Sample Size (ESS) which takes into account the auto-correlation of the samples. We find that the smallest is and hence we require iterations after a thinning of , which makes a total of Gibbs iterations, excluding the burn-in of . Thinning is advisable here due to memory issues.

### 5.5 Analysis and results

We first consider inference on the diffusion coefficient for location . Figure 2 shows the “true” posterior (represented by MCMC) together with the variational approximation for six locations described in the caption of the figure. The figure shows that the posterior distribution is highly skewed for locations with zero dove counts and approaches normality as the dove counts increase. Consequently, the accuracy of the variational posterior (which is Gaussian) improves with increasing dove counts.

Figure shows 100 VB and MCMC posterior samples of the dove intensity for each year summed over the spatial locations, i.e. . Both posteriors are very similar and, in particular, show an exponential increase of doves until year followed by a steep decline for year . In the interest of analyzing the spatial dimension of the model, Figure 4 shows a heat map of the MCMC and VB posterior mean of the dove intensity for the last five years of the data, overlaid on a map of the United States of America. We draw the following conclusions from the analysis using the MCMC and VB posteriors (which are nearly identical). First, the dove intensity is most pronounced in the South East states, in particular Florida (Figure 4). Second, the decline of doves for year in Figure 3 is likely attributed to a drop in the intensity (Figure 4) at two areas of Florida: Central Florida () and South East Florida (). Figure 5 illustrates the whole posterior distribution of the log-intensity for these locations at year and, moreover, an out-of-sample posterior predictive distribution for year . The estimates are obtained by kernel density estimates using approximately effective samples. The posterior distributions for the VB and MCMC are similar, and it is evident that using this large scale model for forecasting future values is associated with a large uncertainty.

We conclude this example by investigating the spatial functions and their reweighting over time to produce the variational approximation. Figure 6 illustrates this and shows that the correlation among the states is mostly driven by spatial locations outside the southern east part of North America. This is reasonable as the data contains mostly zero or near zero counts outside the southern east region.

The analysis in this section shows that similar inferential conclusions are drawn with the VB posterior and the “true” posterior (approximated by the Gibbs sampler). One advantage with the VB posterior is that it is more efficient for performing posterior predictive analysis because independent samples are easily obtained, in contrast to MCMC samples that may have a prohibitively large auto-correlation, resulting in imprecise estimators. Perhaps the main advantage of the VB posterior is that it is faster to obtain: in this particular application VB was times faster than MCMC. The speed up in computing time relative to MCMC is model specific and depends on how expensive the different sampling steps in the Gibbs sampler are. This spatial temporal model is computationally cheap on a per iteration basis, however, many iterations are needed for accurate inference as demonstrated. In the next section, we consider a model which is both computationally expensive and mixes poorly. For that model the computational gains are much more pronounced.

## 6 Application 2: Stochastic volatility modeling

### 6.1 Industry portfolios data

Our second example considers the Multivariate stochastic volatility model via Wishart processes in Philipov and Glickman (2006b) used for modeling the time-varying dependence of a portfolio of assets over time periods. We follow Philipov and Glickman (2006b) and use (manufacturing, utilities, retail/wholesale, finance, other) which results in a state-vector (the lower-triangular part of the covariance matrix) of dimension . This is far from a high-dimensional setting, but allows us to implement an MCMC method to compare against the variational approximation. In fact, for , which gives , Philipov and Glickman (2006b) use a Metropolis-Hastings within Gibbs sampler and reported that some of the blocks updated by the Metropolis-Hastings sampler have acceptance probabilities close to zero. We discuss this further in Section 6.4 and note at the outset that our variational approach does not have this limitation as we demonstrate in Section 6.5.

Philipov and Glickman (2006b) made a mistake in the derivation of the Gibbs sampler which affects all full conditionals (Rinnergschwentner et al., 2012). Implementing the corrected version results in a highly inefficient sampler and we take instead of so that the MCMC can finish in a reasonable amount of time. Hence we have montly observations on value-weighted portfolios from the 201709 CRSP database, covering a period from 2009-06 to 2017-09. We follow Philipov and Glickman (2006b) and prefilter each series using an AR(1) process.

### 6.2 Model

We assume that the return at time period , , is the vector ,

 yt ∼ N(0,Σt),Σt∈Rp×p Σ−1t ∼ Wishart(ν,St−1),St=1νH(Σ−1t)dH⊤,St∈Rp×p,ν>k,0

is a lower triangular Cholesky factor of a positive definite matrix , and is assumed known. Philipov and Glickman (2006b) use an inverse Wishart prior for , , , , a uniform prior for , , and a shifted gamma prior for , . The joint posterior density for is \linenomath

 p(Σ,A,ν−k,d|y) ∝p(A,d,ν−k)T∏t=1p