Multivariate Bayesian Structural Time Series Model
This paper deals with inference and prediction for multiple correlated time series, where one has also the choice of using a candidate pool of contemporaneous predictors for each target series. Starting with a structural model for the time-series, Bayesian tools are used for model fitting, prediction, and feature selection, thus extending some recent work along these lines for the univariate case. The Bayesian paradigm in this multivariate setting helps the model avoid overfitting as well as capture correlations among the multiple time series with the various state components. The model provides needed flexibility to choose a different set of components and available predictors for each target series. The cyclical component in the model can handle large variations in the short term, which may be caused by external shocks. We run extensive simulations to investigate properties such as estimation accuracy and performance in forecasting. We then run an empirical study with one-step-ahead prediction on the max log return of a portfolio of stocks that involve four leading financial institutions. Both the simulation studies and the extensive empirical study confirm that this multivariate model outperforms three other benchmark models, viz. a model that treats each target series as independent, the autoregressive integrated moving average model with regression (ARIMAX), and the multivariate ARIMAX (MARIMAX) model.
Keywords: Multivariate Time Series, Feature Selection, Bayesian Model Averaging, Cyclical Component, Estimation and Prediction
The analysis of “Big data”, through the application of a new breed of analytical tools for manipulating and analyzing vast caches of data that is being produced by computers and other technology, is one of the hottest new areas. As a byproduct of the extensive use of the internet in collecting data on economic transactions, such data is growing exponentially every day. According to Varian (2014) and references therein, Google has trillion URLs and crawls over billion of those each day. Conventional statistical and econometric techniques become increasingly inadequate to deal with such big data problems. For a good introduction to this new trend in data science, see Blei and Smyth (2017). Machine Learning as a field of computer science has strong ties to mathematical optimization and delivers methods, theory and applications, giving computers the ability to learn without being explicitly programmed (see, a classical book, Mohri et al. (2012)). Machine Learning indeed helps in developing high-performance computer tools which often provide useful predictions in the presence of challenging computational needs. However the result is one that we might call “pure prediction” and is not necessarily based on substantive knowledge. Also, typical assumptions such as the data being independent and identically (or at least independently) distributed, are not satisfactory when dealing with time stamped data, which is driven by multiple “predictors” or “features”. We need to employ time-series analysis for such series of data that are time-dependent, such as macroeconomic indicators of the national economy, enterprise operational management, market forecasting, weather and hydrology prediction, etc.
Our focus here is on new techniques that work well for feature selection problems in time series applications. Scott and Varian (2014, 2015) introduced and further explored the Bayesian Structural Time Series (BSTS) model, which is a technique that can be used for feature selection, time series forecasting, nowcasting, inferring causal relationship (causality, see Brodersen et al. (2015) and Peters et al. (2017)), etc. One main ingredient of the BSTS model is that the time-series aspect is handled through the Kalman filter (see Harvey (1990); Durbin and Koopman (2002); Petris et al. (2009)) while taking into account the trend, seasonality, regression, and other common time series factors. The second aspect is the “spike and slab” variable selection, which was developed by George and McCulloch (1997) and Madigan and Raftery (1994), by which the most important regression predictors are selected at each step. The third aspect is the Bayesian model averaging (see Hoeting et al. (1999)) which combines the results and prediction calculation. All these three parts have natural Bayesian interpretations and tend to play well together so that the resulting BSTS model discovers not only correlations, but also causations in the underlying data. Some excellent related literature includes but not limited to the following: Dy and Brodley (2004); Cortes and Vapnik (1995); Guyon and Elisseeff (2003); Koo et al. (2007); Bach et al. (2013); Keerthi and Lin (2003); Nowozin and Lampert (2011); Krishnapuram et al. (2005); Caron et al. (2006); Csató and Opper (2002).
In this paper, we extend the BSTS model to the multivariate dependent target time-series with various components, and label it, the Multivariate Bayesian Structural Time Series (MBSTS) model. For instance, the MBSTS model can be used to explicitly model the correlations between different stock returns in a portfolio through the covariance structure specified by (see Equation (1)). In this model, we allow a cyclical component with a shock damping parameter to specifically model the influence of a shock to the time series, besides the more standard local linear trend component, a seasonal component, and a regression component. One motivation for this is provided by the 2007–2008 financial crisis to the stock market. In examples with simulated data, the properties of our model such as estimation and prediction accuracy is investigated. As an illustration, through an empirical case study, we predict the max log returns over consecutive business days, of a stock portfolio with stocks, which included (i) Bank of America (BOA), (ii) Capital One Financial Corporation (COF), (iii) J.P. Morgan (JPM), and (iv) Wells Fargo (WFC), using domestic Google trends, and stock technical indicators as predictors.
Extensive analysis on both simulated data and real stock market data verifies that the MBSTS model gives much better prediction accuracy compared to the component-wise univariate BSTS model, autoregressive integrated moving average model with regression (ARIMAX), and multivariate ARIMAX (MARIMAX). Some of the reasons for this, are: the MBSTS model is strong in forecasting since it allows incorporating information about different components in the response time series, rather than merely historical values of the same component; Bayesian paradigm and the MCMC algorithm can perform variable selection at the same time as model training and thus prevent overfitting even if some spurious predictors are added into the candidate pool; the MBSTS model benefits from taking correlation among multiple response series into account, which helps boost the forecasting power and is a significant improvement over the component-wise BSTS model with independent components, discussed in Scott (2017).
The rest of the paper is organized as follows. In Section 2, we build the basic model and framework. Extensive simulations are carried out in Section 3 to check how the model performs under various conditions. In Section 4, an empirical study on the stock portfolio is done to show how well our model performs with a real-world data set. Section 5 concludes with some final remarks.
2 The MBSTS Model
In this section, we introduce the MBSTS model including model structure, state components, prior elicitation and posterior inference. Then we describe the algorithm for training the model and performing forecasts.
2.1 Structural Time Series
Structural time series models belong to state space models for time series data given by the following set of equations:
Equation (1) is called the observation equation, as it links the vector of observations at time t with a vector denoting the unobserved latent states, where is the total number of latent states for all entries in . Equation (2) is called the transition equation because it defines how the latent states evolve over time. The model matrices , , and typically contain unknown parameters and known values which are often set as 0 and 1. is a output matrix, is a transition matrix, is a control matrix. The vector denotes observation errors with noise variance-covariance matrix , and is a q-dimensional system error with a state diffusion matrix , where . Note that any linear dependencies in the state vector can be moved from to , hence can be set as a full rank variance matrix.
The structural time series model constructed in terms of components have a direct interpretation. For example, one may consider the classical decomposition in which a series can be seen as the sum of trend, season, cycle and regression components. A model could be formulated as a regression with explanatory variables consisting of a time trend, a set of seasonal dummies, business cycle and other covariates, and defined in terms of a pair of equations. In general, the model in state space form can be written as:
where and are m-dimension vectors, representing target time series, linear trend component, seasonal component, cyclical component, regression component and observation error terms respectively. Based on state space form, is the collection of these components, namely . The is matrix, positive definite and is assumed for simplicity, to be constant over time.
Structural time series models allow us to examine the time series at hand and flexibly select suitable components for trend, seasonality, and either static or dynamic regression. In the current model, all state-components are assembled independently, with each component yielding an additive contribution to . The flexibility of the model allows us to include different model components for each different target series.
2.2 Components of State
The first component is a local linear trend. The specification of a time series model for the trend component varies according to the features displayed by the series under investigation and any prior knowledge. The most elementary structural model deals with a series whose underlying level changes over time. Moreover, it also sometimes displays a steady upward or downward movement, suggesting that we have to incorporate a slope, or a drift, into the model for the trend. The resulting model, a generalization of the local linear trend model where the slope exhibits stationarity instead of obeying a random walk, is expressed in the form as:
where and are -dimension vectors. is the expected increase in between times t and t+1, so it can be thought as the slope at time t and is the long-term slope. The parameter is diagonal matrix, whose diagonal entries represent the learning rate at which the local trend is updated for . Thus, the model balances short-term information with long-term information. When , the corresponding slope becomes a random walk.
The second component is one that captures seasonality. One frequently used model in the time domain is:
where represents the number of seasons for and m dimension vector denotes their joint contribution to the observed response . When we add seasonal component into model, seasonal effects are set in the state-space form for . However, only one seasonal effect has error term based on equation (7) and other effects were represented by itself in deterministic equation. More specifically, the part of the transition matrix representing the seasonal effects is an matrix with ’s along the top row, ’s along the subdiagonal and ’s elsewhere. In addition, the expectation of summation of seasonal effects for is zero with variance equal to diagonal element of .
For each target series , the model allows for various seasonal components with different periods as shown in equation (7). For instance, we might include seasonal component with to capture day of week effect for target series , as well as an indicating month cycle for another target series when modeling daily data. The state-space setting of seasonal transition matrix is or matrix with nonzero error variance for or respectively.
The third component is the one accounting for cyclical effect in the series. In Economics, the term “business cycle” broadly refers to the recurrent, not exactly periodic, deviations around the long term path of the series. A model with the cyclical component is capable of reproducing commonly acknowledged essential features, such as the presence of strong autocorrelation, determining the recurrence and alternation of phases, and the dampening of the fluctuations, or zero long run persistence. A stochastic trend model of a seasonally adjusted economic time series doesn’t capture the short-term movement of the series by itself. Including a serially correlated stationary component, the short-term movement may be captured, and this is the model including cyclical effect (Harvey et al. (2007)). The cycle component is postulated as:
where are diagonal matrices with diagonal entries equal to (a damping factor for target series such that ), where is the frequency with being a period such that , and respectively. When is or , it degenerates to the AR(1) process. The damping factor should be strictly less than unity for stationary process. When the damping factor bigger than one, there will be no restrictions for the cyclical movement, resulting in extending the amplitude of the cycle.
These three time series components are illustrated in Figure 1. The big difference between cyclical component and seasonal component is the damping factor. The amplitude of cyclical component will decay as time goes by, which can be applied to target time series affected by external shock. The are variance-covariance matrices for error terms of time series components, and for simplicity we assume they are diagonal.
The fourth component is the regression component with static coefficients written as follows:
The is the collection of all elements in regression component. For target series , the is the pool of all available predictors at time t, and represent corresponding static regression coefficients. All predictors are supposed to be contemporaneous with a known lag, but can be easily included by shifting the corresponding predictor in time.
2.3 Spike and Slab Regression
In feature selection, a high degree of sparsity is expected, in the sense that the coefficients for the vast majority of predictors will be zero. A natural way to represent sparsity in the Bayesian paradigm is through the spike and slab coefficients. One advantage of working in a fully Bayesian setting is that we do not need to commit to a fixed set of predictors.
2.3.1 Matrix Representation
In order to assign approriate prior distributions to parameters, we first combine into a matrix as follows: , , , and . Then the model cas be written in the long matrix form as follows:
where , and , are written as:
where being matrix, representing all observations of candidate predictors for , which is all observations of th target series. The regression matrix is of dimension with . Moreover, and can be same or only contain a part of common predictors. The regression coefficients for denoted as is a dimensional vector. Arranging the model in this way allows for selecting a different set of available predictors at each iteration for .
2.3.2 Prior distribution and elicitation
We define if , and if . Then , where . Denote as the subset of elements of where , and represent the subset of columns of X where . The spike prior may be written as:
where is prior inclusion probability of predictor for response series. Equation (12) is often further simplified by setting all the to be the same values for if prior information about effect of specific predictors on each target series is not available. With sufficient prior information available, assigning different subjectively determined values to might provide more robust results without a great amount of computational burden. An easy way to elicit is to ask the researcher for an “expected model size”, so that if one expects nonzero predictors for , then , where is the total number of candidate predictors for target series. Under some circumstances, will be set to be or , for some specific predictors of , forcing certain variables to be excluded or included. The spike prior can be specified by the researcher in different distributional forms.
The natural conjugate prior for the multivariate model with the same set of predictors has the conjugate prior on depending on . The multivariate extension with different set of predictors in each equation will destroy the conjugacy (Rossi et al. (2012)). Conjugate priors such as the normal and inverse Wishart can still be used in nonconjugate contexts, since a lot of models are conjugate conditional on some subset of parameters. In order to obtain this conditionally conjugate, we stack up the regression equations into one large regression shown in equation (11). A simple slab prior specification is to make and prior independent (Griffiths (2003)):
where is the vector of prior means with the same dimension as , and is the full-model prior information matrix, where is the number of observations worth of weight on the prior mean . If is not positive definite due to perfect collinearity among predictors, we will replace it by to guarantee propriety. Given analyst’s specification, can be set in other forms. Moreover, IW is inverse Wishart distribution, where is the number of degrees of freedom and is a scale matrix. These priors are not conjugate but they are conditionally conjugate.
Equation (13) is the “slab” because one can choose the prior parameters to make it only very weakly informative (close to flat), conditional on . The vector encodes our prior expectation about the value of each element of . In practice, one usually sets . The values of and can be set by asking the analyst for an expected from the regression, and a number of observations worth of weight , which must be greater than dimension of plus one, to be given to their guess. Then , where is the variance-covariance matrix for multiple response series .
Prior distributions of the other variance-covariance matrices can be seen as:
By our assumption that all components are independent with each other, then we can reduced prior distribution in multivariate version to univariate counterpart since the matrices are diagonal. In other words, each diagonal entry of these matrices follows inverse gamma distributions as introduced in BSTS.
2.3.3 Posterior Inference
By the law of total probability, the full likelihood function is given by
where is the multiple response series with time series components (trend, seasonality and cycle) subtracted out. Conditional on , we can introduce a normal prior, standardize the observations to remove correlation, and produce a posterior. However, we cannot find a convenient prior to integrate out from this conditional posterior. We tackle this issue by transforming equation into a system with uncorrelated errors using the root of the cross-equation covariance matrix, and i.e. if we pre-multiply both sides of this equation, the transformed system has uncorrelated errors:
Then full conditional function for can be expressed as:
Let’s combine the two terms in exponential:
where . Then, a normal prior for is conjugate with the conditional likelihood for the transformed system.
As gets smaller, the prior becomes flatter and we recognize that the mean of this distribution as the generalized least squares estimator.
The posterior of is in the inverted Wishart form. To see this, first recognize that, given , we observe or can compute the errors . This means that, the problem is the standard one of inference regarding a variance-covariance matrix using a multivariate normal sample. From equations (15), (16), (17) and (18), we know that
where . The terms in the exponential part can be transformed to a trace form:
where , , is the matrix, and is the matrix expressed as follows:
Then the full conditional distribution of is inverted Wishart as follows:
Again, if we let the prior precision go to zero, the posterior on centers over the sum of squared residuals matrix.
Since there is no conjugacy for the prior setting, we can not get analytic solution of marginal distribution for . However, by the properties of conditional conjugacy for the model, we can derive conditional distribution . The joint distribution can be obtained as following:
where . Then the conditional distribution can be expressed as:
where is a normalizing constant that only depends on . Note that, here matrices needed to be computed is of low dimension in the sense that the equation (29) places positive probability on coefficients being zero, leading to the sparsity of these matrices. In general, as a feature of the full posterior distribution, sparsity in this model thus make equation (29) to be evaluated in an inexpensive way.
Next we need to derive conditional posterior of . Given the draw of the state, the parameters drawn are straightforward for all state components except the static regression coefficients. All time series components that exclusively depend on their variance parameters will translate their draws back to error terms and accumulate sums of squares. Inverse Wishart distribution is the conjugate prior of a multivariate normal distribution with known mean and covariance matrix, thus the posterior distribution will remain inverse Wishart distributed
where is a matrix, representing a collection of residues of each time series component.
2.4 Markov Chain Monte Carlo
Markov chain Monte Carlo (MCMC) methods are a class of algorithms to sample from a probability distributions ((22), (27) and (29)) based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample from the desired distribution. The quality of the sample improves as an increasing function of the number of steps.
2.4.1 Model Training
Let denote the set of state component parameters. The posterior distribution of the model described before can be simulated by a Markov chain Monte Carlo algorithm given in Algorithm 1. Looping through the five steps yields a sequence of draws from a Markov chain with stationary distribution , the posterior distribution of given .
2.4.2 Target Series Forecasting
Typically in Bayesian data analysis, forecasts from our model are based on the posterior predictive distribution. Given draws of model parameters and latent states from their posterior distribution, we can draw samples from the posterior predictive distribution. Let represents the set of values to be forecast. The posterior predictive distribution of can be expressed as follows:
where is the set of all model parameters and latent states randomly drawn from , then we can draw samples of from , which can be achieved by simply iterating equations (5), (6), (7), (8) and (9) forward from states , with parameters , and . In one-step-ahead forecast, we simply draw samples from multivariate normal distribution with mean equal to and variance equal to . Therefore, the samples drawn in this way have the same distribution as those simulated directly from posterior predictive distribution.
The predictive probability density is not conditioned on parameter estimates and the inclusion or exclusion of predictors with static regression coefficients, all of which have been integrated out. Thus, through Bayesian model averaging, we commit neither to any particular set of covariates, which helps avoid an arbitrary selection, nor to point estimates of their coefficients, which prevents overfitting. By using multivariate BSTS model, we also take into account the correlations among multiple target series, when sampling for predicted values of several target series. The posterior predictive density in (31) is defined as a joint distribution over all predicted target series, rather than as a collection of pointwise univariate distributions. This ensures that we properly forecast multiple target series as a whole instead of predicting them individually. This is crucial, in particular, when forming summary statistics, such as their mean or variance-covariance from joint empirical distribution of forecast values.
3 Application to Simulated Data
In order to investigate the properties of our model, in this section, we analyze computer-generated data through a series of independent simulations. We generated multiple datasets with different time span, local trend, number of regressors, and correlations among two target series to analyze three aspects of generated data, which are estimation accuracy for parameters, ability for selecting the correct variables, and forecast performance of the model.
3.1 Generated Data
To verify whether the estimation error and estimation standard deviation will decrease as sample size increases, we built four different models in equation (37) and each generates two target time series data with different number of observations (). These datasets are simulated using latent states and a static regression component with four explanatory variables, one of which has no effect on each target series with zero coefficient. Specifically, each target series was generated including different set of state components and explanatory variables, which also implies that the explanatory variables having no effect on each target series are not same.
The latent states were generated using a local linear trend with and without a global slope, seasonality with period equal to four or cyclical component with for both target series. All initial values are drawn from normal distribution with mean zero. The detailed model description is presented as follows: