Bayesian Nonparametric Covariance Regression
Abstract
Although there is a rich literature on methods for allowing the variance in a univariate regression model to vary with predictors, time and other factors, relatively little has been done in the multivariate case. Our focus is on developing a class of nonparametric covariance regression models, which allow an unknown covariance matrix to change flexibly with predictors. The proposed modeling framework induces a prior on a collection of covariance matrices indexed by predictors through priors for predictordependent loadings matrices in a factor model. In particular, the predictordependent loadings are characterized as a sparse combination of a collection of unknown dictionary functions (e.g, Gaussian process random functions). The induced covariance is then a regularized quadratic function of these dictionary elements. Our proposed framework leads to a highlyflexible, but computationally tractable formulation with simple conjugate posterior updates that can readily handle missing data. Theoretical properties are discussed and the methods are illustrated through simulations studies and an application to the Google Flu Trends data.
Keywords: covariance estimation; Gaussian process; heteroscedastic regression; nonparametric Bayes; stochastic process.
1Introduction
Spurred by the increasing prevalence of highdimensional datasets and the computational capacity to analyze them, capturing heteroscedasticity in multivariate processes has become a growing focus in many applied domains. For example, within the field of financial time series modeling, capturing the timevarying volatility and covolatility of a collection of risky assets is key in devising a portfolio management scheme. Likewise, the spatial statistics community is often faced with multivariate measurements (e.g., temperature, precipitation, etc.) recorded at a large collection of locations, necessitating methodology to model the strong spatial (and spatiotemporal) variations in correlations. More generally, imagine that one has some arbitrary, potentially multivariate predictor space and a collection of multivariate response vectors . The problem of mean regression (i.e., ) has been well studied in both the univariate and multivariate settings. Although there is a rich literature on methods for allowing the variance in a univariate regression model to vary with predictors, there is a dearth of methodologies for the general case of multivariate covariance regression (i.e., ). The covariance matrix captures key correlations between the elements of the response vector, and the typical assumption of a homoscedastic model can have significant impact on inferences.
Historically, the problem of multivariate covariance regression has typically been addressed by standard regression operations on the unconstrained elements of the log or Cholesky decomposition of the covariance (or precision) matrix. For example, [26] proposes to model elements of as a linear function of the predictors. The weights associated with the th row have a nice interpretation in terms of the conditional distribution of given ; however, the model is not invariant to permutations of the elements of which is problematic in applications where there does not exist a natural ordering. Alternatively, [9] consider modeling each element of as a linear function of the predictor. An issue with this formulation is in the interpretability of the model: a submatrix of does not necessarily coincide with a submatrix of the matrix logarithm. Additionally, both the models of [26] and [9] involve a large number of parameters (specifically, assuming .) More recently, [19] propose a covariance regression model in which with positive definite and real. This model has interpretable parameters and may be equated with a latent factor model, leading to computational advantages. However, the model still has key limitations in (i) scaling to large domains, and (ii) flexibility based on the parametric approach. Specifically, the model restricts the difference between and the baseline matrix to be rank 1. Higher rank models can be considered via extensions such as , but this dramatically increases the parameterization and requires definition of the maximal rank difference.
For volatility modeling where the covariate space is typically taken to be discrete time, heteroscedasticity has classically been captured via either variants of ARCH [12] or stochastic volatility models [17]. The former directly specifies the volatility matrix as a linear combination of lagged volatilities and squared returns, which suffers from curse of dimensionality, and is typically limited to datasets with 5 or fewer dimensions. Alternatively, multivariate stochastic volatility models assume , with real, , and independent autoregressive processes. See [8] for a survey of such approaches. More recently, a number of papers have examined inducing covariance processes through variants of a Wishart process. [24] take with
Within the spatial statistics community, the term Wishart process is typically used to specify a different formulation than those described herein for volatility modeling. Specifically, letting denote the covariance of a dimensional observation at geographic location , [13] assume that with diagonal and following a matricvariate Wishart process. This Wishart process is such that with independently for each and typically taken to be diagonal. The induced distribution on is then marginally inverse Wishart
In this paper, we present a Bayesian nonparametric approach to multivariate covariance regression that allows the covariance matrix to change flexibly with predictors and readily scales to highdimensional datasets. The proposed modeling framework induces a prior on a collection of covariance matrices through specification of a prior on a predictordependent latent factor model. In particular, the predictordependent loadings are characterized as a sparse combination of a collection of unknown dictionary functions (e.g, Gaussian process random functions). The induced covariance is then a regularized quadratic function of these dictionary elements. The proposed methodology has numerous advantages over previous approaches. By employing collections of continuous random functions, we allow for an irregular grid of observations. Similarly, we can easily cope with missing data within our framework without relying on imputing the missing values. Another fundamental property of the proposed methodology is the fact that our combined use of a shrinkage prior with a latent factor model enables us (in theory) to handle highdimensional datasets (e.g., on the order of hundreds of dimensions) in the presence of a limited number of observations. Essential in being able to cope with such large datasets in practice is the fact that our computations are tractable, based on simple conjugate posterior updates. Finally, we are able to state theoretical properties of our proposed prior, such as large support.
The paper is organized as follows. In Section 2, we describe our proposed Bayesian nonparametric covariance regression model in addition to analyzing the theoretical properties of the model. Section 3 details the Gibbs sampling steps involved in our posterior computations. Finally, a number of simulation studies are examined in Section 4, with an application to the Google Trends flu dataset presented in Section 5.
2Covariance Regression Priors
2.1Notation and Motivation
Let denote the covariance matrix at “location” . In general, is an arbitrary, possibly multivariate predictor value. In dynamical modeling, may simply represent a discrete time index (i.e., ) or, in spatial modeling, a geographical location (i.e., ). Another simple, tractable case is when represents an ordered categorical predictor (i.e., ). We seek a prior for , the collection of covariance matrices over the space of predictor values.
Letting our goal is to choose a prior for the collection of covariance matrices that has large support and leads to good performance in large settings. By “good” we mean accurate estimation in small samples, taking advantage of shrinkage priors and efficient computation that scales well as increases. We initially focus on the relatively simple setting in which
independently for each . Such a formulation could be extended to settings in which data are collected at repeated times for different subjects, as in multivariate longitudinal data analysis, by embedding the proposed model within a hierarchical framework. See Section 6 for a discussion.
2.2Proposed Latent Factor Model
In large settings, modeling a covariance matrix over an arbitrary predictor space represents an enormous dimensional regression problem; we aim to reduce dimensionality for tractability in building a flexible nonparametric model for the predictordependent covariances. A popular approach for coping with such high dimensional (nonpredictordependent) covariance matrices in the presence of limited data is to assume that the covariance has a decomposition as where is a factor loadings matrix with and is a diagonal matrix with nonnegative entries. To build in predictor dependence, we assume a decomposition
where is a factor loadings matrix that is indexed by predictors and where . Assuming initially for simplicity that , such a decomposition is induced by marginalizing out a set of latent factors from the following latent factor model:
Here, is the predictor associated with the th observation .
Despite the dimensionality reduction introduced by the latent factor model of Eq. , modeling a dimensional predictordependent factor loadings matrix still represents a significant challenge for large domains. To further reduce dimensionality, and following the strategy of building a flexible highdimensional model from simple lowdimensional pieces, we assume that each element of is a linear combination of a much smaller number of unknown dictionary functions . That is, we propose to let
where is the matrix of coefficients relating the predictordependent factor loadings matrix to the set of dictionary functions comprising the dimensional matrix . Typically, and . Since we can write
we see that the weighted sum of the th column of dictionary functions , with weights specified by the th row of , characterizes the impact of the th latent factor on , the th component of the response at predictor location . By characterizing the elements of as a linear combination of these flexible dictionary functions, we obtain a highlyflexible but computationally tractable formulation.
In marginalizing out the latent factors, we now obtain the following induced covariance structure
Note that the above decomposition of is not unique and there are actually infinitely many such equivalent decompositions. For example, take and . Alternatively, consider for any orthogonal matrix or and for any matrix of dictionary functions . One can also increase the dimension of the latent factors and take . In standard (nonpredictordependent) latent factor modeling, a common approach to ensure identifiability is to constrain the factor loadings matrix to be block lower triangular with strictly positive diagonal elements [14], though such a constraint induces order dependence among the responses [2]. However, for tasks such as inference on the covariance matrix and prediction, identifiability of a unique decomposition is not necessary. Thus, we do not restrict ourselves to a unique decomposition of , allowing us to define priors with better computational properties.
Although we are not interested in identifying a unique decomposition of , we are interested in characterizing the class of covariance regressions that can be decomposed as in Eq. . Lemma ? states that for and sufficiently large, any covariance regression has such a decomposition. For , let denote the space of all dimensional matrices of arbitrarily complex dictionary functions mapping from , be the space of all diagonal matrices with nonnegative entries, and be the space of all dimensional matrices such that has finite elements.
Now that we have established that there exist decompositions of into the form specified by Equation , the question is whether we can specify a prior on the elements , , and that provides large support on such decompositions. This is explored in Section 2.3.
In order to generalize the model to also allow the mean to vary flexibly with predictors, we can follow a nonparametric latent factor regression approach and let
where , and is an unknown function relating the predictors to the mean of the th factor, for . These functions can be modeled in a related manner to the functions described above. The induced mean of conditionally on and marginalizing out the latent factors is then . For simplicity, however, we focus our discussions on the case where .
2.3Prior Specification
Working within a Bayesian framework, we place independent priors on , , and in Eq. to induce a prior on . Let , , and denote each of these independent priors, respectively. Recall that denotes the induced prior on .
Aiming to capture covariances that vary continuously over combined with the goal of maintaining simple computations for inference, we specify the dictionary functions as
independently for all , with a squared exponential correlation function having .
To cope with the fact that the number of latent dictionary functions is a model choice we are required to make, we seek a prior that favors many values of being close to zero so that we may choose larger than the expected number of dictionary functions (also controlled by the latent factor dimension ). As proposed in [5], we use the following shrinkage prior:
Choosing implies that is greater than 1 in expectation so that tends stochastically towards infinity as goes to infinity, thus shrinking the elements toward zero increasingly as grows. The precision parameters allow for flexibility in how the elements of are shrunk towards zero by incorporating local shrinkage specific to each element of , while provides a global columnwise shrinkage factor.
Finally, we specify via the usual inverse gamma priors on the diagonal elements of . That is,
independently for each .
2.4Theoretical Properties
In this section, we explore the theoretical properties of the proposed Bayesian nonparametric covariance regression model. In particular, we focus on the support of the induced prior based on the priors , , and defined in Section 2.3. Large support implies that the prior can generate covariance regressions that are arbitrarily close to any function in a large class. Such a support property is the defining feature of a Bayesian nonparametric approach and cannot simply be assumed. Often, seemingly flexible models can have quite restricted support due to hidden constraints in the model and not to real prior knowledge that certain values are implausible. Although we have chosen a specific form for a shrinkage prior , we aim to make our statement of prior support as general as possible and thus simply assume that satisfies a set of two conditions given by Assumption ? and Assumption ?. In Lemma ?, we show that the specified in Eq. satisfies these assumptions. The proofs associated with the theoretical statements made in this section can be found in the Appendix.
The following theorem shows that, for and as , the induced prior places positive probability on the space of all covariance functions that are continuous on .
Intuitively, the support on continuous covariance functions arises from the continuity of the Gaussian process dictionary functions. However, since we are mixing over infinitely many such dictionary functions, we need the mixing weights specified by to tend towards zero, and to do so “fast enough”—this is where Assumption ? becomes important. See Theorem ?. The proof of Theorem ? relies on the large support of at any point . Since each is independently Gaussian distributed (based on properties of the Gaussian process prior), is Wishart distributed. Conditioned on , is also Wishart distributed. More generally, for fixed , follows the matricvariate Wishart process of [13]. Combining the large support of the Wishart distribution with that of the gamma distribution on the inverse elements of provides the desired large support of the induced prior at each predictor location .
Lemma ? specifies the conditions under which the prior specified in Eq. satisfies Assumption ?, which provides a sufficient condition used in the proof of prior support.
It is also of interest to analyze the moments associated with the proposed prior. As detailed in the Appendix, the first moment can be derived based on the implied inverse gamma prior on the combined with the fact that is marginally Wishart distributed at every location , with the prior on specified in Equation .
Since our goal is to develop a covariance regression model, it is natural to consider the correlation induced between an element of the covariance matrix at different predictor locations and .
We can thus conclude that in the limit as the distance between the predictors tends towards infinity, the correlation decays at a rate defined by the Gaussian process kernel with a limit:
It is perhaps counterintuitive that the correlation between and does not go to zero as the distance between the predictors and tends to infinity. However, although the correlation between and goes to zero, the diagonal matrix does not depend on or and thus retains the correlation between the diagonal elements of and .
Equation implies that the autocorrelation is simply specified by . When we choose a Gaussian process kernel , we have
Thus, we see that the lengthscale parameter directly determines the shape of the autocorrelation function.
Finally, one can analyze the stationarity properties of the proposed covariance regression prior.
3Posterior Computation
3.1Gibbs Sampling with a Fixed Truncation Level
Based on a fixed truncation level and a latent factor dimension , we propose a Gibbs sampler for posterior computation. The derivation of Step 1 is provided in the Appendix.
Step 1 Update each dictionary function from the conditional posterior given , , , . We can rewrite the observation model for the th component of the th response as
Conditioning on , our Gaussian process prior on the dictionary functions implies the following conditional posterior
where and, taking to be the Gaussian process covariance matrix with ,
Step 2 Next we sample each latent factor given , , , . Recalling Eq. and the fact that ,
Step 3 Let . Recalling the prior on each precision parameter associated with the diagonal noise covariance matrix , standard conjugate posterior analysis yields the posterior
Step 4 Conditioned on the hyperparameters and , the Gaussian prior on the elements of specified in Eq. combined with the likelihood defined by Eq. imply the following posterior for each row of :
where and
Step 5 Examining Eq. and using standard conjugate analysis results in the following posterior for each local shrinkage hyperparameter given and :
Step 6 As in [5], the global shrinkage hyperparameters are updated as
where for .
3.2Incorporating nonparametric mean
If one wishes to incorporate a latent factor regression model such as in Eq. to induce a predictordependent mean , the MCMC sampling is modified as follows. Steps 1, 3, 4, 5, and 6 are exactly as before. Now, however, the sampling of of Step 2 is replaced by a block sampling of and . Specifically, let . We can rewrite the observation model as . Marginalizing out , with . Assuming nonparametric mean vector components , the posterior of follows analogously to that of resulting in
where . Once again taking to be the Gaussian process covariance matrix,
Conditioned on , we consider . Then, using the fact that ,
3.3Hyperparameter Sampling
One can also consider sampling the Gaussian process lengthscale hyperparameter . Due to the linearGaussianity of the proposed covariance regression model, we can analytically marginalize the latent Gaussian process random functions in considering the posterior of . Once again taking for simplicity, our posterior is based on marginalizing the Gaussian process random vectors . Noting that
and letting denote the Gaussian process covariance matrix based on a lengthscale ,
We can then Gibbs sample based on a fixed grid and prior on this grid. Note, however, that computation of the likelihood specified in Eq. requires evaluation of an dimensional Gaussian for each value specified in the grid. For large scenarios, or when there are many observations , this may be computationally infeasible. In such cases, a naive alternative is to iterate between sampling given and given . However, this can lead to extremely slow mixing. Alternatively, one can consider employing the recent Gaussian process hyperparameter slice sampler of [1].
In general, because of the quadratic mixing over Gaussian process dictionary elements, our model is relatively robust to the choice of the lengthscale parameter and the computational burden imposed by sampling is typically unwarranted. Instead, one can preselect a value for using a datadriven heuristic, which leads to a quasiempirical Bayes approach. Recalling Equation , we have
Thus, if one can devise a procedure for estimating the autocorrelation function from the data, one can set accordingly. We propose the following.
For a set of evenly spaced knots , compute the sample covariance from a local bin of data with .
Compute the Cholesky decomposition .
Fit a spline through the elements of the computed . Denote the spline fit of the Cholesky by for each
For , compute a pointbypoint estimate of from the splines: .
Compute the autocorrelation function of each element of this kernelestimated .
According to Equation , choose to best fit the most correlated (since less correlated components can be captured via weightings of dictionary elements with stronger correlation.)
3.4Computational Considerations
In choosing a truncation level and latent factor dimension , there are a number of computational considerations. The Gibbs sampler outlined in Section Section 3.1 involves a large number of simulations from Gaussian distributions, each of which requires the inversion of an dimensional covariance matrix, with the dimension of the Gaussian. For large , this represents a large computational burden as the operation is, in general, . The computations required at each stage of the Gibbs sampler are summarized in Table ?. From this table we see that depending on the number of observations and the dimension of these observations , various combinations of and lead to more or less efficient computations.
In [5], a method for adaptively choosing the number of factors in a nonpredictor dependent latent factor model was proposed. One could directly apply such a methodology for adaptively selecting . To handle the choice of , one could consider an augmented formulation in which
where is a diagonal matrix of parameters that shrink the columns of towards zero. One can take these shrinkage parameters to be distributed as
For , such a model shrinks the values towards zero for large indices just as in the shrinkage prior on . The close to zero provide insight into redundant latent factor dimensions. Computations in this augmented model are a straightforward extension of the Gibbs sampler presented in Section 3.1. Based on the inferred values of the latent parameters, one can design an adaptive strategy similar to that for .
Note that in Step 1, the dimensional inverse covariance matrix which needs to be inverted in order to sample is a composition of a diagonal matrix and an inverse covariance matrix that has entries that tend towards zero as becomes large (i.e., for distant pairs of predictors.) That is, (and thus ) is nearly bandlimited, with a bandwidth dependent upon the Gaussian process parameter . Inverting a given bandlimited matrix with bandwidth can be efficiently computed in [21] (versus the naive ). Issues related to tapering the elements of to zero while maintaining positivesemidefiniteness are discussed in [30].
4Simulation Example
In the following simulation examples, we aim to analyze the performance of the proposed Bayesian nonparametric covariance regression model relative to competing alternatives in terms of both covariance estimation and predictive performance. We initially consider the case in which is generated from the assumed nonparametric Bayes model in Section 4.1 and Section 4.2, while in Section 4.3 we simulate from a parametric model and compare to a Wishart matrix discounting method [27] over a set of replicates.
4.1Estimation Performance
We simulated a dataset from the model as follows. The set of predictors is a discrete set , with a 10dimensional observation generated for each . The generating mechanism was based on weightings of a latent dimensional matrix of Gaussian process dictionary functions (i.e, , ), with lengthscale and an additional nugget effect adding to . Here, we first scale the predictor space to . The additional latent mean dictionary elements were similarly distributed. The weights were simulated as specified in Eq. choosing . The precision parameters were each drawn independently from a distribution with mean . Figure ? displays the resulting values of the elements of and .
(a)  (b) 
(a)  (b)  (c) 
For inference, we set the hyperparameters as follows. We use truncation levels , which we found to be sufficiently large from the fact that the last few columns of the posterior samples of were consistently shrunk close to . We set and placed a prior on the precision parameters . The lengthscale parameter was set from the data according to the heuristic described in Section 3.3 using 20 knots evenly spaced in , and was determined to be (after rounding).
Although experimentally we found that our sampler was insensitive to initialization in lowerdimensional examples such as the one analyzed here, we use the following more intricate initialization for consistency with later experiments on larger datasets in which mixing becomes more problematic. The predictorindependent parameters and are sampled from their respective priors (first sampling the shrinkage parameters and from their priors). The variables and are set via a datadriven initialization scheme in which an estimate of for is formed using Steps 14 of Section 3.3. Then, is taken to be a dimensional lowrank approximation to the Cholesky of the estimates of . The latent factors are sampled from the posterior given in Equation using this datadriven estimate of . Similarly, the are initially taken to be spline fits of the pseudoinverse of the lowrank Cholesky at the knot locations and the sampled . We then iterate a couple of times between sampling: (i) given , , , and the datadriven estimates of , ; (ii) given , , , and the sampled ; (iii) given , , , and ; and (iv) determining a new datadriven approximation to based on the newly sampled . Results indistinguishable from those presented here were achieved (after a short burnin period) by simply initializing each of , , , , and the shrinkage parameters and from their respective priors.
We ran 10,000 Gibbs iterations and discarded the first 5,000 iterations. We then thinned the chain every 10 samples. The residuals between the true and posterior mean over all components are displayed in Figure ?(a) and (b). Figure ?(c) compares the posterior samples of the elements of the noise covariance to the true values. Finally, in Figure ? we display a select set of plots of the true and posterior mean of components of and , along with the 95% highest posterior density intervals computed at each predictor value .
From the plots of Figures ? and ?, we see that we are clearly able to capture heteroscedasticity in combination with a nonparametric mean regression. The true values of the mean and covariance components are (even in the worst case) contained within the 95% highest posterior density intervals, with these intervals typically small such that the overall interval bands are representative of the shape of the given component being modeled.
4.2Predictive Performance
Capturing heteroscedasticity can significantly improve estimates of the predictive distribution of new observations or missing data. To explore these gains within our proposed Bayesian nonparametric covariance regression framework, we compare against two possible homoscedastic formulations that each assume . The first is a standard Gaussian process mean regression model with each element of an independent draw from . The second builds on our proposed regularized latent factor regression model and takes , with as defined in the heteroscedastic case. However, instead of having a predictordependent covariance , the homoscedastic model assumes that is an arbitrary covariance matrix constant over predictors. By comparing to this latter homoscedastic model, we can directly analyze the benefits of our heteroscedastic model since both share exactly the same mean regression formulation. For each of the homoscedastic models, we place an inverse Wishart prior on the covariance .
We analyze the same simulation dataset as described in Section 4.1, but randomly remove approximately 5% of the observations. Specifically, independently for each element (i.e., the th response component at predictor ) we decide whether to remove the observation based on a draw. We chose to be a function of the matrix norm of the true covariance at to slightly bias towards removing values from predictor regions with a tighter distribution. This procedure resulted in removing 48 of the 1000 response elements.
Table ? compares the average KullbackLeibler divergence , , for the following definitions of and . The distribution is the predictive distribution of all missing elements given the observed elements of under the true parameters and . Likewise, is taken to be the predictive distribution based on the th posterior sample of and . In this scenario, the missing observations are imputed as an additional step in the MCMC computations
4.3Model Mismatch
We now examine our performance over a set of replicates from a 30dimensional parametric heteroscedastic model. To generate the covariance over , we chose a set of 5 evenly spaced knots and generated
with and . This construction implies that and are correlated. We then fit a spline independently through each element and evaluate this spline fit at . The covariance is constructed as
where is a diagonal matrix with a truncatednormal prior, , on its diagonal elements. The constant is chosen to scale the maximum value of to 1. The resulting covariance is shown in Figure ?(a).
(a)  (b)  (c) 
Each replicate of this parametric heteroscedastic model is generated as
Our hyperparameters and initialization scheme are exactly as in Section 4. The only difference is that we use truncation levels based on an initial analysis with . For each replicate, we once again run 10,000 Gibbs iterations and thin the chain by examining every th sample. A mean estimate of is displayed in Figure ?(b). In Figure ?, we plot the mean and 95% highest posterior density intervals of the Frobenius norm aggregated over iterations and replicates . The average norm error over is around 3, which is equivalent to each element of the inferred deviating from the true by 0.1. Since the covariance elements are approximately in the range of and the variances in , these norm error values indicate very good estimation performance.
We compare our performance to that of the Wishart matrix discounting model (see Section 10.4.2 of [27]), which is commonly used in stochastic volatility modeling of financial time series. Let . The Wishart matrix discounting model is a discretetime covariance evolution model that accounts for the slowly changing covariance by discounting the cumulated information. Specifically, assume , with and . The discounting model then specifies
such that , but with certainty discounted by a factor determined by . The update with observation is conjugate, maintaining a Wishart posterior on . A limitation of this construction is that it constrains (or integral) implying that . We set and such that for all and ran the forward filtering backward sampling (FFBS) algorithm outlined in [27], generating 100 independent samples. A mean estimate of is displayed in Figure ?(c) and the Frobenius norm error results are depicted in Figure ?. Within the region , we see that the error of the Wishart matrix discounting method is approximately twice that of our proposed methodology. Furthermore, towards the end of the time series (interpreting as representing a batch of time), the estimation error is especially poor due to errors accumulated in forward filtering. Increasing mitigates this problem, but shrinks the model towards homoscedasticity. In general, the formulation is sensitive to the choice of , and in highdimensional problems this degree of freedom is forced to take large (or integral) values.
(a)  (b) 
5Applications
We applied our Bayesian nonparametric covariance regression model to the problem of capturing spatiotemporal structure in influenza rates in the United States (US). Surveillance of influenza has been of growing interest following a series of pandemic scares (e.g., SARS and avian flu) and the 2009 H1N1 pandemic, previously known as “swine flu”. Although influenza pandemics have a long history, such as the 19181919 “Spanish flu”, 19571958 “Asian flu”, and 19681969 “Hong Kong flu”, a convergence of factors are increasing the current public interest in influenza surveillance. These include both practical reasons such as the rapid rate by which geographically distant cases of influenza can spread worldwide, along with other driving factors such as an increased media coverage.
5.1CDC Influenza Monitoring
The surveillance of influenza within the US is coordinated by the Centers for Disease Control and Prevention (CDC), which collects data from a large network of diagnostic laboratories, hospitals, clinics, individual healthcare providers, and state health departments (see http://www.cdc.gov/flu/weekly/). The approximately 3,000 participating outpatient sites, collectively referred to as the US Outpatient InfluenzaLike Illness Surveillance Network (ILINet), provide the CDC with key information about rates of influenzalike illness (ILI)
A plot of the number of isolates tested positive by the WHO and NREVSS from 20032010 is shown in Figure ?(a). From these data and the CDC weekly flu reports, we defined a set of six events (Events AF) corresponding to the 20032004, 20042005, 20052006, 20062007, 20072008, and 20092010 flu seasons, respectively. The 20032004 flu season began earlier than normal, and coincided with a flu vaccination shortage in many states. For the vaccination that was available, the CDC found that it was “not effective or had very low effectiveness” (http://www.cdc.gov/media/pressrel/fs040115.htm). The 20042005 and 20072008 flu seasons were more severe than the 20052006 and 20062007 seasons. However, the 20052006 season coincided with an avian flu (H5N1) scare in which Dr. David Narbarro, Senior United Nations System Coordinator for Avian and Human Influenza, was famously quoted as predicting that an avian flu pandemic would lead to 5 million to 150 million deaths. Finally, the 20092010 flu season coincides with the emergence of the 2009 H1N1 (“swine flu”) subtype
(a)  (b)  (c) 
5.2Google Flu Trends Dataset
To aid in a more rapid response to influenza activity, a team of researchers at Google devised a model based on Google user search queries that is predictive of CDC ILI rates [15]—that is, the probability that a random physician visit is related to an influenzalike illness. The Google Flu Trends methodology was devised as follows. From the hundreds of billions of individual searches from 20032008, time series of statebased weekly query rates were created for the 50 million most common search terms. The predictive performance of a regression on the logittransformed query rates was examined for each of the 50 million candidates and a ranked list was produced that rewarded terms predictive of rates exhibiting similar regional variations to that of the CDC data. A massive variable selection procedure was then performed to find the optimal combination of query words (based on best fit against outofsample ILI data), resulting in a final set of 45 ILIrelated queries. Using the 45 ILIrelated queries as the explanatory variable, a regionindependent univariate linear model was fit to the weekly CDC ILI rates from 20032007. This model is used for making estimates in any region based on the ILIrelated query rates from that region. The results were validated against the CDC data both on training and test data, with the Google reported US and regional rates closely tracking the actual reported rates.
A key advantage of the Google data (available at http://www.google.org/flutrends/) is that the ILI rate predictions are available 1 to 2 weeks before the CDC weekly reports are published. Additionally, a user’s IP address is typically connected with a specific geographic area and can thus provide information at a finer scale than the 10regional and US aggregate reporting provided by the CDC. Finally, the Google reports are not subject to revisions. One important note is that the Google Flu Trends methodology aims to hone in on searches and rates of such searches that are indicative of influenza activity. A methodology based directly on raw search queries might instead track general interest in influenza, waxing and waning quickly with various media events.
We analyze the Google Flu Trends data from the week of September 28, 2003 through the week of October 24, 2010, providing 370 observation vectors . Each observation vector is 183dimensional with elements consisting of Google estimated ILI rates at the US national level, the 50 states, 10 U.S. Department of Health & Human Services surveillance regions, and 122 cities. It is important to note, however, that there is substantial missing data with entire blocks of observations unavailable (as opposed to certain weeks sporadically being omitted). At the beginning of the examined time frame only 114 of the 183 regions were reporting. By the end of Year 1, there were 130 regions. These numbers increased to 173, 178, 180, and 183 by the end of Years 2, 3, 4, and 5, respectively. The highdimensionality and missing data structure make the Google Flu Trends dataset challenging to analyze in its entirety with existing heteroscedastic models. As part of an exploratory data analysis, in Figure ? we plot sample estimates of the geographic correlation structure between the states during an event period for four representative states. Specifically, we first subtract a moving average estimate of the mean (window size 10) and then aggregate the data over Events BF, omitting Event A due to the quantity of missing data. Because of the dimensionality of the data (183 dimensions) and the fact that there are only 157 event observations, we simply consider the statelevel observations (plus District of Columbia), reducing the dimensionality to 51. The limited data also impedes our ability to perform timespecific sample estimates of geographic correlations.
5.3Heteroscedastic Modeling of Google Flu Trends
Our proposed heteroscedastic model allows one to capture both spatial and temporal changes in correlation structure, providing an important additional tool in predicting influenza rates. We specifically consider with the nonparametric function defining the mean of the ILI rates in each of the 183 regions. For a given week , the spatial correlation structure is captured by the covariance . Temporal changes are implicitly modeled through the proposed covariance regression framework that allows for continuous variations in . [11] also examine portions of the Google Flu Trends data, but with the goal of online tracking of influenza rates on either a national, state, or regional level. Specifically, they employ a statespace model with particle learning. Our goal differs considerably. We aim to jointly analyze the full 183dimensional data, as opposed to univariate modeling. Through such joint modeling, we can uncover important spatial dependencies lost when analyzing components of the data individually. Such spatial information can be key in predicting influenza rates based on partial observations from select regions or in retrospectively imputing missing data.
There are a few crucial points to note. The first is that no geographic information is provided to our model. Instead, the spatial structure is uncovered simply from analyzing the raw 183dimensional time series and patterns therein. Second, because of the substantial quantity of missing data, imputing the missing values as in Section 4.2 is less ideal than simply updating our posterior based solely on the data that is available. The latter is how we chose to analyze the Google Flu Trends data—our ability to do so without introducing any approximations is a key advantage of our proposed methodology.

The results presented in Figures ? and ? clearly demonstrate that we are able to capture both spatial and temporal changes in correlations in the Google Flu Trends data, even in the presence of substantial missing information. We preprocessed the data by scaling the entire dataset by one over the largest variance of any of the 183 time series. The hyperparameters were set as in the simulation study of Section 4, except with larger truncation levels and and with the Gaussian process lengthscale hyperparameter set to to account for the time scale (weeks) and the rate at which ILI incidences change. Once again, by examining posterior samples of , we found that the chosen truncation level was sufficiently large. We ran 10 chains each for 10,000 MCMC iterations, discarded the first 5,000 for burnin, and then thinned the chains by examining every 10 samples. Each chain was initialized with parameters sampled from the prior. To assess convergence, we performed the modified GelmanRubin diagnostic of [6] on the MCMC samples of the variance terms for corresponding to the state indices of New York, California, Georgia, and South Dakota, and corresponding to the midpoint of each of the 12 event and nonevent time windows
In Figure ?(b), we plot the posterior mean of the 183 components of , showing trends that follow the Google estimated US national ILI rate. For New York, in Figure ?(c) we plot the 25th, 50th, and 75th quantiles of correlation with the 182 other states and regions based on the posterior mean of the covariance function. From this plot, we immediately notice that regions become more correlated during flu seasons, as we would expect. The specific geographic structure of these correlations is displayed in Figure ?, where we see key changes with the specific flu event. In the more mild 20052006 season, we see much more local correlation structure than the more severe 20072008 season (which still maintains stronger regional than distant correlations.) The November 2009 H1N1 event displays overall regional correlation structure and values similar to the 20072008 season, but with key geographic areas that are less correlated. Note that some geographically distant states, such as New York and California, are often highly correlated as we might expect based on their demographic similarities and high rates of travel between them. Interestingly, the strong local spatial correlation structure for South Dakota in February 2006 has been inferred before any data are available for that state. Actually, no data are available for South Dakota from September 2003 to November 2006. Despite this missing data, the inferred correlation structures over these years are fairly consistent with those of neighboring states and change in manners similar to the flutononflu changes inferred after data for South Dakota are available.
Comparing the maps of Figure ? to those of the samplebased estimates in Figure ?, we see much of the same correlation structure, which at a high level validates our findings. Since the samplebased estimates aggregate data over Events BF (containing those displayed in Figure ?), they tend to represent a timeaverage of the eventspecific correlation structure we uncovered. Note that due to the dimensionality of the dataset, the samplebased estimates are based solely on statelevel measurements and thus are unable to harness the richness (and crucial redundancy) provided by the other regional reporting agencies. Furthermore, since there are a limited number of perevent observations, the naive approach of forming sample covariances based on local bins of data is infeasible. The highdimensionality and missing data structure of this dataset also limit our ability to compare to alternative methods such as those cited in Section 1—none yield results directly comparable to the full analysis we have provided here. Instead, they are either limited to examination of the small subset of data for which all observations are present and/or a lowerdimensional selection (or projection) of observations. On the other hand, our proposed algorithm can readily utilize all information available to model the heteroscedasticity present here. (Compare to the common GARCH models, which cannot handle missing data and are limited to typically no more than 5 dimensions.) In terms of computational complexity, each of our chains of 10,000 Gibbs iterations based on a naive implementation in MATLAB (R2010b) took approximately 12 hours on a machine with four Intel Xeon X5550 QuadCore 2.67GHz processors and 48 GB of RAM.
6Discussion
In this paper, we have presented a Bayesian nonparametric approach to covariance regression which allows an unknown dimensional covariance matrix to vary flexibly over , where is some arbitrary (potentially multivariate) predictor space. Foundational to our formulation is quadratic mixing over a collection of dictionary elements, assumed herein to be Gaussian process random functions, defined over . By inducing a prior on through a shrinkage prior on a predictordependent latent factor model, we are able to scale to the large