Efficient Covariance Estimation from Temporal Data
Abstract.
Estimating the covariance structure of multivariate time series is a fundamental problem with a widerange of realworld applications—from financial modeling to fMRI analysis. Despite significant recent advances, current stateoftheart methods are still severely limited in terms of scalability, and do not work well in highdimensional undersampled regimes. In this work we propose a novel method called Temporal Correlation Explanation, or TCorEx, that (a) has linear time and memory complexity with respect to the number of variables, and can scale to very large temporal datasets that are not tractable with existing methods; (b) gives stateoftheart results in highly undersampled regimes on both synthetic and realworld datasets; and (c) makes minimal assumptions about the character of the dynamics of the system. TCorEx optimizes an informationtheoretic objective function to learn a latent factor graphical model for each time period and applies two regularization techniques to induce temporal consistency of estimates. We perform extensive evaluation of TCorex using both synthetic and realworld data and demonstrate that it can be used for detecting sudden changes in the underlying covariance matrix, capturing transient correlations and analyzing extremely highdimensional complex multivariate time series such as highresolution fMRI data.
1. Introduction
Many complex systems in finance, biology, and social sciences can be modeled by multivariate time series. One of the first steps in analyzing such timevarying complex systems is temporal covariance estimation—that is, estimation of the covariance matrix of the variables in different time periods. Such an analysis can be helpful for understanding relationships between components of the system, spotting trends, making predictions, detecting shifts in the underlying structure and for other tasks (Engle et al., 2017; De Nard et al., 2018; Ahmed and Xing, 2009; Monti et al., 2014).
There is an increasing need for efficient temporal covariance estimation methods that can work in highdimensional undersampled regimes. In this regime, even static covariance estimation is a formidable problem (Fan et al., 2016). Extending covariance estimation to the temporal case adds unique challenges. First, the samples from different time steps generally are not independent and identically distributed. Second, the dynamics of the system can be quite complex (e.g., financial time series, biological systems) and hard to model without strong assumptions. Furthermore, the number of time steps whose observations are relevant for estimating the covariance matrix for a particular time period is limited when the underlying covariance matrix changes quickly.
Current stateoftheart temporal covariance estimation methods learn a sparse graphical model for each time period using variants of graphical lasso, while adding a regularization term for inducing temporal consistency of estimates (Hallac et al., 2017; Tomasi et al., 2018). Unfortunately, these methods have at least cubic time complexity and quadratic memory complexity with respect to the number of variables, and do not scale to truly highdimensional problems (e.g., the approaches described in Refs. (Hallac et al., 2017; Tomasi et al., 2018) do not scale beyond thousands of time series).
In this work we propose a novel temporal covariance estimation method, TCorEx, that addresses the aforementioned challenges. TCorEx is based on linear CorEx (Steeg and Galstyan, 2017), which learns static latent variable graphical models by minimizing an informationtheoretic objective function. Our method trains a linear CorEx for each time period and employs two regularization techniques to enforce temporal consistency of learned models. TCorEx has linear time and memory complexity with respect to the number of variables and can be applied to temporal data with several orders of magnitude more variables than what existing methods can handle (e.g., it takes less than an hour on a moderate PC to estimate the covariance structure for time series with variables). The only assumption TCorEx makes about the dynamics of the system is that on average, the underlying covariance matrix changes slowly with time.
We compare TCorEx against other methods on both synthetic and realworld datasets. Our experiments show that TCorEx yields stateoftheart performance in highly undersampled regimes. More specifically, TCorEx outperforms other methods in terms of describing existing correlations in the data (quantified by loglikelihood). We also apply TCorEx to stock market and highresolution functional magnetic resonance image (fMRI) data. For the former, we demonstrate that TCorEx can detect transient correlations and sudden changepoints of the underlying covariance matrix, not detectable with static methods. For the latter, we show that TCorEx successfully scales for time series with 150K variables, and finds meaningful functional connectivity patterns.
We summarize our main contributions in this paper as follows:

We introduce a new temporal covariance estimation method that has linear time and memory complexity with respect to the number of variables—which allows us to analyze systems that are several orders of magnitude larger than what was possible with previous methods.

We conduct extensive experiments on both synthetic and realworld datasets and demonstrate that TCorEx outperforms existing stateoftheart methods in highly undersampled regimes.

To illustrate the practical value of TCorEx, we apply it to two realworld datasets, financial returns and highresolution fMRI, and show that it is capable of extracting meaningful and qualitatively novel patterns.
2. Problem Definition
To derive our method, we formulate the temporal covariance estimation problem for a sequence of multivariate Gaussian observations. We are given a sequence of observations , where is a collection of i.i.d. samples generated from a dimensional normal distribution with zero mean and covariance matrix . The goal is to estimate the unknown covariance matrices for each . Note that can be 0 and can be different for different time steps. In particular, when , then is a regular multivariate time series.
Without further assumptions, the problem is equivalent to having independent static covariance estimation tasks. Usually covariance matrices of different time steps are related, and efficient temporal covariance estimation methods should exploit that relation. However, the relation can be quite complicated, making it hard to model without strong assumptions. To avoid making such assumptions, we only assume that on average, covariance matrices of neighboring time steps are close to each other with respect to some metric. Fig. 1 shows an example of a regularly sampled multivariate time series where the underlying covariance matrix exhibits different dynamics in different periods, but our assumption is met.
3. Methods
Any static covariance estimation method can be used for temporal covariance estimation by applying it in sliding window fashion. Namely, we divide the period into periods of size and apply on each period.^{1}^{1}1There are some subtleties such as deciding whether to overlap periods, or what to do with remaining time steps when does not divide . There is no universal choice for , and different methods work best with different values of . Methods that work well with small values of are usually better choices for temporal covariance estimation. However, these static estimators do not exploit the temporal aspect of the problem, and in general, produce worse estimates (see Section 4).
One way to build better temporal covariance estimation methods is to apply a static covariance estimation method in sliding window fashion while also adding a regularization term that enforces temporal consistency of estimates (Hallac et al., 2017; Tomasi et al., 2018). We use a similar approach to extend linear CorEx (Steeg and Galstyan, 2017) for temporal covariance estimation. There are several reasons why we choose to base our method on linear CorEx. First, linear CorEx has been shown to have low sample complexity. This makes it possible to select small window sizes, which allows the method to detect shortterm variations. Second, the complexity of linear CorEx scales linearly with allowing us to apply it to extremely highdimensional data (where is greater than ). Third, the objective of linear CorEx is differentiable, which allows us to use any variant of gradient descent to train it.
3.1. Linear CorEx
To proceed further we need some details of linear CorEx and one definition.
Definition 3.1 ().
A jointly Gaussian distribution with observed variables and hidden variables is called a nonoverlapping Gaussian latent factor (NGLF) model if the joint distribution factorizes in the following way:
where .
The factorization of an NGLF model defines a directed probabilistic graphical model shown in Fig. 2. In other words, the NGLF model corresponds to the product of onefactor Gaussian models.
Linear CorEx is a latent factor model designed to uncover the structure of a multivariate Gaussian variable. It is a variant of the more general CorEx method (Ver Steeg and Galstyan, 2014). For given dimensional Gaussian random variable , the algorithm finds an dimensional Gaussian random variable , such that the joint system is close to a NGLF model. The authors show that a jointly Gaussian distribution is an NGLF model if and only if and . This makes it possible to learn a NGLF model by minimizing under the constraint . Linear CorEx attempts to solve the following optimization problem:
(1) 
Note that if , then the joint distribution is Gaussian. Assuming , after some modification of the original optimization problem (1), the algorithm solves the following problem (the details can be found in (Steeg and Galstyan, 2017)):
(2) 
where is the conditional mean of given under the constraint that , and is calculated in the following way:
The objective of problem (2) upper bounds the objective of problem (1), ignoring constants. Furthermore, that upper bound is tight when . After inferring the parameters , the covariance matrix of is estimated with the following formula:
(3) 
Even though linear CorEx is designed to recover the structure of NGLF models, experiments show that it can be successfully applied to nonGaussian cases. One reason for this is that many realworld datasets can be approximated by a Gaussian distribution. Another reason is that linear CorEx doesn’t find a model of data that corresponds exactly to the probabilistic graphical model shown in Fig. 2 (which is a fairly restricted model), but rather learns a model close to it.
3.2. TimeVarying Linear CorEx
Hereafter, we assume that the data is already split into periods. We can represent the data as a sequence of collections of observations, , where . As mentioned earlier, we train one linear CorEx for each time period and use regularization techniques to enforce temporal consistency of estimates (i.e., enforcing adjacent time periods to have similar covariance matrix estimates). As a first attempt towards building timevarying linear CorEx, we try to solve the following optimization problem:
(4) 
where is the linear CorEx objective with parameters applied to data , and is the penalty function, which in this work is either or vector norm. The former is suitable for systems with sudden changes, while the latter is better suited for smoothly varying systems. We name the method of Eq. (4) TCorExsimple.
While TCorExsimple follows the general framework of building a temporal covariance estimation method from a static covariance estimation method, it is not powerful enough to get significantly better performance compared to linear CorEx applied in sliding window fashion without a regularization term (see Table 1). This happens because of some properties of linear CorEx. Namely, in linear CorEx, is a function of and (see Eq. 3). Therefore, even when the regularization coefficient is infinity, making , still in general. Therefore, the model can never produce very close estimates of covariance matrices. If we choose to explicitly penalize the difference, instead of the difference , in general, there will be no that will make the new regularization term equal to zero.
To make the regularization effective, we note that depends on through , which in turn depends on and , where the expectations are taken over samples of using parameters . When contains a small number of samples, these quantities can be quite noisy, resulting in noisy estimates . To reduce the noise without increasing the window size, we propose a simple remedy. To estimate and for time period , we use not only the samples of , but also samples of other time periods . Of course, samples belonging to time periods far from are less important than the samples belonging to time periods close to . Since both and are expectations over the sample distribution, we change sample weights/probabilities to capture the intuition above. We introduce a new hyperparameter that defines the rate of decay of “importance” as we go far from the current time period . Samples of time period have weight when they are used to estimate and for time period . Below shows how we estimate for time period :
We estimate for each time period similarly. For computational efficiency, at time period we do not consider time periods for which . Summing up, we get the following optimization problem:
(5) 
where is the linear CorEx objective with parameters applied on data , where samples of have weight . We name the method of Eq. (5) TCorEx.
Note that along the way we extended linear CorEx to work in cases where samples have weights assigned to them. This extended method can be helpful when different samples have different importance, reliability, or relevance.
3.3. Implementation
The authors of linear CorEx derive a quasiNewton optimization algorithm to train their model. Unfortunately, we could not use a similar derivation for TCorEx. Instead, we use gradient descent to optimize the objective of TCorEx. More specifically, in all our experiments we use the Adam optimizer (Kingma and Ba, 2014) with and . We noticed that optimizing the objective function of linear CorEx with gradient descent requires 24 times more iterations than with quasiNewton, but it eventually converges to similar solutions in terms of objective value. We initialize the weights of TCorEx with the weights of a linear CorEx trained on all samples of all time periods. Since training of TCorEx involves many matrix multiplications, GPUs can be used for speeding up the training. In fact, when , a single GPU can reduce the training time up to 10 times. TCorEx is implemented in PyTorch (Paszke et al., 2017). The code can be found on GitHub.^{2}^{2}2https://github.com/harhro94/TCorEx
3.4. Scalability
The time complexity of the linear CorEx at time period is , where is the number of observed variables, is the number of hidden variables, is the number of samples used for estimating and for time period . Since we don’t consider time periods , for which , we get that . Therefore, the time complexity of TCorEx is , where is the total number of samples. Explicitly computing covariance matrix using Eq. 3 has complexity. However, one can avoid computing it explicitly, as the covariance estimates of TCorEx are diagonal plus lowrank matrices. Multiplying such matrices with other matrices can be made faster using their lowrank plus diagonal decomposition. One can compute the determinant of such matrices in time using the generalization of the matrix determinant lemma. Furthermore, the inverse of such matrices has diagonal plus lowrank form and can be computed in time using the Woodbury matrix identity. The memory complexity of TCorEx is . We want to emphasize that the proposed method has linear time and memory complexity w.r.t. the number of variables , assuming , and don’t depend on . To our best knowledge, TCorEx is the only temporal covariance estimation method that scales linearly with . Timevarying graphical lasso (TVGL) (Hallac et al., 2017) and latent variable timevarying graphical lasso (LTGL) (Tomasi et al., 2018), which are the direct competitors of our method, have time complexity. Fig. 3 shows the scalability comparison of different temporal covariance estimation methods.
4. Experiments
We compare TCorEx with other static and temporal covariance estimation methods on both synthetic and realword datasets. The simplest baseline is the diagonal matrix with the sample variances of corresponding variables. Other static baselines are the LedoitWolf (LW) method (Ledoit and Wolf, 2004), oracle approximating shrinkage (OAS) method (Chen et al., 2010), factor analysis (FA), sparse PCA (Zou et al., 2006; Mairal et al., 2009), linear CorEx (Steeg and Galstyan, 2017), graphical lasso (GLASSO) (Friedman et al., 2008), and latent variable graphical lasso (LVGLASSO) (Chandrasekaran et al., 2012). The temporal baselines are timevarying graphical lasso (TVGL) (Hallac et al., 2017) and latent variable timevarying graphical lasso (LTGL) (Tomasi et al., 2018). In the experiments with synthetic data, we add two additional baselines: TCorExnoreg (TCorEx with ) and TCorExsimple (TCorEx with ) to show the importance of the regularization term and the importance of introducing sample weights respectively.
In all quantitative experiments we split the data into train, validation and test sets. We use the validation set to select hyperparameters of the baselines and we report the final scores on the test set. The metric we use in our experiments is the negative loglikelihood of estimates averaged over time periods. The grid of hyperparameters for each baseline can be found in the supplementary material.
4.1. Synthetic Data
Method  Sudden Change  Smooth Change  

Ground Truth  196.0  196.0  196.0  196.0  196.0  230.2  230.2  230.2  230.2  230.2 
Diagonal  279.6  260.6  256.8  255.6  254.9  277.4  268.3  263.6  261.5  261.4 
LW  286.2  266.7  252.8  239.9  227.5  287.3  278.9  270.1  262.7  254.9 
OAS  281.5  266.3  252.7  239.9  227.5  282.9  277.9  269.8  262.7  254.8 
FA    524.4  236.9  208.5  201.3    630.5  267.5  242.3  235.6 
Sparse PCA  270.5  232.9  212.8  205.3  200.7  274.7  260.1  247.4  238.8  235.4 
Linear CorEx  312.3  221.5  204.6  199.7  197.7  333.0  267.8  243.0  235.5  233.0 
GLASSO  266.5  242.0  221.3  212.3  205.5  280.1  262.5  249.2  241.9  238.3 
LVGLASSO  271.6  245.5  235.5  217.6  210.2  276.6  267.1  254.7  248.5  240.7 
TVGL  237.5  224.5  213.4  207.6  203.7  259.0  251.9  244.0  239.4  236.7 
LTGL  248.6  230.0  218.7  209.6  204.7  265.0  256.2  247.1  241.8  238.7 
TCorEx  228.0  213.8  205.3  199.6  197.7  250.6  243.3  237.6  234.5  232.7 
TCorExsimple  275.8  217.9  204.7  199.7  197.7  294.5  261.3  242.5  235.8  233.1 
TCorExnoreg  245.3  228.7  207.5  199.6  197.8  259.2  252.5  241.6  235.4  232.9 
We design experiments with synthetic data to compare our method with other baselines in the case when the modeling assumptions of TCorEx are met (i.e., the data of each period is generated from an NGLF model). We generate synthetic data for two scenarios. In the first scenario, the underlying covariance matrix is constant for the first half of the time periods. Then a sudden change happens after which the underlying covariance matrix remains constant for the remaining half of time periods. We call this scenario sudden change. In the second scenario, the underlying covariance matrix is slowly changing from to . We call this scenario smooth change. The full details of the data generation process for these two scenarios are presented in the supplementary material.
Sudden Change
To create data with a sudden change, we generate two different NGLF models— and . The data of the first five periods is generated from , while the data of the next five periods is generated from . We generate training, 16 validation and 1000 testing samples for each period. The left multicolumn of Table 1 shows the comparison of baselines on this type of data for and varying values of .
Smooth Change
To create a synthetic dataset with smooth change, we generate two NGLF models, and . We start from and smoothly change the model into , so that for each time period the joint distribution remains NGLF. Let denote the NGLF model at time period . We generate training, 16 validation and 1000 testing samples from . The right multicolumn of Table 1 shows the comparison of baselines on this type of data for and varying values of .
The results of sudden change and smooth change experiments show that the proposed method gives stateoftheart results when NGLF modeling assumptions are met. The advantage of TCorEx is significant when the number of samples is small. Comparing TCorExsimple with linear CorEx, we see that indeed, the regularization term of TCorExsimple is not enough to get the best performance. Furthermore, the results show that introducing sample weights improves the results significantly. Comparing TCorEx with TCorExnoreg, we conclude that even after introducing sample weights, having a regularization term that induces temporal consistency improves the results.
4.2. Stock Market Data
Next, we evaluate our method on a stock prices dataset, where the modeling assumptions are not met (i.e. the data of each time step does not correspond to an NGLF model). We take the daily sampled historical data of stock prices of S&P 500 index from January, 2000 to January, 2016.^{3}^{3}3The data is downloaded using the API of tradingeconomics.com. For simplicity, we keep the stocks that are present for at least 95% of the period. The resulting dataset contains daily prices for 391 stocks. We compute the daily logreturns, standardize each variable, and add isotropic Gaussian noise with variance.
Method  

Diagonal  585.6 85.3  467.0 106.7  407.9 50.0  367.9 14.0 
LW  424.9 40.1  384.7 100.3  372.4 54.6  465.6 21.9 
OAS  405.5 34.2  376.8 85.5  403.1 51.9  556.8 30.3 
FA      900.3 189.0  355.7 15.2 
Sparse PCA  507.7 71.5  382.9 66.7  301.1 31.2  266.5 10.8 
Linear CorEx  458.1 47.6  330.1 31.9  279.7 14.6  262.2 8.5 
GLASSO  473.3 65.5  400.3 54.8  304.8 33.6  272.4 18.0 
LVGLASSO  430.6 23.6  372.7 59.0  289.3 22.1  282.8 9.7 
TVGL  335.6 24.8  298.3 46.0  260.5 32.6  243.3 6.9 
LTGL  329.3 18.1  308.5 49.1  252.0 27.9  231.8 9.8 
TCorEx  313.5 16.4  269.5 16.1  246.1 12.9  244.0 4.8 
As all baselines used in this work need the time series to be divided into periods, we break the time series into nonoverlapping periods, each containing samples. We randomly partition samples of each period into train, validation and test sets containing samples respectively. After breaking time series into periods, we consider only the last 10 periods. Table 2 shows the results for averaged over 10 random train/validation/test splits. Again, the metric is timeaveraged negative loglikelihood of the estimates under multivariate Gaussian distribution. We justify our choice of the metric with the fact that all baselines besides the LedoitWolf method assume the data has Gaussian distribution.
For better interpretation of results, we suggest looking at the distribution of test scores for those 10 different runs (Fig. 4). First of all, many baselines have high standard deviations, but the figure shows that it is usually caused by some runs that produce very bad estimates. The results suggest that TCorEx is doing significantly better when the number of samples is small (). For , we see that TVGL, LTGL and TCorEx perform similarly, even though TCorEx has slightly better mean because it has fewer outliers with very poor performance. At , LTGL starts to produce better estimates. Notably, TCorEx always has the smallest standard deviation. We hypothesize that the assumption of TCorEx of data distribution being approximately NGLF is more restrictive than the sparsity assumption of TVGL and LTGL. Therefore, the variance of TCorEx is relatively smaller. On the other hand, the assumptions of TCorEx do not introduce large biases. In fact, we see that TCorEx has the smallest bias when the number of samples is small, and has comparable bias otherwise. Interestingly, as we increase , LW and OAS become worse. This happens because these methods start to shrink the sample covariance matrix less as the number of samples increases.
Qualitative Analysis
With quantitative analysis we showed that the estimates of TCorEx are much more accurate than those of linear CorEx applied in sliding window fashion. Next, we want to see what are the qualitative differences of those better estimates for stock market data. For this purpose we plot the entries of inverse correlation matrix that have absolute value greater than some threshold.^{4}^{4}4In our experiments the threshold is set to 0.025. This can be interpreted as plotting the adjacency matrix of a Markov random field. First, we train a linear CorEx on the whole period, ignoring the temporal aspect (the left part of Fig. 5). This shows how the system looks like “on average.” We see that most of the edges are within sectors. Then we train the proposed model with window size equal to two weeks and plot its estimate for a random time period (the middle part of Fig. 5). First, We see that the sectors are not as densely connected. Second, TCorEx captures some dependencies that are not present for the whole period. The opposite is also true—two variables can be directly connected on average, but be disconnected for some period of time (e.g., some connections between sectors “Materials” and “Energy”). The advantage of TCorEx lies in the ability of detecting correlations that occur only for a short period of time. Methods requiring large number of samples (large window size) cannot detect such correlations. To finalize our analysis, we fit linear CorEx on samples around the same random time period . When is too small, the estimates are too noisy. When is too large the estimates are less related to the true statistics of time period . The right part of Fig. 5 shows the estimate of linear CorEx for the best value of . The estimate of linear CorEx is qualitatively different from that of TCorEx and gives a significantly worse score, indicating the inherent problems of applying a static covariance estimation method in sliding window fashion.
Change Point Detection
Estimated covariance/precision matrices can be used to detect the shifts in the underlying system (Hallac et al., 2017; Tomasi et al., 2018). One simple way to detect changes in the underlying structure is to look at the Frobenius norm of the difference of inverse correlation matrices of neighboring time periods. Fig. 5 shows that from 2000 to 2016, TCorEx detects all but one of the major US stock market events. Interestingly, for some events major changes are visible up to two months earlier. However, the algorithm cannot be used for early detection because both the regularization term and the temporal smoothing exploit data of future time steps. This is also true for TVGL and LTGL methods. Fig. 5 also shows the same analysis for sliding window linear CorEx with optimal window size. The peaks of linear CorEx align much worse with the ground truth events. Furthermore, the estimates of linear CorEx are temporally less consistent. As expected, linear CorEx does not detect changes that occur in very short periods, such as the changes related to the September 11 attacks and the 2010 flash crash.
4.3. HighResolution fMRI Data
To demonstrate the scalability and usefulness of the proposed method, we apply it to highdimensional functional magnetic resonance images (fMRI). The most common measurement in fMRI is blood oxygen leveldependent (BOLD) contrast, which measures blood flow changes in biological tissues (“activation”). In a typical fMRI session, which lasts 310 minutes, hundreds of highresolution brain images are captured, each having 100K600K volumetric pixels (voxels). Correlation analysis is widely used to study functional connections between brain regions (Preti et al., 2017). While in general these analyses are conducted assuming static covariance, recently timevarying covariance (“dynamic functional connectivity”) has received more attention (Chang and Glover, 2010; Lurie et al., 2018). This latter case is exactly the use case of TCorEx.
We demonstrate the feasibility of TCorEx in fMRI analysis by inducing an artificial change point in an otherwise stable time series. While the induced shift is clearly synthetic, this experiment shows possible value for the fMRI community in detecting natural change points and/or using TCorEx for more nuanced domainspecific analyses, demonstrating that TCorEx can scale to the 100K+ variable regime. First, we fit TCorEx on a restingstate fMRI session^{6}^{6}6We use the processed restingstate fMRI of session 014 (dataset version: 1.0.4). We then do spatial smoothing using a Gaussian filter with fwhm=8mm. from the MyConnectome project (Poldrack et al., 2015). The session has 518 timepoint volumes, each having 148262 voxels. We divide the whole period into 20 nonoverlapping periods, ignoring the first 18 timepoints.
The blue curve in Fig. 7 shows the Frobenius norm of differences of inverse correlation matrices of neighboring time periods.^{7}^{7}7Note, although the correlation matrices are extremely large, we are able to compute the inverses and norms of differences, since estimates of TCorEx are diagonal plus lowrank matrices. We see relatively large variability in the beginning and in the end of the session. Next, we consider the middle 12 periods (i.e., removing 4 periods from both sides). We swap the first 6 and the last 6 periods of those periods, creating an artificial change in the middle, and retrain TCorEx on the resulting data. The orange plot of Fig. 7 shows the Frobenius norm of differences of inverse correlation matrices of neighboring time periods for this case. TCorEx detects the shift we created, while other methods do not scale to this regime.
TCorEx can also provide secondary analyses by grouping correlated regions. We assign each voxel to the latent factor that has the largest mutual information with it, forming groups by each factor. This produces clusters of voxels for each time period. Fig. 8 shows three clusters from time period 12, identified using NeuroSynth (Yarkoni et al., 2011). Clusters displayed in (a) and (b) correspond to standard anatomic regions (Brodmann, 1909), namely both left and right Brodmann area 40, as well as left Brodmann areas 44/45 (Broca’s Area), which is known to be a highly lateralized (asymmetric) region. Cluster (c) displays distributed activation in the left hemisphere. We found that cluster (a) is present in all time periods; (b) is present starting at time period 3. These two clusters exhibit small variations over time. Cluster (c) is present starting at time period 11, but varies more compared to the other two clusters.
5. Related Work
Many works have addressed the problem of highdimensional covariance estimation (Fan et al., 2016). One major direction of estimating highdimensional covariance matrices is introducing sparsity constraints. Sparsity constraints are often used to reduce the effective number of parameters of the model, to regularize the model or to incorporate our beliefs. Sparse covariance matrix estimation for multivariate Gaussian random variables is investigated in (Bien and Tibshirani, 2011). Most often sparsity constraints are imposed on the inverse covariance matrix (precision matrix). The precision matrix encodes the conditional independences between pairs of variables. Learning a sparse precision matrix corresponds to learning a sparse graphical model (Lauritzen, 1996; Banerjee et al., 2008). The graphical lasso method (Friedman et al., 2008) does sparse inverse covariance matrix estimation for multivariate Gaussian random variables. The problem of network inference using graphical lasso is well studied (Friedman et al., 2008; Yuan and Lin, 2007; Ravikumar et al., 2011; Meinshausen and BÃ¼hlmann, 2006).
In many realworld applications, there are latent factors influencing the system. Modeling those factors usually leads to better estimates of the covariance matrix of the observed variables. Factor analysis and probabilistic PCA (Tipping and Bishop, 1999) are latent factor models that can be used for covariance estimation. However, they fail in undersampled regimes. Sparse PCA (Zou et al., 2006; Mairal et al., 2009) remedies this problem. Covariance estimation can also be done by learning graphical models with latent factors (Choi et al., 2011, 2010; Chandrasekaran et al., 2012). The latent variable graphical lasso method (Chandrasekaran et al., 2012) learns a sparse graphical model with latent factors. Linear CorEx (Steeg and Galstyan, 2017) is another latent factor model and is central to this work. Some details of linear CorEx are provided in Subsection 3.1.
Many works extended a particular covariance estimation method for temporal covariance estimation. Sparse PCA has been adapted for highdimensional multivariate vector autoregressive time series (Wang et al., 2013). The timevarying graphical lasso method (Hallac et al., 2017) extends graphical lasso. It breaks down the time series into periods and applies graphical lasso on each period, while also adding a regularization term for inducing temporal consistency of estimates. In a similar fashion, latent variable timevarying graphical lasso (Tomasi et al., 2018) extends latent variable graphical lasso. TCorEx takes the same approach, but has a couple of differences. Instead of forcing sparsity, TCorEx attempts to learn models closer to NGLF. Additionally, unlike TCorEx, none of the mentioned methods use weighted samples for estimating their parameters.
The time and memory complexity of a method are crucial in highdimensional settings. All nonCorEx temporal covariance estimation methods listed here have at least quadratic time complexity with respect to the number of observed variables, . In fact, most of them have time complexity as they require computing the inverse and/or SVD of a matrix. Most methods store matrices explicitly. All these methods become inapplicable for systems having more than variables. There are successful attempts to build faster static covariance/precision matrix estimation methods (Hsieh et al., 2014; Hsieh et al., 2013; Zhang et al., 2018; Honorio and Jaakkola, 2013). To our knowledge, our proposed method is the only temporal covariance estimation method that has linear time and memory complexity with respect to .
6. Conclusion and Future Work
We developed a novel temporal covariance estimation method called TCorEx. The proposed method has linear time and memory complexity with respect to the number of observed variables and makes minimal assumptions about the dynamics of the system. We evaluated our method on both synthetic and realworld datasets, showing stateoftheart results in highly undersampled regimes. We also studied the range of possible applications of TCorEx.
In future research we aim to simplify the hyperparameter selection of TCorEx. An interesting direction of research is inferring the sample weights automatically. A successful implementation of this will increase the modeling capacity of TCorEx. Another interesting extension of the model is adapting it to work in the online regime. In fact, exploiting samples of future time steps does not allow us to use the method for early detection and forecasting.
Acknowledgements.
The authors thank Federico Tomasi for his valuable help on training latent variable timevarying graphical lasso, and Neal Lawton for insightful discussions. This research is based upon work supported in part by DARPA, via W911NF1610575, and the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 201616041100004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. Additionally, H. Harutyunyan is supported by an Annenberg fellowship.References
 (1)
 Ahmed and Xing (2009) Amr Ahmed and Eric P Xing. 2009. Recovering timevarying networks of dependencies in social and biological studies. Proceedings of the National Academy of Sciences 106, 29 (2009), 11878–11883.
 Banerjee et al. (2008) Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. 2008. Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. J. Mach. Learn. Res. 9 (June 2008), 485–516.
 Bien and Tibshirani (2011) Jacob Bien and Robert J. Tibshirani. 2011. Sparse estimation of a covariance matrix. Biometrika 98, 4 (2011), 807–820.
 Brodmann (1909) Korbinian Brodmann. 1909. Vergleichende Lokalisationslehre der Grosshirnrinde in ihren Prinzipien dargestellt auf Grund des Zellenbaues. Barth.
 Chandrasekaran et al. (2012) Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky. 2012. Latent variable graphical model selection via convex optimization. Ann. Statist. 40, 4 (08 2012), 1935–1967.
 Chang and Glover (2010) Catie Chang and Gary H. Glover. 2010. Time–frequency dynamics of restingstate brain connectivity measured with fMRI. NeuroImage 50, 1 (2010), 81 – 98.
 Chen et al. (2010) Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero. 2010. Shrinkage Algorithms for MMSE Covariance Estimation. IEEE Transactions on Signal Processing 58, 10 (Oct 2010), 5016–5029.
 Choi et al. (2010) Myung Jin Choi, Venkat Chandrasekaran, and Alan S Willsky. 2010. Gaussian multiresolution models: Exploiting sparse Markov and covariance structure. IEEE Transactions on Signal Processing 58, 3 (2010), 1012–1024.
 Choi et al. (2011) Myung Jin Choi, Vincent YF Tan, Animashree Anandkumar, and Alan S Willsky. 2011. Learning latent tree graphical models. Journal of Machine Learning Research 12, May (2011), 1771–1812.
 De Nard et al. (2018) Gianluca De Nard, Olivier Ledoit, Michael Wolf, et al. 2018. Factor models for portfolio selection in large dimensions: the good, the better and the ugly. Technical Report. Department of EconomicsUniversity of Zurich.
 Engle et al. (2017) Robert F. Engle, Olivier Ledoit, and Michael Wolf. 2017. Large Dynamic Covariance Matrices. Journal of Business & Economic Statistics 0, 0 (2017), 1–13.
 Fan et al. (2016) Jianqing Fan, Yuan Liao, and Han Liu. 2016. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal 19, 1 (2016), C1–C32.
 Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 3 (2008), 432–441.
 Hallac et al. (2017) David Hallac, Youngsuk Park, Stephen Boyd, and Jure Leskovec. 2017. Network Inference via the TimeVarying Graphical Lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 205–213.
 Honorio and Jaakkola (2013) Jean Honorio and Tommi Jaakkola. 2013. Inverse Covariance Estimation for Highdimensional Data in Linear Time and Space: Spectral Methods for Riccati and Sparse Models. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence (UAI’13). AUAI Press, Arlington, Virginia, United States, 291–300.
 Hsieh et al. (2014) ChoJui Hsieh, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar. 2014. QUIC: Quadratic Approximation for Sparse Inverse Covariance Estimation. Journal of Machine Learning Research 15 (2014), 2911–2947.
 Hsieh et al. (2013) ChoJui Hsieh, Mátyás A Sustik, Inderjit S Dhillon, Pradeep K Ravikumar, and Russell Poldrack. 2013. BIG & QUIC: Sparse inverse covariance estimation for a million variables. In Advances in Neural Information Processing Systems. 3165–3173.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Lauritzen (1996) Steffen L Lauritzen. 1996. Graphical models. Vol. 17. Clarendon Press.
 Ledoit and Wolf (2004) Olivier Ledoit and Michael Wolf. 2004. A wellconditioned estimator for largedimensional covariance matrices. Journal of Multivariate Analysis 88, 2 (2004), 365 – 411.
 Lurie et al. (2018) Daniel J Lurie, Daniel Kessler, Danielle S Bassett, Richard F Betzel, Michael Breakspear, Shella Keilholz, Aaron Kucyi, RaphaÃ«l LiÃ©geois, Martin A Lindquist, Anthony R McIntosh, and et al. 2018. On the nature of resting fMRI and timevarying functional connectivity. https://doi.org/10.31234/osf.io/xtzre
 Mairal et al. (2009) Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. 2009. Online Dictionary Learning for Sparse Coding. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09). ACM, New York, NY, USA, 689–696.
 Meinshausen and BÃ¼hlmann (2006) Nicolai Meinshausen and Peter BÃ¼hlmann. 2006. HighDimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics 34, 3 (2006), 1436–1462.
 Monti et al. (2014) Ricardo Pio Monti, Peter Hellyer, David Sharp, Robert Leech, Christoforos Anagnostopoulos, and Giovanni Montana. 2014. Estimating timevarying brain connectivity networks from functional MRI time series. NeuroImage 103 (2014), 427 – 443.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPSW.
 Poldrack et al. (2015) Russell A Poldrack, Timothy O Laumann, Oluwasanmi Koyejo, Brenda Gregory, Ashleigh Hover, MeiYen Chen, Krzysztof J Gorgolewski, Jeffrey Luci, Sung Jun Joo, Ryan L Boyd, et al. 2015. Longterm neural and physiological phenotyping of a single human. Nature communications 6 (2015), 8885.
 Preti et al. (2017) Maria Giulia Preti, Thomas AW Bolton, and Dimitri Van De Ville. 2017. The dynamic functional connectome: Stateoftheart and perspectives. NeuroImage 160 (2017), 41 – 54. Functional Architecture of the Brain.
 Ravikumar et al. (2011) Pradeep Ravikumar, Martin J. Wainwright, Garvesh Raskutti, and Bin Yu. 2011. Highdimensional covariance estimation by minimizing penalized logdeterminant divergence. Electron. J. Statist. 5 (2011), 935–980.
 Steeg and Galstyan (2017) Greg Ver Steeg and Aram Galstyan. 2017. Low complexity gaussian latent factor models and a blessing of dimensionality. arXiv preprint arXiv:1706.03353 (2017).
 Tipping and Bishop (1999) Michael E Tipping and Christopher M Bishop. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 3 (1999), 611–622.
 Tomasi et al. (2018) Federico Tomasi, Veronica Tozzo, Saverio Salzo, and Alessandro Verri. 2018. Latent Variable Timevarying Network Inference. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, New York, NY, USA, 2338–2346.
 Ver Steeg and Galstyan (2014) Greg Ver Steeg and Aram Galstyan. 2014. Discovering Structure in HighDimensional Data Through Correlation Explanation. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 577–585.
 Wang et al. (2013) Zhaoran Wang, Fang Han, and Han Liu. 2013. Sparse Principal Component Analysis for High Dimensional Multivariate Time Series. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Carlos M. Carvalho and Pradeep Ravikumar (Eds.), Vol. 31. PMLR, Scottsdale, Arizona, USA, 48–56.
 Yarkoni et al. (2011) Tal Yarkoni, Russell A Poldrack, Thomas E Nichols, David C Van Essen, and Tor D Wager. 2011. Largescale automated synthesis of human functional neuroimaging data. Nature methods 8, 8 (2011), 665.
 Yuan and Lin (2007) Ming Yuan and Yi Lin. 2007. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika 94, 1 (2007), 19–35.
 Zhang et al. (2018) Richard Zhang, Salar Fattahi, and Somayeh Sojoudi. 2018. LargeScale Sparse Inverse Covariance Estimation via Thresholding and MaxDet Matrix Completion. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, StockholmsmÃ¤ssan, Stockholm Sweden, 5766–5775.
 Zou et al. (2006) Hui Zou, Trevor Hastie, and Robert Tibshirani. 2006. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics 15, 2 (2006), 265–286.
Supplementary Material: Efficient Covariance Estimation from Temporal Data
Appendix A Details of Generating NGLF Data
To describe a NGLF model, it is enough to specify 6 quantities: , , , – the parent of , and – the correlation of and its parent. Note the moments of don’t affect the marginal distribution of , so we set and . We also set . Summing up, we need only and . In our experiments, we have , . To define the correlations , for each we first sample the signal to noise ratio (snr) of uniformly from [0, 5] and then set the correlation between and , , with . This way we control the average signal to noise ratio similar to the experiments done in (Steeg and Galstyan, 2017). An example of covariance matrix corresponding to an NGLF model is shown in Fig. 9.
As stated in the main text, for creating a synthetic dataset with smooth change, we generate 2 NGLF models, and . Let the former be characterized by , and the latter be characterized by . We start from and smoothly change the model into , so that for each time period the joint distribution remains NGLF. We define the parameters of in the following way:
with . To define , we randomly select when each variable will change its parent from to . Formally, we sample and set if , otherwise we set .
Appendix B Implementation Details
In this section we discuss the implementations of the baselines. We use scikitlearn implementations of LW, OAS, FA, and Sparse PCA methods. We use the original implementations of linear CorEx, TVGL, and LTGL. For GLASSO, we tried the scikitlearn implementation, the QUIC method (http://www.cs.utexas.edu/~sustik/QUIC/), and TVGL with . In our experiment, the TVGL implementation was always better. Therefore, we selected the latter implementation. For LVGLASSO, we used the implementation availabile in the REGAIN repository (https://github.com/fdtomasi/regain), which also contains the original implementation of LTGL.
Appendix C Hyperparameter Grids
Method  Hyperparameters  

Sparse PCA 


Linear CorEx 


GLASSO 


LVGLASSO 


TVGL 


LTGL 


TCorEx 

In this section we describe the hyperparameter grids we used in our quantitative experiments. Tables 3, 4, and 5 describe the grids of hyperparameters we tried in the sudden change, smooth change, and stock market experiments respectively. Note that the excluded baselines (beside TCorExsimple and TCorExnoreg) have no hyperparameters. In all our experiments we made sure that increasing the “max_iter” parameter does not improve the results of any baseline. Also, we made sure that the best values of hyperparameters are not the corner values. In the sudden and smooth change experiments we know the ground truth value of (the number of latent factors). Therefore, we do not search the best value of for FA, sparse PCA, linear CorEx, and TCorEx baselines. This way we compare the modeling performances of our baselines, rather than how they depend on the choices of their hyperparameters. In TCorEx, “l1” and “l2” hyperparameters are mutually exclusive, meaning that if one is set to a nonzero value, the other one will be zero. In LVGLASSO and LTGL, we set , where is the number of training samples in each period. We noticed that these methods are not sensitive to this hyperparameter. We found this choice in their code repository.
Method  Hyperparameters  

Sparse PCA 


Linear CorEx 


GLASSO 


LVGLASSO 


TVGL 


LTGL 


TCorEx 

Method  Hyperparameters  

FA 


Sparse PCA 


Linear CorEx 


GLASSO 


LVGLASSO 


TVGL 


LTGL 


TCorEx 
