On the Sample Complexity of Graphical Model Selection for Non-Stationary Processes
We formulate and analyze a graphical model selection method for inferring the conditional independence graph of a high-dimensional non-stationary Gaussian random process (time series) from a finite-length observation. The observed process samples are modeled as uncorrelated over time but having different covariance matrices. We characterize the sample complexity of graphical model selection for such processes by analyzing a variant of sparse neighborhood regression. Our results indicate that, similar to the case of i.i.d. samples, accurate GMS is possible even in the high-dimensional regime if the underlying conditional independence graph is sufficiently sparse.
sparsity, graphical model selection, neighborhood regression, high-dimensional statistics.
Consider a complex system which is represented by a large number of random variables . For the ease of exposition we model those random variables as zero mean jointly Gaussian. We are interested in inferring the conditional independence graph (CIG) of these random variables based on observing samples , for , which are uncorrelated but not identically distributed. The learning method shall cope with the high-dimensional regime, where the system size is (much) larger than the sample size , i.e., [1, 2, 3, 4, 5, 6, 7]. This problem is relevant, e.g., in the analysis of medical diagnostic data (EEG) , climatology  and genetics .
Most existing approaches to graphical model selection (GMS) model the observed data either as i.i.d. or as samples of a stationary random process [6, 3, 10, 11]. By contrast, we model the observed data as an uncorrelated non-stationary process having covariance which varies with sample index . Our main conceptual contribution is the formulation of a sparse neighborhood regression GMS method for high-dimensional non-stationary processes. By analyzing this method, we derive upper bounds on the required sample size such that accurate GMS is possible. In particular, our analysis reveals that the crucial parameter determining the required sample size is the minimum average partial correlation between the process components. If this quantity is not too small, accurate GMS is feasible even in the high-dimensional regime where .
The remainder of this paper is organized as follows. In Section 2 we formalize the considered process model and the notion of a CIG. Section 3 presents a GMS method based on neighborhood regression along with an upper bound on the sample size such that GMS is accurate. The detailed derivation of this bound is in Section 4.
The maximum (minimum) of two numbers and is denoted (). The set of non-negative real (integer) numbers is denoted (). Given a -dimensional process of length , we denote its th scalar component process as . Given a vector , we denote its euclidean and -norm by and . The minimum and maximum eigenvalues of a positive semidefinite (psd) matrix are denoted and , respectively. Given a matrix , we denote its transpose, spectral norm and Frobenius norm by , and , respectively. It will be handy to define, for a given finite sequence of matrices , the block diagonal matrix with th diagonal block given by . The identity matrix of size is .
2 Problem Formulation
We model the observed data samples , for , as zero-mean Gaussian random vectors, which are uncorrelated, i.e., for . Thus, the probability distribution of the observed samples is fully specified by the covariance matrices . We assume the process is suitably scaled such that . By contrast to the widely used i.i.d. assumption (where ), we allow the covariance matrix to vary with sample index . However, we impose a smoothness constraint on the dynamics of the covariance. In particular, we assume the covariance matrix being constant over evenly spaced length- blocks of consecutive vector samples. Thus, our sample process model can be summarized as
For ease of exposition and without essential loss of generality, we henceforth assume the sample size to be an integer multiple of the block length , i.e.,
with denoting the number of data blocks.
The model (1) accommodates the case where the observed samples form a stationary process (cf. [12, 11, 13, 10]) in the following way: Consider a Gaussian zero-mean stationary process with auto-covariance function and spectral density matrix . Let denote the discrete Fourier transform (DFT) of the stationary process . Then, by well-known properties of the DFT , the vectors for are approximately uncorrelated Gaussian random vectors with zero mean and covariance matrix . Moreover, if the effective correlation width of the process is small, i.e., , the SDM is nearly constant over a frequency interval of length . Thus, the DFT vectors approximately conform to process model (1) with block length .
The process model (1) is also useful for the important class of non-stationary processes which are underspread [16, 17, 18, 19]. A continuous-time random process is underspread if its expected ambiguity function (EAF) is well concentrated around the origin in the plane. In particular, if the EAF of is supported on the rectangle , then the process is underspread if . One of the most striking properties of an underspread process is that its Wigner-Ville spectrum (which can be loosely interpreted as a time-varying power spectral density) is approximately constant over a rectangle of area . Moreover, it can be shown that for a suitably chosen prototype function (e.g., a Gaussian pulse) and grid constants ,, the Weyl-Heisenberg set , yields zero-mean expansion coefficients which are approximately uncorrelated. The covariance matrix of the vector is approximately given by . Thus, the vectors approximately conform to process model (1), with block length .
We now define the CIG of a -dimensional Gaussian process conforming to the model (1) as an undirected simple graph with nodes . Node represents the process component . An edge is absent between nodes and , i.e., , if the corresponding process components and are conditionally independent, given the remaining components .
Since we model the process as Gaussian (cf. (1)), the conditional independence among the individual process components can be read off conveniently from the inverse covariance (precision) matrices . In particular, are are conditionally independent, given , if and only if for all [15, Prop. 1.6.6.]. Thus, we have the following characterization of the CIG associated with the sample process in (1):
We highlight the coupling in the CIG characterization (3): An edge is absent, i.e., , only if the precision matrix entry is zero for all .
We will also need a measure for the strength of a connection between process components and for . To this end, we define the average partial correlation between and as
Accurate estimation of the CIG for finite sample size (incurring unavoidable sampling noise) is only possible for sufficiently large partial correlations for .
There is a constant such that
Moreover, we also assume the CIG underlying to be sparse in the sense of having small maximum degree.
For some ,
3 Sparse Neighborhood Regression
The CIG of the process in (1) is fully specified by the neighborhoods , i.e., once we have found all neighborhoods, we can reconstruct the full CIG. In what follows, we focus on the sub-problem of learning the neighborhood of an arbitrary but fixed node . We denote the size of its neighborhood by .
In view of (1), let us denote for each block the th process component as
According to Lemma 4.3,
with the “error term”
Moreover, for an index set ,
with component being uncorrelated with and . The variance with (cf. Lemma 4.3).
For a subset , the matrix represents the orthogonal projection on . The complementary orthogonal projection is
We can interpret (10) as performing sparse neighborhood regression, since we aim at approximating the th component in a sparse manner (by allowing at most active components) using the remaining process components.
For the analysis of the estimator (10) we require a bound on the eigenvalues of the covariance matrices .
For known , for all .
Our main analytical result is an upper bound on the probability of sparse neighborhood regression (10) failing to deliver the true neighborhood. We denote this event by
4 Proof of the Main Result
We now provide a detailed proof of Theorem 3.1 by analyzing the probability of the event (cf. (13)) when (10) fails to deliver the true neighborhood . Our argument draws heavily on the techniques used in .
It will be convenient to introduce the test statistic
For an arbitrary but fixed set with , we define
We will derive an upper bound on which depends on only via and . The total number of such subsets is
Let . By a union bound, Since , we have
Let us now detail the derivation of the above mentioned upper bound on the probability . To this end, in order to make (7) more handy, we stack the vectors into , with
Let us now define, for some number whose precise value to be chosen later, the two events
In view of (20), the event can only occur if at least one of the events or occurs. Therefore, by a union bound,
In order to control the event , observe
with and Gaussian random vector . By similar arguments as used in , one can verify
implying, in turn,
with the events
By union bound, (30) implies
such that can be upper bounded by separately bounding and . Let us define
with the random vector . According to Corollary 4.4,
Let us now choose
In order to control (cf. (32)), note
with and the symmetric matrix
In order to make this paper somewhat self-contained, we list here some well-known facts about Gaussian random vectors, which are instrumental for the derivation in Section 4.
For i.i.d. standard normal variables , consider the weighted sum of squares
with weight vector . We have
Consider some , such that
Applying the elementary inequality
to the RHS of (Appendix) yields
Summing (53) over yields
Now, consider the tail bound
Minimizing the RHS of (55) over ,
where is due to for . Similar to (55), one can also verify
Consider a Gaussian random vector and a symmetric matrix , i.e., . Then, the quadratic form
Consider a Gaussian random vector with precision matrix . For an arbitrary index set , we have