On the Sample Complexity of Graphical Model Selection for Non-Stationary Processes

# On the Sample Complexity of Graphical Model Selection for Non-Stationary Processes

## Abstract

We formulate and analyze a graphical model selection method for inferring the conditional independence graph of a high-dimensional non-stationary Gaussian random process (time series) from a finite-length observation. The observed process samples are modeled as uncorrelated over time but having different covariance matrices. We characterize the sample complexity of graphical model selection for such processes by analyzing a variant of sparse neighborhood regression. Our results indicate that, similar to the case of i.i.d. samples, accurate GMS is possible even in the high-dimensional regime if the underlying conditional independence graph is sufficiently sparse.

{IEEEkeywords}

sparsity, graphical model selection, neighborhood regression, high-dimensional statistics.

## 1 Introduction

Consider a complex system which is represented by a large number of random variables . For the ease of exposition we model those random variables as zero mean jointly Gaussian. We are interested in inferring the conditional independence graph (CIG) of these random variables based on observing samples , for , which are uncorrelated but not identically distributed. The learning method shall cope with the high-dimensional regime, where the system size is (much) larger than the sample size , i.e., [1, 2, 3, 4, 5, 6, 7]. This problem is relevant, e.g., in the analysis of medical diagnostic data (EEG) [4], climatology [8] and genetics [9].

### Contribution

Most existing approaches to graphical model selection (GMS) model the observed data either as i.i.d. or as samples of a stationary random process [6, 3, 10, 11]. By contrast, we model the observed data as an uncorrelated non-stationary process having covariance which varies with sample index . Our main conceptual contribution is the formulation of a sparse neighborhood regression GMS method for high-dimensional non-stationary processes. By analyzing this method, we derive upper bounds on the required sample size such that accurate GMS is possible. In particular, our analysis reveals that the crucial parameter determining the required sample size is the minimum average partial correlation between the process components. If this quantity is not too small, accurate GMS is feasible even in the high-dimensional regime where .

### Outline

The remainder of this paper is organized as follows. In Section 2 we formalize the considered process model and the notion of a CIG. Section 3 presents a GMS method based on neighborhood regression along with an upper bound on the sample size such that GMS is accurate. The detailed derivation of this bound is in Section 4.

### Notation

The maximum (minimum) of two numbers and is denoted (). The set of non-negative real (integer) numbers is denoted (). Given a -dimensional process of length , we denote its th scalar component process as . Given a vector , we denote its euclidean and -norm by and . The minimum and maximum eigenvalues of a positive semidefinite (psd) matrix are denoted and , respectively. Given a matrix , we denote its transpose, spectral norm and Frobenius norm by , and , respectively. It will be handy to define, for a given finite sequence of matrices , the block diagonal matrix with th diagonal block given by . The identity matrix of size is .

## 2 Problem Formulation

We model the observed data samples , for , as zero-mean Gaussian random vectors, which are uncorrelated, i.e., for . Thus, the probability distribution of the observed samples is fully specified by the covariance matrices . We assume the process is suitably scaled such that . By contrast to the widely used i.i.d. assumption (where ), we allow the covariance matrix to vary with sample index . However, we impose a smoothness constraint on the dynamics of the covariance. In particular, we assume the covariance matrix being constant over evenly spaced length- blocks of consecutive vector samples. Thus, our sample process model can be summarized as

 x[0],…,x[L−1]i.i.d.∼N(0,C{b=0}),x[L],…,x[2L−1]i.i.d.∼N(0,C{b=1}),… (1)

For ease of exposition and without essential loss of generality, we henceforth assume the sample size to be an integer multiple of the block length , i.e.,

 N=BL\vspace∗−2mm (2)

with denoting the number of data blocks.

The model (1) accommodates the case where the observed samples form a stationary process (cf. [12, 11, 13, 10]) in the following way: Consider a Gaussian zero-mean stationary process with auto-covariance function and spectral density matrix [14]. Let denote the discrete Fourier transform (DFT) of the stationary process . Then, by well-known properties of the DFT [15], the vectors for are approximately uncorrelated Gaussian random vectors with zero mean and covariance matrix . Moreover, if the effective correlation width of the process is small, i.e., , the SDM is nearly constant over a frequency interval of length . Thus, the DFT vectors approximately conform to process model (1) with block length .

The process model (1) is also useful for the important class of non-stationary processes which are underspread [16, 17, 18, 19]. A continuous-time random process is underspread if its expected ambiguity function (EAF) is well concentrated around the origin in the plane. In particular, if the EAF of is supported on the rectangle , then the process is underspread if . One of the most striking properties of an underspread process is that its Wigner-Ville spectrum (which can be loosely interpreted as a time-varying power spectral density) is approximately constant over a rectangle of area . Moreover, it can be shown that for a suitably chosen prototype function (e.g., a Gaussian pulse) and grid constants ,, the Weyl-Heisenberg set [20], yields zero-mean expansion coefficients which are approximately uncorrelated. The covariance matrix of the vector is approximately given by . Thus, the vectors approximately conform to process model (1), with block length .

We now define the CIG of a -dimensional Gaussian process conforming to the model (1) as an undirected simple graph with nodes . Node represents the process component . An edge is absent between nodes and , i.e., , if the corresponding process components and are conditionally independent, given the remaining components .

Since we model the process as Gaussian (cf. (1)), the conditional independence among the individual process components can be read off conveniently from the inverse covariance (precision) matrices . In particular, are are conditionally independent, given , if and only if for all [15, Prop. 1.6.6.]. Thus, we have the following characterization of the CIG associated with the sample process in (1):

 {i,j}∉E if and only if Ki,j[n]=0 for all n. (3)

We highlight the coupling in the CIG characterization (3): An edge is absent, i.e., , only if the precision matrix entry is zero for all .

We will also need a measure for the strength of a connection between process components and for . To this end, we define the average partial correlation between and as

 ρi,j :=(1/N)N−1∑n=0K2i,j[n]/(Ki,i[n]Kj,j[n]) \lx@stackrel(???)=(1/B)B−1∑b=0K2i,j{b}/(Ki,i{b}Kj,j{b}). (4)

By (3) and (4), if and only if .

Accurate estimation of the CIG for finite sample size (incurring unavoidable sampling noise) is only possible for sufficiently large partial correlations for .

###### Assumption 1.

There is a constant such that

 ρi,j≥ρmin for any {i,j}∈E.\vspace∗−2mm (5)

Moreover, we also assume the CIG underlying to be sparse in the sense of having small maximum degree.

###### Assumption 2.

For some ,

 |N(i)|≤s , for any node i∈V.\vspace∗−4mm (6)

## 3 Sparse Neighborhood Regression

The CIG of the process in (1) is fully specified by the neighborhoods , i.e., once we have found all neighborhoods, we can reconstruct the full CIG. In what follows, we focus on the sub-problem of learning the neighborhood of an arbitrary but fixed node . We denote the size of its neighborhood by .

In view of (1), let us denote for each block the th process component as

 xi{b}:=(xi[bL],…,xi[(b+1)L−1])T∈RL.\vspace∗−2mm

According to Lemma 4.3,

 xi{b}=∑j∈N(i)ajxj{b}+εi{b},\vspace∗−2mm (7)

with the “error term”

 εi{b}∼N(0,(1/Ki,i{b})IL). (8)

Moreover, for an index set ,

 xi{b}=∑j∈Tajxj{b}+~xi{b}+εi{b},\vspace∗−2mm (9)

with component being uncorrelated with and . The variance with (cf. Lemma 4.3).

In view of the decompositions (7) and (9), it is reasonable to estimate via a least-squares search:

 ˆN(i) :=argmin|T|≤s(1/N)B−1∑b=0∥P⊥T{b}xi{b}∥22+λ|T|, (10)

with

 λ:=ρmin/2. (11)

For a subset , the matrix represents the orthogonal projection on . The complementary orthogonal projection is

 P⊥T{b}:=IL−PT{b}. (12)

We can interpret (10) as performing sparse neighborhood regression, since we aim at approximating the th component in a sparse manner (by allowing at most active components) using the remaining process components.

For the analysis of the estimator (10) we require a bound on the eigenvalues of the covariance matrices .

###### Assumption 3.

For known , for all .

Our main analytical result is an upper bound on the probability of sparse neighborhood regression (10) failing to deliver the true neighborhood. We denote this event by

 Ei:={N(i)≠ˆN(i)}. (13)
###### Theorem 3.1.

Consider observed samples conforming to process model (1) such that Asspt. 1, 2 and 3 are valid. Then, if the minimum average partial correlation satisfies

 ρmin≥8β/(L−2s) (14)

and moreover

 Missing or unrecognized delimiter for \big (15)

the probability of (10) to fail is bounded as

 P{Ei}≤η. (16)

## 4 Proof of the Main Result

We now provide a detailed proof of Theorem 3.1 by analyzing the probability of the event (cf. (13)) when (10) fails to deliver the true neighborhood . Our argument draws heavily on the techniques used in [21].

It will be convenient to introduce the test statistic

 Z(S):=(1/N)B−1∑b=0∥P⊥S{b}xi{b}∥22. (17)

For an arbitrary but fixed set with , we define

 ET={Z(N(i))+λsi>Z(T)+λ|T|}. (18)

We will derive an upper bound on which depends on only via and . The total number of such subsets is

 N(ℓ1,ℓ2):=(siℓ1)(p−siℓ2).\vspace∗−1mm

Let . By a union bound, Since , we have

 log{P{Ei}}≤max(ℓ1,ℓ2)∈[si,s]logs2N(ℓ1,ℓ2)+logM(ℓ1,ℓ2). (19)

Let us now detail the derivation of the above mentioned upper bound on the probability . To this end, in order to make (7) more handy, we stack the vectors into , with

 Cεi=blkdiag{(1/Ki,i{0})IL,…,(1/Ki,i{B−1})IL}.

We also need the projection matrix (cf. (12)). Rather trivially (cf. (18)),

 ET={Z(N(i)) −(1/N)∥P⊥Tεi∥22 (20) >Z(T)−(1/N)∥P⊥Tεi∥22+λ(|T|−si)}.

Let us now define, for some number whose precise value to be chosen later, the two events

 E1(δ) :={∣∣Z(N(i))−(1/N)∥P⊥Tεi∥22∣∣≥δ}, (21a) E2(δ) :={Z(T)−(1/N)∥P⊥Tεi∥22+λ(|T|−si)≤2δ}. (21b)

In view of (20), the event can only occur if at least one of the events or occurs. Therefore, by a union bound,

 P{ET}≤P{E1(δ)}+P{E2(δ)}. (22)

In order to control the event , observe

 Z(N( i))\lx@stackrel(???)=(1/N)B−1∑b=0∥P⊥N(i){b}xi{b}∥22 \lx@stackrel(???)=(1/N)B−1∑b=0∥∥∥P⊥N(i){b}(∑i∈N(i)ajxj{b}+εi{b})∥∥∥22 =(1/N)∥P⊥N(i)εi∥22. (23)

Hence,

 P{E1(δ)} \lx@stackrel(???)=E{P{∣∣Z(N(i))−(1/N)∥P⊥Tεi∥22∣∣≥δ∣xT}} Missing or unrecognized delimiter for \big \lx@stackrel(???)=E{P{(1/N)∣∣wTiPwi∣∣≥δ∣xT}} (24)

with and Gaussian random vector . By similar arguments as used in [21], one can verify

 ∥P∥2≤β and rank{P}≤B(ℓ1+ℓ2) (25)

implying, in turn,

 ∥P∥2F≤rank{P}∥P∥22\lx@stackrel(???)≤B(ℓ1+ℓ2)β2. (26)

Inserting (25), (26) into (59) of Lemma 4.2 yields

 P{E1(δ)}≤2exp(−N2δ28((ℓ1+ℓ2)Bβ2+Nδβ)). (27)

Thus, whenever

 N≥B(ℓ1+ℓ2)β/δ,\vspace∗−1mm (28)

we have

 P{E1(δ)}≤2exp(−Nδ16β).\vspace∗−1mm (29)

Let us now upper bound the probability of (cf. (21b)). To this end, in view of the decomposition (9), note

 E2(δ)⊆E3(δ)∪E4(δ)\vspace∗−2mm (30)

with the events

 E3(δ):={(1/N)∥P⊥T~xi∥22+λ(|T|−si)≤3δ} (31)

and

 E4(δ):={(2/N)|~xTiP⊥Tεi|≥δ}. (32)

By union bound, (30) implies

 P{E2(δ)}≤P{E3(δ)}+P{E4(δ)} (33)

such that can be upper bounded by separately bounding and . Let us define

 m3 :=E{(1/N)∥P⊥T~xi∥22}+λ(|T|−si), (34)

with the random vector . According to Corollary 4.4,

 ~xi{b}∼N(0,~σ2bIL) with ~σ2b≥ℓ1ρmin/β. (35)

Therefore,

 m3 \lx@stackrel(???)=(1/N)Tr{C1/2~xiP⊥TC1/2~xi}+λ(|T|−si) \lx@stackrel(???)≥(B(L−ℓ1)/N)ℓ1ρmin+λ(|T|−si) \lx@stackrel(???)≥(B(L−ℓ21/¯ℓ)/N)¯ℓρmin, (36)

with

 ¯ℓ:=ℓ1+(|T|−si)/2=(ℓ1+ℓ2)/2≥(ℓ1∨ℓ2)/2. (37)

Let us now choose

 δ:=m3/4. (38)

Observe

 P{E3(δ)} \lx@stackrel(???)=E{P{(1/N)∥P⊥T~xi∥22≤3δ∣xT}} \lx@stackrel(???)≤E{P{|(1/N)∥P⊥T~xi∥22−m3|≥δ∣xT}}. (39)

Applying Lemma 4.2 to (4) gets us to

 P{E3(δ)}≤2exp(−(N/β)δ28(m3+δ))\lx@stackrel(???)=2exp(−Nm3β160). (40)

In order to control (cf. (32)), note

 Extra open brace or missing close brace (41)

with and the symmetric matrix

 Q=2N⎛⎝C1/2~xi00C1/2εi⎞⎠(0P⊥T(P⊥T)T0)⎛⎝C1/2~xi00C1/2εi⎞⎠.

Applying Lemma 4.2 to (41) gets us to

 P{E4(δ)} ≤2exp(−N(δ/2)28(βm3+β(δ/2))) Missing or unrecognized delimiter for \bigg (42)

For the choice (cf. (38)), the condition (14) implies validity of (28). Therefore, (29) is in force and can be reformulated using (38) as

 P{E1(δ)}≤2exp(−Nm364β). (43)

Combining (43), (4), (40) with (33), (22) and (36) yields

 P{ET} ≤6exp(−ρmin¯ℓB(L−ℓ21/¯ℓ)/(576β2)) \lx@stackrel(???)≤6exp(−(1/2)ρmin(ℓ1∨ℓ2)B(L−2ℓ1)/(576β2)). (44)

We finalize the proof of Theorem 3.1, by using the RHS of (4) as in (19). Thus, holds if

 max(ℓ1,ℓ2)∈[si,s]log6s2N(ℓ1,ℓ2)η−ρmin(ℓ1∨ℓ2)B(L−2ℓ1)2⋅576β2≤0. (45)

The validity of (45), in turn, is guaranteed if

 B(L−2ℓ1)≥2⋅576β2ρmin(ℓ1∨ℓ2)(logs2N(ℓ1,ℓ2)+log(6/η)), (46)

for all . Since (cf. Asspt. 2),

 s2N(ℓ1,ℓ2)≤(p−si(ℓ1∨ℓ2))4\lx@stackrel(a)≤[peℓ1∨ℓ2]4(ℓ1∨ℓ2) (47)

where is due to [21]. Combining (46) and (47), we finally obtain (15) of Theorem 3.1.

## Appendix

In order to make this paper somewhat self-contained, we list here some well-known facts about Gaussian random vectors, which are instrumental for the derivation in Section 4.

###### Lemma 4.1.

For i.i.d. standard normal variables , consider the weighted sum of squares

 y=N∑i=1aiz2i\vspace∗−2mm (48)

with weight vector . We have

 P{|y−E{y}|≥δ}≤2exp(−δ2/8∥a∥22+∥a∥∞δ). (49)
###### Proof.

Consider some , such that

 E{exp(λaiz2i)}=1/√1−2λai, (50)

and hence,

 logE{exp(λai(z2i−1))} =(−1/2)log(1−2λai)−λai ≤(−1/2)log(1−2λ|ai|)−λ|ai|. (51)

Applying the elementary inequality

 −log(1−u)≤u+u22(1−u),0≤u≤1\vspace∗−2mm (52)

to the RHS of (Appendix) yields

 logE{exp(λai(z2i−1))}≤λ2a2i1−2λ|ai|≤2λ2a2i.\vspace∗−2mm (53)

Summing (53) over yields

 logE{exp(λ(y−E{y}))}≤2λ2∥a∥22. (54)

Now, consider the tail bound

 P{y−E{y}≥δ} ≤exp(−λδ)E{exp(λ(y−E{y}))} (55) \lx@stackrel(???)≤exp(−λδ+2λ2∥a∥22).

Minimizing the RHS of (55) over ,

 P{y−E{y}≥δ} ≤exp(−(δ2/8)((1/∥a∥22)∧(1/(δ∥a∥∞)))) \lx@stackrel(a)≤exp(−δ2/8∥a∥22+∥a∥∞δ), (56)

where is due to for . Similar to (55), one can also verify

 P{y−E{y}≤−δ}≤exp(−δ2/8∥a∥22+∥a∥∞δ). (57)

Adding (Appendix) and (57) yields (49) by union bound. ∎

###### Lemma 4.2.

Consider a Gaussian random vector and a symmetric matrix , i.e., . Then, the quadratic form

 y=zTQz\vspace∗−2mm (58)

satisfies

 P{|y−E{y}|≥δ}≤2exp(−δ28(∥Q∥2F+∥Q∥2δ))\vspace∗−2mm (59)

with .

###### Proof.

The spectral decomposition of yields [22]

 Q=N∑i=1λiuiuTi\vspace∗−2mm (60)

with eigenvalues and orthonormal eigenvectors . Inserting (60) into (58) yields

 y=N∑i=1λi(uTiz)2.\vspace∗−2mm (61)

Note that and . Since the random variables are i.i.d. standard normal, we can apply Lemma 4.1 to (61) yielding (59). ∎

###### Lemma 4.3.

Consider a Gaussian random vector with precision matrix . For an arbitrary index set , we have

 xi=