Graph Sampling for Covariance Estimation
Abstract
In this paper the focus is on subsampling as well as reconstructing the secondorder statistics of signals residing on nodes of arbitrary undirected graphs. Secondorder stationary graph signals may be obtained by graph filtering zeromean white noise and they admit a welldefined power spectrum whose shape is determined by the frequency response of the graph filter. Estimating the graph power spectrum forms an important component of stationary graph signal processing and related inference tasks such as Wiener prediction or inpainting on graphs. The central result of this paper is that by sampling a significantly smaller subset of vertices and using simple least squares, we can reconstruct the secondorder statistics of the graph signal from the subsampled observations, and more importantly, without any spectral priors. To this end, both a nonparametric approach as well as parametric approaches including moving average and autoregressive models for the graph power spectrum are considered. The results specialize for undirected circulant graphs in that the graph nodes leading to the best compression rates are given by the socalled minimal sparse rulers. A nearoptimal greedy algorithm is developed to design the subsampling scheme for the nonparametric and the moving average models, whereas a particular subsampling scheme that allows linear estimation for the autoregressive model is proposed. Numerical experiments on synthetic as well as real datasets related to climatology and processing handwritten digits are provided to demonstrate the developed theory.
Graph signal processing, stationary graph signals, sparse sampling, graph power spectrum estimation, compressive covariance sensing.
I Introduction
Graphs are mathematical objects that can be used for describing and explaining relationships in complex datasets, which appear commonly in modern data analysis. The nodes of the graph denote the entities themselves and the edges encode the pairwise relationship between these entities. Some examples of such complexstructured data beyond traditional timeseries include gene regulatory networks [barabasi2004network], brain networks [bullmore2009complex], transportation networks [guimera2005worldwide], social and economic networks [jackson2010social], and so on. Processing signals residing on the nodes of a graph taking into account the relationships between them as explained by the edges of the graph is recently receiving a significant amount of interest. In particular, generalizing as well as drawing parallels of classical timefrequency analysis tools to graph data analysis while incorporating the irregular structure on which the graph signals are defined is an emerging area of research [shuman2013Emerging, sandryhaila2014big].
Graph signals could be stochastic in nature and they can be modeled as the output of a graph filter [sandryhaila2013discrete] whose input is also a random signal (e.g., white noise). We are interested in sampling and processing stationary graph signals, which are stochastic signals defined on graphs with secondorder statistics that are invariant similar to time series, but in the graph setting. Secondorder stationary graph signals are characterized by a welldefined graph power spectrum. They can be generated by graph filtering white noise (or any other stationary graph signal) and the graph power spectrum of the filtered signal will be characterized by the squared magnitude of the frequency response of the filter; see [benjamin15eusipco, perraudin2016stationary, marques2016stationary, girault2015signal].
The secondorder statistics of graph signals, or equivalently the graph power spectrum, are essential to solve inference problems on graphs in the Bayesian setting such as smoothing, prediction, inpainting, and deconvolution; see [girault2014semi] and [perraudin2016stationary] for some Bayesian inference problems. These inference problems are solved by designing optimum (in the minimum mean squared error sense) Wienerlike filters and the graph power spectrum forms a crucial component of such filter designs. In order to compute the graph power spectrum, traditional methods require the processing of signals on all graph nodes. The sheer quantity of data and scale of the graph often inhibit this reconstruction method. Therefore, the main question that we address in this paper is, can we reconstruct the graph power spectrum by observing a small subset of graph nodes?
IA Related works and main results
The notion of stationarity of signals on graphs and related definitions can be found in [benjamin15eusipco, perraudin2016stationary, marques2016stationary, girault2015signal], and it will be briefly explained in the next section as well. Several techniques for graph power spectrum estimation have been discussed in [perraudin2016stationary] and [marques2016stationary], and they are based on observations from all the nodes. In this paper, we consider the problem of reconstructing the secondorder statistics of signals on graphs, but from subsampled observations. The fact that we are reconstructing the graph power spectrum, instead of the graph signal, enables us to subsample the graph signal (or sparsely sample the graph nodes), even without any spectral priors (e.g., sparsity, bandlimited with known support). This is a new and different perspective as compared to subsampling for graph signal reconstruction [anis2014towards, tsitsvero2015uncertainty, varma2015spectrum, chen2015discrete], which imposes some spectral prior that enables graph signal reconstruction. The proposed concept basically generalizes the field of compressive covariance sensing [ariananda2012compressive, romero2013wideband, Romero16CCSspm] to the graph setting.
The aim of this paper is to reconstruct secondorder statistics of stationary graph signals from observations available at a few nodes using simple reconstruction methods such as least squares. The contributions are summarized as the following main results:

Nonparametric approach: Without any spectral priors, secondorder statistics of lengthN stationary graph signals can be recovered using least squares from a reduced subset of \mathcal{O}(\sqrt{N}) observations, i.e., by observing \mathcal{O}(\sqrt{N}) graph nodes. In this case, the processing is done in the graph spectral domain.

Circulant graphs: As a special case, when the graphs are circulant, the identifiability results are elegant. That is, the subset of nodes resulting in the best compression rates are given by the socalled minimal sparse rulers. This is reminiscent of compressive covariance sensing [Romero16CCSspm] for data that reside on a regular support such as time series, which is a specific instance of a circulant graph.

Parametric approach: It is also possible to model the graph power spectrum using a small number of parameters, e.g., the graph signals may be modeled by moving average or autoregressive graph filters. The reconstruction of the secondorder statistics of the graph signal then boils down to the estimation of moving average or autoregressive coefficients. Such a parameterization allows for a higher compression. When the graph power spectrum is modeled using a moving average graph filter, the secondorder statistics can be recovered using least squares from \mathcal{O}(\sqrt{Q}) observations, where Q=\min\{2L1,N\} with L being the number of moving average filter coefficients. When the graph power spectrum is modeled using an autoregressive graph filter, P autoregressive filter coefficients can be recovered using linear least squares by observing \mathcal{O}(P) nodes.

Subsampler design: The proposed samplers are deterministic and they perform node subsampling. Subsampler design, therefore, becomes a discrete combinatorial optimization problem. For the spectral domain and moving average case, the subsampler can be designed using a nearoptimal greedy algorithm. However, for the autoregressive approach, the sampler design depends also on (unobserved) data, and thus a mean squared error optimal design is not possible. This is due to the fact that we restrict ourselves to a lowcomplexity linear estimator for the autoregressive filter coefficients. Nevertheless, we present a suboptimal technique to design a subsampler for the autoregressive case as well.
IB Outline and notation
The remainder of the paper is organised as follows. The preliminary concepts of graph signal processing are discussed in Section II. The proposed least squares based reconstruction of the secondorder statistics based on the subsampled observations are discussed in Section III. Connections of compressive covariance sensing for timeseries with sensing data residing on circulant graphs are discussed in Section IV. In Section V, the graph power spectrum is represented with a small number of parameters under moving average and autoregressive models, and these parameters are then reconstructed using least squares from subsampled observations. In Section LABEL:sec:finitedata, we discuss the validity of the results provided in this paper for finite data records. Under the assumption that the data follows a Gaussian distribution, the maximum likelihood estimator and the related CramérRao bound are also derived. In Section LABEL:sec:subsampdesgn, the design of sparse sampling matrices based on lowcomplexity greedy algorithms is discussed. A few examples to illustrate the proposed framework are provided in Section LABEL:sec:numericalresults. Finally, the paper concludes with Section LABEL:sec:conc.
The notation used in this paper is described as follows. Upper (lower) boldface letters are used for matrices (column vectors). Overbar \bar{(\cdot)} denotes complex conjugation, (\cdot)^{T} denotes the transpose, and (\cdot)^{H} denotes the complex conjugate (Hermitian) transpose. (\cdot)^{T} is a shorthand notation for \left((\cdot)^{1}\right)^{T}. \mathrm{diag}[\cdot] refers to a diagonal matrix with its argument on the main diagonal. {\rm diag_{r}}[\cdot] represents a diagonal matrix with the argument on its diagonal, but with the allzero rows removed. {\boldsymbol{1}} ({\boldsymbol{0}}) denotes the vector of all ones (zeros). {\boldsymbol{I}} is an identity matrix. \mathbb{E}\{\cdot\} denotes the expectation operation. The \ell_{0}(quasi) norm of {\boldsymbol{w}}=[w_{1},w_{2},\ldots,w_{N}]^{T} refers to the number of nonzero entries in {\boldsymbol{w}}, i.e., {\{\boldsymbol{w}}\}_{0}:=\{n\,:\,w_{n}\neq 0\}. The \ell_{1}norm of {\boldsymbol{w}} is denoted by {\{\boldsymbol{w}}\}_{1}=\sum_{n=1}^{N}w_{n}. The notation \thicksim is read as “is distributed according to”. Unless and otherwise noted, logarithms are natural. {\rm tr}\{\cdot\} is the matrix trace operator. {\rm det}\{\cdot\} is the matrix determinant. {\rm rank}(\cdot) denotes the rank of a matrix. \lambda_{\rm min}\{{\boldsymbol{A}}\} (\lambda_{\rm max}\{{\boldsymbol{A}}\}) denotes the minimum (maximum) eigenvalue of a symmetric matrix {\boldsymbol{A}}. {\boldsymbol{A}}\succeq{\boldsymbol{B}} means that {\boldsymbol{A}}{\boldsymbol{B}} is a positive semidefinite matrix. \mathbb{S}^{N} (\mathbb{S}^{N}_{+}) denotes the set of symmetric (symmetric positive semidefinite) matrices of size N\times N. \mathcal{U} denotes the cardinality of the set \mathcal{U}. \otimes denotes the Kronecker product, \circ denotes the KhatriRao or columnwise Kronecker product, and {\rm vec}(\cdot) refers to the matrix vectorization operator. For a full column rank tall matrix {\boldsymbol{A}}, the left inverse is given by {\boldsymbol{A}}^{\dagger}=({\boldsymbol{A}}^{H}{\boldsymbol{A}})^{1}{% \boldsymbol{A}}^{H}. The column span of {\boldsymbol{A}} and row null space of {\boldsymbol{A}} are denoted by {\rm ran}({\boldsymbol{A}}) and {\rm null}({\boldsymbol{A}}), respectively. Properties that are frequently used in this paper:

{\rm vec}({\boldsymbol{A}}{\boldsymbol{B}}{\boldsymbol{C}})=({\boldsymbol{C}}^% {T}\otimes{\boldsymbol{A}}){\rm vec}({\boldsymbol{B}});

{\rm vec}({\boldsymbol{A}}{\rm diag}[{\boldsymbol{b}}]{\boldsymbol{C}})=({% \boldsymbol{C}}^{T}\circ{\boldsymbol{A}}){\boldsymbol{b}}.
II Preliminaries
In this section, we introduce some preliminary concepts related to deterministic and stochastic signals defined on graphs.
IIA Graph signals and filtering
Consider a dataset with N elements denoted as {\boldsymbol{x}}\in\mathbb{C}^{N}, which live on an irregular structure represented by an undirected graph \mathcal{G}=(\mathcal{V},\mathcal{E}), where the vertex set \mathcal{V}=\{v_{1},\cdots,v_{N}\} denotes the set of nodes, and the edge set \mathcal{E} reveals any connection between the nodes, i.e., (i,j)\in\mathcal{E} means that node i is connected to node j. The nth entry of {\boldsymbol{x}}, i.e., x_{n}, is indexed by node v_{n} of the graph \mathcal{G}. Therefore, we refer to the dataset {\boldsymbol{x}} as a lengthN graph signal.
Let us introduce an operator {\boldsymbol{S}}\in\mathbb{C}^{N\times N}, where the (i,j)th entry of {\boldsymbol{S}} denoted by s_{i,j} is nonzero only if (i,j)\in\mathcal{E} and s_{i,j} can also be nonzero if i=j for (i,j)\in\mathcal{E}, and is zero otherwise. The pattern of {\boldsymbol{S}} captures the local structure of the graph. More specifically, for a graph signal {\boldsymbol{x}}, the signal {\boldsymbol{S}}{\boldsymbol{x}} denotes the unit shifted version of {\boldsymbol{x}}. Hence {{\boldsymbol{S}}} is referred to as the graphshift operator [sandryhaila2013discrete]. Different choices for {\boldsymbol{S}} include the graph Laplacian {\boldsymbol{L}} [shuman2013Emerging], the adjacency matrix {\boldsymbol{A}} [sandryhaila2013discrete], or their respective variants. For undirected graphs, {\boldsymbol{S}} is symmetric (more generally, Hermitian), and thus it admits the following eigenvalue decomposition
\displaystyle{\boldsymbol{S}}  \displaystyle={\boldsymbol{U}}{\boldsymbol{\Lambda}}{\boldsymbol{U}}^{H}  (1)  
\displaystyle=[{\boldsymbol{u}}_{1},\cdots,{\boldsymbol{u}}_{N}]\,{\rm diag}[% \lambda_{1},\cdots,\lambda_{N}]\,[{\boldsymbol{u}}_{1},\cdots,{\boldsymbol{u}}% _{N}]^{H}, 
where the eigenvectors \{{\boldsymbol{u}}_{n}\}_{n=1}^{N} and the eigenvalues \{\lambda_{n}\}_{n=1}^{N} of {\boldsymbol{S}} provide the notion of frequency in the graph setting [shuman2013Emerging, sandryhaila2014big]. Specifically, \{{\boldsymbol{u}}_{n}\}_{n=1}^{N} forms an orthonormal Fourierlike basis for graph signals with the graph frequencies denoted by \{\lambda_{n}\}_{n=1}^{N}. Hence, the graph Fourier transform of a graph signal, {\boldsymbol{x}}_{f}=[x_{f,1},x_{f,2},\ldots,x_{f,N}]^{T}\in\mathbb{C}^{N}, is given by
{\boldsymbol{x}}_{f}:={\boldsymbol{U}}^{H}{\boldsymbol{x}}\,\Leftrightarrow\,{% \boldsymbol{x}}=:{\boldsymbol{U}}{\boldsymbol{x}}_{f}.  (2) 
The frequency content of graph signals can be modified using linear shiftinvariant graph filters [sandryhaila2013discrete, shuman2013Emerging]. Let us call the system {\boldsymbol{H}}\in\mathbb{C}^{N\times N} as a graph filter. If the eigenvalues of {\boldsymbol{S}} are distinct, a shiftinvariant graph filter, which satisfies {\boldsymbol{H}}({\boldsymbol{S}}{\boldsymbol{x}})={\boldsymbol{S}}({% \boldsymbol{H}}{\boldsymbol{x}}), can be expressed as a polynomial in {\boldsymbol{S}} as [sandryhaila2013discrete]
\displaystyle{\boldsymbol{H}}  \displaystyle=h_{0}{\boldsymbol{I}}+h_{1}{\boldsymbol{S}}+\cdots+h_{L1}{% \boldsymbol{S}}^{L1}  (3)  
\displaystyle={\boldsymbol{U}}\left[h_{0}{\boldsymbol{I}}+h_{1}{\boldsymbol{% \Lambda}}+\cdots+h_{L1}{\boldsymbol{\Lambda}}^{L1}\right]{\boldsymbol{U}}^{H}, 
where the filter {\boldsymbol{H}} is of degree L1 with filter coefficients {\boldsymbol{h}}=[h_{0},h_{1},\ldots,h_{L1}]^{T}\in\mathbb{C}^{L}, and L\leq N as N is the degree of the minimal polynomial (equal to the characteristic polynomial) of {\boldsymbol{S}}. The diagonal matrix
{\boldsymbol{H}}_{f}=\sum_{l=0}^{L1}h_{l}{\boldsymbol{\Lambda}}^{l}={\rm diag% }[{\boldsymbol{V}}_{L}{\boldsymbol{h}}]={\rm diag}[h_{f,1},\cdots,h_{f,N}]  (4) 
can be viewed as the frequency response of the graph filter. Here, {\boldsymbol{V}}_{L} is an N\times L Vandermonde matrix with the (i,j)th entry as \lambda_{i}^{j1}.
IIB Stationary graph signals
Let {\boldsymbol{x}}=[x_{1},x_{2},\cdots,x_{N}]^{T}\in\mathbb{C}^{N} be a stochastic signal defined on the vertices of the graph \mathcal{G} with expected value {\boldsymbol{m}}_{\boldsymbol{x}}=\mathbb{E}\{{\boldsymbol{x}}\} and covariance matrix {\boldsymbol{R}}_{\boldsymbol{x}}=\mathbb{E}\{({\boldsymbol{x}}{\boldsymbol{m% }}_{\boldsymbol{x}})({\boldsymbol{x}}{\boldsymbol{m}}_{\boldsymbol{x}})^{H}\}. Efforts to generalize some of the concepts of statistical time invariance or stationarity of signals defined over regular structures to random graph signals have been made in [benjamin15eusipco, perraudin2016stationary, marques2016stationary, girault2015signal]. For the sake of completeness, we will summarize the definitions from [benjamin15eusipco, perraudin2016stationary, marques2016stationary, girault2015signal] as follows.
Definition 1 (Secondorder stationarity).
A random graph signal {\boldsymbol{x}} is secondorder stationary, if and only if, the following properties hold:

The mean of the graph signal is collinear to an eigenvector of {\boldsymbol{S}} corresponding to the smallest eigenvalue, i.e., {\boldsymbol{m}}_{\boldsymbol{x}}=m_{\boldsymbol{x}}{\boldsymbol{u}}_{1}.

Matrices {\boldsymbol{S}} and {\boldsymbol{R}}_{\boldsymbol{x}} can be simultaneously diagonalized.
Since we assume that the eigenvalues of {\boldsymbol{S}} are distinct and {\boldsymbol{U}} forms an orthonormal basis, property 2 in the above definition essentially means the statistical orthogonality of spectral components, i.e,. \mathbb{E}\{x_{f,i}{\bar{x}_{f,j}}\}=0 for i\neq j [girault2015signal].
For simplicity, from now on we will focus on graph signals with zero mean, where we assume that m_{\boldsymbol{x}} is either known or m_{\boldsymbol{x}} can be set to zero by preprocessing the data as discussed in Section LABEL:sec:numericalresults. We can generate zeromean secondorder stationary graph signals by graph filtering zeromean white noise. Let {\boldsymbol{n}}=[n_{1},n_{2},\ldots,n_{N}]^{T}\in\mathbb{C}^{N} be zeromean unitvariance noise with covariance matrix {\boldsymbol{R}}_{\boldsymbol{n}}={\boldsymbol{I}}. Then, a zeromean secondorder stationary graph signal {\boldsymbol{x}} can be modeled as {\boldsymbol{x}}={\boldsymbol{H}}{\boldsymbol{n}}, where {\boldsymbol{H}} can be any valid graph filter. The filtered signal will have zero mean and covariance matrix {\boldsymbol{R}}_{\boldsymbol{x}}=\mathbb{E}\{({\boldsymbol{H}}{\boldsymbol{n}% })({\boldsymbol{H}}{\boldsymbol{n}})^{H}\} given by
\displaystyle{\boldsymbol{R}}_{\boldsymbol{x}}  \displaystyle={\boldsymbol{H}}{\boldsymbol{R}}_{\boldsymbol{n}}{\boldsymbol{H}% }^{H}  (5)  
\displaystyle={\boldsymbol{U}}{\rm diag}[h_{f,1}^{2},\cdots,h_{f,N}^{2}]{% \boldsymbol{U}}^{H}  
\displaystyle={\boldsymbol{U}}{\rm diag}[{\boldsymbol{p}}]{\boldsymbol{U}}^{H}, 
where h_{f,n}=h_{0}+h_{1}\lambda_{n}+\cdots+h_{L1}\lambda_{n}^{L1} is defined in (4). This conforms to the second property listed in Definition 1. More generally, graph filtering any secondorder stationary graph signal also results in a secondorder stationary graph signal (it is easy to verify this using property 2 in Definition 1). The nonnegative vector {\rm diag}[{\boldsymbol{p}}] in (5) is referred to as the graph power spectral density or graph power spectrum. We now formally introduce the graph power spectrum through the following definition.
Definition 2 (Graph power spectrum).
The graph power spectral density of a secondorder stationary graph signal is a realvalued nonnegative lengthN vector {\boldsymbol{p}}=[p_{1},p_{2},\ldots,p_{N}]^{T}\in\mathbb{R}_{+}^{N} with entries given by
p_{n}={\boldsymbol{u}}_{n}^{H}{\boldsymbol{R}}_{\boldsymbol{x}}{\boldsymbol{u}% }_{n},\,n=1,2,\ldots,N.  (6) 
Alternatively, p_{n}=h_{f,n}^{2}\geq 0, for n=1,2,\ldots,N, where h_{f,n}=h_{0}+h_{1}\lambda_{n}+\cdots+h_{L1}\lambda_{n}^{L1} is defined in (4).
Secondorder stationarity is preserved by linear graph filtering. This means that stationary graph signals with a prescribed graph power spectrum can be generated by filtering white noise, where the graph power spectrum of the filtered signal is reshaped according to the frequency response of the graph filter [benjamin15eusipco, perraudin2016stationary, marques2016stationary]. As a result, the graph power spectrum reveals critical information about the secondorder stationary graph signal, and thus estimating the graph power spectrum or recovering the secondorder statistics of a graph signal is useful in many applications.
We end this section by summarizing the list of assumptions made in this paper.

The shift operator {\boldsymbol{S}} is known.

The orthonormal basis {\boldsymbol{U}} and the distinct eigenvalues \{\lambda_{n}\}_{n=1}^{N} of {\boldsymbol{S}} are known a priori.
III Nonparametric Spectral Domain Approach
The size of the datasets often inhibits a direct computation of the secondorder statistics, e.g., by observing all the N nodes and using (6) to compute the graph power spectrum. This would computationally cost \mathcal{O}(N^{3}). As such, compression or data reduction is preferred especially for largescale data in the graph setting [sandryhaila2014big]. In the context of graph signal processing, most works consider subsampling the graph signal {\boldsymbol{x}} assuming some spectral prior to reconstruct it [anis2014towards, tsitsvero2015uncertainty, varma2015spectrum, chen2015discrete]. This approach is, in principle, also possible for recovering the secondorder statistics of {\boldsymbol{x}}. However, when the goal is to reconstruct the secondorder statistics of {\boldsymbol{x}} (and not {\boldsymbol{x}} itself), it is computationally advantageous, and allows for a stronger compression, when we avoid the intermediate step of reconstructing and storing {\boldsymbol{x}}. In this paper, we will therefore focus on recovering graph secondorder statistics directly from subsampled graph signals. We refer to this problem as graph covariance subsampling.
The extension of compressive covariance sensing [ariananda2012compressive, romero2013wideband, Romero16CCSspm] to graph covariance subsampling is nontrivial. This is because for secondorder (or widesense) stationary signals with a regular support, the covariance matrix has a clear structure (e.g., Toeplitz, circulant) that enables an elegant subsampler design, but for secondorder stationary graph signals residing on arbitrary graphs, the covariance matrix does not admit any clear structure that can be easily exploited, in general.
Consider the problem of estimating the graph power spectrum of the secondorder stationary graph signal {\boldsymbol{x}}\in\mathbb{C}^{N} from a set of K\ll N linear observations stacked in the vector {\boldsymbol{y}}\in\mathbb{C}^{K}, given by
{\boldsymbol{y}}={\boldsymbol{\Phi}}{\boldsymbol{x}},  (7) 
where {\boldsymbol{\Phi}} is a known K\times N selection matrix with Boolean entries, i.e., {\boldsymbol{\Phi}}\in\{0,1\}^{K\times M} (we will discuss the subsampler design in Section LABEL:sec:subsampdesgn) and where several realizations of {\boldsymbol{y}} may be available. The matrix {\boldsymbol{\Phi}} is referred to as the subsampling or sparse sampling matrix, where the compression is achieved by setting K\ll N. For applications where graph nodes correspond to sensing devices (e.g., weather stations in climatology, electroencephalography (EEG) probes in brain networks), such a sparse sampling scheme results in a significant reduction in the hardware, storage and communications costs next to the reduction in the processing costs.
The covariance matrices {\boldsymbol{R}}_{\boldsymbol{x}}=\mathbb{E}\{{\boldsymbol{x}}{\boldsymbol{x}}% ^{H}\}\in\mathbb{C}^{N\times N} and {\boldsymbol{R}}_{\boldsymbol{y}}=\mathbb{E}\{{\boldsymbol{y}}{\boldsymbol{y}}% ^{H}\}\in\mathbb{C}^{K\times K} contain the secondorder statistics of {\boldsymbol{x}} and {\boldsymbol{y}}, respectively. In practice, typically, multiple snapshots, say N_{s} snapshots, are observed to form a sample covariance matrix. Forming the sample covariance matrix from N_{s} snapshots of {\boldsymbol{x}} costs \mathcal{O}(N^{2}N_{s}), while forming the sample covariance matrix from N_{s} snapshots of {\boldsymbol{y}} only costs \mathcal{O}(K^{2}N_{s}). We now state the problem of interest as follows.
Problem.
(Recovering secondorder statistics) For a known undirected graph \mathcal{G}, given a number of realizations , say N_{s}, of the subsampled lengthK graph signal {\boldsymbol{y}} or the subsampled covariance matrix {\boldsymbol{R}}_{\boldsymbol{y}}, recover the graph power spectrum {\boldsymbol{p}} and thus the covariance matrix {\boldsymbol{R}}_{\boldsymbol{x}}.
Let us decompose the graph signal {\boldsymbol{x}} in terms of its graph Fourier transform coefficients as [cf. (2)]
{\boldsymbol{x}}=\sum_{i=1}^{N}x_{f,i}{\boldsymbol{u}}_{i}. 
This allows us to represent the covariance matrix {\boldsymbol{R}}_{\boldsymbol{x}}=\mathbb{E}\{{\boldsymbol{x}}{\boldsymbol{x}}% ^{H}\} in the graph Fourier domain using the graph power spectrum {\boldsymbol{p}} as
\displaystyle{\boldsymbol{R}}_{\boldsymbol{x}}=\sum_{i=1}^{N}\mathbb{E}\{x_{f% ,i}^{2}\}{\boldsymbol{u}}_{i}{\boldsymbol{u}}_{i}^{H}=\sum_{i=1}^{N}p_{i}{% \boldsymbol{u}}_{i}{\boldsymbol{u}}_{i}^{H}=\sum_{i=1}^{N}p_{i}{\boldsymbol{Q}% }_{i},  (8) 
where we use the fact that for i\neq j we have \mathbb{E}\{x_{f,i}\bar{x}_{f,j}\}=0 and {\boldsymbol{Q}}_{i}={\boldsymbol{u}}_{i}{\boldsymbol{u}}_{i}^{H} is a sizeN rankone matrix. Here, we expand {\boldsymbol{R}}_{\boldsymbol{x}} using a set of N Hermitian matrices \{{\boldsymbol{Q}}_{1},{\boldsymbol{Q}}_{2},\ldots,{\boldsymbol{Q}}_{N}\} as a basis. Vectorizing {\boldsymbol{R}}_{x} in (8) results in
{\boldsymbol{r}}_{\boldsymbol{x}}={\rm vec}({\boldsymbol{R}}_{\boldsymbol{x}})% =\sum_{i=1}^{N}p_{i}{\rm vec}({\boldsymbol{Q}}_{i})={\boldsymbol{\Psi}}_{\rm s% }{\boldsymbol{p}}, 
where we have stacked {\rm vec}({\boldsymbol{Q}}_{i})=\bar{{\boldsymbol{u}}}_{i}\otimes{\boldsymbol{% u}}_{i} to form the N^{2}\times N matrix {\boldsymbol{\Psi}}_{\rm s} as
{\boldsymbol{\Psi}}_{\rm s}=[\bar{{\boldsymbol{u}}}_{1}\otimes{\boldsymbol{u}}% _{1},\cdots,\bar{{\boldsymbol{u}}}_{N}\otimes{\boldsymbol{u}}_{N}]=\bar{{% \boldsymbol{U}}}\circ{\boldsymbol{U}}. 
The subscript “{\rm s}” in the matrix {\boldsymbol{\Psi}}_{\rm s}, which is constructed using the graph Fourier basis vectors, stands for spectral domain.
Using the compression scheme described in (7), the covariance matrix {\boldsymbol{R}}_{\boldsymbol{y}}\in\mathbb{C}^{K\times K} of the subsampled graph signal {\boldsymbol{y}} can be related to {\boldsymbol{R}}_{\boldsymbol{x}} as
\displaystyle{\boldsymbol{R}}_{\boldsymbol{y}}={\boldsymbol{\Phi}}{\boldsymbol% {R}}_{\boldsymbol{x}}{\boldsymbol{\Phi}}^{T}=\sum_{i=1}^{N}p_{i}{\boldsymbol{% \Phi}}{\boldsymbol{Q}}_{i}{\boldsymbol{\Phi}}^{T}.  (9) 
This means that the expansion coefficients of {\boldsymbol{R}}_{\boldsymbol{y}} with respect to the set \{{\boldsymbol{\Phi}}{\boldsymbol{Q}}_{1}{\boldsymbol{\Phi}}^{T},{\boldsymbol{% \Phi}}{\boldsymbol{Q}}_{2}{\boldsymbol{\Phi}}^{T},\cdots,{\boldsymbol{\Phi}}{% \boldsymbol{Q}}_{N}{\boldsymbol{\Phi}}^{T}\} are the same as those of {\boldsymbol{R}}_{\boldsymbol{x}} with respect to the set \{{\boldsymbol{Q}}_{1},{\boldsymbol{Q}}_{2},\cdots,{\boldsymbol{Q}}_{N}\}, and they are preserved under linear compression. It is not yet clear though whether these expansion coefficients, which basically represent the power spectrum, can be uniquely recovered from {\boldsymbol{R}}_{\boldsymbol{y}}.
Vectorizing {\boldsymbol{R}}_{\boldsymbol{y}} as
{\boldsymbol{r}}_{\boldsymbol{y}}={\rm vec}({\boldsymbol{R}}_{\boldsymbol{y}})% =({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\rm vec}({\boldsymbol{R}}_{% \boldsymbol{x}})\in\mathbb{C}^{K^{2}} 
we obtain
\displaystyle{\boldsymbol{r}}_{\boldsymbol{y}}  \displaystyle=\sum_{i=1}^{N}p_{i}({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}% })(\bar{{\boldsymbol{u}}}_{i}\otimes{\boldsymbol{u}}_{i})=\sum_{i=1}^{N}p_{i}(% {\boldsymbol{\Phi}}\bar{{\boldsymbol{u}}}_{i}\otimes{\boldsymbol{\Phi}}{% \boldsymbol{u}}_{i})  (10)  
\displaystyle=({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi% }}_{\rm s}{\boldsymbol{p}}. 
This linear system with N unknowns has a unique solution if ({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi}}_{\rm s} has full column rank, which requires K^{2}\geq N. Assuming that this is the case, the graph power spectrum (thus the secondorder statistics of {\boldsymbol{x}}) can be estimated in closed form via least squares:
\widehat{\boldsymbol{p}}=[({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){% \boldsymbol{\Psi}}_{\rm s}]^{\dagger}{\boldsymbol{r}}_{\boldsymbol{y}}.  (11) 
Computing this least squares solution costs \mathcal{O}(K^{2}N^{2}) [golub1996matrix]. Although for the nonparametric approach, cost of computing (11) is on the same order as that of the uncompressed case, the cost reduction will be prominent for problems discussed later on in Section V. Further, to compute (11), we have assumed that the true covariance matrix {\boldsymbol{R}}_{\boldsymbol{y}} is available, but a practical scenario with finite data records is discussed in Section LABEL:sec:finitedata.
Definition 3.
A wide matrix {\boldsymbol{\Phi}} is a valid graph covariance subsampler if it yields a full column rank matrix ({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi}}_{\rm s}.
We now derive the conditions under which {\boldsymbol{\Phi}} is a valid graph covariance subsampler. To do this, we first introduce two important lemmas.
Lemma 1.
Since the matrix {\boldsymbol{U}}\in\mathbb{C}^{N\times N} is full rank, the matrix {\boldsymbol{\Psi}}_{\rm s}=\bar{{\boldsymbol{U}}}\circ{\boldsymbol{U}} of size {N^{2}\times N} has full column rank.
Proof.
See Appendix LABEL:app:proofselfkr. ∎
Lemma 2.
If the matrix {\boldsymbol{\Phi}}\in\mathbb{R}^{K\times N} has full row rank, then the matrix {\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}} of size K^{2}\times N^{2} also has full row rank.
Proof.
Follows from the singular value decomposition of {\boldsymbol{\Phi}} and the property ({\boldsymbol{A}}\otimes{\boldsymbol{B}})({\boldsymbol{C}}\otimes{\boldsymbol{% D}})=({\boldsymbol{A}}{\boldsymbol{C}}\otimes{\boldsymbol{B}}{\boldsymbol{D}}). ∎
Using the above two lemmas, we can provide the necessary and sufficient conditions under which the solution in (11) is unique.
Theorem 1.
A full row rank matrix {\boldsymbol{\Phi}}\in\mathbb{R}^{K\times N} is a valid graph covariance subsampler if and only if the matrix ({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi}}_{\rm s} is tall, i.e., K^{2}\geq N, and {\rm null}({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}})\cap{\rm ran}({% \boldsymbol{\Psi}}_{s})=\emptyset.
Proof.
See Appendix LABEL:app:theo1. ∎
Although the linear system of equations (10) can be solved using (unconstrained) least squares, nonnegativity constraints or any spectral prior can be easily accounted for while solving (10) as summarized in the following remark.
Remark 1 (Spectral priors).
Any available prior information about the graph spectrum might allow for a higher compression with K^{2}<N, or an improvement of the solution (11). Suppose we have some prior knowledge about the graph spectrum, i.e., {\boldsymbol{p}}\in\mathcal{P} with \mathcal{P} being the constraint set. For instance, suppose we know a priori that (a) the spectrum is bandlimited (e.g., lowpass) with known support such that \mathcal{P}=\{{\boldsymbol{p}}\,\,p_{n}=0,n\notin[N_{l},N_{u}]\}, where [N_{l},N_{u}] denotes the support set, (b) the spectrum is sparse, but with unknown support such that \mathcal{P}:=\{{\boldsymbol{p}}\,\,\sum_{n=1}^{N}p_{n}=S\}, where S denotes the sparsity order (here, we use the convex relaxation of the cardinality constraint), or (c) the power spectrum is nonnegative (by definition), for which \mathcal{P}:=\{{\boldsymbol{p}}\,\,p_{n}\geq 0,\forall n\}. With such spectral priors, the following constrained least squares problem may be solved
\underset{{\boldsymbol{p}}\in\mathcal{P}}{\text{\rm minimize}}\quad\{% \boldsymbol{r}}_{\boldsymbol{y}}({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}% }){\boldsymbol{\Psi}}_{\rm s}{\boldsymbol{p}}\_{2}^{2}. 
In what follows, we will discuss and illustrate the connections with compressive covariance sensing [ariananda2012compressive, Romero16CCSspm] for datasets that reside on regular structures (e.g., time series) using a circulant graph (e.g., a cycle graph). We will also see that designing a compression matrix is much more elegant for such circulant graphs.
IV Circulant Graphs
Discretetime finite or periodic data can be represented using directed cycle graphs, where the direction of the edge represents the evolution of time from past to future. The edge directions may be ignored in some cases, e.g., when we are only interested in exploiting the regular Fourier transform, when we are dealing with the spatial domain, or when the underlying data is a timereversible stochastic process that is invariant under the reversal of the time scale [weiss1975time]. In such cases, the data can be represented using an undirected cycle graph, see Fig. 1.
Consider the adjacency matrix of this undirected cycle graph as its graphshift operator, which will be an N\times N symmetric circulant matrix. We know that a circulant matrix can be diagonalized with a discrete Fourier transform matrix. In other words, the graph Fourier transform matrix {\boldsymbol{U}} related to this graph will consist of the orthonormal vectors
{\boldsymbol{u}}_{n}=[{\omega}_{n}^{0},{\omega}_{n},{\omega}_{n}^{2},\cdots,{% \omega}_{n}^{N1}]^{T} 
with {\omega}_{n}=\exp(\imath 2\pi n/N)/\sqrt{N} and it will be a Vandermonde matrix (here, \imath^{2}=1). In general, for circulant graphs with circulant graphshift operators, an eigenvalue decomposition is not required to compute the graph Fourier transform matrix {\boldsymbol{U}} or the model matrix {\boldsymbol{\Psi}}_{\rm s}, which was introduced in Section III.
Let the set \mathcal{K}\subset\mathcal{N} denote the indices of the selected graph nodes. Now, if we can smartly select the entries of {{\boldsymbol{u}}}_{n} such that the related entries of \bar{{\boldsymbol{u}}}_{n}\otimes{\boldsymbol{u}}_{n} contain all the distinct values \{{\omega}_{n}^{m}\} for m=0,\cdots,N1, the matrix ({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi}}_{\rm s} will be a fullcolumn rank Vandermonde matrix. In particular, this means that, for every m=0,\ldots,N1, there must exist at least one pair of elements n_{i},n_{j}\in\mathcal{K} that satisfies n_{i}n_{j}=m, where the difference n_{i}n_{j} is due to the Kronecker product \bar{{\boldsymbol{u}}}_{n}\otimes{\boldsymbol{u}}_{n}. Sets \mathcal{K} having this property are called sparse rulers [Romero16CCSspm]. Furthermore, if the set contains a minimum number of elements, they are called minimal sparse rulers, which results in the best possible compression.
Let us illustrate this with an example for N=10. In this case, the set \mathcal{K}=\{0,1,4,7,9\} with K=\mathcal{K}=5 elements is a minimal sparse ruler. In other words, by choosing the subsampling matrix {\boldsymbol{\Phi}}=\operatorname*{\mathrm{diag}}_{\rm r}[{\boldsymbol{w}}] with {\boldsymbol{w}}=[1,1,0,0,1,0,0,1,0,1]^{T} we can ensure that the matrix ({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi}}_{\rm s} is full column rank, and hence the secondorder statistics of {\boldsymbol{x}} can be estimated using (11) by subsampling only K=5 nodes. Here, we achieve a compression rate of K/N=0.5. Similarly, for N=80, the minimal sparse ruler has K=15 elements, and this results in a compression rate of K/N=0.1875 (we will see an example related to N=80 and K=15 in Section LABEL:sec:numericalresults). Sparse rulers for other values of N are tabulated in [linebarger1993difference].
Computing minimal sparse rulers is a combinatorial problem with no known expressions. Nevertheless, subsamplers such as coprime [vaidyanathan2011sparse] and nested sparse samplers [pal2010nested], which can be computed using a closedform expression for any N, are also valid covariance subsamplers. However, they are not minimal sparse rulers and thus they do not provide the best compression rate.
Subsampler design for reconstructing the secondorder statistics of signals residing on a circulant graph is as elegant as that for reconstructing the secondorder statistics of stationary timeseries. The design of subsamplers for general graphs, however, is more challenging. This is the subject of Section LABEL:sec:subsampdesgn.
V Parameteric Models
In this section, we will focus on a parametric representation of the graph power spectrum. In particular, the focus will be on moving average and autoregressive parametric models. Typically, the model order (i.e., the number of parameters) is much smaller than the length of the graph signal, and since we now have to recover only these parameters, a much stronger compression can be achieved. Also, this means that, we need to store or transmit only fewer parameters, which could be used to generate realizations of secondorder stationary graph signals (we will illustrate this with an example in Section LABEL:sec:numericalresults)
Parametric methods can be viewed as an alternative approach, where going to the graph spectral domain may be avoided, and instead, all the processing is done directly in the graph vertex domain.
VA Graph moving average models
As before, we assume that the stationary graph signal {\boldsymbol{x}} is generated by graph filtering zeromean unitvariance white noise. Recall that in Section III, we did not impose any structure to the graph filter, but now we will assume that the graph filter has a finite impulse response with an allzero form as in (3); see [perraudin2016stationary, marques2016stationary].
Let us begin by writing the graph signal {\boldsymbol{x}} as
{\boldsymbol{x}}={\boldsymbol{H}}({\boldsymbol{h}}){\boldsymbol{n}}=\sum_{l=0}% ^{L1}h_{l}{\boldsymbol{S}}^{l}{\boldsymbol{n}}={\boldsymbol{U}}\left(\sum_{l=% 0}^{L1}h_{l}{\boldsymbol{\Lambda}}^{l}\right){\boldsymbol{U}}^{H}{\boldsymbol% {n}} 
with covariance matrix
\displaystyle{\boldsymbol{R}}_{\boldsymbol{x}}  \displaystyle={{\boldsymbol{H}}}({\boldsymbol{h}}){{\boldsymbol{H}}}^{H}({% \boldsymbol{h}})  (12)  
\displaystyle={\boldsymbol{U}}\left(\sum_{l=0}^{L1}h_{l}{\boldsymbol{\Lambda}% }^{l}\right)\left(\sum_{l=0}^{L1}\bar{h}_{l}{\boldsymbol{\Lambda}}^{l}\right)% {\boldsymbol{U}}^{H}, 
where {\boldsymbol{x}} is a moving average graph signal (GMA) of order L1 with GMA coefficients \{h_{k}\}_{k=0}^{L1}, and the lengthL vector {\boldsymbol{h}} collects the GMA coefficients as {\boldsymbol{h}}=[h_{0},h_{1},\ldots,h_{L1}]^{T}. Moving average models are particularly useful to represent a smooth graph power spectrum [perraudin2016stationary, marques2016stationary].
The expression (12) basically means that we can express the covariance matrix {\boldsymbol{R}}_{\boldsymbol{x}} as a polynomial of the graph shift operator:
{\boldsymbol{R}}_{\boldsymbol{x}}=\sum_{k=0}^{Q1}b_{k}{\boldsymbol{S}}^{k},  (13) 
where Q=\min\{2L1,N\} unknown expansion coefficients \{b_{k}\}_{k=0}^{Q1} collected in the vector {\boldsymbol{b}}=[b_{0},b_{1},\cdots,b_{Q1}]^{T}\in\mathbb{R}^{Q} completely characterize the covariance matrix {\boldsymbol{R}}_{\boldsymbol{x}}. In other words, we assume a linear parametrization of the covariance matrix {\boldsymbol{R}}_{\boldsymbol{x}} using the set of Q Hermitian matrices \{{\boldsymbol{S}}^{0},{\boldsymbol{S}},\cdots,{\boldsymbol{S}}^{Q1}\} as a basis.
The expansion coefficients {\boldsymbol{b}} depend on the GMA coefficients {\boldsymbol{h}}. To see this, let us consider an example GMA model with L=3 having coefficients {\boldsymbol{h}}=[h_{0},h_{1},h_{2}]^{T}, for which (13) simplifies to
\displaystyle{\boldsymbol{R}}_{\boldsymbol{x}}=h_{0}^{2}{\boldsymbol{I}}  \displaystyle+2h_{0}h_{1}{\boldsymbol{S}}+(h_{1}^{2}+2h_{0}h_{2}){\boldsymbol{% S}}^{2}  
\displaystyle\>+2h_{1}h_{2}{\boldsymbol{S}}^{3}+h_{2}^{2}{\boldsymbol{S}}^{4}.  (14) 
This means that, {\boldsymbol{b}}({\boldsymbol{h}}) will be of length 2L1 with entries {\boldsymbol{b}}({\boldsymbol{h}})=[h_{0}^{2},2h_{0}h_{1},h_{1}^{2}+2h_{2}h_{0% },2h_{2}h_{1},h_{2}^{2}]^{T} that are related to the GMA parameters {\boldsymbol{h}}. To arrive a simple (unconstrained) least squares estimator, we will ignore this structure in {\boldsymbol{b}} (we will discuss the how to account for this structure at the end of this subsection). Therefore, with a slight abuse of notation we will henceforth refer to {\boldsymbol{b}}({\boldsymbol{h}}) as the GMA coefficients.
Depending on the shape of the power spectrum, Q can be much smaller than the number of graph nodes (i.e., the length of the vector {\boldsymbol{p}}) thus allowing a higher compression. In any case, the value of Q will be at most N, recalling that N is the degree of the minimal (and characteristic) polynomial of {\boldsymbol{S}}. That is to say, for Q\geq N, the set of matrices \{{\boldsymbol{S}}^{0},{\boldsymbol{S}},\cdots,{\boldsymbol{S}}^{Q1}\} are linearly dependent.
Vectorizing {\boldsymbol{R}}_{\boldsymbol{x}} in (13) yields
{\boldsymbol{r}}_{\boldsymbol{x}}={\rm vec}({\boldsymbol{R}}_{\boldsymbol{x}})% =\sum_{k=0}^{Q1}b_{k}{\rm vec}({\boldsymbol{S}}^{q})={\boldsymbol{\Psi}}_{\rm MA% }{\boldsymbol{b}},  (15) 
where we have stacked {\rm vec}({\boldsymbol{S}}^{q}) to form the columns of the matrix {\boldsymbol{\Psi}}_{\rm MA}\in\mathbb{R}^{N^{2}\times Q} as
{\boldsymbol{\Psi}}_{\rm MA}=\left[{\rm vec}({\boldsymbol{S}}^{0}),{\rm vec}({% \boldsymbol{S}}^{1}),\cdots,{\rm vec}({\boldsymbol{S}}^{Q1})\right], 
and the subscript “{\rm MA}” in {\boldsymbol{\Psi}}_{\rm MA} stands for moving average.
The covariance matrix of the subsampled graph signal {\boldsymbol{y}} in (7) will then be
{\boldsymbol{R}}_{\boldsymbol{y}}={\boldsymbol{\Phi}}{\boldsymbol{R}}_{% \boldsymbol{x}}{\boldsymbol{\Phi}}^{T}=\sum_{k=0}^{Q1}b_{k}{\boldsymbol{\Phi}% }{\boldsymbol{S}}^{k}{\boldsymbol{\Phi}}^{T}.  (16) 
As in the graph spectral domain approach discussed in Section III, the GMA coefficients \{b_{k}\}_{k=0}^{Q1} of {\boldsymbol{R}}_{\boldsymbol{y}} with respect to the set \{{\boldsymbol{\Phi}}{\boldsymbol{S}}^{0}{\boldsymbol{\Phi}}^{T},{\boldsymbol{% \Phi}}{\boldsymbol{S}}{\boldsymbol{\Phi}}^{T},\cdots,{\boldsymbol{\Phi}}{% \boldsymbol{S}}^{Q1}{\boldsymbol{\Phi}}^{T}\} are the same as those of {\boldsymbol{R}}_{\boldsymbol{x}} with respect to the set \{{\boldsymbol{S}}^{0},{\boldsymbol{S}},\cdots,{\boldsymbol{S}}^{Q1}\}.
Vectorizing {\boldsymbol{R}}_{\boldsymbol{y}}, we get a set of K^{2} equations in Q unknowns, given by
\displaystyle{\boldsymbol{r}}_{\boldsymbol{y}}={\rm vec}({\boldsymbol{R}}_{% \boldsymbol{y}})  \displaystyle=({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\rm vec}({% \boldsymbol{R}}_{\boldsymbol{x}})  (17)  
\displaystyle=({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi% }}_{\rm MA}{\boldsymbol{b}}. 
If the matrix ({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi}}_{\rm MA} has full column rank, which requires K^{2}\geq Q, then the overdetermined system (17) can be uniquely solved using least squares as
\widehat{\boldsymbol{b}}=[({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){% \boldsymbol{\Psi}}_{\rm MA}]^{\dagger}{\boldsymbol{r}}_{\boldsymbol{y}}.  (18) 
Corollary 1.
A full row rank matrix {\boldsymbol{\Phi}}\in\mathbb{R}^{K\times N} is a valid graph covariance subsampler if and only if the matrix ({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}}){\boldsymbol{\Psi}}_{\rm MA} is tall, i.e., K^{2}\geq Q, and {\rm null}({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi}})\cap{\rm ran}({% \boldsymbol{\Psi}}_{\rm MA})=\emptyset.
Proof.
Follows from Theorem 1. ∎
Although knowing the moving average filter coefficients {\boldsymbol{b}} is equivalent to knowing {\boldsymbol{R}}_{\boldsymbol{x}}, it might be interesting to study the relation between {\boldsymbol{b}} and the power spectrum {\boldsymbol{p}}. We can relate the vector {\boldsymbol{p}} and the vector {\boldsymbol{b}}, by using (6) and (13). That is, we can write p_{n}=\sum_{k=0}^{Q1}b_{k}{\lambda}_{n}^{k}, or in matrixvector form we have
{\boldsymbol{p}}={\boldsymbol{V}}_{Q}{\boldsymbol{b}}, 
where {\boldsymbol{V}}_{Q} is an N\times Q Vandermonde matrix with (i,j)th entry equal to \lambda_{i}^{j1}. To recover {\boldsymbol{p}} from {\boldsymbol{b}}, however, we need all the N eigenvalues of {\boldsymbol{S}} to construct {\boldsymbol{V}}_{Q}.
This relation between {\boldsymbol{p}} and {\boldsymbol{b}} can be used to show the equivalence between the linear models (10) and (17) as follows. The fact that {\boldsymbol{S}}^{q}={\boldsymbol{U}}{\boldsymbol{\Lambda}}^{q}{\boldsymbol{U}% }^{H} from (1) allows us to express {\boldsymbol{\Psi}}_{\rm MA} in (17) as {\boldsymbol{\Psi}}_{\rm MA}=({\bar{\boldsymbol{U}}}\circ{\boldsymbol{U}}){% \boldsymbol{V}}_{Q}. Using this in (17), we obtain {\boldsymbol{r}}_{\boldsymbol{y}}=({\boldsymbol{\Phi}}\otimes{\boldsymbol{\Phi% }})({\bar{\boldsymbol{U}}}\circ{\boldsymbol{U}}){\boldsymbol{V}}_{Q}{% \boldsymbol{b}}=({\boldsymbol{\Phi}}{\bar{\boldsymbol{U}}}\circ{\boldsymbol{% \Phi}}{\boldsymbol{U}}){\boldsymbol{p}}.
In the following, we exploit the structure in {\boldsymbol{b}}, which we ignored while solving (17), to develop a constrained least squares estimator.
Remark 2 (Constrained least squares).
To reveal the structure in {\boldsymbol{b}}({\boldsymbol{h}}), let us recall the example (14) with L=3. The coefficients in {\boldsymbol{b}}({\boldsymbol{h}}) are related to the squared polynomial p(t)=(h_{0}+h_{1}t+h_{2}t^{2})^{2}, which can also be written as
\begin{aligned} \displaystyle p(t)&\displaystyle={\boldsymbol{h}}^{T}\left[% \begin{array}[]{ccc}1&t&t^{2}\\ t&t^{2}&t^{3}\\ t^{2}&t^{3}&t^{4}\cr\omit\span\omit\span\@@LTX@noalign{ }\omit\\ \end{array}\end{aligned} 