# Predicting the evolution of stationary graph signals

###### Abstract

An emerging way of tackling the dimensionality issues arising in the modeling of a multivariate process is to assume that the inherent data structure can be captured by a graph. Nevertheless, though state-of-the-art graph-based methods have been successful for many learning tasks, they do not consider time-evolving signals and thus are not suitable for prediction. Based on the recently introduced joint stationarity framework for time-vertex processes, this letter considers multivariate models that exploit the graph topology so as to facilitate the prediction. The resulting method yields similar accuracy to the joint (time-graph) mean-squared error estimator but at lower complexity, and outperforms purely time-based methods.

## I Introduction

In the problem of modeling and predicting statistical processes, (wide-sense) stationarity is a helpful assumption, that allows us to learn the spectral characteristics of a process using very few samples. Especially for time-series prediction, learning from few samples is crucial, as one needs to estimate future values after only partially observing a single realization of the statistical process. This is the main reason why classical models for estimation and prediction of univariate processes, such as Wiener filters and auto-regressive moving average models (ARMA), rely on stationarity to produce predictions.

For multivariate statistical processes, following the same methodology is often problematic, as the number of parameters to be estimated increases quadratically with the number variables, often rendering the problem intractable. A common way to deal with this dimensionality issue is to assume that there is an inherent structure to the process that can be captured by a graph. The graph assumption appears frequently in the machine learning and signal processing literature, and has been shown invaluable for tasks such as clustering [1, 18], low-rank extraction [14], spectral estimation [12, 10] and semi-supervised learning [2, 17]. Nevertheless, despite their promise, so far state-of-the-art graph-based methods predominantly ignore the time-dimension of data.

The objective of this paper is to identify multivariate models that exploit the graph structure inherent to the data so as to facilitate the task of prediction. From a statistical perspective, our model amounts to assuming stationarity not only with respect to the time-dimension, but also with respect of the graph topology [12, 4]. The concept of joint (time-vertex) stationarity, which we introduced in [11], was shown to facilitate regression as it enforces graph structure into the covariance. Unlike our previous work however, we here focus on prediction, where causality is important as one needs to forecast the future in a timely manner.

Concretely, we bring forth a decoupling theorem that allows us to decouple a joint multivariate process into independent univariate processes. This allows us to (i) estimate the model parameters using traditional univariate techniques, and (ii) to reduce further the computational complexity by combining the training stage with (an optimal) low-rank approximation of our data. The learned models are causal and can be used to provide optimal predictions (in the mean-squared error sense) at a cost that is equivalent to a constant number of matrix-vector multiplications. Finally, we show on a synthetic and a weather dataset, that our method outperforms single variable models and yields similar accuracy to the more expensive mean-squared error estimator suitable for joint stationary processes [11].

## Ii Preliminaries

Our objective is to model and predict the evolution of graph signals, i.e., signals supported on the vertices of a weighted undirected graph , with the set of edges and the weighted adjacency matrix.

### Graph signal analysis.

The Graph Fourier Transform (GFT) of a graph signal is defined as , where is the eigenvector matrix of the (combinatorial^{1}^{1}1Though we use the combinatorial Laplacian in our presentation, our results are applicable to any positive semi-definite matrix representation of a graph or to the recently introduced shift operator [13].) Laplacian matrix .
The GFT allows us to extend filtering to graphs [6, 15, 16]. Filtering a signal with a graph filter corresponds to element-wise multiplication in the spectral domain

where the scalar function , referred to as the graph frequency response, has been applied to each diagonal entry of .

### Time-varying graph signals.

Let be a graph signal is sampled at successive regular intervals of unit length. We refer to matrix as a time-varying graph signal, and (without subscript) is the vectorized form of . The joint (time-vertex) Fourier transform (JFT) of is

(1) |

where, once more, is the graph Laplacian eigenvector matrix, whereas is the complex conjugate of the DFT matrix. In fact, is the eigenvector matrix of the lag operator . Denote by the diagonal matrix of angular frequencies (i.e., ), we have , where . Expressed in vector form, the joint Fourier transform becomes , where is unitary, and operator denotes the kroneker product. A joint filter is a function defined in the joint spectral domain that is evaluated at the graph eigenvalues and the angular frequencies . The output of a joint filter is

(2) |

and where is a diagonal matrix with and . For an in-depth discussion of JFT and its properties, we refer the reader to the work by Loukas and Foucard [8].

## Iii Modeling jointly stationary signals

The first step in predicting the evolution of a signal is to choose a good model for it. Motivated by the importance of stationarity for modeling statistical processes, our recent work generalized stationarity to time-varying graph signals [11]. The following is a variant of the definition presented in [12].

###### Definition 1 (JWSS process).

A process is called Jointly (or time-vertex) Wide-Sense Stationary (JWSS), if and only if (i) , and (ii) its covariance matrix is diagonalizable by the joint Fourier basis . The JFT of the autocorrelation function of process is referred to Joint Power Spectral Density (JPSD).

This definition is a generalization of the classical notion of stationarity, where now one assumes simultaneously wide-sense stationarity w.r.t. both the time and vertex domains. Indeed, for the combinatorial Laplacian, the first condition is equivalent to the first moment being constant. The second condition is a generalization of the invariance w.r.t. translation using the localization operator. Joint stationarity assumes the covariance to be driven by a joint filter , referred to as JPSD, that encodes the time and vertex relations in the signal. For an in depth discussion of the above definition, we encourage the reader to read the original publications [12, 11].

The question we will address is: “how to construct models that concisely capture the characteristics of a jointly stationarity process, in order to facilitate short-term prediction?”.

### Joint non-causal models.

Since the covariance of a jointly stationary signal is diagonalizable by , without loss of generality, we can constrain ourselves to models of which take the form

(3) |

where the innovation vector is a random vector of some arbitrary distribution, with mean in the null space of and identity covariance matrix , implying that the above model is equivalent to the one considered in our previous work [11]. Moreover, matrices and are arbitrary joint filters (and not necessarily polynomials). According to proposition 1, model (3) is natural, as the space of JWSS signals is exactly equal to that of signals generated by it.

###### Proposition 1.

Process is JWSS iff it is generated by (3).

###### Proof.

To establish “if” direction, we have to show that the two conditions of joint stationarity are always met for any output of the model. Let such that , then

It is easy to confirm that , since, if lies in the null-space of , then so does . The covariance of is

which is diagonalizable by . To prove the “only if” direction, we observe that every JWSS signal with mean and JPSD can be produced by the model by setting , and (this is always possible as covariance matrices are positive semi-definite). ∎

### Joint causal models.

Yet, despite its generality, model (3) is not always causal. This is problematic for the task of prediction, where one needs to forecast the future in a timely manner. It is more practical to assume that the output at time can be expressed as a function of the input-output variables at previous timesteps, yielding the joint causal model

(4) |

for some model orders and , and with being the -th column of matrix . Moreover, for the canonical form, we set . Obviously, every causal model can be written in this form for and . For this reason, we refer to processes generated by the above model as joint causal processes.

## Iv Predicting evolving graph signals

Our objective is to forecast to the evolution of an observed jointly stationary process with zero mean and JPSD . To start with, we will suppose that is a joint causal process, with known parameters and and we will derive an optimal predictor for the future values of the process. In the second part, we show how to estimate a joint causal model that approximates an arbitrary process (not necessarily causal). Finally, we demonstrate how to reduce the computational complexity of model estimation.

### Prediction.

Suppose that is the output of a joint causal model, where the input has zero mean and identity covariance. This implies that abides to

(5) |

Let the subscript denote that the random vector is conditioned on the (already observed) vectors . We predict vector , based on , thus obtaining the one-step predictor , using the conditional expectation

(6) |

where above we have exploited the fact that . Notice that, in the third step we compute the unobserved input vectors as the difference between the observations and the predictions in previous time-steps. In the following, we will show that this corresponds to the best possible choice, as it yields the minimum possible mean-squared error. We obtain a -step predictor by repeating the above computation times.

One maybe tempted to apply a similar procedure in order to predict unknown values along the graph dimension. The model above however does not render itself suitable for such a task, even when functions and are polynomials (and thus locally computable in the graph). In contrast to prediction along the time domain, in purely graph prediction (or inpainting) the value of a node is dependent on all of its neighbors. Therefore, node values cannot be predicted in isolation, but have to be solved jointly.

### Error analysis.

The one-step prediction error is . Similar to the classical case [7], depends only on the unknown innovations and the mean-squared error achieved is the smallest possible. To see this, we first need to show that , or equivalently that . Expanding (6) in the definition of , we have that

(7) |

Therefore, for every , which, under the assumption that the noise model is invertible^{2}^{2}2System has exactly one solution when matrix is invertible, or equivalently when is invertible for each ., implies .
Directly, we find that and the one-step mean-squared error is equal to

(8) |

Since is unknown at time , the above corresponds to the smallest achievable mean-squared error.

### Model estimation.

Suppose that we want to identify the parameters of a joint causal model (i.e., functions and for every and ) which best match an observed process . The canonical way to achieve this would be to minimize the prediction error residuals by solving the following (non-linear) system of equations involving unknowns^{3}^{3}3A matrix function of an matrix has degrees of freedom.

(9) |

where by we refer to the causal model based on and . In the following, we will use the decoupling theorem in order to simplify this problem by splitting it to a number of independent and well-studied problems with smaller complexity.

###### Theorem 1 (Decoupling theorem).

Let and abide to model (4)

(10) |

where is a symmetric diagonalizable matrix written as . Set and . Up to an isometric rotation by , the input-output relation of and for every is given by an ARMA(P,Q) model , with and scalars.

###### Proof.

Therefore, under the appropriate isometric rotation, the joint causal model decomposes into independent ARMA models, with two important consequences: (i) in the graph frequency domain, the joint causal model estimation problem can be split to independent problems, each involving equations and unknowns. (ii) Despite being non-linear, each of the problems we now have to solve corresponds to fitting an ARMA to a time-series. We can therefore use a number of well understood methods to solve it, such as the subspace Gauss-Newton approach of Wills and Ninness [19]. The exact coefficients of the joint causal model are then found by an inverse graph Fourier transform. We note that, especially for small to medium sized graphs, the cost of eigenvalue decomposition (inherent in joint models), is overshadowed by that of model estimation.

### Low-rank models.

The computational overhead of model estimation can be reduced by purposely ignoring a subset of the data. For instance, one may consider only a subset of the time-series and ignore the rest, saving considerably in terms computation complexity. Nevertheless, such a simplistic scheme will incur significant error when the variance of the data is distributed evenly among all time-series. To capture more concisely the variance inherent the data, we can instead perform the selection process after first rotating the data. Concretely, denote by an index set of size and let be a unitary (rotation) matrix. The low-rank approximation of is

where the diagonal indicator matrix has if , and zero otherwise. For prediction, which is an online task, the low-rank approximation needs to be performed solely w.r.t the graph dimension. For offline regression tasks, a low-rank approximation scheme could also consider the time-dimension.

According to Theorem 2, if we are interested in the expected behavior, the best low-rank approximation of a JWSS process uses the eigenvectors of the graph to rotate the data (corresponding to a GFT). Nevertheless, by the decoupling theorem, rotation by also decouples the time-series, and is the first step of model estimation. Therefore, the low-rank approximation can be effortlessly combined with model estimation, by only modeling the time-series in the decoupled space specified by set . Using this scheme, we attain a complexity reduction by a factor of for model estimation.

###### Theorem 2.

Let be a zero-mean JWSS process. The best rank- approximation of is given by

where contains the indices of the top- diagonal elements of .

###### Proof.

The proof is inspired by the Eckart–Young–Mirsky theorem [3, 9]. Let us define . Since is unitary the expected approximation error is equivalent to

By Theorem 2 of [11], for each time instant , the graph signal is stationary with covariance , which implies that

for . As a result, we observe that is an matrix jointly diagonalizable with the graph Laplacian . Let us order such that . We can write

(13) |

Setting , the above expression becomes

(14) |

where in the third step, we use the fact that is bounded by and can be at most times. The last expression shows that a global minimum is achieved for , where contains the largest components of function . ∎

## V Evaluation

We compare the joint causal models proposed in this paper, with (i) disjoint causal models, which model and predict each time-series independently (using ARMA models), and (ii) with joint non-causal models, which consider all relationships between variables. A disjoint model (which corresponds to setting ) effectively ignores the graph topology and models each time-series as a causal WSS (wide-sense stationary in time) process. This is a valid approach as any JWSS process also WSS (see Theorem 2 in [11]). On the other hand, prediction using a non-causal joint model corresponds to the minimum mean-squared error estimator in the general case, and to a MAP estimator in case the process is Gaussian distributed [11]. Thus, if the JPSD is correctly estimated, non-causal joint predictors are optimal—though at the expense of higher computation per prediction and space (for storing the JPSD), as compared to causal methods.

In our experiments, we split the data (along the time-dimension) in two halfs and used the first for model estimation. Then, for each time we computed the relative -step prediction error . We report the median prediction error and use errorbar to indicate one standard deviation. We also insert a small horizontal offset to improve visibility. Though the causal models (joint and disjoint) estimated in our experiments were not sensitive to the selected order, in the following we illustrate prediction only for the best orders (always below 3), identified using exhaustive search.

### Wave equation.

We simulated steps of a wave equation [5] under white Gaussian input on a random geometric graph of nodes and average degree 5. The scenario was motivated by the fact that the wave partial different equation has a closed form solution that corresponds to a non-separable ^{4}^{4}4A JPSD is separable if it can be written as , implying an independence of the time and graph dimensions and causal JPSD. Our experiments were repeated 10 times. As shown in Fig. 1, the tight dependence between temporal and graph frequency contents present in a wave render prediction using a disjoint causal model inaccurate (by ignoring the graph dimension, a purely time-based method cannot capture the strong correlations between variables). On the other hand, both joint methods closely approximate the optimal solution (corresponding to a MAP estimator that uses the ground truth covariance).

### Weather dataset.

The second experiment used a weather dataset depicting the temperature of weather stations in the region of Brest (France), over a span of hours^{5}^{5}5Access to the raw data is possible directly from https://donneespubliques.meteofrance.fr/donnees_libres/Hackathon/RADOMEH.tar.gz. The graph used was a 3-nearest neighbor graph built from the coordinates of the weather stations. Fig. 2 depicts the prediction error for 1 to 5 steps (hours) in the future. Since in this case the JPSD is (almost) separable, the disjoint causal predictor gives relevant predictions. Nevertheless, it is always outperformed by joint models. Note that the optimal predictor is not illustrated, since the ground truth JPSD of the data is unknown. We remark that the joint non-causal method has been shown to outperform non-stationarity based methods for prediction (such as TV and Tikhonov extrapolation) [12].

Lastly, Fig. 3 examines the effect of low-rank prediction to the 2-steps prediction error and model estimation time for the two causal models. In this experiment, we tested low-rank predictors which ignored a given percentage of the data variance (as estimated by the training data). Moreover, we considered the full dataset of hours. As supported by our theoretical results, performing the approximation in the graph spectral domain (joint causal) far outperforms a naive approximation in the native graph domain (disjoint causal), yielding a significant computational benefit at the expense of only a small decrease prediction accuracy.

## Vi Conclusion

While, this contribution focuses on processes living on graphs, the proposed method can be applied to traditional multivariate random processes, by connecting them with a nearest neighbors graphs. In this case, compared to classical multivariate ARMA models, the proposed method requires less training data and a smaller amount of computation because the number of parameters to be estimated is low (less than ). A more detailed comparison is left for future work. Note also that the current method is limited by the graph Fourier transform that require the diagonalization of the graph Laplacian . How to scale further than a few thousands nodes is still an open question that requires to a clever replacement of the decoupling theorem. Solving this issue would allow us to model user behaviors on large graphs such as Wikipedia.

## References

- [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, volume 14, pages 585–591, 2001.
- [2] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on riemannian manifolds. Machine learning, 56(1-3):209–239, 2004.
- [3] Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
- [4] Benjamin Girault. Stationary graph signals using an isometric graph translation. In Signal Processing Conference (EUSIPCO), 2015 23rd European, pages 1516–1520. IEEE, 2015.
- [5] Francesco Grassi, Nathanael Perraudin, and Benjamin Ricaud. Tracking time-vertex propagation using dynamic graph wavelets. arXiv preprint arXiv:1606.06653, 2016.
- [6] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
- [7] Lennart Ljung. System identification. In Signal Analysis and Prediction, pages 163–173. Springer, 1998.
- [8] Andreas Loukas and Damien Foucard. Frequency analysis of temporal graph signals. arXiv preprint arXiv:1602.04434, 2016.
- [9] Ivan Markovsky. Structured low-rank approximation and its applications. Automatica, 44(4):891–909, 2008.
- [10] Antonio G Marques, Santiago Segarra, Geert Leus, and Alejandro Ribeiro. Stationary graph processes and spectral estimation. arXiv preprint arXiv:1603.04667, 2016.
- [11] Nathanael Perraudin, Andreas Loukas, Francesco Grassi, and Pierre Vandergheynst. Towards stationary time-vertex signal processing. arXiv preprint arXiv:1606.06962, 2016.
- [12] Nathanaël Perraudin and Pierre Vandergheynst. Stationary signal processing on graphs. arXiv preprint arXiv:1601.02522, 2016.
- [13] Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. IEEE transactions on signal processing, 61:1644–1656, 2013.
- [14] Nauman Shahid, Nathanael Perraudin, Vassilis Kalofolias, Gilles Puy, and Pierre Vandergheynst. Fast robust pca on graphs. IEEE Journal of Selected Topics in Signal Processing, 10(4):740–756, 2016.
- [15] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. Signal Processing Magazine, IEEE, 30(3):83–98, 2013.
- [16] David I Shuman, Benjamin Ricaud, and Pierre Vandergheynst. Vertex-frequency analysis on graphs. Applied and Computational Harmonic Analysis, 40(2):260–291, 2016.
- [17] Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning theory and kernel machines, pages 144–158. Springer, 2003.
- [18] Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The Annals of Statistics, pages 555–586, 2008.
- [19] Adrian Wills and Brett Ninness. On gradient-based search for multivariable system estimates. IEEE Transactions on Automatic Control, 53(1):298–306, 2008.