Sparse plus Low rank Network Identification:
A Nonparametric Approach\thanksreffootnoteinfo
Modeling and identification of high-dimensional stochastic processes is ubiquitous in many fields. In particular, there is a growing interest in modeling stochastic processes with simple and interpretable structures. In many applications, such as econometrics and biomedical sciences, it seems natural to describe each component of that stochastic process in terms of few factor variables, which are not accessible for observation, and possibly of few other components of the stochastic process. These relations can be encoded in graphical way via a structured dynamic network, referred to as “sparse plus low-rank (S+L) network” hereafter. The problem of finding the S+L network as well as the dynamic model can be posed as a system identification problem. In this paper, we introduce two new nonparametric methods to identify dynamic models for stochastic processes described by a S+L network. These methods take inspiration from regularized estimators based on recently introduced kernels (e.g. “stable spline”, “tuned-correlated” etc.). Numerical examples show the benefit to introduce the S+L structure in the identification procedure.
Padova]Mattia Zorzi, Padova]Alessandro Chiuso
Dipartimento di Ingegneria dell’Informazione, Università degli studi di Padova, via Gradenigo 6/B, 35131 Padova, Italy
Key words: Linear system identification, Sparsity and low-rank inducing priors, Kernel-based methods, Gaussian processes.
In many applications, high-dimensional data are measured to describe the underlying phenomena. These data are typically measured over time and can thus be modeled as a stochastic process whose components are referred to as manifest variables, i.e. accessible for observation. Very often, in high dimensional time series modeling it is necessary to control the model complexity to be able to obtain sensible results from a finite set of measured data. In addition, models should be interpretable in the sense of providing an insight into the data generation mechanism.
One possible way to limit the complexity is to postulate, often very reasonably, that these observations share some common behaviour (comovements) which, in turn, can be described by a small set of (unmeasurable) variables, called factor variables.
In this paper we shall extend and merge these ideas, building so called sparse plus low-rank (S+L) models, Zorzi & Chiuso (2015); our aim is to model processes which Granger’s causality structure is not necessarily sparse, but it may become so, in an appropriate manner to be defined later on, after some latent variables (called factors in analogy with factor models) are added.
In this way, the relations among manifest variables and factor variables will be described through a two-layer graph (or network) where the (few) nodes in the top (and hidden) layer denote the factor variables, whereas the ones in the bottom (and visible) layer denote the manifest variables. The direct relation, in a sense to be precisely defined later on, between two nodes is encoded by the presence of a connecting directed edge. If the relations among manifest variables are mostly encoded through the factor variables and the number of the latter is small (as compared to the number of manifest variables), this graph will be referred to as sparse plus low-rank (S+L) network; in fact, as we shall see, its structure translates into a S+L structure for the dynamic model of the manifest process. In particular, the rank of the low-rank component will coincide with the number of factor variables whereas the sparse component depends on the number of edges among the nodes representing the manifest variables. This modeling framework seems natural in many applications, such as biomedical sciences (Liegegois et al., 2015), econometrics (Zorzi & Sepulchre, 2015a; Chandrasekaran et al., 2010; Choi et al., 2010), and so on. Finally, this S+L structure have been considered also to perform robust principal component (PCA) analysis (Candès et al., 2011).
In the first part of this paper we propose a new S+L network defined as a directed graph wherein edges encode conditional Granger causality dependences (Granger, 1969; Ding et al., 2006) among variables. In the special case where there is no factor variable our model coincides with the sparse model considered in Chiuso & Pillonetto (2012). Moreover, the S+L model we propose is strictly connected to so called quasi-static factor models (Heij et al., 1997; Deistler & Zinner, 2007; Deistler et al., 2015). Therefore, our S+L model can be understood as blend of those two models.
Within this framework, we shall formulate an identification problem to select, using a finite set of measured data, the most appropriate S+L model (and thus the corresponding network).
A consolidated paradigm in system identification is the so-called prediction error method (PEM), see Ljung (1999); Söderström & Stoica (1989). In the traditional setting, candidate models are described in fixed parametric model structures, e.g. ARMAX. However, there are two main difficulties in this setting. First a model selection problem (i.e. order selection), usually performed by AIC and BIC criteria (Akaike, 1974; Schwarz, 1978), needs to be solved. Second, the parameterization of the predictor is non-linear, so that minimizing the squared prediction error leads to a non-convex optimization problem. Regularization has been recently introduced in the PEM framework, see Pillonetto et al. (2011); Pillonetto & De Nicolao (2010); Chen et al. (2012); Chiuso & Pillonetto (2012); Pillonetto et al. (2014); Chiuso (2016), as an alternative approach to control complexity of the estimated models. With this latter setting we search the candidate model, described via the predictor impulse responses, in an infinite dimensional nonparametric model class; the inverse (ill-posed) problem of determining a specific model using a finite set of measured data can be made into a well posed one using a penalty term, whose duty is to favor models with specific features. In the Bayesian view, this is equivalent to the introduction of an a priori probability (prior) on the unknown model. In the nonparametric Gaussian regression approach proposed in Pillonetto et al. (2011), this prior distribution is completely characterized by the covariance function, known also as kernel in the machine learning literature. The kernel encodes the a priori knowledge about the predictor impulse responses. In our case, the a priori knowledge is that the predictor impulse responses must be Bounded Input Bounded Output (BIBO) stable and respect the S+L structure. Starting from these a priori assumptions, we derive the corresponding kernel by using the maximum entropy principle. In particular, we consider two possible alternative formulations to endow this a priori properties in the kernel, and thus two different identification algorithms.
These kernels are characterized by the decay rate of the predictor impulse responses, by the number of conditional Granger causality relations among the manifest variables and by the number of factor variables. This ensemble of features is not known and is characterized by the so called hyperparameters vector. The latter is usually estimated by minimizing the negative log-marginal likelihood of the measured data (Rasmussen & Williams, 2006). In our case, the challenge is to perform the joint estimation of the hyperparameters tuning the sparse part and the low-rank one. Indeed, it should be observed that the S+L decomposition of a given model might not be unique. On the other hand, once the hyperparameters vector is fixed the uniqueness of the estimated model will be guaranteed through regularization. To estimate the hyperparameters minimizing the negative log-marginal likelihood, we propose an algorithm imposing an “hyper regularizer” on the low-rank hyperparameter to partially handle the non-uniqueness of the S+L decomposition.
Numerical experiments involving both S+L and generic models show the effectiveness of our identification procedure both in terms of complexity, predictive capability and impulse response fit.
The outline of the paper follows. In Section 2, we introduce the S+L models. In Section 3, we introduce the S+L identification problem. In Section 4, we derive the maximum entropy kernels inducing BIBO stability, sparsity and low-rank. Section 5 deals with the estimation of the hyperparameters vector. In Section 6, we provide some numerical examples to show the effectiveness of our method. Finally, conclusions are drawn in Section 7. In order to streamline the presentation all proofs are deferred to the Appendix.
Throughout the paper, we will use the following notation. is the set of natural numbers. Given a finite set , denotes its cardinality. denotes the expectation, while denotes the conditional mean. Given three (possibly infinite dimensional) random vectors , and we say that is conditionally independent of given if
Given , denotes the entry of in position . denotes the vectors space of symmetric matrices of dimension . is the cone of symmetric positive definite matrices of dimension , and denotes its closure. denotes the space of -valued infinite length sequences, which we think as infinite dimensional column vectors , , , such that . is the space of matrices of sequences in
where , and . denotes the space of -valued infinite length sequences such that . is defined in similar way. denotes the space of symmetric infinite dimensional matrices such that . is the space of symmetric infinite dimensional matrices
where , . Given , and , the products and are understood as matrices whose entries are limits of infinite sequences (Jorgesen et al., 2011). Given and , . The definition is similar in the case that and have finite dimension. With some abuse of notation the symbol will denote both the complex variable as well as the shift operator . Given a transfer matrix , , of dimension , with some abuse of terminology, we say that has rank , with , if it admits the decomposition where is a matrix and is a transfer matrix. Given a stochastic process , its -th component is denoted by . With some abuse of notation, will both denote a random vector and its sample value. From now on the time will denote present and we shall talk about past and future with respect to time . With this convention in mind,
denotes the (infinite length) past data vector of at time . In similar way, denotes the past data vector of at time .
2 Sparse plus Low rank Models
Consider a zero-mean stationary and Gaussian stochastic process taking values in . Let be manifest, i.e. it can be measured, and described by the innovation model
where and are BIBO stable transfer matrices of dimension and is a zero mean white Gaussian noise (WGN) with covariance matrix . The minimum variance one-step ahead predictor of based on the past data , denoted by is given by
so that is the one-step-ahead prediction error
Often is approximated by a “simple structure” in order to simplify the analysis of the underlying system. One possible simple structure is sparse, i.e. many of its entries are null transfer functions (Chiuso & Pillonetto, 2012). Sparsity of the predictor transfer function encodes the Granger’s causality (Granger, 1969) of and can be graphically represented with a Bayesian network (see below). However, in many interesting applications (Zorzi & Sepulchre, 2015a; Kolaczyk, 2009; Werhli et al., 2006; Forni et al., 2004) the components in are strongly related through common factors, not accessible for observation. In such situations approximating with a sparse matrix will likely yield to poor results.
To overcome this limitation we consider a more general, but still “simple”, structure by introducing an dimensional process (), called factor. We assume that is zero-mean, jointly stationary and Gaussian with . The underlying idea is that the (few) components of the factor are able to capture the common movement of so that, conditionally on , the predictor of can now be well approximated by a sparse model. This can be formalized assuming that can be modeled as follows:
where is the so-called factor loading matrix and, in analogy with dynamic factor models, is called the “common component” or “latent variable” of (Deistler & Zinner, 2007); is a BIBO stable and sparse transfer matrix; is a white noise process uncorrelated with the past histories of and . As a result, the one-step ahead predictor of is given by
where is the minimum variance estimator of based on , that is
where is a BIBO stable transfer matrix of dimension . We conclude that takes the form in (5) with where . Hence, has a sparse plus low-rank (S+L) structure.
It is possible to describe the structure of model (6) using a Bayesian network (Lauritzen, 1996) having two layers. The nodes in the top layer correspond to the scalar processes , i.e. the components of , whereas the ones in the bottom layer correspond to the scalar processes , i.e. the components of , see Figure 1.
Then, the connections among the nodes obey the following rules:
there is a directed link from node to node if
i.e. if is needed to predict given with and . In this case, we shall say conditionally Granger causes
there is a directed link from node to node if
i.e. if is needed to predict given with . In this case, we shall say conditionally Granger causes
there is a direct link from node to node if
i.e. if is needed to predict given and with . In this case, we shall say conditionally Granger causes .
Therefore, the S+L model in (6) represents a network wherein manifest variables Granger cause each other mostly through few factor variables. In Figure 1 we provide an example of a network describing a S+L model with manifest variables and factor variable. In particular, is conditionally Granger caused by and , with is conditionally Granger caused only by , and is conditionally Granger caused by . Therefore, the corresponding S+L model in (6) has with only one nonnull entry in position (5,1) and has rank equal to one.
The decomposition of a transfer matrix into sparse plus low-rank, i.e. may not be unique. As noticed in (Chandrasekaran et al., 2010), this degeneracy may occur when is sparse, i.e. the factor variables are not sufficiently “diffuse” across the manifest variables, or the degree of sparsity of is low, i.e. there are manifest variables conditionally Granger caused by too many manifest variables. On the other hand, it is possible to derive conditions under which such a decomposition is locally identifiable. We do not tackle this nonidentifiability issue in this paper, although it is important. Indeed, our aim is to find one S+L decomposition (see the next Section) which is not necessarily unique.
2.1 Connections: feedback models and quasi-static factor models
It is also interesting to consider a “feedback” representation of the joint process and see how this connects with S+L model (6) and other models already considered in the literature, such as quasi-static factor models (Deistler & Zinner, 2007).
where the driving noises and are uncorrelated and jointly stationary processes (possibly non-white) and the feedback interconnection (9) is internally stable, namely
is analytic inside the closed unit disc and . In particular, given an analytic and minimum phase spectral factor
of the joint spectrum of , lower block triangular and normalized at infinity (i.e. , , ), the transfer function is given by (see e.g. eq. (3.4) in Gevers & Anderson (1982)) .
The predictor for given the past can be written, using (9), as
Hence, model (6) is recaptured when
(i.e. is constant)
Note that condition (a) above is equivalent to , while complete freedom is left to the other entries of .
To summarize, our model (6) is equivalent to assuming that there is a process , jointly stationary with , so that the pair admits the internally stable feedback representation:
with uncorrelated (and possibly non-white) noises and .
It is worth noting that when , model (10) can be understood as a quasi-static factor model (Deistler & Zinner, 2007) where is the factor loading matrix, is the -dimensional factor and is colored noise uncorrelated with , such that its minimum variance one-step ahead predictor based on is sparse: .
3 Problem Formulation
Assume measured data are available from the manifest process generated by (6). The factor process cannot be measured nor its dimension is known. In this section, we address the problem of estimating and from .
The transfer matrix is parameterized in terms of its impulse response coefficients . In particular, defining to be the impulse response from input to output of the transfer matrix , we have:
The coefficient vector is defined as follows:
Similarly, the impulse response coefficients parameterizing the transfer matrix are stacked in as done above for . We first introduce as
The measured data are stacked in the vector as follows
The vector is defined analogously. Let us also introduce the Toeplitz matrix :
with and .
Then, we define the regression matrix as:
so that, from (6) the vector containing the measured data satisfy the linear model\@footnotemark\@footnotetext The matrix contains in principle data in the remote past which are not available. In practice model (18) needs to be approximated truncating (and thus and ). This corresponds to assuming zero initial conditions, and it is a reasonable approximation given the decay, as a function of , of the coefficients and (which is a necessary condition for BIBO stability). We shall not enter into these details in the paper. See also Remark 3.
where is the one-step ahead predictor of .
Therefore, our S+L identification problem can be formulated in terms of PEM as follows.
Find corresponding to a S+L model minimizing the prediction error norm .
Following the nonparametric Gaussian regression approach in Pillonetto et al. (2011), we model and as two zero-mean processes with kernels and , respectively. These kernels may depend upon some tuning parameters, usually called hyperparameters and denoted with hereafter. As illustrated in Section 4, according to the maximum entropy principle and will be modeled as Gaussian and independent. In the following and denote the reproducing Hilbert spaces (Aronszajn, 1950) of deterministic functions on , associated with and , with norm denoted by and , respectively. We assume that the past data neither affects the a priori probability on and nor carries information on and (Poggio & Girosi, 1990), that is
be, respectively, the minimum variance estimators of and given , and . In what follows, and denote the transfer matrices corresponding to and , respectively. The next Proposition shows that and are, almost surely, solution to a Tikhonov-type variational problem and belong to the spaces and , respectively.
Under the assumption that is a second order stationary process and under approximation (3), almost surely we have
Moreover, almost surely:
The semi-infinite regression matrix depends on both and . Since is never completely known, a solution to handle the initial conditions consists of setting its unknown components to zero. In this way, the introduced error goes to zero as increases (Ljung, 1999, Section 3.2). Alternatively it would be possible to incorporate initial conditions in the estimation problem, e.g. modeling also the free response of the system. This is however outside the scope of the paper and is only practically relevant when very slow dynamics are present.
The main task now is to design the kernels and in such a way that and are almost surely BIBO stable while favouring to be sparse and to be of low-rank.
4 Maximum entropy priors
One way to derive a probability law for the joint process under desired constraints rests on the maximum entropy principle. The most common justification of maximum entropy solutions relies on “information” arguments essentially stating that the maximum entropy distribution is the one which entails the maximum “uncertainty” under the given constraints. There is another and very important motivation for adopting the maximum entropy solution: Shore & Johnson (1980) have shown that maximum entropy is the unique correct method satisfying some minimal consistency axioms; basically, these axioms state that the solution should be consistent when “there are different ways of taking the same information into account”.
We shall make the rather mild assumption that the process is zero-mean and absolutely continuous with respect to the Lebesgue measure. We will see that, under suitable constraints, the optimal solution (i.e. maximizing the differential entropy) is a Gaussian process where and are independent. Then, we will also characterize the corresponding kernels. In what follows, we propose two different ways to enforce BIBO stability and low-rank on . This leads to two different types of kernel for .
4.1 First type of kernel
We start with the constraints on inducing BIBO stability and sparsity on . To do so, we exploit the following proposition:
Let be a strictly positive definite kernel (in the sense of Moore) such that , , with and . Let also be a zero-mean process which satisfies the moment constraint
where . Then, for any there exists such that the covariance function (kernel) of satisfies:
Thus, we consider the constraint on
where , , and . If , then is the null sequence in mean square and so is its posterior mean.
It is not difficult to see that the assumption on in Proposition 4 is satisfied by the kernels usually employed in the identification of dynamical models (e.g. stable spline, tuned/correlated and so on, see Pillonetto et al. (2014)). Thus, by (26) the covariance of the -th element of decays exponentially. We conclude that the posterior mean of the transfer function in position of , under constraint (26), is BIBO stable. Moreover, it is null if and only if .
Clearly, if in (26) is chosen as the covariance matrix of DC (TC) prior, the resulting Maximum Entropy prior coincides with the DC (TC) prior. Some recent literature has discussed the maximum entropy properties of some kernels, such as the TC or DC kernels, see Nicolao et al. (1998), Carli et al. (2014) and Chen et al. (2015), as well as extensions to more articulated kernels Prando et al. (2015). In these papers it was shown for TC and DC kernels that weaker constraints with respect to (26) lead to the same solution. We refer the reader to these papers form more details; here we stick to the simpler conditions (26) in order to streamline the derivation as well as to allow alternative choices of the matrix in (26).
In view of Proposition 4, to guarantee BIBO stability on we also impose the constraint
for some such that .
We now switch our attention to the low rank property of the matrix . Let be the random semi-infinite matrix defined as
and consider the constraint on
with . If has singular values equal to zero, so does the covariance ; therefore the posterior mean of has rank less than or equal to thus admitting the decomposition
In order to build the desired prior distribution we make use of Kolmogorov extension Theorem, see Øksendal (1998), and work with finite vectors extracted from processes and .
To do so, let us consider a finite index set in . Let and be the random vectors whose components are extracted, respectively, from the process and according to the index sets and . We denote by the joint probability density of and . By Kolmogorov extension Theorem the joint process can be characterized by specifying the joint probability density for all finite sets . Thus, the maximum entropy process can be constructed building all marginals using the maximum entropy principle, which can thus be extended by Kolmogorov extension theorem. Such principle states that among all the probability densities satisfying the desired constraints, the optimal one should maximize the differential entropy (Cover & Thomas, 1991)
where and are the vectors extracted from and according to the index set and , respectively. and are the kernel matrices whose entries are extracted from according to and , respectively. is the matrix whose blocks are extracted from according to . Therefore, we obtain the following maximum entropy problem
where is the class of probability densities in which are bounded and taking positive values.
Under the assumption that , , and , the unique optimal solution to the maximum entropy problem (4.1) is such that and are independent, Gaussian with zero mean and kernel matrix, respectively,
where , , and .
In what follows we assume that constraints (33) and (34) are totally binding in problem (4.1). Since and are the Lagrange multipliers associated to those constraints, it is not difficult to see that in this situation and . Moreover, we define . As noticed before, we are interested in the limiting cases where might be equal to zero and might be of low-rank in order to have sparse and low-rank estimators. To include these scenarios, we consider the limits as and tends to a low rank matrix and extend the maximum entropy solution by continuity.
Let and . Then, the maximum entropy solution extended by continuity is the probability density such that and are independent, Gaussian, zero-mean, and with kernel matrices
where if and only if and if and only if .
Finally, in view of Kolmogorov extension Theorem, from the probability density of we can characterize the probability law of maximizing the differential entropy.
Consider the joint process where and are Gaussian independent processes with kernels, respectively,
where such that , , and zero otherwise. For all finite sets , its joint probability density is the extended solution to the maximum entropy problem (4.1).
It is worth noting that the maximum entropy kernels are characterized by the hyperparameters , , which control sparsity on , tuning the rank (and column space) of , while (see Proposition 4) controls the decay rate (as a function of the time index ) of the estimators and and thus BIBO stability of and , see Pillonetto & De Nicolao (2010). Finally, represents a trade-off between BIBO stability and low-rank on . The structure of is very general. We suggest that can be reparameterized by introducing few hyperparameters to reduce its degrees of freedom. In Section 5 we will propose one possible reparameterization.
It is possible to derive the same structure for by adopting the regularization point of view. In Wipf (2012), it has been shown that the penalty term , with , induces low-rank on . Moreover, the term admits the variational upper bound
where and equality holds if and only if . Consider the random vector extracted from according to . Thus, we can induce low-rank on by considering the penalty
4.2 Second type of kernel
The BIBO stability and low-rank constraints on can be imposed by using only one constraint. Consider the random semi-infinite matrix defined in (28) and the constraint
with . Similarly to the previous case, if the null space of has dimension , then the posterior mean of has rank less than or equal to ; therefore also has rank less than or equal to . This statement is formalized in the following proposition:
Assume that is strictly positive definite and such that , , with and . Then, under constraint (43), for any there exists such that , ,