Maximum Entropy Vector Kernels for MIMO system identification
Recent contributions have framed linear system identification as a nonparametric regularized inverse problem. Relying on -type regularization which accounts for the stability and smoothness of the impulse response to be estimated, these approaches have been shown to be competitive w.r.t classical parametric methods. In this paper, adopting Maximum Entropy arguments, we derive a new penalty deriving from a vector-valued kernel; to do so we exploit the structure of the Hankel matrix, thus controlling at the same time complexity, measured by the McMillan degree, stability and smoothness of the identified models. As a special case we recover the nuclear norm penalty on the squared block Hankel matrix. In contrast with previous literature on reweighted nuclear norm penalties, our kernel is described by a small number of hyper-parameters, which are iteratively updated through marginal likelihood maximization; constraining the structure of the kernel acts as a (hyper)regularizer which helps controlling the effective degrees of freedom of our estimator. To optimize the marginal likelihood we adapt a Scaled Gradient Projection (SGP) algorithm which is proved to be significantly computationally cheaper than other first and second order off-the-shelf optimization methods. The paper also contains an extensive comparison with many state-of-the-art methods on several Monte-Carlo studies, which confirms the effectiveness of our procedure.
]Giulia Prando, ]Gianluigi Pillonetto, ]Alessandro Chiuso
Department of Information Engineering
University of Padova, Padova (Italy)
Although linear system identification is sometimes considered a mature field, with a wide and solid literature summarized in the well known textbooks [35, 51], the recent developments on regularization based methods have brought new insights and opened new avenues. The most common “classical” approaches are parametric Prediction Error Methods (PEM) [35, 51], where model classes (OE, ARMAX, Box-Jenkins, state-space, etc.) are described by a finite dimensional parameter vector which is estimated minimizing the squared prediction errors, and subspace methods, which translate ideas from stochastic realization theory [17, 32] into algorithms which work on measured data .
These techniques require that a model complexity (the order hereon) is fixed, and thus estimated, first.
As an alternative to the standard parametric approach, recent literature has proposed a Bayesian perspective, leading to a class of regularized methods [43, 42, 10, 44, 8, 58, 59]. The use of Bayesian inference is not new in the field of identification and time-series estimation: early works on this topic appeared in the late’70, early ’80 [1, 21, 30, 25]; see  for an overview.
The Bayesian paradigm considers the impulse response as a stochastic process whose prior distribution penalizes undesired systems (e.g. unstable ones). This allows to face the so-called bias/variance trade-off by jointly performing estimation and model selection.
In [43, 42, 10] prior distributions are designed to encode smoothness and stability of the impulse response to be estimated, leading to -type penalties so that closed-form solution are available. These priors can also be shown to be solutions of Maximum Entropy problems, see [46, 39, 5, 9].
In this paper, we focus on the identification of multi input-multi output (MIMO) systems, where matrix impulse responses have to be identified. Similar problems are encountered in multi-task learning where one would like to simultaneously estimate multiple functions while also exploiting their mutual information. To this aim [6, 2, 36, 22, 45] have considered vector-valued kernels which account for the smoothness of the functions to be estimated. In the identification of finite dimensional linear MIMO systems, the coupling between different input-output channels is captured by Hankel matrix, which has finite rank equal to the McMillan degree of the system.
The Hankel matrix and its properties have already been thoroughly exploited in subspace methods, where also Vector AutoRegressive Models (VARX) estimated under the PEM framework play a fundamental role; in fact it has been shown in  (see also [11, 13]) that certain subspace methods can be seen as estimation of a long (i.e. “nonparametric” in the context of this paper) VARX model followed by a suitable (data based) model reduction step. This paper goes one step further, by merging these two steps in one. While subspace methods reduce the order of the estimated VARX model via a model reduction step, in this paper regularization takes care of both stability and “complexity” (in terms of McMillan degree) at once, while estimating the VARX model itself.
Within this framework, our recent works [48, 49, 47] have attempted to merge the benefits of accounting for both stability/smoothness as well as complexity when building prior models. The main contributions of this work, w.r.t. the above referenced papers are:
(i) development, by means of MaxEnt arguments, of a new kernel encoding both complexity as well as smoothness and stability (the new kernel is parametrised differently w.r.t. previous conference publications and also the resulting algorithm is different); (ii) a new tailored Scaled Gradient Projection algorithm for marginal likelihood optimization (this had been used but not derived elsewhere) and (iii) an extensive simulation study comparing several state-of-the-art algorithms.
We shall now provide a more detailed description of these contributions as well as a brief discussion of the relevant literature.
The first main goal of this paper is to develop, by means of Maximum Entropy arguments, a vector-valued kernel which accounts both for the stability of the system to be estimated and for its complexity, as measured by its McMillan degree.
The prior distribution introduced here leads, as a special case, to an Hankel nuclear norm penalty, an heuristic related to that proposed in  as a convex surrogate to the rank function.
In the system identification literature the nuclear norm heuristic has also been applied in the context of subspace identification [27, 55, 34], even in presence of incomplete datasets , to control the order of the estimated model. PEM methods equipped with nuclear norm penalties on the Hankel matrix built with the Markov parameters have also been considered [29, 26]. Refer to  for a brief survey on the topic.
However, direct use of nuclear norm (or atomic) penalties may lead to undesired behavior, as suggested and studied in , due to the fact that nuclear norm is not able alone to guarantee stability and smoothness of the estimated impulse responses. To address this limitation,  already suggested the combination of the stability/smoothness penalty with the nuclear norm one; differently from the prior presented in this paper, the formulation given in  did not allow to adopt marginal likelihood maximization to estimate the regularization parameters.
Exploting the structure of the prior distribution used in this paper we design an iterative procedure which alternatively updates the impulse response estimate and the hyper-parameters defining the prior. Our algorithm is related to iteratively reweighted methods used in compressed sensing and signal processing [4, 7, 20, 38, 24] and so-called Sparse Bayesian Learning (SBL) [57, 53].
Our algorithm differs from the previous literature in that the regularization matrix takes on a very special structure, described by few hyper-parameters. With this special structure the weights update does not admit a closed-form solution and thus direct optimisation of the marginal likelihood needs to be performed.
To this purpose, as a second main contribution, this paper develops a Scaled Gradient Projection method (SGP), inspired by the one introduced in , which is more efficient than off-the-shelf optimization procedures implemented in MATLAB.
As a final contribution, the paper provides an extensive simulation study, where the proposed identification algorithm is compared with classical and state-of-the art identification methods, including PEM , N4SID , Stable Spline , reweighted nuclear norm-based algorithms  and regularized “subspace” methods .
While a clear-cut conclusion in terms of relative performance cannot be drawn at the moment, it is fair to say that: (a) the new method developed in this paper outperforms the classical “Stable-Spline” , especially when dealing with MIMO systems; (b) the new method outperforms a reweighted Nuclear Norm algorithm in certain scenarios (e.g. a “mildly-resonant” fourth order system) while performing comparably in others (e.g. randomly generated “large” MIMO systems).
The paper is organized as follows. Section 2 introduces the problem and Section 3 briefly frames system identification in the context of Bayesian estimation. In Section 4 Maximum Entropy arguments are used to derive a family of prior distributions. Section 5 illustrates our algorithm while Section 6 describes the adaptation of a Scaled Gradient Projection method, which is used to solve the marginal likelihood optimization problem. An extensive experimental study will be conducted in Section 7, while some concluding remarks will be drawn in Section 8.
In the following, and denote respectively the set of real, positive real, integers and natural numbers. and will denote respectively the set of -dimensional real vectors, and real matrices. The transpose of will be denoted . , and will denote respectively the zero vector in , the zero matrix in and the identity matrix. The symbol will denote the Kronecker product, the Gaussian distribution with mean and variance . Given , will be a diagonal matrix of size with the diagonal given by . Given matrices , , will denote the block-diagonal matrix of size with the ’s as diagonal blocks. and will respectively denote expectation and trace.
2 Problem Formulation
We consider the following linear, causal and time-invariant (LTI) Output-Error (OE) system:
where is the -dimensional output signal, is the -dimensional input signal, is additive noise and
is the system transfer function with being the backward shift operator: . For simplicity, we will assume the presence of a delay in , i.e. . In addition, we assume , , .
The objective is to estimate, from a finite set of input-output data , the impulse response coefficients .
In the remaining of the paper, we shall consider and as jointly stationary zero-mean stochastic processes; furthermore, the input signal is assumed to be independent of the noise . The results of this paper can be easily extended to VARMAX/BJ type model structure, formulating the identification problem as estimation of the predictor model as done in .
3 Bayesian/regularization approach
In line with the recent developments in linear system identification, we tackle the problem outlined in Section 2 by adopting a Bayesian approach. Namely, we consider as the realization of a stochastic process, embedding in an infinite-dimensional space. For simplicity, consider the Single-Input-Single-Output (SISO) case. A typical choice is to model as a zero-mean Gaussian process with covariance function ,
where is parametrized via the hyper-parameter vector .The covariance function , also called “kernel” in the machine learning literature, is appropriately designed in order to account for the desired properties of the impulse response to be estimated (e.g. stability, smoothness, etc.; see [42, 10, 44]).
In this Bayesian framework the minimum variance estimate of conditional on the observations , on the hyper-parameters and on the noise covariance is the conditional mean:
where is the vector of output observations:
Assuming also that the noise is Gaussian and independent of , and will be jointly normal, so that for fixed and , conditioned on is Gaussian. The estimator (3) is then available in closed form; in particular, when and are replaced with estimators and , (3) is referred to as the Empirical Bayes estimate of . Estimates of and can be found e.g. by cross-validation or marginal likelihood maximization, i.e.
where denotes the likelihood of the observations once the unknown has been integrated out, commonly called the marginal likelihood. Under the Gaussian assumptions on and on the noise also the marginal likelihood is a Gaussian distribution.
According to the Bayesian inference procedure outlined above, the impulse response to be estimated lies in an infinite-dimensional space. However, thanks to the (exponentially) decaying profile of a stable impulse response, it is possible to estimate only a truncated version of , i.e. to approximate with the transfer function of a long Finite Impulse Response (FIR) model ; in this way one avoids dealing with infinite-dimensional objects. It should be stressed that the choice of the length does not correspond to a complexity selection step, since is simply taken large enough to capture the relevant dynamics of the unknown system. Henceforth, we will denote with the vector containing all the impulse response coefficients of , appropriately stacked:
represents the -th impulse response coefficient from input to output . Under the Bayesian framework, is a Gaussian random vector , . Exploiting the notation introduced in (3) and using the FIR approximation, the convolution equation (2) can be reformulated as a linear model:
where the vector collects the noise samples, while the with defined as:
Since and are modelled as Gaussian and independent, and are jointly Gaussian and conditionally on is Gaussian, so that (3) takes the form:
with . By recalling a known equivalence between Bayesian inference and regularization, the previous estimate can also be interpreted as the solution of the following Tikhonov-type regularization problem :
The previous expression shows that the choice of the kernel plays a crucial role for the success of the Bayesian inference procedure. Indeed, it shapes a regularization term which penalizes impulse response coefficients corresponding to “unlikely” or “undesired” systems. In Section 4 we will develop a new class of kernels which induces a penalty of the type:
The first term in will account for the smoothness and stability of the impulse response to be estimated, while the second one will penalize high-complexity models. Estimation of the hyper-parameters and computation of the impulse response estimate through an iterative algorithm will be discussed in Section 5.
The Bayesian inference scheme here illustrated has a well-known connection with the theory of Reproducing Kernel Hilbert Spaces (RKHS). Indeed, once a Gaussian prior for the impulse response is postulated with the covariance function defined in (3), the optimal estimate can also be derived as the solution of a Tikhonov regularization problem and will be an element of the RKHS associated to the kernel . If the true impulse response belongs to , then the so-called “model bias”, accounting for the error between the true and its closest approximation in the hypothesis space, disappears (, Sec. 7.3). In particular, the RKHS associated to the so called stable spline kernel (adopted in the sequel) is very rich. For instance, the impulse response of any BIBO stable finite dimensional linear system belongs to for a suitable choice of . In practice, is estimated by (3): this permits to tune model complexity, trading bias and variance\@footnotemark\@footnotetextFor the reason discussed above only “estimation-bias” will be present., in a continuous manner.
4 Derivation of stable Hankel-type priors
In recent contributions the standard smoothing spline kernels  have been adapted in order to represent covariances of exponentially decaying functions (, ). For instance, considering SISO systems, the 1st order stable spline kernel (see  and  where it has been named Tuned-Correlated (TC) kernel) is defined as
where , , play the role of hyper-parameters. For a suitable choice of , the impulse response of any BIBO linear system belongs a.s. to the RKHS associated to the kernel in (13), see ; thus, by adopting this kernel the “model bias” is zero. Recently,  has shown that the kernel function from which (13) derives admits a Maximum Entropy interpretation. More specifically it is the covariance function of a zero-mean Gaussian process defined over , which is the solution to the Maximum Entropy problem with constraints (
Exploiting a well-known result on Maximum Entropy distributions, see e.g. [19, p. 409], the zero mean Gaussian prior with covariance (13) can also be derived by imposing the constraint\@footnotemark\@footnotetextNote that constraint (15) contains (14).
When dealing with MIMO systems one needs to consider a block-kernel, with the -th block (the cross-covariance of the impulse response from the -th input and the -th output) defined e.g. as in (13). In the recent literature, see e.g. , the cross terms (i.e. , ) have been set to zero. As we shall argue in a moment, this assumption is often unreasonable.
In fact, while smoothness is considered as a synonymous of “simplicity” in the machine learning literature, a system theoretic way to measure complexity is via the McMillan degree of , i.e. the order of a minimal state space realization
The impulse response coefficients of (16) are given by , a relation which couples the impulse responses as and vary. This calls for prior distributions on which encode this coupling. To this end, we first introduce the block Hankel matrix given by:
A classical result from realization theory (see  for details) states that the rank of the block Hankel matrix equals the McMillan degree of the system, if and are large enough. In this work and are chosen so that and the matrix is as close as possible to a square matrix.
From now on, to the purpose of normalization, we shall consider a weighted version of :
where and are chosen, see , so that the singular values of are conditional canonical correlation coefficients between future outputs and near past inputs, given the future inputs and remote past inputs.
For Gaussian processes, there is a one-to-one correspondence between the Canonical Correlation Analysis (CCA) and mutual information. Indeed, the mutual information between past () and future () of a Gaussian process is given by:
where is the canonical correlation coefficient and is the McMillan degree of a minimal spectral factor of .
This provides a clear interpretation of canonical correlations as well as of the impact of shrinking them in terms of mutual information. A similar interpretation holds for systems with inputs, relating conditional mutual information and conditional canonical correlations, i.e. singular values of (18) with the proper choice of and .
4.1 Maximum Entropy Hankel priors
We shall now introduce a probability distribution for , such that samples drawn from have low rank (or close to low rank) Hankel matrices. To this purpose, we would like to favour some of the singular values of to be (close to) zero: this can be achieved imposing constraints on the eigenvalues of the weighted matrix . Let be the -th singular vector of . To achieve our goal we shall constrain the (expected value) of the corresponding singular value , i.e.
for . Here the expectation is taken w.r.t. , while the ’s play the role of hyper-parameters that will have to be estimated from the data\@footnotemark\@footnotetextIn fact, one shall not estimate directly the ’s, but rather the corresponding dual variables appearing in the MaxEnt distribution, i.e. the ’s in (26)..
In order to design , we first assume that an estimate of is available. We shall see in Section 5 how this “preliminary” estimate of arises as an intermediate step in an alternating minimization algorithm.
Thus, we consider the (weighted) estimated Hankel matrix and its singular value decomposition
We can now reformulate the constraints (20) as
where denotes the -th column of . In this way we have fixed the vectors , so that only is random in (22). Fixing the ’s, which in general are not the (exact) singular vectors of the “true” Hankel matrix, introduces a perturbation on the constraint (and thus on the resulting prior distribution). One way to make the constrains (22) robust to such perturbations is to group estimated singular vectors into the so-called “signal” and “noise” subspaces\@footnotemark\@footnotetextIn fact, in this way perturbations “within” the signal and noise subspaces respectively have no effect.. To this purpose let us group the first singular vectors and partition and as follows:
where . Note that, while the ’s corresponding to small singular values are likely to be very noisy, both the “signal” space spanned by the columns of , as well as that spanned by , (i.e. the column space of ) are much less prone to noise; this is easily derived from a perturbation analysis of the singular value decomposition which shows that the error in depends on the gap between the smallest singular value of and the largest one of . In view of these considerations, we can relax the constraints (22) by aggregating the “signal” components (i.e. the first singular vectors):
where well known properties of the trace operator have been used. Similarly, we group the constraints on the “noise” component (i.e. the last singular vectors)
where , , and
We would like to stress that the quality of the relaxation introduced in constraints (24) and (25) depends on the relative magnitude of the Hankel singular values. Using the “normalized” Hankel matrix (18) plays an important role here since its singular values, being canonical correlations, are in the interval . On the other hand, the aggregation of the singular values along the “noise” subspace resembles the role played by the regularization factor in Iterative Reweighted methods [7, 57]. We refer to Appendix B for a thorough discussion on the connection between these methods and our approach.
Notice that in (27) is the sum of two orthogonal projections , respectively on what we called the “signal subspace” (that would coincide with the column space of if was the true system order) and on the “noise subspace”. This observation provides new insights on the design of the prior in (26): namely, by properly tuning the hyper-parameters , the prior is intended to be stronger along certain directions of the column space of (referred to as the “noisy” ones) and milder along what we call the “signal” directions.
Since is linear in , is quadratic in and letting , it can be rewritten as:
i.e. for given , is a zero-mean Gaussian vector:
Using (31) as a prior distribution for , we can recast the problem of estimating under the framework outlined in Section 3. In particular, complexity (in terms of McMillan degree) is controlled by properly choosing the hyper-parameters , which can be done by marginal likelihood maximization as further discussed in Section 5.
where are the singular values of . Thus, the nuclear norm penalty on the (squared) Hankel matrix can be derived from kernel (32) as a special case, i.e. for a special choice of the hyper-parameters. The use of nuclear norm regularization is not new in system identification: a comparison with the literature can be found in Appendix A.
4.2 Maximum Entropy stable-Hankel priors
As thoroughly discussed in , the kernel arising from the “Hankel” constraint alone would not necessarily lead to stable models. In fact given an unstable system and its finite Hankel matrix , it is always possible to design a stable system whose finite Hankel matrix (of the same size as ) has the same singular values of . In addition, the Hankel prior does not include information on the correlation among the impulse response coefficients (see ). Thus, as a final step, we shall consider the Maximum Entropy distribution [19, p. 409], under both stability (15) and low complexity ((24) and (25)) constraints, thus obtaining
and as defined in (33).
5 Identification Algorithm
This section describes the iterative algorithm to estimate the impulse response when the prior is chosen as in (4.2). The algorithm alternates between the estimation of (see (10)) for fixed hyper-parameters and marginal likelihood optimization (see (3)).
The procedure is summarized in Algorithm 1. For ease of notation we have defined . Hence, the hyper-parameters vector in (37) can be rewritten as
Furthermore, , , and denote estimators at the -th iteration of the algorithm.
In Algorithm 1 the noise variance is fixed e.g. to the sample variance of an estimated ARX or FIR model. Of course could also be treated as a hyper-parameter, and estimated with the same procedure based on the marginal likelihood.
5.1 Algorithm Initialization
In the derivation of kernel in Section 4 it has been assumed that a preliminary estimate was available. Therefore the iterative algorithm we outline in this section has to be provided with an initial estimate . Exploiting the structure of the kernel in (36), two straightforward choices are possible:
Initialize using the stable-Hankel kernel with , so that no preliminary estimate is needed to initialize (which is empty) and thus :
where and are estimated through marginal likelihood maximization (3).
The procedure we actually follow (illustrated in Algorithm 1) combines the two strategies above. Namely, the first approach is adopted to fix the hyper-parameters defining the stable-spline kernel (line 6). These are then kept fixed for the whole procedure. We then follow the second strategy to estimate (line 7). Note that in line 7 the hyper-parameters are fixed to and not estimated as in (5.1). Analogously, is estimated by marginal-likelihood maximization and not set a-priori to as in (5.1). Therefore, the estimate computed at line 10 is derived by adopting the kernel with .
This sort of “hybrid” strategy has been chosen for two main reasons. First, it allows to fix the hyper-parameters by solving a simplified optimization problem (w.r.t. solving a problem involving all the hyper-parameters ). Notice that this also provides the user with a certain freedom on the choice of the kernel : using other kernel structures (see e.g. ) additional properties (e.g. resonances, high-frequency components, etc.) of the impulse response can be accounted for. Second, it also allows to properly initialize the iterative procedure used to update the hyper-parameters and in (37), until a stopping condition is met (see next section for a discussion about convergence of Algorithm 1).
5.2 Convergence Analysis
Algorithm 1 is guaranteed to stop in a finite number of steps, returning a final estimate . Indeed, at any iteration four possible scenarios may arise:
Condition at line 15 is met and is increased by one and the algorithm iterates.
, then the algorithm terminates returning .
Conditions (1) and (3) may only be satisfied a finite number of times, thus the algorithm terminates in a finite number of steps.
We also stress that Algorithm 1 is only an ascent algorithm w.r.t. the marginal likelihood without any guarantee of convergence to a local extrema. If was treated as a hyper-parameter and the marginal likelihood optimised over the Grassmann manifold, then convergence to a local maxima could be proven.\@footnotemark\@footnotetextWe have tested this variant, which is considerably more computationally expensive than Algorithm 1. Since no significant improvements have been observed, we only present the simpler version in this paper. Notice indeed that we adopt a tailored Scaled Gradient Projection algorithm to solve the marginal likelihood optimization problem at line 14 (see Section 6): every accumulation point of the iterates generated by this algorithm is guaranteed to be a stationary point (, Theorem 1); furthermore, for the specific problem we are solving, the sequence of the iterates admits at least one limit point.
Once the algorithm has converged, is the optimal dimension of the “signal” and “noise” subspaces of , respectively spanned by the columns of and . Furthermore, the corresponding multipliers and in are expected to tend, respectively, to (meaning that no penalty is given on the signal component) and to (that is, a very large penalty is assigned to the noise subspace); if , would actually be the McMillan degree of the estimated system.
In practice the estimated hyper-parameter is finite and, similarly, is strictly positive. As a result the McMillan degree of the estimated system is generically larger than, but possibly close to, . Therefore, estimation of the integer parameter should not be interpreted as a hard decision on the complexity as instead happens for parametric model classes whose structure is estimated with AIC/BIC/Cross Validation. Therefore, we may say that Algorithm 1 performs a “soft” complexity selection, confirming that this “Bayesian” framework allows to describe model structures in a continuous manner; in fact, for any choice of , systems of different McMillan degrees are assigned non zero probability by the prior.
6 SGP for marginal likelihood optimization
A crucial step in Algorithm 1 is the marginal likelihood maximization (step 14) which is computationally expensive, especially when the number of inputs and outputs is large. To deal with this issue we have adapted the Scaled Gradient Projection method (SGP), proposed in , to solve
with . The SGP is a first order method, in which the negative gradient direction is doubly scaled through a variable step size and a positive definite scaling matrix , which are iteratively updated. A careful choice of these scalings, illustrated later on, allows to speed up the, theoretically linear, convergence; the reader is referred to  for details. The main steps of the algorithm are as follows (see Algorithm 2 for details):
Set the descent direction (scaled negative gradient)
Project the candidate update on the constraint set
and define the final descent direction:
Since is the positive cone the projection is merely a truncation to non-negative values of (and is independent of the scaling ).
Update along the direction as follows:
with the steplength computed through an Armijo backtracking loop.