# Maximum Entropy Vector Kernels for MIMO system identification

## Abstract

Recent contributions have framed linear system identification as a nonparametric regularized inverse problem. Relying on -type regularization which accounts for the stability and smoothness of the impulse response to be estimated, these approaches have been shown to be competitive w.r.t classical parametric methods. In this paper, adopting Maximum Entropy arguments, we derive a new penalty deriving from a vector-valued kernel; to do so we exploit the structure of the Hankel matrix, thus controlling at the same time complexity, measured by the McMillan degree, stability and smoothness of the identified models. As a special case we recover the nuclear norm penalty on the squared block Hankel matrix. In contrast with previous literature on reweighted nuclear norm penalties, our kernel is described by a small number of hyper-parameters, which are iteratively updated through marginal likelihood maximization; constraining the structure of the kernel acts as a (hyper)regularizer which helps controlling the effective degrees of freedom of our estimator. To optimize the marginal likelihood we adapt a Scaled Gradient Projection (SGP) algorithm which is proved to be significantly computationally cheaper than other first and second order off-the-shelf optimization methods. The paper also contains an extensive comparison with many state-of-the-art methods on several Monte-Carlo studies, which confirms the effectiveness of our procedure.

## 1Introduction

Although linear system identification is sometimes considered a mature field, with a wide and solid literature summarized in the well known textbooks [35], the recent developments on regularization based methods have brought new insights and opened new avenues. The most common “classical” approaches are parametric Prediction Error Methods (PEM) [35], where model classes (OE, ARMAX, Box-Jenkins, state-space, etc.) are described by a finite dimensional parameter vector which is estimated minimizing the squared prediction errors, and subspace methods, which translate ideas from stochastic realization theory [17] into algorithms which work on measured data [54].

These techniques require that a model complexity (the *order* hereon) is fixed, and thus estimated, first. As an alternative to the standard parametric approach, recent literature has proposed a Bayesian perspective, leading to a class of regularized methods [43]. The use of Bayesian inference is not new in the field of identification and time-series estimation: early works on this topic appeared in the late’70, early ’80 [1]; see [14] for an overview.

The Bayesian paradigm considers the impulse response as a stochastic process whose prior distribution penalizes undesired systems (e.g. unstable ones). This allows to face the so-called bias/variance trade-off by jointly performing estimation and model selection.

In [43] prior distributions are designed to encode smoothness and stability of the impulse response to be estimated, leading to -type penalties so that closed-form solution are available. These priors can also be shown to be solutions of Maximum Entropy problems, see [46].

In this paper, we focus on the identification of multi input-multi output (MIMO) systems, where matrix impulse responses have to be identified. Similar problems are encountered in multi-task learning where one would like to simultaneously estimate multiple functions while also exploiting their mutual information. To this aim [6] have considered vector-valued kernels which account for the smoothness of the functions to be estimated. In the identification of finite dimensional linear MIMO systems, the coupling between different input-output channels is captured by Hankel matrix, which has finite rank equal to the McMillan degree of the system.

The Hankel matrix and its properties have already been thoroughly exploited in subspace methods, where also Vector AutoRegressive Models (VARX) estimated under the PEM framework play a fundamental role; in fact it has been shown in [12] (see also [11]) that certain subspace methods can be seen as estimation of a long (i.e. “nonparametric” in the context of this paper) VARX model followed by a suitable (data based) model reduction step. This paper goes one step further, by merging these two steps in one. While subspace methods reduce the order of the estimated VARX model via a model reduction step, in this paper regularization takes care of both stability and “complexity” (in terms of McMillan degree) at once, while estimating the VARX model itself.

Within this framework, our recent works [48] have attempted to merge the benefits of accounting for both stability/smoothness as well as complexity when building prior models. The main contributions of this work, w.r.t. the above referenced papers are: (i) development, by means of MaxEnt arguments, of a new kernel encoding both complexity as well as smoothness and stability (the new kernel is parametrised differently w.r.t. previous conference publications and also the resulting algorithm is different); (ii) a new tailored Scaled Gradient Projection algorithm for marginal likelihood optimization (this had been used but not derived elsewhere) and (iii) an extensive simulation study comparing several state-of-the-art algorithms.

We shall now provide a more detailed description of these contributions as well as a brief discussion of the relevant literature.

The first main goal of this paper is to develop, by means of Maximum Entropy arguments, a vector-valued kernel which accounts both for the stability of the system to be estimated and for its complexity, as measured by its McMillan degree.

The prior distribution introduced here leads, as a special case, to an Hankel nuclear norm penalty, an heuristic related to that proposed in [23] as a convex surrogate to the rank function. In the system identification literature the nuclear norm heuristic has also been applied in the context of subspace identification [27], even in presence of incomplete datasets [33], to control the order of the estimated model. PEM methods equipped with nuclear norm penalties on the Hankel matrix built with the Markov parameters have also been considered [29]. Refer to [49] for a brief survey on the topic.

However, direct use of nuclear norm (or atomic) penalties may lead to undesired behavior, as suggested and studied in [40], due to the fact that nuclear norm is not able alone to guarantee stability and smoothness of the estimated impulse responses. To address this limitation, [15] already suggested the combination of the stability/smoothness penalty with the nuclear norm one; differently from the prior presented in this paper, the formulation given in [15] did not allow to adopt marginal likelihood maximization to estimate the regularization parameters.

Exploting the structure of the prior distribution used in this paper we design an iterative procedure which alternatively updates the impulse response estimate and the hyper-parameters defining the prior. Our algorithm is related to iteratively reweighted methods used in compressed sensing and signal processing [4] and so-called *Sparse Bayesian Learning* (SBL) [57].

Our algorithm differs from the previous literature in that the regularization matrix takes on a very special structure, described by few hyper-parameters. With this special structure the weights update does not admit a closed-form solution and thus direct optimisation of the marginal likelihood needs to be performed.

To this purpose, as a second main contribution, this paper develops a Scaled Gradient Projection method (SGP), inspired by the one introduced in [3], which is more efficient than off-the-shelf optimization procedures implemented in MATLAB.

As a final contribution, the paper provides an extensive simulation study, where the proposed identification algorithm is compared with classical and state-of-the art identification methods, including PEM [35], N4SID [54], Stable Spline [42], reweighted nuclear norm-based algorithms [37] and regularized “subspace” methods [55].

While a clear-cut conclusion in terms of relative performance cannot be drawn at the moment, it is fair to say that: (a) the new method developed in this paper outperforms the classical “Stable-Spline” [42], especially when dealing with MIMO systems; (b) the new method outperforms a reweighted Nuclear Norm algorithm in certain scenarios (e.g. a “mildly-resonant” fourth order system) while performing comparably in others (e.g. randomly generated “large” MIMO systems).

The paper is organized as follows. Section 2 introduces the problem and Section 3 briefly frames system identification in the context of Bayesian estimation. In Section 4 Maximum Entropy arguments are used to derive a family of prior distributions. Section 5 illustrates our algorithm while Section 6 describes the adaptation of a Scaled Gradient Projection method, which is used to solve the marginal likelihood optimization problem. An extensive experimental study will be conducted in Section 7, while some concluding remarks will be drawn in Section 8.

### Notation

In the following, and denote respectively the set of real, positive real, integers and natural numbers. and will denote respectively the set of -dimensional real vectors, and real matrices. The transpose of will be denoted . , and will denote respectively the zero vector in , the zero matrix in and the identity matrix. The symbol will denote the Kronecker product, the Gaussian distribution with mean and variance . Given , will be a diagonal matrix of size with the diagonal given by . Given matrices , , will denote the block-diagonal matrix of size with the ’s as diagonal blocks. and will respectively denote expectation and trace.

## 2Problem Formulation

We consider the following linear, causal and time-invariant (LTI) Output-Error (OE) system:

where is the -dimensional output signal, is the -dimensional input signal, is additive noise and

is the system transfer function with being the backward shift operator: . For simplicity, we will assume the presence of a delay in , i.e. . In addition, we assume , , .

The objective is to estimate, from a finite set of input-output data , the impulse response coefficients .

In the remaining of the paper, we shall consider and as jointly stationary zero-mean stochastic processes; furthermore, the input signal is assumed to be independent of the noise . The results of this paper can be easily extended to VARMAX/BJ type model structure, formulating the identification problem as estimation of the predictor model as done in [42].

## 3Bayesian/regularization approach

In line with the recent developments in linear system identification, we tackle the problem outlined in Section 2 by adopting a Bayesian approach. Namely, we consider as the realization of a stochastic process, embedding in an infinite-dimensional space. For simplicity, consider the Single-Input-Single-Output (SISO) case. A typical choice is to model as a zero-mean Gaussian process with covariance function ,

where is parametrized via the hyper-parameter vector . The covariance function , also called “kernel” in the machine learning literature, is appropriately designed in order to account for the desired properties of the impulse response to be estimated (e.g. stability, smoothness, etc.; see [42]).

In this Bayesian framework the minimum variance estimate of conditional on the observations , on the hyper-parameters and on the noise covariance is the conditional mean:

where is the vector of output observations:

Assuming also that the noise is Gaussian and independent of , and will be jointly normal, so that for fixed and , conditioned on is Gaussian. The estimator is then available in closed form; in particular, when and are replaced with estimators and , is referred to as the Empirical Bayes estimate of [50]. Estimates of and can be found e.g. by cross-validation or marginal likelihood maximization, i.e.

where denotes the likelihood of the observations once the unknown has been integrated out, commonly called the *marginal likelihood*. Under the Gaussian assumptions on and on the noise also the marginal likelihood is a Gaussian distribution.

According to the Bayesian inference procedure outlined above, the impulse response to be estimated lies in an infinite-dimensional space. However, thanks to the (exponentially) decaying profile of a stable impulse response, it is possible to estimate only a truncated version of , i.e. to approximate with the transfer function of a long Finite Impulse Response (FIR) model ; in this way one avoids dealing with infinite-dimensional objects. It should be stressed that the choice of the length does not correspond to a complexity selection step, since is simply taken large enough to capture the relevant dynamics of the unknown system. Henceforth, we will denote with the vector containing all the impulse response coefficients of , appropriately stacked:

represents the -th impulse response coefficient from input to output . Under the Bayesian framework, is a Gaussian random vector , . Exploiting the notation introduced in and using the FIR approximation, the convolution equation can be reformulated as a linear model:

where the vector collects the noise samples, while the with defined as:

Since and are modelled as Gaussian and independent, and are jointly Gaussian and conditionally on is Gaussian, so that takes the form:

with . By recalling a known equivalence between Bayesian inference and regularization, the previous estimate can also be interpreted as the solution of the following Tikhonov-type regularization problem [56]:

with .

The previous expression shows that the choice of the kernel plays a crucial role for the success of the Bayesian inference procedure. Indeed, it shapes a regularization term which penalizes impulse response coefficients corresponding to “unlikely” or “undesired” systems. In Section 4 we will develop a new class of kernels which induces a penalty of the type: The first term in will account for the smoothness and stability of the impulse response to be estimated, while the second one will penalize high-complexity models. Estimation of the hyper-parameters and computation of the impulse response estimate through an iterative algorithm will be discussed in Section .

## 4Derivation of stable Hankel-type priors

In recent contributions the standard smoothing spline kernels [56] have been adapted in order to represent covariances of exponentially decaying functions ([43], [42]). For instance, considering SISO systems, the 1st order stable spline kernel (see [43] and [10] where it has been named Tuned-Correlated (TC) kernel) is defined as

where , , play the role of hyper-parameters. For a suitable choice of , the impulse response of any BIBO linear system belongs a.s. to the RKHS associated to the kernel in , see [43]; thus, by adopting this kernel the “model bias” is zero. Recently, [9] has shown that the kernel function from which derives admits a Maximum Entropy interpretation. More specifically it is the covariance function of a zero-mean Gaussian process defined over , which is the solution to the Maximum Entropy problem with constraints (

Exploiting a well-known result on Maximum Entropy distributions, see e.g. [19], the zero mean Gaussian prior with covariance can also be derived by imposing the constraint^{1}

When dealing with MIMO systems one needs to consider a block-kernel, with the -th block (the cross-covariance of the impulse response from the -th input and the -th output) defined e.g. as in . In the recent literature, see e.g. [18], the cross terms (i.e. , ) have been set to zero. As we shall argue in a moment, this assumption is often unreasonable.

In fact, while smoothness is considered as a synonymous of “simplicity” in the machine learning literature, a system theoretic way to measure complexity is via the McMillan degree of , i.e. the order of a minimal state space realization

The impulse response coefficients of are given by , a relation which couples the impulse responses as and vary. This calls for prior distributions on which encode this coupling. To this end, we first introduce the block Hankel matrix given by:

A classical result from realization theory (see [52] for details) states that the rank of the block Hankel matrix equals the McMillan degree of the system, if and are large enough. In this work and are chosen so that and the matrix is as close as possible to a square matrix.

From now on, to the purpose of normalization, we shall consider a weighted version of :

where and are chosen, see [15], so that the singular values of are conditional canonical correlation coefficients between future outputs and near past inputs, given the future inputs and remote past inputs.

### 4.1Maximum Entropy Hankel priors

We shall now introduce a probability distribution for , such that samples drawn from have low rank (or close to low rank) Hankel matrices. To this purpose, we would like to favour some of the singular values of to be (close to) zero: this can be achieved imposing constraints on the eigenvalues of the weighted matrix . Let be the -th singular vector of . To achieve our goal we shall constrain the (expected value) of the corresponding singular value , i.e.

for . Here the expectation is taken w.r.t. , while the ’s play the role of hyper-parameters that will have to be estimated from the data^{2}

In order to design , we first assume that an estimate of is available. We shall see in Section 5 how this “preliminary” estimate of arises as an intermediate step in an alternating minimization algorithm.

Thus, we consider the (weighted) estimated Hankel matrix and its singular value decomposition

We can now reformulate the constraints as

where denotes the -th column of . In this way we have fixed the vectors , so that only is random in . Fixing the ’s, which in general are not the (exact) singular vectors of the “true” Hankel matrix, introduces a perturbation on the constraint (and thus on the resulting prior distribution). One way to make the constrains robust to such perturbations is to group estimated singular vectors into the so-called “signal” and “noise” subspaces^{3}

where . Note that, while the ’s corresponding to small singular values are likely to be very noisy, both the “signal” space spanned by the columns of , as well as that spanned by , (i.e. the column space of ) are much less prone to noise; this is easily derived from a perturbation analysis of the singular value decomposition which shows that the error in depends on the gap between the smallest singular value of and the largest one of . In view of these considerations, we can relax the constraints by aggregating the “signal” components (i.e. the first singular vectors):

where well known properties of the trace operator have been used. Similarly, we group the constraints on the “noise” component (i.e. the last singular vectors)

Exploiting a well known result [19], we can build the Maximum Entropy distribution subject to the constraints and :

where , , and

Since is linear in , is quadratic in and letting , it can be rewritten as:

where is such that . Inserting in we obtain

i.e. for given , is a zero-mean Gaussian vector:

with

Using as a prior distribution for , we can recast the problem of estimating under the framework outlined in Section . In particular, complexity (in terms of McMillan degree) is controlled by properly choosing the hyper-parameters , which can be done by marginal likelihood maximization as further discussed in Section .

We shall notice that, when , quantity (from which kernel arises) reduces to where are the singular values of . Thus, the nuclear norm penalty on the (squared) Hankel matrix can be derived from kernel as a special case, i.e. for a special choice of the hyper-parameters. The use of nuclear norm regularization is not new in system identification: a comparison with the literature can be found in Appendix .

### 4.2Maximum Entropy stable-Hankel priors

As thoroughly discussed in [40], the kernel arising from the “Hankel” constraint alone would not necessarily lead to stable models. In fact given an unstable system and its finite Hankel matrix , it is always possible to design a stable system whose finite Hankel matrix (of the same size as ) has the same singular values of . In addition, the Hankel prior does not include information on the correlation among the impulse response coefficients (see [40]). Thus, as a final step, we shall consider the Maximum Entropy distribution [19], under both stability and low complexity ( and ) constraints, thus obtaining

where , , and is the kernel in . The use of a further hyper-parameter, , will become clear later on. From the distribution we can derive the kernel

with hyper-parameters

and as defined in .

## 5Identification Algorithm

This section describes the iterative algorithm to estimate the impulse response when the prior is chosen as in . The algorithm alternates between the estimation of (see ) for fixed hyper-parameters and marginal likelihood optimization (see ).

The procedure is summarized in Algorithm ?. For ease of notation we have defined . Hence, the hyper-parameters vector in can be rewritten as

Furthermore, , , and denote estimators at the -th iteration of the algorithm.

Notice that the marginal likelihood maximization performed in steps ? and ? of Algorithm ? boils down to the following optimization problem:

where

Section 6 will illustrate a Scaled Gradient Projection (SGP) method appropriately designed to solve . We shall now discuss issues related to initialisation and convergence of Algorithm ?.

### 5.1Algorithm Initialization

In the derivation of kernel in Section 4 it has been assumed that a preliminary estimate was available. Therefore the iterative algorithm we outline in this section has to be provided with an initial estimate . Exploiting the structure of the kernel in , two straightforward choices are possible:

Initialize using only the stable-spline kernel (as the one in ), i.e.:

where only the hyper-parameters are estimated through marginal-likelihood maximization .

Initialize using the stable-Hankel kernel with , so that no preliminary estimate is needed to initialize (which is empty) and thus :

where and are estimated through marginal likelihood maximization .

The procedure we actually follow (illustrated in Algorithm ?) combines the two strategies above. Namely, the first approach is adopted to fix the hyper-parameters defining the stable-spline kernel (line ?). These are then kept fixed for the whole procedure. We then follow the second strategy to estimate (line ?). Note that in line ? the hyper-parameters are fixed to and not estimated as in . Analogously, is estimated by marginal-likelihood maximization and not set a-priori to as in . Therefore, the estimate computed at line ? is derived by adopting the kernel with .

This sort of “hybrid” strategy has been chosen for two main reasons. *First*, it allows to fix the hyper-parameters by solving a simplified optimization problem (w.r.t. solving a problem involving all the hyper-parameters ). Notice that this also provides the user with a certain freedom on the choice of the kernel : using other kernel structures (see e.g. [16]) additional properties (e.g. resonances, high-frequency components, etc.) of the impulse response can be accounted for. *Second*, it also allows to properly initialize the iterative procedure used to update the hyper-parameters and in , until a stopping condition is met (see next section for a discussion about convergence of Algorithm ?).

### 5.2Convergence Analysis

Algorithm ? is guaranteed to stop in a finite number of steps, returning a final estimate . Indeed, at any iteration four possible scenarios may arise:

Condition at line ? is met and is increased by one and the algorithm iterates.

Condition at line ? is not met

, so that is increased by one, and condition ? is not met, then the algorithm terminates returning .^{4}Condition at line ? is not met, so that is increased by one, while condition ? is met, then is increased by one and the algorithm iterates.

, then the algorithm terminates returning .

Conditions (1) and (3) may only be satisfied a finite number of times, thus the algorithm terminates in a finite number of steps.

We also stress that Algorithm ? is only an ascent algorithm w.r.t. the marginal likelihood without any guarantee of convergence to a local extrema. If was treated as a hyper-parameter and the marginal likelihood optimised over the Grassmann manifold, then convergence to a local maxima could be proven.^{5}

Once the algorithm has converged, is the optimal dimension of the “signal” and “noise” subspaces of , respectively spanned by the columns of and . Furthermore, the corresponding multipliers and in are expected to tend, respectively, to (meaning that no penalty is given on the signal component) and to (that is, a very large penalty is assigned to the noise subspace); if , would actually be the McMillan degree of the estimated system.

In practice the estimated hyper-parameter is finite and, similarly, is strictly positive. As a result the McMillan degree of the estimated system is generically larger than, but possibly close to, . Therefore, estimation of the integer parameter should not be interpreted as a hard decision on the complexity as instead happens for parametric model classes whose structure is estimated with AIC/BIC/Cross Validation. Therefore, we may say that Algorithm ? performs a “soft” complexity selection, confirming that this “Bayesian” framework allows to describe model structures in a continuous manner; in fact, for any choice of , systems of different McMillan degrees are assigned non zero probability by the prior.

## 6SGP for marginal likelihood optimization

A crucial step in Algorithm ? is the marginal likelihood maximization (step ?) which is computationally expensive, especially when the number of inputs and outputs is large. To deal with this issue we have adapted the Scaled Gradient Projection method (SGP), proposed in [3], to solve

with . The SGP is a first order method, in which the negative gradient direction is doubly scaled through a variable step size and a positive definite scaling matrix , which are iteratively updated. A careful choice of these scalings, illustrated later on, allows to speed up the, theoretically linear, convergence; the reader is referred to [3] for details. The main steps of the algorithm are as follows (see Algorithm ? for details):

Set the descent direction (scaled negative gradient)

Project the candidate update on the constraint set

and define the final descent direction:

Since is the positive cone the projection is merely a truncation to non-negative values of (and is independent of the scaling ).

Update along the direction as follows:

with the steplength computed through an Armijo backtracking loop.

In step ? of Algorithm ?, the stepsize is chosen by means of an alternation strategy based on the Barzilai-Borwein rules (as in [3]), which aim at finding so that approximates the inverse Hessian matrix.

The choice of the scaling matrix strictly depends on both the objective function and the constraints of the optimization problem. In our implementation we followed the choices made in [3]: is set to be diagonal and its update is based on the split gradient idea. Let us first define

where we have used the simplified notation . Moreover, we define where and are fixed and

Now, indicating with the gradient of w.r.t. to , we have:

From the positive definiteness of and the positive semidefiniteness of , it follows that . Furthermore, from Lemma II.1 in [31], it follows that . This shows how the gradient of the objective function admits the following decomposition:

with and (here the inequalities have to be understood component wise). Using the gradient splitting , the Karush-Kuhn-Tucker optimality conditions for problem

can be written as the solution of a fixed point iteration (see eq. (4.8) in [3]) which leads to the scaling matrix :

This choice of the scaling matrix has proven to be particularly effective on ill-posed or ill-conditioned inverse problems, when it is combined with an appropriate choice of the stepsize .

Further details on the setting of the parameters involved in Algorithm ? and on the adopted stopping criterion will be given in Section 7.6.

## 7Simulation Results

### 7.1Monte-Carlo Simulations

The identification procedure outlined in Algorithm ? is now compared with off-the-shelf identification routines, as well as with recently proposed methods. The comparison is performed through some Monte-Carlo studies on three appropriately designed scenarios.

The innovation process is always a zero-mean Gaussian white noise with standard deviation randomly chosen in order to guarantee that the SNR on each output channel is a uniform random variable in the interval . For each scenario we test the identification procedures on three different data lengths, which can be roughly classified as “few/average/many” data. Each Monte-Carlo study includes runs. A brief illustration of the three scenarios follows.

- S1)
We consider a fixed fourth order system with transfer function where

The input is generated, for each Monte Carlo run, as a low pass filtered white Gaussian noise with normalized band where is a uniform random variable in the interval . The identification of system using data generated by a band-limited input appears particularly challenging because the system is characterized by two high-frequency resonances.

The three different data lengths that have been considered are: .- S2)
For each Monte Carlo run is randomly generated using the MATLAB function

`drmodel`

with outputs and inputs while guaranteeing that all the poles of are inside the disc of radius of the complex plane. The system orders are randomly chosen from 1 to 10. The input is zero-mean unit variance white Gaussian noise. The three different numbers of input-output data pairs that have been tested are: .- S3)
The systems have been randomly generated similarly to scenario S2, but with 10 inputs and 5 outputs. Moreover, the input is a low-pass filtered Gaussian white noise with normalized band defined as in S1. The considered data lengths are: .

### 7.2Compared identification algorithms

The following algorithms have been tested:^{6}

N4SID+Or: The subspace method, as implemented by the MATLAB routine

`n4sid`

. Different model complexities are tested; an Oracle chooses the order which maximises the impulse response fit .N4SID(OE)+Or: As N4SID+Or but forcing the routine to return an Output-Error model.

N4SID: The MATLAB routine

`n4sid`

, equipped with default model order selection.N4SID(OE): Same as N4SID by forcing an OE structure.

PEM+Or: PEM as implemented by the MATLAB routine

`pem`

. Different model complexities are tested: an Oracle chooses the order which maximises the impulse response fit .PEM(OE)+Or: Same as PEM+Or but using the routine

`oe`

. For each of the tested complexities, the routine`oe`

has been initialized with the model returned by`pem`

.PEM: The MATLAB routine

`pem`

, equipped with the default model order selection.PEM(OE): The MATLAB routine

`oe`

, initialized with the model returned by`pem`

(order as selected by the default choice in`pem`

).N2SID: The identification routine proposed in [55] and implemented through the code available from

`http://users.isy.liu.se/en/rt/hansson/`

. This routine returns a state-space model in innovation form. The estimation of Output-Error models through N2SID has not been tested, since the routine does not straightforwardly allow to force an OE model structure.SS: The estimator where is chosen to be the kernel TC introduced in [10]. The estimator is computed through the MATLAB routine

`arxRegul`

(imposing a FIR model structure).NN+CV: A FIR model of order estimated solving

The optimization problem is solved through a tailored ADMM algorithm (as in [33]), while is determined through Cross-Validation. This procedure has also been tested by replacing in with (see ).

RNN+CV: A FIR model of order estimated by iteratively solving

The weight matrices and are updated at each iteration according to the procedure suggested in [37]. is selected through Cross-Validation. The case in which in is replaced with has also been tested.

SH: The estimator returned by Algorithm ? with specified through the TC kernel.

^{7}

Some implementation details follow. For SS, SH, NN+CV and RNN+CV, the length of the estimated impulse response has been set to 80 for scenario S1, to 50 for S2 and S3.

The regularization parameter in N2SID [55] has been chosen within a set of 20 elements logarithmically spaced between and for S1 and 40 elements logarithmically spaced between and for S2 and S3. The endpoints of these grids have been selected so that the estimated value of is inside the interval.

When observed that using cross-validation, the results were unreliable for the “few” data scenarios . To optimize the performance, in scenarios S2 and S3 we have used two-thirds of the available data as training set and the remaining one third for the validation step. Instead, in scenario S1, the available data have been equally split into the training and the validation set. The regularization parameter has been selected from the vector , where is the length of the training dataset, while is a vector of 25 elements logarithmically spaced between and for S1, between and for S2 and S3.

### 7.3Impulse Response Estimate

To evaluate the estimators described above, we first introduce the so-called coefficient of determination (COD) between time series and : where . The impulse response fit is measured using the average COD: where and denote the true and estimated impulse responses from input to output . We set , letting .

Figures ?, ? and ? report the boxplots of in the three scenarios by some of the identification techniques listed above. In particular, among the methods equipped with the oracle for model complexity selection, only the results of PEM+Or are shown, since it gives the best performance. As far as the subspace techniques are concerned, we only report N4SID(OE), because it generally performs slightly better than N4SID; analogously, only the results achieved by the routine PEM are illustrated, since the performance of PEM(OE) is worse.

SH and RNN+CV achieve, among the procedures which can be practically implemented, the best performance in scenarios S2 and S3; instead, in scenario S1, RNN+CV has severe difficulties. It is also interesting to observe that the reweighted procedure in (RNN+CV) improves the performance achieved by simple nuclear norm regularization (NN+CV) in all the scenarios except for S1. The results achieved imposing the nuclear norm penalty on the weighted Hankel matrix are not reported since they are in general slightly worse than those achieved by NN+CV and RNN+CV.

### 7.4 Predictive performance

We compare the predictive performance of the methods listed in Section over a specifically designed scenario. Namely, system has been simulated with a unit variance white Gaussian noise input, while its output was corrupted by additive white Gaussian noise with a variance chosen in order to have . 200 estimation datasets consisting of data have been generated in this way. A set of validation data was used to evaluate the COD for each system output, i.e.