Linear system identification
using stable spline kernels and PLQ penalties
Abstract
The classical approach to linear system identification is given by parametric Prediction Error Methods (PEM). In this context, model complexity is often unknown so that a model order selection step is needed to suitably tradeoff bias and variance. Recently, a different approach to linear system identification has been introduced, where model order determination is avoided by using a regularized least squares framework. In particular, the penalty term on the impulse response is defined by so called stable spline kernels. They embed information on regularity and BIBO stability, and depend on a small number of parameters which can be estimated from data. In this paper, we provide new nonsmooth formulations of the stable spline estimator. In particular, we consider linear system identification problems in a very broad context, where regularization functionals and data misfits can come from a rich set of piecewise linear quadratic functions. Moreover, our analysis includes polyhedral inequality constraints on the unknown impulse response. For any formulation in this class, we show that interior point methods can be used to solve the system identification problem, with complexity in each iteration, where and are the number of impulse response coefficients and measurements, respectively. The usefulness of the framework is illustrated via a numerical experiment where output measurements are contaminated by outliers.
linear system identification; biasvariance trade off; kernelbased regularization; robust statistics; interior point methods; piecewise linear quadratic densities
I Introduction
The classical approach to linear system identification
is given by Parametric Prediction Error Methods (PEM)
[1, 2]. First, models
of different and unknown order, e.g. ARX or ARMAX, are postulated
and identified from data. Then, they are compared using either complexity
measures such as AIC or cross validation (CV) [3, 4].
Some limitations of this approach have been recently described in
[5] (see also [6]
for an analysis of CV).
This has led to the introduction of an alternative technique,
where identification is seen as a function learning problem formulated
in a possibly infinitedimensional space
[5, 7]. In particular, the problem
is cast in the framework of Gaussian regression
[8]: the unknown impulse response
is seen as a Gaussian process, whose autocovariance
encodes available prior knowledge. This approach was subsequently
given an interpretation in a Regularized Least Squares framework in [9].
The new estimators proposed in [5, 10]
rely on a class of autocovariances, called stable spline kernels,
which include information on the exponential stability of the unknown system.
The impulse response is modeled as the fold
integration of white Gaussian noise subject to an exponential time transformation.
The firstorder stable spline kernel has been recently derived
using deterministic arguments [9], and named the TC kernel.
An even more sophisticated
covariance for system identification, the so called DC kernel, is also described in
[9].
All of these kernels are defined by a small number of unknown hyperparameters,
which can be learned from data,
e.g. by optimizing the marginal likelihood [11, 12, 13].
This procedure resembles model order selection in the classical parametric paradigm,
and theoretical arguments supporting it are illustrated in [14].
Once the hyperparameters are found, the estimate of the system impulse response becomes
available in closed form. Extensive simulation studies have shown that
these new estimators can lead to significant advantages with respect
to the classical ones, in particular in terms of robustness and
in model complexity selection.
All of the new kernelbased approaches discussed in
[5, 7, 9] rely on quadratic loss and
and penalty functions. As a result,
in some circumstances they may
perform poorly. In fact, quadratic penalties
are not robust when outliers are present in the data
[15, 16, 17, 18].
In addition, they neither promote sparse solutions,
nor select small subsets of measurements or
impulse response coefficients with the
greatest impact on the predictive capability for
future data. These are key issues for
feature selection and compressed sensing
[19, 20, 21].
The limitations of quadratic penalties motivate
adopting alternative penalties for both loss and regularization functionals.
For example,
popular regularizers are the
the norm, as in the LASSO
[22], or a weighted combination of and
, as in
the elastic net procedure [23].
Popular fitting measures robust to outliers are the norm, the Huber loss [15],
the Vapnik insensitive loss [24, 25]
and the hinge loss [26, 25, 27].
Recently, all of these approaches have been cast in a unified
statistical modeling framework [14, 28],
where solutions to all models
can be computed using interior point (IP) methods.
The aim of this paper is to extend this framework
to the linear system identification problem. In particular,
we propose new impulse response estimators
that combine the stable spline kernels and
arbitrary piecewise linear quadratic (PLQ) penalties.
Generalizing the work in
[14, 28], we also allow the
inclusion of inequality constraints on the unknown parameters.
This generalization can be used to efficiently include additional information — for example, about nonnegativity and unimodality of
the impulse response — into the final estimate.
We show that all of these
models can be solved with IP techniques, with complexity
that scales well with the number of output measurements.
These new identification procedures are tested via a Monte Carlo study
where output error models are randomly generated and output data
(corrupted by outliers) is obtained. We compare the performance
of the classical stable spline estimator that uses a quadratic loss
with the performance of the new estimator
that uses loss.
The structure of the paper is as follows. In Section II,
we formulate the problem and briefly review the stable spline estimator
described in [5, 9].
In Section III we introduce the new class of non smooth stable spline estimators,
review the class of PLQ penalties, and generalize the framework in [29]
by including affine inequality constraints.
We also demonstrate how IP methods can be used to efficiently
compute the impulse response estimates.
In Section IV, the new approach is tested
via a Monte Carlo study, where system output measurements
are corrupted by outliers. We end the paper with Conclusions, and include
additional proofs in the Appendix.
Ii Problem statement and the stable spline estimator
Iia Statement of the problem
Consider the following linear timeinvariant discretetime system
(II.1) 
where is the output, is the shift operator (), is the linear operator associated with the true system, assumed stable, the input and the i.i.d. noise. Assuming the input known, our problem is to estimate the system impulse response from noisy measurements of .
IiB The stable spline estimator
We now briefly review the regularized approach to system identification proposed in [5, 9]. For this purpose, denote by the (column) vector containing the impulse response coefficients. Here, in contrast to classical approaches to system identification, the size is chosen sufficiently large to capture system dynamics rather than to establish any kind of tradeoff between bias and variance. It is useful to rewrite the measurement model (II.1) using the following matrixvector notation
(II.2) 
where the vector contains the output measurements, is a suitable matrix defined by input values, and denotes the noise of unknown variance . Then, the stable spline estimator is defined by the following regularized least squares problem:
(II.3) 
where the positive scalar is a regularization parameter, and can be taken from the class of stable spline kernels [10]. In particular, adopting the discretetime version of the stable spline kernel of order 1, the entry of is
(II.4) 
Above, is a kernel hyperparameter which corresponds
to the dominant pole of the system, and is typically unknown.
This kernel was also studied in
[9], where it was called the tuned/correlated (TC) kernel.
Motivations underlying the particular shape (II.4) have been
discussed under both a statistical and a deterministic framework,
see [30] and [31].
Note that the estimator (II.3), equipped
with the kernel (II.4), contains the
unknown hyperparameters and .
These can be obtained as follows.
First, the estimate of can be computed by fitting
a lowbias model for the impulse response using least squares
(as e.g. described in [32]).
Then, one can exploit the Bayesian interpretation
underlying problem (II.3): if the noise is Gaussian, it provides the minimum
variance estimate of when the impulse response is modeled as
a Gaussian vector independent of with autocovariance .
Here, is an unknown scale factor equal to .
The estimates of and are obtained
by maximizing the marginal likelihood (obtained by integrating out
of the joint density of and ). This gives
(II.5) 
where the matrix is
and the identity matrix (see [5] for details).
Iii New non smooth formulations of the stable spline estimator
To simplify the problem formulation, it is useful to introduce an auxiliary variable , and to to rewrite the classical stable spline estimator (II.3) using the following relationships:
(III.1) 
where is invertible. Using (III.1), (II.3)) becomes
(III.2) 
It is apparent that this estimator uses quadratic functions to define both the loss and the regularizer . In the rest of the paper we study a generalization of (III.2) given by
(III.3) 
where is a polyhedral set (which can be used e.g. to provide nonnegativity information on the impulse response ), and , are defined by the piecewise linear quadratic functions introduced in the next subsection.
Iiia PLQ penalties
Definition III.1 (PLQ functions and penalties)
A piecewise linear quadratic (PLQ) function is any function having representation
(III.4) 
where is a nonempty polyhedral set, the set of real symmetric positive semidefinite matrices, and is an injective affine transformation in , with , so, in particular, and .
When , the associated function is a penalty, since it is necessarily nonnegative.
Remark III.2
When and , we recover the basic piecewise linearquadratic penalties characterized in [33, Example 11.18].
Remark III.3 (scalar examples)
, , elastic net, Huber, hinge, and Vapnik penalties are all representable using the notation of Definition III.1.

: Take , , , and . We obtain
The function inside the is maximized at , hence .

: Take , , , and . We obtain
The function inside the is maximized by taking , hence .

Elastic net: . Take

Huber: Take , , , and . We obtain
with three explicit cases:

If , take to obtain .

If , take to obtain .

If , take to obtain a contribution of .
This is the Huber penalty.


Vapnik loss is given by . We obtain its PLQ representation by taking
to yield
IiiB Optimization with PLQ penalties
Consider a constrained minimization problem for a general PLQ penalty:
(III.5) 
where is a polyhedral set, described by
(III.6) 
After studying this problem, we will come back to consider the estimator (III.3).
It turns out that a wide class of problems (III.5) are solvable by interior point (IP) methods [35, 36, 37]. IP methods solve nonsmooth optimization problems by working directly with smooth systems of equations characterizing the optimality of these problems. [29, Theorem 13] presents a full convergence analysis for IP methods for formulations (III.5) without inequality constraints, so in (III.5). While a generalization of the full analysis to cover inequality constraints is out of the scope of this paper, we present an important computational result showing that constraints can be included in a straightforward manner, and provide the computational complexity of each interior point iteration. Moreover, the proof of the result (given in Appendix) shows that constraints help the numerical stability of the interior point iterations.
Theorem III.4 (Interior Point for PLQ with Constraints)
Consider any optimization problem of the form (III.5), with , , , , , , , and . Suppose that the PLQ satisfies
(III.7) 
Suppose also that contains on the order of entries, while contains on the order of entries. Then every interior point iterations can be computed with complexity .
The assumptions on the structure of and are satisfied for many common PLQ penalties. For example, for we have and , while for , and contains two copies of the identity matrix.
Turning out attention back to system identification, will be the dimension of the impulse response, while and may depend on ; in fact always, while depends on the structure of the PLQ penalty. To be more specific, we have the following corollary.
Corollary III.5
Note that the computational complexity of the IP method scales favorably with the number of measurements which, in the system identification scenario, is typically much larger than the number of unknown impulse response coefficients .
Iv Monte Carlo study
We consider a Monte Carlo study of 300 runs.
At each run, the MATLAB command
m=rss(30) is first used to obtain a SISO continuoustime
system of 30th order. The continuoustime system m
is then sampled at 3 times of its bandwidth, obtaining the
discretetime system md through
the commands: bw=bandwidth(m); f = bw*3*2*pi;
md=c2d(m,1/f,’zoh’). If all poles of md are within the
circle with center at the origin and radius 0.95 on the complex
plane, then the feedforward matrix of md is set to 0,
i.e. md.d=0, and the system is used and saved.
The system input at each run is white Gaussian noise of unit
variance. The input delay is always equal to 1
and this information is given to every estimator used in the
Monte Carlo study described below.
Data consists of 400 inputoutput pairs, which
are collected after getting rid of initial conditions,
and corrupted by a noise generated as a mixture of
two normals with a fraction of outlier contamination equal to 0.2; i.e.,
Here, is randomly
generated in each run as
the variance of the noiseless
output divided by the realization of a random variable
uniformly distributed on . With probability 0.2,
each measurement
may be contaminated by a random error
whose standard deviation is .
The quality of an estimator is measured
by computing the fit measure at every run.
To be more specific,
given a generic dynamic system represented by ,
let denote
the norm of its impulse response, numerically computed
using only the first 100 impulse response coefficients,
whose mean is denoted by . Then, the fit measure for the th run with estimated model is
(IV.1) 
During the Monte Carlo simulations, the following 5 estimators are used:

Oe+oracle. Classical PEM approach, with candidate models given by rational transfer functions defined by two polynomials of the same order. This estimator is implemented using the oe.m function of the MATLAB System Identification Toolbox equipped with the robustification option (’LimitError’,r)^{1}^{1}1As per MATLAB documentation, the value of r specifies when to adjust the weight of large errors from quadratic to linear. Errors larger than r times the estimated standard deviation have a linear weight in the criteria. The standard deviation is estimated robustly as the median of the absolute deviations from the median and divided by 0.7. The value r=0 disables the robustification and leads to a purely quadratic criterion. and an oracle, which provides a bound on the best achievable performance of PEM by selecting (at every run) the model order (between 1 and 20) and the value of r ( or ) that maximize (IV.1).

Oe+CV. Same as above, except that r=0 (the fit criterion is purely quadratic) and model order is estimated via cross validation. In particular, data are split into a training and validation data set of equal size. Then, for every model order ranging from 1 to 20, the MATLAB function oe.m (fed with the training set) is called. The estimate of the order minimizes the sum of squared prediction errors on the validation set. This is obtained by the MATLAB function predict.m (imposing null initial conditions) fed with the validation data set. The final model is computed by oe.m, using the estimated value of the order and all the available measurements (the union of the training and validation sets).

Oe+CVrob. Same as above, except that level of robustification r is also chosen via cross validation on the grid .

SS+. This is the classical stable spline estimator (II.3), which uses a quadratic loss and the stable spline regularizer. Hyperparameters are determined via marginal likelihood optimization, as described in subsection IIB. The number of estimated impulse response coefficients, i.e. the dimension of in (II.2), is . Only the first 100 inputoutput pairs are used to define the entries of the matrix in (II.2), so that the size of the measurement vector is .

SS+. This is the new nonsmooth version of the stable spline estimator. It coincides with (II.3) except that the quadratic loss is replaced by the loss. The hyperparameter defining the stable spline kernel in (II.4) and the regularization parameter are estimated via cross validation as follows. The matrix in (II.2) is defined as described above. Then, the remaining 300 inputoutput pairs are split into a training and validation data set of equal size. The estimates of the hyperparameters are chosen so that the corresponding impulse response estimate (obtained using only the training set) provides the best prediction on the validation data (according to a quadratic fit). The candidate hyperparameters are selected from a twodimensional grid. In particular, may assume values in while varies on a set given by 20 values logarithmically spaced between and , where is the estimate used by SS+. The final estimate of the impulse response is computed using the hyperparameter estimates and the union of training and validation data sets.
The plots in Fig. 2 are the Matlab
boxplots of the errors (IV.1) obtained by the 5 estimators.
The rectangle shows the quantiles of all
the numbers with the horizontal line showing the median.
The “whiskers” outside the rectangle display the
quantiles, with the remaining errors (which may be deemed outliers)
plotted using “+”.
The average fits obtained by Oe+oracle,
Oe+CV, Oe+CVrob,
SS+ and SS+
are and , respectively.
The best results are obtained by Oe+oracle. However, keep in mind
that this estimator relies on an ideal tuning
of the model order and of the level of robustification which is not implementable in practice.
In comparison with the other estimators,
the performance of
SS+ and Oe+CV is negatively influenced by the presence
of data contamination. The reason is that both of these estimators
use quadratic loss functions. Notice however that
textitSS+ largely outperforms Oe+CV.
Focusing now on numerical schemes equipped with robust losses, we see that SS+
outperforms Oe+CVrob. It provides the best results
among all the estimators implementable in practice: the stable spline kernel
introduces a suitable regularization with the loss
to guard against outliers.
V Conclusions
We have extended the stable spline estimator to a non smooth setting. Quadratic losses and regularizers can now be replaced by general PLQ functions, which allow new applications, such as robust estimators in the presence of outliers in the data. In addition, we presented an extended formulation that can include affine inequality constraints on the unknown impulse response, which can be used for example to incorporate nonnegativity into the estimate. We have shown that the corresponding generalized estimates can be computed in an efficient way by IP methods. Finally, our simulation results showed a significant performance improvement of the stable spline kernel with loss over previous art.
Vi Appendix
Via Proof of Theorem iii.4
From [33][Example 11.47], the Lagrangian for problem (III.5) for feasible is given by
Since is by assumption a polyhedral set, it can be expressed by a linear system of inequalities:
(VI.1) 
Using the explicit characterizations of and , the optimality conditions for (III.5) are
(VI.2)  
(see [33] and [38] for more details). The inequality constraint in the definition of in (VI.1) can be reformulated using slack variables :
Combining all of these equations yields the KKT system for (III.5):
(VI.3) 
The last two sets of equations in (VI.3) are known as the complementarity conditions. Solving the problem (III.5) is then equivalent to satisfying (VI.3), and there is a vast optimization literature on working directly with the KKT system. In the Kalman filtering/smoothing application, interior point methods have been used to solve the KKT system (VI.3) in a numerically stable and efficient manner, see e.g. [39].
An interior point approach applies damped Newton iterations to a relaxed version of VI.3:
(VI.4) 
The relaxation parameter is driven aggressively to as the method proceeds. Every Newton iteration solves
where
(VI.5) 
Using the row operations
we arrive at the system
where . Note that this matrix is invertible if and only if the hypothesis (III.7) holds. If is invertible, the row operations
reduce the system to upper triangular form
Note that , , , and are componentwise positive (which holds for every nonzero ), while is injective (see Definition III.1), hence is a square matrix of full rank. The term is also positive semidefinite, and only serves to stabilize the inversion of the final term. Therefore, we can carry out Newton iterations on the relaxed system, as claimed.
To show the computational complexity, we give the full interior point iteration, which is derived by applying the row operations used to obtain the upper triangular system to the right hand side , then solving for , and back substituting.
(VI.6)  
Note that the matrix can be constructed in operations if contains on the order of terms. The matrix can be constructed in operations, and inverted in operations. These operations dominate the complexity, giving the bound .
ViB Proof of Corollary iii.5
To translate (III.3) to (III.5), we have to specify the structures , which capture the impulse response constraints, the injective linear model, and the structure of , respectively.
Suppose that and are given by
(VI.7)  
First define
Adding and together, we obtain the general system identification objective with the following specification:
The matrix and vector encodes the constraints, as given by (III.6).
This completes the specification. The complexity result follows immediately from the assumptions on and Theorem III.4.
It is also worthwhile to consider the structure of (VI.6). First, note that
so in fact is block diagonal. This fact gives a more explicit formula for :
References
 [1] L. Ljung, System Identification, Theory for the User. Prentice Hall, 1999.
 [2] T. Söderström and P. Stoica, System Identification. PrenticeHall, 1989.
 [3] H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, vol. 19, pp. 716–723, 1974.
 [4] T. J. Hastie, R. J. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference and Prediction. Canada: Springer, 2001.
 [5] G. Pillonetto and G. De Nicolao, “A new kernelbased approach for linear system identification,” Automatica, vol. 46, no. 1, pp. 81–93, 2010.
 [6] ——, “Pitfalls of the parametric approaches exploiting crossvalidation or model order selection,” in Proceedings of the 16th IFAC Symposium on System Identification (SysId 2012), 2012.
 [7] G. Pillonetto, A. Chiuso, and G. D. Nicolao, “Prediction error identification of linear systems: a nonparametric Gaussian regression approach,” Automatica, vol. 47, no. 2, pp. 291–305, 2011.
 [8] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.
 [9] T. Chen, H. Ohlsson, and L. Ljung, “On the estimation of transfer functions, regularizations and Gaussian processes  revisited,” Automatica, vol. 48, no. 8, pp. 1525–1535, 2012.
 [10] G. Pillonetto, A. Chiuso, and G. De Nicolao, “Regularized estimation of sums of exponentials in spaces generated by stable spline kernels,” in Proceedings of the IEEE American Cont. Conf., Baltimora, USA, 2010.
 [11] J. S. Maritz and T. Lwin, Empirical Bayes Method. Chapman and Hall, 1989.
 [12] D. MacKay, “Bayesian interpolation,” Neural Computation, vol. 4, pp. 415–447, 1992.
 [13] J. Berger, Statistical Decision Theory and Bayesian Analysis, 2nd ed., ser. Springer Series in Statistics. Springer, 1985.
 [14] A. Aravkin, J. Burke, and G. Pillonetto, “A statistical and computational theory for robust and sparse kalman smoothing,” in Proceedings of the 16th IFAC Symposium on System Identification (SysId 2012), 2012.
 [15] P. Huber, Robust Statistics. Wiley, 1981.
 [16] J. Gao, “Robust l1 principal component analysis and its Bayesian variational inference,” Neural Computation, vol. 20, no. 2, pp. 555–572, February 2008.
 [17] A. Aravkin, B. Bell, J. Burke, and G. Pillonetto, “An laplace robust kalman smoother,” Automatic Control, IEEE Transactions on, vol. 56, no. 12, pp. 2898–2911, dec. 2011.
 [18] S. Farahmand, G. Giannakis, and D. Angelosante, “Doubly robust smoothing of dynamical processes via outlier sparsity constraints,” IEEE Transactions on Signal Processing, vol. 59, pp. 4529–4543, 2011.
 [19] T. J. Hastie and R. J. Tibshirani, “Generalized additive models,” in Monographs on Statistics and Applied Probability. London, UK: Chapman and Hall, 1990, vol. 43.
 [20] B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, pp. 407–499, 2004.
 [21] D. Donoho, “Compressed sensing,” IEEE Trans. on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
 [22] R. Tibshirani, “Regression shrinkage and selection via the LASSO,” Journal of the Royal Statistical Society, Series B., vol. 58, pp. 267–288, 1996.
 [23] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society, Series B, vol. 67, pp. 301–320, 2005.
 [24] V. Vapnik, Statistical Learning Theory. New York, NY, USA: Wiley, 1998.
 [25] M. Pontil and A. Verri, “Properties of support vector machines,” Neural Computation, vol. 10, pp. 955–974, 1998.
 [26] T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks and support vector machines,” Advances in Computational Mathematics, vol. 13, pp. 1–150, 2000.
 [27] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Computation, vol. 12, pp. 1207–1245, 2000.
 [28] A. Aravkin, J. Burke, and G. Pillonetto, “Nonsmooth regression and state estimation using piecewise quadratic logconcave densities,” in Proceedings of the 51st IEEE Conference on Decision and Control (CDC 2012), 2012.
 [29] A. Y. Aravkin, J. V. Burke, and G. Pillonetto, “Sparse/robust estimation and kalman smoothing with nonsmooth logconcave densities: Modeling,computation, and theory, 2013.”
 [30] G. Pillonetto and G. De Nicolao, “Kernel selection in linear system identification – part I: A Gaussian process perspective,” in Proceedings of CDCECC, 2011.
 [31] T. Chen, H. Ohlsson, G. Goodwin, and L. Ljung, “Kernel selection in linear system identification – part II: A classical perspective,” in Proceedings of CDCECC, 2011.
 [32] G. Goodwin, M. Gevers, and B. Ninness, “Quantifying the error in estimated transfer functions with application to model order selection,” IEEE Transactions on Automatic Control, vol. 37, no. 7, pp. 913–928, 1992.
 [33] R. Rockafellar and R. Wets, Variational Analysis. Springer, 1998, vol. 317.
 [34] W. Chu, S. S. Keerthi, and C. J. Ong, “A unified loss function in bayesian framework for support vector regression,” in In Proceeding of the 18th International Conference on Machine Learning, 2001, pp. 51–58.
 [35] M. Kojima, N. Megiddo, T. Noma, and A. Yoshise, A Unified Approach to Interior Point Algorithms for Linear Complementarity Problems, ser. Lecture Notes in Computer Science. Berlin, Germany: Springer Verlag, 1991, vol. 538.
 [36] A. Nemirovskii and Y. Nesterov, InteriorPoint Polynomial Algorithms in Convex Programming, ser. Studies in Applied Mathematics. Philadelphia, PA, USA: SIAM, 1994, vol. 13.
 [37] S. Wright, Primaldual interiorpoint methods. Englewood Cliffs, N.J., USA: Siam, 1997.
 [38] R. Rockafellar, Convex Analysis, ser. Priceton Landmarks in Mathematics. Princeton University Press, 1970.
 [39] A. Aravkin, B. Bell, J. Burke, and G. Pillonetto, “Learning using state space kernel machines,” in Proc. IFAC World Congress 2011, Milan, Italy, 2011.