On the estimation of initial conditions
in kernelbased system identification
Abstract
Recent developments in system identification have brought attention to regularized kernelbased methods, where, adopting the recently introduced stable spline kernel, prior information on the unknown process is enforced. This reduces the variance of the estimates and thus makes kernelbased methods particularly attractive when few inputoutput data samples are available. In such cases however, the influence of the system initial conditions may have a significant impact on the output dynamics. In this paper, we specifically address this point. We propose three methods that deal with the estimation of initial conditions using different types of information. The methods consist in various mixed maximum likelihood–a posteriori estimators which estimate the initial conditions and tune the hyperparameters characterizing the stable spline kernel. To solve the related optimization problems, we resort to the expectationmaximization method, showing that the solutions can be attained by iterating among simple update steps. Numerical experiments show the advantages, in terms of accuracy in reconstructing the system impulse response, of the proposed strategies, compared to other kernelbased schemes not accounting for the effect initial conditions.
in:
1 Introduction
Regularized regression has a long history [tikhonov1977solutions]. It has become a standard tool in applied statistics [hastie2005elements], mainly due to its capability of reducing the mean square error (MSE) of the regressor estimate [james1961estimation], when compared to standard least squares [ljung1999system]. Recently, a novel method based on regularization has been proposed for system identification [pillonetto2014kernel]. In this approach, the goal is to get an estimate of the impulse response of the system, using the so called kernelbased methods [scholkopf2002learning]. To this end, the class of stable spline kernels has been proposed recently in [pillonetto2010new], [pillonetto2011prediction]. The main feature of these kernels is that they encode prior information on the exponential stability of the system and on the smoothness of the impulse response. These features have made stable spline kernels suitable for other estimation problems, such as the reconstruction of exponential decays [pillonetto2010regularized] and correlation functions [bottegal2013regularized]. Other kernels for system identification have been introduced in subsequent studies, see for instance [chen2012estimation], [chen2014constructive].
Stable spline kernels are parameterized by two hyperparameters, that determine magnitude and shape of the kernel and that need to be estimated from data. An effective approach for hyperparameter estimation relies upon empirical Bayes arguments [maritz1989empirical]. Specifically, exploiting the Bayesian interpretation of regularization [wahba1990spline], the impulse response is modeled as the realization of a Gaussian process whose covariance matrix corresponds to the kernel. The hyperparameters are then estimated by maximizing the marginal likelihood of the output data, obtained by integrating out the dependence on the impulse response. Given a choice of hyperparameters, the unknown impulse response is found by computing its minimum MSE Bayesian estimate [pillonetto2014kernel].
One situation where kernelbased methods are preferable is when data records are short (e.g., five times the rise time of the system). This mainly because of two reasons:

Kernelbased methods do not require the selection of a model order. Standard parametric techniques (such as the prediction error method [ljung1999system], [soederstroem1988system]) need to rely on model selection criteria, such as AIC or BIC, if the structure of the system is unknown [beghelli1990frisch]. These could be unreliable when faced with small data sets.

The bias introduced by regularization reduces the variance. With small data records, the variance can be very high. If the bias is of the right kind, it will compensate for the variance effect in the MSE [hastie2005elements, Ch. 2.9].
When data records are very short (e.g., two times the rise time of the system) we cannot ignore the effect of the initial conditions. In fact, if the system is not at rest before the experiment is performed, then there are transient effects that cannot be explained using only the collected data. Standard workarounds, such as discarding those output samples that depend on the initial conditions or approximating the initial conditions to zero [ljung1999system, Ch. 10.1], may give unsatisfactory results. Thus, it seems preferable to deal with the initial conditions by estimating them. In this paper we discuss how to incorporate the estimation of the initial conditions in the context of kernelbased system identification. We discuss three possible approaches to the problem. First, we propose a method that incorporates the unknown initial conditions as parameters, to be estimated along with the kernel hyperparameters. Then, assuming that the input is an autoregressive–movingaverage (ARMA) stationary process, we propose to estimate the initial conditions using the available samples of the input, thus designing a minimum variance estimate of the initial conditions from the input samples. Finally, we design a mixed maximum a posteriori–marginal likelihood (MAP–ML) estimator (see [yeredor2000joint]) that effectively exploits information from both input and output data. We solve the optimization problems using novel iterative schemes based on the expectationmaximization (EM) method [dempster1977maximum], similar to the technique used in our previous works [risuleo2015kernel] and [bottegal2015blind], where methods for Hammerstein and blind system identification are proposed. We show that each iteration consists of a set of simple update rules which either are available in closedform or involve scalar optimization problems, that can be solved using a computationally efficient grid search.
The paper is organized as follows. In Section 2, we formulate the problem of system identification with uncertainty on the initial conditions. In Section 3, we provide a short review of kernelbased system identification. In Section 4, we propose the initialconditions estimation strategies and the related system identification algorithms. In Section 5, we show the results of numerical experiments. In these experiments, the discussed method are compared with standard techniques used to deal with unknown initial conditions. In Section 6, we summarize the work and conclude the paper.
2 Problem formulation
We consider the output error model of the form
(1) 
where is the impulse response of a linear timeinvariant system. For notational convenience, we assume there are no delays in the system (). We approximate by considering its first samples , where is chosen large enough to capture the system dynamics. The system is driven by the input and the measurements of the output are corrupted by the process , which is zeromean white Gaussian noise with variance .
Given a set of measurements, denoted by , , we are interested in estimating the first samples of the impulse response . To this end, we formulate this system identification problem as the linear regression problem
(2) 
where we have introduced the following vector/matrix notation
(3) 
The matrix contains the samples , that we call initial conditions, that are unavailable. Common ways to overcome this problem are, for instance

Discard the first collected samples of . However, if is not much larger than , (e.g., if and ), there is a considerable loss of information.

Assume that the system is at rest before the experiment is performed, (i.e. , , ). This assumption might be too restrictive or unrealistic.
In this paper, our aim is to study how to exploit the available information to estimate the initial conditions, in order to improve the identification performance. Specifically, we will present three estimators that make different use of the available information.
3 Kernelbased system identification
In this section we briefly review the kernelbased approach introduced in [pillonetto2010new], [pillonetto2011prediction]. Exploiting the Bayesian interpretation of kernelbased methods [wahba1990spline], we model the unknown impulse response as a Gaussian random process, namely
(4) 
We parameterize the covariance matrix (the kernel) with the hyperparameter . The structure of the kernel determines the properties of the realizations of (4); its choice is therefore of paramount importance. An effective kernel for systemidentification purposes is the stable spline kernel [pillonetto2010new], [pillonetto2011kernel]. In particular, in this paper we use the firstorder stable spline kernel (or TC kernel in [chen2012estimation]), that is defined as
(5) 
where is a scalar in the interval . The role of this hyperparameter is to regulate the velocity of the exponential decay of the impulse responses drawn from the kernel. The hyperparameter is a scaling factor that regulates the amplitude of the realizations of (4).
We collect the hyperparameters into the vector
(6) 
and introduce the following notation:
where contains the unknown initial conditions. Since we have assumed a Gaussian distribution for the noise, the joint description of and is Gaussian, parameterized by and . Therefore, we can write
(7) 
where and . It follows that the posterior distribution of given is Gaussian, namely
(8) 
where
(9) 
Equation (8) implies that the minimum variance estimator of (in the Bayesian sense, see [anderson2012optimal]) is
(10) 
The estimate depends on the hyperparameter vector and the initial conditions. These quantities need to be estimated from data. In the next section we focus our attention to the estimation of the kernel hyperparameters and the initial conditions, describing different strategies to obtain these quantities.
Remark 1.
The estimator (10) depends also on the noise variance . In this work, we assume that this parameter is known. It can for instance be estimated by fitting a leastsquares estimate of the system and then computing the sample variance of the residuals.∎
4 Estimation of initial conditions and hyperparameters
In most works on kernelbased system identification (see e.g. [pillonetto2014kernel] for a survey), the authors adopt an empiricalBayes approach to estimate the hyperparameters that define the kernel. This amounts to maximizing the marginal likelihood (ML) of the output, found integrating out of (7).
In the standard case, that is when is assumed to be known, the ML estimator of the hyperparameters corresponds to
(11) 
We start from (11) to design new estimators for the initial conditions and the kernel hyperparameters.
4.1 Modelless estimate
The most straightforward generalization of (11) is to include the initial conditions among the ML parameters. The initial conditions become unknown quantities that parameterize the impulse response estimator. The ML criterion then becomes
(12) 
where the maximization is carried out over the unknown initial conditions as well. This problem is nonconvex and possibly high dimensional, as the number of initial conditions to be estimated is equal to the number of impulse response samples. To overcome this difficulty, we devise a strategy based on the expectationmaximization method that yields a solution to (12) by iterating simple updates. To this end, we suppose we have any estimate of the unknown quantities and , and we calculate the current estimate of the impulse response, as well as its variance, from (9). Define the matrix
(13) 
which is the second moment of the current estimated impulse response. We introduce the discretederivator matrix
(14) 
and we calculate the second moment of the derivative of the estimated impulse response at iteration , , given by
(15) 
The Toeplitz matrix of the input samples can be split in two parts, namely , where is fully determined by the available samples, and is composed of the unknown initial conditions. Define the matrix that satisfies the relation ; and call the Toeplitz matrix of the estimated impulse response at the th iteration. Furthermore, define
(16)  
With all the definitions in place, we can state the theorem that provides us with the iterative update of the estimates that solves (12).
Theorem 1.
Consider the hyperparameter estimator (12). Starting from any initial guess of the initial conditions and the hyperparameters, compute
(17)  
(18)  
(19) 
with and defined in (16), and
(20)  
(21) 
where is the th diagonal element of (15), and is the th element of
(22) 
when . Let ; then the sequences and converge to a maximum of (12).
Proof.
See Appendix. ∎
Remark 2.
The EM method does not guarantee convergence of the sequences to a global maximum (see [mclachlan2007algorithm] and [tseng2004analysis]). However, experiments (see Section 5) have show that, for this particular problem, the EM method converges to the global maximum independently of how it is initialized.
4.2 Conditional mean estimate
The modelless estimator presented in Section 4.1 estimates the initial conditions using only information present in the system output . It does not rely on any model of the input signal . To show how an available model can be used to estimate the missing initial conditions, we introduce the following assumption.
Assumption 1.
The input is a realization of a stationary Gaussian process with zeromean and known rational spectrum. Equivalently, is a realization on an ARMA process with known coefficients.∎
Assumption 1 implies that can be expressed as the output of a difference equation driven by white Gaussian noise with unit variance [papoulis2002probability], namely
(23) 
where . Since, using Assumption 1, we can construct the probability density of the input process, a possible approach to solve (11) is to estimate the missing initial conditions from the input process. To this end, consider (23). If we define the matrices as the toeplitz matrix of the coefficients , and as the toeplitz matrix of the coefficients , then we can write
(24) 
so that , with
(25) 
We thus have a joint probabilistic description of the initial conditions and the available input samples . We can partition into four blocks according to the sizes of and
It follows (see [anderson2012optimal]) that the posterior distribution of the unavailable data is , where
(26) 
So we can find the minimum variance estimate of as the conditional mean , namely .
Having an estimate of the initial conditions, we need to find the hyperparameters that define the kernel. In this case, empirical Bayes amounts to solving the ML problem
(27) 
where the unknown initial conditions have been replaced by their conditional mean. The following theorem states how to solve the maximization using the EM method.
Theorem 2.
Proof.
See Appendix. ∎
4.3 Joint inputoutput estimate
The conditional mean estimator presented in Section 4.2 exploits the structure of the input to estimate the missing samples. The modelless estimator, instead, uses information contained in the output samples. In this section we show how to merge these two information sources, defining a joint inputoutput estimator of the initial conditions.
We use Assumption 1 to account for the statistical properties of . We propose the following mixed MAP–ML estimator
(30) 
where we have highlighted the dependence on the known input sequence . A key role is played by the term : it acts as a prior distribution for the unknown values of and puts weight on the values that better agree with the observed data .
Even in this case, the solution can be found with an iterative procedure based on the EM method.
Theorem 3.
Proof.
See Appendix. ∎
Remark 4.
This estimator incorporates the prior information about the initial conditions using the mean and the covariance matrix . If we suppose that we can manipulate , we can see this estimator as a more general estimator, that contains the modelless and conditional mean as limit cases. In fact, setting , we get the modelless estimator. Conversely, setting , we obtain the conditionalmean estimator, as (31) would yield a degenerate iteration where all the updates are . We point however out that the modelless estimator does not rely on any assumption on the input model, whereas the joint inputoutput estimator requires that the input is a Gaussian process with known pdf.
5 Numerical experiments
5.1 Experiment setup
To compare the proposed methods, we perform five numerical experiments, each one consisting of 200 Monte Carlo simulations, with sample sizes 150, 200, 250, 300, and 400. At each Monte Carlo run, we generate a dynamic system of order 40. The system is such that the zeros are constrained within the circle of radius 0.99 on the complex plane, while the magnitude of the poles is no larger than 0.95. The impulse response length is 100 samples. The input is obtained by filtering white noise with unit variance through a 8th order ARMA filter of the form (23). The coefficients of the filter are randomly chosen at each Monte Carlo run, and they are such that the poles of the filter are constrained within the circular region of radii 0.8 and 0.95 in the complex plane.
Random trajectories of input and noise are generated at each run. In particular, the noise variance is such that the ratio between the variance of the noiseless output of the system and the noise variance is equal to 20.
The following estimators are compared during the experiments.

KBTrunc: This method also avoids the estimation of the initial conditions by discarding the first output samples, which depend on the unknown vector . The hyperparameters are obtained solving (11), using the truncated data.

KBICModLess: This is the modelless kernelbased estimator presented in Section 4.1.

KBICMean: This is the conditional mean kernelbased estimator presented in Section 4.2.

KBICJoint: This is the joint inputoutput kernelbased estimator presented in Section 4.3.

KBICOracle: This estimator has access to the vector , and estimates the kernel hyperparameters using (11).
The performances of the estimators are evaluated by means of the fitting score, computed as
(34) 
where is the impulse response generated at the th run, its mean and the estimate computed by the tested methods.
5.2 Results
150  200  250  300  400  

KBICZeros  51.698  54.856  61.151  61.380  63.186 
KBTrunc  42.010  51.038  59.400  61.085  62.963 
KBICModLess  54.017  55.793  61.687  63.074  63.466 
KBICMean  54.146  56.003  62.061  63.715  64.187 
KBICJoint  55.695  57.133  62.776  64.310  64.457 
KBICOracle  57.317  57.902  63.781  64.893  64.959 
Table 1 shows the average fit (in percent) of the impulse response over the Monte Carlo experiments. We can see that, for short data sequences the amount of information discarded by the estimator KBTrunc makes its performance degrade with respect to the other estimators. The estimator KBICZeros performs better, however suffers from the effects of the wrong assumption that the system was at rest before the experiment was performed. From these results, we see that the estimation of the initial conditions has a positive effect on the accuracy of the estimated impulse response. For larger data records the performances of the estimator KBICMean and of the estimator KBICModLess improve, as more samples become available.
When the available data becomes larger, all the methods perform well, with fits that are in the neighborhood of the fit of the oracle.
The positive performance of KBICMean indicates that the predictability of the input can be exploited to improve estimates, and that modelbased approaches to initial condition estimation outperforms modelless estimation methods (if the input model is known). The further improvement of KBICJoint indicates that the output measurements can be used to obtain additional information about the unobserved initial conditions, information that is not contained in the input process itself.
6 Discussion
We have proposed three new methods for estimating the initial conditions of a system when using kernelbased methods. Assuming that the input is a stationary ARMA process with known spectrum, we have designed mixed MAP–ML criteria which aim at jointly estimating the hyperparameters of the kernel and the initial conditions of the systems. To solve the related optimization problems, we have proposed a novel EMbased iterative scheme. The scheme consists in a sequence of simple update rules, given by unconstrained quadratic problems or scalar optimization problems. Numerical experiments have shown that the proposed methods outperform other standard methods, such as truncation or zero initial conditions.
The methods presented here estimate initial conditions (where is the length of the FIR approximating the true system), since no information on the order of the system is given. Assuming that the system order is known and equal to say, , the number of initial conditions to be estimated would boil down to . However, there would also be unknown transient responses which need to be identified. These transients would be characterized by impulse responses with the same poles as the overall system impulse response, but with different zeros. How to design a kernel correlating these transient responses with the system impulse response is still an open problem.
Appendix: Proofs
6.1 Theorem 1
Consider the ML criterion (12). To apply the EM method, we consider the complete loglikelihood
where we have introduced as a latent variable. Suppose that we have computed the estimates of the hyperparameters and of the initial conditions. We define the function
(35) 
where the expectation is taken with respect to the conditional density , defined in (8). We obtain (neglecting terms independent from the optimization variables)
(36) 
where
(37)  
(38) 
and . Define new parameters from
(39) 
By iterating between (35) and (39) we obtain a sequence of estimates that converges to a maximum of (12) (see e.g., [mclachlan2007algorithm] for details). Using (36), (39) splits in the two maximization problems
(40)  
(41) 
Consider the maximization of (37) with respect to . Using the matrix we have
(42)  
(43) 
where denotes the Kronecker product. We now collect , and obtain
(44) 
with and defined in (16). Hence, (40) is an unconstrained quadratic optimization, whose solution is given by (17).
Consider now the maximization of (38) with respect to . We can calculate the derivative of with respect to , obtaining
(45) 
which is equal to zero for
(46) 
We thus have an expression of the optimal value of as a function of . If we insert this value in , we obtain
(47) 
where is constant. We now rewrite the first order stable spline kernel using the factorization (see [carli2014maximum])
(48) 
where is defined in (14) and
(49) 
From (15), we find
(50) 
where and are the th diagonal elements of and of respectively, and is a constant. If we define the function (21), we can rewrite (6.1) as (20). so that we obtain (20) and (21). Using a similar reasoning, we can rewrite (46) as (19). ∎
Theorem 2
Theorem 3
Consider the ML criterion (30). Consider the complete data loglikelihood
(51) 
Given any estimates and , take the expectation with respect to . We obtain
(52) 
with and defined in (37) and (38). Collecting the terms in and using (44) we obtain an unconstrained optimization problem in , that gives (31). The optimization in follows the same procedure as in Theorem 1 and gives (32) and (33).