A new kernel-based approach to system identification
with quantized output data
In this paper we introduce a novel method for linear system identification with quantized output data. We model the impulse response as a zero-mean Gaussian process whose covariance (kernel) is given by the recently proposed stable spline kernel, which encodes information on regularity and exponential stability. This serves as a starting point to cast our system identification problem into a Bayesian framework. We employ Markov Chain Monte Carlo methods to provide an estimate of the system. In particular, we design two methods based on the so-called Gibbs sampler that allow also to estimate the kernel hyperparameters by marginal likelihood maximization via the expectation-maximization method. Numerical simulations show the effectiveness of the proposed scheme, as compared to the state-of-the-art kernel-based methods when these are employed in system identification with quantized data.
First]Giulio Bottegal, Second]Håkan Hjalmarsson, Third]Gianluigi Pillonetto
Department of Electrical Engineering, Eindhoven University of
Technology, Eindhoven, The Netherlands
Automatic Control Lab and ACCESS Linnaeus Centre, School of Electrical Engineering, KTH Royal Institute of Technology, Stockholm, Sweden (e-mail: email@example.com)
Department of Information Engineering, University of Padova, Padova, Italy (e-mail: firstname.lastname@example.org)
Key words: System identification; kernel-based methods; quantized data; expectation-maximization; Gibbs sampler
Many applications in communications, control systems, bioinformatics, require modeling and prediction of dynamic systems with quantized output data (see e.g. [bae2004gene], [carli2010quantized] and [wang2010system]). In particular, in this paper we assume that an unknown input-output map is defined by the composition of a linear time-invariant dynamic system and a quantizer, as depicted in Figure 1.
Estimation of this type of structure is a challenging task. Standard system identification techniques, such as least-squares or the Prediction Error Method (PEM) [Ljung], [Soderstrom], may give poor performances, because the presence of the quantizer can determine a significant loss of information on the system dynamics. Recently, there has been an increasing interest on developing new methods to solve this problem. Particular attention has been devoted to the case of binary measurements [wang2003system], [wang2006joint], also studying on-line recursive identification [jafari2012convergence], [guo2013recursive]. In other works, e.g. [colinet2010weighted], the knowledge of a dithering signal is exploited to improve the identification performance. The problem of experiment design is analyzed in [casini2011input], [casini2012input] and [godoy2014novel]. More recently, system identification with non-binary quantizers has been also addressed in [marelli2013identification], [moschitta2015parametric], [guo2015asymptotically], [wang2015identification]. In [godoy2011identification], [godoy2014identification], [chen2012impulse], the system identification problem is posed as a maximum likelihood/a-posteriori problem. In particular, the framework proposed in [chen2012impulse] uses the nonparametric kernel-based identification approach proposed in [SS2010], see also [pillonetto2014kernel] for a survey.
Similarly to [chen2012impulse], in this paper we cast the problem of identifying a linear dynamic systems with quantized data in a Bayesian framework. To this end, we model the impulse response of the unknown system as a realization of a zero-mean Gaussian random process. The covariance matrix (in this context also called a kernel) corresponds to the recently introduced stable spline kernel (see [SS2010], [SS2011], [bottegal2013regularized]). The structure of this type of kernel depends on two hyperparameters, that need to be estimated from data. Following an empirical Bayes approach [Maritz:1989], the hyperparameters, together with the noise variance, are estimated via marginal likelihood maximization. These quantities are then used to compute the minimum mean-square estimate (MMSE) estimate of the impulse response.
The whole procedure, which has been proven effective and relatively simple in the standard (non-quantized) setting, becomes more involved in the scenario under study. The MMSE system estimate requires the computation of an analytically intractable integral. To accomplish this task, in this paper we propose sampling methods based on Markov Chain Monte Carlo (MCMC) techniques. In particular, we design two different sampling techniques using the so called Gibbs sampler [geman1984stochastic], which often enjoys faster convergence properties compared to standard Metropolis-Hastings sampling techniques [Gilks].
Another contribution of the paper is to show that the task of estimating the kernel hyperparameters (and the noise variance) can be accomplished using the proposed sampling techniques. Using the Expectation-Maximization (EM) method [dempster1977maximum], we design an iterative scheme for marginal likelihood maximization, where the E-step characterizing the EM method makes use of the Gibbs sampler (see also [casella2001empirical]), and the M-step results in a sequence of straightforward optimization problems. Interestingly, the resulting estimation scheme can be seen as a generalization of the method proposed in [godoy2014identification] to the Bayesian nonparametric framework and, differently from [bottegal2015bayesian] and [chen2012impulse], it allows tuning all the kernel hyperparameters.
The paper is organized as follows. In the next section, we introduce the problem of identifying dynamic systems from quantized data. In Section 3, we formulate the proposed Bayesian model and discuss its inference. In Section 4, we describe the proposed identification method. Section 5 shows the results of several numerical experiments to assess the performance of the proposed method. Some conclusions end the paper.
2 Problem formulation
We consider a linear time-invariant discrete-time dynamic system of the form
where is a bounded-input-bounded-output (BIBO) stable impulse response representing the dynamics of the system. We approximate the impulse response by considering the first samples only, namely , where is assumed large enough to capture the system dynamics\@footnotemark\@footnotetextAll the results obtained in this paper can be also extended to the case but we prefer to consider a finite value for to simplify exposition. In addition, the assumption , with the data set size, is typical of the system identification scenario.. The input is a measurable signal; for the sake of simplicity, we assume that the system is at rest until , meaning . We shall not make any specific requirement on the input sequence (i.e., we do not assume any condition on persistent excitation in the input [Ljung]), requiring only for some . Notice, however, that even though the algorithm do not require any specifics of the input, the resulting estimate is of course highly dependent on the properties of the input [Ljung]. The output is corrupted by the additive zero-mean Gaussian white noise with variance , assumed unknown. Introducing the vectors
an approximation of (2) can be written in linear regression form, namely
The output is not directly measurable, only a quantized version being available, namely
where is a known map of the type
with and (and typically and ).
A particular and well-studied case is the binary quantizer, defined as
It is well-known that a condition on the threshold to guarantee identifiability of the system is . In fact, when , the system can be determined up to a scaling factor [godoy2011identification].
We assume that input-output data samples , are collected during an experiment. The problem under study is to estimate using the collected measurements.
It is also useful to write the dynamics in the following vector notation
where and are -dimensional vectors collecting the samples of and , respectively, and is a matrix whose -th row corresponds to . Similarly, we denote by the vector collecting the available quantized measurements.
3 Bayesian modeling and inference
3.1 Impulse response prior
In this paper we cast the system identification problem into a Bayesian framework, setting an appropriate prior on . Following a Gaussian regression approach [Rasmussen], we model the impulse response as a zero-mean Gaussian random vector, i.e. we assume the following prior distribution
Here, is a covariance matrix whose structure depends on the shaping hyperparameter , and is a scaling factor. In this context, is usually called a kernel and determines properties of the realizations of . In particular, we choose from the family of stable spline kernels [SS2010], [SS2011]. Such kernels are specifically designed for system identification purposes and give clear advantages compared to other standard kernels [bottegal2013regularized], [SS2010] (e.g. the Gaussian kernel or the Laplacian kernel, see [Scholkopf01b]). Motivated by its maximum entropy properties [chen2016maximum], in this paper we make use of the first-order stable spline kernel (or TC kernel in [ChenOL12]). It is defined as
with regulating the impulse response decay rate.
3.2 Bayesian inference of systems with quantized output data
In this section we describe the complete Bayesian model that stems from the prior adopted for the impulse response. Our main aim is to devise an MCMC-based sampling mechanism for inferring the model. We introduce the vector of parameters , seen as a deterministic quantity.
Figure 2 depicts the Bayesian network describing the system identification problem with quantized data. The kernel hyperparameters and determine the stochastic model of , which, together with , in turn determines the distribution of the . The output measurements are determined directly by the non-quantized outputs. Any kind of inference on this network crucially depends on the capability of performing Bayesian inference on the hidden nodes , by computing functionals of the posterior density
of the form
where is a general function of and . Special cases of (10) that will be used later comprise the MMSE of given and the computation of the marginal likelihood of .
In the remainder of the section, we assume that is fixed.
3.3 Method 1: sampling from the joint posterior
Quite unfortunately, analytical computation of (10) is intractable, due to the involved structure of the posterior density. Instead, we can resort to Markov Chain Monte Carlo methods, i.e. use the approximation
where , are random samples drawn from (9) and is a large integer. The problem is how to design an effective technique to sample from the distribution (9). If all the conditional probability densities of such a distribution are available in closed-form, the problem of sampling can be solved efficiently using a special case of the Metropolis-Hastings method, namely the Gibbs sampler (see e.g. [Gilks]). The basic idea is that each conditional random variable is the state of a Markov chain; then, by drawing samples from the conditional probability densities iteratively, we converge to the stationary state of this Markov chain and generate samples of the desired distribution. The conditionals of (9) are as follows.
. First note that from (6), given , is independent of and . Therefore we have
Now, using again (6) we observe that, if were not given, then
Knowing the quantized output permits to narrow the range of possible values of . In particular, if , for ant , we have
where denotes a Gaussian distribution truncated below and above , whose original mean and variance are and respectively.
. Given , information carried by becomes redundant and can be discarded. Due to the assumption on the distribution of the noise , the vectors and are jointly Gaussian, so that the posterior density of given is also Gaussian. Combining the linear model (6) with the prior (7) it is straightforward to obtain (see e.g. [Anderson:1979])
with obvious definition of .
Algorithm 1 summarizes the computation of (10) using the sampling mechanism described in this subsection.
Algorithm 1: Method 1 for inference
Initialization: Compute initial value
For to :
Sequentially draw the samples , , from
Draw the sample from
In Algorithm 1, the parameter is introduced. It represents the number of initial samples to be discarded and is also known as burn-in [meyn2009markov]. In fact, the conditional densities from which those samples are drawn are to be considered as non-stationary, because the associated Markov Chain takes a certain number of iterations to converge to a stationary distribution.
3.4 Method 2: sampling from the marginalized posterior of
Method 1 for sampling allows for computing expected values (with respect to the posterior ) of any kind of function . For the problem under study however, we are particularly interested in computing the expected value (conditional on ) of , and of quadratic functions in and , i.e.
for given , , . These type of functions will play a central role in estimating the vector of parameters (see Section 4).
The sampling method proposed in this subsection relies on the following result.
We now focus on the internal integral in the above expression. Developing and recalling the first and second moments of , given in (15), we obtain the following terms:
Combining the three terms yields (19).
A analogous result holds for the computation of .
It holds that
where is defined in (17).
Lemmas (2) and (3) state that the posterior expectations of and of quadratic functions in and , are equivalent to the expectation of specific functions with respect to the marginalized posterior of given . This distribution corresponds to a multivariate truncated Gaussian; so, analytical computation of the relative integrals is quite challenging. Computing the integrals by sampling from is instead a viable approach.
An effective technique for sampling from is again based on the Gibbs sampler. The idea is to iteratively draw samples from the conditional densities . Generating samples from these distributions is relatively simple, because they are scalar truncated normal distributions. To see this, first consider the marginal distribution , which is a multivariate normal random vector with zero-mean and covariance matrix . Then
Algorithm 2 summarizes the sampling procedure described in this subsection, for quadratic functions. The same algorithm can be used for computing the posterior expectation of .
Algorithm 2: Method 2 for inference
Initialization: Compute initial value
For to :
Sequentially draw the samples , , from
where . Then and (see e.g. [noble1988applied]). This procedure turns out computationally convenient, because it is required to invert only one time, instead of computing the inversion of matrices of size (which would be necessary for computing for any ).
4 System identification from quantized output data
In the previous section we have defined an adequate Bayesian framework to describe the problem of identification of systems from quantized output data. We have also presented two methods for inference of functions of and . In this section, we describe the proposed system identification procedure. It relies on the so called empirical Bayes approach [Maritz:1989] and consists of the following two steps:
Estimate the parameter vector via marginal likelihood maximization;
Compute the MMSE estimate of the impulse response, using the estimated parameter vector .
In the following we analyze in detail the two steps composing the proposed system identification scheme.
4.1 Marginal likelihood estimation of the parameter vector
In the empirical Bayes approach, the estimation of is tackled via marginal likelihood maximization, i.e., by computing
In the standard (non-quantized) scenario, solving (25) is straightforward because the marginal likelihood admits a closed-form expression [pillonetto2014kernel]. This does not hold in the problem under study. Therefore, to solve (25) we propose an iterative solution scheme based on the EM method, where we treat and as latent variables that are iteratively estimated together with the parameter vector . Let be the parameter estimate obtained at the -th iteration of the EM method. Then, -th update is obtained with the following steps:
where the expectation is taken with respect to the posterior density , with fixed at the value ;
We first focus on performing the E-step. It requires the computation of (26), which corresponds to the integral
The complete likelihood admits the decomposition
where is a quadratic function of the type (18), with , and is also a quadratic function in and , with .
Using Bayes’ rule we can decompose the complete likelihood as follows
The term is a vectorized version of (13), so that
The last term on the right hand side gives . Similarly,
where the last term corresponds to .
The proposition reveals the nature of the complete likelihood, which is the summation of terms that are either constant or quadratic functions of and , plus the term . As for this term, we note that factorizes, each factor being of the type
When computing the integral of this term using the sampling mechanisms introduced in Section 3, it is ensured that all the generated samples belong to the interval corresponding to the observed quantized value . Hence, when we compute the expectation of techniques of Section 3, it is ensured that each factor (33) is always equal to 1 and thus . Therefore, computing (28) reduces to computing the expectation of two quadratic functions, and this can be done using both the sampling mechanisms introduced in Section 3. We denote by and , respectively, the expected value of and at the -th iteration of the EM method (i.e., when the of the parameter vector is available). Neglecting constant terms, we have
Then the EM update is obtained as follows:
Differentiating with respect to , one obtains that the minimum, as a function of , is achieved at . Inserting this value into yields , so that (36) and (37) follow. Finally, it is straightforward to see that (38) is the minimizer of .
The M-step results in a series of computationally attractive operations. The kernel scaling hyperparameter and the noise variance admit a solution in closed form. The shaping hyperparameter is updated solving a simple scalar optimization problem. This problem can be further simplified by using a factorization of the first-order stable spline kernel, see [bottegal2016robust] for details.
4.2 MMSE estimate of the impulse response
Having an estimate available, the MMSE estimate of corresponds to
Again, this quantity can be computed using both the sampling methods described in the previous section.
We summarize the overall procedure for system identification from quantized data in the following algorithm.
Algorithm 3: System identification with quantized output measurements
Initialization: Set an initial value of
Repeat until convergence:
E-step: Compute using Algorithm 1 or Algorithm 2
M-step: update according to Proposition 6
Compute using Algorithm 1 or Algorithm 2 The initial value can be set either randomly or by using values of the kernel hyperparameters and noise variance returned by the standard nonparametric method for non-quantized data (see [SS2010]).
4.3 Which sampling method?
Algorithm 3 for quantized system identification works with both the sampling methods presented in Section 3, because it requires the computation of the posterior expectation of quadratic functions of the type (18) and of . Choosing the sampling method depends on the user requirements. Method 1 allows for any type of inference and therefore it can be used to compute useful statistics such as confidence intervals on the estimate of and quantile estimation. On the other hand, Method 2 requires sampling only from ; if implemented properly (see Remark 4), this method is expected to have a faster convergence to the required expectations. The experiments presented in the next section show that the two sampling methods substantially give the same performance in terms of accuracy in reconstructing the true impulse response.
5 Numerical experiments
We test the proposed technique by means of 2 Monte Carlo experiments of 100 runs each. For each Monte Carlo run, we generate an impulse response by picking 10 pairs of complex conjugate zeros with magnitude randomly chosen in and random phase and, similarly, 10 pairs of complex conjugate poles with magnitude randomly chosen in and random phase. The obtained impulse response is rescaled in order to have a random -gain in the interval . The goal is to estimate samples of this impulse response from input-output data. The input sequences are realizations of white noise with unit variance. We compare the following estimators.
KB-GS-1: This is the method described in this paper, making use of the sampling technique of Section 3.3 (Algorithm 1). The parameter , denoting the number of samples generated by the sampler, is set to for each iteration of the EM method, and to for the final computation of (last step of Algorithm 3). The burn-in phase consists of samples. Convergence of the EM method is established by a threshold on the relative difference of the current and previous parameter estimates, i.e. the method stops if the condition
is satisfied. In our simulations, we have verified that our choice of was adequate to guarantee a good degree of approximation (the reader interested in convergence diagnostics is referred to, e.g., [gelman2014bayesian, Ch. 11.4]).
KB-GS-2: This is the method described in this paper, making use of the sampling technique of Section 3.4 (Algorithm 2). The algorithm settings are the same as the previous method.
KB-St.: This is the standard nonparametric kernel-based method proposed in [SS2010] and revisited in [ChenOL12]. It is not designed for handling quantized data. The kernel adopted for identification is (8). The kernel hyperparameters are estimated by marginal likelihood maximization, while is estimated via least-squares residuals.
KB-Or.: Same as KB, with the difference that in this case the vector is available to the estimator. Hence, this estimator is regarded as an Oracle.
ML-GS: This is the method proposed in [godoy2011identification], based on maximum likelihood identification of the impulse response, namely
To solve the likelihood problem, a scheme based on the EM method is proposed. In our implementation, the E-step of the EM iterations is computed using the Gibbs sampler (in contrast, [godoy2011identification] proposes a scenario-based approach). The length of is also set to 50, while the convergence is established by using the same conditions as the estimator KB-GS-1.
MAP-GS: This is the method proposed in [chen2012impulse], based on maximum-a-posteriori (MAP) identification of the impulse response, namely
where the prior corresponds to (7). Again, an EM solution scheme based on the Gibbs sampler is employed to solve this problem, setting , while the convergence is established by using the same conditions as the estimator KB-GS-1. To facilitate hyperparameter and noise variance tuning, we plug those that are estimated by the method KB-Or., which has access to the non-quantized output .
The performance of the estimators is evaluated by means of the fitting score, computed as
where is the impulse response generated at the -th Monte Carlo run and the estimate computed by the tested methods.
5.1 Binary quantizer
The first experiment considers the following binary quantizer
For each Monte Carlo run, the noise variance is such that , i.e. the ratio between the variance of the noiseless (non-quantized) output and the noise is equal to 10. If , the system is discarded. We generate data samples.
Figure 3 shows the results of the Monte Carlo runs. The advantage of using the proposed identification scheme, compared to the method that does not account for the quantizer, is evident. Despite the large loss of information caused by the quantizer, the proposed method gives a fit which is quite comparable to the oracle method (KB-Or.). The proposed sampling mechanisms yield nearly equivalent performance. Furthermore, there seems to be a substantial advantage in using a Bayesian approach compared to the non-regularized estimator ML-GS, which is tailored for short FIR systems rather than IIR systems. The high number of coefficients to be estimated inevitably leads to high variance in the estimates. If this effect is not suitably alleviated by regularization (i.e., introducing a “good” bias), the performance of the estimator is doomed to be poor. Finally, the proposed estimators, based on computing the MMSE estimate of the impulse response, outperform the estimator MAP-GS, which is based on computing the MAP estimate of the impulse response.
5.2 Ceil-type quantizer
In the second experiment we test the performance of the proposed method on systems followed by a ceil-type quantizer, which is defined as
Again, for each Monte Carlo run, the noise variance is such that . We generate data samples.
As shown in Figure 4, in this case, if one compares the oracle-type method (i.e. KB-Or.) with the same method using quantized data (KB-St.), the loss of accuracy is relatively low. This is because this type of quantizer has a mild effect on the measurements. It can be seen, however, that the proposed methods are able to give a fit that is comparable to the oracle KB-Or., regardless of the employed sampling method. We notice also that the Bayesian approach has a major impact on the performance, as seen by the gap in the performance between ML-GS and the other estimators.
In this paper, we have introduced a novel method for system identification when the output is subject to quantization. We have proposed a MCMC scheme that exploits the Bayesian description of the unknown system. In particular, we have shown how to design two integration schemes based on the Gibbs sampler by exploiting the knowledge of the conditional probability density functions of the variables entering the system. The two sampling techniques can be used in combination with the EM method to perform empirical Bayes estimation of the kernel hyperparameters and the noise variance. We have highlighted, through some numerical experiments, the advantages of employing our method when quantizers affect the accuracy of measurements.
As a final remark, we note that the cascaded composition of a linear LTI dynamic system followed by a static nonlinear function is known in system identification as a Wiener system [giri2010block], [hagenblad2008maximum]. However, in Wiener systems the nonlinear function is generally assumed unknown, so a direct extension of the method proposed in this paper to general Wiener systems does not appear immediate, and would require the use of more involved and MCMC techniques (see, e.g., [lindsten2013bayesian]).