Switching Nonparametric Regression Models and the Motorcycle Data revisited

# Switching Nonparametric Regression Models and the Motorcycle Data revisited

## Abstract

We propose a methodology to analyze data arising from a curve that, over its domain, switches among states. We consider a sequence of response variables, where each response depends on a covariate according to an unobserved state . The states form a stochastic process and their possible values are . If equals the expected response of is one of unknown smooth functions evaluated at . We call this model a switching nonparametric regression model. We develop an EM algorithm to estimate the parameters of the latent state process and the functions corresponding to the states. We also obtain standard errors for the parameter estimates of the state process. We conduct simulation studies to analyze the frequentist properties of our estimates. We also apply the proposed methodology to the well-known motorcycle data set treating the data as coming from more than one simulated accident run with unobserved run labels.

\kwd
\startlocaldefs\endlocaldefs\runtitle

Switching Nonparametric Regression Models

and \thankstextt1Research supported by the National Science and Engineering Research Council of Canada.

nonparametric regression, machine learning, mixture of Gaussian processes, latent variables, EM algorithm, motorcycle data

## 1 Introduction

In this paper we propose a methodology to analyze data arising from a curve that, over its domain, switches among states. The state at any particular point is determined by a latent process. The state also determines a function. We are interested in the functions corresponding to each of the states and the parameters of the latent state process.

Suppose we have a sequence of response variables, , where depends on a covariate (usually time) according to an unobserved state , also called a hidden or latent state. The possible values of the states are . If the expected response of is . We call this model a switching nonparametric regression model. In a Bayesian switching nonparametric regression model the uncertainty about the ’s is formulated by modeling the ’s as realizations of stochastic processes. In a frequentist switching nonparametric regression model the ’s are merely assumed to be smooth.

The objective of the proposed methodology is to estimate the ’s as well as to estimate the regression error variance and the parameters governing the distribution of the state process. We also obtain standard errors for the proposed parameter estimators of the state process. We consider two types of hidden states, those that are independent and identically distributed and those that follow a Markov structure. The Markov structure would usually require to be time.

As an application we consider the well-known motorcycle data set. The data consist of 133 measurements of head acceleration taken through time in a simulated motorcycle accident. See Figure LABEL:motorcycle. Analyses appearing in the literature (e.g., Silverman, 1985, Härdle and Marron, 1995, Rasmussen and Ghahramani, 2002, Gijbels and Goderniaux, 2004 and Wood, 2011) treat the data as coming from one simulated accident. However, close examination of the data suggests the measurements are from accidents. In the discussion of Silverman (1985) Prof. A. C. Atkinson wrote “inspection of [the Figure], particularly in the region 30-40 ms, suggests that the data may not be a single time series but are rather the superposition of, perhaps, three series”. Professor Silverman had no specific information on this point but replied “Professor Atkinson is right about the motorcycle data in that they are the superposition of measurements made by several instruments but the data I have presented are precisely in the form they were made available to me”. The data structure will most likely remain unclear as the original report (Schmidt, Mattern and Schüler, 1981) seems to be no longer available.

We apply the proposed methodology to the motorcycle data treating the data as coming from functions, one for each simulated accident, with hidden (unobserved) function labels. We choose using an ad hoc Akaike information criterion (AIC). To our knowledge this is the first time that the motorcycle data is analyzed taking into account that they describe more than one simulated accident.

The paper is organized as follows. In Section 2 we present an overview of the proposed methodology. A literature review on the topic is presented in Section 3. In Section 4 we describe the solution for the estimation problem. The standard errors for the parameter estimators of the state process are calculated in Section 5. In Section 6 we present the results of simulation studies. An application of the proposed methodology to the motorcycle data is shown in Section 7. Some discussion is presented in Section 8.

## 2 Overview of the proposed methodology

Suppose we have observed data , and hidden states , . The states form a stochastic process and their possible values are . If , then ’s distribution depends on a function . Specifically, we assume that, given , , for independent with mean of zero and variance of one. Therefore, given the ’s and the ’s, the ’s are independent with the mean of equal to and the variance equal to .

Let , and . Define to be the set of parameters defining the distribution of given and the ’s. So, for instance, if the ’s are normally distributed, then . Let be the vector of parameters that determine the joint distribution of . If the ’s are independent and identically distributed, then the parameter vector is of length with th component equal to . If the ’s follow a Markov structure, the parameter vector consists of initial and transition probabilities.

Our goal is to estimate , along with standard errors or some measure of accuracy for the parameters in .

To obtain the parameter estimates two approaches are considered:

• a frequentist approach called penalized log-likelihood estimation;

• a Bayesian approach where the posterior density is maximized.

These two approaches are computationally similar and consist of using the Expectation-Maximization (EM) algorithm (McLachlan and Krishnan, 2008) to maximize the following criterion

 log-likelihood of the data+P(f1,…,fJ,λ1,…,λJ), (1)

where the exact form of depends on the considered approach.

In the frequentist approach we assume that exists almost everywhere and for and maximize (1) with

 P(f1,…,fJ,λ1,…,λJ)=−J∑j=1λj∫[f′′j(x)]2dx.

The ’s are the so called smoothing parameters as they measure the relative importance of fit to the data, as measured by the log-likelihood of the data, and smoothness of the ’s, as quantified by the penalties .

The Bayesian approach requires a prior distribution for the parameters in . For the ’s, we consider a Gaussian process regression approach (Rasmussen and Williams, 2006) with independent and a Gaussian process with mean function and covariance function, , that depends on a vector of parameters . We place a non-informative prior on the other parameters. Therefore, we can write

 P(f1,…,fJ,λ1,…,λJ) = − J2log(2π)−12J∑j=1log|A(λj)| − 12J∑j=1fj(x)T A(λj)−1 fj(x),

where is an matrix with entries given by

 [A(λj)]lm=K(xl,xm;λj),

with known. In our applications and simulations we assume the following covariance function

 K(x,t;Uj,s2j)=cov(fj(x),fj(t))=Ujexp[−(x−t)22s2j], (2)

with parameter vector . These parameters control the amount of smoothness of each curve and, as in the penalized log-likelihood approach, are called smoothing parameters.

We can extend our approach to consider not equal to the zero function. For instance, we might take as a linear combination of known basis functions. In this case, in addition to , we would have another parameter vector for , namely the vector containing the basis functions’ coefficients.

In both approaches we choose the values of the ’s automatically by cross-validation.

## 3 Background

Similar models have appeared in the machine learning literature, where they are called mixture of Gaussian processes models. See Tresp (2001), Rasmussen and Ghahramani (2002) and Ou and Martin (2008). The focus of these papers is on analyzing data from an on-line process, with the goal being prediction of a single process. The process is modeled as a mixture of realizations of Gaussian processes and, just as in our model, the mixture depends on hidden states. Rasmussen and Ghahramani and Ou and Martin take a hierarchical Bayesian approach, not only placing a Gaussian process prior on the ’s, but also placing a prior distribution on all parameters, including those that govern the Gaussian process. Tresp does not place a prior distribution on the parameters of the Gaussian process, but he does use Gaussian processes to model not only the functions themselves but also the latent process and the regression error variance. To analyze data, Tresp uses the Expectation-Maximization (EM) algorithm to maximize the posterior density of the Gaussian processes given the data. Rasmussen and Ghahramani and Ou and Martin use a Monte Carlo method to estimate the posterior distribution of the unknown functions and parameters.

A main difference between our model and the model in these three papers is in the distribution of the ’s. All three papers begin by assuming that the ’s are independent, conditional on the parameters governing the latent process. In Rasmussen and Ghahramani and in Ou and Martin, the process parameters are , , which are modeled using a Dirichlet distribution. Rasmussen and Ghahramani use a limit of a Dirichlet distribution in order to model an infinite number of possible hidden states, to avoid choosing the number of states. Ou and Martin use a finite number of states to avoid computational complexity. Both papers use an ad hoc modification of the distribution of the ’s to allow to depend on in a smooth way; see Rasmussen and Ghahramani’s equation (5) and Ou and Martin’s equation (13). However, as remarked by Rasmussen and Ghahramani in the discussion, the properties of the resulting joint distribution of the ’s are not completely known. Tresp’s distributional assumptions are more straightforward: he assumes that the distribution of depends on according to a logit model governed by a Gaussian process. None of these papers consider Markov ’s.

While the three papers contain methodology that can, in principle, lead to estimation of the ’s and the latent variable process parameters, the papers focus on estimation of just one function - the mixture. Thus the resulting methodology is a form of variable bandwidth smoothing (see, for instance, Fan and Gijbels, 1992). In contrast, our goal is estimation of the individual processes that make up the mixture and estimation of the parameters governing the hidden state process, along with standard errors. We see the distinction between the goals by considering the analysis of the motorcycle data: Rasmussen and Ghahramani present just one function to summarize the data. We present functions, one for each of the simulated accidents, and we estimate the expected proportion of data points from each function, and provide standard errors. We also conduct extensive simulation studies of the frequentist properties of the estimates.

A closely related model is the Gaussian mixture model, used in density estimation. In the Gaussian mixture model, we assume that the th data point comes from one of a finite set of Gaussian distributions, determined by the value of a latent variable . In his work in this area, Bilmes (1998) considered two models for the latent variables: in the first model the latent variables are independent and identically distributed and in the second they follow a Markov structure. The later corresponds to a classic hidden Markov model, which is the same as our approach for Markov ’s if the ’s are constant. Bilmes provides a very readable description of how he applies the EM algorithm to estimate the parameters for the Gaussian distributions as well as for the distribution of the ’s.

These models are not to be confused with the work of Shi, Murray-Smith and Titterington (2005) on Gaussian process mixtures for regression and Chiou (2012) on functional mixture prediction. These authors analyze data from independent curves where the entire th curve is a realization of one of Gaussian processes, determined by the value of the latent variable . In contrast, like Tresp, Rasmussen and Ghahramani and Ou and Martin, we consider observed curve, which switches among Gaussian processes.

We see that the literature contains many similar but distinct models with names containing the words Gaussian and mixture. For this reason, we prefer to call our models and those of Tresp (2001), Rasmussen and Ghahramani (2002) and Ou and Martin (2008) switching nonparametric regression models as we feel this is more descriptive. To our knowledge, no one has considered a Markov structure for the latent variables to estimate multiple functions, nor has anyone used the non-Bayesian penalized likelihood approach. Our frequentist approach and our calculation of standard errors appear to be new.

## 4 Parameter estimation via the EM algorithm

In this section we describe how the EM algorithm can be used to obtain the parameter estimates for the penalized log-likelihood and Bayesian approaches. The E-step of the algorithm is exactly the same for both approaches. The M-step differs only in the part involving the calculation of . In the M-step we restrict our calculations to normally distributed errors. Furthermore, in the M-step, for the penalized log-likelihood case we can show that the maximizing ’s are cubic smoothing splines: our E-step leads to the maximization criterion for given by (15) plus (15), which is similar to (5.1) in Silverman (1985). See also Heckman (2012). We use this to justify modeling each as a linear combination of known B-spline basis functions , that is,

 fj(x)=K∑k=1ϕjkbk(x),

with the set of unknown parameters determining .

Recall that , where is the vector containing the parameters of the model assumed for , and are the parameters governing the distribution of given the ’s and the ’s. Let be the log-likelihood based on the observed data. Our goal is to find that maximizes

 l(θ)≡logp(y|θ)+P(f1,…,fJ,λ1,…,λJ). (3)

The form of is very complicated, since it involves the distribution of the latent ’s. Therefore, the maximization of (3) with respect to is a difficult task. In order to tackle this problem it is common to apply numerical methods such as the EM algorithm to obtain the parameter estimates (McLachlan and Krishnan, 2008). The EM algorithm is usually used to maximize the likelihood function by generating a sequence of estimates, , , having the property that . The EM algorithm can also be used to maximize (3). We can show (see Supplementary Material) that our EM algorithm also generates a sequence of estimates, , , satisfying

 l(θ(c+1))≥l(θ(c)). (4)

To define our EM algorithm, let be the joint distribution of the observed and latent data given , also called the complete data distribution. In what follows view as an argument of a function, not as a random variable. Note that we write to denote the conditional expected value of assuming that the data and are generated with parameter vector .

The proposed EM algorithm consists of the following two steps based on writing , where

 L1(γ)=logp(y|z,θ)=n∑i=1logp(yi|zi,fzi(xi),σ2zi)

and

 L2(α)=logp(z|θ)=logp(z1,…,zn|α). (5)
1. Expectation step (E-step): calculate

 Q(θ,θ(c))≡Eθ(c)[logp(y,z|θ)|y]=Eθ(c)(L1(γ)|y)+Eθ(c)(L2(α)|y)

that is, calculate the expected value of the logarithm of with respect to the distribution of the latent given the observed using as the true value of .

2. Maximization step (M-step): let

 S(θ,θ(c))=Q(θ,θ(c))+P(f1,…,fJ,λ1,…,λJ)
 =Eθ(c)(L1(γ)|y)+P(f1,…,fJ,λ1,…,λJ)+Eθ(c)(L2(α)|y). (6)

Use the Expectation-Conditional Maximization (ECM) algorithm (see Supplementary Material) to find that maximizes with respect to or at least does not decrease from the current value at . We can show (see Supplementary Material) that if then (4) holds.

### 4.1 E-step: general zi’s

Since

 L1(γ)=n∑i=1J∑j=1I(zi=j) logp(yi|zi=j,fj(xi),σ2j),
 Eθ(c)(L1(γ)|y)=n∑i=1J∑j=1p(c)ijlogp(yi|zi=j,fj(xi),σ2j)

where

 p(c)ij=p(zi=j|y,θ(c)),

whose exact form depends on the model for the ’s.

In a regression model with normal errors

 Eθ(c)(L1(γ)|y) = − n2log(2π)−12n∑i=1J∑j=1p(c)ijlogσ2j − 12n∑i=1J∑j=1p(c)ij[yi−fj(xi)]2σ2j.

In the following sections we calculate and considering different models for the latent variables. Note that depends on the model for the ’s only through .

#### E-step: independent and identically distributed (iid) zi’s

In this section we calculate and assuming that the latent variables are iid with parameter vector , where and .

Since the ’s are iid, we obtain

 p(c)ij=p(yi|zi=j,fj(xi)(c),σ2(c)j)×p(c)j∑Jl=1p(yi|zi=l,fl(xi)(c),σ2(c)l)×p(c)l. (8)

Note that we can easily calculate (8) when the regression errors are normally distributed.

We now calculate . For iid ’s we can write

 L2(α)=n∑i=1J∑j=1I(zi=j)logpj (9)

and, therefore,

 Eθ(c)(L2(α)|y)=n∑i=1J∑j=1p(c)ijlogpj. (10)

#### E-step: Markov zi’s

Here we calculate and assuming a Markov structure for the latent variables . In this case the distribution of the ’s depends on a vector composed of transition probabilities and initial probabilities , .

Let us assume that

1. the th latent variable depends on past latent variables only via the st latent variable, i.e., ;

2. the transition probabilities do not depend on , that is,

 p(zi=j|zi−1=l,α)=p(zi+s=j|zi+s−1=l,α)≡alj.

To compute when the ’s are Markov, we use the results of Baum et al. (1970). These authors let

 δ(c)ij=p(y1,…,yi,zi=j|θ(c))

and

 φ(c)ij=p(yi+1,…,yn|zi=j,θ(c)),

and show how to calculate these recursively using what they call the forward and backward procedures, respectively.

Note that because of the Markovian conditional independence

 δ(c)ijφ(c)ij = p(y1,…,yi,zi=j|θ(c))×p(yi+1,…,yn|zi=j,θ(c)) = p(y1,…,yi,zi=j|θ(c))×p(yi+1,…,yn|zi=j,y1,…,yi,θ(c)) = p(y,zi=j|θ(c)).

Thus, we can calculate via

 p(c)ij=δ(c)ijφ(c)ij∑Jl=1δ(c)ilφ(c)il.

Now let us consider . Since for Markov ’s

 L2(α)=n∑i=2J∑l=1J∑j=1I(zi−1=l,zi=j)logalj+J∑j=1I(z1=j)logπj (11)
 Eθ(c)(L2(α)|y)=n∑i=2J∑l=1J∑j=1p(c)iljlogalj+J∑j=1p(c)1jlogπj, (12)

where . This can be expanded as

 p(c)ilj=p(c)(i−1)l×a(c)lj×p(yi|zi=j,θ(c))×φ(c)ijφ(c)(i−1)l,

which we easily calculate using the normality assumption for the regression errors.

More details on how to obtain the expressions for and can be found in Bilmes (1998), Rabiner (1989) and Cappé, Moulines and Rydén (2005).

### 4.2 M-step

For the M-step we combine our discussion of iid ’s and Markov ’s. We want to find that maximizes in (6) with respect to or at least produces a value of no smaller than . For normally distributed errors is given by (4.1) and, therefore, we can write

 S(θ,θ(c)) = (15) C−12n∑i=1J∑j=1p(c)ijlogσ2j−12n∑i=1J∑j=1p(c)ij[yi−fj(xi)]2σ2j + P(f1,…,fJ,λ1,…,λJ) + Eθ(c)(L2(α)|y).

We maximize as a function of . In the maximization is fixed, not depending on . This implies that the ’s are also fixed, since their calculation depends on current parameter estimates in . We also consider the smoothing parameters, , to be fixed. This maximization cannot be done analytically, so we apply a natural extension of the EM approach, the ECM algorithm, to guarantee that .

Because the expression for does not depend on the ’s or ’s, the ’s and ’s that maximize are the ’s and ’s that maximize (15) + (15). Therefore, the form of the maximizing ’s and ’s will not depend on the model for the ’s. Their only dependence on the model for the ’s is via the ’s.

Thus, to obtain the vector of parameter estimates, , we apply the ECM algorithm as follows.

1. Hold the ’s and the parameters in fixed and maximize (15) plus (15) with respect to . Let

 Wj=diag(p(c)1j/σ2j,…,p(c)nj/σ2j). (16)

For the Bayesian approach this is equivalent to maximizing

 −12J∑j=1(y−fj(x))TWj(y−fj(x))−12J∑j=1fj(x)T A(λj)−1 fj(x)

obtaining

 ^fj(x)=A(λj)(A(λj)+W−1j)−1y.

Let be with in replaced by .

For the penalized log-likelihood approach recall that is a linear combination of known basis functions, so , where is the vector of coefficients corresponding to and is an matrix with entries . Thus, we hold the ’s and fixed and maximize

 −12J∑j=1(y−Bϕj)TWj(y−Bϕj)−J∑j=1λjϕTjRϕj,

with respect to yielding

 ^ϕj=(BTWjB+2λjR)−1BTWjy,

where is a matrix with entries

 Rkk′=∫b′′k(x)b′′k′(x) dx.

Let be with in replaced by . So we let .

2. Now holding the ’s and the parameters in fixed and maximizing (15) with respect to we get

 ^σ2j=n∑i=1p(c)ij[yi−fj(xi)]2n∑i=1p(c)ij. (17)

Let be with replaced by . If we assume for all , then we find that

 ^σ2(c+1)=1nJ∑j=1n∑i=1p(c)ij[yi−fj(xi)(c+1)]2. (18)

We can obtain better variance estimates by adjusting the degrees of freedom to account for the estimation of the ’s. We do that by replacing the denominator of (17) by and the denominator of (18) by , where and is the so called hat matrix satisfying . For the Bayesian approach and for the penalized approach . This modification is similar to a weighted version of what Wahba (1983) has proposed for the regular smoothing spline case.

3. Now we hold the ’s and ’s fixed and maximize with respect to the parameters in . Note that (15) and (15) do not depend on , so to find , we maximize in line (15) as a function of . Because the form of does depend on the model for the ’s, we must obtain the estimates of for each model separately.

For iid ’s, where , using (10) and Lagrange multipliers with the restriction that , we obtain:

 p(c+1)j=1nn∑i=1p(c)ij.

For Markov ’s the vector is composed of transition probabilities and initial probabilities . We first maximize given in (12) with respect to . Holding fixed and using a Lagrange multiplier with the constraint , we get:

 a(c+1)lj=n∑i=2p(c)iljn∑i=2p(c)(i−1)l.

Now let us maximize (12) with respect to . Holding fixed and using a Lagrange multiplier with the restriction that , we obtain:

 π(c+1)j=p(c)1j.

## 5 Standard errors for the parameter estimators of the state process

In this section we use the results of Louis (1982) to obtain standard errors for the estimates of the parameters of the state process. We consider for iid ’s possible state values. For Markov ’s we restrict the possible number of states to to reduce calculational complexity.

Louis (1982) derived a procedure to obtain the observed information matrix when the maximum likelihood estimates are obtained using the EM algorithm. The procedure requires the computation of the gradient and of the second derivative matrix of the log-likelihood based on the complete data and can be implemented quite easily within the EM steps.

### 5.1 The general case

Suppose that is known and let be the maximum likelihood estimator of given , that is, the maximizer of , the incomplete data log-likelihood. We can obtain using the EM algorithm, and the complete data log-likelihood (see Section 4.1) . In this case we can derive the observed information matrix, , via a direct application of Louis’s procedure. Under some regularity conditions, Louis (1982) shows by straightforward differentiation that , , and that the observed information matrix is given by

 Iγ(α) = −L′′I(α) (19) = Eα(−L′′2(α)|y)−Eα(L′2(α)L′2(α)T|y)+L′I(α)L′I(α)T,

where is as in (5), and are the gradient vectors of and , respectively, and and are the associated second derivative matrices. The estimate of is . Note that (19) needs to be evaluated only at convergence of the EM algorithm, where is zero. Then, contains only the first two terms of (19). The inverse of is the estimated variance-covariance matrix of for the known value of .

To relate these calculations to those of our EM algorithm, where both and are estimated, first note that our and are the maximizers of the criterion . If the maximizer is unique then . Therefore, we will estimate the variance-covariance matrix of by the variance-covariance matrix of plugging in . That is, we propose to use a plug-in estimate, , where and are obtained from our EM procedure. Note that this method ignores the variability in estimating .

In the next sections we show how to calculate for the different models for the ’s.

### 5.2 Standard errors: iidzi’s

To remove the restriction , we use the parameters and rewrite (9) as

 L2(α)=n∑i=1{J−1∑j=1I(zi=j)logpj+I(zi=J)log(1−J−1∑j=1pj)}. (20)

Let be the gradient vector of (20) with the th component given by

 ∂L2∂pj=n∑i=1[I(zi=j)pj−I(zi=J)(1−∑J−1k=1pk)]

and be the matrix with the associated second derivatives

 ∂2L2∂p2j=−n∑i=1[I(zi=j)p2j+I(zi=J)(1−∑J−1k=1pk)2],
 ∂2L2∂pjpl=−n∑i=1[I(zi=J)(1−∑J−1k=1pk)2] for j≠l.

Consider the matrix in (19) evaluated at . One can show its th entry, for , is equal to and its th entry is

 n∑i=1⎛⎝^pij^p2j+(1−∑J−1