Fitting Jump Models

# Fitting Jump Models

[    [    [    [ IMT School for Advanced Studies Lucca, Piazza San Francesco 19, 55100 Lucca, Italy. Email: alberto.bemporad@imtlucca.it Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza L.Da Vinci, 32, 20133 Milano, Italy. Email: valentina.breschi@polimi.it Dalle Molle Institute for Artificial Intelligence Research - USI/SUPSI, Galleria 2, Via Cantonale 2c, CH-6928 Manno, Switzerland. Email: dario.piga@supsi.ch Department of Electrical Engineering, Stanford University, Stanford CA 94305, USA. Email: boyd@stanford.edu
###### Abstract

We describe a new framework for fitting jump models to a sequence of data. The key idea is to alternate between minimizing a loss function to fit multiple model parameters, and minimizing a discrete loss function to determine which set of model parameters is active at each data point. The framework is quite general and encompasses popular classes of models, such as hidden Markov models and piecewise affine models. The shape of the chosen loss functions to minimize determine the shape of the resulting jump model.

M
\runtitle

Fitting Jump Models

IMT]A. Bemporad\thanksreffootnoteinfo, POLIMI]V. Breschi, IDSIA]D. Piga, STANFORD]S. Boyd

thanks: [

footnoteinfo]Corresponding author.

odel regression, mode estimation, jump models, hidden Markov models, piecewise affine models.

## 1 Introduction

In many regression and classification problems the training dataset is formed by input and output observations with time stamps. However, when fitting the function that maps input data to output data, most algorithms used in supervised learning do not take the temporal order of the data into account. For example, in linear regression problems solved by least squares each row of and is associated with a data-point, but clearly the solution is the same no matter how the rows of and are ordered. In system identification temporal information is often used only to construct the input samples (or regressors) and outputs, but then it is neglected. For example, in estimating autoregressive models with exogenous inputs (ARX), the regressor is a finite collection of current and past signal observations, but the order of the regressor/output pairs is irrelevant when least squares are used. Similarly, in logistic regression and support vector machines the order of the data points does not affect the result. In training forward neural networks using stochastic gradient descent, the samples may be picked up randomly (and more than once) by the solution algorithm, and again their original temporal ordering is neglected.

On the other hand, there are many applications in which relevant information is contained not only in data values but also in their temporal order. In particular, if the time each data-point was collected is taken into account, one can detect changes in the type of regime the data were produced. Examples range from video segmentation [24, 10] to speech recognition [29, 30], asset-price models in finance [31, 17], human action classification [27, 26], and many others. All these examples are characterized by the need of fitting multiple models and understanding when switches from one model to another occur.

Piecewise affine (PWA) models attempt at fitting multiple affine models to a dataset, where each model is active based on the location of the input sample in a polyhedral partition of the input space [14, 9]. However, as for ARX models, the order of the data is not relevant in computing the model parameters and the polyhedral partition. In some cases, mode transitions are captured by finite state machines, for example in hybrid dynamical models with logical states, where the current mode and the next logical state are generated deterministically by Boolean functions [4, 8]. In spite of the difficulty of assessing whether a switched linear dynamical system is identifiable from input/output data [32], a rich variety of identification methods have been proposed in the literature [14, 5, 20, 3, 22, 9, 28].

Hidden Markov models (HMMs) treat instead the mode as a stochastic discrete variable, whose temporal dynamics is described by a Markov chain [29]. Natural extensions of hidden Markov models consider the cases in which each mode is associated with a linear function of the input [15, 11, 25]. Hidden Markov models are usually trained using the Baum-Welch algorithm [1], a forward-backward version of the more general Expectation Maximization (EM) algorithm [12].

In this paper we consider rather general jump models to fit a temporal sequence of data that takes the ordering of the data into account. The proposed fitting algorithm alternates two steps: estimate the parameters of multiple models and estimate the temporal sequence of model activation, until convergence. The model fitting step can be carried out exactly when it reduces to a convex optimization problem, which is often the case. The mode-sequence step is always carried out optimally using dynamic programming.

Our jump modeling framework is quite general. The structure of the model depends on the shape of the function that is minimized to obtain the model parameters, the way the model jumps depends on the function that is minimized to get the sequence of model activation. When we impose no constraints or penalty on the model sequence, our method reduces to automatically splitting the dataset in clusters and fitting one model per cluster, which is a generalization of -means [19, Algorithm 14.1]. Hidden Markov models (HMMs) are a special case of jump models, as we will show in the paper. Indeed, jump models have broader descriptive capabilities than HMMs, for example the sequence of discrete states may not be necessarily generated by a Markov chain and could be a deterministic function. Moreover, as stated above, jump models can have rather arbitrary model shapes.

After introducing jump models in Section 2 and giving a statistical interpretation of the loss function in Section 3, we provide algorithms for fitting jump models to data and to estimate output values and hidden modes from available input samples in Section 4, emphasizing differences and analogies with HMMs. Finally, in Section 5 we show four examples of application of our approach for regression and classification, using both synthetic and experimental data sets.

The code implementing the algorithms described in the paper is available at http://cse.lab.imtlucca.it/~bemporad/jump_models/.

### 1.1 Setting and goal

We are given a training sequence of data pairs , , with , . We refer to as the time or period, as the regressor or input, and as the outcome or output at time . The training sequence is used to build a regression model that provides a prediction of given the available inputs , and possibly past outputs . We are specifically interested in models where is not simply a static function of , but rather we want to exploit the additional information embedded in the temporal ordering of the data. As we will detail later, our regression model is implicitly defined by the minimization of a fitting loss that depends on and other variables and parameters. The chosen shape for determines the structure of the corresponding regression model.

Given a production data sequence , thought to be generated by a similar process that produced the training data, the quality of the regression model over a time period will be judged by the average true loss

 Ltrue=1~T~T∑t=1ℓtrue(^yt,~yt) (1)

where penalizes the mismatch between and , with for all .

## 2 Regression models

### 2.1 Single model

A simple form of deriving a regression model is to introduce a model parameter , a loss function , and a regularizer defining the fitting objective

 J(X,Y,θ)=T∑t=1ℓ(xt,yt,θ)+r(θ) (2a) where X=(x1,…,xT), Y=(y1,…,yT). For a given training data set (X,Y), let θ⋆=arg minθJ(X,Y,θ) (2b) be the optimal model parameter. By fixing θ=θ⋆ and exploiting the separability of the loss J in (2a) we get the following regression model ^yt = arg minyJ(X,Y,θ⋆)=arg minyℓ(xt,y,θ⋆) (2c) =: φ(xt)

where as the regression model, with ties in the arg min broken arbitrarily. For example, when we get the standard linear regression model .

Model (2) can be enriched by adding output information sets that augment the information that is available about ,

 ^yt=arg miny∈Ytℓ(x,y,θ⋆) (3)

where if no extra information on is given. For example, if we know a priori that we can set equal to the nonnegative orthant.

### 2.2 K-models

Let us add more flexibility and introduce multiple model parameters , , and a latent mode variable that determines the model parameter that is active at step . Fitting a K-model on the training data set , entails choosing the models by minimizing

 J(X,Y,Θ,S)=T∑t=1ℓ(xt,yt,θst)+K∑i=1r(θi) (4)

with respect to and . The optimal parameters define the -model

 (^yt,^st)=arg miny,sℓ(xt,y,θ⋆s). (5)

Note that the objective function in (4) is used to estimate the model parameters based on the entire training dataset, while (5) defines the model used to infer the output and discrete state given the input , as exemplified in the next section.

#### 2.2.1 K-means and piecewise affine models

The standard -means model [19] is obtained by setting , , and

 ℓ(xt,yt,θst)=12∥yt−θst∥22+12∥xt−θst∥22=∥xt−θst∥22 (6)

In this case, minimizing (4) assigns each datapoint to the cluster indexed by , and defines as the centroids of the resulting clusters. Moreover, the regression model defined by (6) returns

 ^st=arg mins∥~xt−θ⋆s∥22,^yt=θ^st (7)

that is the index of the centroid which is closest to the given input , and sets as the best estimate of .

More generally, by setting

 (8)

with and , we obtain a piecewise affine (PWA) model over the piecewise linear partition generated by the Voronoi diagram of , i.e., the regression model (5) becomes

 (9)

The hyper-parameter in (8) trades off between fitting the output and clustering the inputs based on their mutual Euclidean distance.

A more general PWA model can be defined by setting

 ℓ(xt,yt,θst)=∥yt−θ′y,st[xt1]∥22+ρK∑j=1j≠stmax{0,(θx,j−θx,st)′[xt1]+1}2 (10)

where defines a piecewise linear separation function that induces a polyhedral partition of the input space [6, 9]. In this case it is immediate to verify that the regression model induced by (5) is

 ^st=arg maxs{(θ⋆x,s)′[xt1]},^yt=(θ⋆y,^st)′[xt1]. (11)

### 2.3 Jump model

The models introduced above do not take into account the temporal order in which the samples are generated. To this end, we add a mode sequence loss in the fitting objective (4)

 J(X,Y,Θ,S)=T∑t=1ℓ(xt,yt,θst)+K∑k=1r(θk)+L(S), (12)

where is the mode sequence. We define in (12) as

 L(S)=Linit(s0)+T∑t=1Lmode(st)+T∑t=1Ltrans(st,st−1) (13a) where K={1,…,K}, Linit:K→R∪{+∞} is the initial mode cost, Lmode:K→R∪{+∞} is the mode cost, and Ltrans:K2→R∪{+∞} is the mode transition cost. We discuss possible choices for L in Sections 2.3.1 and 3. With a little abuse of notation, we write J(X,Y,Θ,S)=ℓ(X,Y,Θ,S)+r(Θ)+L(S) (13b) where ℓ(X,Y,Θ,S)=T∑t=1ℓ(xt,yt,θst),r(Θ)=K∑k=1r(θk). (13c)

As with any model, the choice of the fitting objective (13) should trade off between fitting the given data and prior assumptions we have about the models and the mode sequence. In particular, the mode sequence loss in (13a) takes into account the temporal structure of the mode sequence, for example that the mode might change (i.e., ) rarely.

A jump model can be used for several tasks beyond inferring the values . In anomaly identification, we are interested in determining times for which the jump model does not fit the data point well. In model change detection we are interested in identifying times for which . In control systems jump models can be used to approximate nonlinear/discontinuous dynamics and design model-based control policies, state estimators, and fault-detection algorithms.

#### 2.3.1 Mode loss functions

We discuss a few options for choosing the mode loss functions , , defining the mode sequence loss in (13a). As we assume that the number of possible modes must be fixed, must be chosen as a trade off between fitting the model to data ( large) and limit the complexity of the model and avoid overfitting ( small). The best value is usually determined after performing cross-validation.

As mentioned above, the case leads to a -model. By choosing for all , , one penalizes mode transitions equally by , where leads to regression of a single model on the data (that is, ), while leads again to a -model. Note that choosing the same constant for all transitions makes the fitting problem exhibit multiple solutions, as indexes , can be arbitrarily permuted. The mode loss can be used to break such symmetries. For example, smaller values for will be preferred by making for . The shape of the increasing finite sequence can be used to reduce the number of possible modes: larger increasing values of will discourage the use of an increasing number of modes.

The initial mode cost summarizes prior knowledge about the initial mode . For example, if no prior information on is available. On the contrary, if the initial mode is known and say equal to , then for and otherwise.

Next Section 3 suggests criteria for choosing in case statistical assumptions about the underlying process that generates are available. Alternative criteria are discussed in Section 4.4 for choosing directly from the training data.

## 3 Statistical interpretations

Let , , , . We provide a statistical interpretation of the loss functions for the special case in which the following modeling assumptions are satisfied:

• The mode sequence , the model parameters and the input data are statistically independent, i.e.,

 p(S|X,Θ)=p(S),   p(Θ|S,X)=p(Θ)
• The conditional likelihood of is given by

 p(Y|X,S,Θ)=T∏t=1p(yt|X,S,Θ)=T∏t=1p(yt|xt,θst)

where is the likelihood of the outcome given and ;

• The priors on the model parameters are all equal to , i.e.,

 p(θ1)=⋯=p(θK)=p(θ)

and the model parameters are statistically independent, i.e.,

 p(Θ)=K∏k=1p(θk)
• The probability of being in mode given is (Markov property);

• The initial mode has probability .

###### Proposition 1.

Let Assumptions A1-A5 be satisfied and define

 ℓ(xt,yt,θst) = −logp(yt|xt,θst) (14a) r(θk) = −logp(θk) (14b) Ltrans(st,st−1) = −logπst,st−1 (14c) Linit(s0) = −logπs0 (14d) Lmode(st) = 0. (14e)

Then minimizing as defined in (12)–(14) with respect to and is equivalent to maximizing the joint likelihood .

Proof. Because of the Markov property (Assumption A4), the likelihood of the mode sequence is

 p(S)=p(s0)T∏t=1p(st|st−1). (15)

From (15) and Assumptions A1-A3, we have:

 p(Y,S,Θ|X) = p(Θ|X)p(Y,S|X,Θ) = p(Θ|X)p(S|X,Θ)p(Y|S,X,Θ) = p(Θ)p(S)p(Y|S,X,Θ) = K∏k=1p(θk)p(s0)T∏t=1p(st|st−1)p(yt|xt,θst)

whose logarithm is

 logp(Y,S,Θ|X)=K∑k=1logp(θk)+logp(s0)+T∑t=1logp(st|st−1)+T∑t=1logp(yt|xt,θst). (16)

By defining the loss functions , , , , and as in (14), the minimization of the fitting objective as in (12)–(13) with respect to and is equivalent to maximizing the logarithm of the joint likelihood , and therefore .

The following proposition provides an inverse result, namely a statistical interpretation of minimizing a given generic defined as in (13).

###### Proposition 2.

Define the probability density functions

 p(yt|xt,θst) = e−ℓ(xt,yt,θst)ν(θst,xt) (17a) p(S,Θ|X) = ν(S,Θ,X)e−L(S)−r(Θ)∑¯S∈KT+1∫Rd×Kν(¯S,Θ,X)e−L(S)−r(Θ)dΘ (17b)

where

 ν(θst,xt) = ∫Ye−ℓ(xt,y,θst)dy (18a) ν(S,Θ,X) = T∏t=1ν(θst,xt) (18b)

and assume that the outputs are conditionally independent given , i.e., . Then the following identity holds

 arg minS,ΘJ(X,Y,Θ,S)=arg maxS,Θlogp(Y,S,Θ|X) (19)

Proof. Since

 p(Y,S,Θ|X)=p(Y|S,X,Θ)p(S,Θ|X) (20)

by substituting (18) in (20) we get

 p(Y,S,Θ|X)= ∏Tt=1e−ℓ(xt,yt,θst)∏Tt=1ν(θst,xt)ν(S,Θ,X)e−L(S)−r(Θ)∑¯S∈KT+1∫Rd×Kν(¯S,Θ,X)e−L(¯S)−r(Θ)dΘ =e−∑Tt=1ℓ(xt,yt,θst)−L(S)−r(Θ)∑¯S∈KT+1∫Rd×Kν(¯S,Θ,X)e−L(¯S)−r(Θ)dΘ (21)

As the denominator in (21) does not depend on and , maximize is equivalent to maximize

 e−∑Tt=1ℓ(xt,yt,θst)−L(S)−r(Θ),

or, equivalently, to minimize

 T∑t=1ℓ(xt,yt,θst)+L(S)+r(Θ)

The identity (19) thus follows from the definition of in (13).

The following corollary provides a set of probabilistic interpretations of the loss function , some of which are well known in Bayesian estimation.

###### Corollary 1.

Let in (18a) be a constant. Then the following statements hold:

1. The quadratic regularization corresponds to assuming a Gaussian prior on , namely with .

2. The quadratic penalty on the prediction error

 ℓ(xt,yt,θst)=c∥yt−θ′stxt∥22 (22)

correspond to assuming the probabilistic model of the output , with .

3. Setting is equivalent to assuming that the modes are i.i.d., with

 st∼p(st)=e−Lmode(st)∑Kk=1e−Lmode(k)

Furthermore, setting corresponds to assuming that for all , while setting , , corresponds to assuming .

4. Under the assumption , the case and for and for , corresponds to assume that

 p(s0)=1K,  p(st|st−1)=⎧⎪⎨⎪⎩e−λ1+(K−1)e−λifst≠st−111+(K−1)e−λifst=st−1

Proof. As does not depend on and , in (17b) can be written as , where

 p(S)=e−L(S)∑¯S∈KT+1e−L(S),p(Θ)=e−r(Θ)dΘ∫Rd×Ke−r(Θ)dΘ (23)

The results follow straightforwardly from the above expressions of and and the definition of in (13a).

## 4 Algorithms

We provide now algorithms for fitting a jump model to a given data set and to infer predictions , from it.

### 4.1 Model fitting

Given a training sequence of inputs and of outputs, for fitting a jump -model we need to attempt minimizing the cost with respect to and . A simple algorithm to solve this problem is Algorithm 1, a coordinate descent algorithm that alternates minimization with respect to and . If and are convex functions, Step 11 can be solved globally (up to the desired precision) by standard convex programming [7]. Step 12 can be solved to global optimality by standard discrete dynamic programming (DP) [2] with complexity . This is achieved by computing the following matrices of costs and of indexes

 M(s,T)= Lmode(s)+ℓ(xT,yT,θs) (24a) Us,t= arg minj{M(j,t+1)+Ltrans(j,s)}, t=1,…,T−1 (24b) M(s,t)= Lmode(s)+ℓ(xt,yt,θs)+M(Us,t,t+1) +Ltrans(Us,t,s) (24c) M(s,0)= Linit(s)+minj{M(j,1)+Ltrans(j,s)} (24d) backwards in time, and then reconstructing the minimum cost sequence S forward in time by setting s0 =arg minjM(j,0) (24e) st =Ust−1,t, t=1,…,T. (24f)

Note that if the time order of operations in (24) is reversed, the DP iterations (24) become Viterbi algorithm [29, p. 264]:

 M(s,0)= Linit(s) (25a) Us,t= arg minj{M(j,t−1)+Ltrans(j,s)}, t=1,…,T (25b) M(s,t)= Lmode(s)+ℓ(xt,yt,θs)+M(Us,t,t−1) +Ltrans(Us,t,s) (25c) followed by the backwards iterations sT =arg minjM(j,T) (25d) st =Ust+1,t, t=0,…,T−1. (25e)

Since at each iteration the cost is non-increasing and the number of sequences is finite, Algorithm 1 always terminates in a finite number of steps, assuming that in case of multiple optima one selects the optimizers in Steps 11 and 12 according to some predefined criterion. However, there is no guarantee that the solution found is the global one, as it depends on the initial guess . To improve the quality of the solution, we may run Algorithm 1 times from different random initial sequences and select the best result. Our experience is that a small , say , is usually enough.

During the execution of Algorithm 1 it may happen that a mode does not appear in the sequence . In this case, the fitting loss does not depend on , and the latter will be determined in Step 11 based only on the regularizer .

In case , the ordering of the training data becomes irrelevant and Algorithm 1 reduces to fitting models to the data set. If in addition and are specified as in (6) and , Algorithm 1 is the standard -means algorithm, where the starting sequence is the initial clustering of the data points , Step 11 computes the collection of cluster centroids at iteration , and Step 12 reassigns data points to clusters by updating their labels .

When again and the mode loss in (10) is used for getting a PWA model, the cost function minimized in Step 11 of Algorithm 1 is separable with respect to , . Then the minimization with respect to produces the piecewise linear separation function that defines the polyhedral partition of the input space [9], while Step 12 looks for the optimal latent variables that best trade off between assigning the corresponding data point to the polyhedron and matching the predicted output .

Finally, we remark that Algorithm 1 is also applicable to the more general case in which the mode loss also depends on , by simply replacing Steps 11 and 12 with

 Θk ←arg minΘℓ(X,Y,Θ,Sk−1)+r(Θ)+L(Sk−1,Θ) (26a) Sk ←arg minSℓ(X,Y,Θk,S)+L(S,Θk). (26b)

This would cover the case in which contains parameters to be estimated.

### 4.2 Inference

Assume that the model parameters have been estimated and that new production data and outputs are given. Because of the structure of the mode loss function defined in (13a), the estimates and do not depend on future inputs and modes for .

The same fitting objective (12) can be used to estimate and ,

 (^yt,^St)=arg miny,StJt(~Xt,~Yt−1,y,Θ⋆,St)s.t. y∈Yt (27)

where is a possible additional output information set and

 Jt(~Xt,~Yt−1,y,Θ⋆,St)=ℓ(~xt,y,θ⋆st)+t−1∑j=1ℓ(~xj,~yj,θ⋆sj)+Linit(s0)+t∑j=1Lmode(sj)+t∑j=1Ltrans(sj,sj−1).

Algorithm 2 attempts at solving problem (27) at every of interest. Step 1 is solved again by the DP iterations (24) over the time span , with the only difference that in (24a) we set the terminal penalty equal to , since the last output is determined later at Step 2.

Note that open-loop prediction, that is the task of predicting and without acquiring , can be simply obtained by replacing with . Arbitrary combinations of one-step ahead and open-loop predictions are possible to handle the more general case of intermittent output data availability.

#### 4.2.2 Recursive inference

When , problem (27) becomes completely separable and simplifies to

 (^yt,^st)=arg miny,sℓ(~xt,y,θ⋆s)+Lmode(s)s.t. y∈Yt. (28)

For example, in the case of -means (6) (), the estimate obtained by (28) is given by (7).

When the mode transition loss function , the simplification in (28) does not hold anymore. Nonetheless, an incremental version of (27) can be still derived as described in Algorithm 3, where is the arrival cost recursively computed by the algorithm from the initial condition , for all .