# Identification of Probability weighted ARX models with arbitrary domains

## Abstract

Hybrid system identification is a key tool to achieve reliable models of Cyber-Physical Systems from data. PieceWise Affine models guarantees universal approximation, local linearity and equivalence to other classes of hybrid system. Still, PWA identification is a challenging problem, requiring the concurrent solution of regression and classification tasks.
In this work, we focus on the identification of PieceWise Auto Regressive with eXogenous input models with arbitrary regions (NPWARX), thus not restricted to polyhedral domains, and characterized by discontinuous maps. To this end, we propose a method based on a probabilistic mixture model, where the discrete state is represented through a multinomial distribution conditioned by the input regressors. The architecture is conceived following the Mixture of Expert concept, developed within the machine learning field. To achieve nonlinear partitioning, we parametrize the discriminant function using a neural network. Then, the parameters of both the ARX submodels and the classifier are concurrently estimated by maximizing the likelihood of the overall model using Expectation Maximization. The proposed method is demonstrated on a nonlinear piece-wise problem with discontinuous maps.

## I Introduction

The identification of hybrid dynamics is fundamental to achieve reliable models of Cyber-Physical Systems (CPS) from data [24]. Hybrid systems provide a unified framework to represent the heterogeneous interactions between control logic/rules and the continuous dynamics of CPS, including sharp changes in operating points, plants regimes, constraints on values/variations of system inputs/outputs, etc. Several models and identification approaches have been proposed in the control community during the last decades (see e.g, [8],[13] for detailed reviews). A major distinction is in the characterization of the discrete state evolution, which is often defined as autonomous or due to partitioning of the continuous state-input domain, leading to switching and piecewise models.

In this work, we focus on the latter, representing an important class - subject of a continuous interest - due to its universal approximation properties, local linearity and equivalence to other hybrid system modeling approaches, thus fostering analysis and deployment in advanced control applications [14].

Considering input-output models, PWARX (i.e., PieceWise Auto Regressive systems with eXogenous input) are structured by a set of affine maps and through a partition of the continuous feature space, determining the discrete mode [8]. The identification of PWARX models is a challenging problem, requiring the concurrent estimation of the sub-model dynamics and the partitioning function through intertwined regression and classification tasks [13]. Besides, as logic conditions in hybrid systems can cause sudden changes in the dynamics, non-smooth function approximations have to be addresses (e.g., using discontinuous PWARX mappings), which usually results challenging using classical nonlinear regression techniques assuming continuous patterns [20].

Mixed-integer programming approaches have been proposed to achieve global optimal solutions to PWARX identification (see e.g., [19] and references therein). As the problem is NP-hard, these approaches result affordable only for very small data sets [19]. Hence, a lot of research studies have been dedicated to approximate methods and heuristics aimed to find good sub-optional solution in reasonable time. In this context, the most popular methods include the bounded error approach [2], the Bayesian approach [10], sparse optimization [18] and the broad family of clustering-based approaches (see e.g., [5],[17],[4]). A detailed review is reported in [8]. Despite the different techniques employed, the common workflow starts from local model estimation and clustering by relaxing the piecewise constraints, followed by partitioning through linear discrimination techniques [13]. Hence, as the obtained clusters are not guaranteed to be linearly separable, a post-processing phase is usually required to address misclassifications [19].

A probabilistic generalization of PWARX (i.e., PrARX) is proposed in [22], resulting in a Mixture of Expert (MoE) architecture [25], enabling concurrent estimation of the ARXs and parameters of the boundaries in a single optimization problem. Then, the corresponding PWARX model is obtained using a straighforward procedure. A variational inference based technique is introduced [1], supporting model structure selections in cases of small data-set size.

Nevertheless, the vast majority of existing studies focus on piecewise models characterized by polyhedral domains in the regression space. The introduction of arbitrary regions leads to the nonlinearly piecewise models (i.e, NPWARX), a direct extension of PWARX still not deeply studied and solved in the literature [12]. As a matter of fact, when the target system is characterized by these arbitrary partitions, forcing linear separability through PWARX models leads to multiple identical submodels in different regions governed by the same dynamics, while NPWARX models enable a unique representation. Besides, NPWARX supports better estimations of submodels, that would be assigned by PWARX to different modes, particularly in sub-regions with fewer samples [12]. However, NPWARX introduce a more challenging nonlinear classification problem within the identification procedure. To address this issue, authors in [12] propose a technique based on kernel regression and Support Vector Machines.

In this work, we investigate a different approach, leveraging on tools from the probabilistic mixture field. Specifically, we extend the PrARX model by introducing a nonlinear partitioning function parametrized by a neural network. Estimation is performed through an integrated optimization, by maximizing the overall model likelihood. Then, we tackle this challenging problem through Expectation Maximization (EM), disentangling classification and regression sub-tasks easier to solve within each iteration.

The paper is organized as follows: Section 2 states the identification problem and details the PrARX model; Section 3 reports the developed model and EM-based estimation procedure; Section 4 describes the adopted case study and summarizes the results achieved.

## Ii Background

### Ii-a PWARX model identification

Consider discrete-time dynamical systems of the form:

(1) |

where are, respectively, the observed system output and error term at discrete time while is a deterministic mapping from the regression vector:

(2) |

represent exogenous inputs and fixed input and output lags (i.e., model orders).

PieceWise affine AutoRegressive eXogenous (PWARX) models are defined by a piecewise affine mapping as follows:

(3) |

where is the extended regression vector, is the finite number of discrete state (or modes), with affine submodels parameters for , with . The switching mechanism between the discrete states is determined by a complete partition of the regressor domain in the collection of convex polyhedra (i.e, regions) described by:

(4) |

where the matrices represent a set of hyperplanes defining the regions by linear inequalities and ’’ denotes a -dimensional vector whose elements can be the symbols and . Hence, the active discrete state is given by the region to which the regressor belongs at :

(5) |

The general PWARX identification problem include the estimation of the ARX parameters and the regions , as well as model orders and the number of modes , given a collection of input/output data .

In this work, we adopt the common assumption of fixed modes/orders, often estimated in practice by preliminary data analysis, cross validation and model order selection techniques [13]. The investigation of automatic feature selection techniques (e.g., including dedicated regularization terms) is left to future extensions.
Besides, we assume that in all modes are sufficiently excited and inputs and outputs are bounded [15].

### Ii-B Nonlinearly PieceWise ARX model

As introduced above, NPWARX models extend PWARX by considering nonlinear boundaries between arbitrary regions, thus not restricted to set of hyperplanes defining polyhedral domains [12]. Formally, NPWARX are defined as in (3) while relaxing the polyhedrical assumption in (4).

### Ii-C Probability weighted ARX model

In the PrARX model proposed in [22], the deterministic partition of PWARX is replaced by probabilistic boundaries, obtained by a softmax function over a linear transformation of the extended regressor (i.e, linear gate):

(6) | ||||

(7) |

where and denotes the probability that belongs to the discrete state .

Parameters are directly related to the hyperplanes determining the regions.
Indeed, the PrARX model can be straightforwardly transformed to a PWARX model by:

(8) |

Besides being a probabilistic generalization of PWARX, the PrARX represents a particular form of the MoE architecture, including linear experts and linear gates [25],[23]. Due to this simplified parametrization, the steepest descent method is adopted in [22] to maximize the likelihood during estimation.

## Iii Methods

### Iii-a Nonlinearly PrARX model

We augment the PrARX model, following the Mixture of Expert approach, by replacing the linear gate with a neural network aimed to learn arbitrary region partitioning the regressor space into discrete states. Hence, we obtain a probabilistic generalization of NPWARX models, as PrARX are for PWARX [1]. We briefly denote it NPrARX hereinafter.

Specifically, the discrete state , at discrete time , is represented by a latent categorical random variable taking values in the set , generated according to a -conditioned multinomial distribution:

(9) |

The output variable is generated according to a set of emission distributions, conditioned by , selected by the active discrete mode:

(10) |

Hence, data generation follows a hierarchical process, starting from the mode sampling followed by output emission, both conditioned on the regressor.

Considering an ARX model form in each discrete state, and assuming a Gaussian distribution with zero mean and standard deviation for , we obtain a set of noisy linear models:

(11) |

It is worth noting that, despite the Gaussian noise assumption adopted in this work, alternative noise patterns can be considered, including specific distributions for each mode [7].

The overall model results in a non-Markov switching probabilistic mixture [1], with semi-parametric density defined as:

(12) |

where summarizes the parameters to be estimated and is the -th mode density:

(13) |

We structure the function , that provides the probability of each discrete state given the regressor, as follows:

(14) |

where the logits are defined using a neural network processing the regressor vector . In the probabilistic mixture literature, this is often referred to as gate.
Besides, the parameters of one of the modes is forced to the null vector to support identifiability, as suggested in [25].

While several network architectures can be adopted, depending on the specific characteristics of the application at hand, in this work we employ a feed-forward form, defined as:

(15) |

where we reported a single hidden layer of units to lighten notation. It it worth noting that alternative nonlinear activation functions can be adopted and that the number of hidden units and layers constitute hyperparameters to be configured. To this end, in this work, we exploit cross-validation as detailed in section IV.

As introduced above, conditioning the gating mechanism through flexible models such as neural networks provides trainable nonlinear decision boundaries for discrete state classification in the regressor space, as opposed to the polyhedral partitioning of PrARX. On the other hand, such flexibility comes at the cost of a more complex optimization problem for parameters estimation [7]. To address this issues, we employ Expectation Maximization (EM), as detailed in the next section.

### Iii-B Parameters estimation by Expectation Maximization

A major feature of the Probabilistic ARX approach is the concurrent learning of classification (i.e., gating function) and regression (i.e., ARX submodels) tasks, while pursuing parameters estimation.

To this end, two major families of approaches can be followed, namely the frequentist and the Bayesian. In this work we adopt the former, leaving the investigation of the second one to future extension, e.g., through the exploitation of Bayesian Neural Networks. Hence, we target a Maximum Likelihood Estimation (MLE) of model’s parameters.

Specifically, given a data set of i.i.d. observations , the likelihood function factorizes as follows:

(16) |

leading to the following objective:

(17) |

The maximization of this function is a complex task, due to the inclusion of the hidden discrete state. This is a common issue of latent variable models, from probabilistic mixtures to Hidden Markov Models. In this fields, EM is typically adopted to derive the MLE. In fact, it provides a more effective alternative to the direct likelihood maximization (e.g, though steepest descent, Newton-Raphson, etc.), specially when dealing with complex models and conditioning functions [7].

Representing a general purpose algorithm for estimation with missing data, EM has several attractive features. First of all, as opposed to the other methods, it has been proved that the observed log-likelihood increase at each iteration [7]. Moreover, it provides a natural way to tackle the challenging sum inside the logarithm form of (17), and enables the exploitation of effective closed form solution in the maximization sub-problems during the iterations.

Considering a set of random indicator variables assigning a generating mode to each sample in the observation set, , we get the joint likelihood over the complete data (i.e, including the missing indicator variables), factorizing as follows:

(18) |

thus leading to a much more tractable form:

(19) |

It is worth noting that the indicator variables filter out all but the related discrete state in the overall model.

Then, the missing indicator variables are averaged out by iteratively computing the expectation of the complete data likelihood over them

(20) |

where represents the number of iteration of the EM algorithm. Notably, the expectation over the binary indicator variables assume continuous values, thus simplifying the optimization problem. represents the expected value of , defined as follows:

(21) |

constituting the posterior obtained by the normalized product of the mode-wise likelihood times the prior .

The training algorithm proceed by alternating between posterior estimation (i.e, E-step) and maximization of the -function (i.e, M-step), devoted to update the estimation of . The former employs the parameters computed by the previous M-step, starting from an initial guess.

Notably, this approach disentangles the estimation of regression (ARX) and classification (gate) parameters within the iteration, resulting in sub-problems easier to tackle. Specifically, the former results in a set of weighted least squared problems , easily solvable by:

(22) |

(23) |

The latter, i.e., , constitutes a 1-of-K classification task over soft labels , where the gate learns to approximate the posterior while the ARX sub-models compete to get responsibility over samples, gaining specialization within different regions of the regression space. During prediction, the sub-model with highest probability is selected. It is worth noting that, as opposed to separated clustering, such an integrated discriminant strategy foster the exploitation of network flexibility on the decision surface between the classes rather than on the overall distribution [3], often resulting in improved classification performances. Despite a closed form solution is not available for this problem, standard network training algorithms can be exploited in order to increase , thus leading to a Generalized-EM approach which still guarantee convergence [7]. Specifically, we deploy the classification task from logits, increasing numerical stability with reference to the conventional cross-entropy over softmax [9]. Then, the network weights are updated using the Adam algorithm, conceived to tackle noisy and sparse gradients [11].

To avoid potential variance collapses [16], we adopt the Maximum A Posteriori estimate proposed in [6], given by:

(24) |

where , , , including the weakest prior with data dimension . Since not subjected to potential collapse, an overall standard deviation parameter can be estimated by the usual MLE form: .

The training algorithm proceed by alternating E/M-steps until convergence, tested by stopping conditions on the likelihood or maximum number of iterations. As common for EM, several initialization are required, by sampling different values of the parameters to tackle eventual convergence to poor local minima. As shown in the result section, we found useful experimentally to sample the initial parameters of the network from relatively large values and to initialize the sub-models biases using k-means [16], to avoid modes collapse.

## Iv Numerical results

To test the proposed method, we considered a slight modification of the widely adopted case study proposed in [2], assumed here to introduce nonlinear partitioning requirements. Besides, it has been conceived as an extension to the simpler nonlinearly piecewise affine map estimation problem considered in [12]. Specifically, the system is constituted by the following components:

with and input sampled from a uniform distribution . The aim is to infer the same dynamics occurring in two parts of the regressor space, by reconstructing two modes. We might remark that, even if this system is defined by two affine maps, identification methods based on linear partition of the regression space would require three modes as the regions are not linearly separable.

We generated a sequence 6000 samples. The first 5000 has been devoted to training/validation and the last 1000 represents the test set. To configure the hyperparameters, we used a cross-validation procedure, leaving 1000 samples to validation.

Both the overall model architecture and the EM-based estimation algorithm have been developed in Numpy-1.18. The weighted least squares subproblems are implemented using the Linear Regression model of Scikit-learn-0.23 while the neural network using Tensorflow-2.2 API.

By cross-validation, we found experimentally that a neural network with 10 hidden units is sufficient to properly fit the problem at hand. The maximum number of iterations has been set to 500, with a stopping tolerance on the log-likelihood variation of 1e-4. The Adam algorithm starts with a learning rate of 0.01, running for 3 epochs mini-batches of 100 samples in each M-iteration. The network starts from random normal weights with zero-mean. Specifically, we observed that the common weight initialization to small (i.e., approximately zero) random values frequently results in convergence to poor local solutions, where the modes collapse to a common (i.e., average) mode. To avoid this issue, we set the standard deviation of the network initializer to a high value (i.e., 10), to stimulate the system to start from initial quite different local configurations, thus limiting mode collapse. Using such configuration, we found that a maximum number of 5 random executions of the estimation algorithm are sufficient to reach good local solutions.

To evaluate the accuracy of the ARX models reconstruction and of the estimated mode sequence, we employed [21]:

(25) |

namely the parameters and mode fit indexes, where , indicates the estimated ARX parameters and state respectively.

As the order of the discrete states in the model do not necessarily match that of the target system, post-processing is required before evaluation [7]. Following [21], without loss of generality, we reordered the sub-models estimate by the Euclidean norm to the target parameters.

The obtained estimates for the ARX submodels are reported in Table I. Statistics are computed over 100 runs of the estimation algorithm using the same hyperparameters configuration.

[1.5, -0.4, 1] | [1.503, -0.400, 0.999] | [2.6e-4, 7.9e-5, 8.1e-5] | |

[-0.5, 0.5, -1] | [-0.499, 0.499, -0.995] | [4.3e-4, 3.7e-4, 4.2e-5] | |

0.2 | 0.201 | 3.1e-4 |

Figure 1 reports the evolution of the estimated ARX parameters and covariance during the iterations of 5 consecutive runs of the training algorithm. Under the adopted configuration, convergence starting from random initial conditions have been observed in approximately 60 iterations on average. The average computation time of one iteration for this case study on an PC with CPU-i7-2.5-GHz-RAM-8Gb has been 0.42 seconds.

The obtained performance indexes are: , , while Figure 2 reports the residuals over the test set.

Notably, we obtained high performances over the test set. The few visible spikes in the residual plot are due to small errors in the estimated switching surface for the target discontinuous PWARX, which are in general inevitable as identification is performed over finite datasets [2]. The estimated partitions of the regressor space are shown in Figure 3, where the different colors represents the states (i.e, the maximum gate network activation for each sample) and the black lines the true partitions of the target system.

Finally, we might remark that the same estimation algorithm can be applied on a simplified discriminant function, including a linear parametrization. In this case, an EM-based estimation of the PrARX model is obtained, enabling transformation to the equivalent PWARX. However, three discrete modes would be required to enable linear separation.

## V Conclusion and Next Steps

In this work we focused on the identification of Nonlinarly Piecewise Affine systems, characterized by arbitrary regions not restricted to polyhedral partitions. To this end, we augmented the Probability weighted ARX model, following a Mixture of Expert approach, by parametrizing the discriminant function through a neural network conditioned by the regressors. As the direct maximization of the likelihood function is particularly difficult when complex models and conditioning functions are employed, we leveraged on Expectation Maximization. Hence, ARX submodels fitting and classification tasks are disentangled within the iterations, enabling a closed form solution of the regression problems while increasing the log-likelihood at each iteration. By application to a test case we showed the capability of the proposed approach to achieve accurate parameters estimation and discrete state prediction in out of sample conditions.

Next developments will include the integration of automatic techniques for both model orders and number of modes selection, the investigation of a Bayesian inference approach and the application to a real application case.

### References

- (2011) Variational learning of autoregressive mixtures of experts for fully bayesian hybrid system identification. In Proceedings of the 2011 American Control Conf., Vol. , pp. 139–144. Cited by: §I, §III-A, §III-A.
- (2005) A bounded-error approach to piecewise affine system identification. IEEE Transactions on Automatic Control 50 (10), pp. 1567–1580. Cited by: §I, §IV, §IV.
- (1995-12) An input output hmm architecture. Advances in Neural Information Processing Systems 7, pp. . Cited by: §III-B.
- (2016-11) Piecewise affine regression via recursive multiple least squares and multicategory discrimination. Automatica 73, pp. 155–162. External Links: Document Cited by: §I.
- (2003) A clustering technique for the identification of piecewise affine systems. Automatica 39 (2), pp. 205 – 217. External Links: ISSN 0005-1098, Document, Link Cited by: §I.
- (2007-02) Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification 24, pp. 155–181. External Links: Document Cited by: §III-B.
- (2019) Handbook of mixture analysis. Chapman and Hall/CRC. External Links: ISBN 9781498763813 Cited by: §III-A, §III-A, §III-B, §III-B, §III-B, §IV.
- (2012) A survey on switched and piecewise affine system identification. 16th IFAC Symposium on System Identification 45 (16). External Links: ISSN 1474-6670, Document, Link Cited by: §I, §I, §I.
- (2017) Hands-on machine learning with scikit-learn and tensorflow. O’Reilly Media. External Links: ISBN 978-1491962299 Cited by: §III-B.
- (2005) A bayesian approach to identification of hybrid systems. IEEE Transactions on Automatic Control 50 (10), pp. 1520–1533. Cited by: §I.
- (2014-12) Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: §III-B.
- (2008) Switched and piecewise nonlinear hybrid system identification. In Hybrid Systems: Computation and Control, External Links: ISBN 978-3-540-78929-1 Cited by: §I, §II-B, §IV.
- (2018) Hybrid system identification: theory and algorithms for learning switching models. Vol. 478, Springer. Cited by: §I, §I, §I, §II-A.
- (2009) Handbook of hybrid systems control: theory, tools, applications. Cambridge University Press. External Links: Document Cited by: §I.
- (2020) Recursive bias-correction method for identification of piecewise affine output-error models. IEEE Control Systems Letters 4 (4), pp. 970–975. Cited by: §II-A.
- (2012) Machine learning: a probabilistic perspective. MIT. Cited by: §III-B, §III-B.
- (2005) Identification of piecewise affine systems based on statistical clustering technique. Automatica 41 (5), pp. 905 – 913. External Links: ISSN 0005-1098, Document, Link Cited by: §I.
- (2013) Identification of switched linear regression models using sum-of-norms regularization. Automatica 49 (4), pp. 1045 – 1050. External Links: ISSN 0005-1098, Document, Link Cited by: §I.
- (2019) A bilevel programming framework for piecewise affine system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), Vol. , pp. 7376–7381. Cited by: §I.
- (2007) Identification of hybrid systems a tutorial. Eur.Jour.of Control, pp. 242 – 260. External Links: ISSN 0947-3580 Cited by: §I.
- (2016) A new kernel-based approach to hybrid system identification. Automatica 70, pp. 21 – 31. External Links: ISSN 0005-1098, Document, Link Cited by: §IV, §IV.
- (2009) Identification of probability weighted multiple arx models and its application to behavior analysis. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC), Vol. , pp. 3952–3957. Cited by: §I, §II-C.
- (1995) Nonlinear gated experts for time series: discovering regimes and avoiding overfitting. International journal of neural systems 6 4, pp. 373–99. Cited by: §II-C.
- (2019) Data driven discovery of cyber physical systems. Nature Communications 10 (1), pp. 4894. External Links: ISSN 2041-1723, Document, Link Cited by: §I.
- (2012) Twenty years of mixture of experts. IEEE Tr. on Neural Networks and Learning Systems. External Links: Document, ISSN 2162-2388 Cited by: §I, §II-C, §III-A.