Sequential online prediction in the presence of outliers and change points: an instant temporal structure learning approach
Abstract
In this paper, we consider sequential online prediction (SOP) for streaming data in the presence of outliers and change points. We propose an INstant TEmporal structure Learning (INTEL) algorithm to address this problem. Our INTEL algorithm is developed based on a full consideration of the duality between online prediction and anomaly detection. We first employ a mixture of weighted Gaussian process models (WGPs) to cover the expected possible temporal structures of the data. Then, based on the rich modeling capacity of this WGP mixture, we develop an efficient technique to instantly learn (capture) the temporal structure of the data that follows a regime shift. This instant learning is achieved only by adjusting one hyperparameter value of the mixture model. A weighted generalization of the product of experts (POE) model is used for fusing predictions yielded from multiple GP models. An outlier is declared once a real observation seriously deviates from the fused prediction. If a certain number of outliers are consecutively declared, then a change point is declared. Extensive experiments are performed using a diverse of real datasets. Results show that the proposed algorithm is significantly better than benchmark methods for SOP in the presence of outliers and change points.
keywords:
online prediction, change point detection, outlier detection, streaming data, regime shift, instant learning1 Introduction
Driven by the fast development of communication, sensor and storage technologies, streaming data abounds in many application areas such as transportation biem2010ibm (), manufacturing harvey1990forecasting (), network security condon2008analysis (), agriculture monitoring eerens2014image () and medical diagnosis rasmussen2011inference (). In this paper, we are concerned with the issue of sequential online prediction (SOP) for streaming data. An online realtime prediction algorithm is critical for many use cases, e.g., preventative maintenance, fraud prevention, and realtime monitoring. Compared with batch processing methods that need to access each data point repeatedly, an online prediction algorithm access each data point only once, and thus is beneficial for saving costs in computation and storage.
In many timeseries data, outliers and change points exist. An outlier is typically a single observation whose statistical property is independent of and different from that of the rest data. Different from outliers, a change point stands for a data point in the timeseries, at which an endogenous regime shift of the system happens. That says, after a change point, the statistics of the system changes. It is an issue also known as concept drift.
If not appropriately dealt with, the presence of outliers and change points can lead to detrimental effects on the prediction. To clarify the necessity and importance for outlier and change point detection, let consider a sensor network based realtime monitoring scenario. Sensor failures shall produce outliers liu2015toward (); liu2017state (), while, when an abnormal event like a forest fire, a leakage of poisonous gas or debris flow happens, it will lead to a regime shift, corresponding to a change point in sensor measurements wang2017online (). In this scenario, undiscovered outliers shall produce unreliable predictions, and, more crucially, if a regime shift is not detected and addressed well in time, it can lead to a disaster.
Approaches to outlier and change point detection can be roughly categorized into two classes, namely retrospective methods and online approaches. For the former class, the full dataset is available for analysis, and the locations of anomalies are identified in a batch mode xuan2007modeling (); barry1993bayesian (). Many previous approaches to anomaly detection are retrospective, while the batch processing feature makes them not suitable for use in an online prediction system.
Compared with retrospective methods, online approaches have received less attention. This is possibly due to that the online setting is quite different from traditional situations that make traditional approaches inapplicable. Bayesian modeling and inference approaches have been explored for developing online change point detection methods in e.g., adams2007bayesian (); turner2009adaptive (); fearnhead2007line (); ray2002bayesian (); chandola2011gaussian (). In the Bayesian online change point detection (BOCPD) algorithm of adams2007bayesian (); turner2009adaptive (), authors design an underlying predictive model (UPM) and a hazard function to describe uncertain factors regarding the run length, namely the time since the last change point. For BOCPD, it assumes that all data points within a regime are identically and independently distributed (iid) according to a specific distribution, e.g., Gaussian. This assumption makes BOCPD a pure change point detection method without the capability to do online prediction. In saatcci2010gaussian (), the Gaussian process (GP) is introduced into the BOCPD framework to exploit the temporal structure of the data, while its central aim is still to improve the change point detection performance.
Outliers and change points are usually separately considered in the literature (with few exceptions in e.g., yamanishi2002unifying ()), under the name of outlier detection, fault detection, and change point detection. In this paper, we consider them together and propose an SOP algorithm that is robust to both outliers and change points. Our algorithm is based on the Gaussian process timeseries (GPTS) model. GPs are Bayesian nonparametric models that have widely used for approximating complex nonlinear functions. It has been proved that the prediction performance of GP is comparable to that of artificial neural networks (ANN) neal2012bayesian (). Further, GP has two merits compared with ANN. First, the probabilistic nature of GP can give us a byproduct, i.e., an uncertainty measure for the prediction it makes. This is a desirable property for human operators to make decisions. Second, as a generative model, GP is more interpretable.
GPTS models have recently been studied for developing robust SOP methods roberts2013gaussian (); osborne2012real (); vanhatalo2009gaussian (); osborne2011machine (); garnett2010sequential (). For example, in vanhatalo2009gaussian (), the authors take into account the presence of faulty observations in the GPTS model by using a heavytailed student’s likelihood function. A similar idea has been used in osborne2011machine (), which takes account of faulty data by a Gaussian distribution with a very wide variance. In garnett2010sequential (), the authors design a nonstationary kernel function to take account of the appearance of change points. Although these methods are powerful, when using them, one has to pay some price, that is the significantly increased complexity in the inference. This is due to the lack of an analytically tractable inference algorithm for those models. Specifically, for the student’s model of vanhatalo2009gaussian (), approximate methods such as Gibbs sampling geweke1993bayesian () and variational methods tipping2005variational () are needed for inference. For the model of garnett2010sequential (), a complex Bayesian quadrature procedure is required for calculating the predictive distribution.
In this paper, we propose an INstant TEmporal structure Learning (INTEL) algorithm for SOP, which has desirable features as follows:

Different from the aforementioned GPTS based methods, our INTEL algorithm allows closedform inference and prediction, and thus is more computationally efficient and easier to code;

The INTEL algorithm is robust to both outliers and change points;

As a GP based algorithm, INTEL inherits all merits of GPs. For example, it can produce an uncertainty measure for each prediction it makes. It also has the desirable interpretability;

As an SOP algorithm, it can also provide realtime anomaly detections as a byproduct.
To the best of our knowledge, INTEL is the only algorithm in the literature that owns all the above features. The other contributions of this paper are summarized as follows:

We present a mixture modeling approach to precover temporal structures of unobserved timeseries data based on a template model trained with a relatively small number of observed data points. Using this method, we obtain a mixture model that has a much richer modeling capacity than the template model.

Based on the rich modeling capacity of the aforementioned mixture model, we propose an efficient approach to quickly capture the temporal structure of the new regime upon a change point is detected. We term this mechanism of fast temporal structure capturing as instant learning. Striking different from traditional machine learning (ML) methods, instant learning emphasizes the use of prior knowledge and does not require any training dataset.

We present a weighted generalization of the POE model of hinton2002training () for fusing predictions yielded from multiple GPTS models.

We use a bunch of real datasets to evaluate the performance of our method. Results demonstrate the superiority of our method.
The remainder of this paper is organized as follows. In Section 2, we present the GPTS model we use for developing the proposed INTEL algorithm. In Section 3, we introduce INTEL in detail and provide a formal analysis of its computational complexity. In Section 4, we discuss its connections to relevant works in the literature. In Section 5, we evaluate the performance of INTEL using a bunch of real datasets. Finally, we conclude the paper in Section 6.
2 Sequential online prediction with GP
In this section, we briefly introduce the GPTS model used here. The aim is to fix notations and introduce the necessary background information for presenting the INTEL algorithm later in Section 3. For more details on GP and its applications in timeseries prediction, readers are referred to roberts2013gaussian (); williams2006gaussian ().
2.1 Gp
Let start by introducing the GP. GP is a probability distribution defined on a function. Consider a function drawn from a GP as follows
(1) 
where is the input of the function, is the output, denotes a GP with mean function and covariance kernel function . Here denotes the hyperparameter of the kernel function. Given any two (arbitrary) input locations, say and , the kernel function defines the covariance element between them. For a set of input locations , the covariance elements can then be described by a covariance matrix
(2) 
One can see that the function evaluations at the input points in are a sample from a multivariate Gaussian distribution,
(3) 
Here are dependent function values evaluated at input locations , and is a vector of mean function values evaluated at .
In most situations, especially in the context of timeseries analysis, our observations are data corrupted by a noise process. We can take account of this by defining
(4) 
in which denotes the noise item. In common practice, is assumed to be Gaussian distributed , where denotes the variance of the noise. For noisy observations, the form of the covariance matrix becomes
(5) 
where is the identity matrix.
The form of the kernel function together with its hyperparameter has a great impact on the efficacy of GP approximation. The most commonly used kernel function is arguably the squared exponential (SE) function, given by
(6) 
in which is the hyperparameter. In this paper, we adopt the Matern kernel function, defined as
(7) 
where is the Euclidean distance between and .
The SE kernel function is infinitely differentiable and thus may yield unrealistic results for physical processes stein2012interpolation (). In contrast with the SE kernel function, the Matern class kernel function is better for capturing temporal structures in physical processes due to its finite differentiability stein2012interpolation (). Specifically, the Matern kernel function is twice differentiable and has been widely used in GP williams2006gaussian (). Now we have . The hyperparameter of the kernel function describes the general properties of our function williams2006gaussian (). As shown in Eqn. (7), governs the output scale of our function, determines the input scale, and thus the smoothness of our function. For other types of kernel functions used for GP regression, see williams2006gaussian ().
Denote as the hyperparameter of the GP model. Then, given an observed dataset , the value of can be determined by maximizing the log marginal likelihood williams2006gaussian ():
A conjugate gradient descent optimization algorithm included in the GPML toolbox rasmussen2010gaussian () is often used to address the above maximization problem.
Now let consider how to predict the observation at a test input location based on an observed dataset (i.e., the training dataset in the ML jargon). According to the definition of GP, and are jointly distributed as follows
(9) 
where and is the transpose of . Using some linear algebra operations, one can derive the posterior distribution of , which is Gaussian with mean
(10) 
and variance
(11) 
To take account of the observation noise, we can simply substitute the term from Eqns. (9)(11) with the in Eqn. (5).
2.2 The GPTS model
Let consider a timeseries , in which denotes the time index, the data observed at . At each time step , we are interested in calculating the predictive distribution of given all the observations up to time , namely . Here, . In real practice, a time window can be used to save computation cost or account for nonstationarity of the system. Then the target predictive distribution becomes , where is the length of the time window.
Now we adapt the GP model into the context of timeseries. We describe timeseries observations using the model of the form
(12) 
where the time index is taken as the input (namely the term in Eqns.(1)(11)), while the observation is the output. Then, given an observed dataset , in which , , the predictive distribution of is given by the mean
(13) 
and the variance
(14) 
This GPTS model generalizes classical timeseries models, e.g., autoregressive (AR), autoregressive moving average (ARMA), and Kalman filter murray2001gaussian (); saatcci2010gaussian (); turner2012gaussian ().
3 The proposed INTEL algorithm
In this section, we present details about our INTEL algorithm, which is developed for SOP in the presence of outliers and change points.
3.1 GPTS Mixture for capturing complex temporal structure
The main idea underlying our algorithm is as follows. We treat a data stream in which outliers and change points are present as a function with timevarying temporal structures. We use GPTS models to capture temporal structures of . In a GPTS model, each hyperparameter describes one aspect of the temporal structure underlying the data. For example, for a model with a Matern 5/2 kernel function, the hyperparameter describes the amplitude of the function, determines its smoothness and represents the variance of the observation noise. Given specific values of its hyperparameters, a GPTS model can capture a specific temporal structure of the data. Suppose that a historical dataset is preavailable, then we can use a template model to capture the temporal structure underlying these data. In most situations, it can be reasonably assumed that a relatively small number of historical data points are preavailable. Since is obtained based on a very limited number of historical data, its modeling capacity shall be very limited too. That means only using is impossible to capture temporal structures underlying future data points since nonstationarity is the basic feature of timeseries data. We come up with an idea to enlarge the modeling capacity of . Given with hyperparameters , we construct a set of candidate models based on . That says these candidate models are variants of . We let all variants share the same mean function with , but take different hyperparameter values. Note that the term temporal structure used here is defined with hyperparameters , and is not related to the mean function. We use a weighted mixture of these models to cover temporal structures underlying unseen data in future. Although only a limited number of GPTS models can be used, the number of their combinations, defined by their weights, is infinite. It means that the modeling capacity of this mixture model can be much larger than that of , as conceptually illustrated in Figure 1. Hence, we may get much better SOP result based on this mixture model, while, to make the above idea work, we need to answer two questions at first, namely, how to build up the variant models and how to combine all models in an appropriate way to capture the true temporal structure underlying the data. We propose specific techniques to answer the above questions.
Let take an example to show how to construct variants of . Suppose that, given at the beginning, we believe that the input scale will decrease later, namely, our function will become rougher later. Then we can translate the above belief by introducing a variant model , for which we assign a smaller input scale value, say . The coefficient is related to the limit of the input scale we expect. If we are uncertain whether the input scale will decrease or increase, then except , we can introduce another variant model , for which we use a bigger input scale value, say .
Notice that for a timeseries, abrupt changes in the temporal structure only appear at locations of outliers and change points. Therefore the temporal structure information learned from the historical data can provide important clues for guessing temporal structures in the future data. Our method makes use of such clue information by constructing candidate models on the basis of and thus is much better than brute force methods that construct candidate models arbitrarily from scratch. The proposed mechanism to handle outliers and change points is deferred to subsection 3.4.
For the issue of model combination, we treat each model as a hypothesis. Each model is associated with a weight, which represents the probability that this hypothesis is true. The weights of the models are tuned over time to let the weighted mixture of these models capture the nonstationary temporal structure of the data. We resort to a dynamic version of the Bayesian model averaging (DMA) technique to tune the model weights. For more details on the DMA method, see dai2016robust (); liu2011instantaneous (); liu2017robust (). Suppose that, at time step , the model has a weight , , . Then the predictive weights of the models at time step are calculated as follows
(15) 
where is termed the forgetting parameter. Upon the arrival of the observation , the model weights are updated according to Bayesian formalism as follows
(16) 
where denotes the likelihood of under the hypothesis .
3.2 Fusion of GPTS predictions
Now we consider how to combine predictions provided by , to yield a fused prediction. Recall that prediction with a single GPTS model is presented in Section 2.2. Following the setting in Section 2.2, we focus on the calculation of the predictive distribution of , namely (or for short). Denote the predictive distribution of corresponding to as (or for short). The mean and the variance of , denoted as and , are calculated using Eqns.(13)(14). To calculate based on , , the POE model of hinton2002training () can be used. Given multiple probability densities, , , the POE model describes the target distribution as follows,
(17) 
in which is a normalizing constant that makes a probability distribution that integrates to 1. Since , , are all Gaussian, calculated with Eqn.(17) is still Gaussian, with its mean and variance given by hinton2002training ()
(18)  
(19) 
where . One can see that the information of the model weights is not involved in the above calculation. In fact, in the original POE model, all models are treated to be equally weighted. Here we generalize the POE model to incorporate the information of the model weights by letting
(20) 
Note that, here we use other than . This is because the calculation of requires access to , see Eqn. (16). However, the calculation of the predictive density of is performed at time step . At that time, the real observation is not accessible. So is used instead of in Eqn.(20). Since , , are Gaussian, calculated with Eqn. (20) is still Gaussian, with its mean and variance given by cao2014generalized ()
(21)  
(22) 
The mean is taken as the prediction of made at time step . A confidence interval associated with this prediction is also available. For example, a 99.75% confidence interval is shown to be .
3.3 Online outlier detection
Considering that SOP and online outlier detection are a pair of dual problems, we do outlier detection based on the prediction given by the GPTS mixture mentioned above. Assume that a GPTS mixture model with a rich enough modeling capacity is built up. That says it can produce a reliable prediction for an observation, say , based on the observed data , provided that and the elements included in are within the same regime. Then outlier detection is performed simply as follows. If the real observation departs from the confidence interval , then declare it to be an outlier.
3.4 Change point detection and instant temporal structure capturing
Recall that, in the GPTS model, the predictive distribution of is calculated based on the training dataset , in which and , see Eqns.(13)(14). This calculation does not take into account the possible presence of outliers or change points in the training dataset. The inclusive of an outlier or a change point will bring detrimental effects to the prediction performance escalante2005comparison (). To this end, we develop a technique, called adaptive training set formation, to eliminate the negative effects of outliers and change points. We use a potential change point bucket (PCB), denoted as to save outliers that have been declared consecutively till now. Specifically, if an outlier is declared at , then add and into and , respectively. Otherwise, we empty and and add and into and , respectively. After that, we check if the number of elements in (or ) achieves a certain number, say . If so, we declare a change point detection.
Upon a change point is declared, we set , , and then empty and . In this way, a new training dataset is formed, based on which we can do predictions for future observations in the new regime. To achieve a fast detection of the regime shift, should take a small value, while, to learn a qualified model to capture the temporal structure of the new regime, the bigger is , the better. We break this dilemma by proposing an instant learning technique with the help of the rich modeling capacity of our GPTS mixture. Recall that in the setting of this paper, a GPTS model is defined by its hyperparameters which describe the temporal structure and the mean function that describes the major trend. So constructing a qualified GPTS model is equivalent to finding appropriate values for such hyperparameters and . If we only have a relatively small number of labeled data points, it is impossible to find appropriate values for all these parameters. Our idea is to borrow the power of the adaptive weighted mixture model mentioned above to automatically capture the temporal structure of the new regime, while only update the value of based on these data points, namely
(23) 
See Figure 3 for an example performance show of the above mechanism. We can see from the middle panel of Figure 3, when a regime shift appears at , the weight of the previously dominated model falls rapidly, while, at the same time, the weight of , whose hyperparameter setting is more fit to the new regime, rises abruptly. The above result clearly shows that the adaptive model weighting mechanism of our method takes effect, rendering our mixture model capture the temporal structure of the new regime instantly.
3.5 Implementation of the INTEL algorithm
3.6 Algorithm initialization
Given the mean function , is initialized by maximizing the log marginal likelihood based on an observed dataset . Then is specified in a way as presented in the second paragraph of subsection 3.1. Note that the efficiency of our algorithm does not depend on a fixed model set, because different model sets can have the same function for one specific dataset. In subsection 5.1, we present an example case that shows the experimental evidence of the above argument. In that example, the inclusive of lowquality models have little impact on the prediction performance, because the model weighting procedure (see Eqns.(15)(16)) automatically assigns tiny weights to those lowquality models, and thus eliminates their negative effects.
Now we discuss initialization issues about the other parameters. We specify to be a constant value function, , where is initialized to be the average of those data points included in . All model weights are initialized to be . As for , the smaller is its value, the earlier a change point can be detected, while, it can not be arbitrarily small, otherwise, a detected change point may be an outlier. We set its value at 3 to give a balance between the timeliness of change point detection and the discrimination between a change point and outliers. For , we follow our previous work in liu2017robust (), setting its value at 0.9. The parameter represents the length of the time window, while in our algorithm, it just determines the maximum number of training data points allowed for use in calculating the predictive distribution, see Eqns.(10)(11). The actual number of training data points and which data points within the time window will be selected as training data points are both determined by the adaptive training set formation procedure described in subsection 3.4. That says the value of has much less impact on the prediction performance of our algorithm than for traditional time window based methods. In our experiments, we set . Lastly, the parameter controls the period for finetuning the mean function. We select to update the mean function periodically, because, even within one regime, the timeseries data may still be nonstationary. Finetuning the mean function is beneficial for capturing the changes in the trend of our function. In our experiments, we set .
3.7 A formal analysis of the computational complexity of the INTEL algorithm
We mark the computational complexity for each major operation in Algorithm 1. It shows that the most computationally complex operation is the inversion of a matrix in line 5, which scale as . The matrix inversion operation is performed times, leading to a complexity in total. The calculation in line 7 consists of simple numerical operations that scales as . The computation in line 8 involves Gaussian likelihood calculations scaling as . The operation in line 9 involves multiplication and addition calculations that scale as . The remaining operations include the comparison operation in line 10, the average operations in lines 14 and 21, whose computational complexity is negligible compared with others. To summarize, all operations in the INTEL algorithm are of the linear algebra type and scale as . This algorithm shall be highly computationally efficient if and take small values.
4 Connections to relevant works in the literature
As a GPTS modelbased method, our INTEL algorithm is relevant with all existent works that involve the GPTS model and take into account outliers or change points. See e.g., chandola2011gaussian (); saatcci2010gaussian (); roberts2013gaussian (); osborne2012real (); garnett2010sequential (); osborne2011machine (), to name just a few. A common feature of these existent works is that they all try to design one accurate complex model to cover all cases that may happen in the future, including the appearance of outliers or change points. For example, the algorithm in vanhatalo2009gaussian () employs a heavytailed student’s observation noise model to take into account the presence of outliers. The fault bucket algorithm in osborne2011machine () takes account of faulty data with a Gaussian distribution with a very wide variance. The approach in garnett2010sequential () uses a nonstationary kernel function to take into account the appearance of change points. A price to pay for applying such complex models is a significantly increased complexity in the inference. For example, to use the student’s model of vanhatalo2009gaussian (), one has to use Gibbs sampling geweke1993bayesian () or variational methods tipping2005variational () for inference. To use the model of garnett2010sequential (), one has to address a complex Bayesian quadrature to calculate the predictive distribution.
Different from the aforementioned methods, neither Monte Carlo sampling nor quadrature operation is required to implement our INTEL algorithm, since all operations are of the linear algebra type, see details in subsection 3.7. The INTEL algorithm does not try to design one accurate complex model, but to construct a model set to cover possible temporal structures in future data. Each member in the model set is an inaccurate model, while it captures one type of temporal structure and allows closedform inference. A dynamic datadriven weighting mechanism is used to combine members in the model set, rendering the resulting mixture of GPTS models owns a rich modeling capacity to cover complex temporal structures that may appear in future data.
Lastly, our method has connections to regimeswitching Markov or statespace model based approaches for nonstationary time series vasas2007two (); boys2000detecting (); takeuchi2006unifying (), online metric learning methods in e.g., zhong2017slmoml (), and deep neural networks (DNN) based sequence data representation learning methods sak2015learning (); jurtz2017introduction (); jia2019towards (); ma2016end (); gao2017video (). The link between statespace models and GP is can be found in solin2014explicit (); hartikainen2010kalman (); sarkka2013spatiotemporal (); reece2010introduction (). The connection between GP and neural networks traces back to Neal’s work in neal1996priors (), which shows that certain types of neural networks with one hidden layer of infinite size are identical to a GP model with a specific type of covariance function. A recent study on relationships between DNN and GP can be found in lee2017deep (). Despite of the intrinsic theoretical link between GP and DNN, from the application point of view, there are several basic differences between GP and DNN. Specifically, the former belongs to the class of nonparametric modeling approaches, while the latter is usually parametric. The former can produce a point estimate as well as its uncertainty measure, while the latter usually can only yield a point estimate. The former is more attractive for adding side information, a property we adopt here to develop the INTEL algorithm, due to its Bayesian nature; while the latter is more attractive for dealing with larger datasets that own nonlocal smoothing features in e.g., natural languages and speeches jia2019towards (); sak2015learning ().
5 Experiments
We conducted extensive experiments to evaluate our INTEL algorithm. In subsection 5.1, we present an experiment conducted to validate the efficacy of the model initialization procedure presented in subsection 3.1. In subsection 5.2, we show results about its performance for online outlier and change point detection. A quantitative evaluation of its prediction performance is presented in subsection 5.3. Finally, in subsection 5.4, we tested its robustness when working under undesirable cases.
5.1 An experiment for testing the model initialization procedure
We check the efficacy of our method for initializing ’s, , which is presented in subsection 3.1. We use a CPU usage dataset collected from a server in Amazon’s east coast data center ahmad2017unsupervised (), as shown in Figure 2. In this dataset, a change point appears at around the 3,000th time step (which is exactly the 2,971st time step). After that, both the mean and the output scale of the dataset change significantly. We take the first 200 data points as the historical dataset used for initializing hyperparameters of . Now let compare two initialization settings for .
In the first setting, only one variant of is used. The hyperparameter values of are the same as that of , except that . As shown in Figure 2, the real situation is that the output scale of the data decreases markedly after the 2,971st time step. Therefore, the hyperparameter setting of is more suitable for the temporal structure of the data after the 2,971th time step. We wonder if our INTEL algorithm can automatically capture this regime shift by increasing the weight of at that time. The answer is yes, as shown in Figure 3. It is shown that the weight of rises rapidly, while that of decreases abruptly, at the 2,971st time step. This result confirms the INTEL algorithm’s capability for instant temporal structure learning.
In the second initialization setting, we maintain the same and for use as in the first initialization setting, while adding two new lowequality models, and , for which we set and , respectively. The values of the other hyperparameters of and are the same as that of and . Here the goal is to check if the inclusive of such lowquality models can lead to the failure of INTEL. The result is shown in Figure 4. Comparing Figure 4 with Figure 3, one can see that after adding the lowquality models, the prediction performance of INTEL is almost unchanged. That says, for this case, the INTEL algorithm is robust to a model set that contains lowquality models. As shown in the middle panel of Figure 4, the reason for this robustness is that the INTEL algorithm only assigns tiny weights to and almost all the time, except at and , where there is an outlier or change point declared.
5.2 Experiments for online outliers / change points detection
Although the central aim of the INTEL algorithm is to do online prediction, we are also interested in whether it can declare outliers and change points in the right way. We used a bunch of real datasets to do the test. We also included the fault bucket algorithm of osborne2011machine (), which is GPTS modelbased, and the BOCPD algorithm of turner2009adaptive (), which is not GPTS modelbased, but Bayesianbased, as benchmark methods for comparison. Except for providing anomaly detections, the fault bucket algorithm and INTEL also do onestepahead prediction. Every dataset is preprocessed by a data normalization operation. The normalized dataset has mean zero and standard error 1. For each algorithm, we selected the same portion of the dataset as the historical data used for hyperparameter initialization. For the INTEL algorithm, we used 8 candidate models, each corresponding to a combination of values for the hyperparameters and . There are two candidate values for each hyperparameter, e.g., for , one candidate value is , and the other is , where the value of is selected based on prior knowledge. For example, for the CPU usage data case mentioned above in subsection 5.1, we set for to describe our prior knowledge that the observation amplitude will decrease during some period. In subsection 5.4, we tested the robustness of INTEL when adopting inaccurate prior knowledge by setting an inappropriate value.
Welllog dataset
The welllog dataset is widely used in the context of change point detection turner2009adaptive (); turner2012gaussian (). It is a timeseries consisting of 4,050 measurements of nuclear magnetic response, which are made during the drilling of a well garnett2010sequential (). The change here has a clear physical meaning, namely a transition between different strata of rock. The result is plotted in Figure 5. It is shown that INTEL and BOCPD successfully report all major regime transitions, while the fault bucket algorithm fails to adapt to regime shifts and reports too many false anomaly detections. We checked the reason for the failure of the fault bucket algorithm and found that its performance is highly dependent on the selected data points used for hyperparameter initialization. If we use the first 150 data points, which contain rougher temporal structures during the first 50 time steps, for hyperparameter initialization, then the fault bucket algorithm performs much better, as shown in Figure 6. This is due to that the presence of the first 50 data points in the training dataset renders the model capable of capturing rougher temporal structures. Right now the fault bucket algorithm reports much less false anomaly detections, while it has missdetections. Further, by comparing the bottom panel of Figure 5 with Figure 6, one can see that INTEL is still better than the fault bucket algorithm in terms of prediction performance since the former yields lowervalued variances and thus more certain predictions.
An ECG dataset
This dataset is obtained by injecting an artificial outlier into a piece of real electrocardiogram (ECG) timeseries. It comprises 235 observations. The outlier appears at the 62nd time step followed by a saccade that happens from about the 130th time step to about the 145th time step. The onestepahead prediction result is shown in Figure 7. One can see that all algorithms considered have successfully detected the true outlier. Our INTEL algorithm is shown to be robust to both the outlier and the saccade, while the fault bucket algorithm fails to yield accurate predictions during the saccade period.
A Numenta benchmark data
This dataset is included in the Numenta Anomaly Benchmark lavin2015evaluating (). It is characterized by a pattern of repeated amplitude changes, see Figure 8, hence is suitable for testing change points detection algorithms. We run fault bucket, BOCPD, and our INTEL algorithms, respectively, to process this dataset. The result is visually plotted in Figure 8. As is shown, all algorithms considered here have successfully detected the true change points, while both fault bucket and BOCPD give some false anomaly detections. In contrast, INTEL gives no false detection for this dataset. Besides, compared with the fault bucket algorithm, INTEL gives tighter SD bounds in its predictions.
Fish killer data
This dataset is a smooth timeseries with some rapid changes near the fish kills. We selected the first 10,000 data points for use in comparing the fault bucket algorithm, BOCPD, and INTEL. The result is depicted in Figure 9. It is shown that, for this dataset, fault bucket and INTEL give comparable onestepahead predictions. BOCPD performs unsatisfactorily since it reports many false anomaly detections.
An industry portfolio data
We also considered the “30 industry portfolios” dataset xuan2007modeling (). We selected a portion of the first timeseries included in that dataset, which records daily returns of an industryspecific portfolio beginning at the year of 1963. The experimental result is plotted in Figure 10, from which one can see that our INTEL algorithm detects all change points accurately without any false detection. The fault bucket algorithm fails to yield accurate change point detections and observation predictions after the first regime shift. BOCPD successfully detects all change points, while it also declares many false detections.
5.3 Experiments for prediction performance evaluation
We tested the onestepahead prediction performance of our INTEL algorithm. Except for the fault bucket algorithm osborne2011machine (), we also included a simplified version of the INTEL algorithm, termed SINTEL here. In SINTEL, only the template model is used, while the operations for anomaly detection, training set formation and hyperparameter value adaptation maintain the same as that in INTEL. The BOCPD algorithm used in subsection 5.2 is not involved here since it is only capable of detecting change points but incapable of making realtime predictions. Except those used in subsection 5.2, additional timeseries datasets are considered here, including:

the Nile dataset, which has been widely used in the timeseries literature;

the Intel lab data, which was collected from 54 sensors deployed in the Intel Berkeley Research lab between Feb. 28th and April 5th, 2004. We only used a small while representative fragment of this dataset.

the NYC taxi data, which records the number of NYC taxi passengers. Each observation in this dataset denotes the total number of taxi passengers during 30 minutes. Five regime shifts happen during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm, respectively.

the temperature (temp.) sensor data of an internal component of a large, industrial machine. This dataset has at least two outlier observations. One originates from a planned shutdown of the machine, and the other one is a catastrophic failure of the machine.

A real time traffic data from the twin cities metro area in Minnesota of the U.S.. Included metrics include occupancy, speed, and travel time from specific sensors, while we only present the result associated with the metric speed, due to the limitation in space.
The performance metrics in use are the negative log likelihood (NLL), the mean absolute error (MAE), and the mean square error (MSE). For every metric, the smaller is its value, the better the prediction performance it stands for. We list the onestepahead prediction result measured with these metrics in Tables 13.
As is shown in Tables 13, for the first 8 of these 11 datasets, INTEL outperforms the fault bucket algorithm osborne2011machine () in terms of all metrics considered. For the last dataset, INTEL performs slightly better than the fault bucket algorithm in terms of MAE and MSE, while the fault bucket algorithm beats INTEL slightly in terms of NLL. It is only for the NYC taxi dataset and the temp. sensor dataset that the fault bucket algorithm gives significantly better prediction than INTEL. We plot these two datasets in Figure 11. As is shown, there is no clear regime shift in them. It indicates that the advantage of INTEL over the fault bucket algorithm mainly comes from its capability to handle change points.
By comparing SINTEL and INTEL according to results as shown in Tables 13, one can see that INTEL outperforms SINTEL markedly in most cases. SINTEL only provides slightly better prediction than INTEL in terms of MAE for the first two datasets, and in terms of MSE for the first dataset. The above result demonstrates the advantage of using multiple models compared with using only one model.
Fault bucket  SINTEL  INTEL  

CPU usage  3.5181  0.7785  0.0972 
welllog  46.3338  0.0950  0.0947 
ECG  24.6629  42.2999  0.1459 
Numenta  1.0409  0.4663  1.4887 
fish killer  1.0336  6.2875  1.6867 
portfolio  60,200  8.8275  3.7451 
Nile data  129.4349  22.8654  2.2453 
Intel lab  0.8593  0.6276  1.2252 
NYC taxi  0.4434  5.6331  0.0129 
temp. sensor  0.7499  41.3108  0.2588 
traffic  1.3471  1.3525  1.3749 
Fault bucket  SINTEL  INTEL  

CPU usage  1.9540  0.1859  0.1867 
welllog  2.4985  0.2078  0.2129 
ECG  0.4650  0.3786  0.1387 
Numenta  0.0375  0.0527  0.0525 
fish killer  0.0496  0.0478  0.0220 
portfolio  2.3378  0.0034  0.0012 
Nile  1.8076  0.6611  0.6111 
Intel lab  0.0833  0.0628  0.0485 
NYC taxi  0.1102  0.3761  0.1943 
temp. sensor  0.0894  0.4028  0.1481 
traffic  0.6716  0.6725  0.6500 
Fault bucket  SINTEL  INTEL  
CPU usage  3.8726  0.0562  0.0564 
welllog  6.3093  0.0707  0.0706 
ECG  1.5398  1.0580  0.0404 
Numenta  0.0044  0.0057  0.0057 
fish killer  0.0129  0.0208  0.0182 
portfolio  5.4651  
Nile  3.5819  0.6604  0.5535 
Intel lab  0.0073  0.0059  0.0038 
NYC taxi  0.0184  0.2045  0.0635 
temp. sensor  0.0101  0.3342  0.0313 
traffic  0.7103  0.7263  0.6903 
5.4 Robustness test
We tested the robustness of our INTEL algorithm in three cases as below:

The prior knowledge used for initializing is inaccurate;

The historical dataset used for initializing is not clean, namely, there is at least one outlier or change point included in it.

False detections of anomalies exist during the sequential prediction process.
For case 1 listed above, we modified the initialization setting used in subsection 5.1 for processing the CPU usage dataset. Specifically, is removed from the model set, while , and remain. Recall that in and , we have and , respectively. We now adopt an inaccurate prior knowledge that the observation amplitude will increase during some period but never decrease, while the fact is that it will decrease significantly after . The performance of INTEL under this setting is plotted in Figure 12. We see that the INTEL algorithm fails to capture one important aspect of the temporal structure, namely a significantly lowered amplitude, in the data after the regime shift at . However, it still gives accurate mean predictions for observations after .
NLL  MAE  MSE 
0.5545  0.2126  0.0708 
NLL  MAE  MSE 
0.5008  0.2150  0.0756 
For case 2, we restudied the welllog dataset. In the result plotted in Figure 5, data points between and are used for hyperparameter initialization for , since there is no anomaly observation within them. We now use the first 200 data points for hyperparameter initialization for . All the other experimental settings are kept the same as that used for plotting Figure 5. Now anomalies exist in the training dataset (there are at least three anomalies in the first 50 data points as reported by turner2012gaussian ()). Now the performance of the proposed INTEL algorithm is plotted in Figure 13. Comparing Figure 13 with the bottom panel of Figure 5, we see that INTEL fails to detect some change points now, and gives a broader SD bounds, while it still provides accurate mean predictions. The result presented in Table 4 reconfirms the above observation. Comparing the result listed in Table 4 with that shown in Tables 13, we see that the performance of INTEL is almost unchanged in terms of MAE and MSE. Its performance is degraded based only on the metric NLL. This is because the metrics MAE and MSE only describe the accuracy of the mean prediction, while NLL covers information on the uncertainty measure.
Finally, for case 3, we tested the performance of INTEL when false anomaly detections are present for the welllog data. As is shown in Figure 14, even in case of false anomaly detections being present, the INTEL algorithm can still give accurate predictions for most of the observations. A quantitative evaluation of the prediction performance associated with Figure 14 is shown in Table 5. Comparing Table 5 with that shown in Tables 13, again, we see that the performance of INTEL is degraded based only on the metric NLL.
6 Concluding remarks and future works
In this paper, we addressed the problem of SOP by unleashing the flexibility and interpretability of the GPTS model together with harnessing prior knowledge. Specifically, we proposed a novel algorithm design termed INTEL and demonstrated its performance using extensive real dataset experiments. Experimental results show that the INTEL algorithm is a highly efficient solution to the problem of SOP in the presence of outliers and change points. As an online prediction algorithm, INTEL is also demonstrated to be a qualified online anomaly detection method. The biggest feature of INTEL is that it can instantly capture the pattern of the new regime, without the need to do model training, upon a change point is declared. Further, the INTEL algorithm allows closedform inference and prediction. All operations to implement this algorithm are deterministic and analytically tractable.
We did robustness tests to the INTEL algorithm, investigating its performance under three undesirable cases, namely, when the prior knowledge it adopts is inaccurate, when the historical data used for template model hyperparameter initialization is not clean and when false anomaly detections exist during the sequential prediction process. An interesting finding is that, under these cases, although the INTEL’s prediction performance is degraded in terms of the metric NLL, its prediction performance in terms of MAE and MSE maintains. That says our INTEL algorithm can still provide accurate point predictions in our test cases.
Currently, the INTEL algorithm can only do onestepahead prediction, while, in principle, it can be extended naturally to do multiplestepahead prediction, which deserves future investigation. It is also important to extend the INTEL algorithm to handle multivariate timeseries data. In the current version of the INTEL algorithm, each candidate GPTS model uses a Matern 5/2 kernel function. It is possible to let these candidate models employ different types of kernel functions and then check its performance.
References
 A. Biem, E. Bouillet, H. Feng, A. Ranganathan, A. Riabov, O. Verscheure, H. Koutsopoulos, and C. Moran, “IBM infosphere streams for scalable, realtime, intelligent transportation services,” in Proc. of the 2010 ACM SIGMOD. ACM, 2010, pp. 1093–1104.
 A. C. Harvey, Forecasting, structural time series models and the Kalman filter, Cambridge university press, 1990.
 E. Condon, A. He, and M. Cukier, “Analysis of computer security incident data using time series models,” in 2008 19th Int’l Symp. on Software Reliability Engineering (ISSRE). IEEE, 2008, pp. 77–86.
 H. Eerens, D. Haesen, F. Rembold, F. Urbano, C. Tote, and L. Bydekerke, “Image time series processing for agriculture monitoring,” Environmental Modelling & Software, vol. 53, pp. 154–162, 2014.
 D. A. Rasmussen, O. Ratmann, and K. Koelle, “Inference for nonlinear epidemiological models using genealogies and time series,” PLoS computational biology, vol. 7, no. 8, pp. 1–11, 2011.
 B. Liu, Z. Xu, J. Chen, and G. Yang, “Toward reliable data analysis for internet of things by bayesian dynamic modeling and computation,” in IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP). IEEE, 2015, pp. 1027–1031.
 B. Liu and S. Cheng, “State space modelbased trust evaluation over wireless sensor networks: an iterative particle filter approach,” The Journal of Engineering, vol. 2017, no. 4, pp. 101–109, 2017.
 J. Wang and B. Liu, “Online faulttolerant dynamic event region detection in sensor networks via trust model,” in IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2017, pp. 1–6.
 X. Xuan and K. Murphy, “Modeling changing dependency structure in multivariate time series,” in Proc. of the 24th Int’l Conf. on Machine Learning (ICML). ACM, 2007, pp. 1055–1062.
 D. Barry and J. A. Hartigan, “A Bayesian analysis for change point problems,” Journal of the American Statistical Association, vol. 88, no. 421, pp. 309–319, 1993.
 R. P. Adams and D. J. MacKay, “Bayesian online changepoint detection,” arXiv preprint arXiv:0710.3742, 2007.
 R. Turner, Y. Saatci, and C. E. Rasmussen, “Adaptive sequential Bayesian change point detection,” in Temporal Segmentation Workshop at NIPS, 2009.
 P. Fearnhead and Z. Liu, “Online inference for multiple changepoint problems,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 69, no. 4, pp. 589–605, 2007.
 B. K. Ray and R. S. Tsay, “Bayesian methods for changepoint detection in longrange dependent processes,” Journal of Time Series Analysis, vol. 23, no. 6, pp. 687–705, 2002.
 V. Chandola and R. R. Vatsavai, “A Gaussian process based online change detection algorithm for monitoring periodic time series,” in Proc. of the 2011 SIAM Int’l Conf. on Data Mining (ICDM). SIAM, 2011, pp. 95–106.
 Y. Saatçi, R. D. Turner, and C. E. Rasmussen, “Gaussian process change point models,” in ICML, 2010, pp. 927–934.
 K. Yamanishi and J. Takeuchi, “A unifying framework for detecting outliers and change points from nonstationary time series data,” in Proc. of the 8th ACM SIGKDD. ACM, 2002, pp. 676–681.
 R. M. Neal, Bayesian learning for neural networks, vol. 118, Springer Science & Business Media, 2012.
 S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain, “Gaussian processes for timeseries modelling,” Philosophical Trans. of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 371, no. 1984, pp. 1–25, 2013.
 M. A. Osborne, Stephen J. Roberts, A. Rogers, and N. R. Jennings, “Realtime information processing of environmental sensor network data using Bayesian gaussian processes,” ACM Trans. on Sensor Networks (TOSN), vol. 9, no. 1, pp. 1, 2012.
 J. Vanhatalo, P. Jylänki, and A. Vehtari, “Gaussian process regression with studentt likelihood,” in Advances in neural information processing systems, 2009, pp. 1910–1918.
 M. Osborne, R. Garnett, K. Swersky, and N. De Freitas, “A machine learning approach to pattern detection and prediction for environmental monitoring and water sustainability,” in ICML Workshop on Machine Learning for Global Challenges, 2011.
 R. Garnett, M. A. Osborne, S. Reece, A. Rogers, and S. J. Roberts, “Sequential Bayesian prediction in the presence of changepoints and faults,” The Computer Journal, vol. 53, no. 9, pp. 1430–1446, 2010.
 J. Geweke, “Bayesian treatment of the independent studentt linear model,” Journal of applied econometrics, vol. 8, no. S1, pp. S19–S40, 1993.
 M. E. Tipping and N. D. Lawrence, “Variational inference for studentt models: Robust bayesian interpolation and generalised component analysis,” Neurocomputing, vol. 69, no. 13, pp. 123–141, 2005.
 G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
 C. Williams and C. E. Rasmussen, Gaussian processes for machine learning, MIT Press Cambridge, MA, 2006.
 M. L. Stein, Interpolation of spatial data: some theory for Kriging, Springer Science & Business Media, 2012.
 C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine learning (GPML) toolbox,” Journal of machine learning research, vol. 11, pp. 3011–3015, 2010.
 R. MurraySmith and A. Girard, “Gaussian process priors with arma noise models,” in Irish Signals and Systems Conference, Maynooth. Citeseer, 2001, pp. 147–152.
 R. D. Turner, Gaussian processes for state space models and change point detection, Ph.D. thesis, University of Cambridge, 2012.
 Y. Dai and B. Liu, “Robust video object tracking via Bayesian model averagingbased feature fusion,” Optical Engineering, vol. 55, no. 8, pp. 1–11, 2016.
 B. Liu, “Instantaneous frequency tracking under model uncertainty via dynamic model averaging and particle filtering,” IEEE Trans. on Wireless Communications, vol. 10, no. 6, pp. 1810–1819, 2011.
 B. Liu, “Robust particle filter by dynamic averaging of multiple noise models,” in Proc. of the 42nd IEEE Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2017, pp. 4034–4038.
 Y. Cao and D. J. Fleet, “Generalized product of experts for automatic and principled fusion of gaussian process predictions,” in Automating the Learning Pipeline workshop at NIPS, 2014.
 H. J. Escalante, “A comparison of outlier detection algorithms for machine learning,” in Proc. of the International Conference on Communications in Computing, 2005, pp. 228–237.
 K. Vasas, P. Elek, and M., “A twostate regime switching autoregressive model with an application to river flow analysis,” Journal of Statistical Planning and Inference, vol. 137, no. 10, pp. 3113–3126, 2007.
 R. J. Boys, D. A. Henderson, and D. J. Wilkinson, “Detecting homogeneous segments in DNA sequences by using hidden Markov models,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 49, no. 2, pp. 269–285, 2000.
 J. Takeuchi and K. Yamanishi, “A unifying framework for detecting outliers and change points from time series,” IEEE Trans. on Knowledge and Data Engineering, vol. 18, no. 4, pp. 482–492, 2006.
 G. Zhong, Y. Zheng, S. Li, and Y. Fu, “SLMOML: online metric learning with global convergence,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2460–2472, 2017.
 H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk, “Learning acoustic frame labeling for speech recognition with recurrent neural networks,” in IEEE Int’l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4280–4284.
 V. I. Jurtz, A. R. Johansen, M. Nielsen, Jose J. Almagro A., H. Nielsen, C. K. Sønderby, O. Winther, and S. K. Sønderby, “An introduction to deep learning on biological sequence data: examples and solutions,” Bioinformatics, vol. 33, no. 22, pp. 3685–3690, 2017.
 X. Jia, S. Li, H. Zhao, S. Kim, and V. Kumar, “Towards robust and discriminative sequential data learning: When and how to perform adversarial training?,” in Proc. of the 25th ACM Int’l Conf. on Knowledge Discovery & Data Mining (SIGKDD), 2019, pp. 1665–1673.
 X. Ma and E. Hovy, “Endtoend sequence labeling via bidirectional LSTMCNNsCRF,” in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016, pp. 1064–1074.
 L. Gao, Z. Guo, H. Zhang, X. Xu, and H. Shen, “Video captioning with attentionbased LSTM and semantic consistency,” IEEE Trans. on Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017.
 A. Solin and S. Särkkä, “Explicit link between periodic covariance functions and state space models,” in Artificial Intelligence and Statistics, 2014, pp. 904–912.
 J. Hartikainen and S. Särkkä, “Kalman filtering and smoothing solutions to temporal Gaussian process regression models,” in IEEE Int’l Workshop on Machine Learning for Signal Processing. IEEE, 2010, pp. 379–384.
 S. Sarkka, A. Solin, and J. Hartikainen, “Spatiotemporal learning via infinitedimensional Bayesian filtering and smoothing: A look at Gaussian process regression through Kalman filtering,” IEEE Signal Processing Magazine, vol. 30, no. 4, pp. 51–61, 2013.
 Steven Reece and Stephen Roberts, “An introduction to Gaussian processes for the Kalman filter expert,” in 13th Int’l Conf. on Information Fusion. IEEE, 2010, pp. 1–9.
 Radford M Neal, “Priors for infinite networks,” in Bayesian Learning for Neural Networks, pp. 29–53. Springer, 1996.
 J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. SohlDickstein, “Deep neural networks as Gaussian processes,” in Proc. of Int’l Conf. on Learning Representations (ICLR), 2018, pp. 1–17.
 S. Ahmad, A. Lavin, S. Purdy, and Z. Agha, “Unsupervised realtime anomaly detection for streaming data,” Neurocomputing, vol. 262, pp. 134–147, 2017.
 A. Lavin and S. Ahmad, “Evaluating realtime anomaly detection algorithms–the numenta anomaly benchmark,” in 2015 IEEE 14th Int’l Conf. on Machine Learning and Applications (ICMLA). IEEE, 2015, pp. 38–44.