Low Latency Anomaly Detection and Bayesian Network Prediction of Anomaly Likelihood
Abstract
We develop a supervised machine learning model that detects anomalies in systems in real time. Our model processes unbounded streams of data into time series which then form the basis of a lowlatency anomaly detection model. Moreover, we extend our preliminary goal of just anomaly detection to simultaneous anomaly prediction. We approach this very challenging problem by developing a Bayesian Network framework that captures the information about the parameters of the lagged regressors calibrated in the first part of our approach and use this structure to learn local conditional probability distributions.
Low Latency Anomaly Detection and Bayesian Network Prediction of Anomaly Likelihood
Derek Farren dfarren@stanford.edu Thai T. Pham^{†}^{†}thanks: indicates equal contribution. thaipham@stanford.edu Marco AlbanHidalgo marcoal@stanford.edu
1 Introduction
1.1 Problem Definition
Given a very large dataset of time series (presumably with some underlying, nonstationary time dependence between each other) up to time , predict whether a time series will have an anomaly (defined formally in section 4) at time , moreover when the data stream arrives at time detect anomalies conclusively using a low latency model.
1.2 Overview
The underlying motivation of this piece is the usefulness of anomaly prediction in mission critical components. A fast anomaly detection platform can by itself be extremely useful for ensuring the reliability of a system, as a result, this problem has been studied extensively by academics and industry experts. We defer to the authors in [2] for a literature survey of anomaly detection techniques.
In this work, we extend the scope of our approach and goal to tackle anomaly prediction. Most work in the literature has narrowed their focus on anomaly detection because it is in and of itself a very challenging problem. However, the benefits of an even slight temporal advantage in prediction can have huge impacts on the performance of a system. The contribution of this work can be categorized in two components:

An novel approach to anomaly prediction by modeling a Bayesian Network structured based on the coefficients of the lag regressors in the anomaly detection system, coupled with Bayesian Parameter learning to model the conditional dependency structure between the time series.
2 Dataset
Our (nonpublic) data set consists of time series , sampled every minute for the past year with some underlying nonstationary dependence between each other and human labeled anomaly events by domain experts (i.e., the owner of a time series flagged the time series at time point as an anomaly. Although the dataset has some time series source information, our anomaly detection model makes no assumptions about the interindependence of the time series to allow us to solve the more general problem of having no apriori knowledge.
To illustrate the difficulty of the problem we are trying to solve, in Figure 1 we show a subsample of 300 points of one of the time series with labels anomalies denoted by red dots. Clearly the problem involves latent relationships between time series; observing one time series in isolation is not enough, even for a human, to determine whether a time point is anomalous.
Another challenge that we face is that the time series have vastly different probability distributions. To show this, we normalize the time series and approximate the probability distribution of each time series using kernel density estimation (with a Gaussian kernel and hyperparameter search on the bandwidth). Figure 2 shows the estimated probability densities of the 19 most important time series.
2.1 Validating Dependence Assumptions
A foundational tenet of our model is that our time series have a latent dependence structure between each other. To validate this assumption, for each pair of time series and (from a selected set of 19), we use kernel density estimation to approximate the marginal distributions of the times series, and , as well as the joint probability distribution . Then we compute the mutual information between the approximate distributions:
Note that mutual information is a measure of the inherent dependence expressed in the joint distribution of and relative to the joint distribution of and under the assumption of independence, in a concrete sense if and only if X and Y are independent random variables. Figure 3 shows a grid plot of the mutual information between 19 selected time series, the coordinate of the figure denotes . As can be seen from the figure below, the time series are highly dependent between each other, moreover, the dependence structure is not uniform across pairs. Although we could use a model like this to try to determine latent dependence structure, this method is not scalable as approximating joint probability distributions can be computationally intensive. In section 3 we propose a method that will capture latent dependence structures in a computationally feasible way.
3 Approach  Detection Model
Combining previous techniques for anomaly detection [3, 9, 2] and time series modeling [12, 11, 8, 7], we propose an auto regressive distributed lag model ^{1}^{1}1A distributedlag model is a dynamic model in which the effect of a regressor on occurs over time rather than all at once. with regularization, that is horizontally scalable.
We first define the following:

: number of time series

: value of time series at timestamp

: the window size used in the auto regression

: the length of the time series
Our model is
where is the prediction of model at time , is the least squares coefficient vector, and is the model’s error on time series at time . In other words, we use all past information from all time series where , for the prediction of the current value in each time series .
Now consider concatenating all the predictions for time series (there are predictions). For each time series , we can write the regression predictions as varies in the matrix form:
where are the vector of predictions and error term, and . X is the matrix of concatenated lag regressors.
The model is trained using ordinary least squares method with regularization. In other words, for each time series , we find the parameter vector of length to minimize
(1) 
Here, is the error function and is the hyperparameter which controls the severity of the penalty on complex models. We do this for each of time series, so we will end up applying ordinary least squares times.
This model offers a sparse regression [13] that selects only the time series that have predictive value and identifies the temporal correlation between these time series. The bias and size of the resulting model can be set by modifying the regularization parameter . This is crucial since the smaller the model is, the less coefficients it has, and thus, the less data it needs to make predictions. Since retrieving data over a network has latency, and low latency when detecting anomalies is a priority in this algorithm, the size of the data to be retrieved should be as small as possible without greatly affecting the model’s accuracy.
Finally, we store the estimated standard error corresponding to each of the time series to be used later for anomaly detection:
Here, is the number of nonzero coefficients. This estimator is motivated by the unbiased estimator for the error variance in a simple linear regression without regularization. The performance of its can be seen in Reid, Tibshirani, and Friedman [10].
4 Detection Model for RealTime Applications
The model described in Section 3 makes real time predictions of all time series in the system as it streams new data. In order to detect anomalies, we compared the prediction the model does at current time , , with its real value coming from the data stream, by running a ttest with the tstatistic defined in equation (2):
(2) 
This test gives us a pvalue we use to compare against a given threshold pvaluethreshold, which is a parameter of the model. If the pvalue we get with tstatistic (2) is smaller than pvaluethreshold, then we are in presence of an anomaly and an alert should be raised.
The parameter pvaluethreshold is probably the most important parameter in the model because it directly specifies how sensitive the model is in detecting anomalies. Usually its value is very low. A value of around 1e5 would roughly detect an anomaly every 100,000 minutes.
It’s important to note that this Detection procedure has very low latency because the model is sparse and it only needs to make few calculations in the linear combination. This is a key feature that puts this model ahead of lower latency, but accurate, models like DPCA [3].
There is another aspect of real life systems that has not been covered so far. Since in real life systems, random and short spikes can be normal and not anomalies, we use a smoothed anomaly definition as opposed to the naive comparison. Concretely, since these spikes may result in false positive classification of anomalies instead of testing for an anomaly with just the current datapoint , we test for an anomaly with a sample of datapoints. This sample of size for time series is defined as
Our model checks the probability that comes from a normal distribution with mean by running a ttest with the tstatistic defined by
Here, is the sample standard error.
5 Detection Model  Preliminary Results
5.1 Baseline
As a baseline to highlight the difficulty of the problem we are trying to solve, we implement and test a model that characterizes the time series as multivariate Gaussian. More concretely, we build a generative model where time series are generated from an dimensional multivariate Gaussian distribution and then set a threshold of the probability of a time series deviating from its mean. We can visualize this method in Figure 4.
5.2 Oracle
Again, highlighting the difficulty of the problem that we are trying to solve, we present an oracle where we use the test set as the training set and calculate the “test error,” since we know that the test error should be bounded by the training error, this oracle will allow us to understand the difficulty of our problem definition. The results shown in section III, compare the error of the oracle formulation above.
5.3 Performance Comparison
We compare our model’s performance with our baseline model (multivariate Gaussian [5]) and with the state of the art Dynamic Principal Components model (DPCA) [3]. We run the models on a year worth of server logs data, aggregated as time series. As noted in section 2, the data are labeled by humans (with domain knowledge) as anomaly or not anomaly. We use an auto regression window size of one week. The output of each model by minute (anomaly or not anomaly class) is compared against the real label of the data. Finally, we calculate the score to evaluate the performance of each model.
Algorithm  Score 

Gaussian Model (Baseline)  
DPCA  
Proposed Algorithm 
Table 1 shows that even though our algorithm’s accuracy is below DPCA’s accuracy, our preliminary model is able to achieve relatively high performance.
Algorithm  latency (millsecs) 

Gaussian Model (Baseline)  
DPCA  
Proposed Algorithm 
Table 2 shows that our model is faster than the very accurate DPCA. It is also faster than the Gaussian model, which is explained by the fact that our model is sparse and does not need to load all the data in the autoregression window, just the data points that are highly correlated.
Since low latency is a crucial component in real time applications, we believe our model’s combination between good accuracy and extremely low latency is very powerful.
6 Bayesian Network Approach For Characterizing Anomaly Likelihood
6.1 Nodes
The second part of this piece deals with building a Bayesian Network to characterize the probability of an anomaly. Recall that for every time series we run a distributed lag regression (section 3) where the regressors are lag variables of and every other time series . The model for time series is parametrized by a vector of coefficients on the regressors (the regularization term ensures that the majority of these weights are set to zero).
We denote by the vector of nonzero weights for time series , and by the nonzero weight component for time series that corresponds to lag regressors of time series with a lag of (as in Section 3)
Similarly, let correspond to the lag variables with nonzero weights for time series regressing on lag regressors of time series where the lag for that regressor is . Let the length of this vector of variables be .
The random variables in our network include: for ; for ; and for , , and .
Note that although there may be overlap between and for , our Bayesian network considers the set of all these variables.
6.2 Domain
The lag variables represent the value of the lagged time series, their domain is continuous. The variables take values representing the presence of an anomaly at time and lastly the variables denote the how much our prediction deviates from the true value of the time series (See section 6.4). We can visualize the Bayes Net in Figure 5.
6.3 Factors
The structure of the Bayesian Network is defined by adding, for each , an edge from node to node for , , and an edge from .
6.4 Theoretical Reasoning Underlying Bayesian Network Model
Fan and Li [4] show that under the approximation:
The unconstrained formulation of the convex optimization problem as Equation (1) can be solved by iterating:
Where is the estimate of . Moreover, Fan and Li [4] draw from the theory of kstep estimators in [1] which proves that, with a good starting parameter, a onstep iteration can be as efficient as the penalized maximum likelihood estimator, when the NewtonRapshon algorithm is used.
We have
Thus, there is a direct relation between and . As determines the possibility of anomaly, the derived relation gives rise to the appearance of the node ’s in the Bayesian network.
6.5 Challenges
The generative model presented above proved to have many challenges. For concreteness, we highlight some of them.
Our initial attempt was discretizing the continuous random variables and using Maximum Likelihood method, an approach which is sometimes used in literature ([6] p. 186). Note however that discretization provides a tradeoff between accuracy of the approximation and cost of the computation ([6] p. 607). Work done using this approach proved to be nonviable. For concreteness, note that the average number of parents (’s) for each time series are . Even for a coarse discretization scheme of 10 bins, each local CPD table would have entries, even with the decomposibility of factors in Bayesian Networks, the amount of data needed to accurately estimate the maximum likelihood (not to mention the computational complexity of collecting the sufficient statistics) would be intractable. Even if we could estimate the local maximum likelihood, the complexity of many inference algorithms is exponential in the size of the domain of the Conditional Probability Distributions (e.g. Variable Elimination). One possible approach that we are exploring is tuning the hyperparameter of the cost of regularization term to restrict the number of variables in each time series to be used in the Bayesian network.
Another approach we are exploring is using a Hybrid Model, which contains both continuous and discrete random variables. This structure in and of itself presents many challenges (see, e.g., [6], chapter 14). In parcticular, we can show that inference on this class of networks is NPHard, even when the network structure is a polytree ([6] p. 615).
6.6 Practical Approach
Solving the full Bayesian network model is super complicated as discussed above. Thus, we solve a simplified version of this problem. Specifically, we ignore the prior distribution and use the conditional distribution as the posterior distribution. Also, the relation between and is linear. The detailed procedure for characterizing anomalies is provided below. We focus on detecting anomalies for one time series only; other time series in the Bayesian network should work in the similar way.

Step 1: We normalize to be in the interval.
where is the minimum over all observed values of and is the maximum over those values. The constants and make the normalized fall strictly in .

Step 2: We specify the relation between and ’s as a linear Gaussian model. More precisely, we determine the relation
This relation can be rewritten as
which can then be solved by least square method.

Step 3: We specify the probability that given that
then is characterized as , which is anomaly, if and only if this probability is greater than and otherwise. The constant can be chosen using crossvalidation so as to maximize the score.
6.7 Result
In this section we use four time series as our data set in order to characterize anomalies for one of these four time series. The reason we drastically reduce the size of the data is because matrix (as defined in equation (1)) and limiting the size of our data set was the only way to be able to compute this in a single machine. Our data set has the following size:

(30 days of minute data)

(5 days of minute data)

(four time series)
With this data set we generate a matrix of float elements and dimensions rows by columns. This matrix has a size of roughly 17GB (depending on the platform used).
Finally, we obtain and the corresponding score is .
6.8 Error Analysis
This value is certainly not as good as that of DPCA or our original discriminative anomaly detection approach, but the comparison is nuanced since above we only processed one time series with a simplified Bayesian Network (because of run time complexity). Notwithstanding, this result is expected; discriminative models will typically outperform generative models when the relationships expressed by the generative model only approximates the true underlying generative process (particularly for our case where we follow some approximations to simplify the model complexity) in terms of classification error rate.
However, the advantage of our novel approach is that our model is richer. As a generative model, we can make explicit claims about the process that underlies the time series. Future work may included running inference queries on the generative model (after tuning it for better performance) to better understand the underlying process where the data is coming from.
7 Conclusion and Future Work
Equation (1) proposes a loss function composed by an error and an regularization. We think it is worth to test if an error and an regularization performs better. The reason behind this is that an error is more biased against outliers. If our regression is less affected by the anomalies in the training data set, then it will predict anomalies more accurately in the testing data.
Also, the model proposed in section 3 may show false positives when the predictors have an anomaly. This is because after an anomaly exists, it still stays in the data that will be later use as an input in out regression. We did not see this in our experiment, but it is something we would like to test and evaluate.
References
 [1] Bickel, P. J., OneStep Huber Estimates in Linear Models, Journal of the American Statistical Association, 70, 1975.
 [2] Chandola, V., A. Banerjee, and V. Kumar., Anomaly Detection: A Survey, ACM Computing Surveys, 2009.
 [3] Chiang, L. H., R. D. Braatz, and E.L. Russell, Fault Detection and Diagnosis in Industrial Systems, Springer Science and Business Media, 2001.
 [4] Fan, J., and R. Li, Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, Princeton, 2001.
 [5] Farren, D., Predicting retail website anomalies using Twitter data, Stanford, 2011.
 [6] Koller D. and N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009.
 [7] Lutkepohl, H., Vector Autoregressions, unpublished manuscript, Institut fÃ¼r Statistik und Okonometrie, HumboldtUniversitÃ¤t zu Berlin, 1999.
 [8] Mills, T.C., The Econometric Modeling of Financial Time Series, Second Edition, Cambridge University Press, Cambridge, 1999.
 [9] Qiu, H. and Y. Liu, Granger Causality for TimeSeries Anomaly Detection, IEEE International Conference on Data Mining, 2012.
 [10] Reid, S., R. Tibshirani, and J. Friedman, A Study of Error Variance Estimation in Lasso Regression, 2014. http://arxiv.org/pdf/1311.5274v2.pdf
 [11] Runkle, D. E., Vector Autoregressions and Reality, Journal of Business and Economic Statistics, 5 (4), 437442, 1987
 [12] Sims, C.A., Macroeconomics and Reality, Econometrica, 48, 148, 1980.
 [13] Tibshirani, R., Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B (Methodogical), Volume 58, Issue 1, 1996, 267288.