Simple Root Cause Analysis by Separable Likelihoods

Simple Root Cause Analysis by Separable Likelihoods

Maciej Skorski DELL

Root Cause Analysis for Anomalies is challenging because of the trade-off between the accuracy and its explanatory friendliness, required for industrial applications. In this paper we propose a framework for simple and friendly RCA within the Bayesian regime under certain restrictions (that Hessian at the mode is diagonal, here referred to as separability) imposed on the predictive posterior. We show that this assumption is satisfied for important base models, including Multinomal, Dirichlet-Multinomial and Naive Bayes. To demonstrate the usefulness of the framework, we embed it into the Bayesian Net and validate on web server error logs (real world data set).

Bayesian Modeling Anomaly Detection Root Cause Analysis

1 Introduction

1.1 Anomaly Detection and Root Cause Analysis

In the likelihood-based approaches to anomaly detection, a generative probabilistic model for data is learned and used to evaluate new data records. Anomalies are defined as the records with unusually low likelihood. An example is the Z-score measure for 1-dimensional data, which fits the Gaussian distribution to the data (estimating the mean and variance) and scores observations in the decreasing order with respect to the likelihood; for its simplicity it is widely used in explanatory data analysis, quality controls and other industrial applications. The challenge with real data sets, however, is that they usually contains both continuous and categorical features, as well as inter dependencies (in particular anomaly scores cannot be applied independently). Interactions and dependencies can be effectively modeled by the modern framework of probabilistic graphical models [KF09]. Further, simplicity can be traded for accuracy by using more sophisticated models as building blocks (for example more exotic base distributions or mixtures); only for multivariate counts several models have been proposed [ZZZS17].

This paper concerns the constrained scenario of Root Cause Analysis (RCA) where in addition to identifying anomalies, a readable explanation (in terms of other features) is required. Because the purpose of RCA is to support business decision making, complexity and fit accuracy are often traded for explanatory abilities. This makes some powerful models (such as neural set) not adequate for this task [SMRE17]. In this paper we show how to build, out of simple building blocks, an anomaly detection system for error logs. While our model is a fairly simple variant of Bayes Network, the main added value is the proposed paradigm of determining anomaly contributions, which is used to estimate how different features contribute to the likelihood of the anomaly data record. These scores can be used directly to perform efficient RCA which is illustrated by a case study on real data.

1.2 Contribution

Root-Cause Analysis for Separable Posteriors

For the task of anomaly detection the main quantity of interest is the likelihood of the new data record given the training data , called the predictive posterior. Assuming the generative process for the data, with some parameter , the predictive posterior is given by

For the task of RCA it would be helpful to see how individual components impact the likelihood. This is not possible in general, because posteriors are often not analytically tractable and only approximated by sampling. However in certain cases the predictive posterior, after subtracting its mode, can be factorized into terms depending on individual terms . More precisely, suppose that the predictive posterior log-likelihood can be written as


where is the mode. When the posterior obeys Equation 1 we say it is separable. The term can be then thought as influence of the -th coordinate of the data point . Moreover, similarly to the notion of the averaged log-likelihood, these influences can be aggregated over several independent observations (e.g. at daily level).

This formula has the following intuitive meaning: we decompose the deficiency w.r.t. the mode per individual dimensions; the deficiency is understood as the difference in the log-likelihood with respect to the mode and can be seen as a natural anomaly measure (note that by the definition of ). We stress that it is important to subtract the mode in Equation 1, otherwise we explain the likelihood of a whole point, rather than its abnormal part.

It is worth mentioning that Equation 1 can be characterized alternatively, by noticing that the hessian matrix satisfies


hence is diagonal at the mode.

We will show theoretical results on separability for two popular building blocks: the posterior of Dirichlet-Multinomial distribution and the posterior of categorical variable given category-dependent multivariate Bernoulli or Multinomial observations (for example, naive bayes text classification on the bag-of-words representation). They will be presented in Section 2; now we sketch a simpler example for illustration. Consider the multinomial model with total counts of and probability . The probability of counts equals

Denote by the observed frequencies. The log-likelihood normalized by the number of observations can be approximated by Stirling formulas [Shl14] establishing the connection to the Kullback-Leibler divergence of observed and real frequencies, respectively and .

It is not hard to see that the logarithm of the mode for the multinomial distribution equals . Thus we obtain Equation 1 with .

Case Study on Real Data

We apply our framework to the real data set of error logs from company servers. Each record contains the number of errors for a given zone, project, procedure and the error message. The data was collected for more than 120 consecutive days. A sample of the data set is shown in Table 1.

row_id date region project_name procedure_name error_detail err_cnt
15362 2018-04-01 EMEA GLOBAL_ONLINE_SERVICE EXPLODE_BUNDLE Object reference not set to an instance of an … 3
29308 2018-04-01 EMEA YOJEG_API YOJEG.Controllers.Configurator.Global.Glo… VerifyError:Invalid option selected 1
29222 2018-04-01 EMEA GDAS Services: CustomerService NaN Operation: GetSalesPerson 26
3157 2018-04-01 EMEA GDAS Services: CustomerService NaN Operation: GetCustomer Exception: GDAS.Ex… 77
7801 2018-04-01 EMEA YOJEG_API YOJEG.Controllers.Configurator.Global.Glo… BuildError:InvalidOrderCodeOrCustomerSet 5
Table 1: Dataset for log errors.

The results will be discussed in Section 3.

1.3 Organization

In Section 2 we derive theoretical results for some separable posteriors. In Section 3 we demonstrate our framework on the real-world data. The paper is concluded in Section 4.

2 Separable Posteriors

2.1 Dirichlet-Multinomial Model

The Dirichlet-Multinomial Model (DM) is popular for modeling multivariate counts. As opposed to the plain multinomial model, it models uncertainty in the probability parameter, which helps avoiding over-dispersion.


This model is analytically tractable, we utilize formulas derived in [Tu15].


Where and or the sake of concise notation. By using the Stirling approximation we obtain


In order to see separability we will apply the well known trick called Laplace approximation, which is merely a multivariate Gaussian approximation to the predictive posterior (see for example [Deh17] for theoretical justifications). Technically, we expand the log-likelihood in a Talyor series around its mode, so that linear term disappear (by the first-derivative test, as the mode maximizes the likelihood!) and quadratic terms correspond to the Gaussian terms. In our case, the second-order terms turn out to be diagonal hence we obtain separability.

In order to find the mode we need to use the Lagrangian because of the implicit constraint . For some constant , the mode satisfies111We extend the likelihood over non-integer frequencies as the gamma function is well-defined and the Stirling approximation works.


which implies


By the Taylor expansion around the mode we obtain (note that the linear part disappears and the coefficients of the quadratic part are determined from the first order conditions Equation 6)


in the alternative notation (observed frequency) and (mode frequency) we have

Lemma 1 (Predictive Posterior vs Mode for DM)

Since usually ( collects all occurrences over the training data) we have and we conclude

Corollary 1 (DM Posterior Predictive Impacts)

For the DM-model the impact for the -th component in Equation 1 equals



Remark 1 (Intuition)

The major reason for impacts being large negative is a significant relative increase in frequencies (observed vs posterior), under large volume. Indeed, let then the -th impact equals .

2.2 BNB Model

We prove separability only for Bernulli Naive Bayes (BNB) as we will be using this model in our case study. However, separability is not limited to the Bernoulli variant and can be also proved for Multionomial Naive Bayes.

The BNB model is popular for classification of short text messages. Texts are represented as as the -dimensional boolean vectors where is the vocabulary. Each entry is a boolean number indicating occurrence of the word in a given text ; we will use the notation . The model with Beta prior (which smooths zero-frequencies assuming extra ”pseudocounts” of one for each class-word) can be written as

where is the set of classes (categories). Let and be posterior probabilities for word given class and class (estimated from the data). Then we have

Proposition 1 (Predicitve Posterior for BNB)

Probability of the class given the vector of words is given by


where the proportionality constant is independent on (but depends on w).

By taking the logarithm of Equation 11 evaluated at and and subtracting (the unknown constant cancels) we obtain

Lemma 2 (Predictive Posterior vs Mode for BNB)

For the Bernoulli Naive Bayes model, let be the most likely class given the sequence of words . We have


From we immediately obtain the word impact.

Corollary 2 (BNB Posterior Predictive Impact)

For the BNB-model the impact for the -th word in Equation 1 equals


where is the actual class.

Remark 2 (Intuition)

The major reason for impacts to be large negative is the presence of class-untypical words (so that ). The effect is stronger with large volume when evaluating averaged likelihoods.

3 Root Cause Analysis of Anomalies

3.1 Generative Model

Before we apply the results of the previous section, we need to construct the joint model for all features in our data set. We model the Data by a Bayes Net illustrated in Figure 1. Every feature is dependent on zone (justification: different zones use servers in different location) and at most one other feature (in the natural hierarchical way). Thus, the model is actually a Tree-Augmented Network (TAN). These models generally allow for a feature-root relation and one more level of interaction. While TANs can capture non-trivial dependencies, they are computationally attractive since every node has at most two parents which reduces the size of internal conditional probability tables [Pad14].





Figure 1: TAN model for occurrences of a single error.

More precisely, we assume


with empirical Dirichlet priors (estimated from data) for and non-informative Beta prior for ). Bernoulli distributions are over the (binarized) bag-of-word text representation of .

Given the graph, the likelihood factorizes into likelihoods of individual features given parents; these models can be fit separately [Pad14]. In our case

We also use this fact to structure our anomaly detection: we will analyze separately anomalies in and separately in tuples . Since we are interested in discovering and explaining anomalies on the daily bases, we perform the inference day by day, training the algorithm on the past data. The model was implemented under Python package PyMC3 [SWF16].

3.2 RCA for Projects

The posterior for given observed projects counts is Dirichlet-Multinomial. The daily-averaged likelihood is illustrated in Figure 2.

(a) Likelihood for ,
(b) Likelihood for ,
Figure 2: Project likelihoods by zone.

Anomalies 2018/05/17 and 2018/06/11, EMEA

By applying Corollary 1 we obtain most impacting projects. We see that the anomalies corresponds to peaks in project hits as illustrated in Figure 3.

Figure 3: Daily hits by project ().

Anomalies 2018/05/07 and 2018/07/28, APJ

By applying Corollary 1 we obtain most impacting projects (we pick two). The anomalies again corresponds to peaks in project hits as illustrated in Figure 4.

Figure 4: Daily hits by project ().

3.3 RCA for Procedures and Error Messages

According to our model, the distribution of procedures given error descriptions follows the classification Bernoulli Naive Bayes (BNB) model (where is the class and is text; class priors are determined by fitting ). To detect anomalies in errors, we evaluate how error messages impact procedures rather than investigating for individual errors.

To detect anomalies on the daily level, we compute the daily-averaged likelihood and illustrate in Figure 5

(a) Likelihood of in
(b) Likelihood of in
Figure 5: Likelihood of split by .

Anomaly 2018/05/17 in EMEA

By Corollary 2 we identify the set

of 3 keywords with biggest negative influence on the likelihood. By inspecting hits on these keywords (by hit we understand every message matching at least one word in ) across the classes we notice a huge difference between the anomaly day and the reference data set (see Figure 6).

Figure 6: Average daily hits of the keywords split by class (), for EMEA zone.

By inspecting message texts we also recognize the specific messages related to the keywords . The result is summarized in Table 2.

procedure error message
Object reference not set to an instance of an object
Object reference not set to an instance of an object
Table 2: RCA for anomaly 2018/05/17 EMEA.

Anomaly 2018/06/11 in EMEA

By Corollary 2 we identify the set

of 5 keywords with biggest negative influence on the likelihood. By inspecting hits on these keywords across the classes we notice a significant shift between the anomaly day and the reference data set (see Figure 7).

Figure 7: Average daily hits of the keywords ’channel’, ’timed’, ’remote’, ’returned’, ’request’ split by class () for EMEA zone.

Having localized the keywords, we easily find procedures with biggest shifts and also the messages. The explanation is summarized in Table 3.

procedure error message
The operation has timed out
The request channel timed out
The request failed with HTTP status 404
Table 3: RCA for anomaly 2018/06/11 EMEA.

Anomaly 2018/05/18 APJ

By Corollary 2 we identify the set

of 3 keywords with biggest negative influence on the likelihood. By inspecting hits on these keywords across the classes we notice a significant shift between the anomaly day and the reference data set (see Figure 8).

Figure 8: Average daily hits of the keywords ’null’, ’reference’, ’set’ split by class (), for APJ zone.

The explanation by procedures and error messages is shown in Table 4 below.

procedure error message
argument is null
Table 4: RCA for anomaly 2018/05/18 APJ.

Anomaly 2018/06/11 APJ

By Corollary 2 we identify the set

of 5 keywords with biggest negative influence on the likelihood. By inspecting hits on these keywords across the classes we notice a significant shift between the anomaly day and the reference data set (see Figure 9).

Figure 9: Average daily hits of the keywords ’contract’, ’gdas’, ’contracts’,’target’,’invocation’ split by class (), for APJ zone.

The explanation by procedures and error messages is shown in Table 5 below.

procedure error message
Operation: GDAS.Exceptions.CustomerNotFoundException
Exception has been thrown by the target of an invocation
Table 5: RCA for anomaly 2018/06/11 APJ.

4 Conclusion

We proposed a framework for anomaly detection and root cause analysis based on separable posterior approximation. This approximation has been proved for the case of Multionomial, Dirchlet-Multinomal and Naive Bayes Models. The validation on the real data set shows that the framework detects anomalies and offers reasonable and simple explanations.


  • [Deh17] G. P. Dehaene, Computing the quality of the laplace approximation,, November 2017.
  • [KF09] Daphne Koller and Nir Friedman, Probabilistic graphical models: Principles and techniques - adaptive computation and machine learning, The MIT Press, 2009.
  • [Pad14] Harini Padmanaban, Comparative analysis of naive bayes and tree augmented naive bayes models,, 2014.
  • [Shl14] Jonathon Shlens, Notes on kullback-leibler divergence and likelihood,, 2014.
  • [SMRE17] Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada, Survey on models and techniques for root-cause analysis, Arxiv e-prints (2017).
  • [SWF16] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck, Probabilistic programming in python using pymc3, PeerJ Computer Science 2 (2016), e55.
  • [Tu15] Stephen Tu, The dirichlet-multinomial and dirichlet-categorical models for bayesian inference, 2015.
  • [ZZZS17] Yiwen Zhang, Hua Zhou, Jin Zhou, and Wei Sun, Regression models for multivariate count data, Journal of Computational and Graphical Statistics 26 (2017), no. 1, 1–13.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description