# Simple Root Cause Analysis by Separable Likelihoods

###### Abstract

Root Cause Analysis for Anomalies is challenging because of the trade-off between the accuracy and its explanatory friendliness, required for industrial applications. In this paper we propose a framework for simple and friendly RCA within the Bayesian regime under certain restrictions (that Hessian at the mode is diagonal, here referred to as separability) imposed on the predictive posterior. We show that this assumption is satisfied for important base models, including Multinomal, Dirichlet-Multinomial and Naive Bayes. To demonstrate the usefulness of the framework, we embed it into the Bayesian Net and validate on web server error logs (real world data set).

###### Keywords:

Bayesian Modeling Anomaly Detection Root Cause Analysis## 1 Introduction

### 1.1 Anomaly Detection and Root Cause Analysis

In the likelihood-based approaches to anomaly detection, a generative probabilistic model for data is learned and used to evaluate new data records. Anomalies are defined as the records with unusually low likelihood. An example is the Z-score measure for 1-dimensional data, which fits the Gaussian distribution to the data (estimating the mean and variance) and scores observations in the decreasing order with respect to the likelihood; for its simplicity it is widely used in explanatory data analysis, quality controls and other industrial applications. The challenge with real data sets, however, is that they usually contains both continuous and categorical features, as well as inter dependencies (in particular anomaly scores cannot be applied independently). Interactions and dependencies can be effectively modeled by the modern framework of probabilistic graphical models [KF09]. Further, simplicity can be traded for accuracy by using more sophisticated models as building blocks (for example more exotic base distributions or mixtures); only for multivariate counts several models have been proposed [ZZZS17].

This paper concerns the constrained scenario of Root Cause Analysis (RCA) where in addition to identifying anomalies, a readable explanation (in terms of other features) is required. Because the purpose of RCA is to support business decision making, complexity and fit accuracy are often traded for explanatory abilities. This makes some powerful models (such as neural set) not adequate for this task [SMRE17]. In this paper we show how to build, out of simple building blocks, an anomaly detection system for error logs. While our model is a fairly simple variant of Bayes Network, the main added value is the proposed paradigm of determining anomaly contributions, which is used to estimate how different features contribute to the likelihood of the anomaly data record. These scores can be used directly to perform efficient RCA which is illustrated by a case study on real data.

### 1.2 Contribution

#### Root-Cause Analysis for Separable Posteriors

For the task of anomaly detection the main quantity of interest is the likelihood of the new data record given the training data , called the predictive posterior. Assuming the generative process for the data, with some parameter , the predictive posterior is given by

For the task of RCA it would be helpful to see how individual components impact the likelihood. This is not possible in general, because posteriors are often not analytically tractable and only approximated by sampling. However in certain cases the predictive posterior, after subtracting its mode, can be factorized into terms depending on individual terms . More precisely, suppose that the predictive posterior log-likelihood can be written as

(1) |

where is the mode. When the posterior obeys Equation 1 we say it is separable. The term can be then thought as influence of the -th coordinate of the data point . Moreover, similarly to the notion of the averaged log-likelihood, these influences can be aggregated over several independent observations (e.g. at daily level).

This formula has the following intuitive meaning: we decompose the deficiency w.r.t. the mode per individual dimensions; the deficiency is understood as the difference in the log-likelihood with respect to the mode and can be seen as a natural anomaly measure (note that by the definition of ). We stress that it is important to subtract the mode in Equation 1, otherwise we explain the likelihood of a whole point, rather than its abnormal part.

It is worth mentioning that Equation 1 can be characterized alternatively, by noticing that the hessian matrix satisfies

(2) |

hence is diagonal at the mode.

We will show theoretical results on separability for two popular building blocks: the posterior of Dirichlet-Multinomial distribution and the posterior of categorical variable given category-dependent multivariate Bernoulli or Multinomial observations (for example, naive bayes text classification on the bag-of-words representation). They will be presented in Section 2; now we sketch a simpler example for illustration. Consider the multinomial model with total counts of and probability . The probability of counts equals

Denote by the observed frequencies. The log-likelihood normalized by the number of observations can be approximated by Stirling formulas [Shl14] establishing the connection to the Kullback-Leibler divergence of observed and real frequencies, respectively and .

It is not hard to see that the logarithm of the mode for the multinomial distribution equals . Thus we obtain Equation 1 with .

#### Case Study on Real Data

We apply our framework to the real data set of error logs from company servers. Each record contains the number of errors for a given zone, project, procedure and the error message. The data was collected for more than 120 consecutive days. A sample of the data set is shown in Table 1.

row_id | date | region | project_name | procedure_name | error_detail | err_cnt |
---|---|---|---|---|---|---|

15362 | 2018-04-01 | EMEA | GLOBAL_ONLINE_SERVICE | EXPLODE_BUNDLE | Object reference not set to an instance of an … | 3 |

29308 | 2018-04-01 | EMEA | YOJEG_API | YOJEG.Controllers.Configurator.Global.Glo… | VerifyError:Invalid option selected | 1 |

29222 | 2018-04-01 | EMEA | GDAS Services: CustomerService | NaN | Operation: GetSalesPerson | 26 |

3157 | 2018-04-01 | EMEA | GDAS Services: CustomerService | NaN | Operation: GetCustomer Exception: GDAS.Ex… | 77 |

7801 | 2018-04-01 | EMEA | YOJEG_API | YOJEG.Controllers.Configurator.Global.Glo… | BuildError:InvalidOrderCodeOrCustomerSet | 5 |

The results will be discussed in Section 3.

### 1.3 Organization

## 2 Separable Posteriors

### 2.1 Dirichlet-Multinomial Model

The Dirichlet-Multinomial Model (DM) is popular for modeling multivariate counts. As opposed to the plain multinomial model, it models uncertainty in the probability parameter, which helps avoiding over-dispersion.

(3) |

This model is analytically tractable, we utilize formulas derived in [Tu15].

(4) |

Where and or the sake of concise notation. By using the Stirling approximation we obtain

(5) |

In order to see separability we will apply the well known trick called Laplace approximation, which is merely a multivariate Gaussian approximation to the predictive posterior (see for example [Deh17] for theoretical justifications). Technically, we expand the log-likelihood in a Talyor series around its mode, so that linear term disappear (by the first-derivative test, as the mode maximizes the likelihood!) and quadratic terms correspond to the Gaussian terms. In our case, the second-order terms turn out to be diagonal hence we obtain separability.

In order to find the mode we need to use the Lagrangian because of the implicit constraint .
For some constant , the mode satisfies^{1}^{1}1We extend the likelihood over non-integer frequencies as the gamma function is well-defined and the Stirling approximation works.

(6) |

which implies

(7) |

By the Taylor expansion around the mode we obtain (note that the linear part disappears and the coefficients of the quadratic part are determined from the first order conditions Equation 6)

(8) |

in the alternative notation (observed frequency) and (mode frequency) we have

###### Lemma 1 (Predictive Posterior vs Mode for DM)

(9) |

Since usually ( collects all occurrences over the training data) we have and we conclude

###### Corollary 1 (DM Posterior Predictive Impacts)

###### Remark 1 (Intuition)

The major reason for impacts being large negative is a significant relative increase in frequencies (observed vs posterior), under large volume. Indeed, let then the -th impact equals .

### 2.2 BNB Model

We prove separability only for Bernulli Naive Bayes (BNB) as we will be using this model in our case study. However, separability is not limited to the Bernoulli variant and can be also proved for Multionomial Naive Bayes.

The BNB model is popular for classification of short text messages. Texts are represented as as the -dimensional boolean vectors where is the vocabulary. Each entry is a boolean number indicating occurrence of the word in a given text ; we will use the notation . The model with Beta prior (which smooths zero-frequencies assuming extra ”pseudocounts” of one for each class-word) can be written as

where is the set of classes (categories). Let and be posterior probabilities for word given class and class (estimated from the data). Then we have

###### Proposition 1 (Predicitve Posterior for BNB)

Probability of the class given the vector of words is given by

(11) |

where the proportionality constant is independent on (but depends on w).

By taking the logarithm of Equation 11 evaluated at and and subtracting (the unknown constant cancels) we obtain

###### Lemma 2 (Predictive Posterior vs Mode for BNB)

For the Bernoulli Naive Bayes model, let be the most likely class given the sequence of words . We have

(12) |

From we immediately obtain the word impact.

###### Corollary 2 (BNB Posterior Predictive Impact)

###### Remark 2 (Intuition)

The major reason for impacts to be large negative is the presence of class-untypical words (so that ). The effect is stronger with large volume when evaluating averaged likelihoods.

## 3 Root Cause Analysis of Anomalies

### 3.1 Generative Model

Before we apply the results of the previous section, we need to construct the joint model for all features in our data set. We model the Data by a Bayes Net illustrated in Figure 1. Every feature is dependent on zone (justification: different zones use servers in different location) and at most one other feature (in the natural hierarchical way). Thus, the model is actually a Tree-Augmented Network (TAN). These models generally allow for a feature-root relation and one more level of interaction. While TANs can capture non-trivial dependencies, they are computationally attractive since every node has at most two parents which reduces the size of internal conditional probability tables [Pad14].

More precisely, we assume

(14) |

with empirical Dirichlet priors (estimated from data) for and non-informative Beta prior for ). Bernoulli distributions are over the (binarized) bag-of-word text representation of .

Given the graph, the likelihood factorizes into likelihoods of individual features given parents; these models can be fit separately [Pad14]. In our case

We also use this fact to structure our anomaly detection: we will analyze separately anomalies in and separately in tuples . Since we are interested in discovering and explaining anomalies on the daily bases, we perform the inference day by day, training the algorithm on the past data. The model was implemented under Python package PyMC3 [SWF16].

### 3.2 RCA for Projects

The posterior for given observed projects counts is Dirichlet-Multinomial. The daily-averaged likelihood is illustrated in Figure 2.

#### Anomalies 2018/05/17 and 2018/06/11, EMEA

By applying Corollary 1 we obtain most impacting projects. We see that the anomalies corresponds to peaks in project hits as illustrated in Figure 3.

#### Anomalies 2018/05/07 and 2018/07/28, APJ

By applying Corollary 1 we obtain most impacting projects (we pick two). The anomalies again corresponds to peaks in project hits as illustrated in Figure 4.

### 3.3 RCA for Procedures and Error Messages

According to our model, the distribution of procedures given error descriptions follows the classification Bernoulli Naive Bayes (BNB) model (where is the class and is text; class priors are determined by fitting ). To detect anomalies in errors, we evaluate how error messages impact procedures rather than investigating for individual errors.

To detect anomalies on the daily level, we compute the daily-averaged likelihood and illustrate in Figure 5

#### Anomaly 2018/05/17 in EMEA

By Corollary 2 we identify the set

of 3 keywords with biggest negative influence on the likelihood. By inspecting hits on these keywords (by hit we understand every message matching at least one word in ) across the classes we notice a huge difference between the anomaly day and the reference data set (see Figure 6).

By inspecting message texts we also recognize the specific messages related to the keywords . The result is summarized in Table 2.

procedure | error message |
---|---|

Object reference not set to an instance of an object | |

Object reference not set to an instance of an object |

#### Anomaly 2018/06/11 in EMEA

By Corollary 2 we identify the set

of 5 keywords with biggest negative influence on the likelihood. By inspecting hits on these keywords across the classes we notice a significant shift between the anomaly day and the reference data set (see Figure 7).

Having localized the keywords, we easily find procedures with biggest shifts and also the messages. The explanation is summarized in Table 3.

procedure | error message |
---|---|

The operation has timed out | |

The request channel timed out | |

The request failed with HTTP status 404 |

#### Anomaly 2018/05/18 APJ

By Corollary 2 we identify the set

of 3 keywords with biggest negative influence on the likelihood. By inspecting hits on these keywords across the classes we notice a significant shift between the anomaly day and the reference data set (see Figure 8).

The explanation by procedures and error messages is shown in Table 4 below.

procedure | error message |

argument is null |

#### Anomaly 2018/06/11 APJ

By Corollary 2 we identify the set

of 5 keywords with biggest negative influence on the likelihood. By inspecting hits on these keywords across the classes we notice a significant shift between the anomaly day and the reference data set (see Figure 9).

The explanation by procedures and error messages is shown in Table 5 below.

procedure | error message |
---|---|

Operation: GDAS.Exceptions.CustomerNotFoundException | |

Exception has been thrown by the target of an invocation |

## 4 Conclusion

We proposed a framework for anomaly detection and root cause analysis based on separable posterior approximation. This approximation has been proved for the case of Multionomial, Dirchlet-Multinomal and Naive Bayes Models. The validation on the real data set shows that the framework detects anomalies and offers reasonable and simple explanations.

## References

- [Deh17] G. P. Dehaene, Computing the quality of the laplace approximation, http://adsabs.harvard.edu/abs/2017arXiv171108911D, November 2017.
- [KF09] Daphne Koller and Nir Friedman, Probabilistic graphical models: Principles and techniques - adaptive computation and machine learning, The MIT Press, 2009.
- [Pad14] Harini Padmanaban, Comparative analysis of naive bayes and tree augmented naive bayes models, http://scholarworks.sjsu.edu/etd_projects/356, 2014.
- [Shl14] Jonathon Shlens, Notes on kullback-leibler divergence and likelihood, http://arxiv.org/abs/1404.2000, 2014.
- [SMRE17] Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada, Survey on models and techniques for root-cause analysis, Arxiv e-prints (2017).
- [SWF16] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck, Probabilistic programming in python using pymc3, PeerJ Computer Science 2 (2016), e55.
- [Tu15] Stephen Tu, The dirichlet-multinomial and dirichlet-categorical models for bayesian inference, 2015.
- [ZZZS17] Yiwen Zhang, Hua Zhou, Jin Zhou, and Wei Sun, Regression models for multivariate count data, Journal of Computational and Graphical Statistics 26 (2017), no. 1, 1–13.