Learning Latent Events from Network Message Logs: A Decomposition Based Approach

# Learning Latent Events from Network Message Logs: A Decomposition Based Approach

Siddhartha Satpathi Supratim Deb AT&T Labs R Srikant  and  He Yan AT&T Labs
###### Abstract.

In this communication, we describe a novel technique for event mining using a decomposition based approach that combines non-parametric change-point detection with LDA. We prove theoretical guarantees about sample-complexity and consistency of the approach. In a companion paper, we will perform a thorough evaluation of our approach with detailed experiments.

copyright: nonecopyright: rightsretainedconference:   ;   ; journalyear: 2018

## 1. Introduction: Problem Setting

We are given a data set consisting of messages generated by error events. The set of messages in the data set come from discrete and finite set Each message has a timestamp associated with it.. Suppose that an event started occurring at time and finished at time In the interval of time event will generate a mixture of messages from a subset of which we will denote by If an event occurs multiple times in the data set, then each occurrence of the event will have start and finish times associated with it.

An event is characterized by its message set and the probability distribution with which messages are chosen from which we will denote by i.e., denotes the probability that event will generate a message For compactness of notation, one can simply think of as being defined over the entire set of messages with if Thus, fully characterizes the event and can be viewed as the signature of the event. We assume that the support set of messages for two different events are not identical.

The event information is latent in the dataset. The goal is to solve the following inference problem: from the given data set identify the set of events that generated the messages in the data set, and for each instance of event, identify when it started and finished. In other words, the output of the inference algorithm should contain the following information:

• The number of events which generated the data set.

• The signatures of these events:

• For each event the number of times it occurred in the data set and, for each occurrence, its start and finish times.

Notation: We use the notation for the message. Also, let be the timestamp associated with the message. Thus the data set can be characterized by tuples of data points.

## 2. Change-point Detection

The first phase of our method involves using change-point detection to identify episodes where an episode is characterized as follows:

• An episode consists of a mixture of events, and each event consists of a mixture of messages.

• Since neighboring episodes consist of different mixtures of events, neighboring episodes also contain different mixtures of messages (due to our assumption that different events do not generate the same set of messages).

• Thus, successive episodes contain different message distributions and therefore, the time instances where these distributions change are the episode boundaries, which we will call change points.

• In our data set, the messages contain time stamps. In general, the inter-arrival time distributions of messages are different in successive episodes, due to the fact that the episodes represent different mixtures of events. This fact can be further exploited to improve the identification of change points.

Suppose we have data points and a known number of change points . The data points between two consecutive change points are drawn i.i.d from the same distribution. In the inference problem, each data point could be a possible change point. A naive exhaustive search to find the best locations would have a computational complexity of . Nonparametric approaches to change-point detection aim to solve this problem with much lower complexity even when the number of change points is unknown and there are few assumptions on the family of distributions, (Kawahara and Sugiyama, 2012), (S. Matteson and A. James, 2013),(Lung-Yut-Fong et al., 2011).

The change point detection algorithm we use is hierarchical in nature. This is inspired by the work in (S. Matteson and A. James, 2013). Nevertheless our algorithm has certain key differences as discussed in section 2.1.1. It is easier to understand the algorithm in the setting of only one change point, i.e., two episodes. Suppose that is a candidate change point among the points. The idea is to measure the change in distribution between the points to the left and right of . We use the TV distance (i.e. distance between two distributions) between the empirical distributions estimated from the points to the left and right of the candidate change point . This is maximized over all values of to estimate the location of the change point. If the distributions are sufficiently different in the two episodes the TV distance between the empirical distributions is expected to be highest for the correct location of the change point in comparison to any other candidate point (we rigorously prove this in the proof of Theorem 2.1, 2.3).

Further, we also have different inter-arrival times for messages in different episodes. Hence we use a combination of TV distance and mean inter-arrival time as the metric to differentiate the two distributions. We denote this metric by .

 (1) ˆD(τ) =∥ˆpL(τ)−ˆpR(τ)∥1+|ˆESL(τ)−ˆESR(τ)|,

where are empirical estimates of message distributions to the left and right and are empirical estimates of the mean inter-arrival time to the left and right of respectively. Algorithm 1 describes the algorithm in the one change point case. To make the algorithm more robust, we declare a change point only when the episode length is at least and the maximum value of the metric (1) is at least .

Let us consider a simple example to illustrate the idea of change-point detection with one change-point. Suppose we have a sequence of messages with unequal inter-arrival times as shown in Fig. 1. All the messages are the same, but the first half of the messages arrive at a rate higher than the second half of the messages. In this scenario, our metric reduces to the difference in the mean inter-arrival times between the two episodes.

Next, we consider the case of multiple change points. When we have multiple change points, we apply Algorithm 1 hierarchically until we cannot find a change point. Algorithm 2 CD is presented below.

The above algorithm tries to detect a single change point first, and if such a change point is found, it divides the data set into two parts, one consisting of messages to the left of the change point and the other consisting of messages to the right of the change point. The single change-point detection algorithm is now applied to each of the two smaller datasets. This is repeated recursively till no more change points are detected.

### 2.1. Analysis of CD

This section focuses on analyzing the proposed change detection algorithm. Section 2.1.1 shows that the computational complexity of CD algorithm is linear in the number of data points. Section 2.1.2 contains the asymptotic analysis of the CD algorithm while section 2.1.3 has the finite sample results.

#### 2.1.1. Computational complexity of CD

In this section we discuss the computational complexities of Algorithm 1 and Algorithm 2. We will first discuss the computational complexity of detecting a change point in case of one change point. Algorithm 1 requires us to compute for . From the definition of in (1), we only need to compute the empirical probability estimates , , and the empirical mean of the inter arrival time , for every value of between to .

We focus on the computation of , . Consider any message in the distribution. For each we can compute , in for every value of by using neighbouring values of , .

 ˆpL,m(l) =(l−1)ˆpL,m(l−1)+\mathds1{Xl−1=m}l, (2) ˆpR,m(l) =(n−l+1)ˆpR,m(l−1)−\mathds1{Xl−1=m}n−l

The computation of for every value of from to is similar.

Performing the above computations for all messages, results in a computational complexity of In the case of change points, it is straightforward to see that we require computations. In much of our discussion, we assume and are constants and therefore, we present the computational complexity results in terms of only.

Related work: Algorithm 2 executes the process of determining change points hierarchically. This ideas was inspired by the work in (S. Matteson and A. James, 2013). However, the metric we use to detect change points is different from that of (S. Matteson and A. James, 2013). The metric used in (S. Matteson and A. James, 2013) leads to an computational complexity. The change in metric necessitates a new analysis of the consistency of the CD algorithm which we present in the next subsection. Further, for our metric, we are also able to derive sample complexity results which are presented in a later subsection.

#### 2.1.2. The consistency of change-point detection

In this section we discuss the consistency of the change-point detection algorithm, i.e., when the number of data points goes to infinity one can accurately detect the location of the change points. In both this subsection and the next, we assume that the inter-arrival times of messages within each episode are i.i.d., and are independent (with possibly different distributions) across episodes.

###### Theorem 2.1 ().

For is well-defined and attains its maximum at one of the change points if there is at least one change point.

The proof of the above theorem is easy when there is only one change point. To study the case of multiple change points, (S. Matteson and A. James, 2013) exploits the fact that their metric for change-point detection is convex between change points. However, the TV distance we use is not convex between two change points. But we work around this problem in the proof of Theorem 2.1 by showing that is increasing to the left of the first change point, unimodal/increasing/decreasing between any two change points and decreasing to the right of the last change point. Hence, any global maximum of for is located at a change point.

#### 2.1.3. The sample complexity of change-point detection

In the previous subsection, we studied the CD algorithm in the limit as In this section, we analyze the algorithm when there are only a finite number of samples. For this purpose, we assume that the inter-arrival distribution of messages have sub-Gaussian tails.

We say that Algorithm CD is correct if the following conditions are satisfied. Let be a desired accuracy in estimation of the change point.

###### Definition 2.2 ().

Given , Algorithm CD is correct if

• there are change points and the algorithm gives such that .

• there is no change point and .

Now we can state the correctness theorem for Algorithm 2. The sample complexity is shown to scale logarithmically with the number of change points.

###### Theorem 2.3 ().

Algorithm 2 is correct in the sense of Definition 2.2 with probability if

 n=Ω⎛⎜ ⎜⎝max⎛⎜ ⎜⎝log(2k+1β)ϵ2,M1+cϵ2(1+c)⎞⎟ ⎟⎠⎞⎟ ⎟⎠,

for sufficiently small and for any .

The proof of this theorem uses the method of types and Pinsker’s inequality.

### 2.2. Additional Notations for Proofs

We recall the notations used for the proofs in this section. The computation of change point centers around computing the metric . We defined in (1) as the sum of distance between empirical distribution to the right and left of index and absolute difference in mean inter-arrival time to the left and right of . So

 (3) ˆD(l)=∥ˆpL(l)−ˆpR(l)∥1+|ˆESL(l)−ˆESR(l)|.

The empirical distributions , have components. For each , we can write

 (4) ˆpL,m(l) =∑l−1i=1\mathds1{Xi=m}l (5) ˆpR,m(l) =∑ni=l\mathds1{Xi=m}n−l.

The mean inter-arrival time and are defined as

 (6) ˆESL(l) =∑l−1i=1Δtil (7) ˆESR(l) =∑ni=lΔtin−l.

We sometimes write as , where the argument . Symbol denotes the index as a fraction of and it can take discrete values between to .

### 2.3. Proof of Theorem 2.1

Proof for single change point case: We first discuss the single change point case. Let the change point be at index . The location of the change point is determined by the point where maximizes over . We will show that when is large the argument where maximizes converges to the change point . The proof for the single change-point case is rather easy, but the novelty in the proof is in the case of multiple change points. We present the single change-point case here for completeness.

Suppose all the points to the left of the change point are chosen i.i.d from a distribution and all the points from the right of are chosen from a distribution , where . Also, say the inter-arrival times ’s are chosen i.i.d from distribution and to the left and right of change point , respectively. Let , be the index of any data point and , the index of the change point.
Case 1 : Suppose we consider the value of to the left of the actual change point, i.e, or . The distribution to the left of , , has all the data points chosen from the distribution . So is the empirical estimate for . On the other hand, the data points to the right of come from a mixture of distribution and . has fraction of samples from and fraction of samples from . Figure 2 below explains it pictorially.

So and defined in (5) converges to

 (8) ˆpL(l)→F, ˆpR(l)→γ−˜γ1−˜γF+1−γ1−˜γG.

Similarly, we can say that the empirical mean estimates and converge to

 (9) ˆESL(l)→EFt, ˆESR(l)→γ−˜γ1−˜γEFt+1−γ1−˜γEGt.

We can combine (8) and (9) to say that where

 ˆD(˜γn) =∥ˆpL(˜γn)−ˆpR(˜γn)∥+|ESL(˜γn)−ESR(˜γn)| (10) →D(˜γ):=1−γ1−˜γ(∥F−G∥1+|EFt−EGt|).

Note that from the definition of , .
Case 2 : Proceeding in a similar way to Case 1, we can show

 (11) ˆD(˜γn) →D(˜γ):=γ˜γ(∥F−G∥1+|EFt−EGt|).

From Case 1 and Case 2, we have

 ˜γ≤γ,ˆD(˜γn) →D(˜γ)=1−γ1−˜γD(γ) (12) ˜γ>γ,ˆD(˜γn) →D(˜γ)=γ˜γD(γ).

Equation (12) shows that the maximum of is obtained at .
Proof for multiple change point case: Suppose we have more than one change points. We plan to show that and is increasing to the left of the first change point, unimodal/increasing/decreasing between two consecutive change points and decreasing to the right of last change point. If this happens, then we can conclude that one of the global maximas of occurs at a change point. Using similar techniques from the single change point case, it is easy to show that is increasing to the left of first change point and decreasing to the right of last change point (The proof is left to the readers as an exercise). Hence, it remains to show that is unimodal/increasing/decreasing between two consecutive change points. Lemma 2.4 proves this result.

###### Lemma 2.4 ().

is unimodal/ increasing/ decreasing between two consecutive change points when there is more than one change point.

###### Proof.

Consider any two consecutive change points at index and . Suppose the data points are drawn i.i.d from distribution between change points and . The data points to the left of are possibly drawn independently from more than one distribution. But, for the asymptotic analysis we can assume that the data points to the left of are possibly drawn i.i.d from the mixture of more than one distribution distribution. Lets call this mixture distribution . Similarly, the data points to the right of can be assumed to be drawn i.i.d from a mixture distribution . Let the inter-arrival time be drawn from a distribution to the left of be, between and and to the right of .

Suppose we consider the region between change points and . So is a mixture of fraction of samples from and fraction from . is a mixture of fraction from and fraction from . So

 ˆpL(˜γn) →γ1˜γF+˜γ−γ1˜γG (13) ˆpR(˜γn) →γ2−˜γ1−˜γG+1−γ21−˜γH

Similarly, the mean inter-arrival time of samples to the left of converges to , and the mean inter-arrival time to the right of converges to . Combining this with (13), we can say that

 ˆD(˜γn)→D(˜γ)= ∥γ1˜γ(F−G)+1−γ21−˜γ(G−H)∥1 (14) +|γ1˜γE(Ft−Gt)+1−γ21−˜γE(Gt−Ht)|

If we expand to sum of probabilities of individual messages as , we can write from (14) as a function of as

 (15) D(˜γ) =M+1∑i=1|ai˜γ+bi˜γ|

for some constants . Function from (15) is only well defined over . For the purpose of this proof, with some abuse of notation we assume the function to have the same definition outside . We then show that defined in (15) is unimodal/increasing/decreasing as a function of between . This would naturally imply that is unimodal/increasing/decreasing in . The rest of the proof deals with this analysis.

Without loss of generality we can assume . We can expand (15) as

 (16) D(˜γ) =∑ai>0,bi>0|ai˜γ+bi1−˜γ|+∑ai>0,bi<0|ai˜γ−−bi1−˜γ| (17) =a˜γ+b1−˜γ+∑ai,di>0|ai˜γ−di1−˜γ|,

where when , and . for . We can assume w.l.o.g. that are in increasing order. Suppose .

 D(˜γ)=a−∑i

Let and . So for

 (18) D(˜γ) =a(s)˜γ+b(s)1−˜γ,asas+bs<˜γ

for and it is a decreasing function of . is a increasing function of . Based on where changes sign w.r.t we have the following cases. Note that and cannot both be negative for any value of . denotes the derivative of whereever it is defined.

• for all values of . is a convex function of and hence is unimodal.

• for all values of and changes sign at , i.e., . So for , and for , . Hence, is a unimodal function of between and .

• changes sign at , i.e., and for all . So for , is convex, and for , is positive. Hence, is either increasing or unimodal between and .

• and . Also . So for is decreasing, for is convex and for is increasing. Hence is unimodal.

### 2.4. Proof of theorem 2.3

Proof for single change point case: We first characterize the single change point case in finite sample setting as described in Algorithm 1. Sa In lemma 2.5-2.7 we develop the characteristics of . We analyze the concentration of in the lemma below.

###### Lemma 2.5 ().

w.p. for all values of .

Lemma 2.5 shows that the empirical estimate is very close to the actual value with high probability. Now suppose Algorithm 1 finds a change point at , we next show in Lemma 2.7 that the value of metric at estimated change point is very close to the value of the at the change point .

###### Lemma 2.6 ().

w.p.

Finally, in Lemma 2.7 we show that the estimated change point is close to with high probability.

###### Lemma 2.7 ().

w.p. .

Now we can prove the correctness for Algorithm 1 as per Definition 2.2 with high probability. We show that Algorithm 1 is correct given accuracy as mentioned in Definition 2.2 with probability given . We upper bound the probability that Algorithm 1 is not correct. From Definition 2.2 this happens when,

• Given , occurs. Say .

• Given change point does not exist, . When change point does not exist we write or .

So

 P(Algorithm ??? is NOT correct) (19) ≤P(Ec1|0<γ<1)+P(ˆD(ˆγ)>δ|γ=0 or 1).

We analyse each event separately in (19). Suppose no change point exists and say all the data points are drawn from the same distribution . We use Sanov’s theorem and Pinsker’s inequality to show that

 P(ˆD(ˆγ)>δ|γ=0 or 1) =P(∥ˆpL(ˆγn)−ˆpR(ˆγn)∥>δ|γ=0 or 1) ≤P(∥ˆpL(ˆγn)−F∥>δ/2|γ= 0 or 1) +P(∥ˆpR(ˆγn)−F∥>δ/2|γ=0 or 1) ≤(nˆγ+1)Mexp(−nδ2/8)+((1−ˆγ)n+1)Mexp(−nδ2/8) (20) ≤(n+2)Mexp(−nδ2/8)

Now, we look at the case when a change point exists at . We assume that is chosen such that . Hence

 P(Ec1|0<γ<1) ≤P(ˆD(ˆγn)<δ|0<γ<1)+P(|ˆγ−γ|>ϵ|ˆD(γ)>δ,0<γ<1) (21) +P(α<ˆγ<1−α|ˆD(γ)>δ,|ˆγ−γ|<ϵ,0<γ<1)

Given the assumption on , . Lemma 2.7 gives a bound on . Also, using lemma 2.6 and assuming that is chosen such that ,

 P(ˆD(ˆγn)<δ|0<γ<1) ≤P(ˆD(ˆγn)<δ|0<γ<1,|ˆD(ˆγn)−D(γ)|<ϵ) +P(|ˆD(ˆγn)−D(γ)|>ϵ) (22) ≤0+3nexp(−ϵ2α2128σ2n+Mlog(n))

Combining (21) and (22) we have

 P((ˆD(ˆγ>δ),|γ−ˆγ|<ϵ)c|0<γ<1) ≤3nexp(−ϵ2α2128σ2n+Mlog(n)) (23) +3nexp(−ϵ2D2(γ)512σ2n+Mlog(n)).

Putting together (20) and (23) into (19), we have

 P(Algorithm ??? is NOT correct) (24) ≤7exp(−(D(γ)−ϵ)2ϵ2α2512max(σ2,1)+Mlog(n))

Finally, under the assumptions

• ,

• ,

• .

we can derive the sample complexity result for one change point case from (24).

Proof for multiple change point case: Similar to the single change point case we first characterize the estimated change points for finite in lemma 2.8-2.10 below.

###### Lemma 2.8 ().

w.p. at least
for all values of .

Lemma 2.9 below can be proved in a similar way to lemma 2.6 in single change point case.

###### Lemma 2.9 ().

w.p. at least
for any change point .

###### Lemma 2.10 ().

w.p. at least
for some constant and any change point .

Now, we can state the correctness result for Algorithm 2. Algorithm 2 is correct given accuracy as mentioned in Definition 2.2 with probability .

We upper bound the probability that Algorithm 2 is not correct. From definition 2.2, this happens when

• Algorithm 2 is correct every time it calls Algorithm 1.

The maximum number of times Algorithm 1 would be applied is if it is correct every time it is applied. Out of the times number of times should return a change point and number of times should return no change point. So

 P(Algorithm ??? is NOT correct) ≤kP(Algorithm ??? does NOT detect a % change point when one exists for data-set XL,…,XH) +(k+1)P(Algorithm ??? returns a change point (25) when one does not exist)

We assume that is at least of size or an episode is at least samples long. Let denote the minimum value of metric at a global maxima for the reduced problem of over all possible values of for which Algorithm 1 is applied. From the correctness result for one change point, we have that

 P(Algorithm ??? does NOT detect a change% point when one exists for data-set XL,…,XH) (26) ≤3nexp(−ϵ2α2128σ2n+Mlog(n))+3nexp(−ϵ2D∗512σ2n+Mlog(n)).

and

 P(Algorithm ??? returns a change point when one does not exist) (27) ≤(n+2)Vexp(−nδ2/8).

Combining the above two cases, we get

 P(Algorithm ??? is NOT correct) (28) ≤7(2k+1)exp(−(D∗−ϵ)2ϵ2α4512max(σ2,k+1)n+Mlog(n))

Finally, under the assumptions

• ,

• ,

• .

we can derive the sample complexity result for change points from (28).

### 2.5. Proof of Lemma 2.5

Since ,

 P(|ˆD(˜γn)−D(˜γn)|>ϵ) ≤P(|∥ˆp(˜γn)−ˆq(˜γn)∥−∥p(˜γn)−q(˜γn)∥|>ϵ2) +P(||ˆm1(˜γn)−ˆm2(˜γn)|−|m1(˜γn)−m2(˜γn)||>ϵ2)

First we focus on finding an upper bound to the probability .

 (29) P(|∥ˆp(˜γn)−ˆq(