1 Introduction
###### Abstract

The following detection problem is studied, in which there are sequences of samples out of which one outlier sequence needs to be detected. Each typical sequence contains independent and identically distributed (i.i.d.) continuous observations from a known distribution , and the outlier sequence contains i.i.d. observations from an outlier distribution , which is distinct from , but otherwise unknown. A universal test based on KL divergence is built to approximate the maximum likelihood test, with known and unknown . A data-dependent partitions based KL divergence estimator is employed. Such a KL divergence estimator is further shown to converge to its true value exponentially fast when the density ratio satisfies , where and are positive constants, and this further implies that the test is exponentially consistent. The performance of the test is compared with that of a recently introduced test for this problem based on the machine learning approach of maximum mean discrepancy (MMD). We identify regimes in which the KL divergence based test is better than the MMD based test.

Universal Outlying sequence detection For Continuous Observations

Yuheng Bu   Shaofeng Zou   Yingbin Liang   Venugopal V. Veeravalli

University of Illinois at Urbana-Champaign

Syracuse University

Email: bu3@illinois.edu, szou02@syr.edu, yliang06@syr.edu, vvv@illinois.edu

## 1 Introduction

In this paper, we study problem, in which there are sequences of samples out of which one outlier sequence needs to be detected. Each typical sequence consists of independent and identically (i.i.d.) continuous observations drawn from a known distribution , whereas the outlier sequence consists of i.i.d. samples drawn from a distribution , which is distinct from , but otherwise unknown. The goal is to design a test to detect the outlier sequence.

The study of such a model is very useful in many applications [1]. For example, in cognitive wireless networks, signals follow different distributions depending on whether the channel is busy or vacant. The goal in such a network is to identify vacant channels out of busy channels based on their corresponding signals in order to utilize the vacant channels for improving spectral efficiency. Such a problem was studied in [2] and [3] under the assumption that both and are known. Other applications include anomaly detection in large data sets [4, 5], event detection and environment monitoring in sensor networks [6], understanding of visual search in humans and animals [7], and optimal search and target tracking [8].

The outlying sequence detection problem with discrete and was studied in [9]. A universal test based on generalized likelihood ratio test was proposed, and was shown to be exponentially consistent. The error exponent was further shown to be optimal as the number of sequences goes to infinity. The test utilizes empirical distributions to estimate and , and is therefore applicable only for the case where and are discrete.

In this paper, we study the case where distributions and are continuous and is unknown. We construct a Kullback-Leibler (KL) divergence based test, and further show that this test is exponentially consistent.

Our exploration of the problem starts with the case in which both and are known, and the maximum likelihood test is optimal. An interesting observation is that the test statistic of the optimal test converges to as the sample size goes to infinity if the sequence is the outlier. This motivates the use of a KL divergence estimator to approximate the test statistic for the case when is unknown. We apply a divergence estimator based on the idea of data-dependent partitions [10], which was shown to be consistent. Our first contribution here is to show that such an estimator converges exponentially fast to its true value when the density ratio satisfies the boundedness condition: , where and are positive constants. We further design a KL divergence based test using such an estimator and show that the test is exponentially consistent.

The rest of the paper is organized as follows. In Section 2, we describe the problem formulation. In Section 3, we present the KL divergence based test and establish its exponential consistency. In Section 4, we review the maximum mean discrepancy (MMD) based test. In Section 5, we provide a numerical comparison of our KL divergence based test and the MMD based test. All the detailed proofs is shown in the appendix.

## 2 Problem Model

Throughout the paper, random variables are denoted by capital letters, and their realizations are denoted by the corresponding lower-case letters. All logarithms are with respect to the natural base.

We study an outlier detection problem, in which there are in total data sequences denoted by for . Each data sequence consists of i.i.d. samples drawn from either a typical distribution or an outlier distribution , where and are continuous, i.e., defined on , and . We use the notation , where denotes the -th observation of the -th sequence. We assume that there is exactly one outlier among sequences. If the -th sequence is the outlier, the joint distribution of all the observations is given by

 pi(yMn)=pi(y(1),…,y(M))=n∏k=1{μ(y(i)k)∏j≠iπ(y(j)k)}.

We are interested in the scenario in which the outlier distributions is unknown a priori, but we know the typical distribution exactly. This is reasonable because in practical scenarios, systems typically start without outliers and it is not difficult to collect sufficient information about .

Our goal is to build a distribution-free test to detect the outlier sequence generated by . The the test can be captured by a universal rule , which must not depend on .

The maximum error probability, which is a function of the detector and , is defined as

 e(δ,π,μ)≜maxi=1,…,M∫yMn:δ(π,yMn)≠ipi(yMn)dyMn,

and the corresponding error exponent is defined as

 α(δ,π,μ)≜limn→∞−1nloge(δ,π,μ).

A test is said to be universally consistent if

 limn→∞e(δ,π,μ)=0,

for any . It is said to be universally exponentially consistent if

 limn→∞α(δ,π,μ)>0,

for any .

## 3 KL divergence based test

We first introduce the optimal test when both and are known, which is the maximum likelihood test. We then construct a KL divergence estimator, and prove its exponential consistency. Next, we employ the KL divergence estimator to approximate the test statistics of the optimal test for the outlying sequence detection problem, and construct the KL divergence based test.

### 3.1 Optimal test with π and μ known

If both and are known, the optimal test for the outlying sequence detection problem is the maximum likelihood test:

 δML(yMn,π,μ)=argmax1≤i≤M pi(yMn). (1)

By normalizing with , (1) is equivalent to:

 δML(yMn,π,μ)=argmax1≤i≤M pi(yMn)π(yMn)=argmax1≤i≤M⎧⎨⎩1nn∑k=1logμ(y(i)k)π(y(i)k)⎫⎬⎭=argmax1≤i≤M Li.

where

 Li≜1nn∑k=1logμ(y(i)k)π(y(i)k). (2)

The following theorem characterizes the error exponent of test (1).

###### Theorem 1.

[9, Theorem 1] Consider the outlying sequence detection problem with both and known. The error exponent for the maximum likelihood test (1) is given by

 α(δML,π,μ)=2B(π,μ),

where is the Bhattacharyya distance between and which is defined as

 B(π,μ)≜−log(∫μ(y)12π(y)12dy).
###### Proof.

See Appendix A. ∎

Consider defined in (2). If is generated from , almost surely as , by the Law of Large Numbers. Here,

 D(μ||π)≜∫dμlogdμdπ

is the KL divergence between and . Similarly, if is generated from , almost surely as If is generated from , is an empirical estimate of the KL divergence between and . This motivates us to construct a test based on an estimator of KL divergence between and , if is unknown.

### 3.2 KL divergence estimator

We introduce a KL divergence estimator of continuous distributions based on data-dependent partitions [10].

Assume that the distribution is unknown and the distribution is known, and both and are continuous. A sequence of i.i.d. samples is generated from . We wish to estimate the KL divergence between and . We denote the order statistics of by where . We further partition the real line into empirically equiprobable segments as follows:

 {Int}t=1,…,Tn={(−∞,Y(ℓn)], (Y(ℓn),Y(2ℓn)],…,(Y(ℓn(Tn−1)),∞)},

where is the number of points in each interval except possibly the last one, and is the number of intervals. A divergence estimator between the sequence and the distribution was proposed in [10], which is given by

 ^Dn(Y||q)=Tn−1∑t=1ℓnnlogℓn/nq(Int)+ϵnnlogϵn/nq(InTn), (3)

where is the number of points in the last segment.

The consistency of such an estimator was shown in [10]. Here, we characterize the convergence rate by introducing the following boundedness condition on the density ratio between and , i.e.,

 0

where and are positive constants. In practice, such a boundedness condition is often satisfied, for example, for truncated Gaussian distributions.

The following theorem characterizes a lower bound on the convergence rate of estimator (3).

###### Theorem 2.

If the density ratio between and satisfies (4), and estimator (3) is applied with , as , then for ,

 limn→∞−1nlog(P{∣∣^Dn(Y||q)−D(p||q)∣∣>ϵ})≥132K21K22ϵ2.
###### Proof.

See Appendix B. ∎

###### Remark 1.

The convergence rate of estimator (3) in Theorem 2 is equivalent to

 ∣∣^Dn(Y||q)−D(p||q)∣∣=Op(n−1/2),\lx@notefootnote$Xn=Op(an)$:$∀ϵ>0$,$∃M>0$,$P(|Xnan|>M)<ϵ,∀n.$

where denotes “bounded in probability ” [11].

### 3.3 Test and performance

In this subsection, we utilize the estimator based on data-dependent partitions to construct our test.

It is clear that if is the outlier, then is a good estimator of , which is a positive constant. On the other hand, if is a typical sequence, should be close to . Based on this understanding and the convergence guarantee in Theorem 2, we use in place of in (2), and construct the following test for the outlying sequence detection problem:

 δKL(yMn)=argmax1≤j≤M^Dn(Y(j)||π). (5)

The following theorem provides a lower bound on the error exponent of , which further implies that is universally exponentially consistent.

###### Theorem 3.

If the density ratio between and satisfies (4), then defined in (5) is exponentially consistent, and the error exponent is lower bounded as follows,

 α(δKL,π,μ)≥132(K1K1+K2)2D2(μ||π). (6)
###### Proof.

See Appendix C. ∎

## 4 MMD-Based Test

In this section, we introduce the MMD based test, which we previously studied in [12]. We will compare to the MMD based test.

### 4.1 Introduction to MMD

In this subsection, we briefly introduce the idea of mean embedding of distributions into RKHS [13] and the metric of MMD. Suppose is a set of probability distributions, and suppose is the RKHS with an associated kernel . We define a mapping from to such that each distribution is mapped to an element in as follows

 μp(⋅)=Ep[k(⋅,x)]=∫k(⋅,x)dp(x).

Here, is referred to as the mean embedding of the distribution into the Hilbert space . Due to the reproducing property of , it is clear that for all .

In order to distinguish between two distributions and , Gretton et al. [14] introduced the following quantity of maximum mean discrepancy (MMD) based on the mean embeddings and of and in RKHS:

 MMD[p,q]:=∥μp−μq∥H.

It can be shown that

 MMD[p,q]=sup∥f∥H≤1Ep[f]−Eq[f].

Due to the reproducing property of kernel, the following is true

 MMD2[p,q]=E[k(X,X′)]−2E[k(X,Y)]+E[k(Y,Y′)],

where and are independent but have the same distribution , and and are independent but have the same distribution . An unbiased estimator of based on and samples of generated from is given as follows,

 MMD2u[X,q]=1n(n−1)n∑i=1n∑j≠ik(xi,xj)+E[k(Y,Y′)]−2nn∑i=1E[k(xi,Y)],

where and are independent but have the same distribution .

### 4.2 Test and performance

For each sequence , we compute for . It is clear that if is the outlier, is a good estimator of , which is a positive constant. On the other hand, if is a typical sequence, should be a good estimator of , which is zero. Based on the above understanding, we construct the following test:

 δMMD=argmax1≤i≤MMMD2u[Y(i),π]. (7)

The following theorem provides a lower bound on the error exponent of , and further demonstrates that the test is universally exponentially consistent.

###### Theorem 4.

Consider the universal outlying sequence detection problem. Suppose defined in (7) applies a bounded kernel with for any . Then, the error exponent is lower bounded as follows,

 α(δMMD,μ,π)≥MMD4[μ,π]9K2. (8)
###### Proof.

See Appendix D. ∎

## 5 Numerical results and Discussion

In this section, we compare the performance of and .

We set the number of sequences . We choose the typical distribution , and choose the outlier distribution , respectively. In Fig. 1, Fig. 2, Fig. 3 and Fig. 4, we plot the logarithm of the probability of error as a function of the sample size .

It can be seen that for both tests as the number of samples increases, the probability of error converges to zero as the sample size increases. Furthermore, decreases with linearly, which demonstrates the exponential consistency of both and .

By comparing the four figures, it can be seen that as the variance of deviates from the variance of , outperforms . The numerical results and theoretical lower bounds on error exponents give us some intuitions to identify regimes in which one test outperforms the other. As shown above, when the distribution and become more different from each other, will outperform . The reason is that for any pair of distributions, MMD is bounded between , while the KL divergence is not bounded. As the distributions become more different from each other, the KL divergence will increase, and the KL divergence based test will have a larger error exponent than MMD based test.

Appendix

## Appendix A Proof of Theorem 1

Recall the maximum likelihood test is defined as

 δML(yMn)=argmax1≤i≤Mlogpi(yMn)M∏j=1n∏k=1π(y(j)k)=argmax1≤i≤M⎧⎨⎩1nn∑k=1logμ(y(i)k)π(y(i)k)⎫⎬⎭=argmax1≤i≤MLi.

Now we will characterize the exponent for the maximum likelihood test. By the symmetry of the problem, it is clear that is the same for every , hence

 maxi=1,…,MPi{δML≠i}=P1{δML≠1}.

It now follows

 P1{L1≤L2}≤P1{δ≠1}=P1{L1≤max2≤j≤MLj}≤(M−1)P1{L1≤L2}.

Since , the left hand side and right hand side will share a same error probability exponent, so we just need to compute the exponent for .

Let us use the notation,

 Zk=log⎛⎝μ(y(1)k)π(y(1)k)π(y(2)k)μ(y(2)k)⎞⎠.

Then, we can rewrite the probability,

 P1{L1≤L2} =P1⎧⎨⎩n∑k=1logμ(y(1)k)π(y(1)k)−n∑k=1logμ(y(2)k)π(y(2)k)≤0⎫⎬⎭ =P1{n∑k=1Zk≤0}.

Thus we can apply the Cramer’s theorem directly.

 limn→∞−1nP1{n∑k=1Zk≤na}=ΛZ(a),

for , and is the large-deviation rate function.

In our case, for . So

 limn→∞−1nP1{n∑k=1Zk≤0}=ΛZ(0)=supλ[−κZ(λ)].

We just need to compute the log-MGF of random variable ,

 κZ(λ)=logE(eλZ)=logE[μλ(Y(1))πλ(Y(1))πλ(Y(2))μλ(Y(2))].

Given that is generated from , is generated from , we have

 κZ(λ) =log(∫μλ+1(y(1))πλ(y(1))πλ+1(y(2))μλ(y(2))dy(1)dy(2)) =log(∫μλ+1(y(1))πλ(y(1))dy(1))+log(∫πλ+1(y(2))μλ(y(2))dy(2)) =−Cλ(π,μ)−Cλ(μ,π),

where

 Cλ(p,q)≜−log(∫pλ(y)q1−λ(y)dy).

In this case, it is easy to show that the error exponent

 supλ[−κZ(λ)]=maxλ[Cλ(π,μ)+Cλ(μ,π)]. (9)

Since is concave with , and , (9) is maximized when , so

where is the Bhattacharyya distance between and which is defined as

 B(π,μ)≜−log(∫μ(y)12π(y)12dy).

## Appendix B Proof of Theorem 2

To show the exponential consistency of our estimator, we invoke a result by Lugosi and Nobel [15], that specifies sufficient conditions on the partition of the space under which the empirical measure converges to the true measure.

Let be a family of partitions of . The maximal cell count of is given by

 c(A)≜supπ∈A|π|,

where denotes the number of cells in partition .

The complexity of is measured by the growth function as described below. Fix points in ,

 xn1={x1,…,xn}.

Let be the number of distinct partitions

 {I1∩xn1,…,Ir∩xn1}

of the finite set that can be induced by partitions . Define the growth function of as

 Δ∗n(A)≜maxxn1∈RnΔ(A,xn1),

which is the largest number of distinct partitions of any -point subset of that can be induced by the partitions in .

###### Lemma 1.

(Lugosi and Nobel ) Let be i.i.d. random variables in with and let denote the empirical probability measure based on samples. be any collection of partitions of . For each and every , then

 P{supπ∈A∑I∈π|μn(I)−μ(I)|>ϵ}≤4Δ∗2n(A)2c(A)exp(−nϵ2/32). (10)

To prove theorem 2, we consider the case when typical distribution is known, and a given sequence is independently generated from an unknown distribution . We further assume that and are both absolutely continuous probability measures defined on , and satisfy

 0

Denote the empirical probability measure based on the sequence by (Since is generated from ) and defined the empirical equiprobable partitions as follow. If the order statistics of can be expressed as where . The real line is partitioned into empirically equivalent segments according to

 {Int}t=1,…,Tn={(−∞,Y(ℓn)], (Y(ℓn),Y(2ℓn)],…,(Y(ℓn(Tn−1)),∞)},

where is the number of points in each interval except possibly the last one, and is the number of intervals. Assume that as , both . So our estimator can be written as

 ^Dn(Y||q)=Tn∑t=1pn(Int)logpn(Int)q(Int).

If we denote the true equiprobable partitions based on true distribution by , then

 p(It)=1Tn=pn(Int).

The estimation error can be decomposed as

Intuitively, is the approximation error caused by numerical integration, which diminishes as increases; is the estimation error caused by the difference of the empirical equivalent partitions from the true equiprobable partitions and the difference of the empirical probability measure on an interval from its true probability measure.

In addition, is only depends on and distribution and , namely, is a deterministic term, while also depends on data , which is random. Next, we will focus on bounding the term.

Since , the approximation error can be written as

 e1=∣∣∣Tn∑t=1pn(Int)logpn(Int)q(Int)−Tn∑t=1p(It)logp(It)q(It)∣∣∣=∣∣∣Tn∑t=11Tn(logq(It)−logq(Int))∣∣∣≤Tn∑t=11Tn∣∣∣(logq(It)−logq(Int))∣∣∣≤Tn∑t=11Tnf′(ξi)∣∣q(It)−q(Int)∣∣,

where , and , is a real number between and . We utilize the mean value theorem to get the last inequality.

Since , we get

 e1 ≤1TnTn∑t=1max{1q(It),1q(Int)}∣∣q(It)−q(Int)∣∣ ≤max1≤t≤Tn{1q(It),1q(Int)}TnTn∑t=1∣∣q(It)−q(Int)∣∣ =1αTn∑t=1∣∣q(It)−q(Int)∣∣, (11)

where

 α=Tnmax1≤t≤Tn{1q(It),1q(Int)}.

To get an exponential bound for , we will apply lemma 1 to our problem. For our case, are the equivalent segments based on the empirical measure . Suppose is the collection of all the partitions of into empirically equiprobable intervals based on sample points. Then, from (10)

 P{Tn∑t=1|pn(Int)−p(Int)|>ϵ} ≤P{supπ∈An∑I∈π|pn(I)−p(I)|>ϵ} ≤4Δ∗2n(An)2c(An)exp(−nϵ2/32). (12)

If we want to get a meaningful exponential bound, we still need to verify 2 conditions in our case: as ,

Here,

 c(An)=supπ∈An|π|=Tn.

Since as , we have that

 c(An)n=1ℓn→0.

Next consider the growth function which is defined as the largest number of distinct partitions of any -point subset of that can be induced by the partitions in . Namely

 Δ∗2n(An)=maxx2n1∈R2nΔ(An,x2n1).

In our algorithm, the partitioning number is the number of ways that fixed points can be partitioned by intervals. Then

 Δ∗2n(An)=(2n+TnTn).

Let be the binary entropy function, defined as

 h(x)=−xlog(x)−(1−x)log(1−x),for  x∈(0,1).

By the inequality , we obtain

 logΔ∗2n(An)≤(2n+Tn)h(Tn2n+Tn)≤3nh(12ℓn).

As , the last inequality implies that

 1nlogΔ∗2n(An)→0.

Now, we can conclude that the inequality (B) is actually an exponential bound, the coefficients and will not influence the exponent.

Since and , the following holds

 P{Tn∑t=1|q(Int)−q(It)|>ϵ} ≤P{Tn∑t=1|p(Int)−p(It)|>K1ϵ} =P{Tn∑t=1|pn(Int)−p(Int)|>K1ϵ} ≤4Δ∗2n(An)2c(An)exp(−nK21ϵ2/32). (13)

Combine with (B), we can control the estimation error with the following bound

 P{e1+e2>ϵ}≤P{1αTn∑t=1∣∣q(It)−q(Int)∣∣>ϵ−e2}≤4Δ∗2n(An)2c(An)exp(−nα2K21(ϵ−e2)2/32).

Recall that

 α=Tnmax1≤t≤Tn{1q(It),1q(Int)}.

Since we show that converges to exponentially fast in (B), we have

 limn→∞α=limn→∞Tnmax1≤t≤Tn{1q(It),1q(Int)}=limn→∞1p(It)max1≤t≤Tn{1q(It)}=limn→∞min1≤t≤Tn{q(It)}p(It)≥1K2.

Finally, we can compute the error exponent,

Since is the approximation error caused by numerical integration, . We prove that

 limn→∞−1nlog(P{|^Dn(Y||q)−D(p||q)|>ϵ})≥132(K1K2)2ϵ2.

## Appendix C Proof of Theorem 3

Recall our test is defined as

 δKL(yMn)=argmax1≤j≤M^Dn(Y(j)||π).

Now we will show the test we proposed is exponentially consistent. By the symmetry of the problem, it is clear that is the same for every , hence

 maxi=1,…,MPi{δKL≠i}=P1{δKL≠1}.

It now follows

 P1{δKL≠1} =P1{^Dn(Y(1)||π)≤max2≤j≤M^Dn(Y(j)||π)} ≤(M−1)P1{^Dn(Y