A Local Density-Based Approach for Local Outlier Detection
This paper presents a simple but effective density-based outlier detection approach with the local kernel density estimation (KDE). A Relative Density-based Outlier Score (RDOS) is introduced to measure the local outlierness of objects, in which the density distribution at the location of an object is estimated with a local KDE method based on extended nearest neighbors of the object. Instead of using only nearest neighbors, we further consider reverse nearest neighbors and shared nearest neighbors of an object for density distribution estimation. Some theoretical properties of the proposed RDOS including its expected value and false alarm probability are derived. A comprehensive experimental study on both synthetic and real-life data sets demonstrates that our approach is more effective than state-of-the-art outlier detection methods.
Advances in data acquisition have created massive collections of data, capturing valuable information to science, government, business, and society. However, despite of the availability of large amount of data, some events are rare or exceptional, which are usually called “outliers” or “anomalies”. Compared with many other knowledge discovery problems, outlier detection is sometimes more valuable in many applications, such as network intrusion detection, fraudulent transactions, and medical diagnostics. For example, in network intrusion detection, the number of intrusions or attacks (“bad” connections) is much less than the “good” and normal connections. Similarly, the abnormal behaviors are usually rare in many other cases. Although these outliers are only a small portion of the whole data set, it is much more costly to misunderstand them compared with other events.
In recent decades, many outlier detection approaches have been proposed. Usually an outlier detection method can be categorized into the following four types of method jin2001mining ()1334558 (): distribution-based, distance-based, clustering-based, and density-based. In distribution-based methods, an object is considered as the outlier if it deviates from a standard distribution (e.g., normal, Poisson, etc.) too much barnett1994outliers (). The problem of the distribution-based method is that the underlying distribution is usually unknown and does not follow a standard distribution for many practical applications.
The distance-based methods detect outliers by computing distances among all objects. An object is considered as the outlier when it has distance away from percentage of objects in the data set knox1998algorithms (). In aggarwal2001outlier (), the distance among objects is calculated in feature subspace through projections for high dimensional data sets. The problem of these methods is that the local outliers are usually misdetected for the data set with multiple components or clusters. To detect the local outliers, a top- -th nearest neighbor distance is proposed in ramaswamy2000efficient (), in which the distance from an object to its -th nearest neighbor indicates outlierness of the object. The cluster-based methods detect the outlier in the process of finding clusters. The object does not belong any cluster is considered as the outlier ester1996density ()zhang1997birch ()brito1997connectivity ().
In density-based methods, an outlier is detected when its local density differs from its neighborhood. Different density estimation methods can be applied to measure the density. In Local Outlier Factor (LOF) breunig2000lof (), an outlierness score, indicating how an object differs from its locally reachable neighborhood, is measured. Previous studies tang2002enhancing ()zhang2009new () have shown that it is more reliable to consider the objects with the highest LOF scores as outliers, instead of comparing the LOF score with a threshold. Several variations of the LOF are also proposed zhang2009new ()jin2006ranking (). In zhang2009new (), a Local Distance-based Outlier Factor (LDOF) using the relative distance from an object to its neighbors is proposed for outlier detection in scattered datasets. In jin2006ranking (), a INFLuenced Outlierness (INFLO) score is measured by considering both neighbors and reverse neighbors of an object when estimating its relative density distribution jin2006ranking (). To address the issue that the LOF method and its variants do not consider the underlying pattern of data, Tang et. al. proposed a connectivity-based outlier factor (COF) scheme in tang2001robust (). While the LOF-based and COF-based outlier detection methods use the relative distance to estimate the density, several other density-based methods are proposed based on kernel density estimation latecki2007outlier ()gao2011rkof ()schubert2014generalized (). For example, Local Density Factor (LDF) latecki2007outlier () extends the LOF by using kernel density estimation. In schubert2014generalized (), similar to the LOCI, a relative density score termed KDEOS is calculated using kernel density estimation and applies the -score transformation for score normalization.
In this paper, we propose an outlier detection method based on the local kernel density estimation for robust local outlier detection. Instead of using the whole data set, the density of an object is estimated with the objects in its neighborhood. Three kinds of neighbors: nearest neighbors, reverse nearest neighbors, and shared nearest neighbors, are considered in our local kernel density estimation. A simple but efficient relative density calculation, termed Relative Density-based Outlier Score (RDOS), is introduced to measure the outlierness. Theoretical properties of the RDOS, including the expected value and the false alarm probability are derived, which suggests parameter settings in practical applications. We further employ the top- scheme to rank the objects with their outlierness, i.e., the objects with the highest RDOS values are considered as the outliers. Simulation results on both synthetic data sets and real-life data sets illustrate superior performance of our proposed method.
The paper is organized as follows: In Section 2, we introduce the definition of the RDOS and present the detailed descriptions of our proposed outlier detection approach. In Section 3, we derive theoretical properties of the RDOS and discuss the parameter settings. In Section 4, we present experimental results and analysis, which show superior performance compared with previous approaches. Finally, conclusions are given in Section 5.
2 Proposed Outlier Detection
2.1 Local Kernel Density Estimation
We use the KDE method to estimate the density at the location of an object based on the given data set. Given a set of objects , where for , the KDE method estimates the distribution as follows:
where is the defined kernel function with the kernel width of , which satisfies the following conditions:
A commonly used multivariate Gaussian kernel function is given by
where denotes the Euclidean distance between and . The distribution estimate in Eq. (1) offers many nice properties, such as its non-parametric property, continuity and differentiability epanechnikov1969non (). Also it is an asymptotic unbiased estimator of the density.
To estimate the density at the location of the object , we only consider its neighbors of as kernels, instead of using all objects in the data set. The reason for this is twofold: firstly, many complex real-life data sets usually have multiple clusters or components, which are the intrinsic patterns of the data. The density estimation using the full data set may lose the local difference in density and fail to detect the local outliers; secondly, the outlier detection will calculate the score for each object, and using the full data set would lead to a high computational cost, which has the complexity of where is the total number of objects in the data set.
To better estimate the density distribution in the neighbourhood of an object, we propose to use nearest neighbors, reverse nearest neighbors and shared nearest neighbors as kernels in KDE. Let be the -th nearest neighbors of the object , we denote the set of nearest neighbors of as :
The reverse nearest neighbors of the object are those objects who consider as one of their nearest neighbors tangENN (), i.e., is one reverse nearest neighbor of if for any . The shared nearest neighbors of the object are those objects who share one or more nearest neighbors with , in other words, is one shared nearest neighbor of if for any . We show these three types of nearest neighbors in Fig. 1.
We denote and by the sets of reverse nearest neighbors and shared nearest neighbors of , respectively. For an object, there would be always nearest neighbors in , while the sets of and can be empty or have one or more elements. Given the three data sets , and for the object , we form an extended local neighborhood by combining them together, denoted by . Thus, the estimated density at the location of is written as
where denotes the number of elements in the set of .
2.2 Relative Density-based Outlier Factor
After estimating the density at the locations of all objects, we propose a novel relative density-based outlier factor (RDOS) to measure the degree to which the density of the object deviates from its neighborhood, which is defined as follows:
The RDOS is the ratio of the average neighborhood density to the density of interested object . If is much larger than 1, then the object would be outside of a dense cluster, indicating that would be an outlier. If is equal or smaller than 1, then the object would be surrounded by the same dense neighbors or by a sparse cloud, indicating that would not be an outlier. In practice, we would like to rank the RDOS values and detect top- outliers. We summarize our algorithm in Algorithm 1, which takes the KNN graph as input. The KNN graph is a directed graph in which each object is a vertex and is connected to its nearest neighbors with an outbound direction. In the KNN graph, an object will have outbound edges to the elements in , and have none, one or more inbound edges. The KNN graph construction using the brute-force method has the computational complexity of for objects, and it can be reduced to using the trees bentley1975multidimensional (). Using the KNN graph KNN-G, it is easy to obtain the nearest neighbors , reverse nearest neighbors and shared nearest neighbors with an approximate computational cost of . For each object, we form a set of local nearest neighbors with the combination of , and , and calculate the density at the location of the object based on the set of . Then, we calculate the RDOS value of each object based on the densities of local neighbors in . The top- outliers are obtained by sorting the RDOS values in a descending way. If one wants to determine whether an object is outlier, we can compare the value of with a threshold , i.e., we determine an object is outlier if satisfies
where the threshold is usually a constant value that is pre-determined by users.
3 Theoretical Properties
In this section, we analyze several nice properties of the proposed outlierness metric. In Theorem 1, we give the expected value of RDOS when the object and its neighbors are sampled from the same distribution, which indicates the lower bound of RDOS for outlier detection.
Let the object be sampled from a continuous density distribution. For , the RDOS equals 1 with probability 1, i.e., , when the kernel function is nonnegative and integrable.
For a fixed , indicates that the objects in locate in the local neighborhood of with the radius . Considering data sampled from a continuous density distribution , the expectation of the density estimation at exists and is consistent to the true one steckley2006estimating ():
and its asymptotic variance is given by steckley2006estimating ()
Meanwhile, the average density at the neighborhood of with the radius of can be given by
Taking the ratio, we get
This theorem shows that when , we could say that the object is not an outlier. Since RDOS is always positive, when , we could say the object can be ignored in outlier detection. Only these objects whose RDOS values are larger than 1 are possible to be outliers.
Following the work in zhang2009new (), we next examine the upper-bound false detection probability to give a sense of threshold selection in practice.
Let be the set of local neighbors of in RDOS, which are assumed to be uniformly distributed in ball centered at with the radius of . Using the Gaussian kernel, the probability of false detecting as an outlier is given by
where is the kernel width and is the volume of ball .
For simplicity of notation, we use for and consider . Then, the density estimation at given the local neighbors is written as
and the average density estimation in the neighborhood of is written as
For , , uniformly distributed in ball , we can compute the expectation of both and from Theorem 1, which is given by:
where is the volume of -sphere and . The rest of proof follows the McDiarmid’s Inequality which gives the upper bound of the probability that a function of i.i.d. variables deviates from its expectation. Let , ,
Then, for all ,
For , we have
For , we have
We define a new function , which is bounded by
Then, the probability of false alarm is written as
where . From Theorem 1, we are only interested in the case of , i.e., , and . Using the McDiarmid’s Inequality, we have
4 Experimental Results and Analysis
4.1 Synthetic Data Sets
We first test the proposed RDOS in two synthetic data sets for outlier detection. Our first synthetic data set includes two Gaussian clusters centered at and , respectively, each of which has 100 data samples. There are three outliers around these two clusters, as indicated in Fig. 2. To calculate the RDOS, we use nearest neighbors and in kernel functions. In Fig. 3, we show the RDOS value of all data samples, where the color and the radius of circles denote the value of RDOS. It can be shown that the RDOS of these three outliers is significantly larger than that of non-outliers.
The second synthetic data set used in our simulation consists of data samples uniformly distributed around a cosine curve, which can be written as
where . In our simulation, we use , and generate four outliers in this data set, as shown in Fig. 4. The RDOS value of all data samples is shown in Fig. 5, where both the color and the radius of circles indicate the value of RDOS. It is still shown that the RDOS-based method can effectively detect the outliers.
4.2 Real-Life Data Sets
We also conduct outlier detection experiments on four real-life data sets to demonstrate the effectiveness of our proposed RDOS approach. All of these four data sets are originally from the UCI repository asuncion2007uci (), including Breast Cancer, Pen-Local, Pen-Global, and Satellite, but are modified for local and global outlier detection dataset (). We summarize the characteristics of these four data sets in Table 1. Prior to calculating the RDOS, we first normalize the data ranging from 0 to 1. In Fig. 6, we show the first two principle components of these four data sets, where the outliers are denoted by the solid circle.
|Dataset||of features||of outliers||of data|
For each data sample, we calculate its RDOS and compare it with a threshold to determine whether it is an outlier. Since all these data sets are highly imbalanced, the use of overall accuracy is not appropriate. In our experiments, we use the metric of AUC (area under the ROC curve) for performance comparison. The ROC curve examines the performance of a binary classifier with different thresholds, leading to different pairs of false alarm rate and true positive rate. We compare our RDOS approach with another four widely used outlier detection approaches: Outlier Detection using Indegree Number (ODIN) Hautamaki:2004 (), LOF breunig2000lof (), INFLO jin2006ranking (), and Mutual Nearest Neighbors (MNN) brito1997connectivity (). Since all of these examined methods are nearest neighbors-based methods, we evaluate the oultier detection performance with different values. Fig. 7 shows the performance of AUC for the data set of Breast Cancer. It can be shown that our proposed RDOS approach, in general, performs better than other four approaches, and has a similar performance to the approaches of LOF and INFLO when is larger than 7. When , the performance improvement of the proposed RDOS approach is largest, as illustrated in Fig. 8.
In Fig. 9, we show the performance of AUC for the data set of Pen-Local. It also shows that our RDOS approach generally outperforms other four approaches when is less than 7. Specifically, in Fig. 8, we show the ROC curve for . Compared to the LOF and INFLO approaches, the performance difference is close to zero for a large value.
In Fig. 11, we show the performance of AUC for the data set of Pen-Global. It shows a large performance improvement of our RDOS approach when the number of nearest neighbors increases. In Fig. 12, the ROC curves of all the five approaches are compared, when . From Fig. 7, 9 and 11, it can be shown that RDOS LOF INFLO ODIN MNN, where the symbol “” means “performs better than”, for the data sets of Breast Cancer, Pen-Local, and Pen-Global.
Fig. 13 shows the performance of AUC for the data set of Satellite. When the number of nearest neighbors is less than 11, three approaches of RDOS, LOF and INFLO have a similar AUC. When the number of nearest neighbors is larger than 11 and less than 23, the approach of INFLO performs the best and our RDOS approach is the second. When the number of nearest neighbors is larger than 25, our RDOS approach has the best performance. Specifically, we show the ROC curve of all the five approaches in Fig. 14. In general, we observe the following phenomena in our experiments: Firstly, the performance of all the five approaches is usually poor for a small , and the improvement of our RDOS approach is not significant. When a small number of nearest neighbors are considered, the relative density in a neighborhood might not be well represented. Secondly, the proposed RDOS approach performs the best for specific values. Thirdly, we observe that the MNN approach has the worst performance, compared with other four approaches, for these four data sets.
5 Conclusions and Future Work
This paper presented a novel local outlier detection method based on local kernel density estimation. Instead of only considering the nearest neighbors of a data sample, we considered three kinds of neighbors: nearest neighbors, reverse nearest neighbors, and shared nearest neighbors, for local kernel density estimation. A simple but efficient relative density calculation, termed Relative Density-based Outlier Score (RDOS), was introduced to measure the outlierness. We further derived theoretical properties of the proposed RDOS measure, including the expected value and the false alarm probability. The theoretical results suggest parameter settings for practical applications. Simulation results on both synthetic data sets and real-life data sets illustrate superior performance of our proposed method. One drawback of kernel-based density estimation is its kernel width selection. Along this research direction, new density estimation methods such as exponentially embedded families tangEEF (); tangtoward (); kayprob () and PDF projection theorem baggenstoss2003pdf (); tangEEF2016 () will be investigated in our future work.
- (1) W. Jin, A. K. Tung, J. Han, Mining top-n local outliers in large databases, in: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001, pp. 293–298.
- (2) V. Hautamaki, I. Karkkainen, P. Franti, Outlier detection using k-nearest neighbour graph, in: Proceedings of the 17th International Conference on Pattern Recognition, Vol. 3, 2004, pp. 430–433.
- (3) V. Barnett, T. Lewis, Outliers in statistical data, Vol. 3, Wiley New York, 1994.
- (4) E. M. Knox, R. T. Ng, Algorithms for mining distancebased outliers in large datasets, in: Proceedings of the International Conference on Very Large Data Bases, 1998, pp. 392–403.
- (5) C. C. Aggarwal, P. S. Yu, Outlier detection for high dimensional data, in: ACM Sigmod Record, Vol. 30, 2001, pp. 37–46.
- (6) S. Ramaswamy, R. Rastogi, K. Shim, Efficient algorithms for mining outliers from large data sets, in: ACM SIGMOD Record, Vol. 29, 2000, pp. 427–438.
- (7) M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise., in: Kdd, Vol. 96, 1996, pp. 226–231.
- (8) T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery 1 (2) (1997) 141–182.
- (9) M. Brito, E. Chavez, A. Quiroz, J. Yukich, Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection, Statistics & Probability Letters 35 (1) (1997) 33–42.
- (10) M. M. Breunig, H.-P. Kriegel, R. T. Ng, J. Sander, LOF: identifying density-based local outliers, in: ACM sigmod record, Vol. 29, 2000, pp. 93–104.
- (11) J. Tang, Z. Chen, A. W.-C. Fu, D. W. Cheung, Enhancing effectiveness of outlier detections for low density patterns, in: Advances in Knowledge Discovery and Data Mining, 2002, pp. 535–548.
- (12) K. Zhang, M. Hutter, H. Jin, A new local distance-based outlier detection approach for scattered real-world data, in: Advances in Knowledge Discovery and Data Mining, 2009, pp. 813–822.
- (13) W. Jin, A. K. Tung, J. Han, W. Wang, Ranking outliers using symmetric neighborhood relationship, in: Advances in Knowledge Discovery and Data Mining, 2006, pp. 577–593.
- (14) J. Tang, Z. Chen, A. W.-c. Fu, D. Cheung, A robust outlier detection scheme for large data sets, in: In 6th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, 2001.
- (15) L. J. Latecki, A. Lazarevic, D. Pokrajac, Outlier detection with kernel density functions, in: Machine Learning and Data Mining in Pattern Recognition, 2007, pp. 61–75.
- (16) J. Gao, W. Hu, Z. M. Zhang, X. Zhang, O. Wu, Rkof: robust kernel-based local outlier detection, in: Advances in Knowledge Discovery and Data Mining, 2011, pp. 270–283.
- (17) E. Schubert, A. Zimek, H.-P. Kriegel, Generalized outlier detection with flexible kernel density estimates, in: Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014, pp. 542–550.
- (18) V. A. Epanechnikov, Non-parametric estimation of a multivariate probability density, Theory of Probability & Its Applications 14 (1) (1969) 153–158.
- (19) B. Tang, H. He, ENN: Extended nearest neighbor method for pattern recognition [research frontier], IEEE Computational Intelligence Magazine 10 (3) (2015) 52–60.
- (20) J. L. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM 18 (9) (1975) 509–517.
- (21) S. G. Steckley, Estimating the density of a conditional expectation, Ph.D. thesis, Cornell University (2006).
- (22) M. Lichman, UCI Machine Learning Repository, Web: http://archive.ics.uci.edu/ml/, School of Information and Computer Science, Irvine, CA: University of California, 2013.
- (23) M. Goldstein, Unsupervised anomaly detection benchmark, http://dx.doi.org/10.7910/DVN/OPQMVF/, online: Harvard Dataverse (2015).
- (24) V. Hautamaki, I. Karkkainen, P. Franti, Outlier detection using k-nearest neighbour graph, in: 17th International Conference on Pattern Recognition, 2004, pp. 430–433.
- (25) B. Tang, H. He, Q. Ding, S. Kay, A parametric classification rule based on the exponentially embedded family, IEEE Transactions on Neural Networks and Learning Systems 26 (2) (2015) 367–377.
- (26) B. Tang, H. He, P. M. Baggenstoss, S. Kay, A bayesian classification approach using class-specific features for text categorization, IEEE Transactions on Knowledge and Data Engineering 28 (6) (2016) 1602–1606.
- (27) S. Kay, Q. Ding, B. Tang, H. He, Probability density function estimation using the EEF with application to subset/feature selection, IEEE Transactions on Signal Processing 64 (3) (2016) 641–651.
- (28) P. M. Baggenstoss, The PDF projection theorem and the class-specific method, IEEE Transactions on Signal Processing 51 (3) (2003) 672–685.
- (29) B. Tang, S. Kay, H. He, P. M. Baggenstoss, EEF: Exponentially embedded families with class-specific features for classification, IEEE Signal Processing Letters 23 (7) (2016) 969–973.