Multi-criteria Similarity-based Anomaly Detection using Pareto Depth Analysis

Multi-criteria Similarity-based Anomaly Detection using Pareto Depth Analysis

Abstract

We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. Similarity-based anomaly detection algorithms detect abnormally large amounts of similarity or dissimilarity, e.g. as measured by nearest neighbor Euclidean distances between a test sample and the training samples. In many application domains there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such cases, multiple dissimilarity measures can be defined, including non-metric measures, and one can test for anomalies by scalarizing using a non-negative linear combination of them. If the relative importance of the different dissimilarity measures are not known in advance, as in many anomaly detection applications, the anomaly detection algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we propose a method for similarity-based anomaly detection using a novel multi-criteria dissimilarity measure, the Pareto depth. The proposed Pareto depth analysis (PDA) anomaly detection algorithm uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach is provably better than using linear combinations of the criteria and shows superior performance on experiments with synthetic and real data sets.

ulti-criteria dissimilarity measure, similarity-based learning, combining dissimilarities, Pareto front, scalarization gap, partial correlation

m

1 Introduction

Identifying patterns of anomalous behavior in a data set, often referred to as anomaly detection, is an important problem with diverse applications including intrusion detection in computer networks, detection of credit card fraud, and medical informatics \citepHodge2004,Chandola2009. Similarity-based approaches to anomaly detection have generated much interest [4, 5, 6, 7, 8, 9, 10] due to their relative simplicity and robustness as compared to model-based, cluster-based, and density-based approaches \citepHodge2004,Chandola2009. These approaches typically involve the calculation of similarities or dissimilarities between data samples using a single dissimilarity criterion, such as Euclidean distance. Examples include approaches based on -nearest neighbor (-NN) distances [4, 5, 6], local neighborhood densities [7], local p-value estimates [8], and geometric entropy minimization [9, 10].

In many application domains, such as those involving categorical data, it may not be possible or practical to represent data samples in a geometric space in order to compute Euclidean distances. Furthermore, multiple dissimilarity measures corresponding to different criteria may be required to detect certain types of anomalies. For example, consider the problem of detecting anomalous object trajectories in video sequences of different lengths. Multiple criteria, such as dissimilarities in object speeds or trajectory shapes, can be used to detect a greater range of anomalies than any single criterion.

In order to perform anomaly detection using these multiple criteria, one could first combine the dissimilarities for each criterion using a non-negative linear combination then apply a (single-criterion) anomaly detection algorithm. However, in many applications, the importance of the different criteria are not known in advance. It is thus difficult to determine how much weight to assign to each criterion, so one may have to run the anomaly detection algorithm multiple times using different weights selected by a grid search or similar method.

We propose a novel multi-criteria approach for similarity-based anomaly detection using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality, which is the typical method for defining optimality when there may be multiple conflicting criteria for comparing items. An item is said to be Pareto-optimal if there does not exist another item that is better or equal in all of the criteria. An item that is Pareto-optimal is optimal in the usual sense under some (not necessarily linear) combination of the criteria. Hence PDA is able to detect anomalies under multiple combinations of the criteria without explicitly forming these combinations.

(a)
(b)
(c)
Figure 1: (a) Illustrative example with training samples (blue x’s) and test samples (red circle and triangle) in . (b) Dyads for the training samples (black dots) along with first Pareto fronts (green lines) under two criteria: and . The Pareto fronts induce a partial ordering on the set of dyads. Dyads associated with the test sample marked by the red circle concentrate around shallow fronts (near the lower left of the figure). (c) Dyads associated with the test sample marked by the red triangle concentrate around deep fronts.

The PDA approach involves creating dyads corresponding to dissimilarities between pairs of data samples under all of the criteria. Sets of Pareto-optimal dyads, called Pareto fronts, are then computed. The first Pareto front (depth one) is the set of non-dominated dyads. The second Pareto front (depth two) is obtained by removing these non-dominated dyads, i.e. peeling off the first front, and recomputing the first Pareto front of those remaining. This process continues until no dyads remain. In this way, each dyad is assigned to a Pareto front at some depth (see Fig. 1 for illustration).

The Pareto depth of a dyad is a novel measure of dissimilarity between a pair of data samples under multiple criteria. In an unsupervised anomaly detection setting, the majority of the training samples are assumed to be nominal. Thus a nominal test sample would likely be similar to many training samples under some criteria, so most dyads for the nominal test sample would appear in shallow Pareto fronts. On the other hand, an anomalous test sample would likely be dissimilar to many training samples under many criteria, so most dyads for the anomalous test sample would be located in deep Pareto fronts. Thus computing the Pareto depths of the dyads corresponding to a test sample can discriminate between nominal and anomalous samples.

Under the assumption that the multi-criteria dyads can be modeled as realizations from a -dimensional density, we provide a mathematical analysis of properties of the first Pareto front relevant to anomaly detection. In particular, in the Theorem (Scalarization Gap Theorem). we prove upper and lower bounds on the degree to which the Pareto fronts are non-convex. For any algorithm using non-negative linear combinations of criteria, non-convexities in the Pareto fronts contribute to an artificially inflated anomaly score, resulting in an increased false positive rate. Thus our analysis shows in a precise sense that PDA can outperform any algorithm that uses a non-negative linear combination of the criteria. Furthermore, this theoretical prediction is experimentally validated by comparing PDA to several single-criterion similarity-based anomaly detection algorithms in two experiments involving both synthetic and real data sets.

The rest of this paper is organized as follows. We discuss related work in Section 2. In Section 3 we provide an introduction to Pareto fronts and present a theoretical analysis of the properties of the first Pareto front. Section 4 relates Pareto fronts to the multi-criteria anomaly detection problem, which leads to the PDA anomaly detection algorithm. Finally we present three experiments in Section 5 to provide experimental support for our theoretical results and evaluate the performance of PDA for anomaly detection.

2 Related work

2.1 Multi-criteria methods for machine learning

Several machine learning methods utilizing Pareto optimality have previously been proposed; an overview can be found in \citepJin2008. These methods typically formulate supervised machine learning problems as multi-objective optimization problems over a potentially infinite set of candidate items where finding even the first Pareto front is quite difficult, often requiring multi-objective evolutionary algorithms. These methods differ from our use of Pareto optimality because we consider Pareto fronts created from a finite set of items, so we do not need to employ sophisticated algorithms in order to find these fronts. Rather, we utilize Pareto fronts to form a statistical criterion for anomaly detection.

Finding the Pareto front of a finite set of items has also been referred to in the literature as the skyline query \citepborzsony2001skyline,tan2001efficient or the maximal vector problem \citepkung1975finding. Research on skyline queries has focused on how to efficiently compute or approximate items on the first Pareto front and efficiently store the results in memory. Algorithms for skyline queries can be used in the proposed PDA approach for computing Pareto fronts. Our work differs from skyline queries because the focus of PDA is the utilization of multiple Pareto fronts for the purpose of multi-criteria anomaly detection, not the efficient computation or approximation of the first Pareto front.

Hero and Fleury \citethero2004pareto introduced a method for gene ranking using multiple Pareto fronts that is related to our approach. The method ranks genes, in order of interest to a biologist, by creating Pareto fronts on the data samples, i.e. the genes. In this paper, we consider Pareto fronts of dyads, which correspond to dissimilarities between pairs of data samples under multiple criteria rather than the samples themselves, and use the distribution of dyads in Pareto fronts to perform multi-criteria anomaly detection rather than gene ranking.

Another related area is multi-view learning \citepBlum1998,Sindhwani2005, which involves learning from data represented by multiple sets of features, commonly referred to as “views”. In such a case, training in one view is assumed to help to improve learning in another view. The problem of view disagreement, where samples take on different classes in different views, has recently been investigated \citepChristoudias2008. The views are similar to criteria in our problem setting. However, in our setting, different criteria may be orthogonal and could even give contradictory information; hence there may be severe view disagreement. Thus training in one view could actually worsen performance in another view, so the problem we consider differs from multi-view learning. A similar area is that of multiple kernel learning \citepGonen2011, which is typically applied to supervised learning problems, unlike the unsupervised anomaly detection setting we consider.

2.2 Anomaly detection

Many methods for anomaly detection have previously been proposed. Hodge and Austin \citetHodge2004 and Chandola et al. \citetChandola2009 both provide extensive surveys of different anomaly detection methods and applications.

This paper focuses on the similarity-based approach to anomaly detection, also known as instance-based learning. This approach typically involves transforming similarities between a test sample and training samples into an anomaly score. Byers and Raftery \citetByers1998 proposed to use the distance between a sample and its th-nearest neighbor as the anomaly score for the sample; similarly, Angiulli and Pizzuti \citetAngiulli2002 and Eskin et al. \citetEskin2002 proposed to the use the sum of the distances between a sample and its nearest neighbors. Breunig et al. \citetBreunig2000 used an anomaly score based on the local density of the nearest neighbors of a sample. Hero \citetHero2006 and Sricharan and Hero \citetSricharan2011 introduced non-parametric adaptive anomaly detection methods using geometric entropy minimization, based on random -point minimal spanning trees and bipartite -nearest neighbor (-NN) graphs, respectively. Zhao and Saligrama \citetZhao2009 proposed an anomaly detection algorithm k-LPE using local p-value estimation (LPE) based on a -NN graph. The aforementioned anomaly detection methods only depend on the data through the pairs of data points (dyads) that define the edges in the -NN graphs. These methods are designed for a single criterion, unlike the PDA anomaly detection algorithm that we propose in this paper, which accommodates dissimilarities corresponding to multiple criteria.

Other related approaches for anomaly detection include -class support vector machines (SVMs) \citepScholkopf2001, where an SVM classifier is trained given only samples from a single class, and tree-based methods, where the anomaly score of a data sample is determined by its depth in a tree or ensemble of trees. Isolation forest [21] and SCiForest [22] are two tree-based approaches, targeted at detecting isolated and clustered anomalies, respectively, using depths of samples in an ensemble of trees. Such tree-based approaches utilize depths to form anomaly scores, similar to PDA; however, they operate on feature representations of the data rather than on dissimilarity representations. Developing multi-criteria extensions of such non-similarity-based methods is beyond the scope of this paper and would be worthwhile future work.

3 Pareto fronts and their properties

Multi-criteria optimization and Pareto fronts have been studied in many application areas in computer science, economics and the social sciences. An overview can be found in \citepehrgott2005. The proposed PDA method in this paper utilizes the notion of Pareto optimality, which we now introduce.

3.1 Motivation for Pareto optimality

Consider the following problem: given items, denoted by the set , and criteria for evaluating each item, denoted by functions , select that minimizes . In most settings, it is not possible to find a single item which simultaneously minimizes for all . Many approaches to the multi-criteria optimization problem reduce to combining all of the criteria into a single criterion, a process often referred to as scalarization \citepehrgott2005. A common approach is to use a non-negative linear combination of the ’s and find the item that minimizes the linear combination. Different choices of weights in the linear combination yield different minimizers. In this case, one would need to identify a set of optimal solutions corresponding to different weights using, for example, a grid search over the weights.

A more robust and powerful approach involves identifying the set of Pareto-optimal items. An item is said to strictly dominate another item if is no greater than in each criterion and is less than in at least one criterion. This relation can be written as if for each and for some . The set of Pareto-optimal items, called the Pareto front, is the set of items in that are not strictly dominated by another item in . It contains all of the minimizers that are found using non-negative linear combinations, but also includes other items that cannot be found by linear combinations. Denote the Pareto front by , which we call the first Pareto front. The second Pareto front can be constructed by finding items that are not strictly dominated by any of the remaining items, which are members of the set . More generally, define the th Pareto front by

For convenience, we say that a Pareto front is deeper than if .

3.2 Mathematical properties of Pareto fronts

The distribution of the number of points on the first Pareto front was first studied by Barndorff-Nielsen and Sobel \citetnielsen1966. The problem has garnered much attention since. Bai et al. \citetbai2005 and Hwang and Tsai\citethwang2010 provide good surveys of recent results. We will be concerned here with properties of the first Pareto front that are relevant to the PDA anomaly detection algorithm and have not yet been considered in the literature.

Let be independent and identically distributed (i.i.d.) on with density function , and let denote the first Pareto front of . In the general multi-criteria optimization framework, the points are the images in of feasible solutions to some optimization problem under a vector of objective functions of length . In the context of multi-criteria anomaly detection, each point is a dyad corresponding to dissimilarities between two data samples under multiple criteria, and is the number of criteria.

A common approach in multi-objective optimization is linear scalarization \citepehrgott2005, which constructs a new single criterion as a non-negative linear combination of the criteria. It is well-known, and easy to see, that linear scalarization will only identify Pareto-optimal points on the boundary of the convex hull of

where . Although this is a common motivation for Pareto optimization methods, there are, to the best of our knowledge, no results in the literature regarding how many points on the Pareto front are missed by scalarization. We present such a result in this section, namely the Theorem (Scalarization Gap Theorem)..

We define

The subset contains all Pareto-optimal points that can be obtained by some selection of of non-negative weights for linear scalarization. Let denote the cardinality of , and let denote the cardinality of . When are uniformly distributed on the unit hypercube, Barndorff-Nielsen and Sobel \citetnielsen1966 showed that

from which one can easily obtain the asymptotics

Many more recent works have studied the variance of and have proven central limit theorems for . All of these works assume that are uniformly distributed on . For a summary, see \citetbai2005 and \citethwang2010. Other works have studied for more general distributions on domains that have smooth “non-horizontal” boundaries near the Pareto front \citepbaryshnikov2005 and for multivariate normal distributions on \citepivanin1975. The “non-horizontal” condition excludes hypercubes.

To the best of our knowledge there are no results on the asymptotics of for non-uniformly distributed points on the unit hypercube. This is of great importance as it is impractical in multi-criteria optimization (or anomaly detection) to assume that the coordinates of the points are independent. Typically the coordinates of are the images of the same feasible solution under several different criteria, which will not in general be independent.

Here we develop results on the size of the gap between the number of items discoverable by scalarization compared to the number of items discovered on the Pareto front. The larger the gap, the more suboptimal scalarization is relative to Pareto optimization. Since if and only if is on the boundary of the convex hull of , the size of is related to the convexity (or lack thereof) of the Pareto front. There are several ways in which the Pareto front can be non-convex.

First, suppose that are distributed on some domain with a continuous density function that is strictly positive on . Let be a portion of the boundary of such that

and

where is the unit inward normal to . The conditions on guarantee that a portion of the first Pareto front will concentrate near as . If we suppose that is contained in the interior of the convex hull of , then points on the portion of the Pareto front near cannot be obtained by linear scalarization, as they are on a non-convex portion of the front. Such non-convexities are a direct result of the geometry of the domain and are depicted in Fig. a. In a preliminary version of this work, we studied the expectation of the number of points on the Pareto front within a neighborhood of (Theorem 1 in \citephsiao2012). As a result, we showed that

as , where is a positive constant given by

It has recently come to our attention that a stronger result was proven previously by Baryshnikov and Yukich \citetbaryshnikov2005 in an unpublished manuscript.

(a)
(b)
Figure 2: LABEL:sub@geometry Non-convexities in the Pareto front induced by the geometry of the domain . LABEL:sub@randomness Non-convexities due to randomness in the points. In each case, the larger points are Pareto-optimal, and the large black points cannot be obtained by scalarization.

In practice, it is unlikely that one would have enough information about or to compute the constant . In this paper, we instead study a second type of non-convexity in the Pareto front. These non-convexities are strictly due to randomness in the positions of the points and occur even when the domain is convex (see Fig. b for a depiction of such non-convexities). In the following, we assume that are i.i.d. on the unit hypercube with a bounded density function which is continuous at the origin and strictly positive on . Under these assumptions on , it turns out that the asymptotics of and are independent of . Hence our results are applicable to a wide range of problems without the need to know detailed information about the density .

Our first result provides asymptotics on , the size of the first Pareto front.

Theorem 1.

Assume is continuous at the origin, and . Then

The proof of Theorem 1 is provided in the Appendix. Our second result concerns . We are not able to get the exact asymptotics of , so we provide upper and lower asymptotic bounds.

Theorem 2.

Assume is continuous at the origin, and . Then

as .

Theorem 2 provides a significant generalization of a previous result (Theorem 2 in \citephsiao2012) that holds only for uniform distributions in . The proof of Theorem 2 is also provided in the Appendix.

Combining Theorems 1 and 2, we arrive at our main result:

Theorem (Scalarization Gap Theorem).

Assume is continuous at the origin, and . Then

as .

The Theorem (Scalarization Gap Theorem). shows that the fraction of Pareto-optimal points that cannot be obtained by linear scalarization is at least . We provide experimental evidence supporting these bounds in Section 5.1.

4 Multi-criteria anomaly detection

We now formally define the multi-criteria anomaly detection problem. A list of notation is provided in Table 1 for reference. Assume that a training set of unlabeled data samples is available. Given a test sample , the objective of anomaly detection is to declare to be an anomaly if is significantly different from samples in . Suppose that different evaluation criteria are given. Each criterion is associated with a measure for computing dissimilarities. Denote the dissimilarity between and computed using the dissimilarity measure corresponding to the th criterion by . Note that need not be a metric; in particular it is not necessary that be a distance function over the sample space or that satisfy the triangle inequality.

Symbol Definition
Number of criteria (dissimilarity measures)
Number of training samples
th training sample
Single test sample
Dissimilarity between training samples and using th criterion
Dyad between training samples and
Set of all dyads between training samples
Pareto front of dyads between training samples
Total number of Pareto fronts on dyads between training samples
Dyad between training sample and test sample
Pareto depth of dyad between training sample and test sample
Number of nearest neighbors in criterion
Total number of nearest neighbors (over all criteria)
Anomaly score of test sample
Table 1: List of notation

We define a dyad between a pair of samples and by a vector . There are in total different dyads for the training set. For convenience, denote the set of all dyads by . By the definition of strict dominance in Section 3, a dyad strictly dominates another dyad if for all and for some . The first Pareto front corresponds to the set of dyads from that are not strictly dominated by any other dyads from . The second Pareto front corresponds to the set of dyads from that are not strictly dominated by any other dyads from , and so on, as defined in Section 3. Recall that we refer to as a deeper front than if .

4.1 Pareto fronts on dyads

For each training sample , there are dyads corresponding to its connections with the other training samples. If most of these dyads are located at shallow Pareto fronts, then the dissimilarities between and the other training samples are small under some combination of the criteria. Thus, is likely to be a nominal sample. This is the basic idea of the proposed multi-criteria anomaly detection method using PDA.

We construct Pareto fronts of the dyads from the training set, where the total number of fronts is the required number of fronts such that each dyad is a member of a front. When a test sample is obtained, we create new dyads corresponding to connections between and training samples, as illustrated in Fig. 1. Like with many other similarity-based anomaly detection methods, we connect each test sample to its nearest neighbors. could be different for each criterion, so we denote as the choice of for criterion . We create new dyads , corresponding to the connections between and the union of the nearest neighbors in each criterion . In other words, we create a dyad between test sample and training sample if is among the nearest neighbors1 of in any criterion . We say that is below a front if , i.e.  strictly dominates at least a single dyad in . Define the Pareto depth of by

Therefore if is large, then will be near deep fronts, and the distance between and will be large under all combinations of the criteria. If is small, then will be near shallow fronts, so the distance between and will be small under some combination of the criteria.

4.2 Anomaly detection using Pareto depths

In -NN based anomaly detection algorithms such as those mentioned in Section 2.2, the anomaly score is a function of the nearest neighbors to a test sample. With multiple criteria, one could define an anomaly score by scalarization. From the probabilistic properties of Pareto fronts discussed in Section 3.2, we know that Pareto optimization methods identify more Pareto-optimal points than linear scalarization methods and significantly more Pareto-optimal points than a single weight for scalarization2.

This motivates us to develop a multi-criteria anomaly score using Pareto fronts. We start with the observation from Fig. 1 that dyads corresponding to a nominal test sample are typically located near shallower fronts than dyads corresponding to an anomalous test sample. Each test sample is associated with new dyads, where the th dyad has depth . The Pareto depth is a multi-criteria dissimilarity measure that indicates the dissimilarity between the test sample and training sample under multiple combinations of the criteria. For each test sample , we define the anomaly score to be the mean of the ’s, which corresponds to the average depth of the dyads associated with , or equivalently, the average of the multi-criteria dissimilarities between the test sample and its nearest neighbors. Thus the anomaly score can be easily computed and compared to a decision threshold using the test

Recall that the Theorem (Scalarization Gap Theorem). provides bounds on the fraction of dyads on the first Pareto front that cannot be obtained by linear scalarization. Specifically, at least dyads will be missed by linear scalarization on average. These dyads will be associated with deeper fronts by linear scalarization, which will artificially inflate the anomaly score for the test sample, resulting in an increased false positive rate for any algorithm that utilizes non-negative linear combinations of criteria. This effect then cascades to dyads in deeper Pareto fronts, which also get assigned inflated anomaly scores. We provide some evidence of this effect on a real data experiment in Section 5.3. Finally, the lower bound increases monotonically in , which implies that the PDA approach gains additional advantages over linear combinations as the number of criteria increases.

4.3 PDA anomaly detection algorithm

Training phase:

1:  for  do
2:     Calculate pairwise dissimilarities between all training samples and
3:  Create dyads for all training samples
4:  Construct Pareto fronts on set of all dyads until each dyad is in a front

Testing phase:

1:   {empty list}
2:  for  do
3:     Calculate dissimilarities between test sample and all training samples in criterion
4:      nearest neighbors of
5:      {append neighbors to list}
6:  Create new dyads between and training samples in
7:  for  do
8:     Calculate depth of
9:  Declare an anomaly if
Figure 3: Pseudocode for PDA anomaly detection algorithm.

Pseudocode for the PDA anomaly detector is shown in Fig. 3. The training phase involves creating dyads corresponding to all pairs of training samples. Computing all pairwise dissimilarities in each criterion requires floating-point operations (flops), where denotes the number of dimensions involved in computing a dissimilarity. The time complexity of the training phase is dominated by the construction of the Pareto fronts by non-dominated sorting. Non-dominated sorting is used heavily by the evolutionary computing community; to the best of our knowledge, the fastest algorithm for non-dominated sorting was proposed by Jensen [29] and later generalized by Fortin et al. [30] and utilizes comparisons. The complexity analyses in [29, 30] are asymptotic in and assume fixed. We are unaware of any analyses of its asymptotics in . Another non-dominated sorting algorithm proposed by Deb et al. [31] constructs all of the Pareto fronts using comparisons, which is linear in the number of criteria but scales poorly with the number of training samples . We evaluate how both approaches scale with and experimentally in Section 5.2.2.

The testing phase involves creating dyads between the test sample and the nearest training samples in criterion , which requires flops. For each dyad , we need to calculate the depth . This involves comparing the test dyad with training dyads on multiple fronts until we find a training dyad that is dominated by the test dyad. is the front that this training dyad is a part of. Using a binary search to select the front and another binary search to select the training dyads within the front to compare to, we need to make comparisons (in the worst case) to compute . The anomaly score is computed by taking the mean of the ’s corresponding to the test sample; the score is then compared against a threshold to determine whether the sample is anomalous.

4.4 Selection of parameters

The parameters to be selected in PDA are , which denote the number of nearest neighbors in each criterion.

The selection of such parameters in unsupervised learning problems is very difficult in general. For each criterion , we construct a -NN graph using the corresponding dissimilarity measure. We construct symmetric -NN graphs, i.e. we connect samples and if is one of the nearest neighbors of or is one of the nearest neighbors of . We choose as a starting point and, if necessary, increase until the -NN graph is connected. This method of choosing is motivated by asymptotic results for connectivity in -NN graphs and has been used as a heuristic in other unsupervised learning problems, such as spectral clustering \citepvonLuxburg2007. We find that this heuristic works well in practice, including on a real data set of pedestrian trajectories, which we present in Section 5.3.

5 Experiments

We first present an experiment involving the scalarization gap for dyads (rather than i.i.d. samples). Then we compare the PDA method with five single-criterion anomaly detection algorithms on a simulated data set and a real data set3. The five algorithms we use for comparison are as follows:

  • kNN: distance to the th nearest neighbor \citepByers1998.

  • kNN sum: sum of the distances to the nearest neighbors \citepAngiulli2002,Eskin2002.

  • k-LPE: localized p-value estimate using the nearest neighbors \citepZhao2009.

  • LOF: local density of the nearest neighbors \citepBreunig2000.

  • 1-SVM: -class support vector machine \citepScholkopf2001.

For these methods, we use linear combinations of the criteria with different weights (linear scalarization) to compare performance with the proposed multi-criteria PDA method. We find that the accuracies of the nearest neighbor-based methods do not vary much in our experiments for . The results we report use neighbors. For the -class SVM, it is difficult to choose a bandwidth for the Gaussian kernel without having labeled anomalous samples. Linear kernels have been found to perform similarly to Gaussian kernels on dissimilarity representations for SVMs in classification tasks [33]; hence we use a linear kernel on the scalarized dissimilarities for the -class SVM.

5.1 Scalarization gap for dyads

Figure 4: Sample means for versus number of dyads for dimension . Note the expected logarithmic growth. The dotted line indicates the curve of best fit .
Figure 5: Sample means for versus dimension for dyads. The upper and lower bounds established in the Theorem (Scalarization Gap Theorem). are given by the dotted lines in the figure. We see in the figure that the fraction of Pareto optimal points that are not obtainable by linear scalarization increases with dimension.

Independence of is built into the assumptions of Theorems 1 and 2, and thus, the Theorem (Scalarization Gap Theorem)., but it is clear that dyads (as constructed in Section 4) are not independent. Each dyad represents a connection between two independent training samples and . For a given dyad , there are corresponding dyads involving or , and these are clearly not independent from . However, all other dyads are independent from . So while there are dyads, each dyad is independent from all other dyads except for a set of size . Since the Theorem (Scalarization Gap Theorem). is an asymptotic result, the above observation suggests it should hold for the dyads even though they are not i.i.d. In this subsection we present some experimental results which suggest that the Theorem (Scalarization Gap Theorem). does indeed hold for dyads.

We generate synthetic dyads here by drawing i.i.d. uniform samples in and then constructing dyads corresponding to the two criteria and , which denote the absolute differences between the and coordinates, respectively. The domain of the resulting dyads is again the box . In this case, the Theorem (Scalarization Gap Theorem). suggests that should grow logarithmically. Fig. 4 shows the sample means of versus number of dyads and a best fit logarithmic curve of the form , where denotes the number of dyads. We vary the number of dyads between to in increments of and compute the size of after each increment. We compute the sample means over realizations. A linear regression on versus gives , which falls in the range specified by the Theorem (Scalarization Gap Theorem)..

We next explore the dependence of on the dimension . Here, we generate dyads (corresponding to points in ) in the same way as before, for dimensions . The criteria in this case correspond to the absolute differences in each dimension. In Fig. 5 we plot versus dimension to show the fraction of Pareto-optimal points that cannot be obtained by scalarization. Recall from Theorem 1 that

Based on the figure, one might conjecture that the fraction of unattainable Pareto optimal points converges to as . If this is true, it would essentially imply that linear scalarization is useless for identifying dyads on the first Pareto front when there are a large number of criteria. As before, we compute the sample means over realizations of the experiment.

5.2 Simulated experiment with categorical attributes

In this experiment, we perform multi-criteria anomaly detection on simulated data with multiple groups of categorical attributes. These groups could represent different types of attributes. Each data sample consists of groups of categorical attributes. Let denote the th attribute in group , and let denote the number of possible values for this attribute. We randomly select between and possible values for each attribute with equal probability independent of all other attributes. Each attribute is a random variable described by a categorical distribution, where the parameters of the categorical distribution are sampled from a Dirichlet distribution with parameters . For a nominal data sample, we set and for each attribute in each group .

To simulate an anomalous data sample, we randomly select a group with probability for which the parameters of the Dirichlet distribution are changed to for each attribute in group . Note that different anomalous samples may differ in the group that is selected. The ’s are chosen such that with , so that the probability that a test sample is anomalous is . The non-uniform distribution on the ’s results in some criteria being more useful than others for identifying anomalies. The criteria for anomaly detection are taken to be the dissimilarities between data samples for each of the groups of attributes. For each group, we calculate the dissimilarity over the attributes using a dissimilarity measure for anomaly detection on categorical data proposed in \citetEskin20024.

We draw training samples from the nominal distribution and test samples from a mixture of the nominal and anomalous distributions. For the single-criterion algorithms, which we use as baselines for comparison, we use linear scalarization with multiple choices of weights. Since a grid search scales exponentially with the number of criteria and is computationally intractable even for moderate values of , we instead uniformly sample weights from the ()-dimensional simplex. In other words, we sample weights from a uniform distribution over all convex combinations of criteria.

Figure 6: AUC of PDA compared to AUCs of single-criterion methods for simulated experiment. The single-criterion methods use randomly sampled weights for linear scalarization, with weights ordered from worst choice of weights (left) to best choice (right) in terms of maximizing AUC. The proposed PDA algorithm is a multi-criteria algorithm that does not require selecting weights. PDA outperforms all of the single-criterion methods, even for the best choice of weights, which is not known in advance.
Method AUC by weight
Median Best
PDA 0.885 0.002
k-NN 0.749 0.002 0.872 0.002
k-NN sum 0.747 0.002 0.870 0.002
k-LPE 0.744 0.002 0.867 0.002
LOF 0.749 0.002 0.859 0.002
1-SVM 0.757 0.002 0.873 0.002
Table 2: Comparison of AUCs for simulated experiment. Best performer up to one standard error is shown in bold. PDA does not use weights so it has a single AUC. Median and best AUCs over all choices of weights are shown for the other methods.

Detection accuracy

The different methods are evaluated using the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC). We first fix the number of criteria to be . The mean AUCs over simulation runs are shown in Fig. 6. Multiple choices of weights are used for linear scalarization for the single-criterion algorithms; the results are ordered from worst to best weight in terms of maximizing AUC. kNN, kNN sum, and k-LPE perform roughly equally so only kNN is shown in the figure. Table 2 presents a comparison of the AUC for PDA with the median and best AUCs over all choices of weights for scalarization. Both the mean and standard error of the AUCs over the simulation runs are shown. Notice that PDA outperforms even the best weighted combination for each of the five single-criterion algorithms and significantly outperforms the combination resulting in the median AUC, which is more representative of the performance one expects to obtain by arbitrarily choosing weights.

Figure 7: The ratio of the AUC for PDA compared to the best and median AUCs of scalarization using LOF as the number of criteria is varied in the simulated experiment. choices of weights uniformly sampled from the -dimensional simplex are chosen for scalarization. PDA perfoms significantly better than the median over all weights for all . For , PDA outperforms the best weights for scalarization, and the margin increases as increases.

Next we investigate the performance gap between PDA and scalarization as the number of criteria varies from to . The performance of the five single-criterion algorithms is very close, so we show scalarization results only for LOF. The ratio of the AUC for PDA to the AUCs of the best and median weights for scalarization are shown in Fig. 7. PDA offers a significant improvement compared to the median over the weights for scalarization. For small values of , PDA performs roughly equally with scalarization under the best choice of weights. As increases, however, PDA clearly outperforms scalarization, and the gap grows with . We believe this is partially due to the inadequacy of scalarization for identifying Pareto fronts as described in the Theorem (Scalarization Gap Theorem). and partially due to the difficulty in selecting optimal weights for the criteria. A grid search may be able to reveal better weights for scalarization, but it is also computationally intractable for large . Thus we conclude that PDA is clearly the superior approach for large .

Computation time

We evaluate how the computation time of PDA scales with varying and using both the non-dominated sorting procedures of Fortin et al. [30] (denoted by PDA-Fortin) and Deb et al. [31] (denoted by PDA-Deb) discussed in Section 4.3. The time complexity of the testing phase is negligible compared to the training phase so we measure the computation time required to train the PDA anomaly detector.

We first fix and measure computation time for uniformly distributed on a log scale from to . Since the actual computation time depends heavily on the implementation of the non-dominated sorts, we normalize computation times by the time required to train the anomaly detector for so we can observe the scaling in .

Figure 8: Normalized computation time as a function of the number of training samples in the simulated experiment. Best fit curves are of the form . The best fit curve for 1-SVM is extrapolated beyond samples, and the best fit curve for PDA-Deb is extrapolated beyond samples.

The normalized times for PDA as well as k-LPE and 1-SVM are shown in Fig. 8. Best fit curves of the form are also plotted, with and estimated by linear regression. PDA-Deb has time complexity of , and the estimated exponent . Of the four algorithms, it has the worst scaling in . PDA-Fortin has time complexity of , and the estimated exponent , confirming that it scales much better than PDA-Deb and is applicable to large data sets. k-LPE is representative of the k-NN algorithms and has time complexity of ; the estimated exponent . It is difficult to determine the time complexity of 1-SVM due to its iterative nature. The estimated exponent for 1-SVM is , suggesting that it scales worse than PDA-Fortin.

(a) PDA-Deb
(b) PDA-Fortin
Figure 9: Normalized computation time as a function of the number of criteria in the simulated experiment. PDA-Deb LABEL:sub@fig:DebK appears to be linear in as predicted. PDA-Fortin LABEL:sub@fig:FortinK initially appears to be exponential in but the computation time does not continue to increase exponentially beyond .

Next we fix and measure computation time for varying from to . We normalize by the time required to train the anomaly detector for to observe the scaling in . The normalized time for PDA-Deb is shown in Fig. a along with a best fit line of the form . The normalized time does indeed appear to be linear in and grows slowly. The normalized time for PDA-Fortin is shown in Fig. b along with a best fit curve of the form fit to . Notice that the computation time initially increases exponentially but increases at a much slower, possibly even linear, rate beyond . The analyses of time complexity from [29, 30] are asymptotic in and assume fixed; we are not aware of any analyses of time complexity asymptotic in . Our experiments suggest that PDA-Fortin is computationally tractable for non-dominated sorting in PDA even for large . Finally we note that the scaling in for scalarization methods is trivial, depending simply on the number of choices for scalarization weights, which is exponential for a grid search.

5.3 Pedestrian trajectories

We now present an experiment on a real data set that contains thousands of pedestrians’ trajectories in an open area monitored by a video camera \citepMajecka2009. We represent a trajectory with time samples by

where denote a pedestrian’s position at time step . The pedestrian trajectories are of different lengths so we cannot simply treat the trajectories as vectors in and calculate Euclidean distances between them. Instead, we propose to calculate dissimilarities between trajectories using two separate criteria for which trajectories may be dissimilar.

(a)
(b)
Figure 10: LABEL:sub@trajAnomalous Some anomalous pedestrian trajectories detected by PDA. LABEL:sub@trajNominal Trajectories with relatively low anomaly scores. The two criteria used are walking speed and trajectory shape. Anomalous trajectories could have anomalous speeds or shapes (or both), so some anomalous trajectories may not look anomalous by shape alone.

The first criterion is to compute the dissimilarity in walking speed. We compute the instantaneous speed at all time steps along each trajectory by finite differencing, i.e. the speed of trajectory at time step is given by . A histogram of speeds for each trajectory is obtained in this manner. We take the dissimilarity between two trajectories and to be the Kullback-Leibler (K-L) divergence between the normalized speed histograms for those trajectories. K-L divergence is a commonly used measure of the difference between two probability distributions. The K-L divergence is asymmetric; to convert it to a dissimilarity we use the symmetrized K-L divergence as originally defined by Kullback and Leibler \citetkullback1951information. We note that, while the symmetrized K-L divergence is a dissimilarity, it does not, in general, satisfy the triangle inequality and is not a metric.

The second criterion is to compute the dissimilarity in shape. To calculate the shape dissimilarity between two trajectories, we apply a technique known as dynamic time warping (DTW) \citepsankoff1983time, which first non-linearly warps the trajectories in time to match them in an optimal manner. We then take the dissimilarity to be the summed Euclidean distance between the warped trajectories. This dissimilarity also does not satisfy the triangle inequality in general and is thus not a metric.

The training set for this experiment consists of randomly sampled trajectories from the data set, a small fraction of which may be anomalous. The test set consists of trajectories ( nominal and anomalous). The trajectories in the test set are labeled as nominal or anomalous by a human viewer. These labels are used as ground truth to evaluate anomaly detection performance. Fig. 10 shows some anomalous trajectories and nominal trajectories detected using PDA.

Figure 11: AUC of PDA compared to AUCs of single-criterion methods for the pedestrian trajectories experiment. The single-criterion methods use linear scalarization with 100 uniformly spaced weights; weights are ordered from worst (left) to best (right) in terms of maximizing AUC. PDA outperforms the single-criterion methods for almost all choices of weights.
Method AUC by weight
Median Best
PDA 0.944 0.002
k-NN 0.902 0.002 0.918 0.002
k-NN sum 0.901 0.003 0.924 0.002
k-LPE 0.892 0.003 0.917 0.002
LOF 0.754 0.011 0.952 0.003
1-SVM 0.679 0.011 0.910 0.003

Table 3: Comparison of AUCs for pedestrian trajectories experiment. Best performer up to one standard error is shown in bold.
Figure 12: ROC curves for PDA and attainable region for LOF over choices of weights for one run of the pedestrian trajectories experiment. The attainable region denotes the possible ROC curves for LOF corresponding to different choices of weights for linear scalarization. The ROCs for linear scalarization vary greatly as a function of the weights.

We run the experiment times; for each run, we use a different random sample of training trajectories. Fig. 11 shows the performance of PDA as compared to the other algorithms using uniformly spaced weights for convex combinations. Notice that PDA has higher AUC than the other methods for almost all choices of weights for the two criteria. The AUC for PDA is shown in Table 3 along with AUCs for the median and best choices of weights for the single-criterion methods. The mean and standard error over the runs are shown. For the best choice of weights, LOF is the single-criterion method with the highest AUC, but it also has the lowest AUC for the worst choice of weights. For a more detailed comparison, the ROC curve for PDA and the attainable region for LOF (the region between the ROC curves corresponding to weights resulting in the best and worst AUCs) is shown in Fig. 12. Note that the ROC curve for LOF can vary significantly based on the choice of weights. The ROC for 1-SVM also depends heavily on the weights. In the unsupervised setting, it is unlikely that one would be able to achieve the ROC curve corresponding to the weight with the highest AUC, so the expected performance should be closer to the median AUCs in Table 3.

Figure 13: Comparison of a Pareto front (solid red line) on dyads (gray dots) with convex fronts (blue dashed lines) obtained by linear scalarization. The dyads towards the middle of the Pareto front are found in deeper convex fronts than those towards the edges. The result would be inflated anomaly scores for the samples associated with the dyads in the middle of the Pareto fronts when using linear scalarization.

Many of the Pareto fronts on the dyads are non-convex, partially explaining the superior performance of the proposed PDA algorithm. The non-convexities in the Pareto fronts lead to inflated anomaly scores for linear scalarization. A comparison of a Pareto front with two convex fronts (obtained by scalarization) is shown in Fig. 13. The two convex fronts denote the shallowest and deepest convex fronts containing dyads on the illustrated Pareto front. The test samples associated with dyads near the middle of the Pareto fronts would suffer the aforementioned score inflation, as they would be found in deeper convex fronts than those at the tails.

Figure 14: AUCs for different choices of in the pedestrian trajectories experiment. The AUC for the parameters chosen using the proposed heuristic is within of the maximum AUC obtained by the parameters .

Finally we note that the proposed PDA algorithm does not appear to be very sensitive to the choices of the number of neighbors, as shown in Fig. 14. In fact, the heuristic proposed for choosing the ’s in Section 4.4 performs quite well in this experiment. Specifically, the AUC obtained when using the parameters chosen by the proposed heuristic is very close to the maximum AUC over all choices of the number of neighbors .

6 Conclusion

In this paper we proposed a method for similarity-based anomaly detection using a novel multi-criteria dissimilarity measure, the Pareto depth. The proposed method utilizes the notion of Pareto optimality to detect anomalies under multiple criteria by examining the Pareto depths of dyads corresponding to a test sample. Dyads corresponding to an anomalous sample tended to be located at deeper fronts compared to dyads corresponding to a nominal sample. Instead of choosing a specific weighting or performing a grid search on the weights for dissimilarity measures corresponding to different criteria, the proposed method can efficiently detect anomalies in a manner that is tractable for a large number of criteria. Furthermore, the proposed Pareto depth analysis (PDA) approach is provably better than using linear combinations of criteria. Numerical studies validated our theoretical predictions of PDA’s performance advantages compared to using linear combinations on simulated and real data.

An interesting avenue for future work is to extend the PDA approach to extremely large data sets using approximate, rather than exact, Pareto fronts. In addition to the skyline algorithms from the information retrieval community that focus on approximating the first Pareto front, there has been recent work on approximating Pareto fronts using partial differential equations [38] that may be applicable to the multi-criteria anomaly detection problem.

The proofs for Theorems 1 and 2 are presented after some preliminary results. We begin with a general result on the expectation of . Let denote the cumulative distribution function of , defined by

Proposition 1.

For any we have

Proof.

Let be the event that and let be indicator random variables for . Then

Conditioning on we obtain

Noting that completes the proof. ∎

The following simple proposition is essential in the proofs of Theorem 1 and 2.

Proposition 2.

Let and . For we have

(1)

and for we have

(2)
Proof.

We will give a sketch of the proof as similar results are well-known \citepbai2005. Assume and let denote the quantity on the left hand side of (1). Making the change of variables for and , we see that

By computing the inner integrals we find that

from which the asymptotics (1) can be easily obtained by another change of variables , provided . For , we make the change of variables to find that

We can now apply the above result provided . The asymptotics in (1) show that

when , which gives the second result (2). ∎

We now give the proof of Theorem 1.

Proof.

Let and choose such that

and . Since , we have that for all . Since is a probability density on , we must have . Since , we can apply Proposition 2 to find that

(3)

For , we have

Combining this with Proposition 2, and the fact that we have

(4)

Combining (6) and (6) with Proposition (1) we have

It follows that

By a similar argument we can obtain

Since was arbitrary, we see that

The proof of Theorem 2 is split into the following two lemmas. It is well-known, and easy to see, that if and only if , and is on the boundary of the convex hull of \citepehrgott2005. This fact will be used in the proof of Lemma 1.

Lemma 1.

Assume is continuous at the origin and there exists such that . Then

Proof.

Let and choose so that

(5)

and . As in the proof of Proposition 1 we have , so conditioning on we have

As in the proof of Theorem 1, we have

and hence

(6)

Fix and define and

for , and note that for all .

Figure 15: Depiction of the sets and from the proof of Lemma 1 in the case that .

See Fig. 15 for an illustration of these sets for .

We claim that if at least two of contain samples from , and , then . To see this, assume without loss of generality that and are nonempty and let and . Set

By the definitions of and we see that and for all , hence . Let such that

A short calculation shows that which implies that is in the interior of the convex hull of , proving the claim.

Let denote the event that at most one of contains a sample from , and let denote the event that contains no samples from . Then by the observation above we have