LSCP: Locally Selective Combination in Parallel Outlier Ensembles

LSCP: Locally Selective Combination in Parallel Outlier Ensembles

Yue Zhao
Department of Computer Science, University of Toronto.
Email: {yuezhao, znasrullah}
   Maciej K. Hryniewicki
Data Analytics, PricewaterhouseCoopers Canada.
   Zain Nasrullah11footnotemark: 1 22footnotemark: 2
   Zheng Li Toronto Campus, Northeastern University.

In unsupervised outlier ensembles, the absence of ground truth makes the combination of base detectors a challenging task. Specifically, existing parallel outlier ensembles lack a reliable way of selecting competent base detectors, affecting accuracy and stability, during model combination. In this paper, we propose a framework—called Locally Selective Combination in Parallel Outlier Ensembles (LSCP)—which addresses this issue by defining a local region around a test instance using the consensus of its nearest neighbors in randomly generated feature spaces. The top-performing base detectors in this local region are selected and combined as the model’s final output. Four variants of the LSCP framework are compared with six widely used combination algorithms for parallel ensembles. Experimental results demonstrate that one of these LSCP variants consistently outperforms baseline algorithms on the majority of eighteen real-world datasets.

LSCP: Locally Selective Combination in Parallel Outlier Ensembles

Yue Zhaothanks: Department of Computer Science, University of Toronto.
and Maciej K. Hryniewickithanks: Data Analytics, PricewaterhouseCoopers Canada.
and Zain Nasrullah11footnotemark: 1 22footnotemark: 2
and Zheng Li thanks: Toronto Campus, Northeastern University.

1 Introduction

Outlier detection methods aim to identify anomalous data objects from the general data distribution and are useful for problems such as credit card fraud prevention and network intrusion detection [8]. Since the ground truth is often absent in outlier mining [1], unsupervised detection methods are commonly used for this task [5, 17, 8]. However, unsupervised approaches are susceptible to generating high false positive and false negative rates [10]. To improve model accuracy and stability in these scenarios, recent research explores ensemble approaches to outlier detection [1, 3, 24, 30]. Ensemble learning combines multiple base estimators to achieve superior detection performance and reliability when compared to an individual estimator [12, 15, 29]. It is important to ensure that the combination process is robust because constituent estimators, if synthesized inappropriately, may be detrimental to the predictive capability of an ensemble[22, 23]. Similar to prior works in classification [24], an outlier ensemble may be characterized as parallel if detectors are generated independently or sequential if the detector generation, selection or combination is iterative.

Model combination is important for parallel ensembles to ensure diversity exists among base detectors; however, existing works have not jointly addressed two key limitations in this process. First, most parallel ensembles generically combine all detectors without considering selection. This limits the benefits of model combination since individual base detectors may not be proficient at identifying all outlier instances [9]. For example, prior work has demonstrated that the value of good detectors can be neutralized by the inclusion of poor detectors in a generic averaging framework [3]. Secondly, data locality is rarely emphasized in the context of detector selection and combination leading to potentially sub-optimal outcomes. While it is acknowledged that certain types of outliers are better identified by local data relationships [26], detectors are often evaluated at a global scale, where all training points are considered, instead of the local region related to a test instance.

To address the aforementioned limitations, we propose a fully unsupervised framework called Locally Selective Combination in Parallel Outlier Ensembles (LSCP) to selectively combine base detectors by emphasizing data locality. The idea is motivated by an established supervised ensemble framework known as Dynamic Classifier Selection (DCS) [15]. DCS selects the best classifier for each test instance by evaluating base classifier competency at a local scale [9]. The rationale behind this is that base classifiers will generally not excel at categorizing all unknown test instances and that an individual classifier is more likely to specialize in a specific local region [9, 16]. Similarly, LSCP first defines the local region of a test instance by the consensus of the nearest training points in randomly generated feature spaces, and then identifies the most competent base detector in this local region by measuring similarity relative to a pseudo ground truth (see [7, 23] for examples). To further improve algorithm stability and capacity, ensemble variations of LSCP are proposed where promising base detectors are kept for a second-phase combination instead of using the single most competent detector. Our technical contributions in this paper are:

  1. We propose a novel combination framework which, to the best of our knowledge, is the first published effort to adapt DCS from supervised classification tasks to unsupervised parallel outlier ensembles.

  2. As a general framework, LSCP is formulated to be compatible with different types of base detectors; we demonstrate its effectiveness with a homogeneous pool of Local Outlier Factor [5] detectors.

  3. We employ various analysis methods to improve model interpretability. First, theoretical explanations and complexity are provided. Second, visualization techniques are used to intuitively explain why LSCP works and when to use it. Third, statistical tests are used to compare experimental results.

  4. Effort has also been made to streamline the accessibility of LSCP. Hyperparameter selection and associated impacts in the context of the framework are discussed in detail. All source code, experiment results and figures are shared for reproduction111Repository:

It should be noted that the purpose of LSCP is not to outperform the best unsupervised outlier detector, but rather to explore the use of detector selection and combination at the local level in unsupervised parallel outlier ensembles. However, extensive experiments on 18 real-world datasets show that LSCP consistently yields better performance than existing parallel combination methods. In summary, LSCP is intuitive, stable and effective for combining independent outlier detectors without supervision.

2 Related Works

2.1 Dynamic Classifier Selection and Dynamic Ensemble Selection.

Dynamic Classifier Selection (DCS) is an established combination framework for classification tasks. The technique was first proposed by Ho et al. in 1994 [15] and then extended, under the name DCS Local Accuracy, by Woods et al. in 1997 [27] to select the most accurate base classifier in a local region. The motivation behind this approach is that base classifiers often make distinctive errors and offer a degree of complementarity [6]. Consequently, selectively combining base classifiers can result in a performance improvement over generic ensembles that use the majority vote of all base classifiers. Subsequent theoretical work by Giacinto and Roli validated that, under certain assumptions, the optimal Bayes classifier could be obtained by selecting non-optimal classifiers [14]. DCS was later expanded by Ko et al. to Dynamic Ensemble Selection (DES) which selects multiple base classifiers for a second-phase combination given each test instance [16]. By minimizing reliance on a single classifier and delegating the classification task to a group competent classifiers, the algorithm has demonstrated that it is more robust than DCS [16]. Motivated by these approaches, LSCP adapts dynamic selection to unsupervised outlier detection tasks.

2.2 Data Locality in Outlier Detection.

The relationship among data objects is critical in outlier detection and existing algorithms can roughly be categorized as either global or local [17, 24, 25]. The former considers all objects during inference while the latter only considers a local selection of objects [25]. In both cases, their applicability is dependent on the structure of the data. Global outlier detection algorithms, for example, offer superior performance when outliers are highly distinctive from the data distribution [23] but often fail to identify outliers in the local neighborhoods of high-dimensional data [5, 26]. Accordingly, global models also struggle with data represented by a mixture of distributions, where global characteristics do not necessarily represent the distribution of objects in local regions [26]. To address these limitations, numerous works have explored local algorithms such as Local Outlier Factor (LOF) [5], Local Outlier Probabilities (LoOP) [17] and GLOSS [26]. However, data locality is rarely considered in the context of detector combination; instead, most combination methods utilize all training data points, e.g., the weight calculation in weighted averaging [30]. LSCP explores both global and local data relationships by training base detectors on the entire dataset and emphasizing data locality during detector combination.

2.3 Outlier Detector Combination.

Recently, studying outlier ensembles has become a popular research area [1, 2, 3, 30] resulting in numerous popular works including: (i) parallel ensembles such as Feature Bagging [18] and Isolation Forest [19]; (ii) sequential methods including CARE [24], SELECT [23] and BoostSelect [7] and (iii) hybrid approaches like BORE [21] and XGBOD [28]. When the ground truth is unavailable, combining outlier models is challenging. Feature Bagging [18], an early work, generates a diversified set of base detectors by training on randomly selected feature subsets and statically combining their outlier scores. Existing unsupervised combination algorithms in parallel ensembles are often both generic and global (GG); a list of representative GG methods are described below (see [1, 2, 3, 30, 24] for details):

  1. Averaging (GG_A): average scores of all detectors.

  2. Maximization (GG_M): take the maximum score across all detectors.

  3. Weighted Averaging (GG_WA): weight each base detector when averaging.

  4. Threshold Sum (GG_TH): discard all scores below a threshold and sum over the remaining scores.

  5. Average-of-Maximum (GG_AOM): divide base detectors into subgroups and take the maximum score for each subgroup. The final score is the average of all subgroup scores.

  6. Maximum-of-Average (GG_MOA): divide base detectors into subgroups and take the average score for each subgroup. The final score is the maximum of all subgroup scores.

As discussed in §2.2, GG methods ignore the importance of data locality while evaluating and combining detectors, which may be inappropriate given the characteristics of outliers [5, 26]. Moreover, without a selection process, poor detectors may hurt the overall detection performance of an ensemble [23, 24]. All aforementioned GG algorithms are thus included as baselines.

There have been attempts to build selective outlier ensembles sequentially in a boosting style. Rayana and Akoglu introduced SELECT [23] and CARE [24] to pick promising detectors and exclude the underperforming ones iteratively, which yielded great results on both temporal graphs and multi-dimensional outlier data. Campos et al. further extend this idea by proposing an unsupervised boosting strategy BoostSelect for outlier ensemble selection [7]. As an alternative to sequential selection models, this paper chooses to focus on parallel detector selection which stresses the importance of data locality. Compared with sequential detector selection methods, our approach can select detectors without iteration which may reduce computational cost.

3 Algorithm Design

LSCP starts with a group of diversified detectors to be combined. For each test instance, LSCP first defines its local region and then picks the most competent detector(s) locally. The selected detector(s) are used to generate the outlier score for the test instance. The workflow of all four proposed LSCP methods is shown in Fig. 1 and Algorithm 1.

Figure 1: LSCP flow chart. Steps requiring re-computation are highlighted in yellow; cached steps are in gray.

3.1 Base Detector Generation.

An effective ensemble should be constructed with diversified base estimators [24, 30] to promote learning distinct characteristics in the data. With a group of homogeneous base detectors, diversity can be induced by subsampling the training set and feature space, or by varying model hyperparameters [6, 30]. In this study, we demonstrate the effectiveness of LSCP by using distinct hyperparameters to construct a pool of models with the same base algorithm. However, in practice, LSCP can also be used as a general framework with heterogeneous base detectors.

Let denote training data with points and features, and denote a test set with points. The algorithm first generates a pool of base detectors initialized with a range of hyperparameters, e.g., a group of LOF detectors with distinct [5]. All base detectors are first trained on and then inference is performed on the same dataset. The results are combined into an outlier score matrix , formalized in Eq.(1), where denotes the score vector from the base detector. Each detector score is normalized using Z-normalization as per prior work[2, 30].


3.2 Pseudo Ground Truth Generation.

Since LSCP evaluates detector competency without ground truth labels, two methods are used for generating a pseudo ground truth (denoted ) with : (i) LSCP_A: averages base detector scores and (ii) LSCP_M: maximum score across detectors. This is generalized in Eq.(2) where represents the aggregation (average or max) taken across all base detectors.


It should be noted that the pseudo ground truth in LSCP is generated using training data and used solely for detector selection.

3.3 Local Region Definition.

The local region of a test instance is defined as the set of its k nearest training objects. Formally, this is denoted as:


where describes the set of a test instance’s nearest neighbours subject to an ensemble criteria. This variation of kNN, which is similar to Feature Bagging [18], is proposed to alleviate concerns involving the curse of dimensionality on kNN [4] while leveraging its better precision compared to clustering algorithms in DCS [9]. The process is as follows: (i) groups of features are randomly selected to construct new feature spaces; (ii) the nearest training objects to in each group are identified using euclidean distance and (iii) training objects that have appeared more than times are added to thus defining the local region. The size of the region is not fixed because it is dependent on the number of training objects that meet the selection criteria.

The local region factor k decides the number of nearest neighbors to consider during this process; care is given to avoid selecting extreme values. Smaller values of k give more attention to local relationships which can result in instability, while large values of k may place too much emphasis on global relationships and have higher computational costs. While it is possible to experimentally determine an optimal k with cross-validation [16] when ground truth is available, a similar trivial approach does not exist in an unsupervised setting. For these reasons, we recommend setting , 10% of the training samples, bounded in the range of , which yielded good results in practice.

3.4 Model Selection and Combination.

For each test instance, the local pseudo ground truth can be obtained by retrieving values associated with the local region from :


where denotes the cardinality of . Similarly, the local training outlier scores can be retrieved from the pre-calculated training score matrix as:


Consequently, although the local region needs to be re-computed for each test instance, the local outlier scores and targets can be efficiently retrieved from pre-calculated values (see Fig. 1).

For evaluating base estimator competency in a local region, DCS measures the accuracy of base classifiers as the percentage of correctly classified points [16], while LSCP measures the similarity between base detector scores and the pseudo target instead. This distinction is motivated by the lack of direct and reliable ways to access binary labels in unsupervised outlier mining. Although converting pseudo outlier scores to binary labels is feasible, defining an accurate threshold for the conversion is challenging. Additionally, since imbalanced datasets are common in outlier detection tasks, it is more stable to use similarity measures over absolute accuracy for competency evaluation. Therefore, LSCP measures the local competency of each base detectors by the Pearson correlation between the local pseudo ground truth and the local detector score . The detector with the highest similarity is regarded as the most competent local detector for , and its outlier score can be considered the final score for the corresponding test sample.

3.5 Dynamic Outlier Ensemble Selection.

Selecting only one detector, even if it is most similar to the pseudo ground truth, can be risky in unsupervised learning. However, this risk can be mitigated by selecting a group of detectors for a second-phase combination. This idea can be viewed as an adaption of supervised DES [16] to outlier detection tasks; correspondingly, we introduce ensemble variations of LSCP which employ Maximum of Average (LSCP_MOA) and Average of Maximum (LSCP_AOM) ensembling methods. Specifically, when the psuedo ground truth is generated by , LSCP_MOA selects a subset of detectors with the highest outlier scores and then takes the maximum as the outlier score of the test instance. Inversely, LSCP_AOM computes the average of the selected subset of detectors as the outlier score when the pseudo target is generated with . Setting the group size of selected detectors equal to 1 is a special case of the ensembles yielding the original LSCP algorithms (LSCP_A and LSCP_M). Larger group sizes may be considered more global in their detector selection while a group size of results in a fully global algorithm. In response to this, we recommend using a group size selection process which includes some variance. Specifically, a histogram of detector Pearson correlation scores (to the pseudo ground truth) is built with equal intervals. The detectors belonging to the most frequent interval are kept for the second-phase combination. A large thus results in selecting fewer detectors which controls the strength of the group size of LSCP ensembles in a flexible way.

With the appropriate implementation of LSCP_A and LSCP_M, e.g., using a k-d tree, the time complexity for each test instance is : for the distance calculation and for summation and sorting [16]. To combine the base detectors in LSCP_MOA and LSCP_AOM, an additional is needed resulting in a total time complexity of .

1:the pool of detectors , training data , test data , the local region factor
2:outlier scores of
3:Train all base detectors in on
4:Get training outlier scores with Eq.(1)
5:Get pseudo with Eq.(2)
6:for each test instance in  do
7:     Define local region by kNN ensemble (§3.3)
8:     Get local pseudo ground truth by selecting objects in from
9:     for each base detector in  do
10:         Get the outlier scores associated with training data in the local region
11:         Evaluate the local competency of by the Pearson correlation between and
12:     end for
13:     if LSCP_A or LSCP_M then
14:         return where has the highest Pearson correlation to
15:     else// LSCP ensembles
16:         Select a group of most similar detectors and add to the empty set 3.5)
17:         if LSCP_AOM then
18:              return
19:         else
20:              return
21:         end if
22:     end if
23:end for
Algorithm 1 Locally Selective Combination

3.6 Theoretical Considerations.

Recently, Aggarwal and Sathe laid the theoretical foundation for outlier ensembles [2] using the bias-variance tradeoff, a widely used framework for analyzing generalization error in classification problems. The reducible generalization error in outlier ensembles may be minimized by either reducing squared bias or variance where a tradeoff between these two channels usually exists. A high variance detector is sensitive to data variation with high instability; a high bias detector is less sensitive to data variation but may fit complex data poorly. The goal of outlier ensembles is to control both bias and variance to reduce the overall generalization error. Various newly proposed algorithms have been analyzed using this new framework to enhance interpretability [23, 24, 28].

It has been shown that combining diversified base detectors, by averaging them for example, results in variance reduction [23, 24, 2]. However, a combination of all base detectors may also include inaccurate ones leading to higher bias. This explains why generic global averaging does not work well. Within Aggarwal’s bias-variance framework, LSCP possesses a combination of both variance and bias reduction. It induces diversity by initializing various base detectors with various hyperparameters and indirectly promotes variance reduction in the way that the pseudo ground truth is generated, e.g., averaging in LSCP_A. Furthermore, LSCP focuses on detector selection by local competency, which helps identify base detectors with conditionally low model bias. LSCP_M is also expected to be more stable than global maximization (GG_M) since the variance is reduced by using the most competent detector’s output rather than the global maximum values of all base detectors. LSCP_MOA and LSCP_AOM further decrease generalization error through bias reduction and variance reduction, respectively, through their second-phase combination. Although, LSCP may reduce the generalization error through both variance and bias channels, it is a heuristic framework with unpredictable results on pathological datasets.

4 Numerical Experiments

4.1 Datasets and Evaluation Metrics.

Table 1 summarizes the 18 public outlier detection benchmark datasets used in this study222ODDS Library: In each experiment, 60% of the data is used for training and the remaining 40% is set aside for validation. Performance is evaluated by taking the average score of 20 independent trials using area under the receiver operating characteristic (ROC-AUC) and mean average precision (mAP). Both metrics are widely used in outlier research [3, 4, 21, 23, 28, 13] and statistical measures are used to analyze the results [11]. Specifically, we use a non-parametric Friedman test followed by a post-hoc Nemenyi test. For these tests, is considered to be statistically significant.

Dataset Pts Dim Outliers %Outlier
Annthyroid 7200 6 534 7.41
Arrhythmia 452 274 66 14.60
Breastw 683 9 239 34.99
Cardio 1831 21 176 9.61
Glass 214 9 9 4.21
Letter 1600 32 100 6.25
Lympho 148 18 6 4.05
Mnist 7603 100 700 9.21
Musk 3062 166 97 3.17
Pendigits 6870 16 156 2.27
Pima 768 8 268 34.90
Satellite 6435 36 2036 31.64
Satimage-2 5803 36 71 1.22
Shuttle 49097 9 3511 7.15
Thyroid 3772 6 93 2.47
Vertebral 240 6 30 12.50
Vowels 1456 12 50 3.43
Wbc 378 30 21 5.56
Table 1: Real-world datasets used for evaluation

4.2 Experimental Design.

This study compares the six GG algorithms introduced in §2.3 with the four proposed LSCP variations described in Algorithm 1. All models use a pool of 50 LOF base detectors ensuring consistency during performance evaluation. To induce diversity among base detectors, distinct initialization hyperparameters—specifically the number of neighbors () used in each LOF detector—are randomly selected in the range of . For GG_AOM and GG_MOA, the base detectors are divided into 5 subgroups and each group contains 10 base detectors selected without replacement. For all LSCP algorithms, the default hyperparameters mentioned in §3 are used.

4.3 Algorithm Performances.

Tables 2 and 3 summarize the ROC and mAP scores on the 18 datasets. Our experiments demonstrate that LSCP can bring consistent performance improvement over its GG counterparts, which is especially noticeable in the mAP scores. The Friedman test shows there is a statistically significant difference between the ten algorithms in both ROC-AUC and mAP ; however, the Nemenyi test fails to identify which pairs of algorithms are significantly different. The latter result is expected in an unsupervised setting due to the difficulty of this task relative to the limited number of datasets [11]. In general, LSCP algorithms show great potential: they achieve the highest ROC-AUC scores on 14 datasets and the highest mAP scores on 16 datasets. While GG_M performs better on Vowels and Satellite, and GG_AOM achieves the highest mAP on Annthyroid, in all other cases, generic global algorithms are outperformed by a variant of LSCP. Specifically, LSCP_AOM is the best performing method and ranks highest on 11 datasets. It should be noted though that GG methods with a second-phase combination (GG_MOA and GG_AOM) demonstrate better performance than GG_A and better stability than GG_M. These observations agree with the conclusions in Aggarwal’s work [2, 3].

Figure 2: t-SNE visualizations on Cardio (left), Thyroid (middle) and Letter (right), where normal (N) and outlying (O) points are denoted as grey dots and orange squares, respectively. Points that can only be correctly classified by a particular framework are shown in green and blue for GG and LSCP respectively.

LSCP_A and LSCP_M do not demonstrate strong performance relative to their GG counterparts. Given that both pseudo ground truth generation methods are heuristic, they may result in poor local competency evaluation. For example, as discussed in §3.6, LSCP_A theoretically benefits from both the variance and bias reduction by averaging and focusing on locality. In practice though, by only selecting the single most competent detector, it’s possible that this approach yields weaker variance reduction compared to GG_A which uses all detector scores. As consequence, the variance reduction may not be able to sufficiently offset the bias inherent to the the pseudo ground truth generation process leading to diminished performance. Comparatively, when the ground truth is generated by taking the maximum among multiple detectors, it is observed that both LSCP_M and GG_M exhibit unstable behaviour. As discussed in [2, 3], selecting maximum scores across detectors yields high model variance which explains these results; a second-phase combination mitigates this risk.

A Friedman test confirms there is a significant difference among the four LSCP algorithms in both ROC-AUC and mAP . Correspondingly, the LSCP ensemble variations (LSCP_MOA and LSCP_AOM) show promise. Building on GG_AOM’s success as one of the most effective combination methods [3], LSCP_AOM averages the selected group of detectors which could be viewed as an additional reduction of model variance over LSCP_M. Moreover, LSCP_AOM’s concentration on the local competency evaluation may have improved model bias and the second-phase averaging should have decreased the model variance leading to better stability. The results show that LSCP_AOM outperforms all models on 11 datasets in terms of ROC-AUC and 12 datasets in terms of mAP. The latter improvement over GG methods is especially considerable on Cardio, Satimage-2 and Thyroid.

The benefit of taking the second-phase combination is less effective for LSCP_MOA, which did not outperform LSCP_A or GG_MOA. As discussed in [2, 3], it is less effective to do a second-phase combination after averaging as information has already been already lost due to blunting. The experiment results confirm that, in an LSCP scheme, the benefit from a second-phase maximization cannot offset the information loss due to the initial averaging. Overall, only LSCP_AOM is recommended for detector combination among the four LSCP algorithms due to its combined bias and variance reduction capabilities.

annthyroid 0.7679 0.7874 0.7653 0.7656 0.7827 0.7711 0.7509 0.7620 0.7924 0.7434
arrhythmia 0.7789 0.7572 0.7790 0.7317 0.7655 0.7772 0.7779 0.7743 0.7516 0.7791
breastw 0.8662 0.7444 0.8702 0.8503 0.8338 0.8529 0.6920 0.8454 0.7158 0.8722
cardio 0.9053 0.8876 0.9065 0.9088 0.9088 0.9125 0.8986 0.9149 0.8292 0.9250
glass 0.7518 0.7582 0.7508 0.7540 0.7590 0.7556 0.7430 0.7505 0.7735 0.7510
letter 0.7890 0.8546 0.7843 0.8077 0.8381 0.8031 0.7690 0.7892 0.8504 0.7685
lympho 0.9785 0.9731 0.9776 0.9785 0.9766 0.9785 0.9782 0.9770 0.9728 0.9785
mnist 0.8548 0.8329 0.8556 0.8250 0.8549 0.8587 0.8558 0.8612 0.7771 0.8630
musk 0.9980 0.9951 0.9987 0.9987 0.9973 0.9991 0.9986 0.9963 0.9977 0.9994
pendigits 0.8252 0.8414 0.8302 0.8446 0.8572 0.8417 0.8097 0.8560 0.7315 0.8615
pima 0.6942 0.6468 0.6956 0.6273 0.6665 0.6904 0.6952 0.6828 0.6276 0.6972
satellite 0.5954 0.6333 0.5950 0.6168 0.6324 0.6079 0.5912 0.6300 0.6028 0.6048
satimage-2 0.9875 0.9906 0.9883 0.9884 0.9925 0.9913 0.9854 0.9931 0.9860 0.9938
shuttle 0.5409 0.5571 0.5389 0.5506 0.5558 0.5475 0.5365 0.5544 0.5276 0.5498
thyroid 0.9675 0.9346 0.9687 0.9656 0.9492 0.9652 0.9558 0.9624 0.9410 0.9693
vertebral 0.3591 0.3713 0.3584 0.3839 0.3883 0.3659 0.3253 0.3798 0.4723 0.3471
vowels 0.9117 0.9338 0.9101 0.9229 0.9284 0.9164 0.9224 0.9155 0.9261 0.8998
wbc 0.9390 0.9313 0.9390 0.9333 0.9351 0.9391 0.9359 0.9331 0.9279 0.9400
Table 2: ROC-AUC scores (average of 20 independent trials, highest score highlighted in bold)
annthyroid 0.2452 0.2424 0.2460 0.2452 0.2617 0.2539 0.2379 0.2555 0.2423 0.2527
arrhythmia 0.3650 0.3516 0.3651 0.3326 0.3576 0.3650 0.3653 0.3614 0.3637 0.3680
breastw 0.6513 0.4797 0.6577 0.6335 0.5926 0.6321 0.4772 0.6110 0.4796 0.6739
cardio 0.4260 0.4083 0.4295 0.4355 0.4496 0.4485 0.4108 0.4669 0.3399 0.4946
glass 0.1397 0.1328 0.1430 0.1410 0.1340 0.1358 0.1341 0.1314 0.1479 0.1366
letter 0.2323 0.3495 0.2275 0.2388 0.3018 0.2429 0.2121 0.2377 0.3682 0.2283
lympho 0.8227 0.8001 0.8155 0.8227 0.8133 0.8227 0.8218 0.8116 0.7977 0.8300
mnist 0.3905 0.3654 0.3913 0.3819 0.3868 0.3934 0.3914 0.3949 0.3326 0.3982
musk 0.9331 0.8122 0.9536 0.9472 0.8908 0.9659 0.9365 0.8487 0.9097 0.9736
pendigits 0.0690 0.0793 0.0693 0.0745 0.0820 0.0751 0.0633 0.0809 0.0573 0.0853
pima 0.4901 0.4519 0.4913 0.4461 0.4662 0.4875 0.4879 0.4793 0.4366 0.4955
satellite 0.4064 0.4447 0.4066 0.4071 0.4421 0.4167 0.4146 0.4404 0.4256 0.4208
satimage-2 0.4092 0.5297 0.4291 0.4268 0.5998 0.5236 0.3584 0.6320 0.3801 0.6408
shuttle 0.1300 0.1207 0.1295 0.1316 0.1265 0.1311 0.1203 0.1299 0.1125 0.1335
thyroid 0.4257 0.2467 0.4397 0.4274 0.3338 0.4217 0.3459 0.3864 0.2449 0.4692
vertebral 0.1054 0.1111 0.1053 0.1144 0.1113 0.1063 0.1003 0.1104 0.1445 0.1054
vowels 0.3810 0.4135 0.3793 0.3835 0.4072 0.3887 0.4079 0.3938 0.3724 0.3547
wbc 0.5536 0.5264 0.5540 0.5496 0.5412 0.5552 0.5497 0.5505 0.5315 0.5567
Table 3: mAP scores (average of 20 independent trials, highest score highlighted in bold)

4.4 Visualization Analysis.

Figure 2 visually compares the performance of the best performing GG and LSCP methods on Cardio, Thyroid and Letter using t-distributed stochastic neighbor embedding (t-SNE) [20]. The green and blue markers highlight objects that can only be correctly classified by either the GG or LSCP methods, respectively, to emphasize the mutual exclusivity of the two approaches. The visualizations of Cardio (left) and Thyroid (middle) illustrate that LSCP methods have an edge over GG methods in detecting local outliers when they cluster together (highlighted by red dotted circles in Fig. 2). Additionally, LSCP methods can contribute to classifying both outlying and normal points when locality is present in the data. However, the outlying data distribution in Letter (right) is more dispersed—outliers do not form local clusters but rather mix with normal points. This causes LSCP to perform worse than GG_M in terms of ROC-AUC despite showing an improvement in terms of mAP. Based on these visualizations, one could assume that LSCP is useful when outlying and normal objects are well separated, but less effective when they are interleaved and cannot easily form local clusters. Additionally, the size of the local region should be informed by the estimated proportion of outliers in the dataset. For instance, outliers account for only 3.43% and 6.25% of Vowels and Letter respectively, which may not be sufficient to form outlier clusters when the local region size is set to 10% of the training data. A smaller local region size is more appropriate when a small number of outliers is assumed.

4.5 Limitations and Future Directions.

Firstly, the local region definition depends on finding nearest neighbors by euclidean distance. However, it is not ideal due to: (i) high time complexity [9] and (ii) degraded performance when many irrelevant features are presented in high dimensional space [4]. This may be improved by using prototype selection [9] or by defining the local region using advanced clustering methods [9]. Secondly, only simple pseudo ground truth generation methods are explored (averaging or maximization) in this study; more accurate methods should be considered, such as actively pruning base detectors [23]. Thirdly, only parallel combination methods are compared in this study; sequential methods discussed in §2.3 should be included in future works. Lastly, DCS has proven to work with heterogeneous base classifiers in classification problems [16, 9], which is pending for verification in LSCP. Significant improvement is expected since the base detectors used in this study are homogeneous with limited diversity.

5 Conclusions

In this work, we propose four variants of a novel unsupervised outlier detection framework called Locally Selective Combination in Parallel Outlier Ensembles (LSCP). Unlike traditional combination approaches, LSCP identifies the top-performing base detectors for each test instance relative to its local region. To validate the effectiveness of this approach, the proposed framework is assessed on 18 real-world datasets and observed to be superior to baseline algorithms. The ensemble approach LSCP_AOM demonstrated the best performance achieving the highest detection score on 11/18 datasets with respect to ROC-AUC and 12/18 datasets with respect to mAP. Theoretical considerations under the bias-variance framework are also provided for LSCP, alongside visualizations, to provide a holistic view of the framework. Since LSCP demonstrates the promise of data locality, we hope that future work extends this exploration by investigating the use of heterogeneous base detectors and more reliable pseudo ground truth generation methods. All source code, experimental results and figures used in this study are made publicly available333Repository:


  • [1] C. C. Aggarwal, Outlier ensembles: position paper, ACM SIGKDD Explorations, 14 (2013), pp. 49–58.
  • [2] C. C. Aggarwal and S. Sathe, Theoretical Foundations and Algorithms for Outlier Ensembles, ACM SIGKDD Explorations, 17 (2015), pp. 24–47.
  • [3]  , Outlier ensembles: An introduction, Springer, 1st ed., 2017.
  • [4] L. Akoglu, H. Tong, J. Vreeken, and C. Faloutsos, Fast and Reliable Anomaly Detection in Categorical Data, in CIKM, 2012.
  • [5] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. ö. r. Sander, LOF: Identifying Density-Based Local Outliers, ACM SIGMOD, (2000), pp. 1–12.
  • [6] A. S. Britto, R. Sabourin, and L. E. Oliveira, Dynamic selection of classifiers - A comprehensive review, Pattern Recognition, 47 (2014), pp. 3665–3680.
  • [7] G. O. Campos, A. Zimek, and W. Meira, An Unsupervised Boosting Strategy for Outlier Detection Ensembles, PAKDD, (2018), pp. 564–576.
  • [8] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, CSUR, 41 (2009), p. 15.
  • [9] R. M. Cruz, R. Sabourin, and G. D. Cavalcanti, Dynamic classifier selection: Recent advances and perspectives, Information Fusion, 41 (2018), pp. 195–216.
  • [10] S. Das, W.-K. Wong, T. Dietterich, A. Fern, and A. Emmott, Incorporating Expert Feedback into Active Anomaly Discovery, ICDM, (2016), pp. 853–858.
  • [11] J. Dem š ar, Statistical Comparisons of Classifiers over Multiple Data Sets, JMLR, 7 (2006), pp. 1–30.
  • [12] T. G. Dietterich, Ensemble Methods in Machine Learning, MCS, 1857 (2000), pp. 1–15.
  • [13] A. Emmott, S. Das, T. Dietterich, A. Fern, and W.-k. Wong, A Meta-Analysis of the Anomaly Detection Problem, arXiv preprint, (2015).
  • [14] G. Giacinto and F. Roli, A theoretical framework for dynamic classifier selection, ICPR, 2 (2000), pp. 0–3.
  • [15] T. K. Ho, J. J. Hull, and S. N. Srihari, Decision Combination in Multiple Classifier Systems, TPAMI, 16 (1994), pp. 66–75.
  • [16] A. H. Ko, R. Sabourin, and A. S. Britto, From dynamic classifier selection to dynamic ensemble selection, Pattern Recognition, 41 (2008), pp. 1735–1748.
  • [17] H.-P. Kriegel, P. Kr ö ger, E. Schubert, and A. Zimek, LoOP: local outlier probabilities, CIKM, (2009), pp. 1649–1652.
  • [18] A. Lazarevic and V. Kumar, Feature bagging for outlier detection, ACM SIGKDD, (2005), p. 157.
  • [19] F. T. Liu, K. M. Ting, and Z. H. Zhou, Isolation forest, ICDM, (2008), pp. 413–422.
  • [20] L. v. d. Maaten and G. Hinton, Visualizing data using t-sne, JMLR, 9 (2008), pp. 2579–2605.
  • [21] B. Micenkov á, B. McWilliams, and I. Assent, Learning Representations for Outlier Detection on a Budget, arXiv preprint, (2015).
  • [22] S. Rayana and L. Akoglu, An Ensemble Approach for Event Detection and Characterization in Dynamic Graphs, in ACM SIGKDD ODD Workshop, 2014.
  • [23]  , Less is More: Building Selective Anomaly Ensembles, TKDD, 10 (2016), pp. 1–33.
  • [24] S. Rayana, W. Zhong, and L. Akoglu, Sequential ensemble learning for outlier detection: A bias-variance perspective, ICDM, (2017), pp. 1167–1172.
  • [25] E. Schubert, A. Zimek, and H. P. Kriegel, Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection, DMKD, 28 (2014), pp. 190–237.
  • [26] B. van Stein, M. van Leeuwen, and T. B ä ck, Local subspace-based outlier detection using global neighbourhoods, IEEE International Conference on Big Data, (2016), pp. 1136–1142.
  • [27] K. Woods, W. Kegelmeyer, and K. Bowyer, Combination of multiple classifiers using local accuracy estimates, TPAMI, 19 (1997), pp. 405–410.
  • [28] Y. Zhao and M. K. Hryniewicki, XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning, IJCNN, (2018).
  • [29] Y. Zhao, M. K. Hryniewicki, F. Cheng, B. Fu, and X. Zhu, Employee Turnover Prediction with Machine Learning: A Reliable Approach, IEEE Intelligent System Conference (Intellisys), (2018).
  • [30] A. Zimek, R. J. G. B. Campello, and J. ö. r. Sander, Ensembles for unsupervised outlier detection: Challenges and research questions, ACM SIGKDD Explorations, 15 (2014), pp. 11–22.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description