LSCP: Locally Selective Combination in Parallel Outlier Ensembles
Abstract
In unsupervised outlier ensembles, the absence of ground truth makes the combination of base detectors a challenging task. Specifically, existing parallel outlier ensembles lack a reliable way of selecting competent base detectors, affecting accuracy and stability, during model combination. In this paper, we propose a framework—called Locally Selective Combination in Parallel Outlier Ensembles (LSCP)—which addresses this issue by defining a local region around a test instance using the consensus of its nearest neighbors in randomly generated feature spaces. The topperforming base detectors in this local region are selected and combined as the model’s final output. Four variants of the LSCP framework are compared with six widely used combination algorithms for parallel ensembles. Experimental results demonstrate that one of these LSCP variants consistently outperforms baseline algorithms on the majority of eighteen realworld datasets.
LSCP: Locally Selective Combination in Parallel Outlier Ensembles
Yue Zhao^{†}^{†}thanks: Department of Computer Science, University of Toronto. 
and Maciej K. Hryniewicki^{†}^{†}thanks: Data Analytics, PricewaterhouseCoopers Canada. 
and Zain Nasrullah^{1}^{1}footnotemark: 1 ^{2}^{2}footnotemark: 2 
and Zheng Li ^{†}^{†}thanks: Toronto Campus, Northeastern University. 
1 Introduction
Outlier detection methods aim to identify anomalous data objects from the general data distribution and are useful for problems such as credit card fraud prevention and network intrusion detection [8]. Since the ground truth is often absent in outlier mining [1], unsupervised detection methods are commonly used for this task [5, 17, 8]. However, unsupervised approaches are susceptible to generating high false positive and false negative rates [10]. To improve model accuracy and stability in these scenarios, recent research explores ensemble approaches to outlier detection [1, 3, 24, 30]. Ensemble learning combines multiple base estimators to achieve superior detection performance and reliability when compared to an individual estimator [12, 15, 29]. It is important to ensure that the combination process is robust because constituent estimators, if synthesized inappropriately, may be detrimental to the predictive capability of an ensemble[22, 23]. Similar to prior works in classification [24], an outlier ensemble may be characterized as parallel if detectors are generated independently or sequential if the detector generation, selection or combination is iterative.
Model combination is important for parallel ensembles to ensure diversity exists among base detectors; however, existing works have not jointly addressed two key limitations in this process. First, most parallel ensembles generically combine all detectors without considering selection. This limits the benefits of model combination since individual base detectors may not be proficient at identifying all outlier instances [9]. For example, prior work has demonstrated that the value of good detectors can be neutralized by the inclusion of poor detectors in a generic averaging framework [3]. Secondly, data locality is rarely emphasized in the context of detector selection and combination leading to potentially suboptimal outcomes. While it is acknowledged that certain types of outliers are better identified by local data relationships [26], detectors are often evaluated at a global scale, where all training points are considered, instead of the local region related to a test instance.
To address the aforementioned limitations, we propose a fully unsupervised framework called Locally Selective Combination in Parallel Outlier Ensembles (LSCP) to selectively combine base detectors by emphasizing data locality. The idea is motivated by an established supervised ensemble framework known as Dynamic Classifier Selection (DCS) [15]. DCS selects the best classifier for each test instance by evaluating base classifier competency at a local scale [9]. The rationale behind this is that base classifiers will generally not excel at categorizing all unknown test instances and that an individual classifier is more likely to specialize in a specific local region [9, 16]. Similarly, LSCP first defines the local region of a test instance by the consensus of the nearest training points in randomly generated feature spaces, and then identifies the most competent base detector in this local region by measuring similarity relative to a pseudo ground truth (see [7, 23] for examples). To further improve algorithm stability and capacity, ensemble variations of LSCP are proposed where promising base detectors are kept for a secondphase combination instead of using the single most competent detector. Our technical contributions in this paper are:

We propose a novel combination framework which, to the best of our knowledge, is the first published effort to adapt DCS from supervised classification tasks to unsupervised parallel outlier ensembles.

As a general framework, LSCP is formulated to be compatible with different types of base detectors; we demonstrate its effectiveness with a homogeneous pool of Local Outlier Factor [5] detectors.

We employ various analysis methods to improve model interpretability. First, theoretical explanations and complexity are provided. Second, visualization techniques are used to intuitively explain why LSCP works and when to use it. Third, statistical tests are used to compare experimental results.

Effort has also been made to streamline the accessibility of LSCP. Hyperparameter selection and associated impacts in the context of the framework are discussed in detail. All source code, experiment results and figures are shared for reproduction^{1}^{1}1Repository: https://github.com/yzhao062/LSCP.
It should be noted that the purpose of LSCP is not to outperform the best unsupervised outlier detector, but rather to explore the use of detector selection and combination at the local level in unsupervised parallel outlier ensembles. However, extensive experiments on 18 realworld datasets show that LSCP consistently yields better performance than existing parallel combination methods. In summary, LSCP is intuitive, stable and effective for combining independent outlier detectors without supervision.
2 Related Works
2.1 Dynamic Classifier Selection and Dynamic Ensemble Selection.
Dynamic Classifier Selection (DCS) is an established combination framework for classification tasks. The technique was first proposed by Ho et al. in 1994 [15] and then extended, under the name DCS Local Accuracy, by Woods et al. in 1997 [27] to select the most accurate base classifier in a local region. The motivation behind this approach is that base classifiers often make distinctive errors and offer a degree of complementarity [6]. Consequently, selectively combining base classifiers can result in a performance improvement over generic ensembles that use the majority vote of all base classifiers. Subsequent theoretical work by Giacinto and Roli validated that, under certain assumptions, the optimal Bayes classifier could be obtained by selecting nonoptimal classifiers [14]. DCS was later expanded by Ko et al. to Dynamic Ensemble Selection (DES) which selects multiple base classifiers for a secondphase combination given each test instance [16]. By minimizing reliance on a single classifier and delegating the classification task to a group competent classifiers, the algorithm has demonstrated that it is more robust than DCS [16]. Motivated by these approaches, LSCP adapts dynamic selection to unsupervised outlier detection tasks.
2.2 Data Locality in Outlier Detection.
The relationship among data objects is critical in outlier detection and existing algorithms can roughly be categorized as either global or local [17, 24, 25]. The former considers all objects during inference while the latter only considers a local selection of objects [25]. In both cases, their applicability is dependent on the structure of the data. Global outlier detection algorithms, for example, offer superior performance when outliers are highly distinctive from the data distribution [23] but often fail to identify outliers in the local neighborhoods of highdimensional data [5, 26]. Accordingly, global models also struggle with data represented by a mixture of distributions, where global characteristics do not necessarily represent the distribution of objects in local regions [26]. To address these limitations, numerous works have explored local algorithms such as Local Outlier Factor (LOF) [5], Local Outlier Probabilities (LoOP) [17] and GLOSS [26]. However, data locality is rarely considered in the context of detector combination; instead, most combination methods utilize all training data points, e.g., the weight calculation in weighted averaging [30]. LSCP explores both global and local data relationships by training base detectors on the entire dataset and emphasizing data locality during detector combination.
2.3 Outlier Detector Combination.
Recently, studying outlier ensembles has become a popular research area [1, 2, 3, 30] resulting in numerous popular works including: (i) parallel ensembles such as Feature Bagging [18] and Isolation Forest [19]; (ii) sequential methods including CARE [24], SELECT [23] and BoostSelect [7] and (iii) hybrid approaches like BORE [21] and XGBOD [28]. When the ground truth is unavailable, combining outlier models is challenging. Feature Bagging [18], an early work, generates a diversified set of base detectors by training on randomly selected feature subsets and statically combining their outlier scores. Existing unsupervised combination algorithms in parallel ensembles are often both generic and global (GG); a list of representative GG methods are described below (see [1, 2, 3, 30, 24] for details):

Averaging (GG_A): average scores of all detectors.

Maximization (GG_M): take the maximum score across all detectors.

Weighted Averaging (GG_WA): weight each base detector when averaging.

Threshold Sum (GG_TH): discard all scores below a threshold and sum over the remaining scores.

AverageofMaximum (GG_AOM): divide base detectors into subgroups and take the maximum score for each subgroup. The final score is the average of all subgroup scores.

MaximumofAverage (GG_MOA): divide base detectors into subgroups and take the average score for each subgroup. The final score is the maximum of all subgroup scores.
As discussed in §2.2, GG methods ignore the importance of data locality while evaluating and combining detectors, which may be inappropriate given the characteristics of outliers [5, 26]. Moreover, without a selection process, poor detectors may hurt the overall detection performance of an ensemble [23, 24]. All aforementioned GG algorithms are thus included as baselines.
There have been attempts to build selective outlier ensembles sequentially in a boosting style. Rayana and Akoglu introduced SELECT [23] and CARE [24] to pick promising detectors and exclude the underperforming ones iteratively, which yielded great results on both temporal graphs and multidimensional outlier data. Campos et al. further extend this idea by proposing an unsupervised boosting strategy BoostSelect for outlier ensemble selection [7]. As an alternative to sequential selection models, this paper chooses to focus on parallel detector selection which stresses the importance of data locality. Compared with sequential detector selection methods, our approach can select detectors without iteration which may reduce computational cost.
3 Algorithm Design
LSCP starts with a group of diversified detectors to be combined. For each test instance, LSCP first defines its local region and then picks the most competent detector(s) locally. The selected detector(s) are used to generate the outlier score for the test instance. The workflow of all four proposed LSCP methods is shown in Fig. 1 and Algorithm 1.
3.1 Base Detector Generation.
An effective ensemble should be constructed with diversified base estimators [24, 30] to promote learning distinct characteristics in the data. With a group of homogeneous base detectors, diversity can be induced by subsampling the training set and feature space, or by varying model hyperparameters [6, 30]. In this study, we demonstrate the effectiveness of LSCP by using distinct hyperparameters to construct a pool of models with the same base algorithm. However, in practice, LSCP can also be used as a general framework with heterogeneous base detectors.
Let denote training data with points and features, and denote a test set with points. The algorithm first generates a pool of base detectors initialized with a range of hyperparameters, e.g., a group of LOF detectors with distinct [5]. All base detectors are first trained on and then inference is performed on the same dataset. The results are combined into an outlier score matrix , formalized in Eq.(1), where denotes the score vector from the base detector. Each detector score is normalized using Znormalization as per prior work[2, 30].
(1) 
3.2 Pseudo Ground Truth Generation.
Since LSCP evaluates detector competency without ground truth labels, two methods are used for generating a pseudo ground truth (denoted ) with : (i) LSCP_A: averages base detector scores and (ii) LSCP_M: maximum score across detectors. This is generalized in Eq.(2) where represents the aggregation (average or max) taken across all base detectors.
(2) 
It should be noted that the pseudo ground truth in LSCP is generated using training data and used solely for detector selection.
3.3 Local Region Definition.
The local region of a test instance is defined as the set of its k nearest training objects. Formally, this is denoted as:
(3) 
where describes the set of a test instance’s nearest neighbours subject to an ensemble criteria. This variation of kNN, which is similar to Feature Bagging [18], is proposed to alleviate concerns involving the curse of dimensionality on kNN [4] while leveraging its better precision compared to clustering algorithms in DCS [9]. The process is as follows: (i) groups of features are randomly selected to construct new feature spaces; (ii) the nearest training objects to in each group are identified using euclidean distance and (iii) training objects that have appeared more than times are added to thus defining the local region. The size of the region is not fixed because it is dependent on the number of training objects that meet the selection criteria.
The local region factor k decides the number of nearest neighbors to consider during this process; care is given to avoid selecting extreme values. Smaller values of k give more attention to local relationships which can result in instability, while large values of k may place too much emphasis on global relationships and have higher computational costs. While it is possible to experimentally determine an optimal k with crossvalidation [16] when ground truth is available, a similar trivial approach does not exist in an unsupervised setting. For these reasons, we recommend setting , 10% of the training samples, bounded in the range of , which yielded good results in practice.
3.4 Model Selection and Combination.
For each test instance, the local pseudo ground truth can be obtained by retrieving values associated with the local region from :
(4) 
where denotes the cardinality of . Similarly, the local training outlier scores can be retrieved from the precalculated training score matrix as:
(5) 
Consequently, although the local region needs to be recomputed for each test instance, the local outlier scores and targets can be efficiently retrieved from precalculated values (see Fig. 1).
For evaluating base estimator competency in a local region, DCS measures the accuracy of base classifiers as the percentage of correctly classified points [16], while LSCP measures the similarity between base detector scores and the pseudo target instead. This distinction is motivated by the lack of direct and reliable ways to access binary labels in unsupervised outlier mining. Although converting pseudo outlier scores to binary labels is feasible, defining an accurate threshold for the conversion is challenging. Additionally, since imbalanced datasets are common in outlier detection tasks, it is more stable to use similarity measures over absolute accuracy for competency evaluation. Therefore, LSCP measures the local competency of each base detectors by the Pearson correlation between the local pseudo ground truth and the local detector score . The detector with the highest similarity is regarded as the most competent local detector for , and its outlier score can be considered the final score for the corresponding test sample.
3.5 Dynamic Outlier Ensemble Selection.
Selecting only one detector, even if it is most similar to the pseudo ground truth, can be risky in unsupervised learning. However, this risk can be mitigated by selecting a group of detectors for a secondphase combination. This idea can be viewed as an adaption of supervised DES [16] to outlier detection tasks; correspondingly, we introduce ensemble variations of LSCP which employ Maximum of Average (LSCP_MOA) and Average of Maximum (LSCP_AOM) ensembling methods. Specifically, when the psuedo ground truth is generated by , LSCP_MOA selects a subset of detectors with the highest outlier scores and then takes the maximum as the outlier score of the test instance. Inversely, LSCP_AOM computes the average of the selected subset of detectors as the outlier score when the pseudo target is generated with . Setting the group size of selected detectors equal to 1 is a special case of the ensembles yielding the original LSCP algorithms (LSCP_A and LSCP_M). Larger group sizes may be considered more global in their detector selection while a group size of results in a fully global algorithm. In response to this, we recommend using a group size selection process which includes some variance. Specifically, a histogram of detector Pearson correlation scores (to the pseudo ground truth) is built with equal intervals. The detectors belonging to the most frequent interval are kept for the secondphase combination. A large thus results in selecting fewer detectors which controls the strength of the group size of LSCP ensembles in a flexible way.
With the appropriate implementation of LSCP_A and LSCP_M, e.g., using a kd tree, the time complexity for each test instance is : for the distance calculation and for summation and sorting [16]. To combine the base detectors in LSCP_MOA and LSCP_AOM, an additional is needed resulting in a total time complexity of .
3.6 Theoretical Considerations.
Recently, Aggarwal and Sathe laid the theoretical foundation for outlier ensembles [2] using the biasvariance tradeoff, a widely used framework for analyzing generalization error in classification problems. The reducible generalization error in outlier ensembles may be minimized by either reducing squared bias or variance where a tradeoff between these two channels usually exists. A high variance detector is sensitive to data variation with high instability; a high bias detector is less sensitive to data variation but may fit complex data poorly. The goal of outlier ensembles is to control both bias and variance to reduce the overall generalization error. Various newly proposed algorithms have been analyzed using this new framework to enhance interpretability [23, 24, 28].
It has been shown that combining diversified base detectors, by averaging them for example, results in variance reduction [23, 24, 2]. However, a combination of all base detectors may also include inaccurate ones leading to higher bias. This explains why generic global averaging does not work well. Within Aggarwal’s biasvariance framework, LSCP possesses a combination of both variance and bias reduction. It induces diversity by initializing various base detectors with various hyperparameters and indirectly promotes variance reduction in the way that the pseudo ground truth is generated, e.g., averaging in LSCP_A. Furthermore, LSCP focuses on detector selection by local competency, which helps identify base detectors with conditionally low model bias. LSCP_M is also expected to be more stable than global maximization (GG_M) since the variance is reduced by using the most competent detector’s output rather than the global maximum values of all base detectors. LSCP_MOA and LSCP_AOM further decrease generalization error through bias reduction and variance reduction, respectively, through their secondphase combination. Although, LSCP may reduce the generalization error through both variance and bias channels, it is a heuristic framework with unpredictable results on pathological datasets.
4 Numerical Experiments
4.1 Datasets and Evaluation Metrics.
Table 1 summarizes the 18 public outlier detection benchmark datasets used in this study^{2}^{2}2ODDS Library: http://odds.cs.stonybrook.edu. In each experiment, 60% of the data is used for training and the remaining 40% is set aside for validation. Performance is evaluated by taking the average score of 20 independent trials using area under the receiver operating characteristic (ROCAUC) and mean average precision (mAP). Both metrics are widely used in outlier research [3, 4, 21, 23, 28, 13] and statistical measures are used to analyze the results [11]. Specifically, we use a nonparametric Friedman test followed by a posthoc Nemenyi test. For these tests, is considered to be statistically significant.
Dataset  Pts  Dim  Outliers  %Outlier 
Annthyroid  7200  6  534  7.41 
Arrhythmia  452  274  66  14.60 
Breastw  683  9  239  34.99 
Cardio  1831  21  176  9.61 
Glass  214  9  9  4.21 
Letter  1600  32  100  6.25 
Lympho  148  18  6  4.05 
Mnist  7603  100  700  9.21 
Musk  3062  166  97  3.17 
Pendigits  6870  16  156  2.27 
Pima  768  8  268  34.90 
Satellite  6435  36  2036  31.64 
Satimage2  5803  36  71  1.22 
Shuttle  49097  9  3511  7.15 
Thyroid  3772  6  93  2.47 
Vertebral  240  6  30  12.50 
Vowels  1456  12  50  3.43 
Wbc  378  30  21  5.56 
4.2 Experimental Design.
This study compares the six GG algorithms introduced in §2.3 with the four proposed LSCP variations described in Algorithm 1. All models use a pool of 50 LOF base detectors ensuring consistency during performance evaluation. To induce diversity among base detectors, distinct initialization hyperparameters—specifically the number of neighbors () used in each LOF detector—are randomly selected in the range of . For GG_AOM and GG_MOA, the base detectors are divided into 5 subgroups and each group contains 10 base detectors selected without replacement. For all LSCP algorithms, the default hyperparameters mentioned in §3 are used.
4.3 Algorithm Performances.
Tables 2 and 3 summarize the ROC and mAP scores on the 18 datasets. Our experiments demonstrate that LSCP can bring consistent performance improvement over its GG counterparts, which is especially noticeable in the mAP scores. The Friedman test shows there is a statistically significant difference between the ten algorithms in both ROCAUC and mAP ; however, the Nemenyi test fails to identify which pairs of algorithms are significantly different. The latter result is expected in an unsupervised setting due to the difficulty of this task relative to the limited number of datasets [11]. In general, LSCP algorithms show great potential: they achieve the highest ROCAUC scores on 14 datasets and the highest mAP scores on 16 datasets. While GG_M performs better on Vowels and Satellite, and GG_AOM achieves the highest mAP on Annthyroid, in all other cases, generic global algorithms are outperformed by a variant of LSCP. Specifically, LSCP_AOM is the best performing method and ranks highest on 11 datasets. It should be noted though that GG methods with a secondphase combination (GG_MOA and GG_AOM) demonstrate better performance than GG_A and better stability than GG_M. These observations agree with the conclusions in Aggarwalâs work [2, 3].
LSCP_A and LSCP_M do not demonstrate strong performance relative to their GG counterparts. Given that both pseudo ground truth generation methods are heuristic, they may result in poor local competency evaluation. For example, as discussed in §3.6, LSCP_A theoretically benefits from both the variance and bias reduction by averaging and focusing on locality. In practice though, by only selecting the single most competent detector, it’s possible that this approach yields weaker variance reduction compared to GG_A which uses all detector scores. As consequence, the variance reduction may not be able to sufficiently offset the bias inherent to the the pseudo ground truth generation process leading to diminished performance. Comparatively, when the ground truth is generated by taking the maximum among multiple detectors, it is observed that both LSCP_M and GG_M exhibit unstable behaviour. As discussed in [2, 3], selecting maximum scores across detectors yields high model variance which explains these results; a secondphase combination mitigates this risk.
A Friedman test confirms there is a significant difference among the four LSCP algorithms in both ROCAUC and mAP . Correspondingly, the LSCP ensemble variations (LSCP_MOA and LSCP_AOM) show promise. Building on GG_AOM’s success as one of the most effective combination methods [3], LSCP_AOM averages the selected group of detectors which could be viewed as an additional reduction of model variance over LSCP_M. Moreover, LSCP_AOM’s concentration on the local competency evaluation may have improved model bias and the secondphase averaging should have decreased the model variance leading to better stability. The results show that LSCP_AOM outperforms all models on 11 datasets in terms of ROCAUC and 12 datasets in terms of mAP. The latter improvement over GG methods is especially considerable on Cardio, Satimage2 and Thyroid.
The benefit of taking the secondphase combination is less effective for LSCP_MOA, which did not outperform LSCP_A or GG_MOA. As discussed in [2, 3], it is less effective to do a secondphase combination after averaging as information has already been already lost due to blunting. The experiment results confirm that, in an LSCP scheme, the benefit from a secondphase maximization cannot offset the information loss due to the initial averaging. Overall, only LSCP_AOM is recommended for detector combination among the four LSCP algorithms due to its combined bias and variance reduction capabilities.
Dataset 











annthyroid  0.7679  0.7874  0.7653  0.7656  0.7827  0.7711  0.7509  0.7620  0.7924  0.7434  
arrhythmia  0.7789  0.7572  0.7790  0.7317  0.7655  0.7772  0.7779  0.7743  0.7516  0.7791  
breastw  0.8662  0.7444  0.8702  0.8503  0.8338  0.8529  0.6920  0.8454  0.7158  0.8722  
cardio  0.9053  0.8876  0.9065  0.9088  0.9088  0.9125  0.8986  0.9149  0.8292  0.9250  
glass  0.7518  0.7582  0.7508  0.7540  0.7590  0.7556  0.7430  0.7505  0.7735  0.7510  
letter  0.7890  0.8546  0.7843  0.8077  0.8381  0.8031  0.7690  0.7892  0.8504  0.7685  
lympho  0.9785  0.9731  0.9776  0.9785  0.9766  0.9785  0.9782  0.9770  0.9728  0.9785  
mnist  0.8548  0.8329  0.8556  0.8250  0.8549  0.8587  0.8558  0.8612  0.7771  0.8630  
musk  0.9980  0.9951  0.9987  0.9987  0.9973  0.9991  0.9986  0.9963  0.9977  0.9994  
pendigits  0.8252  0.8414  0.8302  0.8446  0.8572  0.8417  0.8097  0.8560  0.7315  0.8615  
pima  0.6942  0.6468  0.6956  0.6273  0.6665  0.6904  0.6952  0.6828  0.6276  0.6972  
satellite  0.5954  0.6333  0.5950  0.6168  0.6324  0.6079  0.5912  0.6300  0.6028  0.6048  
satimage2  0.9875  0.9906  0.9883  0.9884  0.9925  0.9913  0.9854  0.9931  0.9860  0.9938  
shuttle  0.5409  0.5571  0.5389  0.5506  0.5558  0.5475  0.5365  0.5544  0.5276  0.5498  
thyroid  0.9675  0.9346  0.9687  0.9656  0.9492  0.9652  0.9558  0.9624  0.9410  0.9693  
vertebral  0.3591  0.3713  0.3584  0.3839  0.3883  0.3659  0.3253  0.3798  0.4723  0.3471  
vowels  0.9117  0.9338  0.9101  0.9229  0.9284  0.9164  0.9224  0.9155  0.9261  0.8998  
wbc  0.9390  0.9313  0.9390  0.9333  0.9351  0.9391  0.9359  0.9331  0.9279  0.9400 
Dataset 











annthyroid  0.2452  0.2424  0.2460  0.2452  0.2617  0.2539  0.2379  0.2555  0.2423  0.2527  
arrhythmia  0.3650  0.3516  0.3651  0.3326  0.3576  0.3650  0.3653  0.3614  0.3637  0.3680  
breastw  0.6513  0.4797  0.6577  0.6335  0.5926  0.6321  0.4772  0.6110  0.4796  0.6739  
cardio  0.4260  0.4083  0.4295  0.4355  0.4496  0.4485  0.4108  0.4669  0.3399  0.4946  
glass  0.1397  0.1328  0.1430  0.1410  0.1340  0.1358  0.1341  0.1314  0.1479  0.1366  
letter  0.2323  0.3495  0.2275  0.2388  0.3018  0.2429  0.2121  0.2377  0.3682  0.2283  
lympho  0.8227  0.8001  0.8155  0.8227  0.8133  0.8227  0.8218  0.8116  0.7977  0.8300  
mnist  0.3905  0.3654  0.3913  0.3819  0.3868  0.3934  0.3914  0.3949  0.3326  0.3982  
musk  0.9331  0.8122  0.9536  0.9472  0.8908  0.9659  0.9365  0.8487  0.9097  0.9736  
pendigits  0.0690  0.0793  0.0693  0.0745  0.0820  0.0751  0.0633  0.0809  0.0573  0.0853  
pima  0.4901  0.4519  0.4913  0.4461  0.4662  0.4875  0.4879  0.4793  0.4366  0.4955  
satellite  0.4064  0.4447  0.4066  0.4071  0.4421  0.4167  0.4146  0.4404  0.4256  0.4208  
satimage2  0.4092  0.5297  0.4291  0.4268  0.5998  0.5236  0.3584  0.6320  0.3801  0.6408  
shuttle  0.1300  0.1207  0.1295  0.1316  0.1265  0.1311  0.1203  0.1299  0.1125  0.1335  
thyroid  0.4257  0.2467  0.4397  0.4274  0.3338  0.4217  0.3459  0.3864  0.2449  0.4692  
vertebral  0.1054  0.1111  0.1053  0.1144  0.1113  0.1063  0.1003  0.1104  0.1445  0.1054  
vowels  0.3810  0.4135  0.3793  0.3835  0.4072  0.3887  0.4079  0.3938  0.3724  0.3547  
wbc  0.5536  0.5264  0.5540  0.5496  0.5412  0.5552  0.5497  0.5505  0.5315  0.5567 
4.4 Visualization Analysis.
Figure 2 visually compares the performance of the best performing GG and LSCP methods on Cardio, Thyroid and Letter using tdistributed stochastic neighbor embedding (tSNE) [20]. The green and blue markers highlight objects that can only be correctly classified by either the GG or LSCP methods, respectively, to emphasize the mutual exclusivity of the two approaches. The visualizations of Cardio (left) and Thyroid (middle) illustrate that LSCP methods have an edge over GG methods in detecting local outliers when they cluster together (highlighted by red dotted circles in Fig. 2). Additionally, LSCP methods can contribute to classifying both outlying and normal points when locality is present in the data. However, the outlying data distribution in Letter (right) is more dispersed—outliers do not form local clusters but rather mix with normal points. This causes LSCP to perform worse than GG_M in terms of ROCAUC despite showing an improvement in terms of mAP. Based on these visualizations, one could assume that LSCP is useful when outlying and normal objects are well separated, but less effective when they are interleaved and cannot easily form local clusters. Additionally, the size of the local region should be informed by the estimated proportion of outliers in the dataset. For instance, outliers account for only 3.43% and 6.25% of Vowels and Letter respectively, which may not be sufficient to form outlier clusters when the local region size is set to 10% of the training data. A smaller local region size is more appropriate when a small number of outliers is assumed.
4.5 Limitations and Future Directions.
Firstly, the local region definition depends on finding nearest neighbors by euclidean distance. However, it is not ideal due to: (i) high time complexity [9] and (ii) degraded performance when many irrelevant features are presented in high dimensional space [4]. This may be improved by using prototype selection [9] or by defining the local region using advanced clustering methods [9]. Secondly, only simple pseudo ground truth generation methods are explored (averaging or maximization) in this study; more accurate methods should be considered, such as actively pruning base detectors [23]. Thirdly, only parallel combination methods are compared in this study; sequential methods discussed in §2.3 should be included in future works. Lastly, DCS has proven to work with heterogeneous base classifiers in classification problems [16, 9], which is pending for verification in LSCP. Significant improvement is expected since the base detectors used in this study are homogeneous with limited diversity.
5 Conclusions
In this work, we propose four variants of a novel unsupervised outlier detection framework called Locally Selective Combination in Parallel Outlier Ensembles (LSCP). Unlike traditional combination approaches, LSCP identifies the topperforming base detectors for each test instance relative to its local region. To validate the effectiveness of this approach, the proposed framework is assessed on 18 realworld datasets and observed to be superior to baseline algorithms. The ensemble approach LSCP_AOM demonstrated the best performance achieving the highest detection score on 11/18 datasets with respect to ROCAUC and 12/18 datasets with respect to mAP. Theoretical considerations under the biasvariance framework are also provided for LSCP, alongside visualizations, to provide a holistic view of the framework. Since LSCP demonstrates the promise of data locality, we hope that future work extends this exploration by investigating the use of heterogeneous base detectors and more reliable pseudo ground truth generation methods. All source code, experimental results and figures used in this study are made publicly available^{3}^{3}3Repository: https://github.com/yzhao062/LSCP.
References
 [1] C. C. Aggarwal, Outlier ensembles: position paper, ACM SIGKDD Explorations, 14 (2013), pp. 49–58.
 [2] C. C. Aggarwal and S. Sathe, Theoretical Foundations and Algorithms for Outlier Ensembles, ACM SIGKDD Explorations, 17 (2015), pp. 24–47.
 [3] , Outlier ensembles: An introduction, Springer, 1st ed., 2017.
 [4] L. Akoglu, H. Tong, J. Vreeken, and C. Faloutsos, Fast and Reliable Anomaly Detection in Categorical Data, in CIKM, 2012.
 [5] M. M. Breunig, H.P. Kriegel, R. T. Ng, and J. ö. r. Sander, LOF: Identifying DensityBased Local Outliers, ACM SIGMOD, (2000), pp. 1–12.
 [6] A. S. Britto, R. Sabourin, and L. E. Oliveira, Dynamic selection of classifiers  A comprehensive review, Pattern Recognition, 47 (2014), pp. 3665–3680.
 [7] G. O. Campos, A. Zimek, and W. Meira, An Unsupervised Boosting Strategy for Outlier Detection Ensembles, PAKDD, (2018), pp. 564–576.
 [8] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, CSUR, 41 (2009), p. 15.
 [9] R. M. Cruz, R. Sabourin, and G. D. Cavalcanti, Dynamic classifier selection: Recent advances and perspectives, Information Fusion, 41 (2018), pp. 195–216.
 [10] S. Das, W.K. Wong, T. Dietterich, A. Fern, and A. Emmott, Incorporating Expert Feedback into Active Anomaly Discovery, ICDM, (2016), pp. 853–858.
 [11] J. Dem š ar, Statistical Comparisons of Classifiers over Multiple Data Sets, JMLR, 7 (2006), pp. 1–30.
 [12] T. G. Dietterich, Ensemble Methods in Machine Learning, MCS, 1857 (2000), pp. 1–15.
 [13] A. Emmott, S. Das, T. Dietterich, A. Fern, and W.k. Wong, A MetaAnalysis of the Anomaly Detection Problem, arXiv preprint, (2015).
 [14] G. Giacinto and F. Roli, A theoretical framework for dynamic classifier selection, ICPR, 2 (2000), pp. 0–3.
 [15] T. K. Ho, J. J. Hull, and S. N. Srihari, Decision Combination in Multiple Classifier Systems, TPAMI, 16 (1994), pp. 66–75.
 [16] A. H. Ko, R. Sabourin, and A. S. Britto, From dynamic classifier selection to dynamic ensemble selection, Pattern Recognition, 41 (2008), pp. 1735–1748.
 [17] H.P. Kriegel, P. Kr ö ger, E. Schubert, and A. Zimek, LoOP: local outlier probabilities, CIKM, (2009), pp. 1649–1652.
 [18] A. Lazarevic and V. Kumar, Feature bagging for outlier detection, ACM SIGKDD, (2005), p. 157.
 [19] F. T. Liu, K. M. Ting, and Z. H. Zhou, Isolation forest, ICDM, (2008), pp. 413–422.
 [20] L. v. d. Maaten and G. Hinton, Visualizing data using tsne, JMLR, 9 (2008), pp. 2579–2605.
 [21] B. Micenkov á, B. McWilliams, and I. Assent, Learning Representations for Outlier Detection on a Budget, arXiv preprint, (2015).
 [22] S. Rayana and L. Akoglu, An Ensemble Approach for Event Detection and Characterization in Dynamic Graphs, in ACM SIGKDD ODD Workshop, 2014.
 [23] , Less is More: Building Selective Anomaly Ensembles, TKDD, 10 (2016), pp. 1–33.
 [24] S. Rayana, W. Zhong, and L. Akoglu, Sequential ensemble learning for outlier detection: A biasvariance perspective, ICDM, (2017), pp. 1167–1172.
 [25] E. Schubert, A. Zimek, and H. P. Kriegel, Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection, DMKD, 28 (2014), pp. 190–237.
 [26] B. van Stein, M. van Leeuwen, and T. B ä ck, Local subspacebased outlier detection using global neighbourhoods, IEEE International Conference on Big Data, (2016), pp. 1136–1142.
 [27] K. Woods, W. Kegelmeyer, and K. Bowyer, Combination of multiple classifiers using local accuracy estimates, TPAMI, 19 (1997), pp. 405–410.
 [28] Y. Zhao and M. K. Hryniewicki, XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning, IJCNN, (2018).
 [29] Y. Zhao, M. K. Hryniewicki, F. Cheng, B. Fu, and X. Zhu, Employee Turnover Prediction with Machine Learning: A Reliable Approach, IEEE Intelligent System Conference (Intellisys), (2018).
 [30] A. Zimek, R. J. G. B. Campello, and J. ö. r. Sander, Ensembles for unsupervised outlier detection: Challenges and research questions, ACM SIGKDD Explorations, 15 (2014), pp. 11–22.