OneClass Kernel Spectral Regression for Outlier Detection
Abstract
The paper introduces a new efficient nonlinear oneclass classifier formulated as the Rayleigh quotient criterion. The method, operating in a reproducing kernel Hilbert subspace, minimises the scatter of target distribution along an optimal projection direction while at the same time keeping projections of target observations as distant as possible from the origin which serves as an artificial outlier with respect to the data. We provide a graph embedding view of the problem which can then be solved efficiently using the spectral regression approach. In this sense, unlike previous similar methods which often require costly eigencomputations of dense matrices, the proposed approach casts the problem under consideration into a regression framework which avoids eigendecomposition computations. In particular, it is shown that the dominant complexity of the proposed method is the complexity of computing the kernel matrix. Additional appealing characteristics of the proposed oneclass classifier are: 1the ability to be trained in an incremental fashion (allowing for application in streaming data scenarios while also reducing computational complexity in the nonstreaming operation mode); 2being unsupervised while also providing the ability for the user to specify the expected fraction of outliers in the training set in advance; And last but not least 3the deployment of the kernel trick allowing for a large class of functions by nonlinearly mapping the data into a highdimensional feature space. Extensive experiments conducted on several datasets verifies the merits of the proposed approach in comparison with some other alternatives.
I Introduction
Oneclass classification (OCC) deals with the problem of identifying objects, events or observations which conform to a specific behaviour or condition, identified as the target/positive class (), and distinguish them from all other objects, typically known as outliers or anomalies. More specifically, consider a set of points where is a realisation of a multivariate random variable drawn from a target probability distribution with probability density function . In a oneclass classification problem, one would like to characterise the support domain of via a oneclass classifier as
(1) 
where function is modelling the similarity of an observation to the target data and denotes the Iverson brackets. Parameter is optimised so that an expected fraction of observations lie within the support domain of the target distribution. Oneclass learning serves as the core of a wide variety of applications such as intrusion detection [1], novelty detection [2], fault detection in safety critical systems [3], fraud detection [4], insurance [5], health care [6], surveillance [7], etc. Historically, the first singleclass classification problem seems to date back to the work in [8] in the context of learning Bayes classifier. Later, with a large time gap, the term oneclass classification was used in [9]. As a result of different applications to oneclass classification, other terminology including anomaly/outlier detection [10], novelty detection [11], concept learning [12], etc. have been also used in the literature.
OCC techniques are commonly employed when the nontarget/negative class is either not well defined, poorly sampled or totally missing, which may be due to the openness of the problem or due to the high cost associated with obtaining negative samples. In these situations, the conventional twoclass classifiers are believed not to operate effectively as they are based on the predominant assumption that data from all classes are more or less equally balanced. OCC techniques are developed to address this shortcoming of the conventional approaches by primarily training on the data coming from a single class. Nevertheless, lack of sufficient negative samples may pose serious challenges in learning oneclass classifiers as only one side of the decision boundary can be estimated using positive observations. As a result, the oneclass problem is typically believed to be more difficult than the twoclass counterpart. As observed in [13], the challenges related to the standard two/multiclass problems, e.g. estimation of the error, atypical training data, the complexity of a solution, the generalisation capability, etc. are also present in OCC and may sometimes become even more severe.
While there exist different categorisation of oneclass techniques [14, 13, 15], a general overarching categorisation considers them to be either generative or nongenerative [16]. The generative approaches incorporate a model for generating all observations whereas nongenerative methods lack a transparent link to the data. In this context, the nongenerative methods are best represented by discriminative approaches which partition the feature space in order to classify an object. As notable representatives of the generative approaches one may consider the parametric and nonparametric density estimation methods [17, 18, 19] (using for example a Gaussian, a mixture of Gaussians or a Poisson distribution), neuralnetwork based methods [12, 20], oneclass sparse representation classification [21, 22], etc. Wellknown examples of the nongenerative methods include those based on support vector machines (SVDD/oneclass SVM) [23, 24], linear programming [25], convex hull methods [26, 27], cluster approaches [28], deeplearning based methods [29, 30] and subspace approaches [31, 32, 33, 34, 35]. By virtue of the emphasis on classification, rather than modelling the generative process, the nongenerative approaches tend to yield better performance in classification.
In practical applications where the data to be characterised is highly nonlinear and complex, linear approaches often fail to provide satisfactory performance. In such cases, an effective mechanism is to implicitly map the data into a very high dimensional space with the hope that in this new space the data become more easily separable, the prominent examples of which are offered by kernel machines [36, 37, 38, 39]. Nevertheless, the high computational cost associated with these methods can be considered as a bottleneck in their usage. For instance, the oneclass variants of kernel discriminant analysis [33, 40, 34, 41] often require computationally intensive eigendecompositions of dense matrices.
In this work, a new nonlinear oneclass classifier formulated as optimisation of a Rayleigh quotient is presented which unlike previous discriminative methods [31, 32, 33, 34, 35, 41] avoids costly eigenanalysis computations via the spectral regression (SR) technique which has been shown to speed up the kernel discriminant analysis by several orders of magnitude [42]. By virtue of bypassing eigendecomposition of large matrices via a regularised regression formulation, the proposed OneClass Kernel SpectralRegression (OCKSR) approach is computationally very attractive, where it will be shown that the dominant complexity of the algorithm is the computation of the kernel matrix. An additional appealing characteristic of the method is the operability in an incremental fashion which allows injection of additional training data into the system in a streaming data scenario, sidestepping the need to reinitialise the training procedure while also reducing computational complexity in a nonstreaming operation mode. Additionally, the method can be operated in an unsupervised mode as well as by specifying the expected fraction of outliers in the training set in advance.
Ia Overview of the Proposed Approach
In the proposed oneclass method, the strategy is to map the data into the feature space corresponding to a kernel and infer a direction in the feature space such that: 1The scatter of the data along that direction is minimised; 2The projected samples and the origin along the projection direction are maximally distant. The problem is then posed as one of graph embedding which is optimised efficiently using the spectral regression technique [42], thus avoiding costly eigenanalysis computations. In addition, an incremental version of the proposed method is also presented which reduces the computational complexity of the training phase even further. As a byproduct of the regressionbased formulation, a consistency measure for training samples with respect to the inferred model is obtained which provides the capability to determine the expected fraction of outliers in advance. During the test phase, the decision criterion for the proposed approach involves projecting a test sample onto an optimal line in the feature space followed by computing the distance between the projection of the test sample and that of the mean of training samples.
The main contributions of the present work may be summarised as follows:

A oneclass nonlinear classifier (OCKSR) posed as a graph embedding problem;

Efficient optimisation of the proposed formulation based on spectral regression;

An incremental variant of the OCKSR approach;

An observation ranking scheme making the method relatively more resilient to contaminations in the training set;

And, evaluation and comparison of the proposed method to the stateoftheart oneclass techniques on several datasets.
IB Outline of the Paper
The rest of the paper is organised as follows: In Section II, we briefly review the oneclass methods which are closely related to the proposed method. In doing so, the focus is on nonlinear methods posing the oneclass classification problem as optimisation of (generalised) Rayleigh quotient. In Section III, the proposed oneclass method (OCKSR) is presented. An experimental evaluation of the proposed approach along with a comparison to other methods on several datasets is provided in Section IV. Finally, the paper is drawn to conclusion in Section V.
Ii Related Work
The work in [19] employs kernel PCA for novelty detection where a principal component in a feature space captures the distribution of the data while the reconstruction residual of a test sample with respect to the inferred subspace is employed as a novelty measure. Other work in [43] describes a strategy to improve the convergence of the kernel algorithm for the iterative kernel PCA. A different study [44] proposed a robustified PCA to deal with outliers in the training set.
In [31, 45], a oneclass kernel Fisher discriminant classifier is proposed which is related to Gaussian density estimation in the induced feature space. The proposed method is based on the idea of separating the data from their negatively replicated counterparts and involved an eigenvalue decomposition of the kernel matrix. In this approach, the data are first mapped into some feature space, where a Gaussian model is fitted. Mahalanobis distances to the mean of this Gaussian are used as test statistics to decide for normality. As also pointed out in [45], for kernel maps which transform the input data into a higherdimensional space, a modelling problem induced by a deviation from the Gaussianity assumption in the feature space might occur. If the deviation is large, the method in [31, 45] may lead to unreliable results.
Other work in [33] proposed a Fisherbased null space method where a zero withinclass scatter and a positive betweenclass scatter is used to map all training samples of one class into a single point. The proposed method was able to treat multiple known classes jointly and to detect novelties for a set of classes with a single model by using a projection in a joint subspace where training samples of all known classes are presumed to have zero intraclass variance. Deciding for novelty involved computing a distance in the estimated subspace while the method involved eigendecomposition of the kernel matrix. In a followup work [46], it is proposed to incorporate locality in the null space approach of [33] by considering only the most similar patterns to the query sample, leading to improvements in performance. In [41], an incremental version of the method in [33] is proposed to increase computational efficiency.
In [34, 47], a generalised Rayleigh quotient specifically designed for outlier detection has been proposed. The method tries to find an optimal hyperplane which is closest to the target data and farthest from the outliers which requires building two scatter matrices: an outlier scatter matrix corresponding to the outliers and a target scatter matrix for the target data. While in [34], for the computation of the decision boundary a computationally intensive generalized eigenvalue problem is solved which limited the utilisation of the method to medium sized data sets, in [47] the generalized eigenvalue problem is replaced by an approximate conjugate gradient solution to decrease the computational cost. The method presented in [34, 47] has certain shortcomings as the computation of the outlier scatter matrix requires the presence of atypical instances which is sometimes difficult to collect in some real applications. Another drawback is that the method is based on the assumption that the target population differs from the outlier population regarding their respective density which might not hold for real world problems in general. A later study [40], tries to address these shortcomings via a nullspace version of the method in [34, 47]. In order to overcome the limitation of the availability of outlier samples, it is proposed to separate the target class from the origin of the kernel feature space serving as an artificial outlier sample. The density constraint is then relaxed by deriving a joint subspace where the training target data population have zero covariance. The method involves eigencomputations of dense matrices.
While the majority of previous work on oneclass classification using a Rayleigh quotient formulation require computationally intensive eigendecomposition of large matrices, in this work, a oneclass approach is proposed which replaces costly eigenanalysis computations by the spectralregression technique [42]. In this sense, the present work can be considered as a oneclass variant of the multiclass approach in [42] and the twoclass, classspecific method of [48] with additional contributions discussed in the subsequent sections.
Iii OneClass Kernel Spectral Regression
Notation  Description 

The target observation class  
total number of training samples  
The number of contaminations (outliers) in the training set  
The observation in the training set  
The dimensionality of observations in the input space  
The feature (reproducing kernel Hilbert) space  
The nonlinear mapping into the feature space  
Scatter of data along projection direction  
The mean of projected samples  
The projection function onto a feature subspace  
the set of real numbers  
The set of real vectors in the dimensional space  
Graph adjacency matrix  
The identity matrix  
A matrix of 1’s  
Graph Laplacian matrix  
Graph degree matrix  
Sum of squared distances of target observations to the origin  
The transformation vector  
The vector of responses (projections)  
Betweenclass scatter  
Withinclass scatter  
The kernel matrix  
The Cholesky decomposition of  
The kernel function  
The threshold for deciding normality  
The regularisation parameter  
The consistency vector of target observations  
The norm operator 
Let us assume that there exist samples and is a feature space (also known as RKHS:reproducing kernel Hilbert space) induced by a nonlinear mapping . For a properly chosen mapping, an inner product on may be represented as , where is a positive semidefinite kernel function. Our strategy for outlier detection is to map the data into a feature space induced by the nonlinear mapping and then look for an optimal projection direction (subspace) in the RKHS based on two criteria: 1minimising the scatter of mapped target data in the RKHS along the projection direction; and 2maximising their distances from a hypothesised nontarget instance in this subspace. In doing so, the problem is formulated as one of graph embedding which is then posed as optimising a Rayleigh quotient, efficiently solved using an spectral regression framework. The two criteria used in this work are discussed next.
Iiia Scatter in the feature subspace
Let us assume a projection function which maps each target data point onto a feature subspace. For the reasons to be clarified later, is assumed to be a onedimensional mapping. The scatter of target data in the feature space along the direction specified by is defined as
(2) 
where denotes the mean of all projections ’s, i.e.
(3) 
Note that as we are working in the feature space, captures both a mapping from the original space onto the feature space as well as a projection onto a line in the RKHS. In order to detect outliers, it is desirable to find a projection function which minimises dispersion of positive samples and forms a compact cluster, i.e minimises . can be written in terms of real numbers ’s and a positive semidefinite kernel function defining an kernel matrix (where ) according to the following proposition:
Proposition 1.
(4) 
cf. [49] for a proof.
Assuming that the kernel function is chosen and fixed, the problem of minimising with respect to boils down to finding :
IiiA1 Graph Embedding View
Let us now augment the data set (’s) with an additional point satisfying . Let us also define the matrix as
(6) 
The scatter in Eq. 2 can now be written as
(7) 
where denotes the element of in the row and column. The latter formulation corresponds to a graph embedding view of the problem where the data points are represented as vertices of a graph and is the graph adjacency matrix, encoding the structure of the graph. That is, the two vertices and of the graph are connected by an edge if . With this perspective and given by Eq. 6, each data point is connected by an edge to , resulting in a star graph structure, Fig. 2. The purpose of graph embedding is to map each node of the graph onto a subspace in a way that the similarity between each pair of nodes is preserved. In view of Eq. 7, the objective function encodes a higher penalty if two connected vertices are mapped to distant locations via . Consequently, by minimising , if two nodes are neighbours in the graph (i.e. connected by an edge), then their projections in the new subspace are encouraged to be located in nearby positions.
Defining the diagonal matrix such that would yield
(8) 
Assuming , Eq. 7 can now be written in matrix form as
(9)  
Defining matrix as , Eq. 9 becomes
(10) 
In the graph embedding literature, is called degree matrix, the diagonal elements of which counts the number of times an edge terminates at each vertex while is graph Laplacian [50, 51]. Since our data points are connected to an auxiliary point in the star graph representation, minimising the scatter given by Eq. 10 with respect to projections of target observations (i.e. with respect to for ) forces the mapped data to be located in nearby positions to . As is the mean of data in the subspace, by minimising all target data are encouraged to be as close as possible to their mean on a line defined by in the feature space. The optimum of the objective function would be reached if all target data are exactly mapped onto a single point, i.e. .
IiiB Distance to the origin
The idea of using the origin as an exemplar outlier has been previously used in designing oneclass classifiers such as OCSVM [24] and others [40, 33, 41]. In essence, such a strategy corresponds to the assumption that novel samples lie around the origin while target objects are farther away. In [24], it is shown that using a Gaussian kernel function, the data are always separable from the origin. In this work, a similar assumption is made and target data points are mapped onto locations in a feature subspace such that they are as far as possible from the origin.
In order to encourage the mapped data points to lie at locations as far as possible to the origin in the subspace, we make use of sum of squared (Euclidean) distances between the projected data points and the origin. As the projection of the origin in the feature space onto any single subspace (including the one specified by ) would be zero, the sum of squared distances of projected data points to the projection of the origin on a subspace defined by can be written as
(11) 
and using a vector notation, one obtains
(12) 
where is obtained by dropping the last element of which corresponds to our augmented point. As per definition of , its maximisation corresponds to maximising the average margin between the projected target data points and the exemplar outlier.
IiiC Optimisation
We now combine the two criteria corresponding to minimising the scatter while maximising the average margin and optimise it with respect to the projections of all target data, i.e. , as
(13)  
Note that the numerator of the quotient is defined in terms of whereas the optimisation is performed with respect to . Thus, the numerator need to be expressed in . Regarding we have
(14)  
where denotes an matrix of 1’s.
Due to the special structure of given in Eq. 8, for , one obtains
(15)  
As a result, Eq. 13 can be purely written in terms of as
(16)  
The relation above is known as the Rayleigh quotient. It is well known that the optimum of the Rayleigh quotient is attained at the eigenvector corresponding to the largest eigenvalue of the matrix in the numerator. That is, , where in this case corresponds to the eigenvector corresponding to the largest eigenvalue of . It can be easily shown that matrix has a single eigenvector corresponding to the nonzero eigenvalue of , where . Note that the Rayleigh quotient is constant under scaling . In other words, if maximises the objective function in Eq. 16, then any nonzero scalar multiple also maximises Eq. 16. As a result, one may simply choose as which would lead to .
IiiD Relation to the Fisherbased nullspace methods
We now establish the relationship of our formulation in Eq. 16 to the nullspace Fisher discriminant analysis. For this purpose, first it is shown that the criterion function in Eq. 16 is in fact the Fisher ratio and then its relation to the nullspace approaches is analysed.
The Fisher analysis maximises the ratio of betweenclass scatter to the withinclass scatter . As the negative class is represented by only a single sample in our approach, it would have a zero scatter and thus the withinclass scatter in this case would be , and hecne
(17) 
The betweenclass scatter when the origin is considered as mean of the negative class along the direction specified by is
(18)  
The Fisher analysis maximises the ratio or equivalently minimises the ratio and thus
(19)  
which clearly shows that when the negative class is represented only by the origin, our criterion function in Eq. 16 is in fact the Fisher criterion.
Next, it is shown that the proposed approach is in fact a nullspace Fisher analysis. The null projection function [41, 33] is defined as a function leading to zero withinclass scatter while providing positive betweenclass scatter. Thus, one needs to show that leads to and . As all the elements of are equal, it is clear that the proposed formulation corresponds to a zero scatter for the target class. The conjecture can be also verified by substituting in the relation for the withinclass scatter as
(20) 
IiiE Spectral Regression
Once is determined, the relation may be used to determine . This approach is called spectral regression in [42]. Denoting the matrix in the numerator of Eq. 16 in general as , the spectral regression involves two steps to solve for :

Solve for ;

Solve for .
The method is dubbed spectral regression as it involves spectral analysis of followed by solving which is equivalent to a regularised regression problem [42]. However, in our formulation, due to the special structure of , finding the leading eigenvector was trivial.
Solving for can be performed using the Cholesky factorisation and forwardback substitution. In this case, if is positivedefinite, then there exists a unique solution for . If is singular, it is approximated by the positive definite matrix where is the identity matrix and is a regularisation parameter. As a widely used kernel function, the radial basis kernel function, i.e. , leads to a positive definite kernel matrix [42, 38] for which and the spectral regression finds the exact solution. Considering a Cholesky factorisation of as , may be found by first solving for and then solving for . Since in the proposed approach there is only one eigenvector associated with the equation , only a single vector, i.e. , is computed.
IiiF Target Observation Ranking
Up to this point, it is assumed that the target data set is not contaminated by any outliers using which a model is built utilising all the available observations. However, in practical settings, this assumption might not hold true which would lead to a model drift. In this section, a mechanism is presented which can handle such situations given some feedback from the user regarding the expected fraction of the outliers in the data set. Based on the relation , for the observation , one obtains
(22) 
where is the row of the kernel matrix. While is the estimated value for the training observation using regression, stands for the true expected value for the . Clearly, as the two quantities become equal. The squared error for the observation is defined as
(23) 
As is fixed for all training samples, can be used as a measure to gauge goodnessoffit for observation . That is, the higher the value , the less is consistent with the model. As a result, the consistency measure of the training observation with the model in the proposed OCKSR method is defined as
(24) 
If contaminations in the training data are in a strict minority, it is expected that the model would fit better to the true positive samples than the contaminations. In other words, the contaminations in the training set would have a larger deviation from the model. In this respect, if observations in the training set are specified by the user as contaminations, once is determined, the samples are sorted according to their consistency measures. One can then discard the least consistent observations by removing the corresponding row and column of the kernel matrix to obtain a reduced kernel matrix. Once a new kernel matrix is obtained, the rest of the training procedure is as before.
IiiG Outlier Detection
Once is determined, the projection of a probe onto the optimal feature subspace can be obtained as , where is a vector collection of the elements . The decision rule is now defined as the squared distance between the mean of projections of target observations in the subspace, i.e. and . As , the decision rule becomes
z is a target object  
z is an outlier 
where is a threshold for deciding normality. The pseudocodes for the training and testing stages of the proposed OCKSR approach are summarised in the Algorithms 1 and 2.
IiiH Incremental OCKSR
In the proposed OCKSR method, a high computational cost is associated with the Cholesky decomposition of the kernel matrix , the batch computation of which requires arithmetic operations. However, as advocated in [52], a Cholesky decomposition may be obtained more efficiently using an incremental approach. In the incremental scheme, the goal is to find the Cholesky decomposition of an matrix given the Cholesky decomposition of its submatrix. Hence, given the Cholesky decomposition of the kernel matrix of samples we want to compute the Cholesky factorisation of the kernel matrix for the augmented training set where a single sample () is injected into the system. The incremental Cholesky decomposition technique may be applied via the Sherman’s March algorithm [52] for as
(25) 
where is an vector given by
and
.
Eq. 25 reads
(26) 
Thus, one first solves for and then computes . The employed incremental technique reduces the computational cost of the Cholesky decomposition from cubic in number of training samples in the batch mode to quadratic in the incremental mode [42].
By varying from 1 to (total number of target observations), the incremental Cholesky decomposition is obtained as Algorithm 3. The incremental approach not only reduces the computational complexity but also allows for operation in streaming data scenarios where training samples are received one at a time. In this case, as new data becomes available, only the new part of the kernel matrix needs to be computed. Moreover, since the Cholesky factorisation can be performed in an incremental fashion, the previous computations are fully utilised.
IiiI Relation to other methods
There exist some unsupervised methods using the kernel PCA (KPCA) for outlier detection such as those in [19, 43]. In case in the KPCA approach one uses the eigenvector corresponding to the smallest eigenvalue for projection, a small variance on the projection direction is expected. However, in a KPCA the smallest eigenvalue and as a result the variance along the projection direction need not be zero in general. Note that, in KPCA one may obtain at most ( being the number of training samples) distinct eigenvalues using the kernel matrix. As the smallest eigenvalue of a general kernel matrix need not be zero, the variance along the corresponding eigenvector would not necessarily be zero. As a widely used kernel function, an RBF kernel results in a positive positivedefinite kernel matrix which translates into strictly positive eigenvalues. In contrast, in the proposed method, the variance along the projection direction is always zero using a Gaussian kernel function.
As discussed previously, the proposed method is similar to the nullspace methods for anomaly detection presented in [41, 33] in the sense that all methods employ the Fisher criterion for estimation of a null feature space. However, the proposed approach, as will be discussed in §IVC is computationally attractive by virtue of avoiding costly eigendecompositions. Other work in [40] tries to optimise the ratio between the target scatter and outlier scatter which is different from the Fisher ratio utilised in this work. As illustrated, the proposed approach can be implemented in an incremental fashion which further reduces computational complexity of the method while allowing for application in streaming data scenarios. Moreover, in the proposed OCKSR method a natural mechanism allowing for specifying the fraction of outliers in the training set in advance is proposed.
Iv Experimental Evaluation
In this section, an experimental evaluation of the proposed approach is provided to compare the performance of the OCKSR method to those of several stateoftheart approaches in terms of the area under the ROC curve (AUC). Ten different data sets which include relatively low to medium and high dimensional features are used. A summary of the statistics of the data sets used is provided in Table II. A brief description regarding the data sets used in the experiments are as follows:

Arcene: The task in this data set is to distinguish cancer versus normal patterns from massspectrometric data. The data set was obtained by merging three massspectrometry datasets with continuous input variables to obtain training and test data. The dataset is part of the 2003 NIPS variable selection benchmark. The original features indicate the abundance of proteins in human sera having a given mass value. Based on these features one must separate cancer patients from healthy patients. The data set is part of the UCI machine learning data sets [53].

AD includes EEG signals from 11 patients with a diagnosis of probable AD and 11 controls subjects. The task in this data set is to discriminate between healthy and Alzheimerâs (AD) patients. AD patients were recruited from the Alzheimerâs Patientsâ Relatives Association of Valladolid (AFAVA), Spain for whom more than 5 minutes of EEG data were recorded using Oxford Instruments Profile Study Room 2.3.411 (Oxford, UK). As suggested in [54], in this work the signal associated with the electrode is used.

Face consists of face images of different individuals where the task is to recognise a subject among others. For each subject a oneclass classifier is built using the data for the same subject while all other subjects are considered as outliers with respect to the built model. The experiment is repeated in turn for all of the subjects in the data set. The features used for learning are obtained via the GooleNet deep CNN [55]. We have created this data set out of realaccess data of the ReplayMobile data set [56] and included ten subjects in the experiments.

Caltech256 is a challenging set of 256 object categories containing 30607 images in total. Each class of images has a minimum of 80 images representing a diverse set of backgrounds, poses, lighting conditions and image sizes. In this experiment, the ’Americanflagâ is considered as the target class and the samples associated with the ’boombox’, ’bulldozer’ and ’cannon’ classes as outliers. Bagofvisualwords histograms from densely sampled SIFT features are used to represent images ^{1}^{1}1http://homes.esat.kuleuven.be/ t̃uytelaa/unsup _ features.html.

MNIST is a collection of pixel images of handwritten digits 09. Considering digit ’1’ as the target digit, 220 images are used as target data and 293 images corresponding to other digits are used as negative samples.

Delft pump includes 5 vibration measurements taken under different normal and abnormal conditions from a submersible pump. The 5 measurements are combined into one object, giving an 160 dimensional feature space. The data set is obtained from the oneclass data set archive of Delft university [57].

Sonar is composed of 208 instances of 60 attributes representing the energy within a particular frequency band, integrated over a certain period of time. There are two classes: an object is a rock or is a mine. The task is to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The Sonar dataset is from the undocumented databases from UCI.

Vehicle dataset is from Statlog, where class van is used as target class. The task is to recognise a vehicle from its silhouette. The data set is obtained from the oneclass data set archive of Delft university [57].

Vowel is an undocumented dataset from UCI. The purpose is speaker independent recognition of the eleven steady state vowels of British English using a specified training set of lpc derived log area ratios. Vowel 0 is used as target class in this work.

Balancescale was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The data set is part of the UCI machine learning repository [53].
The methods included in the comparison in this work are:

OCKSR is the proposed naïve oneclass spectral regression method without observation ranking; i.e. when the fraction of contaminations in the training set is not specified in advance.

OCSVM is based on the Support Vector Machines formalism to solve the one class classification problem [24]. As a widely used method, it provides a baseline for comparison. The method provides a parameter to specify the expected fraction of outliers in the training set.

OCKNFST The oneclass kernel null FoleySammon transform presented in [33] which operates on the Fisher criterion. This method is chosen due to its similarity to the proposed approach.

KPCA is based on the kernel PCA method where the reconstruction residual of a sample in the feature space is used as the novelty measure [19].

GPmean is derived based on the Gaussian process regression and approximate Gaussian process classification [58] where in this work the predictive mean is used as one class score.

LOF Local outlier factor (LOF) [59] is a local measure indicating the degree of novelty for each object of the data set. The LOF of an object is based on a single parameter k, which is the number of nearest neighbours used in defining the local neighbourhood of the object.

Kmeans is the kmeans clustering based approach where k centres are assumed for the target observation. The novelty score of a sample is defined as the minimum distance of a query to data centres.

KNNDD The knearest neighbours data description method (KNNDD) is proposed in terms of the one class classification framework [13]. The principle of KNDD is to associate to each data a distance measure relatively to its neighbourhood (kneighbours).
In all the experiments that follow, the positive samples of each data set are divided into training and test sets of equal sizes randomly. Each experiments is repeated 100 times and the average area under the ROC curve (AUC) and the standard deviation of the AUC’s are reported. No preprocessing of features is performed other than normalising all features to have a unit L2norm. For the methods requiring a neighbourhood parameter (i.e. LOF, Kmeans and KNDD), the neighbourhood parameter is set in the range to obtain the best performance. Regarding the methods operating in the RKHS space (i.e. OCSVM, OCKNFST, GPmean, KPCA and OCKSR), a common Gaussian kernel is computed and shared among all methods.
Dataset  Positive Samples  Negative Samples  d 

Arcene  88  112  10000 
AD  263  400  1280 
Face  1030  10290  1024 
Caltech256  97  304  1000 
MNIST  220  293  784 
Pump  189  531  160 
Sonar  111  97  60 
Vehicle  199  647  18 
Vowel  48  480  10 
Balancescale  49  576  4 
Iva Comparison to other methods
A comparison of the proposed OCKSR approach to other methods is provided in Tables III and IV for the data sets with medium to high dimensional features and data sets with relatively lower dimensional features, respectively. From Tables III and IV, one may observe that in 6 out of 10 data sets, the proposed OCKSR method achieves leading performance in terms of average AUC. The results over different data sets are summarised in Tables V, VI and VII in terms of average AUC over medium to high dimension, relatively lower dimension and all data sets. As can observed from Table V, the best performing methods on the medium to high dimensional data sets in terms of average AUC are the proposed OCKSR and the OCKNFST method with an average AUC of . It is worth noting that the performances of both methods do exactly match. As previously discussed, this is expected since both approaches are equivalent theoretically, optimising the Fisher criterion for classification. The second best performing method in this case is KPCA followed by Kmeans.
Regarding the low dimensional data sets, the best performing methods in terms of average AUC are the proposed OCKSR approach and the OCKNFST method with an average of AUC, Table VI. The second best performing method in this case is GPmean with an average AUC of followed by KPCA with an average AUC of .
Table VII reports the average AUC’s for all the evaluated methods over all data sets regardless of the dimensionality of feature vectors. The best performing methods of OCKSR and OCKNFT achieve an average AUC of while the second best performing methods are KPCA and GPmean with an average AUC of .
Method  Mean AUC() 

OCKSR  88.98 
OCSVM  87.65 
OCKNFST  88.98 
KPCA  88.57 
GPmean  88.17 
LOF  49.07 
KMEANS  88.25 
KNNDD  87.12 
Method  Mean AUC() 

OCKSR  92.46 
OCSVM  77.50 
OCKNFST  92.46 
KPCA  90.24 
GPmean  90.66 
LOF  60.51 
KMEANS  86.56 
KNNDD  82.02 
Method  Mean AUC() 

OCKSR  90.72 
OCSVM  82.57 
OCKNFST  90.72 
KPCA  89.41 
GPmean  89.41 
LOF  54.79 
KMEANS  87.41 
KNNDD  84.57 
IvB Contaminations in the training set
In this experiment, contaminations are gradually added to the training set and the performances of different methods over different data sets are observed. The contaminations are obtained from the negative samples of each data set, the proportion of which relative to the positive set is increased from to in increments of . As previously, each experiment is repeated 100 times. As the LOF method is found to perform much inferior compared to others, it is excluded from this experiment. The results of this evaluation are presented in Fig. 3 and Fig 4 for the medium to high and relatively lower dimensional data sets, respectively. In the figures, OCKSR denotes the OCKSR method without observation ranking while OCKSR denotes the OCKSR with observation ranking scheme. As a result, only the OCKSR and the OCSVM methods have an explicit internal mechanism to handle contaminations in the training set. From the figures, the following conclusions can be drawn. As expected, adding contaminations to the training set, deteriorates the performances of all methods. However, the OCKSR method is more resilient to contaminations of up to and performs better than the OCKSR method which indicates the effectiveness of the proposed observation ranking scheme. However, beyond contaminations of , the regression model is more influenced by contaminations as any missdetected contamination would not only keep the corresponding sample in the training set but will also discard a truly positive sample from this set and hence will result in a deviation in the model. It can also be observed that the higher the dimensionality of feature vectors is, the better the performance of the OCKSR would be. As an instance, the OCKSR is the best performing method among others over the Arcene data set for contaminations of up to . As the dimensionality of the feature vectors decreases, other methods tend to perform better than the proposed approach when the percentage of contaminations goes beyond . Interestingly, the KPCA approach seems to perform reasonably well, despite having no explicit builtin mechanism against contaminations in the data set.
IvC Computational complexity
In this section, the computational complexity of the proposed OCKSR method in the training and test phases is discussed and compared to similar methods.
IvC1 Computational complexity in the training stage
An analysis regarding the computational complexity of the proposed method in the training stage is as follows. As with all the kernel methods, the computation of the kernel matrix has a time complexity of . Computing the additional part of the kernel matrix in the incremental scheme requires compound arithmetic operations each consisting of one addition and one multiplication (flam [52]), where is the number of additional training samples. The incremental Cholesky decomposition requires . Given the Cholesky decomposition of , the linear equations can be solved within flams. As a result, the computational cost of training the incremental OCKSR approach in the updating phase is
assuming , the cost can be approximated as
(27) 
In the initial training stage, the computation of the kernel matrix, the Cholesky decomposition of the kernel matrix and solving linear equations are required. Noting that even in the initial stage the Cholesky decomposition can be performed in an incremental fashion (we assume during the initial training phase), the total cost in the initialisation stage can be approximated as
(28) 
As a result, if (which is often the case), the proposed algorithm would have a time complexity of in the training stage. That is, the computation of the kernel matrix has the dominant complexity in the training phase of the proposed approach.
IvC2 Computational complexity in the Test stage
In the test phase, the OCKSR method requires computation of which has a time complexity of followed by the computation of requiring flams. Hence the dominant computational complexity in the test phase is .
Very recently, an incremental variant of the OCKNFST approach is proposed in [41] which reduces the computational complexity of the original KNFST algorithm. As the classification performance of this approach matches the original KNFST method, both the OCKSR and the incremental KNFST are identical in terms of classification performance. A note regarding the computational complexity of the incremental OCKNFST method is as follows. In the initial training stage, the incremental OCKNFST algorithm requires for the computation of the kernel matrix and eigendecomposition and matrix multiplication computations. As the computation of the kernel matrix is common for both the OCKSR and the incremental OCKNFST, the relative computational advantage of the OCKSR to the incremental OCKNFST in the initial training stage when kernel matrix is precomputed is . That is, the computational advantage of OCKSR approach with respect to the incremental OCKNFST increases almost linearly with (number of training samples) in the initial training stage. This is due to the fact that the method in [41] uses eigendecomposition, whereas such computations are completely avoided in OCKSR via a regression approach.
The computational complexity of the method in [41] in the updating phase of the training stage is where is the number of eigenbases, upper bounded by . As a result, in common scenarios where e.g. , if the number of eigenbases for the incremental OCKNFST method exceeds of , the proposed OCKSR method would be more efficient. In summary, in the initial training phase, the purposed OCKSR approach is computationally more efficient than the incremental OCKNFST method of [41]. In the updating phase, under mild conditions, it would be more efficient too.
V Conclusion
A new efficient nonlinear oneclass classifier built upon the Fisher criterion was presented while providing a graph embedding view of the problem. The proposed OCKSR approach worked by mapping the data onto a onedimensional feature subspace where the scatter of the target distribution is zero while outliers were mapped onto distant locations in the same feature subspace. It was shown that all target observations were projected onto a single point which could be determined up to a multiplicative constant, independently of the data. The proposed method, unlike previous similar approaches casts the problem under consideration into a regression framework optimising the criterion function via the efficient spectral regression method thus avoiding costly eigendecomposition computations. It was illustrated that the dominant complexity of the proposed method in the training phase is the complexity of computing the kernel matrix. The proposed OCKSR approach offers a number of appealing characteristics such as the ability to be trained in an incremental fashion and being unsupervised. Moreover, as a byproduct of the regressionbased formulation, an observation ranking scheme was proposed which could be utilised to specify the expected fraction of outliers in the training set in advance. Extensive experiments conducted on several datasets with varied dimensions of features verified the merits of the proposed approach in comparison with some other alternatives.
Acknowledgment
References
 [1] P. Nader, P. Honeine, and P. Beauseroy, “norms in oneclass classification for intrusion detection in scada systems,” IEEE Transactions on Industrial Informatics, vol. 10, no. 4, pp. 2308–2317, Nov 2014.
 [2] A. Beghi, L. Cecchinato, C. Corazzol, M. Rampazzo, F. Simmini, and G. Susto, “A oneclass svm based tool for machine learning novelty detection in hvac chiller systems,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 1953 – 1958, 2014, 19th IFAC World Congress. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1474667016418999
 [3] S. Budalakoti, A. N. Srivastava, and M. E. Otey, “Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 39, no. 1, pp. 101–113, Jan 2009.
 [4] S. Kamaruddin and V. Ravi, “Credit card fraud detection using big data analytics: Use of psoaann based oneclass classification,” in Proceedings of the International Conference on Informatics and Analytics, ser. ICIA16. New York, NY, USA: ACM, 2016, pp. 33:1–33:8. [Online]. Available: http://doi.acm.org/10.1145/2980258.2980319
 [5] G. G. Sundarkumar, V. Ravi, and V. Siddeshwar, “Oneclass support vector machine based undersampling: Application to churn prediction and insurance fraud detection,” in 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Dec 2015, pp. 1–7.
 [6] M. Yu, Y. Yu, A. Rhuma, S. M. R. Naqvi, L. Wang, and J. A. Chambers, “An online one class support vector machinebased personspecific fall detection system for monitoring an elderly individual in a room environment,” IEEE Journal of Biomedical and Health Informatics, vol. 17, no. 6, pp. 1002–1014, Nov 2013.
 [7] A. Rabaoui, M. Davy, S. Rossignol, and N. Ellouze, “Using oneclass svms and wavelets for audio surveillance,” IEEE Transactions on Information Forensics and Security, vol. 3, no. 4, pp. 763–775, Dec 2008.
 [8] T. Minter, “Singleclass classification,” in Symposium on Machine Processing of Remotely Sensed Data. Indiana, USA: IEEE, 1975, pp. 2A12–2A15.
 [9] M. M. Moya, M. W. Koch, and L. D. Hostetler, “Oneclass classifier networks for target recognition applications,” NASA STI/Recon Technical Report N, vol. 93, 1993.
 [10] G. Ritter and M. T. Gallegos, “Outliers in statistical pattern recognition and an application to automatic chromosome classification,” Pattern Recognition Letters, vol. 18, no. 6, pp. 525 – 539, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167865597000494
 [11] C. M. Bishop, “Novelty detection and neural network validation,” IEE Proceedings  Vision, Image and Signal Processing, vol. 141, no. 4, pp. 217–222, Aug 1994.
 [12] N. Japkowicz, “Concept learning in the absence of counterexamples: An autoassociationbased approach to classification,” Ph.D. dissertation, New Brunswick, NJ, USA, 1999, aAI9947599.
 [13] D. Tax, “Oneclass classification; conceptlearning in the absence of counterexamples,” Ph.D. dissertation, Delft University of Technology, 2001, aSCI Dissertation Series 65.
 [14] S. S. Khan and M. G. Madden, “Oneclass classification: taxonomy of study and review of techniques,” The Knowledge Engineering Review, vol. 29, no. 3, p. 345â374, 2014.
 [15] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “A review of novelty detection,” Signal Processing, vol. 99, pp. 215 – 249, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S016516841300515X
 [16] J. Kittler, W. Christmas, T. de Campos, D. Windridge, F. Yan, J. Illingworth, and M. Osman, “Domain anomaly detection in machine perception: A system architecture and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp. 845–859, May 2014.
 [17] D. M. J. Tax and R. P. W. Duin, “Combining oneclass classifiers,” in Proceedings of the Second International Workshop on Multiple Classifier Systems, ser. MCS ’01. London, UK, UK: SpringerVerlag, 2001, pp. 299–308. [Online]. Available: http://dl.acm.org/citation.cfm?id=648055.744087
 [18] L. Friedland, A. Gentzel, and D. Jensen, ClassifierAdjusted Density Estimation for Anomaly Detection and OneClass Classification, pp. 578–586. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611973440.67
 [19] H. Hoffmann, “Kernel pca for novelty detection,” Pattern Recognition, vol. 40, no. 3, pp. 863 – 874, 2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320306003414
 [20] M. Sabokrou, M. Fathy, and M. Hoseini, “Video anomaly detection and localisation based on the sparsity and reconstruction error of autoencoder,” Electronics Letters, vol. 52, no. 13, pp. 1122–1124, 2016.
 [21] S. R. Arashloo, J. Kittler, and W. Christmas, “An anomaly detection approach to face spoofing detection: A new formulation and evaluation protocol,” IEEE Access, vol. 5, pp. 13 868–13 882, 2017.
 [22] B. Song, P. Li, J. Li, and A. Plaza, “Oneclass classification of remote sensing images using kernel sparse representation,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 4, pp. 1613–1623, April 2016.
 [23] D. M. Tax and R. P. Duin, “Support vector data description,” Machine Learning, vol. 54, no. 1, pp. 45–66, Jan 2004. [Online]. Available: https://doi.org/10.1023/B:MACH.0000008084.60811.49
 [24] B. Schölkopf, J. C. Platt, J. C. ShaweTaylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a highdimensional distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, Jul. 2001. [Online]. Available: https://doi.org/10.1162/089976601750264965
 [25] E. Pekalska, D. Tax, R. Duin, S. Becker, S. Thrun, and K. Obermayer, OneClass LP Classifiers for Dissimilarity Representations. United States: MIT Press, 2002, pp. 761–768.
 [26] P. Casale, O. Pujol, and P. Radeva, “Approximate convex hulls family for oneclass classification,” in Multiple Classifier Systems, C. Sansone, J. Kittler, and F. Roli, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 106–115.
 [27] D. FernÃ¡ndezFrancos, . FontenlaRomero, and A. AlonsoBetanzos, “Oneclass convex hullbased algorithm for classification in distributed environments,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, pp. 1–11, 2017.
 [28] A. Ypma and R. P. W. Duin, “Support objects for domain approximation,” in ICANN 98, L. Niklasson, M. Bodén, and T. Ziemke, Eds. London: Springer London, 1998, pp. 719–724.
 [29] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially learned oneclass classifier for novelty detection,” CoRR, vol. abs/1802.09088, 2018. [Online]. Available: http://arxiv.org/abs/1802.09088
 [30] P. Perera and V. M. Patel, “Learning deep features for oneclass classification,” CoRR, vol. abs/1801.05365, 2018. [Online]. Available: http://arxiv.org/abs/1801.05365
 [31] V. Roth, “Outlier detection with oneclass kernel fisher discriminants,” in Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou, Eds. MIT Press, 2005, pp. 1169–1176. [Online]. Available: http://papers.nips.cc/paper/2656outlierdetectionwithoneclasskernelfisherdiscriminants.pdf
 [32] ——, “Kernel fisher discriminants for outlier detection,” Neural Comput., vol. 18, no. 4, pp. 942–960, Apr. 2006. [Online]. Available: http://dx.doi.org/10.1162/089976606775774679
 [33] P. Bodesheim, A. Freytag, E. Rodner, M. Kemmler, and J. Denzler, “Kernel null space methods for novelty detection,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp. 3374–3381.
 [34] F. Dufrenois, “A oneclass kernel fisher criterion for outlier detection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 5, pp. 982–994, May 2015.
 [35] F. Dufrenois and J. C. Noyer, “Formulating robust linear regression estimation as a oneclass lda criterion: Discriminative hat matrix,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 2, pp. 262–273, Feb 2013.
 [36] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers, “Fisher discriminant analysis with kernels,” Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41–48, Aug. 1999.
 [37] G. Baudat and F. Anouar, “Generalized discriminant analysis using a kernel approach.” Neural Computation, vol. 12, no. 10, pp. 2385–2404, 2000.
 [38] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.
 [39] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995.
 [40] F. Dufrenois and J. C. Noyer, “A null space based one class kernel fisher discriminant,” in 2016 International Joint Conference on Neural Networks (IJCNN), July 2016, pp. 3203–3210.
 [41] J. Liu, Z. Lian, Y. Wang, and J. Xiao, “Incremental kernel null space discriminant analysis for novelty detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 4123–4131.
 [42] D. Cai, X. He, and J. Han, “Speed up kernel discriminant analysis,” The VLDB Journal, vol. 20, no. 1, pp. 21–33, Feb. 2011.
 [43] S. Günter, N. N. Schraudolph, and S. V. N. Vishwanathan, “Fast iterative kernel principal component analysis,” vol. 8, pp. 1893–1918, 2007.
 [44] N. Kwak, “Principal component analysis based on l1norm maximization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 1672–1680, Sept 2008.
 [45] V. Roth, “Kernel fisher discriminants for outlier detection,” Neural Computation, vol. 18, no. 4, pp. 942–960, April 2006.
 [46] P. Bodesheim, A. Freytag, E. Rodner, and J. Denzler, “Local novelty detection in multiclass recognition problems,” in 2015 IEEE Winter Conference on Applications of Computer Vision, Jan 2015, pp. 813–820.
 [47] F. Dufrenois and J. Noyer, “One class proximal support vector machines,” Pattern Recognition, vol. 52, pp. 96 – 112, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320315003672
 [48] S. R. Arashloo and J. Kittler, “Classspecific kernel fusion of multiple descriptors for face verification using multiscale binarised statistical image features,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2100–2109, Dec 2014.
 [49] D. Cai, “Spectral regression: A regression framework for efficient regularized subspace learning,” Ph.D. dissertation, Department of Computer Science, University of Illinois at UrbanaChampaign, May 2009.
 [50] H. Cai, V. W. Zheng, and K. Chang, “A comprehensive survey of graph embedding: Problems, techniques and applications,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2018.
 [51] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 12, pp. 2724–2743, Dec 2017.
 [52] G.W. Stewart, Matrix algorithms – Volume I: Basic decompositions. SIAM, 2001.
 [53] UCI Machine Learning Repository. [Online]. Available: https://archive.ics.uci.edu/ml/datasets.html
 [54] S. Tirunagari, S. Kouchaki, D. Abasolo, and N. Poh, “One dimensional local binary patterns of electroencephalogram signals for detecting alzheimer’s disease,” in 2017 22nd International Conference on Digital Signal Processing (DSP), Aug 2017, pp. 1–5.
 [55] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1–9.
 [56] A. CostaPazo, S. Bhattacharjee, E. VazquezFernandez, and S. Marcel, “The replaymobile face presentationattack database,” in Proceedings of the International Conference on Biometrics Special Interests Group (BioSIG), Sep. 2016.
 [57] Delft University Archive of OneClass Data Sets. [Online]. Available: http://homepage.tudelft.nl/n9d04/occ/index.html
 [58] M. Kemmler, E. Rodner, E.S. Wacker, and J. Denzler, “Oneclass classification with gaussian processes,” Pattern Recognition, vol. 46, no. 12, pp. 3507 – 3518, 2013. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320313002574
 [59] M. Breunig, H.P. Kriegel, R. T. Ng, and J. Sander, “Lof: Identifying densitybased local outliers,” in PROCEEDINGS OF THE 2000 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. ACM, 2000, pp. 93–104.