One-Class Kernel Spectral Regression for Outlier Detection
The paper introduces a new efficient nonlinear one-class classifier formulated as the Rayleigh quotient criterion. The method, operating in a reproducing kernel Hilbert subspace, minimises the scatter of target distribution along an optimal projection direction while at the same time keeping projections of target observations as distant as possible from the origin which serves as an artificial outlier with respect to the data. We provide a graph embedding view of the problem which can then be solved efficiently using the spectral regression approach. In this sense, unlike previous similar methods which often require costly eigen-computations of dense matrices, the proposed approach casts the problem under consideration into a regression framework which avoids eigen-decomposition computations. In particular, it is shown that the dominant complexity of the proposed method is the complexity of computing the kernel matrix. Additional appealing characteristics of the proposed one-class classifier are: 1-the ability to be trained in an incremental fashion (allowing for application in streaming data scenarios while also reducing computational complexity in the non-streaming operation mode); 2-being unsupervised while also providing the ability for the user to specify the expected fraction of outliers in the training set in advance; And last but not least 3-the deployment of the kernel trick allowing for a large class of functions by nonlinearly mapping the data into a high-dimensional feature space. Extensive experiments conducted on several datasets verifies the merits of the proposed approach in comparison with some other alternatives.
One-class classification (OCC) deals with the problem of identifying objects, events or observations which conform to a specific behaviour or condition, identified as the target/positive class (), and distinguish them from all other objects, typically known as outliers or anomalies. More specifically, consider a set of points where is a realisation of a multivariate random variable drawn from a target probability distribution with probability density function . In a one-class classification problem, one would like to characterise the support domain of via a one-class classifier as
where function is modelling the similarity of an observation to the target data and denotes the Iverson brackets. Parameter is optimised so that an expected fraction of observations lie within the support domain of the target distribution. One-class learning serves as the core of a wide variety of applications such as intrusion detection , novelty detection , fault detection in safety critical systems , fraud detection , insurance , health care , surveillance , etc. Historically, the first single-class classification problem seems to date back to the work in  in the context of learning Bayes classifier. Later, with a large time gap, the term one-class classification was used in . As a result of different applications to one-class classification, other terminology including anomaly/outlier detection , novelty detection , concept learning , etc. have been also used in the literature.
OCC techniques are commonly employed when the non-target/negative class is either not well defined, poorly sampled or totally missing, which may be due to the openness of the problem or due to the high cost associated with obtaining negative samples. In these situations, the conventional two-class classifiers are believed not to operate effectively as they are based on the predominant assumption that data from all classes are more or less equally balanced. OCC techniques are developed to address this shortcoming of the conventional approaches by primarily training on the data coming from a single class. Nevertheless, lack of sufficient negative samples may pose serious challenges in learning one-class classifiers as only one side of the decision boundary can be estimated using positive observations. As a result, the one-class problem is typically believed to be more difficult than the two-class counterpart. As observed in , the challenges related to the standard two/multi-class problems, e.g. estimation of the error, atypical training data, the complexity of a solution, the generalisation capability, etc. are also present in OCC and may sometimes become even more severe.
While there exist different categorisation of one-class techniques [14, 13, 15], a general overarching categorisation considers them to be either generative or non-generative . The generative approaches incorporate a model for generating all observations whereas non-generative methods lack a transparent link to the data. In this context, the non-generative methods are best represented by discriminative approaches which partition the feature space in order to classify an object. As notable representatives of the generative approaches one may consider the parametric and nonparametric density estimation methods [17, 18, 19] (using for example a Gaussian, a mixture of Gaussians or a Poisson distribution), neural-network based methods [12, 20], one-class sparse representation classification [21, 22], etc. Well-known examples of the non-generative methods include those based on support vector machines (SVDD/one-class SVM) [23, 24], linear programming , convex hull methods [26, 27], cluster approaches , deep-learning based methods [29, 30] and subspace approaches [31, 32, 33, 34, 35]. By virtue of the emphasis on classification, rather than modelling the generative process, the non-generative approaches tend to yield better performance in classification.
In practical applications where the data to be characterised is highly nonlinear and complex, linear approaches often fail to provide satisfactory performance. In such cases, an effective mechanism is to implicitly map the data into a very high dimensional space with the hope that in this new space the data become more easily separable, the prominent examples of which are offered by kernel machines [36, 37, 38, 39]. Nevertheless, the high computational cost associated with these methods can be considered as a bottleneck in their usage. For instance, the one-class variants of kernel discriminant analysis [33, 40, 34, 41] often require computationally intensive eigen-decompositions of dense matrices.
In this work, a new nonlinear one-class classifier formulated as optimisation of a Rayleigh quotient is presented which unlike previous discriminative methods [31, 32, 33, 34, 35, 41] avoids costly eigen-analysis computations via the spectral regression (SR) technique which has been shown to speed up the kernel discriminant analysis by several orders of magnitude . By virtue of bypassing eigen-decomposition of large matrices via a regularised regression formulation, the proposed One-Class Kernel Spectral-Regression (OC-KSR) approach is computationally very attractive, where it will be shown that the dominant complexity of the algorithm is the computation of the kernel matrix. An additional appealing characteristic of the method is the operability in an incremental fashion which allows injection of additional training data into the system in a streaming data scenario, side-stepping the need to reinitialise the training procedure while also reducing computational complexity in a non-streaming operation mode. Additionally, the method can be operated in an unsupervised mode as well as by specifying the expected fraction of outliers in the training set in advance.
I-a Overview of the Proposed Approach
In the proposed one-class method, the strategy is to map the data into the feature space corresponding to a kernel and infer a direction in the feature space such that: 1-The scatter of the data along that direction is minimised; 2-The projected samples and the origin along the projection direction are maximally distant. The problem is then posed as one of graph embedding which is optimised efficiently using the spectral regression technique , thus avoiding costly eigen-analysis computations. In addition, an incremental version of the proposed method is also presented which reduces the computational complexity of the training phase even further. As a by-product of the regression-based formulation, a consistency measure for training samples with respect to the inferred model is obtained which provides the capability to determine the expected fraction of outliers in advance. During the test phase, the decision criterion for the proposed approach involves projecting a test sample onto an optimal line in the feature space followed by computing the distance between the projection of the test sample and that of the mean of training samples.
The main contributions of the present work may be summarised as follows:
A one-class nonlinear classifier (OC-KSR) posed as a graph embedding problem;
Efficient optimisation of the proposed formulation based on spectral regression;
An incremental variant of the OC-KSR approach;
An observation ranking scheme making the method relatively more resilient to contaminations in the training set;
And, evaluation and comparison of the proposed method to the state-of-the-art one-class techniques on several datasets.
I-B Outline of the Paper
The rest of the paper is organised as follows: In Section II, we briefly review the one-class methods which are closely related to the proposed method. In doing so, the focus is on nonlinear methods posing the one-class classification problem as optimisation of (generalised) Rayleigh quotient. In Section III, the proposed one-class method (OC-KSR) is presented. An experimental evaluation of the proposed approach along with a comparison to other methods on several datasets is provided in Section IV. Finally, the paper is drawn to conclusion in Section V.
Ii Related Work
The work in  employs kernel PCA for novelty detection where a principal component in a feature space captures the distribution of the data while the reconstruction residual of a test sample with respect to the inferred subspace is employed as a novelty measure. Other work in  describes a strategy to improve the convergence of the kernel algorithm for the iterative kernel PCA. A different study  proposed a robustified PCA to deal with outliers in the training set.
In [31, 45], a one-class kernel Fisher discriminant classifier is proposed which is related to Gaussian density estimation in the induced feature space. The proposed method is based on the idea of separating the data from their negatively replicated counterparts and involved an eigenvalue decomposition of the kernel matrix. In this approach, the data are first mapped into some feature space, where a Gaussian model is fitted. Mahalanobis distances to the mean of this Gaussian are used as test statistics to decide for normality. As also pointed out in , for kernel maps which transform the input data into a higher-dimensional space, a modelling problem induced by a deviation from the Gaussianity assumption in the feature space might occur. If the deviation is large, the method in [31, 45] may lead to unreliable results.
Other work in  proposed a Fisher-based null space method where a zero within-class scatter and a positive between-class scatter is used to map all training samples of one class into a single point. The proposed method was able to treat multiple known classes jointly and to detect novelties for a set of classes with a single model by using a projection in a joint subspace where training samples of all known classes are presumed to have zero intra-class variance. Deciding for novelty involved computing a distance in the estimated subspace while the method involved eigen-decomposition of the kernel matrix. In a follow-up work , it is proposed to incorporate locality in the null space approach of  by considering only the most similar patterns to the query sample, leading to improvements in performance. In , an incremental version of the method in  is proposed to increase computational efficiency.
In [34, 47], a generalised Rayleigh quotient specifically designed for outlier detection has been proposed. The method tries to find an optimal hyperplane which is closest to the target data and farthest from the outliers which requires building two scatter matrices: an outlier scatter matrix corresponding to the outliers and a target scatter matrix for the target data. While in , for the computation of the decision boundary a computationally intensive generalized eigenvalue problem is solved which limited the utilisation of the method to medium sized data sets, in  the generalized eigenvalue problem is replaced by an approximate conjugate gradient solution to decrease the computational cost. The method presented in [34, 47] has certain shortcomings as the computation of the outlier scatter matrix requires the presence of atypical instances which is sometimes difficult to collect in some real applications. Another drawback is that the method is based on the assumption that the target population differs from the outlier population regarding their respective density which might not hold for real world problems in general. A later study , tries to address these shortcomings via a null-space version of the method in [34, 47]. In order to overcome the limitation of the availability of outlier samples, it is proposed to separate the target class from the origin of the kernel feature space serving as an artificial outlier sample. The density constraint is then relaxed by deriving a joint subspace where the training target data population have zero covariance. The method involves eigen-computations of dense matrices.
While the majority of previous work on one-class classification using a Rayleigh quotient formulation require computationally intensive eigen-decomposition of large matrices, in this work, a one-class approach is proposed which replaces costly eigen-analysis computations by the spectral-regression technique . In this sense, the present work can be considered as a one-class variant of the multi-class approach in  and the two-class, class-specific method of  with additional contributions discussed in the subsequent sections.
Iii One-Class Kernel Spectral Regression
|The target observation class|
|total number of training samples|
|The number of contaminations (outliers) in the training set|
|The observation in the training set|
|The dimensionality of observations in the input space|
|The feature (reproducing kernel Hilbert) space|
|The nonlinear mapping into the feature space|
|Scatter of data along projection direction|
|The mean of projected samples|
|The projection function onto a feature subspace|
|the set of real numbers|
|The set of real vectors in the -dimensional space|
|Graph adjacency matrix|
|The identity matrix|
|A matrix of 1’s|
|Graph Laplacian matrix|
|Graph degree matrix|
|Sum of squared distances of target observations to the origin|
|The transformation vector|
|The vector of responses (projections)|
|The kernel matrix|
|The Cholesky decomposition of|
|The kernel function|
|The threshold for deciding normality|
|The regularisation parameter|
|The consistency vector of target observations|
|The norm operator|
Let us assume that there exist samples and is a feature space (also known as RKHS:reproducing kernel Hilbert space) induced by a nonlinear mapping . For a properly chosen mapping, an inner product on may be represented as , where is a positive semi-definite kernel function. Our strategy for outlier detection is to map the data into a feature space induced by the nonlinear mapping and then look for an optimal projection direction (subspace) in the RKHS based on two criteria: 1-minimising the scatter of mapped target data in the RKHS along the projection direction; and 2-maximising their distances from a hypothesised non-target instance in this subspace. In doing so, the problem is formulated as one of graph embedding which is then posed as optimising a Rayleigh quotient, efficiently solved using an spectral regression framework. The two criteria used in this work are discussed next.
Iii-a Scatter in the feature subspace
Let us assume a projection function which maps each target data point onto a feature subspace. For the reasons to be clarified later, is assumed to be a one-dimensional mapping. The scatter of target data in the feature space along the direction specified by is defined as
where denotes the mean of all projections ’s, i.e.
Note that as we are working in the feature space, captures both a mapping from the original space onto the feature space as well as a projection onto a line in the RKHS. In order to detect outliers, it is desirable to find a projection function which minimises dispersion of positive samples and forms a compact cluster, i.e minimises . can be written in terms of real numbers ’s and a positive semi-definite kernel function defining an kernel matrix (where ) according to the following proposition:
cf.  for a proof.
Assuming that the kernel function is chosen and fixed, the problem of minimising with respect to boils down to finding :
Iii-A1 Graph Embedding View
Let us now augment the data set (’s) with an additional point satisfying . Let us also define the matrix as
The scatter in Eq. 2 can now be written as
where denotes the element of in the row and column. The latter formulation corresponds to a graph embedding view of the problem where the data points are represented as vertices of a graph and is the graph adjacency matrix, encoding the structure of the graph. That is, the two vertices and of the graph are connected by an edge if . With this perspective and given by Eq. 6, each data point is connected by an edge to , resulting in a star graph structure, Fig. 2. The purpose of graph embedding is to map each node of the graph onto a subspace in a way that the similarity between each pair of nodes is preserved. In view of Eq. 7, the objective function encodes a higher penalty if two connected vertices are mapped to distant locations via . Consequently, by minimising , if two nodes are neighbours in the graph (i.e. connected by an edge), then their projections in the new subspace are encouraged to be located in nearby positions.
Defining the diagonal matrix such that would yield
Assuming , Eq. 7 can now be written in matrix form as
Defining matrix as , Eq. 9 becomes
In the graph embedding literature, is called degree matrix, the diagonal elements of which counts the number of times an edge terminates at each vertex while is graph Laplacian [50, 51]. Since our data points are connected to an auxiliary point in the star graph representation, minimising the scatter given by Eq. 10 with respect to projections of target observations (i.e. with respect to for ) forces the mapped data to be located in nearby positions to . As is the mean of data in the subspace, by minimising all target data are encouraged to be as close as possible to their mean on a line defined by in the feature space. The optimum of the objective function would be reached if all target data are exactly mapped onto a single point, i.e. .
Iii-B Distance to the origin
The idea of using the origin as an exemplar outlier has been previously used in designing one-class classifiers such as OC-SVM  and others [40, 33, 41]. In essence, such a strategy corresponds to the assumption that novel samples lie around the origin while target objects are farther away. In , it is shown that using a Gaussian kernel function, the data are always separable from the origin. In this work, a similar assumption is made and target data points are mapped onto locations in a feature subspace such that they are as far as possible from the origin.
In order to encourage the mapped data points to lie at locations as far as possible to the origin in the subspace, we make use of sum of squared (Euclidean) distances between the projected data points and the origin. As the projection of the origin in the feature space onto any single subspace (including the one specified by ) would be zero, the sum of squared distances of projected data points to the projection of the origin on a subspace defined by can be written as
and using a vector notation, one obtains
where is obtained by dropping the last element of which corresponds to our augmented point. As per definition of , its maximisation corresponds to maximising the average margin between the projected target data points and the exemplar outlier.
We now combine the two criteria corresponding to minimising the scatter while maximising the average margin and optimise it with respect to the projections of all target data, i.e. , as
Note that the numerator of the quotient is defined in terms of whereas the optimisation is performed with respect to . Thus, the numerator need to be expressed in . Regarding we have
where denotes an matrix of 1’s.
Due to the special structure of given in Eq. 8, for , one obtains
As a result, Eq. 13 can be purely written in terms of as
The relation above is known as the Rayleigh quotient. It is well known that the optimum of the Rayleigh quotient is attained at the eigenvector corresponding to the largest eigenvalue of the matrix in the numerator. That is, , where in this case corresponds to the eigenvector corresponding to the largest eigenvalue of . It can be easily shown that matrix has a single eigenvector corresponding to the non-zero eigenvalue of , where . Note that the Rayleigh quotient is constant under scaling . In other words, if maximises the objective function in Eq. 16, then any non-zero scalar multiple also maximises Eq. 16. As a result, one may simply choose as which would lead to .
Iii-D Relation to the Fisher-based null-space methods
We now establish the relationship of our formulation in Eq. 16 to the null-space Fisher discriminant analysis. For this purpose, first it is shown that the criterion function in Eq. 16 is in fact the Fisher ratio and then its relation to the null-space approaches is analysed.
The Fisher analysis maximises the ratio of between-class scatter to the within-class scatter . As the negative class is represented by only a single sample in our approach, it would have a zero scatter and thus the within-class scatter in this case would be , and hecne
The between-class scatter when the origin is considered as mean of the negative class along the direction specified by is
The Fisher analysis maximises the ratio or equivalently minimises the ratio and thus
which clearly shows that when the negative class is represented only by the origin, our criterion function in Eq. 16 is in fact the Fisher criterion.
Next, it is shown that the proposed approach is in fact a null-space Fisher analysis. The null projection function [41, 33] is defined as a function leading to zero within-class scatter while providing positive between-class scatter. Thus, one needs to show that leads to and . As all the elements of are equal, it is clear that the proposed formulation corresponds to a zero scatter for the target class. The conjecture can be also verified by substituting in the relation for the within-class scatter as
Iii-E Spectral Regression
Once is determined, the relation may be used to determine . This approach is called spectral regression in . Denoting the matrix in the numerator of Eq. 16 in general as , the spectral regression involves two steps to solve for :
Solve for ;
Solve for .
The method is dubbed spectral regression as it involves spectral analysis of followed by solving which is equivalent to a regularised regression problem . However, in our formulation, due to the special structure of , finding the leading eigenvector was trivial.
Solving for can be performed using the Cholesky factorisation and forward-back substitution. In this case, if is positive-definite, then there exists a unique solution for . If is singular, it is approximated by the positive definite matrix where is the identity matrix and is a regularisation parameter. As a widely used kernel function, the radial basis kernel function, i.e. , leads to a positive definite kernel matrix [42, 38] for which and the spectral regression finds the exact solution. Considering a Cholesky factorisation of as , may be found by first solving for and then solving for . Since in the proposed approach there is only one eigenvector associated with the equation , only a single vector, i.e. , is computed.
Iii-F Target Observation Ranking
Up to this point, it is assumed that the target data set is not contaminated by any outliers using which a model is built utilising all the available observations. However, in practical settings, this assumption might not hold true which would lead to a model drift. In this section, a mechanism is presented which can handle such situations given some feedback from the user regarding the expected fraction of the outliers in the data set. Based on the relation , for the observation , one obtains
where is the row of the kernel matrix. While is the estimated value for the training observation using regression, stands for the true expected value for the . Clearly, as the two quantities become equal. The squared error for the observation is defined as
As is fixed for all training samples, can be used as a measure to gauge goodness-of-fit for observation . That is, the higher the value , the less is consistent with the model. As a result, the consistency measure of the training observation with the model in the proposed OC-KSR method is defined as
If contaminations in the training data are in a strict minority, it is expected that the model would fit better to the true positive samples than the contaminations. In other words, the contaminations in the training set would have a larger deviation from the model. In this respect, if observations in the training set are specified by the user as contaminations, once is determined, the samples are sorted according to their consistency measures. One can then discard the least consistent observations by removing the corresponding row and column of the kernel matrix to obtain a reduced kernel matrix. Once a new kernel matrix is obtained, the rest of the training procedure is as before.
Iii-G Outlier Detection
Once is determined, the projection of a probe onto the optimal feature subspace can be obtained as , where is a vector collection of the elements . The decision rule is now defined as the squared distance between the mean of projections of target observations in the subspace, i.e. and . As , the decision rule becomes
|z is a target object|
|z is an outlier|
Iii-H Incremental OC-KSR
In the proposed OC-KSR method, a high computational cost is associated with the Cholesky decomposition of the kernel matrix , the batch computation of which requires arithmetic operations. However, as advocated in , a Cholesky decomposition may be obtained more efficiently using an incremental approach. In the incremental scheme, the goal is to find the Cholesky decomposition of an matrix given the Cholesky decomposition of its submatrix. Hence, given the Cholesky decomposition of the kernel matrix of samples we want to compute the Cholesky factorisation of the kernel matrix for the augmented training set where a single sample () is injected into the system. The incremental Cholesky decomposition technique may be applied via the Sherman’s March algorithm  for as
where is an vector given by
Eq. 25 reads
Thus, one first solves for and then computes . The employed incremental technique reduces the computational cost of the Cholesky decomposition from cubic in number of training samples in the batch mode to quadratic in the incremental mode .
By varying from 1 to (total number of target observations), the incremental Cholesky decomposition is obtained as Algorithm 3. The incremental approach not only reduces the computational complexity but also allows for operation in streaming data scenarios where training samples are received one at a time. In this case, as new data becomes available, only the new part of the kernel matrix needs to be computed. Moreover, since the Cholesky factorisation can be performed in an incremental fashion, the previous computations are fully utilised.
Iii-I Relation to other methods
There exist some unsupervised methods using the kernel PCA (KPCA) for outlier detection such as those in [19, 43]. In case in the KPCA approach one uses the eigenvector corresponding to the smallest eigenvalue for projection, a small variance on the projection direction is expected. However, in a KPCA the smallest eigenvalue and as a result the variance along the projection direction need not be zero in general. Note that, in KPCA one may obtain at most ( being the number of training samples) distinct eigenvalues using the kernel matrix. As the smallest eigenvalue of a general kernel matrix need not be zero, the variance along the corresponding eigenvector would not necessarily be zero. As a widely used kernel function, an RBF kernel results in a positive positive-definite kernel matrix which translates into strictly positive eigenvalues. In contrast, in the proposed method, the variance along the projection direction is always zero using a Gaussian kernel function.
As discussed previously, the proposed method is similar to the null-space methods for anomaly detection presented in [41, 33] in the sense that all methods employ the Fisher criterion for estimation of a null feature space. However, the proposed approach, as will be discussed in §IV-C is computationally attractive by virtue of avoiding costly eigen-decompositions. Other work in  tries to optimise the ratio between the target scatter and outlier scatter which is different from the Fisher ratio utilised in this work. As illustrated, the proposed approach can be implemented in an incremental fashion which further reduces computational complexity of the method while allowing for application in streaming data scenarios. Moreover, in the proposed OC-KSR method a natural mechanism allowing for specifying the fraction of outliers in the training set in advance is proposed.
Iv Experimental Evaluation
In this section, an experimental evaluation of the proposed approach is provided to compare the performance of the OC-KSR method to those of several state-of-the-art approaches in terms of the area under the ROC curve (AUC). Ten different data sets which include relatively low to medium and high dimensional features are used. A summary of the statistics of the data sets used is provided in Table II. A brief description regarding the data sets used in the experiments are as follows:
Arcene: The task in this data set is to distinguish cancer versus normal patterns from mass-spectrometric data. The data set was obtained by merging three mass-spectrometry datasets with continuous input variables to obtain training and test data. The dataset is part of the 2003 NIPS variable selection benchmark. The original features indicate the abundance of proteins in human sera having a given mass value. Based on these features one must separate cancer patients from healthy patients. The data set is part of the UCI machine learning data sets .
AD includes EEG signals from 11 patients with a diagnosis of probable AD and 11 controls subjects. The task in this data set is to discriminate between healthy and Alzheimerâs (AD) patients. AD patients were recruited from the Alzheimerâs Patientsâ Relatives Association of Valladolid (AFAVA), Spain for whom more than 5 minutes of EEG data were recorded using Oxford Instruments Profile Study Room 2.3.411 (Oxford, UK). As suggested in , in this work the signal associated with the electrode is used.
Face consists of face images of different individuals where the task is to recognise a subject among others. For each subject a one-class classifier is built using the data for the same subject while all other subjects are considered as outliers with respect to the built model. The experiment is repeated in turn for all of the subjects in the data set. The features used for learning are obtained via the GooleNet deep CNN . We have created this data set out of real-access data of the Replay-Mobile data set  and included ten subjects in the experiments.
Caltech256 is a challenging set of 256 object categories containing 30607 images in total. Each class of images has a minimum of 80 images representing a diverse set of backgrounds, poses, lighting conditions and image sizes. In this experiment, the ’American-flagâ is considered as the target class and the samples associated with the ’boom-box’, ’bulldozer’ and ’cannon’ classes as outliers. Bag-of-visual-words histograms from densely sampled SIFT features are used to represent images 111http://homes.esat.kuleuven.be/ t̃uytelaa/unsup _ features.html.
MNIST is a collection of pixel images of handwritten digits 0-9. Considering digit ’1’ as the target digit, 220 images are used as target data and 293 images corresponding to other digits are used as negative samples.
Delft pump includes 5 vibration measurements taken under different normal and abnormal conditions from a submersible pump. The 5 measurements are combined into one object, giving an 160 dimensional feature space. The data set is obtained from the one-class data set archive of Delft university .
Sonar is composed of 208 instances of 60 attributes representing the energy within a particular frequency band, integrated over a certain period of time. There are two classes: an object is a rock or is a mine. The task is to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The Sonar dataset is from the undocumented databases from UCI.
Vehicle dataset is from Statlog, where class van is used as target class. The task is to recognise a vehicle from its silhouette. The data set is obtained from the one-class data set archive of Delft university .
Vowel is an undocumented dataset from UCI. The purpose is speaker independent recognition of the eleven steady state vowels of British English using a specified training set of lpc derived log area ratios. Vowel 0 is used as target class in this work.
Balance-scale was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The data set is part of the UCI machine learning repository .
The methods included in the comparison in this work are:
OC-KSR is the proposed naïve one-class spectral regression method without observation ranking; i.e. when the fraction of contaminations in the training set is not specified in advance.
OC-SVM is based on the Support Vector Machines formalism to solve the one class classification problem . As a widely used method, it provides a baseline for comparison. The method provides a parameter to specify the expected fraction of outliers in the training set.
OC-KNFST The one-class kernel null Foley-Sammon transform presented in  which operates on the Fisher criterion. This method is chosen due to its similarity to the proposed approach.
KPCA is based on the kernel PCA method where the reconstruction residual of a sample in the feature space is used as the novelty measure .
GP-mean is derived based on the Gaussian process regression and approximate Gaussian process classification  where in this work the predictive mean is used as one class score.
LOF Local outlier factor (LOF)  is a local measure indicating the degree of novelty for each object of the data set. The LOF of an object is based on a single parameter k, which is the number of nearest neighbours used in defining the local neighbourhood of the object.
K-means is the k-means clustering based approach where k centres are assumed for the target observation. The novelty score of a sample is defined as the minimum distance of a query to data centres.
KNNDD The k-nearest neighbours data description method (KNNDD) is proposed in terms of the one class classification framework . The principle of KNDD is to associate to each data a distance measure relatively to its neighbourhood (k-neighbours).
In all the experiments that follow, the positive samples of each data set are divided into training and test sets of equal sizes randomly. Each experiments is repeated 100 times and the average area under the ROC curve (AUC) and the standard deviation of the AUC’s are reported. No pre-processing of features is performed other than normalising all features to have a unit L2-norm. For the methods requiring a neighbourhood parameter (i.e. LOF, K-means and KNDD), the neighbourhood parameter is set in the range to obtain the best performance. Regarding the methods operating in the RKHS space (i.e. OC-SVM, OC-KNFST, GP-mean, KPCA and OC-KSR), a common Gaussian kernel is computed and shared among all methods.
|Dataset||Positive Samples||Negative Samples||d|
Iv-a Comparison to other methods
A comparison of the proposed OC-KSR approach to other methods is provided in Tables III and IV for the data sets with medium to high dimensional features and data sets with relatively lower dimensional features, respectively. From Tables III and IV, one may observe that in 6 out of 10 data sets, the proposed OC-KSR method achieves leading performance in terms of average AUC. The results over different data sets are summarised in Tables V, VI and VII in terms of average AUC over medium to high dimension, relatively lower dimension and all data sets. As can observed from Table V, the best performing methods on the medium to high dimensional data sets in terms of average AUC are the proposed OC-KSR and the OC-KNFST method with an average AUC of . It is worth noting that the performances of both methods do exactly match. As previously discussed, this is expected since both approaches are equivalent theoretically, optimising the Fisher criterion for classification. The second best performing method in this case is KPCA followed by K-means.
Regarding the low dimensional data sets, the best performing methods in terms of average AUC are the proposed OC-KSR approach and the OC-KNFST method with an average of AUC, Table VI. The second best performing method in this case is GP-mean with an average AUC of followed by KPCA with an average AUC of .
Table VII reports the average AUC’s for all the evaluated methods over all data sets regardless of the dimensionality of feature vectors. The best performing methods of OC-KSR and OC-KNFT achieve an average AUC of while the second best performing methods are KPCA and GP-mean with an average AUC of .
Iv-B Contaminations in the training set
In this experiment, contaminations are gradually added to the training set and the performances of different methods over different data sets are observed. The contaminations are obtained from the negative samples of each data set, the proportion of which relative to the positive set is increased from to in increments of . As previously, each experiment is repeated 100 times. As the LOF method is found to perform much inferior compared to others, it is excluded from this experiment. The results of this evaluation are presented in Fig. 3 and Fig 4 for the medium to high and relatively lower dimensional data sets, respectively. In the figures, OC-KSR denotes the OC-KSR method without observation ranking while OC-KSR denotes the OC-KSR with observation ranking scheme. As a result, only the OC-KSR and the OC-SVM methods have an explicit internal mechanism to handle contaminations in the training set. From the figures, the following conclusions can be drawn. As expected, adding contaminations to the training set, deteriorates the performances of all methods. However, the OC-KSR method is more resilient to contaminations of up to and performs better than the OC-KSR method which indicates the effectiveness of the proposed observation ranking scheme. However, beyond contaminations of , the regression model is more influenced by contaminations as any miss-detected contamination would not only keep the corresponding sample in the training set but will also discard a truly positive sample from this set and hence will result in a deviation in the model. It can also be observed that the higher the dimensionality of feature vectors is, the better the performance of the OC-KSR would be. As an instance, the OC-KSR is the best performing method among others over the Arcene data set for contaminations of up to . As the dimensionality of the feature vectors decreases, other methods tend to perform better than the proposed approach when the percentage of contaminations goes beyond . Interestingly, the KPCA approach seems to perform reasonably well, despite having no explicit built-in mechanism against contaminations in the data set.
Iv-C Computational complexity
In this section, the computational complexity of the proposed OC-KSR method in the training and test phases is discussed and compared to similar methods.
Iv-C1 Computational complexity in the training stage
An analysis regarding the computational complexity of the proposed method in the training stage is as follows. As with all the kernel methods, the computation of the kernel matrix has a time complexity of . Computing the additional part of the kernel matrix in the incremental scheme requires compound arithmetic operations each consisting of one addition and one multiplication (flam ), where is the number of additional training samples. The incremental Cholesky decomposition requires . Given the Cholesky decomposition of , the linear equations can be solved within flams. As a result, the computational cost of training the incremental OC-KSR approach in the updating phase is
assuming , the cost can be approximated as
In the initial training stage, the computation of the kernel matrix, the Cholesky decomposition of the kernel matrix and solving linear equations are required. Noting that even in the initial stage the Cholesky decomposition can be performed in an incremental fashion (we assume during the initial training phase), the total cost in the initialisation stage can be approximated as
As a result, if (which is often the case), the proposed algorithm would have a time complexity of in the training stage. That is, the computation of the kernel matrix has the dominant complexity in the training phase of the proposed approach.
Iv-C2 Computational complexity in the Test stage
In the test phase, the OC-KSR method requires computation of which has a time complexity of followed by the computation of requiring flams. Hence the dominant computational complexity in the test phase is .
Very recently, an incremental variant of the OC-KNFST approach is proposed in  which reduces the computational complexity of the original KNFST algorithm. As the classification performance of this approach matches the original KNFST method, both the OC-KSR and the incremental KNFST are identical in terms of classification performance. A note regarding the computational complexity of the incremental OC-KNFST method is as follows. In the initial training stage, the incremental OC-KNFST algorithm requires for the computation of the kernel matrix and eigen-decomposition and matrix multiplication computations. As the computation of the kernel matrix is common for both the OC-KSR and the incremental OC-KNFST, the relative computational advantage of the OC-KSR to the incremental OC-KNFST in the initial training stage when kernel matrix is pre-computed is . That is, the computational advantage of OC-KSR approach with respect to the incremental OC-KNFST increases almost linearly with (number of training samples) in the initial training stage. This is due to the fact that the method in  uses eigen-decomposition, whereas such computations are completely avoided in OC-KSR via a regression approach.
The computational complexity of the method in  in the updating phase of the training stage is where is the number of eigen-bases, upper bounded by . As a result, in common scenarios where e.g. , if the number of eigen-bases for the incremental OC-KNFST method exceeds of , the proposed OC-KSR method would be more efficient. In summary, in the initial training phase, the purposed OC-KSR approach is computationally more efficient than the incremental OC-KNFST method of . In the updating phase, under mild conditions, it would be more efficient too.
A new efficient nonlinear one-class classifier built upon the Fisher criterion was presented while providing a graph embedding view of the problem. The proposed OC-KSR approach worked by mapping the data onto a one-dimensional feature subspace where the scatter of the target distribution is zero while outliers were mapped onto distant locations in the same feature subspace. It was shown that all target observations were projected onto a single point which could be determined up to a multiplicative constant, independently of the data. The proposed method, unlike previous similar approaches casts the problem under consideration into a regression framework optimising the criterion function via the efficient spectral regression method thus avoiding costly eigen-decomposition computations. It was illustrated that the dominant complexity of the proposed method in the training phase is the complexity of computing the kernel matrix. The proposed OC-KSR approach offers a number of appealing characteristics such as the ability to be trained in an incremental fashion and being unsupervised. Moreover, as a by-product of the regression-based formulation, an observation ranking scheme was proposed which could be utilised to specify the expected fraction of outliers in the training set in advance. Extensive experiments conducted on several datasets with varied dimensions of features verified the merits of the proposed approach in comparison with some other alternatives.
-  P. Nader, P. Honeine, and P. Beauseroy, “-norms in one-class classification for intrusion detection in scada systems,” IEEE Transactions on Industrial Informatics, vol. 10, no. 4, pp. 2308–2317, Nov 2014.
-  A. Beghi, L. Cecchinato, C. Corazzol, M. Rampazzo, F. Simmini, and G. Susto, “A one-class svm based tool for machine learning novelty detection in hvac chiller systems,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 1953 – 1958, 2014, 19th IFAC World Congress. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1474667016418999
-  S. Budalakoti, A. N. Srivastava, and M. E. Otey, “Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 39, no. 1, pp. 101–113, Jan 2009.
-  S. Kamaruddin and V. Ravi, “Credit card fraud detection using big data analytics: Use of psoaann based one-class classification,” in Proceedings of the International Conference on Informatics and Analytics, ser. ICIA-16. New York, NY, USA: ACM, 2016, pp. 33:1–33:8. [Online]. Available: http://doi.acm.org/10.1145/2980258.2980319
-  G. G. Sundarkumar, V. Ravi, and V. Siddeshwar, “One-class support vector machine based undersampling: Application to churn prediction and insurance fraud detection,” in 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Dec 2015, pp. 1–7.
-  M. Yu, Y. Yu, A. Rhuma, S. M. R. Naqvi, L. Wang, and J. A. Chambers, “An online one class support vector machine-based person-specific fall detection system for monitoring an elderly individual in a room environment,” IEEE Journal of Biomedical and Health Informatics, vol. 17, no. 6, pp. 1002–1014, Nov 2013.
-  A. Rabaoui, M. Davy, S. Rossignol, and N. Ellouze, “Using one-class svms and wavelets for audio surveillance,” IEEE Transactions on Information Forensics and Security, vol. 3, no. 4, pp. 763–775, Dec 2008.
-  T. Minter, “Single-class classification,” in Symposium on Machine Processing of Remotely Sensed Data. Indiana, USA: IEEE, 1975, pp. 2A12–2A15.
-  M. M. Moya, M. W. Koch, and L. D. Hostetler, “One-class classifier networks for target recognition applications,” NASA STI/Recon Technical Report N, vol. 93, 1993.
-  G. Ritter and M. T. Gallegos, “Outliers in statistical pattern recognition and an application to automatic chromosome classification,” Pattern Recognition Letters, vol. 18, no. 6, pp. 525 – 539, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167865597000494
-  C. M. Bishop, “Novelty detection and neural network validation,” IEE Proceedings - Vision, Image and Signal Processing, vol. 141, no. 4, pp. 217–222, Aug 1994.
-  N. Japkowicz, “Concept learning in the absence of counterexamples: An autoassociation-based approach to classification,” Ph.D. dissertation, New Brunswick, NJ, USA, 1999, aAI9947599.
-  D. Tax, “One-class classification; concept-learning in the absence of counter-examples,” Ph.D. dissertation, Delft University of Technology, 2001, aSCI Dissertation Series 65.
-  S. S. Khan and M. G. Madden, “One-class classification: taxonomy of study and review of techniques,” The Knowledge Engineering Review, vol. 29, no. 3, p. 345â374, 2014.
-  M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “A review of novelty detection,” Signal Processing, vol. 99, pp. 215 – 249, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S016516841300515X
-  J. Kittler, W. Christmas, T. de Campos, D. Windridge, F. Yan, J. Illingworth, and M. Osman, “Domain anomaly detection in machine perception: A system architecture and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp. 845–859, May 2014.
-  D. M. J. Tax and R. P. W. Duin, “Combining one-class classifiers,” in Proceedings of the Second International Workshop on Multiple Classifier Systems, ser. MCS ’01. London, UK, UK: Springer-Verlag, 2001, pp. 299–308. [Online]. Available: http://dl.acm.org/citation.cfm?id=648055.744087
-  L. Friedland, A. Gentzel, and D. Jensen, Classifier-Adjusted Density Estimation for Anomaly Detection and One-Class Classification, pp. 578–586. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611973440.67
-  H. Hoffmann, “Kernel pca for novelty detection,” Pattern Recognition, vol. 40, no. 3, pp. 863 – 874, 2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320306003414
-  M. Sabokrou, M. Fathy, and M. Hoseini, “Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder,” Electronics Letters, vol. 52, no. 13, pp. 1122–1124, 2016.
-  S. R. Arashloo, J. Kittler, and W. Christmas, “An anomaly detection approach to face spoofing detection: A new formulation and evaluation protocol,” IEEE Access, vol. 5, pp. 13 868–13 882, 2017.
-  B. Song, P. Li, J. Li, and A. Plaza, “One-class classification of remote sensing images using kernel sparse representation,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 4, pp. 1613–1623, April 2016.
-  D. M. Tax and R. P. Duin, “Support vector data description,” Machine Learning, vol. 54, no. 1, pp. 45–66, Jan 2004. [Online]. Available: https://doi.org/10.1023/B:MACH.0000008084.60811.49
-  B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, Jul. 2001. [Online]. Available: https://doi.org/10.1162/089976601750264965
-  E. Pekalska, D. Tax, R. Duin, S. Becker, S. Thrun, and K. Obermayer, One-Class LP Classifiers for Dissimilarity Representations. United States: MIT Press, 2002, pp. 761–768.
-  P. Casale, O. Pujol, and P. Radeva, “Approximate convex hulls family for one-class classification,” in Multiple Classifier Systems, C. Sansone, J. Kittler, and F. Roli, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 106–115.
-  D. FernÃ¡ndez-Francos, . Fontenla-Romero, and A. Alonso-Betanzos, “One-class convex hull-based algorithm for classification in distributed environments,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, pp. 1–11, 2017.
-  A. Ypma and R. P. W. Duin, “Support objects for domain approximation,” in ICANN 98, L. Niklasson, M. Bodén, and T. Ziemke, Eds. London: Springer London, 1998, pp. 719–724.
-  M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially learned one-class classifier for novelty detection,” CoRR, vol. abs/1802.09088, 2018. [Online]. Available: http://arxiv.org/abs/1802.09088
-  P. Perera and V. M. Patel, “Learning deep features for one-class classification,” CoRR, vol. abs/1801.05365, 2018. [Online]. Available: http://arxiv.org/abs/1801.05365
-  V. Roth, “Outlier detection with one-class kernel fisher discriminants,” in Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou, Eds. MIT Press, 2005, pp. 1169–1176. [Online]. Available: http://papers.nips.cc/paper/2656-outlier-detection-with-one-class-kernel-fisher-discriminants.pdf
-  ——, “Kernel fisher discriminants for outlier detection,” Neural Comput., vol. 18, no. 4, pp. 942–960, Apr. 2006. [Online]. Available: http://dx.doi.org/10.1162/089976606775774679
-  P. Bodesheim, A. Freytag, E. Rodner, M. Kemmler, and J. Denzler, “Kernel null space methods for novelty detection,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp. 3374–3381.
-  F. Dufrenois, “A one-class kernel fisher criterion for outlier detection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 5, pp. 982–994, May 2015.
-  F. Dufrenois and J. C. Noyer, “Formulating robust linear regression estimation as a one-class lda criterion: Discriminative hat matrix,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 2, pp. 262–273, Feb 2013.
-  S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers, “Fisher discriminant analysis with kernels,” Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41–48, Aug. 1999.
-  G. Baudat and F. Anouar, “Generalized discriminant analysis using a kernel approach.” Neural Computation, vol. 12, no. 10, pp. 2385–2404, 2000.
-  B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.
-  C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995.
-  F. Dufrenois and J. C. Noyer, “A null space based one class kernel fisher discriminant,” in 2016 International Joint Conference on Neural Networks (IJCNN), July 2016, pp. 3203–3210.
-  J. Liu, Z. Lian, Y. Wang, and J. Xiao, “Incremental kernel null space discriminant analysis for novelty detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 4123–4131.
-  D. Cai, X. He, and J. Han, “Speed up kernel discriminant analysis,” The VLDB Journal, vol. 20, no. 1, pp. 21–33, Feb. 2011.
-  S. Günter, N. N. Schraudolph, and S. V. N. Vishwanathan, “Fast iterative kernel principal component analysis,” vol. 8, pp. 1893–1918, 2007.
-  N. Kwak, “Principal component analysis based on l1-norm maximization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 1672–1680, Sept 2008.
-  V. Roth, “Kernel fisher discriminants for outlier detection,” Neural Computation, vol. 18, no. 4, pp. 942–960, April 2006.
-  P. Bodesheim, A. Freytag, E. Rodner, and J. Denzler, “Local novelty detection in multi-class recognition problems,” in 2015 IEEE Winter Conference on Applications of Computer Vision, Jan 2015, pp. 813–820.
-  F. Dufrenois and J. Noyer, “One class proximal support vector machines,” Pattern Recognition, vol. 52, pp. 96 – 112, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320315003672
-  S. R. Arashloo and J. Kittler, “Class-specific kernel fusion of multiple descriptors for face verification using multiscale binarised statistical image features,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2100–2109, Dec 2014.
-  D. Cai, “Spectral regression: A regression framework for efficient regularized subspace learning,” Ph.D. dissertation, Department of Computer Science, University of Illinois at Urbana-Champaign, May 2009.
-  H. Cai, V. W. Zheng, and K. Chang, “A comprehensive survey of graph embedding: Problems, techniques and applications,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2018.
-  Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 12, pp. 2724–2743, Dec 2017.
-  G.W. Stewart, Matrix algorithms – Volume I: Basic decompositions. SIAM, 2001.
-  UCI Machine Learning Repository. [Online]. Available: https://archive.ics.uci.edu/ml/datasets.html
-  S. Tirunagari, S. Kouchaki, D. Abasolo, and N. Poh, “One dimensional local binary patterns of electroencephalogram signals for detecting alzheimer’s disease,” in 2017 22nd International Conference on Digital Signal Processing (DSP), Aug 2017, pp. 1–5.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1–9.
-  A. Costa-Pazo, S. Bhattacharjee, E. Vazquez-Fernandez, and S. Marcel, “The replay-mobile face presentation-attack database,” in Proceedings of the International Conference on Biometrics Special Interests Group (BioSIG), Sep. 2016.
-  Delft University Archive of One-Class Data Sets. [Online]. Available: http://homepage.tudelft.nl/n9d04/occ/index.html
-  M. Kemmler, E. Rodner, E.-S. Wacker, and J. Denzler, “One-class classification with gaussian processes,” Pattern Recognition, vol. 46, no. 12, pp. 3507 – 3518, 2013. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320313002574
-  M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: Identifying density-based local outliers,” in PROCEEDINGS OF THE 2000 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. ACM, 2000, pp. 93–104.