Bayesian Anomaly Detection and Classification
Abstract
Statistical uncertainties are rarely incorporated in machine learning algorithms, especially for anomaly detection. Here we present the Bayesian Anomaly Detection And Classification (BADAC) formalism, which provides a unified statistical approach to classification and anomaly detection within a hierarchical Bayesian framework. BADAC deals with uncertainties by marginalising over the unknown, true, value of the data. Using simulated data with Gaussian noise, BADAC is shown to be superior to standard algorithms in both classification and anomaly detection performance in the presence of uncertainties, though with significantly increased computational cost. Additionally, BADAC provides wellcalibrated classification probabilities, valuable for use in scientific pipelines. We show that BADAC can work in online mode and is fairly robust to model errors, which can be diagnosed through modelselection methods. In addition it can perform unsupervised new class detection and can naturally be extended to search for anomalous subsets of data. BADAC is therefore ideal where computational cost is not a limiting factor and statistical rigour is important. We discuss approximations to speed up BADAC, such as the use of Gaussian processes, and finally introduce a new metric, the RankWeighted Score (RWS), that is particularly suited to evaluating the ability of algorithms to detect anomalies.
Keywords:
machine learning anomalies classification novelty Bayesian unsupervised class detection∎
1 Introduction
In any fully rigorous or scientific analysis, uncertainties must be quantified and propagated through the full analysis pipeline. This is difficult to do with traditional machine learning algorithms that do not explicitly take into account uncertainties on the data or features. As machine learning is increasingly given authority for making more important and highrisk decisions, (e.g. in selfdriving cars), and with the potential for adversarial attacks (Akhtar and Mian 2018), there is an increasing need for interpretable models and rigorous statistical uncertainties on machine learning predictions.
In classification problems class labels are typically inferred through the use of a separation boundary that is learned from training data (Fawzi et al. 2017) and is based on a score combined with a threshold. This threshold is often arbitrary or learned as a hyperparameter to minimise some chosen loss function (NiculescuMizil and Caruana 2005b). Any resulting class “probabilities” are systematically distorted in ways unique to the classification algorithm used and are not true probabilities, though in some cases these can be calibrated in a frequentist sense with more training data using e.g. isotonic regression or Platt scaling (NiculescuMizil and Caruana 2005a).
However, particularly in the physical sciences, we desire an algorithm that automatically outputs unbiased, accurate probabilities, since knowing the probabilities of an object belonging to various classes is typically more useful than the class label alone. The classification process is often just one step in a multistage pipeline, and it is important to propagate class uncertainties through the additional steps in the analysis pipeline. This need is especially true in cases where the true class labels of the training data are noisy or subjective, or the training data are not representative of the test set. An example in astronomy is provided by the photometric classification of type Ia supernovae which are subsequently used for studies of dark energy. Hard label classification leads to contamination from nonIa supernovae that leads to biases in dark energy properties while fully propagating class probabilities instead allows for unbiased results at the end of the pipeline (Kunz et al. 2007; Hlozek et al. 2012).
In this context Bayesian methods are ideal (Denison et al. 2002), as they have been proven optimal for classification for certain loss metrics, e.g. Domingos and Pazzani (1997), and allow the option of both supervised or unsupervised classification (Cheeseman and Stutz 1996). In the context of astronomy, Bayesian techniques have been applied to classification of transient objects such as supernovae (Connolly and Connolly 2009). A common limitation in the classification of noisy data however, is that the classes in the training data are typically represented by a single template with zero variability (e.g. Sako et al. (2011)). This allows straightforward Bayesian methods to be applied but does not apply if there is significant intraclass variability. Ignoring this intraclass variability also makes principled anomaly detection challenging: how unlikely is an example if one doesn’t know the underlying distribution within a class?
Here we address these limitations, constructing what we will argue is a natural, statistically robust supervised Bayesian method that can simultaneously be used for both anomaly detection and classification in the presence of measurement uncertainties on all data. Our method works directly with raw data, requiring no feature extraction, and requires minimal assumptions about the nature of the anomalies or classes.
We begin by describing the formalism in section 2. We introduce a new metric optimised for anomaly detection, namely the RankWeighted Score (RWS) in section 3 (the other metrics we also use to assess algorithm performance are discussed in appendix C). Finally, we compare algorithm performance against various benchmark algorithms on simulated data in section 4. The full derivation of the formalism is given in appendix A.
2 The BADAC Formalism
Bayesian Anomaly Detection and Classification (BADAC) is a Bayesian hierarchical formalism that provides the probabilities that a new measurement belongs to each of a set of known classes, and also ranks objects by the probability that they are anomalous. We make use of language common to machine learning by referring to features for the data for a specific instance of a class, training data (i.e. data for which we know the class label ) and referring to test data for data that we wish to classify as either anomalous or belonging to a known class .
We start by assuming we have a set of multiple classes, , with each class having a set of (noisy) training feature instances , along with associated uncertainties on the features, typically due to some form of measurement error. Here indexes the instances/examples in a class, while indexes the specific features within an instance.
blackFor example, if we have a set of onedimensional timeseries data instances, then would index time. If our data were MNIST images^{1}^{1}1http://yann.lecun.com/exdb/mnist/, then would index pixels, would index the integers , and would index the examples of each digit. As implied by the above examples, the BADAC formalism we develop will work whether indexes a continuous underlying variable (e.g. time or space), or is a nominal index with no preferred ordering. The subscript denotes observed data, to distinguish it from the true (and unknown) underlying value which may differ because of noise or uncertainties. Allowing for the true and observed values separately will be useful, as we show later. For notational convenience we now suppress the class subscript on the but it should always be assumed. The test data to classify and rank for anomalousness, which we call , should also be assumed to have measurement uncertainties which need to be included in any analysis.
Since the final formulae are notationally complex, we gain intuition of the general case by considering the simplest possible example. Assume each instance consists of just one feature, and that each class has only one instance. This would correspond to taking one point from one instance from each class in the timeseries example. This simplification allows us to suppress both the and indices (as well as the label) and write for the moment. We assume has a known measurement error distribution. Then our final goal is to compute the posterior probability , of a single test data point , with known error distribution, of belonging to a class . This is discussed in more detail, along with the general derivation, in appendix A and shown schematically in figure 13. Bayes’ theorem gives this posterior probability as:
(1) 
where is the prior probability of belonging to class .
If we have measurement errors associated with both and , we cannot compute with equation 1 as is. Instead we introduce a latent variable which represents the underlying true value, i.e. what would be measured if there were no measurement error. This can then be marginalised over to rigorously handle our uncertainty of the true value . By application of the product rule for probability densities the posterior becomes:
(2)  
(3)  
(4) 
If we assume and are statistically independent of one another^{2}^{2}2We consider correlations in appendix B., then . We then arrive at:
(5) 
For Gaussian error distributions this can be solved analytically, as we show below.
Next let us generalise equation 5 to allow each class to have multiple training instances, , each still consisting of a single scalar feature value. In this case we need to introduce latent variables, , one for each instance in the class. Then the posterior probability that a new test instance belongs to class  assuming the instances in the training data are uncorrelated  is given by (see Appendix A for derivation):
(6) 
Here is the likelihood of observing the data , conditioned on both the class type and an estimate of the true values of the training data. is a prior on the true value given the class . Two things are worth noting: first, because of the uncertainties in the training data, the classification of just a single scalar data point requires an dimensional integral over the instances in each class ^{3}^{3}3Strictly speaking we should write since the number of samples in each class will be different but we suppress this to keep the notation relatively simple.. Second, equation 6 is not just a product of terms like equation 4, even if the instances are independent.
We next generalise to consider the case of a multidimensional feature vector for each instance, i.e. for the training data and for the test data instance, where as usual index the instance and data dimension within an instance respectively (e.g. for timeseries data would index the time). In the special case of uncorrelated Gaussian distributed test and training data, and for (improper) flat priors, we can analytically compute the posterior probability, (see Appendix A), yielding our main result:
(7) 
where and are the precision of the datapoints, is the total number of training instances and is the number of datapoints per instance. We use equation 7 in all our experiments in section 4 to evaluate BADAC, though it is of course not the most general form.
To normalise equation 7 we need to compute the Bayesian evidence for all the observed data, . For different classes in total, the evidence is then:
(8) 
This is a summation of the posterior probability terms returned for each class type which requires the prior probability, , for each class, to be supplied by the user as usual. Classification then occurs by choosing the class with the highest normalised posterior probability.
We illustrate BADAC for Bayesian classification in figure 1. Here we show a toy example with training data synthesised into two smooth Gaussian templates (whose 1 regions are shown by the filled blue and orange areas) corresponding to two classes, . We have a single test example which we wish to classify, consisting of two data points (the black triangles with error bars). The middle and right panels show the unnormalised posteriors for the true values of the points, i.e. , marginalised over the true value of the other point. For the first data point the posterior conditioned on the blue class is pulled upwards and that conditioned on the orange class downwards. However, because the blue error envelope at the first data point is broader than the orange one, the first point is less likely to come from the blue class, a fact reflected in the smaller area under the blue Gaussian in the first of the right narrow panels. The same happens for the 2nd point. The Bayesian evidence therefore predicts that the test example is more likely to come from the (orange) class, with probability .
2.1 Anomaly Detection
Since we want a simultaneous anomaly detection and classification algorithm we assign the th class to the unknown anomalies with the known classes assigned to the remaining labels. A key issue is the choice of the likelihood for the anomaly class: . If we have seen examples of the anomaly before, this can be estimated from the data. If the anomaly has never been seen before however, this reduces essentially to a prior about our expectations regarding the anomaly. Broad, normalised Gaussian or tophat functions are simple choices allowing for a wide range of behaviours for the anomalies. The shadow of the No Free Lunch theorem appears here: if our prior about the anomaly’s behaviour is completely wrong, we may miss the anomaly entirely.
Instead of specifying a generic prior for the anomaly likelihood, an alternative approach is to assume only the known classes, but then to rank instances in the test data from smallest to largest values of
(9) 
i.e. to rank the instances from least to most likely to belong to any of the known classes. This is useful if one has some insight into the expected fraction of anomalous events, since one can then use it to determine a cutoff in this ranked list.
2.2 Comparison with kNN and KDE
It is interesting to compare BADAC to the knearest neighbours (kNN) algorithm (Altman 1992), with the best value of k selected on a validation set. Since each object can be considered as a single point in a high dimensional feature space, in a sense BADAC generalises kNN to allow for error bars on the test and training points, and uses all the training data from each class. BADAC is also similar to the Kernel Density Estimation (KDE) approach where classification is performed by putting a kernel on each training data point and summing the resulting smooth functions. In our case the bandwidth of the kernel is given by the error bars on the data while it also handles errors on the test data. BADAC can then be thought of as a natural extension of both kNN and KDE to a more rigorous statistical setting.
2.3 NonGaussian data
Often, the standard deviation is used as a proxy for the error distribution on an observation, even when the distribution is nonGaussian. In section 4 we test how the algorithm developed in section 2 performs while assuming a Gaussian error distribution, even when the error distribution is nonGaussian. However, if the error distribution is known, the forms of and in equation 6, can be replaced with the known nonGaussian distribution. These could be the binomial distribution in the case of count data, or the Poisson distribution in the case of certain time series. Any appropriate distribution that can be modelled can be used in this formalism. In the case of the binomial distribution, one would do a summation rather than an integration over the latent variables. For any distribution that doesn’t yield an analytically integrable form for , one can do the marginalisation numerically, though at increased computational cost.
2.4 Online Learning of new classes
Once we have confirmed that an anomaly represents a new class (i.e., if the Bayesian evidence for the anomaly class is higher than that of any of the existing classes) it can automatically be added to the training data as a new class (with a single example) to be compared with. This provides an online learning version of the BADAC algorithm. Any future data belonging to the new anomaly class will be automatically assigned the new anomaly class label.
This process in no way limits us to a single new anomaly class. The BADAC formalism allows for the automatic addition of new classes as demanded by the data. If a new kind of anomaly is different from any previously identified anomalies it will automatically be assigned to a new class.
2.5 BADAC with Parametric Models
It is often the case that the training data can be efficiently modeled by some general parametrisation (with classdependent parameters ), either because we have a physical model of the classes or have performed a general expansion on each class, e.g. through a suitable basis set such as \textcolorblackwavelets (Mallat 2009) or principal component analysis (Hotelling 1933). As a result we assume here that we can parametrise our data as . Then we can express the classification probability as:
(10)  
(11) 
Here is the prior on the parametrised variables , for a given class . is the likelihood for given parameters , given data . For models derived from physical phenomena, we can usually evaluate this.
If one is using some general parametrisation, such as a wavelet transform, then the parameters can be thought of as similar to (and hence grouped with) . In this case, if there were measurement uncertainties associated with each , we could then proceed in the same way as in equation 6, where instead of marginalising over a large range of possible values, we now marginalise over the more constrained . This could be extended to the hierarchical context where the parameters are drawn from some parent distribution.
This shift from the raw training data, to a parametrised model of the data, can have significant computational advantages when there is a lot of training data, as we discuss further in section 6.3.
3 RankWeighted Score
In this section we introduce a new anomalysensitive metric that we call the RankWeighted Score (RWS). In addition to being insensitive to class imbalance, this metric is sensitive to the relative ranking of anomalous objects. In many cases there is a clear hierarchy of how interesting anomalous objects are (a new class is generally more interesting than a new subclass for example), and hence we prefer algorithms that correctly rank more anomalous objects more highly^{4}^{4}4Compare for example the Spearman’s rank correlation (Spearman 1904) which does not give higher weight to the topranking objects.. This is natural because following up and investigating potential anomalies typically consumes resources (whether human or instrumental) and false positives, at any anomaly threshold, must be minimised.
The RWS is defined by ranking the objects according to their degree of anomalousness (from high to low) as identified by an algorithm. Here is a usersupplied integer (the expected number of anomalies in the dataset). The RWS score is then computed as the weighted sum:
(12) 
where:
(13) 
Note that this gives (linearly) more weight to correctly identifying anomalies at the top of the ranks (with low values of ) compared to lower down the list. In equation 12, is an indicator variable: if the th object is an outlier, and otherwise. is a normalisation factor: . This means the RWS score has a possible range of , where implies that no true outliers were found in the most anomalous objects ranked by the algorithm, while an RWS score of would mean that \textcolorblackall most anomalous objects identified by the algorithm were in fact outliers. The value of must be chosen on a per problem basis, and be kept consistent across the various algorithms being considered to allow fair comparison. In section 4 we use this metric along with several other commonly used metrics to gauge algorithm performance. We discuss these metrics in Appendix C.
4 Evaluation of BADAC: Gaussian Case
To illustrate and test the performance of BADAC, we simulate a number of onedimensional datasets and compare results with multiple metrics including the RankWeighted Score (RWS) introduced in section 3, that is optimised for anomaly detection.
4.1 Simulations
We simulate data from arbitrary mathematical functions. We use two mathematical functions to build two “normal” classes and use three other functions as anomalies. Each function has parameters which, when generating the data, are randomly drawn from a Gaussian distribution. The class functions and their corresponding parameter distributions are given in table 1.
Class label  Type  Functional Form  Parameter distributions  

0  Inlier 


1  Inlier 


2  Outlier  if , else 


3  Outlier 


4  Outlier 


For each experiment, we generate curves of roughly equal number of objects from class 0 and 1 as training data. In the test data, we add 1% outliers from classes 2, 3 and 4. Figure 2 illustrates some randomly drawn objects from the training and test sets (equal numbers from each class).
4.1.1 Experiment 1: Gaussian errors
We use the framework of section 4.1 to create a variety of experiments to test our anomaly detection and classification algorithm. Here we simulate the data as described in section 4.1 with uncorrelated Gaussian errors on all data points. The standard deviation of the underlying noise distribution depends on the class, and is given by: and . This experiment is the ideal case in which the noise distribution used for generating the simulated data is the same as that in the mathematical formulation of equation 7.
4.1.2 Experiment 2: Compact Anomalies
In this experiment, we test BADAC’s ability to detect curves with a compact anomaly embedded somewhere in them. We use one of the base inlier classes described in experiment 1, the sine curve (class 0), and place on top of it a narrow Gaussian. We randomly draw the parameters of the sine curve from the distribution described in table 1 and draw the parameters of the compact anomalies as described in table 2. An example of a compact anomaly is shown in figure 3.
Class label  Type  Functional Form  Parameter distributions  

0  Inlier 


1  Inlier 


2  Outlier 


3  Outlier 

4.2 Comparison of algorithm performance
We assess the performance of our algorithm on the simulated data discussed in section 4.1. We then compare our algorithm to a series of benchmark algorithms, namely IsolationForest (Liu et al. 2008) and Local Outlier Factor (LOF) (Breunig et al. 2000) for anomaly detection, and random forests (Breiman and Schapire 2001) for classification.
We use sklearn (Pedregosa et al. 2011) implementations for all of the benchmark algorithms we compare against BADAC. For anomaly detection, all algorithms receive only the input training data, and the percentage of outliers of . For classification with random forests, we set the input parameter n_estimators=1000. \textcolorblackThere are unsupervised implementations of IsolationForest and LOF, but here we consider the supervised methods only.
BADAC  IsolationForest  LOF  

MCC  AUC  RWS  MCC  AUC  RWS  MCC  AUC  RWS  
Gaussian  0.95  0.99  0.99  0.00  0.89  0.02  0.83  0.97  0.96 
Comp. Anom.  0.41  0.91  0.59  0.11  0.80  0.14  0.44  0.90  0.63 
BADAC  Random Forests  

Gaussian Noise  99.02  98.66 
Compact Anomalies  95.51  95.18 
4.2.1 Gaussian noise
Here we illustrate the performance of BADAC, as well as the benchmark algorithms, on the data discussed in section 4.1.1 with Gaussian measurement error. We use the formalism shown in section 2 to provide two probabilities, and , which are the unnormalised probabilities of belonging to class 0 and class 1 respectively. These probabilities are plotted in figure 4.
blackPlotting the unnormalised probabilities is useful for visualising the decision boundary that separates both the known classes and anomalies. It also does not require us to make any assumptions about the nature of the anomalies we expect to see. \textcolorblackHowever, to make use of these probabilities in an analysis pipeline, they must be normalised. In order to normalise the probabilities, we compute the Bayesian evidence. The evidence in the case where one is interested in classification only would be . In the case where anomaly detection is of interest as well, the evidence is , where \textcolorblackin this case, we choose to evaluate using a tophat likelihood equal to over the range of , and equal to otherwise. Here we choose and to cover twice the observed range of the input data.
If we bin the normalised probabilities for a single class, we can measure whether or not they are calibrated. It is a well known problem that many machine learning algorithms give uncalibrated probabilities that do not correspond to the true probability of an object \textcolorblackbelonging to a certain class. The reliability of probabilities can be investigated by plotting a probability calibration curve: the output probabilities from the algorithm \textcolorblackfor a selected class only are binned and compared with the actual fraction of objects in that bin belonging to the class. We show this result for classification only for type 1 objects in figure 5, and compare the results of BADAC with those of random forests.
We show the ROC curves (see Appendix C for a description of ROC curves) for BADAC as well as LOF and IsolationForest in figure 6 in order to compare algorithm performance in anomaly detection. A summary of algorithm performance from all algorithms on all the datasets we consider in both anomaly detection and classification is shown in tables 3 and 4.
4.2.2 Compact anomaly Performance
In this section we illustrate the performance of BADAC as well as the benchmark algorithms we consider on the compact anomaly data discussed in section 4.1.2. It should be noted that the compact anomaly data is generated with Gaussian noise, which is the type of noise we assume in this implementation of our formalism\textcolorblack, and is also the same kind of noise as the data described in section 4.1.1. This means we would expect the algorithms to have similar performance in classification only in this section as in section 4.2.1. For this reason, we don’t discuss classification performance of any of the algorithms on the compact anomaly dataset. We proceed in the exact same manner as we did in section 4.2.1, except here we test how robust the algorithms are to different types of anomalies (compact ones).
The importance of an algorithm being able to detect compact anomalies is twofold. Firstly, compact anomalies are often interesting in science when one wishes to measure or detect aberrant behaviour of known sources. Secondly, an algorithm’s ability to detect compact anomalies demonstrates its overall sensitivity in measuring small variations within data.
As we can see from figure 7, the outlier data has significant overlap with type 0 data. This is because we create compact anomalies on top of type 0 data only. The varying scale/amplitude to the anomaly is responsible for where the outlier data is positioned on the log()axis (further left indicates a more anomalous object). Outlier points with high log() values are likely associated with compact anomalies with very low amplitudes.
We show the ROC curves for BADAC as well as LOF and IsolationForest in figure 8 in order to compare algorithm performance in anomaly detection. Under the AUC metric, BADAC performs the best in this case. LOF is almost as good, and actually performs better under the MCC and RWS metrics. A summary of algorithm performance from all algorithms on all the datasets we consider in both anomaly detection and classification is shown in tables 3 and 4.
4.3 Computational performance
It is difficult to give a “fair” comparison of computational performance between BADAC, random forests, LOF and IsolationForest, \textcolorblacksince unlike the benchmark algorithms we compare it with, our algorithm has no distinct training and testing phases. This means that these algorithms \textcolorblackscale very differently (depending on amount of training and test data available). For example, for a dataset with training and test examples, the computational time required for \textcolorblackrandom forests, IsolationForest and LOF would increase as . For our algorithm, the computational time required increases as . In fact the computational time increases linearly as a function of . For an even comparison of computational performance, we have compared the same number of training and testing samples as were used in section 4.2 (15000 training samples and 15000 test samples). We quote the total time (training time + testing time) in table 5. \textcolorblackWe note, however, that there is ample room for optimisation and parallelisation in our BADAC code and the timings could be considerably improved.
Algorithm  Training time (s)  Testing time (s)  Total time (s) 

Random Forests  96.30  2.94  99.24 
IsolationForest  1.62  1.21  2.83 
Local Outlier Factor  13.21  27.25  40.46 
BADAC      1281.82 
5 Breaking BADAC
So far, i.e. in the case of Gaussian errors, BADAC has \textcolorblackoutperformed random forests, LOF and IsolationForest under most metrics we consider. This is perhaps not surprising since BADAC was designed to use the extra information available, namely that there are uncorrelated errors on the data that are Gaussian distributed. Here we try a series of more challenging tests where we use the uncorrelated Gaussian BADAC formalism, but test it on data that do not obey this model.
5.1 Experiment 3: NonGaussian errors
Here we simulate the data exactly as in Experiment 1, however we use nonGaussian errors instead of the Gaussian errors of Experiment 1. For 80% of the y values (randomly selected) of any given simulated object, the noise is drawn from a Gaussian distribution with standard deviation as described in section 4.1, meaning the scatter matches the error bar. However, for the remaining 20% of the values, the noise is drawn from a Gaussian distribution of five times the width, resulting in scatter dramatically underestimated by the reported error bar.
5.2 Experiment 4: Correlated Gaussian Noise
To test the sensitivity of our algorithm to the uncorrelated noise assumption, we generate correlated Gaussian data. We choose to only correlate Class 0, according to a “wedding cake” covariance matrix (based on Kim and Linder (2011) and Knights et al. (2013)):
(14) 
where
(15) 
where and are indices of the data (in order of value) and is the bin to which the object belongs. To produce the steplike structure, (where “” indicates the floor function, rounding down to the nearest integer). We use for each in this work. The result is that the data are correlated in such a way that the points at higher values are more correlated than the lower ones.
5.3 Results for nonGaussian noise and correlated Gaussian noise
Here we present the results for both classification and anomaly detection for the data discussed in sections 5.1 and 5.2 with both nonGaussian and correlated Gaussian noise. It should be noted that we still use equation 7 to determine classification/anomaly detection probabilities, despite the fact that the data does not have Gaussian uncorrelated noise as equation 7 assumes. Thus we must expect BADAC performance to decrease; the question is how much?
In order to normalise the probabilities shown in figure 10, we compute the evidence, once again. As before, we choose to evaluate using a tophat likelihood equal to over the range of , and equal to otherwise. In this case, naively choosing the width of the tophat does not work, as the model used to determine and is incorrect, and hence returns low probabilities. As a result, is a much higher probability than and , even for inlier data, when the incorrect model for the noise is used. To get around this we determine the height of the tophat likelihood, , by equating it to computed for the object corresponding the 99th percentile. What this means is, we enforce that the algorithm labels the most anomalous 1% of objects (as determined by the algorithm) as outliers. This is still a fair comparison with the benchmark algorithms, as both IsolationForest and LOF receive the percentage contamination of 1% as an input parameter. We discuss how to extend this method to be suitable for modelling data with different types of noise in section 2.3.
As we can see from the scatter of classification probabilities shown in figure 10, there is more overlap of different object types present than in the uncorrelated Gaussian noise case we considered. Additionally, in the correlated case, there is a significant bias introduced due to the noise from only one of the classes being correlated. Since the model does not favour fitting this class, the classification probabilities are not reliable. This is illustrated both by figure 10, where the diagonal dashed line shows where type 0 and type 1 clusters should be separated, and figure 11, where we can see that the classification probabilities are far from calibrated. This is due to the fact that we use a uncorrelated Gaussian model for the noise, despite the fact this model is wrong.
We show the ROC curves for BADAC as well as LOF and IsolationForest in figure 12 in order to gauge performance in anomaly detection. In these two cases, it is surprising BADAC performs best, since we don’t correctly model the noise. Random forests however achieves a higher accuracy in classification in these two cases. A summary of algorithm performance from all algorithms on all the datasets we consider in both anomaly detection and classification is shown in tables 3 and 4.
BADAC  IsolationForest  LOF  

MCC  AUC  RWS  MCC  AUC  RWS  MCC  AUC  RWS  
NonGauss.  0.84  0.99  0.96  0.06  0.84  0.10  0.16  0.84  0.18 
Corr. noise  0.68  0.97  0.84  0.01  0.70  0.03  0.61  0.96  0.76 
BADAC  Random forests  

NonGaussian noise  97.71  98.14 
Correlated Gaussian noise  68.88  96.72 
6 Extensions
6.1 Learning Subclasses
The BADAC algorithm we have presented can classify and identify anomalies. It can also add new classes as needed when run in online learning mode, as discussed in section 2.4. The following \textcolorblackexample illustrates how BADAC can potentially identify subclasses of existing classes. But first, what do we mean by a subclass? Here a subclass corresponds to a large intrinsic variation in a class inconsistent with measurement errors.
In equation 7 we explicitly allowed each instance of a class to have its own true value, , for each feature . However a very homogeneous class will not require this flexibility, and will only require one latent variable for each feature . Hence, we can define the number of subclasses to be the number of different latent variables per feature , required by the data in class .
Let us define a hierarchy of models, where is the number of subclasses of class and also the number of true latent variables per feature. Using all the data from class we infer the latent variables . We can then select the preferred number of subclasses by maximising the Bayesian evidence after marginalising over . The subtlety is that in the models with more than one subclass, we do not know, a priori, which subclass an instance may belong to. We solve this by introducing new latent parameters for the subclass which allows each instance to belong to any of the subclasses and then writing the likelihood as a mixture model as was done in (Kunz et al. 2007; Hlozek et al. 2012; Newling, J., Bassett, B., Hlozek, R. et al. 2012; Knights et al. 2013). Finally we marginalise over these subclass labels to compute the model evidence and find the preferred number of subclasses.
6.2 Dealing with Missing Data
In our discussion so far we have assumed the idealised case that we have data at the same points for all training and test data. This is clearly unrealistic and an important limitation. How do we deal with missing data?
There are two approaches. The first, more conservative, approach is to sample from the prior distribution with the error given by the prior distribution for that class. If the data is missing from test data, the missing data can be sampled as above, but in each case we use the prior for the class that it is being compared against.
The second approach is to use some form of interpolation. A natural approach is to use Gaussian processes, since these give both an expected value and Gaussian error at the missing data. Gaussian processes need a covariance function which encodes how rapidly the underlying class varies. As a result each class will have their own Gaussian process and covariance function which should be learned from the training data. Test data should then be compared to training classes using the appropriate Gaussian process for each of the training classes.
6.3 Template construction and speeding up BADAC
As shown in table 5 the full BADAC calculation is much slower than other classification or anomaly detection algorithms. This stems from the pairwise comparison of all data in the test dataset with all data in the training set, something which becomes computationally infeasible for very large amounts of training data.
Fortunately in the limit of large training data we can expect to sample the class distribution well and therefore we can instead create a single template for each class (or as an intermediate step, each subclass), which will dramatically speed up BADAC, though at the cost of having a nonGaussian spread in general.
How should we compute class or subclass templates? An elegant solution is to fit a single Gaussian process to the data of each class (Williams and Barber 1998). This has the advantage of automatically dealing with any missing data, but will not deal with nonGaussian or multimodel intraclass variability. To get around this limitation one could use a Kernel Density Estimate summed over the training data in each class at each value of the independent variable (or in bins). However, since this is a still a sum over all the training data examples it will be slow. To speed it up we need some approximation to the KDE sum.
Probably the simplest approximation  which also preserves the Gaussian distribution  is to use the inversevariance estimator with standard deviation :
(16)  
(17) 
If the intraclass variability is highly nonGaussian then it would be better to fit a more appropriate lowdimensional distribution to describe this to create the template. We then effectively reduce to the formalism described in section 2.5.
6.4 Intraclass Variability
In the formulation presented earlier we assumed that the variability in the observed data for a given class was given by the measurement errors on the observed data, i.e. that the intraclass variability was small. If this is not the case one can build more complex models for the intraclass variability. The simplest is to fit for a global standard deviation, , at training for each class (for example by using a validation subset of the training data). The intraclass variability model can be made arbitrarily complex and the Bayesian evidence could be used to select the best model.
6.5 Calibration and Zeropoint Issues
In applying BADAC to real examples there may be systematic differences in the data between test and training. This could, for example, be because the data comes from different instruments or is taken under different conditions. As an example, consider applying BADAC to images where there may be largescale calibration differences across the images. How can one deal with such effects which will invalidate the use of the simple versions of the BADAC formalism presented earlier, along with most anomaly and classification algorithms?
In the spirit of the Bayesian approach, one way to deal with such largescale artefacts is to model their effects and introduce nuisance parameters , with their own prior distributions , which are then marginalised over before classification or anomaly ranking. Intuitively this means that the algorithm will exploit the freedom implicit in the calibration model to try to fit each test curve to the training data and will only highlight as outliers those data which are poor fits no matter the calibration freedom.
A related problem is the issue of zeropoints, which occurs if the examples in training and test data are not all aligned on the axis. This is common when working with time series data. In principle this can be dealt with in a similar way, by allowing each data example to have an extra translation parameter which allows one to shift all points in the example left or right. One must then marginalise over this nuisance parameter when doing the fits.
Depending on the exact nature of the data these translation parameters may be wellconstrained. For example, one may be able to align all examples approximately, in which case one can put priors on the translation parameters. However, the zeropoint issue does raise significant complications. For each pair in the training and test sets one should in principle allow a translation parameter. This leads to new nuisance parameters where are the number of training and test set examples respectively. Unless the marginalisation can be performed analytically this will typically be prohibitively expensive.
A cheaper alternative is to prealign all the training data by class. Now there is only one translation nuisance parameter per class and per instance in the test set. However, the alignment of the training data will not be perfect in general. This can be handled by adding an error bar to each data point in the training data, corresponding to small errors in the alignment of the data. These errors are perfectly correlated however (since the translation affects all data in the same way) and the BADAC formalism would need to be extended to account for such correlations, as done in e.g. Heavens et al. (2014); Roberts et al. (2017).
7 Conclusions
We have presented a novel statistically robust joint anomaly detection and classification method, Bayesian Anomaly Detection And Classification (BADAC), that is designed to take advantage of any knowledge of the underlying noise distribution in the training and test data. Although we perform tests for the case of Gaussian distributed data, our formalism is general.
blackUsing simulated onedimensional data, we test the classification and anomaly detection capabilities of BADAC. We make use of several metrics, including our novel RankWeightedScore that rewards algorithms for ranking more anomalous objects above those that have been commonly seen. We find that in the case where the correct noise model is known BADAC outperforms random forests at classification and both IsolationForest and local outlier factor (LOF) at anomaly detection, due to its ability to correctly exploit uncertainty information. In the case of compact anomalies, which could emulate noisy spikes in data, we find that BADAC’s performance is comparable to LOF and superior to IsolationForest. We demonstrate how BADAC produces calibrated classification probabilities, \textcolorblackwhich is crucial if a machine learning algorithm is to be incorporated into a precise, scientific analysis pipeline.
blackWe performed tests to investigate the degradation of performance if the assumptions of BADAC are violated. Interestingly, we find BADAC still outperforms the other anomaly detection algorithms in the presence of nonGaussian and correlated noise. However we find its classification performance degrades, especially in the correlated case. We also note that with an incorrect noise model, the probabilities of BADAC are no longer guaranteed to be calibrated. However, if the structure of the noise is known, the correct noise model can be incorporated into the BADAC likelihood.
While BADAC provides excellent performance by exploiting the extra information about the underlying noise distributions, the computational limitations discussed in section 4.3 mean that it does not scale well to large training datasets. In this case one must either use prototype templates to represent the classes (e.g. through Gaussian processes) or parameterise the data, to speed up classification and anomaly detection with BADAC.
blackWe find ourselves in an era of exponentially increasing data volume, driving the need for machine learning algorithms. However in the physical sciences there is equal need for accurate propagation of uncertainties from all parts of an analysis pipeline, including any machine learning algorithms. With its statistically principled approach to both classification and anomaly detection, BADAC is able to provide believable and interpretable probabilities in the presence of measurement uncertainties, as required by high precision scientific analysis.
Acknowledgements.
We thank Alireza Vafaei Sadr, Martin Kunz and Boris Leistedt for discussions and comments. We acknowledge the financial assistance of the National Research Foundation (NRF). Opinions expressed and conclusions arrived at, are those of the authors and are not necessarily to be attributed to the NRF. This work is partially supported by the European Research Council under the European Community’s Seventh Framework Programme (FP7/20072013)/ERC grant agreement no 306478CosmicDawn.References
 Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, pp. 14410–14430. Cited by: §1.
 An introduction to kernel and nearestneighbor nonparametric regression. The American Statistician 46 (3), pp. 175–185. Cited by: §2.2.
 Random forests. In Machine Learning, pp. 5–32. External Links: Document Cited by: §4.2.
 LOF: identifying densitybased local outliers. SIGMOD Rec. 29 (2), pp. 93–104. External Links: ISSN 01635808, Link, Document Cited by: §4.2.
 Advances in knowledge discovery and data mining. U. M. Fayyad, G. PiatetskyShapiro, P. Smyth and R. Uthurusamy (Eds.), pp. 153–180. External Links: ISBN 0262560976, Link Cited by: §1.
 A Bayesian Approach to Classifying Supernovae With Color. ArXiv eprints. External Links: 0909.3652 Cited by: §1.
 The relationship between precisionrecall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, New York, NY, USA, pp. 233–240. External Links: ISBN 1595933832, Link, Document Cited by: §C.1.
 Bayesian methods for nonlinear classification and regression. Vol. 386, John Wiley & Sons. Cited by: §1.
 On the optimality of the simple bayesian classifier under zeroone loss. Machine learning 29 (23), pp. 103–130. Cited by: §1.
 Classification regions of deep neural networks. ArXiv eprints. External Links: 1705.09552 Cited by: §1.
 Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian Anal. 1 (3), pp. 515–534. External Links: Document, Link Cited by: §A.2.
 Bayesian Data Analysis: Straightline fitting. In Maximum Entropy and Bayesian Methods: Cambridge, England, 1988, J. Skilling (Ed.), pp. 511–518. External Links: ISBN 9789401578608, Document Cited by: §A.2.
 Generalized Fisher matrices. Mon. Not. Roy. Astron. Soc. 445 (2), pp. 1687–1693. External Links: Document, 1404.2854 Cited by: §6.5.
 Photometric supernova cosmology with beams and sdssii. The Astrophysical Journal 752 (2), pp. 79. Cited by: §1, §6.1.
 Analysis of a complex of statistical variables into principal components.. Journal of educational psychology 24 (6), pp. 417. Cited by: §2.5.
 Correlated Supernova Systematics and Ground Based Surveys. JCAP 06 (020). External Links: 1102.1992v1 Cited by: §5.2.
 Extending BEAMS to incorporate correlated systematic uncertainties. Journal of Cosmology and Astroparticle Physics 1, pp. 039. External Links: 1205.3493, Document Cited by: §5.2, §6.1.
 Bayesian Estimation Applied to Multiple Species: Towards cosmology with a million supernovae. Phys. Rev. D75, pp. 103508. External Links: Document, astroph/0611004 Cited by: §1, §6.1.
 Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, Washington, DC, USA, pp. 413–422. External Links: ISBN 9780769535029, Link, Document Cited by: §4.2.
 A wavelet tour of signal processing. 3rd edition, Academic Press, Burlington. External Links: LCCN 99065087 Cited by: §2.5.
 Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)  Protein Structure 405 (2), pp. 442–451. External Links: Link, Document Cited by: §C.2.
 Parameter estimation with Bayesian estimation applied to multiple species in the presence of biases and correlations. Monthly Notices of the Royal Astronomical Society 421 (2), pp. 913–925. External Links: ISSN 13652966, Link, Document, 1110.6178 Cited by: §6.1.
 Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp. 625–632. Cited by: §1.
 Predicting good probabilities with supervised learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, New York, NY, USA, pp. 625–632. External Links: ISBN 1595931805, Link, Document Cited by: §1.
 Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.2.
 zBEAMS: A unified solution for supernova cosmology with redshift uncertainties. JCAP 1710 (10), pp. 036. External Links: Document, 1704.07830 Cited by: §6.5.
 Photometric Type Ia Supernova Candidates from the Threeyear SDSSII SN Survey Data. The Astrophysical Journal 738, pp. 162. External Links: 1107.5106, Document Cited by: §1.
 The proof and measurement of association between two things. The American Journal of Psychology 15 (1), pp. 72–101. External Links: ISSN 00029556, Link Cited by: footnote 4.
 Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (12), pp. 1342–1351. Cited by: §6.3.
Appendix A Detailed derivation of BADAC
In this section we go through the details of the formalism we discussed in section 2. We begin with the simple case of one training and test datapoint, and then add training points. We finish by generalising the formalism to objects with datapoints each.
a.1 Modelling one training point
Our goal is to determine the probability of an object belonging to class . This is calculated assuming that we have an observation of another object , which we know is from class . The probability of belonging to class , given can hence be written as . By implementing Bayes’ theorem, we can arrive at equation 18:
(18) 
If we now assume that we have a measurement error associated with both and , we cannot expand equation 18 to calculate the probability as is. To solve this we introduce a latent parameter , which we wish to marginalise over. Here is the “true” value of , as though we were able to know the correct value of exactly. By application of the product rule, we obtain the result shown in equation 21.
(19)  
(20)  
(21) 
We illustrate this hierarchical model graphically in figure 13. If we assume and are statistically independent of one another, then we can drop the dependence on from terms that depend on as well i.e.: . We can do this since we assume that represents exactly. We then end up with the result shown in equation 22.
(22) 
We assume the measurement errors associated with and are Gaussian, which makes the probability terms and Gaussian likelihoods. We further assume a flat prior on the distribution of true values of , . These assumptions allow us to perform the integration analytically. We can now generalise this method to work for training/known objects.
a.2 Modelling training instances
Here we begin with Bayes’ theorem in the same manner as before. The class probability is now dependent on the unknown object , as well as known objects where :
(23) 
Now we introduce latent variables , with , such that are the true values corresponding to each object . By marginalising over these latent variables, we arrive at equation 24.
(24) 
We can apply the product rule, where we group all variables together. We can do this since .
(25) 
Now we apply product rule again. We also skip a step we showed in section A.1, where we drop the dependence of on . This does not apply if observations of different objects are correlated, but for clarity, we ignore this possible source of correlation.
(26) 
In addition to assuming the objects are statistically independent of one another, we assume they are independent of the true values , where as well. This means that . We can therefore express equation 26 as follows:
(27) 
Now we need to find a way to calculate the value of . The probability of observing object given one is a simple likelihood. We use a kernel density estimate for the probability of observing given a number of variables . This means that . Now we have:
(28) 
is simply a prior on the true measurements. Since we don’t aim to place any constraints on the intraclass variability of class , we choose a flat prior here (). It is important to note that the choice of an improper prior on hierarchical parameters can introduce biases on parameter estimates (Gull (1989) and Gelman (2006)). One could easily select a more informative prior if there was additional information about the data available. For ease, the prior should be analytically integrable with the form of the probabilities . It would however be possible to resort to numerical integration if this were not the case. Here we assume uncorrelated Gaussian measurement error in the determination of the values of and , so equation 28 can be expressed as the following:
(29) 
where is the uncertainty on and is that on the ’th training point . Note that we can switch the sum and integral, and then evaluate the integral of the product of Gaussian likelihoods. However, when we evaluate this integral, it simplifies in the case where the set of are uncorrelated. The simplification also means we can integrate over instead of .
(30) 
Now it is easy to complete the square and do the integration analytically. In equation 31 we simply state the result:
(31) 
where and .
Equation 31 only handles objects with a single datapoint. For objects with multiple datapoints, we introduce , where it is a product of the probabilities for the single datapoints assuming they are uncorrelated. This means we have:
(32) 
where and , and and are all sets over the index .
Appendix B Correlated data
It is possible to take correlated data into account in the formalism we’ve developed, if it is known how the data are correlated. Taking into account correlated data allows for better classification accuracy. Here we will talk about two types of correlation. Firstly, correlations may exist between different features in the same instance. Here we’ll refer to this as intrainstance correlation. Then we have correlations between different instances, which we refer to as interinstance correlation. The following subsections show where accounting for these correlations would enter the formalism we developed in sections A.1 and A.2.
b.1 Intrainstance correlation
Intrainstance correlations often arise in science. These are also the type of correlations we assess in section 5.3, albeit without correctly accounting for the correlated nature of the data in order to test the robustness of our formalism. Here we show how one would correctly account for correlated Gaussian noise using BADAC.
Continuing from equation 28, we get the following result if we assume that , and are normalised Gaussian distributions, and statistically independent of one another (we explore what happens when this is not the case in section B.2):
(33)  
(34) 
Evaluating when there is correlated Gaussian noise is conceptually simple through the introduction of a covariance matrix .In practice, estimating can be difficult. One approach is to parametrise the covariance matrix and marginalise over these nuisance parameters. Assuming we know the covariance matrix, we can evaluate for the th instance as follows:
(35) 
where , and is the covariance matrix of the observed instance , where has features. In the same way, we can evaluate for the th instance as follows:
(36) 
and are the intrainstance covariance matrices of the observed training instances .
b.2 Interinstance correlation
A possible source of correlations is that the training curves could be statistically dependent on one another. This would mean that for a measurement of an object , another object