Guiding New Physics Searches with Unsupervised Learning
Abstract
We propose a new scientific application of unsupervised learning techniques to boost our ability to search for new phenomena in data, by detecting discrepancies between two datasets. These could be, for example, a simulated standardmodel background, and an observed dataset containing a potential hidden signal of New Physics. We build a statistical test upon a test statistic which measures deviations between two samples, using a Nearest Neighbors approach to estimate the local ratio of the density of points. The test is modelindependent and nonparametric, requiring no knowledge of the shape of the underlying distributions, and it does not bin the data, thus retaining full information from the multidimensional feature space. As a proofofconcept, we apply our method to synthetic Gaussian data, and to a simulated dark matter signal at the Large Hadron Collider. Even in the case where the background can not be simulated accurately enough to claim discovery, the technique is a powerful tool to identify regions of interest for further study.
Keywords:
Dark matter, Machine Learning, Statistical Methods, HadronHadron scattering (experiments)SISSA 27/2018/FISI ^{†}^{†}institutetext: SISSA, via Bonomea 265, 34136 Trieste, Italy ^{†}^{†}institutetext: INFN Sezione di Trieste, via Bonomea 265, 34136 Trieste, Italy
1 Introduction
The problem of comparing two independent data samples and looking for deviations is ubiquitous in statistical analyses. It is of particular interest in physics, when addressing the problem of searching for new phenomena in data, to compare observations with expectations to find discrepancies. In general, one would like to assess (in a statistically sound way) whether the observed experimental data are compatible with the expectations, or there are signals of the presence of new phenomena.
In highenergy physics, although the Standard Model (SM) of particle physics has proved to be extremely successful in predicting a huge variety of elementary particle processes with spectacular accuracy, it is widely accepted that it needs to be extended to account for unexplained phenomena, such as the dark matter of the Universe, the neutrino masses, and more. The search for New Physics (NP) beyond the SM is the primary goal of the Large Hadron Collider (LHC). The majority of NP searches at the LHC are performed to discover or constrain specific models, i.e. specific particle physics extensions of the SM. Relatively less effort has been devoted to design and carry out strategies for modelindependent searches for NP Aaltonen:2007dg (); Aaltonen:2008vt (); CMSPASEXO08005 (); CMSPASEXO10021 (); Choudalakis:2011qn (); ATLASCONF2017001 (); Asadi:2017qon (); DAgnolo:2018cun (). At the current stage of no evidence for NP in the LHC data, it is of paramount importance to increase the chances of observing the presence of NP in the data. It may even be already there, but it may have been missed by modelspecific searches.
Recently, there has been growing interest in applying Machine Learning (ML) techniques to highenergy physics problems, especially using supervised learning (see e.g. Refs. Kuusela:2011aa (); Cranmer:2015bka (); Baldi:2016fzo (); Hernandez2016 (); Caron:2016hib (); Bertone:2016mdy (); Dery:2017fap (); Louppe:2017ipp (); Cohen:2017exh (); Metodiev:2017vrx (); Chang:2017kvc (); Paganini:2017dwg (); Komiske:2018oaa (); Fraser:2018ieu (); Andreassen:2018apy (); Brehmer:2018eca (); Brehmer:2018kdj (); Collins:2018epr (); Brehmer:2018hga () and in particular the recent work of Ref. DAgnolo:2018cun () with which we share some ideas, although with a very different implementation). On the other hand, applications of unsupervised learning have been relatively unexplored Kuusela:2011aa (); Andreassen:2018apy (). In unsupervised learning the data are not labeled, so the presence and the characteristics of new phenomena in the data are not known a priori. One disadvantage of unsupervised learning is that one cannot easily associate a performance metric to the algorithm. Nevertheless, unsupervised methods such as anomaly (or outlier) detection techniques, or clustering algorithms, provide powerful tools to inspect the global and local structures of highdimensional datasets and discover ‘neverseenbefore’ processes.
In this paper, we propose a new scientific application of unsupervised learning techniques to boost our ability to search for new phenomena in data, by measuring the degree of compatibility between two data samples (e.g. observations and predictions). In particular, we build a statistical test upon a test statistic which measures deviations between two datasets, relying on a NearestNeighbors technique to estimate the ratio of the local densities of points in the samples.
Generally speaking, there are three main difficulties one may face when trying to carry out a search for the presence of new processes in data: (1) a model for the physics describing the new process needs to be assumed, which limits the generality of the method; (2) it is impossible or computationally very expensive to evaluate directly the likelihood function, e.g. due to the complexity of the experimental apparatus; (3) a subset of relevant features needed to be extracted from the data, otherwise the histogram methods may fail due to the sparsity of points in highdimensional bins.
A typical search for NP at LHC suffers from all such limitations: a model of NP (which will produce a signal, in the highenergy physics language) is assumed, the likelihood evaluation is highly impractical, and a few physically motivated variables (observables or functions of observables) are selected to maximize the presence of the signal with respect to the scenario without NP (the socalled background).
Our approach overcomes all of these problems at once, by having the following properties:

it is modelindependent: it aims at assessing whether or not the observed data contain traces of new phenomena (e.g. due to NP), regardless of the specific physical model which may have generated them;

it is nonparametric: it does not make any assumptions about the probability distributions from which the data are drawn, so it is likelihoodfree;

it is unbinned: it partitions the feature space of data without using fixed rectangular bins; so it allows one to retain and exploit the information from the full highdimensional feature space, when single or few variables cannot.
The method we propose in this paper is particularly useful when dealing with situations where the distribution of data in feature space is almost indistinguishable from the distribution of the reference (background) model.
Although our main focus will be on highenergy particle physics searches at the LHC, our method can be successfully applied in many other situations where one needs to detect incompatibilities between data samples.
The remainder of the paper is organized as follows. In Section 2 we describe the details of the construction of our method and its properties. In Section 3 we apply it to case studies with simulated data, both for synthetic Gaussian samples and for a more physicsmotivated example related to LHC searches. We outline some directions for further improvements and extensions of our approach, in Section 4. Finally, we conclude in Section 5.
2 Statistical test of dataset compatibility
In general terms, we approach the problem of measuring the compatibility between datasets sampled from unknown probability densities, by first estimating the probability densities and then applying a notion of functional distance between them. The first task is worked out by performing density ratio estimation using Nearest Neighbors, while the distance between probability densities is chosen to be the KullbackLeibler divergence KullbackLeibler (). We now describe our statistical test in more detail.
2.1 Definition of the problem
Let us start by defining the problem more formally. Let and be two independent and identically distributed dimensional samples drawn independently from the probability density functions (PDFs) and , respectively:
(1)  
(2) 
We will refer to as a ‘benchmark’ (or ‘control’ or ‘reference’) sample and to as a ‘trial’ (or ‘test’) sample. The space where the sample points live will be referred to as ‘feature’ space.
The primary goal is to check whether the two samples are drawn from the same PDF, i.e. whether . In other words, we aim at assessing whether (and to what significance level) the two samples are compatible with each other. More formally, we want to perform a statistical test of the null hypothesis versus the alternative hypothesis .
This problem is wellknown in the statistics literature as a twosample (or homogeneity) test, and many ways to handle it have been proposed. We want to construct a statistical hypothesis test of dataset compatibility satisfying the properties 13 outlined in the introduction.
First, the samples are going to be analyzed without any particular assumptions about the underlying model that generated them (property 1); our hypothesis test does not try to infer or estimate the parameters of the parent distributions, but it simply outputs to what degree the two samples can be considered compatible.
Second, if one is only interested in a location test, such as determining whether the two samples have the same mean or variance, then a test is often adopted. However, we assume no knowledge about the original PDFs, and we want to check the equality or difference of the two PDFs as a whole; therefore, we will follow a nonparametric (distributionfree) approach (property 2).
Third, we want to retain the full multidimensional information of the data samples, but highdimensional histograms may result in sparse bins of poor statistical use. The popular KolmogorovSmirnov method only works for onedimensional data, and its multidimensional extensions are based on binning the data. Alternative nonparametric tests like the Cramérvon MisesAnderson test or the MannWithney test require the possibility of ranking the data points in an ordinal way, which may be illdefined or ambiguous in highdimensions. Thus, we will employ a different partition of feature space not based on fixed rectangular bins (property 3), which allows us to perform a nonparametric twosample test in high dimensions.
So, in order to construct our hypothesis test satisfying properties 13, we need to build a new test statistic and construct its distribution, as described in the next sections.
2.2 Test statistic
Since we are interested in measuring the deviation between the two samples, it is convenient to define the ratio of probability densities to observe the points in the two samples, in the case (numerator) relative to the case (denominator)
(3) 
The above quantity may also be thought of as a likelihood ratio. However, as we are carrying out a nonparametric test, we prefer not to use this term to avoid confusion.
Now, since the true PDFs are not known, we follow the approach of finding estimators for the PDFs and evaluate the ratio on them
(4) 
We then define our test statistic TS over the trial sample as
(5) 
where is the size of the trial sample. This test statistic will take values close to zero when is true, and far from zero (positively or negatively) when is false.
The test statistic defined in Eq. (5) is also equal to the estimated KL divergence between the estimated PDFs of trial and benchmark samples, with the expectation value replaced by the empirical average (see Eq. (20)). Of course, other choices for the test statistic are possible, based on an estimated divergence between distributions other than the KL divergence, e.g. the Pearson squarederror divergence. The exploration of other possibilities is beyond the scope of this paper and is left for future work.
Ultimately, we want to conclude whether or not the null hypothesis can be rejected, with a specified significance level (e.g. ), therefore we need to associate a value to the null hypothesis, to be compared with . To this end, we first need to estimate the PDFs from the samples, then compute the test statistics observed on the two given samples. Next, in order to evaluate the probability associated with the observed value of the test statistic, we need to reconstruct its probability distribution under the null hypothesis , and finally compute a twosided value of the null hypothesis.
The distribution of the test statistic is expected to be symmetric around its mean (or median), which in general may not be exactly zero as a finitesample effect. Therefore, the twosided value is simply double the onesided value.
A schematic summary of the method proposed in this paper is shown in Figure 1.
In the remainder of this section we will describe this procedure in detail.
2.3 Probability density ratio estimator
We now turn to describing our approach to estimating the ratio of probability densities needed for the test statistic. There exist many possible ways to obtain density ratio estimators, e.g. using kernels SUGIYAMA2011735 () (see Ref. sugiyama_suzuki_kanamori_2012 () for a comprehensive review). We choose to adopt a NearestNeighbors (NN) approach Schilling1986 (); henze1988 (); Wang2005 (); Wang2006 (); Dasu06 (); PerezCruz2008 (); kremer ().
Let us fix an integer . For each point , one computes the Euclidean distance^{1}^{1}1Other distance metrics may be used, e.g. a norm. We do not explore other possibilities here. to the th nearest neighbor of in , and the Euclidean distance to the th nearest neighbor of in . Since the probability density is proportional to the density of points, the probability estimates are simply given by the number of points (, by construction) within a sphere of radius or , divided by the volume of the sphere and the total number of available points. Therefore, the local nearestneighbor estimates of the PDFs read
(6)  
(7) 
(for any ) where is the volume of the unit sphere in . So, the test statistic defined in Eq. (5) is simply given by
(8) 
The NN density ratio estimator described above has been proved to be consistent and asymptotically unbiased Wang2005 (); Wang2006 (); PerezCruz2008 (), i.e. the test statistic TS (8) built from the estimated probability densities converges almost surely to the KL divergence between the true probabilities in the large sample limit .
Two advantages of the NN density ratio estimator are that it easily handles highdimensional data, and its calculation is relatively fast, especially if  trees are employed to find the nearest neighbors. As a disadvantage, for finite sample sizes, the estimator (8) retains a small bias, although several methods has been proposed to reduce it (see e.g. Refs. Wang2005 (); Noh2014BiasRA ()).
The use of NN is also convenient as it allows the partition of the feature space not into rectangular bins, but into hyperspheres of varying radii, making sure they are all populated by data points.
The test statistic TS in Eq. (8), being an estimator of the KL divergence between the two underlying (unknown) PDFs, provides a measure of dataset compatibility. In the construction of TS we have chosen a particular as the number of nearest neighbors. Of course, there is not an a priori optimal value of to choose. In the following analyses we will use a particular choice of , and we will comment on the possibility of extending the algorithm with adaptive in Section 4.2.
Now that we have a test statistic which correctly encodes the degree of compatibility between two data samples, and its asymptotic properties are ensured by theorems, we need to associate a probability with the value of the TS calculated on the given samples, as described in the next section.
2.4 Distribution of the test statistic and value
In order to perform a hypothesis test, we need to know the distribution of the test statistic under the null hypothesis , to be used to compute the value. Classical statistical tests have wellknown distributions of the test statistics, e.g. normal, or Student. In our case, the distribution of TS is not theoretically known, for finite sample sizes. Therefore, it needs to be estimated from the data samples themselves. We employ the resampling method known as the permutation test Edgington (); vanderVaart () to construct the distribution of the TS under the null hypothesis. It is a nonparametric (distributionfree) method based on the idea of sampling different relabellings of the data, under the assumption they are coming from the same parent PDF (null hypothesis).
In more detail, the permutation test is performed by first constructing a pool sample by merging the two samples: , then randomly shuffle (sampling without replacement) the elements of and assign the first elements to , and the remaining elements to . Next, one computes the value of the test statistic on . If one repeats this procedure for every possible permutation (relabelling) of the sample points, one collects a large set of test statistic values under the null hypothesis which provides an accurate estimation of its distribution (exact permutation test). However, it is often impractical to work out all possible permutations, so one typically resorts to perform a smaller number of permutations, which is known as an approximate (or MonteCarlo) permutation test. The TS distribution is then reconstructed from the values of the test statistic obtained by the procedure outlined above.
The distribution of the test statistic under a permutation test is asymptotically normal with zero mean in the large sample limit vanderVaart (), as a consequence of the Central Limit Theorem. Furthermore, when the number is large, the distribution of the value estimator approximately follows a normal distribution with mean and variance EfronTibshirani (); Edgington (). For example, if we want to know the value in the neighborhood of the significance level to better than , we need , which is of the order of 1000 for .
Once the distribution of the test statistic is reconstructed, it is possible to define the critical region for rejecting the null hypothesis at a given significance , defined by large enough values of such that the corresponding value is smaller than .
As anticipated in Section 2.2, for finite samples the test statistic distribution is still approximately symmetric around the mean, but the latter may deviate from zero. In order to account for this general case, and give some intuitive meaning to the size of the test statistic, it is convenient to standardize (or ‘studentize’) the TS to have zero mean and unit variance. Let be the mean and the variance of test statistic under the distribution . We then transform the test statistic as
(9) 
which is distributed according to
(10) 
with zero mean and unit variance. With this redefinition, the twosided value can be easily computed as
(11) 
2.5 Summary of the algorithm
The pseudocode of the algorithm presented in this paper is summarized in Table 1. We implemented it in Python and an opensource package is available on GitHub ^{2}^{2}2 https://github.com/desimone/NN2ST.
3 Applications to simulated data
3.1 Case study: Gaussian samples
As a first case study of our method let us suppose we know the original distributions from which the benchmark and trial samples are randomly drawn. For instance, let us consider the twodimensional case () with twovariate Gaussian distributions defined by mean vectors and covariance matrices :
(12) 
In general, in the case of multivariate Gaussian distributions of dimension , the KL divergence can be computed analytically (see Eq. (22)).
In the large sample limit, we recover that the test statistic converges to the true KL divergence between the PDFs (see Figure 2). Of course, the comparison is possible because we knew the parent PDFs .
For our numerical experiments we fix the benchmark sample by the parameters , , and we construct 4 different trial samples drawn by Gaussian distributions whose parameters are defined in Table 2. Each sample consists of 20 000 points randomly drawn from the Gaussian distributions defined above. Notice that the first trial sample is drawn from the same distribution as the benchmark sample. To help the reader to visualize the situation, the benchmark and trial samples are shown in Figure 3.
Trial Dataset  value  

((1.0,0.0),(0.0,1.0))  0.2  
((1.0,0.0),(0.0,1.0))  2.2  
((0.95,0.1),(0.1,0.8))  3.5  
((1.0,0.0),(0.0,1.0))  4.9 
As is customary, we associate an equivalent Gaussian significance to a given (twosided) value as: , where is the cumulative distribution of a standard (zeromean, unitvariance) onedimensional Gaussian distribution. The results in Table 2 are interpreted as follows. The first two trial samples are not distinguished from the benchmark at more than 99%CL (), while are distinguished (, or equivalently ). Therefore, one would reject the null hypothesis at more than 99% CL and conclude that the PDFs from which are drawn are different from the benchmark PDF . It is remarkable that our statistical test is able to reject the null hypothesis with a large significance of for two random samples drawn from distributions which only differ by a shift of 15% of the mean along each dimension.
The running time to compute the value on a standard laptop for two 2dimensional samples of 20 000 points each, and for 1000 permutations, was about 2 minutes. The running time scales linearly with the number of permutations.
The number of sample points () plays an important role. As an example, we sampled the same datasets with points, i.e. ten times less points than for the cases shown in Table 2. The results for the equivalent significance for are , , , , respectively. Clearly, the test is not able to reject the null hypothesis at more than 99%CL (value is never below 0.01, which would correspond to ) in none of the cases. As another illustration of this point, we run the statistical test for vs for different sample sizes and show the resulting significance in Figure 4. We find that for , the test is not able to reject the null hypothesis at more than 99%CL.
Therefore, the power of our statistical test increases for larger sample sizes, as expected since bigger samples lead to more accurate approximations of the original PDFs.
3.2 Case study: Monojet searches at LHC
A modelindependent search at the LHC for physics Beyond the Standard Model (BSM), such as Dark Matter (DM), has been elusive CMSPASEXO08005 (); CMSPASEXO10021 (); Choudalakis:2011qn (); ATLASCONF2017001 (). Typically it is necessary to simulate the theoretical signal in a specific model, and compare with data to test whether the model is excluded. The signalspace for DM and BSM physics in general is enormous, and despite thorough efforts, the possibility exists that a signal has been overlooked. The compatibility test described in Section 2 is a promising technique to overcome this challenge, as it can search for deviations between the expected simulated Standard Model signal and the true data, without any knowledge of the nature of the new physics.
In a real application of our technique by experimental collaborations, the benchmark dataset will be a simulation of the SM background, while the trial dataset will consist of real measured data, potentially containing an unknown mix of SM and BSM events. As a proofofprinciple, we test whether our method would be sensitive to a DM signature in the monojet channel. For our study, both and will consist of simulated SM events (‘background’), however will additionally contain injected DM events (‘signal’). The goal is to determine whether the algorithm is sensitive to differences in and caused by this signal.
Model and simulations
The signal comes from a standard simplified DM model (see e.g. Ref. DeSimone:2016fbz () for a review) with Fermion DM and an channel vector mediator Boveia:2016mrp (); Albert:2017onk (). Our benchmark parameters are , , , in order to match the simplified model constraints from the ATLAS summary plots vector_summary (). We use a DM mass of 100 GeV, and mediator masses of (1200, 2000, 3000) GeV, in order to choose points that are not yet excluded but could potentially be in the future vector_summary ().
Signal and background events are first simulated using MG5_aMC@NLO v Alwall:2014hca () at centerofmass energy TeV, with a minimal cut of GeV, to emulate trigger rather than analysis cuts. We use Pythia 8.230 Sjostrand:2014zea () for hadronization and Delphes 3.4.1 delphes () for detector simulation. The socalled ‘monojet’ signal consists of events with missing energy from DM and at least one high jet. The resulting signal crosssection is pb for GeV respectively. For the background samples, we simulate 40 000 events of the leading background, where is 1 or 2, resulting in a cross section of pb.
The Delphes ROOT file is converted to LHCO and a feature vector is extracted with Python for each event, consisting of and for the two leading jets; the number of jets; missing energy ; Hadronic energy ; and between the leading jet and the missing energy. Together this gives an 8dimensional feature vector , which is scaled to zeromean unitvariance based on the mean and variance of the background simulations. This feature vector is chosen to capture sufficient information about each event while keep running time of the algorithm reasonable. Other choices of the feature vector could be chosen to capture different aspects of the physical processes, including higher or lowerlevel features, such as raw particle 4vectors. Application of highperformance computing resources would allow the feature vector to be enlarged, potentially strengthening results. A full study of the choice of feature vector is left to future work. Our simulation technique is simple and designed only as a proof of principle; we do not include subleading SM backgrounds, nor full detector effects, adopting a generic Delphes profile.
Test Statistic distribution under null hypothesis
Following the technique described in Section 2, for each of the 3 considered points in signal model parameter space, we first construct an empirical distribution of the test statistic under the null hypothesis, , and we then measure and compute the value to determine the compatibility of the datasets. We choose and is constructed over .
The pool sample consists of the 40 000 background events, along with a number of signal events proportional to the signal crosssection. We define and as having an equal number of background events, so that , . The resulting distribution of TS under the null hypothesis is shown in Fig. 5. The simulations are relatively fast, taking approximately an hour per 1000 permutations on a standard laptop, although computation time grows as a powerlaw with the number of events, such that further optimization and highperformance computing resources will be a necessity for application to real LHC data with many thousands of events. The statistics of converge quickly, as shown in Fig. 6, consistent with the discussion of in Section 2.4, and showing that is more than sufficient.
Note that since are chosen from permutations of , it is not necessary to specify how the 40 000 background events are divided between and ; It is only necessary to specify and at this point.
Observed Test Statistic
To test whether the null hypothesis would be excluded in the event of an (otherwise unobserved) DM signal hiding in the data, we calculate using containing only background, and containing background plus a number of signal events proportional to the relative cross section. In a practical application of this technique by the experimental collaborations, would instead correspond to background simulations, while would be the realworld observation; therefore only one measurement of would be performed.
However, in our case the distribution of TS under the null hypothesis is insensitive to the way the 40 000 background events are divided between and . Therefore we can simulate multiple realworld measurements of by dividing the 40 000 background events between and in different permutations (always keeping 20 000 background events in each sample). This allows us to be more robust: since is itself a random variable, multiple measurements of allows us to avoid the claim of a small value, when in reality the algorithm may not be sensitive to a small signal.
The calculation of is performed for 100 random divisions. The value and significance of each are calculated with respect to the empirical distribution where possible. In many cases, is so extreme that it falls outside the measured range of , in which case and are determined from a Gaussian distribution with mean and variance . This is equivalent to assuming that is wellapproximated by a Gaussian, which is true to a good approximation, as seen in Fig. 5. To be conservative, the technique is only considered sensitive to the signal if all simulated observations of TS exclude the null hypothesis, i.e. we show the minimum significance (and maximum value). These results are shown in Table 3, where we see that the backgroundonly hypothesis is strongly excluded for and , even though these points are not yet excluded by traditional LHC searches. Bear in mind that this is a proofofconcept, and realworld results are unlikely to be as clean, as discussed in Section 3.3.
Sample  [pb]  [pb]  max(value)  min()  
1.2 TeV  223.0  20.4  
2 TeV  206.4  3.8  
3 TeV  203.2  0.6  0.90 
Discussion
To study the threshold to which this technique is sensitive, we can construct by adding an arbitrary number of signal events to the background, without reference to the relative signal crosssection. The result is shown in Figure 7 (left panel), using the signal dataset with GeV. For each value of , the distribution is constructed over 1000 permutations, and the significance is determined through taking the minimum value of over 100 measurements of for different background permutations. There is a clear threshold, below which the significance is negligible and constant, and above which the significance grows as a powerlaw. The number of signal events in crosses this threshold while does not, explaining the rapid drop in the significance.
The strength of the technique is also sensitive to the number of samples. Figure 7 (right panel) demonstrates this, again using the signal dataset with GeV, , and taking the minimum over 100 measurements of . It shows an approximately powerlaw growth in the significance, consistent with the same growth in the significance with number of signal events. Clearly, the more data the better.
3.3 Future application to real data
In a practical application of this technique by experimental collaborations, would correspond to simulations of the SM background, while would be the realworld observation, consisting of an unknown mix of signal and background events. Both and could be constructed under the same set of minimal cuts, imposed based on trigger requirements rather than as a guide to finding new physics. While the technique itself is modelindependent, there is freedom to apply physical knowledge in the choice of minimal cuts to keep the background simulation and data load manageable, and in the choice of feature vector, which can either be lowlevel (raw 4vectors of reconstructed objects, or even pixel hits) or highlevel (missing energy, hadronic energy etc.).
Even though we have only applied our method to a generic monojet signal, the strength of the algorithm is that it is sensitive to unspecified signals, and is limited only by the accuracy of the background simulation. We emphasize that our case study in Section 3.2 is a proof of concept with a generic signal and a naïve estimation of the background.
Accurately estimating SM backgrounds at the LHC is a significant challenge in the field and must be considered carefully in any future application of this technique. Currently used techniques of matching simulations to data in control regions still allow the use of our method, although this introduces some modeldependent assumptions. Alternatively, one may apply our statistical test in the context of datadriven background calculation, as a validation tool to measure the compatibility of MonteCarlo simulations with data in control regions.
Our statistical test alone is not sufficient to claim discovery in cases where background simulations are not sufficiently accurate, but this does not weaken the value of the method. It remains valuable as a tool to identify regions of excess in a modelindependent way, allowing followup handcrafted analyses of potential signal regions.
4 Directions for extensions
In this section we summarize three main directions to extend and improve the method proposed in this paper. We limit ourselves to just outlining some ideas, leaving a more complete analysis of each of these issues to future work.
4.1 Inclusion of sample uncertainties
So far we have considered that both and samples are precisely known. However, in several situations of physical interest this may not be the case, as the features may be known only with some uncertainty. There can be several factors affecting the precision with which each sample point is known, for instance systematic uncertainties (e.g. the smearing effects of the detector response) and the limited accuracy of the background (MonteCarlo simulation), which may be particularly poor in some regions of the feature space.
Of course, once such uncertainties are properly taken into account, we expect a degradation of the results of the statistical test described in this paper, leading to weaker conclusions about the rejection of the null hypothesis.
In light of these effects, which we have neglected so far, we reiterate that the results of Section 3.2 should be taken with a grain of salt: they show the strengths of the statistical test we are proposing and prove it is worthwhile to investigate it further, but they will very likely be weakened in a realworld situation.
4.2 Adaptive choice of the number of nearest neighbors
The procedure for the density ratio estimator described in Section 2.3 relies on choosing the number of NN. As mentioned earlier, it is also possible to make the algorithm completely unsupervised by letting it choose the optimal value of .
One approach is to proceed by model selection as in Refs. SugiyamaMuller (); SUGIYAMA2011735 (); kremer (). We define the loss function as a meansquared error between the true (unknown) density ratio and the estimated density ratio over the benchmark PDF ,
(13)  
(14) 
where the last term is constant and can be dropped, thus making the loss function independent of the unknown ratio . The estimated loss function is obtained by replacing the expectations over the unknown PDF with the empirical averages
(15) 
So, one can perform model selection by minimizing the estimated loss function (15) with respect to the parameter and choosing this value of as the optimal one. However, this procedure may be computationally intensive as it requires running the full algorithm several times (one for each different value of ).
Another approach is to implement the PointAdaptive kNN density estimator (PAk) Rodriguez1492 (); Laio2018 (); 2018arXiv180210549D (), which is an algorithm to automatically find a compromise between large variance of the kNN estimator (for small ), and large bias (for large ) due to variations of the density of points.
4.3 Identifying the discrepant regions
Suppose that after running the statistical test described in this paper one finds a value leading to a rejection of the null hypothesis, or at least for evidence of incompatibility between the original PDFs. This means that the absolute value of the test statistic on the actual samples is large enough to deviate from zero significantly (to simplify the discussion, we assume in this subsection that and the distribution of TS has zero mean and unit variance: ). Then, our algorithm has a straightforward byproduct: it allows to characterize the regions in feature space which contribute the most to a large .
From the expression of the test statistic in Eq. (8) we see that we may associate a density field to each point as
(16) 
such that the test statistic is simply given by the expectation value (arithmetic average) of over the whole trial sample
(17) 
It is then convenient to define a score field over the trial sample, by standard normalization of as
(18) 
One can then use this score field to identify those points in which are significantly larger than , and they can be interpreted as the regions (or clusters) where the two samples manifest larger discrepancies.
This way, the score field provides a guidance for characterizing the regions in feature space where the discrepancy is more relevant, similar in spirit to regions of large signaltobackground ratio. For instance, the points with larger than a given threshold, e.g. , are the points where one expects most of the “anomaly” to occur. An example of this is shown in Figure 8, where a circular sample is compared with a crosslike sample. As expected, the field has higher density in correspondence of the corners of the cross.
Such regions of highest incompatibility between trial and benchmark samples may even be clustered using standard clustering algorithms, thus extending the method studied in this paper with another unsupervised learning technique.
Once they have been characterized and isolated, these highdiscrepancy regions in feature space can provide a guidance for further investigation, in order to identify what causes the deviations. For example, they can be used to place data selection cuts.
5 Conclusions
Many searches for new phenomena in physics (such as searches for New Physics at the LHC) rely on testing specific models and parameters. Given the unknown nature of the physical phenomenon we are searching for, it is becoming increasingly important to find modelindependent methods that are sensitive to an unknown signal hiding in the data.
The presence of a new phenomenon in data manifests itself as deviations from the expected distribution of data points in absence of the phenomenon. So, we propose a general statistical test for assessing the degree of compatibility between two datasets. Our method is modelindependent and nonparametric, requiring no information about the parameters or signal spectrum of the new physics being tested; it is also unbinned, taking advantage of the full multidimensional feature space.
The test statistic we employ to measure the ‘distance’ between two datasets is built upon a nearestneighbors estimation of their relative local densities. This is compared with the distribution of the test statistic under the null hypothesis. Observations of the test statistic at extreme tails of its distribution indicate that the two datasets come from different underlying probability densities.
Alongside an indication of the presence of anomalous events, our method can be applied to characterize the regions of discrepancy, providing a guidance for further analyses even in the case where one of the two samples (e.g. the background) is not known with enough accuracy to claim discovery.
The statistical test proposed in this paper has a wide range of scientific and engineering applications, e.g. to decide whether two datasets can be analyzed jointly, to find outliers in data, to detect changes of the underlying distributions over time, to detect anomalous events in timeseries data, etc.
In particular, its relevance for particle physics searches at LHC is clear. In this case the observed data can be compared with simulations of the Standard Model in order to detect the presence of New Physics events in the data. Our method is highly sensitive even to a small number of these events, showing the strong potential of this technique.
We would like to thank A. Davoli and A. Morandini for collaboration at the early stages of this work, and D. Barducci, R. Brasselet, R.T. D’ Agnolo, A. Farbin, F. Gieseke, E. Merelli, E. Merényi, A. Laio, L. Lista and A. Wulzer for insightful discussions.
Appendix A KullbackLeibler divergence
The KullbackLeibler (KL) divergence (or distance) is one of the most fundamental measures in information theory. The KL divergence of two continuous probability density functions (PDF) is defined as
(19) 
and it is a special case of divergences.
If the distributions are not known, but we are only given two samples of i.i.d. points drawn from and of i.i.d. points drawn from , it is possible to estimate the KL divergence using empirical methods. The estimated KL divergence between the estimated PDFs of is obtained by replacing the PDFs with their estimates and replacing the expectation value in Eq. (19) with the empirical (sample) average
(20) 
For the special case of Gaussian PDFs, the calculation of the KL divergence is particularly simple. Given two multivariate (dimensional) Gaussian PDFs defined by mean vectors and covariance matrices :
(21) 
the KL divergence in Eq. (19) is given by
(22) 
References
 (1) CDF collaboration, T. Aaltonen et al., ModelIndependent and QuasiModelIndependent Search for New Physics at CDF, Phys. Rev. D78 (2008) 012002, [0712.1311].
 (2) CDF collaboration, T. Aaltonen et al., Global Search for New Physics with 2.0 fb at CDF, Phys. Rev. D79 (2009) 011101, [0809.3781].
 (3) CMS collaboration, MUSIC – An Automated Scan for Deviations between Data and Monte Carlo Simulation, Tech. Rep. CMSPASEXO08005, CERN, Geneva, Oct, 2008.
 (4) CMS collaboration, Model Unspecific Search for New Physics in pp Collisions at sqrt(s) = 7 TeV, Tech. Rep. CMSPASEXO10021, CERN, Geneva, 2011.
 (5) G. Choudalakis, On hypothesis testing, trials factor, hypertests and the BumpHunter, in Proceedings, PHYSTAT 2011 Workshop on Statistical Issues Related to Discovery Claims in Search Experiments and Unfolding, CERN,Geneva, Switzerland 1720 January 2011, 2011. 1101.0390.
 (6) ATLAS collaboration, A model independent general search for new phenomena with the ATLAS detector at , Tech. Rep. ATLASCONF2017001, CERN, Geneva, Jan, 2017.
 (7) P. Asadi, M. R. Buckley, A. DiFranzo, A. Monteux and D. Shih, Digging Deeper for New Physics in the LHC Data, JHEP 11 (2017) 194, [1707.05783].
 (8) R. T. D’Agnolo and A. Wulzer, Learning New Physics from a Machine, 1806.02350.
 (9) M. Kuusela, T. Vatanen, E. Malmi, T. Raiko, T. Aaltonen and Y. Nagai, SemiSupervised Anomaly Detection  Towards ModelIndependent Searches of New Physics, J. Phys. Conf. Ser. 368 (2012) 012032, [1112.3329].
 (10) K. Cranmer, J. Pavez and G. Louppe, Approximating Likelihood Ratios with Calibrated Discriminative Classifiers, 1506.02169.
 (11) P. Baldi, K. Cranmer, T. Faucett, P. Sadowski and D. Whiteson, Parameterized neural networks for highenergy physics, Eur. Phys. J. C76 (2016) 235, [1601.07913].
 (12) J. HernándezGonzález, I. Inza and J. Lozano, Weak supervision and other nonstandard classification problems: a taxonomy, Pattern Recognition Letters 69 (2016) 49–55.
 (13) S. Caron, J. S. Kim, K. Rolbiecki, R. Ruiz de Austri and B. Stienen, The BSMAI project: SUSYAIâgeneralizing LHC limits on supersymmetry with machine learning, Eur. Phys. J. C77 (2017) 257, [1605.02797].
 (14) G. Bertone, M. P. Deisenroth, J. S. Kim, S. Liem, R. Ruiz de Austri and M. Welling, Accelerating the BSM interpretation of LHC data with machine learning, 1611.02704.
 (15) L. M. Dery, B. Nachman, F. Rubbo and A. Schwartzman, Weakly Supervised Classification in High Energy Physics, JHEP 05 (2017) 145, [1702.00414].
 (16) G. Louppe, K. Cho, C. Becot and K. Cranmer, QCDAware Recursive Neural Networks for Jet Physics, 1702.00748.
 (17) T. Cohen, M. Freytsis and B. Ostdiek, (Machine) Learning to Do More with Less, JHEP 02 (2018) 034, [1706.09451].
 (18) E. M. Metodiev, B. Nachman and J. Thaler, Classification without labels: Learning from mixed samples in high energy physics, JHEP 10 (2017) 174, [1708.02949].
 (19) S. Chang, T. Cohen and B. Ostdiek, What is the Machine Learning?, Phys. Rev. D97 (2018) 056009, [1709.10106].
 (20) M. Paganini, L. de Oliveira and B. Nachman, CaloGAN : Simulating 3D high energy particle showers in multilayer electromagnetic calorimeters with generative adversarial networks, Phys. Rev. D97 (2018) 014021, [1712.10321].
 (21) P. T. Komiske, E. M. Metodiev, B. Nachman and M. D. Schwartz, Learning to Classify from Impure Samples, 1801.10158.
 (22) K. Fraser and M. D. Schwartz, Jet Charge and Machine Learning, 1803.08066.
 (23) A. Andreassen, I. Feige, C. Frye and M. D. Schwartz, JUNIPR: a Framework for Unsupervised Machine Learning in Particle Physics, 1804.09720.
 (24) J. Brehmer, K. Cranmer, G. Louppe and J. Pavez, A Guide to Constraining Effective Field Theories with Machine Learning, 1805.00020.
 (25) J. Brehmer, K. Cranmer, G. Louppe and J. Pavez, Constraining Effective Field Theories with Machine Learning, 1805.00013.
 (26) J. H. Collins, K. Howe and B. Nachman, CWoLa Hunting: Extending the Bump Hunt with Machine Learning, 1805.02664.
 (27) J. Brehmer, G. Louppe, J. Pavez and K. Cranmer, Mining gold from implicit models to improve likelihoodfree inference, 1805.12244.
 (28) S. Kullback and R. Leilber, On Information and sufficiency, The Annals of Mathematical Statistics 22 (1951) 79–86.
 (29) M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori and M. Kimura, Leastsquares twosample test, Neural Networks 24 (2011) 735 – 751.
 (30) M. Sugiyama, T. Suzuki and T. Kanamori, Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012, 10.1017/CBO9781139035613.
 (31) M. Schilling, Multivariate TwoSample Tests Based on Nearest Neighbors , Journal of the American Statistical Association 81 (1986) 799–806.
 (32) N. Henze, A multivariate twosample test based on the number of nearest neighbor type coincidences, Ann. Statist. 16 (06, 1988) 772–783.
 (33) Q. Wang, S. Kulkarni and S. Verdù, Divergence estimation of continuous distributions based on datadependent partitions, IEEE Transactions on Information Theory 51 (2005) 3064–3074.
 (34) Q. Wang, S. Kulkarni and S. Verdù, A NearestNeighbor Approach to Estimating Divergence between Continuous Random Vectors, Proceedings  2006 IEEE International Symposium on Information Theory, ISIT 2006 (2006) 242–246.
 (35) T. Dasu, S. Krishnan, S. Venkatasubramanian and K. Yi, An informationtheoretic approach to detecting changes in multidimensional data streams, in In Proc. Symp. on the Interface of Statistics, Computing Science, and Applications, 2006.
 (36) F. PérezCruz, KullbackLeibler divergence estimation of continuous distributions, Proceedings of the IEEE International Symposium on Information Theory, 2008 (2008) 1666–1670.
 (37) J. Kremer, F. Gieseke, K. Steenstrup Pedersen and C. Igel, Nearest neighbor density ratio estimation for largescale applications in astronomy, Astronomy and Computing 12 (Sept., 2015) 67–72.
 (38) Y.K. Noh, M. Sugiyama, S. Liu, M. C. du Plessis, F. C. Park and D. D. Lee, Bias reduction and metric learning for nearestneighbor estimation of kullbackleibler divergence, Neural Computation 30 (2014) 1930–1960.
 (39) E. Edgington, Randomization Tests. Dekker, New York, 1995.
 (40) A. van der Vaart, Asymptotic Statistics. Cambridge University Press, 1998.
 (41) B. Efron and R. Tibshirani, An Introduction to Boostrap. Chapman & Hall, 1993.
 (42) A. De Simone and T. Jacques, Simplified models vs. effective field theory approaches in dark matter searches, Eur. Phys. J. C76 (2016) 367, [1603.08002].
 (43) G. Busoni et al., Recommendations on presenting LHC searches for missing transverse energy signals using simplified channel models of dark matter, 1603.04156.
 (44) A. Albert et al., Recommendations of the LHC Dark Matter Working Group: Comparing LHC searches for heavy mediators of dark matter production in visible and invisible decay channels, 1703.05703.
 (45) “Atlas summary plot.” https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/CombinedSummaryPlots/EXOTICS/ATLAS_DarkMatter_Summary_Vector_ModifiedCoupling/history.html.
 (46) J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer et al., The automated computation of treelevel and nexttoleading order differential cross sections, and their matching to parton shower simulations, JHEP 07 (2014) 079, [1405.0301].
 (47) T. SjÃ¶strand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten et al., An Introduction to PYTHIA 8.2, Comput. Phys. Commun. 191 (2015) 159–177, [1410.3012].
 (48) “Delphes.” https://cp3.irmp.ucl.ac.be/projects/delphes/.
 (49) M. Sugiyama and K. . MÃ¼ller, Model selection under covariate shift, vol. 3697 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2005.
 (50) A. Rodriguez and A. Laio, Clustering by fast search and find of density peaks, Science 344 (2014) 1492–1496, [http://science.sciencemag.org/content/344/6191/1492.full.pdf].
 (51) A. Rodriguez, M. dÃ¢ÂÂErrico, E. Facco and A. Laio, Computing the free energy without collective variables, Journal of Chemical Theory and Computation 14 (2018) 1206–1215, [https://doi.org/10.1021/acs.jctc.7b00916].
 (52) M. d’Errico, E. Facco, A. Laio and A. Rodriguez, Automatic topography of highdimensional data sets by nonparametric Density Peak clustering, ArXiv eprints (Feb., 2018) , [1802.10549].