A sampling algorithm to estimate the effect of fluctuations in particle physics data
{onecolabstract}Background properties in experimental particle physics are typically estimated using large data sets. However, different events can exhibit different features because of the quantum mechanical nature of the underlying physics processes. While signal and background fractions in a given data set can be evaluated using a maximum likelihood estimator, the shapes of the corresponding distributions are traditionally obtained using highstatistics control samples, which normally neglects the effect of fluctuations. On the other hand, if it was possible to subtract background using templates that take fluctuations into account, this would be expected to improve the resolution of the observables of interest, and to reduce systematics depending on the analysis. This study is an initial step in this direction. We propose a novel algorithm inspired by the Gibbs sampler that makes it possible to estimate the shapes of signal and background probability density functions from a given collection of particles, using control sample templates as initial conditions and refining them to take into account the effect of fluctuations. Results on Monte Carlo data are presented, and the prospects for future development are discussed.
Keywords: 29.85.Fj; High Energy Physics; Particle Physics; Large Hadron Collider; LHC; background discrimination; mixture models; latent variable models; sampling; Gibbs sampler; Markov Chain Monte Carlo; Expectation Maximisation; Multiple Imputation; Data Augmentation.
1 Introduction
While signal and background fractions in a given data set can be obtained using a maximum likelihood estimator, a traditional approach when analysing data in particle physics consists in estimating the shapes of signal and background probability density functions (PDFs) from highstatistics control samples. Generally speaking, such templates can be used to subtract background from a set of candidate events containing signatures of a process of interest.
However, even if the physics processes can be exactly the same, the quantum nature of the underlying physics can produce different features in different events. For this reason, the shapes of signal and background PDFs in a collection of events of interest can deviate from the templates obtained using highstatistics control samples.
When physics analysis leads to the identification of a large enough number of candidate events, such discrepancies are often of no practical relevance. However, in some cases, most notably with reference to searches for new physics, the analysis can result in only a few interesting events, and fluctuations can then affect the shapes of signal and background PDFs in ways that cannot be described using highstatistics control samples.
Two major aspects underline the innovative nature of this work. First of all, this article describes the first attempt at estimating the effect of fluctuations on the shapes of probability distributions in particle physics data by using a novel algorithm inspired by the Gibbs sampler [1]. This builds on the results of previous studies presented in [2]. The algorithm starts from control sample templates, and uses the stationary distribution of an appropriatelydefined Markov Chain to estimate the effect of fluctuations on the shapes of signal and background PDFs.
Secondly, this paper proposes a new approach to background discrimination. Instead of concentrating on the likelihood for entire events to contain a process of interest, individual particles inside events are assigned a probability for them to originate from signal as opposed to background. This reformulation of background discrimination in terms of a classification problem at the particle level is used to illustrate how the algorithm estimates the effect of fluctuations on the shapes of particlelevel PDFs in the data.
However, the two aspects of (i) estimating the effect of fluctuations on PDF shapes and (ii) concentrating on individual particle properties to discriminate signal from background are in principle independent, and can therefore lead to distinct lines of development in the future.
For this reason, this research has two main prospective goals. First of all, this work is an initial step toward the development of techniques to study the effect of fluctuations on particlelevel probability distributions inside individual events at the Large Hadron Collider (LHC). The emphasis will be on using eventspecific templates to associate individual particles with signal or background on a probabilistic basis, taking into account the effect of fluctuations on the shapes of particlelevel PDFs event by event. We argue that using different templates for different events to subtract background may improve the resolution of observables of interest and may contribute to reduce systematic uncertainties, depending on the analysis.
Secondly, from a broader perspective, this research aims to guide the future development of new background subtraction techniques based on PDFs that take into account the effect of fluctuations in a collection of events of interest. This aspect is more general and not necessarily restricted to particlelevel distributions. Both lines of development refer to the LHC at this point, but we anticipate that they can be of broader relevance.
Finally, it should be noted that it is not the purpose of this article to present the algorithm for use in a proper physics analysis framework at the LHC. On the other hand, this study is a proof of concept of a novel approach to data analysis, illustrating the use of a datadriven technique that estimates the effect of fluctuations on the shapes of probability distributions, with an emphasis on individual particle kinematics.
2 The algorithm
One of the distinctive features of the proposed approach is a populationbased view of highenergy collision events at the particle level. This aspect focuses on decomposing a collection of particles into subpopulations that are associated with different processes and are described in terms of different PDFs.
The input data set consists of a mixture of particles, some of which originated from a hard scattering of interest, others from background. The algorithm uses properties of individual particles to decompose the mixture, iteratively sampling from posterior distributions that encode information as to which particles are more likely to originate from either process.
As compared to classical mixture models, which typically enforce given subpopulation PDF shapes, our statistical model is based on a nonparametric mixture of the form , where the subpopulation fractions (“mixture weights”) satisfy . In the present application, corresponds to particle pseudorapidity , a kinematic variable related to particle polar angle in the laboratory frame in terms of , or to i.e. particle transverse momentum with respect to the beam direction.
The PDFs are here defined in terms of regularized histograms of , namely based on spline interpolation of histogram bin contents
In general, given the above mixture of probability distributions and a set of observations , Markov Chain Monte Carlo (MCMC) techniques can be used to cluster the latter into groups by associating each observation with a distribution of origin. The Gibbs sampler belongs to this family of numerical methods, and, as anticipated, directly inspired this work.
The pseudocode of the proposed algorithm is shown below, subscripts and relating to signal and background, respectively. The value of quantity at iteration is denoted by throughout.

Initialization: Set , where and . Initial estimates of the subpopulation PDFs , , are given by splined onedimensional histograms of and obtained from a highstatistics control sample.

Iteration :

Generate (“allocation variables”) for all particles and distributions according to , . Both the nonparametric treatment of the PDFs and the use of instead of for mapping distinguish this implementation from the classical Gibbs sampler for mixture models.

Set , .

The sampler estimates the shapes of the signal and background subpopulations in an input collection of particles, based on the data as well as on initial conditions on the subpopulation PDFs provided by control samples. As opposed to the traditional Gibbs sampler for mixture models, where the stationary distribution of the corresponding Markov Chain relates to the joint posterior of the allocation variables and of the subpopulation PDF parameters, this algorithm samples from the posterior probability of , and PDFs are kept fixed at control sample estimates.
This article reports results obtained on a Monte Carlo data set generated using Pythia 8.140 [5] [6]. The data set consists of charged particles with from 50 signal events from pp interactions at , superimposed on 350 soft QCD interactions, so called Minimum Bias events, to simulate background. The data set comprises 1,635 particles, out of which 1,269 originate from signal and 366 from background, corresponding to a fraction of background particles of . In addition to reducing the correlation between particle and to a negligible level, the restriction to the kinematic window is necessary for to contribute to discrimination between signal and background. If all values were allowed, the difference between signal and background distributions would in fact be too small for to have any discriminating power.
The higherstatistics Monte Carlo data set (“control sample”) that was used to obtain initial estimates of the subpopulation PDFs consists of a total of particles. It was obtained by generating 1,000 events and superimposing 7 Minimum Bias events per signal interaction. Charged particles were considered in the kinematic range . Both samples are exclusively used for illustrative purposes in this article.
The algorithm correctly estimated the fraction of background particles contained in the input data set with a shorter burnin phase than typically reported in the MCMC literature. In fact, since the sampler essentially refines control sample PDF templates, it already starts from reasonable knowledge of the target PDFs. For this reason, the equilibrium distribution of the Markov Chain, although corresponding to an improved description of the PDF shapes that takes fluctuations into account, is normally close to the initial conditions.
The estimated mixture weights can be biased because control sample PDFs are used for mapping instead of the unknown true distributions. However, as previously noticed, obtaining the signal and background fractions is not the purpose of this algorithm, and the estimated PDFs were observed to be remarkably unaffected by a possible bias on the mixture weights.
The algorithm was run for 1,000 iterations and probabilities were averaged over the last 100. No significant gain was found in this study in choosing a higher number of iterations.
The statistical model is well defined, i.e. the stationary distribution of the Markov Chain exists and is unique, which is a required condition for the estimated signal and background PDF shapes to reflect meaningful properties of the data set analysed. This was verified explicitly in two ways. First of all, different initial conditions were used for the mixture weights in the range , and it was confirmed that the estimated signal and background PDFs did not vary appreciably, as expected. Secondly, the algorithm was run repeatedly using different pseudorandom seeds, and the estimated PDFs were again observed to be essentially unaffected.
Figure 1 illustrates the effect of fluctuations in the data by comparing the shape of the true signal distribution to the corresponding control sample template (a) and to the PDF estimated by the algorithm (b). The points correspond to the true distribution normalised to unit area, while the curves result from spline interpolation of the control sample template (a) and of the estimated PDF (b).
The actual distribution in the data differs notably from the highstatistics control sample template. Not only is its peak shifted toward lower values, but its shape is not symmetric, as shown by the structure at in figure 1 (b). As anticipated, the algorithm can estimate these features thanks to the use of a nonparametric model. This underlines the importance of abandoning classical mixture models in favour of a more flexible formulation that can in principle accommodate any deviation in the PDF shapes due to fluctuations in the data. Figures 1 (c) and (d) display the corresponding ratios of control sample to true PDF, and of estimated to true PDF, respectively. As it can be noticed, the PDF obtained using the algorithm provides a much more faithful description of the true distribution than the control sample template. A detailed autocorrelation analysis of the Markov Chain is outside the scope of this work and is left for future studies. Results were also crosschecked on a second Monte Carlo data set corresponding to a different fraction of background particles, as well as on toy Monte Carlo data [4].
3 Conclusions and outlook
We have described a novel sampling algorithm that uses a Markov Chain to estimate the shapes of signal and background PDFs from an input data set taking fluctuations into account, starting from initial conditions provided by highstatistics control samples. This article concentrates on the use of discriminating variables associated with particlelevel kinematics.
It is worth noticing that the algorithm is presented here neither as a classifier nor as a tool for physics analysis at the LHC. This contribution is a proof of concept of a new approach to data analysis in particle physics that aims to estimate the effect of fluctuations on the shapes of probability distributions, while contextually proposing a reformulation of background discrimination as a classification problem at the particle level.
In particular, this study is a first step toward the development of future mixture decomposition techniques to quantify the effect of fluctuations on the shapes of particlelevel probability distributions inside individual events at the LHC. We argue that using eventbyevent templates to subtract background may improve the resolution of observables of interest, may increase the sensitivity in searches for new physics, and may reduce systematic uncertainties depending on the analysis. Since the number of particles in this study is in line with typical LHC charged particle multiplicities, these results are an encouraging starting point. It should also be noted that, although the present version of the algorithm concentrates on mixtures of one signal and one background subpopulation, this method can in principle be extended to more complex scenarios.
It is worth noticing that efforts to eliminate noise in eventbyevent analysis of multiparticle production documented in the heavy ion literature [7] are fundamentally different from this approach. As opposed to traditional techniques whereby functional forms are enforced on the PDFs and parameters are subsequently adjusted within the templates, this approach is based on a nonparametric statistical model that can in principle describe any deviation from the shapes of control sample templates due to fluctuations. Furthermore, this method builds on a novel populationbased view of particle physics events, and reformulates background discrimination as a classification problem at the level of individual particles.
For the sake of completeness, it should be mentioned that a conceptual issue can in principle arise when color connection is involved [4], but it is our opinion that this technique can anyway cover many situations of practical relevance. It should also be noted that, due to its iterative nature, the algorithm can be outperformed by established methods in terms of execution time. However, the running time in this study was on a 2 GHz Intel Processor with 1 GB RAM, and so still reasonable for offline use. Improvements may be possible in this respect [4].
4 Acknowledgments
The author wishes to thank the High Energy Physics Group at Brunel University for a stimulating environment and for many fruitful discussions. Particular gratitude also goes to the High Energy Physics Group at University College London for their hospitality at an earlier stage, and to Prof. Jonathan M. Butterworth for his valuable comments. The author also wishes to thank Prof. Trevor Sweeting and Dr. Alexandros Beskos at the UCL Department of Statistical Science for fruitful discussions. Particular gratitude also goes to Prof. Carsten Peterson and to Prof. Leif Lönnblad at the Department of Astronomy and Theoretical Physics at Lund University.
Footnotes
 This implementation makes use of the ALGLIB software library [3].
References
 Geman S and Geman D 1984 IEEE T. Pattern Anal. 6 721
 Colecchia F 2012 J. Phys.: Conf. Ser. 368 012031
 ALGLIB (www.alglib.net), Sergey Bochkanov and Vladimir Bystritsky
 Colecchia F 2012 (Preprint arXiv:1205.5886v1 [physics.dataan])
 Sjöstrand T, Mrenna S and Skands P 2006 J. High Energy Phys. JHEP05(2006)026
 Sjöstrand T, Mrenna S and Skands P 2008 Comput. Phys. Comm. 178
 Voloshin S A, Koch V and Ritter H G 1999 Phys Rev C 60 024901