Variational Autoencoders for New Physics Mining at the Large Hadron Collider
Using variational autoencoders trained on known physics processes, we develop a one-side p-value test to isolate previously unseen processes as outlier events. Since the autoencoder training does not depend on any specific new physics signature, the proposed procedure has a weak dependence on underlying assumptions about the nature of new physics. An event selection based on this algorithm would be complementary to classic LHC searches, typically based on model-dependent hypothesis testing. Such an algorithm would deliver a list of anomalous events, that the experimental collaborations could further scrutinize and even release as a catalog, similarly to what is typically done in other scientific domains. Repeated patterns in this dataset could motivate new scenarios for beyond-the-standard-model physics and inspire new searches, to be performed on future data with traditional supervised approaches. Running in the trigger system of the LHC experiments, such an application could identify anomalous events that would be otherwise lost, extending the scientific reach of the LHC.
Variational Autoencoders for New Physics Mining at the Large Hadron Collider
One of the main motivations behind the construction of the CERN Large Hadron Collider (LHC) is the exploration of the high-energy frontier in search for new physics phenomena. This new physics could answer some of the standing fundamental questions in particle physics, e.g., the nature of dark matter or the origin of electroweak symmetry breaking. In LHC experiments, searches for physics beyond the Standard Model (BSM) are typically carried on as fully-supervised data analyses: assuming a new physics scenario of some kind, a search is structured as a hypothesis test based on profiled likelihood ratios . These searches are said to be model dependent, since they depend on considering a specific new physics model.
Assuming that one is testing the right model, this approach is very effective in discovering a signal, as demonstrated by the LHC searches for the Standard Model (SM) Higgs boson [2, 3]. On the other hand, given the (so far) negative outcome of many BSM searches at the LHC and at other particle-physics experiments, it is possible that a future BSM model, if any, is not among those typically tested. The problem is more profound if analyzed in the context of the LHC big-data problem: at the LHC, 40 million proton-beam collisions are produced every second, but only 1000 collision events/sec can be stored by the ATLAS and CMS experiments, due to limited bandwidth, processing, and storage resources. It is possible to imagine BSM scenarios that would escape detection, simply because the corresponding new physics events would be rejected by a typical set of online selection algorithms.
Establishing alternative search methodologies with reduced model dependence is an important aspect of future LHC runs. Traditionally, this issue was addressed with so-called model-independent searches, performed at the Tevatron [4, 5], at HERA , and at the LHC [7, 8], as discussed in Section 2.
In this paper, we propose to address this need by deploying an unsupervised algorithm in the online selection system of the LHC experiments. This algorithm would be trained on known SM processes and could be able to identify BSM events as anomalies. The selected events could be stored in a special stream, scrutinized by experts (e.g., to exclude detector malfunctioning that could explain the anomalies), and even released outside the experimental collaborations, in the form of an open-access catalog. The final goal of this application is to identify anomalous event topologies and inspire future supervised searches on data collected afterwards.
As a proof of principle, we consider the case of a typical single-lepton data stream, selected by the hardware-based L1 trigger system. On this stream of data, a variational autoencoder (VAE) is trained to compress the input event representation into a low-dimension latent space and then decompressed to return the shape parameters describing the probability density function (pdf) of each input quantity, given a point in the compressed space. The event distribution in a proper test statistic, namely part of the VAE loss function, is used to perform a one-side p-value test, to associate to each incoming event the probability of originating from known SM processes. A p-value threshold is applied to decide which event should be included into a low-rate anomalous-event data stream. In this work, we set the threshold such that events could be collected every day under current LHC operation conditions. In particular, we took as a reference 8 months of data taking per year, with an integrated luminosity of fb, as in 2016. Assuming an LHC duty cycle of 2/3, this corresponds to an average instantaneous luminosity of cm s.
We then measure the BSM production cross section that would correspond to a signal excess of 100 event/month, as well as the one that would give a signal yield of the daily SM yield. For this, we consider a set of low-mass BSM resonances, decaying to one or more leptons and light enough to be challenging for the currently employed LHC trigger algorithms.
This paper is structured as follows: we discuss related works in Section 2. Section 3 gives a brief description of the dataset used. Section 4 describes the VAE model used in the study, as well as a set of fully-supervised classifiers used for performance comparison. Results are discussed in Section 5. In Section 6 we discuss how such an application could be used in a typical LHC experimental environment. Conclusions are given in Section 7.
2 Related Work
Model-independent searches for new physics have been performed at the Tevatron [4, 5], at HERA , and the LHC [7, 8]. These searches are based on the comparison of a large set of binned distributions to the prediction from Monte Carlo simulation, in search for bins exhibiting a deviation larger than some predefined threshold. While the effectiveness of this strategy in establishing a discovery has been matter of discussion, a recent study by the ATLAS collaboration  has rephrased this model-independent search strategy into a tool to identify interesting excesses, on which traditional analysis techniques could be performed on independent datasets (e.g., the data collected after running the model-independent analysis). This change of scope has the advantage of reducing the trial factor (i.e., the so-called look-elsewhere effect [9, 10]), which washes out the significance of an observed excess.
Our strategy is similar to what is proposed in Ref. , with two substantial differences: (i) we aim to monitor also those events that could be discarded by the online selection, by running the algorithm in the trigger system; (ii) we do so exploiting deep-learning-based anomaly detection techniques.
Recent works [11, 12, 13] have investigated the use of machine-learning techniques to setup new strategies for BSM searches with minimal or no assumption on the specific new-physics scenario under investigation. In this work, we use variational autoencoders based on high-level features as a baseline. Previously, autoencoders have been used in collider physics for detector monitoring [14, 15] and event generation . Autoencoders have also been explored to define a jet tagger that would identify new physics events with anomalous jets [17, 18], with a strategy similar to what we apply to the full event in this work.
3 Data samples
The dataset used for this study is a refined version of the high-level-feature (HLF) dataset used in Ref. . Proton-proton collisions are generated using the PYTHIA8 event-generation library , fixing the center-of-mass energy to the LHC Run-II value (13 TeV) and the average number of overlapping collisions per beam crossing (pileup) to . These beam conditions loosely correspond to the LHC operating conditions in 2016.
Events generated by PYTHIA8 are processed with the DELPHES library , to emulate detector efficiency and resolution effects. We take as benchmark detector description the upgraded design of the CMS detector, foreseen for the High-Luminosity LHC phase . In particular, we use the CMS HL-LHC detector card distributed with DELPHES. We run the DELPHES particle-flow (PF) algorithm, which combines information from different detector components to derive a list of reconstructed particles, the so-called PF candidates. For each particle, the algorithm returns the measured energy and flight direction. Each particle is associated to one of three classes: charged particles, photons, and neutral hadrons. In addition, lists of reconstructed electrons and muons are given.
Events are filtered at generation requiring an electron, muon, or tau lepton with GeV. Once detector effects are taken into account through the DELPHES simulation, events are further selected requiring the presence of one reconstructed electron or muon with transverse momentum GeV and a loose isolation requirement , where the isolation is computed as:
and the sum extends over all the photons, charged and neutral hadrons within a cone of size from the lepton.111As common for collider physics, we use a Cartesian coordinate system with the axis oriented along the beam axis, the axis on the horizontal plane, and the axis oriented upward. The and axes define the transverse plane, while the axis identifies the longitudinal direction. The azimuth angle is computed from the axis. The polar angle is used to compute the pseudorapidity . We fix units such that .
The 21 considered HLF quantities are:
The absolute value of the isolated-lepton transverse momentum .
The three isolation quantities (ChPFIso, NeuPFIso, GammaPFIso) for the isolated lepton, computed with respect to charged particles, neutral hadrons and photons, respectively.
The lepton charge.
A Boolean flag (isEle) set to 1 when the trigger lepton is an electron, 0 otherwise.
The number of jets entering the sum ().
The invariant mass of the set of jets entering the sum ().
The number of these jets being identified as originating from a quark ().
The missing transverse momentum, decomposed into its parallel () and orthogonal () components with respect to the isolated lepton direction. The missing transverse momentum is defined as the negative sum of the PF-candidate vectors:
The transverse mass, , of the isolated lepton and the system, defined as:
with the azimuth separation between the and vectors, and the absolute value of .
The number of selected muons ().
The invariant mass of this set of muons ().
The absolute value of the total transverse momentum of these muons ().
The number of selected electrons ().
The invariant mass of this set of electrons ().
The absolute value of the total transverse momentum of these electrons ().
The number of reconstructed charged hadrons.
The number of reconstructed neutral hadrons.
This list of HLF quantities is not defined having in mind a specific BSM scenario. Instead, it is conceived to include relevant information to discriminate the various SM processes populating the single-lepton data stream. On the other hand, it is generic enough to allow (at least in principle) the identification of a large set of new physics scenarios.
Many SM processes would contribute to the considered single-lepton dataset. For simplicity, we restrict the list of relevant SM processes to the four with highest production cross section, namely:
Inclusive production, with ().
Inclusive production, with ().
QCD multijet production.222To speed up the generation process for QCD events, we require GeV, the fraction of QCD events with GeV and producing a lepton within acceptance being negligible but computationally expensive.
These samples are mixed to provide a SM cocktail dataset, which is then used to train autoencoder models and to tune the threshold requirement that defines what we consider an anomaly. The cocktail is built scaling down the high-statistics samples (, , and ) to the lowest-statistics one (QCD, whose generation is the most computing-expensive), according to their production cross-section values (estimated at leading order with PYTHIA) and selection efficiency (shown in Tab. 1).
|Standard Model processes|
|BSM benchmark processes|
The monthly event yield is computed assuming the conditions discussed in Section 1.
In addition, we consider the following BSM models to benchmark anomaly-detection capabilities:
A leptoquark with mass 80 GeV, decaying to a quark and a lepton.
A neutral scalar boson with mass 50 GeV, decaying to two off-shell bosons, each forced to decay to two leptons: .
A scalar boson with mass 60 GeV, decaying to two tau leptons: .
A charged scalar boson with mass 60 GeV, decaying to a tau lepton and a neutrino: .
For each BSM scenario, we consider any direct production mechanism implemented in , including associate jet production. We list in Tab. 1 the leading-order production cross section and selection efficiency for each model.
4 Model description
We train Autoencoders (AEs) on the SM cocktail sample described in Section 3, taking as input the 21 HLF quantities listed there. The use of HLF quantities to represent events limits the model independence of the anomaly detection procedure. While the list of features is chosen to represent the main physics aspects of the considered SM processes and in no way tailored to specific BSM models, it is true that such a list might be more suitable for certain models than for others. In this respect, one cannot guarantee that the anomaly-detection performance observed on a given BSM model would generalize to any BSM scenario. We will address in a future work a possible solution to reduce the model carried by the input event representation.
In this section, we present both the best-performing autoencoder model and a set of supervised classifiers, trained to distinguish each of the four BSM benchmark models from SM events. We use the classification performance of these supervised algorithms as an estimate of the best performance that the VAE could get to.
Autoencoders are algorithms that compress a given set of inputs variables in a latent space (encoding) and then, starting from the latent space, reconstruct the HLF input values (decoding). Autoencoders are used in the context of anomaly detection, associating a p-value to a given event through a quantification of the encoding-decoding distance.
In this work we focus on VAEs . Unlike traditional AEs, VAEs return the event pdf in the latent and original space, instead of decoded values of the input quantities and the encoded point in the latent space. The functional form of the pdfs is specified through the loss function a priori and the pdfs’ shape parameters are the output of a trainable function of the inputs. Such a function is the VAE itself and is determined during training.
We consider the VAE architecture shown in Fig. 3, characterized by a four-dimensional latent space. Each latent dimension is associated to a Gaussian pdf and its two degrees of freedom (mean and variance ). The input layer consists of 21 nodes, corresponding to the 21 HLF quantities described in Section 3. This layer is connected to the hidden space through two hidden dense layers, each consisting of 50 neurons with ReLU activation functions. Two four-neuron layers are connected to the second hidden layer. Linear activation functions are used for the first of these four-neuron layers. Its nodes are interpreted as the mean values of the latent-space Gaussian pdfs. The nodes of the second layer are activated by the functions:
This activation, inspired by , has been chosen to increase training stability since it’s strictly positive defined, non linear buit does not involve exponential which might create instabilities in early epochs. These four nodes are interpreted as the parameters of the latent-space four-dimensional Gaussian. After several trials, the dimension of the latent space has been set to 4 in order to keep a good training stability without impacting the VAE performances. The decoding step originates from a point in the latent space, sampled according to the predicted pdf (green oval in Fig. 3). The coordinates of this point in the latent space are fed into a sequence of two hidden dense layers, each consisting of 50 neurons with ReLU activation functions. The last of these layers is connected to three dense layers of 21, 17, and 10 neurons, activated by linear, p-ISRLu and clipped-tanh functions, respectively. The clipped-tanh function if written as:
The 48 output nodes represent the parameters of the pdfs describing the input HLF quantities, which enter the loss function to be minimized.
The VAE loss function is a weighted sum of two pieces: the probability of the inputs given the predicted output pdf parameters () and the Kullback-Leibler divergence () between the latent space pdf and the prior:
where is a free parameter set to 0.3. The prior chosen for the latent space is a 4-dim Gaussian with a diagonal covariance matrix. The means () and the diagonal terms of the covariance matrix () are free parameters of the algorithm and are optimized during the back-propagation. The Kullback-Leibler divergence between two Gaussian distributions has an analytic form. Hence, for each batch, can be expressed as:
where is the batch size, runs over the samples and over the latent space dimensions. Similarly, is the average likelihood of the inputs given the predicted values:
where runs over the input space dimensions, is the functional form chose to describe the pdf of the -th input space variable and are the parameter of the function. Different functional forms have been chosen for , to properly describe different classes of HLF distributions:
Clipped Log-normal + function: used to describe , , , , , , , ChPFIso, NeuPFIso and GammaPFIso:
Gaussian: used for and :
Truncated Gaussian: a Gaussian function truncated for negative values and normalized to unit area for . Used to model :
Discrete truncated Gaussian: like the truncated Gaussian, but normalized to be evaluated on integers (i.e. ). This function is used to describe , , and . It is written as:
where the normalization factor is set to:
Binomial: used for IsEle and lepton charge:
where and are the two possible values of the variable (0 or 1 for IsEle and -1 or 1 for lepton charge) and
Poisson: used for charged-particle and neutral-hadron multiplicities:
The model is implemented in KERAS+TENSORFLOW [27, 28], trained with the Adam optimizer  on a SM dataset of 3.45M samples, equivalent to an integrated luminosity of pb. The SM validation dataset is made of 3.45M of statistically independent samples. Such a sample would be collected in about ten hours of continuous run, under the assumptions made in this study (see Section 1). In training, we fix the batch size to 1000. We use early stopping with patience set to 20 and , and we progressively reduce the learning rate on plateau, with patience set to 8 and .
While optimizing anomaly-detection performance, alternative architectures were tested. For instance, we increased or decreased the dimensionality of the latent space, we changed the value of in Eq.(6), we changed the number of neurons in the hidden layers, tried the RMSprop optimizer, and used plain Gaussian priors for the 21 input features. In addition, we tested the use of a vampprior . While some of these alternative models improved the encoding-decoding capability of the VAE, no sizable improvement in anomaly-detection performance was observed. For simplicity, we limited our study to the architecture in Fig. 3 and dropped these alternative models.
4.2 Supervised classifiers
For each of the four BSM benchmark models, we train a fully-supervised classifier, based on a Boosted Decision Tree (BDT). Each BDT receives as input the same 21 features used by the VAE and is trained on a labelled dataset consisting of the SM cocktail (the background) and one of the four BSM benchmark models (the signal). The implementation is done through the Gradient Boosted Regressor of scikit-learn library  with up to 150 estimators, minimum samples per leaf and maximum depth equal to 3 a learning rate of 0.1 and a tolerance of on the validation loss function (choose to be the default deviance). Each BDT, tailored to a specif BSM model, is trained on 3.45M SM events and about 0.5M BSM events, consistently up-weighted in order to have the same impact on the loss function (i.e. the weights are 1 for SM events and for BSM ones, depending on the actual size of the BSM sample used). In addition, we experimented with fully-connected deep neural networks (DNNs) with two hidden layers. Despite trying different architectures, we didn’t find a configuration in which DNNs outperformed BDTs. We then decided to use the BDTs as a reference of fully-supervised discrimination capabilities.
Figure 6 shows the ROC curves obtained for the four BDTs. We summarize in Tab. 2 the classification performance of the four supervised BDTs, which set a qualitative upper limit for VAE’s results. Overall, the four models can be discriminated with good accuracy, with some loss of performance for those models sharing similarities with specific SM processes (e.g., exhibiting single- and double-lepton topology with missing transverse energy, typical of events). In the table, we also quote the true-positive rate (TPR) corresponding to a SM false positive rate . This value of the efficiency is the one needed for an average of 1000 SM events per month.
5 Results with VAE
An event is classified as anomalous whenever the associated loss, computed from the VAE output, is above a given threshold. Since no BSM signal was observed so far, it is reasonable to expect that a new-physics signal, if any, would be characterized by a low production cross section and/or features very similar to those of a SM process. In view of this, we decided to use a tight threshold value, in order to reduce as much as possible any SM contribution.
Figure 7 shows the distribution of and loss components for the validation dataset. In both plots, the vertical line represents a lower threshold such that a of the SM events would be retained. This threshold value would result in SM events to be selected every month, i.e., a daily rate of events, as illustrated in Table 3. The acceptance rate is calculated assuming the LHC running conditions listed in Section 1. Table 3 also reports the by-process VAE selection efficiency and the relative background composition of the selected sample.
Figure 7 also shows and distribution for the four benchmark BSM models. We observe that the discrimination power, loosely quantify by the integral of these distributions above threshold, is better for than and that the impact of the term on discrimination is negligible. Anomalies are then defined as events laying on the right tail of the expected distribution.
The left plot in Fig. 8 shows the ROC curves obtained from the distribution of the four BSM benchmark models and the SM cocktail, compared to the corresponding BDT curves of Section 4.2. The right plot in Fig. 8 shows the p-value computed from the cocktail SM distribution, both for the SM events themselves (flat by construction) and for the four BSM processes. As the plot shows, BSM processes tend to concentrate at small p-values, which allows their identification as anomalies.
|Standard Model processes|
|Process||VAE selection||Sample composition||Event/month|
Table 4 summarize VAE’s performance on the four BSM benchmark models. Together with the selection efficiency corresponding to , the table reports the effective cross section (cross section after applying the trigger requirements) that would correspond to 100 selected events in a month (assuming an integrated luminosity of ). Similarly, we quote the cross section that would result in a signal-to-background ratio of 1/3 on the sample of events selected by the VAE. The VAE can probe the four models down to relatively low cross section values, comparable to those that are typically probed in dedicated fully-supervised searches. As a comparison, Ref.  excludes a with a mass of 150 GeV and production cross section larger than pb, using 4.8 fb at a center-of-mass energy of 7 TeV, while most recent searches  only cover larger mass values.
|BSM benchmark processes|
|efficiency||100 events/month [pb]||S/B = 1/3 [pb]|
6 How to deploy a VAE for BSM detection
The work presented in this paper suggests the possibility of deploying a VAE as a trigger algorithms associated to dedicated data streams. These trigger would isolate anomalous events, similarly to what was done by the CMS experiment at the beginning of the first LHC run. At that time, with early new physics signal being a possibility, the CMS experiment deployed online a set of algorithms (collectively called hot line) to select potentially interesting new-physics candidates. At that time, anomalies were characterized as events with high- particles or high particle multiplicities, in line with the kind of early-discovery new physics scenarios considered at that time. The events populating the hot-line stream were immediately processed at the CERN computing center (as opposed to traditional physics streams, that are processed after 48 hours). The hot-line algorithms were tuned to collect O(10) events per day, which were then visually inspected by experts.
While the focus of the work presented in this paper is not an early discovery, the spirit of the application we propose would be similar: a set of VAEs deployed online would select a limited number of events every day. These events would be collected in a dedicated dataset and further analyzed. The analysis technique could go from visual inspection of the collisions to detailed studies of reconstructed objects, up to some kind of model-independent analysis of the collected dataset, e.g. a deep-learning implementation of a model-independent hypothesis testing  directly on the Loss distribution (provided a reliable sample of background-only data).
While a pure SM sample to train VAEs could only be obtained from Monte Carlo simulation, the presence of outlier contamination in the training sample has typically a tiny impact on performance. One could then imagine to train the VAE models on so-far collected data and use them on much larger dataset. In our study, we consider a training dataset of pb and applied the VAE to a larger dataset. One could even envision more frequent re-trainings (e.g., every factor increase in integrated luminosity or in presence of substantial detector and/or accelerator condition changes). Such a training could happen offline on a dedicated dataset, e.g., deploying triggers randomly selecting events entering the last stage of the trigger system. The training could even happen online, assuming the availability of sufficient computing resources.
To demonstrate the feasibility of a train-on-data strategy, we enrich the dataset used in Section 4 with a signal contamination of events. As a starting point, the amount of injected signal is tuned to a luminosity of 100 pb and a cross section of 7.1 fb, corresponding to the value at which the VAE in Section 4 would select 100 events in 5 fb . This result into about 700 events added to the training sample. The VAE is trained following the procedure outlined in Section 4 and its performance is compared to that obtained on a signal-free dataset of the same size. The comparison of the ROC curves for the two models is shown in Fig. 9. In the same figure, we show similar results, derived injecting a and signal contamination. A degradation of VAE’s performance is observed once the signal cross section is set to 710 pb (i.e., 100 times the sensitivity value found in Section 4). At that point, the contamination is so large that the signal becomes as abundant as events and would have easily detectable consequences. For comparison, at a production cross section of 27 pb a third of the events selected by the VAE in Section 4 would come from production (see Table 4). And this would have negligible consequences on the training quality. This test shows that a robust anomaly-detecting VAE could be trained directly on data, even in presence of previously undetected (e.g., at Tevatron, 7 TeV and 8-TeV LHC) BSM signals.
We present a strategy to isolate potential BSM events produced by the LHC, using variational autoencoders trained on a reference SM sample. Such an algorithm could be used in the trigger system of general-purpose LHC experiments to isolate recurrent anomalies, which might otherwise escape observation (e.g., being filtered out by a typical trigger selection). Taking as an example a single-lepton data stream, we show how such an algorithm could select datasets enriched with events originating from challenging BSM scenarios. We also discuss how the model training could happen directly on data, with no sizable performance loss.
The final outcome of the analysis would be a list of anomalous events, that the experimental collaborations could further scrutinize and even release as a catalog, similarly to what is typically done in other scientific domains. Repeated patterns in these events could motivate new scenarios for beyond-the-standard-model physics and inspire new searches, to be performed on future data with traditional supervised approaches.
We believe that such an application could help extending the physics reach of the current and next stages of the CERN LHC.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement n 772369) and the United States Department of Energy, Office of High Energy Physics Research under Caltech Contract No. DE-SC0011925. This work was conducted at "iBanks", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and the Kavli Foundation for their support of "iBanks".
-  ATLAS, CMS, LHC Higgs Combination Group Collaboration, Procedure for the LHC Higgs boson search combination in summer 2011, .
-  ATLAS Collaboration, G. Aad et al., Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC, Phys. Lett. B716 (2012) 1–29, [arXiv:1207.7214].
-  CMS Collaboration, S. Chatrchyan et al., Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC, Phys. Lett. B716 (2012) 30–61, [arXiv:1207.7235].
-  CDF Collaboration, T. Aaltonen et al., Global Search for New Physics with 2.0 fb at CDF, Phys. Rev. D79 (2009) 011101, [arXiv:0809.3781].
-  D0 Collaboration, V. M. Abazov et al., Model independent search for new phenomena in collisions at TeV, Phys. Rev. D85 (2012) 092015, [arXiv:1108.5362].
-  H1 Collaboration, F. D. Aaron et al., A General Search for New Phenomena at HERA, Phys. Lett. B674 (2009) 257–268, [arXiv:0901.0507].
-  CMS Collaboration, MUSiC, a Model Unspecific Search for New Physics, in pp Collisions at , Tech. Rep. CMS-PAS-EXO-14-016, CERN, Geneva, 2017.
-  ATLAS Collaboration, M. Aaboud et al., A strategy for a general search for new phenomena using data-derived signal regions and its application within the ATLAS experiment, Submitted to: Eur. Phys. J. (2018) [arXiv:1807.07447].
-  L. Lyons, Open statistical issues in particle physics, ArXiv e-prints (Nov., 2008) [arXiv:0811.1663].
-  E. Gross and O. Vitells, Trial factors for the look elsewhere effect in high energy physics, Eur. Phys. J. C70 (2010) 525–530, [arXiv:1005.1891].
-  R. T. D’Agnolo and A. Wulzer, Learning New Physics from a Machine, arXiv:1806.02350.
-  J. H. Collins, K. Howe, and B. Nachman, CWoLa Hunting: Extending the Bump Hunt with Machine Learning, arXiv:1805.02664.
-  A. De Simone and T. Jacques, Guiding New Physics Searches with Unsupervised Learning, arXiv:1807.06038.
-  A. A. Pol, G. Cerminara, C. Germain, M. Pierini, and A. Seth, Detector monitoring with artificial neural networks at the CMS experiment at the CERN Large Hadron Collider, arXiv:1808.00911.
-  CMS Collaboration, Anomaly detection using Deep Autoencoders for the assessment of the quality of the data acquired by the CMS experiment, tech. rep., CERN, Geneva, Jul, 2018.
-  ATLAS Collaboration, Deep generative models for fast shower simulation in ATLAS, Tech. Rep. ATL-SOFT-PUB-2018-001, CERN, Geneva, Jul, 2018.
-  T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson, QCD or What?, arXiv:1808.08979.
-  M. Farina, Y. Nakai, and D. Shih, Searching for New Physics with Deep Autoencoders, arXiv:1808.08992.
-  T. Q. Nguyen et al., Topology classification with deep learning to improve real-time event selection at the LHC, arXiv:1807.00083.
-  T. Sjöstrand et al., An Introduction to PYTHIA 8.2, Comput. Phys. Commun. 191 (2015) 159–177, [arXiv:1410.3012].
-  DELPHES 3 Collaboration, J. de Favereau et al., DELPHES 3, A modular framework for fast simulation of a generic collider experiment, JHEP 02 (2014) 057, [arXiv:1307.6346].
-  CMS Collaboration, V. Khachatryan et al., Technical Proposal for the Phase-II Upgrade of the CMS Detector, Tech. Rep. CERN-LHCC-2015-010. LHCC-P-008. CMS-TDR-15-02, Geneva, Jun, 2015.
-  M. Cacciari, G. P. Salam, and G. Soyez, FastJet User Manual, Eur. Phys. J. C72 (2012) 1896, [arXiv:1111.6097].
-  M. Cacciari, G. P. Salam, and G. Soyez, The anti- jet clustering algorithm, JHEP 04 (2008) 063, [arXiv:0802.1189].
-  D. P. Kingma and M. Welling, Auto-Encoding Variational Bayes, ArXiv e-prints (Dec., 2013) [arXiv:1312.6114].
-  Wikipedia contributors, Activation function — Wikipedia, the free encyclopedia, 2018. [Online; accessed 25-November-2018].
-  F. Chollet et al., “Keras.” https://github.com/fchollet/keras, 2015.
-  M. Abadi et al., TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
-  D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, ArXiv e-prints (Dec., 2014) [arXiv:1412.6980].
-  J. M. Tomczak and M. Welling, VAE with a vampprior, CoRR abs/1705.07120 (2017) [arXiv:1705.07120].
-  F. Pedregosa et al., Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.
-  CMS Collaboration, S. Chatrchyan et al., Search for pair production of third-generation leptoquarks and top squarks in collisions at TeV, Phys. Rev. Lett. 110 (2013), no. 8 081801, [arXiv:1210.5629].
-  CMS Collaboration, A. M. Sirunyan et al., Search for third-generation scalar leptoquarks and heavy right-handed neutrinos in final states with two tau leptons and two jets in proton-proton collisions at TeV, JHEP 07 (2017) 121, [arXiv:1703.03995].