Topology classification with deep learning to improve real-time event selection at the LHC

Topology classification with deep learning to improve real-time event selection at the LHC

Thong Q. Nguyen3 Email: thong@caltech.edu California Institute of Technology (USA) Daniel Weitekamp III University of California at Berkeley (USA) Dustin Anderson California Institute of Technology (USA) Roberto Castello Experimental Physics Department, CERN (CH) Olmo Cerri California Institute of Technology (USA) Maurizio Pierini Experimental Physics Department, CERN (CH) Maria Spiropulu California Institute of Technology (USA) Jean-Roch Vlimant California Institute of Technology (USA)
Abstract

We show how an event topology classification based on deep learning could be used to improve the purity of data samples selected in real time at the Large Hadron Collider. We consider different data representations, on which different kinds of multi-class classifiers are trained. Both raw data and high-level features are utilized. In the considered examples, a filter based on the classifier’s score can be trained to retain of the interesting events and reduce the false-positive rate by more than one order of magnitude. By operating such a filter as part of the online event selection infrastructure of the LHC experiments, one could benefit from a more flexible and inclusive selection strategy while reducing the amount of downstream resources wasted in processing false positives. The saved resources could translate into a reduction of the detector operation cost or into an effective increase of storage and processing capabilities, which could be reinvested to extend the physics reach of the LHC experiments.

1 Introduction

The CERN Large Hadron Collider (LHC) collides protons every 25 ns. Each collision can result in any of hundreds of physics processes. The total data volume exceeds by far what the experiments could record. This is why the incoming data flow is typically filtered through a set of rule-based algorithms, designed to retain only events with particular signatures (e.g., the presence of a high-energy particle of some kind). Such a system, commonly referred to as trigger, consists of hundreds of algorithms, each designed to accept events with a specific topology. The ATLAS Aaboud:2016leb and CMS Adam:2005zf trigger systems are based on this idea. In their current implementation, given the throughput capability and the typical event size, these two experiments can write on disk events/sec. A few processes, e.g., QCD multijet production, constitute the vast majority of the produced events. One is typically interested to select a fraction of these events for further studies. On the other hand, the main interest of the LHC experiments is related to selecting and studying the many rare processes which occur at the LHC. In a typical data flow, these events are overwhelmed by the large amount of QCD multijet events. The trigger system is put in place to make sure that the majority of these rare events are part of the stored events/sec.

Trigger algorithms are typically designed to maximize the efficiency (i.e., the true-positive rate), resulting in a non-negligible false-positive rate and, consequently, in a substantial waste of resources at trigger level (i.e., data throughput that could have been used for other purposes) and downstream (i.e., storage disk, processing power, etc.).

The most commonly used selection rules are inclusive, i.e., more than one topology is selected by the same requirement. The so-called isolated lepton triggers are a typical example of this kind of algorithms. These triggers select events with a high-momentum electron or muon and no surrounding energetic particle, a typical signature of an interesting rare process, e.g., the production of a boson decaying to a neutrino and an electron or muon. With such a requirement, one can simultaneously collect bosons produced in the primary interaction ( events) or from the cascade decay of other particles, e.g., top quarks (mainly in events where a top quark-antiquark pair is produced). A sample selected this way is dominated by events but it retains a substantial () contamination from QCD multijet. The contribution is smaller than . Events from production are sometimes triggered by a set of dedicated lepton+jets algorithms, capable of using looser requirements on the lepton at the cost of introducing requirements on jets.111A jet is a spray of hadrons, typically originating from the hadronization of gluons and quarks produced in the proton collisions. Due to this additional complexity, the use of these triggers in a data analysis comes with additional complications. For instance, the applied jet requirements produce distortions on offline distributions of jet-related quantities. To avoid having this effect, any typical data analysis applies a tighter offline selection. This means that many of the selected events close to the online-selection threshold are discarded. This is not necessarily the most cost-effective way to retain an unbiased dataset for offline analysis.

Figure 1: Relative composition of the isolated-lepton sample after the acceptance requirement (left) and the trigger selection (right), as described in the text.

In this paper, we investigate the possibility of using machine learning to classify events based on their topologies, serving as an additional clean-up algorithm at the trigger level. Doing so, one could customize the trigger-selection strategy on individual processes (depending on the physics goals) while keeping the selection loose and simple. As a benchmark case, we consider a stream of data selected by requiring the presence of one electron or muon with transverse momentum  GeV 222In this paper, we set units in such a way that = = 1. and a loose requirement on the isolation. Details on the applied selection can be found in Sec. 2.

The considered benchmark sample is dominated by direct production, with a sizable contamination from QCD multijet events and a small contribution of events. Other interesting processes (e.g., , , and production) are usually selected with more exclusive and dedicated trigger algorithms (e.g., di-muon or di-electron triggers), or share the same kinematic properties of the two main interesting processes ( and ). For the sake of simplicity, we ignore these sub-leading processes in our study, without compromising the validity of our conclusions. Fig. 1 shows the composition of a sample with one electron or muon within the defined acceptance ( GeV and pseudorapidity , where is the polar angle), before and after applying the trigger requirements ( GeV and loose isolation).

Such a loose set of requirements would translate into an event acceptance rate of  Hz for a luminosity of  cm s, well beyond the currently allocated budget for these triggers (typically  Hz). We suggest that, using the score of our topology classifier, one could tune the amount of each process to be stored for further analysis, within the boundaries of the allocated resources. For instance, one might be interested to retain all the events and some fraction of events, while rejecting the QCD multijet events. We envision two main applications: for a given total rate, one could loosen the baseline trigger requirements, increasing the acceptance efficiency at no cost. Or, for a given acceptance efficiency (true positive rate), one could save resources by reducing the overall rate, rejecting the contribution of unwanted topologies (see Appendix A).

We consider several topology classifiers based on deep learning model architectures: fully-connected deep neural networks (DNNs), convolutional neural networks (CNNs) CNN, and recurrent neural networks such as Long-Short-Term-Memory networks (LSTMs) LSTM and gated recurrent units (GRUs) GRU. We consider four different representations of the collision events: (i) a set of physics-motivated high-level features, (ii) the raw image of the detector hits, (iii) a sequence of particles, characterized by a limited set of basic features (energy, direction, etc.), and (iv) an abstract representation of this list of particles as an image.

The paper is structured as follows. In Sec. 2 we describe the four data representations. In Sec. 3 we describe the corresponding classification models. Results are discussed in Sec. 4. In Sec. 5 we investigate the generalization properties of the four classifiers to scenarios of other topologies. We study the robustness of our classifiers against Monte-Carlo simulation inaccuracy with pseudo-data in Sec. 6. In Sec. 7 we briefly discuss applications of machine learning algorithms to similar problems. Conclusions are given in Sec. 8. Appendix A describes a different scenario, in which the classifier is used to save resources by reducing the trigger acceptance rate, as opposed to using it to sustain a loose trigger selection that could otherwise require too many resources.

2 Dataset

Synthetic data corresponding to , and QCD multijet production topologies are generated with events per process ( events in total) using the PYTHIA8 event generation library pythia. The setup of the proton-beam simulation is loosely inspired by the LHC running configuration in 2015-2016: two proton beams, each with 6.5 TeV, generate on average 20 proton-proton collisions per crossing following a Poisson distribution.

Generated samples are processed with the DELPHES library delphes, which applies a parametric model of a detector response. Detector performances is tuned to the CMS upgrade design foreseen for the High-Luminosity LHC CMS_TP, as implemented in the corresponding default card provided with DELPHES. We run the DELPHES particle-flow (PF) algorithm, which combines the information from all the CMS detector components to derive a list of reconstructed particles, the so-called PF candidates. For each particle, the algorithm returns the measured energy and flight direction. Each particle is associated to one of three classes: charged particles, photons, and neutral hadrons. Jets are clustered from the reconstructed PF candidates, using the FASTJET fastjet implementation of the anti- jet algorithm antikt, with jet-size parameter R = 0.4. The jet’s b-tagging efficiency is parametrized as a function of jet’s and in the default DELPHES CMS upgrade design card. The parametrized b-tagging efficiency is shown to provide a reasonable agreement with CMS delphes.

The basic event representation consists of a list of reconstructed PF candidates. For each candidate , the following information is given: (i) The particle four-momentum in Cartesian coordinates (, , , ); (ii) The particle three-momentum, computed from (i), in cylindrical coordinates: the transverse momentum , the pseudorapidity , and the azimuthal angle ; (iii) The Cartesian coordinates (, , ) of the particle point of origin. For all neutral particles, (0, 0, 0) is used in the absence of pointing information; (iv) The electric charge; (v) The particle isolation with respect to charged particles (ChPFIso), photons (GammaPFIso), or neutral hadrons (NeuPFIso). For each particle class, the isolation is quantified as

(1)

where the sum extends over all the particles of the appropriate class with angular distance from the particle .

The particle identity is categorized via a one-hot-encoded representation (, , ), corresponding to a charged particle, a neutral hadron, or a photon. In addition, two boolean flags are stored ( and ) to identify if a given particle is an electron or a muon. In total, each particle is then described by 19 features.

The trigger selection is emulated by requiring all the events to include one isolated electron or muon with transverse momentum  GeV and particle-based isolation . This baseline selection, which follows the typical requirements of an inclusive single-lepton trigger algorithm, accepts QCD multijet events and events for every event. Despite its large and efficiency, this trigger selection comes with a large cost in terms of QCD multijet events written on disk and processed offline. The cost is even larger if the main physics target is events and the contribution is seen as an additional source of background (e.g., in a high-statistics scenario, with all measurements of properties limited in precision by systematic uncertainties).

All particles are ranked in decreasing order of . For each event, the isolated lepton is the first entry of the list of particles. To avoid double counting of this isolated lepton as a charged particle, each charged particle is required to have . In addition to the isolated lepton, we consider the first 450 charged particles, the first 150 photons, and the first 200 neutral hadrons. This corresponds to a total of 801 particles per event, each characterized by the 19 features described above. The choice of the numbers of particles is made such that, on average, only 5% charged particles, 5% neutral hadrons and 1% photons are ignored. Thanks to ordering by particle category, what we remove carries small information. In early stages of this work we experimented with tighter cuts on particle multiplicity without observing substantial difference. We verified that the particles we ignore have typical below 1 GeV. If fewer particles are found in the event, zero padding is used to guarantee a fixed length of the particle list across different events. The events are then stored as NumPy arrays in a set of compressed HDF5 files. The dataset is planned to be released on the CERN OpenData portal, accessible at opendata.cern.ch.

In addition to this raw-event representation, we provide a list of physics-motivated high-level features, computed from the full event (the HLF dataset):

  • The scalar sum, , of the of all the jets, leptons, and photons in the event with  GeV and .

  • The missing transverse energy , defined as the absolute value of the missing transverse momentum, computed summing over the full list of reconstructed PF candidates:

    (2)
  • The squared transverse mass, , of the isolated lepton and the system, defined as:

    (3)

    with the transverse momentum of the lepton and the azimuthal separation between the lepton and vector.

  • The azimuthal angle of the vector, .

  • The number of jets entering the sum.

  • The number of these jets identified as originating from a quark.

  • The isolated-lepton momentum, expressed in polar coordinates (, , )

  • The three isolation quantities (ChPFIso, NeuPFIso, GammaPFIso) for the isolated lepton.

  • The lepton charge.

  • The flag for the isolated lepton.

The list of 801 particles is used to generate two visual representations of the events: raw representation and abstract representation. In the raw representation, the (, ) plane corresponding to the detector acceptance is divided into a barrel region (), two end-cap regions ( and ), and two forward regions ( and ). The barrel and endcap regions of the electromagnetic calorimeter, as well as the endcap of the hadronic calorimeter (HCAL), are binned in cells of size . The barrel region of the HCAL is binned with cells of size . The forward regions are binned with cells of size 0.175 in , while the dimension in varies from 0.175 to 0.35. Each cell is filled with the scalar sum of the of the particles pointing to that cell. The three classes of particles (charged particles, photons, and neutral hadrons) are considered separately, resulting in three channels. An example is shown in Fig. 2 for a event. This representation corresponds to the raw image recorded by the detector.

Figure 2: An example of a event as the input of the raw-image classifier. Vertical and horizontal axess are the and coordinates, respectively, of the sub-detectors.

Recently, it was proposed to represent LHC collision events as abstract images where reconstructed physics objects (jets, in that case) are represented as geometric shapes whose size reflects the energy of the particle Madrazo. We generalize this abstract representation approach by applying it to the full list of particles. Each particle is represented as a unique geometric shape, centered at the particle’s coordinates and with size proportional to its . The geometric shapes are chosen as follow: (i) pentagons for the selected isolated electron or muon; (ii) triangles for photons; (iii) squares for charged particles; (iv) hexagons for neutral hadrons. The images are digitized as arrays of size , where each of the first four channels contains a separated particle class, and the last channel contains the , represented as a circle. As an example, the abstract representation for the event in Fig. 2 is shown in Fig. 3.

This abstract representation allows mitigating the sparsity problem of the raw images. On the other hand, there is no guarantee that the physics information is fully retained in this translation. As a result, there could be a reduction of discrimination power. This is one of the points we aim to investigate in this study.

Figure 3: Example of a event, represented as a 5-channel abstract images of photons (top-left), charged hadrons (top-center), neutral hadrons (top-right), the isolated lepton (bottom-kleft), and the event (botton-right).

3 Model description

In this section, we describe five types of multi-class classifiers, trained on the four data representations described in the previous section. We start by considering a state-of-the art HEP application, based on the high-level features listed in Sec. 2. We then consider a convolutional neural network taking as input the raw images. This model offers the baseline point of comparison for the classifier using the abstract images. In order to have a fair comparison between the two approaches, the same kind of network architecture is used for the two sets of images. Next, we consider recurrent neural networks based on LSTMs and GRUs, trained directly on the lists of 801 particles. Finally, we consider a classifier taking both the high-level features and the list of 801 particles as inputs, using a combination of recurrent neural networks and fully connected neural networks.

The CNNs are implemented in PyTorch pytorch. The recurrent neural networks and feed-forward neural networks are implemented in Keras chollet2015keras and trained using Theano theano as a back-end. The Adam optimizer Adam is used to adapt the learning rate. The training is capped at 50 epochs, and can be stopped early if there is no improvement in terms of validation loss after 8 epochs. Categorical cross entropy is used as the loss function. All trainings are performed on a cluster of GeForce GTX 1080 GPUs. In an early stage of this work, experiments on the recurrent models were performed on the CSCS Piz Daint super computer, using the mpi-learn library mpi-learn for multiple-GPU training.

3.1 High-level-feature classifier

A fully connected feed-forward DNN based on a set of high-level features (HLF classifier) is the closest approach to the currently used rule-based trigger algorithms. We train a model of this kind taking as input the 14 features contained in the HLF dataset (see Sec. 2). The 14 features are normalized to take values between 0 and 1.

The final network configuration is the result of an optimization process performed using the scikit-learn optimizer scikit-learn, which performs an exhaustive cross-validated grid-search over a set of hyperparameters related to the network architecture and the training setup. The number of layers, the number of nodes in each layer, and the choice of optimizer have been considered in the scan. For a given number of layers, discrimination performances were found to be constant over the considered range of number of nodes per layer. We believe that this is a direct consequence of the simple problem at hand: even a relatively small networks achieve good classification performances. We then took the smallest network as the best compromise between performance and architecture minimality.

The chosen architecture consists of three hidden layers with 50, 20, and 10 nodes, activated by rectified linear units (ReLU) RELU. The output layer consists of 3 nodes, activated by a softmax activation function.

3.2 Raw-image classifier

To classify events represented as raw calorimeter images (raw-image classifier), we use DenseNet-121, a model based on the Densely Connected Convolutional Network huang2017densely. The DenseNet-121 architecture includes 4 dense blocks, each of which contains 6, 12, 24, 16 dense layers, respectively. Each dense layer contains two 2D convolutional layers preceded by batch normalization layers. A dropout rate of 0.5 is applied after each dense layer. Between two subsequent dense blocks is a transition layer consisting of a batch normalization layer, a 2D convolutional layer, and an average pooling layer.

3.3 Abstract-image classifier

We use the same DenseNet-121 architecture above to classify the abstract image representation. We refer to this model as abstract-image classifier.

3.4 Particle-sequence classifier

A particle-sequence classifier is trained using a recurrent network, taking as input the 801 candidates. To feed these particles into a recurrent network, particles are ordered according to their increasing or decreasing distance from the isolated lepton. Different physics-inspired metrics are considered to quantify the distance (, , ,  antikt, or anti- kt). The best results are obtained using the decreasing distance ordering.

We use gated recurrent units (GRU) to aggregate the input sequence of particle flow candidate features into a fixed size encoding. The fixed encoding is fed into a fully connected layer with 3 softmax activated nodes. Input data is standardized so that each feature has zero mean and unit standard deviation. The zero-padded entries in the particle sequence are skipped with the Masking layer. The best internal width of the recurrent layers was found to be 50, determined by k-fold cross validation on a training set of 210,000 events. We also considered using long short-term memory networks (LSTM) to replace the GRU, but we found that the GRU architecture outperformed the LSTM architecture for the same number of internal cells.

PF

PF

PF

Masking

GRU (50)

Dropout

High-level features (14)

Dropout

Concatenate (64)

Dense (25)

Output (3)
Figure 4: Network architecture of the inclusive classifier.

3.5 Inclusive classifier

In order to inject some domain knowledge in the GRU classifier, we consider a modification of its architecture in which the 14 features of the HLF dataset are concatenated to the output of the GRU layer after some dropout (see Fig. 4). As for the other classifiers, the final output layer consists of 3 nodes, activated by a softmax activation function. We refer to this model as inclusive classifier.

4 Results

Each of the models presented in the previous section returns the probability of each event to be associated to a given topology: , , and . By applying a threshold requirement on or , one can define a or a classifier, respectively. By changing the threshold value, one can build the corresponding receiver operating characteristic (ROC) curve. Fig. 5 shows the comparison of the ROC curves for five classifiers: the DenseNets based on raw images and abstract images, the GRU using the list of particles, the DNN using the HLFs, and the inclusive classifier using both the HLFs and the list of particles. Results for both a and selectors are shown.

Figure 5: ROC curves for the (left) and (right) selectors described in the paper.

Acceptable results are obtained already with the raw-image classifier. On the other hand, the use of abstract images allows us to reach better performances. A further improvement is observed for those models not using an image-based representation of the event. The fact that the HLF selectors perform so well doesn’t come as a surprise, given a considerable amount of physics knowledge implicitly provided by the choice of the relevant features. On the other hand, the fact that the particle-sequence classifier reaches better performances compared to the HLF selector is remarkable, as is the further improvement observed by merging the two approaches in the inclusive classifier. In some sense, the GRU layer is gaining a good part of the physics intuition that motivated the choice of the HLF quantities, but not entirely. Fig. 6 shows the Pearson correlation coefficients between the GRU scores ( and ) and the HLF quantities. As one would expect, exhibits a stronger correlation with those features that quantify jet activity ( in Fig. 6), as well as with the b-jet multiplicity (). On the contrary, events shows an anti-correlation with respect to jet quantities, since the production of associated jets in events is much more penalized than for events. As expected, both scores are anti-correlated to the isolation quantities, which takes larger values for non-isolated leptons.

Figure 6: Pearson correlation coefficients between the (left) and (right) scores of the Particle-sequence classifier and the 14 quantities of the HLF dataset.

The performance of each of the five classifiers is summarized in Tab. 1 in terms of false-positive rate (FPR) and trigger rate (TR) as a function of the true-positive rate (TPR). The best QCD rejection is obtained by the inclusive classifier, which can retain 99% of the or events with a false-positive rate of .

selector Raw-image Abstract-image HLF Particle-sequence Inclusive
(DenseNet) (DenseNet) (DNN) (GRU) (DNN+GRU)
FPR @99% TPR % % % % %
FPR @95% TPR % % % % %
FPR @90% TPR % % % % %
TR @99% TPR Hz Hz Hz Hz Hz
TR @95% TPR Hz Hz Hz Hz Hz
TR @90% TPR Hz Hz Hz Hz Hz
selector Raw-image Abstract-image HLF Particle-sequence Inclusive
(DenseNet) (DenseNet) (DNN) (GRU) (DNN+GRU)
FPR @99% TPR % % % % %
FPR @95% TPR % % % % %
FPR @90% TPR % % % % %
TR @99% TPR Hz Hz Hz Hz Hz
TR @95% TPR Hz Hz Hz Hz Hz
TR @90% TPR Hz Hz Hz Hz Hz
Table 1: False positive rate (FPR) and trigger rate (TR) at different values of the true positive rate (TPR), for a (top) and selector. Rate values are estimated scaling the TPR and process-dependent FPR values by the acceptance and efficiency, assuming a leading-order (LO) production cross section and luminosity of 2 cm s. TR values should be taken only as suggestions of the actual rates, since the accuracy is limited by the use of LO cross sections and a parametric detector simulation.

The trigger baseline selection we use in this study, looser than what is used nowadays in CMS, gives an overall trigger rate (i.e., summing electron and muon events) of  Hz, more than a factor two larger than what is currently allocated. Using the 99% working points of the two classifiers, one would reduce the overall rate to  Hz (counting the overlap between the two triggers). This would be comparable to what is currently allocated for these triggers, but with a looser selection, i.e., with a less severe bias on the offline analysis. In addition, the trigger efficiency (the TPR) is so high that the bias imposed on offline quantities is quite minimal. This is illustrated in Fig. 7, where the dependence of the TPR on the most relevant HLF quantities is shown. In our experience, any rule-based algorithm with the same target trigger rate would result in larger inefficiencies at small values of at least some of these quantities, e.g., the lepton . One should also consider that the principle of a topology classifier could be generalized to other physics cases, as well as to other uses (e.g., labels for fast reprocessing or access to specific subsets of the triggered samples).

Figure 7: Selection efficiency using 99% TPR working point as functions of lepton , , and for the selector on events (top) and the selector on events (bottom).

Figure 8 shows the TPR and FPR of the inclusive selector when applying the 99% TPR working-point threshold, as a function of the number of vertices in the event, which quantifies the amount of pileup. The TPR is fairly insensitive to PU until , (the average PU recorded by the LHC in 2018), where the TPR drops to . At the same time, the FPR increases mildly, resulting in a rate increase from  Hz (at the average PU value ) to  Hz at . In other words, the algorithm trained on 2016 conditions would have been sustainable until 2018 with rate increase (with respect to the average value) or it would have required a threshold adjustment along the way, a pretty standard operation when designing a trigger menu at the beginning of the year. We believe that, in view of these facts, the proposed algorithm would be as robust as many state-of-the-art algorithms operated at the LHC experiments.

Figure 8: Dependence of TPR and FPR on the amount of pileup in the event (estimated through the number of vertices) for the inclusive selector when applying the 99% TPR working-point threshold. The gray histogram shows the distribution of the number of vertices in the training dataset, covering a wide range from to following a Poisson distribution with mean value of 20.

5 Impact on other topologies

While reducing the resource consumption of standard physics analyses is the main motivation behind this study, it is important to evaluate the impact of the proposed classifiers on other kind of topologies. For this purpose, we consider a handful of beyond-the-standard-model (BSM) scenarios, and we compute the TPR as a function of the most relevant kinematic quantities, similar to what was done in Fig. 7 for the standard topologies.

Figure 9: Selection efficiencies of different BSM models using 99% TPR working point as functions of lepton , , and . From top to bottom, , High-mass , , , and .

We consider the following BSM processes:

  • : a heavy Higgs boson with mass 425 GeV decaying to a charged Higgs boson of mass 325 GeV and a boson. The then decays to a final state, where is the 125 GeV Higgs boson, which we force to decay to a bottom quark-antiquark pair. This model, introduced in Ref. baldi, generates a 22 topology similar to that given by events.

  • High-mass : a high-mass variation of the previous model, in which the and masses are set to 1025 GeV and 625 GeV, respectively.

  • : a light neutral scalar particle with mass 20 GeV, decaying to two neutral scalars of 5 GeV each, both decaying to muon pairs, for a total of four muons in the final state.

  • resonance with mass 300 GeV, decaying inclusively with -like couplings.

  • resonance with mass 600 GeV, decaying to a pair of electrons or muons.

These events are filtered with the baseline selection described in Sec. 2.

For each of these models, we consider the inclusive classifier and apply the 99%-TPR thresholds on and . We then consider the fraction of events passing at least one of the two selectors. Results are shown in Fig. 9 for the most relevant kinematic quantities. While the individual selectors might show local inefficiencies, the combination of the two trigger paths is perfectly capable of retaining any event with features different from that of a QCD multijet event. In this respect, the logical OR of our two exclusive topology classifiers is robust enough to also select a large spectrum of BSM topologies. On the other hand, one cannot guarantee that QCD-like topologies (e.g., a dark photon produced in jet showers and decaying to lepton pairs) would not be rejected, a limitation which also affects traditional inclusive trigger strategies.

6 Robustness study

Figure 10: Distributions of the validation sample and pseudo-data. The pseudo-data is created by adding a Gaussian noise of mean zero and standard deviation of 10% to the validation sample’s particle momenta. The high-level features are then recomputed with the new list of particles.

As the classifier is trained on Monte-Carlo simulation samples, one needs to consider the discrepancy between Monte-Carlo and real data when deploying the classifier in the trigger. We investigate the robustness of our topology classifiers against this discrepancy by creating a pseudo-data sample, which attempts to emulate real data by adding a Gaussian noise to the particles’ momenta in the simulation samples. The Gaussian noise has mean of zero and standard deviation of 10% of the variable’s values being applied. Fig. 10 shows some comparisons between the Monte-Carlo samples and the pseudo-data with this Gaussian noise added.

FPR TPR on validation sample TPR on pseudo-data
5.2% % %
0.7% % %
0.2% % %
Table 2: Signal efficiency (TPR) at different values of the false positive rate (FPR) for the inclusive classifier selecting evaluated on the validation sample and the pseudo-data.

We evaluate the performance of our fully-trained inclusive classifier on the new pseudo-data. Tab. 2 shows a slight reduction of signal efficiency: at the same background contamination rate of 5.2%, the signal efficiency reduces by only 1.4%. This demonstrates that our classifiers can be robust against some augmentation that mimics the discrepancy between data and Monte-Carlo simulation. A comprehensive study on full simulation and data in proper control regions would be needed when deploying this classifier into production.

7 Related works

Machine learning is traditionally used in high-energy physics as part of data analysis, and was an important ingredient to the discovery of the Higgs boson, as discussed in ml-review. Several classification algorithms have been studied in the context of LHC physics application, notably for jet tagging deOliveira:2015xxd; Guest:2016iqz; Macaluso:2018tck; Datta:2017lxt; Butter:2017cot; Kasieczka:2017nvn; Komiske:2016rsd; Schwartzman:2016jqu and event topology identification baldi; Bhimji:2017qvb; Madrazo using feed-forward neural networks, convolutional neural networks or physics-inspired architectures. Lists of particles have been used to define jet and event classifiers starting from a list of reconstructed particle momenta RecursiveJets; Egan:2017ojy; Cheng:2017rdo. These studies typically consider data analysis as the main use case, focusing on small FPR selections. This is the main difference with respect to this study, which focuses on the optimization of real-time data-taking procedure.

In parallel, machine learning techniques have also been used in online event selection. For example, the LHCb experiment used a decision-tree based approach for the high-level trigger in the first LHC run bonsaiBDT and re-optimized it with MatrixNet algorithm for Run II optimizedLHCb; ATLAS uses BDT in its multi-step tau trigger for Run II atlas-trigger; a BDT was also deployed on FPGA cards of the hardware-level trigger of the CMS experiment Acosta:2290188. These triggers are mainly based on high-level features related to specific parts of a collision event. We propose instead to define an algorithm that is based on a raw-event representation and considers the full event collision at once. To our knowledge, this is the first demonstration of how a recurrent neural network could perform a successful inference on a full event and improve topology identification based on object-specific features.

In addition, traditional triggers based on machine learning run in tagging mode, i.e., are used to identify certain types of particles. Instead, we propose to use our topology classifier in veto mode: the trigger algorithm running downstream would be a classic trigger with loose selection, which would normally be unsustainable due to high throughput. The topology classifier would subsequently remove a majority of background events, sustaining the trigger rate and saving downstream computing resources.

Note. After submitting this paper for review, the study presented in Ref. Lin2018 showed how a topology classification based on full event information can boost tagging efficiency or purity of a single-object trigger, or both, in the context of an offline analysis.

8 Conclusions

We show how deep neural networks can be used to train topology classifiers for LHC collision events, which could be used as a cleanup filter to select or reject specific event topologies in a trigger system. We consider several network architectures, applied to different representations of the same collision datasets.

The best results are obtained by combining a set of physics-motivated high-level features with the output of a GRU unit applied to a list of particle-level features. For the most difficult case, i.e., selecting rare events, we show how a trigger based on this concept would retain 99% of the events while reducing the FPR by more than times.

The information given as input to the GRU, the abstract-image CNN and the raw-image CNN is the same, but coded differently. The difference in performance is then a combination of two effects: the encoding of this information in the input event representation and the way the network architecture exploits it. The DNN case is different. The DNN uses in principle less information. On the other hand, the list of HLFs given as input to the DNN is based on domain knowledge that the other networks have to learn by themselves. This is why the DNN model is very competitive despite using less information and why the inclusive classifier (GRU+DNN) improves on the GRU-based particle sequence classifier. Nevertheless, it is remarkable that the score of the particle sequence classifier learns interesting correlation patterns with the HLF features, showing that (to some extent) the GRU is learning some of this domain knowledge.

We show that such a trigger would have a minimal impact on the main kinematic features of the event topologies under consideration. The effect of operating this topology classifier as a final filter of a given single-lepton trigger would result in small decrease of trigger efficiency by few percentage (depending on the TPR of the chosen working point). On the other hand, such a filter would allow for a looser selection, efficiently including non-isolated leptons with low without downstream consequences in terms of computational power and storage. In addition, the logic OR of the and selections would also catch a broad class of new-physics topologies, on which the classifiers were not trained.

The advantages of running these types of algorithms comes at the cost of computational resources to train the models. In our case, a single training of the inclusive classifier took 4 hours on a cluster consisting of 6 GeForce GTX 1080 GPUs. Building a cluster of a few tens of GPUs of this kind, to be used as a training facility, is well within the budget of big-experiment computing projects. For this reason, dedicated studies are ongoing to integrate train-on-demand services in the computing infrastructures of LHC experiments mpi-learn tfaas. In view of the challenging trigger environment foreseen for the High-Luminosity LHC, it would be important to test this trigger strategy as a way to preserve a good experimental reach with a substantial reduction of computational resources. In this respect, we look forward to the LHC Run III as an opportunity to experiment with this technique using full simulation and study its impacts on real-time event selection.

9 Acknowledgments

This work is supported by grants from the Swiss National Supercomputing Center (CSCS) under project ID d59, the United States Department of Energy, Office of High Energy Physics Research under Caltech Contract No. DE-SC0011925, and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement n 772369). T.N. would like to thank Duc Le for valuable discussions during the earlier stage of this project. We thank CERN OpenLab for supporting D.W. during his internship at CERN. We are grateful to Caltech and the Kavli Foundation for their support of undergraduate student research in cross-cutting areas of machine learning and domain sciences. Part of this work was conducted at "iBanks", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and the Kavli Foundation for their support of "iBanks".

References

Appendix A    An alternative use case

In this paper, we showed how one could use a topology classifier to keep the overall trigger rate under control while operating triggers with otherwise unsustainable loose selections. In this appendix we discuss how topology classifiers could be used to save resources for a pre-defined baseline trigger selection by rejecting events associated to unwanted topologies. In this case, the main goal is not to reduce the impact of the online selection. Instead, we focus on reducing resource consumption downstream for a given trigger selection.

To this purpose, we consider a copy of the dataset described in Sec. 2, obtained tightening the threshold from 23 to 25 GeV and the isolation requirement from ISO < 0.45 to ISO < 0.20. Doing so, the sample composition changes as follow: 7.5% QCD; 92% ; 0.5% . With such selections, the trigger acceptance rate would decrease from 690 Hz to 390 Hz, closer to what is currently allocated for these triggers in the CMS experiment.

Following the procedure described in Sec. 3 and 4, we train the same topology classifiers on this dataset. The corresponding ROC curves are presented in Fig. 11 for a and a selector.

Figure 11: ROC curves for the (left) and (right) selectors described in the paper, trained on a dataset defined by a tighter baseline selection.

We then define a set of trigger filters applying a lower threshold to the normalized score of the classifier, choosing the threshold value that corresponds to a certain TPR value. The result is presented in Table 3, in terms of the FPR and the trigger rate.

selector Raw-image Abstract-image HLF Particle-sequence Inclusive
(DenseNet) (DenseNet) (DNN) (GRU) (DNN+GRU)
FPR @99% TPR % % % % %
FPR @95% TPR % % % % %
FPR @90% TPR % % % % %
TR @99% TPR Hz Hz Hz Hz Hz
TR @95% TPR Hz Hz Hz Hz Hz
TR @90% TPR Hz Hz Hz Hz Hz
selector Raw-image Abstract-image HLF Particle-sequence Inclusive
(DenseNet) (DenseNet) (DNN) (GRU) (DNN+GRU)
FPR @99% TPR % % % % %
FPR @95% TPR % % % % %
FPR @90% TPR % % % % %
TR @99% TPR Hz Hz Hz Hz Hz
TR @95% TPR Hz Hz Hz Hz Hz
TR @90% TPR Hz Hz Hz Hz Hz
Table 3: False positive rate (FPR) and trigger rate (TR) corresponding to different values of the true positive rate (TPR), for a (top) and selector. Rate values are estimated scaling the TPR and process-dependent FPR values by the acceptance and efficiency, assuming a leading-order (LO) production cross section and luminosity of 2 cm s. TR values should be taken only as a loose indication of the actual rates, since the accuracy is limited by the use of LO cross sections and a parametric detector simulation.

The trigger baseline selection we use in this study, close to what is used nowadays in CMS for muons, gives an overall trigger rate (i.e., summing electron and muon events) of 390 Hz (i.e., 190 Hz per lepton flavor). If one was willing to take (as an example) half the events and all the events, this number could be reduced to  Hz using the inclusive selectors presented in this study (taking into account the partial overlap between the two triggers). A more classic approach would consist in prescaling the isolated lepton triggers, i.e. randomly accepting half of the events. The effect on events would be the same, but one would lose half of the events while still writing 15 times more QCD than events. In this respect, the strategy we propose would allow a more flexible and cost-effective strategy.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
388267
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description