Automating the Construction of Jet Observables with Machine Learning

Automating the Construction of Jet Observables with Machine Learning

Kaustuv Datta Department of Physics, ETH Zürich, 8093 Zürich, Switzerland kdatta@ethz.ch    Andrew Larkoski Physics Department, Reed College, Portland, OR 97202, USA larkoski@reed.edu    Benjamin Nachman Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA bpnachman@lbl.gov
Abstract

Machine-learning assisted jet substructure tagging techniques have the potential to significantly improve searches for new particles and Standard Model measurements in hadronic final states. Techniques with simple analytic forms are particularly useful for establishing robustness and gaining physical insight. We introduce a procedure to automate the construction of a large class of observables that are chosen to completely specify -body phase space. The procedure is validated on the task of distinguishing from , where and previous brute-force approaches to construct an optimal product observable for the -body phase space have established the baseline performance. We then use the new method to design tailored observables for the boosted search, where and brute-force methods are intractable. The new classifiers outperform standard -prong tagging observables, illustrating the power of the new optimization method for improving searches and measurement at the LHC and beyond.

pacs:

I Introduction

Effective identification of hadronic decays of boosted heavy particles like the top quark or , and Higgs () bosons is essential for analyses at the Large Hadron Collider (LHC). Jet substructure observables that identify specific discriminating information in the radiation pattern of jets originating from different particles are now necessary, both in the search for new physics and precision Standard Model (SM) measurements. As a result, there is an extensive literature developing observables and techniques for identifying boosted topologies to increase the efficacy of LHC analyses probing extreme regions of phase space Larkoski et al. (2017); Asquith et al. (2018).

Modern machine learning (ML) methods have emerged as useful tools for automating the creation of optimal observables for classification. These methods are particularly powerful for high-dimensional, low-level inputs such as fixed-length sets of four-vectors Pearkes et al. (2017), variable-length sets of four-vectors Komiske et al. (2018a), physics-inspired bases Komiske et al. (2018b); Erdmann et al. (2018); Datta and Larkoski (2017, 2018); Butter et al. (2018), images Cogan et al. (2015); Almeida et al. (2015); de Oliveira et al. (2016); Komiske et al. (2017); Barnard et al. (2017); Kasieczka et al. (2017); Dreyer et al. (2018); Lin et al. (2018); Fraser and Schwartz (2018); Chien and Kunnawalkam Elayavalli (2018); Macaluso and Shih (2018), sequences Guest et al. (2016); Egan et al. (2017); Andreassen et al. (2019); Fraser and Schwartz (2018), trees Cheng (2018); Louppe et al. (2019), and graphs Henrion et al. (2017). Some deep learning-based tagging schemes have already been demonstrated using collider data as well as with full detector simulations for top quark tagging Aaboud et al. (2018); CMS Collaboration (2017a), boson tagging Aaboud et al. (2018); CMS Collaboration (2018a), quark/gluon tagging ATLAS Collaboration (2017a); CMS Collaboration (2017b), and -jet tagging ATLAS Collaboration (2017b, c); CMS Collaboration (2018b); Sirunyan et al. (2018a). In addition to improving classification performance, ML techniques have also been proposed to make jet tagging more independent from simulation and robust to differences between simulation and data as well as between sideband and signal regions Collins et al. (2018, 2019); Heimel et al. (2018); Farina et al. (2018); Metodiev and Thaler (2018); Komiske et al. (2018c); Metodiev et al. (2017); Dery et al. (2017); Cohen et al. (2018); Louppe et al. (2016); Shimmin et al. (2017); ATL (2018).

One of the key challenges with ML taggers is to identify what information the machine is using for classification. Understanding the origin of discrimination can lead to robustness when taggers are applied outside of the region they were trained, can result in new theoretical insight for other applications, and may produce new simple observables that capture most of the information. While there are many proposals for ML metacognition Cohen et al. (2018); de Oliveira et al. (2016); Lin et al. (2018); Komiske et al. (2018b, a); Datta and Larkoski (2018, 2017), one particularly powerful approach is to identify simple product observables that capture most of the information from an ML algorithm trained on the full phase space Datta and Larkoski (2018). This approach results in analytically tractable observables that can capture nearly all of the power of a more complicated algorithm, but are also very robust and insightful. One of the most challenging aspects of the approach presented in Ref. Datta and Larkoski (2018) is the fitting process for picking the optimal simple product observable.

In this paper, we describe a new procedure based on ML for automating the feature extraction originally presented in Ref. Datta and Larkoski (2018). This method is applied to derive an optimal product observable for discriminating  vs.   and the outcome is compared to the result of Ref. Datta and Larkoski (2018) which used a brute force approach. Having validated the method, a new classifier is developed to distinguish a from generic quark and gluon jets. The phase space scan required in this later tagging task is too big for the brute force approach and therefore the automated method is required to find the optimal tagger. The resulting classifier has a simple form and is competitive with a tagger using high-dimensional, low-level inputs. In addition to Ref. Shimmin et al. (2017), this is the only other study of the dependence on the mass of the new boson, which is timely given new searches for light boosted bosons Sirunyan et al. (2017); Sirunyan et al. (2018b); Aaboud et al. (2019).

This paper is organized as follows. The method for constructing product observables is described in Sec. II and the machine learning approaches are detailed in Sec. III. Results for both the Higgs and classification tasks are presented in Sec. IV. The paper ends with conclusions and future outlook in Sec. V.

Ii -subjettiness Product Observables

The information about the kinematic phase space of -subjets in a jet is resolved with a set of -subjettiness Stewart et al. (2010); Thaler and Van Tilburg (2011, 2012) observables. By increasing , one can identify the number of subjets required to saturate the classification performance based on the spanning set of -subjettiness observables Datta and Larkoski (2017):

where

(1)

for some choice of axes within the jet; is the jet radius parameter, and . Given the minimal , one can posit an ansatz111The product form may not be flexible enough to capture the full discrimination power. We find that it can capture a significant portion of the classification performance, but Appendix E indicates that further information can be useful. for a simple product observable that captures most of the information contained in a neural network trained on the entire spanning set:

(2)

For distinguishing  vs.   jets, Ref. Datta and Larkoski (2018) showed that the useful information for classification is saturated by and has nearly the same tagging performance as the full -body phase space. The parameters that specify were identified by randomly scanning the five-dimensional phase space and exploiting minimal correlations between some of the parameters. This becomes intractable when the optimal is bigger than .

In this paper, we explore methods to overcome the difficulties of extending this procedure to higher dimensions. In one approach, we replace the random sampling segment of the procedure with a combination of neural networks carrying out regression from the parameter space to the distributions of the product observable for individual jets. Off-the-shelf minimization routines can then be used to optimize any metric of the classifier performance. A complementary and simpler approach is to directly use the form in Eq. 2 in the machine learning optimization, where the learnable parameters are the exponents . Further details are described in the next sections.

Iii Machine Learning
Implementation

iii.1 Dataset

Proton-proton collisions with , , and generic quark and gluon jets (QCD) at  TeV are generated using Pythia 8.226 Sjostrand et al. (2006, 2015). For the case, the background is enriched in as in Ref. Alwall et al. (2014) by generating the gluon splitting matrix element in MadGraph 5 v2.5.4 Alwall et al. (2014). All detector-stable particles excluding neutrinos and muons are clustered into jets using the anti- algorithm Cacciari et al. (2008) with as implemented in Fastjet Cacciari et al. (2012). Jets are groomed by reclustering the constituents using the Cambridge-Aachen algorithm Dokshitzer et al. (1997); Wobisch and Wengler (1998) and applying the soft drop algorithm Larkoski et al. (2014a) with and (equivalent to modified mass drop tagging or mMDT Dasgupta et al. (2013)). The -subjettiness observables are computed using the axes that minimize , using the exclusive algorithm Catani et al. (1993); Ellis and Soper (1993) with standard -scheme recombination Blazey et al. (2000). For comparison with other state-of-the-art two-prong tagging techniques, the  Larkoski et al. (2014b),  Moult et al. (2016) observables, and with winner-take-all (WTA) recombination Bertolini et al. (2014); Larkoski et al. (2014c); Larkoski and Thaler (2014), are also computed from the jet constituents.

iii.2 Construction of optimized product observables

Using the approach followed in Ref. Datta and Larkoski (2018), the point of saturation of discrimination power is first identified using a deep neural network (DNN) classifier. For  vs. QCD and  vs.   discrimination, we note that discrimination power saturates at 4-body (8-dimensional) and 3-body phase space (5-dimensional), respectively. Then it is simple to form the product observable from the elements of the -body basis corresponding to saturation.

We examine two approaches for finding the optimal product observable. The first approach follows a similar method as the brute-force algorithm. Neural networks approximate signal and background probability distributions conditioned on the parameters and then any automated optimization procedure can be used to identify the best exponents. For each task, the product observable is calculated for 25,000 signal and background jets for different values of the parameters [] () or [] (), in the range . These distributions are then stored to generate training sets for the neural networks used to carry out regression from the parameter space to the calculation of with those exponents.

While there are multiple possibilities for learning the probability distribution of given , such as generative adversarial networks Goodfellow et al. (2014) and variational autoencoders Kingma and Welling (2013); Rezende et al. (2014), the method that we found works well for the product observables is illustrated in Fig. 1. The network takes as input 5 (Higgs) or 8 () inputs and outputs 25,000 numbers, which represent a dataset that is the same size as the training data, but with the specified parameter values . From these 25,000 values, the probability distribution of is formed for signal and background and the one-dimensional likelihood ratio is constructed for optimizing the classifier performance. Variations on this setup are possible, such as (significantly) reducing the number of points needed to specify the probability distributions, but this approach was found to be robust to perturbations in initialization and network architecture. For this paper, it was found that the network did not work well with fewer than 25k example jets per parameter point. For each network, 250k (450k) parameter points were used for training in the and ungroomed Higgs (groomed Higgs) case. In only the groomed Higgs case, a single network was trained for signal and background with a 1/0 switch added to the input. Separate networks were trained for signal and background in the and ungroomed Higgs cases. To reduce the effects of numerical instability on the training of these networks, we train on samples after taking the natural logarithm of the 25k measured values of the product observables.

Aside from the use (or not) of the switch input, both the  and   tasks use simple fully-connected neural networks with two hidden layers. The input layer is followed by a dense layer with either 250 or 500 nodes, then another dense layer with 100 or 250 nodes, followed by an output layer with 25,000 nodes using a linear activation. The number of nodes in the hidden layers were bigger for the case with grooming compared with the Higgs case or the ungroomed case.

Figure 1: A schematic diagram of the network architecture used to produce the probability distribution of for a given set of input parameters . In this case, .

We use leaky rectified linear units (Leaky ReLU) as the activations for the hidden layers. The networks were compiled with a mean squared error loss function (on the penultimate layer shown in Fig. 1, not on directly), using Adam optimization Kingma and Ba (2014). The regression networks were each trained for epochs. All deep learning tasks were carried out with the Keras Chollet (2015) deep learning libraries, using the TensorFlow Abadi et al. (2015) backend.

Given the set of values of the observable for a given set of parameters, it is straightforward to use these networks in an optimality scan. For this purpose, we use SciPy’s Jones et al. (2001–) basin-hopping Wales and Doye (1997) global minima finder using the non-linear, derivative free COBYLA (Constrained Optimization BY Linear Approximation) Powell (1994) minimizer to scan over local minima. In the optimization, the networks are used to predict background and signal distributions for a given set of parameters. The 1-dimensional binned likelihood distributions222In principle, one can estimate the AUC without binning, but it was found that there was not a significant sensitivity to the choice of binning. of the observable, constructed from the network outputs, was then used to calculate the area under the ROC curve (AUC) to estimate the discrimination power, where (1-AUC) was explicitly chosen as the metric for the basin-hopping minimization. Appendix A illustrates that the regression networks can be used to accurately model the dependence of the AUC as a function of the parameters. The observable selected using this procedure will be denoted in the next sections.

We also note that the space of possible inputs is degenerate since a monotonic function of an observable has the same discrimination power as the original observable. However, due to the finite binning required to calculate the AUC’s from the likelihood distributions, and statistical fluctuations in a given data sample, the observables do not have precisely the same power as monotonic functions of themselves. The issue of degeneracies is not explicitly dealt with in the minimization procedure, but if the networks are adequately trained over the input space, it is sufficient to locate any one ‘global’ minimum among local minima of similar depth, using basin-hopping or any other global minimizer.

A second approach to optimizing directly uses Eq. 2. The product form can be used directly as a tunable function for predicting signal/background with tunable parameters . This is a more direct way of identifying the optimal solution without explicitly modeling the probability distributions. Optimizing a generic function is possible with methods like stochastic gradient decent, but the product observable is amenable to a significant simplification333We thank Eric Metodiev for this insightful observation.. In particular, two classifiers that are monotonic transformations of each other result in the same classification performance. By taking the logarithm of Eq. 2, one can transform the problem into linear regression444Linear regression was proven to be sufficient for all IRC safe observables Ref. Komiske et al. (2018b), however our results need not be IRC safe. where the inputs are and the coefficients are the exponents. This approach uses the mean squared error loss to identify . The observable selected using this procedure will be denoted in the next sections.

In the limit of infinite data and an arbitrarily flexible neural network, both the ensemble learning and linear regression approaches should achieve the same performance. The latter is significantly easier to train, but the complex approach may provide additional benefits because by providing access to the probability distributions, one can optimize any performance metric directly. This includes batch-level losses like the AUC, false positive rate at a fixed true-positive rate, etc. The mean squared error loss should be sufficient to optimize all of these metrics, but maybe prevented from reaching the desired optimum due to limited training statistics. In practice, we do not find this to be the case with the setup presented here, but the structure may be useful for related tasks in the future.

Iv Results

In this section, we present the new observables obtained for the different classification tasks for the ungroomed  samples (the groomed case is in Appendix C). For closure, we first demonstrate that this new procedure produces an observable for ungroomed  discrimination with the same performance as the observable proposed in Ref. Datta and Larkoski (2018) (the groomed case in Appendix B). Then we extend the procedure to higher -body phase space by applying it to   discrimination for three values of , and propose new observables for those classification tasks.

(a) (b) (c) (d)
Figure 2: (a): Comparison of the probability density function of the new observables for ungroomed  discrimination, using signal and background samples, and the distribution of the regression DNN prediction. The distributions are rescaled by a constant for the sake of visual comparison. (b) Probability densities of obtained via linear regression. (c): Comparison of discrimination power of and to standard observables. (d): Comparison of and to proposed in Ref. Datta and Larkoski (2018); we note that three observables provide essentially the same discrimination power.

iv.1 Ungroomed  vs.  discrimination

Utilizing the result that discrimination power for ungroomed  vs.  discrimination saturates at 3-body phase space, we use the procedures proposed in the previous section to find the optimal product observable. The final values for the parameters obtained through the optimization are presented in Table 5, along with those obtained in the previous study. Interestingly, the exponents with the ensemble method are nearly the same for , , , and , but slightly different for . For the regression method, the exponents are nearly the same as the ensemble method up to a constant factor (approximately ) for , , and , but not for and . These results indicate the presence of multiple observables with comparable performance.

Observable AUC

2.0 0.0 0.0 0.5 -1.0 0.823

1.87 -0.02 -0.14 0.66 -0.98 0.823

-0.11 -0.58 0.09 -0.25 0.51 0.824


Table 1: Summary of parameters for the product observables for groomed  discrimination as proposed in Ref. Datta and Larkoski (2018) and as constructed via the procedures presented in this work (Figs. (a)a and (b)b).

In Fig. (a)a, we plot the distributions of the new observable computed for signal and background, along with the prediction from the ensemble neural network. We note that the network provides a good match to the true distribution, where the latter is also calculated on 10 times more jets. Further, in Fig. (b)b we plot the distributions of the observable obtained via the ML regression method. We then compare the ROC curves for the new observables to  Larkoski et al. (2014b),  Moult et al. (2016) observables, and in Fig. (c)c.

In addition, we also compare the new observables to in Fig. (d)d to demonstrate that the three observables have essentially the same discrimination power as expected. Then, this allows us to proceed to applying the procedure on higher dimensional problems.

iv.2 Ungroomed  vs.

(a) (b) (c)
Figure 3: -body discrimination results for ungroomed  vs. QCD jets. Discrimination power is effectively saturated at 4-body phase space for each case.

We first train neural network classifiers on the -body -subjettiness bases, to identify the point of saturation of discrimination power for each value of .555A single neural network architecture, consisting of seven fully connected (five hidden) layers, was utilized for all of the classification tasks. The first four Dense layers consisted of 1000, 1000, 750 and 500 nodes respectively, and were assigned a Dropout Srivastava et al. (2014) regularization of 0.2, to prevent over-fitting on training data. The next two Dense layers consisted of 250 nodes with Dropout regularization 0.1, and 100 nodes without Dropout. The input layer and all hidden layers utilized the ReLU activation function Nair and Hinton (2010), while the output layer, consisting of a single node, used a sigmoid activation. The network was compiled with the binary cross-entropy loss minimization function, using the Adam optimization Kingma and Ba (2014). Models were trained with Keras’ default EarlyStopping callback, with appropriate patience thresholds, to further negate possible over-fitting. The results are presented in Fig. 3, showing that saturation occurs with the 4-body phase space for each case.

We then proceed to construct the and product observables with the elements of the 8-dimensional 4-body basis, and run the procedure described in Sec. III and construct the new observables optimized for  discrimination at three different values of .

We present the distribution of the new observables for   discrimination in Fig. 5 and then compare their discrimination power to standard observables and DNN’s trained on the spanning -subjettiness bases in Fig. 5. The corresponding values of and the AUCs are in tables 3 and 4, respectively. The comparison of the true and predicted distributions in Fig. 5 illustrates the excellent quality of the regression network. The ROC curves in Fig. 5 show that the learned and outperform the state-of-the-art single physics-motivated observables (top row), though the product observables do not fully saturate the performance of the DNN trained on the full -body phase space (bottom row). This suggests that a more flexible form (other than a simple product) is required to build a simple observable to capture more of the classification information. The product values obtained from the ensemble and regression methods are not simple scaling of each other, though the fact that both have a similar performance suggests that one is a monotonic transformation of the other.

The optimized and observables are not identical for the different values of (Table 3), but it would be interesting to study to what extent the trends are physical or are due to the existence of multiple observables with similar performance. We leave this study to future work. However, a first indication that the observables contain similar physical information is studied in Appendix D, where the optimized product for one mass is applied to another mass. The ROC curves are similar for all three product observables when applied to the same .

[GeV]
50 2.72 -3.78 0.63 -2.77 1.54 0.20 2.36 -0.28
90 0.90 -2.87 0.18 -1.78 -0.72 1.79 2.48 -0.44
130 1.69 -2.98 0.75 -0.89 -0.38 0.77 1.37 0.30
Table 2: Summary of parameters for for ungroomed  vs. QCD discrimination at 3 mass points.
[GeV]
50 1.06 -1.11 0.25 -0.56 0.43 -0.07 0.22 -0.01
90 1.02 -1.06 0.22 -0.27 0.15 0.00 0.18 0.02
130 -1.09 -0.43 0.25 -0.97 0.37 0.12 0.60 0.19
Table 3: Summary of parameters for for ungroomed  vs. QCD discrimination at 3 mass points.
[GeV]
50 0.864 0.858 0.843 0.778 0.817
90 0.873 0.866 0.848 0.837 0.827
130 0.842 0.838 0.809 0.812 0.797
Table 4: Area under the ROC curve (AUC), from Fig. 5, of the standard observables and the observables, optimized for the corresponding signal, for ungroomed   vs. QCD discrimination at 3 points. The ROC curves are calculated using the full datasets, with 500,000 events passing the mass cut for each value of .
(a) (b) (c) (d) (e) (f)
Figure 4: Top panel [a-c]: Comparison of the probability density function of the new observables for ungroomed  discrimination, calculated for signal and background samples, and the distribution of the regression DNN predictions of 25,000 observable values. The distributions are rescaled for the sake of visual comparison. Bottom panel [d-f]: Distributions of the observables for ungroomed  discrimination that were obtained via linear regression.
(a) (b) (c) (d) (e) (f)
Figure 5: Top panel [a-c]: Comparison of discrimination power of observables to standard observables; the latter are calculated with an angular exponent of 1, for which they were observed to perform best. Bottom panel [d-f]: Comparison of to discrimination power of neural networks trained on the -body observable bases; the observables seem to capture increasing amounts of the discrimination power of the 3- and 4-body neural networks with increasing .

V Conclusions

This paper has extended the growing literature of machine-learning assisted jet substructure-based tagging in two ways. First, we have developed a procedure to automatically identify the optimal product observable, using the -subjettiness features as an example. This is an important innovation because observables with relatively simple analytic forms are robust complements to complex neural network classifiers and prior to this work, there was no efficient way to identify the best coefficients in the product. Second, we have used this automated framework to identify the optimal product observables for searching for boosted resonances like the boson, but with beyond the standard model masses. Jet substructure has proven to be a powerful toolset for such searches, but until now, there has been few studies of the mass dependence of the optimal observables.

Future extensions of the methods introduced in this paper may be able to simplify the regression procedure, as well as study the connections between different classifiers with similar performance (including the ones connected by monotonic functions). The power of the method may also be extended by considering other parametric forms besides products. Classification problems demanding a higher -body phase space are a natural extension of the examples presented here.

As machine learning techniques are used more widely to guide the optimal selection of classifiers, there will be a growing need to simplify and interpret the guidance from the machines. We have prepared an automated approach to construct optimal observables with simple, analytic forms, which can be used for further theoretical and experimental studies. This technique will form the basis of multiple extensions in the future to improve classification performance and increase the robustness of searches and measurement at the LHC and beyond.

Acknolwedgements

This work was supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. We would like to thank Gregor Kasieczka, Patrick Komiske, Eric Metodiev, and Jesse Thaler for detailed comments on the analysis and manuscript.

References

  • Larkoski et al. (2017) A. J. Larkoski, I. Moult, and B. Nachman (2017), eprint 1709.04464.
  • Asquith et al. (2018) L. Asquith et al. (2018), eprint 1803.06991.
  • Pearkes et al. (2017) J. Pearkes, W. Fedorko, A. Lister, and C. Gay (2017), eprint 1704.02124.
  • Komiske et al. (2018a) P. T. Komiske, E. M. Metodiev, and J. Thaler (2018a), eprint 1810.05165.
  • Komiske et al. (2018b) P. T. Komiske, E. M. Metodiev, and J. Thaler, JHEP 04, 013 (2018b), eprint 1712.07124.
  • Erdmann et al. (2018) M. Erdmann, E. Geiser, Y. Rath, and M. Rieger (2018), eprint 1812.09722.
  • Datta and Larkoski (2017) K. Datta and A. Larkoski, JHEP 06, 073 (2017), eprint 1704.08249.
  • Datta and Larkoski (2018) K. Datta and A. J. Larkoski, JHEP 03, 086 (2018), eprint 1710.01305.
  • Butter et al. (2018) A. Butter, G. Kasieczka, T. Plehn, and M. Russell, SciPost Phys. 5, 028 (2018), eprint 1707.08966.
  • Cogan et al. (2015) J. Cogan, M. Kagan, E. Strauss, and A. Schwarztman, JHEP 02, 118 (2015), eprint 1407.5675.
  • Almeida et al. (2015) L. G. Almeida, M. Backovic, M. Cliche, S. J. Lee, and M. Perelstein, JHEP 07, 086 (2015), eprint 1501.05968.
  • de Oliveira et al. (2016) L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, JHEP 07, 069 (2016), eprint 1511.05190.
  • Komiske et al. (2017) P. T. Komiske, E. M. Metodiev, and M. D. Schwartz, JHEP 01, 110 (2017), eprint 1612.01551.
  • Barnard et al. (2017) J. Barnard, E. N. Dawe, M. J. Dolan, and N. Rajcic, Phys. Rev. D95, 014018 (2017), eprint 1609.00607.
  • Kasieczka et al. (2017) G. Kasieczka, T. Plehn, M. Russell, and T. Schell, JHEP 05, 006 (2017), eprint 1701.08784.
  • Dreyer et al. (2018) F. A. Dreyer, G. P. Salam, and G. Soyez, JHEP 12, 064 (2018), eprint 1807.04758.
  • Lin et al. (2018) J. Lin, M. Freytsis, I. Moult, and B. Nachman, JHEP 10, 101 (2018), eprint 1807.10768.
  • Fraser and Schwartz (2018) K. Fraser and M. D. Schwartz, JHEP 10, 093 (2018), eprint 1803.08066.
  • Chien and Kunnawalkam Elayavalli (2018) Y.-T. Chien and R. Kunnawalkam Elayavalli (2018), eprint 1803.03589.
  • Macaluso and Shih (2018) S. Macaluso and D. Shih, JHEP 10, 121 (2018), eprint 1803.00107.
  • Guest et al. (2016) D. Guest, J. Collado, P. Baldi, S.-C. Hsu, G. Urban, and D. Whiteson, Phys. Rev. D94, 112002 (2016), eprint 1607.08633.
  • Egan et al. (2017) S. Egan, W. Fedorko, A. Lister, J. Pearkes, and C. Gay (2017), eprint 1711.09059.
  • Andreassen et al. (2019) A. Andreassen, I. Feige, C. Frye, and M. D. Schwartz, Eur. Phys. J. C79, 102 (2019), eprint 1804.09720.
  • Cheng (2018) T. Cheng, Comput. Softw. Big Sci. 2, 3 (2018), eprint 1711.02633.
  • Louppe et al. (2019) G. Louppe, K. Cho, C. Becot, and K. Cranmer, JHEP 01, 057 (2019), eprint 1702.00748.
  • Henrion et al. (2017) I. Henrion et al., DLPS at NIPS (2017), URL https://dl4physicalsciences.github.io/files/nips_dlps_2017_29.pdf.
  • Aaboud et al. (2018) M. Aaboud et al. (ATLAS) (2018), eprint 1808.07858.
  • CMS Collaboration (2017a) CMS Collaboration, CMS-DP-2017-049 (2017a), URL https://cds.cern.ch/record/2295725.
  • CMS Collaboration (2018a) CMS Collaboration, CMS-DP-2018-046 (2018a), URL https://cds.cern.ch/record/2630438.
  • ATLAS Collaboration (2017a) ATLAS Collaboration, ATL-PHYS-PUB-2017-017 (2017a), URL http://cds.cern.ch/record/2275641.
  • CMS Collaboration (2017b) CMS Collaboration, CMS-DP-2017-027 (2017b), URL https://cds.cern.ch/record/2275226.
  • ATLAS Collaboration (2017b) ATLAS Collaboration, ATL-PHYS-PUB-2017-003 (2017b), URL https://cds.cern.ch/record/2255226.
  • ATLAS Collaboration (2017c) ATLAS Collaboration, ATL-PHYS-PUB-2017-013 (2017c), URL https://cds.cern.ch/record/2273281.
  • CMS Collaboration (2018b) CMS Collaboration, CMS-DP-2018-058 (2018b), URL https://cds.cern.ch/record/2646773.
  • Sirunyan et al. (2018a) A. M. Sirunyan et al. (CMS), JINST 13, P05011 (2018a), eprint 1712.07158.
  • Collins et al. (2018) J. H. Collins, K. Howe, and B. Nachman, Phys. Rev. Lett. 121, 241803 (2018), eprint 1805.02664.
  • Collins et al. (2019) J. H. Collins, K. Howe, and B. Nachman, Phys. Rev. D99, 014038 (2019), eprint 1902.02634.
  • Heimel et al. (2018) T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson (2018), eprint 1808.08979.
  • Farina et al. (2018) M. Farina, Y. Nakai, and D. Shih (2018), eprint 1808.08992.
  • Metodiev and Thaler (2018) E. M. Metodiev and J. Thaler, Phys. Rev. Lett. 120, 241602 (2018), eprint 1802.00008.
  • Komiske et al. (2018c) P. T. Komiske, E. M. Metodiev, B. Nachman, and M. D. Schwartz, Phys. Rev. D98, 011502 (2018c), eprint 1801.10158.
  • Metodiev et al. (2017) E. M. Metodiev, B. Nachman, and J. Thaler, JHEP 10, 174 (2017), eprint 1708.02949.
  • Dery et al. (2017) L. M. Dery, B. Nachman, F. Rubbo, and A. Schwartzman, JHEP 05, 145 (2017), eprint 1702.00414.
  • Cohen et al. (2018) T. Cohen, M. Freytsis, and B. Ostdiek, JHEP 02, 034 (2018), eprint 1706.09451.
  • Louppe et al. (2016) G. Louppe, M. Kagan, and K. Cranmer (2016), eprint 1611.01046.
  • Shimmin et al. (2017) C. Shimmin, P. Sadowski, P. Baldi, E. Weik, D. Whiteson, E. Goul, and A. Sogaard, Phys. Rev. D96, 074034 (2017), eprint 1703.03507.
  • ATL (2018) Tech. Rep. ATL-PHYS-PUB-2018-014, CERN, Geneva (2018), URL http://cds.cern.ch/record/2630973.
  • Sirunyan et al. (2017) A. M. Sirunyan et al. (CMS), Phys. Rev. Lett. 119, 111802 (2017), eprint 1705.10532.
  • Sirunyan et al. (2018b) A. M. Sirunyan et al. (CMS), JHEP 01, 097 (2018b), eprint 1710.00159.
  • Aaboud et al. (2019) M. Aaboud et al. (ATLAS), Phys. Lett. B788, 316 (2019), eprint 1801.08769.
  • Stewart et al. (2010) I. W. Stewart, F. J. Tackmann, and W. J. Waalewijn, Phys. Rev. Lett. 105, 092002 (2010), eprint 1004.2489.
  • Thaler and Van Tilburg (2011) J. Thaler and K. Van Tilburg, JHEP 03, 015 (2011), eprint 1011.2268.
  • Thaler and Van Tilburg (2012) J. Thaler and K. Van Tilburg, JHEP 02, 093 (2012), eprint 1108.2701.
  • Sjostrand et al. (2006) T. Sjostrand, S. Mrenna, and P. Z. Skands, JHEP 05, 026 (2006), eprint hep-ph/0603175.
  • Sjostrand et al. (2015) T. Sjostrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O. Rasmussen, and P. Z. Skands, Comput. Phys. Commun. 191, 159 (2015), eprint 1410.3012.
  • Alwall et al. (2014) J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H. S. Shao, T. Stelzer, P. Torrielli, and M. Zaro, JHEP 07, 079 (2014), eprint 1405.0301.
  • Cacciari et al. (2008) M. Cacciari, G. P. Salam, and G. Soyez, JHEP 04, 063 (2008), eprint 0802.1189.
  • Cacciari et al. (2012) M. Cacciari, G. P. Salam, and G. Soyez, Eur. Phys. J. C72, 1896 (2012), eprint 1111.6097.
  • Dokshitzer et al. (1997) Y. L. Dokshitzer, G. D. Leder, S. Moretti, and B. R. Webber, JHEP 08, 001 (1997), eprint hep-ph/9707323.
  • Wobisch and Wengler (1998) M. Wobisch and T. Wengler, in Monte Carlo generators for HERA physics. Proceedings, Workshop, Hamburg, Germany, 1998-1999 (1998), pp. 270–279, eprint hep-ph/9907280.
  • Larkoski et al. (2014a) A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler, JHEP 05, 146 (2014a), eprint 1402.2657.
  • Dasgupta et al. (2013) M. Dasgupta, A. Fregoso, S. Marzani, and G. P. Salam, JHEP 09, 029 (2013), eprint 1307.0007.
  • Catani et al. (1993) S. Catani, Y. L. Dokshitzer, M. H. Seymour, and B. R. Webber, Nucl. Phys. B406, 187 (1993).
  • Ellis and Soper (1993) S. D. Ellis and D. E. Soper, Phys. Rev. D48, 3160 (1993), eprint hep-ph/9305266.
  • Blazey et al. (2000) G. C. Blazey et al., in QCD and weak boson physics in Run II. Proceedings, Batavia, USA, March 4-6, June 3-4, November 4-6, 1999 (2000), pp. 47–77, eprint hep-ex/0005012, URL http://lss.fnal.gov/cgi-bin/find_paper.pl?conf-00-092.
  • Larkoski et al. (2014b) A. J. Larkoski, I. Moult, and D. Neill, JHEP 12, 009 (2014b), eprint 1409.6298.
  • Moult et al. (2016) I. Moult, L. Necib, and J. Thaler, JHEP 12, 153 (2016), eprint 1609.07483.
  • Bertolini et al. (2014) D. Bertolini, T. Chan, and J. Thaler, JHEP 04, 013 (2014), eprint 1310.7584.
  • Larkoski et al. (2014c) A. J. Larkoski, D. Neill, and J. Thaler, JHEP 04, 017 (2014c), eprint 1401.2158.
  • Larkoski and Thaler (2014) A. J. Larkoski and J. Thaler, Phys. Rev. D90, 034010 (2014), eprint 1406.7011.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, in Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, Inc., 2014), pp. 2672–2680, URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.
  • Kingma and Welling (2013) D. P. Kingma and M. Welling, CoRR abs/1312.6114 (2013).
  • Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra, in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (JMLR.org, 2014), ICML’14, pp. II–1278–II–1286, URL http://dl.acm.org/citation.cfm?id=3044805.3045035.
  • Kingma and Ba (2014) D. P. Kingma and J. Ba, CoRR abs/1412.6980 (2014), URL http://arxiv.org/abs/1412.6980.
  • Chollet (2015) F. Chollet, Keras, https://github.com/fchollet/keras (2015).
  • Abadi et al. (2015) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., TensorFlow: Large-scale machine learning on heterogeneous systems (2015), software available from tensorflow.org, URL http://tensorflow.org/.
  • Jones et al. (2001–) E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools for Python (2001–), URL http://www.scipy.org/.
  • Wales and Doye (1997) D. J. Wales and J. P. K. Doye, The Journal of Physical Chemistry A 101, 5111 (1997), eprint https://doi.org/10.1021/jp970984n, URL https://doi.org/10.1021/jp970984n.
  • Powell (1994) M. J. D. Powell, A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation (Springer Netherlands, Dordrecht, 1994), pp. 51–67, ISBN 978-94-015-8330-5, URL https://doi.org/10.1007/978-94-015-8330-5_4.
  • Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, J. Mach. Learn. Res. 15, 1929 (2014), ISSN 1532-4435, URL http://dl.acm.org/citation.cfm?id=2627435.2670313.
  • Nair and Hinton (2010) V. Nair and G. E. Hinton, in ICML (Omnipress, 2010), pp. 807–814, URL http://dblp.uni-trier.de/db/conf/icml/icml2010.html#NairH10.

Appendix A Crosscheck for performance of the Regression networks

Here we briefly demonstrate that the regression DNNs do actually learn to approximate the mapping from the input parameters of the product observables to their densities, i.e., a mapping from . We specifically choose the ungroomed 90 GeV case, and choosing values of for the optimal observable, as listed in Table 3.

We then select one of the parameters and vary it between and 7 with a step size of 0.1 while keeping the other parameters fixed. This allows us to study how the networks can be used to interpolate AUC’s over a range of values around the optimum we locate and, in addition, by going beyond the training range of we also demonstrate that the networks can be used to extrapolate the aforementioned mapping to then still calculate the AUC with a good level of accuracy. The results for this study are shown in Fig. 6 and indicate that the regression networks allow to accurately track the trajectories of the AUC in these one-dimensional slices of the parameter space.

(a) (b) (c) (d) (e) (f) (g) (h)
Figure 6: Here we plot results for the calculation of the area under the ROC curve (AUC) from the distributions generated by the signal and background regression DNN’s for the ungroomed case (blue, dashed) along with the true AUC (red, solid) computed on the same statistics. These are seen to be in good agreement with each other. The results demonstrate the usefulness of the networks to accurately reproduce the change in the signal and background PDFs, represented via the accurate reproduction of the AUCs calculated from them, as a function of the variation of each individual input parameter .

Appendix B Groomed  vs.  discrimination

(a) (b) (c) (d)
Figure 7: (a): Comparison of PDFs of for mMDT groomed  discrimination, using signal and background samples, and the distribution of the regression DNN prediction. The distributions are rescaled by a constant for the sake of visual comparison. (b) Probability density distributions of obtained via linear regression. (c): Comparison of discrimination power of , and to standard observables, where the 4-body product observable is seen to perform best for groomed  discrimination. (d): Comparison of and to proposed in Datta and Larkoski (2018); we note that the latter two 3-body product observables provide essentially the same discrimination power while the 3- and 4-body ones obtained with linear regression outperforms them.

Utilizing the result that discrimination power for mMDT groomed  vs.  discrimination saturates at 3-body phase space Datta and Larkoski (2018), we use the procedure proposed in the Sec. III to find the optimal product observable. The final values for the parameters obtained through the optimization are presented in Table 5, along with those obtained in the previous study. Interestingly, the exponents for are nearly the same for , , and , but are quite different for and . The factors and are also similar for up to a multiplicative factor.

Observable AUC

-2.0 0.0 0.0 -2.0 2.0 0.745

0.67 -1.65 0.01 -1.90 2.07 0.744

-1.54 1.01 -0.17 -0.15 0.16 0.758

Table 5: Summary of parameters for the product observables for groomed  discrimination as proposed in Ref. Datta and Larkoski (2018) and as constructed via the procedure presented in this work (Fig. (a)a).

In Fig. (a)a, we plot the distributions of the new observable computed for signal and background, along with the prediction from the neural network. We note that the network provides a good match to the true distribution, where the latter is also calculated on 10 times more jets. We then compare the ROC curves for the new observable to  Larkoski et al. (2014b),  Moult et al. (2016) observables, and in Fig. (c)c.

In addition, we compare the new observable to in Fig. (d)d to demonstrate that both observables have essentially the same discrimination power as expected. Then, this allows us to proceed to applying the procedure on higher dimensional problems. Further, we plot the ROC curve for the 4-body product observable from the linear regression method, noting that it provides the best performance of the observables that have been explored for this problem. 666Explicitly, the optimal parameter values for are as follows: , and it leads to an AUC of 0.778 in Fig. (c)c.

Appendix C Groomed  vs.

In this section we carry out the same set of studies for mMDT groomed   discrimination as for the ungroomed cases from Sec. IV.2. As in the ungroomed case, Fig. 8 indicates that the saturation of discrimination power occurs at 4-body phase space.

[GeV]
50 2.6 -0.41 -2.94 -2.79 0.20 0.93 -0.66 2.43
90 2.3 -1.35 -2.05 -1.64 -0.81 0.89 2.03 -0.44
130 0.80 -1.74 -0.28 -1.01 -0.38 0.56 0.82 0.69
Table 6: Summary of parameters for for mMDT groomed  vs. QCD discrimination at 3 mass points
[GeV]
50 -0.35 0.35 0.56 1.05 -0.17 -0.24 -0.34 0.51
90 0.26 -0.41 -0.39 -0.68 -0.15 0.11 0.25 0.42
130 1.28 0.54 0.35 1.09 0.09 -0.38 -1.06 -0.48
Table 7: Summary of parameters for for mMDT groomed  vs. QCD discrimination at 3 mass points
[GeV]
50 0.830 0.826 0.796 0.803 0.780
90 0.822 0.821 0.780 0.796 0.763
130 0.814 0.811 0.769 0.791 0.751
Table 8: Area under the ROC curve (AUC), from Fig. 10, of standard observables and the observables, optimized for the corresponding signal, for mMDT groomed   vs. QCD discrimination at 3 points. The ROC curves are calculated using the full datasets, with 300,000 events passing the mass cut for each value of .

The results for the final observables for the three points are presented in Table 7 and the observable distributions are plotted in Fig. 10. The performance of the new observables are compared to standard ones and -body DNN’s in Fig. 10 and the corresponding AUCs are shown in Table 8 for different mass points. The conclusions from this section are qualitatively the same as from Sec. IV.2, with a slightly lower AUC from both the product observable and the physics-motivated observables. Importantly, the product observables for the groomed case appear to saturate the bounds from the -body phase space better than in the ungroomed case.

(a) (b) (c)
Figure 8: -body discrimination results of mMDT groomed vs. QCD jets. Here, discrimination power is again seen to effectively saturate at 4-body phase space for all considered values of .
(a) (b) (c) (d) (e) (f)
Figure 9: Top panel [a-c]: Comparison of the probability density function of the new observables for mMDT groomed  discrimination, calculated for signal and background samples, and the distribution of the regression DNN predictions of 25,000 observable values. The distributions are rescaled for the sake of visual comparison. Bottom panel [d-f]: Distributions of the observables for ungroomed  discrimination that were obtained via linear regression.
(a) (b) (c) (d) (e) (f)
Figure 10: Top panel [a-c]: Comparison of discrimination power of observables to standard observables; the latter are computed with an angular exponent of 2, for which they were observed to perform best for mMDT groomed samples. Bottom panel [d-f]: Comparison of to discrimination power of neural networks trained on the -body observable bases; the observables capture almost all of the discrimination power of the 4-body neural networks.

Appendix D Mass dependence of

(a) (b) (c) (d) (e) (f)
Figure 11: Here we plot results for the new observables on  samples with a different mass point to that which they were optimized on, within the mass windows appropriate for the corresponding signal. We note that for all cases, all the new observables demonstrate very similar discrimination power, and outperform standard observables.

Here, we briefly study the performance of the new observables presented in Sec. IV.2. They are tested on a different combination of signal and background samples from the ones they were optimized on; for example, we calculate the new observable for  GeV on signal samples for  GeV, and background, that pass the mass window on which the 90 GeV observable was optimized. The results for this study are presented in Fig. 11, and indicate that while these observables are optimized on samples from a specific mass point, they can be applied to other classification tasks and still provide better discrimination performance than standard observables. This also suggests that the different parameter sets in Table 3 may represent observables with very similar physical information even though the -subjettiness variables are not invariant under transverse boosts.

Appendix E Saturating the discrimination power of

Figure 12: Here we compare the discrimination power of for ungroomed discrimination for values of noting that while discrimination power does increase with higher M (until the inclusion of 7- or 8-body variables), the product form can only capture a limited amount of useful discriminating information from beyond bases at the point of saturation of a DNN classifier (dark grey).

In this section we briefly study the flexibility of the product form ansatz using the observables obtained via the linear regression procedure. For concreteness, we look at the GeV case, and plot ROC curves for the observables upto in Fig. 12.

We observe that discrimination power for the gradually increases upto the inclusion of 7- or 8-body phase space variables. Compared to the ROC curve at the point of saturation, from the 4-body DNN classifier, these results suggest that while a DNN can adjust thresholds on the -body inputs such that there is effectively only redundant discriminating information in higher -body bases, as is also expected from the physics study in Ref. Datta and Larkoski (2017), the product observables do still benefit from including -subjettiness variables from beyond the point of saturation.

Depending on the classification task, the product observables may even come very close to matching the performance of a saturated ML classifier (Fig. 10). However, ultimately it cannot not capture all available information, due to lack of further flexibility of the product form ansatz. These observations will of course vary based on the objects being studied. We leave further physics studies of the product form or other equivalent ansatz to future work.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
340758
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description