Automating the Construction of Jet Observables with Machine Learning
Abstract
Machinelearning assisted jet substructure tagging techniques have the potential to significantly improve searches for new particles and Standard Model measurements in hadronic final states. Techniques with simple analytic forms are particularly useful for establishing robustness and gaining physical insight. We introduce a procedure to automate the construction of a large class of observables that are chosen to completely specify body phase space. The procedure is validated on the task of distinguishing from , where and previous bruteforce approaches to construct an optimal product observable for the body phase space have established the baseline performance. We then use the new method to design tailored observables for the boosted search, where and bruteforce methods are intractable. The new classifiers outperform standard prong tagging observables, illustrating the power of the new optimization method for improving searches and measurement at the LHC and beyond.
pacs:
I Introduction
Effective identification of hadronic decays of boosted heavy particles like the top quark or , and Higgs () bosons is essential for analyses at the Large Hadron Collider (LHC). Jet substructure observables that identify specific discriminating information in the radiation pattern of jets originating from different particles are now necessary, both in the search for new physics and precision Standard Model (SM) measurements. As a result, there is an extensive literature developing observables and techniques for identifying boosted topologies to increase the efficacy of LHC analyses probing extreme regions of phase space Larkoski et al. (2017); Asquith et al. (2018).
Modern machine learning (ML) methods have emerged as useful tools for automating the creation of optimal observables for classification. These methods are particularly powerful for highdimensional, lowlevel inputs such as fixedlength sets of fourvectors Pearkes et al. (2017), variablelength sets of fourvectors Komiske et al. (2018a), physicsinspired bases Komiske et al. (2018b); Erdmann et al. (2018); Datta and Larkoski (2017, 2018); Butter et al. (2018), images Cogan et al. (2015); Almeida et al. (2015); de Oliveira et al. (2016); Komiske et al. (2017); Barnard et al. (2017); Kasieczka et al. (2017); Dreyer et al. (2018); Lin et al. (2018); Fraser and Schwartz (2018); Chien and Kunnawalkam Elayavalli (2018); Macaluso and Shih (2018), sequences Guest et al. (2016); Egan et al. (2017); Andreassen et al. (2019); Fraser and Schwartz (2018), trees Cheng (2018); Louppe et al. (2019), and graphs Henrion et al. (2017). Some deep learningbased tagging schemes have already been demonstrated using collider data as well as with full detector simulations for top quark tagging Aaboud et al. (2018); CMS Collaboration (2017a), boson tagging Aaboud et al. (2018); CMS Collaboration (2018a), quark/gluon tagging ATLAS Collaboration (2017a); CMS Collaboration (2017b), and jet tagging ATLAS Collaboration (2017b, c); CMS Collaboration (2018b); Sirunyan et al. (2018a). In addition to improving classification performance, ML techniques have also been proposed to make jet tagging more independent from simulation and robust to differences between simulation and data as well as between sideband and signal regions Collins et al. (2018, 2019); Heimel et al. (2018); Farina et al. (2018); Metodiev and Thaler (2018); Komiske et al. (2018c); Metodiev et al. (2017); Dery et al. (2017); Cohen et al. (2018); Louppe et al. (2016); Shimmin et al. (2017); ATL (2018).
One of the key challenges with ML taggers is to identify what information the machine is using for classification. Understanding the origin of discrimination can lead to robustness when taggers are applied outside of the region they were trained, can result in new theoretical insight for other applications, and may produce new simple observables that capture most of the information. While there are many proposals for ML metacognition Cohen et al. (2018); de Oliveira et al. (2016); Lin et al. (2018); Komiske et al. (2018b, a); Datta and Larkoski (2018, 2017), one particularly powerful approach is to identify simple product observables that capture most of the information from an ML algorithm trained on the full phase space Datta and Larkoski (2018). This approach results in analytically tractable observables that can capture nearly all of the power of a more complicated algorithm, but are also very robust and insightful. One of the most challenging aspects of the approach presented in Ref. Datta and Larkoski (2018) is the fitting process for picking the optimal simple product observable.
In this paper, we describe a new procedure based on ML for automating the feature extraction originally presented in Ref. Datta and Larkoski (2018). This method is applied to derive an optimal product observable for discriminating vs. and the outcome is compared to the result of Ref. Datta and Larkoski (2018) which used a brute force approach. Having validated the method, a new classifier is developed to distinguish a from generic quark and gluon jets. The phase space scan required in this later tagging task is too big for the brute force approach and therefore the automated method is required to find the optimal tagger. The resulting classifier has a simple form and is competitive with a tagger using highdimensional, lowlevel inputs. In addition to Ref. Shimmin et al. (2017), this is the only other study of the dependence on the mass of the new boson, which is timely given new searches for light boosted bosons Sirunyan et al. (2017); Sirunyan et al. (2018b); Aaboud et al. (2019).
This paper is organized as follows. The method for constructing product observables is described in Sec. II and the machine learning approaches are detailed in Sec. III. Results for both the Higgs and classification tasks are presented in Sec. IV. The paper ends with conclusions and future outlook in Sec. V.
Ii subjettiness Product Observables
The information about the kinematic phase space of subjets in a jet is resolved with a set of subjettiness Stewart et al. (2010); Thaler and Van Tilburg (2011, 2012) observables. By increasing , one can identify the number of subjets required to saturate the classification performance based on the spanning set of subjettiness observables Datta and Larkoski (2017):
where
(1) 
for some choice of axes within the jet; is the jet radius parameter, and . Given the minimal , one can posit an ansatz^{1}^{1}1The product form may not be flexible enough to capture the full discrimination power. We find that it can capture a significant portion of the classification performance, but Appendix E indicates that further information can be useful. for a simple product observable that captures most of the information contained in a neural network trained on the entire spanning set:
(2) 
For distinguishing vs. jets, Ref. Datta and Larkoski (2018) showed that the useful information for classification is saturated by and has nearly the same tagging performance as the full body phase space. The parameters that specify were identified by randomly scanning the fivedimensional phase space and exploiting minimal correlations between some of the parameters. This becomes intractable when the optimal is bigger than .
In this paper, we explore methods to overcome the difficulties of extending this procedure to higher dimensions. In one approach, we replace the random sampling segment of the procedure with a combination of neural networks carrying out regression from the parameter space to the distributions of the product observable for individual jets. Offtheshelf minimization routines can then be used to optimize any metric of the classifier performance. A complementary and simpler approach is to directly use the form in Eq. 2 in the machine learning optimization, where the learnable parameters are the exponents . Further details are described in the next sections.
Iii Machine Learning
Implementation
iii.1 Dataset
Protonproton collisions with , , and generic quark and gluon jets (QCD) at TeV are generated using Pythia 8.226 Sjostrand et al. (2006, 2015). For the case, the background is enriched in as in Ref. Alwall et al. (2014) by generating the gluon splitting matrix element in MadGraph 5 v2.5.4 Alwall et al. (2014). All detectorstable particles excluding neutrinos and muons are clustered into jets using the anti algorithm Cacciari et al. (2008) with as implemented in Fastjet Cacciari et al. (2012). Jets are groomed by reclustering the constituents using the CambridgeAachen algorithm Dokshitzer et al. (1997); Wobisch and Wengler (1998) and applying the soft drop algorithm Larkoski et al. (2014a) with and (equivalent to modified mass drop tagging or mMDT Dasgupta et al. (2013)). The subjettiness observables are computed using the axes that minimize , using the exclusive algorithm Catani et al. (1993); Ellis and Soper (1993) with standard scheme recombination Blazey et al. (2000). For comparison with other stateoftheart twoprong tagging techniques, the Larkoski et al. (2014b), Moult et al. (2016) observables, and with winnertakeall (WTA) recombination Bertolini et al. (2014); Larkoski et al. (2014c); Larkoski and Thaler (2014), are also computed from the jet constituents.
iii.2 Construction of optimized product observables
Using the approach followed in Ref. Datta and Larkoski (2018), the point of saturation of discrimination power is first identified using a deep neural network (DNN) classifier. For vs. QCD and vs. discrimination, we note that discrimination power saturates at 4body (8dimensional) and 3body phase space (5dimensional), respectively. Then it is simple to form the product observable from the elements of the body basis corresponding to saturation.
We examine two approaches for finding the optimal product observable. The first approach follows a similar method as the bruteforce algorithm. Neural networks approximate signal and background probability distributions conditioned on the parameters and then any automated optimization procedure can be used to identify the best exponents. For each task, the product observable is calculated for 25,000 signal and background jets for different values of the parameters [] () or [] (), in the range . These distributions are then stored to generate training sets for the neural networks used to carry out regression from the parameter space to the calculation of with those exponents.
While there are multiple possibilities for learning the probability distribution of given , such as generative adversarial networks Goodfellow et al. (2014) and variational autoencoders Kingma and Welling (2013); Rezende et al. (2014), the method that we found works well for the product observables is illustrated in Fig. 1. The network takes as input 5 (Higgs) or 8 () inputs and outputs 25,000 numbers, which represent a dataset that is the same size as the training data, but with the specified parameter values . From these 25,000 values, the probability distribution of is formed for signal and background and the onedimensional likelihood ratio is constructed for optimizing the classifier performance. Variations on this setup are possible, such as (significantly) reducing the number of points needed to specify the probability distributions, but this approach was found to be robust to perturbations in initialization and network architecture. For this paper, it was found that the network did not work well with fewer than 25k example jets per parameter point. For each network, 250k (450k) parameter points were used for training in the and ungroomed Higgs (groomed Higgs) case. In only the groomed Higgs case, a single network was trained for signal and background with a 1/0 switch added to the input. Separate networks were trained for signal and background in the and ungroomed Higgs cases. To reduce the effects of numerical instability on the training of these networks, we train on samples after taking the natural logarithm of the 25k measured values of the product observables.
Aside from the use (or not) of the switch input, both the and tasks use simple fullyconnected neural networks with two hidden layers. The input layer is followed by a dense layer with either 250 or 500 nodes, then another dense layer with 100 or 250 nodes, followed by an output layer with 25,000 nodes using a linear activation. The number of nodes in the hidden layers were bigger for the case with grooming compared with the Higgs case or the ungroomed case.
We use leaky rectified linear units (Leaky ReLU) as the activations for the hidden layers. The networks were compiled with a mean squared error loss function (on the penultimate layer shown in Fig. 1, not on directly), using Adam optimization Kingma and Ba (2014). The regression networks were each trained for epochs. All deep learning tasks were carried out with the Keras Chollet (2015) deep learning libraries, using the TensorFlow Abadi et al. (2015) backend.
Given the set of values of the observable for a given set of parameters, it is straightforward to use these networks in an optimality scan. For this purpose, we use SciPy’s Jones et al. (2001–) basinhopping Wales and Doye (1997) global minima finder using the nonlinear, derivative free COBYLA (Constrained Optimization BY Linear Approximation) Powell (1994) minimizer to scan over local minima. In the optimization, the networks are used to predict background and signal distributions for a given set of parameters. The 1dimensional binned likelihood distributions^{2}^{2}2In principle, one can estimate the AUC without binning, but it was found that there was not a significant sensitivity to the choice of binning. of the observable, constructed from the network outputs, was then used to calculate the area under the ROC curve (AUC) to estimate the discrimination power, where (1AUC) was explicitly chosen as the metric for the basinhopping minimization. Appendix A illustrates that the regression networks can be used to accurately model the dependence of the AUC as a function of the parameters. The observable selected using this procedure will be denoted in the next sections.
We also note that the space of possible inputs is degenerate since a monotonic function of an observable has the same discrimination power as the original observable. However, due to the finite binning required to calculate the AUC’s from the likelihood distributions, and statistical fluctuations in a given data sample, the observables do not have precisely the same power as monotonic functions of themselves. The issue of degeneracies is not explicitly dealt with in the minimization procedure, but if the networks are adequately trained over the input space, it is sufficient to locate any one ‘global’ minimum among local minima of similar depth, using basinhopping or any other global minimizer.
A second approach to optimizing directly uses Eq. 2. The product form can be used directly as a tunable function for predicting signal/background with tunable parameters . This is a more direct way of identifying the optimal solution without explicitly modeling the probability distributions. Optimizing a generic function is possible with methods like stochastic gradient decent, but the product observable is amenable to a significant simplification^{3}^{3}3We thank Eric Metodiev for this insightful observation.. In particular, two classifiers that are monotonic transformations of each other result in the same classification performance. By taking the logarithm of Eq. 2, one can transform the problem into linear regression^{4}^{4}4Linear regression was proven to be sufficient for all IRC safe observables Ref. Komiske et al. (2018b), however our results need not be IRC safe. where the inputs are and the coefficients are the exponents. This approach uses the mean squared error loss to identify . The observable selected using this procedure will be denoted in the next sections.
In the limit of infinite data and an arbitrarily flexible neural network, both the ensemble learning and linear regression approaches should achieve the same performance. The latter is significantly easier to train, but the complex approach may provide additional benefits because by providing access to the probability distributions, one can optimize any performance metric directly. This includes batchlevel losses like the AUC, false positive rate at a fixed truepositive rate, etc. The mean squared error loss should be sufficient to optimize all of these metrics, but maybe prevented from reaching the desired optimum due to limited training statistics. In practice, we do not find this to be the case with the setup presented here, but the structure may be useful for related tasks in the future.
Iv Results
In this section, we present the new observables obtained for the different classification tasks for the ungroomed samples (the groomed case is in Appendix C). For closure, we first demonstrate that this new procedure produces an observable for ungroomed discrimination with the same performance as the observable proposed in Ref. Datta and Larkoski (2018) (the groomed case in Appendix B). Then we extend the procedure to higher body phase space by applying it to discrimination for three values of , and propose new observables for those classification tasks.
iv.1 Ungroomed vs. discrimination
Utilizing the result that discrimination power for ungroomed vs. discrimination saturates at 3body phase space, we use the procedures proposed in the previous section to find the optimal product observable. The final values for the parameters obtained through the optimization are presented in Table 5, along with those obtained in the previous study. Interestingly, the exponents with the ensemble method are nearly the same for , , , and , but slightly different for . For the regression method, the exponents are nearly the same as the ensemble method up to a constant factor (approximately ) for , , and , but not for and . These results indicate the presence of multiple observables with comparable performance.
Observable  AUC  


2.0  0.0  0.0  0.5  1.0  0.823 

1.87  0.02  0.14  0.66  0.98  0.823 

0.11  0.58  0.09  0.25  0.51  0.824 

In Fig. (a)a, we plot the distributions of the new observable computed for signal and background, along with the prediction from the ensemble neural network. We note that the network provides a good match to the true distribution, where the latter is also calculated on 10 times more jets. Further, in Fig. (b)b we plot the distributions of the observable obtained via the ML regression method. We then compare the ROC curves for the new observables to Larkoski et al. (2014b), Moult et al. (2016) observables, and in Fig. (c)c.
In addition, we also compare the new observables to in Fig. (d)d to demonstrate that the three observables have essentially the same discrimination power as expected. Then, this allows us to proceed to applying the procedure on higher dimensional problems.
iv.2 Ungroomed vs.
We first train neural network classifiers on the body subjettiness bases, to identify the point of saturation of discrimination power for each value of .^{5}^{5}5A single neural network architecture, consisting of seven fully connected (five hidden) layers, was utilized for all of the classification tasks. The first four Dense layers consisted of 1000, 1000, 750 and 500 nodes respectively, and were assigned a Dropout Srivastava et al. (2014) regularization of 0.2, to prevent overfitting on training data. The next two Dense layers consisted of 250 nodes with Dropout regularization 0.1, and 100 nodes without Dropout. The input layer and all hidden layers utilized the ReLU activation function Nair and Hinton (2010), while the output layer, consisting of a single node, used a sigmoid activation. The network was compiled with the binary crossentropy loss minimization function, using the Adam optimization Kingma and Ba (2014). Models were trained with Keras’ default EarlyStopping callback, with appropriate patience thresholds, to further negate possible overfitting. The results are presented in Fig. 3, showing that saturation occurs with the 4body phase space for each case.
We then proceed to construct the and product observables with the elements of the 8dimensional 4body basis, and run the procedure described in Sec. III and construct the new observables optimized for discrimination at three different values of .
We present the distribution of the new observables for discrimination in Fig. 5 and then compare their discrimination power to standard observables and DNN’s trained on the spanning subjettiness bases in Fig. 5. The corresponding values of and the AUCs are in tables 3 and 4, respectively. The comparison of the true and predicted distributions in Fig. 5 illustrates the excellent quality of the regression network. The ROC curves in Fig. 5 show that the learned and outperform the stateoftheart single physicsmotivated observables (top row), though the product observables do not fully saturate the performance of the DNN trained on the full body phase space (bottom row). This suggests that a more flexible form (other than a simple product) is required to build a simple observable to capture more of the classification information. The product values obtained from the ensemble and regression methods are not simple scaling of each other, though the fact that both have a similar performance suggests that one is a monotonic transformation of the other.
The optimized and observables are not identical for the different values of (Table 3), but it would be interesting to study to what extent the trends are physical or are due to the existence of multiple observables with similar performance. We leave this study to future work. However, a first indication that the observables contain similar physical information is studied in Appendix D, where the optimized product for one mass is applied to another mass. The ROC curves are similar for all three product observables when applied to the same .
[GeV]  

50  2.72  3.78  0.63  2.77  1.54  0.20  2.36  0.28 
90  0.90  2.87  0.18  1.78  0.72  1.79  2.48  0.44 
130  1.69  2.98  0.75  0.89  0.38  0.77  1.37  0.30 
[GeV]  

50  1.06  1.11  0.25  0.56  0.43  0.07  0.22  0.01 
90  1.02  1.06  0.22  0.27  0.15  0.00  0.18  0.02 
130  1.09  0.43  0.25  0.97  0.37  0.12  0.60  0.19 
[GeV]  

50  0.864  0.858  0.843  0.778  0.817 
90  0.873  0.866  0.848  0.837  0.827 
130  0.842  0.838  0.809  0.812  0.797 
V Conclusions
This paper has extended the growing literature of machinelearning assisted jet substructurebased tagging in two ways. First, we have developed a procedure to automatically identify the optimal product observable, using the subjettiness features as an example. This is an important innovation because observables with relatively simple analytic forms are robust complements to complex neural network classifiers and prior to this work, there was no efficient way to identify the best coefficients in the product. Second, we have used this automated framework to identify the optimal product observables for searching for boosted resonances like the boson, but with beyond the standard model masses. Jet substructure has proven to be a powerful toolset for such searches, but until now, there has been few studies of the mass dependence of the optimal observables.
Future extensions of the methods introduced in this paper may be able to simplify the regression procedure, as well as study the connections between different classifiers with similar performance (including the ones connected by monotonic functions). The power of the method may also be extended by considering other parametric forms besides products. Classification problems demanding a higher body phase space are a natural extension of the examples presented here.
As machine learning techniques are used more widely to guide the optimal selection of classifiers, there will be a growing need to simplify and interpret the guidance from the machines. We have prepared an automated approach to construct optimal observables with simple, analytic forms, which can be used for further theoretical and experimental studies. This technique will form the basis of multiple extensions in the future to improve classification performance and increase the robustness of searches and measurement at the LHC and beyond.
Acknolwedgements
This work was supported by the U.S. Department of Energy, Office of Science under contract DEAC0205CH11231. We would like to thank Gregor Kasieczka, Patrick Komiske, Eric Metodiev, and Jesse Thaler for detailed comments on the analysis and manuscript.
References
 Larkoski et al. (2017) A. J. Larkoski, I. Moult, and B. Nachman (2017), eprint 1709.04464.
 Asquith et al. (2018) L. Asquith et al. (2018), eprint 1803.06991.
 Pearkes et al. (2017) J. Pearkes, W. Fedorko, A. Lister, and C. Gay (2017), eprint 1704.02124.
 Komiske et al. (2018a) P. T. Komiske, E. M. Metodiev, and J. Thaler (2018a), eprint 1810.05165.
 Komiske et al. (2018b) P. T. Komiske, E. M. Metodiev, and J. Thaler, JHEP 04, 013 (2018b), eprint 1712.07124.
 Erdmann et al. (2018) M. Erdmann, E. Geiser, Y. Rath, and M. Rieger (2018), eprint 1812.09722.
 Datta and Larkoski (2017) K. Datta and A. Larkoski, JHEP 06, 073 (2017), eprint 1704.08249.
 Datta and Larkoski (2018) K. Datta and A. J. Larkoski, JHEP 03, 086 (2018), eprint 1710.01305.
 Butter et al. (2018) A. Butter, G. Kasieczka, T. Plehn, and M. Russell, SciPost Phys. 5, 028 (2018), eprint 1707.08966.
 Cogan et al. (2015) J. Cogan, M. Kagan, E. Strauss, and A. Schwarztman, JHEP 02, 118 (2015), eprint 1407.5675.
 Almeida et al. (2015) L. G. Almeida, M. Backovic, M. Cliche, S. J. Lee, and M. Perelstein, JHEP 07, 086 (2015), eprint 1501.05968.
 de Oliveira et al. (2016) L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, JHEP 07, 069 (2016), eprint 1511.05190.
 Komiske et al. (2017) P. T. Komiske, E. M. Metodiev, and M. D. Schwartz, JHEP 01, 110 (2017), eprint 1612.01551.
 Barnard et al. (2017) J. Barnard, E. N. Dawe, M. J. Dolan, and N. Rajcic, Phys. Rev. D95, 014018 (2017), eprint 1609.00607.
 Kasieczka et al. (2017) G. Kasieczka, T. Plehn, M. Russell, and T. Schell, JHEP 05, 006 (2017), eprint 1701.08784.
 Dreyer et al. (2018) F. A. Dreyer, G. P. Salam, and G. Soyez, JHEP 12, 064 (2018), eprint 1807.04758.
 Lin et al. (2018) J. Lin, M. Freytsis, I. Moult, and B. Nachman, JHEP 10, 101 (2018), eprint 1807.10768.
 Fraser and Schwartz (2018) K. Fraser and M. D. Schwartz, JHEP 10, 093 (2018), eprint 1803.08066.
 Chien and Kunnawalkam Elayavalli (2018) Y.T. Chien and R. Kunnawalkam Elayavalli (2018), eprint 1803.03589.
 Macaluso and Shih (2018) S. Macaluso and D. Shih, JHEP 10, 121 (2018), eprint 1803.00107.
 Guest et al. (2016) D. Guest, J. Collado, P. Baldi, S.C. Hsu, G. Urban, and D. Whiteson, Phys. Rev. D94, 112002 (2016), eprint 1607.08633.
 Egan et al. (2017) S. Egan, W. Fedorko, A. Lister, J. Pearkes, and C. Gay (2017), eprint 1711.09059.
 Andreassen et al. (2019) A. Andreassen, I. Feige, C. Frye, and M. D. Schwartz, Eur. Phys. J. C79, 102 (2019), eprint 1804.09720.
 Cheng (2018) T. Cheng, Comput. Softw. Big Sci. 2, 3 (2018), eprint 1711.02633.
 Louppe et al. (2019) G. Louppe, K. Cho, C. Becot, and K. Cranmer, JHEP 01, 057 (2019), eprint 1702.00748.
 Henrion et al. (2017) I. Henrion et al., DLPS at NIPS (2017), URL https://dl4physicalsciences.github.io/files/nips_dlps_2017_29.pdf.
 Aaboud et al. (2018) M. Aaboud et al. (ATLAS) (2018), eprint 1808.07858.
 CMS Collaboration (2017a) CMS Collaboration, CMSDP2017049 (2017a), URL https://cds.cern.ch/record/2295725.
 CMS Collaboration (2018a) CMS Collaboration, CMSDP2018046 (2018a), URL https://cds.cern.ch/record/2630438.
 ATLAS Collaboration (2017a) ATLAS Collaboration, ATLPHYSPUB2017017 (2017a), URL http://cds.cern.ch/record/2275641.
 CMS Collaboration (2017b) CMS Collaboration, CMSDP2017027 (2017b), URL https://cds.cern.ch/record/2275226.
 ATLAS Collaboration (2017b) ATLAS Collaboration, ATLPHYSPUB2017003 (2017b), URL https://cds.cern.ch/record/2255226.
 ATLAS Collaboration (2017c) ATLAS Collaboration, ATLPHYSPUB2017013 (2017c), URL https://cds.cern.ch/record/2273281.
 CMS Collaboration (2018b) CMS Collaboration, CMSDP2018058 (2018b), URL https://cds.cern.ch/record/2646773.
 Sirunyan et al. (2018a) A. M. Sirunyan et al. (CMS), JINST 13, P05011 (2018a), eprint 1712.07158.
 Collins et al. (2018) J. H. Collins, K. Howe, and B. Nachman, Phys. Rev. Lett. 121, 241803 (2018), eprint 1805.02664.
 Collins et al. (2019) J. H. Collins, K. Howe, and B. Nachman, Phys. Rev. D99, 014038 (2019), eprint 1902.02634.
 Heimel et al. (2018) T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson (2018), eprint 1808.08979.
 Farina et al. (2018) M. Farina, Y. Nakai, and D. Shih (2018), eprint 1808.08992.
 Metodiev and Thaler (2018) E. M. Metodiev and J. Thaler, Phys. Rev. Lett. 120, 241602 (2018), eprint 1802.00008.
 Komiske et al. (2018c) P. T. Komiske, E. M. Metodiev, B. Nachman, and M. D. Schwartz, Phys. Rev. D98, 011502 (2018c), eprint 1801.10158.
 Metodiev et al. (2017) E. M. Metodiev, B. Nachman, and J. Thaler, JHEP 10, 174 (2017), eprint 1708.02949.
 Dery et al. (2017) L. M. Dery, B. Nachman, F. Rubbo, and A. Schwartzman, JHEP 05, 145 (2017), eprint 1702.00414.
 Cohen et al. (2018) T. Cohen, M. Freytsis, and B. Ostdiek, JHEP 02, 034 (2018), eprint 1706.09451.
 Louppe et al. (2016) G. Louppe, M. Kagan, and K. Cranmer (2016), eprint 1611.01046.
 Shimmin et al. (2017) C. Shimmin, P. Sadowski, P. Baldi, E. Weik, D. Whiteson, E. Goul, and A. Sogaard, Phys. Rev. D96, 074034 (2017), eprint 1703.03507.
 ATL (2018) Tech. Rep. ATLPHYSPUB2018014, CERN, Geneva (2018), URL http://cds.cern.ch/record/2630973.
 Sirunyan et al. (2017) A. M. Sirunyan et al. (CMS), Phys. Rev. Lett. 119, 111802 (2017), eprint 1705.10532.
 Sirunyan et al. (2018b) A. M. Sirunyan et al. (CMS), JHEP 01, 097 (2018b), eprint 1710.00159.
 Aaboud et al. (2019) M. Aaboud et al. (ATLAS), Phys. Lett. B788, 316 (2019), eprint 1801.08769.
 Stewart et al. (2010) I. W. Stewart, F. J. Tackmann, and W. J. Waalewijn, Phys. Rev. Lett. 105, 092002 (2010), eprint 1004.2489.
 Thaler and Van Tilburg (2011) J. Thaler and K. Van Tilburg, JHEP 03, 015 (2011), eprint 1011.2268.
 Thaler and Van Tilburg (2012) J. Thaler and K. Van Tilburg, JHEP 02, 093 (2012), eprint 1108.2701.
 Sjostrand et al. (2006) T. Sjostrand, S. Mrenna, and P. Z. Skands, JHEP 05, 026 (2006), eprint hepph/0603175.
 Sjostrand et al. (2015) T. Sjostrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O. Rasmussen, and P. Z. Skands, Comput. Phys. Commun. 191, 159 (2015), eprint 1410.3012.
 Alwall et al. (2014) J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H. S. Shao, T. Stelzer, P. Torrielli, and M. Zaro, JHEP 07, 079 (2014), eprint 1405.0301.
 Cacciari et al. (2008) M. Cacciari, G. P. Salam, and G. Soyez, JHEP 04, 063 (2008), eprint 0802.1189.
 Cacciari et al. (2012) M. Cacciari, G. P. Salam, and G. Soyez, Eur. Phys. J. C72, 1896 (2012), eprint 1111.6097.
 Dokshitzer et al. (1997) Y. L. Dokshitzer, G. D. Leder, S. Moretti, and B. R. Webber, JHEP 08, 001 (1997), eprint hepph/9707323.
 Wobisch and Wengler (1998) M. Wobisch and T. Wengler, in Monte Carlo generators for HERA physics. Proceedings, Workshop, Hamburg, Germany, 19981999 (1998), pp. 270–279, eprint hepph/9907280.
 Larkoski et al. (2014a) A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler, JHEP 05, 146 (2014a), eprint 1402.2657.
 Dasgupta et al. (2013) M. Dasgupta, A. Fregoso, S. Marzani, and G. P. Salam, JHEP 09, 029 (2013), eprint 1307.0007.
 Catani et al. (1993) S. Catani, Y. L. Dokshitzer, M. H. Seymour, and B. R. Webber, Nucl. Phys. B406, 187 (1993).
 Ellis and Soper (1993) S. D. Ellis and D. E. Soper, Phys. Rev. D48, 3160 (1993), eprint hepph/9305266.
 Blazey et al. (2000) G. C. Blazey et al., in QCD and weak boson physics in Run II. Proceedings, Batavia, USA, March 46, June 34, November 46, 1999 (2000), pp. 47–77, eprint hepex/0005012, URL http://lss.fnal.gov/cgibin/find_paper.pl?conf00092.
 Larkoski et al. (2014b) A. J. Larkoski, I. Moult, and D. Neill, JHEP 12, 009 (2014b), eprint 1409.6298.
 Moult et al. (2016) I. Moult, L. Necib, and J. Thaler, JHEP 12, 153 (2016), eprint 1609.07483.
 Bertolini et al. (2014) D. Bertolini, T. Chan, and J. Thaler, JHEP 04, 013 (2014), eprint 1310.7584.
 Larkoski et al. (2014c) A. J. Larkoski, D. Neill, and J. Thaler, JHEP 04, 017 (2014c), eprint 1401.2158.
 Larkoski and Thaler (2014) A. J. Larkoski and J. Thaler, Phys. Rev. D90, 034010 (2014), eprint 1406.7011.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, in Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, Inc., 2014), pp. 2672–2680, URL http://papers.nips.cc/paper/5423generativeadversarialnets.pdf.
 Kingma and Welling (2013) D. P. Kingma and M. Welling, CoRR abs/1312.6114 (2013).
 Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra, in Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32 (JMLR.org, 2014), ICML’14, pp. II–1278–II–1286, URL http://dl.acm.org/citation.cfm?id=3044805.3045035.
 Kingma and Ba (2014) D. P. Kingma and J. Ba, CoRR abs/1412.6980 (2014), URL http://arxiv.org/abs/1412.6980.
 Chollet (2015) F. Chollet, Keras, https://github.com/fchollet/keras (2015).
 Abadi et al. (2015) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., TensorFlow: Largescale machine learning on heterogeneous systems (2015), software available from tensorflow.org, URL http://tensorflow.org/.
 Jones et al. (2001–) E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools for Python (2001–), URL http://www.scipy.org/.
 Wales and Doye (1997) D. J. Wales and J. P. K. Doye, The Journal of Physical Chemistry A 101, 5111 (1997), eprint https://doi.org/10.1021/jp970984n, URL https://doi.org/10.1021/jp970984n.
 Powell (1994) M. J. D. Powell, A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation (Springer Netherlands, Dordrecht, 1994), pp. 51–67, ISBN 9789401583305, URL https://doi.org/10.1007/9789401583305_4.
 Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, J. Mach. Learn. Res. 15, 1929 (2014), ISSN 15324435, URL http://dl.acm.org/citation.cfm?id=2627435.2670313.
 Nair and Hinton (2010) V. Nair and G. E. Hinton, in ICML (Omnipress, 2010), pp. 807–814, URL http://dblp.unitrier.de/db/conf/icml/icml2010.html#NairH10.
Appendix A Crosscheck for performance of the Regression networks
Here we briefly demonstrate that the regression DNNs do actually learn to approximate the mapping from the input parameters of the product observables to their densities, i.e., a mapping from . We specifically choose the ungroomed 90 GeV case, and choosing values of for the optimal observable, as listed in Table 3.
We then select one of the parameters and vary it between and 7 with a step size of 0.1 while keeping the other parameters fixed. This allows us to study how the networks can be used to interpolate AUC’s over a range of values around the optimum we locate and, in addition, by going beyond the training range of we also demonstrate that the networks can be used to extrapolate the aforementioned mapping to then still calculate the AUC with a good level of accuracy. The results for this study are shown in Fig. 6 and indicate that the regression networks allow to accurately track the trajectories of the AUC in these onedimensional slices of the parameter space.
Appendix B Groomed vs. discrimination
Utilizing the result that discrimination power for mMDT groomed vs. discrimination saturates at 3body phase space Datta and Larkoski (2018), we use the procedure proposed in the Sec. III to find the optimal product observable. The final values for the parameters obtained through the optimization are presented in Table 5, along with those obtained in the previous study. Interestingly, the exponents for are nearly the same for , , and , but are quite different for and . The factors and are also similar for up to a multiplicative factor.
Observable  AUC  


2.0  0.0  0.0  2.0  2.0  0.745 

0.67  1.65  0.01  1.90  2.07  0.744 

1.54  1.01  0.17  0.15  0.16  0.758 

In Fig. (a)a, we plot the distributions of the new observable computed for signal and background, along with the prediction from the neural network. We note that the network provides a good match to the true distribution, where the latter is also calculated on 10 times more jets. We then compare the ROC curves for the new observable to Larkoski et al. (2014b), Moult et al. (2016) observables, and in Fig. (c)c.
In addition, we compare the new observable to in Fig. (d)d to demonstrate that both observables have essentially the same discrimination power as expected. Then, this allows us to proceed to applying the procedure on higher dimensional problems. Further, we plot the ROC curve for the 4body product observable from the linear regression method, noting that it provides the best performance of the observables that have been explored for this problem. ^{6}^{6}6Explicitly, the optimal parameter values for are as follows: , and it leads to an AUC of 0.778 in Fig. (c)c.
Appendix C Groomed vs.
In this section we carry out the same set of studies for mMDT groomed discrimination as for the ungroomed cases from Sec. IV.2. As in the ungroomed case, Fig. 8 indicates that the saturation of discrimination power occurs at 4body phase space.
[GeV]  

50  2.6  0.41  2.94  2.79  0.20  0.93  0.66  2.43 
90  2.3  1.35  2.05  1.64  0.81  0.89  2.03  0.44 
130  0.80  1.74  0.28  1.01  0.38  0.56  0.82  0.69 
[GeV]  

50  0.35  0.35  0.56  1.05  0.17  0.24  0.34  0.51 
90  0.26  0.41  0.39  0.68  0.15  0.11  0.25  0.42 
130  1.28  0.54  0.35  1.09  0.09  0.38  1.06  0.48 
[GeV]  

50  0.830  0.826  0.796  0.803  0.780 
90  0.822  0.821  0.780  0.796  0.763 
130  0.814  0.811  0.769  0.791  0.751 
The results for the final observables for the three points are presented in Table 7 and the observable distributions are plotted in Fig. 10. The performance of the new observables are compared to standard ones and body DNN’s in Fig. 10 and the corresponding AUCs are shown in Table 8 for different mass points. The conclusions from this section are qualitatively the same as from Sec. IV.2, with a slightly lower AUC from both the product observable and the physicsmotivated observables. Importantly, the product observables for the groomed case appear to saturate the bounds from the body phase space better than in the ungroomed case.
Appendix D Mass dependence of
Here, we briefly study the performance of the new observables presented in Sec. IV.2. They are tested on a different combination of signal and background samples from the ones they were optimized on; for example, we calculate the new observable for GeV on signal samples for GeV, and background, that pass the mass window on which the 90 GeV observable was optimized. The results for this study are presented in Fig. 11, and indicate that while these observables are optimized on samples from a specific mass point, they can be applied to other classification tasks and still provide better discrimination performance than standard observables. This also suggests that the different parameter sets in Table 3 may represent observables with very similar physical information even though the subjettiness variables are not invariant under transverse boosts.
Appendix E Saturating the discrimination power of
In this section we briefly study the flexibility of the product form ansatz using the observables obtained via the linear regression procedure. For concreteness, we look at the GeV case, and plot ROC curves for the observables upto in Fig. 12.
We observe that discrimination power for the gradually increases upto the inclusion of 7 or 8body phase space variables. Compared to the ROC curve at the point of saturation, from the 4body DNN classifier, these results suggest that while a DNN can adjust thresholds on the body inputs such that there is effectively only redundant discriminating information in higher body bases, as is also expected from the physics study in Ref. Datta and Larkoski (2017), the product observables do still benefit from including subjettiness variables from beyond the point of saturation.
Depending on the classification task, the product observables may even come very close to matching the performance of a saturated ML classifier (Fig. 10). However, ultimately it cannot not capture all available information, due to lack of further flexibility of the product form ansatz. These observations will of course vary based on the objects being studied. We leave further physics studies of the product form or other equivalent ansatz to future work.