Portraying Double Higgs at the Large Hadron Collider
Abstract
We examine the discovery potential for double Higgs production at the high luminosity LHC in the final state with two tagged jets, two leptons and missing transverse momentum. Although this dilepton final state has been considered a difficult channel due to the large backgrounds, we argue that it is possible to obtain sizable signal significance, by adopting a deep learning framework making full use of the relevant kinematics along with the jet images from the Higgs decay. For the relevant number of signal events we obtain a substantial increase in signal sensitivity over existing analyses. We discuss relative improvements at each stage and the correlations among the different input variables for the neutral network. The proposed method can be easily generalized to the semileptonic channel of double Higgs production, as well as to other processes with similar final states.
1 Introduction
The discovery of the Higgs boson Aad:2012tfa (); Chatrchyan:2012xdj () jumpstarted the comprehensive program of precision measurements of all Higgs couplings. While the Higgs boson couplings to fermions and gauge bosons are in good agreement with the Standard Model (SM) predictions Khachatryan:2016vau (), the Higgs selfcouplings are difficult to measure experimentally ATLPHYSPUB2017001 (); ATLPHYSPUB2016024 (); Kim:2018uty (); Sirunyan:2018two (); CMS:2015nat (); CMS:2017cwx (); Baglio:2012np (); Sirunyan:2017guj (); Cepeda:2019klc (). Yet, the knowledge of those couplings is crucial for understanding the exact mechanism of electroweak symmetry breaking and the origin of mass in our universe. It is also a guaranteed physics target which can be probed at the upgraded Large Hadron Collider (LHC) or at future colliders. The resulting experimental constraints on the Higgs selfcouplings will have an immediate and longlasting impact on modelbuilding efforts beyond the SM.
We parameterize the Higgs selfinteraction as follows:
(1) 
where is the mass of the SM Higgs boson (), GeV is the Higgs vacuum expectation value,
are the SM values for the Higgs selfcouplings, while and parametrize the corresponding deviations from them. In order to access (), one has to measure the process of double (triple) Higgs boson production at the LHC, possibly with high luminosity (HL), or at future colliders.
Double Higgs () production has been studied in many channels, including ATLAS:2018combi (); Aaboud:2018knk (); CMS:2018smw (); deLima:2014dta (); Wardrope:2014kya (); Behr:2015oqq (), Sirunyan:2018iwt (); Aaboud:2018ftw (); CMSPASFTR15002 (); ATLPHYSPUB2014019 (); Kim:2018uty (); Kling:2016lay (); Baur:2003gp (); Baglio:2012np (); Huang:2015tdv (); Azatov:2015oxa (); Cao:2015oaa (); Cao:2016zob (); Alves:2017ued (); Barger:2013jfa (); Chang:2018uwu (), CMSPASFTR15002 (); Aaboud:2018sfw (); Sirunyan:2017djm (); Kim:2018uty (); Baur:2003gpa (); Goertz:2014qta (); Dolan:2012rv (), Aaboud:2018zhh (); CMS:2017ums (); CMS:2015nat (); CMS:2017cwx (); Kim:2018cxf (); Papaefstathiou:2012qe (); Huang:2017jws (), Aaboud:2018ksn (), etc. Among the different possible final states, here we focus on production at the HLLHC in the final state with two tagged jets, two leptons and missing transverse momentum. The signal process is and it suffers from large SM backgrounds, primarily due to top quark pair production (). The few existing studies in this channel therefore employ sophisticated algorithms (neutral network (NN) CMS:2015nat (), deep neutral network (DNN) Sirunyan:2017guj (), boosted decision tree (BDT) Adhikary:2017jtu (); CMS:2017cwx (), etc.) to increase the signal sensitivity, but show somewhat pessimistic results, with a significance no better than at the HLLHC with 3 ab luminosity.
The recent study in Ref. Kim:2018cxf () introduced some new ideas for reducing the SM backgrounds in this channel. For example, the new variables Topness and Higgsness were designed to test whether the event kinematics is consistent with or , respectively. The use of Topness and Higgsness already effectively reduced the background to a manageable level, and additional variables were then employed to handle the remaining SM background processes — e.g., the subsystem variable is effective in eliminating background arising from decays. In this paper, we supplement the novel kinematic method from Ref. Kim:2018cxf () with the analysis of the jet image in the decay, where the basic idea is to treat the detector as a camera and the streams of jets as an image Gallicchio:2010sw (); Gallicchio:2010dq (); Hook:2011cq (); Cogan:2014oua (); deOliveira:2015xxd (); Lin:2018cin (). In our case, the collimated nature of the Higgs decay will hopefully differ from the patterns obtained in SM production processes. In addition, we adopt a deep learning framework in our main analysis, since it is known that modern deep learning algorithms trained on jet images provide improved signaltobackground discrimination Gallicchio:2010sw (); Gallicchio:2010dq (); Hook:2011cq (); Baldi:2014kfa (); Cogan:2014oua (); deOliveira:2015xxd (); Komiske:2016rsd (); Kasieczka:2017nvn (); Lin:2018cin ().
The analysis presented in this paper contains a number of improvements in comparison to previous studies:

Unlike the customized detector simulation performed in Ref. Kim:2018cxf (), here we employ Delphes deFavereau:2013fsa () to simulate detector effects such as detector resolution, reconstruction efficiency, etc., and Fastjet Cacciari:2011ma () for jetreconstruction.

We use deep learning framework to optimize the cuts, which further increases the significance compared to the conventional cutandcount as performed in Ref. Kim:2018cxf ().

We exploit an enlarged set of relevant variables which consists of the 10 variables originally considered in Ref. Adhikary:2017jtu (): , , , , , , , , , and , supplemented with the six recent variables from Ref. Kim:2018cxf (): Topness, Higgsness, , , and .

We include a SM background process, production, which was missing from all previous discussions of this channel, yet it turns out to be the next dominant background once the background is under control.

The fact that the Higgs boson is a colorsinglet allows us to use the jet image of the decay for further background suppression Cogan:2014oua (); Gallicchio:2010sw (); Gallicchio:2010dq (); Hook:2011cq (); Lin:2018cin ().

We examine the effect of pileup, which was missing from previous studies. In particular, we chose to apply the Soft Drop algorithm Larkoski:2014wba (), which is a powerful pileup mitigation technique.
Our results show that the dominant background can be significantly reduced until it is comparable to the other subdominant backgrounds, i.e., after all cuts, we find that all SM backgrounds contribute at similar levels. This reduction can be accomplished without sacrificing too much of the signal rate, which leads to an improved signal significance. Our study indicates that the dilepton channel from could contribute to the combined significance for discovery on par with the other final states, making double Higgs production sooner accessible at the HLLHC.
This paper is structured as follows. We begin our discussion of the SM backgrounds and present the details of our simulation in section 2. In the following two sections 3 and 4, we provide some basic information on the kinematic variables used later in the analysis and on jet images, respectively. Then in section 5 we discuss how we set up our analysis in a deep learning framework. Section 6 presents our results, while section 7 is reserved for the discussion and conclusions. We include a brief review on deep neural networks in Appendix A.
2 Event generation and detector simulation
Partonlevel signal and background events were generated using MadGraph5_aMC@NLO v2.6 Alwall:2014hca () with the default NNPDF2.3QED parton distribution functions Ball:2013hta () at leading order QCD accuracy at the TeV LHC. The default dynamical renormalization and factorization scales were used. We assume 3000 ab of luminosity throughout this paper. Partonlevel events were generated with the following cuts: GeV, GeV, GeV, GeV, < 5, < 5, < 2.5, < 2.5, 1.8, 1.3, 70 GeV 160 GeV and 75 GeV. For , and backgrounds, we impose 5 GeV GeV additionally. Here the angular distance is defined by
(2) 
where and are respectively the differences of the azimuthal angles and rapidities between particles and .
The double Higgs production crosssection is normalized to fb, the nexttonexttoleading order (NNLO) accuracy in QCD Grigo:2014jma (). Considering all relevant branching fractions, we obtain signal cross section fb, where denotes an electron or a muon, including leptons from tau decays. The major background is production, whose cross section is normalized to the NNLO QCD crosssection 953.6 pb Czakon:2013goa (). Another important background is , which is normalized to the nexttoleading order (NLO) QCD crosssection of 611.3 fb Dittmaier:2011ti (). For the () background, we apply an NLO kfactor of 1.54, resulting in a crosssection of 1.71 pb deFlorian:2016spz (). We apply an NLO kfactor of 1.0 for the DrellYan type backgrounds and , where denotes partons in the fiveflavor scheme. Note that a recent study indicates that deFlorian:2018wcj (). The irreducible background from the mixed QCD+EW process is included with = 2. Finally, we generate events with up to one additional matched jet (in the fiveflavor scheme), whose crosssection turns out to be 0.51 pb (after the cuts) including all relevant branching fractions.
Events are further processed for partonshower/hadronization using Pythia8235 Sjostrand:2014zea (). We use Delphes 3.4.1 deFavereau:2013fsa () for simulating the detector effects and Fastjet 3.3.1 Cacciari:2011ma () for jetreconstruction, with modified ATLAS settings as follows.

Jets are clustered with the anti algorithm Cacciari:2008gp () with conesize , where is the distance (2) in the (, ) space. Jets are also required to have GeV and .

For lepton isolation, we require , where the sum is taken over the transverse momenta of all final states particles , , with , GeV and within of the lepton candidate .

For photon isolation, we analogously require for particles within of the photon candidate .

The missing transverse momentum is defined as the negative vector sum of the transverse momenta of the reconstructed jets, leptons and photons.

We use the a flat tagging efficiency, , and flat mistagging rates for non jets of and ATLPHYSPUB2019005 ().
After particle reconstruction, we employ the following baseline selection cuts^{1}^{1}1For the motivation behind these cuts, see Fig. 1 (in which the cut values are indicated with vertical dotted lines) and the related discussion in Sec. 3 below. from Ref. Kim:2018cxf ():

the two leading jets must be tagged,

exactly two isolated leptons of opposite sign, each with GeV,

GeV for the reconstructed missing transverse momentum,

proximity cut of for the two leptons,

proximity cut of for the two tagged jets,

GeV for the two leptons,

GeV GeV for the two tagged jets.
For those events which passed the baseline cuts, we form 16 kinematic variables, as well as jet images. As we will see later, the jet images can capture additional features which are not already contained in the 16 standard kinematic variables. Therefore one can obtain better performance by combining kinematics and jet images, which is one of the main ideas of this paper.
3 Kinematics in signal and backgrounds
In this section we introduce the 16 kinematic variables used in this analysis. Their kinematic distributions (for signal and all relevant backgrounds) are shown in Fig. 1 and will be discussed shortly.
We begin with ten standard kinematic variables, which were previously considered in Refs. Adhikary:2017jtu (); CMS:2017cwx () (their distributions are shown in the first ten panels of Fig. 1):

, the invariant mass of the two tagged jets (1st plot in the 1st row). This is expected to be a good variable, since for signal events, the two jets originate from the decay of a narrow resonance (the Higgs boson) and would therefore reconstruct to the Higgs mass, up to resolution effects: . This justifies the baseline cut of GeV GeV, as indicated with the vertical dotted lines. In contrast, no such correlations exists for backgrounds events: the two jets either originate from different decay chains and are uncorrelated (as in the case of , for example), or they reconstruct to the mass of a boson or an offshell gluon, with a mass lower than . The plot in Fig. 1, while confirming those expectations, also shows that the total background happens to peak at a value of which, unfortunately, is not too far away from , providing the motivation to explore other variables.

, the invariant mass of the two leptons (2nd plot in the 1st row). For the case of the signal, the two leptons ultimately originate from the Higgs boson decay, and therefore their invariant mass is bounded from above, hence the baseline cut of GeV. Note that the distribution (which is observable) should be the same as the distribution of (which is unobservable).

, the angular separation (2) between the two tagged jets (3rd plot in the 1st row). Given the relatively low Higgs mass, the two Higgs particles in production have sizable transverse momentum and their respective decay products (e.g., the two quarks) tend to go in the same direction. This and the next four variables try to exploit this kinematic property of the signal. For example, the Higgs boost implies that is relatively small for signal events, and this motivates the proximity cut of .

, the azimuthal angle in the transverse plane between the two jet system and the two lepton system (1st plot in the 2nd row). This is yet another way to capture the backtoback boost of the two Higgs bosons in double Higgs production. Fig. 1 shows that the signal peaks at more sharply than the background, which could be exploited later in the neural network analysis. However, no baseline cut was applied in this case, since is expected to be largely correlated with and .

, the transverse momentum of the two jet system (2nd plot in the 2nd row). Like the previous three variables, this variable is motivated by the significant boost of the Higgs bosons in the signal, but no baseline cut was applied.

, the transverse momentum of the two lepton system (3rd plot in the 2nd row). This variable behaves similarly to , but to a lesser extent, since the two leptons come from separate s, while the two quarks are direct decay products of the Higgs boson.

, the magnitude of the missing transverse momentum (4th plot in the 2nd row). A cut is routinely applied in order to fight the QCD backgrounds (not shown in Fig. 1). Following Ref. CMS:2017cwx (), here we use a baseline cut of GeV.

, the transverse momentum of the hardest lepton (1st plot in the 3rd row).

, transverse momentum of the nexthardest lepton (2nd plot in the 3rd row). As shown in Fig. 1, the individual transverse momenta of the two leptons are similar for both signal and backgrounds. Therefore, the lepton ’s may be good for triggering purposes, but not for background rejection.
We note that for the signal, many of these 10 variables are strongly correlated to each other^{2}^{2}2The strong correlation arises due to the very nature of double Higgs production — the two Higgs particles are produced with a sizable transverse momentum, which restricts the kinematics of their decay products.. This implies that cutting on one variable significantly reduces the power of other variables. At the same time, while these 10 variables are among the most commonly used in high energy physics, it is not guaranteed that they fully capture all kinematic differences between signal and background. This is why we introduce six additional variables Kim:2018cxf (): Topness, Higgsness, , , and , shown in the last six panels of Fig. 1, which are meant to take full advantage of the kinematic differences between the signal and background event topologies.
The Topness variable measures the degree of consistency of a given event with the kinematics of dilepton production, where there are 6 unknowns (the threemomenta of the two neutrinos, and ) and four onshell constraints, , , and . Here is the mass of top or antitop quark, and is the mass of the boson. Then the neutrino momenta can be fixed by minimizing the following quantity
(3)  
subject to the missing transverse momentum constraint, . The parameters and are indicative of the corresponding experimental resolutions and intrinsic particle widths. In principle, they can be treated as free parameters and one can tune them using NN, BDT, etc. In our numerical study, we shall use GeV and GeV. Since there is a twofold ambiguity in the paring of a quark and a lepton, Topness is defined as the smaller of the two s Kim:2018cxf (),
(4) 
The Topness distributions for both signal and backgrounds before baseline cuts are shown in Fig. 1 (3rd plot in the 3rd row). We observe that, as expected, tends to have smaller values for the main background () than for signal.
In our signal of production, the two quarks arise from a Higgs decay (), and therefore their invariant mass can be used as a first cut to enhance the signal sensitivity. For the decay of the other Higgs boson, , Higgsness is defined as follows Kim:2018cxf ()
(5)  
It tests whether the neutrino kinematics can be compatible with having the Higgs boson and one of the bosons onshell, while at the same time being consistent with the invariant mass distributions expected for the offshell boson, , and the neutrino pair, . The invariant mass is bounded by and the peak of its distribution is at
(6) 
The left panel of Fig. 2 shows the unitnormalized invariant mass distribution of the proper leptonneutrino system (). The distribution has a bimodal shape — the narrow peak on the right near 80 GeV corresponds to the onshell boson resonance, while the broader hump to the left is due to the offshell , with a clear endpoint at GeV and a maximum near GeV in accordance with (6).
The definition of Higgsness (5) also includes a term which tests for consistency with the expected invariant mass distribution for the neutrino pair^{3}^{3}3In the limit of massless leptons, the distribution is the same as the dilepton mass distribution , which is directly observable and therefore more commonly discussed in the literature Han:2009ss (); Han:2012nr (); Han:2012nm (); Cho:2012er ()., which is shown in the right panel of Fig. 2. The red solid curve gives the pure phase space prediction
(7) 
where is the twobody phase space function and is the invariant mass distribution of the antler topology with :
(8) 
where the endpoint and the parameter are defined in terms of the particle masses as
(9)  
(10) 
Note that by allowing one of the bosons to be onshell, eqs. (710) generalize the results previously derived in Refs. Han:2009ss (); Han:2012nr (); Han:2012nm (); Cho:2012er () for the purely onshell case. The blue histogram in the right panel of Fig. 2 shows the actual distribution, whose shape is slightly different from the pure phase space result (7), due a helicity suppression in the  vertex. In particular, we observe that the actual peak is at GeV, which is the value that we shall use in the definition of Higgsness (5).^{4}^{4}4We note that other variants of Higgsness are also possible — for example, instead of penalizing the function by the distances to the peaks in the corresponding distributions, one can introduce penalty terms which take advantage of the knowledge of the exact probability distributions (the blue histograms in Fig. 2).
The definition of Higgsness (5) contains some additional resolution parameters: for the reconstructed mass of the Higgs boson, for the reconstructed mass of the offshell boson, and for the resolution. In what follows, we shall take GeV, GeV, and GeV.^{5}^{5}5We have checked that our results are not very sensitive to these choices.
The Higgsness distributions for both signal and backgrounds before baseline cuts are shown in Fig. 1 (4th plot in the 3rd line). The two dimensional map of (Higgsness, Topness) on a loglog scale is depicted in Fig. 3. The Higgsness and Topness distributions in Fig. 1 are projections of this two dimensional scatter plot onto the axis and axis, respectively. Although the signal and the backgrounds do not exhibit a very clean separation in the individual onedimensional projections in Fig. 1, their two dimensional correlation plots show some visible differences. We note that even after employing the baseline cuts, one can still see a difference in the two dimensional correlation of Higgsness and Topness (bottom row plots).
Along with Higgsness and Topness, we also consider two versions of the variable Konar:2008ei (); Konar:2010ma (), which is defined as
(11) 
where represents a set of visible particles under consideration, while and are their invariant mass and transverse momentum, respectively. The variable (11) characterizes the system comprising of the visible particles and the invisible particles (here assumed to be massless) which are responsible for the measured missing transverse momentum . It provides the minimum value of the Mandelstam invariant mass for the system which is consistent with the observed visible 4momentum vector. We shall apply (11) to the whole event, where , or to the subsystem resulting from the decay , where . The distributions of the resulting variables and are shown in the left two panels on the fourth row of Fig. 1. The variable represents the minimum energy required to produce the two original parent particles (the two Higgs bosons in the case of the signal and the two top quarks in the case of the major background). This is why one would expect the distribution to peak around the parent mass threshold, for the signal and for the background Konar:2008ei (). However, the first panel in the fourth row of Fig. 1 shows that while the background distribution peaks near , which is expected, the signal distribution peaks around 400 GeV, which is substantially higher than . This implies that the two top quarks are produced more or less at rest, while the two Higgs bosons have a sizable boost. Similarly, the variable is the minimum energy required to produce the two bosons. For the background, where both bosons are onshell, the peak is expected to occur around . On the other hand, the signal distribution should be softer, since one of the bosons is offshell, and furthermore, the peak should be located slightly below the Higgs boson mass. These kinematic differences are illustrated in the second plot on the fourth row of Fig. 1, and motivate the use of as an analysis variable.
The last two panels in the fourth row of Fig. 1 show distributions of the subsystem variable Burns:2008va () — first when it is applied to the visible system resulting from the decays (), and then when it is applied to the visible system resulting from the decays (). In principle, is defined as Lester:1999tx ()
(12) 
where the minimization over the transverse masses of the parent particles () is performed over the transverse neutrino momenta and , subject to the constraint^{6}^{6}6See Refs. Barr:2011xt (); Kim:2017awi (); Cho:2014naa (); Konar:2009wn (); Konar:2009qr (); Baringer:2011nh (); Kim:2015uea (); Goncalves:2018agy (); Debnath:2017ktz () for more information and other variants of .. The parameter in (12) is the test mass for the daughter particle: in the case of one should use , while in the case of , the daughter particles are the bosons, and GeV, which leads to the lower bound visible in the plot. By construction, the variables are bounded by the mass of the corresponding parent particle. Indeed, the distribution for production shows a sharp drop around , while the signal distribution extends well above . Similarly, the distribution for drops around , as expected. In addition, it exhibits a peak structure in the first bin, which is due to leptonic tau decays. This suggests that can be effective in eliminating backgrounds with s.
This concludes our discussion of the 16 kinematic variables depicted in Fig. 1. The newly introduced 6 variables (Topness, Higgsness, , , and ) typically require a few extra steps to compute them, thus we shall refer to them as highlevel kinematic variables, while the remaining 10 traditional variables will be called lowlevel kinematic variables. We will perform two independent analyses — one with and one without the highlevel kinematic variables, in order to estimate the performance benefit from adding the additional 6 variables.
4 Color flow in signal and backgrounds
We note that the two quarks in the signal result from the decay of a single noncolored object, the Higgs boson. In contrast, the two quarks in production (which is the dominant background) arise from the decays of top quarks, which in turn are produced via the strong interactions from a gluongluon initial state. This distinction is pictorially illustrated in Fig. 4. The different colorflow Maltoni:2002mq () will lead to different hadronization patterns, which can be used to discriminate a color singlet particle from a color octet (or triplet) at hadron colliders such as the LHC. Since the quarks which originate from a color singlet particle are colorconnected to each other, their hadronization will not involve the initial state partons. On the contrary, the quarks which originate from a color octet particle are colorconnected to the annihilating partons in the initial state, and consequently their hadronization is correlated with these initial state partons, see Fig. 4.
The difference in color flow will be reflected in the resulting hadron distributions. Hadrons coming from a colorsinglet object will tend to be closer to the direction of the original mother particle, and as a result, the soft radiation will tend to populate the region between the two quarks. On the other hand, hadrons from the decay of a coloroctet particle will not be so narrowly focused, due to the influence of the initial state partons. These features are illustrated in Fig. 5, where we show the cumulative distributions in the plane after showering the same partonic event 10,000 times. In the left panel we used a signal event, while in the right panel we used an event from production. We see that the bjet clusters in the right panel tend to be better defined and more isolated, since they are not colorcorrelated among themselves. On the other hand, in the left panel we observe quite a bit of soft radiation in the region between the two jets, due to the existing color connection between them.
Of course, the results in Fig. 5 are only valid in the statistical sense, since we took the same partonlevel event and hadronized it multiple times. In reality, only one instance of this hadronization will be realized, as illustrated in Fig. 6. The top row of plots shows the hadronization patterns for charged particles (left panel) and neutral particles (right panel) in the case of one signal event, while the bottom row shows the same, but for one event. The partonlevel event information is quoted (in GeV) to the right of each row of panels, and then each event is translated in the () plane until the origin is aligned with the direction of the quark pair. The color scheme indicates the total in each pixel, while the dotted circles represent the cones for reconstruction of the corresponding jets.
As an alternative to Fig. 5, in Figs. 7 and 8, we illustrate the effects of colorconnection by showing the average of the jet images for the signal and the different background processes before and after the baseline cuts, respectively (some basic generationlevel cuts were imposed on the events in Fig. 7). The origin of the () plane plane is taken to be the center of the quark pair and the color scheme indicates the total in each pixel. The black dotted line delineates the region and used in the analysis. One can observe a striking difference in density between signal and background events in Fig. 7 — the two quarks tend to be more collimated in the signal and more spread out in the background.
Unfortunately, after imposing the baseline cuts introduced in Section 2, this distinction tends to be washed out and the backgrounds start mimicking the signal: one can see a similar structure emerging in all panels in Fig. 8, albeit with some subtle differences. Although one may find it difficult to discriminate signal from backgrounds simply by looking at a particular event, the patterns in the average jet images are different, and have been used actively for signal versus background separation Gallicchio:2010sw (); Hook:2011cq (); Aaboud:2018ibj (). In this paper, instead of quantifying the difference (e.g., with a pull vector Gallicchio:2010sw ()) we will use the images themselves on deep neural networks (DNNs), along with the 16 kinematic variables introduced in the previous section.
5 Analysis using deep learning
DNN is known to be very efficient and powerful in image recognition NIPS2012_4824 (); 0483bd9444a348c8b59d54a190839ec9 () and the particle physics community has used it for various applications^{7}^{7}7The use of neural networks for data analysis in high energy physics can be traced back to the pioneering work by R. Field and his students in the midnineties Field:1996rw (); Field:1996nq (); RFtalk ().. For instance, one can map the information about the direction and the energy (or transverse momentum) of a particle onto a pixel in an image. DNN then provides excellent classification between signal and background in the jet image Cogan:2014oua (); Komiske:2016rsd (); Kasieczka:2017nvn (); Lin:2018cin (). It also shows performance gains in multrivariate analyses over traditional cutandcount analyses or BDTs Baldi:2014kfa (); Baldi:2016fql (). In this section, we describe how we organize our analysis in a DNN framework. In the following three subsections, we address the issues of data preprocessing, DNN architecture and training of the NN.
5.1 Data preprocessing
In order to achieve the improved DNN learning performance and to minimize the error, it is important to properly process signal and backgrounds events before feeding them into a DNN framework. For each event passing the baseline cuts, the jet images are processed as follows.

Input data: we use the particle flow for our input data CMSPASPFT09001 ().

Particle classification: we divide the particle flow into two groups: neutral particles and charged particles. Neutral particles include photons and neutral hadrons, while charged particles include charged hadrons.

Lepton removal: if there is a lepton, we remove it.

Shift: we shift all particle coordinates in the () plane with respect to the center of the reconstructed quark pair, i.e., we set as the new origin, (0,0).

Pixelization: we discretize the rectangular region in the () plane defined by and into a grid of pixels for each particle classification (charged particle set and neutral particle set). In each pixel, we record the total transverse momentum as the pixel’s intensity (in case of more than one particle, we add the transverse momenta and record the total sum). We refer to this discrete image as the jet image Cogan:2014oua ().

Normalization: we rescale each jet image intensity as , where , and represents the intensity value in the pixel. is defined to be the largest value of pixel intensity found in the two pixel images.

Cropping: we crop the jet image to pixels, by further restricting to the () rectangular range of and .
The final jet image has dimension and is comprised of one charged particle channel with dimension and a neutral particle channel with dimension . This preprocessed jetimage is the input to the DNN. We note that Figs. 7 and 8 showed the combined jetimage obtained by adding the neutral and charged particle layers. The black dotted rectangular area in those figures showed the restricted pixel area.
5.2 DNN architecture
Our DNN architecture consists of three subarchitectures, which will merge later, as illustrated in Fig. 9. Combined deep learning (DL) is not yet very common^{8}^{8}8Combined DL is similar to ensemble learning Dietterich:2000:EMM:648054.743935 ()., but recently there have been several studies in particle physics Lin:2018cin (), as well as in other areas 7821017 (); 7477589 (), which showed improved results over simple DL. In this subsection, we provide some details of our DNN layer architecture as follows:

Initialization. Since DNN has a lot of parameters, it is important to give nonbiased initial values for the (weight, bias) before running DNN with all input data. We use the He uniform initialization method as in Ref. He:2015dtg (), among several other algorithms for parameter initialization pmlrv9glorot10a (); DBLP:journals/corr/SaxeMG13 (); He:2015dtg ().

Jet images. They are represented by the top panel in Fig. 9.

Input data: we use preprocessed jet images as inputs.

Convolutional neural networks layers: we use three layers of convolutional neural networks (CNN). Each layer has a filter with no stride and no padding. We proceed with the batch normalization process after filtering DBLP:journals/corr/IoffeS15 (), using the ReLU function as our activation function pmlrv15glorot11a (). After activation, we introduce the max pooling layer which has a shape with strides and padding.

Dense layers: we feed the output of the CNN into two fully connected dense layers, using ReLU as the activation function.


The 6 high level variables. Those are illustrated by the middle panel in Fig. 9.

Input data: , , , , Higgsness and Topness.

Dense layers: we introduce four fully connected dense layers with the ReLU activation function. All four layers have the batch normalization process before activation.


The 10 low level variables: Those are illustrated by the bottom panel in Fig. 9.

Input data: , , , , , , , , , .

Dense layers: we follow the same procedure as in the case with the 6 high level variables above.


Combination

Merge: we apply three single () dense layers to the jet image, the 6 high level variables and the 10 low level variables. These layers are denoted as , , and , respectively, as shown in Fig. 9. To merge the three subarchitectures, we introduce the final dense layer of dimension without an activation function.

Final output: to distinguish signal from backgrounds, we apply a layer of dimension without an activation function.

5.3 DNN training
We now proceed with deep learning on the DNN architecture described in Sec. 5.2, using the preprocessed input data. We use Microsoft CNTK cntk () as the main DNN library on GPU with an Nvidia CUDA platform. We use the Adam optimizer DBLP:journals/corr/KingmaB14 () with cross entropy with SoftMax loss function and classification error function. The sizes of the training data set and the testing data set are about 40k and 17k, respectively. The size of the minibatch is 128 and that of the epoch is 30.
For each event, we prepare the jet images and the 16 variables. The dimension of the final output is , (, ). If the deep learning score is equal to 1, i.e., (), the corresponding event is taken to be a signal event. If (), the event is considered to be background.
6 Results
In this section we present our results. First we validate our framework by repeating the analysis performed in Ref. Kim:2018cxf () under similar assumptions.^{9}^{9}9Our current analysis has several notable improvements over the one carried out in Ref. Kim:2018cxf (). First, the detector simulation is different — in the current study, we use Delphes, which assumes (on average) % (%) reconstruction efficiency for leptons (jets), while Ref. Kim:2018cxf () assumed 100% reconstruction efficiency for both. In addition, the Delphes detector resolution itself is slightly different from one used in Ref. Kim:2018cxf (). In particular we find that the resolution of the missing transverse momentum is worse in Delphes and hence our current results are more conservative (if not more realistic). Finally, as mentioned earlier, we are now including production, which turned out to be the next dominant background, yet was missing from all previous studies. These effects should be kept in mind when comparing our results here to previous results in the literature. We obtained consistent results for the conventional cutandcount method with Delphes detector simulation. When we added deep learning, the signal significance improved slightly by 510%.
Now considering all relevant backgrounds and using all 16 variables and jet images, we show the deep learning score for the signal and the individual background processes in Fig. 10. The signal should peak near by construction, and indeed this is what is observed in the figure. Note that the and processes are well separated from the signal and both peak near . This is direct consequence of the improvements made in our analysis — introducing the proper kinematic variables and jet images, which were meant to target the dominant background ( production), as evidenced in Figs. 1, 3 and 8. Although the subdominant backgrounds are also reduced in this process, they remain rather flat in Fig. 10.
The deep learning score shown in in Fig. 10 can now be used as a signaltobackground discriminator. By placing a lower cut and counting the number of surviving signal and background events, one obtains the efficiency curve (also known as a receiver operating characteristic (ROC)) shown in the left panel of Fig. 11. The curve contains several independent runs of deep learning and shows the signal efficiency () versus the fraction of rejected background events, i.e., , where is the background efficiency. The efficiency corresponding to the results in Fig. 10 is shown with the red solid curve labeled “with jetimage DNN”. The other two solid lines show the efficiencies which would be obtained if we were to remove the jet images from the analysis: the purple solid curve (labelled “10var only DNN”) is obtained with the help of the 10 lowlevel kinematic variables, while the blue solid curve (labelled “16var only DNN”) shows the improvement when we add the 6 highlevel variables and use the full set of 16 variables from Section 3, but still without jet images. The black dotted curve (labeled “jet image only DNN”) shows the result when we use jet images alone, with no help from any of the 16 kinematic variables. Finally, the blue dashed line (labelled “10var with jetimage DNN”) shows the result from an analysis combining jet images with the 10 lowlevel kinematic variables only. The corresponding signal significances are shown as a function of the number of events in the right panel of Fig. 11. Note that the right panel contains an additional curve (the purple dashed line labeled “10var only BDT”) where we use the 10 lowlevel variables and adopt a BDT algorithm using the TMVA tool kit Hocker:2007ht (). The comparison of the latter line against the “10var only DNN” result (purple solid line) reveals the relative performance of DNN versus BDT.
In order to examine the effects of pileup, we use several methods as follows. In the first method, we use the Soft Drop algorithm Larkoski:2014wba () to remove soft jet activity which is exacerbated by pileup. We set and with anti clustered fatjets. Then we select the closest fatjet to the momentum in the  plane and replace the particle flow data with the charged and neutral jet constituents of the selected fatjet. Soft Drop does not affect the jet images and retains the same shapes as in Fig. 8. In second method, we remove the neutral jet image layer in the analysis. Unlike charged particles, which can be cleaned up from pileup relatively easily by checking the longitudinal vertex information Bertolini:2014bba (), neutral particles cannot be treated the same way and suffer from nonremovable pileup effects. The corresponding results with these two pileup mitigation methods are also shown in Fig. 11 with the red dotted line labelled “16var with jetimage DNN, SoftDrop” and the red, dashed line labelled “16var with jetimage DNN, no neutral layer”, respectively.
We also examine the performance of the DNN with four momentum information as input. The corresponding results are shown in Fig. 11, where the greendashed (greensolid) curve represents the significance with four momentum information only (four momentum information plus jet images). The inputs are 18 real numbers, i.e., the four momenta of the two leptons and the two tagged jets and the missing transverse momentum. For this exercise, we use a 4 128 dense layer instead of a 4 64 dense layer. We notice that the DNN performance with kinematic variables is better. This is because, in general, the use of four momenta requires a large training sample in order to be effective, while the kinematic variables already perform efficiently with a smaller data set. If the architecture is deep enough with a large amount of data, the DNN performance with four momentum information would be comparable (or better) to that with kinematic variables only. This exercise illustrates the importance of the appropriate use of kinematic variables.
In summary, Fig. 11 demonstrates that jet images (which capture the effects of color flow) can improve performance over the baseline selection cuts. At the same time, DL with jet image substructure alone does not show the best performance, and becomes fully effective (and still stable under pileup) only when it is combined with the full set of 16 variables, including the highlevel ones.
Signal  
Baseline cuts: ,  
, ,  
, ,  
,  
jetimage DL  0.00667  0.1817  0.0133  0.00793  0.0245  0.0129  0.0671  0.00854  0.65  0.021 
10 lowlevel variables DL  0.00668  0.0806  0.00897  0.00435  0.0163  0.00876  0.0462  0.00578  0.88  0.039 
16 variables DL  0.00667  0.0662  0.00948  0.00358  0.0170  0.00747  0.0387  0.00402  0.95  0.046 
10 variables + jetimage DL  0.00667  0.0693  0.00897  0.00435  0.0178  0.00722  0.0359  0.00352  0.95  0.045 
16 variables + jetimage DL  0.00668  0.0607  0.00769  0.00281  0.0173  0.00799  0.0317  0.00402  1.0  0.051 
Table 1 summarizes the signal and background cross sections in fb at different stages of the analysis for the case of signal events. The last two columns show the signal significance and the signaltobackground ratio . The significance is calculated using the loglikelihood ratio for a luminosity of 3 at the 14 TeV LHC.
In order to understand the correlation between jet images and the 16 kinematic variables, we performed two independent runs with “jet images only; no kinematic variables” and “16 kinematic variables; no jet images”. The corresponding results are shown in Fig. 12. Since the two DLs are trained separately, both the axis and the axis are normalized to unity. As expected, Fig. 12 reveals a degree of correlation between the jet images and the 16 kinematic variables, which is somewhat stronger for the signal and less so for the background.
In our main analysis, we performed simultaneous runs as shown in the deep learning architecture in Fig. 9. Before calculating our final deep learning score, we obtain three intermediate values, , , and , which represent the DL scores for the respective substructure corresponding to the jet images, the 6 high level variables and the 10 low level variables. The first 6 panels in Fig. 13 show the pairwise correlations between these three intermediate scores for the signal (top row) and the background (middle row). The bottom three panels in the figure show the onedimensional distributions of the intermediate scores for signal (blue histograms) and background (red histograms). We observe that the score from jet images () is relatively uncorrelated to the kinematic variables scores and , which motivates the simultaneous training on jet images and kinematic variables together.
Finally, in Fig. 14 we scan over different values of the triple Higgs coupling and show the discovery significance (left panel) and precision (middle panel) as a function of . Both the significance and the precision are calculated fixing DL cuts that would give a certain number of signal events () for the SM at (marked with the dotted vertical line). For the significance, we used the loglikelihoodratio
(13) 
where and are the expected number of signal and background events, respectively. We define as
(14) 
The shape of the significance roughly follows the cross section ratios between the case of to the case of . This is illustrated in the rightmost panel of Fig. 14, which shows the cross section scaled as , i.e., normalized with respect to the minimum cross section for each curve. The blue curve represents the double Higgs production cross section before cuts, and in this case we find the minimum of the cross section somewhere between and . After baseline cuts (the red solid line), the minimum shifts to around , and after DL cuts (the green solid line), the minimum shifts even further out to around . In the latter case, we observe that the signal cross sections for and are numerically very close, as indicated by the two vertical dotted lines in the right panel. This provides an explanation for the double dip structure seen in the middle panel of Fig. 14.
As demonstrated in the rightmost panel of Fig. 14, the analysis cuts modify the signal cross section so that the location of its minimum shifts to higher values of . This can be understood as follows. At leading order, the Higgs pair production cross section is given by
(15) 
before convoluting with the parton distribution functions Glover:1987nx (); Borowka:2016ypz (). Here represents a parityeven triangle and box diagram contribution, while is a parityodd box diagram contribution. Now can be rewritten as , where is the triangle diagram contribution and is the box diagram contribution. Therefore the cross section can be parameterized as a quadratic function of , where the coefficients are related to contributions from and diagrams.
The observation that the baseline cuts and the DL cut shift the minimum cross section to a larger value implies that the effects of the cuts are stronger on than . In other words, our cuts are more likely to affect the triangle diagram which contains the triple Higgs coupling. Unlike the box diagram, the triangle diagram includes an offshell Higgs in the channel. Since it is harder to produce a Higgs pair from an channel offshell Higgs, the Higgs pair generated from the triangle diagram is not as energetic as the one coming from the box diagram, and will therefore tend to have lower transverse momentum. As discussed in Section 4, several of the cuts on our kinematic variables, namely, , , , and , rely on the fact that the Higgs bosons are produced with a significant boost. Consequently, the effect of the cuts will be to suppress the term and enhance the box diagram contribution, which in turn shifts the location of the minimum to a larger value of .
Note that the results for the significance and the precision in Fig. 14 do not change dramatically when we require a different number of signal events at the SM point. This means that the dependence on the DL cut is relatively mild, since the kinematics remains similar when we vary , so that the dependence on the cross section is more important. This is illustrated in Fig. 15, which shows the cross section (in pb) after cutting on the DL score, , (left panel) and the ratio between the cross section after the DL cut and the cross section after baseline cuts (right panel).
7 Discussion
In this paper, we investigated double Higgs production in the final state. It is known to be one of the difficult channels due to the large backgrounds, . We performed a detailed analysis by adopting a deep learning framework and successfully combining new kinematic variables and jet image information. As a result, we obtained a sizable increase in signal sensitivity and an improved signaltobackground ratio compared to the existing analyses.
Our results showed that the dominant background can be brought down to the level of the other remaining backgrounds, without sacrificing too much in the signal rate. This is mostly due to the use of Higgsness, Topness and the subsystem variable . Other backgrounds like can be reduced further by the use of . Finally, additional improvements are possible with the use of jet images. After all cuts, we find that all backgrounds contribute at similar levels.
We find from recent CMS and ATLAS analyses with 36 fb of LHC data at 13 TeV that the 95% confidence level observed (expected) upper limit on the production cross section is 22.2 (12.8) times the standard model value Sirunyan:2018two () for CMS and 6.7 (10.4) times the predicted Standard Model crosssection ATLAS:2018otd () for ATLAS. The leading channel in CMS is followed by , while the leading channels in ATLAS are and , followed by . The main difference arises due to the superior tagging efficiency for the ATLAS detector Aaboud:2018knk (). In both studies, the channel was largely overlooked due to the expected poor significance. However, our study suggests that double Higgs production may be probed in the dilepton channel as well, and would contribute to the combined analysis on par with the other final states, increasing the overall significance. For example, in Ref. ATLAS:2018combi (), the ATLAS collaboration showed that the combined significance of , , and is 3.5 (3.0) without (with) systematic uncertainties at the 14 TeV LHC with 3000 fb. Their individual significance is 1.4, 2.5 and 2.1 (0.61, 2.1 and 2.0), respectively without (with) systematics. They did not combine with the channel but a naive estimate shows that when including our channel, the combined significance would be about 3.7.
We urge the experimental collaborations to consider the ideas presented in this paper and test them in the LHC data. We would also like to mention that the proposed method can be easily generalized to the semileptonic channel from production, as well as to other processes with similar final states.
Acknowledgments
This work is supported in part by United States Department of Energy (DESC0010296, DESC0017988 and DESC0017965) and Korea NRF2018R1C1B6006572. We thank KIAS for providing computing resources. We thank Georgios Anagnostou for valuable comments and Anja Butter and Tilman Plehn for general discussion on machine learning. KK is grateful to the Mainz Institute for Theoretical Physics, which is part of the DFG Cluster of Excellence PRIMA (Project ID 39083149), for its hospitality and its partial support during the completion of this work.
Appendix A Deep Neural Network
The artificial neural network (ANN) is one of the most popular approaches to pattern recognition in machine learning algorithms. The structure of an ANN is defined by a succession of nonlinear and linear transformations between nodes or artificial neurons, which are located on input, output or hidden layers. A hidden layer which uses an ordinary onedimensional layer is called a dense layer or a fully connected layer.
The linear operation consists of weights and bias:
(16) 
where is the value of the th neuron (input) in the prior layer, the value of the th neuron (output) in the subsequent layer, are the weights, and the bias. The index () takes the values () and () is the dimension of the output (input). The input initially can be given in more than one dimension. For example, if the input results from a convolution and has dimension , it may be rearranged as follows:
(17) 
where the corresponding dimension of the input would be .
The nonlinear transformation is often called activation function, which imitates the action potential of biological neurons. Similar to how each neuron adjusts how much signal it needs to deliver to the next neuron using an electric action potential, the activation function determines the output of a particular neuron for a set of given inputs from neurons on the previous layer, and the output is then used as input for the next artificial neuron. The commonly used activation functions are
(18)  
(19)  
(20) 
where represents the value of the th neuron.
If the neural network has sufficiently many hidden layers, the network is called deep neural network (DNN). DNN can learn from the input data to obtain the desirable output by adjusting the parameters in the hidden layers. We note that the proper normalization of the input data helps improve convergence during training. The goal of the training is to determine the parameters (weights and biases) by minimizing the loss, which represents the difference between the target output and the actual DNN output. There are various algorithms for optimization of the parameters DBLP:journals/corr/KingmaB14 (); Duchi:2011:ASM:1953048.2021068 (); rmsprop (). Some well known loss functions are
Mean Square Error  (21)  
Cross Entropy  (22)  
Cross Entropy with SoftMax  (23) 
where is the true answer (either 1 or 0 in our current study), is the DNN final output, and is the number of neurons in the output layer.
Instead of feeding the entire data into the DNN all at once, one splits the input data into several subsets with random selection and takes one subset, called minibatch, for a given iteration, which helps avoid the overfitting problem pmlrv40Ge15 (); DBLP:journals/corr/abs180407612 (). When the full training set is used, the cycle is called epoch, and one uses several epochs to obtain a welltrained DNN. When training DNN with a minibatch, the corresponding loss is defined by the sum of all losses over the minibatch or by their average.
Once the training is over, for testing one uses a different data set from the one used in the DNN training, in order to avoid the overfitting problem. In order to test the trained DNN model one can use either the loss function or the classification error function. If the number of test events is , the classification error function is defined by
(24) 
where ArgMax() gives the position where the value of is maximized. represents the th test event, the is Kronecker delta function.
Often one takes additional steps such as dropout for reducing overfitting in neural networks Srivastava:2014:DSW:2627435.2670313 () and batch normalization for improving the performance and stability of artificial neural networks DBLP:journals/corr/IoffeS15 (). Dropout makes a random drop of units (both hidden and visible) in a neural network and is considered an efficient way of performing model averaging. The batch normalization procedure normalizes the input layer by adjusting and scaling the activations:
(25)  
(26)  
(27)  