The SysCalc code: A tool to derive theoretical systematic uncertainties
Undisputedly, derivation of theoretical systematic uncertainties is an
inseparable ingredient of any robust analysis dealing with
experimental data. However, it is not uncommon, even for those analyses
that use state of the art methods and tools to suffer from insufficient
statistics when it comes to the simulated datasets used to estimate systematic
uncertainties. This practically limits the power, and sometimes the
robustness of the analysis.
In this paper, we present SysCalc, a code which is able to derive
weights for various important theoretical systematic uncertainties,
including those related to the choice of the Parton Distribution Function sets and the various
scale choices. SysCalc utilizes the central sample generated events to estimate the related systematic uncertainties,
thus, omitting the need for generating dedicated systematics datasets, and with only a minimal added cost in terms of computing resources.
In this paper we discuss the working principles of the
code accompanied by various validation plots. We also discuss the structure
of the code followed by a practical guide for how to use the tool.
keywords:SysCalc, reweighting, systematics, Monte Carlo programs
In the high-energy physics (HEP) community, it is a common practise to choose a scale for the normalization (), the factorization scale (), the emission scale () as well as a choice for the Parton Distribution Function (PDF), and then generate large simulated datasets serving as the central samples. This is the case for example for both the ATLAS atlas () and the CMS cms () Collaborations, which typically generate billions of Monte Carlo (MC) simulated events so as to make sure that a sufficient estimation of the physical processes under question with the usage of the scales and PDFs that presumably best describe data, can be attained. The above come with a non negligible cost in terms of resources (both human and computing) but also on time spent to not only design, test and validate, but also to complete the generation of such simulated samples which have to be big enough in order to achieve small statistical uncertainties at the same time.
Moreover, one could argue that more than one set of parameters (which is not always trivial to identify) can exist which should be modelled coherently as well. It is clear that the HEP community (both experimentalists and theoreticians) needs to have access to simulated events generated with different settings, scales, or both. Nowadays, it is common practice that analyses have to estimate systematic uncertainties as well, where some of the most important theoretical ones include the variation of the aforementioned , , or the usage of different PDF grid sets. But up to now, it was often very unpractical, or even impossible in many case, to generate datasets big enough in order to have comparable statistics to the central one(s). This results in high statistical uncertainties on the systematic uncertainties which constrain the power and accuracy of several important analyses, like for example the ones dealing with cross section or precision mass measurements for instance. Such examples can be found in cms_an2 (); cms_an1 () where the dominant uncertainty is the systematic one which can be directly related to limited statistics of the associated simulated systematic datasets.
In this paper, we present a new tool, SysCalc, which is capable of providing to the end user various weights based on the nominal generated sample of a given physics process, in order to avoid the re-generation of events with different choices on the scales, the PDF grid sets etc. The main functionality of SysCalc is the computation of weights for various systematic variations (scales, PDFs) in a fast and robust way. Its format is based on the standard LHEFv3 format Alwall:2006yp (). The existence of this tool saves both computational time and disk space, but its most profound feature is that it supplies analysers with a weighted sample to account for the various theoretical systematics with the same exact size and the same statistical power (S.P.) as the one of the nominal/central sample.
To support the last statement, and although it may be trivial to the more advanced user, for the sake of completeness we consider the simple case where the central sample consists of total events events and each events has a weight of (). Then, we define the S.P. of the central sample as:
Assume now, that for the weighted sample, each event has a new weight , and for simplicity lets assume that it is the same weight for all events. Similarly to Eq. 1 we can write:
It is obvious from the above example, that in the case when the dedicated systematic samples are times smaller than the nominal sample, the statistical power of the dedicated samples is decreased by a factor of .
With the use of SysCalc, as the number of events for a given theoretical uncertainty variation is equal to the central one, the statistical uncertainty on the related systematics are kept at the same level as the ones of the central sample.
The structure of this paper is as follows: In Sect. 2 we briefly present the package, while in Sect. 3 (4) we describe the derivation of the formulas used in order to calculate the various systematics uncertainties for the case of un-matched/un-merged (matched/merged) samples followed by some validation plots. Further, in Sect. 5 we give a practical guide of how to use the code and the paper closes with the conclusions in Sect. 6. In the generation of all plots in this paper, the Delphes delphes () and ExRootAnalysis packages have been used.
2 The SysCalc package
SysCalc is a package that can calculate dedicated event weights for certain theoretical systematical uncertainties. SysCalc supports all Leading-Order (LO) computations generated in MadGraph5_aMC@NLO Alwall:2014hca (); Alwall:2007fs (). Its output is an XML-based file which contains all relative weights needed to account for the selected systematics. Supported systematics include the variations of the , and scales, as well as PDF sets and grids. SysCalc makes use of additional information stored by MadGraph5_aMC@NLO inside the record for each event, providing access to all information required to recompute the event weight based on convolution of the PDF set with the Matrix Element (ME) for the various supported scale variations.
3 The un-matched case
3.1 Reweighing of the cales
Without matching/merging, SysCalc is able to compute the variation of and scales (parameter scalefact) and the change of the PDF set. The variation of the scales can be done in a correlated and/or a uncorrelated way which is controlled by the value of the scale-correlation parameter which can take the following values:
-1: to account for all combinations ().
-2: to account only for the correlated variations.
A set of positive values corresponding to the following entries (assuming 0.5, 1, 2 for the scalefact entry):
The weight associated with the renormalisation scale is the following:
where is the scale variation considered, and are the original and new weights associated to the
event respectively. is the power in the strong coupling for the associated event (interferences are not taken account in a event by event basis).
The weight associated to the scaling of the factorisation scale is:
where are the PDF sets associated to the particle (which holds a fraction of energy / for the first/second beam respectively) for the original PDF set.
As a validation, two typical processes, namely and , with have been generated with MadGraph5_aMC@NLO. Around 10Mi events were produced for the central sample with a choice of (where is the scalar sum of transverse momenta () of all jets of the event). The sample is then interfaced to SysCalc to derive the weights for the different scales into question. For each of the aforementioned processes, a set of dedicated samples is generated as well with equal statistics, one for each of the following () configurations:
For all of the generated samples, a unique random seed is initialised every 100k events. In Fig. 1, the invariant mass, the and the rapidity () distributions of the top quark pair and of the leptons from the Z-bosons from the dedicated samples with different choices of the various scales are compared against the SysCalc weighted events. The agreement between them is found to be within the statistical fluctuations.
3.2 Reweighing of the PDF set
The equation describing both the variation of the PDF set and the central scale in the case of fixed-order samples is given by the corresponding weights associated to the chosen new PDF sets and new central scales:
where are the PDF sets associated to the particle under consideration.
As validation, the and processes (with ) have been considered. The central sample has been generated with the NNPDF_LO() PDF, while the events have been weighted for the following PDF sets Lai:2010vv (); Harland-Lang:2014zoa (); Gao:2013xoa (); Pumplin:2002vw (); Ball:2014uwa ():
Further, dedicated samples have been generated with MadGraph5_aMC@NLO for each of the above PDF sets for a choice of , while the central sample is generated with the NNPDF_LO() PDF set. In Fig. 2, the invariant mass, the , and the of the top quark pair and of the leptons from the Z-bosons from the dedicated samples from the different PDFs are compared against the SysCalc weighted events. The agreement is found to be within the statistical fluctuations.
4 The case of matching
In the presence of matching, MadGraph5_aMC@NLO rescales each event, such that the scale of the strong interaction in emissions as well as the PDF is set according to the parton shower history
(which is selected via a clustering). SysCalc can perform an associated re-weighting (parameter alpsfact) by dividing and by multiplying with the associated factor.
For each vertex of the clustering (associated to a scale ) this corresponds to the following factor for Final State Radiation (FSR):
and similarly for Initial State Radiation (ISR) associated to a scale and fraction of energy :
where is the scale of the next vertex in the initial state clustering history.
To test SysCalc in the case of matching, we have performed a similar validation as described in Sec. 3 for different choices of the scales and different PDF sets as well.
The utilized samples have been generated with 0 and 1 parton at ME with MadGraph5_aMC@NLO and with a different random seed every 100k events for about 20Mi events in total. The parton level events were interfaced with Pythia8pythia8 () for the hadronization and the matching step.
The validation was performed for different variations of the scales:
In Fig. 4 the invariant mass of the top quark pair and the Z-bosons, as well as the and the distributions of the ISR jet for the different scale variations from the dedicated samples are compared against the weighted events derived with SysCalc from the central sample. The agreement is found to be within the statistical uncertainties, with one exception discussed below. Similarly the plots for the scales variations are presented in Fig. 3.
One can notice that there is a small bias of the order of few % level on the and the variations which is more prominent in the ISR kinematic related properties.
This can be attributed to the fact that Pythia8 uses as the starting scale for the parton shower the scale that is already written in the LHE event record (parameter SCALUP in the LHEv3 format)
which determines the maximum hardness of the first shower emission.
For different choices of the and the scales, the SCALUP will change accordingly,
but since Pythia8 runs only once on the nominal sample which holds the weights from SysCalc, thus it is not possible to have information
on events with various SCALUP variations. Since this affects the hardness of the parton shower radiation, it is then expected to cause a difference in the final matched sample.
In order to check this effect, the following steps were carried out :
The nominal sample (i.e. which has ) was processed after modifying the relevant parameters that control the SCALUP parameter111this is achieved by defining appropriately the TimeShower:pTmaxMatch, TimeShower:pTmaxFudge, SpaceShower:pTmaxMatch, SpaceShower:pTmaxFudge parameters in the Pythia8 card., so that two new samples were derived; both of them having the nominal scales set in, but with different SCALUP corresponding to the or variations.
The dedicated samples generated with the variation on and 2, were processed appropriately so that their SCALUP corresponded to that of the nominal () samples.
As the above is equivalent to having separate samples corresponding to without modifying their SCALUP, dedicated samples with these settings were produced as well, with the matching efficiency found to be for the variations samples correspondingly.
All of the above samples were compared against the prediction obtained from SysCalc.
Further, for this exercise in order to assess the effects related to ISR and FSR radiation only, we veto all radiation related to the system and we consider jets at generator level only.
The of these jets are shown in Fig. 5; there is good agreement between the
dedicated and the systematic variation samples.
In summary, the choice of the SCALUP has a visible effect on the kinematics of the particles which is amplified for the ISR related quantities,
which can be well covered at analysis level by an extra systematic of the order of (for those quantities).
4.1 Future developments
SysCalc can include the weight associated to different merging scales in the MLM matching/merging mechanism (from output provided by the Pythia 6 package of pythia-pgs package). In that case, the parton shower keeps track of the scale of the first emission and applies then a veto to account for the minimal allowed value for the matching scale according to the cut performed at parton-level. SysCalc will then test for each of the values specified in the parameter matchscale if the event passes the MLM criteria or not. If it does not, a zero weight is associated to the events, while if it does, a weight of 1 is kept. As a reminder, these weights are the equivalent of having a (approximate) Sudakov form-factor and removing at the same time the double counting between the events belonging to different multiplicities. However, as at the time of writing this paper, Pythia6 is no longer supported and the described functionality is not yet supported in Pythia8, this functionality cannot be fully tested currently at the moment.
5 Practical guide to use SysCalc
The requirements of the SysCalc package as inputs are:
A systematics file (which can be generated by MadGraph 5 v.1.6.0 or later) where the generated events will be further processed so to calculate the weights.
The presence of LHAPDF LHAPDF () installed in the working environment222Instructions on how to install LHAPDF can be found in the following link: http://lhapdf.hepforge.org/lhapdf6/install.html. Once LHAPDF is installed, it is very important to properly set the environment variables (assuming that LHAPDF was installed in the /local directory
A configuration file (i.e. simple text file) specifying the parameters to be varied as described later.
In any system with bazaar available:
bzr branch lp: mgtools/mg5amcnlo/SysCalc
5.2 Configuration file
Before running SysCalc a configuration card with the parameters to be varied is needed. Below follows such an example on how to vary the central scale, the and the PDF set:
# Central scale factors
0.5 1 2
# Scale correlation
# Special value -1: all combination (N**2)
# Special value -2: only correlated variation
# Otherwise list of index N*fac_index + ren_index #index starts at 0
#Emission scale factors
alpsfact: 0.5 1 2
#PDF sets and number of members
If your are interested to simply run on one member of the PDF grid, simply replace the last line with this:
Numbering of the PDF members starts from 1, so the above line will result to consider only the first member (ie the central one) of a given PDF grid.
Also, a LHE v3 file is needed which has to be generated with MadGraph5_aMC@NLO with the option:
T = sys_calc.
The one-line syntax to execute SysCalc is:
./syscalc input.lhe syscalc_parameters.dat out.lhe
The output code follows the LHEF v3 format. The following block appears in the header of the output file. Assuming the above configuration card, the output will look like:
<weightgroup type="Central scale variation" combine="envelope">
<weight id="1" MUR="0.5" MUF="0.5" PDF="10042"> mur=0.5 muf=0.5 </weight>
<weight id="2" MUR="0.5" MUF="1" PDF="10042"> mur=0.5 muf=1 </weight>
<weight id="3" MUR="0.5" MUF="2" PDF="10042"> mur=0.5 muf=2 </weight>
<weight id="4" MUR="1" MUF="0.5" PDF="10042"> mur=1 muf=0.5 </weight>
<weight id="5" MUR="1" MUF="1" PDF="10042"> mur=1 muf=1 </weight>
<weight id="6" MUR="1" MUF="2" PDF="10042"> mur=1 muf=2 </weight>
<weight id="7" MUR="2" MUF="0.5" PDF="10042"> mur=2 muf=0.5 </weight>
<weight id="8" MUR="2" MUF="1" PDF="10042"> mur=2 muf=1 </weight>
<weight id="9" MUR="2" MUF="2" PDF="10042"> mur=2 muf=2 </weight>
</weightgroup> <weightgroup name="Emission scale variation" combine="envelope"> <weight id="10" ALPSFACT="0.5" MUR="1" MUF="1" PDF="10042"> alpsfact=0.5</weight> <weight id="11" ALPSFACT="1" MUR="1" MUF="1" PDF="10042"> alpsfact=1</weight> <weight id="12" ALPSFACT="2" MUR="1" MUF="1" PDF="10042"> alpsfact=2</weight> </weightgroup> <weightgroup name="CT10nlo" combine="hessian"> <weight id="13" MUR="1" MUF="1" PDF="11000"> Member 0 of sets CT10nlo</weight>
<weight id="14" MUR="1" MUF="1" PDF="11001"> Member 1 of sets CT10nlo</weight>
<weight id="15" MUR="1" MUF="1" PDF="11002"> Member 2 of sets CT10nlo</weight>
<weight id="16" MUR="1" MUF="1" PDF="11003"> Member 3 of sets CT10nlo</weight>
<weight id="17" MUR="1" MUF="1" PDF="11004"> Member 4 of sets CT10nlo</weight>
<weight id="18" MUR="1" MUF="1" PDF="11005"> Member 5 of sets CT10nlo</weight>
<weight id="19" MUR="1" MUF="1" PDF="11006"> Member 6 of sets CT10nlo</weight>
<weight id="64" MUR="1" MUF="1" PDF="11051"> Member 51 of sets CT10nlo</weight>
<weight id="65" MUR="1" MUF="1" PDF="11052"> Member 52 of sets CT10nlo</weight>
While for each event, the following information will be written:
</mgrwt> <rwgt> <wgt id="1">0.128476</wgt>
In this paper, we presented the SysCalc package, a C++ written tool capable of calculating weights for certain theoretical systematic uncertainties.
The very existence of this tool tackles the problem that many physics analyses face when they have to deal with calculation of different yet important
theoretical systematic uncertainties but struggle with the generation of large simulated dedicated samples.
With the use of SysCalc, one is able to have a systematics sample with similar statistical power as for the nominal one as the main idea is based on derivation of dedicated weights, one for
each source of the considered theoretical systematics.
Further, the code is very fast and thus ideal for a large scale MC production, while the final weights are written in dedicated tags in ASCII using the LHEFv3 format.
We presented a first level of validation of SysCalc’s main characteristics along with a practical guide on how to use it.
Near future developments include a full integration with Pythia8 in order to provide the matching scales variations on top of the currently available ones.
The authors would like to warmly thank Olivier Mattelaer and Steven Mrenna for their continuous support, cooperation and fruitful discussions.
-  G. Aad et al. [ATLAS Collaboration], JINST 3 (2008) S08003. doi:10.1088/1748-0221/3/08/S08003
-  S. Chatrchyan et al. [CMS Collaboration], JINST 3 (2008) S08004. doi:10.1088/1748-0221/3/08/S08004
-  V. Khachatryan et al. [CMS Collaboration], arXiv:1509.04044 [hep-ex].
-  V. Khachatryan et al. [CMS Collaboration], JHEP 1604 (2016) 073 doi:10.1007/JHEP04(2016)073 [arXiv:1511.02138 [hep-ex]].
-  J. Alwall et al., Comput. Phys. Commun. 176 (2007) 300 doi:10.1016/j.cpc.2006.11.010 [hep-ph/0609017].
-  J. de Favereau et al. [DELPHES 3 Collaboration], JHEP 1402 (2014) 057 doi:10.1007/JHEP02(2014)057 [arXiv:1307.6346 [hep-ex]].
-  J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H.-S. Shao and T. Stelzer et al., JHEP 1407 (2014) 079 [arXiv:1405.0301 [hep-ph]].
-  J. Alwall et al., Eur. Phys. J. C 53 (2008) 473 doi:10.1140/epjc/s10052-007-0490-5 [arXiv:0706.2569 [hep-ph]].
-  H. L. Lai, M. Guzzi, J. Huston, Z. Li, P. M. Nadolsky, J. Pumplin and C.-P. Yuan, Phys. Rev. D 82 (2010) 074024 doi:10.1103/PhysRevD.82.074024 [arXiv:1007.2241 [hep-ph]].
-  L. A. Harland-Lang, A. D. Martin, P. Motylinski and R. S. Thorne, Eur. Phys. J. C 75 (2015) no.5, 204 doi:10.1140/epjc/s10052-015-3397-6 [arXiv:1412.3989 [hep-ph]].
-  J. Gao et al., Phys. Rev. D 89 (2014) no.3, 033009 doi:10.1103/PhysRevD.89.033009 [arXiv:1302.6246 [hep-ph]].
-  J. Pumplin, D. R. Stump, J. Huston, H. L. Lai, P. M. Nadolsky and W. K. Tung, JHEP 0207 (2002) 012 doi:10.1088/1126-6708/2002/07/012 [hep-ph/0201195].
-  R. D. Ball et al. [NNPDF Collaboration], JHEP 1504 (2015) 040 doi:10.1007/JHEP04(2015)040 [arXiv:1410.8849 [hep-ph]].
-  M. L. Mangano, M. Moretti, F. Piccinini and M. Treccani, JHEP 0701 (2007) 013 [hep-ph/0611129].
-  T. Sjöstrand et al., Comput. Phys. Commun. 191 (2015) 159 doi:10.1016/j.cpc.2015.01.024 [arXiv:1410.3012 [hep-ph]].
-  M. R. Whalley, D. Bourilkov and R. C. Group, hep-ph/0508110.
-  T. Sjostrand, S. Mrenna and P. Z. Skands, JHEP 0605 (2006) 026 [hep-ph/0603175].