Abstract
In most (weakly interacting) extensions of the Standard Model the relation mapping the parameter values onto experimentally measurable quantities can be computed (with some uncertainties), but the inverse relation is usually not known. In this paper we demonstrate the ability of artificial neural networks to find this unknown relation, by determining the unknown parameters of the constrained minimal supersymmetric extension of the Standard Model (CMSSM) from quantities that can be measured at the LHC. We expect that the method works also for many other new physics models. We compare its performance with the results of a straightforward minimization. We simulate LHC signals at a center of mass energy of at the hadron level. In this proof–of–concept study we do not explicitly simulate Standard Model backgrounds, but apply cuts that have been shown to enhance the signal–to–background ratio. We analyze four different benchmark points that lie just beyond current lower limits on superparticle masses, each of which leads to around events after cuts for an integrated luminosity of . We use up to observables, most of which are counting observables; we do not attempt to directly reconstruct (differences of) masses from kinematic edges or kinks of distributions. We nevertheless find that and can be determined reliably, with errors as small as in some cases. With of data as well as can also be determined quite accurately. For comparable computational effort the minimization yielded much worse results.
July 2013
Determination of the CMSSM Parameters using Neural Networks
Nicki Bornhauser and Manuel Drees
Physikalisches Institut and Bethe Center for Theoretical Physics, Universität Bonn,
Nußallee 12, D–53115 Bonn, Germany
1 Introduction
The Large Hadron Collider (LHC) is running successfully. After the next long shutdown the center of mass energy will be raised from 8 to 13 or 14 TeV. This higher center of mass energy will increase the reach for finding new physics. Here we are concerned with supersymmetric extensions of the Standard Model (SM) of particle physics. Within the simplest potentially realistic supersymmetric model, the minimal supersymmetric extension of the SM (MSSM) [1], the 14 TeV LHC will increase the mass reach for first generation squarks and gluinos from the current lower bounds, which reach nearly 1.5 TeV for equal squark and gluino masses [2],^{* }^{* }* Mass limits are usually cited for the constrained MSSM (CMSSM) or for certain simplified scenarios; even in these scenarios the gluino mass bound reduces to about 900 GeV if squarks are much heavier than gluinos. In the more general MSSM the bounds may be somewhat weaker [3]; however, they can be reduced significantly only if strongly interacting sparticles are quite close in mass to the lightest superparticle [4], which is not particularly natural from the model building point of view. to values between 2 and 3 TeV [5]. This leaves plenty of room for new discoveries. In particular, the “natural” range of parameters of supersymmetric theories will then be probed decisively.
Discovering a signal for physics beyond the SM, important as it would be, would certainly not be the end of the LHC physics program. One would then not only have to ascertain what kind of new physics has been discovered, but also determine the values of the free parameters as accurately as possible. In the context of the MSSM, this should help to unravel the mechanism responsible for the breaking of supersymmetry.
There is a large literature on ways to determine the parameters of supersymmetric theories. Most methods start from kinematic features, in particular endpoints or “edges” of invariant mass distributions [6] or kinks in slightly more complicated kinematic distributions [7]. These kinematic features directly allow to determine (differences of) superparticle masses; at least at the tree–level, in most cases there is a direct relation between the mass of a superpartner and a weak–scale parameter of the underlying theory. In many cases the experimental resolution that can be achieved is such that at least one–loop corrections should be included; e.g. the difference between pole (on–shell) masses, which determine the kinematics, and masses, which are “fundamental” free parameters in the supersymmetric Les Houches accord [8], can easily reach several percent for strongly interacting superparticles [9]. The derived masses will then depend on many (pole) masses. Moreover, in the chargino and neutralino sector, as well as for third generation sfermions, the relation between pole masses and fundamental parameters is complicated by mixing [1]. Nevertheless the basic kinematic quantities that are used for parameter determination can be determined from a single (simulated) experiment. While the step from there to the determination of the basic parameters and their errors may entail many calls of spectrum calculators, there is no need to simulate event generation for different sets of parameters, which is usually far more time consuming than the calculation of the spectrum of superparticles.^{† }^{† }† This statement may no longer be strictly true in the presence of additional hard radiation, which changes the shapes of kinematic distributions near the features (edges or kinks) used to determine masses [10]. Beyond the collinear approximation, the amount of radiation emitted may depend on (combinations of) superparticle masses in a complicated manner. However, in practice the shapes of the relevant kinematic distributions are fitted to (simulated) data rather than directly taken from theory. While we are not aware of an analysis of kinematical fitting based on a fully NLO corrected event simulation, we expect that by fitting the shapes of the relevant distributions one can still determine the locations of the relevant kinematic features without prior knowledge of superparticle masses.
On the other hand, even in constrained scenarios purely kinematical determinations of the underlying parameters work well only if sufficiently many events contain two (or more) charged leptons (meaning electrons or muons). Kinematic reconstructions based on jets suffer not only from the much poorer energy resolution of jets, but also from larger combinatorial backgrounds (since the production and decay of strongly interacting superparticles typically leads to events with many jets).
In any case, it is clear that the number of signal events in certain categories contains a lot of information about the underlying physics. Even if kinematic reconstruction works well, it would be wasteful to ignore this information. To mention a well–known example, the cross section for the pair production of a new color triplet complex scalar boson (like the stop) is much smaller than that for spin quarks of the same mass [1]. Moreover, in constrained supersymmetric scenarios strongly interacting superparticles tend to be heavier than those without strong interactions. The production of strongly interacting superparticles therefore frequently leads to long “cascade” decays [11], which can populate many “topologically different” final states, i.e. final states characterized by different numbers (and charges) of leptons as well as different numbers (and flavors) of jets. It has been recognized quite early that the relative abundance of these final states contains a great deal of information about the sparticle spectrum [12].
However, these early studies mostly focused on distinguishing qualitatively different spectra of superparticles. Information on the total signal rate has only quite recently been included in fits attempting to determine the underlying parameters from (simulated) events [13]. We are not aware of any study that attempts to determine the values of the underlying parameters using (mostly) counting observables, although a recent analysis showed that these observables can be very useful for discriminating between discrete sets of model parameters [14]. One major difficulty with this approach is that it requires to generate event samples for many different assumed sets of input parameters. For example, even in the CMSSM, which has only four free parameters, a simple grid scan over all parameters with a step size comparable to or smaller than the anticipated statistical accuracy of the method is prohibitively CPU expensive in most circumstances.
The purpose of this paper is to demonstrate the usefulness of artificial neural networks for the parameter determination of such a new physics theory. Largely due to the limitation of our computational resources, we do this in the framework of the CMSSM; the method should also be useful for other theories, supersymmetric or otherwise. Again for computational simplicity we ignore detector effects, but we work at full hadron level, including initial and final state radiation, hadronization, and the underlying event. Similarly, we ignore Standard Model backgrounds, but we include cuts that should keep them at a manageable level. We consider four benchmark scenarios, all of which lie (slightly) beyond current lower bounds on superparticle masses, but have qualitatively different spectra. We find that of data at are sufficient to determine the common scalar mass parameter and the common gaugino mass parameter to few percent accuracy without any direct kinematic mass reconstruction. With of data the neural networks can also determine the trilinear soft breaking parameter and the ratio of vacuum expectation values quite accurately for these benchmark scenarios. In contrast, in many cases a simple minimization failed to converge, i.e. it could not reliably determine the parameters and their errors. The likely reason is that the minimization of is very sensitive to fluctuations in the predictions due to finite Monte Carlo statistics.
In this paper we are only interested in the production and decay of superparticles at the LHC, as an example for an extension of the SM containing many new parameters that can hopefully be determined from future LHC data. In our numerical analysis we will therefore respect the experimental bounds on the masses of superparticles, but we will not try to reproduce the recently discovered (increasingly) Higgs–like boson [15] in our CMSSM spectra, nor will we try to describe Dark Matter through thermally produced superparticles. Instead we are using the CMSSM as toy model whose parameter space is manageable even without requiring the correct Higgs mass and Dark Matter relic density. Obviously imposing these constraints, or other constraints not directly related to LHC data, would simplify the task of fixing the free parameters. Here we wish to show that data on the production and decay of superparticles at the LHC by themselves can be used for this task, even if mostly counting observables are used.
The remainder of this article is organized as follows. In Sec. 2 we first introduce the general setting of the simulation. One important issue is the choice of observables. An automated reconstruction of the underlying parameters can only succeed if one has sufficiently many observables to be sensitive to all parameters everywhere in parameter space. On the other hand, including too many observables can dilute the statistical power. We present a set of observables which we showed to be useful for discriminating between different parameter sets for a more general supersymmetric model with 15 parameters [14]. In the second part of Sec. 2 we introduce four different benchmark points in the CMSSM framework. In Sec. 3 we discuss both our attempts at parameter reconstruction, first using artificial neural networks and second a minimization. We explain the general set–up as well as each step of the creation of the neural networks for this specific application. We also estimate the errors on the CMSSM parameters, including their correlations, using different methods that yield consistent results. In the second part of Sec. 3 the attempted minimization is discussed; as already mentioned, it does not perform very well. The results obtained by the artificial neural networks for all four benchmark points are discussed in Sec. 4. We also compare them to the results from the minimization. Finally, the last Section contains a summary and some conclusions.
2 Simulation
We simulate future LHC data at a center of mass energy of . As mentioned in the Introduction, we work in the framework of the CMSSM, where the entire spectrum of superparticles and Higgs bosons is defined by four continuous parameters and a sign. The continuous parameters are the common scalar mass parameter , the common gaugino mass , the common trilinear soft breaking parameter and the ratio of vacuum expectation values of the two Higgs doublets. As usual, and are specified at the scale of Grand Unification, GeV, whereas is given at the electroweak scale. We fix the sign of the supersymmetric higgsino mass parameter to be positive.
We use SOFTSUSY [16] to compute the CMSSM superparticle and Higgs boson spectra from the values of the four input parameters. The weak–scale spectrum is then passed on to SUSY–HIT [17], which calculates the branching ratios of all kinematically allowed decays. Knowledge of the superparticle masses and branching ratios is needed for the simulation of the production and decay of pairs of superparticles at the LHC, which is handled by the event generator Herwig++ [18]. In a first step events are simulated in order to determine the total cross section for the production of all superparticles for the given set of input parameters. Next, the appropriate number of events is simulated which corresponds to the assumed integrated luminosity; we will show results for and .
Each simulated event is assigned to one of twelve mutually exclusive event classes, based on the number, charges and flavors of charged leptons. In addition, for each event a small number of mostly counting observables is kept, from which we construct our observables. This is described in more detail in the following Subsection. We do this, first of all, for four benchmark scenarios, which lie in qualitatively different regions of CMSSM parameter space, as described in the second Subsection. Of course, in the attempt to determine the values of the CMSSM parameters from the four simulated measurements, the procedure from spectrum calculation to event generation has to be performed for many additional parameter sets, as described in Sec. 3.
2.1 Observables and their Covariances
In this Subsection we summarize our observables, which we introduced in detail in Sec. 3 of [14]. In particular, the precise definitions of the objects (isolated charged leptons, hadronically decaying leptons, hadronic jets with or without tag) we use to characterize the events, and the applied cuts, can be found in the Appendices of [14].
As already noted, we group all accepted events into twelve mutually exclusive classes, which differ by the number, charges and flavors of charged leptons. Here only isolated electrons or muons with transverse momentum GeV and pseudorapidity are counted. These classes are:

: Events with no charged leptons

: Events with exactly one charged lepton, with negative charge (in units of the proton charge)

: Events with exactly one charged lepton, with positive charge

: Events with exactly two charged leptons, with total charge

: Events with exactly two charged leptons, with total charge

: Events with exactly two charged leptons, with opposite charge but the same flavor; i.e. or

: Events with exactly two charged leptons, with opposite charge and different flavor; i.e. or

: Events with exactly three charged leptons with total charge . There is an opposite–charged lepton pair with same flavor. For example or

: Events with exactly three charged leptons with total charge . There is an opposite–charged lepton pair with same flavor. For example or

: Events with exactly three charged leptons with total negative charge, i.e. there are at least two negatively charged leptons. There is no opposite–charged lepton pair with same flavor. For example or

: Events with exactly three charged leptons with total positive charge, i.e. there are at least two positively charged leptons. There is no opposite–charged lepton pair with same flavor. For example or

: Events with four or more charged leptons
We distinguish between different charges of leptons since the initial state at the LHC carries charge . In general the number of events with positively charged leptons can therefore differ from those with negatively charged leptons. Moreover, we distinguish between lepton pairs with opposite charge but the same flavor, which can originate from leptonic neutralino decays, , and all other lepton pairs, which have to come from the decays of two different particles. This explains why we have two different classes of events with exactly one charged lepton, and four different classes each for events with exactly two and exactly three charged leptons, respectively. In principle we could also define several different classes of four lepton events. However, the number of such events is in any case rather small; further separating these few events into several classes is therefore not very useful.^{‡ }^{‡ }‡ In fact, separating the events into too many distinct classes is harmful. The reason is that we wish to use Gaussian statistics; our observables become approximately Gaussian only in the limit of large event numbers. We therefore only use event classes containing some minimal number of events, as described in Sec. 3.
Our first observable is the total number of events after cuts, . Note that the cuts differ for the different event classes, as described in ref.[14]. In addition, for each of these twelve classes , the values of seven observables are computed:

: The number of events contained in the given class divided by the total number of events , i.e. the fraction of all events contained in a given class

: Average number of tagged hadronically decaying of all events within a given class

: Average number of tagged hadronically decaying of all events within a given class

: Average number of tagged jets of all events within a given class

: Average number of nonjets of all events within a given class

: Average of the square of the number of nonjets^{§ }^{§ }§ If event in the given class contains non–jets, then . of all events within a given class

: Average value of of all events within a given class , where is the scalar sum of the transverse momenta of all hard objects, including the missing
Both and jets have to have transverse momentum GeV and pseudorapidity . In addition, a jet needs to be isolated, and a jet has to contain a flavored hadron. Jets satisfying these criteria are tagged with an assumed tagging efficiency of . Finally, is the scalar sum of the transverse momenta of all hard objects (jets and charged leptons) and the absolute value of the missing . We again refer to ref.[14] for further details.
Three of those observables are different to the ones used in ref.[14]. The number of events in a given class that contain at least one tagged hadronically decaying divided by the total number of events in this class, , has been replaced by the average number of tagged hadronically decaying of all events within a given class , , and similarly for positively charged jets. In the parameter sets considered in [14] the number of events containing a tagged was rather small and the number of events containing two of those even smaller. Therefore it was sufficient to just count the number of events containing at least one tagged . Now in the case of the CMSSM there can be more events with a higher number of leptons. Therefore here we switched the observable to preserve more information about the measurement. The same applies to the observable , which is used instead of .
Out of the observables listed above, one should be discarded. Obviously the fractions of events which belong to a certain class add up to one, because . We therefore do not include the fraction of events without charged leptons, , among our observables; note that this does not lead to any loss of information. We thus end up with observables.
For the calculation of , and also in order to improve the performance of our artificial neural networks, we need the covariance matrix of all observables. The variance of the total number of events after cuts, , is
(2.1) 
The next twelve observables are the fractions of events that belong to each class . As mentioned before they are not independent. The covariance between the fraction of events in two different classes and is then:
(2.2) 
The covariance for identical classes () equals the variance. Note that this matrix would be singular if we included all twelve among our observables. In contrast, and the total number of events are not correlated, i.e.
(2.3) 
The remaining observables can be written as averages over all events in a given class, with or . Their variances can be calculated directly from the simulated data using the formula
(2.4) 
Of these observables, only and are correlated within a given class:
(2.5) 
Here is also determined directly from the (simulated) events. Observables from different classes are not statistically correlated. We also ignore the possible correlation between and . The validity of this approximation was checked for the closely related observables and in [14] and should also be fine here.
2.2 Benchmark Points
We look at four different reference points in the CMSSM parameter space each yielding events after cuts for of data:

, , and

, , and

, , and

, , and
All points have a positive Higgs(ino) mass parameter . The resulting spectra of superparticles and Higgs bosons are given in Table 1.^{* }^{* }* Recall that we are not requiring the CMSSM to contain a Higgs boson with mass near 125 GeV in our analysis.
Point 1  Point 2  Point 3  Point 4  

The first parameter set has a low , leading to small slepton masses. The sleptons can therefore be produced on–shell in decays of the wino–like states and , which in turn are produced frequently in decays of doublet squarks . Since these squarks can be produced either directly or in decays of gluinos, we expect this benchmark point to lead to relatively strong leptonic signatures.
In contrast, the second point has , so that squarks and sleptons have similar masses. The wino–like and are still produced rather copiously in three–body decays via virtual doublet squark exchange; note that does not have any tree–level two–body decay in this scenario.^{† }^{† }† Decays of the kind , which proceed via one–loop diagrams, are also included in SUSYHIT [17], but their branching ratios are small. However, mostly decays into the light CP–even scalar Higgs boson plus the lightest superparticle (LSP) here, while decays have a branching ratio of . A gluino decay on average produces top (anti)quarks in this scenario, whose semileptonic decays can yield additional hard leptons; these final states are favored since renormalization group (RG) running and mixing reduce the masses of and relative to those of the other squarks [1]. Nevertheless we expect the strengths of the signals with three or more leptons to be much weaker in this scenario. Moreover, the signal in class 6 (two leptons with opposite charge and equal flavor) will not be enhanced relative to that in class 7 (two leptons with opposite charge and different flavor).
The third benchmark point also has a relatively high , so that sleptons are again not produced in the decays of strongly interacting superparticles. However, it also has a rather high value of , which enhances both the RG running and the mixing. This leads to a relatively large mass splitting between and the remaining squarks. The gluino mass is chosen such that all gluinos decay into a top and an anti–stop, or vice versa. Since of all decay into a top quark plus one of the neutralinos, a gluino decay therefore produces on average top (anti)quarks. This yields high–multiplicity final states, including several jets and frequently also hard leptons. However, the mass is still so high that direct pair production does not contribute appreciably to the overall signal for supersymmetry in this scenario.
Finally, the fourth benchmark point again has , but rather large . This increases the value of the Yukawa coupling, which in turn reduces the mass of the lighter eigenstate through RG effects as well as mixing. As a result, is significantly lighter than the other sleptons in this scenario. Although , essentially no and just of all decay into a first or second generation charged slepton. The reason is that and are both wino–like, whereas and are singlets. On the other hand, has a significant doublet component. As a result, about each of all and decay into a , which in turn always decays into a lepton and an LSP ; the remaining decay mostly into , whereas the remaining decay into . The decay products of strongly interacting superparticles therefore frequently contain one or more lepton(s). On the other hand, the branching ratios for gluino decays into third generation squarks are again enhanced, so that each gluino decay produces on average top (anti)quarks. In this scenario supersymmetric events therefore can contain relatively soft leptons from leptonic decays and/or hard leptons from semileptonic top decays, in addition to and/or jets.
Class  Point 1  Point 2  Point 3  Point 4 

1.  
2.  
3.  
4.  
5.  
6.  
7.  
8.  
9.  
10.  
11.  
12. 
Our qualitative expectations are confirmed by Table 2, which lists the fractions of events (in percent) that are assigned to the twelve event classes. We also note that all benchmark points predict that more positively than negatively charged leptons are produced. This results because the proton contains more quarks than quarks, so that the production of squarks is enhanced relative to that of squarks. The asymmetry between positively and negatively charged leptons becomes large when squark production contributes prominently to the total event sample (after cuts) and (some) first generation squarks have sizable semi–leptonic decay branching ratios. Both conditions are satisfied for Point 1 and, to a lesser extent, in Point 4, whereas Point 2 has a very small asymmetry owing to the large squark masses and small leptonic branching ratios of .
Since we assume exact conservation of parity, the LSP is stable. In the four benchmark scenarios, it is the lightest neutralino . Each supersymmetric event will therefore contain at least two , which are not detected, leading to the celebrated missing transverse momentum signature. Indeed, our cuts always include the requirement of a sizable missing , the exact value of the cut depending on the number of charged leptons in the event; we again refer to the Appendix of ref.[14] for further details.^{‡ }^{‡ }‡ Note that we apply a “ veto” cut on events of class 6, which is not applied on events in class 7. Without this cut, in a theory respecting universality, as the CMSSM does, class 6 would have to contain at least as many events as class 7, since uncorrelated lepton pairs would contribute equally to both classes, while correlated same flavor lepton pairs only contribute to class 6.
We find the following numbers of events for an integrated luminosity of :

1940 events before and 1080 after cuts

4080 events before and 1047 after cuts

1970 events before and 1135 after cuts

1618 events before and 991 after cuts
These numbers have been obtained from a simulation corresponding to an integrated luminosity of .
Evidently the number of events after cuts differs much less between the four benchmark points than the raw event number does. In particular, for the second benchmark point the large number of events before cuts is due to the small masses of the electroweak charginos and neutralinos. In principle these final states – in particular, production – can lead to final states with three charged leptons and little hadronic activity [19], which could contribute to our event classes 8 or 9. Recall, however, that has a very small leptonic branching ratio in this scenario. The pair events therefore have a very low cut efficiency, i.e. most of these events do not pass our cuts.
It is amusing to note that the total number of events after cuts alone would evidently not be sufficient to distinguish between benchmark points 1 and 3, nor between points 2 and 4. This illustrates the need to include (many) more observables when trying to discriminate scenarios, let alone for the quantitative determination of the values of the free parameters. Indeed, Table 2 shows that the event fractions in the twelve classes are quite sufficient to distinguish between the four benchmark scenarios. However, we do not merely wish to distinguish between benchmark points that are well separated in parameter space; we wish to quantitatively determine the values of the underlying parameters. This is discussed in the subsequent Section.
3 Strategies for Determining the Parameters
In this Section we discuss the strategies we employed to extract the values of the four CMSSM parameters from our 84 observables for the four benchmark points described in the previous Subsection. We first describe the construction of an artificial neural network (ANN), and then an attempt based on the minimization of an overall variable.
3.1 Neural Network
Artificial neural networks have been used in high energy physics since more than 25 years [20]. Originally they were designed for comparatively simple tasks, like reconstructing single tracks from hit patterns in wire chambers. By now ANNs are used for a wide variety of purposes, from optimizing experimental searches for superparticles [21] to parameterizing parton distribution functions [22]. However, we are not aware of a previous application of ANNs to the LHC inverse problem.
Here we describe how to construct artificial neural networks that can find relations mapping the measured observables onto the CMSSM parameters whose values we wish to determine. Mathematically speaking, an ANN is a function mapping input parameters, to be provided by the user, onto output parameters, whose values the user wishes to determine. It can learn this function from training sets via a well–defined algorithm. A training set consists of input values (the observables of simulated experiments) and the corresponding output values (the CMSSM parameters). Note that the parameters that are the input into an event generation program (in our example, the CMSSM parameters) are the output of the neural network, whereas the output of the event simulation (our 84 observables) are input of the neural network. It is hoped that by training the neural network on sufficiently many sets of input and output parameters, it will learn to derive the correct values of the output parameters also for sets of input parameters that are not contained in the training set. We will show that, at least for our four benchmark scenarios, this indeed works quite well.
As noted above, a neural network is a mathematical function; “learning” means that the numerical coefficients defining this function are adjusted. By choosing different values for these coefficients, in principle nearly every possible function can be reproduced with some accuracy. During the learning process these coefficients are set such that the deviations between the calculated network outputs for the specific training inputs and the known desired (correct) outputs are reduced as far as possible. If the training set is a fair representation of the investigated CMSSM parameter space, at the end the resulting neural network function should be able to compute the correct CMSSM parameters for an arbitrary set of input parameters, assuming that the latter indeed can be reproduced in the CMSSM.
3.1.1 Set–up
General information about neural networks can e.g. be found in [23]. Generally speaking, an ANN consists of a set of “neurons”, collected in “layers”, and “weights” assigned to the connections between these neurons. A neuron can be described by a simple function computing one output value from one input value. At least two layers of neurons are needed: one for the input, and one for the output. In the input layer, there is one neuron for each observable, plus a “constant” neuron which always outputs 1; this is needed in order to be able to describe a constant function, i.e. a component of the network’s output that is independent of the input. Similarly, there is one output neuron for each quantity that we wish to determine. The input of the neurons in the input layer are normalized versions of our observables, and the output of the neurons in the output layer gives the network’s prediction for a normalized version of the quantity whose value we wish to determine; this normalization is described below. In between the input and output layers, there can be an arbitrary number of layers of hidden neurons; in physics applications this number is usually one or two. All neurons in adjacent layers are connected to each other, and a numerical weight is assigned to each of these connections. These weights are the coefficients mentioned above, whose values are to be determined during the training of the network.^{* }^{* }* The term “artificial neural network” reflects the fact that this construction yields a very simple model of actual biological neural networks; however, the (dis)similarity between ANNs and actual brains is immaterial for our discussion.
The ANN we construct here contains a single layer of hidden neurons, as shown in Figure 1. For all neurons in the input and output layers, the output is simply equal to the input. In our case the input layer consists of 85 neurons, one for each (normalized) observable with input values , ; the output of the th neuron in this layer is therefore also given by . As noted above, there is an additional input neuron with a fixed output, .
The second layer consists of a yet to be determined number of “hidden” neurons with input values . These input values are weighted sums of the output of all neurons in the input layer:
(3.1) 
Here and in the following we will use to label the input neurons, to label the hidden neurons, and for the output neurons. Each hidden neuron processes its input by applying the function . This function is commonly chosen for the hidden neurons, because it exhibits approximately linear behavior for small but saturates at for large values of . This ensures that the ANN can reproduce a wide variety of functions.^{† }^{† }† Due to the saturation of the function at large absolute values of its arguments, an ANN constructed in this way may have difficulty reproducing a function which grows or shrinks very quickly when its input variables are varied. Recall, however, that the input of the ANN (during training) is the output of event simulation. In extensions of the Standard Model, the number of events of a given type expected at the LHC typically changes quickly when the model parameters are varied. This implies that the true values of the model parameters change relatively slowly when the observables are varied. Our ANN should be well suited to describe this kind of behavior.
The last layer of the ANN consists of the output neurons, four in number for CMSSM with values .^{‡ }^{‡ }‡ For reasons that will be explained below, at the end we actually do not use one common network for all four CMSSM parameters but instead four separate ANNs with one output parameter each. Their input values are calculated by weighted sums over the output of the hidden neurons:
(3.2) 
the output of the zeroth hidden neuron has been fixed to . These four output values should reproduce the (properly normalized) values of the CMSSM parameters; the normalization is explained below.
Two “weight layers” connect the three layers of neurons. The first weight layer connects each input neuron with each hidden neuron (except for the zeroth hidden neuron whose output value is fixed to ). The weights with and are assigned to these connections. Similarly, the connections between the hidden and output neurons have weights with and and form the second weight layer. Evidently the number of hidden neurons determines the number of free parameters describing the ANN. The goal of the training process is to find appropriate (ideally, optimal) values for these weights. Before we describe the algorithm used to set the weights, we discuss the normalization of the input and output values, and the initialization of the weights.
3.1.2 Normalization
The 84 inputs into our ANN are obtained by normalizing the 84 measured observables. This simplifies the initialization of the weights (see below), which in turn results in a shorter learning process. To that end, both the input and output values of the ANN should be (dimensionless) quantities. Because the hyperbolic tangent function also naturally leads to hidden layer output values of , the weights in both layers could be of a similar order. In most cases the functions will then not be in the saturation regime, where it becomes almost independent of the input value. In general this should speed up the learning process.
We normalize the 84 input quantities independently. Let be the value of observable for the th training set, with . The mean value and the variance over all training sets are then:
(3.3) 
(3.4) 
From these, we compute normalized inputs for the ANN:
(3.5) 
The mean value of the over the training sets is thus zero, with a standard deviation equal to one; hence the (absolute values of) the inputs should be , as desired.^{§ }^{§ }§ Equivalently, one could use the original observables as input, and assign the function (3.5) to the –th input neuron instead of the identity function.
Since we want the ANN outputs to also be in absolute value, they are related to the CMSSM parameters which we wish to determine via
(3.6) 
The averages and the variances are calculated analogously to eqs.(3.3) and (3.4), respectively.
The normalization of both the input and the output is thus calculated from the training sets. The same normalization is then used for the control sets, and for every other ANN input including the (actual or simulated) measurement. The implicit assumption is that the training sets are a good representation of the investigated CMSSM parameter space. If this condition is not satisfied, the ANN results for the CMSSM parameters are in any case likely to be quite inaccurate.
3.1.3 Initialization
The normalization of the overall input and output of the ANN ensures that these quantities are in absolute size, without preference for either positive or negative values. The weights in the two weight layers should be initialized such that these conditions are also satisfied for the inputs of the neurons in the hidden and output layers. One simple, and by experience quite effective, way to ensure this is to initialize all weights with Gaussian random numbers with mean zero and variance equal to the number of neurons in the layer beneath the given weight layer. Hence we set the Gaussian variance for the weights in the first weight layer to , and for the second layer to , where is as before the number of hidden neurons.
3.1.4 Learning Procedure
The initial ANN will in general provide a very poor approximation of the desired function. In the case at hand, the initial ANN will most likely give values for the CMSSM parameters that are very far from the true values.
The ANN therefore needs to be “trained”, so that it can “learn” to closely reproduce the desired function. To that end it has to be confronted with sufficiently many “training sets”, where both the input and the desired output are known. In the case at hand, this means that we had to simulate many sets of CMSSM parameters, and compute the corresponding values of our 84 observables, along the lines described in Sec. 2. More details on the choice of the training sets are given below.
An ANN which is trained with specific training sets reproduces the desired output for these training sets more and more exactly with every learning step. If the network includes sufficiently many hidden neurons it will eventually simply “memorize” the training sets, i.e. reproduce their outputs exactly, if the training runs long enough. This may seem desirable at first sight, but actually it is not. The task of the ANN is to interpolate between the training sets, i.e. it should provide (approximately) the correct output also for inputs that are not part of the training sets. Experience shows that at some point further improvement in the reproduction of the training sets degrades the performance of the ANN when applied to different inputs.
In order to determine when the training of the ANN should be terminated one therefore also needs “control sets”. These are generated exactly like the training sets, but they are not used in the training of the ANN. Instead, they are used to define a “control error”. The training of the ANN is stopped when this control error reaches its minimum. This strategy ensures good performance of the neural network as long as both the training and the control sets are good representations of the investigated CMSSM parameter space.
We use the following normalized control error:
(3.7) 
where is the number of control sets, are the four output values combined in one vector, calculated by the ANN from the input values of the –th control set, are the correct output values and is the mean value over the output values of all control sets:
(3.8) 
The only quantities in eq.(3.7) that change in the course of the training are the vectors , which depend on the current values of the weights defining the ANN. These weights are modified iteratively, such that an error analogous to that defined in eq.(3.7), but computed from the training sets rather than from the control sets, is minimized. Mathematically this amounts to minimizing a (complicated) function of the (many) variables . We do this using the “conjugate gradient” algorithm, as described in Appendix A.
Recall that and are normalized output values. If the ANN has several outputs, the contribution of each output variable to the total error therefore depends on the spread of this variable within the training sets. The algorithm for adjusting the weights ensures that the total training error decreases during the training; however, this does not guarantee that the performance for each individual output, i.e. for each component of , improves monotonically. We therefore found it convenient to define four separate ANNs for the four CMSSM parameters, as already noted. For each of our ANNs the vectors and collapse to simple real numbers. The splitting into four separate network also offers the possibility of a further specialization of each network. Furthermore we have smaller networks, whose training take less time than the training of one big network with a higher number of hidden neurons. Note, however, that all four ANNs have identical inputs, i.e. each of them has 84 input neurons.
The training of the ANNs thus proceeds as follows. The initial values of the weights are used to define initial errors for the training and control sets. The weights are then adjusted iteratively, such that the error computed for the training sets is minimized. After each learning step the control error is calculated. The learning process is stopped when this control error reaches its minimum.^{* }^{* }* Had we used a single ANN, with a combined control error as defined in eq.(3.7), the training would have to be stopped when the total error is minimal. This does not guarantee that each of the four individual errors, on the four CMSSM parameters, is minimal. This is another argument in favor of using four separate ANNs. Since the control error is not used for the determination of the new weights, it need not decrease monotonically during the iteration. The learning procedure should therefore not be stopped until one can be reasonably sure that the absolute minimum of the control error has been passed. Finally, the weights have to be re–set to the values that gave this absolute minimum.
This is illustrated in Fig. 2. The triangles (blue) show the normalized training errors and the dots (red) the control errors. We see that both errors initially decrease very quickly. The control error reaches a first minimum after about learning steps, increases again, and finally reaches its absolute minimum after steps. In contrast, the training error keeps decreasing until the iteration is stopped after steps; the control error at the end of the iteration is nearly larger than its absolute minimum. At the end the ANN is therefore re–set to its status after learning steps.
3.1.5 Training and Control Sets
The performance of the trained ANNs obviously depends on the quality of the training and control sets. Note first of all that these sets are supposed to show the true relation between the 84 observables and the CMSSM parameters. The errors on the observables in the training and control sets, in particular the error due to finite Monte Carlo statistics, should therefore be (much) smaller than the statistical accuracy of the (simulated) measurement that the ANNs are finally meant to analyze. We therefore used of simulated data for the training and control sets; of course, the number of events is then rescaled to the assumed luminosity of the “measurement”.
Moreover, we already emphasized that the training and control sets should be good representations of the investigated CMSSM parameter space. Therefore for each benchmark point we set different ranges for the input parameters of the simulation; the actual CMSSM parameters for each training and control set are chosen randomly using a flat distribution.^{† }^{† }† Furthermore each simulation in Herwig++ is done with a different flatly, randomly chosen seed to prevent any possible correlations.
In general it should be easier to determine the parameters and , since they influence the superparticle spectrum more strongly than the other two continuous parameters, and , do. One combination of and can e.g. be determined with some accuracy simply from the total rate of signal events after cuts. Next, one could use some inclusive kinematical observable (e.g. ), and/or the jet multiplicity in signal events, to determine the region of interest in the plane. In the following we therefore use smaller parameter ranges for these parameters than for and . We do this just to save CPU hours. This slight “cheat” allows us to generate, with our limited computer resources, sufficiently many training and control sets in the vicinity of our benchmark points to allow good interpolation. The above qualitative discussion indicates that our ANNs could most likely also be trained on the “entire” CMSSM parameter space. However, we would then need to generate (many) more training and control sets. Moreover, the ANNs would presumably need more hidden neurons, making their training even more time consuming.
We use the following parameter ranges for our four benchmark points:

, , and

, , and

, , and

, , and
The range for depends on the value chosen for in order to make sure that there is no problem with the generation of the superparticle spectrum in SOFTSUSY.^{‡ }^{‡ }‡ could e.g. result in tachyonic sfermions. We simulate slightly more than training sets and around control sets for each benchmark point. We checked that adding additional control sets does not improve the results; recall that these sets are only needed to determine when the training of the ANNs should be terminated. The total number of simulated parameter sets was determined by our available computing resources. Note, however, that the average distance between neighboring values of any CMSSM parameter in the training sets is much smaller than our final estimate of the error with which this parameter can be determined by our ANNs (see Sec. 4), at least for the smaller integrated luminosity of ; i.e. the final (multi–dimensional) error ellipsoid should already contain many training sets. Further increasing the number of training sets is therefore not likely to significantly improve the final performance of our ANNs.
3.1.6 Improving the Performance
The “measured” observables resulting from each simulated parameter set have statistical uncertainties, which are saved in the covariance matrix. Recall, however, that only the values of the observables themselves, but not the corresponding covariance matrix, is used as input into our ANNs. It would be desirable if the ANN knew with what precision an input observable is determined in order to be able to decide how important its value is for the determination of the CMSSM parameters. Recall in particular that different observables can have quite different (relative) errors; see Table 2. All else being equal, observables with larger uncertainties should contribute with smaller weights. It is therefore quite evident that knowledge of the covariance matrix should improve the performance of our ANNs significantly.
Recall that we use of simulated data for the training and control sets. This means that the Monte Carlo (theory) error is essentially negligible when comparing with simulated measurements based on an integrated luminosity of . We will also show results for simulated measurements which also assume an integrated luminosity of . Here the Monte Carlo theory error is still significant; our computer resources do not allow us to reduce it further.
Directly feeding the (non–vanishing) entries of the covariance matrix as additional inputs into our ANNs would greatly increase their complexity. Instead, we take two measures in order to include the effect of (statistical) uncertainties.
First, we simply omit very noisy observables, i.e. quantities that have very large errors. To that end, as in the calculation in [14], we require a minimal number of events of a given class for taking the observables into account. Obviously observables computed from a small number of events will have a large error. For the training and control sets the observables are only included if the total number of events in a given class after cuts satisfies ; the total number of events is only included if it exceeds 50 (in practice this is always the case). For the “measurements” the corresponding thresholds are changed to and , respectively. The luminosity dependence of these thresholds means that for a specific set of CMSSM parameters the same observables are included (up to statistical fluctuations).
Recall that an ANN is defined with a fixed number of input neurons, therefore we still have to assign some value to each observable. The normalized input value of an observable which does not fulfill the minimal number of events is therefore set to zero. This means that this input neuron does not contribute to the weighted sums which are calculated within the neural network, independent of the weight of the connections exiting this neuron.
This procedure ensures that all non–zero inputs into our ANNs should have some statistical power for determining the CMSSM parameters we are after. However, this still does not tell the ANNs the relative accuracies of these observables. To that end, we use our knowledge of the covariance matrix of the observables for a particular training set to determine an dimensional (correlated) Gaussian distribution. Here both the mean values and the covariance matrix are taken from the training sets, independent of the luminosity of the (simulated) measurement to which the ANN will eventually be applied. Of course, the overall width of these multi–dimensional Gaussian distributions (one for each original set of CMSSM parameters chosen for a given training set) should scale like the inverse square root of the integrated luminosity. However, the purpose of this trick is to teach the ANN the relative size of the errors on the various observables that are fed into the ANN, so that it can assign appropriate weights. This relative weighting should be independent of the luminosity.
For each combination of CMSSM parameters chosen for the training sets, we then randomly generate sets of observables from the corresponding up to dimensional Gaussian distribution.^{* }^{* }* Note that these “satellite sets” are produced directly from the multi–dimensional Gaussian; no additional event generation is needed here. From the point of view of the ANNs, these sets have (slightly) different inputs, but exactly the same outputs (CMSSM parameters). Altogether each ANN is thus trained on slightly more than sets of inputs yielding slightly more than different outputs.
It is interesting to note that this enlarged set of training sets can no longer be described as a function in the mathematical sense: due to statistical fluctuations, there might be training sets with the same (sets of) observable(s) but different output values.^{† }^{† }† Of course, the ANNs still are functions, assigning a unique output value to each set of input values. In fact, for each observable we have a certain range of values, depending on the corresponding variance, that can lead to the same CMSSM parameters. The bigger this range is the less important the input value should be for the parameter determination. If the variance is large, during the learning process the ANNs are confronted with strongly varying input values for the same output value. This should allow them to recognize that this observable cannot contribute much to the determination of the CMSSM parameters, by reducing the appropriate weight. So the creation of Gaussian distributed variants of the original training set should lead to a higher weighting of input values with small errors.
3.1.7 Output Error
We finally end up with four trained ANNs per benchmark point, one for each CMSSM parameter. The 84 observables obtained from any (simulated) experiment can now be used as input for the ANNs, which will then produce their estimates of the corresponding CMSSM parameters. In order to be able to judge the accuracy of these estimates, it is crucial to also obtain estimates for the uncertainties; ideally we would like to be able to quote the parameters with well–defined (Gaussian) statistical errors. Our ANNs by themselves do not provide such estimated errors.
We investigated two different ways to determine the errors on the outputs. The first possibility is to expand the set of input values of the measurement to multiple Gaussian distributed input sets. As in the construction of the “satellite sets” from the original training sets, the known covariance matrix of the measurement can be used to create a multitude of Gaussian distributed versions of the (simulated) measurement. Each version is then used as an input for the ANNs. If Gaussian statistics is applicable, the ANNs’ outputs should also form Gaussian distributions. The square root of the variance of an output distribution then gives the error of the corresponding CMSSM parameter.
This is illustrated in Fig. 3, which shows the distribution of reconstructed values for benchmark point 3. This distribution has been obtained by feeding the appropriate ANN with Gaussian distributed versions of the simulated measurement. We see that the true value is recovered without significant off–set. Moreover, the distribution can be described very well through a Gaussian fit. The result of this ANN for benchmark point 3 is thus .
In order to determine the covariance matrix between the four CMSSM parameters we fit two–dimensional Gaussians to the distribution of each pair of outputs. To that end we feed the same sets of input values into both relevant ANNs; each set of input values then gives one pair of output values. We repeat that for all Gaussian distributed input sets. The set of output pairs then forms a two–dimensional (correlated) distribution and the corresponding variances and the covariance can be determined through a fit. Within rounding errors, e.g. due to different binning, the variances should be exactly the same as the ones determined from one–dimensional Gaussians.
This is illustrated in Fig. 4, which shows the two–dimensional distribution of reconstructed values for benchmark point 3. The left frame shows a three–dimensional view and the right frame a two–dimensional contour plot. The fit yields and ; the result for agrees with that of Fig. 3, as expected. There is a very mild negative correlation between these two parameters, with correlation coefficient . This coefficient is related to the entries of the covariance matrix through the formula
(3.9) 
Equivalently, the angle between the major axis of the error ellipse and the axis in the right frame of Fig. 4 is:
(3.10) 
The ellipse is tilted with to the right bottom.
The second method of determining errors on the parameter estimates of the ANNs uses Gaussian error propagation. Since we know the covariance matrix of the input values and the neural network functions we can directly calculate the variances and covariances of the output values. The variance of the output value is then:
(3.11) 
Here is the standard deviation of the input observable and is the covariance between the observables and ; explicit expressions for these quantities are listed in Sec. 2.1. The sums run only over those observables that have not been set to zero when we removed observables with large statistical uncertainty as described in Sec. 3.1.6, i.e. is in practice significantly smaller than . Finally, is the function mapping the (unnormalized) observables onto the estimated CMSSM values. Its derivative has the form:^{‡ }^{‡ }‡ Since each ANN has only one output, we removed the first index on the weights in the second layer, cf. eq.(3.2).
(3.12)  
Recall that is the number of hidden neurons. Moreover, as described in Sec. 3.1.2 the ANNs, described by the functions , are trained to produce inversely normalized output values from normalized inputs; the corresponding derivatives and can be read off eqs.(3.5) and (3.6), respectively. The training set standard deviation for the –th input value is calculated using equation (3.4) and similarly for the output value . The are the normalized input values of the (simulated) measurement; again only the inputs that have not been set to zero need to be taken into account.
Similarly the covariance between two CMSSM values and is
(3.13) 
It is reassuring to note that the covariances derived in this way are close to the values obtained by fitting Gaussians to the distributions of reconstructed CMSSM parameters. For example, for benchmark point 3 we find and ; recall that the corresponding values derived from the Gaussian fits are and , respectively. We will see later that the two methods agree even more closely if a higher integrated luminosity is used for the “measurement”.
3.1.8 Number of Hidden Neurons
In order to determine the appropriate number of hidden neurons for each ANN we start with a low number (like 10) and train the corresponding network as described in Sec. 3.1.4. Note that each additional hidden neuron adds free parameters to the description of ANNs with one output neuron. As discussed in the Appendix, the training of the ANN involves the repeated computation of a matrix whose dimension grows linearly with . For a fixed number of learning steps the time needed to train an ANN therefore scales like . Hence the number of hidden neurons should not be increased needlessly.
After the training, the distribution of estimated outputs for the given benchmark point are calculated as described in the first part of Sec. 3.1.7. If this distribution is (approximately) a Gaussian the number of hidden layer neurons should be sufficient. On the other hand, if the distribution does not look very Gaussian the number of hidden layer neurons is increased and the process is iterated, until a satisfactory result is achieved.^{* }^{* }* The ANNs were often not trained until the global minimum of the control error was reached before looking at the distribution of output values. Instead this distribution was checked already when the first local minimum of the control error was reached; if this distribution looked very non–Gaussian, the ANN was discarded and was increased.
This completes our discussion of the construction of the ANNs. Before reporting our results for the four benchmark points, we describe an alternative strategy to determine the CMSSM parameters, based on minimization.
3.2 Minimization
The preceding discussion shows that quite a lot of (computational) effort is required to construct ANNs that are able to produce reliable estimates of CMSSM parameters from simulated measurements. Moreover, a trained ANN is a “black box”; it is very difficult to get a feeling for how the input affects the output.
We therefore also attempted to derive estimates for the CMSSM parameters by minimizing a function. At least conceptually this is far simpler than training ANNs. However, it turns out that fluctuations due to the finite Monte Carlo statistics used in deriving the predictions of the model make it very difficult to derive estimates for the parameters with statistically meaningful errors in this way. In this subsection we describe the method we used for the minimization and show the results for benchmark point 4.
We wish to minimize the between the simulated measurement and predictions, which is given by
(3.14) 
are the observables of the simulated measurement and are the predictions for these observables derived for a particular set of CMSSM parameters. The double sum runs over all observables that have been derived from a minimum number of events (see below), and is the inverse covariance matrix of the relevant observables, with entries
(3.15) 
The second term in eq.(3.15) takes into account the statistical error on the prediction due to the finite size of the Monte Carlo sample that has been used for its calculation. In [14] it has been shown that this function has the statistical properties of a true distribution, provided we only include observables that have been computed from sufficiently many events. In particular, for of simulated integrated luminosity the total number of events is taken into account if the measurement or the prediction has at least one event after cuts; for our benchmark points this is always the case. The event ratio for a class is included in the definition of if the class contains at least events for the measurement or the prediction. All other observables are only included if for both the measurement and the prediction.^{* }^{* }* Some of these minimal event numbers differ from the ones used in [14]. The benchmark points investigated here yield on the order of events after cuts for of data, as shown in Sec. 2.2. In contrast the parameter sets analyzed in [14] yielded on average around events after cuts. Using as minimal for all classes showed a little bit better results than the old choice of for and , for and , and for and . For an higher integrated luminosity the thresholds are scaled up accordingly.
Depending on the number of events that are generated, i.e. the assumed integrated luminosity, computing the prediction for a single set of CMSSM parameters can already be quite costly in terms of CPU time. We therefore perform the minimization in two steps. The first step is supposed to roughly identify the correct region of parameter space in which the minimum lies; the second step should then pin down the location of the minimum. In both steps we compute the predictions using of simulated data.
In the first step we use an algorithm based on simulated annealing. We begin by computing the prediction for a randomly chosen set of parameters, and compute the corresponding . We then randomly vary one of the CMSSM parameters, leaving the other three parameters unchanged. If the resulting is smaller than the previous one, the corresponding parameter set is taken as the new best guess for the location of the minimum. If the new –value is bigger than the previous one, there is still a finite probability that the new parameters are picked as the new best guess; this probability decreases exponentially with the difference between the two values. This approach should prevent the algorithm from getting stuck in a local minimum. The same parameter would then be changed again, if a new minimum was not picked; this loop is terminated if a given number of tries did not improve the minimization. After that the second parameter is changed and so on. When all four parameters have been scanned in this manner, the scanning starts at the first parameter again. The procedure is continued until a given number of steps (here 250) is reached.
Recall that our goal in the first step is just to get a rough estimate of the CMSSM parameters. Therefore, we did not put much effort into setting the available parameters of the simulated annealing algorithm. Nevertheless this step is useful, in particular for narrowing down the range for . The size of approximately determines the maximal allowed range of . This is important for the second step of the minimization, in which the parameters should be determined more exactly.
In this second step we use the minimization algorithms “Simplex” and “Migrad” of TMinuit [24] in the program ROOT. Using this algorithm we set allowed parameter ranges in order to avoid parameter selections where simulation problems occur (e.g. very large can lead to problems with the generation of the superparticle and Higgs spectrum). These algorithms will also work better if the starting point is closer to the true minimum and its distance to the minimum is roughly known. Therefore we use the output of the simulated annealing algorithm as starting point; the –value corresponding to this starting point gives us a rough idea how close we are to the true minimum.
These algorithms eventually do converge onto a new best guess for the location of the minimum. Since Minuit has been designed specifically for minimizations, it even gives estimates for the statistical errors of the determined parameters. However, it turns out that these error estimates are almost always much too small. This can be traced back to the statistical fluctuations of the predictions. The final minimum found is nearly always produced by a rather extreme Monte Carlo fluctuation in the prediction. Such a fluctuation is likely to occur only in a region of parameter space around the true CMSSM parameters, i.e. the location of the minimum is likely not too far from the true location of the minimum computed from predictions with negligible error. However, in our simulation the fluctuation frequently reduced by several units, leading to a very steep, but spurious, minimum. The width of this spurious minimum, which is what Minuit attempts to estimate, is then a very poor estimate for the actual error on the CMSSM parameters.
This problem can be ameliorated by increasing the integrated luminosity used for the computation of the predictions, which reduces the fluctuations. Alas, these fluctuations only decrease inversely to the square root of the number of generated events, whereas the CPU time needed to generate them obviously scales linearly with this number. Note that Minuit needs several hundred search steps to converge on a minimum, i.e. several hundred sets of CMSSM parameters have to be simulated for each