Edinburgh 2011/15
IFUM981FT
FRPHENO2011013
RWTH TTK1132
Reweighting and Unweighting of Parton Distributions
and the LHC W lepton asymmetry data
The NNPDF Collaboration:
Richard D. Ball, Valerio Bertone, Francesco Cerutti, Luigi Del Debbio,
Stefano Forte, Alberto Guffanti, Nathan P. Hartland, José I. Latorre, Juan Rojo and Maria Ubiali.
Tait Institute, University of Edinburgh,
JCMB, KB, Mayfield Rd, Edinburgh EH9 3JZ, Scotland
Physikalisches Institut, AlbertLudwigsUniversität Freiburg,
HermannHerderStraße 3, D79104 Freiburg i. B., Germany
Departament d’Estructura i Constituents de la Matèria, Universitat de Barcelona,
Diagonal 647, E08028 Barcelona, Spain
Dipartimento di Fisica, Università di Milano and INFN, Sezione di Milano,
Via Celoria 16, I20133 Milano, Italy
The Niels Bohr International Academy and Discovery Center,
The Niels Bohr Institute, Blegdamsvej 17, DK2100 Copenhagen, Denmark
Institut für Theoretische Teilchenphysik und Kosmologie, RWTH Aachen University,
D52056 Aachen, Germany
Abstract:
We develop in more detail our reweighting method for incorporating new datasets in parton fits based on a Monte Carlo representation of PDFs. After revisiting the derivation of the reweighting formula, we show how to construct an unweighted PDF replica set which is statistically equivalent to a given reweighted set. We then use reweighting followed by unweighting to test the consistency of the method, specifically by verifying that results do not depend on the order in which new data are included in the fit via reweighting. We apply the reweighting method to study the impact of LHC W lepton asymmetry data on the NNPDF2.1 set. We show how these data reduce the PDF uncertainties of light quarks in the medium and small region, providing the first solid constraints on PDFs from LHC data.
Contents
1 Introduction
In a series of previous papers [1, 2, 3, 4, 5, 6, 7, 8, 9], we constructed increasingly accurate sets of parton distributions (PDFs), using a Monte Carlo approach coupled to the use of neural networks as underlying interpolating functions. By definition, a PDF set provides a representation of a probability density in the space of parton distributions, i.e. a probability density in a space of functions [10, 11, 12]. We have performed various tests that confirm that NNPDF parton sets do indeed behave in a way which is consistent with the desired statistical properties of functional probability densities.
An advantage of providing a Monte Carlo representation of this PDF probability density is that new information (such as might be provided by new experimental data) can be included, using Bayes’ theorem, by reweighting an existing PDF set, without having to perform a new PDF fit [13, 14]: it is possible to determine a reweighting factor for each Monte Carlo replica in such a way that the information contained in the new data is included by simply computing weighted averages. This approach was first successfully developed and implemented in Ref. [14], where it was explicitly shown, in studies involving CDF and D0 inclusive jet data, that results obtained by reweighting are equivalent to those found by including the new data in the fit.
Reweighting takes a set of equally likely PDF replicas generated by importance sampling, and assigns to them weights reflecting their relative probabilities in the light of new data not included in the original fit. In this paper we develop a second technique, which we call ‘unweighting’, which takes the reweighted set and replaces it with a new set of replicas which are again all equally probable. This new set of replicas can then be used in precisely the same way as a fitted set. Even though no new information is gained by unweighting, presenting reweighted PDFs in the same form as a corresponding refitted set has various obvious practical advantages.
Furthermore, unweighting allows us to perform a highly nontrivial test of the reweighting procedure: namely, we take two new independent datasets, and use them to sequentially improve an existing set of replicas. This may then be done in either order, or indeed by treating them as one (combined) dataset. All three methods should yield equivalent results. Checking that this is the case provides a strong test of the method. However this can only be done if after each reweighting we unweight, because our simple closedform expression for the weights can only be used for the reweighting of an equally probable (i.e. unweighted) set of PDFs.
We perform this check by first taking the NNPDF2.0 NLO DIS+DY fit [7], based on deepinelastic and DrellYan data only, and taking as new datasets the CDF [15] and D0 [16] Run II inclusive jet data. This completes and refines the studies of Ref. [14], where it was verified that the inclusion of the combined CDF+D0 jet data by reweighting or refitting gives equivalent results. We then perform a second check using as the prior the NNPDF2.1 DIS fit [8], based on deepinelastic data only, and taking as new datasets the E605 [17] DrellYan and Tevatron inclusive jet data. This provides a somewhat different test, because while the D0 and CDF data used in the previous test measure the same observable in the same kinematic region, the DrellYan and jet data affect different PDFs in different kinematic regions.
Besides its practical usefulness, the combined reweighting plus unweighting procedure is important because it allows one, at least in principle, to perform a global PDF fit by sequentially including new data by reweighting a generic prior distribution of PDFs [13]. If the information contained in the new data is sufficiently precise, and the prior distribution sufficently broad, the results will the be largely independent of the prior one starts from: this would then give completely unbiased PDFs. In practice, this procedure is unlikely to be viable because, in order to get accurate results, the prior set of PDF replicas would have to be huge. However, the equivalence of PDFs obtained from reweighting with those determined using a fitting procedure (such as the NNPDF sets) confirms that the latter are also unbiased.
Following the success of these consistency tests, we use reweighting to evaluate the impact on the NNPDF2.1 NLO fit Ref.[8] of recent LHC data on the lepton asymmetry from the ATLAS and CMS collaborations. Using unweighting, we are able to produce a new PDF set, NNPDF2.2, which incorporates the effect of these data and the older lepton asymmetry from D0.
The outline of this paper is as follows. In Sec. 2 we revisit the derivation of the reweighting method, in particular the determination of the weights in terms of the of the fit of the new data to each replica, and we discuss some subtle issues that were not tackled in Ref. [14], related to the definition of the measure in data space and to the inclusion by reweighting of multiple data sets. Then, in Sec. 3 we present our method of unweighting reweighted PDF sets, to give a set of replicas which are all equally probable, and show that indeed the unweighted set is equivalent to the original reweighted set. We follow this in Sec. 4 with a study of the consistency of the combined reweighting and unweighting procedure, when applied to more than one dataset in turn. After this theoretical study, we turn to phenomenology by using the method to investigate the impact of LHC measurement of the lepton asymmetry on PDFs. First, we show in Sec. 5 how these data reduce the PDF uncertainties of light quarks in the medium and small– region, providing the first solid constraints on PDFs from LHC data, and then in Sect. 6 we construct a new set of NLO PDFs, NNPDF2.2, which includes, on top of all the data used to determine NNPDF2.1 PDFs, also the D0 asymmetry data already discussed in Ref. [14] and the LHC data discussed in Sect. 5.
2 Reweighting
In this section we revisit the derivation of the weight formula for reweighting ensembles of PDFs. In particular we discuss some of the more subtle issues in the formal proof presented in Ref. [14]. The derivation of the formula for the computation of the weights is nontrivial because we are dealing with probability densities in multidimensional spaces. In particular we need to avoid the ambiguities that can appear when dealing with conditional probabilities with respect to an event of probability zero, the socalled BorelKolmogorov paradox [18]. The conditional probabilities need to be defined carefully as integrations of conditional probability densities over finite volumes, in the limit when these volumes are taken to zero.
2.1 Integration over the data space
Bayes’ theorem can be stated in terms of probability densities:
(1) 
where is the integration measure in the space of PDFs, and is the integration measure in the space of data. The latter is an dimensional real space, where is the number of data points used for reweighting. is the prior density in the space of the PDFs: it is represented by the set of PDF replicas. These are all equally probable, e.g., the expected PDF is simply determined as the average over the set , and are determined by importance sampling by starting from experimental data [11]. is instead the new probability density, given the data points . Note that here, unlike in Ref. [14], we do not make explicit the dependence of conditional probabilities on generic prior information (which includes the data used to determine the prior PDF, external parameters such as , and theoretical assumptions such as the use of perturbative QCD at a given order). is the prior density in the space of data, and we do not need to specify its explicit form, since it can be fixed by requiring to be correctly normalised. The only relevant property of is that it does not depend on the PDFs .
In order to define the probability density at a given point , we can integrate Eq. (1) in a small sphere of radius centered at . Integrating the lefthand side of Eq. (1) over we obtain
(2) 
where is the solid angle in dimensions. Integrating the righthand side similarly, we can cancel the volume factors on each side and thus take the limit , to give
(3) 
Now is the likelihood density for the data : assuming these data to be normally distributed about central values (which of course depend on the PDF ),
(4) 
where is the experimental covariance matrix. The only dependence on is through the value of
(5) 
It now follows from Eqs. (35) that
(6) 
with a constant of proportionality that depends on , but not on , and can thus be fixed if necessary through the normalization condition .
2.2 Weights for a given
This is all fine so far as it goes, but is not sufficient to give us a reweighting of our ensemble of PDFs equivalent to a refitting. The reason for this is that when we fit PDFs, we do not demand that the predictions coincide with the data points , but rather that the figure of merit is optimized. Thus rather than integrating both sides of Eq. (1) over the small spheres , we should integrate over all subject only to the single constraint that , for some fixed value . It is convenient to choose as a parameter , rather than , because we can interpret as the radial coordinate in a system of spherical polar coordinates in function space, centered at .
The lefthand side of Eq. (1) thus becomes
(7) 
thus defining up to an overall constant (independent of ). We can evaluate it by performing the same integration over the righthand side of Eq. (1), since the dependence on factorises:
(8) 
where we have used Eq. (4) for the likelihood, and performed the integral over in spherical coordinates. Comparing Eq. (7) and Eq. (8) we thus find
(9) 
In order to define the weight to be associated to each replica, we need to define the probability for each replica by integrating the probability density over a finite volume, and then send that volume to zero. For a given replica we thus integrate over the region , where :
(10) 
Note that this corresponds to integrating Eq. (7) over a spherical shell, centered on , of radius and thickness . The thickness of the shell is independent of the choice of replica: if it were not, we would bias the result.
It is easy to see using Eq. (9) that Eq. (10) gives the formula derived in Ref.[14] for the weights: since the replicas in the prior distribution all have equal probability, is independent of the choice of replica , and the weights are
(11) 
The constant of proportionality may be fixed by normalizing the sum of the weights to the number of replicas.
The factor of takes account of the fact that when there are many data points, larger values of have a larger phase space available to them, while very small values are phase space suppressed: however good the model it is always very unlikely that the theoretical prediction will give exactly the right result for a large number of measurements. This is not a trivial result: it depends critically on choosing the correct volume upon which to integrate in the space of the new data . Starting from the same probability density, but using a different integration volume would produce a different result. Hence we need to justify our particular choice of volume.
In this respect, we note that our choice includes all points in the space of with a particular , and that the thickness of the shell is independent of its radius or centre , in the same way that in Eq. (2) the radius of the little sphere was also independent of . The ultimate justification in both cases is that the probability measure on the space is uniform, i.e. that equal volumes have equal probability: this assumption is of course implicit from the start, since without it the likelihood Eq. (4) would not be Gaussian.
Note that although the above argument is most naturally expressed using as a coordinate in function space we would get the same weights if we were to instead use , or indeed a conditional dependence on any other monotonic function of , so long as we use the same volume in the space of data to define the weights. To see this, note that for example
(12) 
so that, comparing with Eq. (7),
(13) 
As expected, we thus have . If we work with , in order to be sure to use the same volume in the space of data (i.e. a spherical shell of thickness ) we must now integrate over the interval :
(14) 
which then yields exactly the same weight Eq. (11) as obtained using Eq. (10).
2.3 Multiple experiments
Let us now discuss the implications of the above prescription for reweighting with more than one set of data. Suppose we are given a set of new data , which is made of two independent subsets and , containing respectively and data points, such as for example a dataset which includes results from two independent experimental measurements (of the same, or of different observables).
When the two sets of data are used for reweighting simultaneously, the only quantity that matters is the total of the two experiments. Since we assumed the experiments to be independent, , where , and the probability density is therefore given by Eq. (9) above:
(15) 
Clearly the individual values of of the two sets need not each be fixed to and . Hence even though the likelihood factorizes,
(16) 
the weights do not:
(17) 
Instead they are determined through the more complicated relation (see Eqs. (7) and (8))
(18) 
With Gaussian likelihoods Eq. (4), the integrals can be evaluated to give Eq. (15).
This means that if we wish to proceed sequentially, then after weighting with the first data set, with the usual weights , the weights for the second data set are not given by
(19) 
but rather by
(20) 
This perhaps appears odd at first sight, but is as it should be: the first dataset has altered the probability distribution of the PDFs, and thus the probabilities of the replicas before the second dataset can be considered must necessarily change. This is taken into account of by the dividing out the phase space factor of the first dataset, and multiplying by that of the combined dataset.
Nevertheless, it is possible to factorize the reweightings due to more than one dataset, if rather than attempting successive reweightings of the same set of replicas, one first turns the original weighted set into an unweighted set, and then computes the second set of weights using this set. This procedure will be discussed in detail in Sec. 4: however before we can do this we must first develop a procedure for unweighting.
3 Unweighting
In this section we present a method to unweight reweighted PDF sets so that they can be used without the need for including weights for individual replicas. The starting point is a set of reweighted replicas. Each replica, identified by the index , carries a weight defined in Eq. (11), determined by comparing each of the replicas of the original unweighted distribution to the new experimental information. Our goal is to unweight this PDF set in order to obtain a new set of replicas with all weights equal to unity, but with the same probability distribution of the original weighted set, i.e. such that any moment of the probability distribution computed from the weighted and unweighted set would be the same in the limit in which .
3.1 The unweighting method
The basic idea for constructing the unweighted set consists of selecting replicas from the weighted set of replicas in such a way that replicas carrying a relatively high weight are chosen repeatedly, while those with vanishingly small weight disappear from the final unweighted set. The method is depicted graphically in Fig. 1. We start by subdividing a line of unit length into segments, in such a way that for each replica the length of the corresponding segment is proportional to the weight of the replica, and thus to its probability. The ordering of the segments is random. In order to extract a set of replicas that faithfully represents this distribution, we draw another unit interval directly below the first, and subdivide it into segments all of equal length . We then select replicas from the original weighted set by taking a number of copies of each replica equal to the number of lower segments whose right edge is contained in the upper segment corresponding to that specific replica. A little thought shows that the (all equally probable) replicas in the lower set are then chosen according to the probabilities of the replicas in the upper set.
To see this, note that, if the number of replicas is large enough, (top plot in Fig. 1) then at least one lower segment (width ) will be contained in each upper segment, and the original probability distribution is reproduced. This case is however unrealistic, as it would require to be as large as the ratio between the highest and lowest weight, which can be very large indeed. It is also unnecessary, because the amount of information carried by the weighted set is measured by its Shannon entropy, which can be used to determine the effective number of unweighted replicas which carry the same information [14]. Hence, it is pointless to include a number of replicas significantly larger than , as no information is then gained. Because by construction the more realistic situation is depicted in the bottom plot of Fig. 1: for the larger weights several unweighted segments are contained in a weighted one, but for the smaller weights there are often none at all, since we only select a replica if the edge of a lower segment is contained in the upper segment corresponding to that replica. Which replica is chosen among many all with equally small weight is of course entirely random, since the ordering of the replicas is random.
We can now formulate the unweighting algorithm quantitatively. We start with a set of replicas, each carrying a weight Eq. 11; as in Ref. [14], we normalize the weights according to
(21) 
The probability of each replica is determined given its weight as
(22) 
We then define probability cumulants
(23) 
where in the last step we take . By construction, and . Indeed, the cumulants provide the coordinate of the edge of the th upper segment in the plot of Fig. 1, with origin at the left edge of the unit interval.
The unweighted set is then constructed as follows. We start with weights , and we determine new weights
(24) 
The weights are either zero or positive integers, and they satisfy the normalization condition
(25) 
in fact, they correspond to the graphical counting procedure described previously. The unweighted set is then simply constructed by taking copies of the th replica, for all . The probability of replica in the new unweighted set is then given by
(26) 
As a consequence we have
(27) 
i.e. the unweighted set reproduces the probabilities of the weighted set in the limit of large sample size, as it ought to.
As already mentioned, even though exact identity of the reweighted and unweighted probability distribution holds in the limit Eq. (27), the amount of information contained in the weighted set corresponds to unweighted replicas, with determined as in Eq. (10) of Ref. [14] from the Shannon entropy. Therefore for practical applications it is advisable to take — though there is nothing in principle wrong with taking , this would just lead to a highly redundant replica set. We will study the dependence of unweighted results on in an explicit example below.
3.2 Testing unweighting
As a proof of concept of the unweighting technique, we will apply it to the two cases discussed in Ref. [14]: the reweighting of NNPDF2.0 DIS+DY with Tevatron inclusive jet data and the reweighting of NNPDF2.0 with the D0 muon and inclusive electron lepton asymmetry data.
First, we consider the reweighting of NNPDF2.0 DIS+DY [7] with the Tevatron inclusive jet data [19, 15]. As discussed in Ref. [14], starting with NNPDF2.0 DIS+DY replicas, after reweighting with jet data the effective number of replicas is . A reasonable choice for the size of the unweighted set would be any number less than this: here we chose . We perform the unweighting following the procedure discussed above. The comparison between the reweighted PDFs and the unweighted set can be made quantitative by determining the distances between PDFs and uncertainties. Distances were defined in Appendix A of Ref. [7], and in Ref. [14] in the weighted case; recall that distances correspond to statistically identical distributions, while (with replicas) corresponds to distributions which are statistically inequivalent, but agree to one sigma. The distances between the reweighted PDF set and the same PDF set after unweighting are shown in Fig. 2. The corresponding distances between reweighted and refitted PDFs were shown in Fig. 2 of Ref. [14]. It is clear that the distances between reweighted and unweighted sets are generally smaller than those between the reweighted and the refitted sets, and they all fluctuate about , showing statistical equivalence (with the possible exception of the light sea asymmetry at small , which is subject to very large uncertainties). We conclude that there is no significant loss of accuracy in the reweighting due to the unweighting.
We can now study the information contained in the unweighted set as the number of unweighted replicas is varied. To this purpose, we compute the relative Shannon entropy between the unweighted set and the original weighted set, defined as
(28) 
where are the probabilities Eq. (26), defined for each value of . If the starting number of replicas is large enough that is already in the asymptotic region where Eq. (24) holds, then clearly for large the relative entropy should fall to zero. For lower values of measures the information loss between the original weighted set and the unweighted one.
In Fig. 3 we display . It is clear that falls linearly as a function of up to , as more and more of the information in the weighted set is included. Around the slope of the fall changes abruptly, and then falls slowly to zero as increases, being already close to zero when . This can also be seen by computing directly the effective number of replicas of the unweighted set as a function of , which can be determined using Eq. (10) of Ref. [14], with the weights Eq. (24) and . Note that the result is nontrivial because some of the are zero, others are integers larger than one, and the dependence on comes about only through the definition of the weights Eq. (24). The result is also shown in Fig. 3: at first grows linearly as a function of , and is in fact very nearly equal to it. However when it reaches , the linear growth breaks off abruptly, and saturates at the value , which is reached asymptotically. Hence our expectation is borne out by these plots: the amount of information in the unweighted set increases with the number of unweighted replicas , but only up to the point , after which nothing is gained by further increasing .
We now repeat the same analysis for the unweighting of the NNPDF2.0 set, reweighted with the inclusive electron and muon D0 Run–II lepton asymmetry data [20, 21]. The reweighting procedure for these data was presented in detail in Ref. [14]. The effective number of replicas, after reweighting a starting set of replicas, is in this case . Again, we can choose the size of the unweighted set to be , as in the case above, and we perform the unweighting following the same procedure as before.
In Fig. 4 we show the distance between the reweighted and unweighted sets, and in Fig. 5 we plot the relative entropy between these two sets and the effective number of replicas in the unweighted set as a function of the number of unweighted replicas. The conclusions are the same as before: the unweighted set is indistinguishable from the reweighted one, provided that the number of unweighted replicas is of the same order as the effective number of reweighted replicas . In the sequel we will thus feel free to use unweighted replica sets instead of their weighted counterparts, to which they are essentially equivalent.
4 Consistency
4.1 Multiple Reweighting
As we discussed in Sec. 2.3, when adding two new datasets to a set of prior PDFs, one way to proceed is to treat them as a single combined dataset, as in Eq. (15), i.e., with weights with and . However, it should also be possible to treat them separately, weighting with first one dataset, then the other. If we do this using Eq. (20) then by construction we get the same answer that we would get by including the two sets at once, but this is trivial, because in the weights Eq. (20) the effect of the first weighting is divided out.
However, we can test nontrivially that two subsequent weightings by two independent datasets commute by incorporating the unweighting procedure. Formally we define the operation as reweighting with the weights given by Eq. (11), and an unweighting operation , as described in Sec. 3.1. Note that because the unweighting operator is a projection operator, it has no inverse. Weighting an existing PDF set by incorporating information from a new dataset then consists of the combined ‘weighting’ operation . The weighting operation takes a set of replicas , all equally probable, and replaces it with a subset which are again all equally probable, but the selection of which reflects information contained in the new dataset that was used in the reweighting . Clearly has no inverse, since it projects onto a lower dimensional space.
Now consider two datasets: the set of replicas produced by the action of weighting with the first dataset, , can be subject to a further weighting with the second dataset . Now of course the formula used to evaluate the weights used for the second reweighting must again be given by Eq. (11): the subset of replicas produced by are again all equally probable, so the second reweighting must work in precisely the same way as the first. The only difference is that acts only on those replicas produced by the action of .
Now for consistency it cannot matter in what order we perform these two weightings, and indeed their combined effect must be the same as for a single weighting , which treats the two datasets as a single dataset: , or more explicitly
(29) 
So, for weighting to be consistent it must satisfy two nontrivial conditions: the combination property, and the commutation property. Clearly the first always implies the second (if , clearly , because is performed using weights determined through the total ), but not the reverse (we might have if the formula Eq. (11) was incorrect).
In the remaining part of this Section we present two tests of the combination and commutation properties when two datasets are included. First, we consider sets of data for the same observable (the onejet inclusive crosssection) in the same kinematic region by two different experiments. Then, we consider data for two different observables (a jet crosssection and a DrellYan cross section) which affect different PDFs in different kinematic regions.
4.2 Tevatron Inclusive Jets
The first exercise we present is an extension of the reweighting proofofconcept in Section 4 of [14]. There, Run II Tevatron inclusive jet data production were included by reweighting a PDF set extracted from a NLO fit to DIS and DrellYan data (NNPDF2.0 DIS+DY) and the results compared to those obtained from a fit which included the same DIS, DrellYan and inclusive jet datasets all treated in the same way (NNPDF2.0).
CDF  D0  CDF+D0  
Data points  76  110  186 
290.8  565.8  334.5 
In this Section we look again at the inclusion via reweighting of the same datasets, namely the CDF Run II and D0 Run IIcone inclusive jet data in the NNPDF2.0 DIS+DY fit, but we now focus on comparing the results obtained in the following two cases:

the two new datasets are included by reweighting the prior fit in a single step with both datasets;

one of the datasets is included by reweighting, an unweighted set of PDFs is constructed using the procedure detailed in Section 3, and finally the latter set is reweighted again with the second dataset.
We will carry out the successive reweighting procedure (b) twice, exchanging the order in which the CDF and D0 datasets are included, in order to test the commutativity of the procedure. A final unweighting is performed for all the reweighted sets and the PDF comparisons and computations of distances are performed using these unweighted sets.
The number of data points and the effective number of replicas after reweighting with these data of a set of replicas are summarized in Table 1. In each case, we construct a final set of unweighted replicas. When the reweighting is performed in two steps, we first construct a (redundant) set of unweighted replicas, which is then reweighted and unweighted again to obtain the final set of 100 unweighted replicas.
As discussed in Refs. [7, 14], Tevatron jet data mostly affect the gluon at large , leaving all other PDFs essentially unchanged. The impact of the inclusion of these data in the fit is shown in Fig. 6 where we compare the gluon for the prior set, the refitted one, and sets obtained reweighting the prior in the three different ways described above. As in the previous Section, a more quantitative assessment can be made by computing distances between various pairs of PDF sets. In Fig. 8 we show the distance between PDFs obtained by reweighting with the two sets at once and those found including CDF data first and D0 data next, while in Fig. 8 we show distances between sets obtained by including the CDF and D0 data in either order. It is clear that the three reweighting procedures lead to completely equivalent results.
(CDF+D0)  E605  (CDF+D0)+E605  
Data points  186  119  305 
627.1  59.5  63.7 
4.3 Jet and DrellYan data
In this second exercise we start from a NLO fit to DIS data, NNPDF2.1 NLO DIS [8], and include the Tevatron inclusive jet data discussed in the previous section (D0 and CDF as a single dataset) and data from one of the DrellYan experiments which are included in the NNPDF2.1 global analysis (the E605 fixed target experiment [17]).
The number of data points and the effective number of replicas in this case are summarized in Table 2. Also in this case, we construct a set of unweighted replicas, with unweighted replicas in the intermediate step if any. Note that this is a much less symmetric example than the previous one: the DrellYan data have a much greater impact than the jet data (in fact for the DrellYan data ).
As already mentioned, the jet data affect mostly the large gluon, while the DrellYan data have mostly an impact on the quark flavour and antiflavour separation. The impact of these data on the gluon and the total quark valence distribution are shown in Fig. 9, where we show the results obtained by reweighting with the two sets included together, or one after another in either order. Note that in this case we do not have a refitted set. Distances between PDFs obtained by reweighting in the combined set, or first with jets then with DrellYan are shown in Fig. 11. Distances between PDFs obtained reweighting in either order are shown in Fig. 11. The test is clearly as successful here as it was in the previous case, despite being perhaps more challenging.
5 The W asymmetry at the LHC
In this section we will use the reweighting technique presented here and in Ref. [14] to study the effect of including in the NNPDF2.1 NLO global fit the lepton asymmetry measurements produced by the experimental collaborations at the LHC, and based on data collected in the 2010 run.
The leptonic charge asymmetry is defined in terms of the differential crosssections , with being the pseudorapidity of the lepton coming from the decay of the boson, as
(30) 
where the crosssections are computed inside the acceptance cuts used to select the events.
The ATLAS Collaboration published a first measurement of the muon charge asymmetry from boson production in the pseudorapidity range , based on 31pb of accumulated luminosity [22], while CMS published a measurement of the muon and the electron charge asymmetries in the pseudorapidity range , based on 36pb of data [23]. The data provide a constraint for the above combination of PDFs in the region , where they are only partially constrained by the data already included in the NNPDF global analysis. In particular, while is very well determined by fixed target DIS data, and the light sea are currently much less constrained.
The LHCb collaboration presented preliminary results for a measurement of the muon charge asymmetry in the pseudorapidity range , covered by the LHCb detector. This measurement probes PDFs in the small and large regions, where data included so far in the global analyses provide much looser constraints. For this reason they might eventually have a substantially larger impact on global fits than the ATLAS or CMS data. However, at the time of writing these experimental results have only been presented in preliminary form [24], and are therefore not included in this study.
5.1 Inclusion of individual experiments
We begin by checking the compatibility of the individual ATLAS and CMS datasets for the charge lepton asymmetry with the data included in the NNPDF2.1 global fit, and by studying their impact when they are included separately in the fit using the reweighting technique presented in this paper.
The ATLAS muon charge asymmetry data [22] and CMS electron and muon data [23] are compared to the predictions obtained using three different NLO global fits, CT10 [25], MSTW2008 [26] and NNPDF2.1 in Fig. 12. The theoretical predictions including NLO QCD corrections are obtained using the fully differential Monte Carlo code DYNNLO [27] which allows for the implementation of arbitrary experimental cuts.
NNPDF2.1  CT10  MSTW08  

ATLAS(31pb)  11  0.76  0.77  3.32 
CMS(36pb) electron GeV  6  1.83  1.19  1.70 
CMS(36pb) muon GeV  6  1.24  0.73  0.77 
To give a more quantitative estimate of the level of agreement of the different predictions with the experimental data, in Table 3 we collect the per number of data points for each individual dataset. Since no covariance matrix is provided by the LHC experiments at this point, we add statistical and systematic uncertainties in quadrature in the computation of the values.
The ATLAS muon charge asymmetry data are already very well described by the NNPDF2.1 prediction before being included in the analysis. This is shown by the excellent reported in Table 3 and demonstrated by the distribution of for the individual replicas before reweighting shown in the left plot of Fig. 13, which has a sharp peak around one. The compatibility of a new dataset with the data already included in a global analysis can be assessed by looking at the probability density for the parameter , defined in Eq. (12) of [14]. If this probability distribution peaks close to one, the new data are consistent with the ones already included in the global fit. For the ATLAS data, the distribution, shown in the right plot of Fig. 13, is peaked slightly below one, thereby showing the good compatibility of these data with those included in the global analysis. Note that optimal values of are to be expected because statistical and systematic errors have been added in quadrature, thereby leading to an overestimation of uncertainties.
After reweighting NNPDF2.1 with the ATLAS data the quality of their description remains substantially unchanged, with the value . The number of effective replicas of the reweighted sets computed according to Eq. (42) in Appendix of [14] is , out of the initial number of replicas in the prior. The distribution of the for the weighted replicas, shown in the center plot of Fig. 13, peaks just below one, again confirming the very good description of these data also after reweighting.
Given the outcome of the previous statistical analysis – a very good description of the data by the prior set to start with, resulting in a large number of surviving replicas () – it is easy to predict that the ATLAS data alone will impose only mild constraints on the underlying PDFs. This is in fact what is seen in Fig. 14 where we compare the NNPDF2.1 light (anti)flavour densities at the scale to the ones obtained after reweighting with the ATLAS data. The most noticeable effect is a reduction of the uncertainties on these PDFs in the mediumsmall region, around , by up to .
We now turn to the CMS measurements described in [23]. CMS presented data for both the electron and muon charge asymmetries from decays with two different cuts on the transverse momentum of the detected lepton: GeV and GeV. From the values for obtained using the NNPDF2.1 global set reported in Table 3, and the plots of the distribution of for individual replicas and of the distribution shown in Fig. 15, we see that both sets are equally well described by the NNPDF2.1 set and thus compatible with the data included in the global analysis. Since the two datasets are not independent we have to choose which one to use in our reweighting analysis and thus we only consider the dataset with the looser cut GeV, which proves to be more constraining of the PDFs. We perform our reweighting analysis including the muon and electron data as a single dataset.
The NNPDF2.1 prediction provides a good, though not optimal, description of the CMS data, as shown by the obtained combining the values for the electron and muon data collected in Table 3. After reweighting, the description of these data improves significantly with . The number of effective replicas computed as above is roughly half the initial number of replicas, out of , suggesting that these data will have have a significant impact on the PDFs. The distribution of the of individual replicas after reweighting is centered around one, as shown in the middleupper plot of Fig. 15.
The impact of the CMS data on light (anti)flavour PDFs, is shown in Fig. 16 where we observe a reduction of uncertainties in the medium region smaller than that due to the ATLAS data, but also a change in the shape of the and distributions at relatively large , pushing up the central value a little and reducing the uncertainties by around for the down distributions and as much as for the up.
We conclude this Section by comparing the predictions for the charge asymmetry computed with NNPDF2.1 and NNPDF2.1 after reweighting with the ATLAS and CMS data respectively in Fig. 17. The effect on the prediction for the CMS data is more substantial, because the data undershoot the NNPDF2.1 NLO prediction in most of the higher rapidity bins.
5.2 Combination of ATLAS and CMS data
We now consider adding the ATLAS and CMS lepton charge asymmetry data as a single dataset to the NNPDF2.1 NLO global fit using reweighting.
The whole dataset is already well described by the NNPDF2.1 NLO dataset with and the distributions of for individual replicas having a sharp peak around one, as shown by the left plot in Fig. 18. The compatibility of the ATLAS+CMS data with the data included in the global analysis and among the two experiments is also good, as can be deduced by looking at the distribution shown in the right plot in Fig. 18, which is nicely peaked around one.
After reweighting the description of the data improves, with with the distribution of for individual replicas shown in the middle plot of Fig. 18 showing a sharp peak around one. These results, combined with the number of effective replicas surviving after reweighting, namely out of the initial , show that the use of the ATLAS and CMS data together in the fit is not only possible but imposes a moderate constraint on PDFs. However the constraint is not quite so great as with the CMS data alone, suggesting a mild incompatibility particularly in the high rapidity bins.
The impact of the data on the light flavour and antiflavour distributions is shown in Fig. 19, where we compare the and quark and antiquark distributions at the scale from the NNPDF2.1 global fit and the ones obtained after adding the ATLAS and CMS lepton charge asymmetry data using reweighting. There is around reduction in uncertainties around , mainly due to the ATLAS data, complemented by a reduction of between and at larger , mainly due to the CMS data.
6 Global PDFs including LHC data
In this section we will check the consistency of the D0 and ATLAS+CMS datasets among themselves, and use both datasets to reweight the NNPDF2.1 NLO PDFs. The unweighting method presented in Sect. 3 is then used to produce a set of 100 unweighted replicas. The final product of this analysis is a new set of NNPDF parton distribution functions, NNPDF2.2 NLO, which includes, together with all the datasets already included in the NNPDF2.1 NLO global set, the D0, ATLAS and CMS lepton charge asymmetry data described above.
6.1 Tevatron asymmetry data
In Ref. [14] we used the reweighting technique to study the compatibility of the D0 lepton charge asymmetry data with the data included in the NNPDF2.0 NLO global fit and to assess their impact on the fitted parton densities. The conclusion of this study was that the data that are inclusive in the of the identified lepton, namely the muon charge asymmetry data presented in [21] and electron charge asymmetry data with GeV released in [20], are consistent with each other and with all the other datasets included in NNPDF2.0, in particular with the CDF asymmetry data [28] and the fixedtarget DIS deuteron data. When included in the fit they have a moderate impact on PDFs, providing a reduction of the uncertainty of the valence quark distributions in the mediumhigh region ().
Less inclusive electron charge asymmetry data were also presented in [20]. They are binned in , divided into two sets with and respectively. We observed [14] that these data, which could have potentially more impact on the PDFs, are inconsistent with some of the DIS data included in the global analysis and have problems of internal consistency. Similar conclusions have been reported by the MSTW [29] and CTEQ [25] collaborations, as they tried to include these datasets in the context of a PDF global analysis. We will thus not use these datasets here.
These results, though obtained using the NNPDF2.0 global fit, remain substantially unchanged if we use instead the NNPDF2.1 NLO global set as a prior fit to start the reweighting analysis. The muon charge asymmetry [21] and inclusive electron charge asymmetry data (with GeV) [20] can thus provide additional information to that from the ATLAS and CMS data considered in the previous section. We thus proceed directly to a combined fit of these data together with the LHC data.
Experiment  NNPDF2.1  NNPDF2.1 LHC  NNPDF2.2  

NMCpd  132  0.97  0.95  0.97 
NMC  221  1.73  1.72  1.72 
SLAC  74  1.33  1.26  1.28 
BCDMS  581  1.24  1.23  1.23 
HERAIAV  592  1.07  1.07  1.07 
CHORUS  862  1.15  1.15  1.15 
FLH108  8  1.37  1.37  1.37 
NTVDMN  79  0.79  0.74  0.70 
ZEUSH2  127  1.29  1.28  1.28 
ZEUSF2C  50  0.78  0.79  0.78 
H1F2C  38  1.51  1.52  1.51 
DYE605  119  0.84  0.84  0.86 
DYE886  199  1.25  1.23  1.27 
CDFWASY  13  1.85  1.81  1.81 
CDFZRAP  29  1.66  1.61  1.70 
D0ZRAP  28  0.60  0.60  0.58 
CDFR2KT  76  0.98  0.98  0.96 
D0R2CON  110  0.84  0.84  0.83 
ATLASmuASY  11  [0.77]  0.97  1.07 
CMSeASY  6  [1.83]  1.23  1.08 
CMSmuASY  6  [1.24]  0.63  0.56 
D0eASY  12  [4.39]  [3.46]  1.38 
D0muASY  10  [1.48]  [1.17]  0.35 
Total  1.165  1.158  1.157 
6.2 Combining LHC and Tevatron asymmetry data
The description of the combined ATLAS, CMS and D0 charge asymmetry datasets obtained using the NNPDF2.1 NLO global fit, in which they were not included, is reasonably good but not optimal, with : a detailed comparison is shown in Table 4. The distribution of the combined for individual replicas before and after reweighting, and the distribution, shown in Fig. 20, indicate however that these data are reasonably compatible with the data already included in the NNPDF2.1 analysis and would provide a significant constraint on the PDFs.
These conclusions are indeed confirmed when the effect of the ATLAS, CMS and D0 data is included using the reweighting technique. After reweighting their overall description improves significantly, with a combined . This is due to a significant improvement in the fit to the CMS and the D0 data: the fit to the ATLAS data deteriorates a little, again showing that there is some tension. The number of effective replicas is now out of the initial , showing that the lepton asymmetry data indeed introduce very significant constraints on the PDFs. The distribution of the for the individual replicas after reweighting, shown in the middle plot of Fig. 20, is peaked around one, confirming the compatibility of these data with the other datasets included in the global analysis.
After reweighting, the unweighting procedure of Sec. 3 may be used to give a replica set of PDFs equivalent to a global fit which includes all the data already included in NNPDF2.1, plus the ATLAS, CMS and D0 asymmetry data. We call this new NLO PDF set NNPDF2.2. The quality of the data to all the sets used in this new fit is shown in Tab. 4. There is no significant deterioration in the in any of other datasets included in the global fit, and the fit to the NuTeV dimuon data improves significantly. The overall thus also improves a little.
The impact on light flavour and antiflavour PDFs is shown in Fig. 21, where we compare the and quark and antiquark distributions at the scale from the NNPDF2.1 NLO set to the ones obtained for the NNPDF2.2 NLO set. The most noticeable effects of the inclusion of the new data are concentrated in two separate regions of , namely, the region, which is mostly affected by the ATLAS data, and the region, which is mostly affected by the CMS and D0 data. In each of these regions, the asymmetry data leads to a reduction of uncertainties on the light flavour and antiflavour distribution, or around in the low region, and up to at higher when CMS and D0 are combined (see Fig. 22). At higher changes in the central values for these PDFs by up to one sigma are also observed: these are mainly due to the D0 data (compare Fig. 21 with Fig. 19).
As recently shown in the extensive studies carried out in the context of the PDF4LHC Working Group [30], there is rather good agreement among NLO parton distributions determined from the widest global datasets, specifically by the NNPDF, MSTW and CTEQ groups. However, there still are some significant differences, notably in the flavour separation at mediumlarge . Since this is the region which is directly probed by the Tevatron and LHC lepton charge asymmetry data studied here, these data might help in resolving some of these outstanding incompatibilities.
To this end, in Figs. 23 and 24 we compare the and combinations at the scale obtained in the NNPDF2.1 and MSTW08 NLO global analyses, which do not include any of the asymmetry data, the CT10 analysis, which includes only the D0 data, and the new NNPDF2.2 fit, which also includes the ATLAS and CMS data. The new data lie in a region of where the compatibility between the results obtained by different collaborations is at best marginal: in particular the ratio given by MSTW08 is too low at large and too high at medium . The reduction of uncertainty when going from NNPDF2.1 to NNPDF2.2 is quite visible: the NNPDF2.2 prediction should thus be taken as the most reliable at present. Future LHC data will constrain the light quark PDFs in this region even more.
7 Conclusions and outlook
The reweighting method which we have reviewed, rederived and refined in this paper is a powerful techinque which enables one both to preform interesting studies of the statistical properties of parton distributions viewed as probability distributions in a space of functions, and to rapidly and effectively include new experimental information in parton sets. Coupled to the unweighting method that we have presented and tested here it allows one to quickly upgrade existing Monte Carlo replica PDF sets to new sets which, while retaining the same format, include new experimental information.
The method has been used here to construct the NNPDF2.2 NLO PDF set — the first PDF set to include LHC data. This will doubtless be the first of many such sets: the quantity, quality and diversity of LHC measurements potentially relevant for PDF determination is now growing at an impressive rate.
The NNPDF2.2 NLO LO PDF set that has been presented in Section 6 is available from the NNPDF web site,
and will be also available through the LHAPDF interface [31]:

NNPDF2.2 NLO, set of replicas:
NNPDF22_nlo_100.LHgrid
Acknowledgments
We are especially grateful to John Collins and Jon Pumplin for detailed questions and a critique of the reweighting method which largely stimulated this investigation. We thank Georgios Daskalakis, Gautier Hamel de Monchenault, Michele Pioppi, Michael Schmitt and Ping Tan for help with the LHC asymmetry data, and Giancarlo Ferrera for help with the DYNNLO code. LDD acknowledges the warm hospitality of the theory group at KMI, Nagoya, during the final stages of this work. RDB would likewise like to thank the Discovery Center at the NBI, Copenhagen. MU is supported by the Bundesministerium für Bildung and Forschung (BmBF) of the Federal Republic of Germany (project code 05H09PAE). We would like to acknowledge the use of the computing resources provided by the Black Forest Grid Initiative in Freiburg and by the Edinburgh Compute and Data Facility (ECDF) (http://www.ecdf.ed.ac.uk/). The ECDF is partially supported by the eDIKT initiative (http://www.edikt.org.uk).
References
 [1] S. Forte et al., JHEP 05 (2002) 062, hepph/0204232.
 [2] The NNPDF Collaboration, L. Del Debbio et al., JHEP 03 (2005) 080, hepph/0501067.
 [3] The NNPDF Collaboration, L. Del Debbio et al., JHEP 03 (2007) 039, hepph/0701127.
 [4] The NNPDF Collaboration, R.D. Ball et al., Nucl. Phys. B809 (2009) 1, arXiv:0808.1231.
 [5] The NNPDF Collaboration, J. Rojo et al., (2008), arXiv:0811.2288.
 [6] The NNPDF Collaboration, R.D. Ball et al., Nucl. Phys. B823 (2009) 195, arXiv:0906.1958.
 [7] The NNPDF Collaboration, R.D. Ball et al., Nucl. Phys. B838 (2010) 136, arXiv:1002.4407.
 [8] The NNPDF Collaboration, R.D. Ball et al., Nucl. Phys. B849 (2011) 296, arXiv:1101.1300.
 [9] The NNPDF Collaboration, R.D. Ball et al., (2011), arXiv:1107.2652.
 [10] W.T. Giele, S.A. Keller and D.A. Kosower, (2001), hepph/0104052.
 [11] The NNPDF Collaboration, R.D. Ball et al., “Parton distributions: determining probabilities in a space of functions”, to be published in the proceedings of PHYSTAT 2011; CERN Yellow Report, 2011, arXiv:1110.1863.
 [12] S. Forte, Acta Phys. Polon. B41 (2010) 2859, arXiv:1011.5247.
 [13] W.T. Giele and S. Keller, Phys. Rev. D58 (1998) 094023, hepph/9803393.
 [14] The NNPDF Collaboration, R.D. Ball et al., Nucl.Phys. B849 (2011) 112, arXiv:1012.0836.
 [15] CDF  Run II, A. Abulencia et al., Phys. Rev. D75 (2007) 092006, hepex/0701051.
 [16] D0, V.M. Abazov et al., Phys. Rev. Lett. 101 (2008) 062001, arXiv:0802.2400.
 [17] G. Moreno et al., Phys. Rev. D43 (1991) 2815.
 [18] E. Jaynes, Probability Theory: The Logic of Science (Cambridge University Press, 2003).
 [19] CDF, T. Aaltonen et al., Phys. Rev. D78 (2008) 052006, arXiv:0807.2204.
 [20] D0, V.M. Abazov et al., Phys. Rev. Lett. 101 (2008) 211801, arXiv:0807.3367.
 [21] D0, V.M. Abazov et al., Phys. Rev. D77 (2008) 011106, arXiv:0709.4254.
 [22] ATLAS, G. Aad et al., (2011), arXiv:1103.2929.
 [23] CMS, S. Chatrchyan et al., JHEP 04 (2011) 050, arXiv:1103.3470.
 [24] LHCb, T. Shears, PoS EPSHEP2009 (2009) 306.
 [25] H.L. Lai et al., Phys. Rev. D82 (2010) 074024, arXiv:1007.2241.
 [26] A.D. Martin et al., Eur. Phys. J. C63 (2009) 189, arXiv:0901.0002.
 [27] S. Catani et al., Phys. Rev. Lett. 103 (2009) 082001, arXiv:0903.2120.
 [28] CDF, T. Aaltonen et al., Phys. Rev. Lett. 102 (2009) 181801, arXiv:0901.2169.
 [29] R.S. Thorne et al., PoS DIS2010 (2010) 052, arXiv:1006.2753.
 [30] S. Alekhin et al., (2011), arXiv:1101.0536.
 [31] D. Bourilkov, R.C. Group and M.R. Whalley, (2006), hepph/0605240.