Visualizing the sensitivity of hadronic experiments to nucleon structure
Abstract
Determinations of the proton’s collinear parton distribution functions (PDFs) are emerging with growing precision due to increased experimental activity at facilities like the Large Hadron Collider. While this copious information is valuable, the speed at which it is released makes it difficult to quickly assess its impact on the PDFs, short of performing computationally expensive global fits. As an alternative, we explore new methods for quantifying the potential impact of experimental data on the extraction of proton PDFs. Our approach relies crucially on the Hessian correlation between theorydata residuals and the PDFs themselves, as well as on a newly defined quantity — the sensitivity — which represents an extension of the correlation and reflects both PDFdriven and experimental uncertainties. This approach is realized in a new, publicly available analysis package PDFSense, which operates with these statistical measures to identify particularly sensitive experiments, weigh their relative or potential impact on PDFs, and visualize their detailed distributions in a space of the parton momentum fraction and factorization scale . This tool offers a new means of understanding the influence of individual measurements in existing fits, as well as a predictive device for directing future fits toward the highest impact data and assumptions. Along the way, many new physics insights can be gained or reinforced. As one of many examples, PDFSense is employed to rank the projected impact of new LHC measurements in jet, vector boson, and production and leads us to the conclusion that inclusive jet production will play an indispensable role in future PDF fits — comparable to HERA deepinelastic scattering — by providing key leverage on the PDF uncertainty of the Higgs production cross section, gluon, strangeness and other seaquark flavors.
Contents
I Introduction
The determination of collinear parton distribution functions (PDFs) of the nucleon is becoming an increasingly precise discipline with the advent of highluminosity experiments at both colliders and fixedtarget facilities. Several research groups are involved in the rich research domain of the modern PDF analysis Dulat et al. (2016); HarlandLang et al. (2015); Ball et al. (2017); Alekhin et al. (2017); Accardi et al. (2016); Abramowicz et al. (2015); Alekhin et al. (2015). By quantifying the distribution of a parent hadron’s longitudinal momentum among its constituent quarks and gluons, PDFs offer both a description of the hadronic structure and an essential ingredient of perturbative QCD computations. PDFs enjoy a symbiotic relationship with highenergy experimental data, in the sense that they are crucial for understanding hadronic collisions in the Standard Model (SM) and beyond, while reciprocally benefiting from a wealth of highenergy data that constrain the PDFs. In fact, since the start of the Large Hadron Collider Run II (LHC Run II), the volume of experimental data pertinent to the PDFs is growing with such speed that keeping pace with the rapidly expanding datasets and isolating measurements of greatest impact presents a significant challenge for PDF fitters. This paper intends to meet this challenge by presenting a method for identifying highvalue experiments which constrain the PDFs and the resulting SM predictions that depend on them.
That such expansive datasets can constrain the PDFs is a consequence of the latter’s universality — a feature which relies upon QCD factorization theorems to separate the inherently nonperturbative PDFs (at long distances) from processdependent, shortdistance matrix elements. For instance, the cross section for inclusive singleparticle hadroproduction (of, e.g., a weak gauge boson ) in protonproton collisions at the LHC is directly sensitive to the nucleon PDFs via an expression of the form
(1)  
in which represents the PDF for a parton of flavor carrying a fraction of the 4momentum of proton at a factorization scale ; the order hard matrix element is denoted by and is dependent upon the partonic centerofmass energy , in which in the centerofmass energy of the initial hadronic system; and is the renormalization scale in the QCD coupling strength . In Eq. (1), subleading corrections have been omitted, and we emphasize that factorization theorems like Eq. (1) have been proved to arbitrary order in for essential observables in the global PDF analysis, such as the inclusive cross sections in DIS and DrellYan processes. For compactness and generality, we shall refer henceforth to a PDF for the parton of flavor simply as .
Given this formalism, one is confronted with the problem of finding those experiments that provide reliable new information about the PDF behavior. With the proliferation of potentially informative new data, incorporating them all into a global QCD fit inevitably incurs significant cost both in terms of computational resources and required fitting time. Indeed, tremendous progress in the precision of PDFs and robustness of SM predictions is driven by the technology for performing global analysis that has vastly grown in complexity and sophistication. Nowadays, the stateoftheart in perturbative QCD (pQCD) treatments are done at NNLO (and increasingly even NLO), and advanced statistical techniques are commonly employed in PDF error estimation. The magnitude of this subject is vast, and we refer the interested reader to Refs. Gao et al. (2017); Butterworth et al. (2016) for comprehensive reviews. The trade off of this progress is that the impact of an experiment on the ultimate PDF uncertainty is often hard to foresee without doing a complicated fit in which the outcome reflects numerous theoretical, experimental, and methodological components relevant at the modern precision level.
This potential cost is steepened by the large size of the global datasets usually involved. This point can be seen in Fig. 1, which plots the default dataset considered in the present analysis in a space of partonic momentum fraction and factorization scale . We label these data as the “CTEQTEA set,” given that it is an extension of the 3258 raw data points (given by the sum over in Tables 1 and 2) treated in the NNLO CT14HERA2 analysis of Ref. Hou et al. (2017), now augmented by the inclusion of 734 raw data points (given by the sum over in Table 3) from more recent LHC data. These raw measurements can ultimately be mapped to 5227 pairs (given by the sum over in, for instance, Table 5) of values such that each symbol in Fig. 1 corresponds to a data point from an experiment shown in the legend, at the approximate and values characterizing the data point as described in Appendix A. The experiments are labeled by a shorthand experimental ID number, following the translation key also given in Tables 1–3 of App. B. The experiments included in the CT14HERA2 analysis are listed in the left part of the legend, and those considered for the upcoming CTEQTEA analysis, in the legend’s right part.
The growing complexity of PDF fitting has for years placed a high premium on less computationally involved approaches to estimate the impact of new experimental data on full global fits, such as Hessian profiling techniques Camarda et al. (2015) and Bayesian reweighting Ball et al. (2011, 2012) of PDFs. Although these approaches are useful for simulating the expansion of a particular global fit to include theretofore absent dataset(s), they are also limited in that the interpretation of their outcomes is married to the specific PDF parametrization and definition of PDF errors. For example, conclusions obtained by PDF reweighting regarding the importance of a given data set strongly depend on the assumed statistical tolerance or the choice of reweighting factors Sato et al. (2014); Paukkunen and Zurita (2014).
Parallel to these efforts, the notion of using correlations between the PDF uncertainties of two physical observables was proposed in Refs. Pumplin et al. (2001); Nadolsky and Sullivan (2001) as a means of quantifying the degree to which these quantities were related based upon their underlying PDFs. The PDFmediated correlation in this case, which we define in Sec. III.1, embodies the Pearson correlation coefficient computed by a generalization of the “master formula” Pumplin et al. (2002) for the Hessian PDF uncertainty. The Hessian correlation was deployed extensively in Ref. Nadolsky et al. (2008) to explore implications of the CTEQ6.6 PDFs for envisioned LHC observables. It proved to be instrumental for identifying the specific PDF flavors and ranges most correlated with the PDF uncertainties for , and production cross sections as well as other processes. However, the PDFmediated correlation with a theoretical cross section is only partly indicative of the sensitivity of the experiment. The constraining power of the experiment also depends on the size of experimental errors that were not normally considered in correlation studies, as well as on correlated systematic effects that are increasingly important.
As a remedy to these limitations, we introduce a new format for the output of CTEQTEA fits and a natural extension of the correlation technique to quantify the sensitivity of any given experimental data point to a PDFdependent observable of the user’s choice. In this approach, we work with statistical residuals quantifying the goodnessoffit to individual data points. We demonstrate that the complete set of residuals computed for Hessian PDF sets characterizes the CTEQTEA fit well enough to permit a means of gauging the influence of empirical information on PDFs in a fashion that does not require complete refits.
As a generalization of the PDFmediated correlations, we introduce the sensitivity — to be characterized in detail in Sec. III.1 — which better identifies those experimental data points that tightly constrain PDFs both by merit of their inherent precision and their ability to discriminate among PDF error fluctuations. Such an approach can aid in identifying regions of for which PDFs are particularly constrained by physical observables.
In fact, in the numerical approach presented in the forthcoming sections, the user can quantify the sensitivity of data not only to individual PDF flavors, but even to specific physical observables, including the modifications due to correlated systematic uncertainties in every experiment of the CT14HERA2 analysis. For example, for Higgs boson production via gluon fusion () at the LHC 14 TeV, the shortdistance cross sections are known up to NLO with a scale uncertainty of about 3% Anastasiou et al. (2016). It has been suggested that production and high boson production already provide comparable constraints on the gluon PDF in the region sensitive to LHC production, and that these are comparable to the constraints from LHC and Tevatron data Czakon et al. (2017); Boughezal et al. (2017). Verifying the degree to which this hypothesis is true has been difficult without actually including all these data in a fit.
As an alternative to doing a full global fit, we can critically assess this supposition in the context of the entire global dataset of Fig. 1 using the Hessian correlations and sensitivities, and , associated with the Higgs production cross section . The detailed procedure is explained in Sec. III.1. Fig. 2 shows and distributions that we obtain in space. The data points that have large values of and , and hence constrain the PDF dependence of , are highlighted with color according to the conventions described in Sec. III.
Fig. 2 illustrates that the sensitivity measure generally identifies more data points providing constraints on than the correlation, as can be seen by comparing the number of highlighted data points in the left and right panels of Fig. 2. In the left panel, only a subgroup of less than half of the inclusive jet production points, and also HERA Neutral Current (NC) DIS data and production, show the most significant correlations, taken to have in this comparison. With our improved definition for the sensitivity, however, the corresponding plot in the right panel demonstrates that many more points have large sensitivity to , with . We note that these choices for the significance thresholds for and are grounded in previous work and we discuss their motivation in greater detail in Sect. III. As noted in the caption of Fig. 2, these data include most of the analyzed jet production data, and high production, as well as various DIS experiments. Many of these points have comparable values, with no experiments aside from HERA (160) and LHC jet production clearly standing out. From this comparison, one would conclude that efforts to constrain PDFbased SM predictions for Higgs production relying only on a few points of data but to the neglect of highenergy jet production points would be significantly handicapped by the absence of the latter. We will return to this example in Sec. IV.
The discriminating power of a sensitivitybased analysis as just illustrated for constraints to Higgs production therefore forms the primary motivation for this work, and we present the attendant details below. To assess information about the PDFs encapsulated in the residuals for large collections of hadronic data implemented in the CTEQTEA global analysis, we make available a new statistical package PDFSense to visualize the regions of partonic momentum fractions and QCD factorization scales where the experiments impose strong constraints on the PDFs.
The remainder of the article proceeds as follows. Pertinent aspects of the PDFs and their standard determination via QCD global analyses are summarized in II. We describe how to extract and visualize information about the global QCD fit provided by statistical residuals returned by the PDF analysis. New statistical quantities to characterize the global analysis are presented in Sec. III, followed by an illustration in Sec. IV of their implementation to examine the impact of various CTEQTEA datasets on extractions of the gluon PDF . Finally, in the conclusion contained in Sec.V we emphasize a number of physics insights that may be gained with our sensitivity analysis techniques. Additional aspects of the technique and supporting tables are reserved for Apps. A and B, respectively.
Ii PDF preliminaries
ii.1 Data residuals in a global QCD analysis
While various theoretical methods exist to compute nucleon PDFs in terms of models Farrar and Jackson (1975); Hobbs et al. (2015, 2014), their unambiguous evaluation entirely in terms of QCD theory is not yet possible due to the fact that the PDFs can in general receive substantial nonperturbative contributions at infrared momenta. For this reason, precise PDF determination has proceeded mainly through the technique of the QCD global analysis — a method enabled by QCD factorization and PDF universality.
In this approach, a highly flexible parametric form is ascribed for the various flavors in a given analysis at a relatively low scale . For example, one might take the input PDF for a given quark flavor to be a parametric form
(2) 
in which can be a suitable polynomial function, e.g., a Chebyshev or Bernstein polynomial, or replaced with a feedforward neural network as in the NNPDF approach. While the full statistical theory for PDF determination and error quantification is beyond the intended range of this analysis, roughly speaking, a best fit is found for a vector of PDF parameters by minimizing a goodnessoffit function describing agreement of the QCD data and physical observables computed in terms of the PDFs. Based on the behavior of in the neighborhood of the global minimum, it is then possible to construct an ensemble of error PDFs to quantify uncertainties of PDFs at a predetermined probability level.
There are in principle various ways to evaluate uncertainties on PDFs, e.g., the Hessian method Pumplin et al. (2001, 2002), the Monte Carlo method Giele and Keller (1998); Giele et al. (2001), and the Lagrange Multiplier approach Stump et al. (2001). In this analysis our default PDF input set is CT14HERA2, which uses the Hessian method to estimate uncertainties and is therefore based on the quadratic assumption for ) in the vicinity of the global minimum. In the Hessian method, an orthonormal basis of PDF parameters is derived from the input PDF parameters by the diagonalization of a Hessian matrix , which encodes the secondorder derivatives of with respect to . The eigenvector PDF combinations are found for two extreme variations from the bestfit vector along the direction of the eigenvector of allowed at a given probability level. The uncertainty on a QCD observable can then be estimated with one of the available “master formulas” Pumplin et al. (2002); Nadolsky and Sullivan (2001), the “symmetric” variety of which is
(3) 
In the CTEQTEA global analysis, the function accounts for multiple sources of experimental uncertainties, as well as for some prior theoretical constraints on the parameters. Consequently, the global function takes the form
(4) 
where the sum runs over all experimental datasets and imposes theoretical constraints. The complete formulas for and can be found in Ref. Gao et al. (2014). For the purposes of this paper, we express for each experiment in a compact form as a sum of squared shifted residuals , which are summed over individual data points in this experiment, as well as the contributions of bestfit nuisance parameters associated with correlated systematic errors:
(5) 
In turn, for the data point is constructed from the theoretical prediction evaluated in terms of PDFs, total uncorrelated uncertainty , and the shifted central data value :
(6) 
This representation arises in the Hessian formalism due to the presence of correlated systematic errors in many experimental datasets, which require to depend on nuisance parameters . This is in addition to the dependence of on the PDF parameters and theoretical parameters such as and particle masses. The parameters are optimized for each according to the analytic solution derived in Appendix B of Ref. Pumplin et al. (2002). Optimization effectively shifts the central value of the data point by an amount determined by the optimal nuisance parameters and the correlated systematic errors
(7) 
It should be noted that the contribution of the squared bestfit nuisance parameters to in Eq. (5) is dominated in general by the first term involving the shifted residuals, which tends to be much larger — especially for more sizable datasets.
We point out also that some alternative representations for include the correlated systematic errors via a covariance matrix , rather than the above mentioned CTEQpreferred form that explicitly operates with . Various definitions in use are reviewed in Ball et al. (2013), as well as in Alekhin et al. (2015). Crucially, however, the representations based upon operating with and are derivable from each other Gao et al. (2014). From an extension of the derivation in Ref. Pumplin et al. (2002), we may relate the shifted residual to the covariance matrix at an point and optimal nuisance parameters as
(8)  
(9) 
where
(10) 
and
(11) 
Thus, even for those PDF analyses which operate with the covariance matrix one is still able to determine the shifted residuals from using Eq. (8). In this article, we conveniently follow the CTEQ methodology and obtain directly from the CTEQTEA fitting program, together with the optimal nuisance parameters and shifted central data values
ii.2 Visualization of the global fit with the help of residuals
The shifted residuals draw our interest because, in consequence of the definitions in Eqs. (5)(6), they contain substantial lowlevel information about the agreement of PDFs with every data point in the global QCD fit in the presence of systematic shifts. The response of to the variations in PDFs depends on the experiment type and kinematic range associated with the data point, and the totality of these responses can be examined with modern dataanalytical methods. The sum of squared residuals over all points of the global dataset renders the bulk of the loglikelihood, or experimental, component of the global . In turn, the rootmeansquared residual for experiment and the central PDF set is tied to the standard measure of agreement with experiment at the best fit:
(12) 
We will now invoke the Hessian formalism to first organize the analysis of the PDF dependence of individual residuals, and then introduce a framework to evaluate sensitivity of individual data points to PDFdependent physical observables. To test the effectiveness of the proposed method, we study constraints using CT14HERA2 parton distributions Hou et al. (2017) fitted to datasets from DIS processes, , , , and jet production (. We include both the experiments that were used to construct the CT14HERA2 dataset, as well as a number of LHC experiments that may be fitted in the future.
To parametrize the response of a residual , we evaluate it for every eigenvector PDF of the CT14HERA2 PDF set with PDF parameters. Then, given the normalized differences
(13) 
between the residuals for the PDF eigenvectors and for the CT14HERA2 central PDF , we construct a dimensional vector
(14) 
for each data point of the global dataset.
The components of parametrize responses of to PDF variations along each independent direction . The differences are normalized to the central rootmeansquare (r.m.s.) residual of experiment [see Eq. (12)] so that (a) these differences are less sensitive to random fluctuations in due to statistical uncertainties in , and (b) the normalized differences do not significantly depend on the quality of fit to experiment . Recall that a substantial spread over the fitted experiments is generally obtained for . Moreover, it is reasonable to expect significantly larger values for for the experiments that have not been yet fitted, but are included in the analysis of the residuals, e.g., the new LHC experiments shown in Fig. 1. With the definitions in Eqs. (13) and (14), however, is only weakly sensitive to .
Thus, we represent the PDFdriven variations of the residuals of a
global dataset by a bundle of vectors in a dimensional
space.
ii.3 Manifold learning and dimensionality reduction
PCA and tSNE visualizations
We illustrate a possible analysis technique carried out with the help of the TensorFlow Embedding Projector software for the visualization of highdimensional data Emb (). A table of 3992 vectors for the CTEQTEA dataset (corresponding to our total number of raw data points) is generated by our package PDFSense and uploaded to the Embedding Projector website. As variations along many eigenvector directions result only in small changes to the PDFs, the 56dimensional vectors can in fact be projected onto an effective manifold spanned by fewer dimensions. Specifically, the Embedding Projector approximates the 56dimensional manifold by a 10dimensional manifold using principal component analysis (PCA). In practice, this 10dimensional manifold is constructed out of the 10 components of greatest variance in the effective space, such that the most variable combinations of are retained, while the remaining 46 components needed to fully reconstruct the original 56dimensional are discarded. However, because the 10 PCAselected components describe the bulk of the variance of , the loss of these 46 components results in only a minimal relinquishment of information, and in fact provides a more efficient basis to study variations.
In the 10dimensional PCA representation, some directions result in efficient separation of residuals of different types. For example, the left panel of Fig. 3 shows a 3dimensional projection of the that separates clusters of DIS, vector boson production, and jet/ production residuals. In this example, the jet/ cluster, shown in red, is roughly orthogonal to the blue DIS cluster and intersects it. This separation is remarkable, as it is based only on numerical properties of the vectors, and not on the metadata about the types of experiments that is entered after the PCA is completed. The underlying reasons for this separation, namely, dependence on independent PDF combinations, will be quantified by sensitivities in the next section.
As an alternative, the Embedding Projector can organize the vectors into clusters according to their similarity using distributed stochastic neighbor embedding (tSNE) van der Maaten and Hinton (2008). A representative 3dimensional distribution of the vectors obtained by tSNE is displayed in the right panel of Fig. 3. In this case, we find that such algorithms can again sort data into clusters according to the experimental process, values of and , and even the experiment itself.
The breakdown of the vectors over experiments in the PCA representation is illustrated by Fig. 4. Here, we see that the bulk of the DIS cluster from the left Fig. 3 originates with the combined HERA1+2 DIS data (ID=160, Abramowicz et al. (2015)). The jet cluster in Fig. 3 will be dominated by ATLAS and CMS inclusive jet datasets (ID=542, 544, and 545 Aad et al. (2015); Chatrchyan et al. (2014a); Khachatryan et al. (2017)), which add dramatically more points across a wider kinematical range on top of the CDF Run2 and D0 Run2 jet production datasets (ID=504 and 514 Aaltonen et al. (2008); Abazov et al. (2008a)).
In contrast, although the production experiments (ID=565568 Aad et al. (2016a)) are generally characterized by large vectors, they contribute only a few data points lying within the jet cluster and, by themselves, will not make much difference in a global fit. The same conclusion applies to data from high production, which has too few points to stand out in a fit with significant inclusive jet data samples. We return to this point in the discussion of reciprocated distances below.
It is also interesting to note that semiinclusive charm production at HERA (ID=147 Abramowicz et al. (2013)) lies within both the DIS and jet cluster. Finally, the CCFR/NuTeV dimuon SIDIS (ID=124127 Mason (2006); Goncharov et al. (2001)) extends in an orthogonal direction, not well separated from the other datasets in the selected threedimensional projection.
Reciprocated distances
As a complement to the visualization methods based on PCA and tSNE just presented, it is also possible to evaluate another similarity measure based on the distances between the vector residuals. For example, rather than applying PCA to an ensemble of vectors in the effective 10dimensional space following dimensional reduction, we might instead compute over the dimensionally reduced space a pairwise reciprocated distance measure, which we define as
(15) 
and evaluate for the points in each experimental dataset. We
allow the sum over in Eq. (15) to run over all
the data points in the CTEQTEA set regardless of experimental ID
(denoted by ). The distances can be computed either
in the 56dimensional space or in the reduced dimensionality space.
The advantage of the definition in Eq. (15) is that it enables a quantitative measure of the degree to which separate experiments broadly differ in terms of their residual fluctuations, and therefore provides information analogous to that found in Figs. 3–4. For example, by inspection of Eq. (15) it can be seen that those experimental measurements which are widely separated from the rest of the CTEQTEA dataset in space of vectors will correspond to comparatively large values of , and experiments that systematically differ from the rest of the total dataset are thus expected to have especially tall distributions in the panels of Fig. 5. On this basis, it can be seen that information yielded by W asymmetry measurements (ID 234, 266, 281) are particularly distinct, as well as the combined HERA DIS data (ID 160) and fixedtarget DrellYan measurements, such as E605 (ID 201) and E866 data (IDs 203 and 204) are particularly distinct. Similarly, direct comparison of the distributions in the panels of Fig. 5 allows one to compare constraints with and without the jet data. We note that the 7 and 8 TeV ATLAS high production (IDs 247 and 253) and production (ID 565) provide a number of “remote” points and hence are potentially useful in the fits sensitive to the gluon. On the other hand, new jet production experiments (IDs 542, 544, 545) all include large numbers of points characterized by significant reciprocated distances.
Iii Quantifying distributions of residuals
We have demonstrated that the multidimensional distribution of the shifted residuals evaluated with Hessian eigenvector PDFs reflects PDF dependence of individual data points. In this section, we will focus on numerical metrics to assess the emerging geometrical picture and to visualize the regions of partonic momentum fractions and QCD factorization scales where the experiments impose strong constraints on a given PDFdependent observable .
iii.1 Correlations and sensitivities
Gradients of in a space of Hessian eigenvector PDF parameters are naturally related to the PDF uncertainty. Recall that in the Hessian method the PDF uncertainty on is found as
(16) 
where is the bestfit combination of PDF parameters, and is the maximal displacement along the gradient that is allowed within the tolerance hypersphere of radius centered on the best fit Pumplin et al. (2001, 2002). The standard master formula
(17) 
is obtained by representing the components of by a finitedifference formula
(18) 
in terms of the values for extreme displacements of within the tolerance hypersphere along the th direction.
In this setup, a dot product between the gradients provides a convenient measure of the degree of similarity between PDF dependence of two quantities Nadolsky et al. (2008). A dot product between the gradients of a shifted residual and another QCD variable , such as the PDF at some or a cross section, can be cast in a number of useful forms.
Correlation cosine
The correlation for the point, which we define following Refs. Pumplin et al. (2001); Nadolsky and Sullivan (2001); Nadolsky et al. (2008); Gao et al. (2017) according to
(19) 
can determine whether there may exist a predictive relationship between and goodness of fit to the point. The correlation function for the quantities in Eq. (19) represents the realization in the Hessian formalism of Pearson’s correlation coefficient, which we express as
(20) 
with the sum in these expressions being over the parameters of the full PDF model space. Geometrically, represents the cosine of the angle that determines the eccentricity of an ellipse satisfying in the plane. This latter point follows from the fact that the mapping of the tolerance hypersphere onto the plane is an ellipse with an eccentricity that depends on the correlation of and which is in turn given by Eq. (20) above.
does not indicate how constraining the residual is, but it may indicate a predictive relation between and . On the basis of previous work Nadolsky et al. (2008), we say that the (anti)correlation between and is significant roughly if , while smaller (anti)correlation values are less robust or predictive. Following this ruleofthumb, correlations have been used successfully to identify PDF combinations that dominate PDF uncertainties of complicated observables, for instance to show that the gluon uncertainty dominates the total uncertainty on LHC and production, or that the uncertainty on the ratio of and boson cross sections at the LHC is dominated by the strangeness PDF, rather than and (anti)quark PDFs Nadolsky et al. (2008).
Sensitivity
The correlation alone does not fully encode the potential impact of separate or new measurements on improving PDF determinations in terms of the uncertainty reduction. Rather, we employ again to define the sensitivity to of the point in experiment :
(21) 
where and are computed according to Eqs. (3) and (12), respectively. In other words, again represents the variation of the residuals across the set of Hessian error PDFs, and we normalize it to the r.m.s. residual for the whole dataset to reduce the impact of random fluctuations in the data values . This definition has the benefit of encoding not only the correlated relationship of with , but also the comparative size of the experimental uncertainty with respect to the PDF uncertainty. In consequence, for example, if new experimental data have reported uncertainties that are much tighter than the present PDF errors, these data would then register as highsensitivity points by the definition in Eq. (21).
Geometrically, represents a projection of the displaced residual vector , defined in Sec. III using the symmetrized formula for from footnote 1, onto the direction of the gradient . This interpretation suggests that the total strength of constraints along the direction of can be quantified by summing projections onto this direction of all contributing individual .
As with correlations, only a sufficiently large absolute magnitude of is indicative of a predictive constraint of the point on . Recall that is the contribution of the point to and that only residuals with a large enough as compared to the r.m.s. residual are sensitive to PDF variations. The magnitude is of order which suggests an estimate of a minimal value of that would be deemed sensitive according to the respective contribution. For the numerical comparisons in this study, we assume that must be no less than 0.25 to indicate a predictive constraint, as the PDF uncertainty of the residual contributes no less than 0.0625 to the variation in the global . The reader can choose a different minimal value in the figures depending on the desired accuracy. The cumulative sensitivities that we obtain in later sections are independent of this choice.
Yet another possible definition, which we list for completeness, is to further normalize the sensitivity as
(22) 
For instance, if is the PDF or parton luminosity evaluated at the points extracted according to the data, the definition of in Eq. (22) deemphasizes those points where the PDF uncertainty is small compared to the bestfit PDF value — analogously to how deemphasizes (relative to the correlation ) those data points whose normalized residuals have already been more tightly constrained.
Iv Case study: CTEQTEA global data
iv.1 Maps of correlations and sensitivities
We will now discuss a number of practical examples of using or to quickly evaluate the impact of various hadronic data sets upon the knowledge of the PDFs in a fashion that does not require a full QCD analysis of the type described in Sec. II. For this demonstration, we will continue to study the dataset shown in Fig. 1 of the CT14HERA2 analysis Hou et al. (2017) augmented by the candidate LHC data.
We have already noted the extent of this dataset in the plane in Fig. 1, where it is decomposed into constituent experiments labeled according to the conventions in Tables 13. It is instructive to create similar maps in the plane showing or values for each data point at the typical momentum fraction(s) and factorization scale(s) associated with this point. Such maps are readily produced by the PDFSense program for a variety of PDF flavors and for userdefined observables, such as the Higgs cross section. For demonstration we have collected a large number of these maps at the companion website PDF ().
Thus, we obtain scatter plots of or for a given QCD observable , such as the LHC Higgs production cross section shown in Fig. 2, or with a PDF evaluated at the same determined by the data points, with examples shown for in Figs. 6 and 7. The typical values characterizing the data points are found according to Bornlevel approximations appropriate for each scattering process included in the CTEQTEA dataset, with the formulas to compute these kinematic matchings summarized in App. A. Here and in general, we consider the absolute values of the correlation and sensitivity on the grounds that their signs may randomly fluctuate depending on whether data points overshoot or undershoot their theory predictions.
Together with the map in the plane, PDFSense also returns a histogram of the values for each quantity it plots. An example is shown for in the left panel of Fig. 6. One would judge that stronger constraints are in general provided to those PDFs for which the histogram has many entries comparatively closely to . In the left panel of Fig. 6, we can see that, while the distribution has its greatest excess at low correlations, , the distribution is also bimodal, having a pronounced tail and subsidiary excess in the region . This feature shows that, of the 5227 points probed by the augmented CT14HERA2 set in Fig. 1, several hundred (specifically, 443) have especially strong () correlations (or anticorrelations) with the gluon PDF.
To identify these points, we plot complementary information in the right panel of the same figure – specifically, a map in space of each of the data points shown in Fig. 1, in which they are colorized according to the magnitude of following the color palette in the “rainbow strip” on the right. “Cooler” colors (green/yellow) correspond to weaker correlation strengths, while “hotter” colors (orange/red) represent comparatively stronger correlations, as indicated.
A striking feature of the correlation plot of Fig. 6 is that large magnitudes of are found for inclusive jet production measurements, especially those recently obtained by CMS at 8 TeV Khachatryan et al. (2017) (Expt. ID 545, inverted triangles) with as high as 0.85, including at the highest values of and . Beyond these, a sizable cluster of HERA (160) data points at the lowest values of are also seen to have large correlations with the gluon PDF, consistent with the common wisdom that HERA DIS constrains the gluon PDF at small via DGLAP scaling violations. Under the jet production cluster, high production (247, 253) and production (565–568) at the LHC show a high correlation. At the same time, many other measurements, including fixedtarget data at large and asymmetry data near GeV, fall beyond the highlighted range and would therefore be less emphasized by an analysis based solely upon the PDFresidual correlations.
We can also consider the analogous plots for the sensitivity as defined in Eq. (21), which we plot in the two panels of Fig. 7. In the left panel, we again consider the histogram, here for the magnitudes of the gluon sensitivity , in which the correlations are now weighted by the relative size of the PDF uncertainty in the residual. As discussed in Sec. III.1, this additional weighting emphasizes those data points for which the PDFdriven fluctuations in the residuals are comparatively large relatively to experimental uncertainties. This results in a redistribution of the data points shown in the histogram of Fig. 6, with the result being a considerably longertailed histogram for such that in this instance there are 990 points with larger sensitivities, . Unlike the correlation, can be arbitrarily large, depending on the value. In the right panel, places additional emphasis on the combined HERA dataset (ID=160) constraining at lowest Similarly, we observe heightened sensitivities at highest for the LHC (542, 544, 545) and Tevatron (514) jet production data, which have both large correlations with and small experimental uncertainties.
In contrast to the plot, we observe increased sensitivity in the precise fixedtarget DIS data from BCDMS (101, 102) and CCFR (110, 111), which are also sensitive to the gluon via scaling violations despite only moderate correlation values. We also observe enhanced sensitivity for individual points in a large number of experiments, including CDHSW DIS (108); HERA (169); the DrellYan process (201, 204); CDF 8 TeV charge asymmetry (266); HERA charm SIDIS (147); ATLAS high production (247, 253); and especially strongly sensitive points in production (565568). However, since the latter category includes fewer points per each experiment, it has less constraining impact on the gluon than the jet production and highstatistics DIS data.
These findings comport with the idea that the gluon PDF remains dominated by substantial uncertainties at both and in the elastic limit , a fact which has driven an intense focus upon production of hadronic jets, pairs, and high bosons, which themselves are measured at large centerofmass energies and are expected to be sensitive to the gluon PDF across a wide interval of including typical for Higgs boson production via the gluon fusion at the LHC. Turning back to the distributions of and for the Higgs cross section at TeV in Fig. 2, we notice their semblance to the distributions of and for in Figs. 6 and 7. We also see some differences: although the average and is fixed in , it is nonetheless sensitive to some constraints at much lower values as a result of the momentum sum rule. To reduce visual clutter, in Fig. 2 we chose to highlight with color only data points that have sufficiently large values of and . For instance, Fig. 2 highlighted only points with and , which are likely to indicate nontrivial predictive constraints according to the arguments presented in Sec. III.1; in Figs. 6 and 7, however, we relax these highlighting conditions, and instead score the full space according to the color scales indicated in the right panels of both figures.
iv.2 Experiment rankings according to cumulative sensitivities
From the examination of multiple maps for of various PDF flavors collected on the website PDF (), we find that the most precise experiments constrain several flavors at the same time. This has been already known for the combined HERA data, but, for example, the CMS jet production data (542, 545) also constrains as well as , , , and other flavors.
For the purpose of identifying such experiments, we can compute an overall sensitivity statistic for each experiment to the quark distributions of flavor . Furthermore, to obtain one overall ranking, we can add up sensitivity measures as an unweighted sum over the “basis PDF” flavors, such as the six light flavors (). To obtain these measures we say that an experiment sensitive to points in space for a PDF of given flavor can be characterized by its mean sensitivity per point, , from which we derive several additional statistical measures of experimental sensitivity. For each experiment and flavor we determine an overall sensitivity measure, numerically adjusted to the size of each experimental dataset (having raw data points), according to . In addition, we also track cumulative sensitivity measures and , with running over .
We list the corresponding values of these four types of sensitivities for each experiment of the CTEQTEA dataset in Tables 9 and 10, as well as for categories of experiments from the CTEQTEA dataset in Tables 11 and 12. Table 4 lists the experiments in each category.
Alongside the full tables, we provide their simplified versions in Tables 58, in which we rank and score information according to a reward system described in the caption of Table 5. In each table, experiments are listed in descending order according to the cumulative sensitivity measure to the six lightparton flavors. For each PDF flavor, the experiments with especially high overall flavorspecific sensitivities receive an “A” rating, per the convention in the caption of Table 5. Successively weaker overall sensitivities receive marks of “B” and “C,” while those falling below a lower limit are left unscored.
We similarly evaluate each experimental dataset based on its pointaveraged sensitivity, in this case scoring according to a complementary scheme in which the highest score is “1.” We indicate especially highscoring experiments with bold entries.
According to this ranking system, we find that the expanded HERA dataset (Exp. ID 160) tallies the highest overall sensitivity to the PDFs, with enhanced sensitivity to the distributions of the  and quarks, as well as that of the gluon. Other experiments also play decisive roles, including recent inclusive jet measurements from LHC — particularly jet production observations from CMS at 8 TeV (Exp. ID 545) and 7 TeV (Exp. ID 542). It is notable that these datasets consistently have strong overall sensitivities across most flavors, as suggested by the wide share of ranks (A, B, C) they net. On similar footing but with slightly weaker overall sensitivities are a number of other fixedtarget measurements, including structure function measurements from BCDMS for (101) and CCFR extractions of (111) — as well as several other DIS datasets.
Going beyond the rankings based upon overall sensitivities, which are more closely tied to the impact of an entire experimental dataset in aggregate, it is useful to consider the pointaveraged sensitivity as well, given that this quantity relates more immediately to those datasets for which each separate measurement in isolation provides a substantial PDF constraint. In this respect, CMS asymmetry measurements at 8 TeV (Exp. ID 266) can be thought of as particularly valuable (despite the small number of individual measurements, ); this is especially true again for the gluon, , and quark PDFs, for which this set of measurements is particularly highly rated in Table 5.
Aside from the quark and gluonspecific rankings of specific measurements, we can also assess experiments based upon the constraints they impose on various interesting flavor combinations and observables as presented in Table 6. As was the case with Table 5, a considerable amount of information resides in Table 6 of which we only highlight several notable features here. Among these features is the very sharp (on par with the sensitivities found for Run III HERA data) overall and pointaveraged sensitivities to the Higgs cross section (e.g., , , etc.) of the CMS jet production measurements corresponding to Exp. IDs 545 and 542, a point which should emphasize the special utility of this process in improving the picture of Higgs physics going forward. While their overall sensitivity is small, the corresponding ATLAS data also possesses significant average sensitivity. On the other hand, measurements of dependent production (247, 253) appear to have somewhat less pronounced sensitivity to the gluon and other PDF flavor combinations. The total and mean sensitivities are on par with HERA charm SIDIS data (ID=147) and provide comparable constraints to charm DIS production, albeit in a different region.
For the light quark PDF combinations like and , the various DIS datasets — led by Run II of HERA and CCFR measurements of the proton structure function — demonstrate the greatest sensitivity. At the same time, however, Run2 Tevatron data from D0 on the asymmetry (281) and Run1 CDF measurements for the corresponding asymmetry (225) also exhibit substantial pointwise sensitivity as well. We collect a number of other observations in the conclusion below, Sec. V.
V Conclusions
In the foregoing analysis, we have confronted the modern challenge of a rapidly growing set of global QCD data with new statistical methodologies for quantifying and exploring the impact of this information. These novel methodologies are realized in a new analysis tool PDFSense PDF (), which allows the rapid exploration of the impact of both existing and potential data on PDF determinations, thus providing a means of weighing the impact of measurements of QCD processes in a way that allows meaningful conclusions to be drawn without the cost of a full global analysis. We expect this approach to guide future PDF fitting efforts by allowing fitters to examine the worlds data a priori, so as to concentrate analyses on the highest impact datasets. In particular, this work builds upon the existing CT framework with its reliance on the Hessian formalism and assumed quasiGaussianity, but these features do not impact the validity of our analysis and conclusions. Our approach provides a means to carry out a detailed study of data residuals, for which we explored novel visualizations in several ways, including the PCA, tSNE, and reciprocated distance approaches discussed in Sec. II.3. These techniques show promise moving forward by providing useful insights into the numerical relationships among datasets and experimental processes.
Crucial to this analysis is the leveraging of both the existing and proposed statistical measures laid out in Sect. III.1. Of these, the flavorspecific sensitivity of Eq. (21) for a data point to the PDF serves as a particularly powerful discriminator, and we deployed it and the correlation of Eq. (19) to visualize PDF constraints provided by data of a wide range in . This was facilitated by the fact that the sensitivity and correlation are readily computable over the extent of the global dataset. The companion website collects a large number of figures illustrating the sensitivities to various flavors as a function of and
In addition to visualization of sensitivities, in Sec. IV.2 we also demonstrated a means to systematically rank and assess subsidiary datasets within the world’s data — an approach that allows us to perceive which specific measurements and physical processes are potentially most influential in constraining PDFs. We note that one is allowed some freedom in choosing a specific ranking prescription, but we find our conclusions to be stable against variations among these possible choices. In this context, we reaffirmed the unique advantage of DIS and jet production to inform the PDFs. We illustrated our constraint methodology by concentrating on the data correlations and sensitivities to the gluon PDF, showing the critical role of HERA data at low and of jet production measurements at higher values of and .
Indeed, many intriguing physics results can be established using our sensitivity methods, and the specific results in the previous sections are only illustrative examples. We stress that these results take the complementary form of sensitivity tables (for example, Table 5) and plots (such as Fig. 2), which respectively offer broad overviews of the experimental landscape as well as detailed mappings of the placements of PDF constraints in space. In totality, the full range of possible results is beyond the scope of the present article, but the interested user can explore them using our PDFSense package at PDF (). We mention only a representative sample of these to motivate the reader:

A wide range of experimental processes possess sensitivity to the nucleon’s quark sea distributions; for example, for the distribution , the DY measurements of E866 (203) exhibit strong sensitivity, but so do DY data from E605 (201) as well as (at larger ) information on the production asymmetry from CMS at 7 TeV (266); at high and , CMS inclusive jet data (545) are also seen to be important. Still, however, the recent HERA data (160) registers the greatest overall sensitivity.

In conjunction, CMS jet production at 7 and 8 TeV (542 and 545) provide the stronger total sensitivity to , beating the HERA DIS (160), NuTeV (124), and CCFR (126, 127) dimuon SIDIS experiments, which have very strong average sensitivity to the strange distribution. Still, at lower scales a mix of these fixedtarget measurements play a substantial role, including data from NuTeV (124) and data on processes from SIDIS at CCFR (126 and 127). At higher and a number of jet measurements like those of CMS (545) show modest impact. Similarly, the data from D0 (234) and CMS (266) have slightly weaker sensitivity, as do the ATLAS points (268), but provide some constraint at lower .

Knowledge of the charm distribution is most influenced by a number of datasets, with HERA (160) at low especially important. Fixed target measurements, particularly those of CDHSW on the proton’s structure function (108) have strong sensitivity at slightly higher , while a wide range of jet measurements, including 7 TeV data from ATLAS (535) and CMS (542), and 8 TeV CMS (545) points also show larger sensitivities; notably, this pattern of sensitive measurements broadly follows the corresponding plot for [as well as ] due to the dominance of boson fusion graphs in heavy quark production. The datasets of importance we identify are broadly consistent with the conclusions of the recent CT14 analysis Hou et al. (2018) of the nucleon’s intrinsic charm Hobbs et al. (2014).

One can also study the correlations and sensitivities for various derived PDF combinations. for instance, for the ratio representing deviations from flavor symmetry in the nucleon sea, the E866 experiment (203) shows exceptional pointaveraged sensitivity, such that its “C” ranking for its overall sensitivity to places it in the company of only a few other DIS and DY experiments, despite their much larger number of measurements, . At somewhat lower , NMC data on the structure function ratio (104) show sensitivity in the range . At still lower , the CMS 8 TeV points (249, 266) and data from LHCb (250) show strong pull, corresponding to pointaveraged rankings of “2,” “1,” and “2,” respectively.

We also consider the PDF ratio , which often serves as a discriminant among various nucleon structure models, especially at high . For an amalgam of fixedtarget experiments, including the NMC data (104) particularly, but also measurements from BCDMS (101) and CCFR (110) as well as data from CCFR drive the current status. At higher , however, the LHCb data (250) and measurements from Run2 of D0 (281) also constrain the high behavior of together with points from CMS at 7 TeV (266).

More generally, we note that, among new LHC experiments to be considered for future global fits, the datasets for inclusive jet production are expected to have greatest impact, comparable in strength to the effect of adding a fixedtarget DIS dataset such as BCDMS (101). Meanwhile, the magnitude of the constraint on the gluon PDF from high production (253) is comparable to those from the combined HERA SIDIS charm dataset (147) or inclusive jet production from CDF Run2 (504); that is, the high data are significant in the event that other jet datasets are not included, in overall consistency with the findings in Ref. Boughezal et al. (2017). The smaller ATLAS production data sets (565568) have strong pointbypoint sensitivity to the gluon, but will have a more diminished role when combined with other, larger data sets. HERA DIS (160) and CMS inclusive jets at 8 and 7 TeV (545 and 542) render the strongest overall constraints on the Higgs production cross section at the LHC according to the rankings in Table 6.
Quantifying correlations and sensitivities thus provides a means of evaluating the ability of a global dataset to constrain our knowledge of nucleon structure in a comprehensive way. It must be emphasized, however, that this analysis is not a substitute for actually performing a QCD global analysis, which remains the single most robust means of determining the nucleon PDFs themselves. Rather, the method presented in the paper is a guiding tool to both supplement and direct fits by providing visual and numerical information on the impact of measurements within a given global analysis as well as the potential for improving PDFs with the incorporation of new datasets (both existing data, as in the case of 8 TeV LHC jet information, and prospective experiments).
The essential ingredients of this study are the PDFresidual correlation and sensitivity and , with the latter representing an extension of the correlation used elsewhere in the modern PDF literature. These definitions are robust enough that we can exhaustively score the data points in an arbitrary global dataset to construct and map the resulting distributions, as shown in Figs. 6 and 7. Accordingly, we found it possible to impose cuts on these distributions to identify points of especially strong correlation () or sensitivity (); we stress that these cuts are chosen as approximate indicators, and any user can adjust them freely. On the other hand, the distributions themselves, as shown in Figs. 6 and 7, are not subject to such cut choices.
While we have demonstrated these techniques in the context of the CT14 family of global fits, they are of sufficient generality that one could readily repeat our analysis using alternative PDF sets. The results of this study can be expected to vary somewhat depending on the specifics of the PDF sets used to compute and , but we see this as an advantage of PDFSense. In fact, one could imagine exploiting the detailed correlation and sensitivity maps to undertake a systematic analysis of the impact of various theoretical assumptions implemented in competing global fits (e.g., the choice of input PDF parametrization or the status of the perturbative QCD treatment implemented in various processes). This utility is further enhanced by the fact that one can recover the shifted residuals that are crucial to our analysis from covariance matrices as argued in connection with Eq. (8). In the same spirit but on the side of the data, PDFSense empowers the user to evaluate the combined impact of multiple experimental datasets — for example, to evaluate the extent to which the impact of a proposed experiment might be diminished by the constraints already imposed by existing measurements. These various functions collectively suggest a number of possible avenues to advance PDF knowledge in the coming years.
Acknowledgements
We thank our CTEQTEA colleagues for insightful discussions. This work was supported in part by the U.S. Department of Energy under Grant No. DESC0010129 and by the National Natural Science Foundation of China under the Grant No. 11465018. The work of J.G. is sponsored by Shanghai Pujiang Program.
Appendix A Approximate kinematical variables
In this section, we describe in detail our method for identifying the values of that correspond to experimental data.
For each experimental data point , we can establish an approximate
relation between the kinematical quantities for that data point, and
unobserved quantities specifying the PDFs: the partonic momentum fraction
and QCD factorization scale . For example, in DIS,
and are approximately equal to Bjorken and momentum
transfer according to the Bornlevel kinematic relation. Although
this relation is violated by higherorder radiative contributions,
it will approximately hold in most scattering events. The same overall
logic can be followed to relate the kinematical quantities in every
process of the CTEQTEA global set to the approximate unobserved
quantities and in the PDFs. These relations vary by process
and are used to assign approximate pairs for
each data point.
Specifically, for DIS, which primarily measures the differential cross sections of the form , we simply take
(23) 
as mentioned above, where the kinematical variables inside “” are evaluated at their experimentally measured values for the data point. The above approximate relations hold even when (N)NLO radiative contributions are included.
For oneparticleinclusive particle production in hadronhadron scattering of the form , we plot two values if the rapidity is known:
(24) 
We set if the rapidity is integrated away. We point out that for processes of this type, Eq. (24) implies that a measurement in a single rapidity bin can in fact probe two distinct values of ; for this and other potential reasons, the number of raw data points in such an experiment () should not be expected to match the number of extracted points it probes ().
In vector boson production, or , we set (invariant mass of the lepton pair), and if a singlelepton rapidity is provided or if the leptonpair rapidity is provided. If the rapidity of the lepton is known, yet of the pair is unknown, we use the fact that for most events because of the shape of the decay leptonic tensor. Thus, the momentum fractions can still be estimated as , where (up to an error of less than 1 unit)
In singleinclusive jet production, , we set
In singleinclusive pair production, we set if known, or 0 otherwise.
In singleinclusive top (anti)quark production, we take , for (as in Expt. ID 565). On the other hand, for or , in which the invariant mass is integrated out (Expt. IDs=566 and 568), we take an average mass scale GeV that is slightly above the observed peak of at .
Lastly, for the measurements from in the sets with Expt. ID 247 and 253, we take , . [Here denotes the boson’s transverse mass, not the invariant mass.]
Appendix B Tabulated results
In Tables 1–3 we provide a detailed key for the individual experiments mapped in Fig. 1, including the physical process, number of points, and luminosities, where available. We group these tables broadly according to subprocess — Table 1 corresponds to DIS experiments, while Tables 2 and 3 collect various measurements for the hadroproduction of, e.g., gauge boson, jet, and pairs — and thus provide a translation key for the experimental ID numbers given in Fig. 1.
In Tables 5 and 6, we collect the flavorspecific () and overall () sensitivities for the experimental datasets contained in this analysis. In Table 5 we list the total and pointaveraged sensitivities for each main flavor (), while Table 6 gives the corresponding information for a number of quantities derived from these, as explained in the associated captions.