A new measure of tension between experiments
Abstract
Tensions between cosmological measurements by different surveys or probes have always been important — and are presently much discussed — as they may lead to evidence of new physics. Several tests have been devised to probe the consistency of datasets given a cosmological model, but they often have undesired features such as dependence on the prior volume, or burdensome requirements such as that of nearGaussian posterior distributions. We propose a new quantity, defined in a similar way as the Bayesian evidence ratio, in which these undesired properties are absent. We test the quantity on simple models with Gaussian and nonGaussian likelihoods. We then apply it to data from the Planck satellite: we investigate the consistency of CDM model parameters obtained from TT and EE angular power spectrum measurements, as well as the mutual consistency of cosmological parameters obtained from large scale (multipoles, ) and small scale () portions of each measurement and find no significant discrepancy in the sixdimensional CDM parameter space.
Introduction. The use of Bayesian statistics in cosmology is now commonplace: most of the results on cosmological parameters from cosmic microwave background (CMB) experiments Ade et al. (2016) and largescale structure (LSS) surveys Abbott et al. (2017) are reported as posterior distributions. In addition, various Bayesian methods are used for model comparison Trotta (2008).
Along with the increase in the number of cosmological surveys and the improvement in their precision, a number of tensions between parameters derived from different experiments have been observed. For example, the Hubble constant measured using the distance ladder in the local universe disagrees with that derived from Planck CMB observations; in the standard six parameter CDM model, the disagreement is about Riess et al. (2016); Bernal et al. (2016). There is also some tension between the measurements of the amplitude of fluctuations and the matter density from weak lensing to that of the measurement from Planck CMB data (Hildebrandt et al., 2017; Troxel et al., 2017, 2018). As a result, a number of statistics have been developed to compare datasets in cosmology. The primary goal of these statistics is to determine if two datasets are consistent realizations of the same model, that is, with a single set of cosmological parameters (see Seehars et al. (2016); Charnock et al. (2017); Lin and Ishak (2017a) for discussions and comparisons of some of the popular methods). For an alternative approach using hyperparameters, see Hobson et al. (2002); Bernal and Peacock (2018).
The Bayesian evidencebased metric of Marshall et al. (2006) has been widely used March et al. (2011); Amendola et al. (2013); Martin et al. (2014); Joudaki et al. (2017); Raveri (2016), but is known to strongly depend on the priors given to parameters. This has led to the use of other measures Seehars et al. (2014); Grandis et al. (2016); Feeney et al. (2018) that do not have the priorvolume dependence, but at the expense of losing the simplicity of an evidence ratio. In this work, we define an evidencebased quantity which fixes the problem of prior volume dependence and which can be evaluated on a easytointerpret scale.
Consider two datasets and , and let be the parameters of a model. Let us assume that both the datasets and the combination of them can be modeled by a particular CDM realization, and the priors are wide enough to include the parameter posteriors preferred by both the datasets individually. Most commonly, a Bayesian analysis is used to determine posterior probability distributions for the model parameters . Suppose the two datasets separately give two (normalized) posterior distributions
(1) 
where denotes the likelihood of the data d given the model defined by a set of parameters , is the prior probability of the model parameters, and is called the marginal likelihood or the evidence,
(2) 
We will always use normalized probability density functions for the likelihood () and the prior (). The posterior for the combination of datasets is:
(3) 
where the second equality assumes that the combined likelihood is approximated well by the product of the two likelihoods.
The ratio of the evidences obtained using two different models, called the Bayes Factor, is a widely used measure for model comparison. In this work, we will define a similar ratio to compare two sets of parameter constraints of a model obtained using different datasets or experiments.
Evidence for model parameters. We first define the marginal likelihood (or the evidence) for the maximum likelihood model parameters, , instead of the usual definition of the evidence for the data, d. We do so as our primary goal is to quantify the level of consistency between model parameters obtained from different datasets or experiments. Analogous to Eq. (2), we define the evidence for the maximum likelihood model parameters
(4) 
where, instead of the measured data, we have used the maximum likelihood values of the data realization given the model, ; see Figure 1 for an illustration. Here is the function that computes the model prediction for the data given the parameters ; for example, in the case of the CMB temperature fluctuation data, the model prediction is represented by the theory angular power spectra . If the likelihood in the above equation is a combination of two experiments, then we can define evidences for the maximum likelihood parameters obtained through the combination of the two datasets (denoted by ), . Alternatively, we can define an evidence so that each part of the data vector in the evidence integral uses its own maximum likelihood parameter values, obtained by analyzing each experiment separately, . As we will show, the ratio of the two evidences can quantify the tension between the parameter constraints obtained from two different datasets.
Evidencebased dataset comparison. For simplicity, consider two datasets that have independent likelihoods, , and let the measured data vector for each experiment be denoted by . Let us further assume that the maximum likelihood parameters of the model, CDM for example, are known for three different cases: two datasets analyzed separately, , and their combined analysis, .
Our null hypothesis is that both the datasets are realizations of a single set of parameters, , from the combined fit. The alternative, more complicated, hypothesis is that each of the datasets are realizations of their own set of parameters, . Then, using the Bayes theorem similarly to the derivation of the Bayes factor, we get
(5)  
where the superscripts sep and com in the formula above stand for separate and combined maximum likelihood parameters, respectively. We will first consider a case where the above expression can be evaluated analytically.
Consider two likelihoods given by two dimensional multivariate Gaussian distributions with arbitrary covariance matrices and ,
In this simple example, we have taken so that the expressions are easy to evaluate analytically. If we further assume that the prior on each of the parameters is uniform and wide (compared to the constraint on the parameter), we get Petersen and Pedersen (2012),
(6)  
so that
(7) 
the negative logarithm of which () is the twoexperiment index of inconsistency (IOI) defined in Lin and Ishak (2017a). Under these conditions assuming that the null hypothesis is true, is distributed with degrees of freedom (dof) Raveri and Hu (2018) (see their definition and discussion of ). More generally, the ratio of probabilities of two hypotheses (evidence ratio) is similar to a likelihoodratio test Kass and Raftery (1995), and the distribution of asymptotically approaches by Wilks theorem Wilks (1938). Here, , when comparing two datasets. We will, therefore, evaluate the probabilitytoexceed (PTE) value of observed values by taking to be distributed.
For two onedimensional Gaussian likelihoods: and , we get . The application of our new measure to the marginalized Hubble constant likelihoods from Planck Ade et al. (2016) and distance ladder Riess et al. (2016), therefore, trivially gives us the values expected from Gaussian statistics i.e Lin and Ishak (2017b) with a pvalue or .
Also, we note that our new measure is related to the tension measure defined in Verde et al. (2013), because in some situations can be approximated by shifting one of the posterior probability density functions while preserving its shape. However, there can be ambiguity in the process of shifting one or both of the posterior distributions (for nonGaussian and multimodal distributions), as discussed in Section X.B. of Lin and Ishak (2017a). That ambiguity is removed in our definition, as we reference the likelihood functions directly. We provide an example in Figure 2, in which the Gaussian likelihood is simply, . The nonGaussian likelihood is a (normalized) sum of two Gaussians, defined as . The distributions plotted in Figure 2 are (dashed) and (solid). Because the combined fit is insensitive to the narrow peak near , we get without any ambiguity in how to shift the distributions, which shows that the two sets of parameters and from the two likelihoods are consistent, as expected. Without the additional peak at , the level of consistency is slightly better: , a simple verification that the new measure gets contribution from nonGaussian features.
Next, we calculate using different pairs of datasets (e.g. TT vs EE) from the Planck satellite, in which case is no more a simple linear function but has to evaluated numerically.
Application to Planck data. We use the binned and foregroundmarginalized plik_lite likelihood from the Planck collaboration Aghanim et al. (2016) which includes multipoles for TT power spectrum, and multipoles for EE power spectrum. We fix the Planck calibration factor to 1; see Sec. C.6.2 of Aghanim et al. (2016), from which the CMBonly Gaussian plik_lite likelihood is:
(8) 
where . The binned and marginalized mean and covariance matrix are provided by the Planck team. To evaluate the likelihood in Eq. (8), we compute lensed for a given set of parameters using camb Lewis et al. (2000); Howlett et al. (2012) and bin the using the appropriate weights to get .
Without lowmultipole polarization data, the optical depth to reionization is only weakly constrained and is strongly degenerate with the amplitude of scalar fluctuations . To break this degeneracy, we use a low polarization prior . The evidences we compute are:
(9)  
where and are obtained individually by using the respective TT and EE data, while is the maximum likelihood model parameters from the combined fit. We obtain the maximum likelihood values by using a global optimization algorithm differential_evolution Storn and Price (1997) implemented in scipy Jones et al. (01). We calculate the evidences using the MultiNest package Feroz and Hobson (2008); Feroz et al. (2009), and quote results and statistical errorbars produced by the importance nested sampling method Feroz et al. (2013). For evidence calculations, we take uniform priors on six cosmological parameters listed in Table 1.
Parameter  Range  Parameter  Range 
[2.7, 3.4]  [0.8, 1.2]  
[0.1, 0.45]  [0.044, 0.055]  
[50, 95]  [0.005, 0.2] 
The results are shown in Table 2 where, in addition to , we also quote the corresponding probabilitytoexceed (pvalue) and Gaussian  values. For the discrepancy between model parameters obtained from TT and EE spectra, we obtain (approximately ). Previous studies also find no indication of strong discrepancy between these datasets Shafieloo and Hazra (2017), albeit by using more complicated methods, or by directly using the posteriors Lin and Ishak (2017b).
datasets  TT,EE 




pvalue 
We perform another test using the Planck power spectrum data, by splitting the temperature data into and samples and calculating for these two datasets. We again find that the level of inconsistency is small with or approximately , which agrees with the significance obtained using simulated data sets in Aghanim et al. (2017). Note that, to obtain the values in Table 2, we are using the plik_lite likelihood in which low () multipoles are not included; inclusion of these largescale multipoles would likely increase the discrepancy as their amplitude is known to be anomalously low.
To estimate the effect of low part of the TT likelihood, we implement an approximation to the low likelihood following Aghanim et al. (2017) (see their Section 3.2 for details), which they have tested to find that the approximation gives similar cosmological parameters compared to the computationally more demanding pixelspace likelihood. To summarize: is drawn from a probability distribution function, where are maskdependent fitting factors determined for the commander mask. Here is the maskdeconvolved power spectrum, which we take to be the Planck commander quadratic maximum likelihood (QML) s. Any correlation between different multipoles for and with the plik_lite multipole bins is ignored. For , including the approximate low likelihood, we now get or approximately , which again agrees with the significance quoted in Aghanim et al. (2017) obtained using simulations.
We finally carry out a similar analysis with the polarization data: we split the Planck EE data in multipole, using the plik_lite likelihood for each multipole range. The large and small scale multipole split for the EE spectrum results in consistent CDM parameters: , or approximately , which is expected given the lesser constraining power of the EE spectrum for Planck noise levels.
Summary and Conclusion. We have introduced a new statistic to quantify tension between experiments. The statistic is based upon Bayesian evidence, and has advantages of not depending on the prior volumes of the parameters, and of being straightforward to apply to multiparameter, nonGaussian likelihood distributions. We have shown that our new measure reduces to the expected discrepancy measure for Gaussian distributed posteriors, and gives sensible results in the nonGaussian tests that we performed.
Applying the new statistic to the Planck power spectrum data, we find that the cosmological parameters obtained from TT and EE spectra are consistent, and that the level of discrepancy of the parameters obtained from the TT spectrum split into smaller and larger scales at is slightly larger at about .
We have limited our application to just the Planck data in this work. It is worthwhile to apply the new measure to comparing the Planck constraints with weaklensing constraints Abbott et al. (2017) and smallerscale CMB constraints Aylor et al. (2017). It will also be useful to consider using the statistic in the context of CDM extensions. Further, we have only carefully investigated the ratio for comparing two datasets. A straightforward application of the ratio for more than two datasets might be possible by evaluating as distributed with degrees of freedom, but detailed investigation of this possibility and application to other cosmological datasets is left for future study.
Acknowledgments. The authors are supported by NASA under contract 14ATP140005. DH is also supported by DOE under Contract No. DEFG0295ER40899. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI1548562. We thank Marius Millea for providing the necessary coefficients and example code to implement the low approximated likelihood. We are grateful to Wayne Hu, Marco Raveri, Vivian Miranda, Weikang Lin, Mustapha IshakBoushaki, Pavel Motloch and Michael Hobson for insightful comments.
References
 Ade et al. (2016) P. A. R. Ade et al. (Planck), Astron. Astrophys. 594, A13 (2016), arXiv:1502.01589 [astroph.CO] .
 Abbott et al. (2017) T. M. C. Abbott et al. (DES), (2017), arXiv:1708.01530 [astroph.CO] .
 Trotta (2008) R. Trotta, Contemp. Phys. 49, 71 (2008), arXiv:0803.4089 [astroph] .
 Riess et al. (2016) A. G. Riess et al., Astrophys. J. 826, 56 (2016), arXiv:1604.01424 [astroph.CO] .
 Bernal et al. (2016) J. L. Bernal, L. Verde, and A. G. Riess, JCAP 1610, 019 (2016), arXiv:1607.05617 [astroph.CO] .
 Hildebrandt et al. (2017) H. Hildebrandt et al., Mon. Not. Roy. Astron. Soc. 465, 1454 (2017), arXiv:1606.05338 [astroph.CO] .
 Troxel et al. (2017) M. A. Troxel et al. (DES), (2017), arXiv:1708.01538 [astroph.CO] .
 Troxel et al. (2018) M. A. Troxel et al. (DES), (2018), arXiv:1804.10663 [astroph.CO] .
 Seehars et al. (2016) S. Seehars, S. Grandis, A. Amara, and A. Refregier, Phys. Rev. D93, 103507 (2016), arXiv:1510.08483 [astroph.CO] .
 Charnock et al. (2017) T. Charnock, R. A. Battye, and A. Moss, Phys. Rev. D95, 123535 (2017), arXiv:1703.05959 [astroph.CO] .
 Lin and Ishak (2017a) W. Lin and M. Ishak, Phys. Rev. D96, 023532 (2017a), arXiv:1705.05303 [astroph.CO] .
 Hobson et al. (2002) M. P. Hobson, S. L. Bridle, and O. Lahav, Mon. Not. Roy. Astron. Soc. 335, 377 (2002), arXiv:astroph/0203259 [astroph] .
 Bernal and Peacock (2018) J. L. Bernal and J. A. Peacock, (2018), arXiv:1803.04470 [astroph.CO] .
 Marshall et al. (2006) P. Marshall, N. Rajguru, and A. Slosar, Phys. Rev. D73, 067302 (2006), arXiv:astroph/0412535 [astroph] .
 March et al. (2011) M. C. March, R. Trotta, L. Amendola, and D. Huterer, Mon. Not. Roy. Astron. Soc. 415, 143 (2011), arXiv:1101.1521 [astroph.CO] .
 Amendola et al. (2013) L. Amendola, V. Marra, and M. Quartin, Mon. Not. Roy. Astron. Soc. 430, 1867 (2013), arXiv:1209.1897 [astroph.CO] .
 Martin et al. (2014) J. Martin, C. Ringeval, R. Trotta, and V. Vennin, Phys. Rev. D90, 063501 (2014), arXiv:1405.7272 [astroph.CO] .
 Joudaki et al. (2017) S. Joudaki et al., Mon. Not. Roy. Astron. Soc. 465, 2033 (2017), arXiv:1601.05786 [astroph.CO] .
 Raveri (2016) M. Raveri, Phys. Rev. D93, 043522 (2016), arXiv:1510.00688 [astroph.CO] .
 Seehars et al. (2014) S. Seehars, A. Amara, A. Refregier, A. Paranjape, and J. Akeret, Phys. Rev. D90, 023533 (2014), arXiv:1402.3593 [astroph.CO] .
 Grandis et al. (2016) S. Grandis, D. Rapetti, A. Saro, J. J. Mohr, and J. P. Dietrich, Mon. Not. Roy. Astron. Soc. 463, 1416 (2016), arXiv:1604.06463 [astroph.CO] .
 Feeney et al. (2018) S. M. Feeney, H. V. Peiris, A. R. Williamson, S. M. Nissanke, D. J. Mortlock, J. Alsing, and D. Scolnic, (2018), arXiv:1802.03404 [astroph.CO] .
 Petersen and Pedersen (2012) K. B. Petersen and M. S. Pedersen, “The matrix cookbook,” (2012), version 20121115.
 Raveri and Hu (2018) M. Raveri and W. Hu, (2018), arXiv:1806.04649 [astroph.CO] .
 Kass and Raftery (1995) R. E. Kass and A. E. Raftery, Journal of the American Statistical Association 90, 773 (1995).
 Wilks (1938) S. S. Wilks, The Annals of Mathematical Statistics 9, 60 (1938).
 Lin and Ishak (2017b) W. Lin and M. Ishak, Phys. Rev. D96, 083532 (2017b), arXiv:1708.09813 [astroph.CO] .
 Verde et al. (2013) L. Verde, P. Protopapas, and R. Jimenez, Phys. Dark Univ. 2, 166 (2013), arXiv:1306.6766 [astroph.CO] .
 Aghanim et al. (2016) N. Aghanim et al. (Planck), Astron. Astrophys. 594, A11 (2016), arXiv:1507.02704 [astroph.CO] .
 Lewis et al. (2000) A. Lewis, A. Challinor, and A. Lasenby, Astrophys. J. 538, 473 (2000), arXiv:astroph/9911177 [astroph] .
 Howlett et al. (2012) C. Howlett, A. Lewis, A. Hall, and A. Challinor, JCAP 1204, 027 (2012), arXiv:1201.3654 [astroph.CO] .
 Storn and Price (1997) R. Storn and K. Price, Journal of Global Optimization 11, 341 (1997).
 Jones et al. (01 ) E. Jones, T. Oliphant, P. Peterson, et al., “SciPy: Open source scientific tools for Python,” (2001–), [Online; accessed May 1, 2018].
 Feroz and Hobson (2008) F. Feroz and M. P. Hobson, Mon. Not. Roy. Astron. Soc. 384, 449 (2008), arXiv:0704.3704 [astroph] .
 Feroz et al. (2009) F. Feroz, M. P. Hobson, and M. Bridges, Mon. Not. Roy. Astron. Soc. 398, 1601 (2009), arXiv:0809.3437 [astroph] .
 Feroz et al. (2013) F. Feroz, M. P. Hobson, E. Cameron, and A. N. Pettitt, (2013), arXiv:1306.2144 [astroph.IM] .
 Shafieloo and Hazra (2017) A. Shafieloo and D. K. Hazra, JCAP 1704, 012 (2017), arXiv:1610.07402 [astroph.CO] .
 Aghanim et al. (2017) N. Aghanim et al. (Planck), Astron. Astrophys. 607, A95 (2017), arXiv:1608.02487 [astroph.CO] .
 Aylor et al. (2017) K. Aylor et al. (SPT), Astrophys. J. 850, 101 (2017), arXiv:1706.10286 [astroph.CO] .