An improved modelindependent assessment of the latetime cosmic expansion
Abstract
In the current work, we have implemented an extension of the standard Gaussian Process formalism, namely the MultiTask Gaussian Process with the ability to perform a joint learning of several cosmological data simultaneously. We have utilised the “lowredshift” expansion rate data from Supernovae TypeIa (SN), Baryon Acoustic Oscillations (BAO) and Cosmic Chronometers (CC) data in a joint analysis. We have tested several possible models of covariance functions and find very consistent estimates for cosmologically relevant parameters. In the current formalism, we also find provisions for heuristic arguments which allow us to select the bestsuited kernel for the reconstruction of expansion rate data. We also utilised our method to account for systematics in CC data and find an estimate of and a corresponding Mpc as our primary result. Subsequently, we find constraints on the present deceleration parameter and the transition redshift . All the estimated cosmological parameters are found to be in good agreement with the standard CDM scenario. Including the local modelindependent estimate to the analysis we find and the corresponding Mpc. Also, the constraints on remain consistent throughout our analysis and also with the modeldependent CMB estimate. Using the diagnostic, we find that the concordance model is very consistent within the redshift range and mildly discrepant for .
a,b,c]Balakrishna S. Haridasu b,c]Vladimir V. Luković d,e]Michele Moresco b,c]Nicola Vittorio Prepared for submission to JCAP
An improved modelindependent assessment of the latetime cosmic expansion

Gran Sasso Science Institute , Viale Francesco Crispi 7, I67100 L’Aquila, Italy

Dipartimento di Fisica, Università di Roma "Tor Vergata", Via della Ricerca Scientifica 1, I00133, Roma, Italy

Sezione INFN, Università di Roma "Tor Vergata", Via della Ricerca Scientifica 1, I00133, Roma, Italy

Dipartimento di Fisica e Astronomia, Università di Bologna, Via Gobetti 93/2, I40129, Bologna, Italy

INAF  Osservatorio Astronomico di Bologna, Via Gobetti 93/3, I40129, Bologna, Italy
Contents
1 Introduction
The current era of precision cosmology has allowed us to constrain cosmological models to a very high degree of precision [1, 2, 3, 4]. However lacking explanations for several physical aspects within these models [5], this creates a need for alternative approaches for analysing the data and making cosmological inferences. Such a need and the ease to conduct phenomenological analysis has led to the use of several modelindependent methods, in the last decade.
Amongst several modelindependent approaches, Principal Component Analysis [6, 7, 8, 9, 10, 11], smoothing methods [12, 13, 14], cosmographic and polynomial regression methods [15, 16, 17, 18, 19], Gaussian process (GP) formalism [20, 21, 22] etc., are a few. See [23] for an overview of some of these methods. GP has been a popular “modelindependent” method that has been utilised in the past years to reconstruct the lowredshift cosmic expansion history, henceforth the dynamics of the latetime evolution [21, 24]. GP has been implemented to draw inferences on acceleration [25, 26], estimates of [27, 28, 29, 18], curvature of the universe [30], to perform model falsification by estimating several diagnostics [29, 31, 32, 33], and to study the dynamics of the Dark Energy (DE) / scalar field equation of state (EoS) [34, 35, 21, 33]. More recently, it has also been implemented to study model selection in [26, 36, 37] and estimation of linear anisotropic stress in [38].
In this work we implement an extended GP formalism, which allows one to perform a Jointanalysis of “lowredshift” observations such as Supernova Type Ia (SN) [4], Cosmic chronometers (CC) [39, 40] and the anisotropic Baryon Acoustic Oscillations (BAO) [41] datasets. To conduct a joint analysis using the cosmological data in an independent/single GP often requires assumptions of the several cosmological parameters such as present expansion rate (), sound horizon at drag epoch () etc. Such assumptions, in turn, make the overall analysis and thereby any inferences mildly modeldependent (if not significantly). In this respect, we implement a modelindependent joint learning method called MultiTask Gaussian Process (MTGP) [42, 43, 44], which provides a provision to model the covariance between different tasks (essentially datasets), thereby allowing various tasks to complement each other and hence improve the overall reconstruction/prediction for each of the datasets, in comparison to the independent GPs of individual tasks. In particular for a cosmological dataset, one can make better predictions in the redshift range with less significant observations, by complementing the primary dataset with the secondary dataset(s) which provide better measurements in that range of redshifts. Apart from providing improved predictions, MTGP also allows one to infer the parameters (e.g., ) from the reconstructions, in contrast to assuming them a priori.
The tension in the local direct estimate of km/s Mpc (hereafter R18) in [45] and indirect estimate of km/s Mpc (hereafter P16) from the “highredshift” Planck collaboration [1] (See Table 8 therein) has been a well established problem of precision cosmology. GP formalism has been implemented many a time to estimate as an intercept () of the extrapolated predictive reconstruction of CC data in several works [27, 28, 29]. [29], have reported km/s Mpc, preferring a lower value in comparison to R18. More recently, [18] have combined the expansion rate data (E(z)) from supernova [4] and CC data to obtain km/s Mpc. These estimates are clearly very consistent with “highredshift” P16 and retain the wellknown tension with R18. As is also mentioned in [18], the GP formalism suffers from the caveat of having to choose the specific covariance function, essentially a prior (of sorts) on the reconstruction. In turn, this choice could lead to a bias in the estimate of . Correcting for this effect would require better ways to model the GP, such as, learning the covariance function instead of assuming a fixed model, or to implement a joint analysis of different datasets. While the earlier option is a statistically rigorous procedure (see e.g., [46]), it is not suitable for the current compilation of limited cosmological data. On the contrary, there lie prospects in the latter option for the same, which we implement here (partly due to the newer E(z) data). It is also important that the joint analysis is complemented with a proper kernel selection criteria to overcome any possible biases/specificities of the assumed covariance functions. However, we anticipate that given the smooth nature of the cosmic expansion rate data, one might not require a kernel selection but on the contrary find very consistent results with each of the kernels implemented.
In this work, we focus on the estimation of constant parameters , from the MTGP formalism for several suitable combinations of data. As the GP formalism is also able to predict derivatives of the reconstruction regions, evaluation of deceleration parameter () [25, 20] and estimation of deceleration/acceleration transition redshift () [47, 48] have been of key interest. The SN distance data was able to provide a reliable reconstruction of up to (see e.g., [20]), however, the CC data is unable to do the same unless otherwise a prior on is assumed [25]. In [29], it has been reported that no good estimate of is available from the CC data, which we try to revise in this work. We also reconstruct relevant cosmological diagnostics to comment on their estimation and hence the inferences. In particular we reconstruct the diagnostic [49, 32, 50] and its derivative. These diagnostics and their variations have been implemented on several occasions to assess the validity of CDM [49]. The CC dataset usually implemented in the literature are derived based on an assumption of a particular model for the stellar evolution [51, 40, 52]. We discuss the differences in the estimates and hence the systematics, obtained using the alternate data (this issue was also addressed in [18]). We also illustrate several subtleties of the newer method utilised in this work by implementing it with several combinations of the data.
The paper is organised as follows: In Section 2 we describe the newer MTGP method, in Section 3 we then briefly present the datasets, corresponding modelling and also introduce the theoretical framework used to derive inferences. We have also included a brief discussion on the issue of kernel selection in this section. In Section 4 we present results for cosmological inferences from our analysis, along with a discussion on systematic effects and finally summarise our conclusions in Section 5.
2 Method
In this section, we describe the basic formalism of the MultiTask Gaussian Process (hereafter MTGP), that we have implemented in this work. As the name suggests, MTGP considers different tasks, which are essentially different sets of independent training data and their corresponding crosscovariance to perform a simultaneous learning, which inturn is utilised to make predictions. We begin by introducing the standard SingleTask Gaussian Process and then proceed to the newer MTGP formalism implemented here.
2.1 SingleTask Gaussian Process
We introduce the individual/SingleTask Gaussian Process (GP), with a brief description. However, we keep the discussion in this subsection to a minimum as it has been introduced numerous times in the cosmological context (see e.g., [20, 21, 22])^{1}^{1}1However, for an indepth discussion, we strongly encourage the reader to refer to [53]. See [54] for a discussion of the Gaussian process with noisy inputs, which is more relevant for cosmological purposes. . GP is essentially a generalisation of the single Gaussian distribution to the probability distributions of a function over a range of independent variables. In principle, this can be any stochastic process but is way simpler in a gaussian scenario and is also more often the case. While GP can be implemented for both classification and regression problems, for the cosmological concerns in this paper we focus on regression. Given a task to be learnt for a set of independent variables and the corresponding dependent outputs , GP provides predictions at the target inputs by placing a Gaussian prior on the latent function that maps x onto y. It is in this context that GP is also deemed as the collection of random variables and can be represented as,
(2.1) 
where , and are the mean and the variance of the random variable at , respectively. The prior assumption on is usually assumed to be 0, and does not play any role in obtaining the reconstructions, as the constant(deterministic) only remains an additive factor for the final predictions. However, it is not the case if one wants to assume an explicit functional form for with additional parameters [55, 56] and would no longer remain a minimal assumption analysis as the nature of this functional form is now imposed on the reconstructions. Also, note that far away from the regions in which data are available, the reconstructed posterior mean tends to the prior mean. However, we will comment more in detail on the assumption of mean towards the end of this section. Given that one needs to reconstruct the function , we model the covariance between values of this function at different positions as,
(2.2) 
where is an a priori assumed covariance model (kernel, in GP terminology) for the particular reconstruction and is appropriately replaced in Eq. 2.1. We use x to represent a set of inputs values and for an individual input. Correspondingly, is a single value and represents a matrix of dimension , where is the length of the input vector. In the realworld scenarios oftentimes the dependent variables are only obtained with uncertainties (noise, in the GP terminology) or even possible covariances () amongst different measurements. In such cases, the learning is performed with an additional noise covariance term along with the modelled . The standard covariance functions are usually stationary models, where the estimate of the covariance is dependent only on the difference between the positions and not on the actual positions. These models for covariance functions are often referred to as stationary kernels. In this work we implement the standard Gaussian/SquaredExponential () and the Matérn () class of kernels. The kernel is defined as,
(2.3) 
where is the function/signal variance (amplitude) and is the length scale that dictates the ability to model features in the predicted region. These two parameters are often called hyperparameters as they are not the parameters of the function but of the covariance function. In the later descriptions to follow we redefine , and explicitly write only when necessary, as this is consistent with all the kernels implemented here, which are stationary in nature. The kernel, however, is a very smooth covariance function and so oftentimes not regarded as the best assumption for reconstructing features. To this aid, the Matérn class of kernels are very useful and the general functional form can be written as,
(2.4) 
where is the modified Bessel function of second kind with is strictly a positive parameter and is the standard Gamma function. carry the usual definitions. The modified Bessel function provides an explicit analytic functional form for halfinteger values of and as the covariance function converges to the kernel. Amongst the several possibilities and are of particular interest as they are neither too smooth nor do they predict rapid variations^{†}^{†}footnotemark: . We would also like to assert that for the current cosmological data available, the flexibility of the Matérn class kernels is sufficient and there is no immediate need for more complicated kernels, which will become relevant with forthcoming stringent data and for extrapolated predictions. For instance, the Matérn covariance function for is given as, ^{1}^{1}footnotetext: Please see [21] for a good discussion on the relevancy of these issues for the cosmological data. It has also been argued that reconstructions using Matérn kernels with cannot be distinguished for finite and noisy training samples (see Sec.4.2 of [53] and references therein).
(2.5) 
where . Also, it is worth mentioning that the Matérn kernels are differentiable only to the order and so the kernels with lose their interest for predicting the derivatives^{†}^{†}footnotemark: and hence relevant cosmological features, however still relevant for predictions at a single target position. Finally, the hyperparameters are learned by optimising the log marginal likelihood (see [53] for details) that is defined as,
(2.6) 
where for a set of observations and assuming mean . The reconstructions of the regions can be performed by either learning the hyperparameters from a simple optimisation (global) of Eq. 2.6 or by marginalising over them in a Bayesian way. In fact, the latter procedure often puts GP in the light of a nonparametric method^{2}^{2}2In a similar line of reasoning, as we are modelling the covariance but not the latent function (, which is marginalised upon) itself, GP is also deemed as modelindependent. In a more recent work [18], discussion on similar issues was presented.. We emphasise that in the context of cosmology, modelindependent implies independent of cosmological model, and the assumptions of kernel models must however be subjected to proper kernel selection to estimate which of them provides a better description of the data (see Section 3.3). However, it has been argued [53](also in [21]), that optimisation provides equally good predictions, while being computationally lighter. The differences in optimisation marginalisation becomes significant when the data are too sparse or if they are too noisy. The latter is, in fact, true with the CC data, as was found in [18]. However, if the data are well constrained the probability distributions of the hyperparameters will be similarly peaked^{3}^{3}3Although the log marginal likelihood would be peaked at the global maximum, it may also be the case that there exist several local maxima. Appropriate care must be implemented to avoid these local maxima (see Sec 5.4 of [57]). We also provide more discussion on the same in the Section 3.3. In this work we perform global optimisation within a very large prior region () for either of the hyperparameters (see Section 3.2). Implementing the optimisation method, once the hyperparameters are learned, one can predict the mean and variance of the latent function at the test/target points through,
(2.7) 
One can extend the GP predictions also to the derivatives of the latent functions, however limited by the differentiability of an assumed kernel function. As is shown in [58], derivative of a certain GP would also be a Gaussian Process, in fact even an integration or combinations of different orders of derivatives can be approximated as Gaussians within the GP formalism. As described in [31](see also [54]), this formalism can be useful to predict relevant dynamics in cosmology. The covariances involving the derivatives are written as,
(2.8) 
where represent the derivatives with respect to their corresponding independent variables. Once the covariance between the derivative(s) of the latent function are defined we can obtain the mean and variance for the predictions of function derivatives. The mean of the derivative and the covariance between and derivatives can be written as,
(2.9) 
respectively. For latter of Eq. 2.9 provides the variance of the derivative. If the data for derivative functions is available, one can in fact perform a joint analysis, which was by far the only way [59], however requiring to assume physical constants in the cosmological context. Subsequently, one can extend this formalism to obtain predictions for combinations of and its derivatives (). The prediction at a given target () for the combination is sampled from a multivariate normal distribution of the necessary components, appropriately considering the covariance amongst them. This is especially crucial for derivatives of higher order with larger uncertainties, where simple propagation of errors will not fare well.
2.2 MultiTask Gaussian Process
While the SingleTask GP models the covariance between the values of the same task at different positions, the MTGP allows one perform a simultaneous learning of multiple tasks by modelling the covariance between different tasks at their respective positions [44, 43, 60, 61]^{4}^{4}4For an historical overview of MultiTask Gaussian Process, please refer to [42, 53]. For this purpose, similar to Eq. 2.2 one can write,
(2.10) 
where represent the latent functions of the and the task at their corresponding input variables, respectively. The covariance between tasks and is modelled by . In this work we implement a more general MultiKernel Gaussian process, where the formalism has been developed so one can utilise different kernels for different tasks, unlike in the standard MultiTask scenario which only allows for single kernel across different tasks. Therefore, one can choose to use and kernels to model two tasks and their covariance to perform a joint learning. Consider two sets of data and , with and being the number of data points in each set, respectively. The current formalism is based on the fact that the kernel function can be written as a convolution of the underlying basis function ^{5}^{5}5The underlying basis function(s) of a kernel models how the strength of covariance/correlation between two inputs decays with increasing separation () between them, see [53] for more details on basis functions., evaluated at the two positions between which the covariance needs to be modelled. Following [60], one can write the kernel function as,
(2.11) 
where is the basis function of the kernel () defining the autocovariance between two different data points of the same task. Given that one needs to model the crosscovariance which can be written as the convolution of the basis functions of the two different kernels for two individual tasks,
(2.12) 
Here denotes the crosscovariance between two different tasks , and , represent the basis functions of the two individual kernels, , represent the independent variables of their respective tasks. Given the above formalism, the method to obtain the basis functions has been derived in [60]^{6}^{6}6See also the technical report () with detailed derivations and explanations, available at http://www.acfr.usyd.edu.au/techreports/.. The basis function of a given kernel can be written as,
(2.13) 
where , represent the Inverse Fourier and Fourier transform, respectively. Once the basis functions are evaluated one can write the crosscovariance following Eq. 2.12. Finally, we model the autocovariance for each of the dataset and the corresponding crosscovariance between different datasets as,
(2.14) 
where for number of datasets. This formalism ensures that we obtain a positive semidefinite matrix for the modelled covariance and allows one to perform a joint learning on different datasets/tasks which can now be incorporated in the joint GP. Given the covariance between the individual tasks, we write the joint covariance matrix as,
(2.15) 
where the crosscovariance between different datasets is modelled by and represents the corresponding noise observed on the data, with being the number of data points in task . If different tasks are correlated as well, the nondiagonal terms in Eq. 2.15 will be nonzero. This formalism can be easily extended to many datasets and the learning can be performed by optimising Eq. 2.6, appropriately replacing and the data for the joint analysis of tasks. Note that also the reconstruction of the predictive regions is performed taking into account the total covariance in Eq. 2.15. Finally, as an example, we show the crosscovariance terms evaluated for the and kernels. The basis functions of the and kernels can be analytically evaluated and are written as,
(2.16) 
where have the usual meanings and represents the modified Bessel function (with ). Note that the basis function of the is now given in terms of , which cannot be explicitly written in an analytical form and hence all further operations should be carried out numerically. With the basis functions defined, the crosscovariance can be evaluated by performing the convolution as in Eq. 2.12. The crosscovariance between two SE kernels with different hyperparameters can be analytically obtained as,
(2.17) 
where indices 1,2 correspond to the first and the second SE kernels. Note that the reduces to the standard (Eq. 2.3), if hyperparameters of the two kernels are same. Although, we do not have an analytical functional form for the crosscovariance involving Matérn class kernels, they can be evaluated by numerically performing the convolution. In Figure 1 we provide some illustrative comparison of the autocovariance and crosscovariance functions with several combinations of hyperparameters. In the left panel we show the standard kernel autocovariance as a function of . It can be clearly noticed that the Matérn class kernels provide lesser correlation than the kernel for smaller and and viceversa. This in turn allows kernels to provide more features in the predictive region as the value is reduced. In the right panel we show the crosscovariance between same kernels with different length scale hyperparameter for each. Notice that in this case the covariance between is greater for smaller and , at lower values of . This plays a very important role in the determination of posterior at a particular value of for different tasks in the MTGP formalism. Also, the crosscovariance between the derivative of a first kernel and a second kernel is still stationary but antisymmetric with respect to .
While the improved formalism seems very lucrative for cosmological analysis, there lie also a few cautions one must be aware of, before making cosmological inferences. Therefore, we would like to summarise advantages of this formalism and a few cautions to mind, when using MTGP in a joint analysis.
Advantages:

Complementary datasets can help improve the reconstructions of data with sparse/fewer or noisy data points. This approach will provide better predictions one knows a priori that the two task are related, however with an unknown functional form, e.g., combining and datasets.

Two closely related datasets that are different due to an unknown multiplicative or additive factor can be modelled together, and in fact the unknown factor(s) can be inferred from MTGP instead of having to assume them a priori. This can also be extended to analyses where the unknown factors are not simply constants but functions, e.g. powers of (1+z), by appropriately adapting the kernels.

In the current cosmological context, even if a statistical evaluation of the most suited kernel for a particular prediction can be eluding, given the physical nature of data one can infer the better suited kernel through heuristic arguments.

While MTGP provides the freedom to model different tasks with different underlying latent functions, one can also enforce the equivalence and variedly utilise the shared hyperparameters for an extended analysis.
Disadvantages/cautions:

One should be careful while combining two unrelated datasets, especially if they have very different constraining strength. As the dataset with stronger significance might force its own features on to the reconstruction of less stronger dataset. Such a phenomenon is called negative transfer (please refer to [62, 63, 64] for extensive discussion on this issue). We demonstrate this effect in Section 4.3.

One must be careful in the choice of the covariance functions. As the number of tasks increases, the possible combinations of kernels rise drastically. It would be important to assess the nature of the data and guess a reasonable covariance function. However, this is only of dire consequence if the data are not functionally related.

While the data are related, it is still possible to obtain an overfitting, if one of the dataset is much stronger than the other. This can however be easily seen when the stringent posteriors of one dataset overpower the reconstructions of other datasets, in the regions where they are void of data.
Here overfitting and negative transfer are very similar phenomenon, which can however be conceptually differentiated. If the datasets are clearly unrelated and the features of one dataset overpower the other to provide incorrect features (nature of the curve) in the reconstruction of second dataset, it is negative transfer. If the datasets are a priori known to be related, and if the stronger dataset overpowers the second dataset to provide stringent reconstruction in the regions where there are no measurements available for the second dataset, this is also negative transfer that is of the overfitting kind.
On the assumption of the mean , often times when no prior knowledge of the latent function is available, the mean is assumed to be 0 by symmetry, to obtain posteriors of i.e., . Although, the reasoning for the same may not be apparent in the functionspace viewpoint elaborated here, they become clearer in an equivalent basis function or weightspace^{7}^{7}7Providing a complete description on these two viewpoints of Gaussian Processes is beyond the scope and necessity of this paper, we urge the reader to refer to Section 2.2 of [53] and Section 6.4 of [65] for more details. viewpoint [53]. In a weightspace viewpoint the latent function can be written as,
(2.18) 
where ^{8}^{8}8For example, is the vector of basis functions for polynomial regression. In fact, this makes GP a generalization over the polynomial regression methods. A nonzero constant prior mean of ‘c’ on a 3dimensional basis function vector of this kind, imposes a prior belief of the functional form to be , which is clearly a nonminimal supposition. is a dimensional vector/set of basis functions and ‘w’ is the corresponding vector of weights. This coincides with the formalism presented here, as the covariance kernel can now be written as , which is equivalent to the convolution formulae, Eq. 2.12 and Eq. 2.11. Here is the covariance matrix of joint distribution on the weights, equivalently defined as a function of the hyperparameters ‘’ and the kernel itself is a kind of prior on the expected nature of the latent functions, which however includes infinite possibilities. A prior assumption of the mean is in fact equivalent to assuming the mean of the prior on probability of weights to be 0 in the weightspace view [65]. The probability distribution on ‘w’ therefore characterizes , as . A nonzero prior on the mean of the weights implies a nonminimal assumption^{†}^{†}footnotemark: analysis as a preferred functional form is a priori implied, from amongst all the infinite possibilities. And so, the prior of is seemingly conservative over any other asymmetric prior to make predictions, unless an accurate prior knowledge of the latent function is available or if the motive is ulterior, such as a residual analysis. Also note that the mean of the process itself is not necessarily the mean of the data. However, if the dataset is stringent enough, different assumptions of prior should not effect the reconstructions within the range of data. Therefore, especially if we do not expect that the data can alleviate the effect of different prior mean assumptions, the most unbiased prior that one can assume is . Needless to say, such a constant mean prior implies that the posteriors far outside the data range will invariably tend to 0 and varied assumptions of mean could alter the results in the extrapolated regions beyond agreement. An asymmetric assumption of prior mean might also tend to provide stricter, but biased reconstructions of the derivatives. While the symmetric assumption provides fairly less stringent and hence unbiased, conservative reconstructions.
3 Theory and Data
In this section we provide a brief introduction to the theoretical framework of the standard cosmological model and then describe the data, different combinations of the same that we implemented with the MTGP method described in the previous section. In our current work we use the mostrecent “lowredshift” BAO, CC, and SN datasets. We utilise the expansion rate measurements available from these data, which are however rescaled according to their corresponding observational methods. Finally, we have also presented a brief discussion on the criteria for kernel selection to infer the best description of the data.
3.1 Theoretical framework
The first Friedmann equation with all necessary degrees of freedom at “lowredshifts” is given by,
(3.1) 
where, is the present expansion rate, while , and are the dimensionless density parameters of matter, dark energy (DE) and curvature, respectively. The dynamics of the DE component are determined by , which is written as,
(3.2) 
where is the equation of state (EOS) parameter of dark energy. The dimensionless density parameters are assumed to obey the cosmic sum rule of . Predominantly, in this work we assumed the concordance CDM, with and a constant DE equation of state parameter () with for the diagnostics implemented here. Through the second Friedmann equation, , one can assess the present accelerated state of the universe by estimating the deceleration parameter which is written as,
(3.3) 
A negative value of the deceleration parameter implies an accelerated expansion and viceversa. In addition, one can derive in the standard CDM scenario. The transition from deceleration to accelerated expansion is observed as a change in sign of , which is marked as the transition redshift . Having significant constraints on the and remain crucial for any cosmological inferences in the current GP framework.
The sound horizon at drag epoch is a relevant cosmological scale for the BAO data that is imprinted in the galaxyclustering BAO data which is necessary to model the data. The sound horizon at the drag epoch is given as,
(3.4) 
where is the sound speed and is the redshift at drag epoch. This physical scale is for rescaling the expansion rate data coming from BAO observations.
In the context of the standard framework, we utilise the diagnostic [32, 50],
(3.5) 
If the underlying expansion history is given by the CDM, is essentially a constant and is equal to the matter density . Therefore, any deviations from this can be used to infer a dynamic nature of dark energy or an intrinsic dynamic nature of matter itself. The derivative of the w.r.t provides information regarding the possible variations in . Note that utilises the information from both and reconstructions, which is the additional information inferred from the GP analysis. Therefore, is more often susceptible to the prior assumption of the GP framework (i.e., kernel) itself. Apart from these wellknown diagnostics, we also implement a simple modification of the by considering the derivative of the first Friedmann equation (Eq. 3.1) in the CDM scenario, which provides,
(3.6) 
It is straightforward that the reconstructed should once again be equal to a constant if the underlying cosmology is CDM.
3.2 Data and Implementation
In this subsection we summarise the dataset implemented in the current analysis. We have collected datasets which provide information regarding the cosmic expansion history as a function of redshift. The straightforward relatedness of these datasets allows one to perform the MTGP learning and conduct a joint analysis. The most recent expansion rate data from SN observations plays a very crucial role in our analysis, as it provides significant improvement at lower redshifts.
Baryon Acoustic Oscillations: In this work we utilise (hereafter abbreviated as BAO) observables provided in [2, 66, 67, 68, 69, 70]. In [2], using the Sloan Digital Sky Survey(SDSS) III DR12, these observables were reported for the galaxyclustering data set at three effective binned redshifts (hereafter 3z). From the same data set also a tomographic analysis was performed in [66, 67], providing observables at 9 binned redshifts (9z). Implementing both the 3z and 9z datasets, we find very consistent results within the current methodology and remain to use 3z data set to quote the BAO contribution. More recently, [70] have reported 4 measurements of the same observables at redshifts . In fact, these 4 measurements provide data in a much needed redshift range of , that could play very important role also in the context of modelindependent analysis and reconstructions. We also utilised the “highredshift” Lyman measurements in [69] and [68] at redshifts and . For all the datasets mentioned here we appropriately use the covariance matrices that have been provided ^{9}^{9}9All values of the mean, dispersion and covariances of observables for the galaxy clustering BAO data are taken from https://data.sdss.org/sas/dr12/boss/papers/clustering/. . While the observations are presented at times for , we homogenise the dataset to considering the appropriate used in the respective works. clearly being a physical constant set at a much higher redshift (), the data is interpreted as expansion rate multiplied by a physical constant.
0.07  0.997 0.023 
0.20 
1.111 0.021 
0.35 
1.127 0.037 
0.55 
1.366 0.062 
0.9 
1.524 0.121 
1.5 
2.924 0.675 
Supernova data: We have implemented the most recent binned expansion rate data provided in [4], where 6 (hereafter SN) data with their covariances have been provided for flat cosmologies (). We invert the , data to obtain the points and corresponding covariance. However, the has been reported with an asymmetric uncertainty, while the conversion is unable to account for the same. We simply proceed to utilise the symmetric estimate obtained from the inversion, as this data point with its large uncertainty would not be of key importance and also with little significance in the current joint analysis MTGP formalism^{10}^{10}10We also verify that the reconstructions show very little to no change, if the data point is omitted. Also, [18] have commented in similar lines, regarding the data point.. The data implemented in this work are summarised in Table 1.
[]  Ref.  
BC03 (CC & )  MaStro ()  
0.0708  69.0 19.68    [73] 
0.09  69.0 12.0    [71] 
0.12  68.6 26.2    [73] 
0.17  83.0 8.0    [71] 
0.1791  75.0 4.0  81.0 5.0  [51] 
0.1993  75.0 5.0  81.0 6.0  [51] 
0.2  72.9 29.6    [73] 
0.27  77.0 14.0    [71] 
0.28  88.8 36.6    [73] 
0.3519  83.0 14.0  88.0 16.0  [51] 
0.3802  83.0 13.5  89.3 14.1  [40] 
0.4  95.0 17.0    [71] 
0.4004  77.0 10.2  82.8 10.6  [40] 
0.4247  87.1 11.2  93.7 11.7  [40] 
0.4497  92.8 12.9  99.7 13.4  [40] 
0.47  89.0 34.0    [74] 
0.4783  80.9 9.0  86.6 8.7  [40] 
0.48  97.0 62.0    [72] 
0.5929  104.0 13.0  110.0 15.0  [51] 
0.6797  92.0 8.0  98.0 10.0  [51] 
0.7812  105.0 12.0  88.0 11.0  [51] 
0.8754  125.0 17.0  124.0 17.0  [51] 
0.88  90.0 40.0    [72] 
0.9  117.0 23.0    [72] 
1.037  154.0 20.0  113.0 15.0  [51] 
1.3  168.0 17.0    [72] 
1.363  160.0 33.6  160.0 33.6  [52] 
1.43  177.0 18.0    [72] 
1.53  140.0 14.0    [72] 
1.75  202.0 40.0    [72] 
1.965  186.5 50.4  186.5 50.4  [52] 
Cosmic Chronometers: The measurements of cosmic expansion rate have been estimated using the differential age method suggested in [39], which considers pairs of massive and passively evolving red galaxies at similar redshifts to obtain , which are deemed as the Standardisable Clocks. In this work, we utilise two different compilations of CC data, differentiated based on the stellar evolution models assumed for obtaining them. A dataset of 31 (CC) uncorrelated data points is compiled from [71, 72, 51, 40, 52, 73, 74] and is the most utilised compilation in literature. Data points in this compilation are obtained utilising the BC03^{11}^{11}11This BC03 compilation was implemented in several works such as [76, 29, 18], also at times including the BAO measurements assuming an . In [51, 40, 52] the estimates from BC03 models were also accompanied with the estimates obtained using MaStro models. models for the stellar evolution and comprise of larger number of data points as it was the only method implemented in earlier works such as [71, 72, 73]. Along side this standard dataset we compile a different set of data that are obtained using the MaStro^{†}^{†}footnotemark: stellar evolution models. This compilation consists of 15 () points taken from [51, 40, 52]. Given, the fewer number of data points this compilation does not add significant weight in the standard modeldependent residual analysis. However, in a modelindependent analysis, the estimate of as an intercept of the reconstruction of data can provide significant constraints on the mean with both datasets. Also, complementing with other datasets in our analysis we find suitable arguments to comment on the systematics within the CC dataset. The 15 point compilation of is compared against the corresponding 15 point compilation obtained using the BC03 () method. All the CC data yet available in the literature are summarised in Table 2. Notice that the data at redshifts and are quoted as same for both the methods, as contribution of systematic error for different choice of the stellar evolution method is included in quadrature within the reported error. Repeating our analysis decoupling this systematic error, we however find compatible results to the ones discussed here. Finally, we refer to the recent analysis of [75], that explored the dependence of these data on a possible contamination due to an young underlying component, assessing that for current measurements this effect has been confined by the selection criteria well below the estimated errors.
All three datasets mentioned above are implemented as different tasks in the MTGP formalism. For this purpose, we start by rescaling/rewriting the data with arbitrary fiducial values such as, ^{12}^{12}12Please note the difference from , which is the fiducial value assumed in the process of acquiring the measurement in various works and is simply a rescaling factor used only in our analysis, for convenience.. In principle there is no need for this step and it would not effect the results in any way, but we do so to have all the datasets within similar order(s) of magnitude, which we find also convenient in the numerical computations. The data is not rescaled but we do include a theoretical data point of . This assumption is justified assuming the results are essentially restricted to a comparison for flat cosmologies within standard framework where and it also helps improve the features in the reconstructions. To summarise, we have three tasks: (SN), (BAO) and (CC). The CC dataset however is replaced for different compilations that are mentioned above. We perform the MTGP on these three datasets considering several combinations of kernels described in Section 2. Given the known relatedness of the data i.e., all the datasets are essentially providing expansion history, while being rescaled accordingly with physical constants (, ), we do not immediately perform any interkernel analysis. We assume the same covariance function for all the three tasks, however with their own hyperparameters and the appropriate crosscovariance. We will comment in detail about this assumption of the kernels and their implications in the Section 4. Therefore, we have to finally optimise the log Marginal likelihood for six hyperparameters in total, two for each task.
3.3 Kernel selection
The SingleTask GP formalism so far implemented in the cosmological context has been deprived of a proper kernel selection criteria, to evaluate the best possible model of the covariance. Given that the data are not very stringent, there was however no strong motivation to discuss the same. As is also mentioned in [18], this dependence on the prior choice of the covariance functions enforces a subjectivity, that needs to be properly accessed to select the kernel model that provides best description of data and hence for making reliable predictions. As we anticipate the MTGP formalism is capable of providing better reconstructions of the data, the need to discuss kernel selection will become relevant. Note that here kernel selection should not be confused with model selection as is discussed in the context of Gaussian Process literature. The standard idea of model selection in a Gaussian Process regression (GPR) is to select the best possible explanation for the data given the assumed model. In a nonBayesian approach, as is implemented in this paper, the best explanation for the data is obtained by optimising the log marginal likelihood, (equivalent to Eq. 2.6). Here are the observations and is the assumed probabilistic model, with parameters . In the Bayesian formalism the posterior distribution of is given according to the Bayes’ rule as,
(3.7) 
where is the prior on the models. Here is the marginalised likelihood which is also referred to as evidence and is given as,
(3.8) 
Therefore a comparison of the evidence for two given different models and , with equal/uniform priors ), can be inferred using the Bayes factor as,
(3.9) 
where is the posterior for a respective model () given observations . The comparison of evidence can also be extended to the nonBayesian formalism to select the model that performs better for the given data. However, this comparison is not necessarily suitable to perform selection amongst different kernels, with smooth data such as in the current cosmological context. While the comparison of the evidence for different kernels can be performed, it is usually the case with smooth datasets, that one finds the simplest (in this case the SE) covariance function to be probabilistically best suited. To test the performance of an assumed kernel, one can appropriately split the available data into two separate categories, i.e., the training data () and test data (). Learning is performed on the training dataset and then the predictions based on the optimised model are compared against the test data by estimating either the standardised mean squared error or negative log probability [53] (see also [60] for the examples demonstrated therein).
In this work we attempted to evaluate for a better suited kernel combination for available cosmological data, but to no avail. We performed 5 different tests by selecting or of the total data points as test data, over different regions, to perform extrapolation (test data chosen towards one end of the dataset) and interpolation (test data chosen in the central regions of the dataset) tests. We find a more flexible kernel (e.g., ) to be more suitable in 2 cases, while and are the best suited in 3 other cases. We conclude that the current compilation of the cosmological data is unable to provide strict preference for a particular kernel combination. In the current work we implement all the kernels except the , which we a priori deem not very suitable for the reconstructions of the derivatives, given its complexity (higher flexibility) to predict more features, differentiability only to first order, and the fact that it could suffer from possible overfitting due to higher crosscovariance compared to other kernels. However, we do utilise to compare estimates of cosmological parameters at particular target points such as , etc. While the discussion presented here is minimal, a proper kernel selection criteria is necessary if one intends to draw more robust inferences from future data to arrive, in a cosmology independent way. Kernel selection is a much more intricate problem and needs dedicated attention [46], which we intend to leave for an extended future investigation.
However, given the nature of the cosmological data we utilise an heuristic argument to select from amongst the kernel combinations available. As we expect all the expansion rate data to have the same underlying latent functions, the kernel combination that is capable of predicting the same is clearly a better choice. Therefore, we infer the kernel combination with minimum amount of complexity required to predict equivalent latent functions to provide the best description of data. Increasing the flexibility of the kernels beyond this limit might very well provide same latent functions across datasets, however being overfitted or predicting additional/unrealistic features. Finally, we choose from amongst the smoother kernels with these criteria to report suitable constraints and comment on the cosmological inferences , while all the kernel choices remain equally valid in a statistical context of kernel selection.
4 Results and Discussion
In this section we present the results obtained from the above described formalism Section 2 implemented with the data presented in Section 3. We present the results for estimation of cosmological parameters, reconstructions and then proceed to report the inferred diagnostics, and the corresponding improvement in comparison to the previous works. Finally, we also comment on the issue of systematics within the CC data.
4.1 Reconstruction of data and cosmological parameters
We start by reproducing the existing results through individual GPs and then proceed to discuss our results from the MTGP method. In Figure 2 we show the individual GP reconstructions of all three datasets considered in this work using the kernel, as an example. The wellknown reconstruction of the CC data is very similar to the one obtained using the kernel (see also [29]), which we reassert. Also, we find very little to no difference from one kernel to another even for the SN and BAO datasets. Reconstruction of the SN data is clearly at a disadvantage for higher redshifts () due to the very large uncertainty of the data point at . The inclusion or exclusion of this data point makes very little difference as was also mentioned in [18] (while they opt to omit this data point we let it remain, with no dire consequences). Similarly, the BAO data is unable to show any features in the reconstructions due to the sparse distribution, and the unavailability of data at redshifts . Note that using Matérn kernels that are necessarily modelled to allow more dynamics/features do not fare immediately well in SingleTask GP. We do not find much difference, except for an increase in the predictive region in the case compared to . In fact, their utility only becomes significant in our MTGP formalism or when the derivatives are inferred. The estimates for using several kernels from the reconstruction of CC data alone are summarised in Table 3.
Another important feature worth noticing is the lowering of the posterior mean in the SN and CC reconstructions in the left column of Figure 2. As mentioned earlier in Section 2, this feature could essentially be due to the lack of data in this region and the tendency of reconstructed Gaussian posterior to converge to a 0 (prior) mean with confidence regions given by the optimized hyperparameter, far away from the data. Clearly, this implies that the extrapolated reconstructions are less reliable in individual GPs for making inferences, e.g., ^{13}^{13}13However, it can be noticed in Figure 2 that the CC reconstruction does not immediately start tending to the prior as .. In the MTGP formalism however, these three datasets are modelled to complement each other, thereby providing better reconstructions for all the three datasets (see Figure 3). This essentially implies that the overall range of available data is now an union of all three dataset ranges i.e., and as expected the effects of extrapolation in all three data planes are now mitigated within this joint redshift range.
As mentioned earlier (Section 3.2), we implement the same kernel across all the data planes considered, as one would expect that the dynamic nature of the reconstructions for these three datasets is essentially the same. However, this does not imply that we a priori assume the underlying latent functions of the three tasks to be unique and hence one could find differences in the reconstructions from one plane to another (see Figure 4), as the log marginal likelihood (Eq. 2.6) is optimised with different length scale hyperparameters for each task. While it might appear inconsistent if the three data planes do not predict unique expansion histories, this freedom provides us with the criteria to evaluate the best kernel combination, that will predict equivalent latent functions, i.e., similar length scale hyperparameters. In fact, the equivalence of the posteriors validates that the observed features in MTGP analysis are no longer due to extrapolated effects but due to the optimized data correlations. After obtaining equivalent predictions in all three planes, the results must be read as the final reconstruction of the CC dataset complemented by the BAO at highredshift and SN data at low redshifts. Therefore we later proceed to study the dynamics based on the CC reconstruction, obtained from MTGP of all three datasets, although the dynamics inferred from all three reconstructions will be same. We test for a strict equivalence between CC and BAO reconstructions, and expect an agreement amongst CC/BAO and SN to be better than at higher redshifts (as SN data completely lacks information in this region). We infer as the intercept of CC reconstruction and use CC and BAO reconstructions to obtain as a rescaling constant at the intercept of BAO reconstruction.
Dataset(s)  Kernel  []  [Mpc]  
CC 




CC+SN+BAO 




CC+SN+BAO+R18 



In this respect, we would also like to mention that a unique expansion history can be enforced by restricting to the same kernel across all the planes and then imposing that the length scale hyperparameter () of the three planes should be the same. This assumption in the MTGP formalism implies that the datasets are in fact rescaled by an absolute constant (i.e., (CC) and (BAO), are the rescaling factors with respect to the (SN) reconstruction), and is clearly not the right assumption to make a priori for the estimation of physical constants. This provision within the formalism becomes extremely useful to estimate the differences in the data once the requirement for unique expansion history is enforced, as will be elaborated later (see Section 4.2). Reconstructions of the three datasets in the MTGP formalism using kernel are shown in Figure 3. One can immediately notice a significant improvement in the reconstruction of the individual datasets in comparison to the single task GP shown in Figure 2. The Matérn class kernels with lower force the SN reconstruction to be more in agreement with the CC and BAO data reconstructions, which is a consequence of higher crosscovariance between kernels for , as shown in right panel of Figure 1.
Comments on estimates: The estimate has been discussed time and again in the context of Gaussian Processes, usually prompting values compatible with the “highredshift” CMB estimate (P16). We show the improvement in the estimate and the corresponding uncertainty from standard GP to the MTGP implemented here, with the addition of complementary datasets in Table 3. In the MTGP formalism with the inclusion of BAO data alone to the CC dataset there is only a modest improvement in the estimate, while the more important contribution is seen at the high redshifts (), where no CC data is available and hence the BAO data aids the CC reconstruction in this redshift range. For example, assuming kernel we find using CC + BAO datasets. In fact, this uncertainty on is of the order of the single task GP conducted in [29], where BAO (5 points) data were rescaled with an assumed and a SingleTask GP was performed on a total of 36 (31 CC + 5 BAO) data points. We also find a very good agreement in the predicted mean values of the from this work and [29] (please see Table 2. therein). The SN data helps to improve the CC reconstruction at lower redshifts and hence the estimate to a much higher significance. As shown in Figure 2, the SN data is at a severe disadvantage for in the simple GP, while its reconstruction in the MTGP is improved by several orders (see Figure 3). More recently, [18] have utilised the same SN and CC dataset to obtain the in a modified GP formalism. We do assert the consistency in our findings with the results quoted therein. Our formalism provides slightly stringent constraints, while also allowing for a provision to include the BAO data.
As anticipated, we encounter a subtlety of choosing the best possible reconstruction (i.e., choice of the kernel) to quote the estimate. While it is appreciable that the constraint on gets better with more flexible Matérn class kernels (), it is compensated with the overall predictive quality of the reconstruction. For example, using we find , which is even more stringent compared to constraint shown in Table 3. While the predictive region is overfitted at lower redshifts due to higher crosscorrelation with SN data, at the same time, for the higher redshift regions the predictive quality deteriorates in comparison to the kernels with . Implementing the heuristic argument for equivalent latent functions introduced in Section 3.3 (also elaborated alongside estimates), we argumentatively infer that the assumption of kernel provides a better description of the data in comparison to other kernel choices for the final constraints on all the cosmologically relevant quantities. Our best estimate of from CC + SN + BAO datasets, is in agreement with the “highredshift” P16 at 1.41 and in a lesser agreement with the local measurement of R18 at 2.60 ^{14}^{14}14Our estimate is also consistent with several other previous estimates in [77, 78, 79, 29, 80, 18].. This remains to be a very consistent scenario as we have also reported in [80] (see also references therein and [81], for discussion on estimates from "lowredshift" data), even in a modeldependent analysis, especially if measurements obtained with other stellar population synthesis models are not considered, as we discuss in Section 4.2. Our estimate seems to be consistent with several other lowredshift estimates obtained in [77]. In any case, we find that all the reconstructions assuming each of the or Matérn class kernels () provide very consistent estimates on .
Comments on estimates: In Table 3, we also show the constraints on obtained as the rescaling factor between CC and BAO reconstructions. As can be seen in Figure 3, the formalism is able to realise that the CC and BAO data are in fact better described by similar underlying latent functions and hence provide similar length scale hyperparameters. This implies that two reconstructions are well correlated and hence a better description of the data. The reconstructed regions in Figure 3 are obtained after including the covariance between all the datasets. Therefore, the estimates are simply obtained as the ratios of the intercepts in the BAO and CC reconstruction planes without the need to reconsider the covariance. A crosscheck for the same would be if we obtain when considering the covariance twice while obtaining the uncertainty. In fact, when the kernel is considered the MTGP is unable to provide a good equivalence for the latent functions of BAO and CC reconstructions, which leads to a nonzero () uncertainty when covariance is considered twice.
We would also like to remind that the estimates of quoted in Table 3 are obtained as the intercept of the BAO data at . However, this constraint is supposed to improve at the bestreconstructed redshift of the BAO/CC data. Using the CC + BAO + SN data we find Mpc (), Mpc () and Mpc () estimates at redshifts , and , respectively. Incidentally, all the kernels predict the best reconstruction at similar redshifts, which is not surprising as the better of BAO data are available in the redshift range of . This trend is no longer the case when more flexible kernels like and are used, as they predict the best BAO reconstruction at under the influence of strong crosscovariance with SN data. This, in fact, is a clear indication of overfitting and these kernels are therefore not inferred as to provide the best description of the data. Needless to say, the estimate of mean would also slightly vary when obtained from different redshifts if the optimised underlying latent functions are not exactly the same. For example, the mean of estimated at different redshifts with kernel varies in the range of , accounting for an additional systematic uncertainty of in quadrature. In the left panel of Figure 4 one can notice that using kernel we find that the latent functions of CC and BAO are completely equivalent, which makes it a better choice compared to and . On the other hand, it remains better than and as they provide overfitted predictions (see right panel of Figure 4). All the abovepresented arguments show kernel in the light of being the better choice to make predictions with the current data. Also in the left panel of Figure 4, it can be clearly seen how the uncertainty of the predictions follow the availability of data, by providing better reconstructions around and , compared to their neighbouring regions.
Finally, we also include the R18 estimate of [45] to the joint analysis and as expected find improvement in the constraints on . We find extremely consistent final estimates of with all the covariance functions utilised in this work. In fact, this high consistency discourages us to make any inference for the bestsuited kernel. However, we do utilise to quote the results throughout the discussion, without loss of reasoning. The estimated and values are summarised in Table 3. As expected, with the addition of R18 to the CC + SN + BAO data the errors on the parameters decrease. Given the increase in values we find a consistent decrease in estimates, also retaining very similar reconstructions using all the kernels. Clearly, the value of is conserved while the estimate of gets rescaled to accommodate the increase in the value of . Using kernel, with and without the inclusion of R18 we find km/s and km/s, respectively. While the latter estimate without the inclusion of R18 is extremely consistent with the “highredshift” CMB estimate of km/s [82], the earlier measure is also consistent at . Given the agreement in , the discrepancy remains within breaking the degeneracy between these two parameters. However, we extend this discussion also to a comparison of estimates in Section 4.2.
CC  SN  BAO  [] 
Illustration of MultiKernel Gaussian Process: So far we have performed the analysis assuming same kernel for all the datasets. Now we would like to extend the analysis with the provision within current formalism and utilise different kernels for the datasets and comment on its implications. As anticipated, we find that the initial assumption of one kernel for all the datasets remains best, as we try to implement several combinations. For instance the combination of (SN), (BAO), (CC) fares poorly compared to the assumption of kernel for all datasets. In this case, we find . In Table 4 we show a few combinations of kernels assumed in the joint analysis and the corresponding estimates obtained. One can clearly notice that the assumption of different kernels does not improve the results. This can be easily explained based on the strength of crosscovariance that is obtained using a specific combination of kernels and the nature of data. In the MultiKernel formalism a combination including kernel provides stronger covariance for closer data points (i.e., lower ) compared to and the contrary for far away data points. As we have described earlier the SN data complements the CC data essentially at lower redshifts, and hence giving rise to larger dispersion at the intercept of CC reconstruction for kernel combinations involving due to their lower crosscovariance. Although this provision is an added advantage in the current formalism, given the nature of data it does not present any immediate application in the current work. However, it could be of interest in several other implementations such as [83, 84, 85], presenting which is beyond the scope of this paper.
4.2 Reconstruction of Diagnostics
In this section we present our results for the reconstructions of and a few diagnostics inferred from the MTGP analysis described in the Section 3. We find a very significant improvements in the reconstructed regions, moving from single GP performed in [31, 25] to MTGP formalism implemented here. In Table 5 we present the estimates for and the transition redshift , corresponding to the joint analysis presented in Table 3. We find very consistent results for the estimates of and using all the kernels implemented here. However, one can notice that the error on estimated increases with the flexibility of assumed kernel, while is constrained mildly better by the and kernels. As expected, using we find deteriorated constraints of and reassuring that the assumption of is not suitable for reconstructions of derivatives. We would like to remind that kernel might retain its importance for reconstructing other cosmological features such as dynamical dark energy equation of state (EoS) . Very similar trend is observed even when R18 is included as data, in the reconstruction. Consistently, the inclusion of R18 predicts higher values of and lower values of , appropriately suggesting lower . It is also the case that inclusion of R18 does not improve the constraints on and , as can be seen in Table 5.
Dataset(s)  Kernel  
CC 




CC+SN+BAO 


