Application of Bayesian model inadequacy criterion for multiple data sets to radial velocity models of exoplanet systems

# Application of Bayesian model inadequacy criterion for multiple data sets to radial velocity models of exoplanet systems

Mikko Tuomi University of Hertfordshire, Centre for Astrophysics Research, Science and Technology Research Institute, College Lane, AL10 9AB, Hatfield, UK University of Turku, Tuorla Observatory, Department of Physics and Astronomy, Väisäläntie 20, FI-21500, Piikkiö, FinlandThe corresponding author,    David Pinfield University of Hertfordshire, Centre for Astrophysics Research, Science and Technology Research Institute, College Lane, AL10 9AB, Hatfield, UK    Hugh R. A. Jones University of Hertfordshire, Centre for Astrophysics Research, Science and Technology Research Institute, College Lane, AL10 9AB, Hatfield, UK
###### Key Words.:
Methods: Statistical – Techniques: Radial velocities – Stars: Individual: GJ 581, HD 217107, Andromedae

We present a simple mathematical criterion for determining whether a given statistical model does not describe several independent sets of measurements, or data modes, adequately. We derive this criterion for two data sets and generalise it to several sets by using the Bayesian updating of the posterior probability density. To demonstrate the usage of the criterion, we apply it to observations of exoplanet host stars by re-analysing the radial velocities of HD 217107, Gliese 581, and Andromedae and show that the currently used models are not necessarily adequate in describing the properties of these measurements. We show that while the two data sets of Gliese 581 can be modelled reasonably well, the noise model of HD 217107 needs to be revised. We also reveal some biases in the radial velocities of Andromedae and report updated orbital parameters for the recently proposed 4-planet model. Because of the generality of our criterion, no assumptions are needed on the nature of the measurements, models, or model parameters. The method we propose can be applied to any astronomical problems, as well as outside the field of astronomy, because it is a simple consequence of the Bayes’ rule of conditional probabilities.

## 1 Introduction

Since the discovery of the first clear-cut example of an extrasolar planet orbiting a normal star (Mayor & Queloz, 1995), Doppler-spectroscopy, or radial velocity (RV), has been the most efficient method in detecting extrasolar planets orbiting nearby stars111See The Extrasolar Planets Encyclopaedia for an up-to-date list of known planetary candidates: http://exoplanet.eu/..

Because the same nearby stars can be targets of several RV surveys, there is the possibility to combine the information of two or more RV data sets using the means of Bayesian inference (e.g. Gregory, 2011; Tuomi, 2011) and posterior updating. However, little is known about the possible biases individual data sets, or RV timeseries, may contain with respect to one another. Therefore, we use Bayesian tools in determining whether the common statistical models can be used to analyse RV timeseries without bias, and if not, how these models can be improved to receive trustworthy results. For these purposes we introduce a method for determining model inadequacy in describing multiple sets of measurements – the Bayesian model inadequacy criterion.

The Bayes’ rule leads naturally to the commonly used Bayesian model comparison methods (e.g. Jeffreys, 1961). These methods can be used efficiently to compare the relative performance of different statistical models of some a priori selected model set. The Bayes’ rule can be used to calculate the relative posterior probabilities of the models in the set given some measurements that describe some aspect of the modelled system. However, because only the relative performances of the models can be compared, it cannot be said whether the model with the greatest posterior probability is adequately accurate in describing the measured quantities.

The Bayes’ factors (Kass & Raftery, 1995; Chib & Jeliazkov, 2001; Ford & Gregory, 2007) and other related measures of model goodness, such as the various information criteria (e.g. Akaike, 1973; Schwarz, 1978; Spiegelhalter et al., 2002) derived using different approximations, can only be used to tell which one of the models in some model set describes the measurements the best – i.e. the relative “goodness“ of the models can be determined reliably. However, they cannot be used to assess whether this best model is as accurate description as possible given the information in the measurements. Our method of determining model inadequacy in this sense can be used to assess whether the model set can be estimated to contain a sufficiently accurate model that can be used to describe the measurements reliably.

Whether a given statistical model can be used to describe several sets of data in an adequate manner or not, has not been studied very extensively in the statistics literature. In Kaasalainen (2011), the author presents a method for determining the optimal combination of two or more sources of data, or data modes. However, we are not aware of a single study discussing this problem in the Bayesian context, though Spiegelhalter et al. (2002) appear to discuss the ”model adequacy” in their article introducing the deviance information criterion, but they use the term interchangeably with the term ”fit”. Yet, determining whether a single model can describe two or more data sets without bias is of increasing importance in astronomy, particularly for indirect detections of the most interesting exoplanets whose signals lie close to the current limits of instrument sensitivity.

Since the RV variations of typical targets of Doppler-spectroscopy surveys are commonly modelled using a superposition of Keplerian signals, reference velocities, and possible linear trends, corrupted by some Gaussian noise, we use these models as a starting point of our analyses. However, we emphasise the fact that the RV variations caused by the stellar surface, usually referred to as the stellar jitter, are in general, despite some efforts in modelling their magnitude (Wright, 2005), foreseen as arising from dark or bright spots primarily driven by stellar rotation (Barnes et al., 2011; Boisse et al., 2011) and their effect on the RV’s is not understood very well at the moment. Therefore, we model the excess noise in the RV’s with care and show explicitly the statistical models we use in the analyses.

In section 2 we describe what we mean by the model inadequacy in describing two or more independent measurements or sets of measurements and provide a simple way of determining it in practice. We describe the details of our model inadequacy criterion in the Appendix. Finally, in section 3 we apply this criterion in practice by analysing astronomical RV exoplanet detections made using at least two different telescope-instrument combinations.

## 2 Bayesian analyses and model inadequacy

The Bayesian methods do not differentiate between determining the most probable parameter values or most probable models containing these parameters. They can all be arranged into a linear order, which yields information on the observed system if only the selected models describe the observed system realistically enough. It is possible to calculate the relative posterior probabilities of any number of models and determine their relative magnitudes in a similar way as it is possible to determine the posterior odds of having the measurements drawn from a probability density characterised by a certain parameter value of any one of the models. We do not describe the process of determining the posterior probability densities of the model parameters here, because several well-known posterior sampling methods exist and they have been well covered by the existing literature (e.g. Metropolis et al., 1953; Hastings, 1970; Geman & Geman, 1984; Haario et al., 2001). The performance of these methods has also been demonstrated by several re-analyses of existing RV data, revealing the existence of planets (e.g. Gregory, 2005, 2007a, 2007b; Tuomi & Kotiranta, 2009) or disputing it (e.g. Tuomi, 2011). In these works, the model probabilities have played an important role in assessing the number of planetary companions orbiting nearby stars.

Commonly, the Bayesian tools are used to assess the probabilities of different statistical models given the measurements that are being analysed using the models. These tools provide the relative probabilities of the selected models in the a priori determined model set as

 P(Mi|m)=P(m|Mi)P(Mi)∑kj=1P(m|Mj)P(Mj), (1)

where probabilities are the prior probabilities of the different models and the marginal likelihoods are defined as

 P(m|Mi)=∫θi∈Θil(m|θi)π(θi)dθi, (2)

where is the prior probability density of the parameter or parameter vector of the model and represents the likelihood function corresponding to the model.

The interpretation of the posterior probabilities in Eq. (1) is a rather subjective matter because they are relative and it is only possible to assess how much confidence one has in one of the models compared to the rest of them. According to the views of Jeffreys (1961); Kass & Raftery (1995), a model would have to be at least 150 times more probable than the next best model to have strong evidence in favour of it. We adopt the same threshold because claiming that there are planets orbiting a star instead of needs to be on a solid ground with respect to the model probabilities. Especially, if the model with planets was e.g. 50 times more probable than that with planets, there would still be a roughly 2% possibility that the planet model explains the data. Therefore, we choose a rather high threshold when interpreting the posterior probabilities of models with different numbers of Keplerian signals.

With the marginal likelihoods available according to the Eq. 2, we define the model to be an inadequate description of independent measurements , if it holds that for some small positive number

 B(m1,...,mN|M):=P(m1,...,mN|M)∏Ni=1P(mi|M)

This definition is based on the independence of the measurements and that they are being modelled with a single statistical model. It is a simple result of a relation of the marginal likelihoods of each of the measurement and the joint marginal likelihood of all of them shown in Eq. (13). We derive this criterion using the Bayes’ rule of conditional probabilities and the concept of independence, and also interpret the results in terms of information theory in the Appendix.

The number has an interpretation as a threshold value. For instance, the model being inadequate with probabilities 90%, 95%, and 99% corresponds to threshold values of 0.111, 0.053, and 0.010, respectively (see Appendix). Therefore, if the best model according to Eq. (1) satisfies Eq. (3) for some reasonably small , it can be concluded that the model does not describe the measurements without bias and the corresponding analysis results may be biased as well. In such a case, the model set has to be re-considered and expanded by adding better descriptions of the data to it. In practice, we use the 95% threshold value, but choosing its value is a subjective issue and only represents how confidently one wants to determine the model inadequacy.

We note that the model inadequacy can also be interpreted in terms of the measurements being inconsistent with one another with respect to the model used. This interpretation arises from the fact that the model may not take into account some features in one or more data sets that result from biases in the process of making the measurements or from some other unmodelled features in the data. We use the inadequacy of the model given the data sets and the inconsistency of the data sets with respect to this model interchangeably throughout this article.

We describe the parameter probability densities using three numbers. These numbers are the maximum a posteriori (MAP) estimate of the posterior density and the limits of the 99% Bayesian credibility set as defined in e.g. Tuomi & Kotiranta (2009). We calculate these estimates from the posterior densities of the model parameters received using the adaptive Metropolis posterior sampling algorithm (Haario et al., 2001), which is a modification of the famous Metropolis-Hastings (M-H) algorithm (Metropolis et al., 1953; Hastings, 1970) that adapts the proposal density to the shape of the posterior density of the model parameters. Because of this property, it is not very sensitive to the choise of initial parameter vector nor proposal density – desired features that make the method significantly more robust than the common M-H algorithm by enabling a more rapid convergence to the posterior.

While the adaptive Metropolis algorithm assumes a Gaussian proposal density, it adapts to the posterior reasonably rapidly and a samples of roughly are sufficient for the chain burn-in period in all the analyses, i.e. until the chain converges to the posterior. Because of the Gaussian posterior, the acceptance rate of the chain can sometimes decrease to as low values as 1% when the posterior density is highly nonlinear, as is comonly the case with RV data. However, in such cases, we simply increased the chain length by a factor of 10-20 and saved computer memory by only saving every 10th or 20th member of the chain to the output file. We verified that the chain had indeed converged by running up to five samplings with different initial values and required that they all produced marginal integrals that were equal up to the second digit. With a converged chain, we then calculated the marginal likelihoods using the method of Chib & Jeliazkov (2001).

For the sake of trustworthiness, throughout this article we also take into account the uncertainties in the stellar masses when calculating the semi-major axes and RV masses of the planets orbiting them. These uncertainties are taken into account by using a direct Monte Carlo simulation – i.e. by drawing random values from both the density of the model parameters and the estimated density of the stellar mass when calculating the densities of the semi-major axes and planetary RV masses. We assume that the estimated distribution of the stellar mass, usually reported using mean and stardard error, is independent of the densities of the orbital parameters from the posterior samplings.

## 3 Model inadequacy criterion and exoplanet detections

In principle, analysing RV data is reasonably simple because the planet induced stellar wobble can be modelled using the well-known Newtonian laws of motion – especially if the gravitational planet-planet interactions are not significant in the timescale of the observations and post-Newtonian effects are negligible. In practice, though, there are several aspects of the RV measurements that are not understood well enough to be able to consider that the models describe the measurements in an adequate manner. These aspects include e.g. disturbances caused by undetected planets or planets whose orbital periods cannot be constrained (e.g. Ford & Gregory, 2007); noise caused by the inhomogeneities in the stellar surface, usually referred to as the stellar ”jitter” (e.g. Wright, 2005); and excess noise and possible biases that are particular to the various instruments and telescopes used to make the observations. All these aspects make the analyses of RV’s challenging and if not accounted for properly by the statistical models used, can lead to biased results and misleading interpretations.

In this section we re-analyse three RV data sets made using at least two telescope-instrument combinations. Assuming these sets are independent – which is a common assumption, though not explicitly stated most of the time when analysing several sets of measurements – we apply the model inadequacy criterion to find out if the common models should be modified and if the corresponding results are different from the ones found in the literature.

### 3.1 Hd 217107

The RV’s of HD 217107 are known to contain the signatures of two extrasolar planets (Fischer et al., 1999, 2001; Vogt et al., 2000; Naef et al., 2001; Vogt et al., 2005; Wittenmyer et al., 2007; Wright et al., 2009). The system consists of a massive short-period planet with an orbital period of roughly 7 days, and an outer long-period planet with an orbital period of 11 years. The RV’s of this target have been observed using 4 instruments mounted on 5 telescopes, namely, Euler (Naef et al., 2001), Harlam J. Smith (HJS) (Wittenmyer et al., 2007), Keck I (Wright et al., 2009), and Shane and Coude Auxiliary Telescope (CAT) at the Lick Observatory (Wright et al., 2009). Together, there are 293 RV measurements of this system.

The most up-to-date solution is that of Wright et al. (2009), where the combined Keck and Lick data with 207 measurements was analysed. However, the authors do not discuss the exact statistical model used in their analyses and therefore we feel that this combined data set should be re-analysed to see how the four data sets should be modelled to receive the most trustworthy results.

Following the common Bayesian approach (e.g. Gregory, 2005, 2007a, 2007b; Tuomi & Kotiranta, 2009; Tuomi, 2011), we choose our model set to consist of four models, namely, models , where denotes the number of planetary signals in the data. Therefore, there are parameters in our models corresponding to 5 parameters for each planet – RV amplitude , orbital eccentricity , orbital period , longitude of pericentre , and mean anomaly , i.e. the date of periastron passage as expressed in radians between 0 and 2 – four parameters decribing the reference velocities of each data set , and the parameter describing the magnitude of stellar jitter . Our set of statistical models describing the measurement made at time is

 rk(ti)+γl+ϵi+ϵJ,k=0,...,3, (4)

where represents the Keplerian signals and and are Gaussian random variables with zero mean and known variance and an unknown variance , respectively. The variance corresponds to the instrument uncertainty of each individual measurement, which is usually assumed known and is reported together with the data.

We analyse the combined data set using the models and receive the model probabilities in Table 1. These probabilities imply that there are two companions orbiting the star with high confidence. However, the Bayes factor determining the inadequacy of the best model in the model set has to be calculated to assess the reliability of this model. Denoting the four data sets as , we receive , which means that the model is an inadequate description of the data with a probability of 0.95. This implies, that the model set does not contain a sufficiently good model, i.e. the data sets are not consistent with one another given this model, and needs to be expanded.

Because of the inadequacy of model , we no longer assume that the instrument noise is known according to the variances but suspect that there could be unknown random variations or biases that differ between the data sets. Therefore, we expand our model set by models

 rk(ti)+γl+ϵi+ϵI,l,k=1,..,3, (5)

where the Gaussian random variable is different for every data set and is assumed to consist of additional random variation caused by the instrument noise and stellar jitter. Therefore, in this model, the resulting values can only be interpreted as giving the upper limit for the stellar jitter. We denote these models as .

Using the expanded model set, we receive the model probabilities in Table 2. These probabilities imply that there are indeed differences in the noise levels of the different data sets and that these differences have to be taken into account when assessing the orbital parameters of the planets. We calculate the model inadequacy Bayes factor for the best model . This time , which corresponds to an inadequacy probability of , a value that clearly states the best model cannot be considered inadequate.

We have listed the solution of the model with the greatest posterior probability, , in Table 3. While consistent with the results of Wright et al. (2009), our solution with the best model has much more uncertain parameter values, especially for the period, RV mass, and RV amplitude of the outer companion, which is also found heavily correlated with the reference velocity parameters. We show the 99%, 95%, and 50% equiprobability contours of RV mass and period of the outer companion in Fig. 1 (the gap in the 50% contours arises from the numerical inaccuracy of the plot). This Fig. is similar to the Fig. 8 in Wright et al. (2009), but they used the density for the plot instead of posterior density. Also, we note that the jitter of HD 217107 has a level of at most 6.0 ms based on the noise in the Euler data, which turned out to contain the least noise out of the four data sets. It is also interesting to see that the Lick data had therefore at least 5 ms, but possibly even more than 10.0 ms, additional uncertainty that can only be caused by the telescopes and the instrument. Therefore, it cannot be said that the Lick instrument uncertainty is known according to the standard uncertainties of the data reduction pipeline, as reported when publishing Lick RV’s. This could in fact be one of the reasons the parameter values in our solution (Table 3) appear to be more uncertain than those reported by Wright et al. (2009), though they do not indicate the confidence-level of the reported uncertainties.

### 3.2 Gliese 581

The Gliese 581 planetary system has been claimed to be a host to as many as six relatively low-mass planets (Bonfils et al., 2005; Udry et al., 2007; Mayor et al., 2009; Vogt et al., 2010). Though the most likely number of planetary companions in the system is four (Tuomi, 2011) or five (Gregory, 2011), the RV’s of Gliese 581 provide a challenging analysis problem because the signals are only barely distinguishable from the relatively noisy measurements.

We start by analysing the combined data set of HARPS and HIRES RV measurements (see e.g. Vogt et al., 2010; Gregory, 2011; Tuomi, 2011) using the models and with . We choose this model set because we already suspect, based on the analysis of the RV’s of HD 217107, that this combined data set may have different noise levels corresponding to the different telescope-instrument combinations.

The posterior probabilities of the models in our model set are shown in Table 4. These probabilities, while having the greatest value for model , do not support the conclusion that there are five Keplerian signals in the data strongly enough because the probability of model is highly significant. Therefore, we check the inadequacy of the latter model to see if our statistical model is good enough.

The Bayes factor in Eq. (3) has a value of for the four-planet model , which means that the probability of the HIRES and HARPS data sets being inadequately described by the model is , a value low enough to conclude that there is no need to revise the model. We note that this model, an order of magnitude more probable than the previously used model (Tuomi, 2011), does not result in a revision of the orbital parameters (Table 5). However, the noise parameters of the two data sets do differ from one another slightly. Denoting the HIRES data set with and the HARPS data set with , the parameters , have MAP estimates of 2.39 and 1.50 ms, respectively. The corresponding 99% credibility sets are [1.77, 3.09] and [1.00, 2.01] ms, respectively. Therefore, the noise in the HARPS measurements gives an upper limit for the jitter of Gliese 581 of 2.01 ms, whereas there is likely a small amount of additional instrument noise in the HIRES data.

### 3.3 υ Andromedae

The RV’s of And have shown three strong Keplerian signals resulting from three massive planets orbiting the star (Butler et al., 1997, 1999; Fischer et al., 2003; Naef et al., 2004; Wittenmyer et al., 2007; Wright et al., 2009). The star has been a target of five RV surveys for several years, namely, Lick (Butler et al., 1999; Fischer et al., 2003; Wright et al., 2009), the Advanced Fiber-Optic Echelle spectrometer (AFOE) at the Whipple Observatory (Butler et al., 1999), HJS (Wittenmyer et al., 2007), ELODIE at the Haute-Provence Observatory (Naef et al., 2004), and the Hobby-Eberly Telescope (HET) (McArthur et al., 2010). Recently, the combined data of Lick (Fischer et al., 2003; Wright et al., 2009) and ELODIE (Naef et al., 2004) has been reported to contain a fourth planetary signal (Curiel et al., 2011).

We re-analyse the combined RV data of And by using the model inadequacy criterion. However, before we start, we check the consistency of the 248 Lick RV’s published in Fischer et al. (2003) and the 284 Lick RV’s published in Wright et al. (2009) (we denote these data sets as Lick1 and Lick2, respectively), because Curiel et al. (2011) used Lick2 data and the additional 30 RV points from Lick1 that were not included in Lick2. The fact that these 30 measurements were not included in Lick2 likely because of suspected biases or calibration errors suggests that there could be some biases within the combined Lick data analysed in Curiel et al. (2011) as well.

The Lick1 and Lick2 data sets appear to have one striking difference. While they both imply that there are indeed four Keplerian signals in the And RV’s, as concluded by Curiel et al. (2011), they do not agree on the orbital period of the proposed fourth signal. The probability of the three companion model is significantly lower than that of the four planet model – and times lower for Lick1 and Lick2, respectively. This implies that there is either a fourth Keplerian signal in the data or biases that mimic Keplerian periodicity. The MAP estimate and the corresponding set of the period of this fourth signal is 3120 [2560, 3940] days for Lick1 data and 3860 [3180, 5160] for Lick2 data. The latter of these estimates appears to be very close to the estimate of Curiel et al. (2011) of 3848.860.74 days. However, because of the difference of more than 700 days between the MAP estimates of the periods from Lick1 and Lick2, we cannot conclude, based on the Lick data alone, that there are indeed four Keplerian signals in the data. This inconsistency is seen the most clearly when looking at the equiprobability contours of the parameter posterior densities given each data set. The contours containing 50%, 95%, and 99% of the density are shown in Fig. 2 for the period and amplitude parameters of And d (top) and the proposed And e (bottom). The Lick1 contours are shown in red and Lick2 contours in blue. As seen in this Fig., the estimated period and amplitude of the And d differ also between the two Lick data sets.

Because of the inconsistency of the Lick data sets published in Fischer et al. (2003) and Wright et al. (2009), we use the model inadequacy criterion to find out if either of these two data sets is also inconsistent with the combined ELODIE, AFOE, HET, and HJS data. We denote this combined data as and use and to denote the Lick1 and Lick2 data, respectively, and calculate the Bayes factors and for the model . The logarithms of these factors are 4.01 and -10.20, respectively (Table 6). This implies that the Lick2 data set is inconsistent with the rest of the data and the 4-companion model is an inadequate description with a probability of more than 0.999, whereas the Lick1 data cannot be shown inconsistent with the rest of the data with a probability exceeding 5%. Therefore, it appears that Lick1 data (Fischer et al., 2003) is consistent with the other four data sets but the Lick2 data (Wright et al., 2009) is not.

We also investigated whether some of the ELODIE, AFOE, HET, HJS, and Lick data sets were inconsistent with the rest of the data by calculating the Bayes factors , where , refers to each of these sets, respectively, and contains all the data except the set . We performed these calculations using both Lick1 data and Lick2 data. The probabilities of the model being inadequate in describing each of these sets with respect the the rest of the data are shown in Table 6.

The results in Table 6 show that while Lick2 data is inconsistent with the rest of the measurements with respect to the model , the AFOE data is also inconsistent with the rest of the measurements regardless of using the Lick1 or Lick2 data among the others in the analyses. We also note that the same inconsistency remains for the AFOE data when using the three-companion model in the analyses. Therefore, as also noted by Curiel et al. (2011), we conclude that the AFOE data has additional biases and should not be used together with the rest of the data because the results would be prone to biases as well. To further demonstrate the inconsistency of the AFOE data and the other data sets, we show the RV residuals of the AFOE data when the three-companion model has been used to analyse the combined data of AFOE, Lick1, ELODIE, HET, and HJS (Fig. 3). These residuals appear to show a low-amplitude periodicity that roughly corresponds to the period of companion d, despite the fact that the signal of this companion (and those of b and c) has been subtracted.

We continue the analyses of And RV’s by neglecting the AFOE data and by using the older Lick1 data set (Fischer et al., 2003), because of the inconsistencies of the AFOE and Lick2 data with the rest of the data sets. The combined data set consists of Lick1, HET, ELODIE, and HJS data that contain 248, 79, 71, and 41 measurements, respectively. This combined data set with 439 measurements was analysed using two models, namely, and , because there are clearly three strong Keplerian signals in the data as demonstrated already by Butler et al. (1999), and because the noise levels of the different data sets likely differ from one another based on the previous analyses.

Since we removed the AFOE data from the analyses, we need to assess whether the resulting restricted data set can be shown inadequate or not. For this purpose, we re-calculate the values in Table 6 and show them in Table 7. According to these results, none of the four data sets can be said to conflict with the others. Also, using the Bayesian model inadequacy for multiple data sets by calculating , where , correspond to Lick1, ELODIE, HJS, and HET data sets, respectively, we receive a value of 1.4, which means that these sets are inconsistent with a probability of less than given the four-companion model. Therefore, these four sets can be combined reliably and we calculate our final solution of And RV’s using these four sets.

The posterior probability of the model is less than of the probability of model . This implies that there are indeed four periodic signals in the combined data set. The revised orbital parameters with respect to the four-companions model are shown in Table 8. The RV variations corresponding to the longest periodicity in the data are shown in Fig. 4 together with the fitted Keplerian signal. The signals of the three inner companions have been subtracted from the residuals in Fig. 4.

When comparing the orbital parameters of our solution in Table 8 with the solution of Curiel et al. (2011), it can be seen that the period of the And e is significantly lower in our solution. We received a MAP estimate for the orbital period of 2860 days (), whereas Curiel et al. (2011) reported a period of 3848.860.74 days. This difference can arise from the fact that they used the more recent Lick2 data set of Wright et al. (2009), which is not consistent with the other RV’s according to our analyses. We also found another solution for the period of And e. This period is 5750 days (), roughly twice the MAP periodicity, but its posterior probability is more than a thousand times lower than that of the solution in Table 8.

We note that while Curiel et al. (2011) adopted a jitter of 10 ms when analysing the RV’s of And, the estimate of Butler et al. (2006) is only 4.2 ms. Our results are consistent with the latter estimate because the upper limit of excess noise, including the stellar jitter, is 4.58 ms based on the lowest noise level in the data sets of the HET data (Table 8). According to our results, the jitter has likely an even lower value of roughly 2.0 ms. This also implies that the Lick1, ELODIE, and HJS data contain an additional source of RV variations – likely the telescope-instrument combination used to measure these data.

## 4 Discussion

We have proposed a simple method for assessing whether a statistical model is an inadequate description of multiple independent data sets. This method is simply an application of the well known Bayesian model selection theory and the law of conditional probability but it also differs from the common model selection approach because it provides the means of determining whether a single model, i.e. the best model in the a priori selected model set, is not an adequate description of the data sets and needs to be improved.

Using this Bayesian model inadequacy criterion and common model comparisons, we re-analysed three combined RV data sets made using at least two telescope-instrument combinations. According to our results, the Gliese 581 RV’s observed using the HIRES and HARPS spectrographs can be described reliably using the model , where their uncertainties caused by stellar jitter and additional instrument uncertainty have been modelled to have different magnitudes – at least, the four-companion model cannot be shown to be an inadequate description of these two data sets. This suggests that the results in Tuomi (2011) are indeed reliable in this respect.

The RV’s of HD 207107 showed that there can be significant telescope-instrument -induced uncertainties in the data. Therefore, we were forced to describe these uncertainties with different parameters for each telescope-instrument -combination. According to our results, the telescope-instrument uncertainties can differ considerably between different data sets, which makes it more difficult to put reliable constraints to the stellar jitter. While the jitter of HD 217107 is not likely to exceed 6.0 ms based on the noise in the Euler data, the Lick1 data turned out to have excess noise of 5-10 ms with respect to this jitter estimate (Table 3). Therefore, we conclude that the instrument uncertainties cannot be assumed as known, and additional noise should always be assumed to exist in the data. When neglecting this additional uncertainty, the estimates of orbital parameters can be biased and their uncertainty estimates will certainly be unrealistically low with respect to the information in the measurements.

The RV’s of Andromedae proved a challenging analysis problem on their own. These data consisted of five independent RV data sets. According to our results, the Lick2 data of Wright et al. (2009) was not consistent with the other data sets with respect to our model inadequacy criterion but the earlier Lick1 data set of Fischer et al. (2003) should be used instead. Unfortunately, it is not possible to tell where this inadequacy arises from. Also, the AFOE data (Butler et al., 1999) turned out to contradict with the rest of the data, likely because of biases in the process of making the measurements, as also noted by Curiel et al. (2011). This leaved only four consistent data sets, Lick1 (Fischer et al., 2003), ELODIE (Naef et al., 2004), HET (McArthur et al., 2010), and HJS (Wittenmyer et al., 2007), to be used in the analyses. With differing noise levels for each of these sets, we calculated the revised orbital parameters for the And planetary system with four planetary companions (Table 8).

Because our four-planet solution of the RV’s of And differs significantly from the proposed solution of Curiel et al. (2011) with respect to the orbital period of the outer planet, numerical integrations of the orbits are needed to assess the stability of our solution. The lower estimate for the orbital period of And e does not support the conclusion that the d and e planets could be in a 3:1 mean motion resonance (MMR). However, our solution coincides roughly with a 2:1 MMR, which could enable the stability of the system over long time-scales. Investigating the stability of our solution is necessary to be able to determine whether it corresponds to a physically viable system and is not simply an artefact caused by noise, data sampling, and possible biases in the measurements.

For successful detections, it is crucial that the noise – i.e. all the other variations except the Keplerian signals – in the valuable measurements is modelled as realistically as possible and not simply minimised as is commonly the case when using simple minimisations and related methods. Our method can be used readily to detect whether the statistical model indeed describes the data adequately with respect to the selected noise model as well.

The application of our criterion to measurements of any complex systems is obvious. Systems whose behaviour, time-evolution, and dependence on different physical and other factors cannot be derived from fundamental physical principles, are difficult to model because the models are necessarily empirical descriptions, whose validity can only be assessed using measurements. In such systems, there can be numerous small-scale effects and/or biases, whose existence is not known and whose magnitude cannot be measured. These effects cannot therefore be taken into account in the model constructed to describe some desired features of the system. As briefly noted in Kaasalainen (2011), the ability to show that a model is an insufficient description of the measurements is therefore needed to be able to determine whether the model needs to be improved further to extract all the valuable information from the noisy data. According to the demonstrations in this article, our method can be said to satisfy these needs to significant extent. Also, as we did not make any assumptions regarding the exact nature of the model, the criterion can be applied to any problem for which it is possible to calculate the likelihoods of the measurements using the model.

Finally, we note that if the model has been constructed prior to the measurements, the model inadequacy means that the earlier data sets used to construct the model, i.e. to select the model formulae and calculate the posterior densities of the model parameters, conflict with the new ones with respect to the model. It could also be that the model is being developed using a single data set in hand. Then, despite being the best model in the sense of having the greatest posterior probability, the model could still be inadequate in describing some part of the data set with respect to another part (Kaasalainen, 2011). Either way, the measurements cannot be described adequately using the selected model and we say that the model is inadequate. Our criterion can be used in these cases as well.

###### Acknowledgements.
M. Tuomi is supported by RoPACS (Rocky Planets Around Cools Stars), a Marie Curie Initial Training Network funded by the European Commission’s Seventh Framework Programme.

## References

• Akaike (1973) Akaike, H. 1973. pp. 267 in Petrov, B. N. & Csaki, F. (eds.). Second International Symposium on Information Theory. Akadémiai Kiadó.
• Barnes et al. (2011) Barnes, J. R., Jeffers, S. V., & Jones, H. R. A. 2011. MNRAS, 412, 1599.
• Boisse et al. (2011) Boisse, I., Bouchy, F., Hébrard, G., et al. 2011. A&A, 528, A4.
• Bonfils et al. (2005) Bonfils, X., Forveille, T., Delfosse, X., et al. 2005. A&A, 443, L15.
• Butler et al. (1997) Butler, R. P., Marcy, G. W., Williams, E., et al. 1997. ApJ, 474, L115.
• Butler et al. (1999) Butler, R. P., Marcy, G. W., Fischer, D. A., et al. 1999. ApJ, 526, 916.
• Butler et al. (2006) Butler, R. P., Wright, J. T., Marcy, G. W., et al. 2006. ApJ, 646, 505.
• Chib & Jeliazkov (2001) Chib S. & Jeliazkov I. 2001. J. Am. Stat. Ass., 96, 270.
• Curiel et al. (2011) Curiel, S., Cantó, J., Georgiev, L., et al. 2011. A&A, 525, A78.
• Fischer et al. (1999) Fischer, D. A., Marcy, G. W., Butler, R. P., et al. 1999. PASP, 111, 50.
• Fischer et al. (2001) Fischer, D. A., Marcy, G. W., Butler, R. P., et al. 2001 ApJ, 551, 1107.
• Fischer et al. (2003) Fischer, D. A., Marcy, G. W., Butler, R. P., et al. 2003. ApJ, 586, 1394.
• Ford & Gregory (2007) Ford, E. B. & Gregory, P. C. 2007. Statistical Challenges in Modern Astronomy IV, Babu, G. J. & Feigelson, E. D. (eds.), ASP Conf. Ser., 371, 189.
• Geman & Geman (1984) Geman, S. & Geman D. 1984. IEEE Trans. Patt. Anal. Mach. Int., 6, 721.
• Gregory (2005) Gregory, P. C. 2005. ApJ, 631, 1198.
• Gregory (2007a) Gregory, P. C. 2007a. MNRAS, 381, 1607.
• Gregory (2007b) Gregory P. C. 2007b. MNRAS, 374, 1321.
• Gregory (2011) Gregory, P. C. 2011. MNRAS, submitted. (arXiv:1101.0800v1 [astro-ph.SR])
• Haario et al. (2001) Haario, H., Saksman, E., & Tamminen, J. 2001. Bernoulli, 7, 223.
• Hastings (1970) Hastings, W. 1970. Biometrika 57, 97.
• Jeffreys (1961) Jeffreys, H. 1961. The Theory of Probability (The Oxford Univ. Press).
• Kaasalainen (2011) Kaasalainen, M. 2011. Inverse Problems & Imaging, 5, 37.
• Kass & Raftery (1995) Kass, R. E. & Raftery, A. E. 1995. J. Am. Stat. Ass., 430, 773.
• Kullback & Leibler (1951) Kullback, S. & Leibler, R. A. 1951. Ann. Math. Stat., 22, 76.
• Mayor et al. (2009) Mayor, M., Bonfils, X., Forveille, T., et al. 2009. A&A, 507, 487.
• Mayor & Queloz (1995) Mayor, M. & Queloz, D. 1995. Nature, 378, 355.
• McArthur et al. (2010) McArthur, B. E., Benedict, G. F., Barnes, R., et al. 2010. ApJ, 715, 1203.
• Metropolis et al. (1953) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., et al. 1953. J. Chem. Phys., 21, 1087.
• Naef et al. (2001) Naef, D., Mayor, M., Pepe, F., et al. 2001. A&A, 375, 205.
• Naef et al. (2004) Naef, D., Mayor, M., Beuzit, J. L., et al. 2004. A&A, 414, 351.
• Schwarz (1978) Schwarz, G. E. 1978. Ann. Stat., 6, 461.
• Spiegelhalter et al. (2002) Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. 2002. JRSS B, 64, 583.
• Tuomi & Kotiranta (2009) Tuomi, M. & Kotiranta, S. 2009. A&A, 496, L13.
• Tuomi (2011) Tuomi, M. 2011. A&A, 528, L5.
• Udry et al. (2007) Udry, S., Bonfils, X., Delfosse, X., et al. 2007. A&A, 469, L43.
• Vogt et al. (2000) Vogt, S. S., Marcy, G. W., Butler, R. P., & Apps, K. 2000. ApJ, 536, 902.
• Vogt et al. (2005) Vogt, S. S., Butler, R. P., Marcy, G. W., et al. 2005. ApJ, 632, 638.
• Vogt et al. (2010) Vogt, S., Butler, P., Rivera, E., et al. 2010. ApJ, 723, 954.
• Wittenmyer et al. (2007) Wittenmyer, R. A., Endl, M., & Cochran, W. D. 2007. ApJ, 654, 524.
• Wright (2005) Wright, J. T. 2005. PASP, 117, 657.
• Wright et al. (2009) Wright, J. T., Upadhyay, S., Marcy, G. W., et al. 2009. ApJ, 693, 1084.

## Appendix A Model inadequacy criterion

### a.1 Two data sets

We start by defining what we mean by model inadequacy in describing two sets of data and derive its equations from the common Bayesian model comparison theory.

We assume that there are independent measurements, or series of measurements, , that have been made to study the same system of interest. Because these measurements describe the same system, or at least contain information on the same aspects of the system of interest, they can be modelled with statistical models that have at least one parameter in common, namely, . Throughout this article the parameter space is a bounded subset of . The parameter is used to quantify some features in the measurements . In addition, there are other parameters, namely , that each quantify some additional features in the th measurement.

The measurements can now be used to compare different statistical models using the Bayesian model selction theory. Let be the posterior probability of model given the measurements and . The model can be any model for which a likelihood function exists. With this model, both measurements are modelled using the same parameter and different parameters and , respectively. Probability is the corresponding probability when measurements and are modelled using the same model structure as model has, but this time with parameters and , where , respectively.

Therefore, because of the independence of the measurements and the independence of and , the marginal integral in Eq. (2) of the measurements with respect to the model can be written as222The reader should refer to any basic text on conditional probabilities and independence.

 P(mi,mj|B) =∫ϕi,ϕj∈Θl(mi,mj|ϕi,ϕj,B)π(ϕi,ϕj|B)d(ϕi,ϕj) =∏k=i,j∫ϕk∈Θl(mk|ϕk,B)π(ϕk|B)dϕk=P(mi|A)P(mj|A), (6)

where the model has been changed to , because given only one measurement, and are in fact the same model. In the above, we have used and to denote the likelihood function and the prior density, respectively.

Now, let be a small threshold probability. We compare the probabilities of the models and given the measurements and . If , we say, that the model is an inadequate description of the data with a probability of and that the model cannot be used to model them both. In other words, the probability of model is so small, that the measurements should instead be modelled using different parameters and , i.e. using model . This condition is simply the common Bayesian model selection criterion (e.g. Jeffreys, 1961). From this condition and Eq. (A.1), and when selecting the prior probabilities of the two models equal, the comparison of models and according to Eq. (1) leads to

 P(mi,mj|A)

We denote and leave the model out of the notation by denoting when it is clear which model has been used. Now, we define the model inadequacy as follows.

The model used to describe the measurements and is not adequate with level if

 B(mi,mj):=P(mi,mj)P(mi)P(mj)

where the factor is actually the Bayes factor in favour of model and against model and is some (small) positive number corresponding to the selected threshold probability .

Because we have made no assumption on the exact nature of the measurements, the model, or the modelled system, the above condition applies to anything that can be measured and described with a statistical model. In fact, to be able to use the Eq. (8), a sufficient condition is that the measurements and are modelled using statistical models that have at least one parameter, namely , in common. The model of the th data set may have other parameters and these have to be treated as free parameters as well, but they have no role in the Eq. (8) because they are independent of the other data set.

The Eq. (8) in fact states that the measurements are not distributed according to the model used. However, the converse is not true. If the condition in Eq. (8) does not hold for some measurements and , it cannot be said that they are drawn from the same modelled density, even though it might be a reasonable assumption in practice.

The Bayes factor in Eq. (8) has an interesting property when interpreted in terms of the information gain defined using the Kullback-Leibler (K-L) divergence (Kullback & Leibler, 1951) between prior and the posterior. The K-L divergence is defined for two continuous random variables with probability densities and as

 DKL{u(x)||v(x)}=∫u(x)logu(x)v(x)dx. (9)

With this notation, we can write the K-L divergence of moving from the prior to the posterior (given both data sets). Hence, it follows that

 DKL{π(θ|mi,mj)||π(θ)}=∫π(θ|mi,mj)logπ(θ|mi,mj)π(θ)dθ =−logP(mi,mj)+∫π(θ|mi,mj)logl(mi,mj|θ)dθ =−logP(mi,mj)+∑k=i,j∫π(θ|mi,mj)logl(mi,mj|θ)l(mk|θ)dθ ⇔logB(mi,mj)=DKL{π(θ|mi,mj)||π(θ)} −DKL{π(θ|mi,mj)||π(θ|mi)}−DKL{π(θ|mi,mj)||π(θ|mj)}, (10)

where we have used the Bayes rule and the facts that integral over a probability density equals unity and and are independent.

This means that the logarithm of the Bayes factor used to determine the model inadequacy in describing measurements and can in fact be interpreted as the total information gain of the two measurements minus the information gains of moving from the posterior with respect to each measurement alone to the full posterior.

Alternatively, the Bayes factor can be written using the information losses, or K-L divergences, of moving from the posteriors back to the prior (as opposed to the information gain of moving from prior to the posterior). With this terminology, and using a similar derivation as for the information gain in Eq. (A.1), the expression in Eq. (A.1) can be replaced by

 logB(mi,mj)=DKL{π(θ)||π(θ|mi,mj)} (11)

which means that the logarithm of the Bayes factor can be interpreted as the total information loss of the two measurements minus the information losses of the two measurements separately.

### a.2 Multiple data sets

When there are more than two data sets available, the model inadequacy criterion can be derived easily following the considerations in the previous subsection. For measurements , it can be seen that

 P(m1,...,mN|B)=N∏i=1P(mi|A). (12)

It then follows that the model inadequacy criterion corresponding to that in Eq. (7) can be written as

 P(m1,...,mN|A)

We again use to denote the Bayes factor and write this criterion in the following way.

The model used to describe measurements does not describe the measurements adequately accurately with level if

 B(m1,...,mN):=P(m1,...,mN)∏iP(mi)

From the Eq. (14), it can be seen that for data sets, the marginal integral needs to be determined times to receive the Bayes factor that is used to assess the model inadequacy. This requirement cannot be considered very limiting, because in practice, the data sets are commonly analysed separately anyway.

In terms of K-L information loss of moving from the posterior to the prior, the Bayes factor can again be interpreted in a simple manner using similar derivation as in Eq. (A.1). As a consequence, it follows that

 logB(m1,...,mN)=DKL{π(θ)||π(θ|m1,...,mN)} (15)

However, the information gains cannot be used in a similar manner as in Eq. (A.1). Instead, using the information gain of the measurements the generalisation of Eq. (A.1) to several measurements is

 log∏iB(mi,(m1,...,mk,...,mN)|k≠i)B(m1,...,mN) =DKL{π(θ|m1,...,mN)||π(θ)} −N∑i=1DKL{π(θ|m1,...,mN)||π(θ|m1,...,mk,...,mN)|k≠i}, (16)

where is the Bayes factor describing the model inadequacy with respect to two data sets, namely, and the combined data set , which denotes all the data except the measurement .

Therefore, the Bayes factor determining the model inadequacy in Eq. (14) can be interpreted as a measure of information loss that results from disregarding the measurements to gain information on the posterior minus the corresponding information losses of disregarding each measurement one at the time. Naturally, the gain and loss Eqs. (A.2) and (A.2) are equivalent if , as was seen in the previous subsection.

Assuming that , which means that model has a greater probability than , has an interesting consequence. From this assumption, it follows that

 DKL{π(θ)||π(θ|m1,...,mN)}≥N∑i=1DKL{π(θ)||π(θ|mi)}. (17)

When again interpreted in terms of information loss, this means that given a model that cannot be shown inadequate with , the amount of information in the combined data set is greater than the information in the individual data sets.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters