Resolving the mass–anisotropy degeneracy of the spherically symmetric Jeans equation II: optimum smoothing and model validation
Abstract
The spherical Jeans equation is widely used to estimate the mass content of a stellar systems with apparent spherical symmetry. However, this method suffers from a degeneracy between the assumed mass density and the kinematic anisotropy profile, . In a previous work, we laid the theoretical foundations for an algorithm that combines smoothing Bsplines with equations from dynamics to remove this degeneracy. Specifically, our method reconstructs a unique kinematic profile of and for an assumed free functional form of the potential and mass density and given a set of observed lineofsight velocity dispersion measurements, . In Paper I (submitted to MNRAS: MN140101MJ) we demonstrated the efficiency of our algorithm with a very simple example and we commented on the need for optimum smoothing of the Bspline representation; this is in order to avoid unphysical variational behaviour when we have large uncertainty in our data. In the current contribution we present a process of finding the optimum smoothing for a given data set by using information of the behaviour from known ideal theoretical models. Markov Chain Monte Carlo methods are used to explore the degeneracy in the dynamical modelling process. We validate our model through applications to synthetic data for systems with constant or variable masstolight ratio . In all cases we recover excellent fits of theoretical functions to observables and unique solutions. Our algorithm is a robust method for the removal of the massanisotropy degeneracy of the spherically symmetric Jeans equation for an assumed functional form of the mass density.
keywords:
globular clusters: individual: NGC 6809 (M55)  dark matter.1 Introduction
An open problem in modern theoretical galactic dynamics is the mass anisotropy degeneracy of the spherically symmetric Jeans equation (hereafter SSJE). Specifically, there exists a degeneracy between an assumed mass density, , and the anisotropy profile, , of a self gravitating system. That is, there are many pairs that describe equally well the observables (Merritt, 1987). Binney & Mamon (1982) were the first to present a solution for the case of constant masstolight ratio. From a different perspective, in a previous work (Diakogiannis et. al. 2014 submitted to MNRAS: MN140101MJ; hereafter Paper I), we set the theoretical foundation for the removal of this anisotropy so that: for an assumed free functional form^{1}^{1}1By the phrase “assumed free functional form” we mean a specific mass profile, e.g. King, or Michie, with free parameters to be estimated from our algorithm. of the mass density and potential of a spherically symmetric system we recover a unique kinematic profile for the second order radial, , and tangential, , velocity moments. This is valid for both constant and variable masstolight ratio. In Paper I we argued that, if we know the complete functional form of and the mass density , then it is possible by using some optimization method (e.g. Genetic Algorithms), to reconstruct a unique kinematic profile within any desired numerical accuracy; i.e. a unique decomposition of to and . For the case where we have a discrete data set, we can only speak of a unique family of profiles within some statistical uncertainty. This uniqueness follows from exploring parameter space with MCMC methods and recovering unimodal marginalized distributions.
In the core of our method lies the representation of the radial velocity dispersion, , in a Bspline basis. Bspline functions (De Boor, 1978) are used extensively in Computer Aided Geometric Design (Farin, 2002) and in statistical smoothing spline modelling techniques (Hastie et al., 2001). In Paper I we demonstrated our algorithm with a very simple example with fixed masstolight ratio . In the present paper we validate our method by giving a variety of examples with constant or variable . For the latter we demonstrate the usage of our algorithm in a two component system that consists of both stellar and dark matter populations. In Paper I we also commented on the need for optimum smoothing, since there may be cases where our data have large errors and the resulting kinematic profile demonstrates unphysical variations. In the current contribution we present an algorithm for optimum smoothing, based on information of the smoothness behaviour from ideal theoretical models.
The structure of our paper is as follows: in Section 2 we present for completeness a very short description of the most basic equations presented in Paper I. In Section 3 we describe the statistical inference methods we use as well as the algorithm for optimised smoothing. In Section 4 we present three detailed examples. In these we reconstruct fully the mass content of the system and the kinematic profile, using synthetic data of brightness and lineofsight velocity dispersion . In Section 5 we discuss various aspects of our results. Finally in Section 6 we conclude our work.
2 Mathematical formulation
In Paper I we gave a detailed description of the mathematical formulation of our algorithm for the SSJE. For completeness we repeat some very basic mathematical formulas that are related to our needs for the present analysis. The interested reader should consult Paper I and references therein for a complete understanding of our method.
2.1 Bspline Functions
A Bspline function of order is the linear combination of some constant coefficients with the Bspline basis functions :
(1) 
The basis functions are known polynomial pieces of degree , joined together in a special way as to ensure certain differentiability and smoothness criteria. These Bspline basis functions are defined over a nondecreasing interval that we call the knot sequence. Each of the is called a knot. The distribution of knots and the constant coefficients regulate the geometric shape of . For a detailed description of Bspline functions the reader should consult Paper I and references therein.
2.2 The Spherically Symmetric Jeans Equation
In Paper I we demonstrated that if we represent the radial second order velocity moment, , in a Bspline basis defined over , i.e.:
(2) 
then the lineofsight velocity dispersion can be written in the form:
(3) 
where we defined:
(4)  
(5) 
where is the tidal radius of the system. Quantities and represent the first derivative of the mass density and the Bspline basis function. Coefficients define the shape of and affect the geometric shape of .
For a system that consists of both stellar and dark matter populations, the above formalism is generalized with the assumption that the mass density is now the tracer stellar density, i.e. and . The interaction of the stellar with the dark matter component is performed through the total potential
(6) 
where symbol corresponds to stellar population, and to dark matter.
2.3 Dynamical Models
In the following sections we will reconstruct, from synthetic data, the kinematic profile. i.e. and (once is known, can be found from the SSJE), of a stellar system in equilibrium. We will assume that the stellar mass content of this system is described from a King (1966) mass density . This is the only stellar density we are going to consider. For cases where we will consider also a cold dark matter component, we will use a Navarro, Frenk and White profile (Navarro et al., 1996, hereafter NFW). The quantity is going to be reconstructed from a Bspline function approximation, as discussed in previous sections; in Paper I we gave a detailed description of the necessary functions that we use for our modelling. In short, a King mass model is defined through the parameters , where is the value of the transformed potential at the center , is the stellar core density of the cluster, and the King core radius; the interested reader should consult Paper I and references therein for more details.
The NFW profile reference is defined through the mass density:
where is a characteristic dark matter (hereafter DM) density and a characteristic length.
3 Statistical Analysis
In this section we will be using standard frequentist and Bayesian approaches to model fitting. We will interchange the use of these, according to what best suits our needs each time. Eventually our model fits are going to be performed in a fully Bayesian context. The reader is directed to standard texts such as Hastie et al. (2001); Sivia & Skilling (2006) and Gregory (2010) for further details.
3.1 Likelihood function
In this section we augment the formalism we developed in Paper I in order to account also for a smoothing penalty. The full data set of brightness, , and kinematics, is . The full posterior probability of our complete data set, including the penalty, is:
(7) 
where represents the vector of parameters needed to fully describe a given assumed physical model; the terms and have the same mathematical form as in Paper I. The term represents a smoothness^{2}^{2}2See following section for definition of smoothness. penalty that we must apply to the shape of the function such as to avoid variational behaviour. This is encoded in the quantity which we shall define later. The penalty is governed by the set of values of the smoothing parameter. represents the distribution from which our smoothing parameter draws values and that we are going to determine later. We use the same Markov Chain Monte Carlo (MCMC) algorithm as in Paper I.
3.2 Bayesian Model Selection
In Paper I we described the necessary mathematical framework for Bayesian model inference. Model comparison is performed through the evaluation of Bayesian evidence . For this we use MultiNest (Feroz & Hobson, 2008; Feroz et al., 2009). In cases where the model complexity becomes too great, we are going to use the Bayesian Information Criterion (BIC) for the choice of optimum model. BIC is defined as:
(8) 
where is the total number of free parameters, is the total number of data points, and the maximum likelihood value as results from an MCMC run. are the values of parameters, , that maximize the likelihood function, . We can legitimately use BIC, since the posterior distribution belongs to the exponential family. When we perform model comparison using BIC, the most probable model is the one with smallest BIC value.
3.3 Smoothing Penalty Distributions
In this section we are going to describe our choice of probability distributions and . By the term “smoothing” we mean how “stiff” or “nonflexible” our resulting Bspline function must be. The probability distributions and govern this property of the resulting function as well as of the Bspline representation. We need smoothing penalty for cases where we have large errors and the Bspline function tends to follow the data in an unphysical way. We also need this penalty for cases of incomplete data: they do not span all distance ).
For the penalty function we choose the following scheme:
(9) 
We penalise both the first (slope) and second (curvature) derivatives of the lineofsight velocity dispersion. We choose to penalise and not since the former is compared directly to the data set. The parameter lies in the range and describes at what percentage each of the derivatives of participates in the penalty. Our choice, after a lot of trial and error, will be fixed^{3}^{3}3We finalised this value on experiments with synthetic data and known models, by calculating their Bayesian evidence. at . That is, we assign the majority of the penalty to the second derivative, thus penalising mainly the curvature of the curve, and not the slope.
In a full Bayesian context we have:
(10) 
which is normalised in the range of all positive values :
(11) 
is going to be another “prior” distribution of values over . In statistician’s terminology, this is called a hyperprior. We will assume that has the form of the inverse Gamma distribution^{4}^{4}4The motivation of our choice will be described in the next section., i.e.:
(12) 
Based on the above definitions we see that a large value of tends to make a straight line. On the contrary a very small value allows the model to overfit.
We have to note that instead of the combination and fixed we could use two different free parameters, and , each penalising a different derivative of . However this would increase the complexity of our model and our calculations, with a small extra gain. For a full discussion on penalties on smoothing functions the interested reader should consult Hastie et al. (2001). Again we note that in the smoothing splines fitting, from a statistician’s perspective, all penalties are applied to the smoothing spline. Here we follow a significantly different approach, penalising a function which is the end product of some very complex calculations. That is, we apply the smoothing penalty to the function that is directly fitted to observables and not to the smoothing Bspline function.
3.4 Training the model for Optimum Smoothing
In this section we are going to describe how we estimate the parameters and of the distribution for optimum smoothing for cases where we consider penalty in our models. Once we have defined, we can use as a free parameter in the MCMC walk. Our reasoning is that different models must behave in a similar way in terms of their “stiffness” behaviour (smoothing). That is, we expect, in terms of smoothness only, that a real physical system will approximately have the same behaviour as an ideal theoretical model with a known kinematic distribution profile. Then we are going to use theoretical known models, in order to have a measure of smoothness that is needed when we fit our model to real data.
We must be careful in the choice of the parameters of . A “stiff” curve may result in underfitting, while alternatively a very “soft” curve can result in overfitting. This can also be affected by the number of coefficients . The fewer their number, the less flexible the resulting Bspline function. We must draw attention to the fact that there does not exist a single range of values for for all data sets. This range depends on the number of available data, the choices we make for the knot distribution (uniform, exponential etc) and the number of coefficients. Especially a large number of data points, has a severe impact on the resulting value of since the likelihood product, (see Paper I), contains more terms, each smaller than unity. Then the quantities and must take a different range of values in order not to have a larger (or smaller) effect on the resulting likelihood value. This is regulated by parameters of . For our method to be reliable in the case where we use a penalty, we need to know the number of available data and keep fixed in our fittings the choice of knot distribution^{5}^{5}5We emphasise that it is the choice of distribution (e.g. uniform, Gaussian etc) that must be fixed, not the knot sequence., Bspline order and number of coefficients. In practice, since the value of is eventually determined through the MCMC scheme, we can have a small deviation from the above mentioned parameters. This should be avoided for applications to real data sets. For example calculate for a given number of coefficients and then apply this range to a Bspline function with a few more coefficients or a few more data points. We will demonstrate with an example, how altering the value of data points and keeping fixed the range of can result in a small bias in the estimates.
Consider the case where we create a small number of synthetic data from a model that depends on a single variable, say . These data are created from a reference value, say , in which we add a random Gaussian noise of given variance . We wish to recover this value, within some statistical uncertainty, through a Markov Chain Monte Carlo scheme. That is, we wish to estimate a marginalised distribution of values.
Assume that we define a likelihood that depends on and also on some parameter that is not estimated from the MCMC. The value of this parameter affects the estimation of by inserting some unknown bias in the true mean and by increasing the total variance of the true values by some unknown amount . Then this bias and variance will be in general a function of , i.e. and . The relation between the true reference values of mean and variance and the estimated ones is:
(13)  
(14) 
We can see these relations schematically in Fig. 1. The histogram represents the estimates of some MCMC walk based on synthetic data points . These were created from the true value by adding Gaussian random noise. Overplotted is a Gaussian (red in color version) from which we generated a synthetic data sample of values , centred on the (biased) mean value . We show how the mean deviates from the true due to . We also plot the increase in variance from the true due to .
We wish to find which is the optimum value that gives the optimum tradeoff between bias and variance in our estimates. We define as measure of this tradeoff the average deviation of each of the MCMC walk points from the reference value :
(15) 
Substituting Equation 13 in 15 yields:
that is
(16) 
where , and we used:
In Equation 16, can be further decomposed into the true variance of the target distribution around its mean, that is inherent to the process by which we created the random sample, and a variance term that depends on the smoothing parameter^{6}^{6}6In general we expect that a “stiff” curve will have smaller variation in its estimates when fitting to a set of data. Quantity expresses exactly this fact. :
(17) 
That is, the deviation of the values of the MCMC chains from the reference value, is some irreducible variance that is inherent to the random process in which the data were created, and a part that results from the additional bias and variance inserted by a wrong assumption on the value . There is an interplay between and . As we increase parameter from zero, we expect the term to be reduced. However, beyond some value, while will continue decreasing, will start rising and result in an overall increase^{7}^{7}7 (For a full discussion on the Bias  Variance trade off see Hastie et al., 2001). in . Therefore, the optimum value of parameter is going to be the one that minimises in the estimates from the MCMC.
We are going to use this “variance” as a measure of optimum smoothing for the case of a Bspline representation of . For this case, we define:
(18) 
The quantity inside the brackets corresponds to the average deviation of all the coefficients from their corresponding reference value, for each proposed value from the MCMC walk. We estimate each reference value that corresponds to the reference radial velocity dispersion of a known analytic model, through:
(19) 
by convolving Equation 19 with , i.e.:
(20) 
where . This results in a system of equations for the unknowns which can be solved with standard linear algebra methods.
3.4.1 Determination of parameters of the distribution.
For model fits where we will use a penalty function, we choose to use seven^{8}^{8}8We remind the reader that the last index since at the tidal radius of the system . This follows from the fact that the Bspline represented function passes through the last point, see section . So the last coefficient is going to be fixed. unknown coefficients, i.e. and a uniform distribution of knot values. The order of the Bspline representation is set to .
The first thing we need is to determine a range of values that give optimum penalty. That is, create a set of estimated values that minimise the “measure of fitness” . Then, using these observed values as “data” we need to estimate the values of that go into the hyperprior of the full posterior distribution (Equation 7).
For the determination of a range of values based on the measure of optimum smoothing (Equation 18) we use as likelihood the functional form:
(21)  
(22) 
i.e. we removed the unknown hyperprior distribution and the prior probability . The former is not needed, since we are going to use a fixed for each MCMC evaluation, while the latter does not affect our calculations (it is constant and is simplified out in the MCMC process).
The steps of our algorithm in detail are:

Create a large number of training data sets, from a known theoretical model. .

For each of these data sets:

For the range of values , keeping the mass model fixed, estimate the marginalized distributions of the coefficients.

Find the approximate value of that minimizes the “measure of fitness” of the produced MCMC chains from the ideal model and store it as .

For the determination of values, we assume that our estimate , from each random sample, has a Gaussian random error deviation from the true value . Then we convolve this Gaussian error with the hyperprior , and integrate over all possible values of . The resulting function is going to be the distribution function that values satisfy. We are going to use this distribution function in an MCMC scheme for the definition of the likelihood that eventually will give us the estimates of . From each random sample of lineofsight values we have a value that minimises the function. That is, we have a data set . We use this set as “observable” in a MCMC algorithm, in order to estimate parameters of the hyperprior . The likelihood we use in this MCMC scheme is the distribution function of values:
where
(23) 
The output of this MCMC is marginalised distributions of and . We use the mode values of the distributions for the hyperprior .
In Fig. 2 we plot the optimum smoothing measure for 30 randomly created independent samples. These random samples were created from an Isotropic King profile (top right panel) and a King mass model with OsipkovMerritt anisotropy profile (top left panel). The value of that approximately minimises , is somewhere in the interval . Observe that the lower boundary is steeper than the upper boundary. Based on this observation, and the fact that the convolution of the Inverse Gamma distribution with a Gaussian is finite, we choose to draw values from the Inverse Gamma distribution.
Recall Equation 12 for the assumption on the hyperprior from which the distribution satisfies. In bottom panels, we plot the hyperprior for the optimum values of .
4 Examples
In this section we are going to reconstruct models of stellar clusters from synthetic data for a variety of anisotropy functions. For the notation of the total number of unknown coefficients , as mentioned earlier, we use the following scheme: Based on the restriction that all the quantities that describe the cluster must be zero at the tidal radius, the final coefficient will be . This extra coefficient does not go into the likelihood analysis, hence we break the total number of coefficients to the sum of unknowns plus one which represents this last coefficient. We use this notation in captions of Figures and in Tables where we list Bayesian evidence or BIC values.
Our motivation is as follows: in general the King brightness profile gives a very good fit to observational data. Thus, it can be a good approximation to the mass model. However the kinematic profile is not always adequate and can affect the mass measurements. So we allow information from the lineofsight velocity dispersion data to accurately reconstruct the shape of the Bspline representation of and consequently . Eventually the kinematic profile and the mass content of the cluster is fully reconstructed independently of the anisotropy of the system. This independence results from the flexibility of the smoothing spline representation of . We will see that we recover accurately in all cases the correct reference parameters of the mass model and the masstolight ratio and to a good accuracy the second moments of the radial and lineofsight velocities. We find no degeneracy in our results, i.e. a unique kinematic profile is constructed for each example. Finally, the quality of our fits depends on the available data set.
Penalty  NDim  Number of coefficients  

No  9  
No  10  
Yes  12  
Yes  13 
The order of the Bspline representation of was held fixed at . From left to right: first column describes if we used penalty or not. Second column is the total dimensionality of the model. Third column is the number of coefficients and the fourth column is the value of Bayesian evidence. The highest value of corresponds to the most probable model.
4.1 OsipkovMerritt anisotropy and King mass density
In this example we consider a King mass model and a kinematic profile constructed from an OsipkovMerritt anisotropy function of the form:
(24) 
For the creation of the synthetic data points we use the same process as in Paper I. In this example the masstolight ratio is a free parameter. The reference values for the creation of synthetic data we use are . We use 14 synthetic data points and an increased error of of the true value. For the Bspline representation of we use uniform knot distribution in the interval . Observe that this knot distribution is adaptive: for each proposed value of parameters , the solution of Poisson Equation yields a tidal radius . For this we construct a uniform knot distribution. This is repeated for each likelihood call in the MCMC.
First we fit our model to the synthetic data with no penalty. In this case we use coefficients, . Then we fit including a penalty distribution as described in section 3.4 and using coefficients, . The values of in the hyperprior are the ones that correspond to the estimates from the King Isotropic model (section 3.4.1). We use a higher number of coefficients in the case where we also consider a penalty in order to avoid a very “stiff” function and give greater fitting flexibility to our model. The parameters of the hyperprior were calculated for coefficients ; we list the values of Bayesian evidence in Table 1. The models that are favored from Bayesian evidence are the for the case with no penalty, and for the case with penalty.
In Fig. 3 we plot the histograms of the mass model parameters for the two fits; the top panel corresponds to the fit with no penalty, while the bottom panel is the fit with penalty. In both cases the reference values (vertical dashed curves) from which the synthetic data were created are within the uncertainty range of the marginalised distributions; the mass content of the model is fully reconstructed.
In Fig. 4 we plot the fits for the second moments of the lineofsight and radial velocities; while left panels correspond to the fits with no penalty considered. In all panels the yellow shaded regions correspond to the uncertainty interval of the unknown coefficients only, i.e. we do not account for the uncertainty of the mass defining parameters . Right panels correspond to fits with penalty; note that the fit takes into account the brightness synthetic profile as well, although we do not plot it here. Top left panel is the lineofsight velocity dispersion, , and the 14 synthetic data points we employed. We plot the highest likelihood profile as well as the true function, demonstrating that the fit is excellent. There is a small deviation close to the origin, since tends to follow the first data point. However, the uncertainty in the coefficient includes the true value of . The bottom left panel is the corresponding as estimated from the lineofsight fit, although note that we have no data to compare directly to . This is estimated from the corresponding fit. Close to the origin there is a deviation of the true from the estimated one. This again is owing to the tendency of to follow the first datum, however, the uncertainty interval of encapsulates the true value. What is more important is that the mass content is fully reconstructed and is not affected by this small deviation. The top right panel is the lineofsight fit for the case where we add a smoothness penalty term to the likelihood. In this case the fit is much better, since the first data point does not significantly influence the behaviour of the curve. In the bottom right panel is the corresponding fit, and both fits are excellent and closer to the true value than in the case with no penalty.
Penalty  NDim  Number of coefficients  

No  9  
No  10  
Yes  12  
Yes  13 
In this table we list Bayesian evidence values estimated from MultiNest. For all our fits we used uniform knot distribution and a fixed order of the Bspline basis. From left to right: First column describes if we used penalty or not in the corresponding fit. Second column is the total dimensionality of the model. Third refers to the number of coefficients of the representation. Third column is the value of Bayesian evidence as estimated from MultiNest. The highest value of corresponds to the most probable model.
4.2 Sinusoidal anisotropy and King mass density
In this example we consider again a King mass profile, and an anisotropy function of the form:
(25) 
We do not test the system for stability, or self consistency, i.e. if it is possible to have a realisation of a stellar distribution with such an anisotropy profile. We are interested to see if the mass estimate of the system is recovered completely, despite the “difficult” anisotropy profile. Namely, the marginalised distributions of the parameters and the masstolight ratio . We construct our synthetic data in the same way as in the previous examples using error, again allowing the masstolight ratio to be a free parameter. The reference values we used for creation of synthetic data are .
In Table 2 we list the Bayesian evidence for various fits of our model to synthetic data, with and without penalty. As in the previous example, the values of in the hyperprior are the ones that correspond to the estimates from the King Isotropic model (section 3.4.1). Bayesian inference favours the models with for the case with no penalty, and for the case with penalty.
In Fig. 5 we plot the lineofsight velocity dispersion and the second order radial velocity moment of these models; left panels correspond to fits without penalty, while right panels to fits with penalty. Yellow shaded region corresponds to the uncertainty intervals of the coefficients only, i.e. keeping the mass parameters fixed at the highest likelihood values. Taking into account the variance of the mass model, the true uncertainty would be greater. For the case without penalty, we see that follows the shape of the true curve, however around deviates from the true value. Note that if we included the variance of the mass model as well, the true curve will be overlapped by the uncertainty region of the model. The corresponding highest likelihood fit fails to follow the slope of the true curve close to the origin. However the true curve lies within the uncertainty, except very close to the origin. This is due to the small number of data points and larger errors in the synthetic data. With more data points or smaller errors, as we shall see, we get an excellent fit. This discrepancy is also due to the small number of coefficients, , we used and the uniform knot distribution, which is far from optimum. We need more control points for the Bspline representation in regions of increased curvature. This is something that the uniform knot distribution fails to encapsulate with only coefficients. The case where we consider penalty gives a much better reconstruction. The small deviation in the origin is within the uncertainty of the coefficients. The corresponding plot again deviates close to . This is due to the bad quality of the data and the bad choice of knot distribution. However the majority of the true curve is encompassed in the uncertainty. We discuss a method for overcoming the problem of the poor fit close to the origin in Section 5.
In Fig. 6 we plot the histograms of the corresponding mass model parameters, as well as the masstolight ratio ; vertical dashed lines correspond to the reference values from which synthetic data were created, while the top panel corresponds to fit without penalty. The bottom panel to fit with penalty. Observe that in all cases the mass content of the system is reconstructed within the uncertainty of the parameters. In the top panel we plot the marginalized distributions of two models, namely (blue in online version), and (green in online version). The model with coefficients contains the reference values in the outer limits of the distributions for the case of and . This is due to the reduced flexibility of the model that introduces bias in the estimates. The small number of coefficients with a uniform knot distribution in combination with the bad quality of our data fail to describe accurately the region close to . The model with that is more flexible recovers the reference values with a smaller uncertainty interval. Note that Bayesian inference penalizes any complexity resulting from higher number of dimensions. However the difference in between competing models with no penalty is not decisive according to Jeffreys Table (see Paper I for further details). That is, the model with is acceptable, although less favored. The model with penalty again gives a better fit and encapsulates the reference values in smaller uncertainty.
In order to verify that indeed the deviation from the true curve close to the origin is due to the data set, we recalculated our parameters with penalty using coefficients and 29 data points, again with error on the reference values. In Fig. 7 we plot the (top panel) and corresponding (bottom panel) highest likelihood fits. Indeed in this case the kinematic profile reconstruction is excellent, as we expected. We note that in order to have the most accurate mass model parameter estimates, we should have recalculated . Then the penalty distribution would be fine tuned for the larger data set. We should also perform Bayesian model selection again. However this goes beyond our needs for this simple example: we only want to demonstrate that with greater number of data, we have a better fit to the kinematic profile and our algorithm can account for the increased curvature of close to the origin.
4.3 OsipkovMerritt anisotropy and Cold Dark Matter
In this example we consider the case of a combined model that has both a stellar population and a DM component. The stellar component is described by a King profile, and the DM with an NFW profile. We create a sample of 14 data points with error, in the same way as in the previous examples. The reference values for the creation of synthetic data are and the functional form of the anisotropy profile is:
(26) 
where . For the brightness profile we make the approximation that, for each solar mass of the tracer mass density corresponds a solar luminosity , i.e. . As usual, we consider an adaptive uniform knot distribution, and the order of the Bspline approximation is again . The coefficients, , of the hyperprior, , were estimated for optimum smoothing from synthetic data based on the OsipkovMerritt anisotropy model as well as the isotropic King model (section 3.4.1). That is, we combine information on smoothness from two distinct models with different anisotropy profiles. We do so in order to allow for more information to be encoded to our model.
For this example we try several models with various numbers of Bspline coefficients . We perform model selection using BIC since MultiNest converges too slowly due to the higher dimensionality of parameter space. The results of the Bayesian inference can be seen in Table 3. When we do not consider a smoothing penalty the most probable model is the one with coefficients, . The best fitting model with penalty has coefficients, .
We plot the highest likelihood fits in Fig. 8; the left panels correspond to coefficients and no penalty. The top left panel is the lineofsight velocity dispersion data, the fit and the true profile; bottom left panel is the corresponding fit, while right panels correspond to profiles with coefficients and penalty. The top right panel is the lineofsight velocity dispersion data, true profile and highest likelihood fit, while bottom right panel is the corresponding fit. In all figures, the yellow shaded region corresponds to the uncertainty interval of the coefficients only, i.e. keeping the mass models parameters fixed to the highest likelihood value. The case with no penalty has more flexibility and tends to follow the data in the region . It also fails to capture the increased curvature in the region , thus deviating from the reference. The model with penalty, again gives a better fit. It has a small deviation close to the origin . This deviation is due to the small number of data points.
Penalty  NDim  Number of coefficients  BIC 

No  10  
No  11  
Yes  13  
Yes  14  
Yes  15  
Yes  16 
The order of the Bspline representation of was held fixed at . From left to right: first column describes if we used penalty or not. Second column is the total dimensionality of the model. Third column is the number of coefficients and the fourth column is the value of BIC. The smallest value of BIC corresponds to the most probable model.
In Fig. 9 we plot the histograms of the defining parameters of the mass models. Vertical dashed lines are the reference values from which we created the synthetic data. The top panel corresponds to the case with no penalty, and the bottom panel to the case with penalty. In both cases the mass content of the cluster is fully recovered. This holds even for the case with no penalty where we did not have a good reconstruction of the profile.
5 Discussion
Having presented our method to its full extent there are several issues to be addressed. A first comment to be made is on the mass profile. Why did we choose a King mass profile, and did not allow also for a Bspline representation of the mass density ? We choose to model with a King profile^{9}^{9}9Alternatively we could have used a Michie model (Michie, 1963) or any model that can give a nice fit to a specific data set., since this is in general a good approximation to real stellar systems and above all a simple mathematical construction. The case where the mass density is also allowed to have a free Bspline functional form requires a different mathematical formulation and presents different technical difficulties. This is a method under development that we will present in future work.
As regards to the knot distribution, our choice of using uniform knots is far from optimum. This is especially evident in our second example where we use a difficult sinusoidal anisotropy profile that has the greatest curvature close to the origin. Our algorithm can be substantially improved by the appropriate choice of knots. A very promising approach is to use Genetic Algorithms (Yoshimoto et al., 2003) to identify the best knot distribution for a given data set. Then use this distribution in an MCMC scheme according to what we described in order to recover uncertainties of the model defining parameters.
For models where we also include a smoothing penalty, we need to emphasise the importance of the correct procedure to obtain parameters of the hyperprior distribution (Equation 12). Failure to do so can result in bias in the marginalised distributions of model parameters. This is similar to the bias inserted from the assumption of a specific anisotropy profile. However, quantitatively this is much smaller due to the flexibility of the Bspline representation of even when we consider a smoothing penalty.
As an example of bias due to wrong estimates, we will repeat the fit in the last example, where we considered also a DM component. Using the parameters we obtained from 14 data points for the cases of isotropic and OsipkovMerritt anisotropy , we fit another model to a set of 29 data points. As explained in section 3.4.1 we should train our model for optimum smoothness with the same number of data points. Failure to do so affects the relative contribution of and to the resulting value of . This happens because the likelihood product, (see Paper I), for a larger data set contains more terms, each smaller than unity. Then the quantities and must take a different range of values in order not to have a larger effect on the resulting likelihood. The maximum value and spread of the normalised is regulated by the values of . Then these define the relative contribution for a given number of data points. Eventually it is the MCMC process that choses the distribution of values, but a wrong assumption on this hyperprior can introduce errors in the analysis.
In Table 4 we give the BIC values for a set of models with a different number of coefficients . The most probable model results for the case where . In Fig. 10 we plot the fit of lineofsight velocity dispersion (top left panel), the corresponding radial velocity dispersion (bottom left panel) and the marginalised distributions of two of the defining parameters of the mass model, namely the DM core density and of the NFW profile. The vertical red lines correspond to the reference values from which we created the synthetic data. Observe that despite the fact that we have an excellent fit for the and values, there exists a small bias in the estimates. In this example the stellar model parameters were accurately recovered, however we do not plot them here since we only wish to demonstrate the effect of incorrect penalty hyperprior with parameters that have biased marginalised distributions.
As a methodology for obtaining accurate results we propose to always fit with and without penalty to real stellar systems, and observe if there are discrepancies between resulting marginalised distributions of mass model parameters. The penaltyfree fit will always reconstruct the mass content of a stellar system however it may give unphysical variational or profiles. Then the corresponding penalised fit should result in marginalised distributions over approximately the same range and with approximately the same variance.
Finally we note that Bayesian inference for model selection heavily penalises the complexity of the models that have a greater number of coefficients and a penalty distribution . Specifically in all examples, Bayesian inference predicts decisively that models with penalty should not be considered. However, in all cases we demonstrated that fits with smoothing penalty give better results. This is a drawback that results from our inadequacy to include in the likelihood analysis (in the case where there is no penalty) the information of smoothness that the various functions of a physical system must satisfy. That is, Bayesian inference does not have any information as to why we impose a smoothing penalty distribution and penalises it. This is also directly related to our prior belief of the penalty parameter (Equations 10 and 12).
6 Conclusions
Penalty  NDim  Number of coefficients  BIC 

Yes  14  
Yes  15  
Yes  16  
Yes  17  
Yes  18 
The order of the Bspline representation of was held fixed at . From left to right: first column describes if we used penalty or not. Second column is the total dimensionality of the model. Third column is the number of coefficients and the fourth column is the value of BIC. The smallest value of BIC corresponds to the most probable model.
In this work we validate and expand the method developed in Paper I. Namely we address the issue of the massanisotropy degeneracy of the spherically symmetric Jeans equation. We present an algorithm that combines smoothing Bsplines with dynamical equations of physical systems and reconstructs accurately the kinematic profile and the mass content of a stellar system. This is for a constant or variable masstolight ratio .
Based on the assumption that a realistic physical system must have similar behaviour to ideal theoretical models in terms of smoothness of the function, we present a method for estimation of the optimum smoothing penalty for a statistical model fitting of the Bspline representation of . Furthermore we demonstrate with an example that incorrect smoothing penalty estimation can lead to biases in the defining parameters of the mass models.
We present three examples of kinematic profile reconstruction from brightness and lineofsight velocity dispersion observables. These are based on synthetic data that consist of 14 data points, each with a Gaussian random error on the reference value. The first two examples have a constant masstolight ratio , while the third example consists of a composite structure with a stellar and a DM component. In all cases we reconstruct completely the mass content of the system and the kinematic and profile.
Thus we removed the massanisotropy degeneracy to the level that for an assumed free functional form of the potential and mass density pair , and a given data set of brightness, , and lineofsight velocity dispersion observables, , we reconstruct a unique kinematic profile , within the statistical uncertainties. In Paper I we argued that if one knows the complete functional form of the lineofsight velocity dispersion and the mass density^{10}^{10}10This is valid also for cases where we have a composite system that consists of a stellar, , and a dark matter component, , i.e. . , then it is in principle possible to find a unique decomposition of to and . For the case of discrete data we can speak only for a unique family of solutions within some statistical uncertainty. This uniqueness is identified through the unimodal marginalised distributions of the model parameters from the MCMC scheme.
In general the quality of our results depends on the quality of the data. That is, on the total number of binned brightness and observables and their corresponding errors. The method was demonstrated to give excellent results for as few as 14 binned data points and up to error in the reference value from which we created the synthetic data.
Acknowledgments
F. I. Diakogiannis acknowledges the University of Sydney International Scholarship for the support of his candidature. G. F. Lewis acknowledges support from ARC Discovery Project (DP110100678) and Future Fellowship (FT100100268). The authors would like to thank Nick Bate as well as the anonymous referee for useful comments and suggestions on the manuscript.
References
 Binney & Mamon (1982) Binney J., Mamon G. A., 1982, MNRAS, 200, 361
 De Boor (1978) De Boor C., 1978, A Practical Guide to Splines. No. v. 27 in Applied Mathematical Sciences, SpringerVerlag
 Farin (2002) Farin G., 2002, Curves and surfaces for CAGD: a practical guide, 5th edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
 Feroz & Hobson (2008) Feroz F., Hobson M. P., 2008, MNRAS, 384, 449
 Feroz et al. (2009) Feroz F., Hobson M. P., Bridges M., 2009, MNRAS, 398, 1601
 Gregory (2010) Gregory P., 2010, Bayesian Logical Data Analysis for the Physical Sciences
 Hastie et al. (2001) Hastie T., Tibshirani R., Friedman J., 2001, The Elements of Statistical Learning. Springer Series in Statistics, Springer New York Inc., New York, NY, USA
 King (1966) King I. R., 1966, AJ, 71, 64
 Merritt (1987) Merritt D., 1987, ApJ, 313, 121
 Michie (1963) Michie R. W., 1963, MNRAS, 125, 127
 Navarro et al. (1996) Navarro J. F., Frenk C. S., White S. D. M., 1996, ApJ, 462, 563
 Sivia & Skilling (2006) Sivia D., Skilling J., 2006, Data analysis: a Bayesian tutorial. Oxford science publications, Oxford University Press
 Yoshimoto et al. (2003) Yoshimoto F., Harada T., Yoshimoto Y., 2003, ComputerAided Design, 35, 751