Extensive Studies of the Neutron Star Equation of State from the Deep Learning Inference with the Observational Data Augmentation
Abstract
We discuss deep learning inference for the neutron star equation of state (EoS) using the real observational data of the mass and the radius. We make a quantitative comparison between the conventional polynomial regression and the neural network approach for the EoS parametrization. For our deep learning method to incorporate uncertainties in observation, we augment the training data with noise fluctuations corresponding to observational uncertainties. Deduced EoSs can accommodate a weak firstorder phase transition, and we make a histogram for likely firstorder regions. We also find that our observational data augmentation has a byproduct to tame the overfitting behavior. To check the performance improved by the data augmentation, we set up a toy model as the simplest inference problem to recover a doublepeaked function and monitor the validation loss. We conclude that the data augmentation could be a useful technique to evade the overfitting without tuning the neural network architecture such as inserting the dropout.
1 Introduction
In the central cores of neutron stars baryonic matter is highly compressed by the gravitational force. The inward force by gravity is balanced by the outward pressure which strongly depends on intrinsic properties of cold and dense matter subject to the strong interaction. The study of inner structures of the neutron star requires a relation between the pressure and the energy density ; namely, the equation of state (EoS), , at (see, e.g., Refs. Lattimer:2012nd (); Ozel:2016oaf (); Baym:2017whm (); Baiotti:2019sew (); Kojo:2020krb () for recent reviews). At the center the baryon number density may reach 5–10 times as large as the saturation density of nuclear matter, i.e., .
Towards the modelindependent EoS determination, we should solve quantum chromodynamics (QCD), which is the firstprinciples theory of the strong interaction. At the moment, however, the modelindependent results are only available at limited ranges of densities. At low density range of – we can apply ab initio methods combined with the nuclear force derived from Chiral Effective Theory (EFT) with controlled uncertainty estimates Hebeler:2009iv (); Gandolfi:2011xu (); Tews:2012fj (); Holt:2013fwa (); Hagen:2013yba (); Roggero:2014lga (); Wlazlowski:2014jna (); Tews:2018kmu (); Drischler:2020hwi () (see Ref. Drischler:2021kxf () for a recent review). At asymptotically high densities of perturbative QCD calculations reliably converge Freedman:1976xs (); Freedman:1976ub (); Baluni:1977ms (); Kurkela:2009gj (); Fraga:2013qra (); Gorda:2018gpy () (see Ref. Ghiglieri:2020dpq () for a recent review). The intermediate density region around –, which is relevant for the neutron star structure, still lacks trustable QCD predictions (see also Ref. Fujimoto:2020tjc () for a recent attempt to improve the convergence in the resummed perturbation theory). To tackle the intermediate density region from QCD, we need to develop nonperturbative methods such as the MonteCarlo simulation of QCD on the lattice (lattice QCD), but the latticeQCD application to finite density systems is terribly hindered by the notorious sign problem (see, e.g., Ref. Aarts:2015tyj () for a review). This is why neutronstarphysics studies still rely on phenomenological EoS constructions: rather ab initio approaches Akmal:1998cf (); Togashi:2017mjp () near the saturation density, estimates employing the Skyrme interactions Douchin:2001sv (), the relativistic mean field theories Serot:1997xg (), and the functional renormalization group Drews:2016wpi (), etc.
Meanwhile, we have witnessed significant advances in neutron star observations over the last decades. Now these advances have opened an alternative pathway for the modelindependent extraction of the EoS by statistical methods. Observations include the Shapiro delay measurements of massive twosolarmass pulsars Demorest:2010bx (); Fonseca:2016tux (); Antoniadis:2013pzd (); Cromartie:2019kug (), the radius determination of quiescent lowmass Xray binaries and thermonuclear bursters Ozel:2010fw (); Steiner:2010fz (); Steiner:2012xt (); Ozel:2015fia (); Bogdanov:2016nle () (see reviews Miller:2013tca (); Ozel:2016oaf (); Miller:2016pom () for discussions on the methods and associated uncertainties), detections of gravitational waves from binary mergers involving neutron stars by the LIGOVirgo collaboration TheLIGOScientific:2017qsa (); Abbott:2020uma () as well as the Xray timing measurements of pulsars by the NICER mission Riley:2019yda (); Miller:2019cac (). Typical observable quantities of neutron stars involve mass , radius , compactness , tidal deformability (and their variants, e.g., Love number ), quadrupole moment , moment of inertia , etc. The gravitational waves from binary neutron star mergers provide us with the information about the tidal deformability. Also, the NICER mission particularly targets at the compactness of stars by measuring the gravitational lensing of the thermal emission from the stellar surface. Some of these neutron star quantities are connected through the universal relations that are insensitive to the EoS details. Among such relations, the most wellknown example is the ILoveQ relation Yagi:2013bca (); Yagi:2013awa (), which relates the moment of inertia , the Love number , and the quadrupole moment (see Refs. Yagi:2016bkt (); Baiotti:2019sew () and reference therein for other universal relations and usages).
These observations have provided some insights on the EoS and invoked discussions about nonperturbative aspect of QCD: for example, the emergence of quark matter in the neutron star Annala:2019puf () and a certain realization of duality between the confined and the deconfined matter Masuda:2012kf (); Fujimoto:2019sxg (); McLerran:2007qj (); Fukushima:2015bda (); McLerran:2018hbz (); Jeong:2019lhv (); Duarte:2020xsp (); Zhao:2020dvu (); Fukushima:2020cmk (). Moreover, these observations, particularly the gravitational wave observation, may provide a unique opportunity for studying the hadrontoquark phase transition occurring in the binary star merger and even better sensitivity will be achieved with the future detector Most:2018eaw (); Bauswein:2018bma (); Most:2019onn ().
For the purpose of extracting the most likely EoS from the observational data, there are diverse statistical technologies. The commonly used method is the Bayesian inference Ozel:2010fw (); Steiner:2010fz (); Steiner:2012xt (); AlvarezCastillo:2016oln (); Raithel:2016bux (); Raithel:2017ity (); Raaijmakers:2019dks (); Raaijmakers:2019qny (); Blaschke:2020qqj (). There are still other methods such as the one based on the Gaussian processes which is a variation of the Bayesian inferences with nonparametric representation of the EoS Landry:2018prl (); Essick:2019ldf (); Essick:2020flb (), etc. Despite numerous efforts to quantitatively constrain the EoS, it is still inconclusive what the genuine EoS should look like because of uncertainties from the assumed prior distribution in the Bayesian analysis. Therefore, the EoS inference program is in need of complementary tools. As an alternative to these conventional statistical methods, the deep learning; namely, the machine learning devising deep neural networks (NNs), has been successfully applied to a wide range of physics fields Pang:2016vdc (); Mori:2017pne (); Hezaveh:2017sht (); Chang:2017kvc (); Niu:2018csp (); Kaspschak:2020ezh (); Wang:2020hji (). Several studies have already put the NN in motion in the context of gravitational wave data analysis of binary neutron star mergers George:2016hay (); Carrillo:2016kvt (); George:2017pmj (); George:2018awu (). Along these lines, in our previous publications Fujimoto:2017cdo (); Fujimoto:2019hxv () we proposed a novel approach to the EoS extraction based on the deep machine learning (see also Refs. Ferreira:2019bny (); Morawski:2020izm (); Traversi:2020dho () for related works).
In the present work we make further comprehensive analyses using the machine learning method developed previously Fujimoto:2017cdo (); Fujimoto:2019hxv (). Here we give a brief synopsis of our machine learning approach to the neutron star EoS problem. Given an EoS, various pairs of stellar mass and radius, , follow from the general relativistic structural equation, i.e., TolmanOppenheimerVolkoff (TOV) equation Tolman:1939jz (); Oppenheimer:1939ne (). We will express the inverse operation of solving the TOV equation in terms of the deep NN, train the NN with sufficiently large data set of mock observations, and eventually obtain a regressor that properly infers the most likely EoS corresponding to the observational  inputs. While the other approaches to the neutron star EoS explicitly assume a prior distribution for the EoS probability distributions, our method only implicitly depends on the prior assumption. Therefore, in a sense that the implementations are based on different principles and the prior dependences appear differently, we can consider our method as complementary to other conventional approaches. We also note that independent algorithms would help us with gaining some information about systematics. For more detailed discussions on the prior, see Sec. 2.2.
This paper is organized as follows. Section 2 provides a detailed account on our problem setting, i.e., mapping neutron star observables to the EoS. Then, we outline our method by setting forth our parametrization for the EoS and reviewing the machine learning technique in general. Section 3 is devoted for the methodology study using mock data. We describe our data generating procedure and training setups, and then give a brief overview of the evolution of learning processes, which motivates the later analyses in Sec. 5. We showcase the typical examples out of the whole results, and then we carry out statistical analyses for the data as a whole. We confront our NN method with a different method; namely, the polynomial regression, and compare two methods to show that our method surpasses the polynomial regression. In Sec. 4, we present the EoSs deduced from the several different neutron star observables in real. We explain our treatment of uncertainty quantification along with the possibility of a weak firstorder phase transition in the deduced EoS results. In Sec. 5 we study the efficiency of data augmentation in the context to remedy the overfitting problem. Finally, we summarize this work and discuss future outlooks in Sec. 6. Throughout this paper we use the natural unit system with unless otherwise specified.
2 Supervised Learning for the EoS Inference Problem
In this section we explicitly define our problem and summarize the basic strategy of our approach with the supervised machine learning. We will explain the concrete setup in each subsection where we adjust the strategy in accordance with the goal of each subsection. In the present study we want to constrain the EoS from the stellar observables. The EoS and the observables are nontrivially linked by the TOV equation which gives a means to calculate the neutron star structure from the EoS input. Thus, constraining the EoS from the observables is the inverse process of solving the TOV equation, but this inverse problem encounters difficulties from the nature of the observations. In Sec. 2.1 we formulate the inverse problem of the TOV equation and discuss its difficulties. We proceed to our approach to this problem using the supervised machine learning in Sec. 2.2. We closely describe the EoS parametrization and the data generation in Sec. 2.3. Then, we explain the design of the deep NN in Sec. 2.4 and its training procedures in Sec. 2.5.
2.1 TOV mapping between the EoS and the  relation
In the present analysis we focus on the mass and the radius as the neutron star observables. Given a boundary condition of the core pressure , the observables and of such a neutron star can be determined by solving the TOV equation Tolman:1939jz (); Oppenheimer:1939ne ():
(1)  
(2) 
where is the radial coordinate which represents a distance from the stellar center. The functions, and , are the pressure and the energy density (i.e., the mass density), respectively, at the position . The function represents a mass enclosed within the distance . We can readily integrate these integrodifferential equations from with the initial condition, , towards the outward direction. The radius of the neutron star, , is defined by the surface condition: . The mass of the neutron star is the total mass within , i.e., .
The only missing information in the combined TOV equations (1) and (2) is a relationship between and , which is nothing but the EoS, i.e., . Once an EoS is fixed, we can draw a parametrized trajectory of in the  plane. This trajectory is called the  relation. Thus, the operation of solving the TOV equation can be considered as a mapping from the functional space of the EoSs onto the functional space of the  relations. Here we call this mapping the “TOV mapping” denoted by :
(3) 
It is known that the mapping is invertible, i.e., the inverse mapping exists Lindblom:1992 (). Thus, ideally, one could uniquely reconstruct the EoS from an observed  relation. However, practically, we cannot access the entire continuous curve on the  relation (which we will briefly call the  curve) from neutron star observations. In reconstructing the EoS from the observational data, we are facing twofold obstacles:

Incompleteness —  data from the observation does not form a continuous curve. Since one neutron star corresponds to one point on the  curve, the observation of neutron stars gives a limited number of  points and only partial information on the original  curve is available. Even with a sufficiently large number of neutron stars, the information on possible unstable branches of the  curve, on which the neutron stars do not exist, would be missing.

Uncertainties — Real data is not a point but a distribution. Each  point is accompanied by observational uncertainties represented as a credibility distribution. Moreover, the central peak position of the distribution is not necessarily on the genuine  relation.
We can introduce the following symbolic representation of the above complication associated with the observational limitation:
(4) 
This mapping by is a convolution of a projection from the  relation to a set of points along the  curve and a probability distribution due to observational uncertainties. The distribution generally depends on observational scenarios (how many and what kinds of neutron stars are detected by which signals) and such dependence is collectively represented by . With this mapping, the relation between the EoS and the observed  data is more correctly expressed as
(5) 
Because of this additional filter by , one cannot reconstruct the correct EoS from the observational data just by calculating . Since a part of original information is lost through , this is an illposed problem, i.e., there is no unique welldefined answer due to insufficient information and uncertainties inherent in this problem.
Here, instead of reconstructing the unique EoS, we try to infer the most likely EoS from  observations with uncertainty; that is, the regression analysis of the inverse mapping including (see Fig. 1 for a schematic illustration). In particular it is crucial to develop an approach leading to outputs robust against the observational uncertainties of the  data.
2.2 General strategy for the machine learning implementation
To solve this illposed problem we employ a method of machine learning with deep NNs to infer the neutron star EoS. In particular we consider the supervised machine learning to train the NN with as many and different ’s as possible. The trained NN can then receive the  data for one particular and return the output of the most likely EoS correspondingly. To put it another way, the “inversion” of the mapping is approximated by the regression, particularly by the deep NN in our approach, which would be symbolically written as .
In the supervised learning, we need to prepare the training data and the validation data composed of pairs of the input  data and the desired output EoS. To generate the training data efficiently, we make use of the asymmetry between and ; that is, we can straightforwardly calculate the forward mapping of by modeling the observation , while the latter inverse mapping, which is what we currently want to know, is more nontrivial. Thus, first, we randomly generate many possible answer EoSs represented by several parameters. We next generate the corresponding  data by applying the forward mapping with various ’s. We will explain technical details of handling and simplifying in Sec. 2.3.
Now, we turn to the description of the architecture of our NN model and we will optimize the model parameters so that the model can infer the answer EoS corresponding to the input training  data. During the training, it is important to monitor the training quality by checking the prediction error behavior for the validation data. After the optimization process is complete with good convergence of the validation error, we can test the predictive power using the mock data for which the true answer EoS is known (see Sec. 3 for actual calculations). Once the method passes all these qualification tests, we finally proceed to the application of the real observational data to obtain the most realistic EoS (see Sec. 4 for details).
Here, we comment on an alternative possibility of the inference problem formulation. As already mentioned, the inverse TOV mapping, , is welldefined by itself. Thus, it is also feasible to decompose the inference as and train the NN aimed to approximate ; that is, the NN model would predict the  curve from the input  data. We will not adopt this strategy since it is unclear in this approach how to measure the reconstruction performance of the  curve. Our target space is the EoS space, but the distance in the  space is not uniformly reflected in the distance in the desired EoS space. We are ultimately interested in the EoS, so it is natural to directly model by maximizing the performance for the EoS reconstruction.
We also make a remark on the discriminative modeling and the generative modeling to clarify a difference between our machine learning method and other statistical approaches such as the Bayesian inference. The approach we are advocating in this paper belongs to the discriminative modeling: our NN model directly predicts the response variable, i.e., the EoS, for a fixed set of explanatory variables, i.e., the  data. In contrast, the statistical approaches such as the Bayesian inference for the neutron star EoS are to be categorized as the generative modeling. The Bayesian inference first models occurrence of the combination of the explanatory and response variables by the joint probability, . Here, models the prior belief on the EoS, and models the observation process. Then, the Bayes theorem gives a posterior distribution of EoS, , from the joint probability, which corresponds to the likely EoS distribution for given  data.
These discriminative and generative approaches can be considered to be complementary to each other. They have the pros and cons as follows:

The advantage in the generative approaches is that the EoS prediction for a given  data takes a distribution form, so that the prediction uncertainties can be estimated naturally. In the discriminative model, on the other hand, additional analysis is needed to estimate the prediction uncertainties. We will perform the additional uncertainty analysis within our method in Sec. 4.3. At the same time, we will utilize our uncertainty analysis for a physics implication as discussed in Sec. 4.5.

In the discriminative modeling, once the NN training is completed, the most likely EoS corresponding to a given  data can immediately result from the mapping represented by the NN: the NN models the inference procedure by itself. Thus, the computational cost is much smaller than statistical analyses, which enables us to perform supplementary estimation on observables and/or quality assessments if necessary. In the generative approaches, on the other hand, the posterior calculation is computationally expensive because the EoS parameter space is large. In addition, one needs to calculate the posterior for each  data separately. If one applies the Bayesian inference to the real observational data only once, this computational cost would not matter. The computational cost becomes problematic, however, if one wants to examine the statistical nature of the inference using a large number of generated EoSs and  data.
2.3 Random EoS generation with parametrization by the speed of sound
The training data consists of pairs of the input  data and the output EoS, , parametrized by a finite number of variables. We first generate EoSs by randomly assigning values to the EoS parameters, solve the TOV equation to find the corresponding  curve, and finally generate the  data. The  data generation involves , namely, we sample observational points of neutron stars on the  curve, and probabilistically incorporate finite observational uncertainties, which should mimic observations in the scenario . The real  data is composed of the probability distribution for each neutron star on the  plane, but in practice we should simplify and need to parametrize the  data as well. We will discuss the parametrization later and in this section we focus on our EoS parametrization and generation.
In designing the EoS parametrization, it is important to consider its affinity to the physical constraints such as the thermodynamic stability and the causality, i.e, should be a monotonically nondecreasing function of , and the speed of sound, , should not be larger than the speed of the light. For this reason, , which is derived from
(6) 
is the convenient choice as the EoS parameters to impose the physical constraints easily.
Also, we need to smoothly connect the parametrized EoS to the
empirical EoS known from the low density region. Up to the energy
density of the saturated nuclear matter, , we use a
conventional nuclear EoS, specifically SLy4 Douchin:2001sv ().
Because the EoS is well constrained by the saturation properties and
the symmetry energy measurements near the saturation density, this
specific choice of SLy4 would not affect our final results. In the
energy region above we employ the standard piecewise
polytropic parametrization and partition a density range into a
certain number of segments. In our present analysis we choose the
density range, , and six energy
points, () with
, equally separated in the logarithmic
scale. Then, ()
form 5 segments. To be more specific, these values are
.
The EoS parameter is the average speed of sound,
(),
in th segment.
From these definitions we see that the pressure values at the th
segment boundaries, , are read as
(7) 
where is determined by from our chosen nuclear EoS, i.e., SLy4. We make polytropic interpolation for the EoS by whose exponent and coefficient are given by
(8) 
For the EoS generation we randomly assign the average sound velocities, , of the uniform distribution within . The upper bound comes from the causality , and the lower bound from the thermodynamic stability . Here, a small margin by is inserted as a regulator to avoid singular behavior of the TOV equation. We here note that we allow for small corresponding to a (nearly) firstorder phase transition. Repeating this procedure, we generate a large number of EoSs. For each generated EoS, we solve the TOV equation to obtain the  curve as explained in Sec. 2.1. In Fig. 2 we show the typical EoS data and the corresponding  curves used in our later analyses. It should be noted here that we excluded artificial biases in the EoS generation and allowed for any possible EoSs including ones disfavored by the model calculations or the observational data (e.g., an EoS whose maximum mass exceeds in the right panel of Fig. 2). This is important for the NN to be free from biases coming from specific model preference. In this way our analysis gains an enough generalization ability, covers even exotic scenario cases, and captures the correct underlying relations between the input and the output.
Before closing this part, let us mention a possible improvement for the random EoS generation, though we would not utilize this improvement to keep our present approach as simple as possible. The problem arises from our assumed uniform distribution of . The parametrization and generation algorithm as explained above definitely allows for a firstorder phase transition for sufficiently small ’s; however, due to the uniform distribution, it is a tiny percentage among the whole data for the generated EoSs to accommodate a firstorder phase transition, and this may be a part of our “prior” dependence. Since a strong firstorder transition scenario has already been ruled out from phenomenology, this prior dependence should be harmless. Nevertheless, if necessary, we can naturally increase the percentage of the EoSs with a firstorder phase transition in a very simple way as follows: In Eq. (8) we carry out the linear interpolation in the loglog plane. Alternatively, we can use the spline interpolation in the loglog plane. With such a smooth interpolation, there appear the energy density regions with negative , i.e., the EoS can be nonmonotonic. We can then replace this nonmonotonic part by the firstorder phase transition using the Maxwell construction. This is one effective and natural way to enhance a finite fraction of EoSs with a firstorder phase transition while keeping the same ’s. We also note that one more merit in this procedure is that the end points of the firstorder energy regions are not restricted to be on the segment boundaries. Thus, the strength of the firstorder phase transition is not necessarily discretized by the segment width but an arbitrarily weak firstorder phase transition could be realized with the same parametrization. We say again that we would not adopt this procedure: the present observational precision gives a tighter limitation and for the moment we do not benefit from this EoS improvement. In the future when the quality and the quantity of the observational data ameliorate, the above mentioned procedure could be useful.
2.4 Neural network design: a general introduction
The neural network, NN, is one representation of a function with fitting parameters. The deep learning, i.e., the machine learning method using a deep NN, is nothing but the optimization procedure of the parameters contained in the function represented by the NN. In particular, the supervised learning can be considered as the fitting problem (namely, regression) for given pairs of the input and the output, i.e., the training data. We often refer to a NN “model” to mean the optimized system of the function given in terms of the NN.
The advantage of the deep learning, as compared to naive fitting methods, lies in the generalization properties of the NN. We need not rely on any preknowledge about reasonable forms of fitting functions: the NN with a sufficient number of neurons is capable of capturing any continuous functions citeulike:3561150 (); HORNIK1991251 (). With a large number of neurons (and layers), there are many fitting parameters, and recent studies of machine learning have developed powerful optimization methods for such complicated NNs.
The model function of feedforward NN can be expressed as follows:
(9) 
where and are input and output, respectively. The fitting parameters, , are given in the form of matrices and vectors, respectively. For calculation steps, we first prepare layers (including the input and the output layers). Each layer () consists of nodes called neurons corresponding to the vector components, . We note that the number of neurons, , can be different for different . The input is set to the first layer (which is labeled as in this paper), and the subsequent layers are calculated iteratively as
(10)  
(11) 
The th layer becomes the final output, . Here, ’s are called activation functions, whose typical choices include the sigmoid function: , the ReLU: , hyperbolic tangent: , etc. In the above notation returns a vector by applying the activation to each vector component of the function argument, i.e., . The fitting parameters, and , on the th layer, denote the weights between nodes in the two adjacent layers and the activation offset at each neuron called bias, respectively. These fitting parameters have no physical interpretation at the setup stage. The general architecture is schematically depicted in Fig. 3, in which the calculation proceeds from the left with input to the right with output .
The complexity choice of the NN structure, such as the number of layers and neurons, has a tradeoff problem between the fitting power and the overfitting. To capture the essence of the problem, the complexity of layers and neurons should be sufficiently large. At the same time, to avoid the overfitting problem, and to train the NN within reasonable time, the number of layers and neurons should not be too large. There is no universal prescription and the performance depends on the problem to be solved. For our setup we found overall good performance when we chose the neuron number on the first layer greater than the input neuron number. A more sophisticated NN may be possible, but for our present purpose a simple network structure as in Fig. 3 is sufficient.
2.5 Loss function and training the neural network
For the actual procedure of machine learning we need to choose a loss function and minimize it, which is called training. We define a loss function as
(12) 
where quantifies the distance or error between the NN prediction and the answer provided in the training data. The exact formula for the distribution is not necessarily known in general. Moreover, even if it is known, multidimensional integration in Eq. (12) is practically intractable, so let us approximate it with the sum over the samples in the training data, , with denoting the sample size:
(13) 
The optimization problem we deal with here is to minimize the above loss function by tuning the network parameters . The minimization is usually achieved with the stochastic gradient descent method or its improved versions. The deterministic methods would be easily trapped in local minima or saddle points of the loss function. In practice, therefore, the stochastic algorithms are indispensable. In gradient descent method the parameters at the iteration step are updated along the direction of the derivative of the loss function with respect to the parameters:
(14) 
where is called learning rate. We can update in the same way. We usually evaluate the derivatives by using the minibatch method. In the minibatch method, the training data set is first randomly divided into multiple subsets, which is called minibatches. Then, the derivative of the loss function is estimated within a minibatch . Using the explicit expression (13), the approximated derivatives read:
(15) 
where the batch size denotes the number of sample points in , whose optimal choice differs casebycase. The epoch counts the number of scans over the entire training data set . The parameters are updated for each minibatch, so one epoch amounts to updates. Usually, the derivative, appearing in Eq. (15), is evaluated by a method called backpropagation.
For numerics we basically use a Python library, Keras software:Keras () with TensorFlow arXiv:1605.08695 () as a backend. In this work we specifically employ msle, i.e., the mean square logarithmic errors, for the choice of the loss appearing in Eq. (12); that is,
(16) 
We use a specific fitting algorithm, Adam DBLP:journals/corr/KingmaB14 (), with the minibatch size being 100 for the analysis in Sec. 3 and 1000 for the analysis in Sec. 4. We initialize our NN parameters with the Glorot uniform distribution.
3 Extensive Methodology Study and the Performance Tests
This section is an extensive continuation from our previous work on the methodology study Fujimoto:2017cdo (). We here present more detailed and complete investigations of various performance tests.
In this section, to put our considerations under theoretical control, we assume to be a hypothetical observational scenario that 15 neutron stars are observed and () are known with a fixed size of observational uncertainty following the previous setup in Refs. Ozel:2015fia (); Bogdanov:2016nle (); Ozel:2016oaf (); Fujimoto:2017cdo (). We design a NN, generate the validation data as well as the training data, and train the NN to predict the EoS parameters, , from the artificially generated  data, () (i.e., mock data). We discuss the training behavior of the NN, the reconstruction performance of the EoS and the  curve, and the distribution of reconstruction errors.
3.1 Mock data generation and training
For the current study in this section, we deal with the input  data, (), and the output EoS parameters, (), as described in the previous section. A combination of these input and output data forms a sample point in the training and the validation data, and we need to generate many independent combinations to prepare the training and the validation data.
The procedure for sampling the  points is sketched in Fig. 4. We follow the basic strategy described in Sec. 2.3 and introduce minor modifications for the current setup. For each randomly generated EoS, , we solve the TOV equation to get the  curve and identify the maximum mass of the neutron star, corresponding to (1) and (2) in Fig. 4. If does not reach the observed maximum, such an EoS is rejected from the ensemble. Then, for each EoS, we randomly sample 15 pairs of on the corresponding  curve, as shown in (3) in Fig. 4, assuming a uniform distribution of over the interval . If there are multiple values of corresponding to one , we take the largest discarding others belonging to unstable branches. In this way we select 15 points of on the  curve.
We also expect the NN to learn that observational data should include uncertainties, and . We randomly generate and according to the normal distribution with assumed variances, for the mass and for the radius. These variances should be adjusted according to the real observational uncertainty as we will treat in Sec. 4, but for the test purpose in this section, we simplify the treatment by fixing and by hand. Furthermore, the distributions are not necessarily centered on the genuine  curve as in (3) in Fig. 4. So, we shift data points from to . Importantly, we repeat this procedure to generate an ensemble of randomly shifted data for each EoS, and the repeating number (and thus the number of obtained shifted data) is denoted by in this work. Our typical choice is , whose effects on learning will be carefully examined in later discussions. For studies in this section, we have prepared 1948 EoSs (2000 generated and 52 rejected), so the total size of the training data set is . We find that the learning proceeds faster if the data is normalized appropriately to be consistent with the Glorot initialization; we use and with and .
The NN is optimized to fit the training data so as to have a predictive power. To test it, we use the validation loss calculated for the validation data, i.e., independent mock data of the neutron star observation. We separately generate 200 EoSs, among which 196 EoSs pass the twosolarmass condition, and sample an observation for each EoS to generate the validation data set.
It is worth mentioning here that calculating  curves requires repeated numerical integration of the TOV equation with various boundary conditions, which is computationally expensive. Generally speaking, in the machine learning approach, the preparation of the training data is often the most timeconsuming part. In contrast, increasing the sample size by repeating the observation procedure does not need further computational cost. This procedure can actually be regarded as a kind of the data augmentation by the noise injection, where the observational errors play the role of the injected noise. Let us call this procedure observational data augmentation hereafter. This observational data augmentation has actually another virtue, which will be discussed in later sections.
Layer index  Neurons  Activation Function 
0  30  N/A 
1  60  ReLU 
2  40  ReLU 
3  40  ReLU 
4  5 
We are constructing a NN that can give us one EoS in the output side in response to one “observation” of 15 neutron stars, (, ) () in the input side. Thus, the number of neurons in the input and the output layers should be matched with the size of the observational data and the number of EoS parameters, respectively. The input layer with 30 neurons matches 15  points (30 input data). We sorted 15  pairs by their masses in ascending order, but we found that the sorting does not affect the overall performance. The output neurons in the last layer correspond to the prediction target, i.e., 5 parameters of the speed of sound.
The concrete design of our NN is summarized in Tab. 1. We choose the activation function at the output last layer as so that the speed of sound is automatically bounded in a causal range . In principle the output could be unphysical: may occur, and then we readjust it as . For the other layers we choose the ReLU, i.e., (), which is known to evade the vanishing gradient problem.
3.2 Learning curves with the observational data augmentation
The stochastic gradient descent method is an iterative procedure to adjust the NN parameters step by step and to gradually make the loss function smaller. To judge the point where we stop the iteration, we need to monitor the convergence of the loss function in training. For this purpose we plot the loss function as a function of training time (i.e., the number of iteration units), which we call the learning curve in this work.
The actual shape of the learning curve depends on the initial parameters of the NN and also on the choice of minibatches in the stochastic gradient descent method. Although the learning curve differs every time we start the new training, we can still find some common tendency. In Fig. 5 we show the typical behavior of learning curves for our neutron star problem. It should be noted that the horizontal axis in Fig. 5 is the number of minibatches. In this study we will compare the training speeds with different but with the fixed minibatch size. The number of epochs – the number of scans over the whole training data set – is often used as the unit of training time, but in the present study for the training speeds, we need to adopt the number of minibatches which is proportional to the actual number of parameter updating and the computational time of the training.
In Fig. 5 we plot two different learning curves for a single training: one for the training loss (dashed line) and the other for the validation loss (solid line). The training and the validation losses are the loss functions calculated for the training and the validation data sets, respectively, by using Eq. (13). The former diagnoses how well the model remembers the training data, and the latter diagnoses how well the model properly predicts unseen data by generalizing the learned information.
Here, we compare the learning curves for two different training data sets with and . By observing the plot like Fig. 5 for many trials on the training, we can read out two general tendencies:

The value of validation loss of is larger than that of . A possible interpretation is that the loss function can be more trapped in local minima/saddle points for smaller .

For the number of batches , for , the training loss keeps decreasing with increasing number of batches, whilst the validation loss turns to increasing behavior. The loss function of apparently shows overfitting behavior but the overfitting is not seen for .
From these results, we may argue that the data augmentation with sufficiently large could be useful to improve the problems of local minimum trapping and overfitting. To justify or falsify this speculation, we will perform further quantitative analyses on the observational data augmentation under a simple setup in Sec. 5. We note in passing that recent developments show a possibility of overparametrized networks, which has a very large number of parameters, to reconcile a good generalization ability and an elusion of overfitting 2016arXiv161103530Z (); NEURIPS2019_62dad6e2 (). Here, our point is that the data augmentation is needed to incorporate the physical uncertainty rather than to improve the learning quality, but it may have such a byproduct.
3.3 Typical examples – EoS reconstruction from the neural network
Once the loss function converges, we can use the trained NN model to infer an EoS from an observation of 15  points. We picked two typical examples represented by orange and blue in Fig. 6. In the left panel of Fig. 6 the dashed lines represent randomly generated EoSs. We note that two EoSs are identical in the low density region because the SLy4 is employed at . The corresponding genuine  relations are shown by the dashed lines in the right panel of Fig. 6. Randomly sampled mock observations consisting of 15  points are depicted by ellipses where the centers are the “observed” and the major and minor axes show observational uncertainties in the and directions, respectively. The reconstructed EoSs are depicted by solid lines in the left panel of Fig. 6. We can see that the reconstructed EoSs agree quite well with the original EoSs for these examples. It would also be interesting to make a comparison of the  relations corresponding to the original and the reconstructed EoSs. The solid lines in the right panel of Fig. 6 represent the  relations calculated with the reconstructed EoSs. As is expected from agreement in the EoSs in the left panel of Fig. 6, the original and the reconstructed  relations are close to each other. More details are already reported in our previous publication Fujimoto:2017cdo ().
3.4 Error correlations
We make the detailed error analysis of the reconstructed EoSs in terms of the speed of sound. To increase the statistics, for the present analysis, we set (remember that increasing is almost costless).
Now, we introduce the prediction error in the speed of sound:
(17) 
where and are the reconstructed and original values, respectively, for the speed of sound. The first row of Fig. 7 shows the histograms of (). We see that the prediction error at the lower density (i.e., smaller ) is centered at the origin while that at the higher density (i.e., larger ) is distributed widely, which means that the observational data efficiently constrains lower density regions. At the highest density with , the prediction error roughly follows the uniform distribution within the physical range. This behavior can be understood as follows. The original values of the speed of sound in the validation data are generated by the uniform distribution in the range while the distribution of the predicted () are shown in the second row of Fig. 7. At lower density, the distribution of the prediction roughly follows the original uniform distribution. However, at higher density, the distribution of the prediction peaks around . At the highest density, therefore, the NN always predicts for any inputs. This is because is an optimal prediction to minimize the loss function when the constraining information is inadequate. The exact peak position is slightly smaller than 0.5, which is because our choice of the loss function, msle, gives larger weights for smaller values. Thus, the approximate uniform distribution of the prediction errors at the highest density (as seen in the upper rightmost figure in Fig. 7) just reflects the distribution of the original values relative to the fixed prediction.
From this analysis we conclude that the observation data should contain more information on the lower density EoS but not much on the high density EoS around . This is quite natural, but such a quantitative demonstration in Fig. 7 is instructive. Also, the results in Fig. 7 tell us two important aspects of our NN approach. First, the NN tends to give a moderate prediction, not too far from all possible values, when the uncertainty is large. This is nontrivial: the posterior distribution of EoS is random, but such information is lost after the NN filtering. This is reasonable behavior in a sense that we want to design the NN to predict the most likely EoS instead of the posterior distribution. Secondly, the moderate prediction for unconstrained cases may explicitly depend on the choice of the loss function and also on the prior distribution, i.e., the distribution of the EoSs in the training data set in our problem. Thus, we should be careful of possible biases when we discuss unconstrained results in the high density regions.
The error analysis so far is separately made for each EoS parameter. In Fig. 8, to extract more detailed information, we plot the marginalized distributions for different pairs of the EoS parameters with corresponding Pearson’s correlation coefficients . We see that the speeds of sound of neighboring density segments have negative correlations, which is similar to the behavior seen in the Bayesian analysis Raithel:2017ity (). The underlying mechanism for this behavior can be explained in the same way as in Ref. Raithel:2017ity (): an overprediction in a segment can be compensated by an underprediction in the adjacent segment, and vice versa. The correlations are clearer in the lower density regions in which predicted ’s have better accuracy. We also observe a small and positive correlation, e.g., , between the parameters separated by two segments. This is also expected from two negative correlations in between. The other correlations involving parameters in the high density regions are consistent with zero within the statistical error.
3.5 Conventional polynomial regression
It is an interesting question how the machine learning method could be superior to other conventional approaches as a regression tool. In this section we make a detour, introducing a polynomial model, and checking performance of a traditional fitting method. We can model the regression of EoS from the  data with the following polynomial fitting of degree :
(18) 
where and are the input and the output of the training data set, respectively, with being the size of the input vector , and are the fitting parameters. For the optimization we employ the traditional leastsquare method with the loss function in Eq. (12) specified as mse, i.e., the mean square errors:
(19) 
In this case the mapping is linear with respect to the fitting parameters, which means the problem is essentially the (multivariate) linear regression. Thus, we can find an analytical solution of the exact global minimum of the loss function (19) in this traditional method.
Here we comment on the number of parameters in the polynomial model, which is given by the multiset coefficient: . For in the present problem, as long as , we can approximate which explodes as the degree gets larger. Explicit calculations read: (), (), (), (), etc. The drawback of the linear regression is a huge computational cost of .
3.6 Comparison between the neural network and the polynomial regression
In Sec. 3.3 we already observed two successful examples of the EoS reconstruction. For other EoSs in the validation data, the reconstructed  curves agree well with the genuine ones. In this section we shall quantify the overall agreement, and for this purpose we enlarge the size of the validation data; namely, we randomly generate 1000 EoSs, among which 967 EoSs pass the twosolarmass condition, and we carry out statistical analyses with 967 EoSs.
We define the overall reconstruction accuracy by calculating the radius deviation, i.e., the distance between the solid and dashed lines as already shown in the right panel of Fig. 6:
(20) 
where and are the reconstructed radius (solid) and the genuine original radius (dashed), respectively, at the mass . Then, we estimate the root mean square (RMS) of radius deviations (20) using 967 data at masses ranging from 0.6 to 1.8 with an interval of 0.2 . We calculate for both the NN and the polynomial regression to make a quantitative comparison.
(a)  (b)  (c) 
The logarithmic histogram of is plotted in Fig. 9 for [as shown in (a)] [as shown in (b)], and [as shown in (c)]. In the plot of the logarithmic probability density the Gaussian distribution or the normal distribution takes a quadratic shape, so the histograms from both the NN and the polynomial regression have a strong peak at for all . The tails are wider than the normal distribution, and the polynomial results (lower panels) exhibit even slightly wider tails than the NN results (upper panels).
(a)  (b)  (c) 
We can look into this tendency in more quantitative manner in Fig. 10. In the leftmost (a) of Fig. 10 we plot the RMS of , i.e., . We see that from the NN is around for a wide range of masses. This indicates that our NN method works surprisingly good; remember that data points have random fluctuations by . It should be noticed that, even without neutron stars around – in mock  data of our setup, the RMS of the corresponding radii are still reconstructed within the accuracy . One might think that this should be so simply because we use the SLy4 in the low energy region, but in view of Fig. 6, different EoSs lead to significantly different radii even in this region of –. Figure 10 (a) clearly shows that the NN results surpass the polynomial outputs with .
When the distribution has long tails, as seen in Fig. 9, the RMS may not fully characterize the data behavior. To quantify the deviation of the reconstruction more differentially, we can define quantilebased widths, and , corresponding to and of contained within the half widths, respectively. They can be explicitly expressed as
(21) 
where is the quantile function of the distribution, i.e., the value of at the fraction integrated from small . The error function, , is used to translate the statistical significance by to the probability, e.g., and . If obeys the normal distribution , in general, follows. Therefore, we can use as alternatives to . For more general distributions it is important to note that these quantilebased widths have a clear meaning as the 68% and 95% confidence intervals unlike .
In Figs. 10 (b) and (c) we plot and . It is interesting that the results appear very different while the results are almost indistinguishable between the NN and the polynomial results. The polynomial results are consistently larger than the NN results. This behavior of implies that the polynomial results may be contaminated by outliers of wrong values far from the true answer. We can understand this by relating it to Runge’s phenomenon of the polynomial approximation; higher order terms can cause large oscillation in the interpolation and the extrapolation. In particular, the extrapolation may easily break down by higher order terms that blow up for large inputs. In contrast, the NNs with ReLU activation only contain up to the linear terms so that the outputs are stable.
4 EoS Estimation from the Real Observational Data
In the previous section we have studied our methodology to infer the most likely EoS from observational data and quantitatively confirmed satisfactory performance to reproduce the correct results. Now, we shall apply the method to analyze the real observational data and constrain the real neutron star EoS.
For practical purposes it is important to develop a way to estimate uncertainties around the most likely EoS within the framework of the machine learning method. As we have already discussed, the EoS inference from the  data is an illposed problem and the solution cannot be uniquely determined. Thus, the prediction from the NN method cannot avoid uncertain deviations from the true EoS. To employ a predicted EoS as a physical output, we need to quantify the uncertainty bands on top of the results. In this section we consider two different approaches for the uncertainty quantification.
4.1 Compilation of the neutron star data
For the real data we use the  relation of the neutron stars
determined by various Xray observations. There are three sorts of
sources we adopt here: 8 neutron stars in quiescent lowmass Xray
binaries (qLMXB) Ozel:2015fia (); Bogdanov:2016nle (); Ozel:2016oaf (),
6 thermonuclear bursters with better constraints than the
qLMXBs Ozel:2015fia (); Ozel:2016oaf () as well as a rotationpowered
millisecond pulsar, PSR J0030+0451, with thermal emission from hot
spots on the stellar surface Riley:2019yda (); Miller:2019cac ().
The data from the first two sources is tabulated in
Refs. Ozel:2015fia (); Bogdanov:2016nle (); Ozel:2016oaf (), especially in
Fig. 4 of Ref. Ozel:2016oaf ()
Ideally, with sufficient computational resources, the machine learning would be capable of directly dealing with such twodimensional probability distribution using, for example, the convolutional neural network (CNN). In the present work, however, we simplify our analysis by approximately characterizing a twodimensional probability distribution with four parameters following the prescription proposed in our previous publication Fujimoto:2019hxv (): the means and the variances of the marginal distributions with respect to and . More specifically, we make projection of the twodimensional distribution onto the onedimensional axis (and axis) by integrating over (and , respectively). In this way we marginalize the twodimensional distribution with respect to and into two onedimensional distributions, which are eventually fitted by the Gaussians with the mean and the variance.
4.2 Training data generation with observational uncertainty
We should revise the data generation procedure previously illustrated in Fig. 4 in Sec. 3.1. Previously, for the mock data analysis, we assumed the universal variances, and , for all and . In reality, however, they should vary for different , i.e., and should correspond to the observational uncertainties of (). To deal with the real observational data, the revised procedure for sampling the  points is sketched in Fig. 11. We need to design the NN with an input of information including and : the input variables are extended to (, ; , ).
We shall recapitulate the data generation scheme as follows. In the same way as in Sec. 3 we prepare 5 EoS parameters () in the output side. In this section the training data comprises input parameters, i.e., (, ; , ) (). We note that runs to not 15 but 14 corresponding to the number of observed neutron stars as explained in Sec. 4.1. We calculate the  curve for each EoS, and then select 14 points of on the  curve and add statistical fluctuations of and [see Fig. 11 (3)]. Let us go into more detailed procedures now. Unlike and in Sec. 3 here we randomly generate and differently for . These variances, and , are sampled from the uniform distributions, and , respectively. In view of the observational data, these ranges of the distributions should be sufficient to cover the realistic situations. Then, and are sampled according to the Gaussian distributions with these variances, and . Finally we obtain the training data, () [see Fig. 11 (4)]. Hereafter we call these 14 tetrads of an observation.
Now we prepare the training data set by taking multiple observations. For each EoS we randomly generate 100 different pairs of (, and then we make another 100 observations for each . From the former 100 pairs the NN is expected to learn that the observational uncertainties may vary, and the latter tells the NN that the genuine  relation may deviate from the observational data. In total we make () “observations” per one EoS. The size of the whole training data set is thus 100 times larger than before.
Layer index  Neurons  Activation Function 
0  56  N/A 
1  60  ReLU 
2  40  ReLU 
3  40  ReLU 
4  5 
We modify the architecture of the NN used in this section accordingly. The number of neurons in the input layer becomes (the performance test was done with 15 points, but we have numerically confirmed that the same level of performance is achieved with 14 points as well). Since we already know from the mock data analysis that the mass sorting does not affect the performance, we keep the mass ordering as generated randomly, unlike in Sec. 3.1. We normalize the input data as , , , and with , , , and . Aside from the input layer, the NN design of the other layers is chosen to be the same as before, as summarized in Tab. 2. For the NN optimization we adopt Adam, the same as in Sec. 3, but with different minibatch size, 1000. Incorporating the observational uncertainties the training data set becomes larger as compared to before, and we expect that a larger minibatch size would fasten the training.
4.3 Two ways for uncertainty quantification
Here we prescribe two independent methods to quantify uncertainties in the output EoS based on different principles.
The first one utilizes the validation data, and the procedure is similar to that in Sec. 3.6. The basic idea is that, once we have trained the NN, we can evaluate the prediction accuracy from the validation data. We generate 100 samples of the validation data set whose input variances are chosen to be in accord with the real observational uncertainties as explained in Sec. 4.1. For the validation data, we know the true  relation so that we can evaluate the deviation as defined in Eq. (20). Using the whole validation data set, we calculate the rootmeansquare deviation, , and regard it as the systematic error inherent in the NN approach. Later we show this uncertainty by the label “validation” as seen in Fig. 13.
The second one is the ensemble method in machine learning. This
method is usually used to enhance the stability and performance of the
predicted output from NNs. Here we repurpose it to quantify
uncertainty of the predicted output. The idea is concisely
summarized in Fig. 12. In this method we set up
multiple NNs independently: NN, NN,NN with being
the number of prepared NNs. We perform random sampling from
to generate different subsets, and
train each NN using . This random
sampling is commonly referred to as bootstrapping. After the
training, by feeding the input data, each NN predicts output values,
which we symbolically denote by
as in Fig. 12.
Finally, aggregating the outputs from these multiple NNs, i.e.,
averaging all the outputs, we get the overall output
.
Here we can also calculate the variance by
, and
regard it as “uncertainty”. This whole procedure, comprising
bootstrapping and aggregating, is
named bagging
We have introduced the two natural ways to quantify uncertainties, but our working prescriptions are still to be developed. In fact there is no established method yet. A more systematic way for the uncertainty quantification might be possible. This is an interesting problem left for future research in the general context of machine learning.
4.4 The most likely neutron star EoS
(a)  (b) 
In Fig. 13 the orange line is the most likely neutron star EoS deduced from our NN approach. We estimated uncertainty from the bagging (shown by the band with a hatch pattern labeled by “bagging”) and the validation (shown by the blue band labeled by “validation”). We plot bands from other works in the figure for comparison. The gray band represents an estimate from the EFT calculation combined with polytropic extrapolation and the twosolarmass pulsar constraint Hebeler:2013nza () (labeled by “EFT+astro”). Because EFT is an ab initio approach, any reasonable predictions should lie within the gray band, and indeed our results are found to be near the middle of the gray band. The preceding Bayesian analyses Steiner:2010fz (); Steiner:2012xt (); Ozel:2015fia (); Bogdanov:2016nle (); Ozel:2016oaf () are also overlaid on Fig. 13. While Özel et al. Ozel:2015fia (); Bogdanov:2016nle (); Ozel:2016oaf () and our present analysis use the same astrophysical data, Steiner et al. Steiner:2010fz (); Steiner:2012xt () employs a subset of the data, i.e., 8 of Xray sources. One may think that our prediction gives a tighter constraint than the others, but the narrowness of the band may be related with the implicit assumption in our EoS parametrization; we will come back to this point in Sec. 4.5 (see Fig. 17). Figure 13 (b) shows the  curves corresponding to the EoSs in (a). We see that our EoS (blue curve) certainly supports neutron stars with Demorest:2010bx (); Fonseca:2016tux (); Antoniadis:2013pzd (); Cromartie:2019kug ().
(a)  (b) 
Figures 14 (a) and (b) show the tidal deformability and their correlation, respectively, in the binary neutron star merger GW170817. Once an EoS is given, the dimensionless tidal deformability, , results from a quantity called the Love number , which is derived from the Einstein equation under static linearized perturbations to the Schwarzschild metric due to external tidal fields. Practically, we solve a secondorder ordinary differential equation in combination with the TOV equation; see Refs. Hinderer:2007mb (); Hinderer:2009ca () for the explicit form of the equations. The blue band in Fig. 14 (a) represents from the EoS we inferred in the present work, which is consistent with the merger event GW170817 indicated by the red bar. In Fig. 14 (b) we show the correlation of the tidal deformabilities, of the star 1 and of the star 2, using the relation between and as given in Fig. 14 (a). The orange lines in Fig. 14 (b) refer to the constraints (solid:90% and dashed:50%) for which and are sampled independently TheLIGOScientific:2017qsa (), while the green lines refer to the constraints for which and are related through with , , and . In Fig. 14 (b) we clearly see that our predicted band is located within the 90% contours of the LIGOVirgo data TheLIGOScientific:2017qsa (); Abbott:2018exr ().
(a)  (b) 
So far we have only used the observational data from the thermonuclear bursters and qLMXBs, which may contain large systematic errors related with the uncertain atmospheric model of neutron stars. The results in Figs. 15 (a) and (b) include the  constraint from the NICER mission as well. There are two independent analyses, as spotted on Fig. 15 (b), on the same observation of PSR J0030+0451 Riley:2019yda (); Miller:2019cac (), and we adopt the one [green bar in (b)] in Ref. Riley:2019yda (). We see that the uncertainty becomes slightly larger by the inclusion of the NICER data, which is attributed to the relatively large deviation of the NICER data from others.
4.5 Possible EoSs with a weak firstorder phase transition
In the analyses we have presented so far, we used the piecewise polytrope with 5 segments of density. Here, we change the number of segments from 5 to 7 and repeat the inference with the finer bins. There are mainly two issues argued in this subsection: a possibility of a weak firstorder phase transition with the finer bins and its implication on the uncertainty quantification.
To this end we prepared NNs in the bagging outlined in Sec. 4.3. We note that each NN is trained so as to predict an EoS in response to the real observational data. Here we use the  data of qLMXBs and thermonuclear bursters without the NICER data. In the left panel of Fig. 16 we show 100 EoSs predicted from 100 independent NNs. We remind that the activation function in the output layer is chosen to be which takes a value over . When the predicted values is smaller than the threshold: , we adjust it to and identify a firstorder phase transition. We found 44 predicted EoSs out of 100 that have a firstorder phase transition. We highlight the region of firstorder phase transition with orange thick lines in Fig. 16. From this plot we can understand why we increased the number of segments. If we use the EoS parametrization with 5 segments, weak firstorder phase transitions are too strongly prohibited by coarse discretization.
We also make a histogram in the right panel of Fig. 16 to show a breakdown of the EoS regions with a firstorder phase transition. This histogram counts the number of firstorder transition EoSs in each energy density region. It is interesting to see that the most of the firstorder phase transition is centered around the energy region –. On the one hand, in the lower energy region – the firstorder phase transition is less likely, and this tendency is consistent with the fact that a stronger firstorder phase transition in a lower energy region is more disfavored by the twosolarmass pulsar constraint Alford:2015dpa (). In the higher energy region, on the other hand, there are also less EoSs with a firstorder phase transition. One may think that a firstorder phase transition would be more allowed in the higher energy region, but it is not the case in the NN analysis. In Sec. 3.4 we already discussed that the NN model tends to predict the most conservative value around in the high energy density regions where the constraints are inadequate. Therefore, the correct interpretation of the absence of the firstorder transition in the high density regions as shown in Fig. 16 should be, not that our results exclude a firstorder transition there, but merely that the observational data analyzed in our NN method does not favor a firstorder transition there. Another artificial factor in the high energy density region is that our piecewise polytrope is equally spaced in the log scale, so the higher density segments have larger energy widths in the linear scale. Then, the parametrization does not have a resolution to represent a weak firstorder phase transition in the high energy region. Here, we also make a comment that our NN approach does not take account of the third family scenario Gerlach:1968zz (); Schertler:2000xq () in which separate branches may appear in the  relation (see also Refs. Alford:2017qgh (); Blaschke:2020qqj () for recent discussions).
(a)  (b) 
In Fig. 17 we show the EoS results with 7 segments (fine binning) after bagging. We see that uncertainty is widened compared with the results with 5 segments (coarse binning). This is partially because the number of output parameters, (), is increased, while the amount of the input data is unchanged, so the corresponding uncertainty should increase. Furthermore we see that the uncertainty is enhanced by the effect of firstorder phase transition. If there is a firstorder phase transition, becomes (nearly) zero and these zero values enlarge the variance of the output in the bagging and increase the uncertainty accordingly. Now the uncertainty appears comparable with other Bayesian methods as seen in Fig. 17. Our previous results with 5 segments as shown in Ref. Fujimoto:2019hxv () and in Sec. 4.4 of this paper had smaller uncertainty, which implies that the choice of 5 segments turns out to be optimal.
Finally we plot the speed of sound in Fig. 18. The bare output from the NN model actually comprises these values of . It is clear from the plots that the speed of sound exceeds the conformal limit ; see Refs. Bedaque:2014sqa (); Tews:2018kmu () for thorough discussions on the conformal limit. The inclusion of the NICER data only slightly widens uncertainty, and changing the segment number is a more dominant effect on uncertainty bands as quantified in the left panel of Fig. 18.
5 More on the Performance Test: Taming the Overfitting
In Sec. 3.2, we observed a quantitative difference
between the learning curves for the training data sets with and
without data augmentation by as demonstrated in
Fig. 5. Then, it would be a natural
anticipation to consider that this observational data augmentation may be
helpful to overcome the problems of local minimum trapping and
overfitting that we often meet during the NN training. This section
is aimed to discuss numerical experiments to understand the behavior
of the learning curve and the role of thereof. In particular,
we will focus on the overfitting problem here
5.1 A toy model
Here we consider a specific and simple problem as a toy model. We
choose the following doublepeaked function
(22) 
in the domain of . A concrete shape of above is depicted in Fig. 19. In Tab. 3 we make a comparison between the neutron star calculation and this toymodel study of fitting.
Neutron star ()  fitting  

Input  
Output  EoS  
Training data (input)  
Number of training data (output)  1948  
Number of training data (input)  1948 1948000 
For the training data, we first generate a set of the “true” inputs,
denoted by , in the interval and calculate the
corresponding outputs, . We denote this base data set
as
with being the size of the base data set.
Then, similarly to the neutron star case, we augment the training data
set by duplicating copies of the base data with the input
fluctuated by “observational uncertainty”, . Here, we set
the distribution of as the normal distribution with the
standard deviation fixed as
of our choice
(23) 
The size of the training data set is found to be . The trivial case with corresponds to the naive training data set without augmentation, while augments the training data set. Figure 19 exemplifies the training data set with and .
Since the task we deal with here is idealized and as simplified as possible, we can reasonably keep the NN architecture simple and relatively small (albeit deep). In the present study we basically use two types: NN1221 and NN1991 as tabulated in Tab. 4. We also choose the simplest loss function here as mae, i.e., the mean absolute errors:
(24) 
5.2 Input noises, parameter fluctuations, and dropout
To deepen our understanding of the data augmentation, we argue that including noises in the input is equivalent to introducing fluctuations to the NN parameters in the first layer. For the moment we will not limit ourselves to the special toy model in Eq. (22) but here we shall develop a general formulation using the schematic notation introduced in Sec. 2.4.
We consider the following case that a noise is added to input values: . The noise is assumed to be independent of the input value, i.e., . In this case the input noise can be absorbed by redefinition of the first layer parameters; that is,
(25) 
where . Therefore, we find,
(26) 
The loss function with the input noise is given by