Constructing smooth potentials of mean force, radial distribution functions and probability densities from sampled data
Abstract
In this paper a method of obtaining smooth analytical estimates of probability densities, radial distribution functions and potentials of mean force from sampled data in a statistically controlled fashion is presented. The approach is general and can be applied to any density of a single random variable. The method outlined here avoids the use of histograms, which require the specification of a physical parameter (bin size) and tend to give noisy results. The technique is an extension of the BergHarris method [B.A. Berg and R.C. Harris, Comp. Phys. Comm. 179, 443 (2008)], which is typically inaccurate for radial distribution functions and potentials of mean force due to a nonuniform Jacobian factor. In addition, the standard method often requires a large number of Fourier modes to represent radial distribution functions, which tends to lead to oscillatory fits. It is shown that the issues of poor sampling due to a Jacobian factor can be resolved using a biased resampling scheme, while the requirement of a large number of Fourier modes is mitigated through an automated piecewise construction approach. The method is demonstrated by analyzing the radial distribution functions in an energydiscretized water model. In addition, the fitting procedure is illustrated on three more applications for which the original BergHarris method is not suitable, namely, a random variable with a discontinuous probability density, a density with long tails, and the distribution of the first arrival times of a diffusing particle to a sphere, which has both long tails and shorttime structure. In all cases, the resampled, piecewise analytical fit outperforms the histogram and the original BergHarris method.
pacs:
05.20.Jj, 82.20.Wt, 02.70.Rr, 02.50.rI Introduction
In many situations the probability to find a particle at a certain distance from a given other particle is of interest. This probability is given by the radial distribution function. In addition to providing detailed information on the local structure of a system, one can often express thermodynamic quantities such as the pressure, energy and compressibility in terms of the radial distribution functions.McQuarrie () Furthermore, the radial distribution function can be reformulated in terms of a potential of mean force, which is of great importance in multiscale simulations that take the potential of mean force as input in Langevin or Brownian dynamics simulations.FrenkelSmit (); Berendsen07 ()
Radial distribution functions are usually constructed in simulations by forming a histogram of sampled interparticle distances. FrenkelSmit () However, histograms contain the bin size as a free parameter, and can be rather noisy for a poorly selected value of this parameter. For probability densities, an alternative method recently proposed by Berg and Harris avoids histograms and yields a smooth analytical form for the probability density which describes the sample data statistically at least as well as the noisy histogram.BergHarris08 () This method has already proved useful in the context of the computation of quantum free energy differences from nonequilibrium work distributions,VanZonetal08b () the determination of the density in BoseEinstein condensates, Gerickeetal08 () and in the distribution of a reaction coordinate in simulations of chemical reactions.Monesetal09 ()
Having a similar method for potentials of mean force and radial distribution functions would have many advantages. As in the BergHarris method, such an approach would avoid the noise that accompanies the standard histogram method, while no a priori choice of bin size is required. Furthermore, a smooth radial distribution would give a better representation of the potential of mean force (which is related to the logarithm of the radial distribution function), and allows the function to be evaluated at any point in the range over which it is defined. Finally, the expressions for the pressure, energy and compressibility in terms of the radial distribution functions involve integrals of the radial distribution function over . Such integrals can be evaluated more accurately from an analytic form than from a histogram, whose accuracy is restricted by the bin size.
The purpose of this paper is to develop an approach to obtain smooth radial distribution functions and potentials of mean force from sampled interparticle distances. The resulting method turns out to be suitable not just for radial distribution functions and potentials of mean force, but for a large class of densities of single random variables.
The paper is structured as follows. Sec. II presents a model of water in which rigid molecules interact through a discretized potential. Construction of radial distribution functions from data derived from this model will be used as a running test case. This section also contains some details on the generation of sampled data through simulation. The BergHarris method for smoothing probability densities in a statistically controlled fashion, and the connection between probability distribution functions and radial distribution functions, are briefly reviewed in Sec. III. In Sec. IV, simulation results are presented that show the shortcoming of the BergHarris method for radial distribution functions. The extended method is developed in Sec. V, with Sec. V.1 containing an explanation for the poor performance and its solution through statistical resampling. Sec. V.2 addresses an additional problem related to the particular shape of radial distribution functions, which is solved by extending the method to use piecewise analytic functions. In Sec. VI, the generality of the method is illustrated by presenting the results of applying the method to data drawn from three different probability densities which are problematic for the original BergHarris method. The paper concludes with a discussion in Sec. VII.
Ii System: a discrete water model
In the development of the smooth approximation method below, a model of rigid water molecules subject to a discretized interaction potential between the molecules will be used as a running test case for the construction of radial distribution functions. Since a water molecule consists of two kinds of atoms (oxygen and hydrogen), there are three radial distribution functions in this system, , and , which turn out to have quite different character and therefore give a more stringent test of the smooth fitting methods than a singleatom model would.
The relative distances between the atomic sites in molecules are fixed, making the molecules rigid bodies. Each state of each body can therefore be described by its centerofmass position , its orientation or attitude matrix that transforms coordinates from the lab frame to a bodyfixed frame for molecule , and the associated linear and angular momenta. Here the bodyfixed frame is chosen so that the third row of the matrix corresponds to the direction of the molecule’s dipole , whose magnitude is fixed at a value .
The interaction potential between a pair of molecules and is a discrete version of the soft sticky dipole potential LiuIchiye95 (); ChandraIchiye99 (); Tanetal03 (); FennellGezelter04 ()
(1) 
where the LennardJones, dipole, and sticky parts of the potential are, respectively, given by
(2)  
(3)  
(4)  
where and are the conventional spherical angles of the vector , and and are those of . Finally, the switching function is defined as
(5) 
while is given by the same form with primed parameters and .
Here, the socalled SSD/E reparameterization of the model was used, for which the parameter values are kcal/mol, Å, D, kcal/mol, and , while the cutoff parameters in the functions and are taken to be Å, Å, Å and Å.FennellGezelter04 ()
The discontinuous interaction potential in our system is obtained from the smooth potential by the controlled energy discretization method presented in Ref. VanZonSchofield08, . In this discretization, a cutoff naturally arises, and therefore no reaction field was included.
The natural simulation technique for systems with discretized potentials is discontinuous molecular dynamics, or DMD. dmd1 (); dmd2 (); VanZonSchofield08 () In DMD, the dynamics of the system is free (no forces or torques are present) in between interaction events at which linear and angular momenta change. By its very nature, the energydiscretization scheme used in DMD is symplectic, timereversible and strictly conserves the total (discretized) energy. It has no fixed timestep, but moves from event to event. The event frequency, which sets the efficiency of the method to a large degree, is determined by the level of discretization of the potential energy: the finer the discretization, the more events occur. One typically finds that a discretization of the order of already suffices for a reasonably accurate simulation of the smooth system.VanZonSchofield08 () In the simulations of which the results are presented below, the discretization of the potential energy was set to .
The orientational dynamics of free rigid bodies is an important aspect in the calculation of interaction times.dmd1 (); dmd2 () The solution of the equations of motion for the angular momenta and attitude matrix depends on the symmetry of the body through the principal moments of inertia. Although any value of the principal moments can be utilized to sample configurations of a system of rigid molecules, the DMD simulations presented here made use of the exact solution of the equations of motion for a free, asymmetric rigid rotor,dmd1 (); VanZonSchofield07a () and hence were based on the exact dynamics of the system.
At the time of a distance measurement in the simulation, the forces and torques are most likely zero, since these quantities are nonzero only at discrete time points. This makes alternative smoothing methods, such as the weighted residual methodCyrBond07 () and methods based on equilibrium identities using (smooth) forces,AdibJarzynski05 (); BasnerJarzynski08 () not applicable here. Since only the interparticle distances are available as input for the determination of the analyticallyfitted radial distribution functions, the comparison with histogrambased radial distribution functions is more equitable.
Iii Review
iii.1 Smoothing probability densities with the BergHarris method
Consider a random variable which has a probability density . Suppose that one has a sample of data points that are independently drawn from the density . How can one estimate from the data points? One way is to bin the data points into a histogram. Because histograms are quite sensitive to statistical noise, Berg and Harris developed the following alternative procedure to obtain an analytical estimate for the probability density from the data.BergHarris08 ()
In the BergHarris method, the data are first sorted such that . Using the sorted data, the empirical cumulative distribution function is defined in the range as
(6) 
Although becomes a better approximation to the true cumulative distribution with increasing sample size , it is a function with many steps for any finite value of so that its derivative is not analytic but rather consists of delta functions.
The next step consists of writing the function as a sum of a linear term and a Fourier expansion. The expansion is truncated at the th term:
(7) 
where the linear term is defined as
(8) 
Furthermore, the Fourier coefficients in Eq. (7) are determined from
(9) 
When is approximated by , the probability density is approximately
(10) 
Because is zero at the end points of the interval , and the Fourier modes form a complete orthonormal basis of the space of such functions, the empirical cumulative distribution and the associated probability density are reconstructed exactly by Eqs. (7) and (10) as . However, as mentioned above, this limit would yield a series of delta functions for the probability density . The aim is to truncate the series at a level which is not too high that one is fitting the noise, but high enough to give a good smooth approximation to and .
The final step of the procedure is therefore to find the appropriate number of Fourier terms in a statistically controlled fashion. The value of is determined here using the KolmogorovSmirnov (KS) test.BergHarris08 (); NumRecipes () This test determines how likely it is that the difference between the empirical cumulative distribution function and its analytical approximation is due to noise.footnote1 () The test takes the maximum variation between and over the sampled points (), and returns a probability that the difference between the two cumulative distribution functions is due to chance. The functional dependence of on is known to a good (asymptotic) approximation.Stephens70 ()
A small value of indicates that the difference between the cumulative distribution functions is statistically significant, i.e. the quality of the expansion is insufficient to represent the data. One therefore carries out a process of progressively increasing the number of Fourier modes and evaluating as well as until the value of is larger than some convergence threshold . A reasonable value for this convergence value is .
In Ref. VanZonetal08b, , errors were estimated using the jackknife algorithm,Efron79 () and the same method will be used here. However, since the interparticle distance data naturally come in blocks, each corresponding to a single configuration, not all data points are independent. To account for this, we use a blockversion of the jackknife method, in which a single block of data is omitted in each jackknife sample.Kuensch89 ()
iii.2 From probability densities to potentials of mean force
To be able to apply the above smoothing method to radial distributions, for which histograms are presently the method of choice, one has to make the connection between a probability density on the one hand, and the radial distribution function and potentials of mean force on the other hand. This connection will be briefly reviewed here because some of the details are needed below.
Consider a system of a single type of particle, in which the number of particles is and the volume of the system is . The radial distribution function, denoted by , is the density of particles at a distance away from a chosen first particle relative to the mean density . The true density at a distance from a first particle is thus . The mean number of particles at a distance between and from a given particle is then . The Jacobian factor , which is, of course, the surface area of a sphere with radius , will play an important role below. The probability of a particle being at a distance is equal to the number of particles with this , divided by the total number of particles, , i.e. , so one has the relation
(11) 
Systems with different types of particles give rise to different radial distribution functions , where and label the kinds of particles. A similar argument to that above leads to the relation,
(12) 
where is the probability density of distances between a particle and a particle . Once is known, the potential of mean force is found fromMcQuarrie ()
(13) 
where is the temperature of the system and is Boltzmann’s constant.
A further consideration for the applicability of the BergHarris method to radial distribution functions is whether the KS test may be used at all. The KS test assumes that the samples are independently drawn. Correlation between the samples may result in a bias in the radial distribution functions. Since nearby particles in an instantaneous configuration of the system are correlated, this is potentially an issue. If, however, the system is sufficiently large, only a small fraction of the samples in a single configuration will be spatially correlated, making the KS test applicable to a very good approximation. Furthermore, if one ensures that the configurations are taken from the simulation at sufficiently large time intervals, time correlations do not pose a problem either.
Iv Poor performance of the straightforward smoothing applied to radial distribution functions
Using the BergHarris method to get a smooth probability densities from a sample of interparticle distances, one expects to obtain a good, smooth fit to by using Eq. (12). To test this expectation, the system of pure water in which rigid water molecules interact via a discretized potential energy derived from the soft sticky dipole model described in Sec. II was simulated. For all simulation results presented in this section, the parameters of the water model were as follows. The temperature is set at , the number of particles is , the cubic simulation box has sides of length Å, so that the density is kgl. The principal moments of inertia of a rigid water molecule are Å, Å, and Å, where is the molecular mass of water. After equilibration, the simulations were run for picoseconds, in which the interparticle distances were sampled every picoseconds (long enough for the system to decorrelate), for a total of configurations.
From these data, the radial distributions , and were determined in the simulations through histograms and by following the smoothing procedure of Sec. III.1, using a value for of . The results for the three radial distribution functions are shown in Fig. 1. Clearly, the two methods do not agree very well at short interparticle separations, despite the fact that the smooth distance distributions are statistically good descriptions of the distance data according to the KS test. In particular, note that the peak heights defining the first solvation shell are not welldescribed.
V Extending the BergHarris method
Given these unsatisfactory results, the histogram method would be preferred over the straightforward application of the BergHarris method. But there is a way to fix the smoothing method to the extent that the smoothing method becomes preferable over the histogram method. To understand how to fix the problem, one first needs to understand its underlying causes.
v.1 Overrepresented large distances
v.1.1 Problematic Jacobian
The convergence criterion of the smooth approximation relies on the KS test.BergHarris08 (); NumRecipes () This test is based only on the maximum variation between and and is more sensitive to typical data points than to outliers. Although this may appear to be a contradiction at first glance, it is important that the KS test depends on the maximum deviation in the cumulative distribution function, rather than on the deviation of the random variable from the mean. To see that outliers do not constitute very deviant points in the cumulative distribution, suppose that, by chance, an outlier from the far left tail of a distribution is found in a sample of size . This would yield an increase of magnitude of for the empirical cumulative distribution at , while the cumulative distribution is practically zero since is in the region where the distribution is very small. If is not too bad an approximation to , will also be of order at . However, the deviation between and at other points depends more on the goodness of fit than on the sample size, at least for points in regions that contain a sizable fraction of the samples: these are the ‘typical points’. For these points, there is little sample size dependence, so that to first order in , one has . Since for large enough sample sizes, the outlier will not be seen as the most deviant point in the KS test, but rather, some typical value will have the most deviant cumulative distribution.footnote1a ()
The above argument holds when the quality of the fit in the typical region is poor. However, as the number of Fourier modes used to fit is increased, the quality of the fit of the cumulative distribution improves in the typical region, and the maximum deviation shifts to less probable values. Because the probability density function is small in the tails, this maximum deviation of the distribution is not that large and the convergence criterion (the KS test) is met, resulting in a good fit in the typical region with a poor fit in the tails.
The BergHarris method of Sec. III.1 therefore works well for probability densities without “long tails,” since typical values of the variable will be the values of interest. However, for radial distribution functions, the focus on typical values is the origin of the difficulties getting the straightforward BergHarris method to work for radial distribution functions (cf. Fig. 1). Typical values of are of no interest in radial distribution functions. In a homogeneous system such as a fluid, the typical distances between any two particles is on the order of half the system size, . The length scales of interest in the radial distribution functions to describe local structure are typically much smaller than that.
To make this point clearer, consider a dilute gas of hard spheres with diameter . In the limit of infinite dilution, the radial distribution function is zero for and unity otherwise. The probability density of interparticle distances is then [cf. Eq. (11)]
(14) 
Because of the Jacobian factor of , the most likely values of are of the order of the largest possible , which is . This means that in a sample of interparticle distances in the dilute hard sphere system, the large distance samples () are much more abundant than the small distance samples (). Thus, when approximating by a smooth function using the KS test, the longdistance part will be fitted very well, but the shortdistance behavior () will not, because the test used is sensitive to the typical values of , which are .
The conclusion that large distance values overwhelm the smaller ones in which one is interested is valid beyond the dilute hard sphere case, since the Jacobian factor in Eq. (14) acts on long length scales and is therefore also present in systems of higher density with nonnegligible interactions.
A crude attempt at solving the large distance problem is to impose a cutoff on the allowed values of in the sample of interparticle distances. This cutoff should be of the order of the distance over which differs from one. But to determine that distance one has to measure . Thus, one would have to guess a value of the cutoff distance and adjust it until approaches one within a certain accuracy for . Unfortunately, even when the cutoff is chosen in this way, the results, although better than those in Fig. 1, are still not very impressive, as Fig. 2 for the oxygenoxygen radial distribution function shows with set to Å(using again ). One sees a lot of spurious oscillations in the smooth radial distribution functions, which are due to the high number of Fourier modes needed to fit the radial distribution function ( in this case). Furthermore, these oscillations do not even agree within the 95% confidence interval with the histogram result, deviating especially for small distances . The explanation is that the Jacobian factor in Eq. (12) is still present within the restricted sample , and causes the peaks at larger values to be fitted better than the peaks at smaller values.
v.1.2 Overcoming overrepresentation through resampling
To remove the troublesome Jacobian factor in Eqs. (11) and (12) altogether, one can resample the interparticle distances found in the simulation. The idea is to draw from the original sample of distances, such that smaller values of are more likely than larger values by introducing a relative bias weight for each distance in the sample to construct a new, resampled set .footnote1b () Given that the probability densities of the original sample points is [cf. Eq. (11)], and provided that subsequent resampled data points are chosen independently with weight , the probability density of the resampled points is given by
(15) 
where and are constants. Note that may be determined from
(16) 
which may be approximated by the average of the weight over the samples:
(17) 
In the application to the radial distribution functions below, we also obtained from a numerical integration using the biased analytic approximation for , the result of which differed from that given by the simpler expression in Eq. (17) by less than .
The simplest way to counter the Jacobian factor in Eq. (15) is to choose a weight
(18) 
which gives
(19) 
with
(20) 
In detail, the resampling leading to data with a probability density can be performed as follows: Given the original set one determines the weight for each data point as
(21) 
One also determines the maximum weight,
(22) 
to convert the weights into probabilities
(23) 
which, by construction, lie between 0 and 1. Next, one takes one of the original sample points at random (with equal weight), draws a random number uniformly from the interval [0,1] and if , one adds to the resampled data set . The procedure is repeated until enough resampled points have been gathered. There is some choice into what number of resampled points is to be taken. In our implementation, the number of resampled points is chosen to be the same as the number of points in the original sample.
Note that Eq. (19) shows that the resampled probability density and the radial distribution function are proportional to one another. Since approaches one for large , the large values of are no longer overrepresented in .
For the resampled data , one can use the BergHarris smoothing method of Sec. III.1 to obtain a smooth approximation to . Fig. 3 shows the effect of resampling on the cumulative distribution functions associated with of the water model, using . It is evident that the small tail of the unweighted (BergHarris) cumulative distribution is much enhanced in the resampled cumulative distribution, and therefore more relevant to the KS test. Note that we have plotted only the fits, and not the empirical cumulative distributions, because the fits are hard to distinguish from the empirical cumulative distributions. The differences are easier to see in the associated radial distributions, which are plotted in Fig. 4. In contrast to Fig. 2, the smooth approximation now follows the result of the histogram within error bars for the whole range of values. But like in Fig. 2, the supposedly smooth exhibits fast oscillatory behavior within those error bars. This oscillatory behavior will be addressed next.
v.2 Hardtofit distribution functions
v.2.1 Unphysical oscillations
It is hard to reconcile the aim of having a smooth approximation to the radial distribution functions and the large number of Fourier modes needed to approximate the shape of , which starts out as a very flat function at small , then increases sharply, reaches a maximum not far beyond this sharp rise, and then decays on a larger scale to in an oscillatory fashion. The peculiar shape of leads to the oscillatory behavior seen in Figures 2 and 4. Even when the oscillations lie within error bars of the histogram results, they reintroduce a remnant of the noise similar in magnitude to that present in the histograms that one is trying to avoid.
It is conceivable that there are expansions other than the Fourier expansion that are more suitable for describing a sharply increasing function followed by a more moderate behavior. However, it is not clear at present what this specialized expansion should be. Therefore, a systematic and generic way will now be presented which avoids highfrequency modes in parts of that do not require them. This will also make the method more generally applicable to discontinuous and other hardtofit densities.
We note that in the following section, the resampling explained in the previous section is supposed to have been performed on the data already, and that the resampled data have been sorted (). For notational convenience, the tildes on , and are omitted below.
v.2.2 Resolving the oscillation problem using a piecewise approach
The decomposition of in Eq. (7) can be adjusted to incorporate hardtofit radial distributions by allowing the Fourier decomposition to be different on subintervals within the total interval . Let the full interval be divided into subintervals , , , …, , where and . The intervals will be labelled by a Greek index, which can run from to . How to choose the points (where , with fixed) will be discussed later. Different intervals can have different numbers of samples that fall within its range. The fraction of the samples that fall within a given interval is denoted by .
On each subinterval, is approximated by a linear part and a truncated Fourier transform of the remainder. Analogously to Eq. (8), the linear part in interval is given by
(24) 
Note that ranges from 0 to 1 in the interval . The piecewise linear approximation to for in interval is then
(25) 
where it should be noted that
(26) 
An example of such is shown in Fig. 5, based on data from the oxygenoxygen radial distribution function in the water model (as explained below). Approximations to beyond are found by adding Fourier modes for each interval,
(27) 
Here, is the characteristic function on interval , i.e.,
(28) 
and is given by
(29) 
where
(30) 
The function is the conditional empirical cumulative probability distribution function for data points within interval . Eq. (27) represents a piecewise analytic approximation to the cumulative distribution. Since the Fourier modes form a complete orthonormal basis, the approximation becomes exact as for all . As before, however, the should not become too large to avoid spurious oscillations, which only amounts to fitting the noise. One therefore defines a maximum number of Fourier modes allowed in each subinterval and subdivides intervals if the maximum number of modes is not sufficient for convergence, as determined by the KS test.
The division of the original interval is chosen dynamically as follows:

Start with one interval;

Increase until ;

If exceeds , find the most deviant point according to the KS test;

Split the interval in two at the point ;

Repeat for the conditional distribution for each interval left and right of .
Because the procedure is recursive, in each step the original oneinterval procedure is used, which one exception: the allowable value of may be set to a lower value in a subinterval, since there are fewer points in the interval and therefore the interval carries less statistical weight. We have not found a unique, statistically controlled way to adjust the for subintervals, but found heuristically that scaling the for interval by avoids fitting the noise and leads to a satisfactory overall value.
The recursive, multiinterval procedure is found to greatly speed up the convergence of the approximation scheme, leading to much lower , and thus fewer oscillations. It should be stressed that splitting at the most deviant point in the KS test is found to be essential here: choosing a different splitting point does not improve the convergence because the most deviant point is then still difficult to fit and the value for the subinterval containing the most deviant point remains the same.
Once the approximation in Eq. (27) has been obtained, the probability density is given by
(31) 
where is such that .
The results of applying the piecewise procedure to the of the water model are shown in Fig. 6 (remembering that ), with and the initial set to . While the method now works better than without the piecewise approach, it has one drawback: the derivative of the approximate cumulative distribution function, which gives , need not be a smooth function across the different intervals. As a consequence, the result in Fig. 6 shows artificial discontinuities. These discontinuities at the boundaries of the intervals fall within the 95% confidence intervals (not shown in Fig. 6).
v.2.3 Dealing with spurious discontinuities
Provided the underlying probability distribution of the samples is continuous, the spurious discontinuities that are present in the piecewiseanalytic fit will get smaller as the number of sample points increases. Nevertheless, for cases with poor statistics, one would like to have an approach that gives a continuous piecewise analytic fit to the distribution. In fact, for some applications, having a continuous radial distribution function is essential, such as for Brownian simulations that take the potential of mean force as input. Even though the discontinuities are within the statistical noise, they would lead to sudden, unphysical changes in energy in such applications.
There are many ways to get a continuous curve out of the discontinuous one, but it should be remembered that one is working within the statistical noise. We can therefore choose any method as long as the “patch” is still statistically reasonable. To determine the statistical suitability, one can once again use the KS test on the patched fit.
We have chosen the following simple procedure: We apply a quadratic patch function around each discontinuity at ,
(32) 
The width of the patch is adjustable, while the prefactor follows from the requirement that the cumulative distribution has a continuous derivative at . Provided the different patches do not overlap, this leads to a value of
(33) 
where is the height of the jump in the unpatched distribution function.
Initially, each width is set equal to half the minimum interval size on either side of the corresponding split point to avoid overlap between the patches. The value of the patched distribution is then determined, and if it is not smaller than the value for the unpatched distribution, the patch is accepted. Otherwise, the width of the patch is reduced by a factor of two, until the value is acceptable.footnote2 () It will be demonstrated below that this solves the problem of spurious oscillations and discontinuities.
v.3 Final results
Figs. 7 show the result of the resampled, piecewise procedure for the three radial distribution functions of the water model, , and , with and . It is clear that the piecewise analytic fit is now smoother than the histogram, hardly shows any oscillations and is continuous. Furthermore, while error bars were omitted in Fig. 7 for clarity, it was found that the histogram and the patched, piecewise analytic results are in mutual agreement within the 95% confidence intervals.
It may be argued that plotting the histograms using bars is an unfair way of representing the histogram results. One often takes the histograms and applies a cubic spline fitNumRecipes () to the results, which makes the graph seem smoother. There is of course no a priori reason why the splined histogram should represent the radial distribution function better. For this reason, we have shown only ‘unsplined’ histograms so far. The comparison between the histograms and smooth approximations in these curves should really only be done at the midpoint of the histogram bins. Now that the piecewise analytic smoothing approach is fully developed, however, it is interesting to see how it compares to a smoothed spline fit of the histogram results. Figure 8 shows a plot of these two types of smooth results for . The bin size for the histograms has been chosen such that the first peak of the radial distribution is well resolved. One sees that the cubic spline fit does a reasonable job for the first peak, but that in the second peak in the radial distribution function the spline fit is still noisy, while the statistically controlled piecewise analytic result is not. Thus, the splines cannot fix the roughness of the histogram method, at least not when bin sizes are uniform.
To show how the piecewise analytic method performs for potentials of mean force, the smooth potentials of mean force between the different species corresponding to the radial distributions in Figs. 7 [cf. Eq. (13)], have been plotted in Figs. 9. The results from the piecewise analytic method are considerably smoother than the histogram results, but still exhibit some roughness since they were based only on four configurations.
Vi Further applications
The piecewise approximation method is not specific to radial distribution functions, but can help to fit any probability density that is hard to fit with a truncated Fourier series. Below, we will give several examples of such densities and show the advantages of using the piecewise analytical approximation method.
vi.1 A discontinuous density
Consider a random variable with a distribution given by
(34) 
Note that within the domain of this function , there is a discontinuity at . Discontinuities are very poorly represented by truncated Fourier series.Jeffreys ()
From the above distribution, 50,000 samples were drawn and used as input to the piecewise analytic approximation (with and the initial ), as well as to the oneinterval analytic approximation (with the same and unrestricted ) and the histogram method. The results are shown in Fig. 10.
One clearly sees the trouble that the BergHarris oneinterval approximation has in capturing the discontinuity, while the histogram is very noisy. The piecewise analytic expansion, on the other hand, beautifully captures the whole distribution; the automated procedure divides the interval in two at and then needs zero Fourier modes to approximate the separate pieces.
vi.2 The standard Cauchy distribution
According to Berg and Harris,BergHarris08 () the original smoothing method has trouble with densities with long tails. It is therefore interesting to see if the piecewise approach helps for these kinds of densities as well. As an example, we consider the standard Cauchy distribution
(35) 
From this distribution, 50,000 samples were randomly drawn and used in the piecewise approach (with and the initial ). The results are contrasted with those of the histogram in Fig. 11. The piecewise smooth approximation performs so well that it can hardly be distinguished from the Cauchy distribution.
Not plotted in Fig. 11 were the results of the original BergHarris method, which, surprisingly, are so similar to the piecewise approach that they would be hard to make out in the figure. The only statistically significant difference between the two results is apparent in the height of the maximum, which the BergHarris fit slightly underestimates.
The success of both methods is illustrated further in Fig. 12, in which the errors in the piecewise and the oneinterval results are compared; the piecewise analytic error is overall slightly smaller than that of the BergHarris method, but not by much. This success seems to contradict Berg and Harris’ warning against using the smoothing method for densities with long tails. It is possible the relatively high quality of the fit in the simple approach is due to the symmetric nature of the Cauchy density and its relative lack of structure.
A more stringent test of the method applied to longtailed distributions will be presented next.
vi.3 First arrival time distribution of a diffusing particle to a sphere
Consider a particle undergoing diffusion, starting at a distance from the origin, i.e. anywhere on a sphere of radius . The distribution function of the diffusing particle satisfies , where the selfdiffusion constant. Investigating the time required to arrive anywhere on the surface of a sphere of radius for the first time is a classic case of a first passage problem.Redner () Such problems have applications in the rate of molecules finding each other in a solution, and are a major determining factor of reaction rates in diffusion limited reactions.
It turns out that the first arrival times are distributed according tofootnote3 ()
(36) 
where . This distribution function has both short time structure and a long tail.
From the distribution in Eq. (36), 100,000 samples were randomly drawn, with parameters set at , (these choices set the time and length units) and . The piecewise approach was used to obtain an estimate of the probability distribution (with and ), as well as the oneinterval analytic approximation ( and unrestricted ) and the histogram method. The results are compared in Figs. 13 and 14.
While all methods work to some extent, the histogram method would have to be tailored to the function in question in order to be useful, with differently sized bins for different values of . The BergHarris method works reasonably well without subdivisions, but exhibits fast oscillations in the tails of the density, as becomes very apparent from Fig. 14. The appearance of oscillations is not surprising when one considers that Fourier modes were needed for the BergHarris fit! The piecewise analytic result does not exhibit fast oscillations in the tail. Furthermore, the piecewise fit only required a total of 28 Fourier modes distributed over 8 intervals. Although the piecewise result does have some modulation within the error bars, we have checked that almost all of these disappear when one takes 10 times as many data points. At that level of statistics, the BergHarris results still have oscillations in the tails, and still require over 350 Fourier modes.
When the analytical form of is not known, such as for absorption in nonspherical geometries, but samples are available from numerical simulation, the piecewise approach would give the best description of : one that is less noisy than the histogram construction and with fewer oscillations than the oneinterval BergHarris method.
Vii Conclusions
In this paper, a method to obtain smooth analytical estimates of probability densities, radial distribution functions and potentials of mean force in a statistically controlled fashion without using histograms was presented. This method only uses direct samples of data (distance samples in the case of radial distribution functions). Since this method is expected to be generally useful, we have made our implementation, coded in c and c++, available on the web.footnote4 ()
While the method is based on the BergHarris method, the statistical criterion used in that method is most sensitive to the most common samples, which for radial distribution functions are not the ones of physical interest. To make the method work for radial distribution functions, a weighted resampling of this data was required. Spurious oscillations, allowed by the statistical noise, were eliminated using a piecewise approach. In addition, one can optionally patch the piecewiseanalytic form to avoid discontinuities within the errors, if desired. The resampled, piecewise smoothing method was demonstrated on data from eventdriven DMD simulations of water, and proved to give a much smoother result than the histogram method.
If the purpose of a simulation is just to get a good smooth result for the radial distribution functions of a water model, one could use the histogram method with longer simulation runs to reduce the statistical uncertainties to some predetermined limit. However, the histogram would still be biased and known just at a lattice of points if a uniform bin size is used. To get a smoother and less biased curve from the histogram, one would also have to decrease the bin size by hand. The piecewise analytic method makes such tuning unnecessary. The method also make much longer runs unnecessary, since what appears to be very poor statistics for the histogram method turns out to be quite reasonable statistics for the piecewise analytic method. Remember that the same set of interatom distance are used in both methods. Apparently, there is much more statistics in the sample than the histogram is using (something that Berg and Harris also noticedBergHarris08 ()).
For the test case considered here, the computational cost of doing longer runs is not that large, so the advantages of the piecewise analytic method may seem to be nice but not necessary. In other applications, however, such as in studies of the distribution of water molecules near a polymer or a biomolecule (or near one of their polymer units), better statistical information is costly to obtain because such systems are not only computationally more demanding, but there are far less samples available in a single configuration since only the water molecules near the polymer or the biomolecule are involved. The simulation run times would have to be much larger to compensate for this poor statistics, if histograms are used. In such cases, the piecewise analytic method is expected to be advantageous since it appears able to use more of the information present in the interparticle distance sample.
The piecewise analytic method yields a smooth approximation to the probability density function, and the deviations of the empirical distribution from this smooth curve are supposed to be due to statistical noise. If this is true and the methods are unbiased, then for large enough sample size , one would expect the distribution of errors in the probability density functions to be Gaussian with a zero mean. To test whether this is the case, the probability densities of the relative errors plotted in Fig. 14 were determined, both for the BergHarris and for the piecewise analytic method.footnote5 () The results are shown in Fig. 15. Within statistical uncertainty, both distributions were found to be unbiased. Furthermore, for the piecewise analytic results, the error densities are roughly Gaussian. The errors from the BergHarris method, on the other hand, seem to show deviations from Gaussian behavior. The nonGaussian nature of the errors of the BergHarris method might be due to the fact that the sample noise is fitted too closely since a large number of Fourier modes are necessary, which means the distribution of errors follows the distribution of the finite samplesize errors rather than being truly random. The Gaussian nature of the errors in the piecewise analytic method might be a confirmation that in the smooth approximation, enough Fourier modes have be taken into account for the remainder to be due to a random statistical noise.
One of the nice features of the piecewise analytic method is that one does not have to choose a bin size a priori, as one has to do in the histogram method. It is nonetheless true that within the smoothing method, one is to some extent free to choose the cutoff value of the ‘quality’ parameter and the maximum number of basis functions allowed in the expansion of each interval. Note that both and are dimensionless numbers, and do not contain physical parameters, in contrast to the bin size parameter required in the histogram approach. Setting too low may result in a bad fit to the data, while setting too large may result in noise fitting and artificial oscillations. We have found that and are reasonable choices, and these values were used in all the applications presented above.
There is also some freedom in the choice of the set of basis functions, as long as they form a complete set. The Fourier basis used here is convenient and familiar, but others are possible. For instance, in Ref. VanZonetal08b, , a Chebyshev expansion was used. This expansion was chosen over the Fourier expansion because it gave fewer oscillations. The piecewise approach present in this paper, however, resolves the oscillation problem as well, and is expected to be less sensitive to the choice of basis functions than the original BergHarris method. Furthermore, the method was shown to also be applicable to other potentials of mean force and other probability densities with a hardtofit character.
Acknowledgements.
We would like to thank Prof. P. N. Roy for useful discussions. This work was supported by the National Sciences and Engineering Research Council of Canada.References
 (1) D. A. McQuarrie, Statistical Mechanics (Harper Collins Publishers, 1976).
 (2) D. Frenkel and B. Smit, Understanding molecular simulation: From Algorithms to Applications (Academic Press, San Diego, 2002).
 (3) H. J. C. Berendsen, Simulating the Physical World: Hierarchical Modeling from Quantum Mechanics to Fluid Dynamics (Cambridge University Press, Cambridge, UK, 2007).
 (4) B. A. Berg and R. C. Harris, Comp. Phys. Comm. 179, 443 (2008).
 (5) R. van Zon, L. Hernández de la Peña, G. H. Peslherbe, and J. Schofield, Phys. Rev. E 78, 041103 (2008).
 (6) T. Gericke, P. Würtz, D. Reitz, T. Langen, and H. Ott, Nature Physics 4, 949 (2008).
 (7) L. Mones, P. Kulhánek, I. Simon, A. Laio, and M. Fuxreiter, J. Phys. Chem. B 113, 7867 (2009).
 (8) Y. Liu and T. Ichiye, J. Phys. Chem. 100, 2723 (1996).
 (9) A. Chandra and T. Ichiye, J. Chem. Phys. 111, 2701 (1999).
 (10) M.L. Tan, J. T. Fisher, A. Chandra, B. R. Brooks, and T. Ichiye, Chem. Phys. Lett. 376, 646 (2003).
 (11) C. J. Fennell and J. D. Gezelter, J. Chem. Phys. 120, 9175 (2004).
 (12) R. van Zon and J. Schofield, J. Chem. Phys. 128, 154119 (2008).
 (13) L. Hernández de la Peña, R. van Zon, J. Schofield, and S. B. Opps, J. Chem. Phys. 126, 074105 (2007).
 (14) L. Hernández de la Peña, R. van Zon, J. Schofield, and S. B. Opps, J. Chem. Phys. 126, 074106 (2007).
 (15) R. van Zon and J. Schofield, J. Comp. Phys. 225, 145 (2007).
 (16) E. C. Cyr and S. D. Bond, J. Comput. Phys. 225, 714729 (2007).
 (17) A. B. Adib and C. Jarzynski, J. Chem. Phys. 122, 14114 (2005).
 (18) J. E. Basner and C. Jarzynski, J. Phys. Chem. B 112, 12722 (2008).
 (19) W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in Fortran, The Art of Scientific Computing (Cambridge University Press, Cambridge, UK, 1992), 2nd ed.
 (20) Alternatively, one can use other tests such as the Kuiper test, but these are less convenient for the piecewise approach presented later, since they involve several “hardtofit” points.
 (21) M. A. Stephens, J. Royal Stat. Soc. B 32, 115 (1970).
 (22) B. Efron, The Annals of Statistics 7, 1 (1979).
 (23) H. R. Künsch, The Annals of Statistics 17, 1217 (1989).
 (24) Another good explanation of the sensitivity of the KS test to values near the median can be found in Ref. NumRecipes, , Ch. 4, Sec. 3.
 (25) One could imagine performing the reweighting at the level of KS test instead, but this would modify the statistic. The problem is that it would then not be known what a reasonable testing criterion for this new statistic should be, or even whether it allows a simple formulation such as .
 (26) A procedure in which the piecewise Fourier series is restricted to be continuous across the interval boundaries was also considered but found to lead to undesirable oscillations near discontinuities, reminiscent of the Gibbs phenomenon.Jeffreys ()
 (27) H. Jeffreys and B. S. Jeffreys, Methods of Mathematical Physics (Cambridge University Press, Cambridge, UK, 1956), 3rd ed.
 (28) S. Redner, A Guide to FirstPassage Processes (Cambridge University Press, Cambridge, UK, 1991).
 (29) This is the Laplace inverse of Eq. (6.4.8) in Ref. Redner, .
 (30) http://www.chem.utoronto.ca/~rzon/Code.html.
 (31) This analysis was also attempted for the deviations plotted in Fig. 12 but the error bars on the resulting distributions were too large to tell whether they are biased or nonGaussian.