Constructing smooth potentials of mean force, radial distribution functions and probability densities from sampled data

# Constructing smooth potentials of mean force, radial distribution functions and probability densities from sampled data

## Abstract

In this paper a method of obtaining smooth analytical estimates of probability densities, radial distribution functions and potentials of mean force from sampled data in a statistically controlled fashion is presented. The approach is general and can be applied to any density of a single random variable. The method outlined here avoids the use of histograms, which require the specification of a physical parameter (bin size) and tend to give noisy results. The technique is an extension of the Berg-Harris method [B.A. Berg and R.C. Harris, Comp. Phys. Comm. 179, 443 (2008)], which is typically inaccurate for radial distribution functions and potentials of mean force due to a non-uniform Jacobian factor. In addition, the standard method often requires a large number of Fourier modes to represent radial distribution functions, which tends to lead to oscillatory fits. It is shown that the issues of poor sampling due to a Jacobian factor can be resolved using a biased resampling scheme, while the requirement of a large number of Fourier modes is mitigated through an automated piecewise construction approach. The method is demonstrated by analyzing the radial distribution functions in an energy-discretized water model. In addition, the fitting procedure is illustrated on three more applications for which the original Berg-Harris method is not suitable, namely, a random variable with a discontinuous probability density, a density with long tails, and the distribution of the first arrival times of a diffusing particle to a sphere, which has both long tails and short-time structure. In all cases, the resampled, piecewise analytical fit outperforms the histogram and the original Berg-Harris method.

## 1Introduction

In many situations the probability to find a particle at a certain distance from a given other particle is of interest. This probability is given by the radial distribution function. In addition to providing detailed information on the local structure of a system, one can often express thermodynamic quantities such as the pressure, energy and compressibility in terms of the radial distribution functions.[1] Furthermore, the radial distribution function can be reformulated in terms of a potential of mean force, which is of great importance in multi-scale simulations that take the potential of mean force as input in Langevin or Brownian dynamics simulations.[2]

Radial distribution functions are usually constructed in simulations by forming a histogram of sampled inter-particle distances. [2] However, histograms contain the bin size as a free parameter, and can be rather noisy for a poorly selected value of this parameter. For probability densities, an alternative method recently proposed by Berg and Harris avoids histograms and yields a smooth analytical form for the probability density which describes the sample data statistically at least as well as the noisy histogram.[4] This method has already proved useful in the context of the computation of quantum free energy differences from non-equilibrium work distributions,[5] the determination of the density in Bose-Einstein condensates, [6] and in the distribution of a reaction coordinate in simulations of chemical reactions.[7]

Having a similar method for potentials of mean force and radial distribution functions would have many advantages. As in the Berg-Harris method, such an approach would avoid the noise that accompanies the standard histogram method, while no a priori choice of bin size is required. Furthermore, a smooth radial distribution would give a better representation of the potential of mean force (which is related to the logarithm of the radial distribution function), and allows the function to be evaluated at any point in the range over which it is defined. Finally, the expressions for the pressure, energy and compressibility in terms of the radial distribution functions involve integrals of the radial distribution function over . Such integrals can be evaluated more accurately from an analytic form than from a histogram, whose accuracy is restricted by the bin size.

The purpose of this paper is to develop an approach to obtain smooth radial distribution functions and potentials of mean force from sampled inter-particle distances. The resulting method turns out to be suitable not just for radial distribution functions and potentials of mean force, but for a large class of densities of single random variables.

The paper is structured as follows. Section 2 presents a model of water in which rigid molecules interact through a discretized potential. Construction of radial distribution functions from data derived from this model will be used as a running test case. This section also contains some details on the generation of sampled data through simulation. The Berg-Harris method for smoothing probability densities in a statistically controlled fashion, and the connection between probability distribution functions and radial distribution functions, are briefly reviewed in Section 3. In Section 4, simulation results are presented that show the shortcoming of the Berg-Harris method for radial distribution functions. The extended method is developed in Section 5, with Section 5.1 containing an explanation for the poor performance and its solution through statistical resampling. Section 5.2 addresses an additional problem related to the particular shape of radial distribution functions, which is solved by extending the method to use piecewise analytic functions. In Section 6, the generality of the method is illustrated by presenting the results of applying the method to data drawn from three different probability densities which are problematic for the original Berg-Harris method. The paper concludes with a discussion in Section 7.

## 2System: a discrete water model

In the development of the smooth approximation method below, a model of rigid water molecules subject to a discretized interaction potential between the molecules will be used as a running test case for the construction of radial distribution functions. Since a water molecule consists of two kinds of atoms (oxygen and hydrogen), there are three radial distribution functions in this system, , and , which turn out to have quite different character and therefore give a more stringent test of the smooth fitting methods than a single-atom model would.

The relative distances between the atomic sites in molecules are fixed, making the molecules rigid bodies. Each state of each body can therefore be described by its center-of-mass position , its orientation or attitude matrix that transforms coordinates from the lab frame to a body-fixed frame for molecule , and the associated linear and angular momenta. Here the body-fixed frame is chosen so that the third row of the matrix corresponds to the direction of the molecule’s dipole , whose magnitude is fixed at a value .

The interaction potential between a pair of molecules and is a discrete version of the soft sticky dipole potential [8]

where the Lennard-Jones, dipole, and sticky parts of the potential are, respectively, given by

where and are the conventional spherical angles of the vector , and and are those of . Finally, the switching function is defined as

while is given by the same form with primed parameters and .

Here, the so-called SSD/E reparameterization of the model was used, for which the parameter values are kcal/mol, Å, D, kcal/mol, and , while the cut-off parameters in the functions and are taken to be Å, Å, Å and Å.[11]

The discontinuous interaction potential in our system is obtained from the smooth potential by the controlled energy discretization method presented in Ref. . In this discretization, a cut-off naturally arises, and therefore no reaction field was included.

The natural simulation technique for systems with discretized potentials is discontinuous molecular dynamics, or DMD. [13] In DMD, the dynamics of the system is free (no forces or torques are present) in between interaction events at which linear and angular momenta change. By its very nature, the energy-discretization scheme used in DMD is symplectic, time-reversible and strictly conserves the total (discretized) energy. It has no fixed time-step, but moves from event to event. The event frequency, which sets the efficiency of the method to a large degree, is determined by the level of discretization of the potential energy: the finer the discretization, the more events occur. One typically finds that a discretization of the order of already suffices for a reasonably accurate simulation of the smooth system.[12] In the simulations of which the results are presented below, the discretization of the potential energy was set to .

The orientational dynamics of free rigid bodies is an important aspect in the calculation of interaction times.[13] The solution of the equations of motion for the angular momenta and attitude matrix depends on the symmetry of the body through the principal moments of inertia. Although any value of the principal moments can be utilized to sample configurations of a system of rigid molecules, the DMD simulations presented here made use of the exact solution of the equations of motion for a free, asymmetric rigid rotor,[13] and hence were based on the exact dynamics of the system.

At the time of a distance measurement in the simulation, the forces and torques are most likely zero, since these quantities are non-zero only at discrete time points. This makes alternative smoothing methods, such as the weighted residual method[16] and methods based on equilibrium identities using (smooth) forces,[17] not applicable here. Since only the inter-particle distances are available as input for the determination of the analytically-fitted radial distribution functions, the comparison with histogram-based radial distribution functions is more equitable.

## 3Review

### 3.1Smoothing probability densities with the Berg-Harris method

Consider a random variable which has a probability density . Suppose that one has a sample of data points that are independently drawn from the density . How can one estimate from the data points? One way is to bin the data points into a histogram. Because histograms are quite sensitive to statistical noise, Berg and Harris developed the following alternative procedure to obtain an analytical estimate for the probability density from the data.[4]

In the Berg-Harris method, the data are first sorted such that . Using the sorted data, the empirical cumulative distribution function is defined in the range as

Although becomes a better approximation to the true cumulative distribution with increasing sample size , it is a function with many steps for any finite value of so that its derivative is not analytic but rather consists of delta functions.

The next step consists of writing the function as a sum of a linear term and a Fourier expansion. The expansion is truncated at the th term:

where the linear term is defined as

Furthermore, the Fourier coefficients in Eq. (Equation 2) are determined from

When is approximated by , the probability density is approximately

Because is zero at the end points of the interval , and the Fourier modes form a complete orthonormal basis of the space of such functions, the empirical cumulative distribution and the associated probability density are reconstructed exactly by Eqs. (Equation 2) and (Equation 4) as . However, as mentioned above, this limit would yield a series of delta functions for the probability density . The aim is to truncate the series at a level which is not too high that one is fitting the noise, but high enough to give a good smooth approximation to and .

The final step of the procedure is therefore to find the appropriate number of Fourier terms in a statistically controlled fashion. The value of is determined here using the Kolmogorov-Smirnov (K-S) test.[4] This test determines how likely it is that the difference between the empirical cumulative distribution function and its analytical approximation is due to noise.[20] The test takes the maximum variation between and over the sampled points (), and returns a probability that the difference between the two cumulative distribution functions is due to chance. The functional dependence of on is known to a good (asymptotic) approximation.[21]

A small value of indicates that the difference between the cumulative distribution functions is statistically significant, i.e. the quality of the expansion is insufficient to represent the data. One therefore carries out a process of progressively increasing the number of Fourier modes and evaluating as well as until the value of is larger than some convergence threshold . A reasonable value for this convergence value is .

In Ref. , errors were estimated using the jackknife algorithm,[22] and the same method will be used here. However, since the inter-particle distance data naturally come in blocks, each corresponding to a single configuration, not all data points are independent. To account for this, we use a block-version of the jackknife method, in which a single block of data is omitted in each jackknife sample.[23]

### 3.2From probability densities to potentials of mean force

To be able to apply the above smoothing method to radial distributions, for which histograms are presently the method of choice, one has to make the connection between a probability density on the one hand, and the radial distribution function and potentials of mean force on the other hand. This connection will be briefly reviewed here because some of the details are needed below.

Consider a system of a single type of particle, in which the number of particles is and the volume of the system is . The radial distribution function, denoted by , is the density of particles at a distance away from a chosen first particle relative to the mean density . The true density at a distance from a first particle is thus . The mean number of particles at a distance between and from a given particle is then . The Jacobian factor , which is, of course, the surface area of a sphere with radius , will play an important role below. The probability of a particle being at a distance is equal to the number of particles with this , divided by the total number of particles, , i.e. , so one has the relation

Systems with different types of particles give rise to different radial distribution functions , where and label the kinds of particles. A similar argument to that above leads to the relation,

where is the probability density of distances between a particle and a particle . Once is known, the potential of mean force is found from[1]

where is the temperature of the system and is Boltzmann’s constant.

A further consideration for the applicability of the Berg-Harris method to radial distribution functions is whether the K-S test may be used at all. The K-S test assumes that the samples are independently drawn. Correlation between the samples may result in a bias in the radial distribution functions. Since nearby particles in an instantaneous configuration of the system are correlated, this is potentially an issue. If, however, the system is sufficiently large, only a small fraction of the samples in a single configuration will be spatially correlated, making the K-S test applicable to a very good approximation. Furthermore, if one ensures that the configurations are taken from the simulation at sufficiently large time intervals, time correlations do not pose a problem either.

## 4Poor performance of the straightforward smoothing applied to radial distribution functions

Using the Berg-Harris method to get a smooth probability densities from a sample of inter-particle distances, one expects to obtain a good, smooth fit to by using Eq. (Equation 6). To test this expectation, the system of pure water in which rigid water molecules interact via a discretized potential energy derived from the soft sticky dipole model described in Section 2 was simulated. For all simulation results presented in this section, the parameters of the water model were as follows. The temperature is set at , the number of particles is , the cubic simulation box has sides of length Å, so that the density is kgl. The principal moments of inertia of a rigid water molecule are Å, Å, and Å, where is the molecular mass of water. After equilibration, the simulations were run for picoseconds, in which the inter-particle distances were sampled every picoseconds (long enough for the system to decorrelate), for a total of configurations.

From these data, the radial distributions , and were determined in the simulations through histograms and by following the smoothing procedure of Section 3.1, using a value for of . The results for the three radial distribution functions are shown in Figure 1. Clearly, the two methods do not agree very well at short inter-particle separations, despite the fact that the smooth distance distributions are statistically good descriptions of the distance data according to the K-S test. In particular, note that the peak heights defining the first solvation shell are not well-described.

## 5Extending the Berg-Harris method

Given these unsatisfactory results, the histogram method would be preferred over the straightforward application of the Berg-Harris method. But there is a way to fix the smoothing method to the extent that the smoothing method becomes preferable over the histogram method. To understand how to fix the problem, one first needs to understand its underlying causes.

### 5.1Over-represented large distances

#### Problematic Jacobian

The convergence criterion of the smooth approximation relies on the K-S test.[4] This test is based only on the maximum variation between and and is more sensitive to typical data points than to outliers. Although this may appear to be a contradiction at first glance, it is important that the K-S test depends on the maximum deviation in the cumulative distribution function, rather than on the deviation of the random variable from the mean. To see that outliers do not constitute very deviant points in the cumulative distribution, suppose that, by chance, an outlier from the far left tail of a distribution is found in a sample of size . This would yield an increase of magnitude of for the empirical cumulative distribution at , while the cumulative distribution is practically zero since is in the region where the distribution is very small. If is not too bad an approximation to , will also be of order at . However, the deviation between and at other points depends more on the goodness of fit than on the sample size, at least for points in regions that contain a sizable fraction of the samples: these are the ‘typical points’. For these points, there is little sample size dependence, so that to first order in , one has . Since for large enough sample sizes, the outlier will not be seen as the most deviant point in the K-S test, but rather, some typical value will have the most deviant cumulative distribution.[24]

The above argument holds when the quality of the fit in the typical region is poor. However, as the number of Fourier modes used to fit is increased, the quality of the fit of the cumulative distribution improves in the typical region, and the maximum deviation shifts to less probable values. Because the probability density function is small in the tails, this maximum deviation of the distribution is not that large and the convergence criterion (the K-S test) is met, resulting in a good fit in the typical region with a poor fit in the tails.

The Berg-Harris method of Section 3.1 therefore works well for probability densities without “long tails,” since typical values of the variable will be the values of interest. However, for radial distribution functions, the focus on typical values is the origin of the difficulties getting the straightforward Berg-Harris method to work for radial distribution functions (cf. Fig. Figure 1). Typical values of are of no interest in radial distribution functions. In a homogeneous system such as a fluid, the typical distances between any two particles is on the order of half the system size, . The length scales of interest in the radial distribution functions to describe local structure are typically much smaller than that.

To make this point clearer, consider a dilute gas of hard spheres with diameter . In the limit of infinite dilution, the radial distribution function is zero for and unity otherwise. The probability density of inter-particle distances is then [cf. Eq. (Equation 5)]

Because of the Jacobian factor of , the most likely values of are of the order of the largest possible , which is . This means that in a sample of inter-particle distances in the dilute hard sphere system, the large distance samples () are much more abundant than the small distance samples (). Thus, when approximating by a smooth function using the K-S test, the long-distance part will be fitted very well, but the short-distance behavior () will not, because the test used is sensitive to the typical values of , which are .

The conclusion that large distance values overwhelm the smaller ones in which one is interested is valid beyond the dilute hard sphere case, since the Jacobian factor in Eq. (Equation 8) acts on long length scales and is therefore also present in systems of higher density with non-negligible interactions.

A crude attempt at solving the large distance problem is to impose a cut-off on the allowed values of in the sample of inter-particle distances. This cut-off should be of the order of the distance over which differs from one. But to determine that distance one has to measure . Thus, one would have to guess a value of the cut-off distance and adjust it until approaches one within a certain accuracy for . Unfortunately, even when the cut-off is chosen in this way, the results, although better than those in Figure 1, are still not very impressive, as Figure 2 for the oxygen-oxygen radial distribution function shows with set to Å(using again ). One sees a lot of spurious oscillations in the smooth radial distribution functions, which are due to the high number of Fourier modes needed to fit the radial distribution function ( in this case). Furthermore, these oscillations do not even agree within the 95% confidence interval with the histogram result, deviating especially for small distances . The explanation is that the Jacobian factor in Eq. (Equation 6) is still present within the restricted sample , and causes the peaks at larger values to be fitted better than the peaks at smaller values.

#### Overcoming over-representation through resampling

To remove the troublesome Jacobian factor in Eqs. (Equation 5) and (Equation 6) altogether, one can resample the inter-particle distances found in the simulation. The idea is to draw from the original sample of distances, such that smaller values of are more likely than larger values by introducing a relative bias weight for each distance in the sample to construct a new, resampled set .[25] Given that the probability densities of the original sample points is [cf. Eq. (Equation 5)], and provided that subsequent resampled data points are chosen independently with weight , the probability density of the resampled points is given by

where and are constants. Note that may be determined from

which may be approximated by the average of the weight over the samples:

In the application to the radial distribution functions below, we also obtained from a numerical integration using the biased analytic approximation for , the result of which differed from that given by the simpler expression in Eq. (Equation 10) by less than .

The simplest way to counter the Jacobian factor in Eq. (Equation 9) is to choose a weight

which gives

with

In detail, the resampling leading to data with a probability density can be performed as follows: Given the original set one determines the weight for each data point as

One also determines the maximum weight,

to convert the weights into probabilities

which, by construction, lie between 0 and 1. Next, one takes one of the original sample points at random (with equal weight), draws a random number uniformly from the interval [0,1] and if , one adds to the resampled data set . The procedure is repeated until enough resampled points have been gathered. There is some choice into what number of resampled points is to be taken. In our implementation, the number of resampled points is chosen to be the same as the number of points in the original sample.

Note that Eq. (Equation 11) shows that the resampled probability density and the radial distribution function are proportional to one another. Since approaches one for large , the large values of are no longer over-represented in .

For the resampled data , one can use the Berg-Harris smoothing method of Section 3.1 to obtain a smooth approximation to . Figure 3 shows the effect of resampling on the cumulative distribution functions associated with of the water model, using . It is evident that the small- tail of the unweighted (Berg-Harris) cumulative distribution is much enhanced in the resampled cumulative distribution, and therefore more relevant to the K-S test. Note that we have plotted only the fits, and not the empirical cumulative distributions, because the fits are hard to distinguish from the empirical cumulative distributions. The differences are easier to see in the associated radial distributions, which are plotted in Figure 4. In contrast to Figure 2, the smooth approximation now follows the result of the histogram within error bars for the whole range of values. But like in Figure 2, the supposedly smooth exhibits fast oscillatory behavior within those error bars. This oscillatory behavior will be addressed next.

### 5.2Hard-to-fit distribution functions

#### Unphysical oscillations

It is hard to reconcile the aim of having a smooth approximation to the radial distribution functions and the large number of Fourier modes needed to approximate the shape of , which starts out as a very flat function at small , then increases sharply, reaches a maximum not far beyond this sharp rise, and then decays on a larger scale to in an oscillatory fashion. The peculiar shape of leads to the oscillatory behavior seen in Figures Figure 2 and Figure 4. Even when the oscillations lie within error bars of the histogram results, they reintroduce a remnant of the noise similar in magnitude to that present in the histograms that one is trying to avoid.

It is conceivable that there are expansions other than the Fourier expansion that are more suitable for describing a sharply increasing function followed by a more moderate behavior. However, it is not clear at present what this specialized expansion should be. Therefore, a systematic and generic way will now be presented which avoids high-frequency modes in parts of that do not require them. This will also make the method more generally applicable to discontinuous and other hard-to-fit densities.

We note that in the following section, the resampling explained in the previous section is supposed to have been performed on the data already, and that the resampled data have been sorted (). For notational convenience, the tildes on , and are omitted below.

#### Resolving the oscillation problem using a piecewise approach

The decomposition of in Eq. (Equation 2) can be adjusted to incorporate hard-to-fit radial distributions by allowing the Fourier decomposition to be different on sub-intervals within the total interval . Let the full interval be divided into sub-intervals , , , …, , where and . The intervals will be labelled by a Greek index, which can run from to . How to choose the points (where , with fixed) will be discussed later. Different intervals can have different numbers of samples that fall within its range. The fraction of the samples that fall within a given interval is denoted by .

On each sub-interval, is approximated by a linear part and a truncated Fourier transform of the remainder. Analogously to Eq. (Equation 3), the linear part in interval is given by

Note that ranges from 0 to 1 in the interval . The piecewise linear approximation to for in interval is then

where it should be noted that

An example of such is shown in Figure 5, based on data from the oxygen-oxygen radial distribution function in the water model (as explained below). Approximations to beyond are found by adding Fourier modes for each interval,

Here, is the characteristic function on interval , i.e.,

and is given by

where

The function is the conditional empirical cumulative probability distribution function for data points within interval . Eq. (Equation 14) represents a piecewise analytic approximation to the cumulative distribution. Since the Fourier modes form a complete orthonormal basis, the approximation becomes exact as for all . As before, however, the should not become too large to avoid spurious oscillations, which only amounts to fitting the noise. One therefore defines a maximum number of Fourier modes allowed in each sub-interval and sub-divides intervals if the maximum number of modes is not sufficient for convergence, as determined by the K-S test.

The division of the original interval is chosen dynamically as follows:

2. Increase until ;

3. If exceeds , find the most deviant point according to the K-S test;

4. Split the interval in two at the point ;

5. Repeat for the conditional distribution for each interval left and right of .

Because the procedure is recursive, in each step the original one-interval procedure is used, which one exception: the allowable value of may be set to a lower value in a sub-interval, since there are fewer points in the interval and therefore the interval carries less statistical weight. We have not found a unique, statistically controlled way to adjust the for sub-intervals, but found heuristically that scaling the for interval by avoids fitting the noise and leads to a satisfactory overall value.

The recursive, multi-interval procedure is found to greatly speed up the convergence of the approximation scheme, leading to much lower , and thus fewer oscillations. It should be stressed that splitting at the most deviant point in the K-S test is found to be essential here: choosing a different splitting point does not improve the convergence because the most deviant point is then still difficult to fit and the value for the sub-interval containing the most deviant point remains the same.

Once the approximation in Eq. (Equation 14) has been obtained, the probability density is given by

where is such that .

The results of applying the piecewise procedure to the of the water model are shown in Figure 6 (remembering that ), with and the initial set to . While the method now works better than without the piecewise approach, it has one drawback: the derivative of the approximate cumulative distribution function, which gives , need not be a smooth function across the different intervals. As a consequence, the result in Figure 6 shows artificial discontinuities. These discontinuities at the boundaries of the intervals fall within the 95% confidence intervals (not shown in Figure 6).

#### Dealing with spurious discontinuities

Provided the underlying probability distribution of the samples is continuous, the spurious discontinuities that are present in the piecewise-analytic fit will get smaller as the number of sample points increases. Nevertheless, for cases with poor statistics, one would like to have an approach that gives a continuous piecewise analytic fit to the distribution. In fact, for some applications, having a continuous radial distribution function is essential, such as for Brownian simulations that take the potential of mean force as input. Even though the discontinuities are within the statistical noise, they would lead to sudden, unphysical changes in energy in such applications.

There are many ways to get a continuous curve out of the discontinuous one, but it should be remembered that one is working within the statistical noise. We can therefore choose any method as long as the “patch” is still statistically reasonable. To determine the statistical suitability, one can once again use the K-S test on the patched fit.

We have chosen the following simple procedure: We apply a quadratic patch function around each discontinuity at ,

The width of the patch is adjustable, while the prefactor follows from the requirement that the cumulative distribution has a continuous derivative at . Provided the different patches do not overlap, this leads to a value of

where is the height of the jump in the unpatched distribution function.

Initially, each width is set equal to half the minimum interval size on either side of the corresponding split point to avoid overlap between the patches. The value of the patched distribution is then determined, and if it is not smaller than the value for the unpatched distribution, the patch is accepted. Otherwise, the width of the patch is reduced by a factor of two, until the value is acceptable.[26] It will be demonstrated below that this solves the problem of spurious oscillations and discontinuities.

### 5.3Final results

Figs. Figure 7 show the result of the resampled, piecewise procedure for the three radial distribution functions of the water model, , and , with and . It is clear that the piecewise analytic fit is now smoother than the histogram, hardly shows any oscillations and is continuous. Furthermore, while error bars were omitted in Figure 7 for clarity, it was found that the histogram and the patched, piecewise analytic results are in mutual agreement within the 95% confidence intervals.

It may be argued that plotting the histograms using bars is an unfair way of representing the histogram results. One often takes the histograms and applies a cubic spline fit[19] to the results, which makes the graph seem smoother. There is of course no a priori reason why the splined histogram should represent the radial distribution function better. For this reason, we have shown only ‘unsplined’ histograms so far. The comparison between the histograms and smooth approximations in these curves should really only be done at the mid-point of the histogram bins. Now that the piecewise analytic smoothing approach is fully developed, however, it is interesting to see how it compares to a smoothed spline fit of the histogram results. Figure 8 shows a plot of these two types of smooth results for . The bin size for the histograms has been chosen such that the first peak of the radial distribution is well resolved. One sees that the cubic spline fit does a reasonable job for the first peak, but that in the second peak in the radial distribution function the spline fit is still noisy, while the statistically controlled piecewise analytic result is not. Thus, the splines cannot fix the roughness of the histogram method, at least not when bin sizes are uniform.

To show how the piecewise analytic method performs for potentials of mean force, the smooth potentials of mean force between the different species corresponding to the radial distributions in Figs. Figure 7 [cf. Eq. (Equation 7)], have been plotted in Figs. Figure 9. The results from the piecewise analytic method are considerably smoother than the histogram results, but still exhibit some roughness since they were based only on four configurations.

## 6Further applications

The piecewise approximation method is not specific to radial distribution functions, but can help to fit any probability density that is hard to fit with a truncated Fourier series. Below, we will give several examples of such densities and show the advantages of using the piecewise analytical approximation method.

### 6.1A discontinuous density

Consider a random variable with a distribution given by

Note that within the domain of this function , there is a discontinuity at . Discontinuities are very poorly represented by truncated Fourier series.[27]

From the above distribution, 50,000 samples were drawn and used as input to the piecewise analytic approximation (with and the initial ), as well as to the one-interval analytic approximation (with the same and unrestricted ) and the histogram method. The results are shown in Figure 10.

One clearly sees the trouble that the Berg-Harris one-interval approximation has in capturing the discontinuity, while the histogram is very noisy. The piecewise analytic expansion, on the other hand, beautifully captures the whole distribution; the automated procedure divides the interval in two at and then needs zero Fourier modes to approximate the separate pieces.

### 6.2The standard Cauchy distribution

According to Berg and Harris,[4] the original smoothing method has trouble with densities with long tails. It is therefore interesting to see if the piecewise approach helps for these kinds of densities as well. As an example, we consider the standard Cauchy distribution

From this distribution, 50,000 samples were randomly drawn and used in the piece-wise approach (with and the initial ). The results are contrasted with those of the histogram in Figure 11. The piecewise smooth approximation performs so well that it can hardly be distinguished from the Cauchy distribution.

Not plotted in Figure 11 were the results of the original Berg-Harris method, which, surprisingly, are so similar to the piecewise approach that they would be hard to make out in the figure. The only statistically significant difference between the two results is apparent in the height of the maximum, which the Berg-Harris fit slightly underestimates.

The success of both methods is illustrated further in Figure 12, in which the errors in the piecewise and the one-interval results are compared; the piecewise analytic error is overall slightly smaller than that of the Berg-Harris method, but not by much. This success seems to contradict Berg and Harris’ warning against using the smoothing method for densities with long tails. It is possible the relatively high quality of the fit in the simple approach is due to the symmetric nature of the Cauchy density and its relative lack of structure.

A more stringent test of the method applied to long-tailed distributions will be presented next.

### 6.3First arrival time distribution of a diffusing particle to a sphere

Consider a particle undergoing diffusion, starting at a distance from the origin, i.e. anywhere on a sphere of radius . The distribution function of the diffusing particle satisfies , where the self-diffusion constant. Investigating the time required to arrive anywhere on the surface of a sphere of radius for the first time is a classic case of a first passage problem.[28] Such problems have applications in the rate of molecules finding each other in a solution, and are a major determining factor of reaction rates in diffusion limited reactions.

It turns out that the first arrival times are distributed according to[29]

where . This distribution function has both short time structure and a long tail.

From the distribution in Eq. (Equation 19), 100,000 samples were randomly drawn, with parameters set at , (these choices set the time and length units) and . The piece-wise approach was used to obtain an estimate of the probability distribution (with and ), as well as the one-interval analytic approximation ( and unrestricted ) and the histogram method. The results are compared in Figs. Figure 13 and Figure 14.

While all methods work to some extent, the histogram method would have to be tailored to the function in question in order to be useful, with differently sized bins for different values of . The Berg-Harris method works reasonably well without subdivisions, but exhibits fast oscillations in the tails of the density, as becomes very apparent from Figure 14. The appearance of oscillations is not surprising when one considers that Fourier modes were needed for the Berg-Harris fit! The piecewise analytic result does not exhibit fast oscillations in the tail. Furthermore, the piecewise fit only required a total of 28 Fourier modes distributed over 8 intervals. Although the piecewise result does have some modulation within the error bars, we have checked that almost all of these disappear when one takes 10 times as many data points. At that level of statistics, the Berg-Harris results still have oscillations in the tails, and still require over 350 Fourier modes.

When the analytical form of is not known, such as for absorption in non-spherical geometries, but samples are available from numerical simulation, the piecewise approach would give the best description of : one that is less noisy than the histogram construction and with fewer oscillations than the one-interval Berg-Harris method.

## 7Conclusions

In this paper, a method to obtain smooth analytical estimates of probability densities, radial distribution functions and potentials of mean force in a statistically controlled fashion without using histograms was presented. This method only uses direct samples of data (distance samples in the case of radial distribution functions). Since this method is expected to be generally useful, we have made our implementation, coded in c and c++, available on the web.[30]

While the method is based on the Berg-Harris method, the statistical criterion used in that method is most sensitive to the most common samples, which for radial distribution functions are not the ones of physical interest. To make the method work for radial distribution functions, a weighted resampling of this data was required. Spurious oscillations, allowed by the statistical noise, were eliminated using a piecewise approach. In addition, one can optionally patch the piecewise-analytic form to avoid discontinuities within the errors, if desired. The resampled, piecewise smoothing method was demonstrated on data from event-driven DMD simulations of water, and proved to give a much smoother result than the histogram method.

If the purpose of a simulation is just to get a good smooth result for the radial distribution functions of a water model, one could use the histogram method with longer simulation runs to reduce the statistical uncertainties to some predetermined limit. However, the histogram would still be biased and known just at a lattice of points if a uniform bin size is used. To get a smoother and less biased curve from the histogram, one would also have to decrease the bin size by hand. The piecewise analytic method makes such tuning unnecessary. The method also make much longer runs unnecessary, since what appears to be very poor statistics for the histogram method turns out to be quite reasonable statistics for the piecewise analytic method. Remember that the same set of inter-atom distance are used in both methods. Apparently, there is much more statistics in the sample than the histogram is using (something that Berg and Harris also noticed[4]).

For the test case considered here, the computational cost of doing longer runs is not that large, so the advantages of the piecewise analytic method may seem to be nice but not necessary. In other applications, however, such as in studies of the distribution of water molecules near a polymer or a bio-molecule (or near one of their polymer units), better statistical information is costly to obtain because such systems are not only computationally more demanding, but there are far less samples available in a single configuration since only the water molecules near the polymer or the bio-molecule are involved. The simulation run times would have to be much larger to compensate for this poor statistics, if histograms are used. In such cases, the piecewise analytic method is expected to be advantageous since it appears able to use more of the information present in the inter-particle distance sample.

The piecewise analytic method yields a smooth approximation to the probability density function, and the deviations of the empirical distribution from this smooth curve are supposed to be due to statistical noise. If this is true and the methods are unbiased, then for large enough sample size , one would expect the distribution of errors in the probability density functions to be Gaussian with a zero mean. To test whether this is the case, the probability densities of the relative errors plotted in Figure 14 were determined, both for the Berg-Harris and for the piecewise analytic method.[31] The results are shown in Figure 15. Within statistical uncertainty, both distributions were found to be unbiased. Furthermore, for the piecewise analytic results, the error densities are roughly Gaussian. The errors from the Berg-Harris method, on the other hand, seem to show deviations from Gaussian behavior. The non-Gaussian nature of the errors of the Berg-Harris method might be due to the fact that the sample noise is fitted too closely since a large number of Fourier modes are necessary, which means the distribution of errors follows the distribution of the finite sample-size errors rather than being truly random. The Gaussian nature of the errors in the piecewise analytic method might be a confirmation that in the smooth approximation, enough Fourier modes have be taken into account for the remainder to be due to a random statistical noise.

One of the nice features of the piecewise analytic method is that one does not have to choose a bin size a priori, as one has to do in the histogram method. It is nonetheless true that within the smoothing method, one is to some extent free to choose the cut-off value of the ‘quality’ parameter and the maximum number of basis functions allowed in the expansion of each interval. Note that both and are dimensionless numbers, and do not contain physical parameters, in contrast to the bin size parameter required in the histogram approach. Setting too low may result in a bad fit to the data, while setting too large may result in noise fitting and artificial oscillations. We have found that and are reasonable choices, and these values were used in all the applications presented above.

There is also some freedom in the choice of the set of basis functions, as long as they form a complete set. The Fourier basis used here is convenient and familiar, but others are possible. For instance, in Ref. , a Chebyshev expansion was used. This expansion was chosen over the Fourier expansion because it gave fewer oscillations. The piecewise approach present in this paper, however, resolves the oscillation problem as well, and is expected to be less sensitive to the choice of basis functions than the original Berg-Harris method. Furthermore, the method was shown to also be applicable to other potentials of mean force and other probability densities with a hard-to-fit character.

We would like to thank Prof. P. N. Roy for useful discussions. This work was supported by the National Sciences and Engineering Research Council of Canada.

### References

1. D. A. McQuarrie, Statistical Mechanics (Harper Collins Publishers, 1976).
2. D. Frenkel and B. Smit, Understanding molecular simulation: From Algorithms to Applications (Academic Press, San Diego, 2002).
3. H. J. C. Berendsen, Simulating the Physical World: Hierarchical Modeling from Quantum Mechanics to Fluid Dynamics (Cambridge University Press, Cambridge, UK, 2007).
4. B. A. Berg and R. C. Harris, Comp. Phys. Comm. 179, 443 (2008).
5. R. van Zon, L. Hernández de la Peña, G. H. Peslherbe, and J. Schofield, Phys. Rev. E 78, 041103 (2008).
6. T. Gericke, P. Würtz, D. Reitz, T. Langen, and H. Ott, Nature Physics 4, 949 (2008).
7. L. Mones, P. Kulhánek, I. Simon, A. Laio, and M. Fuxreiter, J. Phys. Chem. B 113, 7867 (2009).
8. Y. Liu and T. Ichiye, J. Phys. Chem. 100, 2723 (1996).
9. A. Chandra and T. Ichiye, J. Chem. Phys. 111, 2701 (1999).
10. M.-L. Tan, J. T. Fisher, A. Chandra, B. R. Brooks, and T. Ichiye, Chem. Phys. Lett. 376, 646 (2003).
11. C. J. Fennell and J. D. Gezelter, J. Chem. Phys. 120, 9175 (2004).
12. R. van Zon and J. Schofield, J. Chem. Phys. 128, 154119 (2008).
13. L. Hernández de la Peña, R. van Zon, J. Schofield, and S. B. Opps, J. Chem. Phys. 126, 074105 (2007).
14. L. Hernández de la Peña, R. van Zon, J. Schofield, and S. B. Opps, J. Chem. Phys. 126, 074106 (2007).
15. R. van Zon and J. Schofield, J. Comp. Phys. 225, 145 (2007).
16. E. C. Cyr and S. D. Bond, J. Comput. Phys. 225, 714729 (2007).
17. A. B. Adib and C. Jarzynski, J. Chem. Phys. 122, 14114 (2005).
18. J. E. Basner and C. Jarzynski, J. Phys. Chem. B 112, 12722 (2008).
19. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in Fortran, The Art of Scientific Computing (Cambridge University Press, Cambridge, UK, 1992), 2nd ed.
20. Alternatively, one can use other tests such as the Kuiper test, but these are less convenient for the piecewise approach presented later, since they involve several “hard-to-fit” points.
21. M. A. Stephens, J. Royal Stat. Soc. B 32, 115 (1970).
22. B. Efron, The Annals of Statistics 7, 1 (1979).
23. H. R. Künsch, The Annals of Statistics 17, 1217 (1989).
24. Another good explanation of the sensitivity of the K-S test to values near the median can be found in Ref. , Ch. 4, Sec. 3.
25. One could imagine performing the reweighting at the level of K-S test instead, but this would modify the statistic. The problem is that it would then not be known what a reasonable testing criterion for this new statistic should be, or even whether it allows a simple formulation such as .
26. A procedure in which the piecewise Fourier series is restricted to be continuous across the interval boundaries was also considered but found to lead to undesirable oscillations near discontinuities, reminiscent of the Gibbs phenomenon.
27. H. Jeffreys and B. S. Jeffreys, Methods of Mathematical Physics (Cambridge University Press, Cambridge, UK, 1956), 3rd ed.
28. S. Redner, A Guide to First-Passage Processes (Cambridge University Press, Cambridge, UK, 1991).
29. This is the Laplace inverse of Eq. (6.4.8) in Ref. .
30. http://www.chem.utoronto.ca/~rzon/Code.html.
31. This analysis was also attempted for the deviations plotted in Fig. but the error bars on the resulting distributions were too large to tell whether they are biased or non-Gaussian.
10364