On the statistical effects of multiple reusing of simulated air showers in detector simulations
The simulations of extensive air showers as well as the detectors involved in their detection play a fundamental role in the study of the high energy cosmic rays. At the highest energies the detailed simulation of air showers is very costly in processing time and disk space due to the large number of secondary particles generated in interactions with the atmosphere, e.g. for eV proton shower. Therefore, in order to increase the statistics, it is quite common to recycle single showers many times to simulate the detector response. In this work we present a detailed study of the artificial effects introduced by the multiple use of single air showers for the detector simulations. In particular, we study the effects introduced by the repetitions in the kernel density estimators which are frequently used in composition studies.
Shower Simulations, Detector Simulations
Air shower and detector simulations play a fundamental role in the study of cosmic rays. In particular, arrays of surface detectors that do not have fluorescence telescopes to calibrate the energy scale, must resort to simulated data in order to estimate the energy of the primary particle. Furthermore, the primary mass is also obtained comparing experimental data with simulations.
There are several Monte Carlo programs for air shower simulation, the most used in the literature are AIRES , CORSIKA , and CONEX , the latter for a fast simulation of the longitudinal shower development. Since the number of particles produced in a shower can be extremely large, e.g., for a eV proton shower, the computer processing time and disk space needed are also very large, even if unthinning methods [4, 5] are used. Due to this difficulty it is a common practice to reuse the same shower for generating several events (see for example [6, 7]). This practice is more common in simulations that include surface detectors because, for fluorescence telescopes, very fast Monte Carlo programs like CONEX, have very fast and efficient algorithms for the generation of longitudinal profiles.
In this work we present a study of the effects of using multiple repetitions of individual showers , applied to the simulation of detectors, on the evaluation of standard estimators of the expected value, variance, and covariance. We study in detail the effects introduced in the kernel density estimators, which are analytical estimates of the underlying distribution function obtained from a finite sample of events. In cosmic rays physics this technique is used mainly in connection with composition analyses (see for example [10, 7, 8]); however, it is also extensively used in many different areas of knowledge  to which this work can be directly extended.
2 Analytical Treatment
As mentioned in the introduction, we want to study the potential distortions introduced by reusing individual showers to maximize the statistics when simulating the response of a detector. Let us start with the optimum case in which each individual shower is used only once and, therefore, best reproduces reality.
Let be a -dimensional vector composed by physical observables (e.g. mass sensitive parameters) distributed as and let be a random vector, distributed as , that takes into account the effects of the detectors and the corresponding reconstruction method such that, after measuring and reconstructing the empirical information, a vector is obtained. The distribution function of is the convolution of and ,
Suppose that we have a sample of independent events of the distribution , The probability of this configuration can be written as,
However, as previously noted, if single showers are recycled and used many times to simulate the response of the detectors, non-independent samples are obtained. If we use each shower of a sample of independent showers times to simulate the detectors response, the following sample of size is obtained, , where the notation used henceforth corresponds to , where is the coordinate of vector , indicates the number of independent shower and the number of detector simulation performed using the shower. The probability of such configuration is given by
2.1 Mean, variance and covariance estimators
Let us consider the average of the coordinate of , , for the realistic case in which each shower is used only once to simulate the detector response,
By using Eq. (2) it is easy to obtain the very well known expressions for the expected value and variance of ,
The usual estimator of the covariance between two random variables is given by,
For the estimator of the variance of is obtained, . By using Eq. (2) it can be shown that both estimators are non-biased,
For the case in which each shower is used several times to simulate the response of the detectors the average of is given by,
which means that using samples obtained by reusing individual showers to simulate the detector response does not introduce any bias when calculating the average. However the fluctuations of are increased by the generation of an additional term proportional to .
The estimator of the covariance, between and , including multiple repetitions of the individual showers takes the form,
Therefore, as expected, the repetition of individual showers introduces a bias in the covariance estimator because the events are not independent. The bias results proportional to .
As mentioned before, the expected value of the variance estimator is obtained setting in Eq. (14),
which shows that also is now a biased estimator of the variance of .
2.2 Density estimators
The density estimation technique consist in obtaining an estimator of the underlying density function from a given data sample . In one of the most widely used variants of that technique, a density estimator is obtained from a superposition of kernel functions centered at each event of the data sample. For -dimensional data the kernel density estimator can be written as,
where is a -dimensional vector, is a symmetric, positively defined matrix (i.e., the symmetric, positively defined square-root matrix exists) and is the kernel function. The matrix gives the covariance between the different pairs of variables and also the degree of smoothing, i.e., the width of the kernel function.
which shows that is a biased estimator of .
By using the Taylor expansion and retaining the dominant terms an approximated expression for the integrated mean square error is obtained,
Here we take , where is a small parameter that parametrizes the degree of smoothing.
Minimizing with respect to , the well known expression of is recovered,
where the constant of proportionality depends on , the unknown density function that we want to estimate.
Let us consider the case in which shower repetitions of individual showers are included. The density estimator in this case is given by,
where just the leading terms are retained. Consequently, the takes in this particular case the form
Eq. (23) shows that the leading term introduced by the repetitions does not depend on and, therefore, the expression for remains equal to the case. The only effect introduced by the repetitions of the individual showers is to increase the fluctuations of the estimator for each .
3 Numerical Example
In this section a numerical example that shows the predicted effects introduced by the shower repetitions is given. For that purpose, air showers simulations are performed using the program CONEX. A total of proton showers of primary energy eV and zenith angle are generated.
Samples of the parameter obtained from the CONEX simulations are considered. A Gaussian uncertainty of g cm and is assumed in order to take into account the detector response and the reconstruction method. Therefore, the distribution function of the reconstructed is given by Eq. (1) with the distribution function corresponding to the physical fluctuations and , a Gaussian distribution of mean value and , which takes into account the response of the detectors and reconstruction methods.
Four sets of 100 samples are considered. Each set of samples is noted as where indicates the independent values of (obtained from CONEX) in each sample and the number of repetitions of each shower, i.e., the number of times that the Gaussian distribution is sampled for each of the independent values in each individual sample. Therefore, , , and are considered, where and just differ in the different values obtained from the Gaussian distribution performed to include the detector response and reconstruction method. The number of events in each sample, belonging to the different sets, is , the same for all kind of samples considered.
Figure 1 shows the distributions of the estimators of the average, , and the standard deviation, , for the sets of samples considered. It can be seen that, as expected, when the repetitions are included, the fluctuations increase and when the number of independent showers increases the fluctuations decrease. Figure 1 also shows that, although the distributions of with repetitions have a tail towards larger values of grammage, which is not present in the corresponding without repetitions, the bias is not statistically significative. This is consistent with Eq. (15) which shows that the expected bias introduced by repetitions in the variance is proportional to for .
In order to illustrate the effects of repetitions on the density estimators, one-dimensional Gaussian kernels are used to estimate the density function of . An adaptive bandwidth method, introduced by B. Silverman , is used to obtain better estimates of the density function.
For each sample belonging to a given set a density estimate is obtained, therefore, 110 density estimates are obtained for each set of samples considered. Figure 2 shows the mean value and the one sigma region obtained from the density estimates of each set. It can be seen that the mean values corresponding to samples with or without repetitions are very similar, which is consistent with the result obtained in subsection 2.2. Also, as expected from Eq. (2.2), the fluctuations corresponding to sets including repetition are larger and comparing the results obtained for and we see that the fluctuations in the latter case are smaller due to the smaller number of repetitions.
In this work we present a study of the effects of recycling individual cosmic ray showers to simulate the detector response, which is a common practice in Monte Carlo simulations at the highest energies. We find that the standard estimators of the expected value, variance and covariance are modified. In particular, the average remains as a non-biased estimator of the expected value but the fluctuations are increased. For the standard estimators of the variance and covariance a bias proportional to appears when repetitions are included. Besides, as in the case of the average, the fluctuations of both estimators are increased. Finally, we study the effects introduced by repetition in the kernel density estimators obtained from finite samples. We find again that the expected value of the estimator is unchanged, i.e., the bias takes the same form. However, the pointwise fluctuations are increased and become more important as the ratio increases.
This work is partially supported by the Mexican agencies CONACyT, UNAM’s CIC, and PAPIIT. ADS is supported by a postdoctoral grant from the UNAM.
-  S. Sciutto, http://www.fisica.unlp.edu.ar/auger/aires.
-  D. Heck et al., Report FZKA 6097, Forschungszentrum Karlsrue, 1998; http://www-ik3.fzk.de/$∼$/heck/corsika.
-  T. Bergmann et al., Astropart. Phys. 26, 420 (2007).
-  A. Hillas, Proc. 19th ICRC 1, 155 (1985).
-  A. Hillas, Nucl. Phys. (Proc. Suppl.) B52, 29 (1997).
-  M. Ave et. al., Astropart. Phys. 19, 61 (2003).
-  A. D. Supanitsky, G. Medina-Tanco and A. Etchegoyen, Astropart. Phys. 31, 75 (2009).
-  A. D. Supanitsky, G. Medina-Tanco and A. Etchegoyen, Astropart. Phys. 31, 116 (2009).
-  A. D. Supanitsky, G. Medina-Tanco, Astropart. Phys. 30, 264 (2008).
-  T. Antoni et. al., Astropart. Phys. 18, 319 (2003).
-  B. Silvermann, Density Estimation for Statististics and Data Analysis, ed. Chapman & Hall, New York (1986).