On the classification problem for Poisson Point Processes

# On the classification problem for Poisson Point Processes

## Abstract

For Poisson processes taking values in any general metric space, we tackle the problem of supervised classification in two different ways: via the classical -nearest neighbor rule, by introducing suitable distances between patterns of points and via the Bayes rule, by estimating nonparametricaly the intensity function of the process. In the first approach we prove that, under the separability of the space the rule turns out to be consistent. In the former, we prove the consistency of rule by proving the consistency of the estimated intensities. Both classifiers have shown to have a good behaviour under departures from the Poisson distribution.

###### keywords:
[class=MSC] 60G55, 62G05, 62G08
###### keywords:
point process, Poisson process, nonparametric estimation, classification

## 1 Introduction

Spatial point processes are commonly used to model the spatial structure of points formed by the location of individuals in space. The growing interest in this kind of process is related to the wide range of areas where they can be applied. For instance, in ecology, they can be used to model the distribution of herds of animals, the spreading of nests of birds, speckles of trees or plants or the eroded areas in rivers or seas. In geography, the position of earthquakes or volcanoes can by modelled by this kind of processes. They can be also used to model the distribution of galaxies in astronomy, the locations of subscribers in telecommunications, among others. There exists a vast literature on this area, just to name a few, we refer to the recent book Spatial Data analysis in ecology and agriculture using R (Plant (2012)), which contains many other possible applications and techniques, as well as real data examples. In Illian, et al. (2009), the authors propose a hierarchical modelling of the interaction structure in the plant community. The current interest on this kind of process also appears in connection with the new developments in Functional Neuroimaging techniques (for example fMRI), where it is possible to record in real time the location of the activation zones of the brain (see for instance Kang, et al. (2011, 2014), and Yarkoni, et al. (2010)). In this context, in order to perform classification between healthy and unhealthy people, the differences between the neurons that fire under some stimuli can be measured by modelling them as spatial Poisson processes with different intensities. In Mateu, et al. (2015), the authors do a review of several distances used to measure the differences between two spatial patterns in order to perform clustering or classification (see also Victor (1997)). In a different application area, crime modelling and mapping using geospatial technologies (which include the use of spatial point process) is, quoting Leitner (2013), “a topic of much interest mostly to academia, but also to the private sector and the government”, see also Bernasco and Elffers (2010) and Gervini (2015). On this topic in Section 7 we study the spatial distribution of three different crimes which took place in Chicago between 2014 and 2016, by using an open access database containing, among other variables, the spatial location of the crimes.

The aim of this manuscript is to tackle the supervised classification problem for Poisson point processes by framing it in the functional data setting. In particular, we prove the consistency of the -nearest neighbors classifier in a more general context by proving the separability of the space and the Besicovitch condition (see Cérou and Guyader (2006) and Forzani, et al. (2012) for a deepest reading on this topic). Via some simulation studies, we show how different choices of distances lead to different results on the classification. In addition, following the ideas in Diggle (2014), we also propose a nonparametric estimator of the intensity function, prove its consistency and plug it into the Bayes rule to get a consistent classifier. This last approach is similar to the one proposed in Kang, et al. (2014) but we do not assume that the intensities vary in a parametric family. Through some simulation studies we show the good performance of the -NN rule so that it can be considered as an easier to implement alternative to the classical Bayes. More precisely, the -NN classifier does not require the estimation of the intensity function (which is computationally expensive) and it can be employed in more general settings. With regard to the last statement, it is important to highlight that, although most of the classical applications of spatial point processes are for recorded locations in or , we do not restrict our approach to that case, allowing the realizations of the processes to live in a general metric space (as functional metric spaces, Riemannian manifolds, among others).

The manuscript is outlined as follows: in Section 2 we present definitions and preliminary results that we will use throughout the work. Section 3.1 is devoted to introduce an estimator of the intensity of the process in order to plug it in the Bayes rule an prove its consistency. In Section 3.2 we handle the problem of choosing a suitable distance to guarantee the separability of the space and the Besicovitch condition, in order to get the consistency of the -NN estimator. Section 4 is devoted to the study of the metric dimension of the space introduced in the Section 3.2. In Section 5 we extend the results to a more general class of processes: the Gibss processes. In Section 6 we perform some simulation studies in order to asses the performance of the classification rules for different scenarios as well as to see the effect of changing some parameters in the estimation and robustness when the model is not Poisson. Finally in Section 7 we perform classification in a real data scenario. All the proofs are given in the Appendix.

## 2 Definitions and preliminary results

This section is devoted to introduce some definitions and tools we will use throughout the paper. We will start with the definition of the main object of this paper, the Poisson point process and then we will turn to classification rules in our context. For a deeper read on Poisson processes we refer to Gaetan and Guyon (2010), Kingman (1993) and Møller and Waagepetersen (2004).

### 2.1 Poisson process

Let be a separable and bounded set metric space, endowed with a Borel measure , let us denote by the Borel -algebra on and by the set of elements (subsets) of whose cardinal, , is finite. This is,

 S∞≐{x⊂S:#x<∞}.

Let be an integrable function. Given a probability space , we will say that a function is a Poisson process on with intensity (we will denote ) if:

• the functions defined as are random variables for all ;

• given disjoint Borel subsets of , the random variables are independent;

• follows a Poisson process with mean (we will denote , being

 μ(A)=∫Aλ(ζ)dν(ζ).

Let be the -algebra of part of . If is a Poisson process, the distribution of on is defined as for .

A well-known result (see Møller and Waagepetersen (2004)) on point processes states that, if and are Poisson processes with intensity and , respectively, with values on a non-empty bounded metric space such that , , the distribution of is absolutely continuous with respect to the distribution of () with Radon Nikodym derivative

 fX1(x)=exp[μ2(S)−μ1(S)]∏ξ∈xλ1(ξ)λ2(ξ),

with . As a consequence observe that if then, for all , and

 fX(x)=exp[ν(S)−μ(S)]∏ξ∈xλ(ξ), (1)

where .

### 2.2 Classification

Given a set of iid pares with the same distribution as , the aim of classification is, given a new observation , to predict the class to which belongs. In this context, a classification rule is a measurable function which, for a new observation , returns a label . It was shown (see, e.g., Devroye, et al. (1996)) that the optimal classifier is the Bayes rule which minimizes the probability of error or, equivalentely, which maximizes the posterior probabilities:

 g∗(x)=argmaxg:S∞→{0,…,M}P(g(X)≠Y).

The mimimun probability of error is known as Bayes error. If is the probability of error of a sequence of classifiers built up from a training sample, it is said that the sequence is weakly consistent if converges in probability to as .

In our context, we assume that conditioned to has Poisson distribution therefore, following (1), in Lemma 1 we obtain an expression for the Bayes rule as a function of the intensities of the processes.

###### Lemma 1.

Let . Let be Poisson processes on with intensities , , respectively. Therefore, the Bayes rule classifies a point in class if

 exp[μi(S)−μj(S)]∏ξ∈xλj(ξ)λi(ξ)>pipj,∀i≠j, (2)

where , and as before, , .

Observe that, in order to apply the Bayes rule we will need to estimate the intensities of the processes which will be done in Section 3.1.

Another well known classification rule is the -nearest neighbor rule which, in our context will classify a point in class if, for all ,

 n∑k=1wnkI{Yk=j}>n∑k=1wnkI{Yk=i},

where the weights are for the -nearest neighbors of and elsewhere. We say that is the -nearest neighbor among if the distance is the -th smallest among .

For random variables taking values in a finite dimensional space (for instance ), it is well-known (see Stone (1977)) that the -NN rule is -universally consistent provided that and . However, when they take values in infinite dimensional spaces (as in this case), the consistency is not necessarily true (even weakly than ) as it was studied by Cérou and Guyader (2006). Nevertheless, Forzani, et al. (2012) gave sufficient conditions to ensure -consistency of the classical estimators of the regression function . That conditions are the separability of the metric space for a given metric and Besicovitch condition, which can be stated as:

 limϵ→0PX{x:limδ→01PX(Bd(x,δ))∫Bd(x,δ)|η(x)−η(y)|dPX(y)>ϵ}=0, (3)

It is immediate that -a.s. continuity of is a sufficient condition for (3). In order to get the consistency of the -NN rule in the context of Poisson processes, we will study in Section 3.2 the problem of choosing a suitable distance which leads the separability of the space and the Besicovitch condition for .

## 3 Main Results

Throughout all this section we will assume that is a separable compact metric space.

### 3.1 Bayes rule in the context of Poisson processes: consistency and the estimation of the intensity

In this section we propose to estimate nonparametrically the intensity functions in order to plug in them in Equation (2) to get the Bayes rule for Poisson processes. Following Diggle (2014), given a realization of the process with values in , we estimate the intensity of in the point as

 ^λ(ζ)=1Kσ(ζ)n∑i=11σdk(ρ(ζ,ξi)σ), (4)

with a symmetric, no negative kernel, a smoothing parameter and

 Kσ(ζ)=∫S1σdk(ρ(ζ,ξ)σ)dν(ξ).

Given a random sample of Poisson processes , each with realization , , we define an estimator of the intensity by

 ^^λm(ζ)=1mm∑j=1^λj(ζ), (5)

where is an estimation of as given in (4) for the realization , .

Via simulation study (see Section 6.3), we observed that when performing classification by using the Bayes rule as in (2) with estimated intensities (5) the best choice is .

In the following theorem we show the consistency of the estimator given in (5).

###### Theorem 1.

Let us assume that the intensity function is continuous and that, for all , there exists such that . Let be a symmetric, no negative continuous kernel such that and . Then, for almost all (w.r.t. ), there exists , such that

 limm→∞∣∣^^λm(ζ)−λ(ζ)∣∣=0 a.s. (6)
###### Remark 1.

It is easy to see that if , is absolutely continuous w.r.t. the Lebesgue measure, the density is bounded away from 0 and is standard (see Definition 1 in Cuevas and Rodríguez-Casal (2004)), then and is enough to get (6).

For a recent review on the estimation of the intensity function for general point process see van Lieshout (2012).

### 3.2 k-NN rule in the context of Poisson process: consistency

As we said in Section 2.2, in order to get the separability of as well as the Besicovitch condition, we need to chose a suitable distance. Since the elements of are subsets of , a quite natural choice is the Hausdorff distance, which measure how far two subsets of a metric space are from each other. It is defined as follows.

###### Definition 1.

Given two non-empty compact sets , the Hausdorff distance between them is defined by

where .

###### Remark 2.

Observe that, when is bounded the metric on is well defined since in this case for all .

In the following two propositions we state the sufficient conditions to get the -consistency of the -nearest neighbor rule in .

###### Proposition 1.

The space is separable.

###### Remark 3.

Moreover if is complete, the metric space of compact non-empty subsets of endowed with turns out to be a complete and locally compact metric space (see Chapter 4 in Rockafellar and Wets (2009)).

###### Proposition 2.

Let us consider . Suppose that is a Poisson process on with intensity function , for , respectively. Let us assume that , are continuous functions of and that the measure does not have atoms (i.e. for all ). Then, for all condition (3) holds for with .

From Propositions 1 and 2 and Theorems 4.1 and 5.1 in Forzani, et al. (2012) it follows the consistency of the -NN estimator of the regression function which in turn gives the consistency of the classification rule built up from such estimator.

Although we stated the consistency of the -NN rule for the Hausdorff distance, we could have two points very close in Hausdorff distance but with very dissimilar cardinal. Via some simulation studies we noted that adding to the Hausdorff distance a term that forces points close enough in Hausdorff distance to have the same cardinality, the performance of the classification rule improves considerably. Basically this is due to the fact that in point processes analysis the cardinality of the points is an important characteristic to distinguish between populations. Moreover, we performed a simulation study (see Section 6.1.3) to show that, for two populations with the same expected number of points but different intensity, the Haussdorf distance is still a good choice, without the necessity of adding a new term. With all this in mind, we define new metrics in which have shown to outperform Hausdorff distance, and give the consistency of the -NN rule for them.

###### Definition 2.

Given , we define a new distance on as:

 d(x,y)=1diam(S)dH(x,y)+d0(x,y), (7)

where denotes the diameter of (i.e ) and is a function (not necessarily a distance) which verifies:

• implies ;

• ;

• for all , ;

In what follows we list a set of functions verifying conditions 1–4 in Definition 2:

• ;

• Hellinger: ;

• Kulback Leibel: .

As before, in the following two propositions we state sufficient conditions to get the -consistency (see Theorem 4.1 and 5.1 in Forzani, et al. (2012))) of the -nearest neighbor rule in with as in (7), which in turn gives the consistency of the classification rule built up from such estimator.

###### Proposition 3.

The space with as defined in (7) is separable.

###### Proposition 4.

Let be a bounded metric space. If the intensity of a Poisson process defined on is continuous on (with respect to the distance ) then the regression function is continuous with respect to the metric defined in (7) and then it fulfils condition (3).

As we will see in Section 6.4, for all the distances before defined, higher number of neighbor gives better classification (although seven neighbors could be the right choice since for seven or more neighbors the results are the same).

## 4 Why do we need to prove the Besicovitch condition?

The Besicovitch condition would be trivial if the space were finite dimensional. However, this space is not even a vector space therefore, first we need to define what “infinite dimensional” means.

###### Definition 3.

A metric space is finite dimensional (in the Nagata sense) if there exists such that for all , , and points , there exists such that . A metric space is said to be -finite dimensional if it is equal to the numerable union of finite dimensional sets.

The following result ensures that where is as in (7) is not finite dimensional.

###### Proposition 5.

Let us assume that there exists such that for all , and all , contains two points such that . Then the space is not finite dimensional.

## 5 Extensions of the results to Gibbs process

Gibbs processes appear as a natural generalization of Poisson processes since they allow a spatial dependency between the numbers of points in two disjoints subsets of (compare with the definition of Poisson process introduced in Section 2.1). We prove that Proposition 2 can be extended to this class of processes which, for instance, are being used in telecommunications to model the position of base stations for improving the performance of a wireless network, see Zhuo, et al. (2015), Kelif, et al. (2014) and Guo and Haenggi (2013).

Recall that a process is Gibbs if its density with respect to has the form being constant, where the energy is admissible, in the sense that satisfy:

 ∞∑n=0e−ν(S)n!qn<∞andqn=∫Snexp(−U(x))dν(x1)…dν(xn)<∞.

Since we will assume that is compact, and is a bounded function, the admissibility condition will be fulfilled. We will assume that the energy is of the form (see pg. 95 in Gaetan and Guyon (2010))

 U(x)=n∑i=1φ(xi)+n∑i=1n∑j>iψ(∥xi−xj∥). (8)

This includes as a particular case , taking and , and Strauss processes, for which with and . The following proposition extends Proposition 2 to this kind of process.

###### Proposition 6.

Let us consider being compact. Suppose that for all , is a Gibbs process with energy given by (8), with and continuous functions. Let us assume that the measure does not have atoms (i.e. for all ). Then, for all the Besicovitch condition holds for with .

## 6 Simulations

In order to assess the performance of the proposed classification rules for two populations and see how the nature of the density function affect the methods we have implemented some simulation studies. First we show the behaviour in three different scenarios, one in which the densities are smooth and decrease to zero exponentially fast, another for very wiggly densities, and a last one where the expected number of points is the same, but the distribution of points is different. We also carry out three simulation studies to show the robustness under departure from the Poisson assumption, the effect of in the estimation of the intensities (4) and the effect of in the -nearest neighbor distances (7).

In what follows, we will use the following notation for the different distances:

• KNN_Hausdorff: -NN in ;

• KNN_Hausdorff_d1: -NN in with given in (7) and ;

• KNN_Hausdorff_Hellinger: -NN in with given in (7) and ;

• KNN_Hausdorff_KL: -NN in with given in (7) and .

In all the simulations we generated training and testing samples of size ( for each class), used replications. We chose in the -NN rule via cross validation. For the Bayes rule we used cross validation to get the optimal in Sections 6.1.1 and 6.1.2 but we fixed in Sections 6.1.3 and 6.2.

### 6.1 Behaviour of our proposed methods in three different scenarios

#### Smooth case.

In this case, for the class 0 we generate the processes in the square with intensity

 λ0(x,y)=c2exp(−20((x−1/2)2+(y−1/2)2)),

and for class 1,

 λ1((x,y),c1,d1)=c1exp(−d1((x−1/2)2+(y−1/2)2)).

In Figure 1 we report the misclassification rate for different values of the parameters and , with . For a better understanding, in Figure 2 we plot different level sets of both of the estimated intensities. Let us observe that the intensities in this case overlap considerably, which difficulties the classification. As expected, the misclassification rate decreases when the difference between and increase.

#### Wiggly case.

In this case, for class 0 we generate the processes in the square with intensity

 λ0(x,y)=80+80xysin(1/(xy)),

and for class 1,

 λ1((x,y),c2)=c2+30xysin(1/(xy)),

where is a positive constant. In Figure 3 we report the boxplot of the misclassification rate for different values of the parameter and in Figure 4 we plot different level sets of both of the estimated intensities.

Again, as expected, the misclassification rate decreases when the difference between and increases.

#### Different intensities but same expected number of points.

In this case we generate two processes in the square . Both of them have intensity with the same height but one of them centered at and the other one centered at as shown in Figure 6. In Figure 5 we report the misclassification rate where we can see that, in this case, Bayes rules performs much better than the -NN rule.

As we can see in the previous simulations, in the case of densities with the same expected number of points (Section 6.1.3), the estimated Bayes rules outperforms the -NN based rules whereas in the wiggly case (6.1.2) -NN achieves a better performance. This could be due to the fact that smooth intensities can be better estimated. For smooth functions (section 6.1.1) sometimes -NN outperforms the Bayes rules (specially when adding an extra term to the distance).

### 6.2 Robustness under non Poisson distributions

In this simulation we generate two Strauss processes (see Section 5) in the same region , one with parameters and the other with parameters . In this case, the mean of the misclassification rates are: for the Bayes rule, for KNN_Hausdorff, for KNN_Hausdorff_d0, for KNN_Hausdorff_Hellinger and for KNN_Hausdorff_KL.

This shows, togheter with the boxplot of the misclassification rates (Figure 7) the robustness of our methods and a better performance for the -NN base rules.

### 6.3 Effect of the smoothing parameter used in the estimation of the intensities (4)

To show the effect of the smoothing parameter in the estimation of the intensity function (4), we run it in the setting described in subsection (6.1.1) with , and for one of the intensities and , for the other. We took different combinations of . The misclassification rates are plotted in Figure 8 where in the epigraph of each graphic we put and in the -axis . In the boxplots it can be seen that, in general, the best combination is .

### 6.4 Effect of k in the k-NN rule using different distances (7)

To asses the effect in the -NN rule we run it in the setting described in subsection (6.1.1) with , , . The results are given in Figure 9. It can be seen that for all distances the -NN rule performs better choosing higher values of . However, it can be also seen that choosing a value greater than does not improve the performance considerably.

## 7 Real data example

The study of the geographic distribution of crimes gave rise to the well known “social disorganization theory”, developed by the Chicago School (also called the Ecological School) which, since 1920, specializes in urban sociology and urban environment research. It proposes that the neighborhood of a subject is as significant as the person’s characteristics (like gender, race, etc). See Chapter 33 in Bernasco and Elffers (2010) for a survey on this topic. The School collected the location as well as a description of many crimes including prostitution, assault, narcotics, battery, among others, reported between 2001 and 2016 in the city of Chicago and joint them in an open source database of more than 6 million entries. This database was recently employed in Gervini (2015) to fit a model for the intensity function of replicated point processes. In order to asses if there exists statistical differences between the spatial pattern of points of crimes, we performed classification among the different crimes.

To get different samples of the same process we have split the data in periods of one week, comprehended between the first of January of 2014 and the first of January of 2016. As a result, for every type of crime we have 105 samples, 84 of them were used as training sample and the remaining 21 for the testing sample. Since there exists a wide range of intensities between the different crimes, we have considered only three of them: assault and robber (joined in one class, denoted as AR), narcotics (N), and criminal damage (CD). The mean value of locations registered in one week, for every sample is , and for AR, N, and CD respectively. The classification errors obtained using -NN rule for (this value minimize the misclassification error) were the following: between N and CD 7%, CD and AR 6%, and finally between AR and N we get 15%. This result suggests that there is a stronger geographic similarity between the crimes typified as narcotics and those typified as assault and robbery. This can be also seen in Figure 10, were we represented the points for this 3 kind of crimes, reported in one week. There we can see that points in blue (N) and in green (AR) are very closed each other whereas the red ones are spread throughout all the city.

## 8 Conclusions

We have proposed two consistent classification techniques for point Poisson processes: the -NN and Bayes rule. The -NN rule has shown better performance in cases in which the intensity function of the process is wiggly and for non Poisson processes whereas the Bayes rule did it when the intensity functions have the same expected number of points. From a theoretical point of view, we proved that the -NN rule is consistent not only for the case of spatial process in , but also for processes taking values in any metric space. The rule has also shown to be robust against departures from Poisson distribution.

## Appendix

###### Proof of Lemma 1.

For , let be the density of with respect to the Poisson process with intensity . That is, is the Radon-Nykodim derivative of with respect to the distribution of the Poisson process with intensity . Then,

 P(Y=j|X=x)=fXj(x)P(Y=j)fX(x),j=1,…,M (9)

with the total probability. Let , then we have

 P(Y=j|X=x)>P(Y=i|X=x)⇔fXj(x)fXi(x)>pipj. (10)

Now, since for , from equation (1) we get,

 fXj(x)fXi(x) =exp[ν(S)−μj(S)]∏ξ∈xλj(ξ)exp[ν(S)−μi(S)]∏ξ∈xλi(ξ) =exp[μi(S)−μj(S)]∏ξ∈xλj(ξ)λi(ξ).

And with this equality in (10), it turns out that

 P(Y=j|X=x)>P(Y=i|X=x) \rotatebox90.0$⇔$ exp[μi(S)−μj(S)]∏ξ∈xλj(ξ)λi(ξ)>pipj.

Therefore, the Bayes rule classifies a point in class if

 exp[μi(S)−μj(S)]∏ξ∈xλj(ξ)λi(ξ)>pipj, for all i≠j.

###### Proof of Theorem 1.

Let us fix , and write,

 Missing or unrecognized delimiter for \big (11)

First observe that, conditioned to , the random variables are an iid sample of with density (see Def 3.2 Moller-Waagepetersen), where . In addition, since is a Poisson process, , so that . Then,

 E(^^λm(ζ)) =E[E(^^λm(ζ)∣∣#X1=n(1),…,#Xm=n(m))] =E[E(1mm∑j=11Kσm(ζ)n(j)∑i=1kσm(ρ(ζ,ξi))∣∣#X1=n(1),…,#Xm=n(m))] =E[1mm∑j=11Kσm(ζ)#Xj∑i=1E(kσm(ρ(ζ,ξi))∣∣#X1=n(1),…,#Xm=n(m))] =E[1Kσm(ζ)1mm∑j=1#Xj∑i=11μ(S)∫Skσm(ρ(ζ,ξ))λ(ξ)dν(ξ))] =1Kσm(ζ)μ(S)∫Skσm(ρ(ζ,ξ))λ(ξ)dν(ξ)E[1mm∑j=1#Xj] =1Kσm(ζ)∫Skσm(ρ(ζ,ξ))λ(ξ)dν(ξ)) =E(^λ1(ζ)).

With this in (11) we have,

 ∣∣^^λm(ζ)−λ(ζ)∣∣≤∣∣^^λm(ζ)−E(^λ1(ζ))∣∣+∣∣E(^λ1(ζ))−λ(x)∣∣≐I+II. (12)

To prove that observe that,

 P(∣∣∣1mm∑j=1^λj(ζ)−E(^λ1(ζ))∣∣∣>ϵ) (13) =E[P(∣∣∣1mm∑j=1^λj(ζ)−E(^λ1(ζ))∣∣∣>ϵ∣∣∣#X1=n(1),…,#Xm=n(m))].

In order to apply Hoeffding inequality to observe that, conditioned to , if we denote , , and , with . Therefore, applying Hoeffding inequality we get

 E[P(∣∣∣1mm∑j=1^λj(ζ)−E(^λ1(ζ))∣∣∣>ϵ∣∣∣#X1=n(1),…,#Xm=n(m))] ≤E[2exp(−2ϵ2(mγm(ζ))2K21∑mj=1(#Xj)2)] =E⎡⎣E[2exp(−2ϵ2(mγm(ζ))2K21(∑m−1j=1(#Xj)2+(#Xm)2))∣∣#X1=n(1),…,#Xm−1=n(m−1)]⎤⎦ ≤E⎡⎣2exp⎛⎝−2ϵ2(mγm(ζ))2K21(E[∑m−1j=1(#Xj)2+(#Xm)2∣∣#X1=n(1),…,#Xm−1=n(m−1)])⎞⎠⎤⎦ =E⎡⎣2exp⎛⎝−2ϵ2(mγm(ζ))2K21(∑m−1j=1(#Xj)2+E((#Xm)2))⎞⎠⎤⎦ (14) =E⎡⎣2exp⎛⎝−2ϵ2(mγm(ζ))2K21(∑m−1j=1(#Xj)2+E((#Xm)2))⎞⎠⎤⎦ ⋮ ≤2exp(−2ϵ2(mγm(ζ))2K21∑mj=1E((#Xj)2)) =2exp(−2ϵ2mγm(ζ)2K21μ(S)(1+μ(S))),

where we used the same conditioning trick times and, in the last equality, we used that so that . Now, by Lemma in Forzani, et al. (2012) (with ), there exists such that . Therefore,

 ∞∑m=12exp(−2ϵ2mγm(ζ)2K21μ(S)(1+μ(S))) ≤∞∑m=12exp(−2ϵ2log(m)2K21μ(S)(1+μ(S))) =2∞∑m=1m−2ϵ2log(m)K21μ(S)(1+μ(S))<∞, (15)

then, from (13), (Appendix) and (Appendix) it follows that,

 ∞∑m=1P(∣∣∣1mm∑j=1^λj(ζ)−E(^λ1(ζ))∣∣∣>ϵ)<∞,

and finally, by Borel-Cantelli’s Lemma,

 I=∣∣1mm∑j=1^λj(ζ)−E(^λ1(ζ))∣∣→0,a.s.

In order to prove that in (12) observe that, since is a continuous function and is compact, for there exists , such that for all , if . In addition, then we get,

 II=|E(^λ1(ζ))−λ(ζ)| =∣∣1Kσm(ζ)∫B(ζ,σm)kσm(ρ(ζ,ξ