Cross-validation estimation of covariance parameters under fixed-domain asymptotics

Cross-validation estimation of covariance parameters under fixed-domain asymptotics

F. Bachoc111Corresponding author. francois.bachoc@math.univ-toulouse.fr, Institut de Mathématiques de Toulouse, Université Paul Sabatier, 118 route de Narbonne, 31062 TOULOUSE Cedex 9 Institut de Mathématiques de Toulouse, Université Paul Sabatier, France 118 route de Narbonne, 31062 TOULOUSE Cedex 9 {francois.bachoc,agnes.lagnoux,thi_mong_ngoc.nguyen}@math.univ-toulouse.fr A. Lagnoux Institut de Mathématiques de Toulouse, Université Paul Sabatier, France 118 route de Narbonne, 31062 TOULOUSE Cedex 9 {francois.bachoc,agnes.lagnoux,thi_mong_ngoc.nguyen}@math.univ-toulouse.fr T.M.N. Nguyen Institut de Mathématiques de Toulouse, Université Paul Sabatier, France 118 route de Narbonne, 31062 TOULOUSE Cedex 9 {francois.bachoc,agnes.lagnoux,thi_mong_ngoc.nguyen}@math.univ-toulouse.fr
Abstract

We consider a one-dimensional Gaussian process having exponential covariance function. Under fixed-domain asymptotics, we prove the strong consistency and asymptotic normality of a cross validation estimator of the microergodic covariance parameter. In this setting, Ying [40] proved the same asymptotic properties for the maximum likelihood estimator. Our proof includes several original or more involved components, compared to that of Ying. Also, while the asymptotic variance of maximum likelihood does not depend on the triangular array of observation points under consideration, that of cross validation does, and is shown to be lower and upper bounded. The lower bound coincides with the asymptotic variance of maximum likelihood. We provide examples of triangular arrays of observation points achieving the lower and upper bounds. We illustrate our asymptotic results with simulations, and provide extensions to the case of an unknown mean function. To our knowledge, this work constitutes the first fixed-domain asymptotic analysis of cross validation.

Keywords: Kriging, cross validation, strong consistency, asymptotic normality, spatial sampling, fixed-domain asymptotics

1 Introduction

Kriging [35, 28] consists in inferring the values of a Gaussian random field given observations at a finite set of observation points. It has become a popular method for a large range of applications, such as geostatistics [25], numerical code approximation [29, 30, 8] and calibration [27, 9] or global optimization [20].

Before Kriging can be applied, a covariance function must be chosen. The most common practice is to statistically estimate the covariance function, from a set of observations of the Gaussian process, and to plug [35, Ch.6.8] the estimate in the Kriging equations. Usually, it is assumed that the covariance function belongs to a given parametric family (see [1] for a review of classical families). In this case, the estimation boils down to estimating the corresponding covariance parameters. For covariance parameter estimation, maximum likelihood (ML) is the most studied and used method, while cross validation (CV) [36, 43, 5] is an alternative technique. CV has been shown to have attractive properties, compared to ML, when the parametric family of covariance functions is misspecified [5, 7].

There is a fair amount of literature on the asymptotic properties of ML. In this regard, the two main frameworks are increasing-domain and fixed-domain asymptotics [35, p.62]. Under increasing-domain asymptotics, the average density of observation points is bounded, so that the infinite sequence of observation points is unbounded. Under fixed-domain asymptotics, this sequence is dense in a bounded domain.

Consider first increasing-domain asymptotics. Generally speaking, for all (identifiable) covariance parameters, the ML estimator is consistent and asymptotically normal under some mild regularity conditions. The asymptotic covariance matrix is equal to the inverse of the (asymptotic) Fisher information matrix. This result was first shown in [24], and then extended in different directions in [12, 13, 31, 6, 16].

The situation is significantly different under fixed-domain asymptotics. Indeed, two types of covariance parameters can be distinguished: microergodic and non-microergodic parameters [18, 35]. A covariance parameter is microergodic if, for two different values of it, the two corresponding Gaussian measures are orthogonal, see [18, 35]. It is non-microergodic if, even for two different values of it, the two corresponding Gaussian measures are equivalent. Non-microergodic parameters cannot be estimated consistently, but have an asymptotically negligible impact on prediction [32, 33, 34, 42]. On the other hand, it is at least possible to consistently estimate microergodic covariance parameters, and misspecifying them can have a strong negative impact on prediction.

Under fixed-domain asymptotics, there exist results indicating which covariance parameters are microergodic, and providing the asymptotic properties of the corresponding ML estimator. Most of these available results are specific to particular covariance models. In dimension when the covariance model is exponential, only a reparameterized quantity obtained from the variance and scale parameters is microergodic. It is shown in [40] that the ML estimator of this microergodic parameter is strongly consistent and asymptotically normal. These results are extended in [11], by taking into account measurement errors, and in [10], by taking into account both measurement errors and an unknown mean function. When and for a separable exponential covariance function, all the covariance parameters are microergodic, and the asymptotic normality of the ML estimator is proved in [41]. Other results in this case are also given in [37, 2]. Consistency of ML is shown as well in [23] for the scale parameters of the Gaussian covariance function and in [22] for all the covariance parameters of the separable Matérn covariance function. Finally, for the entire isotropic Matérn class of covariance functions, all parameters are microergodic for [3], and only reparameterized parameters obtained from the scale and variance are microergodic for [42]. In [21], the asymptotic normality of the ML estimators for these microergodic parameters is proved, from previous results in [14] and [39]. Finally we remark that, beyond ML, quadratic variation-based estimators have also been extensively studied, under fixed-domain asymptotics (see for instance [19]).

In contrast to ML, CV has received less theoretical attention. Under increasing-domain asymptotics, the consistency and asymptotic normality of a CV estimator is proved in [6]. Also, under increasing-domain asymptotics, it is shown in [7] that this CV estimator asymptotically minimizes the integrated square prediction error. To the best of our knowledge, no fixed-domain asymptotic analysis of CV exists in the literature.

In this paper, we provide a first fixed-domain asymptotic analysis of the CV estimator minimizing the CV logarithmic score, see [28] Equation (5.11) and [43]. We focus on the case of the one-dimensional exponential covariance function, which was historically the first covariance function for which the asymptotic properties of ML were derived [40]. This covariance function is particularly amenable to theoretical analysis, as its Markovian property yields an explicit (matrix-free) expression of the likelihood function. It turns out that the CV logarithmic score can also be expressed in a matrix-free form, which enables us to prove the strong consistency and asymptotic normality of the corresponding CV estimator. We follow the same general proof architecture as in [40] for ML, but our proof, and the nature of our results, contain several new elements.

In terms of proofs, the random CV logarithmic score, and its derivatives, have more complicated expressions than for ML. [This is because the CV logarithm score is based on the conditional distributions of the observations, from both their nearest left and right neighbors, while the likelihood function is solely based on the nearest left neighbors. See Lemma 3.1 and Lemma 1 in [40] for details.] As a consequence, the computations are more involved, and some other tools than in [40] are needed. In particular, many of our asymptotic approximations rely on Taylor expansions of functions of several variables (where each variable is an interpoint distance going to zero, see the proofs for details). In contrast, only Taylor approximations with one variable are needed in [40]. In addition, we use central limit theorems for dependent random variables, while only independent variables need to be considered in [40].

The nature of our asymptotic normality result also differs from that in [40]. In this reference, the asymptotic variance does not depend on the triangular array of observation points. On the contrary, in our case, different triangular arrays of observation points can yield different asymptotic variances. We exhibit a lower and an upper bound for these asymptotic variances, and provide examples of triangular arrays reaching them. The lower bound is in fact equal to the asymptotic variance of ML in [40]. Interestingly, the triangular array given by equispaced observation points attains neither the lower nor the upper bound. It is also pointed out in [6] that equispaced observation points need not provide the smallest asymptotic variance for covariance parameter estimation.

Finally, the fact that the asymptotic variance is larger for CV than for ML is a standard finding in the well-specified case considered here, where the covariance function of the Gaussian process does belong to the parametric family of covariance functions under consideration. In contrasts, as mentioned above, CV has attractive properties compared to ML when this well-specified case does not hold [5, 7].

The rest of the paper is organized as follows. In Section 2, we present in more details the setting and the CV estimator under consideration. In Section 3, we give our strong consistency result for this estimator. In Section 4, we provide the asymptotic normality result, together with the analysis of the asymptotic variance. In Section 5, we present numerical experiments, illustrating our theoretical findings. In Section 6, we extend the results of Sections 3 and 4 to the case of an unknown mean function. In Section 7, we give a few concluding remarks. All the proofs are postponed to Section 8.

2 The context and the cross-validation estimators

We consider a centered Gaussian process on with covariance function

for some fixed and unknown parameters and . This process is commonly known as the Ornstein-Uhlenbeck process. It satisfies the following stochastic differential equation, called the Langevin’s equation,

where denotes a standard Brownian motion process. The Ornstein-Uhlenbeck process has been widely used to model physical, biological, social, and many other phenomena. It also possesses many useful mathematical properties that simplify the analysis.

We introduce the parametric set of covariance functions for some fixed and where

For any , we consider a design of observation points . Without loss of generality, we may assume that . Similarly as in [40], there is no need to assume that the sequences of observation points are nested. We consider the vector of observations at locations , . Now let , for , and , for . For ease of redaction, we do not mention in and the dependency in . We define as the variance-covariance matrix of under covariance function ,

From [4], we have

(1)

We now address the CV estimators of and considered in [28, 43]. Let

where the conditional expectation is calculated assuming that is centered and has covariance function . We remark that does not depend on . We define similarly

Then, the CV estimators are given by

where

(2)

is the logarithmic score. The rationale for minimizing the logarithmic score is that is equal to times the conditional log-likelihood of , given , with covariance parameters . The term cross-validation underlines the fact that we consider leave-one-out quantities.

As already known [18, 40, 42], it is not possible to consistently estimate simultaneously and (the ML estimator of is a non-degenerate random variable, even if is observed continuously [44]), but it is possible to consistently estimate . As a consequence, we have considered three different cases, as in [40]. (i) Set in (2) with being a predetermined constant and consider the CV estimator of that minimizes (2) with . (ii) Set in (2) with being a predetermined constant and consider the CV estimator of that minimizes (2) with . (iii) Consider the estimator of , where and are the CV estimators of and .

Ying [40] considers the ML estimators of and and establishes their consistency and asymptotic normality. We carry out a similar asymptotic analysis for the above CV estimators. More precisely, we prove that (resp. and ) converges almost surely to (resp. and ) in the next section. In section 4, we establish that, for a sequence which is lower and upper-bounded, , and all converge in distribution to a standard Gaussian random variable. We remark that the asymptotic variance depends on how the underlying design points are chosen. On the contrary, considering the ML estimators [40], the asymptotic variance is the same for any triangular array of design points.

3 Consistency

In this section, we establish the strong consistency of the CV estimator of described in the previous section. In that view, we consider defined by (2). As done in [40], we base our analysis on the Markovian property of the Ornstein-Uhlenbeck process in order to handle the fact that, as increases, the observed sample becomes more and more correlated. We have

(3)

and

from [43, 5, 15]. Then, using Equation (1), we get the following lemma after some tedious computations.

Lemma 3.1 (Logarithmic score).

With as in (2), we have

Based on Lemma 3.1, we prove the following theorem in Section 8.2.

Theorem 3.2 (Consistency).

Assume that

(4)

Let , where , , and are fixed and have been defined in the previous section. Assume that there exists in so that . Define as a solution of

(5)

Then exists and

(6)

In particular, let and be predetermined constants satisfying and . Define and as solutions of

(7)

and

(8)

Then and .

Remark 3.3.

It is worth remarking that the asymptotically preponderant terms in Lemma 3.1 are the same as those obtained in the context of ML estimation (see [40] and Section 8.2 for more details).

4 Asymptotic normality

Once the consistency has been established, the natural question of the convergence speed arises. We address this point in this section. We first provide a central limit result in the following theorem.

Theorem 4.1 (Central Limit Theorem).

Consider the same notation and assumptions as in Theorem 3.2. Assume further that either ; or ; hold. Then the estimators are asymptotically normal. More precisely, we have

(9)

Also, when we have

(10)

Finally, when we have

(11)

The quantity depends on how the underlying design points have been chosen. More precisely,

(12)
Remark 4.2.

The condition ; or ; ensures that or will be equal to zero for large enough almost surely, by applying Theorem 3.2. This is used in the proof of Theorem 4.1. A similar assumption is made in [40], where the parameter domain for is or .

In the following proposition, we show that the quantity in Theorem 4.1 is lower and upper bounded, so that the rate of convergence is always in this theorem.

Proposition 4.3.

We have, for any choice of the triangular array of design points satisfying (4),

(13)
Remark 4.4.
  1. The asymptotic variance of the limiting distribution of can be easily estimated. By the previous proposition, this asymptotic variance is always larger than the one of the ML estimator. Indeed, with and the ML estimators of and we have , see [40]. This fact is quite expected as ML estimates usually perform best when the covariance model is well-specified, as is the case here.

  2. As one can check easily, the regular design for all , does not yield the limiting variance of the ML estimator. Instead, we have for this design. However, in Proposition 4.5, we exhibit a particular design realizing the limiting variance of the ML estimator: .

In fact, the bounds in (13) are sharp as shown in the following proposition.

Proposition 4.5.

(i) Let be such that , for ,

where , and . Then, taking , we get .

(ii) Let and be such that , for and with . Then and .

Remark 4.6.

Intuitively, in Proposition 4.5 (ii), will be much smaller than for most of the indices , so that the quantities and in (12) will be negligible. We refer to the proof of Proposition 4.5 for further details.

5 Numerical experiments

We illustrate Theorem 4.1 by a Monte Carlo simulation. We set and and we consider three sample size values, . For the sample size , we address three designs . The first one is the ‘minimal’ design given by Proposition 4.5 (ii) with , which asymptotically achieves the minimal estimation variance. The second one is the ‘regular’ design given by . The third one is the ‘maximal’ design given by Proposition 4.5 (i) with , which asymptotically achieves the maximal estimation variance. These three designs are show in Figure 1. For the sample sizes and , the ‘minimal’ design is not amenable to numerical computation anymore, as the values of become too small; so that we only address the ‘regular’ and ‘maximal’ designs.

Figure 1: Plot of the points for the ‘minimal’, ‘regular’ and ‘maximal’ designs of the numerical experiments for (from left to right). For the ‘minimal’ design, nine points form a dense cluster around one and the asymptotic variance of the CV estimator is (Proposition 4.5 (i)), for the ‘regular’ design, the asymptotic variance is , and for the ‘maximal’ design, the asymptotic variance is (Proposition 4.5 (ii)).

For a given configuration of and a given design , we repeat data generations and estimations. That is, we independently sample Gaussian vectors of size with zero mean vector and covariance matrix . For each of these Gaussian vectors, we compute the CV estimators and , with parameter space , so that we consider case (9) of Theorem 4.1. The computation of and is not challenging since, from Lemma 3.1, the logarithmic score can be computed quickly, with a complexity. [For more general covariance functions, the computation of CV or ML criteria is more costly, with a complexity.] The criterion is minimized over by repeating the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm, with several starting points for , and by keeping the value of with smallest logarithmic score, over all the repetitions. The R software was used, with the optim function. [We remark that, for fixed , one could find an explicit expression of (see also [5] for a different CV criterion). Hence, it would be possible to minimize the profile logarithmic score over only. As mentioned earlier, this improvement is not needed here, since the criterion can be computed quickly.]

For the values of , we compute the values of . In Figure 2, we report the histograms of these latter values, for the seven configurations under consideration. In addition, we report the probability density functions of the seven corresponding Gaussian distributions with mean and variance , to which the histograms converge when , in view of Theorem 4.1.

In Figure 2, we observe that, for , the asymptotic Gaussian distributions are already reasonable approximations of the empirical histograms. For , the asymptotic distributions become very close to the histograms, and for the asymptotic distributions are almost identical to the histograms. Hence, the convergence in distribution of Theorem 4.1 provides a good approximation of the finite sample situation already for small to moderate . The case illustrates the benefit of the ‘minimal’ design for estimation, as the histogram is most concentrated around zero for this design. Similarly, the value of is the smallest for this design, compared to the ‘regular’ and ‘maximal’ designs. For and , we also observe that the estimation is more accurate for the ‘regular’ design than for the ‘maximal’ design, which also confirms Remark 4.4 and Proposition 4.5.

Finally, we have obtained similar conclusions for the case where either or is known in the computation of (cases of (10) and (11)). We do not report the corresponding results for concision.

Figure 2: Illustration of Theorem 4.1. Histograms of independent realizations of , together with the corresponding asymptotic Gaussian probability density functions with mean and variances (red lines). The sample size is (top row), (middle row) and (bottom row). For the top row, the designs are the ‘minimal’ design (left), achieving the smallest asymptotic variance; the ‘regular’ design (middle), with equispaced observation points; and the ‘maximal’ design (right), achieving the largest asymptotic variance. For the middle and bottom rows, the designs are the ‘regular’ design (left) and the ‘maximal’ design (right).

6 Extension to regression models

In this section, we extend Theorems 3.2 and 4.1 to the case of regression models. We assume that, instead of , we observe the Gaussian process defined by . In the definition of , is fixed and unknown and, for , is a known function. Hence, we estimate jointly from the observation vector , with .

Let be the matrix . Then is the best linear unbiased predictor of given , under covariance function for all , see e.g. [30].

We now address CV estimation. Let , let , let be the matrix obtained by removing line of , and let be the matrix obtained by removing line and column of . Then, for all , is the best linear unbiased predictor of given , under covariance function .

We also let . Then, from e.g. [30],

(14)

is the best linear unbiased predictor of given . We let

Then, the CV estimator of we shall study in this section is

with

We remark that [43] suggests to use a similar CV criterion, with the notable difference that is replaced by in (14). The benefit of the CV predictor (14), compared to that considered in [43], is that, in (14), no use of is made at all for predicting . In [15], the following relations are shown, extending those of Section 3. We have

and

(15)

with .

Based on the two displays above, and again using the explicit matrix inverse in (1), we are able to prove the consistency and asymptotic normality of where the asymptotic distribution is identical to that of Section 4.

Theorem 6.1 (Consistency).

Assume that and that there exists in so that . Assume also that are twice continuously differentiable and are linearly independent on . Then exists and

(16)
Theorem 6.2 (Central Limit Theorem).

Assume that the conditions of Theorem 6.1 hold and that and . Then, with as in (12), we have

(17)

In Theorems 6.1 and 6.2, the twice differentiability condition for is mostly technical, and could be replaced by a continuous differentiability condition, at the price of more technical proofs. [We remark that Theorems 6.1 and 6.2 apply in particular to polynomial functions which are widely considered, for instance in the framework of computer experiments [30].] As remarked in [40], when are continuously differentiable, the parameter is non-microergodic and can not be consistently estimated.

Finally, assume now that satisfy the conditions given in Theorem 3 (ii) in [40]. Then, one can show from the proof of this theorem that, for any sequence or random variables (and in particular for ( ), the estimator given above is consistent and asymptotically normal, with asymptotic distribution given in (4.5) in [40]. In this setting, it would be interesting to study the joint asymptotic distribution of .

7 Concluding remarks

We have proved the consistency and asymptotic normality of the CV estimator of the microergodic parameter , based on the logarithmic score. While the ML asymptotic variance of is for any triangular array of observation points, the corresponding CV asymptotic variance is simply bounded between and , those bounds being tight. The triangular array we exhibit, achieving the asymptotic variance for CV, is based on some ratios between interpoint distances (of the form ) going to zero as , which makes it too challenging to simulate numerically for large . It would be interesting to find the smallest possible asymptotic variance for CV, when the ratios between interpoint distances are bounded away from zero.

One interesting agenda for future research would be to extend this asymptotic analysis of CV in the other settings where such an analysis was possible for ML. These settings include the case of measurement errors for the exponential covariance function in dimension one [11, 10], the case of the separable exponential covariance function in higher dimension [41] (consistency and asymptotic normality), of the separable Matérn covariance function [22] (consistency) and of the Gaussian covariance function [23] (consistency). In these references, tractable approximations of the inverse covariance matrices are provided, which could also be exploited in the case of CV. Finally, using techniques which are more spectral in nature, [14, 21, 39] prove central limit theorems for the ML estimation of the microergodic parameter in the isotropic Matérn model. An extension to CV would also be valuable.

8 Proofs

8.1 Notation and auxiliary results

Remind that and introduce and its normalized version for (the dependency in is not mentioned in and to lighten notation). When , the random variables will be denoted and . By the Markovian and Gaussian properties of , it follows that for each , is independent of . Therefore, is an i.i.d. sequence of random variables having the standard normal distribution .
It is convenient to have short expressions for terms that converge in probability to zero. We follow [38]. The notation (respectively ) stands for a sequence of random variables that converges to zero in probability (resp. is bounded in probability) as . More generally, for a sequence of random variables ,

For deterministic sequences and , the stochastic notation reduce to the usual and . Throughout the paper, by , we denote a generic constant (i.e.  may or may not be the same at different occurrences) that does not depend on .

We also denote by and two sequences of random variables satisfying

and

The definition of and may change from one line to the other. Similarly, we denote by a triangular array of deterministic scalars satisfying

The definition of may change from one line to the other, and possibly also in different occurrences within the same line. We also use several times that,

(18)
(19)

Before turning to the proof of Theorems 3.2 and 4.1, we state five auxiliary lemmas that will be required in the sequel.

Lemma 8.1.
  • Let , be fixed. Then as ,