Sparse Precision Matrix Selection for Fitting Gaussian Random Field Models to Large Data Sets 1footnote 11footnote 1This research is based on the dissertation of the first author, conducted under the guidance of the second and the third authors, Drs. Aybat and del Castillo.

# Sparse Precision Matrix Selection for Fitting Gaussian Random Field Models to Large Data Sets 111This research is based on the dissertation of the first author, conducted under the guidance of the second and the third authors, Drs. Aybat and del Castillo.

S. Davanloo Tajbakhsh, N. S. Aybat, and E. del Castillo
Department of Industrial and Manufacturing Engineering
The Pennsylvania State University
###### Abstract

Iterative methods for fitting a Gaussian Random Field (GRF) model to spatial data via maximum likelihood (ML) require floating point operations per iteration, where denotes the number of data locations. For large data sets, the complexity per iteration together with the non-convexity of the ML problem render traditional ML methods inefficient for GRF fitting. The problem is even more aggravated for anisotropic GRFs where the number of covariance function parameters increases with the process domain dimension. In this paper, we propose a new two-step GRF estimation procedure when the process is second-order stationary. First, a convex likelihood problem regularized with a weighted -norm, utilizing the available distance information between observation locations, is solved to fit a sparse precision (inverse covariance) matrix to the observed data using the Alternating Direction Method of Multipliers. Second, the parameters of the GRF spatial covariance function are estimated by solving a least squares problem. Theoretical error bounds for the proposed estimator are provided; moreover, convergence of the estimator is shown as the number of samples per location increases. The proposed method is numerically compared with state-of-the-art methods for big . Data segmentation schemes are implemented to handle large data sets.

Keywords: Spatial Statistics, Convex Optimization, Gaussian Markov Random Fields, Covariance Selection, ADMM.

Gaussian Random Field (GRF) models222A GRF is also called a Gaussian Process (GP) model or simply a Gaussian Field (GF). are widely used in several fields, e.g., Machine Learning, Geostatistics, Computer Experiments (metamodeling) and Industrial Metrology. Traditional methods for fitting a GRF model to given sample data rely on computing the maximum likelihood estimate (MLE) of the parameters of an assumed spatial covariance function belonging to a known parametric family. As it is well-known in Spatial Statistics (Warnes and Ripley, 1987), the log-likelihood function for the covariance parameters of a GRF is non-concave, which leads to numerical difficulties in solving the optimization problem for MLE, yielding suboptimal estimates that do not possess the desirable properties of MLE. Furthermore, MLE optimization routines require operations per iteration due to matrix inversions ( is the number of distinct data locations), which renders them inefficient on large data sets – Banerjee et al. (2008) called it the “Big problem”.

To overcome these difficulties, we propose a new method, Sparse Precision matrix Selection (SPS), for fitting a GRF model. In the first stage of the SPS method, a sparse precision (inverse covariance) matrix is fitted to the GRF data observations, by solving a convex likelihood problem regularized with a weighted -norm, utilizing the available distance information among observation locations. This precision matrix is not parameterized in any form and constitutes a Gaussian Markov Random Field (GMRF) approximation to the GRF. This first-stage problem is solved using a variant of the Alternating Direction Method of Multipliers (ADMM) with a linear convergence rate. Suppose the covariance function has parameters . In the second stage, these parameters are estimated via a least squares (LS) problem in , resulting in more consistent estimates than the suboptimal solutions of the non-convex MLE problem. Although the second stage LS problem is non-convex in general, it is still numerically much easier to solve when compared to the non-convex MLE problem, which requires operations per iteration. In particular, the solution to the second stage LS problem can be computed via a line search in the range parameter for isotropic GRFs. Empirical evidence suggests that the first stage optimization “zooms-in” to a region in the covariance parameter space that is close to the true parameter values, alleviating the non-convexity to a certain degree. We next provide some preliminaries, including a brief review of other state-of-the-art methods.

### Preliminaries

Let and denote the value of a latent GRF observed with additive Gaussian noise at location : , where has a mean function and covariance function for all , and is independent of and models the “nugget” error. We assume the training data contains realizations of the GRF at each of distinct locations in . Let denote the vector of -th realization values for locations in . Given a new location , the goal in GRF modeling is to predict . We assume that the GRF has a constant mean equal to zero, i.e., . Since any countable collection of observations from a GRF follows a multivariate normal distribution, the joint distribution of is given by , for all , where , , and the covariance matrix is formed such that its element is equal to . Therefore, the conditional distribution of given , i.e., the predictive distribution of denoted by , is given as

 p(f0 | {y(r)}Nr=1)=N(c⊤0(Cf+θ∗0In)−1N∑r=1y(r)/N, c00−c⊤0(Cf+θ∗0In)−1c0).\vspace∗−2mm (1)

The mean of this predictive distribution is a point estimate (known as the Kriging estimate in Geostatistics) and its variance measures the uncertainty of this prediction.

It is clear from (1) that the prediction performance can be made significantly robust by correctly estimating the unknown covariance function , which is typically assumed to belong to some parametric family , where is a set that contains the true parameters of the -process – this practice originates in the Geostatistics literature, e.g., Cressie (1993). Let , where is a parametric correlation function, and , and denote the spatial correlation and variance parameters. For isotropic correlation functions we have , e.g., the Squared-Exponential (SE), i.e., , and Matern-, i.e., . In the isotropic setting, is the range parameter, and . In second-order stationary anisotropic random fields, the correlation between two points is a function of the vector connecting the two locations rather than the distance, e.g., the anisotropic exponential correlation function , where is a symmetric matrix, e.g., , and . A covariance function is called valid if it leads to a positive definite covariance matrix for any finite set of fixed locations . Let denote the unknown true parameters of the -process, where . Hence, denotes the covariance function of the -process. Here, if , and equal to 0 otherwise.

Given a set of locations , and , let be such that its element is , and define , i.e., element is equal to . Hence, and denote true covariance matrices of the -process and -process, resp., corresponding to locations . The log marginal likelihood function is written as

 ℓ(θ | D)=−12logdet(C(θ))−12NN∑r=1y(r)⊤C(θ)−1y(r)−n2log(2π).\vspace∗−3mm

Let . Given , let . Hence, finding the MLE of the -process parameters requires solving

 ^θMLE=argminθ∈Θ⟨S,C(θ)−1⟩+logdet(C(θ))\vspace∗−4mm (2)

over a set containing the true unknown parameters .

The log-likelihood function is not concave in for many important parametric families of covariance functions. Therefore, the MLE problem in (2) is non-convex, which causes standard optimization routines to be trapped in local minima (Mardia and Watkins, 1989; Rasmussen and Williams, 2006). The main result of this paper, Theorem 3.2, and empirical evidence from our numerical experiments suggest the reason why the two-step SPS approach works better compared to well-known one-step non-convex log-likelihood maximization approaches. SPS defers dealing with the non-convexity issue to a later stage, and first obtains a regularized log-likelihood estimate of the precision matrix solving a convex problem. At the second stage, a non-convex least-squares problem is solved, of which global minimum is “close” to the true values; moreover, the objective is strongly convex in a considerably “large” neighborhood of the global minimum – see Figure 5.

Several other methods have been proposed in the literature to deal with the “Big ” problem in GRF modeling. These approaches can be broadly classified in six main classes: 1) Likelihood approximation methods approximate the likelihood function in the spectral domain (Fuentes, 2007; Stein, 1999), or approximate it as a product of conditional densities (Vecchia, 1988; Stein et al., 2004); 2) Covariance tapering provides a sparse covariance matrix in which the long range (usually weak) covariance elements are set to zero. Sparse matrix routines are then used to efficiently find the inverse and determinant of the resulting matrix (Furrer et al., 2006); 3) Low-rank process approximation methods are based on a truncated basis expansion of the underlying GRF which results in reducing the computational complexity from to , where is the number of basis functions used to approximate the process (Higdon, 2002; Cressie and Johannesson, 2008; Banerjee et al., 2008; Nychka et al., 2015); 4) Sampling-based stochastic approximation draws sample data points () at each iteration, and the model parameters are updated according to a stochastic approximation technique until the convergence is achieved (Liang et al., 2013); 5) Localized GRFs split the input domain into different segments, and the covariance parameters are estimated via ML locally on each segment (Gramacy and Apley, 2013) – this approach requires further formulation to avoid discontinuities on the predicted surface over the full domain (Park et al., 2011); and finally 6) Markov random field approximations, related to our proposed SPS method, will be discussed in more detail in Section 1. There are also methods in the intersection of two classes: Sang and Huang (2012) have combined low-rank process approximation with covariance tapering; Snelson and Ghahramani (2007) proposed a mix of likelihood approximation and localized GRF.

The rest is organized as follows: in Section 1, we motivate the proposed method. In Sections 2 and 3 we discuss the two-stage SPS method in detail and prove the statistical properties of the SPS estimator. From a computational perspective, it is shown that the first stage has linear convergence rate, and that the second stage problem is strongly convex around the estimator, which can be solved efficiently via a line search for isotropic GRFs. Next, in Section 4, we assess the prediction performance of the proposed method comparing it to alternative methods on both synthetic and real data. Finally, in Section 5 we conclude by providing some summarizing remarks and directions for further research.

## 1 Motivation for the proposed SPS method

Let be the true parameters, and be the true covariance matrix of the -process corresponding to given locations . The proposed method can be motivated by providing four interrelated remarks: a) the precision matrix of a stationary GRF can be approximated with a sparse matrix; b) powerful convex optimization algorithms exist to solve Sparse Covariance Selection (SCS) problems to find a sparse approximation to ; c) the past and recent work on directly approximating a GRF with a GMRF also involves determining a sparse precision matrix; d) the available distance information among given locations can be incorporated into the GRF estimation.

a) first motivation: Our method is motivated by the observation that of a stationary GRF can be approximated by a sparse matrix. The element is a function of the conditional covariance, i.e., partial covariance, between and given the rest of the variables, and as because

 P∗ij=−Cov(y(xi), y(xj) | {y(xk)}k≠i,j)Var(y(xi)|{y(xk)}k≠i,j)Var(y(xj)|{y(xk)}k≠i,j)−Cov(y(xi),y(xj)|{y(xk)}k≠i,j)2.\vspace∗−8mm

In particular, conditionally independent variables lead to a zero entry in the precision matrix (Whittaker, 2009). This is why sparse precision matrices are common in graphical models and Bayesian networks (Whittaker, 2009) when most of the variable pairs are conditionally independent. The fact that the precision matrix of a GRF is close to sparse is related to the interesting behavior of the so-called screen effect in a spatial process (Cressie (1993), p. 133; Journel and Huijbregts (1978), p. 346). The screen effect is complete in , i.e., for given three data points on a line, the two outer points are conditionally independent (in time series models, the partial (auto)correlation function “cuts off” after lag for a Markovian AR() process – see Box et al. (2008)). However, for a GRF in with , the screen effect is not complete; hence, the corresponding precision matrix is not sparse for any finite set of variables pertaining to the GRF. However, for all stationary GRFs tested, we have observed that for a finite set of locations magnitudes of the off-diagonal elements of the precision matrix decay to 0 much faster than those observed for the covariance matrix. To illustrate this fact, we compared the decay in covariance and precision elements in Figure 1 for data generated from GRFs with Matern-, Squared Exponential, and Exponential covariance functions. Clearly, the precision matrix can be better approximated by a sparse matrix than the covariance matrix can be.

By fixing the domain of the process and increasing (increasing the density of the data points), the screen effect becomes stronger, i.e., off-diagonal entries decay to 0 faster. Hence, the precision matrices can be better approximated with a sparse matrix as increases in a fixed domain. To illustrate this phenomenon numerically, we calculate the precision matrices corresponding to a Matern GRF () with variance and range parameters equal to 1, and 10 for over a fixed square domain . Then, as a measure of near-sparsity, we computed the percentage of scaled off-diagonal elements in the precision matrix greater in absolute value than certain threshold , i.e., , where . For comparison, we report the same quantities for the covariance matrices in Table 1. This shows the effect of infill asymptotics (Cressie, 1993) on the screen effect: in a fixed domain as increases, precision matrices get closer to sparse matrices, while the covariance matrices are insensitive to increasing density.

b) second motivation: Our work is also motivated by the recent optimization literature on the Sparse Covariance Selection (SCS) problem (Dempster, 1972) in (3) – compare it with (2). Given a sample covariance matrix of a zero-mean multivariate Gaussian , d’Aspremont et al. (2008) proposed to estimate the corresponding precision matrix by solving a regularized maximum likelihood problem: , where denotes the cardinality of non-zero elements of , denotes the cone of symmetric, positive definite (PD) matrices. This problem is combinatorially hard due to cardinality operator in the objective function. Since the -norm, defined as , is the tightest convex envelope of , a convex approximation problem can be formulated as

 minP≻0⟨S,P⟩−logdet(P)+α∥P∥1.\vspace∗−4mm (3)

The growth of interest in SCS in the last decade is mainly due to development of first-order algorithms that can efficiently deal with large-scale -regularized convex problems (Yuan, 2012; Friedman et al., 2008; Mazumder and Hastie, 2012; Honorio and Jaakkola, 2013).

c) third motivation: Further motivation comes from the previous work on approximating a GRF with a Gaussian Markov Random Field (GMRF) to obtain computational gains using sparsity. A GRF process on a lattice is a GMRF under the conditional independence assumption, i.e., a variable is conditionally independent of the other variables on the lattice given its “neighbors” (Rue and Held, 2005). While the index set is countable for the lattice data, the index set for a GRF is uncountable; hence, in general GMRF models cannot represent GRFs exactly. For a very special class, Lindgren et al. (2011) recently established that the Matern GRFs are Markovian; in particular, when the smoothing parameter is such that , where is the dimension of the input space – see Lindgren et al. (2011) and Simpson et al. (2012) for using this idea in the approximation of anisotropic and non-stationary GRFs. Rather than using a triangulation of the input space as proposed by Lindgren et al. (2011), or assuming a lattice process, we let the data determine the near-conditional independence pattern between variables through the precision matrix estimated via a weighted -regularization similar to that used in the SCS problem.

d) fourth motivation: Since the spatial locations of the observations are known, i.e., , these data can be utilized to improve the estimation even when the number of realizations at each location is low. For all stationary covariance functions tested, we observed that decreases to exponentially as increases – see Figure 2. Therefore, this information can be utilized for regularizing the likelihood function (see Section 2.1).

## 2 The SPS algorithm for fitting a GRF model

The proposed method for fitting a GRF is composed of two stages: 1) the true precision matrix corresponding to the training data set is approximated with a sparse matrix by solving a convex maximum likelihood problem regularized with a weighted -norm; 2) after inverting the fitted precision matrix from the first stage, a least-squares problem is solved to estimate the unknown covariance function parameters.

### 2.1 STAGE-I: Estimation of precision matrices

Given , suppose , and , i.e., , . Let be the covariance matrix of a zero-mean GRF corresponding to , and , where . Fix satisfying ; hence, . Given , compute the unbiased estimator of , , where denotes the set of symmetric matrices, , and form the weight matrix as follows: for all ,

 Gij=∥xi−xj∥,if i≠j,Gii=min{∥xi−xj∥: j∈I∖{i}}.\vspace∗−4mm (4)

Let and . To approximate the true precision matrix with a sparse matrix, we propose to solve the following convex problem:

 ^P:=argmin{⟨S,P⟩−logdet(P)+α⟨G,|P|⟩: a∗I⪯P⪯b∗I},\vspace∗−3mm (5)

where is the element-wise absolute value operator; hence, the last term is a weighted -norm with weights equal to the “distances” – compare it with (2) and (3). Note that is always a full rank matrix due to the term in the objective function.

If there is no prior information on the process to obtain non-trivial , then setting , and trivially satisfies the condition on and . For this case, (5) reduces to . On the other hand, when there is prior information on the process, one can also exploit it to obtain non-trivial bounds and . For instance, let be the true covariance matrix corresponding to locations in , where , and denote the true spatial correlation and variance parameters of the -process. The common structure of the covariance functions implies that , where denotes the vector of ones. Therefore, . Hence, if upper bounds on and are known a priori, then one can obtain non-trivial lower bounds.

In comparison to the SCS problem in (3), the proposed model in (5) penalizes each element of the precision matrix with a different weight proportional to the distance between the corresponding locations. In particular, for such that , the weight is . This model assumes that the off-diagonal precision magnitudes decrease with distance, for which there is an empirical evidence as shown in Figure 2. Moreover, the reason we solve (5) using with strictly positive diagonal is that otherwise the diagonal elements of the precision matrix would not be penalized and they might attain relatively large positive values. To solve STAGE-I problem in (5), we propose to use the Alternating Direction Method of Multipliers (ADMM) algorithm displayed in Figure 3.

###### Theorem 2.1.

Let . Given arbitrary and , let for , and denote the iterate sequence generated by ADMM as shown in Figure 3. Then converges -linearly333Let converge to for a given norm . The convergence is called -linear if , for some ; and -linear if , for some converging to 0 -linearly. to , and converges -linearly to , where is the unique optimal solution to STAGE-I problem given in (5).

Remark. As shown in the proof of Theorem 2.1, when and/or , the choice of in Figure 3 satisfies for , defined in (5). This technical condition makes sure that the ADMM iterate sequence converges linearly.

The algorithm is terminated at the end of iteration when both primal and dual residuals are below a given tolerance value, where and . From the necessary and sufficient optimality conditions for Step 7 and Step 8 in Figure 3, implies , i.e., the unique optimal solution to (5). In practice, ADMM converges to an acceptable accuracy within a few tens of iterations, which was also the case in our numerical experiments. Typically, in ADMM algorithms (Boyd et al., 2011), the penalty parameter is held constant, i.e., for all , for some . Although the convergence is guaranteed for all , the empirical performance critically depends on the choice of – it deteriorates rapidly if the penalty is set too large or too small (Kontogiorgis and Meyer, 1998). Moreover, Lions and Mercier (1979) discuss that there exists a which optimizes the convergence rate bounds for the constant penalty ADMM scheme; however, estimating is difficult in practice. In our experiments, we used an increasing penalty sequence . For details on the convergence of variable penalty ADMM, see (He et al., 2002; Aybat and Iyengar, 2015) in addition to the references above.

Next, we show that Steps 7 and 8 of ADMM, displayed in Figure 3, can be computed efficiently. Given a convex function and , the proximal mapping is defined as ; and given a set , let denote the indicator function of , i.e., for ; otherwise equal to .

###### Lemma 2.1.

Let , and . In generic form, Step 7 of ADMM can be written as for some and . Suppose has eigen-decomposition . Then , where

 λ∗i=max⎧⎪ ⎪⎨⎪ ⎪⎩min⎧⎪ ⎪⎨⎪ ⎪⎩¯λi+√¯λ2i+4ρ2ρ, b⎫⎪ ⎪⎬⎪ ⎪⎭, a⎫⎪ ⎪⎬⎪ ⎪⎭,i=1,…,n. (6)
###### Lemma 2.2.

Let , and . In generic form, Step 8 of ADMM can be written as for some and , which can be computed as follows:

 (proxΦ/ρ(¯P))ij =sgn(¯Pij)max{|¯Pij|−αρGij,0},∀(ij)∈I×I s.t. i≠j, (7a) (proxΦ/ρ(¯P))ii =max{¯Pii−αρGii,0},∀i∈I. (7b)

### 2.2 STAGE-II: Estimation of covariance function parameters

After estimating the precision matrix in the first stage from (5), a least-squares problem is solved in the second stage to fit a parametric covariance function to . Although this is a non-convex problem for parametric covariance functions in general, our main result, Theorem 3.2, and empirical evidence from our numerical experiments suggest that non-convexity of this problem is much less serious than that of the likelihood function (2). In STAGE-II, we propose to estimate the covariance parameters by solving

 ^θ∈argminθ∈Θ∥C(θ)−^P−1∥2F,\vspace∗−4mm (8)

where , , and is the parametric covariance matrix corresponding to the locations of the training data , denotes the spatial parameters of the correlation function, is the variance parameter, and is the nugget, which in some applications is set equal to zero. Indeed, . The two stage SPS method is summarized in Figure 4.

Solution to the STAGE-II problem. Let , where is defined in (5). Consider sequentially solving (8): Fixing , the objective in (8) is first minimized over and in closed form (inner optimization); hence, it can be written as a function of only. Next, the resulting function is minimized over (outer optimization), i.e.,

 \vspace∗−6mmminθρ∈Θρ{minθv≥0, θ0≥0 12∑i,j∈I(θv r(xi,xj,θρ)+θ0 δ(xi,xj)−^Cij)2}, (9)

where if , and equals 0 otherwise. Consider the inner optimization problem written as follows:

 f(θρ;^c):=minθ0≥0, θv≥0 12∥θvr(θρ)+θ0d−^c∥2,\vspace∗−3mm (10)

where denotes the Euclidean norm, , , and are long vectors in such that , , and for . We write as for short when we do not need to emphasize the dependence of on .

###### Theorem 2.2.

For any given , the minimization problem in (10) has a unique global optimal solution that can be computed as

 (^θv,^θ0)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(0, d⊤^c/n) if  r⊤^c≤d⊤^c,(r⊤^c−d⊤^c∥r∥2−n,d⊤^c ∥r∥2/n−r⊤^c∥r∥2−n) if  d⊤^c
###### Corollary 2.1.

In the absence of the nugget parameter , i.e., , the problem has a unique global minimizer equal to .

Using Theorem 2.2 or Corollary 2.1, the solution to the inner problem can be computed as a function of the outer optimization variable, , in (9). In Lemma 3.1, we show that under certain conditions, the outer problem objective, , is strongly convex in around the global minimum. Moreover, for isotropic covariance functions, can be simplified to a one-dimensional line search over , where is an upper bound on . This is illustrated in Figure 5 which displays the STAGE-II objective as a function of , with true GRF parameters for a SE covariance function. is unimodal with global minimum close the true value. The univariate minimization is performed via bisection; hence, after iterations, the search reaches a point within -ball of .

## 3 Statistical analysis of the SPS estimator

In this section, we focus on the statistical convergence of the parameter estimates obtained by the SPS algorithm displayed in Figure 4. Theorem 2.1 shows that given , the SPS estimator of the precision matrix, defined in (5), can be computed very efficiently. Throughout this section, we assume that non-trivial bounds are given. These bounds are useful in practice to control the condition number of the estimator.

###### Theorem 3.1.

Let be independent realizations of a GRF with zero-mean and stationary covariance function observed over distinct locations with ; furthermore, let be the corresponding true precision matrix for these observations. Finally, let be the SPS estimator computed as in (5) for chosen as in (4). Then for any given and