Scalar-on-Image Regression via the Soft-Thresholded Gaussian Process
The focus of this work is on spatial variable selection for scalar-on-image regression. We propose a new class of Bayesian nonparametric models, soft-thresholded Gaussian processes and develop the efficient posterior computation algorithms. Theoretically, soft-thresholded Gaussian processes provide large prior support for the spatially varying coefficients that enjoy piecewise smoothness, sparsity and continuity, characterizing the important features of imaging data. Also, under some mild regularity conditions, the soft-thresholded Gaussian process leads to the posterior consistency for both parameter estimation and variable selection for scalar-on-image regression, even when the number of true predictors is larger than the sample size. The proposed method is illustrated via simulations, compared numerically with existing alternatives and applied to Electroencephalography (EEG) study of alcoholism.
Keywords: spatial variable selection, EEG, posterior consistency, Gaussian processes.
Scalar-on-image regression has attracted considerable attention recently in both frequentist and Bayesian literature. This problem is challenging for several reasons such as: 1) the predictor is very high dimensional (two dimensional or three dimensional image), often larger than the sample size, 2) the observed predictor may be contaminated with noise and the true predictor signal may exhibit complex correlation structure, and 3) most components of the predictor may have no effect on the response, and when they have an effect it may vary smoothly.
Regularized regression techniques are usually needed when the dimension of the predictor is much higher relative to the sample size; lasso (Tibshirani 1996) is a popular method for variable selection by employing a penalty based on the sum of the absolute values of the regression coefficients. However most approaches do not accommodate predictors with ordered components such as in the case of predictor images. One exception is the fused lasso, which generalizes the lasso by penalizing both the coefficients and their successive differences, thus ensuring both sparsity and smoothness of the effect. To incorporate spatial dependence structure of the predictors, Reiss and Ogden (2010) extended the functional principal component regression originally proposed for one dimensional functional covariates to high dimensional predictors. They modeled the coefficient function by using B-spline functions and considered common smoothing spline penalty which is not sensitive to sharp edges and jumps. Recently, Wang and Zhu (2015) proposed a new type of penalty - based on the total variation of the function - which yields piecewise smooth regression coefficients. While these approaches are computationally efficient, none of them can fully take into account the spatial dependence of the image predictor. In addition, in this framework it is not clear how to assess statistical significance.
To overcome some of these limitations, this problem has been also approached form a Bayesian viewpoint. Goldsmith et al. (2014) proposed two latent spatial processes to model the sparsity and the smoothness of the regression coefficient: specifically an Ising prior was used for the binary indicator variable that controls whether a voxel image is predictive of the response or not (sparsity), and a conditional autoregressive Gaussian process for the non-zero regression coefficients to improve the model prediction (smoothness). The use of Ising prior for the binary indicator was first discussed in Smith and Fahrmeir (2007) in the context of high dimensional predictors and was also recently exploited by Li et al. (2015) who proposed it jointly with a Dirichlet process prior. To address the computational challenge of a non-closed form for the probability function, the latter work proposed an analytical approach to derive bounds for the hyperparameters. One of the characteristics of both Li et al. (2015) and Goldsmith et al. (2014) is that the sparsity and smoothness are controlled separately by two different spatial processes. As a result, the transition from zero-areas to non-zero neighboring areas in the regression coefficient may be very abrupt. This does not seem realistic for entire brain regions, where it is expected to see a gradual effect in contain brain areas on the response.
We propose a novel approach to spatial variable selection in the scalar-on-image regression by modeling the regression coefficients through a soft-thresholding transformation of latent Gaussian processes, to which we refer as soft-thresholded Gaussian processes. The soft-thresholding function is well known as the solution for the lasso estimate when the design matrix is orthonormal (Tibshirani 1996). The soft-thresholded Gaussian process leads to different model properties than the existent literature: in particular it ensures a gradual transition between the zero and non-zero effects of the neighboring locations. Theoretically, we can show that it provides a large support for the spatially varying coefficient function in the model that enjoys piecewise smoothness, sparsity and continuous properties. The idea is inspired from Boehm Vock et al. (2014) who considered it as a regularization technique for spatial variable selection. This approach does not assign prior probability mass at zero for regression coefficients and it is not designed for the scalar-on-image regression. The use of the soft-thresholded Gaussian process has an attractive computational advantage over the competing methods, where the use of Ising prior makes it impossible to have a closed form probability distribution function making the computations challenging. In particular, we consider a low-rank spatial model for the latent process, which is important for the scalability of the method to large datasets. For theoretical results, in addition to the large support, we also can show that the soft-thresholded Gaussian process leads to the posterior consistency for both parameter estimation and variable selection for scalar-on-image regression. That is, the posterior distribution of the spatially varying coefficient function concentrates in a small neighborhood of the true value and the its sign is also consistent with the true value with probability one as the number of subjects goes to infinity. These two results only need a few mild regularity conditions; the conclusions hold even when the number of true predictors is larger than the sample size.
The proposed method is introduced for the case of single image predictor and Gaussian responses for simplicity. Nevertheless extensions to accommodate other type of covariates through a linear effect as well as generalized responses are straightforward. The methods are applied to the data from an electroencephalography (EEG) study of alcoholism (http://kdd.ics.uci.edu/datasets/eeg/eeg.data.html), where of interest was to study the relation between the alcoholism status and the electrical brain activity over time. The data have been previously described in Li et al. (2010) and Zhou and Li (2014) and consist of EEG signals received from 64 channel of electrodes located on subjects’ scalp, corresponding to alcoholic subjects and healthy controls. The EEG signals are recorded for 256 seconds; leading to a high-dimensional predictor. Previous literature that analyzed these data ignored the spatial locations of the electrodes on the scalp, and thus considered a two-dimensional predictor. In contrast, we recover the locations of the electrodes from the standard electrode position nomenclature described by Fig. 1 of https://www.acns.org/pdf/guidelines/Guideline-5.pdf, as shown in Fig. 1. We study the same scientific question by accounting for the space-temporal dependence of the predictor.
2.1 Scalar-on-image regression
Let be a -dimensional vector space of real values for any integer . Suppose there are subjects in the dataset. For each subject , we collect a scalar response variable, , a set of spatially distributed imaging predictors, denoted and other scalar covariates , where denotes the image intensity values measured at location , for . Write which is an subset of a compact closed region . Let be a normal distribution with mean and covariance (or variance for one dimensional case). We consider the following scalar-on-image regression model: for ,
where is the coefficient for the scalar covariates and is a spatially varying coefficient function defined on for imaging predictor . Assume that are fixed design covariates and collects all fixed spatial locations. In practice, the normalizing scalar can be absorbed into imaging predictors; its role is to rescale the total effects of massive imaging predictors such that they are bounded away from infinity with a large probability when is very large. Scientifically, imaging predictors take values that measure the body tissue contrast or the neural activities at each spatial location and the number of imaging predictors, is determined by the resolution of the image. Thus, the total effects of imaging predictors reflect the total amount of the intensity in the brain signals, which should not increase to infinity as the image resolution increases. In model (1), the response is taken to be Gaussian and only one type of imaging predictor is included, although extensions to non-Gaussian responses and multi-modality imaging predictor regression are straightforward.
2.2 Soft-thresholded Gaussian Processes
In order to capture the characteristics of imaging predictors and their effects on the response variable, the prior for the covariate function should be sparse and spatial. That is, we assume that many locations have ; the sites with non-zero coefficients cluster spatially, and that the coefficients vary smoothly in clusters of non-zero coefficients. To encode these desired properties into the prior, we represent as a transformation of a Gaussian process, , where is the transformation function dependent on parameter and follows a Gaussian process prior. In this transkriging (Cressie 1993) or Gaussian copula (Nelsen 1999) model, the function determines the marginal distribution of , while the covariance of the latent determines ’s dependence structure.
Spatial dependence is determined by the prior for . We assume that is a Gaussian process with zero-mean and stationary covariance function for some covariance function . Although other transformations are possible (Boehm Vock et al. 2014), we select to be the soft-thresholding function to map near zero to exact zero and thus give a sparse prior. Let
where is the sign of , i.e. if and if and . The thresholding parameter determines the degree of sparsity. This soft-thresholded Gaussian process prior is denoted .
3 Theoretical Properties
In this section, we examine the theoretical properties of soft thresholded Gaussian processes as prior models for scalar-on-image regression. We first introduce the formal definition for the class of the true spatially varying coefficients in the model that can well characterize the effects of the imaging predictors on the response variable. In light of good properties of the soft thresholding function (Lemmas 1 – 2), we show that the soft thresholded Gaussian processes have large support for the true spatially varying coefficient functions in the class we define (Theorem 1). Then we can verify the prior positivity of neighborhoods (Lemma 3) and construct uniform consistent tests (Lemma 5) which are needed to define a sieve of spatially varying coefficient functions and find the upper bound of the tail probability (Lemmas A.6–A.4), and verify that the model is identifiable (Lemmas A.7–A.10) under certain regularity conditions. Thus, following the theory developed by Choudhuri et al. (2004), we show the posterior consistency (Theorem 2). Given the smoothness and sparsity of the soft-thresholded Gaussian process, we can further show the posterior sign consistency for the sparse spatially varying coefficient function (Theorem 3), indicating the posterior spatial variable selection consistency.
3.1 Notations and Definitions
We start with additional notations for the theoretical development and the formal definitions of the class of spatially varying coefficient functions under consideration. We assume that all the random variables and stochastic processes that we introduced in this article are defined in the same probability space, denoted . Recall that represents the -dimensional vector space of real values. Let represent the -dimensional vector space of non-negative integers. For any vector , let be the norm for vector for any , and be the supremum norm. For any , let be the smallest integer larger than and be the largest integer smaller than . Define the event indicator with if event occurs, otherwise. For any , define and define .
Let be defined in the set for , and let be a non-negative integer. We say is a differentiable function of order , if has partial derivatives
where , and .
Denote by a set of differentiable functions of order defined on . For any , define the norm for any and the supremum norm is . The reminder has the following property. Given any point of and any , there is a such that if and are any two points of with and , then . If , then .
Denote by and the closure and the boundary of any set . Define a collection of functions that satisfy the following properties. That is, there exist two disjoint non-empty open sets and with such that
3.2 Conditions for Theoretical Results
In this section, we list all the conditions that are needed to facilitate the theoretical results, although they may not be the weakest conditions.
There exists , , , and some , and such that for all , .
This condition implies that the number of imaging predictors should be of polynomial order of sample size. The lower bound indicates that needs to be sufficiently large such that the posterior distribution of the spatially varying coefficient function concentrates around the true value.
The true spatially varying coefficient function in model (1) enjoys the piecewise smoothness, sparsity and continuity properties, in short, .
The next two conditions summarize constraints on the spatial locations and the distribution of the imaging predictors.
When is a hypercube in , e.g. , there exists a set of that equally partitions . Then .
The covariate variables are independent realizations of a stochastic process at spatial locations . The stochastic process satisfies
Condition 4 includes assumptions on the mean of and on the range of eigenvalues of the covariance matrix for covariate variables. If Gaussian process on has zero mean and , , if and , where are chosen as the centers of the equal space partitions of , then condition (4.2) holds. Furthermore, condition (4.3) also holds. Specifically, for any , taking , for any , let By Condition (4.2),
There exists , such that for all , Thus, . Furthermore,
To ensure the large support property, we need the following condition on the kernel function of the Gaussian process. This condition also has been used previously by Ghosal and Roy (2006).
For every fixed , the covariance kernel has continuous partial derivatives up to order .
3.3 Large Support
One of the desired properties for the Bayesian nonparametric model is to have prior support over a large class of functions. In this section, we show that the support of the soft-thresholded Gaussian process is large for any spatially varying coefficient function of our interests in the scalar-on-image regression. We begin with two appealing properties of the soft-thresholding function in the following two lemmas.
The soft-thresholding function is Lipschitz continuous for any , that is, for all ,
For any function , there exists a threshold parameter and a smooth function such that
The proof of lemma 1 is straightforward by verifying the definition. The proof of lemma 2 is not trivial, it requires a detailed construction on the smooth function . Please refer to the Appendix for details.
(Large Support) For a function , there exists a thresholding parameter , such that the soft thresholded Gaussian process prior satisfies
Theorem 1 implies that there is always a positive probability that the soft-thresholded Gaussian process concentrates in an arbitrarily small neighborhood of any spatially varying coefficient function that has piecewise smoothness, sparsity and continuity properties.
3.4 Posterior Consistency
For , given the image predictor on a set of spatial locations and other covariates , suppose the response is generated from the scalar-on-image regression model (1) with parameters , that are known and parameter of interest . The assumptions about and are for theoretical convenience; in practice it is straightforward to estimate from the data. We assign a soft-thresholded Gaussian process prior for the spatially varying coefficient function, i.e. for some known and covariance kernel . In light of the large support by Theorem 1, the following lemma shows the positivity of prior neighborhoods:
We construct sieves for the spatially varying coefficient functions in as
Suppose with and the kernel function satisfies condition 5, then there exist constants and such that for all ,
(Uniform consistent tests) For any and , there exist , , and such that for all and all , if , a test function can be constructed such that
Proofs of Lemmas 3–5 are provided in the online supplementary materials. These lemmas verify three important conditions for proving posterior consistency in the scalar-on-image regression based on Theorem A.1 by Choudhuri et al. (2004). Thus, we have the following theorem:
(Posterior Consistency) Write data . If Conditions 1 – 5 hold, then for any ,
as in probability, where denotes the actual distribution of data .
Theorem 2 implies that the soft-thresholded Gaussian process prior can ensure that the posterior distribution of the spatially varying coefficient function concentrates in an arbitrarily small neighborhood of the true value, when both the number of subjects and number of spatial locations are sufficiently large. Given that the true function of interest is piecewise smooth, sparse and continuous, the soft-threshold Gaussian process prior can further ensure that the posterior probability of the sign of the spatially varying coefficient function being correct converges to one as the sample size goes to infinity. The result is formally stated in the following theorem.
(Posterior Sign Consistency) Suppose the model assumptions, prior settings and regularity conditions for Theorem 2 hold.
as in probability.
This theorem establishes the spatial variable selection consistency. It does not require the number of true imaging predictors is finite or less than the sample size. This is different from most previous results, but it is reasonable in that the true spatially varying coefficient function is piecewise smooth and continuous and the soft-thresholded Gaussian process will borrow strength from neighboring locations to estimate the true imaging predictors. Please refer to the Appendix for the proofs of Theorems 2 and 3.
4 Posterior Computation
4.1 Model Representation and Prior Specifications
We turn now to the practical applicability of our proposed method. We select a low-rank spatial model to ensure that computation remains possible for applications with large datasets. We exploit the kernel convolution approximation of a spatial Gaussian process. As discussed in Higdon et al. (1999), any stationary Gaussian process can be written , where is a kernel function and is a white-noise process with mean zero and variance . This gives covariance function
which illustrates the connection between covariance and kernel . This representation suggests the approximation for the latent Gaussian process
where are a grid of spatial knots covering , is a local kernel function, and is the coefficient associated with knot . We use tapered Gaussian kernels with bandwidth ,
so that for separated from by at least . Taking knots and selecting compact kernels both lead to computational savings, as discussed in Section 4.2.
The compact kernels control the local spatial structure and the prior for the coefficients controls broad spatial structure. Following the work by Nychka et al. (2015) for geostatistical data, we assume that the knots are arranged on an array, and use to denote that knots and are adjacent on this array. We then use a conditionally autoregressive prior (Gelfand et al. 2010) for the kernel coefficients. The conditional autoregressive prior is also defined locally, with full conditional distribution
where is the number of knots adjacent to knot , determines the strength of spatial dependence, and determines the variance. These full conditional distributions correspond to the joint distribution N, where is diagonal with diagonal elements and is the adjacency matrix with element equal 1 if and zero otherwise (including zeros on the diagonal).
Write . Denote by the kernel matrix with element , the prior for is given by . In this case, the do not have the equal variance, which may not generally be desirable. Non-constant variance arises because the kernel knots may be unequally distributed, and because the conditional autoregressive model is non-stationary in that the variances of the are unequal.
To stabilize the prior variance, define and as the corresponding matrix of standardized kernel coefficients, where are constants chosen so that the prior variance for each is equal. We take to be the -th diagonal element of , hence the kernel functions now depend on . By pulling the prior standard deviation out of the thresholding transformation, write , we have an equivalent model representation of model (1) as
where . After standardization the prior variance of each is one, and therefore the prior probability that is nonzero is for all , where denotes the cumulative distribution function of standard normal distribution. This endows each parameter with a distinct interpretation: controls the scale of the non-zero coefficients; controls the prior degree of sparsity; and controls spatial dependence.
In practice, we normalize the response and covariates, and then select priors , , , , and . Following Banerjee et al. (2004), we use a beta prior for with mean near one because only values near one provide appreciable spatial dependence. In many of the cases considered in the simulation studies, the sparsity parameter cannot be fully identified. To improve numerical stability, we suggest an informative data-driven prior. We first fit the non-sparse model with and record the proportion of the with posterior 95% credible interval that exclude zero, denoted . The prior for then restricts the prior proportion of non-zeros to be within 0.05 of , i.e., and .
4.2 Markov chain Monte Carlo Algorithm
We sample from the posterior distribution using Metropolis-Hastings within Gibbs sampling. The parameters , , and have conjugate full conditional distributions and are updated using Gibbs sampling. The spatial dependence parameter is sampled using Metropolis-Hastings sampling using a beta candidate distribution with the current value as mean and standard deviation tuned to give acceptance around 0.4. The threshold is updated using Metropolis sampling with random-walk Gaussian candidate distribution with standard deviation tuned to have acceptance probability around 0.4. The Metropolis update for uses the prior full conditional distribution in (3) as the candidate distribution which gives high acceptance rate and thus good mixing without tuning.
5 Simulation study
5.1 Data generation
In this section we conduct a simulation study to compare the proposed methods with other popular methods for scaler-on-image regression. For each simulated observation, we generate a two-dimensional image on the grid with . The covariates are generated following two covariance structures: exponential (“Exp”) and with shared structure (“SS”) with the signal, . The exponential covariates are Gaussian with mean and cov, where is the distance between locations and and controls the range of spatial dependence. The covariates generated with shared structure with are , where is Gaussian with exponential covariance with and ; this is denoted as “SS()”. The response is then generated as N. Both and are independent for . We consider two true images (“Five peaks” and “Triangle”, plotted in Figure 2), sample sizes , spatial correlation , and error standard deviation . For all combinations of these parameters considered we generate datasets.
We fit our model with a equally-spaced grid of knots covering with bandwidth set to the minimum distance between knots. We fit the model both with and thus sparsity (“STGP”) and with and thus no sparsity (“GP”). For both models, we run the proposed Markov chain Monte Carlo algorithm 5,000 iterations with 1,000 burn-in, and compute the posterior mean of . For the sparse model, we compute the posterior probability of a nonzero .
The lasso estimate is computed using the lars package (Hastie and Efron 2013) in R (R Core Team 2013) and the tuning parameter is selected using the Bayesian information criteria. The fused lasso estimate is computed using the genlasso package (Arnold and Tibshirani 2014) in R and the tuning parameters and are selected using the Bayesian information criteria. Due to computational considerations, we search only over .
We also compare with a functional principle components analysis approach (“FPCA”). We smooth each image using the technique of Xiao et al. (2013) implemented in the fbps function in R’s refund package (Crainiceanu et al. 2014), compute the eigen decomposition of the sample covariance of the smoothed images, and then perform principal components regression using the lasso penalty tuned via the Bayesian information criteria. We use the leading eigenvectors that explain 90% of the variation in the sample images.
Finally, we compare with the Bayesian spatial model of Goldsmith et al. (2014) (“Ising”). Goldsmith et al. (2014) use the model , where is the binary indicator that location is included in the model, and is the regression coefficient given that the location is included. Both the and have spatial priors; the continuous components follow a conditional autoregressive prior, and the binary components follow an Ising (autologistic) prior (Gelfand et al. 2010) with full conditional distributions
Estimating and is challenging because of the complexity of the Ising model (Møller et al. 2006), therefore Goldsmith et al. (2014) recommend selecting and using cross validation over and . Due to computational limitations we select values in the middle of these intervals and set and . Similar to our approach, 5,000 Markov chain Monte Carlo samples are simulated for the Ising model, and the first 1,000 are discarded as burn-in, and the posterior mean of and the posterior probability of a nonzero are computed.
Table 1 gives the mean squared error for estimation (averaged over location), type I error and power for detect non-zero signals along with computing time. The soft-thresholded Gaussian process (STGP) model gives the smallest mean squared error when the covariate has exponential correlation. Compared to the Gaussian process (GP) model, adding thresholding reduces mean squared error by roughly 50% in many cases. As expected the functional principal component analysis (FPCA) methods gives the smallest mean squared error in final two scenarios where the covariates are generated to have a similar spatial pattern as the true signal. Even in this case, the proposed method outperforms the other methods that do not exploit this shared structure. For variable selection results, we only compare the proposed method with Fused lasso and the Ising model for a fair comparison, because Lasso does not incorporate spatial locations and other methods do not perform variable selection directly. The results show that Fused lasso has much larger Type I errors in all cases and the Ising model has a very small power to detect the signal in each case. It is clear that the proposed method is much better than Fused lasso and the Ising model for variable selection accuracy. For the computing time, the proposed method is comparable to Fused lasso and faster than the Ising model.
(a) MSE (multiplied by 1000) for
|True||cov ()||Lasso||Fused lasso||FPCA||Ising||GP||STGP|
(b) Type I error (%)
(c) Power (%)
(d) Computing time (minutes)
|True||cov ()||Lasso||Fused lasso||FPCA||Ising||GP||STGP|