Spatial Clustering of Curves with Functional Covariates: A Bayesian Partitioning Model with Application to Spectra Radiance in Climate Study

Spatial Clustering of Curves with Functional Covariates: A Bayesian Partitioning Model with Application to Spectra Radiance in Climate Study

Zhen Zhang, Chae Young Lim, Tapabrata Maiti and Seiji Kato
Department of Statistics, University of Chicago, Chicago, Illinois, USA
Department of Statistics, Seoul National University, Seoul, South Korea
Department of Statistics & Probability
Michigan State University, East Lansing, Michigan, USA
Science Directorate, NASA Langley Research Center, Hampton, Virginia, USA
Email: zhangz19@galton.uchicago.edu
Abstract

In climate change study, the infrared spectral signatures of climate change have recently been conceptually adopted, and widely applied to identifying and attributing atmospheric composition change. We propose a Bayesian hierarchical model for spatial clustering of the high-dimensional functional data based on the effects of functional covariates and local features. We couple the functional mixed-effects model with a generalized spatial partitioning method for: (1) producing spatially contiguous clusters for the high-dimensional spatio-functional data; (2) improving the computational efficiency via parallel computing over subregions or multi-level partitions; and (3) capturing the near-boundary ambiguity and uncertainty for data-driven partitions. We propose a generalized partitioning method which puts less constraints on the shape of spatial clusters. Dimension reduction in the parameter space is also achieved via Bayesian wavelets to alleviate the increasing model complexity introduced by clusters. The model well captures the regional effects of the atmospheric and cloud properties on the spectral radiance measurements. The results elaborate the importance of exploiting spatially contiguous partitions for identifying regional effects and small-scale variability.

Keywords: Spatial clustering; Bayesian wavelets; Voronoi Tessellation; Functional covariates; High-dimensional data; Parallel computing; Spectral radiance.

1 Introduction

1.1 Spectral radiance change studies

The infrared spectral signatures of climate change have been conceptually adopted, and widely applied to identifying and attributing atmospheric composition change. The spectrally resolved radiances of earth thermal emission spectra have proven to be beneficial for climate change detection, which is essentially a comparison between unforced radiance variability and anthropogenic trend signals. Active research is ongoing to analyze how the radiances in different spectral bands vary under different controls of geophysical variables such as atmospheric and cloud properties (temperature, tropospheric water vapor, cloud height and fraction, optical thickness and particle size, etc.) to examine whether a detectable trend against natural variability exists (e.g., Huang and Ramaswamy, 2009; Kato et al., 2011, 2014). The spectral signatures associated with these factors can be uniquely determined by retrieving the cloud properties and atmospheric composition change. Therefore, it is very important to quantitatively study the spectral radiance data from multiple perspectives which potentially cast light on climate change.

To understand the effects of small-scale variability on atmospheric temperature, humidity, and cloud property change detection, Kato et al. (2011) derived the cloud fields for monthly zonal mean spectral radiances from Moderate Resolution Imaging Spectroradiometer (MODIS) spectral radiance observations. Two year’s worth of MODIS-derived cloud fields from January 2003 through December 2004 are used as a control run. Top-of-the-atmosphere (TOA) longwave nadir-view spectra from to cm are computed with a cm resolution. Detailed procedures for computing the spectral radiances are described in Kato et al. (2011) and the references provided therein. The control run hence includes longwave spectra over these wavenumbers from to cm at a total of 18 latitude zones from to and the months between 2003 and 2004. In addition to the control run, Kato et al. (2011) independently perturbed cloud and atmospheric properties for computing the TOA spectral radiance as perturbed runs, with uniform perturbed amounts at all latitudes. The atmospheric and cloud properties (perturbed amount) are: Temperature near-surface and skin ( K), at surface–200-hPa ( K) and at 200–10-hPa ( K); Water vapor at surface–500-hPa () and 500–200-hPa (); Cloud-top height at low-level ( km), midlevel ( km), and high-level ( km); Cloud fraction at low-level (), midlevel (), and high-level (); Thin ice cloud () optical thickness () and Ice cloud particle size (m); Water cloud optical thickness () and Water cloud particle size (m). Kato et al. (2011) studied the monthly zonal mean spectral radiance changes due to perturbations by differencing the control run and individual perturbed run spectral radiances, and averaging the zonal mean spectral radiances weighted by their area for the global mean radiance. The shapes of the spectral radiance changes from some perturbations, though individually calculated, were closely connected and functionally similar due to the relationship in certain spectral regions between these perturbed atmospheric properties. Kato et al. (2011) further quantified the relative difference between the spectral radiance change under simultaneous and individual perturbations using the zonal mean root-mean-square (RMS) over the wavenumber region from to cm. The results suggest that TOA spectral radiance change can be expressed approximately by a linear combination of the radiance changes due to individual perturbation. However, the spectral difference is found to be large particularly in the presence of the cloud fraction change. All these facts purport the joint modeling of these perturbed runs to comprehensively study the effects on the TOA radiance or radiance change, which may vary across the spectral regions over wavenumbers. Furthermore, such effects can potentially be piecewise linear across the 18 latitude zones and months due to the geographical and seasonal impact. A global linearity assumption may obscure such intra-region dissimilarity.

The earlier study by Kato et al. (2011) treated the spectral radiance difference between the control runs in 2003 and 2004 as the observed response variable in their linear regression model, for retrieving the cloud property differences between the two years. Kato et al. (2014) defined the spectral difference as the anomaly from the long-term average spectral radiance, for retrieving atmospheric and cloud property anomalies from spatially and temporally averaged spectral radiance. In this study, we use the spectral radiance from the control run in Kato et al. (2011) as the functional response, and treat the spectral radiance from the perturbed runs with different cloud and atmospheric properties as the functional covariates, to investigate the piecewise linearity of their effects over space and the months, in order to study local features and grouped patterns. Subsequently, we regularize the sampling points with observed functional data that comprise the spectral radiances over the wavenumbers onto a -D regular lattice system, with the 18 latitude zones and the months from January 2003 to December 2004 as the coordinates. For the regularized spatio-functional data, we seek for meaningful partitions of the spatial locations according to regional effects of the spectra radiances from the perturbed runs on the radiance from control run. The clusters under the partition are expected to be spatially contiguous for direct interpretation as subregions, and more importantly for capturing the local small-scale variability with spatial dependence. Moreover, we target on not only identifying the heterogeneity of the effects among latitude zones and months, but also quantifying the significance and strength of such effects that can vary across the spectral regions of the wavenumbers. Finally, we intend to provide the uncertainty associated with such partitions, in particular, the variability among different partitions and the boundary ambiguity in the label assignment. We propose a model-based approach for answering these questions in a unified faction.

1.2 Related Work

The spectra radiance data in this analysis exhibit an intrinsic high-dimensionality. A well-known issue in partitioning large and high-dimensional spatial data sets is the presence of nonstationarity or local stationarity in that the spatial patterns and behaviors of the underlying processes can vary across the study domain. Taking such local dependency into account can potentially improve both model fit and interpretability. One popular Bayesian clustering method is using Dirichlet Process prior (DPP) for horizontally partitioning the observations to identify common features in vertical direction. For example, Ray and Mallick (2006) combined horizontal partitions using DPP and vertical partitions using Bayesian wavelets for curve clustering with application to spatial data; Dass et al. (2015) conceptually extended the feature vector for unsupervised learning to feature function, and elicited a functional DPP for simultaneously detecting change points for multiple curves over a spatial domain; among others. Nevertheless, these DPP-based approaches mentioned above do not guarantee that the resulting clusters are spatially contiguous. Whereas for spatial data, the interpretation and capturing local stationarity can be greatly improved by seeking for spatially contiguous clusters, with natural interpretation as subregions. One convenient tool of introducing such spatially connected clusters is the treed Gaussian processes (Gramacy and Lee, 2008) under Bayesian framework to stochastically search for treed partitions (Denison et al., 1998) which typically produce rectangle-shaped clusters. Konomi et al. (2014a) adopted treed partition for multivariate Gaussian precesses which can potentially be applied to curve clustering. However, the number of parameters can rapidly grow when more clusters are introduced, and the penalization and dimension reduction in parameter space are not fully explored. Another convenient way of defining a spatial partition is using Voronoi tessellations (VT) by introducing a set of centers (nuclei) and certain rule for label assignment. Okabe et al. (2000) provided a comprehensive review of VT-type partitions with application in modeling spatial process. Knorr-Held and Raßer (2000) introduced a Bayesian clustering prior model based on VT for disease mapping to identify risk by regions. Kim et al. (2005) considered piecewise stationary Gaussian processes based on VT-based partition to address nonstationarity issues. Zhang et al. (2014) developed a VT-based Bayesian hierarchical model for identifying spatial extreme in cancer disease. Castro et al. (2015) adopted VT-based Bayesian clustering model to estimate the neighborhood matrix for modeling areal count data. Feng et al. (2016) compared the performance of the VT-based spatial clustering with traditional approaches such as Poisson-CAR model for spatial count data, and demonstrated the merit of VT-clustering using several simulation studies and real data examples. VT-type partition is also useful in non-Bayesian, non-parametric procedure as a convenient tool for defining subregions to extract local representatives for post-hoc clustering (see, for example, Secchi et al., 2013). A brief comparison between treed partition and VT-based partition can be found in Gramacy and Lee (2008) and Zhang et al. (2014). In general, treed partition seeks for axis-aligned splitting planes. VT-type partition puts less constraints on the shape of clusters and is suitable for identifying irregularly shaped patterns such as hot spots in disease. Nevertheless, VT-based approach still puts considerable constraints, such as convexity assumption, on the cluster shapes, which limit its use for clustering more complex objects.

The limitations of imposing strong constraints on cluster shapes for both treed and VT partition become even more severe for high-dimensional spatio-functional data. Suppose we seek for a partition of spatial locations, each with measurements. When is quite large comparing to , without dimension reduction, switching a label for a single spatial location involves simultaneously testing hypothesis. Consequently, the constraints on cluster shapes result in a very low chance of accepting the change in label assignment for even a few spatial locations. This causes the well-known local-trap problem (Denison et al., 1998) in that the stochastic model search fails to accept any new clustering configuration, which hinders the thorough exploration for probable clustering configurations. Therefore, an extended VT-type partition with less constraints on cluster shapes is desirable for handling more complex structure in high-dimensional spatial data.

In addition, the cluster-wise mean can be explained by a set of covariates, and the covariates may have varying effects across the space. Sanchez et al. (2015) utilized covariates information for VT-based spatial clustering to identify subregions that share similarity in covariate effects. It is more challenging to investigate functional covariate effects that are potentially varying across spatial clusters for high-dimensional spatial data, since the introduction of new partitions can lead to a rapid growth in the number of parameters. The dimension reduction in the functional representation is desirable. One common technique is the Bayesian wavelets (e.g., Clyde and George, 2000, 2003; Morris and Carroll, 2006) that partition the -observations into multi-resolution, and the important signals are generally centering at the first several levels of resolution. Bayesian wavelets have been employed for identifying the cluster means of the functional responses via the wavelet representation, which introduces sparsity and decorrelation property (e.g., Ray and Mallick, 2006). Nevertheless, the combination of Bayesian wavelets for functional covariates and spatially contiguous VT-type partition for spatial data has not been fully studied.

The main contribution of this work is the development of a spatio-functional clustering (SFC) model with varying functional covariate effects across random subregions, and a generalized Voronoi tessellation (GVT) procedure for partitioning spatial data with less constraints on the cluster shapes. We view clustering as a horizontal partitioning operator on the data with locations, which results in spatially contiguous regions for interpretation and addressing nonstationarity issue for large spatial data; We also view the selection of important wavelet coefficients as a vertical, binary partitioning operator of the data with replicates to handle the high-dimensionality. Dimension reduction is achieved along both directions by introducing a two-component partitioning operator. The vertical partition is nested in the horizontal partition to identify spatial clusters that share identical or similar underlying characteristics.

The remainder of the article is organized as follows. In Section 2, we present a spatio-functional mixed-effects model with random groups. In Section 3, we propose a generalized Voronoi tessellation for partitioning spatial data with less constraints on cluster shapes. We briefly introduce the model implementation and other options in Section 4. In Section 5, we provide the real data application on clustering the infrared spectral measurements in identifying regional and dynamic characteristics in atmospheric composition change, followed by some further discussions in Section 6.

2 A Spatio-Functional Clustering Model

In this section, we present a spatio-functional model with clustering. Suppose for each location in the spatial domain, we have -by- response and a set of -by- covariates for . Let introduce a spatial partition of the domain into contiguous clusters, i.e., with each cluster comprising locations in , hence . Consider the following model for ,

(1)

where is the -by- fixed-effect function for the th covariate in the th cluster, and denotes its Hadamard (element-wise) product with . Note the Hadamard product can be alternatively written as if we let be the block diagonal matrix with each block ; However, requires memory storage rather than , and hence is used purely for our notational convenience. Additionally, note that includes an intercept for , which represents the cluster mean level of the response . The cluster-wise fixed-effect function captures the large-scale variation that can be explained by the functional covariates in addition to the cluster means. The location-specific random-effect function further captures the small-scale variation due to individual location in addition to the cluster mean. It can also measure the local (within-cluster) and global (between-cluster) spatial dependence. is the nugget function that captures the small-scale variation due to the noise.

While the regional effects ’s are considered to capture the cluster-varying signals and local predictive relationship between ’s and , it is immediately seen from (1) that the model complexity arises as a severe issue when the functional mixed-effects model incorporates the clustering structure. For instance, an extra cluster can result in at least additional fixed-effect parameters. To avoid redundancy of parameters, we consider using the Bayesian wavelet smoothing technique for dimension reduction. We also utilize the well-known decorrelation property that yields simple covariance structure in the wavelet domain. More specifically, assuming , we let , where is the -by- discrete wavelet transformation (DWT) matrix that induces an one-to-one map of the original data domain indexed by into wavelet domain that has multi-resolution structure with double-index : indexes the resolution level, and indexes the location at each level . Note that (hence ) corresponds to the scaling function that preserves considerable information in the data domain. More explicitly, let with its th -by- column matrix denoted as , and th -by- row matrix as . Note is an orthogonal matrix with the inverse Wavelet transformation matrix . Each element , i.e.,

where consists of the wavelet basis functions and ’s at level and location for . Note since is not symmetric, otherwise . The dimensions at both sides under this transformation match as . The dimension reduction can be achieved by assuming a large portion of the wavelet coefficients ’s have trivial or rather weak signals towards ’s, and hence can be treated as exactly zero, in particular for larger . Subsequently, we consider the model (1) in the wavelet domain by multiplying at both sides of the equation:

(2)

where , , , , and . Note albeit is diagonal, can be dense, causing aforementioned memory issues. By decorrelation property of the wavelet transformation, we assume for the th cluster, where is a diagonal matrix with diagonal entries to measure the heterogeneity of the noise level at different wavelet basis locations . We fix such that represents the noise level for the scaling function , and is the relative variability at individual level and location for data to the base variability for the th cluster. Note that as long as , the residuals in the data domain are correlated with covariance . It is also convenient to write (2) as

(3)

where is the th column of that corresponds to for ; Its th element, , has the contribution towards . We routinely adopt the spike-and-slab prior for each with structured parameters:

(4)

where is the scaling parameter, or signal-to-noise ratio of versus the noise level at ; is an indicator that if can be treated as exactly zero: when , the prior is a point mass at . The hierarchy of depends on the prior shrinkage probability , which allows Bayesian learning of the shrinkage effects for the extent of dimension reduction. The horizontal partition operator and vertical partition operator that consists of all ’s, jointly determine the number of nonzero elements, or -norm of coefficients. We let be the -norm of . It is possible to consider Ising (auto-logistic) prior (see, e.g., Smith and Fahrmeir, 2007) for ’s in (4), if we further introduce a coarse-level neighborhood structure for the spatial clusters. However, for a relatively small number of clusters, the spatial dependence between cluster-wise parameters can be weak. We alternatively incorporate the Ising-type spatial averaging for the wavelets shrinkage parameters in the proposal densities when creating new clusters.

As remarked by Morris and Carroll (2006), allowing the scaling parameter to also depend on the location introduces extra flexibility and improves the potentiality of capturing signals at finer scales. In the proposed model with clustering, the number of the scaling parameters can, however, proliferate with the flexible assumption. We instead allow the scales ’s and shrinkage levels to vary across clusters for capturing more details via borrowing regional information. This assumption adds more flexibility to that in Ray and Mallick (2006), where the authors assumed homogeneous scaling and shrinkage levels of wavelet functions over clusters. Gramacy and Lee (2008) also considered cluster-specific hyperparameters for the treed Gaussian Processes.

For the spatial random effect function , a common assumption is that the correlation structures within and between ’s are separable (e.g., Morris and Carroll, 2006), such that the wavelet transformation only applies to the within-correlation structure, and the between-curve correlation structure is preserved in the wavelet domain. For multivariate spatial data such as temporal or functional response with treed partitions of the study domain, Konomi et al. (2014b) assumed separable covariance functions for measuring within-unit interactions of the multivariate response and between-unit spatial dependence. Despite bringing considerable advantage in computation, the assumption of separable covariance function has known issues (e.g., Stein, 2005). Moreover, for wavelet coefficients that contain distinct strengths of signals, it is generally not evident that the spatial dependence of the corresponding scale and shrinkage parameters is common. We therefore consider a non-separable dependence structure. For notational convenience, we denote i.e., consists of the first locations. Let inherit the decorrelation assumption in wavelet domain, we consider:

(5)

where we assume the conditional autoregressive (CAR) structure (see, e.g., Besag et al., 1991) for the spatial dependence among ’s for : is the scaling parameter, or detectability of the random effect comparing to the noise level at ; is the neighborhood matrix for cluster , with diagonal entries equal . is a diagonal matrix with each entry being the sum of neighbors. is the parameter that measures the spatial dependence for wavelet location at in . To guarantee the positive-definitness of , has finite support where and are the largest and smallest eigenvalue of , respectively. Note (5) specifies a localized random effect within each cluster, or a random effect nested in the cluster-specific random effect . It is possible to assume a global spatial random effect that further introduces the dependence between curves in different clusters. However, this depends on the computational efficiency and memory storage, since the computational burden explodes as . The localized random effect hence can greatly improve the computational efficiency in particular when is large. Consequently, the support of each spatial parameters depends on the partitions introduced by since and may vary, which adapts the prior knowledge under different partitions. This is another different aspect from the existing approaches on spatial partitioning models which typically assume a global prior for the spatial parameters (e.g., Kim et al., 2005; Gramacy and Lee, 2008).

To summarize, given a random horizontal partition operator that groups data into cluster , and a random vertical binary partition that suggests nontrivial signals for cluster with , we propose a spatio-functional clustering (SFC) model for functional response with corresponding -by- design matrix :

(6)
(7)
(8)
(9)
(10)

This prior specifications in the full Bayesian hierarchical model above introduce a set of hyper-parameters for the wavelets: the signal-to-noise ratio , and the probability of shrinkage , which can be routinely estimated using an empirical Bayes procedure as described by Clyde and George (2000) and Morris and Carroll (2006). However, the empirical Bayesian estimation procedure becomes intractable when the random clustering structure is involved. Furthermore, the clustering results can be sensitive to the choice of these hyperparameters. We thereby pose a further hierarchy to update these parameters, and draw posterior samples for cluster-wise inference. For all Inverse-Gamma prior densities, we choose the hyper-parameters to yield rather dispersed prior density, such as shape and . Note that for the variance term , the choice of and corresponds to the noninformative, scale-invariant Jefferys prior which may affect the expression of the marginal model likelihood. For Beta prior one can choose to yield the uniform prior on for , which does not violate the posterior propriety since has bounded support while the outcomes of ’s form a finite set. The full and marginal model likelihood is discussed in Appendix A.

3 Generalized Voronoi Tessellation for Spatial Clustering

In this section, we discuss the choice of in (6), and couple the proposed SFC model with a generalized spatial partitioning model. Let be a clustering configuration introduced by a Voronoi tessellation (Okabe et al., 2000; Knorr-Held and Raßer, 2000; Kim et al., 2005; Zhang et al., 2014). It is determined by two components, , with the number of clusters, and the label vector that indicates the membership for each location, . Note we have suppressed the implicit dependence of on for notation simplicity. For spatial clustering, we can reduce to where are cluster centers and . The label assignment is determined by the minimal distance criterion, i.e., locations with the minimal distance from center form . Hence we have each cluster . For point reference or irregularly spaced data, one natural choice for is the Euclidean or the great circle distance between the spatial coordinates; For areal or regular lattice data that can be regularized on a graph, for each location , we define the order of a neighbor location as the number of boundaries to cross from location to location , and use it as the distance . Note that the centers that correspond to the minimal distance may not be unique, hence does not uniquely induce a label assignment . A remedy by Knorr-Held and Raßer (2000) is to treat as an ordered set, and when the minimal distance for location corresponds to more than centers including , the procedure consistently assigns if has the smallest index in . The stochastic model search then involves swapping the center index and comparing the overall model likelihoods to check which assignment is more appropriate. However, the locations with minimal distance ties are still simultaneously assigned to either cluster regardless of the individual likelihood. This depicts the limitations of VT-based partition in handling more complex structures.

A frequently adopted clustering prior model (e.g., Knorr-Held and Raßer, 2000) is to define , where is a prior model for the number of clusters , and is a prior model for the cluster centers, . One may specify a minimal size such that for all ’s, especially in the presence of covariates. Consequently is bounded above, say . Consider the truncated geometric prior for with the normalizing constant with tunning parameter . Note that when is larger, is more likely to be smaller, while when it is fixed at a very small value, is almost uniformly distributed over the grids , such that each possible value of receives the same prior weight. Therefore, the hyperparameter allows the flexibility to introduce -type penalty on redundant parameters (see Zhang et al., 2014). Conditioning on , a uniform prior on centers is specified as such that all possible ’s with centers receive equal prior weight. Comparing to the treed partition which typically aligns the boundary of clusters to the axes, the partition induced by Voronoi tessellation puts less constraints on the shape of clusters. However, it still imposes quite strong constraints such as convexity, which limits its use for more generally and irregularly shaped patterns of clusters, and for capturing the boundary ambiguity which is a well-known issue in image segmentation.

Therefore, we extend a clustering model with a generalized Voronoi tessellation (GVT) by introducing the boundary concept, and assuming a prior distribution on the labels of a boundary set. Let where is the number of clusters, is a set of cluster centers which determines a partition of for the spatial domain, and is the label assignment where . We define a boundary set induced by in the following two steps for each location :

  • Let . If its cardinality i.e., location has the minimal distance from at least cluster centers, then location is a boundary point. Additionally, it is referred as a tied point;

  • Next, for some integer , suppose all the neighbors of location with orders , except tied points, form a set . Let be the set of unique labels that the locations in receive. If , then location is a boundary point.

In either case, is called the choice set of the boundary point . Consequently, the boundary set is the collection of all such boundary points. In general, one may pick smaller values to specify rather thin buffer regions near the boundaries to avoid fuzziness that hampers the spatial contiguity and hence the interpretability of clusters. Note for a non-boundary point, or an interior point or , its choice set only contains the cluster label assigned to location , i.e., since .

Next, we specify the clustering model prior as with independent label assignments . For instance, under the uniform prior weights, we have for label . Other choices include using the proportion of members in the neighbor set of that receive label as the prior weight . However, such majority rule can pose a strong assumption against the true structure suggested by the data. We hence stick to the simple, uniform prior weights to avoid additional computational effort.

Spatial clustering techniques based on Voronoi tessellation in previous studies (e.g., Knorr-Held and Raßer, 2000; Kim et al., 2005; Secchi et al., 2013; Zhang et al., 2014) choose the minimal-distance criterion and minimal-index-assignment rule by treating as an ordered set, to restrict for every location for making binary assignment. In our case, holds only for the interior points , and we assume random assignments for the boundary points. The implementation hence allows posterior boundary label correction for given local cluster parameters, and results in more generally shaped clusters to incorporate the boundary ambiguity. The extended clustering model prior can be more concisely presented as and we prefer as described in Appendix B.

We illustrate the flexibility using boundary correction with two examples in Figure 1. Let and we consider the locations on the regular lattice system with two clusters. In Example (a), the top-left panel shows the true boundary and label assignment which potentially describe the dynamic boundary change due to the seasonal effect in the spectra radiance study. Next, the top-right panel shows one example partition using Voronoi tessellation based on either Euclidean distance or shortest path with 4-point neighborhood system. In this case, the regular VT-based partition (e.g., Kim et al., 2005) agrees with treed partition (e.g., Gramacy and Lee, 2008) in producing a boundary that aligns to the axis, and the limitation is immediately seen for the misclassified locations near the boundary, as highlighted with circles. However, with boundary correction, these misclassified locations can still receive the correct label by comparing the likelihoods evaluated at both sides. Take the location with an asterisk for example, its order- neighbors receive different labels with hence it is a boundary point with the label determined by the likelihood. In example (b), the bottom-right panel visualizes a partition using Voronoi tessellation that introduces distance ties in that some locations have the same minimal distance to both cluster centers. In this case, both the regular VT and treed partition typically assign the boundary points simultaneously to either side based on some ordering of the centers. The boundary correction with allows point-wise label correction adjusted by the likelihood for the tied points. Note that one can assume random with certain prior distribution. A large not only results in additional computational effort in correcting larger boundary set, but also potentially hampers the spatial contiguity. Since the spectra radiance data set does not exhibit clear depth in boundary “invasion” and high complexity in shape, we fix , while other similar choices do not severely affect the results from our sensitivity analysis. In addition, the boundary flexibility under the new partitioning operator substantially accelerates the chance of traversing in the model space for the stochastic search.

Figure 1: Two examples of spatial partitions with boundary correction.

4 Model Implementation

The model implementation involves the stochastic search of the horizontally partitioning operator for clustering, with associated vertically partitioning operator for variable selection and the model parameters two layers: (1) higher-layer parameters that measure the scale, dependence, and shrinkage for the wavelet components, and (2) lower-layer parameters that can be routinely integrated out and the resulting marginal likelihood has a tractable form, as given in Appendix A. The estimation of the parameters follows a standard procedure under Bayesian framework using Markov chain Monte Carlo (MCMC) technique for posterior inference. Similar to many previous studies (e.g., Knorr-Held and Raßer, 2000; Feng et al., 2016; Zhang et al., 2014) that utilized the reversible jump MCMC algorithm, the stochastic search for partitions generally involves a split-merge step in transdimensional move for model exploration. In addition, we consider the model exploration in the same dimensional space, which involves the stochastic boundary search and boundary correction under our new partitioning model. The details are given in Appendix B.

It is worth pointing out that Jasra et al. (2007) considered a population-based reversible jump MCMC algorithm to alleviate the local-trap problem in multi-modal model space exploration. The idea is to allow the communications between multiple rj-MCMC runs by assigning a temperature ladder to pave the gap between the marginal likelihoods among chains. The authors also utilized the delayed rejection (Green and Mira, 2001) for accelerating the acceptance rate. On the other hand, Kim et al. (2005) employed the population-based Evolutionary Monte Carlo (EMC) (Liang and Wong, 2001) for searching spatial partitions based on the regular Voronoi tessellation, as an alternative to the rj-MCMC algorithm. Bottolo and Richardson (2010) also developed the EMC sampler for tackling high-dimensional model search issues in Bayesian variable selection, which is attractive to our application for vertical partitioning of wavelet coefficients.

One limitation of the population-based MCMC algorithm is that the communication inner chains can undermine the pure independence which is ideal for parallel computing over MCMC runs. Since the full Gibbs step within each chain can be quite time-consuming in particular for updating the components associated with the fixed-effect functions, we instead adopt two-level parallel computing scheme. Firstly, we parallelize the potentially quite large number of MCMC runs. Secondly, we parallelize the subsequent Gibbs sweep of model parameters for each subregion after sampling the partitioning operator within each MCMC run. The two-level parallelism turns out to substantially facilitate the SFC implementation in this data application. Next, to alleviate the local-trap issue, we start with pilot runs with a large number MCMC chains for recording locally-trapped partitions, and use them as the “anchor” points in the model space for the subsequent full stochastic search that utilizes the delayed rejection algorithm (Green and Mira, 2001). The performance is evaluated by checking if a large portion of distinct sampled partitions are presented in the posterior samples.

5 Application to Spectra Radiance Data

Top-of-the-atmosphere (TOA) longwave nadir-view spectra from to cm are computed with a cm resolution for 18 latitude zones and months from January 2003 to December 2004. The raw data over the observed wavenumbers for the curves, including one control run (top-left panel) and perturbed runs using indicated cloud and atmospheric properties, are shown in Figure 2. The curves have meaningful colors indicating their latitude zones with the central latitude shown on the top of Figure 2. The dark purple color indicates the 24 monthly radiances from the south pole latitude zone ( to ), while the dark blue color indicates the radiances from the north pole latitude zone. On the other hand, the dark red and brown color indicate the radiances from latitude zones near the equator. The response radiances (control run) clearly exhibit spatially contiguous group patterns that motivate this study. The near-pole zones evidently have smaller means and variabilities for the radiance, while the near-equator zones have higher quantities. Nevertheless, the covariates information can be different for the two near-pole zones, and it is therefore important to study the zonal effects rather than simply clustering accordingly to the means.

Figure 2: Spectral radiances over wavenumber (cm). The 15 perturbed runs were multiplied by from the original data, to have similar order of magnitude as the control run.

We apply the proposed SFC model with the spatial partitioning method to the spectral radiance data, by treating the radiances from control run as the response, and the output from the perturbed runs as covariates. On the -d lattice system regularized by month and latitude, We adopt the Manhattan distance, or equivalently the shortest path in this case, between a pair of locations and , which are neighbors if .

The original number of wavenumbers with the total runs can be relatively large and slow down the full implementation of component-wise updating the fixed-effects wavelet coefficients during the MCMC runs. Additionally, the primary goal is to identify the grouping structure. Thus, we consider preprocessing for dimension reduction by systematically sampling wavenumbers from to cm with resolution around cm. Other pre-regularization techniques such as using P-splines or kernel functions can be also considered if the resulting curves are shape-preserving and not over-smoothing the raw curves. The interval is empirically determined by checking the signals for the spectra radiance, and is largely overlapping with the interval ( to cm) in Kato et al. (2011) for calculating the relative difference. The interval between and cm is dropped in this analysis since the corresponding radiance for the control run is quite flat, close to 0. Furthermore, all the functional covariates are multiplied by since the original scales are quite small comparing to the response radiance from the control run. For each of locations, we have wavenumbers for each of the variables. As a consequence, we have a total of data points.

The data for this analysis are shown in Figure 2. The two red vertical lines specify the window including the fraction of data used by Kato et al. (2011). The two blue vertical lines specify the window including the fraction of data used in this analysis. We further perform a systematic subsampling to reduce the number of wavenumbers to be , with a coarser level of resolution of cm to accelerate the horizontal partition search.

After pilot runs for initial partitions and “anchor” points for model exploration, we fit the Bayesian model with MCMC runs with distinct initial values and a total of iterations per chain. At each iteration, the clustering configuration is updated, and then the parameters for each subregion are updated via MATLAB R2014a parallel computing toolbox with 16 cores per chain. The computational benefit with parallel computing depends on the number of subregions , and is generally significantly achieved comparing to the single-thread runs. On average, iterations take approximately an hour to complete. The most computational expensive part is updating the vector of ’s which is conducted sequentially and component-wise due to the wavelet smoothing prior for dimension reduction.

The posterior modal estimate that partitions the response spectral radiance from control run, and the corresponding posterior estimates of the mean curve with predictive bands for each cluster, are shown in Figure 4. Since we regularized the observed spectral radiance into monthlattice 2D lattice system, the map in Figure 4 can be viewed as a dynamic clustering map for the latitude zones over the months in . It is evident to see how the clustering map evolves over months. By incorporating the information from atmospheric compositions as functional covariates, the trends and variabilities of the spectral radiances can be well captured. The posterior mean and the predictive band for cluster-wise mean curve are computed using for all posterior samples that has , and is the number of locations in cluster under . The curves are transformed back to the original data domain for visualization and interpretation.

Figure 3: The maximal a posteriori (MAP) estimate of the clustering configuration.
Figure 4: Cluster-wise summary for over wavenumber (cm).
Figure 3: The maximal a posteriori (MAP) estimate of the clustering configuration.
Figure 5: Posterior summary of ’s under the modal cluster over wavenumber (cm).

The clustering in Figure 4 suggests that the mean levels and variabilities are larger in locations near the equator and have decreasing trends towards to both poles, based on which the global observations are partitioned into clusters. However, the clusters are not rectangles or regular bands that are typical outputs from treed partitions or even Voronoi tessellation on a regular lattice system. The map shows that those clusters with large mean levels and variabilities move towards poles during April and November, which further depicts the relationship between those characteristics and temperature. The flexibility in producing irregularly-shaped, non-convex clusters by introducing the boundaries with random cluster assignments is helpful in capturing such relevant patterns. In addition, three clusters near the equator that have relatively higher mean levels of radiances, are still spatially separable since Cluster 4 has less variability with higher homogeneity within the cluster. Moreover, the mean level and variability near the north pole are higher than the south pole. Moreover, there is one more cluster (Cluster 2) in the southern hemisphere with higher portion of ocean. This is possibly due to that the Antarctic forms a distinct cluster (Cluster 1).

Next, we report the posterior mean and credible intervals for using samples with transformed back to the data domain to detect strong signals. Figure 5 displays only the significant signals with credible intervals that do not span zero. A large portion of for most clusters are exactly 0, as suggested by the flatness and vacant wavenumber regions for the functions in Figure 5. It is evident that different covariates can have distinct impacts for the resulting clusters. For instance, for the temperature group, the near-surface and skin temperature has relatively stronger positive impacts at the starting wavenumber regions (approximately from to cm) for Cluster 3, from latitude S to S, near the boundary of Tropic of Capricorn and Antarctic Circle. While all the 3 temperature factors have strong negative impacts around cm for Cluster 4 near the Equator. The effects become strong, positive for Cluster 1 near the South Pole and Antarctic. This is reflected by Figure 2, which shows the surface-200-hPa and 200-10-hPa temperature have varying signs for curves from the two zones, while they all reach the second peak for the control run near that wavenumber. For Cluster 6 at the Arctic Ocean, the 200-10-hPa temperature also has strong effects near the valley point where the response radiance has quite small variability and hence strong linearity.

Another example is the 500-200-hPa water vapor. It has strong effects near the second peak point in the control run for Cluster 5, the Tropic of Cancer. The cloud-top height and cloud fraction both have remarkably high impacts for multiple latitude zones, though Cluster 6 near the North Pole apparently has relatively high impacts in cloud fraction. In particular, the midlevel and high-level cloud fraction correspond to the initial growth of the radiance for the control run. This finding echoes with Kato et al. (2011) in that the cloud fraction change is especially relevant in determining the behaviors of joint and independent effects, such as nonlinearity correspondence and relative difference. The last row in Figure 2 shows the thickness and particle size have strong effects for Cluster 1 and Cluster 2, the latitude zones near the South Pole and Antarctic Circle. All these facts indicate the importance of our approach of spatial partitioning based on different covariate effects.

To further validate the estimated fixed-effect functions for capturing the large-scale variation, we also compare the results with the least-square (LS) estimates for each wave number, discarding any individual curve effect, serial and spatial correlation. The LS-estimates in general agree with our estimates in Figure 5. Each panel at a selected wavenumber (cm) shows pairs of the response radiance versus 200-10-hPa temperature. The wavenumber and correspond to the two peaks in the response, and wavenumber corresponds to the deep valley where the variability is quite small for the response curves. The solid line indicates a local Least-Square estimate. The characters and colors are consistently employed in presenting the memberships under the posterior clustering configuration. As an example, for wavenumber 313 cm, the local linear regression may suggest a neutral effect, while the relationships under the partition are significant and varying by groups. Furthermore, for 200-10-hPa temperature, the LS-estimates similarly produce a local peak in effect around wavenumber 638 cm, which corresponds to the deep valley point between the two peak points at wavenumber 482 cm and 716 cm of the response radiance in Figure 2. Nevertheless, Figure 6 suggests the local LS-estimates can give quite misleading estimates by discarding the group structure under the posterior clustering configuration from the proposed model. With the partition, the proposed model can evidently give better individual estimates of the cluster-specific effect of 200-10-hPa temperature. Figure 6 shows after separation, at the valley point 638 cm, the near-pole clusters (Cluster 1, 2, and 6) have scatters quite close to an exactly “vertical” line. This precisely explains the remarkably large effects of 200-10-hPa temperature detected at 638 cm in Figure 5 for these clusters. It also explains the opposite strong signals in Figure 5 that suggest the effect function of 200-10-hPa temperature does not vary smoothly for Cluster 6: there is a strong negative effect (the “vertical” line leans towards the left) detected near the strong positive effect (the “vertical” line leans towards the right) at the valley point 638 cm. Similar patterns were found for other covariates, which is consistent with the prevalence of strong signal near the valley point 638 cm in Figure 5 for almost all the covariates, but for different clusters. This finding again demonstrates the importance of identifying regionally varying effects and providing meaningful partitions that can largely reduce the variability.

Figure 6: An illustration of misfitted regression lines without considering the partition.

Moreover, Figure 6 illustrates the merits of spatial contiguity for clusters. For instance, the relationship between 200-10-hPa temperature and the response radiance is apparently the same for Cluster 2 and Cluster 6. While most clustering algorithms do not incorporate the covariates information, a general clustering algorithm that discards the spatial contiguity may assign the two clusters to one, even when considering the covariate effects. While the two clusters are near different poles and have clearly different interpretations. A post-hoc separation may still fail to capture the local features, the boundary information, and the within-cluster spatial dependence.

6 Discussion

In this work, we present a single Bayesian hierarchical model for answering multiple relevant questions regarding to the spectral radiance study in a unified faction. The model has the ability to discover the local impacts in both space and wavenumber regions on the effects of the atmospheric and cloud properties on the spectral radiances from control run. The extension to the spatial partitioning models has the enhanced ability to produce clusters with less constraints to capture the seasonal effects, while maintaining the convenience in sampling the partitions to address several commonly recognized issues including the complexity, near-boundary ambiguity, posterior mixing, and uncertainty associated with the grouping structure of the high-dimensional spectral radiances. The model can also accommodate the spatial dependence with associated nonstationarity, regional characteristics for wavelet shrinkage. Comparing to existing curve clustering methods that can handle spatial data such as Ray and Mallick (2006), Konomi et al. (2014a) and Dass et al. (2015), the proposed SFC model has several attractive features such as producing spatially contiguous clusters, accounting for regional effects of multiple functional covariates, reducing parameters via wavelet shrinkage to alleviate the parameter proliferation issue for product partition models.

From computational perspective, we harness the Bayesian hierarchical model with the two-level parallel computing scheme: between- and within-MCMC parallelism. Due to the developments of high performance computing environments, most independent MCMC runs for Bayesian methods nowadays are conducted in parallel. However, the within-MCMC parallelization with random partitions has not been commonly adopted yet partially due to the challenges in preallocation of computing resources for varying sizes and number of components for parallelization. Our exploration in this case study suggests such parallel computing techniques indeed improve the model implementation and the computational ability with partitions that are even random.

The proposed method can work for data with relatively high dimensionality, in particular, when (in this application and ). However, the speedup is not satisfying since the parallel computation in this study mainly takes advantage of the horizontal, rather than vertical partitions of the data. The major computational challenge lies in the Gibbs sweep of sampling the -parameters. Alternatively, a sophisticated handle of the high-dimensionality in searching vertical binary partitions can be adopted in the broad literature of Bayesian variable selection such as Fast Scan Metropolis-Hastings scheme (FSMH) (Bottolo and Richardson, 2010), Shotgun Stochastic Search (SSS) (Hans et al., 2007), among others. In this application, our primary interest is searching for the horizontal partitions in clustering curves, hence we conduct a vertical reduction in the -functionals that facilitates the search while preserving the shapes.

The application of the proposed SFC model in this spectral radiance change study suggests it is essential to jointly analyze multiple functional covariates that capture the large-scale variation, while modeling the heterogeneity in covariates effects among subregions in the 2-D input space with latitude zone and month. With full posterior inference on parameters that characterize such regional effects, the full model-based SFC model is more than a clustering model of TOA spectral radiance for providing insight into cloud properties and atmospheric composition. Based on the retrieved feature functions and clustering maps, further investigation and identification of atmospheric composition change in climate study using the TOA spectral radiance change become an immediate task.

Appendix A Calculation of the marginal likelihood

The full Bayesian hierarchical SFC model proposed through (6) to (10) includes different layers of parameters:

  1. partitioning parameters , where is horizontal partitioning operator for clustering, and is the vertical partitioning operator for variable selection that is nested in (i.e., it applies to individual cluster);

  2. higher-layer parameters ; and

  3. lower-layer parameters .

The partition of parameters into higher and lower layer is due to the tractability of the integrated likelihood for sampling the partitioning parameters. We further consider integrating the lower-layer parameters . Integrating out the random effect out from (7) with the prior (9) involves an integration of the conditional density , which is with

(11)

The resulting integrated likelihood, with , becomes

(12)

Next, we continue to integrate out from the product of (12) over for cluster , with the prior in (8), this involves integrating the conditional density which is with

(13)

Let , , and for full observations in cluster . As a result, the integrated likelihood with both ’s and marginalized out, involves . Let be the precision matrix, by Sherman-Morrison-Woodbury formula, we have the closed form . Hence the marginal density can be alternatively viewed as

(14)

due to the fact that , and

We continue to integrate out from (14) with the Inverse Gamma prior (10) when . This involves an integration of the conditional density which is

(15)

Consequently, the marginal likelihood under the partitioning operator and parameters at higher level , when integrating out, is an -dimensional student- density which is

(16)

which clearly depends on the choice of the hyperparameters that capture information of the , as is an estimate of that lies in between the prior mode, , and the prior mean when . It is possible to obtain some accurate prior estimate for experimental data or simulation model output when the nugget effect is trivial. One can employ the Jefferys prior to avoid the sensitivity issue. The integrated likelihood then becomes

(17)

The quadratic term and the determinant in (16) or (17) can be evaluated as

with the latter obtained by Sylvester’s determinant theorem. The marginal likelihood (