Genomic Region Detection via Spatial Convex Clustering

# Genomic Region Detection via Spatial Convex Clustering

## Abstract

Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiple chromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. We use ideas from fusion penalties and convex clustering to introduce a method for Spatial Convex Clustering, or SpaCC. Our method is specifically tailored to detecting multi-subject regions of methylation, but we also test our approach on the well-studied problem of detecting segments of copy number variation. We formulate our method as a convex optimization problem, develop a massively parallelizable algorithm to find its solution, and introduce automated approaches for handling missing values and determining tuning parameters. Through simulation studies based on real methylation and copy number variation data, we show that SpaCC exhibits significant performance gains relative to existing methods. Finally, we illustrate SpaCC’s advantages as a pre-processing technique that reduces large-scale genomics data into a smaller number of genomic regions through several cancer epigenetics case studies on subtype discovery, network estimation, and epigenetic-wide association.

## 1Introduction

Modern genomic technologies take fine-grained measurements on on human subjects that allow for increasingly individualized treatment options for various diseases. Several of these technologies capture genomic information at spatially registered locations on the DNA sequence; examples include point mutations, next generation sequencing, copy number variation, and the focus of this paper, DNA Methylation arrays which measure epigenetic variation. Here, the units returned by the technology, CpG sites, are not of primary interest to scientists. More important are regions of CpG sites whose cumulative impact affects gene function [7]. To this end, we introduce an unsupervised feature learning technique which maps technological units to biological units by coalescing probes into contiguous genomic regions that are common across multiple subjects.

Epigenetic technologies measure genetic aspects that affect gene regulation beyond gene expression and transcriptomics. One such example is DNA Methylation, which measures the addition of a methyl group to CpG sites creating 5-methylcytosine. High methylation levels (hypermethylation) have been shown to block gene transcription in cancer. Similarly, low methylation levels (hypomethylation), typically at a global level, have also been observed by cancer researchers [7]. Recent advances in whole genome bisulfite sequencing technology now yield methylation intensity measurements at hundreds of thousands of CpG sites across the genome [2]. The ratio of methylated intensity to total intensity, the so-called beta-value, is returned as a measure of the DNA methylation level at each site. Such technologies interrogate genomic regions (such as gene promoter regions) by taking measurements of multiple CpG sites in close spatial proximity; beta-values in these regions are often strongly correlated, indicating that they behave functionally as a unit [9]. These observations imply that probes which are close in genomic distance and which display similar methylation levels are better treated as functional units, or genomic regions. Previous work on region detection in the context of methylation has focused on differentially methylated region (DMR) discovery. Approaches in this area have utilized both smoothing techniques [14] and the concept of linkage disequilibrium [35]. The task of DMR discovery is, however, an inherently supervised one. In this work, we focus on the unsupervised discovery of genomic regions for methylation data. We develop a method for grouping probes into genomic regions that leads to more interpretable scientific measurements, improves the performance of downstream analyses, and can thus serve as a dimensionality reducing pre-processing step for methylation data.

Another example of grouping spatial genomics data is the well-studied problem of copy number segmentation. Copy number variation (CNV) is one measure of structural variability in the genome that quantifies large scale insertions and deletions of genetic information across the DNA sequence. By utilizing array-CGH technology, structural differences relative to a reference sample are quantified as the log ratio of intensities at various loci across the genome. Such differences have been linked to various cancers such as breast and lung [34]. Similar to DNA methylation, copy number measurements at a particular loci are typically of lesser interest than regions of gain or loss, which signal large-scale amplification or deletions [36]. As such, the problem of Copy Number Segmentation has received much attention and numerous methods exist for this type of analysis [31]. Due to both its well-studied nature and similarity to region detection in the case of methylation data, we consider the task of copy number segmentation as a benchmark, and show that our method performs competitively to many state of the art methods.

For both methylation and CNV data, scientists seek entire regions of CNV amplifications or deletions or regions of CpG sites that behave as functional units. For these regions to serve as meaningful features for downstream multivariate statistical analyses, they must be consistent across all subjects. To address this, we introduce an unsupervised learning method which maps technological units to more meaningful biological units by clustering probes into genomic regions. An example of the genomic regions resulting from our method, Spatial Convex Clustering (SpaCC), can be seen in Figure 1. To detect such regions, we utilize the popular concept of fusion penalties [37] by extending methods for convex clustering to be appropriate for spatial genomics data [15]. We specifically focus on developing a fully automated and data-driven method that can be used to pre-process spatial genomics data into genomic regions for downstream analyses (Section Section 2). While we illustrate our method on CNV data to benchmark our method against widely used segmentation approaches (Section Section 3.1), the main focus of this paper is on methylation data. For this, we show how our method can lead to improved results for biomarker discovery in region-based epigenetic-wide association studies (Section Section 4.3), and our resulting genomic regions can yield improved features in downstream multivariate analysis such as clustering for subtype discovery (Section ?) and epigenetic network estimation (Section ?).

## 2Spatial Convex Clustering

Our objective is to discover groups of probes which (i) have similar measurements, (ii) are spatially contiguous, and (iii) are nearby on the chromosome. Taken together, (i)-(iii) deliver groups of probes similar enough to be considered as single functional units, or genomic regions. Importantly, we require that (i)-(iii) be satisfied simultaneously for all subjects, so that the genomic regions discovered are intrinsically meaningful, and not artifacts of a particular sample. One approach to multi-subject genomic region detection would be to consider clustering the genomic probes based on the subject observations, as opposed to the more commonly used clustering of observations. To this end, we merge ideas that use fusion penalties for copy number segmentation [44] and the more recently introduced convex clustering methods [5] to develop an automated pipeline for feature learning with spatial genomics data. In particular, we address data-specific weighting schemes, automated methods for detecting the number and extent of clusters, as well as a principled approach for handling missing values.

### 2.1Optimization Problem and Algorithm

Let be our data matrix with subjects and variables (probes). Our SpaCC problem is defined as

Here, we use the Frobenius loss associated with the Gaussian distribution to estimate the cluster means, given by the columns of . The second regularization term induces sparsity in the adjacent column differences of , thus forcing adjacent column mean estimates to exactly fuse as the regularization parameter, is increased; fused columns of form a spatially contiguous cluster of variables. As entire columns of are fused as a unit, the clusters are common to all samples. The level of regularization is governed by , with larger values encouraging more fusion and larger spatial clusters, and the weights, . As discussed subsequently, the latter are taken as spatial weights proportional to the inverse distance between adjacent probes that are specific to each genomic technology; these in turn, lead to more interpretable results and computationally efficient algorithms. Note that in contrast to convex clustering methods [5] which cluster observations, our approach clusters variables; additionally, we do not permit clusters among arbitrary groups of probes, and instead permit only genomically adjacent probes to coalesce together. In this sense, our method builds on the success of several existing fusion-based approaches that have been proposed specifically for CNV data [3]. Finally, note that we apply our SpaCC problem separately to data for each chromosome.

To fit our SpaCC model, we adopt an approach introduced by [5] for convex clustering problems. We reformulate by introducing an auxiliary variable , where , and rewrite the penalty in terms of . We then use the Alternating Minimization Algorithm (AMA) [39] optimization algorithm to fit our SpaCC problem. Formulating the augmented Lagrangian, we obtain updates for both primal variables (, ) and dual variables (), as shown in Algorithm ?.

### 2.2Spatial Weights

An important input to our SpaCC problem is the weight vector, , which incorporates genomic spatial structure and yields more interpretable results and faster computation. Importantly, the choice of weights is inversely proportional to the genomic distance between adjacent probes. While there are many such possible spatial weighting schemes, we propose to use , where is the distance in basepairs between probes. Here, is chosen based on properties of genomic data. For example with copy number data, it is common to have sizable portions of a chromosome amplified or deleted [32]. Hence, weights that decay at a slower rate as a function of genomic distance will perform well for CNV data; specifically, we take in our empirical studies. For methylation data, we expect methylated regions to form in small localized CpG islands near promoter regions of genes [9]. Hence, we take for methylation data, which yields a more rapid decay in weights as a function of genomic distance. Examples of these weight choices are shown in the supplemental materials and the efficacy of these choices are tested in simulations, Section 3, and real data examples, Section 4. Overall, tailoring spatial weights to specific technologies yields more interpretable genomic regions, with larger clusters in CNV data and smaller localized methylated regions, Figure 1.

Additionally, our choice of weights can dramatically decrease the computational burden of our SpaCC algorithm. Specifically, we apply hard-thresholding to the weights to set tiny weights to zero. Exact sparsity in the weight values prevents distant adjacent probes from coalescing into a common genomic region. Further if weights are set to zero, the penalty term of is perfectly separable into terms, yielding subproblems. These subproblems may in turn be solved in parallel. For example using our methylation weight choice, the Chromosome 17 TCGA Level 3 Breast Cancer Methylation SpaCC problem, Figure 1, separates into subproblems that can be solved completely in parallel. This yields dramatic computational gains when compared to the original problem which consists of probes.

### 2.3Missing Values and Parameter Selection

We seek to develop a fully automated method for detecting multi-subject genomic regions in CNV and methylation data; often automated methods are more reliable and reproducible, as practitioners have no knobs to tune that can yield different results across studies and labs. To this end, we introduce two automated, optimization-based approaches to handling practical problems with spatial genomics data: missing data and regularization parameter selection. First, missing values tend to be a major problem for large-scale high-throughput genomics data. For example, the TCGA Breast Cancer Level 3 Methylation Chromosome 17 data shown in Figure 1, contains total missing values. Similarly for TCGA Ovarian Cancer Level 2 Copy Number Variation, Chromosome 17 contains missing values. Typically, missing genomics data is handled via a two-step process where one first imputes the missing values using many popular off-the-shelf imputation routines [38] and then continues with analyses of the fully imputed data. This approach, however, is less reproducible as results of downstream analyses can change depending on the imputation procedure employed. Instead, we propose to fit our SpaCC model in the presence of missing data, effectively eliminating the need to preform a separate imputation step.

As before, let be our data matrix. Let denote the indices of missing elements. We adopt an approach similar to that in [6] to fit our SpaCC procedure in the presence of missing values. Specifically, we fit our SpaCC loss function only over the the non-missing elements of , given by the indices :

First, notice that this optimization problem is still convex, and hence our approach will yield the global solution. We propose to optimize using the majorization minimization (MM) algorithm [17]. Defining the surrogate function to be

notice that majorizes the objective ; namely (i) for all and (ii) . Iteratively minimizing the surrogate objective creates a non-increasing sequence of objective values . Defining , we augment leading to following algorithm to fit our SpaCC problem in the presence of missing data:

Notice that in contrast to traditional imputation routines, Algorithm ? does not explicitly replace missing elements in prior to analysis. Rather, missing indices are filled in iteratively via the SpaCC cluster means for the genomic regions, .

Now that we have an automated way of fitting SpaCC with missing values, we leverage this to propose a -fold cross-validation scheme to select the single tuning parameter, . Note that controls both the number of genomic regions and the extent of these regions. To select , we employ the approach of [46] where we remove random elements of in each fold and take the optimal as the parameter whose SpaCC imputed solution fit in the presence of missing values most closely aligns with the removed data. Specifically, for , where the number of folds, we define the indices to be left out at the th fold to be , where , and , so that is an approximately equal sized partition of the non-missing elements of . For each fold , we introduce additional missing elements via and solve the missing data problem, Algorithm ?. Specifically, let . Given , we solve SpaCC with missing values given by :

We can solve this problem via Algorithm ?, replacing with ; we denote the solution as . Then, we evaluate the performance of on the th fold by comparing the solution to the left out (non-missing) elements of via mean squared error: Our complete cross validation algorithm is given as follows:

Given the tuning parameter, , selected by minimizing the cross-validation error, we noticed that SpaCC identifies genomic regions which are spatially smaller than desired. Stated another way, cross-validation tends to underestimate the sparsity level in the differences, . Such behavior is well known for the lasso and other sparse problems and is hence why many advocate using the one-standard-error cross-validation rule [40]. An alternative approach is to post-process the results after cross-validation by thresholding [22]. We adopt such a scheme and propose to threshold the elements of at a level proportional to the estimated noise, ; specifically we threshold at the level, , similar to that proposed by [22]. In Figure 2, we give examples of cross-validation error curves over a sequence of values for the TCGA Ovarian Copy Number and Methylation chromosome 17 example from Figure 1; both the minimum cross-validation error as well as our proposed thresholding level are shown. The efficacy of our cross-validation and post-thresholding procedure are further studied in subsequent simulations. Overall, we have developed an automated and principled optimization-based approach for handling missing values and selecting the number and extent of genomic regions that offers many advantages over competing approaches.

## 3Simulation Studies

We study the performance of SpaCC empirically and compare our approach to existing methods through simulations based on real array-CGH and DNA-methylation data. Before presenting our simulation results, it will be helpful to introduce some notation. Our probes are indexed by . Associated with each probe is a genomic location , , and we denote the distance between genomic locations by . The probes are partitioned into clusters, , indexed by . For each probe we denote the cluster to which it belongs via . Our method and competitors will estimate a clustering , with and . We will then evaluate the performance of all methods by by comparing the estimated clustering to the true clustering using common clustering metrics such as the Rand, Adjusted Rand, and Jaccard Indexes [43]. Note that these clustering metrics are general and do not specifically account for the fact that we require clusters to be spatially contiguous. Hence, we also employ a lesser-known entropy-based clustering metric, the Variation of Information (VI) metric [20], that is better suited to measuring the information loss between two sets of clusters that may be nested, as we would often expect with spatially contiguous clusters.

### 3.1Simulation Studies: Copy Number Segmentation

We first evaluate the performance of SpaCC for the well-studied problem of copy number segmentation for array-CGH data. While not our primary focus, the problem shares many attributes in common with the problem of genomic region detection for methylation data. These common features along with many popular tools and software packages [33] make the problem an ideal benchmark to test the performance of our method.

Our simulations are based on TCGA Ovarian Cancer Level II array-CGH data for chromosome 17 [26], where we adopt the probe locations and use data for observed subjects to form the mean simulated CNV signal as well as the locations of copy number segments with amplifications or deletions. Specifically, we simulate series for probes and subjects from the following model: . Here, is the base mean of the series which is taken from a subject in the TCGA data that was detected as having no amplifications or deletions according to DNACopy [33]; is an indicator of an amplification or deletion for subject in region where the regions are taken from the observed segments as detected by DNACopy for subjects from the TCGA data; gives the mean shift for the amplification or deletion in region ; and is iid additive noise. As not all subjects will have copy number changes for each region, we simulate . Also, as each subject could have an amplification or deletion of differing magnitude, we simulate .

We study four simulation scenarios of varying difficulty. First, we study the effect of the magnitude of the copy number changes relative to the noise level by taking for large changes, and for small magnitude shifts that are more difficult to detect. Next, we use two different subsets of DNAcopy regions detected for a real TCGA patient, with easier or more difficult to detect shapes to seed the segment boundaries; that is, easier to detect segments tend to be spatially longer and harder to detect segments are shorter and more fragmented. Note that the easy and hard set of segments are shown in the Supplemental Materials. Also related to shape, we vary the probability of a shift, , with a larger probability corresponding greater consensus across subjects, and hence easier shift detection.

Results indicate that SpaCC outperforms according to all metrics on all regimes in this simulation. As illustrated subsequently for real data, DNACopy segments each subject separately. As such, DNACopy performs well on per-subject basis, but performs poorly at the detecting segments common across all patients. To address this shortcoming, the CNTools package and its implementation of the Circular Binary Segmentation (CBS) algorithm offers an alternative whereby common segments are returned for all subjects. Yet, CNTools’ still fails to reach a reasonable consensus regarding common subject segments. Given the discrepancy between subject’s amplification/deletion regions, CNTools tends to deliver shorter segments, wherein agreement can be reached across several subjects. These short segments often partition the true, larger segments, but are both difficult to interpret and necessarily score poorly on the various metrics. In contrast, SpaCC’s regularized solution yields an improved consensus between the individual series and their common segments, detecting larger segments more closely aligned with the underlying truth.

### 3.2Simulation Studies: Methylation Region Detection

Next we evaluate SpaCC’s performance for the task of methylation region discovery. We again model our study based on real data, utilizing TCGA Breast Cancer Level 3 Methylation data for chromosome 17 [25]. Initial clusters, , are detected using SpaCC, which then act as the ground truth for what follows. We simulate methylation beta values via the cdf transform , for subjects and probes , to ensure values in . Our simulated methylation regions are denoted by the degree of correlation of probes within the region. Specifically, we take to be . with spatial covariance given by where is the distance between probes. Here controls the decay rate of the within cluster spatial correlation, and similarly between clusters. The difficulty of the simulation is controlled by difference between the the within and between spatial correlation decay rate. By varying the ratio of decay rates, , we consider three scenarios of high, medium, and low within region spatial correlation, relative to between; these scenarios correspond to , respectively. We compare SpaCC’s performance to the linkage disequilibrium method introduced in [35].

We again note that SpaCC outperforms existing region-detection methods across all regimes and metrics. A portion of SpaCC’s success may be attributed to the continuous nature of its spatial fusions, wherein clusters are joined in a smooth fashion via continuous spatial weight decay. The LD method, by contrast, implements a greedy strategy for accumulating clusters based on a discrete window size, here 500 bp. This fixed window size gives the LD method less flexibility to detect long range clusters when they are present; SpaCC, by not choosing apriori distance cutoffs, does not have this difficulty.

## 4Applications of SpaCC to Cancer Genomics Data

We now illustrate how SpaCC can be used as a pre-processing tool to improve the analysis of real array-CGH and DNA-methylation data. Using case studies from TCGA ovarian, breast, and lung cancers we show how SpaCC can yield improved copy number segmentation results in Section 4.1, but mainly focus on the more novel application of detecting methylation regions. For this, we show how SpaCC can yield improvements in subtype discovery, Section ?, inferring epigenetic networks, Section ?, and biomarker discovery in region-based epigenetic-wide association studies (rEWAS), Section 4.3. As EWAS and especially rEWAS are relatively new types of epigenetic analyses, we include a more careful study of this application with additional simulation results in Section ? and an application to TCGA lung cancer data in Section ?.

### 4.1Ovarian Cancer Copy Number Segmentation

We investigate SpaCC’s segmentation performance on Ovarian Cancer TCGA Level 2 Copy Number data [26]. We report results for Chromosome 17, which is where the important BRCA1 gene, which has been widely associated with ovarian cancer [19], is located. The number of probes totals and we consider subjects. In Figure 4, we visually compare segments discovered by SpaCC to those of the two most popular competitors, DNACopy [33] and CNTools [48]. We plot three subjects’ copy number series for the whole chromosome and overlay the estimated segments, colored consecutively, detected by each method; the level of each segment corresponds to the subject’s average copy number per segment. The three subjects plotted represent the subject with the minimum, median, and maximum copy number changes across the cohort.

These results visually illustrate the many advantages of SpaCC over existing tools for copy number segmentation. In particular, DNACopy, while performing well on a per-subject basis delivers inconsistent segments across multiple subjects. Indeed, the number of segments returned by DNACopy varies from to , depending on subject. Thus, the segments produced by DNACopy cannot be used as a pre-processing step before further multivariate analyses, as the features are unaligned across the subjects. Conversely, CNTools, while delivering segments consistent across the subjects, finds difficulty assessing the trade-off between subject-specific patterns and patterns common to all subjects. The result is an awkward consensus with many (), short segments. This “shattered” appearance makes interpretation difficult due to both the number of segments and their size. In contrast, SpaCC finds longer more interpretable segments (), finding a more appropriate balance between subject-specific and common patterns across the cohort. As such, SpaCC is ideally suited as a pre-processing method to segment and reduce the dimension of copy number data before further multivariate analyses.

### 4.2Detecting Methylation Regions with SpaCC

We now study and illustrate how SpaCC can be used for biomarker discovery and as a pre-processing tool that yields a reduced and more meaningful set of features for downstream analysis of methylation data. First, we apply SpaCC to discover methylation regions from the TCGA Level 3 Breast Cancer data [25] for chromosome 17 with subjects and probes. We visually examine methylation regions returned by SpaCC in Figure Figure 5. Notice that the methylation regions detected are much shorter than their Copy Number counterparts. Segments in the latter tend to be amplified or deleted in large chromosomal regions, whereas regions of consistent methylation levels to occur in small localized areas corresponding to promoter regions of genes [9]. On the right in Figure 5, we can easily see how SpaCC features aggregate across probes whose levels are meaningful and consistent across all subjects; this is illustrated by the region denoted in blue. In contrast, SpaCC has not fused the two probes flanking this region as their levels differ significantly from the blue region for the subjects with the minimum and median variation in methylation levels shown. These local regions allow us to capture fine-grained characteristics that are more biologically meaningful than examining single probes. For example, the highlighted blue region is hypomethylated in most subjects, but hypermethylated for the bottom subject shown, which has the maximum variation in methylation levels. This region corresponds to the promoter region of the ABCC3 gene which is associated with HER2 and luminal breast tumors [29]. In total, SpaCC detected methylation regions for chromosome 17, which offers a reduction from the original probes.

#### Breast Cancer Methylation Subtype Discovery

Several have recently suggested that methylation levels can be used to define cancer subtypes, and in breast cancer, methylation levels have been used to characterize the well-known expression-based subtypes [41]. Here, we illustrate how reducing methylation data to the SpaCC-derived reduced set of features offers improvements in downstream multivariate analyses such as subtype detection. We continue to work with the TCGA breast cancer data for chromosome 17 and consider classifying between the Basal and Luminal (A and B) subtypes. To fairly assess the performance of SpaCC’s region-based features relative to using the raw probe values, we consider a simple classification scheme where we first reduce the data using principal components and then fit a Naive Bayes classifier. In Figure ?, we repeatedly split the data into training and test sets and report the misclassification errors on the test sets for a range of principal components (PCs). We also show PC scatterplots for SpaCC-based and probe-based analysis, illustrating the superior discriminatory power of SpaCC-based features. Also notice that SpaCC-based features outperform in terms of misclassification error for all the number of PCs considered. By reducing the noisy probe-level data into more biologically meaningful units, SpaCC is able to yield improved results for subtype detection.

#### Inferring Epigenetic Networks

Various genomic regions have been consistently shown to be hyper- or hypomethylated together in sets of cancer patients [10]. One way of understanding relationships between methylated regions in cancer is by inferring epigenetic networks. Here, we show how SpaCC’s region-based features can lead to more meaningful and interpretable epigenetic networks. We continue to study the TCGA breast cancer chromosome 17 data and use Gaussian Graphical Models to infer networks where either the raw probes or the SpaCC-estimated regions serve as nodes. Networks are estimated via the graphical lasso [13] with a common (dimension dependent) regularization parameter and stability selection [21] with a common threshold parameter of ; we utilize the BigQUIC software [8] to estimate both the probe and region-based networks.

Results for the probe-based and region-based epigenetic networks are shown in Figure 6. Networks inferred for probe features contain a much larger proportion of edges connecting spatially adjacent probes; summaries of the genomic distances between connected nodes are given in Table 3. This finding is not unexpected given the high degree of spatial correlation inherent in methylation levels. Connections between local regions, however, are less meaningful biologically. SpaCC, on the other hand, reduces the data to its relevant biological units and hence inferred networks have a larger portion of long range connections that are more likely of interest to scientists. For example, the TBCD gene highlighted in Figure 6 plays a role in the cytokinesis stage of cell division [11] and has been shown to be upregulated in breast cancer [12]. As can be seen in the probe-based inferred network, probes in the TBCD region of chromosome 17 form only local connections. In contrast, SpaCC aggregates many of the probes in the regions surrounding the TBCD gene. The resulting network estimate no longer contains TBCD intra-gene connections, and instead forms longer range connections. One such long range connection illustrated above is to the SEPT9 gene region which has exhibited high expression levels in breast cancer cell lines [23]. While the connection between the epigenetics of TBCD and SEPT9 has not been experimentally established, both genes have been associated with breast cancer and thus represent the precise type of potential connection which scientists seek to discover using graphical models. By aggregating probes into relevant biological units, SpaCC-based graphical models are able to discover potentially novel functional relationships between epigenomic regions which would otherwise be masked.

### 4.3Region-Based Epigenetic-Wide Association Studies

Finally, we study how SpaCC can improve systematic discovery of epigentic markers associated with disease through epigenome-wide association studies (EWAS). Analogous to more widely used GWAS, EWAS conducts univariate tests for association with an outcome at each epigenetic marker (probe at a CpG site) and adjusts significance levels for multiplicity. As the epigenome can encode environmental or behavioral characteristics of human subjects [4], EWAS can help discover how factors other than genetics contribute to disease. For EWAS with DNA Methylation data, however, the number of subjects is typically small relative to the CpG sites measured by the latest Illumina platform, thus leading to a possibly under-powered study. To address this, we propose to first reduce methylation data to genomic regions via SpaCC and then conduct an association study on the regions; we term this a region-based epigenome-wide association study (rEWAS). As SpaCC retains genomic regions that behave as functional units, we expect that this will reduce the number of tests conducted, leading to an increase in statistical power, while still being able to detect epigenetic markers of disease. Note that testing regions in rEWAS is similar in spirit to testing groups of genetic markers based on linkage disequilibrium in GWAS [45]. Also note that some EWAS analyses have proposed to test regions, but they have typically considered tests for global methylation levels [16] instead of localized genomic regions as we propose with SpaCC. In this section, we first evaluate the efficacy of SpaCC-based rEWAS via a simulation study and then present a rEWAS example to detect epigenetic markers of survival in lung cancer.

#### rEWAS Simulation Study

Since rEWAS is a new type of association study, we first use simulated methylation data to assess the performance of SpaCC-based rEWAS compared to traditional EWAS analysis using the raw probes. As in Section 3.2, we base our simulated data on TCGA Breast Cancer Level 3 methylation data for chromosome 17. We obtain initial regions, via SpaCC and then simulate region means for each subject as . Next, individual probes, are simulated as deviations about the region means via , where are chosen to ensure . Finally, we generate our response as a linear function of a subset of the region means: , where , and . The difficulty of the simulation is controlled by the signal-to-ratio (SNR) level, here the size of relative to . Our simulation consists of patients relative to probes.

We compare our SpaCC-based rEWAS approach to rEWAS using linkage disequilibrium [35] and the Fisher product method [47] as well as EWAS on the raw probes. For all methods, we fit a univariate linear regression model at each probe or region and corrected for multiplicity using Benjamini-Hochberg’s method to control the FDR [1]. The objective is to recover the regions or all the probes contained within the regions that determine the outcome. Notice then that there are two ways to report the true positive rate (TPR) and false discovery proportion (FDP) for this simulation study. First, we can use the regions detected as significant and compare the spatial extent of a detected region to that of the corresponding true region to determine a true positive rate; we call this the region point-of-view (RegionPOV) and the FDP is defined analogously. Second, we can compare the probes detected, or the probes within regions detected as significant, to the probes that lie within the true regions to determine the true positive rate; we call this the probe point-of-view (ProbePOV). The Supplemental Materials contain formal definitions of these metrics used to evaluate our simulation study.

In Figure Figure 8, we report the results of our simulation study at various SNR levels and FDR levels. Compared to probe-based EWAS and LD-based rEWAS, SpaCC-based rEWAS, achieves comparable FDP levels while achieving higher TPR according to both probe and region-based metrics. Overall, this study confirms our intuition that first reducing methylation data to regions via SpaCC, leads to increases in statistical power for EWAS.

#### Lung Cancer rEWAS Study

We now use our SpaCC-based rEWAS approach to discover epigenetic markers associated with lung cancer survival. We use the TCGA lung cancer DNA methylation data which has patients and probes across all chromosomes [27]. A univariate Cox proportional hazards model is used to test for associations with survival at each probe (CpG site) for EWAS or at each SpaCC-region for rEWAS. The Benjamini-Hochberg procedure was used to control the FDR at the 1% and 5% levels. The p-values at each genomic location for EWAS and SpaCC-based rEWAS are displayed in Manhattan plots in Figure 9; the alternating colors represent chromosomes. Horizontal lines are shown at the 1% and 5% FDR levels; vertical lines denoting p-values that cross these thresholds are statistically significant. Notice that the Manhattan plots for both EWAS and rEWAS retain a similar shape, indicating that both methods found a common epigenome-wide signature for lung cancer survival. On closer inspection, we see that SpaCC-based rEWAS yields a larger number of discoveries at equivalent FDR-levels. At the 1% FDR level, EWAS found 29 significant CpG sites, whereas SpaCC-based rEWAS found 49 significant regions which contain a total of 77 CpG sites. Similarly at the 5% FDR level, EWAS had 287 discoveries while SpaCC-based rEWAS found 556 regions which contain 861 CpG sites. The overlap in significant discoveries are shown in Table Table 5. Notice especially at the 1% FDR level that our SpaCC rEWAS method missed only 4 CpG sites declared significant by EWAS, but in the converse, EWAS missed 54 discoveries made by rEWAS.

Our analysis reveals several important epigenetic markers which are largely consistent with the cancer literature. In the top portion of the Table 6, we list the most significant discoveries from rEWAS which were also discovered by EWAS. We also report the epigenetic marker location, its gene target or the nearest gene, and a brief description of its role in the cancer literature. Note that further details on the literature are given in the Supplemental Materials. In the bottom portion of Table 6, we highlight the most significant discoveries found exclusively by rEWAS; these are known markers for lung cancer. Overall, our analysis reveals that reducing methylation data to biologically meaningful genomic regions via SpaCC before conducting EWAS studies, leads to major increases in statistical power for discovering epigenetic markers.

## 5Discussion

Building on the success of fusion-based penalties and convex clustering methods, we have introduced a clustering technique for spatially correlated data called Spatial Convex Clustering, or SpaCC. Through both simulations and real data examples we have shown SpaCC to be a successful tool for reducing DNA methylation data to its functional genomic regions. The resulting regions can then be used to discover epigenetic markers or as features for down stream multivariate statistical analyses. While this paper has focused on the analysis of methylation data, the SpaCC framework is not limited to this data type, nor genomic data in particular. Extensions of SpaCC may be appropriate for clustering spatially registered read counts from RNA-sequencing data to discover new isoforms and alternate splicing. Beyond genomics, similar fusion-based approaches may also be applied other biological data sources with known local spatial structure such as brain imaging. Taken together, these future extensions can allow scientists to perform a single integrative analysis to discover meaningful regions across differing biological modalities. An R package SpaCCr implementing our method is available from CRAN [24].

## Acknowledgements

J.N. was supported in this work by the NIH Training Award in Biostatistics for Cancer Research Award 5T32CA09652010. J.N. and G.I.A. were supported by NFS DMS-1264058 and NFS DMS-1554821. We also thank Zhandong Liu and Ying-Wooi Wan at Baylor College of Medicine for thoughtful discussions related to this work.

### References

1. Controlling the false discovery rate: a practical and powerful approach to multiple testing.
Yoav Benjamini and Yosef Hochberg. Journal of the royal statistical society. Series B (Methodological)
2. Genome-wide dna methylation profiling using infinium assay.
Marina Bibikova, Jennie Le, Bret Barnes, Shadi Saedinia-Melnyk, Lixin Zhou, Richard Shen, and Kevin L Gunderson. 2009.
3. The group fused lasso for multiple change-point detection.
Kevin Bleakley and Jean-Philippe Vert. arXiv preprint arXiv:1106.4199
4. Tobacco-smoking-related differential dna methylation: 27k discovery and replication.
Lutz P Breitling, Rongxi Yang, Bernhard Korn, Barbara Burwinkel, and Hermann Brenner. The American Journal of Human Genetics
5. Splitting methods for convex clustering.
Eric C. Chi and Kenneth Lange. Journal of Computational and Graphical Statistics
6. Convex biclustering.
Eric C Chi, Genevera I Allen, and Richard G Baraniuk. Biometrics
7. Dna methylation and cancer.
Partha M Das and Rakesh Singal. Journal of clinical oncology
8. Big & quic: Sparse inverse covariance estimation for a million variables.
Inderjit S Dhillon.
9. Dna methylation profiling of human chromosomes 6, 20 and 22.
Florian Eckhardt, Joern Lewin, Rene Cortese, Vardhman K Rakyan, John Attwood, Matthias Burger, John Burton, Tony V Cox, Rob Davies, Thomas A Down, et al. Nature genetics
10. A gene hypermethylation profile of human cancer.
Manel Esteller, Paul G Corn, Stephen B Baylin, and James G Herman. Cancer research
11. Tbcd links centriologenesis, spindle microtubule dynamics, and midbody abscission in human cells.
Mónica López Fanarraga, Javier Bellido, Cristina Jaén, Juan Carlos Villegas, and Juan Carlos Zabala. PloS one
12. Gene expression profile associated with response to doxorubicin-based therapy in breast cancer.
Maria Aparecida Azevedo Koike Folgueira, Dirce Maria Carraro, Helena Brentani, Diogo Ferreira da Costa Patrão, Edson Mantovani Barbosa, Mário Mourão Netto, José Roberto Fígaro Caldeira, Maria Lucia Hirata Katayama, Fernando Augusto Soares, Célia Tosello Oliveira, et al. Clinical Cancer Research
13. Sparse inverse covariance estimation with the graphical lasso.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Biostatistics
14. Bsmooth: from whole genome bisulfite sequencing reads to differentially methylated regions.
Kasper D Hansen, Benjamin Langmead, Rafael A Irizarry, et al. Genome Biol
15. Clusterpath: an algorithm for clustering using convex fusion penalties.
Toby Hocking, Jean-Philippe Vert, Armand Joulin, and Francis R Bach. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 745–752, 2011.
16. Global dna methylation level in whole blood as a biomarker in head and neck squamous cell carcinoma.
Debra Ting Hsiung, Carmen J Marsit, E Andres Houseman, Karen Eddy, C Sloane Furniss, Michael D McClean, and Karl T Kelsey. Cancer Epidemiology Biomarkers & Prevention
17. A tutorial on mm algorithms.
David R Hunter and Kenneth Lange. The American Statistician
18. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.
Andrew E Jaffe, Peter Murakami, Hwajin Lee, Jeffrey T Leek, M Daniele Fallin, Andrew P Feinberg, and Rafael A Irizarry. International journal of epidemiology
19. Breast and ovarian cancer risks due to inherited mutations in brca1 and brca2.
Mary-Claire King, Joan H Marks, Jessica B Mandell, et al. Science
20. Comparing clusterings�an information based distance.
Marina Meilă. Journal of multivariate analysis
21. Stability selection.
Nicolai Meinshausen and Peter Bühlmann. Journal of the Royal Statistical Society: Series B (Statistical Methodology)
22. Lasso-type recovery of sparse representations for high-dimensional data.
Nicolai Meinshausen and Bin Yu. The Annals of Statistics
23. The septin 9 (msf) gene is amplified and overexpressed in mouse mammary gland adenocarcinomas and human breast cancer cell lines.
Cristina Montagna, Myung-Soo Lyu, Kent Hunter, Luanne Lukes, William Lowther, Tricia Reppert, Bruce Hissong, Zoë Weaver, and Thomas Ried. Cancer research
24. SpaCCr: A package for genomic region detection via spatial convex clustering.

John Nagorski. R package version 0.1.0.
25. Comprehensive molecular portraits of human breast tumours.
Cancer Genome Atlas Network et al. Nature
26. Integrated genomic analyses of ovarian carcinoma.
Cancer Genome Atlas Research Network et al. Nature
27. Comprehensive molecular profiling of lung adenocarcinoma.
Cancer Genome Atlas Research Network et al. Nature
28. A fused lasso latent feature model for analyzing multi-sample acgh data.
Gen Nowak, Trevor Hastie, Jonathan R Pollack, and Robert Tibshirani. Biostatistics
29. Functional genomics identifies abcc3 as a mediator of taxane resistance in her2-amplified breast cancer.
Carol O’Brien, Guy Cavet, Ajay Pandita, Xiaolan Hu, Lauren Haydu, Sankar Mohan, Karen Toy, Celina Sanchez Rivers, Zora Modrusan, Lukas C Amler, et al. Cancer research
30. A cohort study of tumoral line-1 hypomethylation and prognosis in colon cancer.
Shuji Ogino, Katsuhiko Nosho, Gregory J Kirkner, Takako Kawasaki, Andrew T Chan, Eva S Schernhammer, Edward L Giovannucci, and Charles S Fuchs. Journal of the National Cancer Institute
31. Circular binary segmentation for the analysis of array-based dna copy number data.
Adam B Olshen, ES Venkatraman, Robert Lucito, and Michael Wigler. Biostatistics
32. Global variation in copy number in the human genome.
Richard Redon, Shumpei Ishikawa, Karen R Fitch, Lars Feuk, George H Perry, T Daniel Andrews, Heike Fiegler, Michael H Shapero, Andrew R Carson, Wenwei Chen, et al. nature
33. DNAcopy: DNA copy number data analysis

Venkatraman E. Seshan and Adam Olshen. .
34. Copy number variations and cancer.
Adam Shlien and David Malkin. Genome medicine
35. Allele-specific methylation is prevalent and is contributed by cpg-snps in the human genome.
Robert Shoemaker, Jie Deng, Wei Wang, and Kun Zhang. Genome research
36. Functional copy-number alterations in cancer.
Barry S Taylor, Jordi Barretina, Nicholas D Socci, Penelope DeCarolis, Marc Ladanyi, Matthew Meyerson, Samuel Singer, and Chris Sander. PloS one
37. Sparsity and smoothness via the fused lasso.
Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Journal of the Royal Statistical Society: Series B (Statistical Methodology)
38. Missing value estimation methods for dna microarrays.
Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Bioinformatics
39. Applications of a splitting algorithm to decomposition in convex programming and variational inequalities.
Paul Tseng. SIAM Journal on Control and Optimization
40. The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso).
Sara van de Geer, Peter Bühlmann, Shuheng Zhou, et al. Electronic Journal of Statistics
41. Array-based dna methylation profiling for breast cancer subtype discrimination.
Ilse Van der Auwera, Wayne Yu, Liping Suo, Leander Van Neste, Peter Van Dam, Eric A Van Marck, Patrick Pauwels, Peter B Vermeulen, Luc Y Dirix, and Steven J Van Laere. PLoS One
42. A faster circular binary segmentation algorithm for the analysis of array cgh data.
ES Venkatraman and Adam B Olshen. Bioinformatics
43. Comparing clusterings: an overview

Silke Wagner and Dorothea Wagner. .
44. A method for calling gains and losses in array cgh data.
Pei Wang, Young Kim, Jonathan Pollack, Balasubramanian Narasimhan, and Robert Tibshirani. Biostatistics
45. The nhgri gwas catalog, a curated resource of snp-trait associations.
Danielle Welter, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manolio, Lucia Hindorff, et al. Nucleic acids research
46. Cross-validatory estimation of the number of components in factor and principal components models.
Svante Wold. Technometrics
47. Association of brain dna methylation in sorl1, abca7, hla-drb5, slc24a4, and bin1 with pathological diagnosis of alzheimer disease.
Lei Yu, Lori B Chibnik, Gyan P Srivastava, Nathalie Pochet, Jingyun Yang, Jishu Xu, James Kozubek, Nikolaus Obholzer, Sue E Leurgans, Julie A Schneider, et al. JAMA neurology
48. CNTools: Convert segment data into a region by sample matrix to allow for other high level computational analyses.

Jianhua Zhang. R package version 1.22.0.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters