Simultaneous detection and estimation of trait associations with genomic phenotypes
\externaldocumentjade_supp
Abstract
Genomic phenotypes, such as DNA methylation and chromatin accessibility, can be used to characterize the transcriptional and regulatory activity of DNA within a cell. Recent technological advances have made it possible to measure such phenotypes very densely. This density often results in spatial structure, in the sense that measurements at nearby sites are very similar.
In this paper, we consider the task of comparing genomic phenotypes across experimental conditions, cell types, or disease subgroups. We propose a new method, Joint Adaptive Differential Estimation (JADE), which leverages the spatial structure inherent to genomic phenotypes. JADE simultaneously estimates smooth underlying group average genomic phenotype profiles, and detects regions in which the average profile differs between groups. We evaluate JADE’s performance in several biologically plausible simulation settings. We also consider an application to the detection of regions with differential methylation between mature skeletal muscle cells, myotubes and myoblasts.
1 Introduction
During the past decade, it has become possible to measure genomic phenotypes, or local properties of DNA such as DNA methylation, chromatin accessibility, copy number variation, and histone modification. These genomic phenotypes can be measured very densely, and in some cases even at singlenucleotide resolution. For example, DNA methylation proportion can be measured at nearly every CpG using bisulfite sequencing, and chromatin accessibility can be measured at every nucleotide with DNaseseq. Neighboring nucleotides may belong to the same functional unit. Thus, genomic phenotypes often have similar values at nearby genomic positions.
Unlike the DNA sequence itself, genomic phenotypes may change dynamically as a function of cellular activity, developmental stage, and environment. This motivates us to compare genomic phenotypes across experimental conditions or clinicallydefined categories. This task is made challenging by the fact that, for most genomic phenotypes, we cannot prespecify all relevant functional units and test each unit for an overall difference across groups. For instance, for epigenetic features, many potentially relevant regulatory regions are not annotated. Furthermore, in exploratory studies, it is not always known which classes of genetic elements, such as promoters or enhancers, should be considered.
When prespecified functional units are not available, a simple option for comparing genomic phenotypes across conditions is to perform a separate test at each locus. The majority of existing methods take such an approach, for instance using logistic regression, Fisher’s exact test, or ttests (Akalin and others, 2012; Stockwell and others, 2014; Jaffe and others, 2012; Hebestreit and others, 2013), sometimes followed by a multiplicity correction that considers spatial structure. These procedures identify differential regions by merging contiguous sites with large test statistics. Such tests tend to be underpowered, as they fail to borrow strength across neighboring loci.
In contrast, the BSmooth method of Hansen and others (2012) and the WaveQTL method of Shim and Stephens (2015) are twostep procedures that leverage the spatial structure of the genomic phenotypes. BSmooth first smooths the data, and then uses the smoothed data to calculate a statistic at each site. Differential regions are then identified by merging contiguous sites with large statistics. WaveQTL requires the genome to be divided into prespecified bins. A hierarchical Bayesian regression is performed in order to generate a binlevel test statistic, as well as estimates of association between the data and the outcome at different spatial scales.
In this paper, we propose joint adaptive differential estimation (JADE), a onestep approach for differential estimation and testing of genomic phenotypes. JADE is a penalized likelihoodbased approach which simultaneously estimates smooth averagegroup profiles and identifies regions of difference between groups. By combining these two tasks into a single step, JADE can adaptively share information both across loci and between groups, leading to improved power to detect differential regions without the need for prespecified functional units of interest. When the grouping variable has more than two levels, JADE finds regions where at least one group differs from the rest, and within those differential regions performs local clustering of profiles.
The rest of this paper is organized as follows. In Section 2, we introduce the underlying model, and formulate JADE as the solution to a convex optimization problem. In Section 3, we introduce a custom algorithm that can be used to efficiently solve the JADE optimization problem. In Section 4, we explore the performance of JADE, relative to existing methods, in a simulation study. In Section 5 we apply JADE to publicly available methylation data from the ENCODE project. The Discussion is in Section 6.
2 Problem Formulation
Consider a categorical trait, , such as disease status or tissue type, coded numerically for convenience. We wish to associate this trait with a genomic phenotype, , measured at positions along the genome.
For a given value of , we assume that varies smoothly as a function of genomic position,
Here the function represents the mean genomic phenotype profile for the th class. If all profiles are identical at site ( for all ), then there is no association between the mean of , the genomic phenotype at the th position, and the categorical trait . If for some , then there is an association between the mean of and . Our goal is to identify differential regions, or contiguous blocks of associated sites. A very similar framework was considered in Shim and Stephens (2015).
In what follows, we assume that we have independent observations of , denoted . We now introduce some notation that will be used throughout this paper. Let denote the number of observations with , so that . Let , and let . Furthermore, we let , and . In what follows, unless otherwise specified, the letter will index the observations, will index the values of the categorical trait , and will index the genomic positions of .
2.1 Example
We illustrate JADE with a simple toy example. In each of two groups, we simulate a quantitative genomic phenotype at a series of evenly spaced positions, . The data are generated as an overall groupspecific mean curve, plus independent normal errors, as shown in Figure 0(a). The two groupspecific mean curves differ only for .
We first consider estimating the mean curves by separately smoothing the data corresponding to each of the two groups. As is shown in Figure 0(b), the two estimated profiles are somewhat different at nearly every location.
In contrast, the results from applying JADE to this data are shown in Figure 0(c). JADE simultaneously smooths the data in each group, and penalizes the differences between the two estimated mean curves. Therefore, JADE can approximately recover the differential region shown in Figure 0(a).
Of course, the data that we encounter in real biological problems, such as the application studied in Section 5, are more complicated than the toy example shown in Figure 0(a). Real data are often characterized by unevenly spaced positions ; sites for which a subset of groups are missing measurements; and nonconstant variance of the genomic phenotype measurements. As we describe in the following sections, JADE is able to accommodate all of these characteristics.
2.2 Penalties to Induce Structure
JADE combines two tasks: (i) estimation of a smooth mean curve within each group; and (ii) fusion of the mean curves across groups. Here we use the term fusion to describe JADE’s ability to provide mean curve estimates that are identical across multiple groups at a particular genomic position. That is, if our estimates of and are identical for some , then we say that the estimated mean curves for the th and th classes are fused at position .
We briefly discuss the application of existing penalized regression methods to the two aforementioned tasks.
2.2.1 Smoothing a Genomic Phenotype
Consider the task of smoothing a single observation of a genomic phenotype, , measured at (potentially unevenly spaced) positions . Given weights , we consider the optimization problem
(1) 
The smoothed estimate, , minimizes the sum of two terms: a goodnessoffit term between and , and a penalty term that discourages a rough or complex . The penalty parameter controls the relative importance of these two terms. There are a number of options for , such as a smoothing spline penalty (Reinsch, 1971) or an trend filtering penalty (Kim and others, 2009; Tibshirani, 2014).
Trend filtering induces piecewise polynomial estimates for , of prespecified order , with adaptively chosen knots. The choice of is guided by the characteristics of the data at hand: for instance, trend filtering with (also known as the fused lasso; see Tibshirani and others, 2005) is appropriate for the piecewise constant structure of copy number data (Tibshirani and Wang, 2008); while is appropriate for relatively smooth DNA methylation data. Trend filtering is locally adaptive, in the sense that it can be used to fit a curve that is very smooth in one region of the domain and very rough in another; this is discussed extensively in Tibshirani (2014). This property is very attractive within the context of analyzing messy, heterogeneous, and heteroskedastic biological data. Consequently, in what follows, we take in (1) to be a trend filtering penalty.
For convenience, we now switch to using vector notation. The trend filtering estimate, , is the solution to the optimization problem
(2) 
where is the discrete th derivative matrix, the entries of which depend on both and the spacing of . The specific form of this matrix is detailed further in Section LABEL:sec:TF of the supplementary material [SM] available at Biostatistics online, and in Tibshirani (2014).
2.2.2 Fusing Genomic Phenotypes
Now consider the task of fusing observations of a genomic phenotype, — that is, we seek to encourage the estimated means to be identical at a given site. The convex clustering estimates, , solve the optimization problem (Pelckmans and others, 2005; Hocking and others, 2011; Heinzl and Tutz, 2014)
(3) 
In (3), is a weight for the th locus in the th observation.
For sufficiently large, the penalty in (3) will encourage similarity between and . In particular, if , then when is large, the entire vectors and will tend to be identical, or completely fused at all sites. In this case, the set of observations for which are identical can be interpreted as clusters. In contrast, if , then a large value of will encourage individual elements and to be identical. This amounts to fusing the vectors and at a subset of the sites.
2.3 Joint Smoothing and Comparison with JADE
Recall from the beginning of Section 2 the problem setup: each observation belongs to one of categories, denotes the number of observations within the th category, and denotes the mean of the observations of the genomic phenotype within the th category.
Our goal is to estimate a mean genomic phenotype profile, , for each of the categories. We want each mean profile to be smooth, and for the mean profiles to be identically equal to each other for many of the loci . To do this, we combine the smoothing and fusion penalties seen in (2) and (3) into a single convex optimization problem.
The JADE estimator is defined as the solution to the convex optimization problem
(4) 
This minimization consists of three terms: a weighted sum of squared residuals, a sum of trend filtering penalties, and a clustering penalty. When the nonnegative tuning parameter is sufficiently large, the trend filtering penalty encourages each mean profile to be smooth. Equation (4) could be modified to allow each of the groups to have its own smoothness tuning parameter, . For simplicity we use a single common parameter.
When the nonnegative tuning parameter is sufficiently large, the clustering penalty encourages many of the sites to have exactly the same value in the th and th mean profiles, for . In fact, when is large enough, some of the sites will have ; these can be interpreted as regions of the genome where the mean profile is constant across the groups. Thus, JADE simultaneously identifies regions of the genome in which the genomic phenotype is associated with the categorical variable , and estimates smooth average profiles for each group. It accomplishes this in an efficient way that borrows strength across nearby sites, without performing a separate test at each site in the genome. Selection of and in (4) is discussed in Section 3.2.
In (4), are diagonal weight matrices. These can be used to account for the fact that the elements of may have nonconstant variance across the sites, perhaps due to varying numbers of reads across the genome. Furthermore, if no data are available for the th site in the th group, then the th diagonal element of can be set to zero.
3 Solving the JADE Optimization Problem
3.1 An Alternating Direction Method of Multipliers Algorithm for JADE
The JADE optimization problem (4) is convex, so in principle, it can be solved with generalpurpose convex solvers, such as SDPT3
(Tütüncü and
others, 2003) or SeDuMi
(Sturm, 1999). However, these solvers do not scale well to genomesized problems. Therefore, we have developed an efficient custom alternating direction method of multipliers (ADMM; Boyd and others, 2010) algorithm for solving (4).
Our algorithm relies on the key observation by Tibshirani (2014) that the trend filtering penalty matrix can be decomposed as , where is the first difference operator, and is a scaled thorder difference operator. Details of these two matrices are provided in Section LABEL:sec:TF of the SM.
Using this decomposition, we can rewrite (4) as
(5)  
subject to 
The scaled augmented Lagrangian for this problem is
(6) 
where , , and . In (6), is a vector of dual variables, where for and . The dual variables are broken into multiple components in order to allow for different step sizes, and , as this leads to faster convergence. In our implementation, we adjust the step sizes adaptively; details are in Section LABEL:sec:stepsize of the SM.

Initialize as solutions to (4) with .

For , initialize and .

Iterate until the convergence criteria described in Section LABEL:sec:stepsize of the SM are satisfied:

For , update

For , update

For , update

For , update the dual variables by setting

Update the step sizes and as described in Section LABEL:sec:stepsize of the SM, and rescale the dual variables by setting

The ADMM algorithm corresponding to the scaled augmented Lagrangian (6) is given in Algorithm 1. The initialization in Step 1 simply amounts to solving a separate trend filtering problem for each , . The update in Step 3(b) involves solving a fused lasso problem; this can be done using the algorithm of Johnson (2013). The update in Step 3(c) has an explicit form in the case of groups (see Section LABEL:sec:algdetails of the SM). For groups, we make use of the solution of Hocking and others (2011).
If the output of Algorithm 1 has the property that for some , , then we conclude that at the th locus, the mean genomic phenotype profiles are identical. Due to numerical issues, however, we may not observe exact equality between and for . Therefore, in practice, we set a threshold , and conclude that and are equal if the absolute difference between them is below . The mean genomic phenotype profile for the th group can be obtained from in the output of Algorithm 1.
In practice, it is computationally prohibitive to solve the JADE optimization problem (4) on genomesized data. Therefore, we take a pragmatic approach: we segment the genome, and apply JADE to each segment in parallel. In the methylation data application presented in Section 5, there is a natural segmentation that respects the biology of the problem. Other situations might require more arbitrary segmentation. Provided that the regions to be detected by JADE are short relative to the segmentation that we impose, we expect the segmentation to have little effect on the results.
3.2 Tuning Parameter Selection
The JADE optimization problem in (4) involves two nonnegative tuning parameters. The parameter controls the smoothness of the mean genomic phenotype profiles while controls the amount of fusion between pairs of profiles. We take a twostage approach to select and rather than performing a grid search over all combinations of values.
In both stages, crossvalidation is performed by dividing the data points , , , into folds. For a given value of , each fold contains a data point at every th position, and the folds are staggered so that all data points at a single position are not in the same fold. For example, if , , and , then the first fold could contain and .
In the first stage, we set in (4); this amounts to combining all of the data into a single trend filtering problem. We then perform crossvalidation in order to select .
In the second stage of tuning parameter selection, we hold fixed at the value selected in the first stage, and select the tuning parameter using crossvalidation. Additional details are provided in Section LABEL:sec:cv_gamma of the SM.
In both stages of crossvalidation, we apply the onestandarderror rule, selecting the largest tuning parameter value that has crossvalidation error within one standard deviation of the minimum (Hastie and others, 2009).
4 Simulations
In Section 4.1, we consider a setting in which the genomic phenotype is continuousvalued. In Section 4.2, we consider a setting that is modeled after methylation sequence data.
4.1 Normal Simulations
4.1.1 Simulation SetUp
We simulate observations, 10 in each of groups, at evenly spaced sites, . (We explore other values of the sample size in Section LABEL:sec:ss_sims of the SM.) The data for the th observation in the th group at the th site is generated as
(7) 
where the functions and represent the mean genomic phenotype profiles for the two groups, and are displayed in Figure 2. The error terms are generated in one of two ways:

Autoregressive model. For and ,
We consider values of and .

Random effects model. For , , and ,
(8) In this setup, represents a mean shift for the th observation in the th group, such as one might expect as a result of a batch effect. We choose and such that , and the proportion of variance due to random effects, , takes on values of , , , and .
4.1.2 Methods for Comparison
In this section, we compare JADE to three test based methods. These methods decouple the tasks of estimating the mean genomic phenotype profiles for each of the groups, and testing for differences between the mean genomic phenotype profiles. These approaches assume that .

A twosample statistic is calculated at each site, without first smoothing the data. This approach is used by methylKit (Akalin and others, 2012).

Each observation is smoothed using local likelihood, with the bandwidth chosen by generalized crossvalidation. Then a two sample statistic is computed at each site, using the smoothed observations. BSmooth (Hansen and others, 2012) uses this strategy with a fixed bandwidth optimized for methylation data.

Each observation is smoothed using a quadratic smoothing spline, with the tuning parameter chosen by generalized crossvalidation. Then a two sample statistic is computed at each site, using the smoothed observations.
The third method is included in order to understand the impact of different smoothing strategies. For all three methods, a threshold is chosen, and any site with a test statistic exceeding that threshold in absolute value is declared to have a different mean value between the groups.
We use our own implementation in R
of all three strategies, because methylKit and BSmooth are both implemented specifically for methylation count data, whereas the genomic phenotypes in this simulation study are continuous.
We do not include the WaveQTL method of Shim and Stephens (2015) in our comparisons, as it requires the user to pass in prespecified genomic regions, and does not provide a persite assessment of the association between genomic phenotype and category.
In our application of JADE, we set the weight matrices and in (4) to the identity. Tuning parameters were selected according to the procedure in Section 3.2.
4.1.3 Results
We now compare the performances of JADE and the three testbased methods described in Section 4.1.2. Before presenting these results, we briefly discuss the calculation of false and true positives for each method.
For a given value of in the JADE optimization problem (4), we declare a false positive if and , and a true positive if and . We fit JADE at around 100 values of , as described in Section LABEL:sec:cv_gamma of the SM. For each value of considered, we calculate a true positive rate and a false positive rate.
For a given statistic method and a given choice of threshold, we declare a false positive if the absolute value of the statistic for the th site exceeds the threshold and . We declare a true positive if the absolute value of the statistic for the th site exceeds the threshold and . For each method, we calculate true positive and false positive rates for a sequence of threshold values.
Figures 3 and 4 display the average true positive rate (TPR) as a function of the false positive rate (FPR) for JADE and the three testbased methods, for the two error structures described in Section 4.1.1, averaged over 100 simulations. Colored points indicate the average TPR and FPR achieved with tuning parameters selected via crossvalidation for JADE, or using a false discovery rate (FDR) of for the statisticbased methods, as calculated using SLIM (Wang and others, 2011). Details of the calculation of these curves are given in Section LABEL:sec:roc of the SM.
In all settings, JADE results in a higher TPR for any fixed FPR than the competing methods. We expect JADE to perform well in the random effects setting because it pools observations within each group before smoothing, thereby averaging out individuallevel random effects. The JADE framework does not, however, account for the dependence between errors seen in the autoregressive simulations. These results show that JADE is robust, at least in this setting, to dependence between errors.
The statisticbased methods with an FDR cutoff of 10% tend to be more conservative than JADE with chosen by crossvalidation: that is, they yield fewer false positives and fewer true positives. The average FPR for JADE with chosen by crossvalidation increases for larger values of in the autoregressive settings and in the random effects settings.
In Section LABEL:sec:rl_sims of the SM, we evaluate JADE and the three test methods using a different approach, in which we treat contiguous blocks of associated sites as single discoveries.
4.2 Binomial Simulations
4.2.1 Simulation SetUp and Methods for Comparison
In this section, we use methylation sequence data to motivate our simulation setup. DNA methylation is a chemical modification that can affect cytosine residues directly followed by guanine residues (CpGs). In methylation sequencing experiments, DNA is fragmented, amplified, and bisulfite converted, a process in which nonmethylated cytosines in CpGs are converted to uracil. These fragments are then sequenced, and the uracils and cytosines at each CpG are counted. Thus, at each CpG site we obtain two numbers: the number of sequenced fragments (reads) and the number of observed uracils (counts). We analyze methylation data in Section 5. In this section we consider a simple simulation mimicking the binomial character of methylation data.
As in Section 4.1.1, we simulate observations, 10 in each of groups at evenly spaced sites. We generate the observed number of counts for the th individual in the th group at the th site as
Section LABEL:sec:binom of the SM describes the way in which , the total number of reads, is generated. No sites were permitted to have zero reads, as neither BSmooth (Hansen and others, 2012) nor methylKit (Akalin and others, 2012) can accommodate this.
In order to generate the binomial probability , we first scaled and translated and , the two mean genomic phenotype profiles displayed in Figure 2, to take on values between and . We then generated according to a random effects model, as follows:
We consider values of .
We fit JADE using the observed proportions . Due to the binomial meanvariance relationship as well as the variable read depth, the variance of is not constant across sites or observations. We can estimate the variance of as
(9) 
Here, differs from in the inclusion of pseudocounts to prevent estimates of zero variance. To accommodate these variance estimates in JADE, the diagonal elements of the matrix in (4) were set to .
In what follows, we compare JADE to two existing methods for analyzing methylation data, methylKit (Akalin and others, 2012) and BSmooth (Hansen and others, 2012). These are methylation specific implementations of methods 1 and 2 in Section 4.1.2.
The local likelihood smoothing bandwidth is fixed in BSmooth, making its performance dependent on the spacing of the measurement sites. For this comparison, we used a separation between sites of five units (). This spacing is close to what might be expected of bisulfite sequencing data in which measurements are closely spaced but not made at every basepair.
4.2.2 Results
Both BSmooth and methylKit produce a score for each site, quantifying the evidence that the mean profiles differ at that location. TPRs and FPRs for JADE, BSmooth, and methylKit were computed as described in Section 4.1.3. The results, averaged over 100 simulated data sets, are displayed in Figure 5. We find that JADE gives a higher TPR than BSmooth and methylKit for any fixed FPR in all settings.
5 Application to Methylation Data
In this section, we apply JADE to DNA methylation patterns during three stages of skeletal muscle cell development (myoblast, myotube, and adult muscle cells), using reduced representation bisulfite sequencing data from the ENCODE project (The Encode Project Consortium, 2012). Methylation data were described at the beginning of Section 4.2.
These cell lines have been studied extensively: in particular, ChIPseq peaks, DNaseI peaks, and H3K27ac marks are also available. Therefore, we are able to compare the set of differentially methylated regions (DMRs) detected by JADE with previous findings, and we can assess colocalization with other functional annotations in order to validate our results. In what follows, we will make use of the fact that there is a developmental ordering to the three cell types in our data: myoblasts precede myotubes, which precede mature muscle cells.
5.1 Analysis
We compared DNA methylation in myoblasts, myotubes, and mature skeletal muscle cells. Three technical replicates from a single cell line are available for both myoblasts and myotubes, and two technical replicates are available for mature skeletal muscle. We pooled technical replicates, and set the diagonal elements of the weight matrices in (4) equal to the inverse of the standard deviation estimates in (9). In this analysis, we only examined chromosome 22.
The locations at which DNA methylation can occur, CpG sites, are irregularly distributed throughout the genome. Since there is no biological reason to smooth across very long distances containing no CpG sites, this irregular spacing provides a natural way to segment the genome. We divided the chromosome into segments such that neighboring CpG sites within a segment are separated by less than 2 kb, and the first and last CpG of each segment is measured in all three cell types. Segments with fewer than 20 CpGs were removed. This resulted in 477 segments with an average segment length of 3.0 kb and an average of 64 CpG sites per segment. Running JADE on each of the 477 segments in parallel, with 5fold crossvalidation, required computing efforts equivalent to running 120 cores for two days.
Neither methylKit nor BSmooth can be directly applied to this data, since both methods are intended for a twogroup comparison, and in this data set we have three groups.
5.2 Results
5.2.1 DMRs Identified by JADE
We applied JADE to each of the 477 segments on chromosome 22, with and selected using 5fold crossvalidation as described in Section 3.2 and Section LABEL:sec:cv_gamma of the SM, and with as described in Section 3.1.
We declared a DMR as any contiguous set of CpGs at which two or more profiles are separated in the JADE output. Adjacent DMRs separated by a single CpG are combined to form a single DMR. We removed 48 regions containing 536 basepairs (0.3% of basepairs in DMRs) in which two pairs of profiles are fused (separation ) while the third pair is unfused (separation ), since such a pattern does not give a valid partition of the three profiles.
JADE identified 220 DMRs in 127 segments, with an average length of 826 basepairs. An example JADE fit is shown in Figure 5(b). In this segment, two DMRs have been identified (shaded in blue). In the DMR on the left, all three profiles are separated, while on the right, the myotube and myoblast profiles are fused.
A single DMR may contain multiple partitions of the profiles, though those in Figure 5(b) each contain only one partition. The 220 DMRs identified by JADE can be divided into 380 subregions, each of which contains only a single partition of the profiles.
5.2.2 CoLocalization of DMRs with Genetic and Epigenetic Landmarks
In order to assess the quality of the DMRs detected by JADE, we evaluate their overlap with epigenetic annotations and genetic landmarks. We expect the DMRs detected by JADE to be enriched for some of these genetic features.
We consider epigenetic annotations obtained in myotubes and mature skeletal muscle, available from ENCODE. (Annotations obtained in myoblasts were not available.) These annotations include (i) transcription factor binding sites identified through ChIPseq; (ii) active enhancer regions indicated by H3K27ac histone methylation; and (iii) DNase I hypersensitive sites, which mark open chromatin.
We also consider three types of genetic landmarks: (i) CpG islands, annotated in the UCSC genome browser. Evidence suggests that methylation in these regions affects gene expression (Bell and others, 2011; Illingworth and Bird, 2009). (ii) CpG island shores, defined as the 2 kb flanking regions of islands. Using BSmooth, Irizarry and others (2009) found that DMRs between colon cancer and healthy colon cells tend to be located in CpG island shores. (iii) The 2 kb flanking regions of transcription start sites (TSSs), annotated in the Gencode project (Harrow and others, 2012) and available from ENCODE annotations. These 2 kb flanking regions serve as proxies for promoter regions, which are typically located immediately upstream of the TSS.
We tested whether the number of detected DMRs overlapping each genetic feature differed from what would be expected by chance. Details are given in Section LABEL:sec:meth_null of the SM. We found that the DMRs identified by JADE are enriched in TSS flanking regions (Table 1).
Next, we restricted our analysis to the subregions of DMRs detected by JADE that are consistent with increasing methylation over the course of development (MyoblastMyotubeMature) or decreasing methylation over the course of development (MyoblastMyotubeMature). These two groups account for approximately 58% of all basepairs in detected DMRs, as discussed in Section LABEL:sec:methylpattern of the SM. The results can be found in Table LABEL:tab:gain_loss of the SM. We found that the rate of overlap with CpG islands is higher in lossofmethylation than in gainofmethylation DMR subregions. This pattern is consistent with prior suggestions that demethylation plays a major role in upregulating cell type specific gene expression over the course of development (Segalés and others, 2014; Hupkes and others, 2011). We also found that the rates of overlap with both DNaseI hypersensitive sites and H3K27ac modifications are higher in gainofmethylation than in lossofmethylation DMR subregions.
6 Discussion
In this manuscript we propose JADE, a flexible method for the analysis of genomic phenotypes measured in two or more conditions. JADE combines smoothing and group comparison into a single optimization problem, resulting in improved power over competing methods.
In addition to gains in power for simple comparisons of two groups, JADE offers a novel approach for analyzing data with respect to a categorical outcome. Although the BSmooth and methylKit frameworks could be extended to categorical outcomes by using a oneway ANOVA, categorical outcomes are typically analyzed by performing pairwise comparisons, as in Carrió and others (2015), or oneversusall comparisons. For example, tissuespecific DMRs have been identified by finding regions for which one cell type differs from the average over all other cell types (Irizarry and others, 2009; The Encode Project Consortium, 2012). JADE is able to identify DMRs across categorical outcomes, and determine the order and grouping of profiles within DMRs.
JADE is implemented as an R
package jadeTF
, currently available at the author’s website https://github.com/jean997/jadeTF.
7 Description of Supplementary Materials
The reader is referred to the online Supplementary Materials for technical appendices.
8 Funding
D. W. was partially supported by a Sloan Research Fellowship, NIH Grant DP5OD009145, and NSF CAREER DMS1252624. N.S. was supported by NIH Grant DP5OD019820. J.M. was supported by NIH Grant DP5OD019820 and NIH Grant F31HG008572. Methylation data analysis made use of resources provided by a Google Cloud Credits Award from Google Research.
References
 Akalin and others (2012) Akalin, Altuna, Kormaksson, Matthias, Li, Sheng, GarrettBakelman, Francine E, Figueroa, Maria E, Melnick, Ari and Mason, Christopher E. (2012, oct). methylKit: a comprehensive R package for the analysis of genomewide DNA methylation profiles. Genome Biology 13(10), R87.
 Bell and others (2011) Bell, Jordana T, Pai, Athma a, Pickrell, Joseph K, Gaffney, Daniel J, PiqueRegi, Roger, Degner, Jacob F, Gilad, Yoav and Pritchard, Jonathan K. (2011, jan). DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biology 12(1), R10.
 Boyd and others (2010) Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja and Eckstein, Jonathon. (2010). Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning 3(1), 1–122.
 Carrió and others (2015) Carrió, Elvira, DíezVillanueva, Anna, Lois, Sergi, Mallona, Izaskun, Cases, Ildefonso, Forn, Marta, Peinado, Miguel A. and Suelves, Mònica. (2015). Deconstruction of DNA Methylation Patterns During Myogenesis Reveals Specific Epigenetic Events in the Establishment of the Skeletal Muscle Lineage. Stem Cells 33(6), 2025–2036.
 Hansen and others (2012) Hansen, Kasper D, Langmead, Benjamin and Irizarry, Rafael a. (2012, oct). BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology 13(10), R83.
 Harrow and others (2012) Harrow, J., Frankish, A., Gonzalez, J. M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B. L., Barrell, D., Zadissa, A., Searle, S., Barnes, I., Bignell, A., Boychenko, V., Hunt, T., Kay, M., Mukherjee, G., Rajan, J., DespacioReyes, G., Saunders, G., Steward, C., Harte, R., Lin, M., Howald, C., Tanzer, A., Derrien, T., Chrast, J., Walters, N., Balasubramanian, S., Pei, B., Tress, M., Rodriguez, J. M., Ezkurdia, I., van Baren, J., Brent, M., Haussler, D., Kellis, M., Valencia, A., Reymond, A., Gerstein, M., Guigo, R. and others. (2012, sep). GENCODE: The reference human genome annotation for The ENCODE Project. Genome Research 22(9), 1760–1774.
 Hastie and others (2009) Hastie, Trevor, Tibshirani, Robert and Friedman, Jerome. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second edition. Springer.
 Hebestreit and others (2013) Hebestreit, Katja, Dugas, Martin and Klein, Hans Ulrich. (2013, jul). Detection of significantly differentially methylated regions in targeted bisulfite sequencing data. Bioinformatics 29(13), 1647–1653.
 Heinzl and Tutz (2014) Heinzl, Felix and Tutz, Gerhard. (2014, jan). Clustering in linearmixed models with a group fused lasso penalty. Biometrical Journal 56(1), 44–68.
 Hocking and others (2011) Hocking, Toby Dylan, Joulin, Armand, Bach, Francis and Vert, JeanPhilippe. (2011). Clusterpath : An Algorithm for Clustering using Convex Fusion Penalties. Proceedings of the 28th International Conference on Machine Learning (ICML).
 Hupkes and others (2011) Hupkes, M., Jonsson, M. K. B., Scheenen, W. J., van Rotterdam, W., Sotoca, a. M., van Someren, E. P., van der Heyden, M. a. G., van Veen, T. a., van Ravesteinvan Os, R. I., Bauerschmidt, S., Piek, E., Ypey, D. L., van Zoelen, E. J. and others. (2011, nov). Epigenetics: DNA demethylation promotes skeletal myotube maturation. The FASEB Journal 25(11), 3861–3872.
 Illingworth and Bird (2009) Illingworth, Robert S. and Bird, Adrian P. (2009, jun). CpG islands  ’A rough guide’. FEBS Letters 583(11), 1713–1720.
 Irizarry and others (2009) Irizarry, Rafael a, LaddAcosta, Christine, Wen, Bo, Wu, Zhijin, Montano, Carolina, Onyango, Patrick, Cui, Hengmi, Gabo, Kevin, Rongione, Michael, Webster, Maree, Ji, Hong, Potash, James B, Sabunciyan, Sarven and others. (2009, feb). The human colon cancer methylome shows similar hypo and hypermethylation at conserved tissuespecific CpG island shores. Nature genetics 41(2), 178–186.
 Jaffe and others (2012) Jaffe, Andrew E., Murakami, Peter, Lee, Hwajin, Leek, Jeffrey T., Fallin, M. Daniele, Feinberg, Andrew P. and Irizarry, Rafael a. (2012, feb). Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. International Journal of Epidemiology 41(1), 200–209.
 Johnson (2013) Johnson, Nicholas a. (2013, apr). A Dynamic Programming Algorithm for the Fused Lasso and L Segmentation. Journal of Computational and Graphical Statistics 22(2), 246–260.
 Kim and others (2009) Kim, SeungJean, Koh, Kwangmoo, Boyd, Stephen and Gorinevsky, Dimitry. (2009). Trend Filtering. SIAM Review 51(2), 339–360.
 Pelckmans and others (2005) Pelckmans, K., De Brabanter, J., Suykens, J. A. K. and De Moor, B. (2005). Convex Clustering Shrinkage. In: Workshop on Statistics and Optimization of Clustering Workshop (PASCAL), Number i.
 Reinsch (1971) Reinsch, Christian H. (1971). Smoothing by spline functions. II. Numerische Mathematik 16(5), 451–454.
 Segalés and others (2014) Segalés, Jessica, Perdiguero, Eusebio and MuñozCánoves, Pura. (2014, sep). Epigenetic control of adult skeletal muscle stem cell functions. FEBS Journal 282(9), 1571–1588.
 Shim and Stephens (2015) Shim, Heejung and Stephens, Matthew. (2015, jul). Waveletbased genetic association analysis of functional phenotypes arising from highthroughput sequencing assays. The Annals of Applied Statistics 9(2), 665–686.
 Stockwell and others (2014) Stockwell, Peter a., Chatterjee, Aniruddha, Rodger, Euan J. and Morison, Ian M. (2014, mar). DMAP: Differential methylation analysis package for RRBS and WGBS data. Bioinformatics 30(13), 1814–1822.
 Sturm (1999) Sturm, Jos F. (1999). Using SeDuMi 1.02, A Matlab toolbox for optimization over symmetric cones. Optimization Methods and Software 11(14), 625–653.
 The Encode Project Consortium (2012) The Encode Project Consortium. (2012, sep). An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74.
 Tibshirani and others (2005) Tibshirani, Robert, Saunders, Michael, Rosset, Saharon, Zhu, Ji and Knight, Keith. (2005, feb). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society. Series B: Statistical Methodology 67(1), 91–108.
 Tibshirani and Wang (2008) Tibshirani, Robert and Wang, Pei. (2008, jan). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9(1), 18–29.
 Tibshirani (2014) Tibshirani, Ryan J. (2014, feb). Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics 42(1), 285–323.
 Tütüncü and others (2003) Tütüncü, R. H., Toh, K. C. and Todd, M. J. (2003). Solving semidefinitequadraticlinear programs using SDPT3. Mathematical Programming, Series B 95(2), 189–217.
 Wang and others (2011) Wang, Hong Qiang, Tuominen, Lindsey K. and Tsai, Chung Jui. (2011, jan). SLIM: A sliding linear model for estimating the proportion of true null hypotheses in datasets with dependence structures. Bioinformatics 27(2), 225–231.
Genetic Feature  Total (N=220)  Fold  PValue 

Cpg Islands  119 (54.1%)  1.13  0.25 
CpG Island Shores  70 (31.8%)  0.99  1 
Transcription Start Sites  112 (50.9%)  1.32  0.011 
TF Binding Sites  25 (11.4%)  1.03  1 
DNase I HS Sites  95 (43.2%)  1.04  0.77 
H3K27ac Modifications  36 (16.4%)  0.77  0.22 