[
Abstract
Motivation: Second generation sequencing technologies are being increasingly used for genetic
association studies, where the main research interest is to identify sets of genetic
variants that contribute to various phenotype. The phenotype can be univariate disease status, multivariate responses and even highdimensional outcomes. Considering the genotype and phenotype as two complex objects, this also poses a general statistical problem of testing association between complex objects.
Results: We here proposed
a similaritybased test, generalized similarity U (GSU), that can test the association between complex objects. We first studied the theoretical properties of the test in a general setting and then focused on the application of the test to sequencing association studies. Based on theoretical analysis, we proposed to use Laplacian kernel based similarity for GSU to boost power and enhance robustness. Through simulation, we found that GSU did have advantages over existing methods in terms of power
and robustness. We further performed a whole genome sequencing (WGS) scan for Alzherimer’s Disease Neuroimaging Initiative (ADNI) data, identifying three genes, APOE, APOC1 and TOMM40, associated with imaging phenotype.
Availability: We developed a C++ package for analysis of whole genome sequencing data using GSU. The source codes can be downloaded at https://github.com/changshuaiwei/gsu.
Contact: weichangshuai@gmail.com
Generalized Similarity U]Generalized Similarity U:
A Nonparametric Test of Association Based on Similarity
xx \artmonthxx \doixxxxxxx/xxxxxxxxxxxxxx
eighted U Statistic; Sequencing Study; Nonparametric Statistics.
1 Introduction
The ongoing sequencing studies allowed researchers to comprehensively investigate the role of a deep catalog of human genome variations in complex diseases(Cirulli and Goldstein, 2010). Although these studies hold great promise for uncovering novel diseaseassociated variants, the massive sequencing data bring tremendous computational and statistical challenges to data analysis. Sequencing data is characterized with highdimensionality and sparsity, where a large portion of genetic variants are rare variants with minor allele frequency (MAF) smaller than 5%. Even with a large effect size, a rare variant is hard to detect because of its low MAF. Moreover, the massive number of rare variants raises computational burden and multiple comparison issue.
The common strategy is to perform a joint association test, namely, testing the joint effect of a set of single nucleotide variants (SNVs) on a genomic region, a functional unit (e.g., a gene) or a functional pathway. By combining multiple SNVs, the association information is aggregated and and the number of tests is greatly reduced. Among these approaches, methods based on variance component score tests (VCscore) are widely used (Lin, 1997; Wu et al., 2011). The methods considered the effects of the multiple variants as a random effect, and then test the effect by testing the variance component under the framework of the linear mixed model or the generalized linear mixed model.
There are also increasing interests in studying shared genetic contribution to multivariate phenotype. The multivariate phenotype can be multiple measurements evaluating different aspects of a disease, which better reflect the underlying biological mechanism of the disease. It can also be multiple disease phenotypes that used for studying comorbid genes or pleiotropic gene(Dick and Agrawal, 2008). A few methods can test the association of SNVset with multivariate phenotype, yet, most of the current methods can not handle multivariate phenotype when the outcome variables are of different types (e.g., some variables are binary while others are continuous). Besides conventional multivariate phenotype, modern data types, such as shapes, images and trees, are emerging in biomedical researches. These complex objects are difficult to be integrated in traditional statistical frameworks, whose primary interests are variables in vector spaces(e.g. continuous, ordinal and categorical variables). Yet, it is relatively easy to define distance metric or similarity metric for complex objects. As a consequence, many distance and similarity based methods have been proposed for modern data analysis.
In this paper, we present a similaritybased test using U statistic, referred to as the Generalized Similarity U test (GSU). GSU can be used to test the association of highdimensional and sparse predictors with univariate, multivariate or complexobject responses from sequencing association studies or other association studies. We first studied the theoretical properties of GSU in a general setting in Section 2, where we investigated the finitesample properties and asymptotic properties of the test. In section 3, we then focused on the applications of GSU to genetic sequencing studies. Extensive simulation studies were conducted to evaluated the performance of GSU in section 4, followed by a whole genome sequencing data application in section 5.
2 Generalized Similarity U
2.1 General Setting and Rational
We start with a formal setup. Let and be a probability space and a metric space respectively. Consider random elements and taking values on metric space and with distribution and respectively. Here, the random elements can be random variables (e.g., ), random vectors (e.g., ), random matrix (e.g., ), random graph (e.g., trees), or random objects (e.g., shapes).
Let and denote the realization of the random response element and random predictor element . Given a sample of data , we are interested in testing the association of response and predictor . Since and may not live in a natural vector space, it is not straight forward to construct a regression model, such as . However, it is easier to construct similarity measurement for pairs and with the distance metrics and . Intuitively, if and are associated, then high similarity between and should lead to high similarity between and .
The similarity measurement can be defined by a realvalued function quantifying the similarity between two elements. For example, we can define the similarity between and as , so that the closer and are in the metric space, the more similar and are. Other possible transformations from distance to similarity include inverse transformation (for some ) and thresholding transformation ( for some ). Loosely speaking, any monotonically nonincreasing function can be used to transform distance to similarity.
Here, we list some examples of similarity measurements.
Example 1 (vector similarity): Let . We can use Gaussian kernel () or crossproduct kernel () to measure similarity. Here, can be considered as a transformation from Euclidean distance , using the fact that .
Example 2 (graph similarity): Let be a graph with adjacency matrix , where is the set of vertices and is the set of edges. For any two graphs and , we can construct a product graph , with adjacency matrix . The similarity between the two graphs can be calculated using random walk on the product graph, (Vishwanathan et al., 2010), where is the length of the random walk, is the weight for size random walk, is the initial probability for vertices on , is the transition probability obtained from , and is the stopping probability for vertices on . Beside random walk, graph similarity can also be calculated using graphlet and subtree pattern.
Example 3 (image similarity): Image similarity can be calculated from local features and global features of the images by using traditional computer vision techniques such as scale invariant feature transformation (SIFT) and histogram of gradients (HOG). Both SIFT and HOG are human designed feature extraction. With large data sets, we can use modern machine learning methods, such as deep neural network (LeCun et al., 2015), to automatize the feature extraction, and construct more meaningful image similarity from high level representation of image.
2.2 A Motivating Model
Given the predictor elements and the response elements for the subjects and ,we denote their response similarity by,
and denote their predictor similarity by,. The similarity measurements and can be of a general form as long as they satisfy the finite second moment condition, i.e., and , where and ( and ) are independent identical copy of (). We center the response similarity by,
(1) 
and center the predictor similarity, , in the same manner. Based on the definition of the centered similarity, we can show that and (Supplementary Appendix S1).
We can investigate the relationship of the two similarities using a similarity regression model (Elston et al., 2000; Tzeng et al., 2009),
Since the similarities have been centered, the regression has zero intercept. The association can then be evaluated by testing null hypothesis , where can be estimated by, By the form of , testing is equivalent to testing the numerator , where . As we shall see soon, is in the same form as the generalized similarity U.
2.3 Weighted U Statistic
The generalized similarity U (GSU) is defined as the summation of the centered response similarities weighted by the centered predictor similarities,
(2) 
where is considered as the weight function and is considered as the U kernel. In our definition of GSU, the role of response similarity and predictor similarity are interchangeable. In other words, we can also treat as the weight function and as the U kernel.
Under the null hypothesis, when the predictor element is independent of response element (i.e., ), we have . Under alternative hypothesis, when is associated with , we expect that the response similarity is concordant with the predictor similarity. In other words, the positive response similarities are weighted heavier and the negative response similarities are weighted lighter, leading to a positive value of U statistic. A statistical test can be formed to test the association, and pvalue can be calculated by under null hypothesis, where is the observed value of .
Define a population parameter as
It is easy to show that GSU is an unbiased estimator of , i.e., . In addition, knowing that we can consider as a population covariance. In this sense, a scale invariant “correlation”, , can be calculated, as an indicator of strength of association.
2.4 Strongly Positive Definite Similarity
We have already shown that , which ensures the correct type I error. It is of interest now whether , so that we can control type II error (i.e., improve power) and reject null hypothesis whenever . The establishment of needs additional assumptions on the similarity measurements and metric spaces. For the completeness, we first introduce several preliminaries.
Define a “kernel” as a real symmetric function . A kernel is called positive definite if , and . A kernel is called negative definite if , and .
A positive definite kernel is called strictly positive definite when the equality implies . The kernel function here can be used to define similarity measurement. To consider , however, we need the kernel function to exhibit a property of “strong” positive definiteness in the integral form. Using similar notions of Rachev et al. (2013), we define a strongly positive definite kernel as follows.
Definition 1: Let Q be a finite positive measure on and be a function integrable with respect to Q. We say is strongly positive definite if it is positive definite and the equality implies a.e. Q.
Let be a finite signed measure dominated by s.t. . For strongly positive definite kernel , the equality implies . Now let be a measure on , we can show (in Supplementary Appendix S1) that
If the tensor product kernel is strongly positive definite, then implies (i.e., ). In fact, we can show as long as and are both strongly positive definite.
Theorem 2: Assume both and are strongly positive definite. Let and be the centered similarities as defined in (1). Define . Then, .
The proof is given in Appendix A by employing measures embedding into the reproducing kernel Hilbert space. Many popular kernels such as radial basis kernel () are strongly positive definite kernel on (Sriperumbudur et al., 2010). However, the cross product kernel is not strongly positive definite on , by observing that .
2.5 Asymptotic Test
For high dimensional data, it is computationally expensive to calculate pvalues using permutation. Here, we derive the asymptotic distribution of GSU under null hypothesis.
By considering the predictor similarity as the weight function and the response similarity as the U kernel, GSU is a weighted U statistic (Lindsay et al., 2008; Wei et al., 2016). More specifically, because its kernel satisfied (Supplementary Appendix S1), GSU is a degenerated weighted U statistic. To derive the limiting distribution of GSU, we can decompose the centered response similarity by, where and are eigenvalues and eigenfunctions of the U kennel , and all the eigenfunctions are orthogonal, i.e., equals 0 if and equals 1 if . Similarly, we can decompose the centered predictor similarity by, . We can then rewrite the GSU as,
where and . Using the form above, we can show that the limiting distribution of GSU is a weighted sum of independent chisquare random variables. This is the result of theorem 3 below, which is proved in Appendix B.
Theorem 3: Assume , , and . Let and be the centered similarities as defined in (1). Define U as . Then, , where are independent chisquare random variables with 1 degree of freedom.
Using the similar techniques, we can show that a weighted V statistic in the following form, also converges to a weighted sum of chisquared variables, i.e.,
2.6 Power and Sample Size
In this subsection, we derive the asymptotic distribution of GSU under the alternative hypothesis, and provide asymptotic power and sample size calculations for association analysis.
Denote . Assume under the alternative hypothesis that and . Using the Hoeffding projection, we can show that GSU asymptotically follows a normal distribution, with mean and variance . This is the result of Theorem 4, which is proved in Supplementary Appendix S4.
Theorem 4: Let and be the centered similarities as defined in (1). Suppose Y is associated with G, and the following conditions are satisfied: , , and . Define as . Then, .
The power of GSU at the significance level can be calculated by, where is the quantile for and is the CDF of a standard normal distribution. The sample size required to achieve power can be calculated by solving . By denoting as the quantile for a standard normal distribution, the required sample size is given by,
3 Generalized Similarity U for Sequencing Association
3.1 Settings for Sequencing Data Analysis
In a sequencing association study, the response element is called phenotype and the predictor element is called genotype. Common forms of phenotype and genotype are scalars or vectors. Suppose that subjects are sequenced in a study, where we are interested in testing the association of L phenotypic variables (, , ) with M genetic variants (, , ). For each subject , we observe a phenotype vector ( ) and a genotype vector ( ). In the special case when (or ), it is simplified to a univariate analysis (or a singlelocus analysis). When (or ), it extends to a multivariate analysis (or a multilocus analysis). Here, we allow multiple phenotypes to be of different types (e.g., continuous or categorical), and do not assume any distribution of phenotypes. The number of genetic variants and the number of phenotypes can be larger than the sample size. For example, the genetic data can be sequencing data (high dimensional genotype) and the phenotype data can be imaging data (high dimensional phenotype).
3.2 Similarity Measurement
The choices for the phenotype similarity and the genetic similarity are flexible. According to different types of genetic variants and the purpose of the analysis, we can choose different types of phenotype similarities and genetic similarities.
For phenotype similarity, one popular approach is to use a cross product kernel, i.e., (Tzeng et al., 2009). Yet, as discussed in previous theoretical analysis, cross product kernel may not fit for robust association analysis. Here, we propose a similarity measurement for both categorical and continuous phenotype using radial basis kernel with L1 norm (Laplacian Kernel),
where represents the weight for the th phenotypes given based on prior knowledge. If there is no prior knowledge, we can use an equal weight, . The Laplacian Kernel (LK) based phenotype similarity can be modified to take the correlation among the phenotypes into account, where . can be chosen to reflect the correlations among the phenotypes. For example, we can define as,
For the categorical SNVs data, the popular way of measuring genetic similarity is to use IBS function or the weighted IBS function(Lynch and Ritland, 1999). Assuming the genetic variants (, , ) are coded as 0, 1 and 2 for AA, Aa and aa respectively, the IBSbased genetic similarity is defined as, Alternatively, the weightedIBS (wIBS) genetic similarity can be defined to emphasize the effects of rare variants, where represents the weight for the th SNV in the SNVset, and is a scaling constant, defined as . is usually defined as a function of minor allele frequency (MAF, denoted as ). For example, the weight can be calculated using inverse variance, i.e., . However, IBSbased similarity can not be used for other genetic data, such as copy number variation (count) or expression data (continuous). Here, we propose a unified LKbased genetic similarity by generalizing wIBS,
where can be categorical, count or continuous variables, and can be calculated as function of variance of , i.e., .
Thus, we defined a unified measurement for genetic similarity and phenotype similarity with Laplacian kernel . Since laplacian kernel is strongly positive definite, we know that (from Theorem 2) the corresponding GSU has the property , so that it can control type II error for detection of any types of association. Since Laplacian kernel is bounded similarity measurement, i.e., and , we know the regularity conditions in Theorem 3 is satisfied and the asymptotic test for corresponding GSU is robust against distribution assumptions (for large sample size).
3.3 Computation and Covariates Adjustment
Let and be the matrix form of the phenotype similarity and genetic similarity, the centered similarity matrices and can be obtained by, and where is an nbyn identity matrix, and is an nbyn matrix where all elements are (Supplementary Appendix S5). Then GSU can be expressed as, In this form, can be viewed as a sum of the elementwise product of the two matrices, and , which are obtained by assigning 0 to the diagonal elements of matrices and .
To allow for covariates adjustment, we can perform two sided projection on the zerodiagonal centered similarity matrices, and . Suppose that there are covariates that need to be adjusted. Let represents the covariate matrix, we can calculate the covariate centered similarity matrices by (Supplementary Appendix S6), and The covariate adjusted GSU can be expressed as,
We include the diagonal terms in the covariateadjusted similarities because they also contain the similarity information after the adjustment. In fact, the covariateadjusted GSU is a weighted V statistic, and its asymptotic distribution can be attained similarly as weighted U statistic. We use matrix eigendecomposition to approximate the eigenvalues in function decomposition. Let and be the eigenvalues for matrices and respectively, the limiting distribution of is given by (Supplementary Appendix S7),
where are independent chisquare random variables with 1 degree of freedom. The pvalue can be calculated by using the Davies’ method (Davies, 1980), the Liu’s method(Liu et al., 2009) or the Kuonen’s method(Kuonen, 1999). To facilitate the high dimensional data analysis, we developed a C++ package based on GSU (https://github.com/changshuaiwei/gsu).
4 Simulation study
4.1 Simulation method
To mimic real genetic structure, we used genetic data from the 1000 Genome Project(Abecasis et al., 2010). Based on the genetic data, we then simulated phenotype values. In particular, we used a 1Mb region of the genome (Chromosome 17: 73443288344327) from the 1000 Genome Project. For each simulation replicate, we randomly chose a 30kb segment from the 1Mb region and formed a SNVset for the analysis, in which only rare variants (i.e., ) are used except otherwise specified. From the SNVset, we set a proportion of the SNVs as causal. A number of individuals were randomly chosen from the total 1092 individuals as the simulation sample to study the performance of the methods. We set sample size by default.
To investigate the robustness against different phenotype distributions, we simulated four types of phenotypes:

A binarydistributed phenotype (denoted as B), by ,

A Poissondistributed phenotype (denoted as P), by ,

A Gaussiandistributed phenotype (denoted as G), by ,

And a Cauchydistributed phenotype (denoted as C), by ,
Here, and were the phenotype value and the genotype vector (coded as 0, 1, and 2) for the th individual, respectively. We set except otherwise specified. were the effects of the SNVs, which were sampled from a uniform distribution with a mean of and a variance of .
Three sets of simulations were performed. In simulation I, we considered a single phenotype; in simulation II, we considered multivariate phenotype; in simulation III, we considered multivariate phenotype under the influence of confounding effects. Details of simulation settings are in Supplementary Appendix S8.
We evaluated the performance of GSU by comparing it with variance component score (VCscore) test under univariate or multivariate linear mixed model (Wu et al., 2011; Maity et al., 2012). For each simulation, we created 1000 simulation replicates to evaluate type I error and power. Type I error rates and powers are calculated using percentage of pvalues smaller than a given threshold (e.g., 0.05) under null models and alternative models respectively.
4.2 Result for Simulation I
The type I error rates and powers are summarized in Table 1. GSU had a wellcontrolled type I error (around 0.05) for all 4 phenotypes, while VCscore had an inflated type I error rates (0.113) for Cauchydistributed phenotype and overconservative type I error rates (0.005) for Binarydistributed phenotype.
For the disease model where half of the causal SNVs were deleterious (Table 1), GSU had slightly lower power than VCscore for Gaussiandistributed (0.258 v.s. 0.345) and Poissondistributed phenotype (0.506 v.s. 0.651), but had significantly higher power than VCscore for Cauchydistributed (0.503 v.s. 0.21) and Binarydistributed phenotype(0.402 v.s. 0.083). The same comparison was observed for the second disease model in which a majority of the SNVs were deleterious.
We performed additional simulations by including both common and rare variants (Supplementary Table S4). Under this setting, the power of VCscore increased significantly for Binary phenotype (0.764), though still lower than that of GSU (0.807). GSU attained higher power than VCscore for Poisson (0.813 v.s. 0.795) and Cauchy (0.885 v.s. 0.573) phenotype. Nevertheless, GSU was still less powerful than VCscore for Gaussian phenotype (0.853 v.s. 0.878).
4.3 Result for Simulation II
The type I error rates and powers for the multivariate analysis are summarized in Table 2. Similar to the results of the univariate analysis, GSU can correctly control type I error at the level of 0.05 (Table 2), while VCscore had inflated type I error when the phenotype contained variables with heavy tailed distribution (e.g., CGG and BCG). GSU attained higher power than VCscore for BBG, CGG and BCG phenotypes, and similar power as VCscore for BPP phenotype.
We examined the type I error rates at more stringent significance levels (Table 3) by simulating 1 million replicates. In general, GSU can control the type I error better than VCscore. For example, at and for BPP phenotype, GSU had type I error near (i.e., ), while, VCscore had type I error much higher than (i.e., ). While simulation demonstrated robustness of GSU over VCscore on controlling type I error, we observe GSU has slightly inflated type I errors at level. We suspect this is because of the small sample size. We therefore conducted another set of simulation with sample size of 200, and the results showed type I errors of GSU are better controlled for stringent significant levels under larger sample size (Supplementary Table S5).
To separate influences of different distributions, we also compared GSU and VCscore when phenotype have the same distributions (i.e., BBB, CCC, GGG, PPP). The results (Supplementary Table S6) are similar to those for univariate phenotype. In general, GSU can control type I errors better than VCscore. GSU had slightly lower power than VCscore for GGG phenotype (0.882 v.s. 0.958) and PPP phenotype (0.862 v.s. 0.966), but attained significantly higher power for BBB phenotype (0.862 v.s. 0.26) and CCC phenotype (0.724 v.s. 0.284). We further increased the dimension of phenotype to 10 for each type, and the comparisons showed that GSU have better control of type I error and attain higher power for most cases (Supplementary Figure S1).
4.4 Result for Simulation III
We summarized the type I errors in Table 4. Without covariates adjustment, both methods had inflated type I errors. With covariates adjustment, GSU showed robustness against confounding effects for all 4 multivariate phenotypes, with type I errors ranging from 0.056 to 0.061. VCscore can control type I error for BBG phenotype (0.054), but had inflated type I errors, ranging from 0.174 to 0.322, for the other 3 multivariate phenotypes.
In Figure 1, we generated the power curves by plotting the powers of the two methods against different sample sizes (50 to 200). GSU has higher power than VCscore for different sample sizes and multivariate phenotypes, except for BPP phenotype. The “higher power” of VCscore for BPP phenotype is due to the fact that VCscore has inflated type I error (i.e., 0.322, as shown in Table 4).
5 Real Data Application
We analyzed the whole genome sequencing data (WGS) from Alzherimer’s Disease Neuroimaging Initiative (ADNI) using the GSU C++ package. ADNI is a large scale longitudinal study that collects and utilize various predictors of Alzherimer’s Disease, including 3D brain imaging, cognitive measurements and genetic data. The sample with WGS data contains 808 individuals, with 280 Normal Controls (NC), 234 Early Mild Cognitive Impaired patients (EMCI), 246 Late Cognitive Impaired patients (LMCI), and 48 Alzheimer’s Disease patients (AD) at study baseline.
Whole genome sequencing was performed on autosomal chromosomes for each subject. To form SNVset, we group the genetic variants based on the gene range list from GRch37 assembly, where we only used the nonoverlapping genes. For genetic variants outside of gene ranges, we group them by evenly spacing the remaining genome with windows of 50kb. After completing quality control(e.g., delete variants with high missing rate) and grouping process, about 21 millions genetic variants remained for analysis, forming 61683 SNVsets.
We were interested in testing the association of the SNVsets with brain imaging summary matrices considered important to cognitive impairment. In particular, we used 6 variables: 18Ffluoro2deoxyglucose (FDG), Hippocampus, Entorhinal, 8Fflorbetapir (AV45), Fusiform, and Ventricles measurements at baseline, as multivariate phenotype. The phenotype similarity is calculated using weighted Laplacian kernel, . We “fished” the weight from the case control status. In particular, we regressed the case control status on the scaled multivariate phenotype and obtained regression coefficient for th variable, where we assigned (Table S7).
In order to adjust the potential confounding effects, we included age, gender, race and top 20 genome principle components as covariates in the analysis. Two sets of whole genome association analysis were performed. For the first scan, we include both common and rare variants, while for the second scan we only include rare variants. The QQ plots (Figure S2 and Figure S3) showed no systematical bias after adjusting covariates. We listed the top 5 SNVsets for each scan in Table 5. When both common and rare variants were considered, 4 SNVsets (i.e., APOE, Ch194538930945439308, APOC1, TOMM40) pasted the Bonferroni threshold, among which the genes APOE and TOMM40 has been reported in previous studies. As a comparison, we also performed the analysis using VCscore (Supplementary Table S10). VCscore attained similar results for the top association findings, though with less significant pvalues (e.g., pvalue = for APOE).
When only rare variants are considered, no SNVset past the Bonferroni threshold. Interestingly, the gene APOC1 was listed as one of the top 5 associated genes from both analyses. Further investigation will be needed to study its role in AD. More detailed results are in Table S8 and S9. We further calculated the pvalue of the top SNVsets using AD casecontrol status instead of multivariate phenotype with 6 intermediate measurements. The univariate analysis attained less significant result (Table S11). For example, the pvalue of APOE is from analysis using AD case control status, less significant than from analysis using brain imaging matrices.
6 Discussion
Many genetic studies collect multiple secondary phenotypes, or use intermediate biomarkers, to study complex diseases. By considering multiple phenotypes that measure the different aspects of underlying diseases, the power of the association analysis can potentially be improved (Zhang et al., 2010; Maity et al., 2012). Several methods were recently developed to detect the joint effect of genetic variants on multivariate phenotype(Tao et al., 2015; Wang et al., 2015). Most were built on parametric framework that poses certain assumptions on phenotype distribution. In this paper, we proposed a nonparametric test, GSU, based on similarity measurement. Simulation study showed that our methods can can control type I error for multiple different phenotypes and moderate level of confounding effects. In most cases, GSU also attained higher power than the parametric method. Although the simulation results depend on the simulation settings, and should always be interpreted in the context of the simulation setting, we believe the results reflect the advantage of GSU in a broader sense, because 1) the genetic data used in the simulation comes from the 1000 Genome Project, which reflects the LD pattern and the allele frequency distribution in the general population; and 2) we simulated a wide range of disease models, including univariate phenotype and multivariate phenotype with different distributions, to mimic real disease scenarios.
The test statistics in VCscore is a quadratic form, , where is the standardized residual under null, and is the genetic similarity matrix. If we rewrite as , VCscore is actually a weighted V statistic with cross product kernel . In this respect, VCscore can be considered as a special case of GSU. Nonetheless, there are several key differences: 1) GSU allows general forms of similarity and thus can be used for association analysis of elements in general metric space; 2) For multivariate association analysis, GSU with LK based similarity has the ability to detect any types of association (strongly positive definite similarity) and its asymptotic test is robust against distribution assumptions (bounded similarity); 3) For covariates adjustment, GSU used a centralized similarity and then perform two sided projection, i.e., , while, VCscore performed two sided projection on original similarity (), i.e., ; 4) Asymptotic distribution of GSU is in the form of , where distribution of VCscore is in the form of ; 5) For multivariate phenotype with variables, the dimension for similarity matrix is in GSU and in VCscore.
In simulation studies, we observed higher power of GSU over VCscore. This is mainly due to the fact that GSU is equipped with strongly positive definite kernel which can detect any type of association while the cross product kernel in VCscore does not have this property. We performed another set of simulations by generating dependence structure via rotation operator (Supplementary Appendix S9). In particular, we first generate two i.i.d. multimodal continuously distributed variables and then rotate the vector with angle (Figure S4). The data generated thus does not have first order dependence structure (correlation) nor second order dependence structure. The result (Figure S5) showed that GSU ( with LKbased similarity) had power of 1 for large enough sample size, while VCscore (with cross product kernel) can not detect any association regardless of different sample sizes. Though the“toy” simulation may not represent common scenarios in genetic association studies, it empirically explains the reason why GSU attained higher power than VCscore. To further investigate the influence of different kernels, we performed simulations using 5 different kernels for GSU, including 3 strongly positive definite kernels. The result shows that GSU with strongly positive definite kernels have higher powers for the most of the time, among which GSU with LK kernel have highest power (Supplementary Figure S6). In general, we recommend to use LK kernel for GSU. Nevertheless, its performance may not guaranteed to be optimal. In this case, we can perform kernel selection, for example, by using the procedure proposed by Wu et al. (2013). Besides the choice of kernel, different choices of weights can also influence the power of GSU for multivariate phenotype. In principle, we should use weights that represent their relative importance with respect to the underlying ”true phenotype”. For example, in real data analysis, we obtained the weights based on their contributions to the AD disease status. Here in this paper, we only considered the joint effect of SNVsets. If gene environment interaction effects are to be considered, we can calculate a composite similarity using both the genetic information and environmental information (Wei et al., 2016; Tong et al., 2016), and then construct GSU with the composite similarity and the phenotype similarity.
The asymptotic test for GSU (with LKbased similarity) is shown to be robust to distribution assumption. This is because the Laplacian kernel is bounded between 0 and 1, and the resulting similarities and thus satisfy the regularity condition of asymptotic test, i.e., and . However, crossproduct kernel does not have this property. As a result, we observed that in simulation studies GSU had more robust type I errors than VCscore. Nevertheless, we still observed slightly inflated type I error with stringent significant level (e.g., ) when . This is because the asymptotic null distribution can not approximate the actually null distribution well when sample size is small compared with when sample size is large (Supplementary Figure S7). One way to improve the robustness for small sample size is to take an rank transformation for each variable (i.e., ) before calculating the similarity. We performed additional simulation for GSU with rank transformation for using same setting as simulation II. The results showed that GSU with rank transformation (GSUrk) can control type I error well even with more stringent significant level for small sample size (Supplementary Table S12). Nevertheless, rank transformation can cause loss of information, which might lead to lower power.
In simulations, we observe that VCscore, although designed for Gaussian distributed phenotype, appears to be able to control type I error appropriately and attain slightly higher power for Poisson phenotype. This may be due to that Poisson distribution can be reasonably approximated by Gaussian distribution when its mean is moderate to large. We performed additional simulation using heavily right skewed Poisson distribution, and the results showed VCscore had lower power for one simulation and inflated type I errors for another simulation (Supplementary Table S13). We can use rank transformation to improve the robustness of VCscore (Wei et al., 2016). We performed additional simulation to compare GSUrk to VCscore test with rank transformation (VCscorerk). The result (Supplementary Table S14) showed that VCscorerk can control Type I errors under various setting. However, VCscorerk still had lower powers than GSUrk for most cases.
For the analysis of multivariate phenotype, the difference on the dimension of similarity matrix for GSU and VCscore influenced the computation efficiency especially when the number of variables in multivariate phenotype increases. The key reason is the cost of the eigen decomposition. For analysis of variable multivariate phenotype in a sample of size , GSU needs to decompose a matrix, while VCscore needs to decompose a matrix. The time used for matrix decomposition are for GSU and for VCscore. For example, in real data application when , the average time to analyze one SNV set is 36.75 seconds for VCscore and 1.3 seconds for GSU. For highdimensional setting (e.g., ), VCscore is computationally infeasible. An additional simulation shows that GSU is well behaved when the dimension of phenotype increase to 100 (Supplementary Figure S8). Nevertheless, noises in high dimensional phenotype or genotype may reduce the power of GSU. In this case, dimension reduction techniques, such as variable selection and principle component analysis, can be used to increase power.
The covariate adjustment proposed in the paper is a heuristic approach for adjusting confounding effect. Accurate adjustment of confounding effects requires additional assumptions on the distributions and the functional forms between responses and covariates. In the paper, we showed GSU works well when the confounding effects are moderate. Nonetheless, the heuristic covariate adjustment in GSU should always be used with caution. If there is a strong confounding effect, the heuristic approach might not control type I error very well. For this paper, covariate adjustment is not the primary focus, and the issue will be investigated in future studies.
Besides confounding effects, the correlation among variables in multivariate phenotype may also influence the performance of association testing. This is particularly important for regression based methods, since it handles multivariate phenotype by stretching the phenotype matrix to a long phenotype vector. Without considering correlations among variables in phenotype, the test will lead to inflated type I error. Nevertheless, GSU don’t have this issue, since its similarity matrix is calculated on subject level and its inference only assume independence between subjects. We performed additional simulations by introducing additional correlation in the multivariate phenotype (Supplementary Table S15). The results showed that, in general, GSU can control type I error and attain higher power than VCscore (Supplementary Figure S9).
In recent years, Ustatisticbased methods became popular in genetic data analysis, and have shown their robustness and flexibility for analyzing genetic data(Schaid et al., 2005; Li et al., 2011; Wei and Lu, 2015; Wei et al., 2016). GSU is a general framework of association analysis and is based on similarity measurements and U statistics. In this paper, we have focused on the association analysis between multivariate phenotype and categorical sequencing data (i.e, SNV data). GSU can easily be applied to analyze other types of genetic data, such as count data (CNV data) and continuous data (expression data) with unified LKbased similarity (Section 3.2). With appropriate similarity measurement (Section 2.1), GSU can also be used for association testing of modern data types, such as imaging, curves and trees.
Acknowledgements
Data used in preparation of this article were obtained from the Alzheimer Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgment_List.pdf
Supplementary Materials
Supplementary Materials are available on line.
References
 Abecasis et al. (2010) Abecasis, G. R., Altshuler, D., Auton, A., Brooks, L. D., Durbin, R. M., Gibbs, R. A., Hurles, M. E., and McVean, G. A. (2010). A map of human genome variation from populationscale sequencing. Nature 467, 1061–73.
 Berlinet and ThomasAgnan (2011) Berlinet, A. and ThomasAgnan, C. (2011). Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media.
 Cirulli and Goldstein (2010) Cirulli, E. T. and Goldstein, D. B. (2010). Uncovering the roles of rare variants in common disease through wholegenome sequencing. Nature Reviews Genetics 11, 415–425.
 Davies (1980) Davies, R. B. (1980). Algorithm as 155: The distribution of a linear combination of x2 random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics) 29, 323–333.
 Dick and Agrawal (2008) Dick, D. M. and Agrawal, A. (2008). The genetics of alcohol and other drug dependence. Alcohol Research & Health 31, 111.
 Elston et al. (2000) Elston, R. C., Buxbaum, S., Jacobs, K. B., and Olson, J. M. (2000). Haseman and elston revisited. Genetic epidemiology 19, 1–17.
 Kuonen (1999) Kuonen, D. (1999). Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86, 929–935.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521, 436–444.
 Li et al. (2011) Li, M., Ye, C., Fu, W., Elston, R. C., and Lu, Q. (2011). Detecting genetic interactions for quantitative traits with ustatistics. Genetic epidemiology 35, 457–468.
 Lin (1997) Lin, X. (1997). Variance component testing in generalised linear models with random effects. Biometrika 84, 309–326.
 Lindsay et al. (2008) Lindsay, B. G., Markatou, M., Ray, S., Yang, K., and Chen, S. C. (2008). Quadratic distances on probabilities: A unified foundation. Annals of Statistics 36, 983–1006.
 Liu et al. (2009) Liu, H., Tang, Y. Q., and Zhang, H. H. (2009). A new chisquare approximation to the distribution of nonnegative definite quadratic forms in noncentral normal variables. Computational Statistics & Data Analysis 53, 853–856.
 Lynch and Ritland (1999) Lynch, M. and Ritland, K. (1999). Estimation of pairwise relatedness with molecular markers. Genetics 152, 1753–1766.
 Lyons (2013) Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability 41, 3284–3305.
 Maity et al. (2012) Maity, A., Sullivan, P. E., and Tzeng, J. Y. (2012). Multivariate phenotype association analysis by markerset kernel machine regression. Genetic Epidemiology 36, 686–695.
 Rachev et al. (2013) Rachev, S. T., Klebanov, L., Stoyanov, S. V., and Fabozzi, F. (2013). The methods of distances in the theory of probability and statistics. Springer Science & Business Media.
 Schaid et al. (2005) Schaid, D. J., McDonnell, S. K., Hebbring, S. J., Cunningham, J. M., and Thibodeau, S. N. (2005). Nonparametric tests of association of multiple genes with human disease. American Journal of Human Genetics 76, 780–793.
 Sriperumbudur et al. (2010) Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research 11, 1517–1561.
 Tao et al. (2015) Tao, R., Zeng, D., Franceschini, N., North, K. E., Boerwinkle, E., and Lin, D.Y. (2015). Analysis of sequence data under multivariate traitdependent sampling. Journal of the American Statistical Association 110, 560–572.
 Tong et al. (2016) Tong, X., Wei, C., and Lu, Q. (2016). Genomewide joint analysis of singlenucleotide variant sets and gene expression for hypertension and related phenotypes. BMC proceedings 10, 125.
 Tzeng et al. (2009) Tzeng, J.Y., Zhang, D., Chang, S.M., Thomas, D. C., and Davidian, M. (2009). Genetrait similarity regression for multimarkerbased association analysis. Biometrics 65, 822–832.
 van der Vaart and Wellner (2000) van der Vaart, A. and Wellner, J. A. (2000). weak convergence and empirical processes. Springer, 2 edition.
 Vishwanathan et al. (2010) Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. (2010). Graph kernels. The Journal of Machine Learning Research 11, 1201–1242.
 Wang et al. (2015) Wang, Y., Liu, A., Mills, J. L., Boehnke, M., Wilson, A. F., BaileyWilson, J. E., Xiong, M., Wu, C. O., and Fan, R. (2015). Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genetic epidemiology 39, 259–275.
 Wei et al. (2016) Wei, C., Elston, R. C., and Lu, Q. (2016). A weighted u statistic for association analyses considering genetic heterogeneity. Statistics in Medicine .
 Wei and Lu (2015) Wei, C. and Lu, Q. (2015). A generalized similarity u test for multivariate analysis of sequencing data. arXiv preprint arXiv:1505.01179 .
 Wu et al. (2011) Wu, M. C., Lee, S., Cai, T. X., Li, Y., Boehnke, M., and Lin, X. H. (2011). Rarevariant association testing for sequencing data with the sequence kernel association test. American Journal of Human Genetics 89, 82–93.
 Wu et al. (2013) Wu, M. C., Maity, A., Lee, S., Simmons, E. M., Harmon, Q. E., Lin, X., Engel, S. M., Molldrem, J. J., and Armistead, P. M. (2013). Kernel machine snpset testing under multiple candidate kernels. Genetic epidemiology 37, 267–275.
 Zhang et al. (2010) Zhang, H. P., Liu, C. T., and Wang, X. Q. (2010). An association test for multiple traits based on the generalized kendall’s tau. Journal of the American Statistical Association 105, 473–481.
Appendix 1
Due to space limit, we here only sketch the proofs. Detailed proofs can be found at Supplementary Appendix S2 and S3.
Appendix A: embedding into Hilbert Space
For each positive definite kernel , we can construct a unique reproducing kernel Hilbert space (RKHS) with reproducing kernel (Berlinet and ThomasAgnan, 2011), such that, 1) , , 2) , , . We can write , and then represent a measure as an element in RKHS (Lyons, 2013) using an embedding map : , s.t.,. Further, if is strongly positive definite, we can show the mapping is onetoone, i.e., .
Let and . We can then write as, , where, . If , then we know , i.e.,
We then can show, by repeatedly using measure embedding, that , i.e., .
Appendix B: Proof of Theorem 3
Because of the orthogonality of and the fact that , we can show . Similarly, . Under the null hypothesis, predictor element () is independent of response element (). Therefore, for and ,
and
Therefore, for any finite subset of , the multivariate random variable converges to a multivariate normal distribution.
Then, we need to show the convergence is uniform. Notice that, . Under the condition , the infinite countable sequence of function is a Donsker class (Theorem 2.13.1 in van der Vaart and Wellner (2000)). Therefore, the empirical process,