1 Introduction

The practice of pooling several individual test statistics to form aggregate tests is common in many statistical application where individual tests may be underpowered. While selection by aggregate tests can serve to increase power, the selection process invalidates the individual test-statistics, making it difficult to identify the ones that drive the signal in follow-up inference. Here, we develop a general approach for valid inference following selection by aggregate testing. We present novel powerful post-selection tests for the individual null hypotheses which are exact for the normal model and asymptotically justified otherwise. Our approach relies on the ability to characterize the distribution of the individual test statistics after conditioning on the event of selection. We provide efficient algorithms for estimation of the post-selection maximum-likelihood estimates and suggest confidence intervals which rely on a novel switching regime for good coverage guarantees. We validate our methods via comprehensive simulation studies and apply them to data from the Dallas Heart Study, demonstrating that single variant association discovery following selection by an aggregated test is indeed possible in practice.

Post-selection estimation and testing following aggregated association tests

Ruth Heller, Department of Statistics and Operations Research, Tel-Aviv university, Tel-Aviv 6997801, Israel, E-mail: ruheller@gmail.com

Amit Meir, Department of statistics, University of Washington, Seattle, WA, E-mail: amitmeir@uw.edu.

Nilanjan Chatterjee, Department of Biostatistics, Bloomberg School of Public Health, and Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD 21205, U.S.A.,E-mail: nchatte2@jhu.edu

1 Introduction

Many modern scientific investigations involve simultaneous testing of many thousands of hypotheses. Valid testing of large number of hypotheses requires strict multiple-testing adjustments, making it difficult to identify signals in the data if the signal is weak or sparse. One possible remedy is to pool groups of related test statistics into aggregate tests. This practice reduces the amount of multiplicity correction that needs to be applied and may assist in identifying weak signals that are spread over a number of test statistics. However, once an ‘interesting’ group of hypotheses has been identified, it may also be of interest to perform inference within the group in order to identify the individual test statistics that drive the signal.

In many scientific fields, there exist a natural predefined grouping of features of interest. In neuroscience, functional magnetic resonance imaging (fMRI) studies aim to identify the locations of activation while a subject is involved in a cognitive task. The individual null hypotheses of no activation are at the voxel level, and regions of interest can be tested for activation by aggregating the measured signals at the voxel level (Penny and Friston, 2003; Benjamini and Heller, 2007). Following identification of the regions of interest, it is meaningful to localize the signal within the region. In microbiome research, the operational taxonomic units (OTUs) are grouped into taxonomic classifications such as species level and genus level. The data for the individual null hypotheses of no association between OTU and phenotype can be aggregated in order to test the null hypotheses of no association at the species or genus level (Bogomolov et al., 2017). Here as well, following identification of the family associated with the phenotype, it is of interest to identify the OTUs within the family that drive the association. In genome-wide association studies (GWAS), the disease being analyzed may have multiple subtypes of interest. The standard analysis aim is to identify the SNPs associated with the overall disease, but another important aim is to identify associations with specific sub-types of the disease (Bhattacharjee et al., 2012). In genetic association studies, there is also a natural grouping of the genome, since genes are comprised of single variants. The test statistics of single variants within a gene can be aggregated into a test statistic for powerful identification of associations at the gene level (Wu et al., 2011; Bhattacharjee et al., 2012; Derkach et al., 2014; Yoo et al., 2016). Following identification at the gene level, it may be of interest to identify the single variants within the gene that drive the association.

For a single group of features, let be the estimator for the vector of parameters of interest in the group, . Much research has focused on developing powerful aggregate tests for selecting the groups of interest, i.e., for testing at the group level the null hypothesis that . When has (approximately) a known covariance and a normal distribution, classical test-statistics are the score, Wald, and likelihood ratio statistics, all of which have an asymptotic distribution. A recent example is the work by Reid et al. (2016), which suggested novel tests that improve on classical tests. Other examples come from the field of statistical genetics, where many gene level tests have been recently proposed based on weighted linear or quadratic combinations of score statistics for analyzing genomic studies of rare variants, see Derkach et al. (2014) for a review.

In this work we seek to develop methods for conducting inference on the coordinates of following selection by an aggregate test. Failure to account for data driven selection of any kind can lead to biased inference. For example, in linear regression, if the relationship of the predictors with a response is assumed linear for a group, then selection by an aggregate test constraints the response vector to values for which the aggregate test -value is below a threshold, and the post-selection distribution of the data is no longer multivariate normal but a truncated multivariate normal. Generally speaking, ignoring the selection will result in biased inference if there is dependence between the selection event and the individual test-statistic: if the individual test-statistic contributes to the selection by the aggregate test, then conditional on being selected the distribution of the individual test statistic is changed.

Inference following selection is an emerging field of research, which is of great interest both in the statistics community and in application fields. In the multiple testing literature, Benjamini and Bogomolov (2014) presented a novel approach for the problem of inference within families of hypotheses following selection of the families by any selection rule. Marginal confidence intervals following selection are addressed in generality in Benjamini and Yekutieli (2005), from a Bayesian perspective in Yekutieli (2012), and for a specific selection rule (i.e., that the test statistic is larger than a certain threshold) in Weinstein et al. (2013). Significant progress has also been made in the regression context, where variables are first selected into the model, and inference on the selected variables follows. Failing to account for the data-driven variable selection process invalidates the inference (Pötscher, 1991; Berk et al., 2013; Fithian et al., 2014). Recent valid post model-selection procedures can be found in Berk et al. (2013); Fithian et al. (2014); Lee and Taylor (2014); Lee et al. (2016); Taylor and Tibshirani (2016); Tian and Taylor (2015); Meir and Drton (2017).

Recently, Heller et al. (2017) addressed the problem of identifying the individual studies with association with a feature, following selection of potential features by a meta-analysis of multiple independent studies. We generalize the work of Heller et al. (2017) to allow for (approximately) known dependence across the individual test statistics. In particular, this allows for valid testing of predictors in a generalized linear model that was selected via an aggregated test . We further develop methods for obtaining post-selection point estimates and confidence intervals. We discuss the computation of maximum-likelihood estimates following aggregate testing and show that in the special case of aggregate testing with a Wald test, the problem of computing the multivariate conditional maximum likelihood estimator can be cast as a simple line-search problem. We also discuss computation of post-selection confidence intervals which are based on inversion of the post-selection tests. Finally, we develop regime switching post-selection tests and confidence intervals that adapt to the unknown underlying sparsity of the signal, and thus have good power when the signal is sparse as well as when it is non-sparse.

The paper is organized as follows. In § 2 we formally introduce our inference framework and goals. We develop theory for post-selection testing and estimation in § 3 and § 4, respectively. We conduct empirical evaluation of our test-statistics and post-selection estimates in § 5. In § 6, we apply our methods to a genomic application. Finally, § 7 concludes.

2 The set-up and the inferential goals

Let with known, and suppose that we are interesting in performing inference on if and only if we can reject an aggregate test for the global-null hypothesis that . For testing the global-null hypothesis, we use a quadratic test of the form where is a semi positive-definite matrix and is the quantile of under the null-hypothesis. Setting results in the well known Wald test statistic. The developments when group selection is by a linear aggregate test are similar and detailed in Appendix LABEL:sec-linear-aggregate-test.

The value of comes from the analysis at the group level. For example, in genomics, when the group is the gene, then typically . This is because the Bonferroni procedure is commonly used for identifying genes associated with phenotypes using aggregate tests, so the FWER on the family of genes is controlled at level .

Given that an aggregate test has been rejected at a level , our aim is to infer on the parameters . For , let

Our first aim is to test the family of hypotheses if , with FWER or FDR control. The conditional FWER and FDR (introduced in Heller et al., 2017) for the selected group are, respectively, and where and are the number of false and total rejections in the group. We provide procedures for conditional FWER/FDR control in § 3.

Our second aim is to estimate the magnitude of the regression coefficients given selection. Denoting the likelihood for by , the conditional likelihood can be written as:

We propose to use the maximizer of the conditional likelihood as a point estimate, and we show how to obtain it, as well as confidence intervals, in § 4.

Example 2.1.

Generalized Linear Models. Suppose we observe a response vector , and predictors of interest in a group (e.g., the single variants in a gene), . Let be a set of additional covariates to be accounted for in the model (e.g., environmental factors or ancestry variables in GWAS). Suppose that we are interested in modelling the relationship of the predictors in a group with the response vector using a generalized linear model. So, we assume that , an exponential family distribution with canonical parameter for some continuous link function and


In the case of linear regression, is the identity function and . If explain little of the variance in (e.g., in genomic applications) it is reasonable to estimate by the empirical variance of the residuals from the linear model with .

When is not assumed to be normal, the maximum likelihood estimator for the regression coefficients has an asymptotic normal distribution and an asymptotic truncated-normal distribution post-selection. While depends on and therefore cannot be assumed to be known in general, if explain little of the variance in it is reasonable to estimate the variance of under the assumption that . More generally, Tibshirani et al. (2015) discuss conditions under which naive plug-in estimation leads to asymptotically valid inference.

3 Testing following selection

In the absence of selection, we can test for using the -value of the test statistic : where and is the unit vector with a single entry of one in position . However, conditionally on selection, will often have a distribution that is stochastically smaller than uniform, meaning that its realization will no longer be a valid -value for testing .

To correct for selection, it appears necessary to evaluate the probability that . However, this probability depends on the unknown , and hence it cannot be evaluated when is true unless we assume that all other entries in are zero. In the special case of the distribution of , conditional on is known. Of course, in practice we do not know whether any of the entries of are non-zero.

In § 3.1 we suggest a way around this problem, by computing a valid conditional -values using the polyhedral lemma first introduced by Lee et al. (2016). In practice, we find that statistical tests based on the polyhedral lemma tend to have relatively low power if is sparse, and thus, in § 3.2 we suggest an inference method that automatically adapts to the sparsity level of . In § 3.3 we discuss applying multiple testing procedures to the valid conditional -values.

3.1 The conditional -values based on the polyhedral lemma

Performing inference on is difficult because the post-selection distribution of is not location invariant and depends heavily on the unknown parameter value. Suppose that we are interested in performing inference on an arbitrary linear function of the parameter vector . Lee et al. (2016) showed that by conditioning on additional information beyond the selection event, the post-selection distribution of a single coordinate can be reduced to a univariate truncated normal distribution which only depends on single unknown parameter . Furthermore, Fithian et al. (2014) showed that such conditioning yields a (unique) family of admissible unbiased post-selection tests.

Denote by the truncated normal distribution constrained to , i.e. the conditional distribution of a random variable conditional on it being in . Let be the CDF of . The following theorem, which is a direct result of the polyhedral lemma of Lee et al. (2016), provides us with a conditional distribution of any linear contrast of that we can use for post-selection inference.

Theorem 3.1.

Let be a linear combination of , and a fixed selection threshold. Let , where and is the identity matrix. Then


where is defined in Lemma A.1.

See Appendix A for the proof.

Since the only unknown parameter in the truncated distribution of (3.1) is , it is straightforward to compute a -value under the null hypothesis and construct confidence intervals via test inversion.

Corollary 3.1.

For the estimation of , let , , and as defined in Lemma A.1. Then,

For testing , let


Then, if is true,

The following example serves to give some intuition as to how the polyhedral lemma works and the possible adverse effects of the extra conditioning on .

Example 3.1.

Independence Model. Let and suppose that we are interested in testing after rejecting the global-null hypothesis that . In this case, the relevant contrast is and the orthogonal projection is . It is clear that is independent of and therefore, conditionally on selection the only relevant information contained in is that so . Note that if we do not condition on then the support of is and this is why conditioning on often results in a loss of efficiency (Fithian et al., 2014).

3.2 A hybrid conditional -value

Our empirical investigation in Section § 5 suggests that -values that are computed based on the polyhedral lemma tend to have good power when is not sparse or has a large magnitude. However, when only a single entry in is nonzero, -values based on the polyhedral lemma (which are valid for any configuration of the unknown ) tend to be considerably less powerful than -values computed based on the distribution of under the global-null distribution (where it is assumed that ). Therefore, we would like to consider a test that adapts to the unknown sparsity of the signal, by combining the two approaches for computing -values into a single test of , allowing for powerful identification of the non-null coefficients. The combined test will be useful in applications where both groups with sparse signals and with non-sparse signals are likely.

Sampling from the truncated multivariate normal distribution is a well studied problem, see for example Pakman and Paninski (2014). Specifically, under the global null, i.e., , one can use samples from the truncated distribution to asses the likelihood of the observed regression coefficients, defining


Under the global-null distribution, both and the -value computed under the global-null distribution have a uniform distribution. However, may be larger than because it requires extra conditioning on . Thus, if the only non-zero predictor in the model is the th predictor, the test based on can be expected to be more powerful than the test based on .

On the other hand, when more than one of the coordinates of are non-zero, will often be substantially larger than the original -value , e.g., if , then . This, while may not suffer any additional loss of power due to the extra conditioning e.g., if the aggregate test passes the selection threshold regardless of the value of , then and it will clearly be smaller than .

Since the preference for using instead of depends on the (unknown) , we suggest the following test that combines the two valid post-selection -values,


Clearly, would be a valid -value, i.e., with a null distribution that is either uniform or stochastically larger than uniform, if both and are valid -values. In the previous section we indeed showed that is a valid -value. But by the definition in equation (3.3) it is only clear that is valid when . Intuitively, for , we may assume that is conservative (i.e., has a null distribution that is stochastically larger than uniform). We shall now provide a rigorous justification.

We start with the special case that the quadratic aggregate test for selection is Wald’s test. Following selection by Wald’s test, is a valid -value for testing . This follows by showing that the marginal null distribution of is at least stochastically as large as the uniform, so the test based on the global null distribution where is conservative.

Theorem 3.2.

If , and has a normal distribution with mean and variance , then

See Appendix B for the proof.

More generally, when selection is based on , where is any positive definite symmetric matrix, we can still justify the use of for testing for a large enough sample size. This follows since the conditional -values under the global null are necessarily larger than the original -values, as formally stated in the following lemma.

Lemma 3.1.

If is a positive definite matrix, , and , then


for arbitrary fixed .

See Appendix LABEL:app-lemma-gaussiancorrelationineq for the proof. Setting to be the realized test statistic and , is the lefthand side of (3.5) and is the righthand side. It thus follows that

Since regardless of the true value of if for at least one , the probability of getting a smaller value than given selection, if is true, coincides with asymptotically. So is an asymptotically valid -value if for at least one . Since , it follows that and are asymptotically valid -values for any .

3.3 Controlling the conditional error rate

In order to identify the non-null entries in , we can apply a valid multiple testing procedure on the conditional -values computed as in § 3.1 or § 3.2. We can then achieve conditional error control.

The Bonferroni-Holm procedure will control the conditional FWER, since the conditional -values are valid -values and the procedure is valid under any dependency structure among the test statistics.

For conditional FDR control, we recommend using the Benjamini-Hochberg (BH) procedure. Although the BH procedure does not have proven FDR control for general dependence among the -values, it usually controls the FDR for dependencies encountered in practice. We believe that the robustness property of the BH procedure carries over to our setting, and that the conditional FDR will be controlled in practice. The robustness guarantee follows from empirical and theoretical results (Reiner-Benaim, 2007), which suggest that the FDR of the BH procedure does not exceed its nominal level for test statistics with a joint normal distribution, and our simulations in § 5, which suggest that this holds also following selection.

A conservative procedure that will control the conditional FDR is the Benjamini-Yekutieli procedure for general dependence, introduced in Benjamini and Yekutieli (2001). The theoretical guarantee follows since the conditional -values are valid -values and the procedure is valid under any dependency structure among the test statistics.

If the individual test statistics are independent, as occurs when the design matrix is orthogonal in the linear model, and the aggregate test statistic is monotone increasing in the absolute value of each test statistic (keeping all others fixed), then we have a theoretical guarantee that the BH procedure on controls the conditional FDR, even though these conditional -values are dependent. This is a direct result of Theorem 3.1 in Heller et al. (2017), and it is formally stated in the following theorem.

Theorem 3.3.

If , , and is a diagonal matrix, then the BH procedure at level on controls the conditional FDR at level , where is the number null coefficients in .

4 Estimation following selection

So far we focused on valid testing after selection by an aggregate test. But it is often also desirable to asses the absolute magnitude of parameters of interest. Just as model selection causes an inflation of test statistics, it also has an adverse effect on the accuracy of point estimates. In fact, inflation of estimated effect sizes is the main cause for the increased type-I error rates that are encountered in naive inference following selection. In Section § 4.1 we discuss the computation of post-selection of maximum likelihood estimators which are defined as the maximizers of the likelihood of the data conditional on selection and serve to correct for some of the selection bias.

Beyond point estimates, valid post-selection confidence intervals can be constructed by inverting the post-selection tests described in Section § 3. These however, may be either underpowered in the case of confidence intervals based on the polyhedral lemma or too conservative in the case of the hybrid confidence intervals. Thus, in Section § 4.2 we propose novel regime switching confidence intervals that maintain the validity and power of the hybrid method intervals while ensuring the desired level of confidence asymptotically.

4.1 Conditional maximum likelihood estimation

Let be the log-likelihood for , and the corresponding conditional log-likelihood. Define the conditional MLE as the maximizer of the conditional likelihood:


For notational convenience, we suppress the dependence of on the selection threshold . While difficult to compute in many practical cases, computing the conditional MLE following selection by aggregate testing is a relatively simple task. For the special case where , we are able to show that the maximum likelihood estimator is given by the solution to a simple line search problem.

Theorem 4.1.

Under the conditions of Theorem 3.2, the conditional maximum likelihood estimator is given by:

where is the observed value.

See appendix LABEL:app-linesearch for the proof.

The Theorem shows that the maximum likelihood estimation is reduced to maximizing the likelihood only with respect to a scalar factor. This follows when because the distribution of the test-statistic is governed by one unknown parameter. In the general case, the distribution of is a sum of chi-square random variables which depends on parameters, making the optimization problem slightly more involved. So, for we use the stochastic optimization approach of Meir and Drton (2017) to maximize the likelihood. Let

be a sample from the post selection distribution of for a mean parameter value . Then, taking gradient steps of the form


will lead to convergence to the conditional MLE as long as

Theorem 4.2.

Suppose that and that inference is conducted only if with . Then, the algorithm defined by (4.2) converges to the conditional MLE for the post-aggregate testing problem which satisfies


The result follows from the fact that the variance of the post-selection distribution of can be uniformly bounded from above by , see Meir and Drton (2017) for details. ∎

We discuss the topic of conditional maximum likelihood estimation in Generalized Linear Models and the related problem of estimation after aggregate testing with a linear test in Appendices LABEL:appendix-mle-computation and LABEL:sec-linear-aggregate-test.

The conditional MLE is consistent assuming the following. Suppose that we observe a sequence of regression coefficient estimates such that


Furthermore, suppose that we perform inference on the individual coordinates of if and only if


The good behaviour of the conditional MLE hinges on the probability of passing the selection by the aggregate test. The lower bound on the this probability is given trivially by and therefore the conditional MLE is consistent.

Theorem 4.3.

Assume that (4.3) and (4.4) hold. Then, the conditional MLE is consistent for , satisfying:


The result follows from the theory developed in the work of Meir and Drton (2017) for selective inference in exponential families and the fact that

For a discussion of post-selection efficiency, see the work of Routtenberg and Tong (2015).

4.2 Confidence intervals following selection by an aggregate test

From Theorem 3.1 it is clear that the truncated normal distribution can be used to construct confidence intervals post-selection in a straightforward manner. However, the extra conditioning (on ) may lead to wide confidence intervals relative to confidence intervals based on the sampled distributions, as pointed out by Tian and Taylor (2015). As an alternative, it is possible to invert a global type test (specifically, the test with null hypothesis for coefficient ) and construct a hybrid type confidence interval in order to obtain a confidence interval with more power to determine the sign of the regression coefficients (Weinstein et al., 2013).

For constructing a confidence interval at a level, let and be the lower and upper bounds of the polyhedral confidence interval for the th variable, so:

where is as defined in § 3.1. Similarly, let and be the lower and upper limit of the global-null confidence interval for the th variable:

where is the unit vector and is the CDF of given selection, for the parameter vector . We use the Robbins-Monroe process to find and (Garthwaite and Buckland, 1992). As in testing, the polyhedral confidence interval tends to be shorter and more efficient if there are several variables in the model that are highly correlated with the response variable and the global-null confidence intervals tend to be more powerful when the model is sparse or if the global-null hypothesis holds (approximately). As we have done in Section § 3.2, we propose a hybrid method for constructing a confidence interval, as defined by the lower and upper bounds:

The hybrid confidence intervals, while possessing a good degree of power to determine the sign regardless of the true underlying model, tend to be inefficient when there is strong signal in the data. To see why, consider the case of a regression model where . Then, for a sufficiently large sample size the polyhedral confidence interval will apply no correction and hybrid confidence interval will be conservative, with an asymptotic level of : . As a remedy, we propose a regime switching scheme for constructing confidence intervals in which we first determine whether or and then construct confidence intervals accordingly.

Procedure 4.1.

The post-selection level confidence interval for , with switching regime at level (with default value ):

  1. Compute .

  2. If , i.e., the aggregate test does not pass the more stringent threshold , then compute the hybrid conditional confidence interval at level .

  3. If , compute the unconditional confidence interval, at level .

Theorem 4.4.

Post-selection confidence intervals constructed with Procedure 4.1 have a confidence level at least if , and an asymptotic level if .

See Appendix LABEL:app-pf-thm-ci for the proof.

Remark 4.1.

Ours is not the first regime switching procedures proposed for inference in the presence of data-driven variable selection, see for example the works of Chatterjee and Lahiri (2011) and McKeague and Qian (2015). In both these cases, one has to determine whether some (or all) of the parameters are zero and construct a test in an appropriate manner. The usual prescription for selecting tuning parameters in such procedures is to scale the tuning parameter of the test (in our case, ) in such a way so the correct regime is selected with probability approaching one as the sample size grows. In our case, this would amount to setting in such a way so that and . However, in practice it is necessary to select a single a value for and so we chose to fix to a small value as to maintain a good degree of power when there is only limited amount of signal in the data and to modify our procedure in such a way as to ensure some finite sample coverage guarantees.

Figure 1: Point estimates and confidence intervals for the artificial data example described in Example 4.1. The naive point estimate is marked as a circle, the conditional MLE as a triangle and the true value of the parameter is marked as a square. The confidence intervals are the naive (solid red line), hybrid (dotted green line) and polyhedral (dashed blue line).
Example 4.1.

Figure 1 shows point estimates and confidence intervals for normal means vector which was selected via a quadratic aggregate test. The figure was generated by sampling with , , and . The aggregate test applied was a Wald test at an level. The naive and conditional estimates are plotted along with naive, polyhedral and hybrid confidence intervals. The conditional MLE applies the same multiplicative shrinkage of to all of the coordinates of and so the shrinkage is more visible for the larger observed values. Because the selection is driven by corresponding to the large negative coordinate , the polyhedral confidence intervals for the other coordinates of are similar in size to the naive confidence intervals. The naive confidence intervals overestimates the magnitude of , the polyhedral confidence intervals cover the true parameter value but fails to determine its sign and the regime switching confidence intervals both cover the true parameter value and succeed in determining the sign.

5 Simulations

In this section we conduct a simulation study where we assess the methods proposed in this work and verify our theoretical findings. In Section § 5.1 we asses the post-selection tests proposed in Section 3 with respect to their ability to control the FDR. In Section § 5.2 we compare the different testing method with respect to their power to detect true signal in the data. In Section § 5.3 we compare the conditional MLE and the unadjusted MLE with respect to their estimation error. Finally, in Section § 5.4 we asses the coverage rates of the polyhedral and regime switching confidence intervals.

In all of our simulations we generate data in a similar manner. We first generate a design matrix in a manner meant to approximate a rare-variant design. We sample marginal expression proportions for our variants from constrained to and for each subject we sample two multivariate normal vectors with . We then set to obtain a design matrix with dependent columns and a marginal distribution . We generate a sparse regression coefficients vector with zero coordinates and coordinates which are sampled form the distribution. We normalize the values of the regression coefficients such that the signal to noise ratio


equals some pre-specified value. Finally, we generate a response variable with . In all of our simulations we use a Wald aggregate test with a significance level of .

Figure 2: False Discovery Rates after aggregate testing. We plot the nominal FDR vs. the empirical FDR for the unadjusted naive p-values (red solid line), the polyhedral p-values (dotted green line), p-values based on the exact post-selection null distribution (dashed-blue) and the hybrid method (dashed purple line). The diagonal line is in dashed black. The figure is faceted according to the distribution of the noise and the signal to noise ratio in the data as defined in equation (5.1). Details about the data generation are in § 5.1.

5.1 Assessment of false discovery rate control

We assess how well the proposed testing procedures control the FDR under the assumed model as well as under model misspecification. We generate datasets with , , , and three types of distributions for the model residuals, all of which have a variance of :

We compare four testing procedures. BH on naive -values which are not adjusted for selection, BH on the polyhedral -values as computed in equation (3.2), BH on the -values based on the global null distribution as computed in equation (3.3), and BH on the hybrid -values as computed in equation (3.4).

We plot the target FDR versus the empirical FDR in Figure 2. When there is no signal in the data and the noise is not heavy tailed, all selection adjusted methods obtain close to nominal FDR levels (top left and right panels). When the noise is heavy tailed (Laplace), the methods based on the null Gaussian distribution have higher than nominal FDR rates, while the p-values computed with the polyhedral method exhibit a more robust behavior (top center panel). When there is some signal in the data, all selection adjusted p-values control the FDR at nominal or conservative rate (bottom row). The naive p-values do not control the FDR in any of the simulation settings. Thus, we conclude that the polyhedral -values may be preferable to the hybrid and global null -values if the distribution of the data is heavy-tailed. However, as we show in the next section, the hybrid method tends to have more power compared to the polyhedral method and is preferable when the residual distribution is well behaved.

5.2 Assessment of power to detect true signal

We compare the power to detect signal of the proposed testing procedures. We generate datasets with , , , and . We compare the same testing procedures as in § 5.1. We measure the power to identify true signals at a nominal FDR level of .

We plot the results of the simulation in Figure 3. In all of the simulations the naive unadjusted p-values have the most power, at the cost of an inflated FDR. When the number of non-zero regression coefficients is small, the global null and hybrid methods tend to have the most power, while all methods have a similar power when the signal is spread over a large number of regression coefficients. The method based on the global null distribution is the most powerful when the signal is sparse and low. The polyhedral method has more power when the signal is not too sparse or low. The hybrid method seems to adapt to the sparsity and signal strength well, exhibiting comparatively good power in all settings.

Figure 3: Power to detect true signals after aggregate testing. We plot the power of the different inference method as a function as the number of non-zero regression coefficients for the unadjusted naive p-values (red solid line), the polyhedral p-values (dotted green line), p-values based on the exact post-selection null distribution (dashed-blue) and the hybrid method (dashed purple line). The figure is faceted according to the strength of the signal as defined in equation (5.1). Details about the data generation are in § 5.2.

5.3 Assessment of estimation error

We compare the conditional MLE to the naive, unadjusted point estimate, itself. We set , , , and sample model residuals from the normal distribution with a standard deviation of

We plot the results of the simulation in Figure 4. When the dimension of is small, the conditional MLE estimates the vector of regression coefficients better than the unadjusted MLE. The gap between the conditional and naive estimator is roughly constant across the different sample sizes when because the probability of selection remains constant for all sample sizes. However, when there is some signal in the data the probability of passing the aggregate increases in the sample size and the gap between the estimators shrinks. The difference between the conditional MLE and the naive MLE decreases in the size of , to the extent that for the two estimators are indistinguishable from one another.

To see why this occurs, consider the following example. Let , suppose that we perform selection using a Wald test at a fixed level and consider the conditional likelihood function:

As we let the dimension grow, the decrease in the value of the (unconditional) gaussian log-likelihood due to a possible shrinkage of grows linearly in . At the same time, the additional penalty term remains bounded below by regardless of the dimension of the problem.

Figure 4: Root mean squared error for estimation after aggregate testing. We plot the RMSE for estimating the vector of regression coefficients with the naive unadjusted estimator (solid red line) and the conditional MLE (dashed blue line). The figure is faceted according to the signal-to-noise ratio as defined in (5.1) and the size of , .

5.4 Assessment of confidence interval coverage rates

In the last set of simulations, we evaluate the regime switching and polyhedral confidence intervals with respect to their coverage rates and power to determine the sign of the non-zero coefficients. We set the parameters of the simulation to , , , and sample the residuals from a normal distribution with a standard deviation of .

We plot the results of the simulation in Figure 5. The naive confidence intervals have a coverage rate far below nominal for signal-to-noise ratios less than . As could be expected, the polyhedral method achieves the correct coverage rates up to Monte-Carlo error in all simulation settings. When there is no signal in the data the regime switching confidence intervals have close to nominal coverage. When the signal to noise ratio is moderate, the regime-switching confidence intervals are conservative because the polyhedral confidence intervals are superior to the ones based on the global-null assumption with high probability while the probability of exceeding is still not overwhelmingly large. When the signal to noise ratio is high the regime switching confidence intervals are mostly identical to the naive ones, because the selection occurs with probability of close to , and so they have the correct coverage rate. Despite being more conservative than the polyhedral confidence intervals, the regime switching confidence intervals can have better power to determine the sign. Specifically, the regime switching confidence intervals tend to have more power when the true model is sparse and the signal to noise ratio is low or moderate.

Figure 5: Coverage rates and power to determine the sign of confidence intervals constructed after aggregate testing. We plot results rates for the naive unadjusted confidence intervals (solid red line), polyhedral confidence intervals (dotted green line) and regime switching intervals with (dashed blue line).

6 Application to variant selection following gene-level testing

Genome-wide association studies (GWAS) involves large scale association testing of genetic markers with underlying traits. Large GWAS of uncommon and rare variants are now becoming increasingly feasible with the advent of newer genotyping chips, cheaper sequencing technologies and sophisticated algorithms that allow imputation of low-frequency variants based on combinations of common variants that are already genotyped in large GWAS. Thus, association studies of rare variants is a very active area of research and some of the early studies have already begun to report their findings, e.g., Consortium et al. (2015) and Fuchsberger et al. (2016).

As the statistical power for testing association of traits with individual rare variants may be low, it has been suggested that tests for genetic associations be performed at an aggregated level by combining signals across multiple variants within genomic regions such as those defined by functional units of genes (Madsen and Browning, 2009; Morris and Zeggini, 2010; Neale et al., 2011; Wu et al., 2011; Lee et al., 2012; Sun et al., 2013). These tests can be divided into sum-based tests (which aggregate the variant statistics by a linear combination), variance component tests (which aggregate the squared variant statistics by a linear combination), or combined (sum-based and variance component) tests. See Derkach et al. (2014) for a review. There is, however, currently a lack of rigorous methods for variant selection following gene-level association testing.

The Dallas Heart Study (DHS) (Romeo et al., 2007) considered four genes of potential interest, genotyped in 3549 individuals (601 hispanic, 1830 non-hispanic black, 1043 non-hispanic white, 75 other ethnicities). We focus on the 32 variants in ANGPTL4, which includes both rare and common variants. Table 1, column 2, shows the number of subjects with rare variants.

To detect associations with triglyceride (TG), a metabolism trait, we applied the variance component test SKAT of Wu et al. (2011) , with outcome TG on a logarithmic scale, while adjusting for the covariates race, sex, and age on a logarithmic scale. ANGPTL4 is one of the four genes in the ANGTPTL family (Romeo et al., 2009). Using a Bonferroni correction for testing the genes in the family, ANGTPL4 is selected for post-selection inference if the SKAT test -value is at most . To identify the potentially susceptible variants, we proceeded as suggested in section § 3.

The SKAT -value for ANGTPL4 was and therefore the gene was selected. Table 1, column 6 lists the weights assigned to each variant in the SKAT test. These weights were obtained using the default settings of the publicly available R library SKAT. Figure 6 and Table 1 provide, respectively, a graphical display and the actual numbers for the naive (i.e., unconditional, not corrected for selection) and conditional -values. When using the polyhedral method described in Section § 3.1, one variant, , passes the Bonferroni threshold for FWER control at the level. When using the hybrid method described in Section § 3.2, two variants, and are identified at an FDR level of . This example demonstrates that it is possible to make further discoveries in a follow-up analysis after aggregate testing, to identify which underlying variants drive the signal. The variant is indeed associated with TG, as validated by external studies (Dewey et al., 2016).

Figure 6: The naive and two types selection adjusted -values on a scale (left panel) and FDR adjusted p-values on a scale (right panel) for the 32 variants. The p-values plotted are Naive unadjusted p-values (red circles), conditional p-values based on the polyhedral lemma (black triangles) and conditional p-values based on the hybrid method (green plus). The dotted line marks a multiplicity adjusted threshold of (FWER in the left panel, FDR in the right) and the dashed line marks a multiplicity adjusted threshold of .

7 Discussion

In this work, we provided valid inference for linear contrasts of estimated parameters, after the aggregate test passed a pre-defined threshold. For the post-selection inference we suggest in this paper, we only need the summary statistics for the selected group of interest, and knowledge of the selection threshold . The selection threshold does not have to be fixed. For example, a data dependent threshold will be valid if the groups are independent and via the BH procedure, or any other simple selection rule (as defined in Benjamini and Bogomolov, 2014). If the data of all the groups is available, then there remains an open question of how to choose in order to maximize the chance of discovery for individual hypotheses (assuming that an error control guarantee at the group level is not necessary). Data adaptive methods for choosing may invalidate the post-selection inference. We are currently investigating potential approaches, but they are outside the scope of this manuscript.

Our methods can be extended to tree structured hypothesis tests in a straightforward manner. See Bogomolov et al. (2017) and the references within for state-of-the-art work on hierarchical testing when there are more than two layers. An interesting genomic application is the following. Within a selected gene, the tests may be further divided naturally into subgroups. For example, clusters of SNPs within a gene (Yoo et al., 2016). It may be of interest to develop a multi-level analysis, where following selection we first examine the subgroups, and only then the individual effects.

In this work we suggested switching regimes to adapt to the different unknown sparsity of the estimated effects. We observed that by combining a powerful method for the sparse setting with a powerful method for the non-sparse setting, we get a method that has overall good performance. Such an approach can be very useful in genomic applications, where the signal is expected to be sparse in some groups but non-sparse in others. The switching regime approach may benefit other post-selection settings as well, e.g., confidence intervals for the selected parameters in a regression model.

8 Supplementary material

An R implementation of the methods in this paper is available in https://github.com/ammeir2/PSAT and will be available (soon) in the Bioconductor package PSAT.


RH and AM contributed equally to this paper. The authors thank Andriy Derkach for providing his R implementation of the SKAT test, and for helpful conversations on the rare variant testing applications. Mathias Drton for helpful discussions regarding the properties of the conditional MLE. The authors are also grateful for Bin Zhu for helpful discussions of the Dallas Heart Study example. RH is supported by Israel Science Foundation grant no. 0603616831.


  • Benjamini and Bogomolov (2014) Benjamini, Y. and Bogomolov, M. (2014). Selective inference on multiple families of hypotheses. J. R. Stat. Soc. Ser. B. Stat. Methodol., 76(1):297–318.
  • Benjamini and Heller (2007) Benjamini, Y. and Heller, R. (2007). False discovery rate for spatial signals. Journal of the American Statistical Association, 102(480):1272–1281.
  • Benjamini and Yekutieli (2001) Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist., 29(4):1165–1188.
  • Benjamini and Yekutieli (2005) Benjamini, Y. and Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc., 100(469):71–81.
  • Berk et al. (2013) Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L., et al. (2013). Valid post-selection inference. The Annals of Statistics, 41(2):802–837.
  • Bertsekas and Tsitsiklis (2000) Bertsekas, D. P. and Tsitsiklis, J. N. (2000). Gradient convergence in gradient methods with errors. SIAM J. Optim., 10(3):627–642 (electronic).
  • Bhattacharjee et al. (2012) Bhattacharjee, S., Rajaraman, P., Jacobs, K., Wheeler, W., William, A., Melin, B., Hartge, P., Yeager, M., Chung, C., Chanock, S., and Chatterjee, N. a. (2012). A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits. The American Journal of Human Genetics, 90(5):821–835.
  • Bogomolov et al. (2017) Bogomolov, M., Peterson, C., Benjamini, Y., and Sabatti, C. (2017). Testing hypotheses on a tree: new error rates and controlling strategies. arXiv preprint arXiv:1705.07529.
  • Chatterjee and Lahiri (2011) Chatterjee, A. and Lahiri, S. N. (2011). Bootstrapping lasso estimators. J. Amer. Statist. Assoc., 106(494):608–625.
  • Consortium et al. (2015) Consortium, U. et al. (2015). The uk10k project identifies rare variants in health and disease. Nature, 526(7571):82.
  • Derkach et al. (2014) Derkach, A., Lawless, J. F., and Sun, L. (2014). Pooled association tests for rare genetic variants: a review and some new results. Statist. Sci., 29(2):302–321.
  • Dewey et al. (2016) Dewey, F. E., Gusarova, V., O’Dushlaine, C., Gottesman, O., Trejos, J., Hunt, C., Van Hout, C. V., Habegger, L., Buckler, D., Lai, K.-M. V., et al. (2016). Inactivating variants in angptl4 and risk of coronary artery disease. New England Journal of Medicine, 374(12):1123–1133.
  • Fithian et al. (2014) Fithian, W., Sun, D., and Taylor, J. (2014). Optimal inference after model selection. arXiv preprint arXiv:1410.2597.
  • Fuchsberger et al. (2016) Fuchsberger, C., Flannick, J., Teslovich, T. M., Mahajan, A., Agarwala, V., Gaulton, K. J., Ma, C., Fontanillas, P., Moutsianas, L., McCarthy, D. J., et al. (2016). The genetic architecture of type 2 diabetes. Nature, 536(7614):41–47.
  • Garthwaite and Buckland (1992) Garthwaite, P. H. and Buckland, S. T. (1992). Generating Monte Carlo confidence intervals by the Robbins-Monro process. J. Roy. Statist. Soc. Ser. C, 41(1):159–171.
  • Heller et al. (2017) Heller, R., Chatterjee, N., Krieger, A., and Shi, J. (2017). Post-selection inference following aggregate level hypothesis testing in large scale genomic data. J. Amer. Statist. Assoc.
  • Latał a and Matlak (2017) Latał a, R. and Matlak, D. (2017). Royen’s proof of the Gaussian correlation inequality. In Geometric aspects of functional analysis, volume 2169 of Lecture Notes in Math., pages 265–275. Springer, Cham.
  • Lee et al. (2016) Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist., 44(3):907–927.
  • Lee and Taylor (2014) Lee, J. D. and Taylor, J. E. (2014). Exact post model selection inference for marginal screening. In Advances in Neural Information Processing Systems, pages 136–144.
  • Lee et al. (2012) Lee, S., Wu, M. C., and Lin, X. (2012). Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13(4):762–775.
  • Madsen and Browning (2009) Madsen, B. E. and Browning, S. R. (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics, 5(2):e1000384.
  • McKeague and Qian (2015) McKeague, I. W. and Qian, M. (2015). An adaptive resampling test for detecting the presence of significant predictors. J. Amer. Statist. Assoc., 110(512):1422–1433.
  • Meir and Drton (2017) Meir, A. and Drton, M. (2017). Tractable post-selection maximum likelihood inference for the lasso. arXiv preprint arXiv:1705.09417.
  • Morris and Zeggini (2010) Morris, A. P. and Zeggini, E. (2010). An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic epidemiology, 34(2):188–193.
  • Neale et al. (2011) Neale, B. M., Rivas, M. A., Voight, B. F., Altshuler, D., Devlin, B., Orho-Melander, M., Kathiresan, S., Purcell, S. M., Roeder, K., and Daly, M. J. (2011). Testing for an unusual distribution of rare variants. PLoS genetics, 7(3):e1001322.
  • Pakman and Paninski (2014) Pakman, A. and Paninski, L. (2014). Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Graph. Statist., 23(2):518–542.
  • Penny and Friston (2003) Penny, W. and Friston, K. (2003). Mixtures of general linear models for functional neuroimaging. IEEE Transactions on Medical Imaging, 22:504–514.
  • Pötscher (1991) Pötscher, B. M. (1991). Effects of model selection on inference. Econometric Theory, 7(2):163–185.
  • Reid et al. (2016) Reid, S., Taylor, J., and Tibshirani, R. (2016). A general framework for estimation and inference from clusters of features. Journal of the American Statistical Association, (just-accepted).
  • Reiner-Benaim (2007) Reiner-Benaim, A. (2007). FDR control by the BH procedure for two-sided correlated tests with implications to gene expression data analysis. Biom. J., 49(1):107–126.
  • Romeo et al. (2007) Romeo, S., Pennacchio, L. A., Fu, Y., Boerwinkle, E., Tybjaerg-Hansen, A., Hobbs, H. H., and Cohen, J. C. (2007). Population-based resequencing of angptl4 uncovers variations that reduce triglycerides and increase hdl. Nature genetics, 39(4):513.
  • Romeo et al. (2009) Romeo, S., Yin, W., Kozlitina, J., Pennacchio, L. A., Boerwinkle, E., Hobbs, H. H., and Cohen, J. C. (2009). Rare loss-of-function mutations in angptl family members contribute to plasma triglyceride levels in humans. The Journal of clinical investigation, 119(1):70.
  • Routtenberg and Tong (2015) Routtenberg, T. and Tong, L. (2015). Estimation after parameter selection: Performance analysis and estimation methods. arXiv preprint arXiv:1503.02045.
  • Royen (2014) Royen, T. (2014). A simple proof of the Gaussian correlation conjecture extended to some multivariate gamma distributions. Far East J. Theor. Stat., 48(2):139–145.
  • Sun et al. (2013) Sun, J., Zheng, Y., and Hsu, L. (2013). A unified mixed-effects model for rare-variant association in sequencing studies. Genetic epidemiology, 37(4):334–344.
  • Taylor and Tibshirani (2016) Taylor, J. and Tibshirani, R. (2016). Post-selection inference for l1-penalized likelihood models. arXiv preprint arXiv:1602.07358.
  • Tian and Taylor (2015) Tian, X. and Taylor, J. E. (2015). Selective inference with a randomized response. arXiv preprint arXiv:1507.06739.
  • Tibshirani et al. (2015) Tibshirani, R. J., Rinaldo, A., Tibshirani, R., and Wasserman, L. (2015). Uniform asymptotic inference and the bootstrap after model selection. arXiv preprint arXiv:1506.06266.
  • Weinstein et al. (2013) Weinstein, A., Fithian, W., and Benjamini, Y. (2013). Selection adjusted confidence intervals with more power to determine the sign. J. Amer. Statist. Assoc., 108(501):165–176.
  • Wu et al. (2011) Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82–93.
  • Yekutieli (2012) Yekutieli, D. (2012). Adjusted Bayesian inference for selected parameters. J. R. Stat. Soc. Ser. B. Stat. Methodol., 74(3):515–541.
  • Yoo et al. (2016) Yoo, Y., Sun, L., Poirier, J., Paterson, A., and Bull, S. (2016). Multiple linear combination (mlc) regression tests for common variants adapted to linkage disequilibrium structure. Genetic epidemiology, 41:108–121.

Variant # rare variants hybrid PV conditional PV naive PV SKAT weight M1T 1.0000 0.8967 0.6872 0.3434 0.3434 4.9831 19.9657 P5L 2.0000 0.8588 0.3999 0.1995 0.1995 4.9663 74.7223 E40K 50.0000 -0.4490 0.0020 0.0010 0.0010 4.2172 7687.4667 M41I 28.0000 0.1403 0.8860 0.4333 0.4333 4.5466 288.0531 S67R 2.0000 0.3681 1.000 0.5827 0.5827 4.9663 15.9910 R72L 3.0000 -0.3468 1.000 0.5262 0.5262 4.9495 22.0274 G77R 1.0000 -1.0913 0.4928 0.2489 0.2489 4.9831 29.5739 E167K 1.0000 -1.2002 0.4072 0.2049 0.2049 4.9831 34.4125 P174S 1.0000 0.1040 1.000 0.9125 0.9125 4.9831 0.4010 E190Q 32.0000 0.2140 0.4108 0.2054 0.2054 4.4850 1199.4289 E196K 1.0000 -0.3991 1.000 0.6733 0.6733 4.9831 3.5122 K217X 1.0000 -0.7300 0.8704 0.4406 0.4406 4.9831 12.4116 G223R 1.0000 -1.3242 0.3239 0.1621 0.1621 4.9831 40.5690 R230C 1.0000 -0.5784 1.000 0.5412 0.5412 4.9831 7.6586 F237V 1.0000 -0.2453 1.000 0.7956 0.7956 4.9831 1.2266 K245fs 1.0000 0.5157 1.000 0.5858 0.5858 4.9831 6.6047 P251T 1.0000 1.3860 0.2854 0.1432 0.1432 4.9831 49.3013 T266M 1887.0000 0.0230 0.8619 0.2454 0.2454 0.0006 0.0002 R278Q 207.0000 -0.1945 0.0055 0.0103 0.0023 2.4309 10667.1021 V291M 1.0000 -0.6203 1.000 0.5123 0.5123 4.9831 9.5533 L293M 1.0000 0.4021 1.000 0.6717 0.6717 4.9831 1.3213 E296V 1.0000 -0.0308 1.000 0.9741 0.9741 4.9831 0.0235 S302fs 1.0000 -0.7089 0.9012 0.4539 0.4539 4.9831 12.4794 P307S 1.0000 -0.0793 1.000 0.9333 0.9333 4.9831 0.0274 V308M 3.0000 1.4719 0.0143 0.0071 0.0071 4.9495 477.6972 R336C 7.0000 -0.0469 1.000 0.8957 0.8957 4.8829 2.5696 D338E 1.0000 0.2314 1.000 0.8073 0.8073 4.9831 0.0339 W349C 1.0000 -0.6120 1.000 0.5179 0.5179 4.9831 9.3008 G361R 2.0000 0.4744 0.9598 0.4784 0.4784 4.9663 23.2973 R371Q 1.0000 0.4724 1.000 0.6177 0.6177 4.9831 5.5410 R384W 1.0000 -0.7610 0.8420 0.4225 0.4225 4.9831 22.6686 G361S 1.0000 -1.0389 0.5388 0.2724 0.2724 4.9831 26.7995

Table 1: For the 32 varaints in ANGPTL4, the number of subjects with rare variants (column 2), the estimated effect size (column 3), the conditional -value (column 4), the original -value (column 5), the default Beta-density weight in SKAT (column 6), and the contribution of the variant to the SKAT statistic (column 7). Variant has conditional -value below the , and is therefore discovered by Bonferroni, with a guarantee of conditional FWER control at the 0.05 level. The contribution of variant R278Q is by far the largest towards the SKAT statistic, and therefore the conditional -value is larger than the naive -value for R278Q. For all other variants in this gene, the conditional values coincide with the naive -values. This is expected when by conditioning on the test statistics of all the other variants, the SKAT test singificance is guaranteed regardless of the single variant test statistic value.

Appendix A Proof of Theorem 3.1

In order to compute the distribution of a linear combination of within selected regions, it is useful to first represent the selection event in simple form (the representation is based on the one used for post-model selection in Lee et al., 2016).

Lemma A.1.

For an arbitrary linear combination , let and , where is the identity matrix. The selection event can be rewritten in terms of as follows:


for .


Decomposing , the result is immediate from rewriting as a quadratic polynomial with argument . The selection event is therefore

Since the covariance between and is zero, it follows from Lemma A.1 that if is (approximately) normal, then the boundaries of the selection event are independent of . Therefore, conditional on and on the selection event , has a truncated normal distribution, truncated at the values .

Appendix B Proof of Theorem 3.2

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters